Next Article in Journal
Piecewise Modeling the Accumulated Daily Growth of COVID-19 Deaths: The Case of the State of São Paulo, Brazil
Previous Article in Journal
A Framework for Detecting System Performance Anomalies Using Tracing Data Analysis
Previous Article in Special Issue
Word2vec Skip-Gram Dimensionality Selection via Sequential Normalized Maximum Likelihood
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Factor Analysis Perspective on Linear Regression in the ‘More Predictors than Samples’ Case

Faculty of Computer Science, Alexandru Ioan Cuza University of Iaşi, 700506 Iaşi, Romania
*
Authors to whom correspondence should be addressed.
Entropy 2021, 23(8), 1012; https://doi.org/10.3390/e23081012
Submission received: 5 June 2021 / Revised: 25 July 2021 / Accepted: 27 July 2021 / Published: 3 August 2021
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science II)

Abstract

:
Linear regression (LR) is a core model in supervised machine learning performing a regression task. One can fit this model using either an analytic/closed-form formula or an iterative algorithm. Fitting it via the analytic formula becomes a problem when the number of predictors is greater than the number of samples because the closed-form solution contains a matrix inverse that is not defined when having more predictors than samples. The standard approach to solve this issue is using the Moore–Penrose inverse or the L2 regularization. We propose another solution starting from a machine learning model that, this time, is used in unsupervised learning performing a dimensionality reduction task or just a density estimation one—factor analysis (FA)—with one-dimensional latent space. The density estimation task represents our focus since, in this case, it can fit a Gaussian distribution even if the dimensionality of the data is greater than the number of samples; hence, we obtain this advantage when creating the supervised counterpart of factor analysis, which is linked to linear regression. We also create its semisupervised counterpart and then extend it to be usable with missing data. We prove an equivalence to linear regression and create experiments for each extension of the factor analysis model. The resulting algorithms are either a closed-form solution or an expectation–maximization (EM) algorithm. The latter is linked to information theory by optimizing a function containing a Kullback–Leibler (KL) divergence or the entropy of a random variable.

1. Introduction

In machine learning, models can be grouped into two categories: probabilistic and nonprobabilistic. Probabilistic models can be classified as generative and discriminative [1]. Examples of classic generative models are naive Bayes and Gaussian mixture models (GMM). Examples of classic discriminative models are linear regression (LR) and logistic regression. The key difference is whether they model the joint probability of the input and the output—generative models—or they just model the conditional probability of the output given the input—discriminative models. For a classification or a regression task, one may argue that what you need is just a discriminative model, but the generative models have their advantages: they can sometimes handle missing data, can easily generate new data, can be extended to be unsupervised or semisupervised, etc. ([2] p. 268).
As one may notice, there are generative models for unsupervised learning that have counterparts in supervised learning, even though this is not widely discussed in the literature. One such example is the GMM ([2] p. 339) with its counterpart, the Gaussian joint Bayes model ([2] p. 102), also known as quadratic discriminant analysis. Their training/fitting algorithms are similar, as one may notice, for example, in [3,4]:
  • for Gaussian joint Bayes:
    π j = 1 n i = 1 n 1 { z i = j }
    μ j = i = 1 n 1 { z i = j } x i i = 1 n 1 { z i = j }
    Σ j = i = 1 n 1 { z i = j } ( x i μ j ) ( x i μ j ) i = 1 n 1 { z i = j }
    where x i is an input observation, ( π j , μ j , Σ j ) are the parameters of a GMM, observable z i is the class index corresponding to x i , j is a class index, and 1 { z i = j } is the indicator function, which returns 1 if the condition z i = j is true and 0 otherwise.
  • for expectation–maximization (EM) for the GMM, which optimizes a function concerning a Kullback–Leibler (KL) divergence or the entropy of a random variable—check Appendix A for these details—:
    E step:
    w i j = p ( z i = j | x i ; π , μ , Σ ) = p ( x i | z i = j ; μ , Σ ) p ( z i = j ; π ) l = 1 K p ( x i | z i = l ; μ , Σ ) p ( z i = l ; π )
    M step:
    π j = 1 n i = 1 n w i j
    μ j = i = 1 n w i j x i i = 1 n w i j
    Σ j = i = 1 n w i j ( x i μ j ) ( x i μ j ) i = 1 n w i j
    where x i is an input observation, ( π j , μ j , Σ j ) are the parameters of a GMM, unobservable z i is the cluster index corresponding to x i , j is a cluster index, and w i j is the probability that x i belongs to cluster j.
This similarity between GMM and Gaussian joint Bayes is intriguing; hence, we decided to further explore this aspect but starting from other supervised–unsupervised counterparts. As a result, we changed the root model into factor analysis (FA) [5] ([2] p. 381), which is normally used for dimensionality reduction or for density estimation when the dimensionality of the data is greater than the number of samples. Factor analysis is a Gaussian generative model used in unsupervised learning. We aimed at creating its supervised counterpart in order to handle a regression task and then exploit it as much as possible.
After creating the supervised counterpart, we proved a significant property, namely that linear regression is equivalent to (supervised) factor analysis—with one-dimensional latent space—when no constraints are imposed on the covariance matrices.
A linear regression model can be fitted via a closed-form solution or an iterative algorithm. When the number of predictors is greater than the number of samples, there is no closed-form solution. There are other solutions to this problem, as we will see.
We were at the point where we knew that factor analysis was linked to linear regression and that it could be used when the number of samples was lower than the dimensionality of the data—from now on, this is denoted as D > > n or n < < D . As a result, we shifted our focus from solely exploiting the factor analysis model to highlighting novel linear regression versions applicable in the D > > n regime—linear regression being a widely known and used model—:
  • linear regression when D > > n ,
  • semisupervised linear regression when D > > n ,
  • (semisupervised) linear regression when D > > n with missing data.
The structure of this paper is as follows. In Section 2, we include some theoretical background to enhance the readability of this paper. Section 3 contains related work. In Section 4, we include the models we proposed, starting from factor analysis. Section 5 contains experiments using the proposed models. In Section 6, we conclude the paper and show future directions.
We include the full algorithms in the appendices in a pseudocode format, two of them being instances of the expectation–maximization schema.

2. Theoretical Background

We started our analysis from two core models in machine learning: linear regression and factor analysis. We will discuss the aspects of those two models that are relevant to understanding the next sections of this paper.

2.1. Linear Regression

Proposition 1.
Let { ( x ( i ) , y ( i ) ) | x ( i ) R D × 1 , y ( i ) R , i { 1 , , n } } be a data set where D is the dimensionality of the input data, { x ( i ) | i { 1 , , n } } is the input, and { y ( i ) | i { 1 , , n } } is the output. The linear regression model is as follows:
Y ( i ) = w x ( i ) + b + ϵ ( i )
where Y ( i ) is a random variable corresponding to y ( i ) , ϵ ( i ) N ( 0 , σ 2 ) , σ R + * , w R 1 × D , b R . Then, the parameters w and b can be estimated via maximum likelihood as follows:
w ^ LR = n y ¯ x ¯ Y X n x ¯ x ¯ X X 1
b ^ LR = y ¯ w ^ LR x ¯ ,
or, equivalently, as follows:
w ^ LR b ^ LR = ( X ˜ X ˜ ) 1 X ˜ Y
where x ¯ = x ( 1 ) + + x ( n ) n , y ¯ = y ( 1 ) + + y ( n ) n ,
X = x ( 1 ) x ( n ) R D × n , X ˜ = x ˜ ( 1 ) x ˜ ( n ) R ( D + 1 ) × n ,
x ˜ ( i ) = x ( i ) 1 = x 1 ( i ) x D ( i ) 1 R D + 1 , and Y = y ( 1 ) y ( n ) R 1 × n .
A potential problem with Equation (3) is when X ˜ X ˜ is not invertible. Such case arises when D + 1 > n , i.e., there are more predictors than samples. Two standard solutions to this problem are the following:
  • Let A R a × b be a matrix. Then, the Moore–Penrose inverse of A can be defined as A + = lim α 0 + ( A A + α I ) 1 A , which, algorithmically, is computed via the singular value decomposition of A ([6] Section 2.9). One may notice that the matrix ( X ˜ X ˜ ) 1 X ˜ from Equation (3) is just ( X ˜ ) + when X ˜ X ˜ is invertible. When it is not, the solution is to replace the matrix ( X ˜ X ˜ ) 1 X ˜ from Equation (3) with ( X ˜ ) + .
  • L2 regularization, which results in ridge regression ([2] p. 225). The matrix ( X ˜ X ˜ ) 1 X ˜ from Equation (3) is replaced with X ˜ X ˜ + α 0 0 0 α 0 0 0 0 1 X ˜ , with α > 0 ; the bigger the α , the more regularization we add to the model, i.e., move away from overfitting. From this point of view, the first solution using the Moore–Penrose inverse can be interpreted as achieving the asymptotically lowest L2 regularization.

2.2. Factor Analysis

The formulas stated in the conclusion of the following Proposition were proved in [5] and are relevant for the factor analysis algorithm—although the matrix Ψ is considered as being diagonal there, the formulas stay the same even if Ψ is not diagonal.
Proposition 2.
Let us consider the following factor analysis model:
z N ( 0 , I ) -latent variable, z R d × 1
x | z N ( μ + Λ z , Ψ ) , x R D × 1 , μ R D × 1 , Λ R D × d , and Ψ R D × D a diagonal matrix. Then:
x z N μ 0 , Λ Λ Λ Λ I
x N ( μ , Λ Λ + Ψ )
z | x N ( Λ ( Λ Λ + Ψ ) 1 ( x μ ) , I Λ ( Λ Λ + Ψ ) 1 Λ ) .
A factor analysis model can be fitted via an EM algorithm using the maximum likelihood estimation (MLE) principle, and factor analysis can be used as a density estimation technique when the dimensionality of the data is greater than the number of samples [5].
For the algorithms we developed, we will let z N ( μ z , Σ z ) and not z N ( 0 , I ) , because z becomes observed data, and we want to learn its parameters, and not impose something unrealistic like z N ( 0 , I ) . This generalization leads to the following result.
Proposition 3.
Let us consider the following linear Gaussian system:
z N ( μ z , Σ z ) -latent variable, z R d × 1 , μ z R d × 1 , Σ z R d × d a symmetric and positive definite matrix,
x | z N ( μ + Λ z , Ψ ) , x R D × 1 , μ R D × 1 , Λ R D × d , and Ψ R D × D a symmetric and positive definite matrix. (If Ψ is a diagonal matrix, then we say that we are in the FA case. If Ψ is a scalar matrix, Ψ = η I , η R + * , then we are in the probabilistic principal component analysis (PPCA) case [7]. If Ψ is any symmetric and positive definite matrix, then we say we are in the unconstrained factor analysis (UncFA) case. If the first two terms are standard (FA and PPCA), the third one is proposed by us—UncFA.) Then:
x z N μ + Λ μ z μ z , Λ Σ z Λ + Ψ Λ Σ z ( Λ Σ z ) Σ z x N ( μ + Λ μ z , Λ Σ z Λ + Ψ ) z | x N ( μ z + Σ z Λ ( Λ Σ z Λ + Ψ ) 1 ( x μ Λ μ z ) ,             Σ z Σ z Λ ( Λ Σ z Λ + Ψ ) 1 Λ Σ z ) .
The proof can be found in ([8] pp. 9–11).

3. Related Work

Although factor analysis is widely used for dimensionality reduction, its supervised counterpart is, to the best of our knowledge, not present in the literature. What is present is a model called supervised principal component analysis or latent factor regression ([2] p. 405). The idea is that not only the input, for a regression task, is generated by a latent variable, as one applies factor analysis to replace the input in the problem with a low dimensional embedding, but also the output. The key idea is that the purpose of supervised principal component analysis is still dimensionality reduction and not at all regression, which is where we want to push the factor analysis model.
There is also a term called linear Gaussian system ([2] p. 119). This was already presented in Section 2, and it generalizes the factor analysis generative process by considering that z has a learnable mean and covariance matrix, but it does not go further.
Factor analysis is strongly related to principal component analysis (PCA) [9] because by imposing a certain constraint in factor analysis, we get a model called probabilistic principal component analysis [7] that can be fitted using a closed-form solution, which, in an asymptotic case, is also the solution for PCA. Probabilistic PCA can be kernelized using a model called the Gaussian process latent variable model (GPLVM) [10]. This model also has supervised counterparts [11], but, as in the case of FA, the supervised extension targets dimensionality reduction, and the idea is similar to the one in supervised PCA.

4. Proposed Models

In this section, we propose three models starting from the FA model, each in a new subsection: simple-supervised factor analysis (S2.FA), simple-semisupervised factor analysis (S3.FA), and missing simple-semisupervised factor analysis (MS3.FA). While S2.FA is applicable in the supervised case, regression, S3.FA is meant to be used in a semisupervised context. MS3.FA handles missing input data in a (semi)supervised scenario.
One important remark that will not be restated in this paper is that all the models (FA, S2.FA, S3.FA, MS3.FA) are fitted by maximizing the likelihood (the MLE principle) of the observed data. Another important observation is regarding the names of our proposed algorithms: the algorithms are called “simple-”—S2 = simple-supervised; S3 = simple-semisupervised; MS3 = missing simple-semisupervised—not only because they constitute a simple adaptation of the factor analysis model, but mostly because we created an adaptation of the (simple-)supervised FA model called (simple-)supervised PPCA, and we did not want this model to be confused with the already existing supervised PCA model in the literature. Simple-supervised probabilistic principal component analysis (S2.PPCA) is not discussed in this paper, but it is implemented and usable in the R package that we developed (https://github.com/aciobanusebi/s2fa; accessed on 31 July 2021) along with other undiscussed but related models: Simple-semisupervised unconstrained factor analysis (S3.UncFA), Simple-semisupervised probabilistic principal component analysis (S3.PPCA), Missing simple-semisupervised unconstrained factor analysis (MS3.UncFA), and Missing simple-semisupervised probabilistic principal component analysis (MS3.PPCA).

4.1. The S2.FA Model

The core of this subsection regards the S2.FA model, but in order to link it with LR, we need to also introduce a similar model to S2.FA: S2.UncFA. This link will make S2.FA a good candidate for replacing LR when D > > n . These three ideas—S2.FA and S2.UncFA, S2.UncFA-LR link, and replacing LR via S2.FA—will be expanded below.

4.1.1. The S2.FA Model. S2.FA and S2.UncFA

The first model that we propose is called simple-supervised factor analysis. It is a linear Gaussian system with slight changes:
z N ( μ z , σ z 2 ) -observed variable, z R , μ z R , σ z R + *
x | z N ( μ + Λ z , Ψ ) , x R D × 1 , μ R D × 1 , Λ R D × 1 , and Ψ R D × D a diagonal matrix.
If we do not impose the constraint of Ψ being diagonal, we arrive at the simple-supervised unconstrained factor analysis (S2.UncFA):
z N ( μ z , σ z 2 ) -observed variable, z R , μ z R , σ z R + *
x | z N ( μ + Λ z , Ψ ) , x R D × 1 , μ R D × 1 , Λ R D × 1 , and Ψ R D × D a symmetric and positive definite matrix.
In contrast with the factor analysis model, which is fitted via an EM algorithm, S2.UncFA and S2.FA are fitted via analytic formulas (see Propositions 4 and 5).
Proposition 4.
Let { ( x ( i ) , z ( i ) ) | x ( i ) R D × 1 , z ( i ) R d × 1 , i { 1 , , n } } be a data set where D is the dimensionality of the input data, d is the dimensionality of the output data (although in the context of this paper d = 1 , we decided to expose more general results— d 1 —in order for the reader to gain more insight; this is the reason why we write Σ z and not just σ z 2 , or z ( i ) and not just z ( i ) , or z ¯ and not just z ¯ , etc.), { x ( i ) | i { 1 , , n } } is the input, and { z ( i ) | i { 1 , , n } } is the output. We suppose that the data was generated as follows:
z ( i ) N ( μ z , Σ z ) , z ( i ) R d × 1 , μ z R d × 1 , Σ z R d × d a symmetric and positive definite matrix and
x ( i ) | z ( i ) N ( μ + Λ z , Ψ ) , x ( i ) R D × 1 , μ R D × 1 , Λ R D × d , while Ψ R D × D is a symmetric and positive definite matrix.
Then, the parameters in the S2.UncFA algorithm (training phase) can be estimated via maximum likelihood using the following closed-form formulas:
μ ^ z = i = 1 n z ( i ) n
Σ ^ z = i = 1 n ( z ( i ) μ ^ z ) ( z ( i ) μ ^ z ) n
Λ ^ = n x ¯ z ¯ i = 1 n x ( i ) z ( i ) n z ¯ z ¯ i = 1 n z ( i ) z ( i ) 1
μ ^ = x ¯ Λ ^ z ¯
Ψ ^ = i = 1 n ( x ( i ) μ ^ Λ z ( i ) ) ( x ( i ) μ ^ Λ z ( i ) ) n
where x ¯ = i = 1 n x ( i ) n and z ¯ = μ ^ z . For the testing/prediction phase, one uses the formula for z | x from (4).
For more elaborate notations and the proof, see [8] pp. 13–17.
Proposition 5.
[We will denote the parameter Ψ ^ in (9) as Ψ ^ S 2 . UncFA .]
For the S2.FA algorithm, (9) is replaced by
Ψ ^ = diag Ψ ^ S 2 . UncFA
where “ d i a g ” takes the diagonal of a matrix and returns the corresponding diagonal matrix.
The proof of Equation (10) is relatively simple, and we skip it for brevity. It can be found in [8] pp. 21–23.
For the step-by-step S2.FA algorithm and also for the matrix form of the algorithm, see Appendix B.

4.1.2. The S2.FA Model. The Link between LR and S2.UncFA

Linear regression and S2.UncFA have the same prediction function after fitting, as we claim and prove below.
Proposition 6.
Let { ( x ( i ) , z ( i ) ) | x ( i ) R D × 1 , z ( i ) R d , i { 1 , , n } } be a data set where D is the dimensionality of the input data, d is the dimensionality of the output data (The same observation as earlier: in the context of this paper, only the “ d = 1 " case is relevant.), { x ( i ) | i { 1 , , n } } is the input, and { z ( i ) | i { 1 , , n } } is the output.
One can fit an S2.UncFA model and obtain—via the relationships (5)–(9)— μ ^ z , Σ ^ z , Ψ ^ , Λ ^ , μ ^ . Remember that at the test phase (see (4)), the predicted value is
predicted S 2 . UncFA ( x * ) = μ ^ z + Σ ^ z Λ ^ ( Λ ^ Σ ^ z Λ ^ + Ψ ^ ) 1 ( x * μ ^ Λ ^ μ ^ z ) , x * R D × 1 .
One can fit a linear regression model and obtain w ^ , b ^ from (1) and (2). Remember that at the test phase, the predicted value is
predicted LR ( x * ) = w ^ x * + b ^ , x * R D × 1 .
Then:
predicted S 2 . UncFA ( x * ) = predicted LR ( x * ) , x * R D × 1 .
Proof. 
Let X = x ( 1 ) x ( n ) R D × n and Z = z ( 1 ) z ( n ) R d × n .
We begin by computing Ψ ^ .
Ψ ^ = ( 9 ) 1 n i = 1 n ( x ( i ) μ ^ Λ ^ z ( i ) ) ( x ( i ) μ ^ Λ ^ z ( i ) ) = 1 n i = 1 n ( x ( i ) x ( i ) x ( i ) μ ^ x ( i ) z ( i ) Λ ^ μ ^ x ( i ) + μ ^ μ ^ + μ ^ z ( i ) Λ ^ Λ ^ z ( i ) x ( i ) + Λ ^ z ( i ) μ ^ + Λ ^ z ( i ) z ( i ) Λ ^ ) = 1 n X X x ¯ μ ^ 1 n X Z Λ ^ μ ^ x ¯ + μ ^ μ ^ + μ ^ z ¯ Λ ^ 1 n Λ ^ Z X + + Λ ^ z ¯ μ ^ + 1 n Λ ^ Z Z Λ ^ = 1 n X X 1 n X Z Λ ^ 1 n Λ ^ Z X + 1 n Λ ^ Z Z Λ ^ x ¯ μ ^ μ ^ x ¯ + μ ^ μ ^ + + Λ ^ z ¯ μ ^ + μ ^ z ¯ Λ ^ .
We substitute μ ^ with x ¯ Λ ^ z ¯ .
x ¯ μ ^ = x ¯ ( x ¯ Λ ^ z ¯ ) = x ¯ x ¯ x ¯ z ¯ Λ ^
μ ^ x ¯ = ( x ¯ Λ ^ z ¯ ) x ¯ = x ¯ x ¯ Λ ^ z ¯ x ¯
μ ^ μ ^ = ( x ¯ Λ ^ z ¯ ) ( x ¯ Λ ^ z ¯ ) = x ¯ x ¯ x ¯ z ¯ Λ ^ Λ ^ z ¯ x ¯ + Λ ^ z ¯ z ¯ Λ ^
Λ ^ z ¯ μ ^ = Λ ^ z ¯ ( x ¯ Λ ^ z ¯ ) = Λ ^ z ¯ x ¯ Λ ^ z ¯ z ¯ Λ ^
μ ^ z ¯ Λ ^ = ( x ¯ Λ ^ z ¯ ) z ¯ Λ ^ = x ¯ z ¯ Λ ^ Λ ^ z ¯ z ¯ Λ ^
We return to compute Ψ ^ :
Ψ ^ = 1 n X X 1 n X Z Λ ^ 1 n Λ ^ Z X + 1 n Λ ^ Z Z Λ ^ x ¯ x ¯ + x ¯ z ¯ Λ ^ x ¯ x ¯ + + Λ ^ z ¯ z ¯ + x ¯ x ¯ x ¯ z ¯ Λ ^ Λ ^ z ¯ z ¯ + Λ ^ z ¯ z ¯ Λ ^ + Λ ^ z ¯ x ¯ Λ ^ z ¯ z ¯ Λ ^ + + x ¯ z ¯ Λ ^ Λ ^ z ¯ z ¯ Λ ^ = 1 n X X 1 n X Z Λ ^ 1 n Λ ^ Z X + 1 n Λ ^ Z Z Λ ^ x ¯ x ¯ + Λ ^ z ¯ x ¯ + + x ¯ z ¯ Λ ^ Λ ^ z ¯ z ¯ Λ ^ .
We continue by computing Λ ^ Σ ^ z Λ ^ .
Λ ^ Σ ^ z Λ ^ = ( 6 ) Λ ^ 1 n Z Z z ¯ z ¯ Λ ^ = 1 n Λ ^ Z Z Λ ^ Λ ^ z ¯ z ¯ Λ ^ .
We observe that the above term (see (12)) is also included in Ψ ^ (see (11)).
Σ ^ z Λ ^ = ( 7 ) 1 n Z Z z ¯ z ¯ ( n z ¯ z ¯ Z Z ) 1 ( n z ¯ x ¯ Z X ) = 1 n Z Z z ¯ z ¯ 1 n 1 n Z Z z ¯ z ¯ 1 ( n z ¯ x ¯ Z X ) = 1 n Z X z ¯ x ¯ .
Since ( Λ ^ Σ ^ z Λ ^ ) = Λ ^ Σ ^ z Λ ^ = Λ ^ Σ ^ z Λ ^ Λ ^ Σ ^ z Λ ^ is symmetric.
We have that:
Λ ^ Σ ^ z Λ ^ = Λ ^ ( Σ ^ z Λ ^ ) = ( 13 ) Λ ^ 1 n Z X z ¯ x ¯ = 1 n Λ ^ Z X Λ ^ z ¯ x ¯ .
We also have that:
Λ ^ Σ ^ z Λ ^ = ( Λ ^ Σ ^ z Λ ^ ) = ( 13 ) 1 n Λ ^ Z X Λ ^ z ¯ x ¯ = 1 n X Z Λ ^ x ¯ z ¯ Λ ^ .
We continue by computing Λ ^ Σ ^ z Λ ^ + Ψ ^ .
As we have already noticed above, Λ ^ Σ ^ z Λ ^ is also included in Ψ ^ (see (11) and (12)). In the computation of Λ ^ Σ ^ z Λ ^ + Ψ ^ , we replace Λ ^ Σ ^ z Λ ^ once with (14) and then with (15). We get:
Λ ^ Σ ^ z Λ ^ + Ψ ^ = ( 14 ) ( 15 ) ( 11 ) ( 12 ) 1 n X X 1 n X Z Λ ^ 1 n Λ ^ Z X x ¯ x ¯ + Λ ^ z ¯ x ¯ +          + x ¯ z ¯ Λ ^ + 1 n Λ ^ Z X Λ ^ z ¯ x ¯ + 1 n X Z Λ ^ x ¯ z ¯ Λ ^ = 1 n X X x ¯ x ¯ .
Observation: The result is exactly the maximum likelihood estimate of the covariance matrix Σ of the input data set if x N ( μ , Σ ) : Σ MLE = 1 n X X x ¯ x ¯ . This is natural because according to the relationship (4) we have x N ( μ , Λ Σ z Λ + Ψ ) , and there are enough free parameters in Λ Σ z Λ + Ψ , i.e., d D + d 2 d 2 + d + D 2 D 2 + D free parameters— d D in Λ , d 2 d 2 in Σ z , D 2 D 2 in Ψ , for it to become Σ MLE = 1 n X X x ¯ x ¯ , since Σ has D 2 D 2 + D free parameters.
We return to the initial computation:
predicted S 2 . UncFA ( x * ) =
= ( 4 ) μ ^ z + Σ ^ z Λ ^ ( Λ ^ Σ ^ z Λ ^ + Ψ ^ ) 1 ( x * μ ^ Λ ^ μ ^ z )
= ( 5 ) ( 13 ) ( 16 ) z ¯ + 1 n Z X z ¯ x ¯ 1 n X X x ¯ x ¯ 1 ( x * x ¯ + Λ ^ z ¯ Λ ^ z ¯ )
= 1 n Z X z ¯ x ¯ 1 n X X x ¯ x ¯ 1 x * + z ¯ 1 n Z X z ¯ x ¯ 1 n X X x ¯ x ¯ 1 x ¯
= n z ¯ x ¯ Z X n x ¯ x ¯ X X 1 x * + z ¯ n z ¯ x ¯ Z X n x ¯ x ¯ X X 1 x ¯
= ( 1 ) w ^ x * + z ¯ w ^ x ¯
= ( 2 ) w ^ x * + b ^
= predicted LR ( x * ) , x * R D × 1 .  □

4.1.3. The S2.FA Model. A New Approach for LR When D > > n

Since FA can be used to estimate the density of a data set when D > > n , and S2.UncFA is equivalent to LR, we consider S2.FA as a new approach to extend LR when D > > n besides the two solutions mentioned in Section 2.

4.2. The S3.FA Model

Factor analysis is a classic generative unsupervised model. Its supervised counterpart is S2.FA as shown in the previous subsection. Those two can be merged into a semisupervised model that we propose here, named simple-semisupervised factor analysis:
z N ( μ z , σ z 2 ) -either observed or latent variable, z R , μ z R , σ z R + *
x | z N ( μ + Λ z , Ψ ) , x R D × 1 , μ R D × 1 , Λ R D × 1 , and Ψ R D × D a diagonal matrix.
If we were to speak about Gaussian naive Bayes and the GMM, good hints for combining these two supervised–unsupervised counterparts into a semisupervised model can be found in [12]. We applied those hints for our supervised–unsupervised counterparts—S2.FA and FA—and created an EM algorithm to fit an S3.FA model. For the step-by-step algorithm and the matrix form of the algorithm, see Appendix C. For more elaborate notations and the proof, see [8] pp. 26–34.

4.3. The MS3.FA Model

The algorithm that fits an S3.FA model can be adapted also for the case when not all the components of x are known. We call the resulted model missing simple-semisupervised factor analysis:
z N ( μ z , σ z 2 ) -either observed or latent variable, z R , μ z R , σ z R + *
x | z N ( μ + Λ z , Ψ ) , x R D × 1 , μ R D × 1 , Λ R D × 1 , and Ψ R D × D a diagonal matrix; each component of x: x 1 , , x D is either observed or latent.
The resulting algorithm that fits a MS3.FA model is an EM algorithm. For the step-by-step algorithm, see Appendix D. For more elaborate notations and the proof, see [8] pp. 34–37.

5. Experiments

In this section, we include the experiments we carried out on data with D > > n using the S2.FA, S3.FA, and MS3.FA models, comparing them with other methods. In all the experiments, we computed errors between the real values and the predicted values; the metric we used is mean squared error (MSE):
MSE = 1 N i = 1 N ( real i predicted i ) 2 ,
where N is the number of the unknown elements whose real and predicted values are real i and predicted i , respectively; an unknown element represents an output number for S2.FA and S3.FA or an input/output number for MS3.FA. We ran each experiment five times and computed a 95% confidence interval using the t-distribution. Furthermore, in each experiment we used the same three data sets:
All three of these data sets have multiple outputs, but for each data set, we selected only the first output column that appears in the text data file and used it as the output: ace_conc for gas sensor array under flow modulation data set, LBL_ALLminpA_fut_001 for atp1d, the first column in the propvals file for m5spec.
We preprocessed each data set simply by dropping the constant columns. Only the second data set has constant columns: from 432 columns, we obtain 370.

5.1. The S2.FA Model: Experiment

The experiment concerning S2.FA covers the comparison of the three solutions presented so far for LR when D > > n :
  • Moore–Penrose inverse
  • ridge regression—L2 regularization
  • S2.FA.
Each data set was split into a training part— 80 % —and a testing part— 20 % . If the model had hyperparameters, ridge regression, the training part was also split into a new training part— 60 % of the whole data set—and a validation part— 20 % of the whole data set—in order to be able to set the hyperparameters (for ridge regression, we used a simple technique: pick α { 10 2 , 10 1.9 , 10 1.8 , , 10 1.9 , 10 2 } , which attains the minimum validation error); after setting the hyperparameters, we train a new model on the initial training part— 80 % of the whole data set—and obtain the final model. All of the MSE errors are reported on the testing part and shown in Table 1.
As one may notice, the best method is different for each data set, so our general advice is to use all the methods on a given data set and pick the best one.

5.2. The S3.FA Model: Experiment

The experiment concerning S3.FA includes an analysis of algorithms for semisupervised regression when D > > n :
  • Moore–Penrose inverse—a supervised method
  • S2.FA—a supervised method
  • S3.FA—a semisupervised method
  • label propagation [15]—a semisupervised method: we used the function
    sslLabelProp in the SSL R package [16] with the parameter alpha set to 1.
Each data set was split into a training part— 80 % —and a testing part— 20 % . We retained from the training part 5 % , 10 % , 15 % , 20 % , 25 % , …, 100 % of the output labels. For the supervised methods, we used only the labeled data in the training set, and for the semisupervised methods, we used the full training set when fitting the model. We initialized the S3.FA method with the fitted parameters returned by the S2.FA algorithm. All of the MSE errors are reported on the testing part and shown in Figure 1, Figure 2 and Figure 3—inspired from [17]—and Table 2. The figures contain all the results from using 5 % to 100 % of the output labels, but we include less information in the table for brevity.
We notice that on the selected data sets, S3.FA returns poorer results even than S2.FA, which uses only the labeled data. As in the previous experiment, the best method is also data-dependent. The models that show a greater amount of variability compared to the others are S3.FA—in two data sets—and Moore–Penrose—in one data set. Moreover, as expected, the errors tend to decrease as the percentage of labeled training data points increases; the plots do not help us in this regard, but this decrease can be seen numerically in Table 2.

5.3. The MS3.FA Model: Experiment

The experiment concerning MS3.FA includes a comparison of two different types of algorithms for data imputation when D > > n :
  • Mean imputation: for a given attribute (input column), compute its mean ignoring the missing values, then replace the missing data on that attribute with this computed mean
  • MS3.FA.
We also tried two other R packages: mice [18] and Amelia [19], but they could not be applied successfully on our data sets perhaps because they have a peculiarity: D > > n .
For each data set, we removed 10 % , 20 % , 30 % , 40 % , 50 % , and 60 % of the input (We could have added missing data also in the output, but we wanted to focus on the missing input data scenario and not on the semisupervised case.) cells and imputed those via the above mentioned algorithms. The results are presented in Figure 4, Figure 5 and Figure 6 and in Table 3.
From these results, we discover that MS3.FA is better than mean imputation on two data sets, and, as expected, the error increases as the percentage of missing data increases.

6. Conclusions and Future Work

The initial purpose of this paper was to extend an already existing model: factor analysis. We developed its supervised counterpart (S2.FA) and noticed that the unconstrained version (S2.UncFA) is equivalent to linear regression. Because FA is applied in density estimation when the dimensionality of the data is greater than the number of samples, and because of the already mentioned equivalence, the purpose of the paper became to analyze this new method of applying LR when D > > n , i.e., via S2.FA. Since FA and S2.FA are generative models and are unsupervised–supervised counterparts, we combined both into a new model S3.FA as an extension of LR to semisupervised learning when D > > n . The final extension regards missing data; it is called MS3.FA. We developed an R package (s2fa) with these algorithms; it can be found on GitHub. The experimental parts included several comparisons in the D > > n scenario:
  • of S2.FA with other techniques extending LR to the D > > n case,
  • of S3.FA with other (semi)supervised regression methods,
  • of MS3.FA with another data imputation algorithm.
The bottom line is that we do not necessarily recommend S3.FA for semisupervised regression since our results suggest that it gives poor results, but we encourage the consideration of S2.FA for regression and MS3.FA for missing data imputation as algorithms to be compared with others on a given data set.
As for future work, we could further explore the S2.FA, S3.FA, and MS3.FA algorithms when z is a real vector, not just a real number. Moreover, we can experiment with the PPCA version of the algorithms. Questions regarding the time complexity—empirical or not—can be also addressed; we expect the fitting time to be impractical if the number of columns is large. Because there are models such as mixture of factor analyzers [20] and mixture of linear regression models ([21] Section 14.5.1), another research direction involves mixtures of S2.FAs. Another idea would be to investigate the memory resources required by the algorithms when the data set increases and also to consider scalable systems such as Spark [22] for implementation.

Author Contributions

Conceptualization, S.C. and L.C.; methodology, S.C. and L.C.; software, S.C.; validation, S.C.; formal analysis, S.C.; investigation, S.C.; resources, L.C.; data curation; writing—original draft preparation, S.C.; writing—review and editing, S.C. and L.C.; visualization, S.C. and L.C.; supervision, L.C.; project administration, S.C. and L.C.; funding acquisition. Both authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available data sets were analyzed in this study. This data can be found here: http://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+flow+modulation (accessed on 31 July 2021) for the gas sensor array under flow modulation data set, https://www.openml.org/d/41475 (accessed on 31 July 2021) for atp1d, http://www.eigenvector.com/data/Corn (accessed on 31 July 2021) for m5spec.

Acknowledgments

We thank Cristian Gaţu and Daniel Stamate for attentive proofreading and useful comments of previous versions of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

   The following abbreviations are used in this manuscript:
GMMGaussian mixture model
LRLinear regression
EMExpectation–maximization
KLKullback–Leibler
FAFactor analysis
PCAPrincipal component analysis
PPCAProbabilistic principal component analysis
GPLVMGaussian process latent variable model
MLEMaximum likelihood estimation
UncFAUnconstrained factor analysis
S2.UncFASimple-supervised unconstrained factor analysis
S2.FASimple-supervised factor analysis
S2.PPCASimple-supervised probabilistic principal component analysis
S3.UncFASimple-semisupervised unconstrained factor analysis
S3.FASimple-semisupervised factor analysis
S3.PPCASimple-semisupervised probabilistic principal component analysis
MS3.UncFAMissing simple-semisupervised unconstrained factor analysis
MS3.FAMissing simple-semisupervised factor analysis
MS3.PPCAMissing simple-semisupervised probabilistic principal component analysis
MSEMean squared error
ELBOEvidence lower bound

Appendix A. On the Expectation–Maximization Algorithm

This section provides theoretical details on the EM algorithm [23]. These are relevant in order to establish a link between EM and the information theory field.
A latent variable model assumes that the data we observed—usually, denoted by the random variable X—is not complete: there is also some latent data modeled via random variables—usually, denoted by Z. Often, pairs consisting of an observed point and a latent one constitute the complete dataset, e.g., in a GMM the latent data is the cluster number and such a number is assigned to each point in the observed dataset.
Usually, in a latent variable model the likelihood of the observed data is not tractable—although there are exceptions, like in GMM or factor analysis—and therefore we cannot maximize it directly. Instead we maximize a lower bound for the log-likelihood function called ELBO (evidence lower bound). To simplify the discussion, we consider that we have one single observed datapoint, x. We will also consider the case where Z is continuous; when Z is discrete the ∫ sign is replaced by the ∑ sign.
Let q be any distribution over z, where the z values are the possible values of the random variable Z. The log-likelihood of the observed datapoint x is:   
log p ( x ; θ ) = marginal log z p ( x , z ; θ ) d z = log z q ( z ) q ( z ) p ( x , z ; θ ) d z = log z q ( z ) p ( x , z ; θ ) q ( z ) d z = log E q p ( x , z ; θ ) q ( z ) Jensen E q log p ( x , z ; θ ) q ( z ) ELBO ( θ , q ) .
Furthermore, the following relationships important for the E step of the EM algorithm—see below—can be proven:
log p ( x ; θ ) ELBO ( θ , q ) = KL ( q ( · ) | | p ( · | x ; θ ) ) log p ( x ; θ ) = ELBO ( θ , q ) + KL ( q ( · ) | | p ( · | x ; θ ) ) ELBO ( θ , q ) = log p ( x ; θ ) KL ( q ( · ) | | p ( · | x ; θ ) ) .
Moreover, the following relationships important for the M step of the EM algorithm—see below—can be proven:
ELBO ( θ , q ) = def . E q log p ( x , z ; θ ) q ( z ) = z q ( z ) log p ( x , z ; θ ) q ( z ) d z = z q ( z ) log p ( x , z ; θ ) d z z q ( z ) log q ( z ) d z = E q [ log p ( x , z ; θ ) ] + H ( q ) .
Now, instead of carrying out max θ log p ( x ; θ ) we will execute max θ , q ELBO ( θ , q ) , since ELBO is a lower bound for the log-likelihood of x and hence its maximization will not hurt the process of maximizing log p ( x ; θ ) .
The ELBO can be maximized in at least two ways:
  • via (block) coordinate ascent
    The resulting meta-algorithm is the EM algorithm.
    In fact this is the case for many classic models—EM for GMM [3], EM for factor analysis [5] etc.—.
    EM is an iterative algorithm and an iteration encompasses two steps:
    • E step:
      q ( t ) = arg max q ELBO ( θ ( t 1 ) , q ) for θ ( t 1 ) fixed—from the previous iteration.
      Since ELBO ( θ , q ) = log p ( x ; θ ) KL ( q ( · ) | | p ( · | x ; θ ) ) , we have:   
      q ( t ) = arg max q ELBO ( θ ( t 1 ) , q ) = arg max q ( log p ( x ; θ ( t 1 ) ) KL ( q ( · ) | | p ( · | x ; θ ( t 1 ) ) ) ) = arg min q KL ( q ( · ) | | p ( · | x ; θ ( t 1 ) ) ) = KL property p ( · | x ; θ ( t 1 ) ) .
      (In this case, we have log p ( x ; θ ( t 1 ) ) = ELBO ( θ ( t 1 ) ; q ( t ) ) .)
      So, we obtained the distribution q ( t ) as a posterior distribution. Although in classic models where conjugate priors are used it is tractable to compute p ( · | x ; θ ( t 1 ) ) —this type of inference is called analytical inference—, in other models this is not the case and a solution to this shortcoming is represented by approximate/variational inference.
    • M step:
      θ ( t ) = arg max θ ELBO ( θ , q ( t ) ) for q ( t ) fixed—from the E step.
      Since ELBO ( θ , q ) = E q [ log p ( x , z ; θ ) ] + H ( q ) , we have:
      θ ( t ) = arg max θ ELBO ( θ , q ( t ) ) = arg max θ ( E q ( t ) [ log p ( x , z ; θ ) ] + H ( q ( t ) ) ) = arg max θ E q ( t ) [ log p ( x , z ; θ ) ] .
      So, we obtained a relatively simpler term to maximize. Note that the maximization is further customized using the probabilistic assumptions at hand.
  • via gradient ascent: this is the case of Variational Autoencoder [24] which will not be discussed since it is not necessarily relevant to this study.

Appendix B. S2.FA

Algorithm A1 S2.FA—nonmatrix form.
1:
functionTrain( { ( x ( i ) , z ( i ) ) | i { 1 , , n } } )
2:
     x ¯ = i = 1 n x ( i ) n
3:
     μ ^ z = i = 1 n z ( i ) n = z ¯
4:
     Σ ^ z = i = 1 n ( z ( i ) μ ^ z ) ( z ( i ) μ ^ z ) n
5:
     Λ ^ = n x ¯ z ¯ i = 1 n x ( i ) z ( i ) n z ¯ z ¯ i = 1 n z ( i ) z ( i ) 1
6:
     μ ^ = x ¯ Λ ^ z ¯
7:
     Ψ ^ = diag i = 1 n ( x ( i ) μ ^ Λ z ( i ) ) ( x ( i ) μ ^ Λ z ( i ) ) n
8:
    return ( μ ^ z , Σ ^ z , Λ ^ , μ ^ , Ψ ^ )
9:
functionTest( x * ,( μ ^ z , Σ ^ z , Λ ^ , μ ^ , Ψ ^ ))
10:
    value = μ ^ z + Σ ^ z Λ ^ ( Λ ^ Σ ^ z Λ ^ + Ψ ^ ) 1 ( x * μ ^ Λ ^ μ ^ z )
11:
    covarianceMatrix = Σ ^ z Σ ^ z Λ ^ ( Λ ^ Σ ^ z Λ ^ + Ψ ^ ) 1 Λ ^ Σ ^ z
12:
    return (value, covarianceMatrix)
Algorithm A2 S2.FA—matrix form.
1:
functionTrain(X,Z)                               ▹ X = x ( 1 ) x ( n )
2:
                                        ▹ Z = z ( 1 ) z ( n )
3:
     x ¯ = i = 1 n X : i n
4:
     μ ^ z = i = 1 n Z : i n = z ¯
5:
     Σ ^ z = 1 n Z μ ^ z 1 1 1 R 1 × n Z μ ^ z 1 1 1 = 1 n Z Z μ ^ z μ ^ z   ▹ 2 ways to compute
6:
     Λ ^ = ( n x ¯ μ ^ z X Z ) ( n μ ^ z μ ^ z Z Z ) 1
7:
     μ ^ = x ¯ Λ ^ z ¯
8:
     Ψ ^ = diag 1 n X μ ^ 1 1 1 R 1 × n Λ ^ Z X μ ^ 1 1 1 Λ ^ Z
9:
    return ( μ ^ z , Σ ^ z , Λ ^ , μ ^ , Ψ ^ )
10:
functionTest( x * ,( μ ^ z , Σ ^ z , Λ ^ , μ ^ , Ψ ^ ))
11:
    value = μ ^ z + Σ ^ z Λ ^ ( Λ ^ Σ ^ z Λ ^ + Ψ ^ ) 1 ( x * μ ^ Λ ^ μ ^ z )
12:
    covarianceMatrix = Σ ^ z Σ ^ z Λ ^ ( Λ ^ Σ ^ z Λ ^ + Ψ ^ ) 1 Λ ^ Σ ^ z
13:
    return (value, covarianceMatrix)

Appendix C. S3.FA

The algorithms below are more general: z is not just a real number, as we state in the paper, but a real vector.
Algorithm A3 S3.FA—nonmatrix form.
1:
functionLogLikelihood( { ( x ( 1 ) , z ( 1 ) ) , , ( x ( a ) , z ( a ) ) , x ( a + 1 ) , , x ( n ) } , ( μ z , Σ z , μ , Λ , Ψ ) )
2:
  return  i = 1 a ln N ( z ( i ) | μ z , Σ z ) + ln N ( x ( i ) | μ + Λ z ( i ) , Ψ ) + i = a + 1 n ln N ( x ( i ) | μ + Λ μ z , Λ Σ z Λ + Ψ )
3:
functionTrain( { ( x ( 1 ) , z ( 1 ) ) , , ( x ( a ) , z ( a ) ) , x ( a + 1 ) , , x ( n ) } ,nMaxIterations,eps)
4:
     x ¯ = i = 1 n x ( i ) n
5:
     θ ( 0 ) = initializeParameters ( { ( x ( 1 ) , z ( 1 ) ) , , ( x ( a ) , z ( a ) ) , x ( a + 1 ) , , x ( n ) } )
6:
     l RV_Do ( 0 ) = LogLikelihood( { ( x ( 1 ) , z ( 1 ) ) , , ( x ( a ) , z ( a ) ) , x ( a + 1 ) , , x ( n ) } , θ ( 0 ) )
7:
    for t = 0:nMaxIterations do
8:
        E step: Compute E [ Z ( i ) ] , E [ Z ( i ) Z ( i ) ] , i { a + 1 , , n } :
9:
         E Z ( i ) | X ( i ) = x ( i ) , θ ( t ) [ Z ( i ) ] = μ z ( t ) + Σ z ( t ) Λ ( t ) ( Λ ( t ) Σ z ( t ) Λ ( t ) + Ψ ( t ) ) 1 ( x ( i ) μ ( t ) Λ ( t ) μ z ( t ) )
10:
         E Z ( i ) | X ( i ) = x ( i ) , θ ( t ) [ Z ( i ) Z ( i ) ] = Σ z ( t ) + E Z ( i ) | X ( i ) = x ( i ) , θ ( t ) [ Z ( i ) ] E Z ( i ) | X ( i ) = x ( i ) , θ ( t ) [ Z ( i ) ]
11:
        M Step: Compute θ ( t + 1 ) = ( μ z ( t + 1 ) , Σ z ( t + 1 ) , μ ( t + 1 ) , Λ ( t + 1 ) , Ψ ( t + 1 ) ) :
12:
         μ z ( t + 1 ) = i = 1 a z ( i ) + a + 1 n E [ Z ( i ) ] n
13:
         Σ z ( t + 1 ) = 1 n ( i = 1 a ( z ( i ) μ z ( t + 1 ) ) ( z ( i ) μ z ( t + 1 ) ) + i = a + 1 n ( E [ Z ( i ) Z ( i ) ] E [ Z ( i ) ] μ z ( t + 1 ) μ z ( t + 1 ) E [ Z ( i ) ] + μ z ( t + 1 ) μ z ( t + 1 ) ) )
14:
         Λ ( t + 1 ) = n x ¯ μ z ( t + 1 ) i = 1 a x ( i ) z ( i ) i = a + 1 n x ( i ) E [ Z ( i ) ]
n μ z ( t + 1 ) μ z ( t + 1 ) i = 1 a z ( i ) z ( i ) i = a + 1 n E [ Z ( i ) Z ( i ) ] 1
15:
         μ ( t + 1 ) = x ¯ Λ ( t + 1 ) μ z ( t + 1 )
16:
         Ψ ( t + 1 ) = diag ( 1 n ( i = 1 a ( x ( i ) μ ( t + 1 ) Λ ( t + 1 ) z ( i ) ) ( x ( i ) μ ( t + 1 ) Λ ( t + 1 ) z ( i ) ) + i = a + 1 n ( ( x ( i ) μ ( t + 1 ) ) ( x ( i ) μ ( t + 1 ) ) ( x ( i ) μ ( t + 1 ) ) E [ Z ( i ) ] Λ ( t + 1 ) Λ ( t + 1 ) E [ Z ( i ) ] ( x ( i ) μ ( t + 1 ) ) + Λ ( t + 1 ) E [ Z ( i ) Z ( i ) ] Λ ( t + 1 ) ) ) )
17:
         l RV_Do ( t + 1 ) = LogLikelihood( { ( x ( 1 ) , z ( 1 ) ) , , ( x ( a ) , z ( a ) ) , x ( a + 1 ) , , x ( n ) } , θ ( t + 1 ) )
18:
        if  θ ( t ) θ ( t + 1 ) 2 2 θ ( t ) 2 2 eps or | l RV_Do ( t ) l RV_Do ( t + 1 ) | | l RV_Do ( t ) | eps then
19:
           break
20:
    return  θ ( t )
21:
functionTest( x * ,( μ ^ z , Σ ^ z , Λ ^ , μ ^ , Ψ ^ ))                         ▹ The same as in S2.FA
22:
    value = μ ^ z + Σ ^ z Λ ^ ( Λ ^ Σ ^ z Λ ^ + Ψ ^ ) 1 ( x * μ ^ Λ ^ μ ^ z )
23:
    covarianceMatrix = Σ ^ z Σ ^ z Λ ^ ( Λ ^ Σ ^ z Λ ^ + Ψ ^ ) 1 Λ ^ Σ ^ z
24:
    return (value, covarianceMatrix)
Algorithm A4 S3.FA—matrix form.
1:
functionLogLikelihood(X,Z, ( μ z , Σ z , μ , Λ , Ψ ) )                  ▹ X = x ( 1 ) x ( n )
2:
                                        ▹ Z = z ( 1 ) z ( a )
3:
    return  i = 1 a ln N ( Z : i | μ z , Σ z ) + ln N ( X : i | μ + Λ Z : i , Ψ ) + i = a + 1 n ln N ( X : i | μ + Λ μ z , Λ Σ z Λ + Ψ )
4:
functionTrain(X,Z,nMaxIterations,eps)                      ▹ X = x ( 1 ) x ( n )
5:
                                        ▹ Z = z ( 1 ) z ( a )
6:
     x ¯ = i = 1 n X : i n
7:
     θ ( 0 ) = initializeParameters ( X , Z )
8:
     l RV_Do ( 0 ) = LogLikelihood(X,Z, θ ( 0 ) )
9:
    for t = 0:nMaxIterations do
10:
        E step: Compute E [ Z ( i ) ] , E [ Z ( i ) Z ( i ) ] , i { a + 1 , , n } :
11:
         E _ Z = μ z ( t ) + Σ z ( t ) Λ ( t ) ( Λ ( t ) Σ z ( t ) Λ ( t ) + Ψ ( t ) ) 1 ( X : , ( a + 1 ) : n μ ( t ) 1 1 1 R 1 × ( n a ) Λ ( t ) μ z ( t ) 1 1 1 R 1 × ( n a ) )
12:
         E _ Z _ Z _ T = Σ z ( t ) + E _ Z E _ Z
13:
         E _ Z = Z E _ Z
14:
         E _ Z _ Z _ T = Z Z + E _ Z _ Z _ T
15:
        M Step: Compute θ ( t + 1 ) = ( μ z ( t + 1 ) , Σ z ( t + 1 ) , μ ( t + 1 ) , Λ ( t + 1 ) , Ψ ( t + 1 ) ) :
16:
         μ z ( t + 1 ) = i = 1 n E _ Z : i n
17:
         Σ z ( t + 1 ) = 1 n E _ Z _ Z _ T μ z ( t + 1 ) μ z ( t + 1 )
18:
         Λ ( t + 1 ) = n x ¯ μ z ( t + 1 ) X E _ Z   n μ z ( t + 1 ) μ z ( t + 1 ) E _ Z _ Z _ T 1
19:
         μ ( t + 1 ) = x ¯ Λ ( t + 1 ) μ z ( t + 1 )
20:
         Ψ ( t + 1 ) = diag ( 1 n ( ( X μ ( t + 1 ) 1 1 1 R 1 × n ) ( X μ ( t + 1 ) 1 1 1 R 1 × n ) ( X μ ( t + 1 ) 1 1 1 R 1 × n ) E _ Z Λ ( t + 1 ) Λ ( t + 1 ) E _ Z ( X μ ( t + 1 ) 1 1 1 R 1 × n ) + Λ ( t + 1 ) E _ Z _ Z _ T Λ ( t + 1 ) ) )
21:
         l RV_Do ( t + 1 ) = LogLikelihood(X,Z, θ ( t + 1 ) )
22:
        if  θ ( t ) θ ( t + 1 ) 2 2 θ ( t ) 2 2 eps or | l RV_Do ( t ) l RV_Do ( t + 1 ) | | l RV_Do ( t ) | eps then
23:
           break
24:
    return  θ ( t )
25:
functionTest( x * ,( μ ^ z , Σ ^ z , Λ ^ , μ ^ , Ψ ^ ))                         ▹ The same as in S2.FA
26:
    value = μ ^ z + Σ ^ z Λ ^ ( Λ ^ Σ ^ z Λ ^ + Ψ ^ ) 1 ( x * μ ^ Λ ^ μ ^ z )
27:
    covarianceMatrix = Σ ^ z Σ ^ z Λ ^ ( Λ ^ Σ ^ z Λ ^ + Ψ ^ ) 1 Λ ^ Σ ^ z
28:
    return (value, covarianceMatrix)

Appendix D. MS3.FA

The algorithms below are more general: z is not just a real number, as we state in the paper, but a real vector.
Algorithm A5 MS3.FA—Other functions.
1:
functionCondNormalNA(y,( μ , Σ ))
2:
    I = { i | y i = N A }                                  ▹ I = Indexes
3:
    OI = { i | y i N A }                               ▹ OI = Other Indexes
4:
    mean = μ I + Σ I , O I Σ O I 1 ( y O I μ O I )
5:
    covarianceMatrix = Σ I Σ I , O I Σ O I 1 Σ I , O I
6:
     E [ Y ] I = mean
7:
     E [ Y ] O I = y O I
8:
     E [ Y Y ] = E [ Y ] E [ Y ]
9:
     E [ Y Y ] I , I = E [ Y Y ] I , I + covarianceMatrix
10:
    return ( E [ Y ] , E [ Y Y ] )
11:
functionFullNormal( μ z , Σ z , μ , Λ , Ψ )
12:
    mean = μ + Λ μ z μ z
13:
    covarianceMatrix = Λ Σ z Λ + Ψ Λ Σ z ( Λ Σ z ) Σ z
14:
    return (mean,covarianceMatrix)
15:
functionLogLikelihood( { ( x ( 1 ) , z ( 1 ) ) , , ( x ( n ) , z ( n ) ) } , ( μ z , Σ z , μ , Λ , Ψ ))
16:
    (mean,cov) = FullNormal( μ z , Σ z , μ , Λ , Ψ )
17:
    result = 0
18:
    for i = 1:n do
19:
         y ( i ) = x ( i ) z ( i )
20:
        OI = { j | y j ( i ) N A }                           ▹ OI = Other indexes
21:
        result = result + ln N ( y O I ( i ) | mean O I , cov O I , O I )
22:
    return result
Algorithm A6 MS3.FA—Train.
1:
functionTrain( { ( x ( 1 ) , z ( 1 ) ) , , ( x ( n ) , z ( n ) ) } ,nMaxIterations,eps)         ▹ data can have NAs
2:
     θ(0) = initializeParameters ( { ( x ( 1 ) , z ( 1 ) ) , , ( x ( n ) , z ( n ) ) } )
3:
     l RV_Do ( 0 ) = LogLikelihood ( { ( x ( 1 ) , z ( 1 ) ) , , ( x ( n ) , z ( n ) ) } , θ ( 0 ) )
4:
    for t = 0:nMaxIterations do
5:
        E step: Compute the following for all i { 1 , , n } :
6:
         y ( i ) = x ( i ) z ( i )
7:
        fullNormal = FullNormal ( μ z ( t ) , Σ z ( t ) , μ ( t ) , Λ ( t ) , Ψ ( t ) )
8:
         ( E [ Y ( i ) ] , E [ Y ( i ) Y ( i ) ] ) = CondNormalNA ( y ( i ) , fullNormal )
9:
         E [ X ( i ) ] = E [ Y ( i ) ] 1 : D                              ▹ D = dim( x ( i ) )
10:
         E [ Z ( i ) ] = E [ Y ( i ) ] ( D + 1 ) : ( D + d )                           ▹ d = dim( z ( i ) )
11:
         E [ X ( i ) X ( i ) ] = E [ Y ( i ) Y ( i ) ] 1 : D , 1 : D
12:
         E [ X ( i ) Z ( i ) ] = E [ Y ( i ) Y ( i ) ] 1 : D , ( D + 1 ) : ( D + d )
13:
         E [ Z ( i ) X ( i ) ] = E [ X ( i ) Z ( i ) ]
14:
         E [ Z ( i ) Z ( i ) ] = E [ Y ( i ) Y ( i ) ] ( D + 1 ) : ( D + d ) , ( D + 1 ) : ( D + d )
15:
        M Step: Compute θ ( t + 1 ) = ( μ z ( t + 1 ) , Σ z ( t + 1 ) , μ ( t + 1 ) , Λ ( t + 1 ) , Ψ ( t + 1 ) ) :
16:
         μ z ( t + 1 ) = i = 1 n E [ Z ( i ) ] n
17:
         Σ z ( t + 1 ) = i = 1 n ( E [ Z ( i ) Z ( i ) ] E [ Z ( i ) ] μ z ( t + 1 ) μ z ( t + 1 ) E [ Z ( i ) ] + μ z ( t + 1 ) μ z ( t + 1 ) ) n
18:
         x ¯ = i = 1 n E [ X ( i ) ] n
19:
         Λ ( t + 1 ) = n x ¯ μ z ( t + 1 ) i = 1 n E [ X ( i ) Z ( i ) ] n μ z ( t + 1 ) μ z ( t + 1 ) i = 1 n E [ Z ( i ) Z ( i ) ] 1
20:
         μ ( t + 1 ) = x ¯ Λ ( t + 1 ) μ z ( t + 1 )
21:
         Ψ ( t + 1 ) = diag ( 1 n i = 1 n ( E [ X ( i ) X ( i ) ] E [ X ( i ) ] μ ( t + 1 ) μ ( t + 1 ) E [ X ( i ) ] + μ ( t + 1 ) μ ( t + 1 ) E [ X ( i ) Z ( i ) ] Λ ( t + 1 ) + μ ( t + 1 ) E [ Z ( i ) ] Λ ( t + 1 ) Λ ( t + 1 ) E [ Z ( i ) X ( i ) ] + Λ ( t + 1 ) E [ Z ( i ) ] μ ( t + 1 ) + Λ ( t + 1 ) E [ Z ( i ) Z ( i ) ] Λ ( t + 1 ) ) )
22:
         l RV_Do ( t + 1 ) = LogLikelihood ( { ( x ( 1 ) , z ( 1 ) ) , , ( x ( n ) , z ( n ) ) } , θ ( t + 1 ) )
23:
        if  θ ( t ) θ ( t + 1 ) 2 2 θ ( t ) 2 2 eps or | l RV_Do ( t ) l RV_Do ( t + 1 ) | | l RV_Do ( t ) | eps then
24:
           break
25:
    return  θ ( t )
Algorithm A7 MS3.FA—Test and Impute.
1:
functionTest( y * = ( x * , z * ) partially known, ( μ ^ z , Σ ^ z , Λ ^ , μ ^ , Ψ ^ ))
2:
    fullNormal = FullNormal ( μ ^ z , Σ ^ z , μ ^ , Λ ^ , Ψ ^ )
3:
     ( E [ Y * ] , E [ Y * Y * ] ) = CondNormalNA(y*, fullNormal)
4:
    value = E [ Y * ]
5:
    covarianceMatrix = E [ Y * Y * ] E [ Y * ] E [ Y * ]
6:
    return (value,covarianceMatrix)
7:
functionImpute( y * = ( x * , z * ) partially known, ( μ ^ z , Σ ^ z , Λ ^ , μ ^ , Ψ ^ ))
8:
    (value, covarianceMatrix) = Test( y * ,( μ ^ z , Σ ^ z , Λ ^ , μ ^ , Ψ ^ ))
9:
    return value

References

  1. Mitchell, T. Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression. (Additional Chapter to Machine Learning; McGraw-Hill: New York, NY, USA, 1997.) Published Online. 2017; Available online: https://bit.ly/39Ueb4o (accessed on 31 July 2021).
  2. Murphy, K. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
  3. Ng, A. Machine Learning Course, Lecture Notes, Mixtures of Gaussians and the EM Algorithm. Available online: http://cs229.stanford.edu/notes2020spring/cs229-notes7b.pdf (accessed on 31 July 2021).
  4. Singh, A. Machine Learning Course, Homework 4, pr 1.1; CMU: Pittsburgh, PA, USA, 2010; p. 528 in Ciortuz, L.; Munteanu, A.; Bădărău, E. Machine Learning Exercise Book (In Romanian); Alexandru Ioan Cuza University of Iași: Iași, Romania, 2019. Available online: https://bit.ly/320ZuIk (accessed on 31 July 2021).
  5. Ng, A. Machine Learning Course, Lecture Notes, Part X. Available online: http://cs229.stanford.edu/notes2020spring/cs229-notes9.pdf (accessed on 31 July 2021).
  6. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Available online: http://www.deeplearningbook.org (accessed on 31 July 2021).
  7. Tipping, M.E.; Bishop, C.M. Probabilistic Principal Component Analysis. J. R. Stat. Soc. Ser. (Stat. Methodol.) 1999, 61, 611–622. Available online: https://bit.ly/2PCxoRr (accessed on 31 July 2021). [CrossRef]
  8. Ciobanu, S. Exploiting a New Probabilistic Model: Simple-Supervised Factor Analysis. Master’s Thesis, Alexandru Ioan Cuza University of Iași, Iași, Romania, 2019. Available online: https://bit.ly/31UsBx6 (accessed on 31 July 2021).
  9. Ng, A. Machine Learning Course, Lecture Notes, Part XI. Available online: http://cs229.stanford.edu/notes2020spring/cs229-notes10.pdf (accessed on 31 July 2021).
  10. Lawrence, N.D. Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data. Adv. Neural Inf. Process. Syst. 2004, 329–336. Available online: https://papers.nips.cc/paper/2540-gaussian-process-latent-variable-models-for-visualisation-of-high-dimensional-data.pdf (accessed on 31 July 2021).
  11. Gao, X.; Wang, X.; Tao, D.; Li, X. Supervised Gaussian Process Latent Variable Model for Dimensionality Reduction. IEEE Trans. Syst. Man, Cybern. Part (Cybern.) 2010, 41, 425–434. Available online: https://ieeexplore.ieee.org/document/5545418 (accessed on 31 July 2021).
  12. Mitchell, T.; Xing, E.; Singh, A. Machine Learning Course, Midterm Exam, pr. 5.3; CMU: Pittsburgh, PA, USA, 2010; p. 565 Ciortuz, L.; Munteanu, A.; Bădărău, E. Machine Learning Exercise Book (In Romanian); Alexandru Ioan Cuza University of Iași: Iași, Romania, 2019. Available online: https://bit.ly/320ZuIk (accessed on 31 July 2021).
  13. Ziyatdinov, A.; Fonollosa, J.; Fernández, L.; Gutierrez-Gálvez, A.; Marco, S.; Perera, A. Bioinspired early detection through gas flow modulation in chemo-sensory systems. Sens. Actuators Chem. 2015, 206, 538–547. [Google Scholar] [CrossRef] [Green Version]
  14. Spyromitros-Xioufis, E.; TSOUMAKAS, G.; WILLIAM, G.; Vlahavas, I. Drawing parallels between multi-label classification and multi-target regression. arXiv 2014, arXiv:1211.6581 v2. [Google Scholar]
  15. Xiaojin, Z.; Zoubin, G. Learning from Labeled and Unlabeled Data with Label Propagation; Technical Report CMU-CALD-02–107; Carnegie Mellon University: Pittsburgh, PA, USA, 2002. [Google Scholar]
  16. Wang, J. SSL: Semi-Supervised Learning, R Package Version 0.1; 2016. Available online: https://CRAN.R-project.org/package=SSL (accessed on 31 July 2021).
  17. Oliver, A.; Odena, A.; Raffel, C.; Cubuk, E.D.; Goodfellow, I.J. Realistic evaluation of deep semi-supervised learning algorithms. arXiv 2018, arXiv:1804.09170. [Google Scholar]
  18. van Buuren, S.; Groothuis-Oudshoorn, K. Mice: Multivariate Imputation by Chained Equations in R. J. Stat. Softw. 2011, 45, 1–67. Available online: https://www.jstatsoft.org/v45/i03/ (accessed on 31 July 2021). [CrossRef] [Green Version]
  19. Honaker, J.; King, G.; Blackwell, M. Amelia II: A Program for Missing Data. J. Stat. Softw. 2011, 45, 1–47. Available online: http://www.jstatsoft.org/v45/i07/ (accessed on 31 July 2021). [CrossRef]
  20. Ghahramani, Z.; Hinton, G.E. The EM Algorithm for Mixtures of Factor Analyzers; Technical Report, CRG-TR-96-1; University of Toronto: Toronto, ON, Canada, 1996; Available online: http://mlg.eng.cam.ac.uk/zoubin/papers/tr-96-1.pdf (accessed on 31 July 2021).
  21. Bishop, C.M. Pattern Recognition and Machine Learning; Springer Science + Business Media: Berlin, Germany, 2006. [Google Scholar]
  22. Zaharia, M.; Xin, R.S.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S.; Franklin, M.J.; et al. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 2016, 59, 56–65. [Google Scholar] [CrossRef]
  23. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. (Methodol.) 1977, 39, 1–22. [Google Scholar]
  24. Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Figure 1. Simple-semisupervised factor analysis (S3.FA) experiment: MSE 95% confidence intervals on the gas sensor array under flow modulation data set using four methods for semisupervised regression when D > > n ; p { 5 % , 10 % , 15 % , , 100 % } of the training output labels are retained.
Figure 1. Simple-semisupervised factor analysis (S3.FA) experiment: MSE 95% confidence intervals on the gas sensor array under flow modulation data set using four methods for semisupervised regression when D > > n ; p { 5 % , 10 % , 15 % , , 100 % } of the training output labels are retained.
Entropy 23 01012 g001
Figure 2. S3.FA experiment: MSE 95% confidence intervals on the atp1d data set using four methods for semisupervised regression when D > > n ; p { 5 % , 10 % , 15 % , , 100 % } of the training output labels are retained.
Figure 2. S3.FA experiment: MSE 95% confidence intervals on the atp1d data set using four methods for semisupervised regression when D > > n ; p { 5 % , 10 % , 15 % , , 100 % } of the training output labels are retained.
Entropy 23 01012 g002
Figure 3. S3.FA experiment: MSE 95% confidence intervals on the m5spec data set using four methods for semisupervised regression when D > > n ; p { 5 % , 10 % , 15 % , , 100 % } of the training output labels are retained.
Figure 3. S3.FA experiment: MSE 95% confidence intervals on the m5spec data set using four methods for semisupervised regression when D > > n ; p { 5 % , 10 % , 15 % , , 100 % } of the training output labels are retained.
Entropy 23 01012 g003
Figure 4. Missing simple-semisupervised factor analysis (MS3.FA) experiment: MSE 95% confidence intervals on the gas sensor array under flow modulation data set using two methods for imputing missing data when D > > n ; p { 10 % , 20 % , 30 % , 40 % , 50 % , 60 % } of the input data are removed.
Figure 4. Missing simple-semisupervised factor analysis (MS3.FA) experiment: MSE 95% confidence intervals on the gas sensor array under flow modulation data set using two methods for imputing missing data when D > > n ; p { 10 % , 20 % , 30 % , 40 % , 50 % , 60 % } of the input data are removed.
Entropy 23 01012 g004
Figure 5. MS3.FA experiment: MSE 95% confidence intervals on the atp1d data set using two methods for imputing missing data when D > > n ; p { 10 % , 20 % , 30 % , 40 % , 50 % , 60 % } of the input data are removed.
Figure 5. MS3.FA experiment: MSE 95% confidence intervals on the atp1d data set using two methods for imputing missing data when D > > n ; p { 10 % , 20 % , 30 % , 40 % , 50 % , 60 % } of the input data are removed.
Entropy 23 01012 g005
Figure 6. MS3.FA experiment: MSE 95% confidence intervals on the m5spec data set using two methods for imputing missing data when D > > n ; p { 10 % , 20 % , 30 % , 40 % , 50 % , 60 % } of the input data are removed.
Figure 6. MS3.FA experiment: MSE 95% confidence intervals on the m5spec data set using two methods for imputing missing data when D > > n ; p { 10 % , 20 % , 30 % , 40 % , 50 % , 60 % } of the input data are removed.
Entropy 23 01012 g006
Table 1. Simple-supervised factor analysis (S2.FA) experiment: mean squared error (MSE) 95% confidence intervals on three data sets using three methods for regression when D > > n ; the best MSE means are marked in bold.
Table 1. Simple-supervised factor analysis (S2.FA) experiment: mean squared error (MSE) 95% confidence intervals on three data sets using three methods for regression when D > > n ; the best MSE means are marked in bold.
Data Set/MethodMoore–PenroseRidge RegressionS2.FA
Gas sensor array under flow modulation0.0251 ± 0.02540.0062 ± 0.0070.0452 ± 0.0208
atp1d94,627.7239 ± 80,183.007627,770.9253 ± 42,887.52164724.2957 ± 1616.3341
m5spec0.00004 ± 0.000010.02676 ± 0.013720.37344 ± 0.25025
Table 2. S3.FA experiment: MSE 95% confidence intervals on three data sets using four methods for semisupervised regression when D > > n ; p { 5 % , 10 % , 15 % , 30 % , 50 % , 70 % } of the training output labels are retained; the best MSE means are marked in bold.
Table 2. S3.FA experiment: MSE 95% confidence intervals on three data sets using four methods for semisupervised regression when D > > n ; p { 5 % , 10 % , 15 % , 30 % , 50 % , 70 % } of the training output labels are retained; the best MSE means are marked in bold.
Data SetMethod p = 5 % p = 10 % p = 15 %
Gas sensor
array under
flow modulation
Moore–Penrose0.034 ± 0.04370.0292 ± 0.04210.0267 ± 0.0326
S2.FA0.2511 ± 0.39530.0899 ± 0.23670.0285 ± 0.0333
S3.FA3.9799 ± 6.70870.349 ± 0.66260.2294 ± 0.1561
Label Propagation0.0853 ± 0.00730.0825 ± 0.00510.0708 ± 0.013
atp1dMoore–Penrose3763.2939 ± 749.09663079.5042 ± 1842.76763428.7567 ± 1722.6437
S2.FA2706.9906 ± 477.32452339.3504 ± 324.85812279.0802 ± 99.5443
S3.FA3771.605 ± 1563.65432972.1137 ± 639.67242633.0816 ± 274.7409
Label Propagation110,820.3235 ± 0110,820.3235 ± 0110,820.3235 ± 0
m5specMoore–Penrose0.4602 ± 0.3540.3609 ± 0.76310.0187 ± 0.0221
S2.FA0.7003 ± 0.38340.4269 ± 0.54410.4999 ± 0.5172
S3.FA0.8133 ± 0.63661.653 ± 3.94970.5118 ± 0.5503
Label Propagation0.2175 ± 0.07070.1939 ± 0.07060.2343 ± 0.0624
Data SetMethod p = 30 % p = 50 % p = 70 %
Gas sensor
array under
flow modulation
Moore–Penrose0.0313 ± 0.02590.0192 ± 0.01450.0132 ± 0.0075
S2.FA0.0588 ± 0.05760.0233 ± 0.02240.0279 ± 0.0116
S3.FA0.645 ± 1.30440.1666 ± 0.10590.0912 ± 0.0443
Label Propagation0.0668 ± 0.00860.0675 ± 0.0180.0605 ± 0.0122
atp1dMoore–Penrose6401.7587 ± 3602.91827907.484 ± 4449.231236,041.874 ± 21,300.704
S2.FA2444.9001 ± 404.23932439.162 ± 108.36052399.7948 ± 160.1038
S3.FA2760.6602 ± 391.92172659.6472 ± 112.97952521.9861 ± 222.3765
Label Propagation110,820.3235 ± 0110,820.3235 ± 0110,820.3235 ± 0
m5specMoore–Penrose0.00057 ± 0.000850.00009 ± 0.000080.00009 ± 0.00004
S2.FA0.37449 ± 0.125080.35796 ± 0.072750.3995 ± 0.03164
S3.FA0.37865 ± 0.128080.36181 ± 0.074630.40419 ± 0.03415
Label Propagation0.19391 ± 0.027540.17424 ± 0.008570.17752 ± 0.01545
Table 3. MS3.FA experiment: MSE 95% confidence intervals on three data sets using two methods for imputing missing data when D > > n ; p { 10 % , 20 % , 30 % , 40 % , 50 % , 60 % } of the input data are removed; the best MSE means are marked in bold.
Table 3. MS3.FA experiment: MSE 95% confidence intervals on three data sets using two methods for imputing missing data when D > > n ; p { 10 % , 20 % , 30 % , 40 % , 50 % , 60 % } of the input data are removed; the best MSE means are marked in bold.
Data SetMethod p = 10 % p = 20 % p = 30 %
Gas sensor array under
flow modulation
Mean0.0108 ± 0.00220.0241 ± 0.00270.0359 ± 0.0011
MS3.FA0.00603 ± 0.000290.01176 ± 0.000810.01747 ± 0.00094
atp1dMean3239.9757 ± 68.25116831.0884 ± 105.611910,077.1798 ± 117.4693
MS3.FA1807.7748 ± 105.033697.1146 ± 124.21975276.899 ± 104.6252
m5specMean0.00013 ± 0.000010.00026 ± 00.00039 ± 0.00001
MS3.FA0.14835 ± 0.0000020.148433 ± 0.0000020.148518 ± 0.000002
Data SetMethod p = 40 % p = 50 % p = 60 %
Gas sensor array under
flow modulation
Mean0.0494 ± 0.00420.0597 ± 0.00150.0731 ± 0.0012
MS3.FA0.02372 ± 0.001590.0299 ± 0.000470.0364 ± 0.0011
atp1dMean13,423.3625 ± 252.773516,899.3166 ± 223.697120,156.2493 ± 62.1299
MS3.FA7071.6793 ± 143.32328921.5319 ± 165.575910,558.5723 ± 53.1182
m5specMean0.000521 ± 0.0000050.000658 ± 0.0000080.00080 ± 0.000005
MS3.FA0.1486 ± 0.000010.14869 ± 0.000010.148781 ± 0.000001
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Ciobanu, S.; Ciortuz, L. A Factor Analysis Perspective on Linear Regression in the ‘More Predictors than Samples’ Case. Entropy 2021, 23, 1012. https://doi.org/10.3390/e23081012

AMA Style

Ciobanu S, Ciortuz L. A Factor Analysis Perspective on Linear Regression in the ‘More Predictors than Samples’ Case. Entropy. 2021; 23(8):1012. https://doi.org/10.3390/e23081012

Chicago/Turabian Style

Ciobanu, Sebastian, and Liviu Ciortuz. 2021. "A Factor Analysis Perspective on Linear Regression in the ‘More Predictors than Samples’ Case" Entropy 23, no. 8: 1012. https://doi.org/10.3390/e23081012

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop