A Factor Analysis Perspective on Linear Regression in the ‘More Predictors than Samples’ Case

Ciobanu, Sebastian; Ciortuz, Liviu

doi:10.3390/e23081012

Open AccessArticle

A Factor Analysis Perspective on Linear Regression in the ‘More Predictors than Samples’ Case

by

Sebastian Ciobanu

^*

and

Liviu Ciortuz

^*

Faculty of Computer Science, Alexandru Ioan Cuza University of Iaşi, 700506 Iaşi, Romania

^*

Authors to whom correspondence should be addressed.

Entropy 2021, 23(8), 1012; https://doi.org/10.3390/e23081012

Submission received: 5 June 2021 / Revised: 25 July 2021 / Accepted: 27 July 2021 / Published: 3 August 2021

(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science II)

Download

Browse Figures

Versions Notes

Abstract

:

Linear regression (LR) is a core model in supervised machine learning performing a regression task. One can fit this model using either an analytic/closed-form formula or an iterative algorithm. Fitting it via the analytic formula becomes a problem when the number of predictors is greater than the number of samples because the closed-form solution contains a matrix inverse that is not defined when having more predictors than samples. The standard approach to solve this issue is using the Moore–Penrose inverse or the L2 regularization. We propose another solution starting from a machine learning model that, this time, is used in unsupervised learning performing a dimensionality reduction task or just a density estimation one—factor analysis (FA)—with one-dimensional latent space. The density estimation task represents our focus since, in this case, it can fit a Gaussian distribution even if the dimensionality of the data is greater than the number of samples; hence, we obtain this advantage when creating the supervised counterpart of factor analysis, which is linked to linear regression. We also create its semisupervised counterpart and then extend it to be usable with missing data. We prove an equivalence to linear regression and create experiments for each extension of the factor analysis model. The resulting algorithms are either a closed-form solution or an expectation–maximization (EM) algorithm. The latter is linked to information theory by optimizing a function containing a Kullback–Leibler (KL) divergence or the entropy of a random variable.

Keywords:

more predictors than samples; linear regression; factor analysis; semisupervised regression; missing data

1. Introduction

In machine learning, models can be grouped into two categories: probabilistic and nonprobabilistic. Probabilistic models can be classified as generative and discriminative [1]. Examples of classic generative models are naive Bayes and Gaussian mixture models (GMM). Examples of classic discriminative models are linear regression (LR) and logistic regression. The key difference is whether they model the joint probability of the input and the output—generative models—or they just model the conditional probability of the output given the input—discriminative models. For a classification or a regression task, one may argue that what you need is just a discriminative model, but the generative models have their advantages: they can sometimes handle missing data, can easily generate new data, can be extended to be unsupervised or semisupervised, etc. ([2] p. 268).

As one may notice, there are generative models for unsupervised learning that have counterparts in supervised learning, even though this is not widely discussed in the literature. One such example is the GMM ([2] p. 339) with its counterpart, the Gaussian joint Bayes model ([2] p. 102), also known as quadratic discriminant analysis. Their training/fitting algorithms are similar, as one may notice, for example, in [3,4]:

for Gaussian joint Bayes:

$π_{j} = \frac{1}{n} \sum_{i = 1}^{n} 1_{{z_{i} = j}}$

$μ_{j} = \frac{\sum_{i = 1}^{n} 1_{{z_{i} = j}} x_{i}}{\sum_{i = 1}^{n} 1_{{z_{i} = j}}}$

$Σ_{j} = \frac{\sum_{i = 1}^{n} 1_{{z_{i} = j}} (x_{i} - μ_{j}) {(x_{i} - μ_{j})}^{⊤}}{\sum_{i = 1}^{n} 1_{{z_{i} = j}}}$

where $x_{i}$ is an input observation, ( $π_{j}, μ_{j}, Σ_{j}$ ) are the parameters of a GMM, observable $z_{i}$ is the class index corresponding to $x_{i}$ , j is a class index, and $1_{{z_{i} = j}}$ is the indicator function, which returns 1 if the condition $z_{i} = j$ is true and 0 otherwise.
for expectation–maximization (EM) for the GMM, which optimizes a function concerning a Kullback–Leibler (KL) divergence or the entropy of a random variable—check Appendix A for these details—:
E step:

$\begin{matrix} w_{i j} & = p (z_{i} = j | x_{i}; π^{'}, μ^{'}, Σ^{'}) \\ = \frac{p (x_{i} | z_{i} = j; μ^{'}, Σ^{'}) p (z_{i} = j; π^{'})}{\sum_{l = 1}^{K} p (x_{i} | z_{i} = l; μ^{'}, Σ^{'}) p (z_{i} = l; π^{'})} \end{matrix}$

M step:

$π_{j} = \frac{1}{n} \sum_{i = 1}^{n} w_{i j}$

$μ_{j} = \frac{\sum_{i = 1}^{n} w_{i j} x_{i}}{\sum_{i = 1}^{n} w_{i j}}$

$Σ_{j} = \frac{\sum_{i = 1}^{n} w_{i j} (x_{i} - μ_{j}) {(x_{i} - μ_{j})}^{⊤}}{\sum_{i = 1}^{n} w_{i j}}$

where $x_{i}$ is an input observation, ( $π_{j}, μ_{j}, Σ_{j}$ ) are the parameters of a GMM, unobservable $z_{i}$ is the cluster index corresponding to $x_{i}$ , j is a cluster index, and $w_{i j}$ is the probability that $x_{i}$ belongs to cluster j.

This similarity between GMM and Gaussian joint Bayes is intriguing; hence, we decided to further explore this aspect but starting from other supervised–unsupervised counterparts. As a result, we changed the root model into factor analysis (FA) [5] ([2] p. 381), which is normally used for dimensionality reduction or for density estimation when the dimensionality of the data is greater than the number of samples. Factor analysis is a Gaussian generative model used in unsupervised learning. We aimed at creating its supervised counterpart in order to handle a regression task and then exploit it as much as possible.

After creating the supervised counterpart, we proved a significant property, namely that linear regression is equivalent to (supervised) factor analysis—with one-dimensional latent space—when no constraints are imposed on the covariance matrices.

A linear regression model can be fitted via a closed-form solution or an iterative algorithm. When the number of predictors is greater than the number of samples, there is no closed-form solution. There are other solutions to this problem, as we will see.

We were at the point where we knew that factor analysis was linked to linear regression and that it could be used when the number of samples was lower than the dimensionality of the data—from now on, this is denoted as

D > > n

or

n < < D

. As a result, we shifted our focus from solely exploiting the factor analysis model to highlighting novel linear regression versions applicable in the

D > > n

regime—linear regression being a widely known and used model—:

linear regression when $D > > n$ ,
semisupervised linear regression when $D > > n$ ,
(semisupervised) linear regression when $D > > n$ with missing data.

The structure of this paper is as follows. In Section 2, we include some theoretical background to enhance the readability of this paper. Section 3 contains related work. In Section 4, we include the models we proposed, starting from factor analysis. Section 5 contains experiments using the proposed models. In Section 6, we conclude the paper and show future directions.

We include the full algorithms in the appendices in a pseudocode format, two of them being instances of the expectation–maximization schema.

2. Theoretical Background

We started our analysis from two core models in machine learning: linear regression and factor analysis. We will discuss the aspects of those two models that are relevant to understanding the next sections of this paper.

2.1. Linear Regression

Proposition 1.

Let

{(x^{(i)}, y^{(i)}) | x^{(i)} \in R^{D \times 1}, y^{(i)} \in R, i \in {1, \dots, n}}

be a data set where D is the dimensionality of the input data,

{x^{(i)} | i \in {1, \dots, n}}

is the input, and

{y^{(i)} | i \in {1, \dots, n}}

is the output. The linear regression model is as follows:

Y^{(i)} = w x^{(i)} + b + ϵ^{(i)}

where

Y^{(i)}

is a random variable corresponding to

y^{(i)}

,

ϵ^{(i)} \sim N (0, σ^{2})

,

σ \in R_{+}^{*}

,

w \in R^{1 \times D}

,

b \in R

. Then, the parameters w and b can be estimated via maximum likelihood as follows:

\begin{matrix} {\hat{w}}_{LR} & = (n \bar{y} {\bar{x}}^{⊤} - Y X^{⊤}) {(n \bar{x} {\bar{x}}^{⊤} - X X^{⊤})}^{- 1} \end{matrix}

(1)

\begin{matrix} {\hat{b}}_{LR} & = \bar{y} - {\hat{w}}_{LR} \bar{x}, \end{matrix}

(2)

or, equivalently, as follows:

\begin{matrix} [\begin{matrix} {\hat{w}}_{LR}^{⊤} \\ {\hat{b}}_{LR} \end{matrix}] = {(\tilde{X} {\tilde{X}}^{⊤})}^{- 1} \tilde{X} Y^{⊤} \end{matrix}

(3)

where

\bar{x} = \frac{x^{(1)} + \dots + x^{(n)}}{n}

,

\bar{y} = \frac{y^{(1)} + \dots + y^{(n)}}{n}

,

X = [\begin{matrix} x^{(1)} & \dots & x^{(n)} \end{matrix}] \in R^{D \times n},

\tilde{X} = [\begin{matrix} {\tilde{x}}^{(1)} & \dots & {\tilde{x}}^{(n)} \end{matrix}] \in R^{(D + 1) \times n},

{\tilde{x}}^{(i)} = [\begin{matrix} x^{(i)} \\ 1 \end{matrix}] = [\begin{matrix} x_{1}^{(i)} \\ ⋮ \\ x_{D}^{(i)} \\ 1 \end{matrix}] \in R^{D + 1},

and

Y = [\begin{matrix} y^{(1)} & \dots & y^{(n)} \end{matrix}] \in R^{1 \times n}

.

A potential problem with Equation (3) is when

\tilde{X} {\tilde{X}}^{⊤}

is not invertible. Such case arises when

D + 1 > n

, i.e., there are more predictors than samples. Two standard solutions to this problem are the following:

Let $A \in R^{a \times b}$ be a matrix. Then, the Moore–Penrose inverse of A can be defined as $A^{+} = {lim}_{α \to 0^{+}} {(A^{⊤} A + α I)}^{- 1} A^{⊤}$ , which, algorithmically, is computed via the singular value decomposition of A ([6] Section 2.9). One may notice that the matrix ${(\tilde{X} {\tilde{X}}^{⊤})}^{- 1} \tilde{X}$ from Equation (3) is just ${({\tilde{X}}^{⊤})}^{+}$ when $\tilde{X} {\tilde{X}}^{⊤}$ is invertible. When it is not, the solution is to replace the matrix ${(\tilde{X} {\tilde{X}}^{⊤})}^{- 1} \tilde{X}$ from Equation (3) with ${({\tilde{X}}^{⊤})}^{+}$ .
L2 regularization, which results in ridge regression ([2] p. 225). The matrix ${(\tilde{X} {\tilde{X}}^{⊤})}^{- 1} \tilde{X}$ from Equation (3) is replaced with ${(\tilde{X} {\tilde{X}}^{⊤} + [\begin{matrix} α & \dots & 0 & 0 \\ ⋮ & ⋱ & ⋮ & ⋮ \\ 0 & \dots & α & 0 \\ 0 & \dots & 0 & 0 \end{matrix}])}^{- 1} \tilde{X}$ , with $α > 0$ ; the bigger the $α$ , the more regularization we add to the model, i.e., move away from overfitting. From this point of view, the first solution using the Moore–Penrose inverse can be interpreted as achieving the asymptotically lowest L2 regularization.

2.2. Factor Analysis

The formulas stated in the conclusion of the following Proposition were proved in [5] and are relevant for the factor analysis algorithm—although the matrix

Ψ

is considered as being diagonal there, the formulas stay the same even if

Ψ

is not diagonal.

Proposition 2.

Let us consider the following factor analysis model:

z \sim N (0, I)

-latent variable,

z \in R^{d \times 1}

x | z \sim N (μ + Λ z, Ψ)

,

x \in R^{D \times 1}, μ \in R^{D \times 1}, Λ \in R^{D \times d}, and Ψ \in R^{D \times D}

a diagonal matrix. Then:

[\begin{matrix} x \\ z \end{matrix}] \sim N ([\begin{matrix} μ \\ 0 \end{matrix}], [\begin{matrix} Λ Λ^{⊤} & Λ \\ Λ^{⊤} & I \end{matrix}])

x \sim N (μ, Λ Λ^{⊤} + Ψ)

z | x \sim N (Λ^{⊤} {(Λ Λ^{⊤} + Ψ)}^{- 1} (x - μ), I - Λ^{⊤} {(Λ Λ^{⊤} + Ψ)}^{- 1} Λ) .

A factor analysis model can be fitted via an EM algorithm using the maximum likelihood estimation (MLE) principle, and factor analysis can be used as a density estimation technique when the dimensionality of the data is greater than the number of samples [5].

For the algorithms we developed, we will let

z \sim N (μ_{z}, Σ_{z})

and not

z \sim N (0, I)

, because z becomes observed data, and we want to learn its parameters, and not impose something unrealistic like

z \sim N (0, I)

. This generalization leads to the following result.

Proposition 3.

Let us consider the following linear Gaussian system:

z \sim N (μ_{z}, Σ_{z})

-latent variable,

z \in R^{d \times 1}

,

μ_{z} \in R^{d \times 1}

,

Σ_{z} \in R^{d \times d}

a symmetric and positive definite matrix,

x | z \sim N (μ + Λ z, Ψ)

,

x \in R^{D \times 1}, μ \in R^{D \times 1}, Λ \in R^{D \times d}

, and

Ψ \in R^{D \times D}

a symmetric and positive definite matrix. (If Ψ is a diagonal matrix, then we say that we are in the FA case. If Ψ is a scalar matrix,

Ψ = η I

,

η \in R_{+}^{*}

, then we are in the probabilistic principal component analysis (PPCA) case [7]. If Ψ is any symmetric and positive definite matrix, then we say we are in the unconstrained factor analysis (UncFA) case. If the first two terms are standard (FA and PPCA), the third one is proposed by us—UncFA.) Then:

\begin{matrix} [\begin{matrix} x \\ z \end{matrix}] \sim N ([\begin{matrix} μ + Λ μ_{z} \\ μ_{z} \end{matrix}], [\begin{matrix} Λ Σ_{z} Λ^{⊤} + Ψ & Λ Σ_{z} \\ {(Λ Σ_{z})}^{⊤} & Σ_{z} \end{matrix}]) \\ x \sim N (μ + Λ μ_{z}, Λ Σ_{z} Λ^{⊤} + Ψ) \\ z | x \sim N (μ_{z} + Σ_{z} Λ^{⊤} {(Λ Σ_{z} Λ^{⊤} + Ψ)}^{- 1} (x - μ - Λ μ_{z}), \\ Σ_{z} - Σ_{z} Λ^{⊤} {(Λ Σ_{z} Λ^{⊤} + Ψ)}^{- 1} Λ Σ_{z}^{⊤}) . \end{matrix}

(4)

The proof can be found in ([8] pp. 9–11).

3. Related Work

Although factor analysis is widely used for dimensionality reduction, its supervised counterpart is, to the best of our knowledge, not present in the literature. What is present is a model called supervised principal component analysis or latent factor regression ([2] p. 405). The idea is that not only the input, for a regression task, is generated by a latent variable, as one applies factor analysis to replace the input in the problem with a low dimensional embedding, but also the output. The key idea is that the purpose of supervised principal component analysis is still dimensionality reduction and not at all regression, which is where we want to push the factor analysis model.

There is also a term called linear Gaussian system ([2] p. 119). This was already presented in Section 2, and it generalizes the factor analysis generative process by considering that z has a learnable mean and covariance matrix, but it does not go further.

Factor analysis is strongly related to principal component analysis (PCA) [9] because by imposing a certain constraint in factor analysis, we get a model called probabilistic principal component analysis [7] that can be fitted using a closed-form solution, which, in an asymptotic case, is also the solution for PCA. Probabilistic PCA can be kernelized using a model called the Gaussian process latent variable model (GPLVM) [10]. This model also has supervised counterparts [11], but, as in the case of FA, the supervised extension targets dimensionality reduction, and the idea is similar to the one in supervised PCA.

4. Proposed Models

In this section, we propose three models starting from the FA model, each in a new subsection: simple-supervised factor analysis (S2.FA), simple-semisupervised factor analysis (S3.FA), and missing simple-semisupervised factor analysis (MS3.FA). While S2.FA is applicable in the supervised case, regression, S3.FA is meant to be used in a semisupervised context. MS3.FA handles missing input data in a (semi)supervised scenario.

One important remark that will not be restated in this paper is that all the models (FA, S2.FA, S3.FA, MS3.FA) are fitted by maximizing the likelihood (the MLE principle) of the observed data. Another important observation is regarding the names of our proposed algorithms: the algorithms are called “simple-”—S2 = simple-supervised; S3 = simple-semisupervised; MS3 = missing simple-semisupervised—not only because they constitute a simple adaptation of the factor analysis model, but mostly because we created an adaptation of the (simple-)supervised FA model called (simple-)supervised PPCA, and we did not want this model to be confused with the already existing supervised PCA model in the literature. Simple-supervised probabilistic principal component analysis (S2.PPCA) is not discussed in this paper, but it is implemented and usable in the R package that we developed (https://github.com/aciobanusebi/s2fa; accessed on 31 July 2021) along with other undiscussed but related models: Simple-semisupervised unconstrained factor analysis (S3.UncFA), Simple-semisupervised probabilistic principal component analysis (S3.PPCA), Missing simple-semisupervised unconstrained factor analysis (MS3.UncFA), and Missing simple-semisupervised probabilistic principal component analysis (MS3.PPCA).

4.1. The S2.FA Model

The core of this subsection regards the S2.FA model, but in order to link it with LR, we need to also introduce a similar model to S2.FA: S2.UncFA. This link will make S2.FA a good candidate for replacing LR when

D > > n

. These three ideas—S2.FA and S2.UncFA, S2.UncFA-LR link, and replacing LR via S2.FA—will be expanded below.

4.1.1. The S2.FA Model. S2.FA and S2.UncFA

The first model that we propose is called simple-supervised factor analysis. It is a linear Gaussian system with slight changes:

z \sim N (μ_{z}, σ_{z}^{2})

-observed variable,

z \in R

,

μ_{z} \in R

,

σ_{z} \in R_{+}^{*}

x | z \sim N (μ + Λ z, Ψ)

,

x \in R^{D \times 1}, μ \in R^{D \times 1}, Λ \in R^{D \times 1}

, and

Ψ \in R^{D \times D}

a diagonal matrix.

If we do not impose the constraint of

Ψ

being diagonal, we arrive at the simple-supervised unconstrained factor analysis (S2.UncFA):

z \sim N (μ_{z}, σ_{z}^{2})

-observed variable,

z \in R

,

μ_{z} \in R

,

σ_{z} \in R_{+}^{*}

x | z \sim N (μ + Λ z, Ψ)

,

x \in R^{D \times 1}, μ \in R^{D \times 1}, Λ \in R^{D \times 1}

, and

Ψ \in R^{D \times D}

a symmetric and positive definite matrix.

In contrast with the factor analysis model, which is fitted via an EM algorithm, S2.UncFA and S2.FA are fitted via analytic formulas (see Propositions 4 and 5).

Proposition 4.

Let

{(x^{(i)}, z^{(i)}) | x^{(i)} \in R^{D \times 1}, z^{(i)} \in R^{d \times 1}, i \in {1, \dots, n}}

be a data set where D is the dimensionality of the input data, d is the dimensionality of the output data (although in the context of this paper

d = 1

, we decided to expose more general results—

d \geq 1

—in order for the reader to gain more insight; this is the reason why we write

Σ_{z}

and not just

σ_{z}^{2}

, or

{z^{(i)}}^{⊤}

and not just

z^{(i)}

, or

{\bar{z}}^{⊤}

and not just

\bar{z}

, etc.),

{x^{(i)} | i \in {1, \dots, n}}

is the input, and

{z^{(i)} | i \in {1, \dots, n}}

is the output. We suppose that the data was generated as follows:

z^{(i)} \sim N (μ_{z}, Σ_{z})

,

z^{(i)} \in R^{d \times 1}

,

μ_{z} \in R^{d \times 1}

,

Σ_{z} \in R^{d \times d}

a symmetric and positive definite matrix and

x^{(i)} | z^{(i)} \sim N (μ + Λ z, Ψ)

,

x^{(i)} \in R^{D \times 1}, μ \in R^{D \times 1}, Λ \in R^{D \times d}

, while

Ψ \in R^{D \times D}

is a symmetric and positive definite matrix.

Then, the parameters in the S2.UncFA algorithm (training phase) can be estimated via maximum likelihood using the following closed-form formulas:

\begin{matrix} {\hat{μ}}_{z} & = \frac{\sum_{i = 1}^{n} z^{(i)}}{n} \end{matrix}

(5)

\begin{matrix} {\hat{Σ}}_{z} & = \frac{\sum_{i = 1}^{n} (z^{(i)} - {\hat{μ}}_{z}) {(z^{(i)} - {\hat{μ}}_{z})}^{⊤}}{n} \end{matrix}

(6)

\begin{matrix} \hat{Λ} & = (n \bar{x} {\bar{z}}^{⊤} - \sum_{i = 1}^{n} x^{(i)} {z^{(i)}}^{⊤}) {(n \bar{z} {\bar{z}}^{⊤} - \sum_{i = 1}^{n} z^{(i)} {z^{(i)}}^{⊤})}^{- 1} \end{matrix}

(7)

\begin{matrix} \hat{μ} & = \bar{x} - \hat{Λ} \bar{z} \end{matrix}

(8)

\begin{matrix} \hat{Ψ} & = \frac{\sum_{i = 1}^{n} (x^{(i)} - \hat{μ} - Λ z^{(i)}) {(x^{(i)} - \hat{μ} - Λ z^{(i)})}^{⊤}}{n} \end{matrix}

(9)

where

\bar{x} = \frac{\sum_{i = 1}^{n} x^{(i)}}{n}

and

\bar{z} = {\hat{μ}}_{z}

. For the testing/prediction phase, one uses the formula for

z | x

from (4).

For more elaborate notations and the proof, see [8] pp. 13–17.

Proposition 5.

[We will denote the parameter

\hat{Ψ}

in (9) as

{\hat{Ψ}}_{S 2 . UncFA}

.]

For the S2.FA algorithm, (9) is replaced by

\hat{Ψ} = diag ({\hat{Ψ}}_{S 2 . UncFA})

(10)

where “

d i a g

” takes the diagonal of a matrix and returns the corresponding diagonal matrix.

The proof of Equation (10) is relatively simple, and we skip it for brevity. It can be found in [8] pp. 21–23.

For the step-by-step S2.FA algorithm and also for the matrix form of the algorithm, see Appendix B.

4.1.2. The S2.FA Model. The Link between LR and S2.UncFA

Linear regression and S2.UncFA have the same prediction function after fitting, as we claim and prove below.

Proposition 6.

Let

{(x^{(i)}, z^{(i)}) | x^{(i)} \in R^{D \times 1}, z^{(i)} \in R^{d}, i \in {1, \dots, n}}

be a data set where D is the dimensionality of the input data, d is the dimensionality of the output data (The same observation as earlier: in the context of this paper, only the “

d = 1

" case is relevant.),

{x^{(i)} | i \in {1, \dots, n}}

is the input, and

{z^{(i)} | i \in {1, \dots, n}}

is the output.

One can fit an S2.UncFA model and obtain—via the relationships (5)–(9)—

{\hat{μ}}_{z}

,

{\hat{Σ}}_{z}

,

\hat{Ψ}

,

\hat{Λ}

,

\hat{μ}

. Remember that at the test phase (see (4)), the predicted value is

{predicted}_{S 2 . UncFA} (x^{*}) = {\hat{μ}}_{z} + {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} {(\hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} + \hat{Ψ})}^{- 1} (x^{*} - \hat{μ} - \hat{Λ} {\hat{μ}}_{z}), \forall x^{*} \in R^{D \times 1} .

One can fit a linear regression model and obtain

\hat{w}

,

\hat{b}

from (1) and (2). Remember that at the test phase, the predicted value is

{predicted}_{LR} (x^{*}) = \hat{w} x^{*} + \hat{b}, \forall x^{*} \in R^{D \times 1} .

Then:

{predicted}_{S 2 . UncFA} (x^{*}) = {predicted}_{LR} (x^{*}), \forall x^{*} \in R^{D \times 1} .

Proof.

Let

X = [\begin{matrix} x^{(1)} \dots x^{(n)} \end{matrix}] \in R^{D \times n}

and

Z = [\begin{matrix} z^{(1)} \dots z^{(n)} \end{matrix}] \in R^{d \times n}

.

We begin by computing

\hat{Ψ}

.

\begin{matrix} \hat{Ψ} & \overset{(9)}{=} \frac{1}{n} \sum_{i = 1}^{n} (x^{(i)} - \hat{μ} - \hat{Λ} z^{(i)}) {(x^{(i)} - \hat{μ} - \hat{Λ} z^{(i)})}^{⊤} \\ = \frac{1}{n} \sum_{i = 1}^{n} (x^{(i)} {x^{(i)}}^{⊤} - x^{(i)} {\hat{μ}}^{⊤} - x^{(i)} {z^{(i)}}^{⊤} {\hat{Λ}}^{⊤} - \hat{μ} {x^{(i)}}^{⊤} + \hat{μ} {\hat{μ}}^{⊤} + \hat{μ} {z^{(i)}}^{⊤} {\hat{Λ}}^{⊤} - \\ - \hat{Λ} z^{(i)} {x^{(i)}}^{⊤} + \hat{Λ} z^{(i)} {\hat{μ}}^{⊤} + \hat{Λ} z^{(i)} {z^{(i)}}^{⊤} {\hat{Λ}}^{⊤}) \\ = \frac{1}{n} X X^{⊤} - \bar{x} {\hat{μ}}^{⊤} - \frac{1}{n} X Z^{⊤} {\hat{Λ}}^{⊤} - \hat{μ} {\bar{x}}^{⊤} + \hat{μ} {\hat{μ}}^{⊤} + \hat{μ} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} - \frac{1}{n} \hat{Λ} Z X^{⊤} + \\ + \hat{Λ} \bar{z} {\hat{μ}}^{⊤} + \frac{1}{n} \hat{Λ} Z Z^{⊤} {\hat{Λ}}^{⊤} \\ = \frac{1}{n} X X^{⊤} - \frac{1}{n} X Z^{⊤} {\hat{Λ}}^{⊤} - \frac{1}{n} \hat{Λ} Z X^{⊤} + \frac{1}{n} \hat{Λ} Z Z^{⊤} {\hat{Λ}}^{⊤} - \bar{x} {\hat{μ}}^{⊤} - \hat{μ} {\bar{x}}^{⊤} + \hat{μ} {\hat{μ}}^{⊤} + \\ + \hat{Λ} \bar{z} {\hat{μ}}^{⊤} + \hat{μ} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} . \end{matrix}

We substitute

\hat{μ}

with

\bar{x} - \hat{Λ} \bar{z}

.

\bar{x} {\hat{μ}}^{⊤} = \bar{x} {(\bar{x} - \hat{Λ} \bar{z})}^{⊤} = \bar{x} {\bar{x}}^{⊤} - \bar{x} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤}

\hat{μ} {\bar{x}}^{⊤} = (\bar{x} - \hat{Λ} \bar{z}) {\bar{x}}^{⊤} = \bar{x} {\bar{x}}^{⊤} - \hat{Λ} \bar{z} {\bar{x}}^{⊤}

\hat{μ} {\hat{μ}}^{⊤} = (\bar{x} - \hat{Λ} \bar{z}) {(\bar{x} - \hat{Λ} \bar{z})}^{⊤} = \bar{x} {\bar{x}}^{⊤} - \bar{x} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} - \hat{Λ} \bar{z} {\bar{x}}^{⊤} + \hat{Λ} \bar{z} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤}

\hat{Λ} \bar{z} {\hat{μ}}^{⊤} = \hat{Λ} \bar{z} {(\bar{x} - \hat{Λ} \bar{z})}^{⊤} = \hat{Λ} \bar{z} {\bar{x}}^{⊤} - \hat{Λ} \bar{z} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤}

\hat{μ} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} = (\bar{x} - \hat{Λ} \bar{z}) {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} = \bar{x} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} - \hat{Λ} \bar{z} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤}

We return to compute

\hat{Ψ}

:

\begin{matrix} \hat{Ψ} & = \frac{1}{n} X X^{⊤} - \frac{1}{n} X Z^{⊤} {\hat{Λ}}^{⊤} - \frac{1}{n} \hat{Λ} Z X^{⊤} + \frac{1}{n} \hat{Λ} Z Z^{⊤} {\hat{Λ}}^{⊤} - \bar{x} {\bar{x}}^{⊤} + \bar{x} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} - \bar{x} {\bar{x}}^{⊤} + \\ + \hat{Λ} \bar{z} {\bar{z}}^{⊤} + \bar{x} {\bar{x}}^{⊤} - \bar{x} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} - \hat{Λ} \bar{z} {\bar{z}}^{⊤} + \hat{Λ} \bar{z} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} + \hat{Λ} \bar{z} {\bar{x}}^{⊤} - \hat{Λ} \bar{z} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} + \\ + \bar{x} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} - \hat{Λ} \bar{z} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} \\ = \frac{1}{n} X X^{⊤} - \frac{1}{n} X Z^{⊤} {\hat{Λ}}^{⊤} - \frac{1}{n} \hat{Λ} Z X^{⊤} + \frac{1}{n} \hat{Λ} Z Z^{⊤} {\hat{Λ}}^{⊤} - \bar{x} {\bar{x}}^{⊤} + \hat{Λ} \bar{z} {\bar{x}}^{⊤} + \\ + \bar{x} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} - \hat{Λ} \bar{z} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} . \end{matrix}

(11)

We continue by computing

\hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤}

.

\begin{matrix} \hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} \overset{(6)}{=} \hat{Λ} (\frac{1}{n} Z Z^{⊤} - \bar{z} {\bar{z}}^{⊤}) {\hat{Λ}}^{⊤} = \frac{1}{n} \hat{Λ} Z Z^{⊤} {\hat{Λ}}^{⊤} - \hat{Λ} \bar{z} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} . \end{matrix}

(12)

We observe that the above term (see (12)) is also included in

\hat{Ψ}

(see (11)).

\begin{matrix} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} & \overset{(7)}{=} (\frac{1}{n} Z Z^{⊤} - \bar{z} {\bar{z}}^{⊤}) {(n \bar{z} {\bar{z}}^{⊤} - Z Z^{⊤})}^{- 1} (n \bar{z} {\bar{x}}^{⊤} - Z X^{⊤}) \\ = (\frac{1}{n} Z Z^{⊤} - \bar{z} {\bar{z}}^{⊤}) \frac{- 1}{n} {(\frac{1}{n} Z Z^{⊤} - \bar{z} {\bar{z}}^{⊤})}^{- 1} (n \bar{z} {\bar{x}}^{⊤} - Z X^{⊤}) \\ = \frac{1}{n} Z X^{⊤} - \bar{z} {\bar{x}}^{⊤} . \end{matrix}

(13)

Since

{(\hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤})}^{⊤} = \hat{Λ} {\hat{Σ}}_{z}^{⊤} {\hat{Λ}}^{⊤} = \hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} \Rightarrow \hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤}

is symmetric.

We have that:

\begin{matrix} \hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} = \hat{Λ} ({\hat{Σ}}_{z} {\hat{Λ}}^{⊤}) \overset{(13)}{=} \hat{Λ} (\frac{1}{n} Z X^{⊤} - \bar{z} {\bar{x}}^{⊤}) = \frac{1}{n} \hat{Λ} Z X^{⊤} - \hat{Λ} \bar{z} {\bar{x}}^{⊤} . \end{matrix}

(14)

We also have that:

\begin{matrix} \hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} & = {(\hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤})}^{⊤} \overset{(13)}{=} {(\frac{1}{n} \hat{Λ} Z X^{⊤} - \hat{Λ} \bar{z} {\bar{x}}^{⊤})}^{⊤} = \frac{1}{n} X Z^{⊤} {\hat{Λ}}^{⊤} - \bar{x} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} . \end{matrix}

(15)

We continue by computing

\hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} + \hat{Ψ}

.

As we have already noticed above,

\hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤}

is also included in

\hat{Ψ}

(see (11) and (12)). In the computation of

\hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} + \hat{Ψ}

, we replace

\hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤}

once with (14) and then with (15). We get:

\begin{matrix} \hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} + \hat{Ψ} & =_{(14) (15)}^{(11) (12)} \frac{1}{n} X X^{⊤} - \frac{1}{n} X Z^{⊤} {\hat{Λ}}^{⊤} - \frac{1}{n} \hat{Λ} Z X^{⊤} - \bar{x} {\bar{x}}^{⊤} + \hat{Λ} \bar{z} {\bar{x}}^{⊤} + \\ + \bar{x} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} + \frac{1}{n} \hat{Λ} Z X^{⊤} - \hat{Λ} \bar{z} {\bar{x}}^{⊤} + \frac{1}{n} X Z^{⊤} {\hat{Λ}}^{⊤} - \bar{x} {\bar{z}}^{⊤} {\hat{Λ}}^{⊤} \\ = \frac{1}{n} X X^{⊤} - \bar{x} {\bar{x}}^{⊤} . \end{matrix}

(16)

Observation: The result is exactly the maximum likelihood estimate of the covariance matrix

Σ

of the input data set if

x \sim N (μ, Σ)

:

Σ_{MLE} = \frac{1}{n} X X^{⊤} - \bar{x} {\bar{x}}^{⊤}

. This is natural because according to the relationship (4) we have

x \sim N (μ, Λ Σ_{z} Λ^{⊤} + Ψ)

, and there are enough free parameters in

Λ Σ_{z} Λ^{⊤} + Ψ

, i.e.,

d D + \frac{d^{2} - d}{2} + d + \frac{D^{2} - D}{2} + D

free parameters—

d D

in

Λ

,

\frac{d^{2} - d}{2}

in

Σ_{z}

,

\frac{D^{2} - D}{2}

in

Ψ

, for it to become

Σ_{MLE} = \frac{1}{n} X X^{⊤} - \bar{x} {\bar{x}}^{⊤}

, since

Σ

has

\frac{D^{2} - D}{2} + D

free parameters.

We return to the initial computation:

{predicted}_{S 2 . UncFA} (x^{*}) =

\overset{(4)}{=} {\hat{μ}}_{z} + {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} {(\hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} + \hat{Ψ})}^{- 1} (x^{*} - \hat{μ} - \hat{Λ} {\hat{μ}}_{z})

\overset{(5) (13) (16)}{=} \bar{z} + (\frac{1}{n} Z X^{⊤} - \bar{z} {\bar{x}}^{⊤}) {(\frac{1}{n} X X^{⊤} - \bar{x} {\bar{x}}^{⊤})}^{- 1} (x^{*} - \bar{x} + \hat{Λ} \bar{z} - \hat{Λ} \bar{z})

= (\frac{1}{n} Z X^{⊤} - \bar{z} {\bar{x}}^{⊤}) {(\frac{1}{n} X X^{⊤} - \bar{x} {\bar{x}}^{⊤})}^{- 1} x^{*} + \bar{z} - (\frac{1}{n} Z X^{⊤} - \bar{z} {\bar{x}}^{⊤}) {(\frac{1}{n} X X^{⊤} - \bar{x} {\bar{x}}^{⊤})}^{- 1} \bar{x}

= (n \bar{z} {\bar{x}}^{⊤} - Z X^{⊤}) {(n \bar{x} {\bar{x}}^{⊤} - X X^{⊤})}^{- 1} x^{*} + \bar{z} - (n \bar{z} {\bar{x}}^{⊤} - Z X^{⊤}) {(n \bar{x} {\bar{x}}^{⊤} - X X^{⊤})}^{- 1} \bar{x}

\overset{(1)}{=} \hat{w} x^{*} + \bar{z} - \hat{w} \bar{x}

\overset{(2)}{=} \hat{w} x^{*} + \hat{b}

= {predicted}_{LR} (x^{*}), \forall x^{*} \in R^{D \times 1} .

□

4.1.3. The S2.FA Model. A New Approach for LR When $D > > n$

Since FA can be used to estimate the density of a data set when

D > > n

, and S2.UncFA is equivalent to LR, we consider S2.FA as a new approach to extend LR when

D > > n

besides the two solutions mentioned in Section 2.

4.2. The S3.FA Model

Factor analysis is a classic generative unsupervised model. Its supervised counterpart is S2.FA as shown in the previous subsection. Those two can be merged into a semisupervised model that we propose here, named simple-semisupervised factor analysis:

z \sim N (μ_{z}, σ_{z}^{2})

-either observed or latent variable,

z \in R

,

μ_{z} \in R

,

σ_{z} \in R_{+}^{*}

x | z \sim N (μ + Λ z, Ψ)

,

x \in R^{D \times 1}, μ \in R^{D \times 1}, Λ \in R^{D \times 1}

, and

Ψ \in R^{D \times D}

a diagonal matrix.

If we were to speak about Gaussian naive Bayes and the GMM, good hints for combining these two supervised–unsupervised counterparts into a semisupervised model can be found in [12]. We applied those hints for our supervised–unsupervised counterparts—S2.FA and FA—and created an EM algorithm to fit an S3.FA model. For the step-by-step algorithm and the matrix form of the algorithm, see Appendix C. For more elaborate notations and the proof, see [8] pp. 26–34.

4.3. The MS3.FA Model

The algorithm that fits an S3.FA model can be adapted also for the case when not all the components of x are known. We call the resulted model missing simple-semisupervised factor analysis:

z \sim N (μ_{z}, σ_{z}^{2})

-either observed or latent variable,

z \in R

,

μ_{z} \in R

,

σ_{z} \in R_{+}^{*}

x | z \sim N (μ + Λ z, Ψ)

,

x \in R^{D \times 1}, μ \in R^{D \times 1}, Λ \in R^{D \times 1}

, and

Ψ \in R^{D \times D}

a diagonal matrix; each component of x:

x_{1}, \dots, x_{D}

is either observed or latent.

The resulting algorithm that fits a MS3.FA model is an EM algorithm. For the step-by-step algorithm, see Appendix D. For more elaborate notations and the proof, see [8] pp. 34–37.

5. Experiments

In this section, we include the experiments we carried out on data with

D > > n

using the S2.FA, S3.FA, and MS3.FA models, comparing them with other methods. In all the experiments, we computed errors between the real values and the predicted values; the metric we used is mean squared error (MSE):

MSE = \frac{1}{N} \sum_{i = 1}^{N} {({real}_{i} - {predicted}_{i})}^{2},

where N is the number of the unknown elements whose real and predicted values are

{real}_{i}

and

{predicted}_{i}

, respectively; an unknown element represents an output number for S2.FA and S3.FA or an input/output number for MS3.FA. We ran each experiment five times and computed a 95% confidence interval using the t-distribution. Furthermore, in each experiment we used the same three data sets:

Gas sensor array under flow modulation data set (http://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+flow+modulation; accessed on 31 July 2021) [13]: 58 observations, 432 input attributes;
atp1d—the airline ticket price; 1D refers to the fact that the target price is in the next day—(https://www.openml.org/d/41475; accessed on 31 July 2021) [14]: 337 observations, 411 input attributes; 370 after preprocessing: see below;
m5spec—corn measured on a NIR spectrometer: mp5 instrument—(http://www.eigenvector.com/data/Corn; accessed on 31 July 2021): 80 observations, 700 input attributes.

All three of these data sets have multiple outputs, but for each data set, we selected only the first output column that appears in the text data file and used it as the output: ace_conc for gas sensor array under flow modulation data set, LBL_ALLminpA_fut_001 for atp1d, the first column in the propvals file for m5spec.

We preprocessed each data set simply by dropping the constant columns. Only the second data set has constant columns: from 432 columns, we obtain 370.

5.1. The S2.FA Model: Experiment

The experiment concerning S2.FA covers the comparison of the three solutions presented so far for LR when

D > > n

:

Moore–Penrose inverse
ridge regression—L2 regularization
S2.FA.

Each data set was split into a training part—

80 %

—and a testing part—

20 %

. If the model had hyperparameters, ridge regression, the training part was also split into a new training part—

60 %

of the whole data set—and a validation part—

20 %

of the whole data set—in order to be able to set the hyperparameters (for ridge regression, we used a simple technique: pick

α \in {10^{2}, 10^{1.9}, 10^{1.8}, \dots, 10^{- 1.9}, 10^{- 2}}

, which attains the minimum validation error); after setting the hyperparameters, we train a new model on the initial training part—

80 %

of the whole data set—and obtain the final model. All of the MSE errors are reported on the testing part and shown in Table 1.

As one may notice, the best method is different for each data set, so our general advice is to use all the methods on a given data set and pick the best one.

5.2. The S3.FA Model: Experiment

The experiment concerning S3.FA includes an analysis of algorithms for semisupervised regression when

D > > n

:

Moore–Penrose inverse—a supervised method
S2.FA—a supervised method
S3.FA—a semisupervised method
label propagation [15]—a semisupervised method: we used the function
sslLabelProp in the SSL R package [16] with the parameter alpha set to 1.

Each data set was split into a training part—

80 %

—and a testing part—

20 %

. We retained from the training part

5 %

,

10 %

,

15 %

,

20 %

,

25 %

, …,

100 %

of the output labels. For the supervised methods, we used only the labeled data in the training set, and for the semisupervised methods, we used the full training set when fitting the model. We initialized the S3.FA method with the fitted parameters returned by the S2.FA algorithm. All of the MSE errors are reported on the testing part and shown in Figure 1, Figure 2 and Figure 3—inspired from [17]—and Table 2. The figures contain all the results from using

5 %

to

100 %

of the output labels, but we include less information in the table for brevity.

We notice that on the selected data sets, S3.FA returns poorer results even than S2.FA, which uses only the labeled data. As in the previous experiment, the best method is also data-dependent. The models that show a greater amount of variability compared to the others are S3.FA—in two data sets—and Moore–Penrose—in one data set. Moreover, as expected, the errors tend to decrease as the percentage of labeled training data points increases; the plots do not help us in this regard, but this decrease can be seen numerically in Table 2.

5.3. The MS3.FA Model: Experiment

The experiment concerning MS3.FA includes a comparison of two different types of algorithms for data imputation when

D > > n

:

Mean imputation: for a given attribute (input column), compute its mean ignoring the missing values, then replace the missing data on that attribute with this computed mean
MS3.FA.

We also tried two other R packages: mice [18] and Amelia [19], but they could not be applied successfully on our data sets perhaps because they have a peculiarity:

D > > n

.

For each data set, we removed

10 %

,

20 %

,

30 %

,

40 %

,

50 %

, and

60 %

of the input (We could have added missing data also in the output, but we wanted to focus on the missing input data scenario and not on the semisupervised case.) cells and imputed those via the above mentioned algorithms. The results are presented in Figure 4, Figure 5 and Figure 6 and in Table 3.

From these results, we discover that MS3.FA is better than mean imputation on two data sets, and, as expected, the error increases as the percentage of missing data increases.

6. Conclusions and Future Work

The initial purpose of this paper was to extend an already existing model: factor analysis. We developed its supervised counterpart (S2.FA) and noticed that the unconstrained version (S2.UncFA) is equivalent to linear regression. Because FA is applied in density estimation when the dimensionality of the data is greater than the number of samples, and because of the already mentioned equivalence, the purpose of the paper became to analyze this new method of applying LR when

D > > n

, i.e., via S2.FA. Since FA and S2.FA are generative models and are unsupervised–supervised counterparts, we combined both into a new model S3.FA as an extension of LR to semisupervised learning when

D > > n

. The final extension regards missing data; it is called MS3.FA. We developed an R package (s2fa) with these algorithms; it can be found on GitHub. The experimental parts included several comparisons in the

D > > n

scenario:

of S2.FA with other techniques extending LR to the $D > > n$ case,
of S3.FA with other (semi)supervised regression methods,
of MS3.FA with another data imputation algorithm.

The bottom line is that we do not necessarily recommend S3.FA for semisupervised regression since our results suggest that it gives poor results, but we encourage the consideration of S2.FA for regression and MS3.FA for missing data imputation as algorithms to be compared with others on a given data set.

As for future work, we could further explore the S2.FA, S3.FA, and MS3.FA algorithms when z is a real vector, not just a real number. Moreover, we can experiment with the PPCA version of the algorithms. Questions regarding the time complexity—empirical or not—can be also addressed; we expect the fitting time to be impractical if the number of columns is large. Because there are models such as mixture of factor analyzers [20] and mixture of linear regression models ([21] Section 14.5.1), another research direction involves mixtures of S2.FAs. Another idea would be to investigate the memory resources required by the algorithms when the data set increases and also to consider scalable systems such as Spark [22] for implementation.

Author Contributions

Conceptualization, S.C. and L.C.; methodology, S.C. and L.C.; software, S.C.; validation, S.C.; formal analysis, S.C.; investigation, S.C.; resources, L.C.; data curation; writing—original draft preparation, S.C.; writing—review and editing, S.C. and L.C.; visualization, S.C. and L.C.; supervision, L.C.; project administration, S.C. and L.C.; funding acquisition. Both authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available data sets were analyzed in this study. This data can be found here: http://archive.ics.uci.edu/ml/datasets/Gas+sensor+array+under+flow+modulation (accessed on 31 July 2021) for the gas sensor array under flow modulation data set, https://www.openml.org/d/41475 (accessed on 31 July 2021) for atp1d, http://www.eigenvector.com/data/Corn (accessed on 31 July 2021) for m5spec.

Acknowledgments

We thank Cristian Gaţu and Daniel Stamate for attentive proofreading and useful comments of previous versions of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GMM	Gaussian mixture model
LR	Linear regression
EM	Expectation–maximization
KL	Kullback–Leibler
FA	Factor analysis
PCA	Principal component analysis
PPCA	Probabilistic principal component analysis
GPLVM	Gaussian process latent variable model
MLE	Maximum likelihood estimation
UncFA	Unconstrained factor analysis
S2.UncFA	Simple-supervised unconstrained factor analysis
S2.FA	Simple-supervised factor analysis
S2.PPCA	Simple-supervised probabilistic principal component analysis
S3.UncFA	Simple-semisupervised unconstrained factor analysis
S3.FA	Simple-semisupervised factor analysis
S3.PPCA	Simple-semisupervised probabilistic principal component analysis
MS3.UncFA	Missing simple-semisupervised unconstrained factor analysis
MS3.FA	Missing simple-semisupervised factor analysis
MS3.PPCA	Missing simple-semisupervised probabilistic principal component analysis
MSE	Mean squared error
ELBO	Evidence lower bound

Appendix A. On the Expectation–Maximization Algorithm

This section provides theoretical details on the EM algorithm [23]. These are relevant in order to establish a link between EM and the information theory field.

A latent variable model assumes that the data we observed—usually, denoted by the random variable X—is not complete: there is also some latent data modeled via random variables—usually, denoted by Z. Often, pairs consisting of an observed point and a latent one constitute the complete dataset, e.g., in a GMM the latent data is the cluster number and such a number is assigned to each point in the observed dataset.

Usually, in a latent variable model the likelihood of the observed data is not tractable—although there are exceptions, like in GMM or factor analysis—and therefore we cannot maximize it directly. Instead we maximize a lower bound for the log-likelihood function called ELBO (evidence lower bound). To simplify the discussion, we consider that we have one single observed datapoint, x. We will also consider the case where Z is continuous; when Z is discrete the ∫ sign is replaced by the ∑ sign.

Let q be any distribution over z, where the z values are the possible values of the random variable Z. The log-likelihood of the observed datapoint x is:

\begin{matrix} log p (x; θ) & \overset{marginal}{=} log \int_{z} p (x, z; θ) d z \\ = log \int_{z} \frac{q (z)}{q (z)} p (x, z; θ) d z \\ = log \int_{z} q (z) \frac{p (x, z; θ)}{q (z)} d z \\ = log E_{q} [\frac{p (x, z; θ)}{q (z)}] \\ \overset{Jensen}{\geq} \underset{ELBO (θ, q)}{\underset{︸}{E_{q} [log \frac{p (x, z; θ)}{q (z)}]}} . \end{matrix}

Furthermore, the following relationships important for the E step of the EM algorithm—see below—can be proven:

\begin{matrix} log p (x; θ) - ELBO (θ, q) & = KL (q (\cdot) | | p (\cdot | x; θ)) \\ ⇕ \\ log p (x; θ) & = ELBO (θ, q) + KL (q (\cdot) | | p (\cdot | x; θ)) \\ ⇕ \\ ELBO (θ, q) & = log p (x; θ) - KL (q (\cdot) | | p (\cdot | x; θ)) . \end{matrix}

Moreover, the following relationships important for the M step of the EM algorithm—see below—can be proven:

\begin{matrix} ELBO (θ, q) & \overset{def .}{=} E_{q} [log \frac{p (x, z; θ)}{q (z)}] \\ = \int_{z} q (z) log \frac{p (x, z; θ)}{q (z)} d z \\ = \int_{z} q (z) log p (x, z; θ) d z - \int_{z} q (z) log q (z) d z \\ = E_{q} [log p (x, z; θ)] + H (q) . \end{matrix}

Now, instead of carrying out

{max}_{θ} log p (x; θ)

we will execute

{max}_{θ, q} ELBO (θ, q)

, since ELBO is a lower bound for the log-likelihood of x and hence its maximization will not hurt the process of maximizing

log p (x; θ)

.

The ELBO can be maximized in at least two ways:

via (block) coordinate ascent
The resulting meta-algorithm is the EM algorithm.
In fact this is the case for many classic models—EM for GMM [3], EM for factor analysis [5] etc.—.
EM is an iterative algorithm and an iteration encompasses two steps:
- E step:
  $q^{(t)} = arg {max}_{q} ELBO (θ^{(t - 1)}, q)$ for $θ^{(t - 1)}$ fixed—from the previous iteration.
  Since $ELBO (θ, q) = log p (x; θ) - KL (q (\cdot) | | p (\cdot | x; θ))$ , we have:
  
  $\begin{matrix} q^{(t)} & = arg max_{q} ELBO (θ^{(t - 1)}, q) \\ = arg max_{q} (log p (x; θ^{(t - 1)}) - KL (q (\cdot) | | p (\cdot | x; θ^{(t - 1)}))) \\ = arg min_{q} KL (q (\cdot) | | p (\cdot | x; θ^{(t - 1)})) \overset{KL property}{=} p (\cdot | x; θ^{(t - 1)}) . \end{matrix}$
  
  (In this case, we have $log p (x; θ^{(t - 1)}) = ELBO (θ^{(t - 1)}; q^{(t)})$ .)
  So, we obtained the distribution $q^{(t)}$ as a posterior distribution. Although in classic models where conjugate priors are used it is tractable to compute $p (\cdot | x; θ^{(t - 1)})$ —this type of inference is called analytical inference—, in other models this is not the case and a solution to this shortcoming is represented by approximate/variational inference.
- M step:
  $θ^{(t)} = arg {max}_{θ} ELBO (θ, q^{(t)})$ for $q^{(t)}$ fixed—from the E step.
  Since $ELBO (θ, q) = E_{q} [log p (x, z; θ)] + H (q)$ , we have:
  
  $\begin{matrix} θ^{(t)} & = arg max_{θ} ELBO (θ, q^{(t)}) \\ = arg max_{θ} (E_{q^{(t)}} [log p (x, z; θ)] + H (q^{(t)})) \\ = arg max_{θ} E_{q^{(t)}} [log p (x, z; θ)] . \end{matrix}$
  
  So, we obtained a relatively simpler term to maximize. Note that the maximization is further customized using the probabilistic assumptions at hand.
via gradient ascent: this is the case of Variational Autoencoder [24] which will not be discussed since it is not necessarily relevant to this study.

Appendix B. S2.FA

Algorithm A1 S2.FA—nonmatrix form.

1:: functionTrain( ${(x^{(i)}, z^{(i)}) | i \in {1, \dots, n}}$ )
2:: $\bar{x} = \frac{\sum_{i = 1}^{n} x^{(i)}}{n}$
3:: ${\hat{μ}}_{z} = \frac{\sum_{i = 1}^{n} z^{(i)}}{n} = \bar{z}$
4:: ${\hat{Σ}}_{z} = \frac{\sum_{i = 1}^{n} (z^{(i)} - {\hat{μ}}_{z}) {(z^{(i)} - {\hat{μ}}_{z})}^{⊤}}{n}$
5:: $\hat{Λ} = (n \bar{x} {\bar{z}}^{⊤} - \sum_{i = 1}^{n} x^{(i)} {z^{(i)}}^{⊤}) {(n \bar{z} {\bar{z}}^{⊤} - \sum_{i = 1}^{n} z^{(i)} {z^{(i)}}^{⊤})}^{- 1}$
6:: $\hat{μ} = \bar{x} - \hat{Λ} \bar{z}$
7:: $\hat{Ψ} = diag (\frac{\sum_{i = 1}^{n} (x^{(i)} - \hat{μ} - Λ z^{(i)}) {(x^{(i)} - \hat{μ} - Λ z^{(i)})}^{⊤}}{n})$
8:: return ( ${\hat{μ}}_{z}$ , ${\hat{Σ}}_{z}$ , $\hat{Λ}$ , $\hat{μ}$ , $\hat{Ψ}$ )
9:: functionTest( $x^{*}$ ,( ${\hat{μ}}_{z}$ , ${\hat{Σ}}_{z}$ , $\hat{Λ}$ , $\hat{μ}$ , $\hat{Ψ}$ ))
10:: value = ${\hat{μ}}_{z} + {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} {(\hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} + \hat{Ψ})}^{- 1} (x^{*} - \hat{μ} - \hat{Λ} {\hat{μ}}_{z})$
11:: covarianceMatrix = ${\hat{Σ}}_{z} - {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} {(\hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} + \hat{Ψ})}^{- 1} \hat{Λ} {\hat{Σ}}_{z}^{⊤}$
12:: return (value, covarianceMatrix)

Algorithm A2 S2.FA—matrix form.

1:: functionTrain(X,Z) ▹ $X = [\begin{matrix} x^{(1)} & \dots & x^{(n)} \end{matrix}]$
2:: ▹ $Z = [\begin{matrix} z^{(1)} & \dots & z^{(n)} \end{matrix}]$
3:: $\bar{x} = \frac{\sum_{i = 1}^{n} X_{: i}}{n}$
4:: ${\hat{μ}}_{z} = \frac{\sum_{i = 1}^{n} Z_{: i}}{n} = \bar{z}$
5:: ${\hat{Σ}}_{z} = \frac{1}{n} (Z - {\hat{μ}}_{z} \underset{\in R^{1 \times n}}{\underset{︸}{[\begin{matrix} 1 & 1 & \dots & 1 \end{matrix}]}}) {(Z - {\hat{μ}}_{z} [\begin{matrix} 1 & 1 & \dots & 1 \end{matrix}])}^{⊤} = \frac{1}{n} Z Z^{⊤} - {\hat{μ}}_{z} {\hat{μ}}_{z}^{⊤}$ ▹ 2 ways to compute
6:: $\hat{Λ} = (n \bar{x} {\hat{μ}}_{z}^{⊤} - X Z^{⊤}) {(n {\hat{μ}}_{z} {\hat{μ}}_{z}^{⊤} - Z Z^{⊤})}^{- 1}$
7:: $\hat{μ} = \bar{x} - \hat{Λ} \bar{z}$
8:: $\hat{Ψ} = diag (\frac{1}{n} (X - \hat{μ} \underset{\in R^{1 \times n}}{\underset{︸}{[\begin{matrix} 1 & 1 & \dots & 1 \end{matrix}]}} - \hat{Λ} Z) {(X - \hat{μ} [\begin{matrix} 1 & 1 & \dots & 1 \end{matrix}] - \hat{Λ} Z)}^{⊤})$
9:: return ( ${\hat{μ}}_{z}$ , ${\hat{Σ}}_{z}$ , $\hat{Λ}$ , $\hat{μ}$ , $\hat{Ψ}$ )
10:: functionTest( $x^{*}$ ,( ${\hat{μ}}_{z}$ , ${\hat{Σ}}_{z}$ , $\hat{Λ}$ , $\hat{μ}$ , $\hat{Ψ}$ ))
11:: value = ${\hat{μ}}_{z} + {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} {(\hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} + \hat{Ψ})}^{- 1} (x^{*} - \hat{μ} - \hat{Λ} {\hat{μ}}_{z})$
12:: covarianceMatrix = ${\hat{Σ}}_{z} - {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} {(\hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} + \hat{Ψ})}^{- 1} \hat{Λ} {\hat{Σ}}_{z}^{⊤}$
13:: return (value, covarianceMatrix)

Appendix C. S3.FA

The algorithms below are more general: z is not just a real number, as we state in the paper, but a real vector.

Algorithm A3 S3.FA—nonmatrix form.

1:: functionLogLikelihood( ${(x^{(1)}, z^{(1)}), \dots, (x^{(a)}, z^{(a)}), x^{(a + 1)}, \dots, x^{(n)}}$ , $(μ_{z}, Σ_{z}, μ, Λ, Ψ)$ )
2:: return $\sum_{i = 1}^{a} (ln (N (z^{(i)} | μ_{z}, Σ_{z})) + ln (N (x^{(i)} | μ + Λ z^{(i)}, Ψ))) + \sum_{i = a + 1}^{n} ln (N (x^{(i)} | μ + Λ μ_{z}, Λ Σ_{z} Λ^{⊤} + Ψ))$
3:: functionTrain( ${(x^{(1)}, z^{(1)}), \dots, (x^{(a)}, z^{(a)}), x^{(a + 1)}, \dots, x^{(n)}}$ ,nMaxIterations,eps)
4:: $\bar{x} = \frac{\sum_{i = 1}^{n} x^{(i)}}{n}$
5:: $θ^{(0)} = initializeParameters ({(x^{(1)}, z^{(1)}), \dots, (x^{(a)}, z^{(a)}), x^{(a + 1)}, \dots, x^{(n)}})$
6:: $l_{RV_Do}^{(0)} =$ LogLikelihood( ${(x^{(1)}, z^{(1)}), \dots, (x^{(a)}, z^{(a)}), x^{(a + 1)}, \dots, x^{(n)}}$ , $θ^{(0)}$ )
7:: for t = 0:nMaxIterations do
8:: E step: Compute $E [Z^{(i)}]$ , $E [Z^{(i)} {Z^{(i)}}^{⊤}], i \in {a + 1, \dots, n}$ :
9:: $E_{Z^{(i)} | X^{(i)} = x^{(i)}, θ^{(t)}} [Z^{(i)}] = μ_{z}^{(t)} + Σ_{z}^{(t)} {Λ^{(t)}}^{⊤} {(Λ^{(t)} Σ_{z}^{(t)} {Λ^{(t)}}^{⊤} + Ψ^{(t)})}^{- 1} (x^{(i)} - μ^{(t)} - Λ^{(t)} μ_{z}^{(t)})$
10:: $E_{Z^{(i)} | X^{(i)} = x^{(i)}, θ^{(t)}} [Z^{(i)} {Z^{(i)}}^{⊤}] = Σ_{z}^{(t)} + E_{Z^{(i)} | X^{(i)} = x^{(i)}, θ^{(t)}} [Z^{(i)}] E_{Z^{(i)} | X^{(i)} = x^{(i)}, θ^{(t)}} {[Z^{(i)}]}^{⊤}$
11:: M Step: Compute $θ^{(t + 1)} = (μ_{z}^{(t + 1)}, Σ_{z}^{(t + 1)}, μ^{(t + 1)}, Λ^{(t + 1)}, Ψ^{(t + 1)})$ :
12:: $μ_{z}^{(t + 1)} = \frac{\sum_{i = 1}^{a} z^{(i)} + \sum_{a + 1}^{n} E [Z^{(i)}]}{n}$
13:: $Σ_{z}^{(t + 1)} = \frac{1}{n} (\sum_{i = 1}^{a} (z^{(i)} - μ_{z}^{(t + 1)}) {(z^{(i)} - μ_{z}^{(t + 1)})}^{⊤} + \sum_{i = a + 1}^{n} (E [Z^{(i)} {Z^{(i)}}^{⊤}] - E [Z^{(i)}] {μ_{z}^{(t + 1)}}^{⊤} - μ_{z}^{(t + 1)} E {[Z^{(i)}]}^{⊤} + μ_{z}^{(t + 1)} {μ_{z}^{(t + 1)}}^{⊤}))$
14:: $Λ^{(t + 1)} = (n \bar{x} {μ_{z}^{(t + 1)}}^{⊤} - \sum_{i = 1}^{a} x^{(i)} {z^{(i)}}^{⊤} - \sum_{i = a + 1}^{n} x^{(i)} E {[Z^{(i)}]}^{⊤})$
${(n μ_{z}^{(t + 1)} {μ_{z}^{(t + 1)}}^{⊤} - \sum_{i = 1}^{a} z^{(i)} {z^{(i)}}^{⊤} - \sum_{i = a + 1}^{n} E [Z^{(i)} {Z^{(i)}}^{⊤}])}^{- 1}$
15:: $μ^{(t + 1)} = \bar{x} - Λ^{(t + 1)} μ_{z}^{(t + 1)}$
16:: $Ψ^{(t + 1)} = diag (\frac{1}{n} (\sum_{i = 1}^{a} (x^{(i)} - μ^{(t + 1)} - Λ^{(t + 1)} z^{(i)}) {(x^{(i)} - μ^{(t + 1)} - Λ^{(t + 1)} z^{(i)})}^{⊤} + \sum_{i = a + 1}^{n} ((x^{(i)} - μ^{(t + 1)}) {(x^{(i)} - μ^{(t + 1)})}^{⊤} - (x^{(i)} - μ^{(t + 1)}) E {[Z^{(i)}]}^{⊤} {Λ^{(t + 1)}}^{⊤} - Λ^{(t + 1)} E [Z^{(i)}] {(x^{(i)} - μ^{(t + 1)})}^{⊤} + Λ^{(t + 1)} E [Z^{(i)} {Z^{(i)}}^{⊤}] {Λ^{(t + 1)}}^{⊤})))$
17:: $l_{RV_Do}^{(t + 1)} =$ LogLikelihood( ${(x^{(1)}, z^{(1)}), \dots, (x^{(a)}, z^{(a)}), x^{(a + 1)}, \dots, x^{(n)}}$ , $θ^{(t + 1)}$ )
18:: if $\frac{{∥θ^{(t)} - θ^{(t + 1)}∥}_{2}^{2}}{{∥θ^{(t)}∥}_{2}^{2}} \leq$ eps or $\frac{| l_{RV_Do}^{(t)} - l_{RV_Do}^{(t + 1)} |}{| l_{RV_Do}^{(t)} |} \leq$ eps then
19:: break
20:: return $θ^{(t)}$
21:: functionTest( $x^{*}$ ,( ${\hat{μ}}_{z}$ , ${\hat{Σ}}_{z}$ , $\hat{Λ}$ , $\hat{μ}$ , $\hat{Ψ}$ )) ▹ The same as in S2.FA
22:: value = ${\hat{μ}}_{z} + {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} {(\hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} + \hat{Ψ})}^{- 1} (x^{*} - \hat{μ} - \hat{Λ} {\hat{μ}}_{z})$
23:: covarianceMatrix = ${\hat{Σ}}_{z} - {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} {(\hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} + \hat{Ψ})}^{- 1} \hat{Λ} {\hat{Σ}}_{z}^{⊤}$
24:: return (value, covarianceMatrix)

Algorithm A4 S3.FA—matrix form.

1:: functionLogLikelihood(X,Z, $(μ_{z}, Σ_{z}, μ, Λ, Ψ)$ ) ▹ $X = [\begin{matrix} x^{(1)} & \dots & x^{(n)} \end{matrix}]$
2:: ▹ $Z = [\begin{matrix} z^{(1)} & \dots & z^{(a)} \end{matrix}]$
3:: return $\sum_{i = 1}^{a} (ln (N (Z_{: i} | μ_{z}, Σ_{z})) + ln (N (X_{: i} | μ + Λ Z_{: i}, Ψ))) + \sum_{i = a + 1}^{n} ln (N (X_{: i} | μ + Λ μ_{z}, Λ Σ_{z} Λ^{⊤} + Ψ))$
4:: functionTrain(X,Z,nMaxIterations,eps) ▹ $X = [\begin{matrix} x^{(1)} & \dots & x^{(n)} \end{matrix}]$
5:: ▹ $Z = [\begin{matrix} z^{(1)} & \dots & z^{(a)} \end{matrix}]$
6:: $\bar{x} = \frac{\sum_{i = 1}^{n} X_{: i}}{n}$
7:: $θ^{(0)} = initializeParameters (X, Z)$
8:: $l_{RV_Do}^{(0)} =$ LogLikelihood(X,Z, $θ^{(0)}$ )
9:: for t = 0:nMaxIterations do
10:: E step: Compute $E [Z^{(i)}]$ , $E [Z^{(i)} {Z^{(i)}}^{⊤}], i \in {a + 1, \dots, n}$ :
11:: $E_Z = μ_{z}^{(t)} + Σ_{z}^{(t)} {Λ^{(t)}}^{⊤} {(Λ^{(t)} Σ_{z}^{(t)} {Λ^{(t)}}^{⊤} + Ψ^{(t)})}^{- 1} (X_{:, (a + 1) : n} - μ^{(t)} \underset{\in R^{1 \times (n - a)}}{\underset{︸}{[\begin{matrix} 1 & 1 & \dots & 1 \end{matrix}]}} - Λ^{(t)} μ_{z}^{(t)} \underset{\in R^{1 \times (n - a)}}{\underset{︸}{[\begin{matrix} 1 & 1 & \dots & 1 \end{matrix}]}})$
12:: $E_Z_Z_T = Σ_{z}^{(t)} + E_Z E_Z^{⊤}$
13:: $E_Z = [\begin{matrix} Z & E_Z \end{matrix}]$
14:: $E_Z_Z_T = Z Z^{⊤} + E_Z_Z_T$
15:: M Step: Compute $θ^{(t + 1)} = (μ_{z}^{(t + 1)}, Σ_{z}^{(t + 1)}, μ^{(t + 1)}, Λ^{(t + 1)}, Ψ^{(t + 1)})$ :
16:: $μ_{z}^{(t + 1)} = \frac{\sum_{i = 1}^{n} E_Z_{: i}}{n}$
17:: $Σ_{z}^{(t + 1)} = \frac{1}{n} E_Z_Z_T - μ_{z}^{(t + 1)} {μ_{z}^{(t + 1)}}^{⊤}$
18:: $Λ^{(t + 1)} = (n \bar{x} {μ_{z}^{(t + 1)}}^{⊤} - X E_Z^{⊤})$ ${(n μ_{z}^{(t + 1)} {μ_{z}^{(t + 1)}}^{⊤} - E_Z_Z_T)}^{- 1}$
19:: $μ^{(t + 1)} = \bar{x} - Λ^{(t + 1)} μ_{z}^{(t + 1)}$
20:: $Ψ^{(t + 1)} = diag (\frac{1}{n} ((X - μ^{(t + 1)} \underset{\in R^{1 \times n}}{\underset{︸}{[\begin{matrix} 1 & 1 & \dots & 1 \end{matrix}]}}) {(X - μ^{(t + 1)} \underset{\in R^{1 \times n}}{\underset{︸}{[\begin{matrix} 1 & 1 & \dots & 1 \end{matrix}]}})}^{⊤} - (X - μ^{(t + 1)} \underset{\in R^{1 \times n}}{\underset{︸}{[\begin{matrix} 1 & 1 & \dots & 1 \end{matrix}]}}) E_Z^{⊤} {Λ^{(t + 1)}}^{⊤} - Λ^{(t + 1)} E_Z {(X - μ^{(t + 1)} \underset{\in R^{1 \times n}}{\underset{︸}{[\begin{matrix} 1 & 1 & \dots & 1 \end{matrix}]}})}^{⊤} + Λ^{(t + 1)} E_Z_Z_T {Λ^{(t + 1)}}^{⊤}))$
21:: $l_{RV_Do}^{(t + 1)} =$ LogLikelihood(X,Z, $θ^{(t + 1)}$ )
22:: if $\frac{{∥θ^{(t)} - θ^{(t + 1)}∥}_{2}^{2}}{{∥θ^{(t)}∥}_{2}^{2}} \leq$ eps or $\frac{| l_{RV_Do}^{(t)} - l_{RV_Do}^{(t + 1)} |}{| l_{RV_Do}^{(t)} |} \leq$ eps then
23:: break
24:: return $θ^{(t)}$
25:: functionTest( $x^{*}$ ,( ${\hat{μ}}_{z}$ , ${\hat{Σ}}_{z}$ , $\hat{Λ}$ , $\hat{μ}$ , $\hat{Ψ}$ )) ▹ The same as in S2.FA
26:: value = ${\hat{μ}}_{z} + {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} {(\hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} + \hat{Ψ})}^{- 1} (x^{*} - \hat{μ} - \hat{Λ} {\hat{μ}}_{z})$
27:: covarianceMatrix = ${\hat{Σ}}_{z} - {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} {(\hat{Λ} {\hat{Σ}}_{z} {\hat{Λ}}^{⊤} + \hat{Ψ})}^{- 1} \hat{Λ} {\hat{Σ}}_{z}^{⊤}$
28:: return (value, covarianceMatrix)

Appendix D. MS3.FA

The algorithms below are more general: z is not just a real number, as we state in the paper, but a real vector.

Algorithm A5 MS3.FA—Other functions.

1:: functionCondNormalNA(y,( $μ$ , $Σ$ ))
2:: I = ${i | y_{i} = N A}$ ▹ I = Indexes
3:: OI = ${i | y_{i} \neq N A}$ ▹ OI = Other Indexes
4:: mean = $μ_{I} + Σ_{I, O I} Σ_{O I}^{- 1} (y_{O I} - μ_{O I})$
5:: covarianceMatrix = $Σ_{I} - Σ_{I, O I} Σ_{O I}^{- 1} Σ_{I, O I}^{⊤}$
6:: $E {[Y]}_{I}$ = mean
7:: $E {[Y]}_{O I} = y_{O I}$
8:: $E [Y Y^{⊤}] = E [Y] E {[Y]}^{⊤}$
9:: $E {[Y Y^{⊤}]}_{I, I} =$ $E {[Y Y^{⊤}]}_{I, I}$ + covarianceMatrix
10:: return ( $E [Y]$ , $E [Y Y^{⊤}]$ )
11:: functionFullNormal( $μ_{z}, Σ_{z}, μ, Λ, Ψ$ )
12:: mean = $[\begin{matrix} μ + Λ μ_{z} \\ μ_{z} \end{matrix}]$
13:: covarianceMatrix = $[\begin{matrix} Λ Σ_{z} Λ^{⊤} + Ψ & Λ Σ_{z} \\ {(Λ Σ_{z})}^{⊤} & Σ_{z} \end{matrix}]$
14:: return (mean,covarianceMatrix)
15:: functionLogLikelihood( ${(x^{(1)}, z^{(1)}), \dots, (x^{(n)}, z^{(n)})}$ , ( $μ_{z}, Σ_{z}, μ, Λ, Ψ$ ))
16:: (mean,cov) = FullNormal( $μ_{z}, Σ_{z}, μ, Λ, Ψ$ )
17:: result = 0
18:: for i = 1:n do
19:: $y^{(i)} = [\begin{matrix} x^{(i)} \\ z^{(i)} \end{matrix}]$
20:: OI = ${j | y_{j}^{(i)} \neq N A}$ ▹ OI = Other indexes
21:: result = result + $ln N (y_{O I}^{(i)} | {mean}_{O I}, {cov}_{O I, O I})$
22:: return result

Algorithm A6 MS3.FA—Train.

1:: functionTrain( ${(x^{(1)}, z^{(1)}), \dots, (x^{(n)}, z^{(n)})}$ ,nMaxIterations,eps) ▹ data can have NAs
2:: θ⁽⁰⁾ = initializeParameters $({(x^{(1)}, z^{(1)}), \dots, (x^{(n)}, z^{(n)})})$
3:: $l_{RV_Do}^{(0)} =$ LogLikelihood $({(x^{(1)}, z^{(1)}), \dots, (x^{(n)}, z^{(n)})}, θ^{(0)})$
4:: for t = 0:nMaxIterations do
5:: E step: Compute the following for all $i \in {1, \dots, n}$ :
6:: $y^{(i)} = [\begin{matrix} x^{(i)} \\ z^{(i)} \end{matrix}]$
7:: fullNormal = FullNormal $(μ_{z}^{(t)}, Σ_{z}^{(t)}, μ^{(t)}, Λ^{(t)}, Ψ^{(t)})$
8:: $(E [Y^{(i)}], E [Y^{(i)} {Y^{(i)}}^{⊤}]) =$ CondNormalNA $(y^{(i)}, fullNormal)$
9:: $E [X^{(i)}] = E {[Y^{(i)}]}_{1 : D}$ ▹ D = dim( $x^{(i)}$ )
10:: $E [Z^{(i)}] = E {[Y^{(i)}]}_{(D + 1) : (D + d)}$ ▹ d = dim( $z^{(i)}$ )
11:: $E [X^{(i)} {X^{(i)}}^{⊤}] = E {[Y^{(i)} {Y^{(i)}}^{⊤}]}_{1 : D, 1 : D}$
12:: $E [X^{(i)} {Z^{(i)}}^{⊤}] = E {[Y^{(i)} {Y^{(i)}}^{⊤}]}_{1 : D, (D + 1) : (D + d)}$
13:: $E [Z^{(i)} {X^{(i)}}^{⊤}] = E {[X^{(i)} {Z^{(i)}}^{⊤}]}^{⊤}$
14:: $E [Z^{(i)} {Z^{(i)}}^{⊤}] = E {[Y^{(i)} {Y^{(i)}}^{⊤}]}_{(D + 1) : (D + d), (D + 1) : (D + d)}$
15:: M Step: Compute $θ^{(t + 1)} = (μ_{z}^{(t + 1)}, Σ_{z}^{(t + 1)}, μ^{(t + 1)}, Λ^{(t + 1)}, Ψ^{(t + 1)})$ :
16:: $μ_{z}^{(t + 1)} = \frac{\sum_{i = 1}^{n} E [Z^{(i)}]}{n}$
17:: $Σ_{z}^{(t + 1)} = \frac{\sum_{i = 1}^{n} (E [Z^{(i)} {Z^{(i)}}^{⊤}] - E [Z^{(i)}] {μ_{z}^{(t + 1)}}^{⊤} - μ_{z}^{(t + 1)} E [{Z^{(i)}}^{⊤}] + μ_{z}^{(t + 1)} {μ_{z}^{(t + 1)}}^{⊤})}{n}$
18:: $\bar{x} = \frac{\sum_{i = 1}^{n} E [X^{(i)}]}{n}$
19:: $Λ^{(t + 1)} = (n \bar{x} {μ_{z}^{(t + 1)}}^{⊤} - \sum_{i = 1}^{n} E [X^{(i)} {Z^{(i)}}^{⊤}]) {(n μ_{z}^{(t + 1)} {μ_{z}^{(t + 1)}}^{⊤} - \sum_{i = 1}^{n} E [Z^{(i)} {Z^{(i)}}^{⊤}])}^{- 1}$
20:: $μ^{(t + 1)} = \bar{x} - Λ^{(t + 1)} μ_{z}^{(t + 1)}$
21:: $Ψ^{(t + 1)} = diag (\frac{1}{n} \sum_{i = 1}^{n} (E [X^{(i)} {X^{(i)}}^{⊤}] - E [X^{(i)}] {μ^{(t + 1)}}^{⊤} - μ^{(t + 1)} E {[X^{(i)}]}^{⊤} + μ^{(t + 1)} {μ^{(t + 1)}}^{⊤} - E [X^{(i)} {Z^{(i)}}^{⊤}] {Λ^{(t + 1)}}^{⊤} + μ^{(t + 1)} E {[Z^{(i)}]}^{⊤} {Λ^{(t + 1)}}^{⊤} - Λ^{(t + 1)} E [Z^{(i)} {X^{(i)}}^{⊤}] + Λ^{(t + 1)} E {[Z^{(i)}]}^{⊤} {μ^{(t + 1)}}^{⊤} + Λ^{(t + 1)} E [Z^{(i)} {Z^{(i)}}^{⊤}] {Λ^{(t + 1)}}^{⊤}))$
22:: $l_{RV_Do}^{(t + 1)}$ = LogLikelihood $({(x^{(1)}, z^{(1)}), \dots, (x^{(n)}, z^{(n)})}, θ^{(t + 1)})$
23:: if $\frac{{∥θ^{(t)} - θ^{(t + 1)}∥}_{2}^{2}}{{∥θ^{(t)}∥}_{2}^{2}} \leq$ eps or $\frac{| l_{RV_Do}^{(t)} - l_{RV_Do}^{(t + 1)} |}{| l_{RV_Do}^{(t)} |} \leq$ eps then
24:: break
25:: return $θ^{(t)}$

Algorithm A7 MS3.FA—Test and Impute.

1:: functionTest( $y^{*} = (x^{*}, z^{*})$ partially known, ( ${\hat{μ}}_{z}$ , ${\hat{Σ}}_{z}$ , $\hat{Λ}$ , $\hat{μ}$ , $\hat{Ψ}$ ))
2:: fullNormal = FullNormal $({\hat{μ}}_{z}, {\hat{Σ}}_{z}, \hat{μ}, \hat{Λ}, \hat{Ψ})$
3:: $(E [Y^{*}], E [Y^{*} {Y^{*}}^{⊤}]) =$ CondNormalNA(y*, fullNormal)
4:: value = $E [Y^{*}]$
5:: covarianceMatrix = $E [Y^{*} {Y^{*}}^{⊤}] - E [Y^{*}] E {[Y^{*}]}^{⊤}$
6:: return (value,covarianceMatrix)
7:: functionImpute( $y^{*} = (x^{*}, z^{*})$ partially known, ( ${\hat{μ}}_{z}$ , ${\hat{Σ}}_{z}$ , $\hat{Λ}$ , $\hat{μ}$ , $\hat{Ψ}$ ))
8:: (value, covarianceMatrix) = Test( $y^{*}$ ,( ${\hat{μ}}_{z}$ , ${\hat{Σ}}_{z}$ , $\hat{Λ}$ , $\hat{μ}$ , $\hat{Ψ}$ ))
9:: return value

References

Mitchell, T. Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression. (Additional Chapter to Machine Learning; McGraw-Hill: New York, NY, USA, 1997.) Published Online. 2017; Available online: https://bit.ly/39Ueb4o (accessed on 31 July 2021).
Murphy, K. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
Ng, A. Machine Learning Course, Lecture Notes, Mixtures of Gaussians and the EM Algorithm. Available online: http://cs229.stanford.edu/notes2020spring/cs229-notes7b.pdf (accessed on 31 July 2021).
Singh, A. Machine Learning Course, Homework 4, pr 1.1; CMU: Pittsburgh, PA, USA, 2010; p. 528 in Ciortuz, L.; Munteanu, A.; Bădărău, E. Machine Learning Exercise Book (In Romanian); Alexandru Ioan Cuza University of Iași: Iași, Romania, 2019. Available online: https://bit.ly/320ZuIk (accessed on 31 July 2021).
Ng, A. Machine Learning Course, Lecture Notes, Part X. Available online: http://cs229.stanford.edu/notes2020spring/cs229-notes9.pdf (accessed on 31 July 2021).
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Available online: http://www.deeplearningbook.org (accessed on 31 July 2021).
Tipping, M.E.; Bishop, C.M. Probabilistic Principal Component Analysis. J. R. Stat. Soc. Ser. (Stat. Methodol.) 1999, 61, 611–622. Available online: https://bit.ly/2PCxoRr (accessed on 31 July 2021). [CrossRef]
Ciobanu, S. Exploiting a New Probabilistic Model: Simple-Supervised Factor Analysis. Master’s Thesis, Alexandru Ioan Cuza University of Iași, Iași, Romania, 2019. Available online: https://bit.ly/31UsBx6 (accessed on 31 July 2021).
Ng, A. Machine Learning Course, Lecture Notes, Part XI. Available online: http://cs229.stanford.edu/notes2020spring/cs229-notes10.pdf (accessed on 31 July 2021).
Lawrence, N.D. Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data. Adv. Neural Inf. Process. Syst. 2004, 329–336. Available online: https://papers.nips.cc/paper/2540-gaussian-process-latent-variable-models-for-visualisation-of-high-dimensional-data.pdf (accessed on 31 July 2021).
Gao, X.; Wang, X.; Tao, D.; Li, X. Supervised Gaussian Process Latent Variable Model for Dimensionality Reduction. IEEE Trans. Syst. Man, Cybern. Part (Cybern.) 2010, 41, 425–434. Available online: https://ieeexplore.ieee.org/document/5545418 (accessed on 31 July 2021).
Mitchell, T.; Xing, E.; Singh, A. Machine Learning Course, Midterm Exam, pr. 5.3; CMU: Pittsburgh, PA, USA, 2010; p. 565 Ciortuz, L.; Munteanu, A.; Bădărău, E. Machine Learning Exercise Book (In Romanian); Alexandru Ioan Cuza University of Iași: Iași, Romania, 2019. Available online: https://bit.ly/320ZuIk (accessed on 31 July 2021).
Ziyatdinov, A.; Fonollosa, J.; Fernández, L.; Gutierrez-Gálvez, A.; Marco, S.; Perera, A. Bioinspired early detection through gas flow modulation in chemo-sensory systems. Sens. Actuators Chem. 2015, 206, 538–547. [Google Scholar] [CrossRef] [Green Version]
Spyromitros-Xioufis, E.; TSOUMAKAS, G.; WILLIAM, G.; Vlahavas, I. Drawing parallels between multi-label classification and multi-target regression. arXiv 2014, arXiv:1211.6581 v2. [Google Scholar]
Xiaojin, Z.; Zoubin, G. Learning from Labeled and Unlabeled Data with Label Propagation; Technical Report CMU-CALD-02–107; Carnegie Mellon University: Pittsburgh, PA, USA, 2002. [Google Scholar]
Wang, J. SSL: Semi-Supervised Learning, R Package Version 0.1; 2016. Available online: https://CRAN.R-project.org/package=SSL (accessed on 31 July 2021).
Oliver, A.; Odena, A.; Raffel, C.; Cubuk, E.D.; Goodfellow, I.J. Realistic evaluation of deep semi-supervised learning algorithms. arXiv 2018, arXiv:1804.09170. [Google Scholar]
van Buuren, S.; Groothuis-Oudshoorn, K. Mice: Multivariate Imputation by Chained Equations in R. J. Stat. Softw. 2011, 45, 1–67. Available online: https://www.jstatsoft.org/v45/i03/ (accessed on 31 July 2021). [CrossRef] [Green Version]
Honaker, J.; King, G.; Blackwell, M. Amelia II: A Program for Missing Data. J. Stat. Softw. 2011, 45, 1–47. Available online: http://www.jstatsoft.org/v45/i07/ (accessed on 31 July 2021). [CrossRef]
Ghahramani, Z.; Hinton, G.E. The EM Algorithm for Mixtures of Factor Analyzers; Technical Report, CRG-TR-96-1; University of Toronto: Toronto, ON, Canada, 1996; Available online: http://mlg.eng.cam.ac.uk/zoubin/papers/tr-96-1.pdf (accessed on 31 July 2021).
Bishop, C.M. Pattern Recognition and Machine Learning; Springer Science + Business Media: Berlin, Germany, 2006. [Google Scholar]
Zaharia, M.; Xin, R.S.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S.; Franklin, M.J.; et al. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 2016, 59, 56–65. [Google Scholar] [CrossRef]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. (Methodol.) 1977, 39, 1–22. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]

Figure 1. Simple-semisupervised factor analysis (S3.FA) experiment: MSE 95% confidence intervals on the gas sensor array under flow modulation data set using four methods for semisupervised regression when

D > > n

;

p \in {5 %, 10 %, 15 %, \dots, 100 %}

of the training output labels are retained.

Figure 1. Simple-semisupervised factor analysis (S3.FA) experiment: MSE 95% confidence intervals on the gas sensor array under flow modulation data set using four methods for semisupervised regression when

D > > n

;

p \in {5 %, 10 %, 15 %, \dots, 100 %}

of the training output labels are retained.

Figure 2. S3.FA experiment: MSE 95% confidence intervals on the atp1d data set using four methods for semisupervised regression when

D > > n

;

p \in {5 %, 10 %, 15 %, \dots, 100 %}

of the training output labels are retained.

Figure 2. S3.FA experiment: MSE 95% confidence intervals on the atp1d data set using four methods for semisupervised regression when

D > > n

;

p \in {5 %, 10 %, 15 %, \dots, 100 %}

of the training output labels are retained.

Figure 3. S3.FA experiment: MSE 95% confidence intervals on the m5spec data set using four methods for semisupervised regression when

D > > n

;

p \in {5 %, 10 %, 15 %, \dots, 100 %}

of the training output labels are retained.

Figure 3. S3.FA experiment: MSE 95% confidence intervals on the m5spec data set using four methods for semisupervised regression when

D > > n

;

p \in {5 %, 10 %, 15 %, \dots, 100 %}

of the training output labels are retained.

Figure 4. Missing simple-semisupervised factor analysis (MS3.FA) experiment: MSE 95% confidence intervals on the gas sensor array under flow modulation data set using two methods for imputing missing data when

D > > n

;

p \in {10 %, 20 %, 30 %, 40 %, 50 %, 60 %}

of the input data are removed.

Figure 4. Missing simple-semisupervised factor analysis (MS3.FA) experiment: MSE 95% confidence intervals on the gas sensor array under flow modulation data set using two methods for imputing missing data when

D > > n

;

p \in {10 %, 20 %, 30 %, 40 %, 50 %, 60 %}

of the input data are removed.

Figure 5. MS3.FA experiment: MSE 95% confidence intervals on the atp1d data set using two methods for imputing missing data when

D > > n

;

p \in {10 %, 20 %, 30 %, 40 %, 50 %, 60 %}

of the input data are removed.

Figure 5. MS3.FA experiment: MSE 95% confidence intervals on the atp1d data set using two methods for imputing missing data when

D > > n

;

p \in {10 %, 20 %, 30 %, 40 %, 50 %, 60 %}

of the input data are removed.

Figure 6. MS3.FA experiment: MSE 95% confidence intervals on the m5spec data set using two methods for imputing missing data when

D > > n

;

p \in {10 %, 20 %, 30 %, 40 %, 50 %, 60 %}

of the input data are removed.

Figure 6. MS3.FA experiment: MSE 95% confidence intervals on the m5spec data set using two methods for imputing missing data when

D > > n

;

p \in {10 %, 20 %, 30 %, 40 %, 50 %, 60 %}

of the input data are removed.

Table 1. Simple-supervised factor analysis (S2.FA) experiment: mean squared error (MSE) 95% confidence intervals on three data sets using three methods for regression when

D > > n

; the best MSE means are marked in bold.

Table 1. Simple-supervised factor analysis (S2.FA) experiment: mean squared error (MSE) 95% confidence intervals on three data sets using three methods for regression when

D > > n

; the best MSE means are marked in bold.

Data Set/Method	Moore–Penrose	Ridge Regression	S2.FA
Gas sensor array under flow modulation	0.0251 ± 0.0254	0.0062 ± 0.007	0.0452 ± 0.0208
atp1d	94,627.7239 ± 80,183.0076	27,770.9253 ± 42,887.5216	4724.2957 ± 1616.3341
m5spec	0.00004 ± 0.00001	0.02676 ± 0.01372	0.37344 ± 0.25025

Table 2. S3.FA experiment: MSE 95% confidence intervals on three data sets using four methods for semisupervised regression when

D > > n

;

p \in {5 %, 10 %, 15 %, 30 %, 50 %, 70 %}

of the training output labels are retained; the best MSE means are marked in bold.

Table 2. S3.FA experiment: MSE 95% confidence intervals on three data sets using four methods for semisupervised regression when

D > > n

;

p \in {5 %, 10 %, 15 %, 30 %, 50 %, 70 %}

of the training output labels are retained; the best MSE means are marked in bold.

Data Set	Method	$p = 5 %$	$p = 10 %$	$p = 15 %$
Gas sensor array under flow modulation	Moore–Penrose	0.034 ± 0.0437	0.0292 ± 0.0421	0.0267 ± 0.0326
	S2.FA	0.2511 ± 0.3953	0.0899 ± 0.2367	0.0285 ± 0.0333
	S3.FA	3.9799 ± 6.7087	0.349 ± 0.6626	0.2294 ± 0.1561
	Label Propagation	0.0853 ± 0.0073	0.0825 ± 0.0051	0.0708 ± 0.013
atp1d	Moore–Penrose	3763.2939 ± 749.0966	3079.5042 ± 1842.7676	3428.7567 ± 1722.6437
	S2.FA	2706.9906 ± 477.3245	2339.3504 ± 324.8581	2279.0802 ± 99.5443
	S3.FA	3771.605 ± 1563.6543	2972.1137 ± 639.6724	2633.0816 ± 274.7409
	Label Propagation	110,820.3235 ± 0	110,820.3235 ± 0	110,820.3235 ± 0
m5spec	Moore–Penrose	0.4602 ± 0.354	0.3609 ± 0.7631	0.0187 ± 0.0221
	S2.FA	0.7003 ± 0.3834	0.4269 ± 0.5441	0.4999 ± 0.5172
	S3.FA	0.8133 ± 0.6366	1.653 ± 3.9497	0.5118 ± 0.5503
	Label Propagation	0.2175 ± 0.0707	0.1939 ± 0.0706	0.2343 ± 0.0624
Data Set	Method	$p = 30 %$	$p = 50 %$	$p = 70 %$
Gas sensor array under flow modulation	Moore–Penrose	0.0313 ± 0.0259	0.0192 ± 0.0145	0.0132 ± 0.0075
	S2.FA	0.0588 ± 0.0576	0.0233 ± 0.0224	0.0279 ± 0.0116
	S3.FA	0.645 ± 1.3044	0.1666 ± 0.1059	0.0912 ± 0.0443
	Label Propagation	0.0668 ± 0.0086	0.0675 ± 0.018	0.0605 ± 0.0122
atp1d	Moore–Penrose	6401.7587 ± 3602.9182	7907.484 ± 4449.2312	36,041.874 ± 21,300.704
	S2.FA	2444.9001 ± 404.2393	2439.162 ± 108.3605	2399.7948 ± 160.1038
	S3.FA	2760.6602 ± 391.9217	2659.6472 ± 112.9795	2521.9861 ± 222.3765
	Label Propagation	110,820.3235 ± 0	110,820.3235 ± 0	110,820.3235 ± 0
m5spec	Moore–Penrose	0.00057 ± 0.00085	0.00009 ± 0.00008	0.00009 ± 0.00004
	S2.FA	0.37449 ± 0.12508	0.35796 ± 0.07275	0.3995 ± 0.03164
	S3.FA	0.37865 ± 0.12808	0.36181 ± 0.07463	0.40419 ± 0.03415
	Label Propagation	0.19391 ± 0.02754	0.17424 ± 0.00857	0.17752 ± 0.01545

Table 3. MS3.FA experiment: MSE 95% confidence intervals on three data sets using two methods for imputing missing data when

D > > n

;

p \in {10 %, 20 %, 30 %, 40 %, 50 %, 60 %}

of the input data are removed; the best MSE means are marked in bold.

Table 3. MS3.FA experiment: MSE 95% confidence intervals on three data sets using two methods for imputing missing data when

D > > n

;

p \in {10 %, 20 %, 30 %, 40 %, 50 %, 60 %}

of the input data are removed; the best MSE means are marked in bold.

Data Set	Method	$p = 10 %$	$p = 20 %$	$p = 30 %$
Gas sensor array under flow modulation	Mean	0.0108 ± 0.0022	0.0241 ± 0.0027	0.0359 ± 0.0011
Gas sensor array under flow modulation	MS3.FA	0.00603 ± 0.00029	0.01176 ± 0.00081	0.01747 ± 0.00094
atp1d	Mean	3239.9757 ± 68.2511	6831.0884 ± 105.6119	10,077.1798 ± 117.4693
atp1d	MS3.FA	1807.7748 ± 105.03	3697.1146 ± 124.2197	5276.899 ± 104.6252
m5spec	Mean	0.00013 ± 0.00001	0.00026 ± 0	0.00039 ± 0.00001
m5spec	MS3.FA	0.14835 ± 0.000002	0.148433 ± 0.000002	0.148518 ± 0.000002
Data Set	Method	$p = 40 %$	$p = 50 %$	$p = 60 %$
Gas sensor array under flow modulation	Mean	0.0494 ± 0.0042	0.0597 ± 0.0015	0.0731 ± 0.0012
Gas sensor array under flow modulation	MS3.FA	0.02372 ± 0.00159	0.0299 ± 0.00047	0.0364 ± 0.0011
atp1d	Mean	13,423.3625 ± 252.7735	16,899.3166 ± 223.6971	20,156.2493 ± 62.1299
atp1d	MS3.FA	7071.6793 ± 143.3232	8921.5319 ± 165.5759	10,558.5723 ± 53.1182
m5spec	Mean	0.000521 ± 0.000005	0.000658 ± 0.000008	0.00080 ± 0.000005
m5spec	MS3.FA	0.1486 ± 0.00001	0.14869 ± 0.00001	0.148781 ± 0.000001

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ciobanu, S.; Ciortuz, L. A Factor Analysis Perspective on Linear Regression in the ‘More Predictors than Samples’ Case. Entropy 2021, 23, 1012. https://doi.org/10.3390/e23081012

AMA Style

Ciobanu S, Ciortuz L. A Factor Analysis Perspective on Linear Regression in the ‘More Predictors than Samples’ Case. Entropy. 2021; 23(8):1012. https://doi.org/10.3390/e23081012

Chicago/Turabian Style

Ciobanu, Sebastian, and Liviu Ciortuz. 2021. "A Factor Analysis Perspective on Linear Regression in the ‘More Predictors than Samples’ Case" Entropy 23, no. 8: 1012. https://doi.org/10.3390/e23081012

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Factor Analysis Perspective on Linear Regression in the ‘More Predictors than Samples’ Case

Abstract

1. Introduction

2. Theoretical Background

2.1. Linear Regression

2.2. Factor Analysis

3. Related Work

4. Proposed Models

4.1. The S2.FA Model

4.1.1. The S2.FA Model. S2.FA and S2.UncFA

4.1.2. The S2.FA Model. The Link between LR and S2.UncFA

4.1.3. The S2.FA Model. A New Approach for LR When $D > > n$

4.2. The S3.FA Model

4.3. The MS3.FA Model

5. Experiments

5.1. The S2.FA Model: Experiment

5.2. The S3.FA Model: Experiment

5.3. The MS3.FA Model: Experiment

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. On the Expectation–Maximization Algorithm

Appendix B. S2.FA

Appendix C. S3.FA

Appendix D. MS3.FA

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A Factor Analysis Perspective on Linear Regression in the ‘More Predictors than Samples’ Case

Abstract

1. Introduction

2. Theoretical Background

2.1. Linear Regression

2.2. Factor Analysis

3. Related Work

4. Proposed Models

4.1. The S2.FA Model

4.1.1. The S2.FA Model. S2.FA and S2.UncFA

4.1.2. The S2.FA Model. The Link between LR and S2.UncFA

4.1.3. The S2.FA Model. A New Approach for LR When D > > n

4.2. The S3.FA Model

4.3. The MS3.FA Model

5. Experiments

5.1. The S2.FA Model: Experiment

5.2. The S3.FA Model: Experiment

5.3. The MS3.FA Model: Experiment

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. On the Expectation–Maximization Algorithm

Appendix B. S2.FA

Appendix C. S3.FA

Appendix D. MS3.FA

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.1.3. The S2.FA Model. A New Approach for LR When $D > > n$