Next Article in Journal
An Optimal Inequality for the Normal Scalar Curvature in Metallic Riemannian Space Forms
Next Article in Special Issue
Automated Model Selection of the Two-Layer Mixtures of Gaussian Process Functional Regressions for Curve Clustering and Prediction
Previous Article in Journal
Industrial and Management Applications of Type-2 Multi-Attribute Decision-Making Techniques Extended with Type-2 Fuzzy Sets from 2013 to 2022
Previous Article in Special Issue
CCTrans: Improving Medical Image Segmentation with Contoured Convolutional Transformer Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Improved Mixture Model of Gaussian Processes and Its Classification Expectation–Maximization Algorithm

1
School of Mathematics and Statistics, Shaanxi Normal University, Xi’an 710119, China
2
School of Computer Science, Shaanxi Normal University, Xi’an 710119, China
3
School of Mathematics, Northwest University, Xi’an 710127, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2023, 11(10), 2251; https://doi.org/10.3390/math11102251
Submission received: 4 April 2023 / Revised: 4 May 2023 / Accepted: 6 May 2023 / Published: 11 May 2023

Abstract

:
The mixture of experts (ME) model is effective for multimodal data in statistics and machine learning. To treat non-stationary probabilistic regression, the mixture of Gaussian processes (MGP) model has been proposed, but it may not perform well in some cases due to the limited ability of each Gaussian process (GP) expert. Although the mixture of Gaussian processes (MGP) and warped Gaussian process (WGP) models are dominant and effective for non-stationary probabilistic regression, they may not be able to handle general non-stationary probabilistic regression in practice. In this paper, we first propose the mixture of warped Gaussian processes (MWGP) model as well as its classification expectation–maximization (CEM) algorithm to address this problem. To overcome the local optimum of the CEM algorithm, we then propose the split and merge CEM (SMC EM) algorithm for MWGP. Experiments were done on synthetic and real-world datasets, which show that our proposed MWGP is more effective than the models used for comparison, and the SMCEM algorithm can solve the local optimum for MWGP.

1. Introduction

The mixture of experts (ME) model is effective for multimodal data in statistics and machine learning [1]. In ME, the input space is softly divided into multiple regions by an input-dependent gating function, and each region is specified by an expert. Due to the diversity of experts, such as Gaussian distribution [2] and the support vector machine (SVM) [3], there are a variety of models based on the ME framework.
In the 2000s, Tresp constructed the mixture of Gaussian processes (MGP) model, a special ME where each expert is a stationary Gaussian process (GP) probabilistic regression model, by mixing GPs along the input space to treat non-stationary probabilistic regression [4,5,6,7,8]. Popular gating functions of the MGP include logistic distribution and Gaussian distribution. For learning MGP, the main algorithms are Markov Chain Monte Carlo (MCMC), variational Bayesian (VB), and expectation–maximization (EM). The MGP cannot work correctly on non-stationary probabilistic regression in some situations since the ability of each GP expert is limited.
In this paper, we propose the mixture of warped Gaussian processes (MWGP) model, which has more flexible and attractive properties than MGP to handle non-stationary probabilistic regression, by modeling each component of an MGP with a warped Gaussian process (WGP); these WGPs are combined in the input space by the Gaussian distribution. The WGP is capable of modeling non-stationary probabilistic regression by learning a nonlinear distortion (also called warping) of the GP outputs. The MWGP can be viewed as a generalization of the MGP and WGP frameworks, and it allows for dealing with non-stationary data in each multimodal mode. In MWGP, the mixture parameter, warping function parameter, covariance function parameter, and indicator variable (regarded as the latent variable) are considered simultaneously. To handle this, we designed the classification expectation–maximization (CEM) algorithm for the training of MWGP. However, the CEM algorithm may easily converge to a local optimum for MWGP in some cases. We propose the split and merge CEM (SMCEM) algorithm of MWGP based on the SMCEM algorithm of the MGP and the CEM algorithm of MWGP to solve this problem. Experiments were conducted on synthetic and real-world datasets, and the results demonstrate the feasibility and superior accuracy of our proposed MWGP model trained using the CEM algorithm compared to other comparative models for probabilistic regression. Moreover, the SMCEM algorithm of MWGP can overcome local optima in some datasets at a negligible time cost.
The remainder of this paper is organized as follows. In Section 2, we present related works of GP, MGP, and WGP models. We describe GP, WGP, and our proposed MWGP in Section 3. In Section 4, we present our proposed SMCEM algorithm of MWGP, the CEM algorithm of MWGP, and the partial CEM algorithm of MWGP. The experimental results are presented in Section 5, and the conclusions are drawn in Section 6.

2. Related Works

Related works of the GP. The GP is a versatile tool for probabilistic regression and it has been successfully applied to practical fields such as time series prediction [9] and signal processing [10]. The non-stationary probabilistic regression problem exists widely, but it cannot be modeled effectively by the conventional GP model [11,12]. To solve this, the non-stationary GP model was proposed by introducing a robust and flexible covariance structure [13,14,15,16]. However, a single GP cannot handle the non-stationary probabilistic regression well due to its inherent simplicity.
Related works of the MGP model. The structure of the MGP is shown in Figure 1. As seen in this figure, the MGP is a more effective non-stationary model than the GP. However, the parameter estimation of the MGP is a challenge due to the unknown indicator variable (regarded as the latent variable) and the highly correlated sample [17,18,19,20]. The Markov Chain Monte Carlo (MCMC) method employing Gibbs sampling and hybrid Monte Carlo approximates the intractable integration and summation by the simulated sample of the indicator variable and parameter [6,7,21,22,23], and it is commonly used in a system of partial differential equations [24,25]. The MCMC generally obtains precise results, but it takes a long time to generate the simulated sample. To improve the efficiency, the variational Bayesian (VB) inference and the expectation–maximization (EM) algorithm were proposed. Ross and Dy constructed the VB inference on the basis of the mean-field approximation, where the indicator variable and the stochastic parameter are forced to be independent of the probability distribution [26]. Yuan and Neubauer established a variational EM algorithm by a similar mean-field method as the VB inference [8]. Then, the leave-one-out cross-validation (LOOCV) EM algorithm was proposed based on the LOOCV approximation method, where the probability density of the GP is approximated by the production of LOOCV probability densities [27]. To improve the accuracy, Chen et al. constructed the CEM algorithm by replacing the expectation step (E step) of the conventional EM algorithm with a classification–expectation step (CE step) [28,29]. For the CEM algorithm, samples are classified into components by the maximum a posteriori (MAP) principle of indicator variables in the CE step and the parameters of components are learned independently in the maximization step (M step). Then, the MCMC EM algorithm was designed by approximating the Q-function of the EM algorithm with the Gibbs sampling; this algorithm is generally accurate but slow [30,31,32]. The SMCEM algorithm was constructed by combining the split and merge EM algorithm and the CEM algorithm to address the local optimum of the CEM algorithm for MGP [33,34,35,36]. Regarding the model selection problem, i.e., selecting the number of components, the Dirichlet process as the gating function was developed [6,17,21,22]; moreover, Zhao and Ma proposed a synchronous balancing criterion [37]. Regarding the robustness problem, the robust MGP with the Laplace noise and the robust MGP with the student-t noise were proposed to overcome this difficulty [38].
Related works of the WGP model. The WGP trained by the maximum likelihood estimation (MLE) method can handle non-stationary probabilistic regression effectively by transforming the GP output in a latent space to the real output in the observation space with a learnable nonlinear monotonic function [39], as seen in Figure 2. From this figure, it is clear that the WGP is much better than the GP, and the performance of the GP degenerates dramatically. In WGP, such a preprocessing transformation can be considered as an integral part of non-stationary probabilistic modeling. The improved WGP models involve a large number of parameters and hyperparameters, which limits the applicability of the WGP. The Hamiltonian MCMC method [40] and the VB inference [41] were proposed to improve the training of the WGP in some situations. To make the WGP structure flexible, Rios et al. constructed the WGP with a deep compositional architecture warping function [42], and the multi-task WGP was proposed [43]. For the optimization of the WGP, a spatial branching strategy was designed [44]. The WGP (being a useful generalization of the GP) has also been widely used in practical applications [45,46,47,48].

3. Model Construction

In this section, we first introduce the GP and WGP models and then describe our proposed MWGP model.

3.1. The GP

The GP is a non-parametric statistical model, which is described briefly as follows. For a dataset { x n R E , y n R } n = 1 N , the standard Gaussian process (GP) model for probabilistic regression is defined by
y n = f ( x n ) + ε n and ε n N 0 , σ 2 ,
where f ( ) , y n , x n , ε n , and σ are the random latent function described, the n-th output, the n-th E × 1 input, the n-th Gaussian noise, and the SD of the Gaussian noise, respectively. The random latent function f ( ) on { x n } n = 1 N is subject to a Gaussian distribution
p ( f | X ) = N ( μ , K ) ,
where X = [ x 1 , x 2 , , x N ] is the E × N input matrix, f = [ f n ] N × 1 is the latent vector, f n = f ( x n ) , μ = [ μ n ] N × 1 is the mean vector, μ n = μ ( x n ) is the mean function, K = [ K n n ˜ ] N × N is the covariance matrix, and K n n ˜ = K ( x n , x n ˜ ; γ ) is the covariance function (In this paper, we use the squared exponential covariance function
K ( x n , x n ˜ ; γ ) = ( γ ( 1 ) ) 2 exp ( x n x n ˜ ) T Λ ( x n x n ˜ ) / 2 ,
where Λ = diag 1 / ( γ ( 2 ) ) 2 , 1 / ( γ ( 3 ) ) 2 , , 1 / ( γ ( E + 1 ) ) 2 is the E × E diagonal matrix) parameterized by γ = [ γ ( 1 ) , γ ( 2 ) , , γ ( E + 1 ) ]
The likelihood function of the GP is obtained by integrating p ( y | f ) p ( f | X ) with respect to f , given by
p ( y | X ) = N ( μ , K + σ 2 I N ) ,
where p ( y | f ) is the independent identically distributed Gaussian distribution obtained by Equation (1), y = [ y n ] N × 1 is the output vector, and I N is the N × N unit matrix.

3.2. The WGP Model

In the non-stationary probabilistic regression problem, the WGP model describes the real output in the observation space as a parametric nonlinear transformation of the GP. For the dataset D = { x n R E , w n R } n = 1 N , the WGP is constructed by introducing a latent variable set { y n R } n = 1 N , where w n and x n are the n-th output in the observation space and the n-th E × 1 input, respectively. The latent vector y = [ y n ] N × 1 is subject to a GP with a zero-mean function (i.e., μ ( x n ) = 0 ), defined by Equation (2)
p ( y | X ) = N ( 0 N × 1 , K + σ 2 I N ) .
The latent variable y n is transformed to w n by a monotonic warping function (In this paper, we assume the warping function to be a feedforward neural network g ( w n ; Ω ) = w n + j = 1 J a j tanh h j ( w n + l j ) , where a j and h j are non-negative for any j to ensure monotonicity, J is the number of neurons, and Ω = [ a 1 , a 2 , , a J , h 1 , h 2 , , h J , l 1 , l 2 , , l J ] T ). g ( ; Ω )
y n = g ( w n ; Ω ) ,
where g ( w n ; Ω ) maps w n to the entire real line, and Ω is the parameter vector composed of J neurons. As stated above, the established WGP is fully incorporated into the probabilistic framework of the GP.
For convenience, the WGP is denoted as
w WGP ( X ; θ , Ω ) ,
where w = [ w n ] N × 1 is the output vector and θ = { σ , γ ( 1 ) , γ ( 2 ) , , γ ( E + 1 ) } . For WGP, the information flow direction is x n y n w n , and the relationships between the main variables are shown in Figure 3. From this figure, outputs { w n R } n = 1 N are conditionally independent when the strongly correlated { f n } n = 1 N are given. The parameters θ and Ω are learned jointly using a conjugate gradient method for WGP.

3.3. The MWGP Model

To process multimodal data while modeling the non-stationary nature of each mode, we describe our proposed MWGP model mathematically next, where C-different WGP components are mixed in the input region. The structure of MWGP is similar to that of the MGP, as illustrated in Figure 1. Compared to the MGP, the two-layer structure of MWGP can address non-stationary probabilistic regression in different ways.
A subscript c is inserted in the preceding notation for the number of components. The n-th sample ( x n , w n ) is allocated to the c-th WGP component by an indicator variable z n c (regarded as a latent variable), where c = 1 , 2 , , C . If ( x n , w n ) is in the c-th component, then z n c = 1; otherwise, z n c = 0. The distribution of the indicator variable vector z n = [ z n 1 , z n 2 , , z n C ] T is given by
P ( z n = e c ) = η c ,
where e c is the c-th column of the C × C unit matrix I C and c = 1 C η c = 1 .
The distribution of the input vector x n is given by
p ( x n | z n = e c ) = N ( α c , Σ c ) ,
where α c and Σ c are the E × 1 mean vector and the E × E covariance matrix of the Gaussian distribution, respectively. Equation (5) is commonly used in most generative mixture models.
After the distributions of Equations (4) and (5) are given, the distribution of the output vector w c is given based on Equation (3) by
w c WGP ( X c ; θ c , Ω c ) ,
where X c is the E × N c input matrix composed of { x n | z n = e c ; n = 1 , 2 , , N } in which N c = n = 1 N z n c is the sample number, w c is the N c × 1 output vector composed of { w n | z n = e c ; n = 1 , 2 , , N } , Ω c parameterizes the warping function of the c-th component, and θ c = { σ c , γ c ( 1 ) , γ c ( 2 ) , , γ c ( E + 1 ) } . For MWGP, C WGP components are independent, and each component is defined by Equation (6). MWGP is generally more flexible than the MGP and WGP; its information flow direction is z n x n y n w n , as shown in Figure 4. If C = 1 , then the MWGP degenerates to WGP; if g ( w n ; Ω c ) = w n , then the MWGP degenerates to MGP.
With the above analysis, the mixture structure, the covariance function, and the warping function are incorporated simultaneously in the same probabilistic model framework. The computational cost of MWGP is similar to MGP since the time complexity of the inverse covariance matrix operation in MWGP is the same as that in MGP, i.e., O ( N 3 / C 2 ) . For MWGP, there is an overfitting problem when too many extra parameters are added.

4. Algorithm Design

To handle the large time complexity of calculating the Q-function in the conventional EM algorithm of MWGP, we designed the CEM and partial CEM algorithms. In the CEM algorithm of MWGP, a local optimum is generally generated when there is a large separation between two parts of the MWGP component in practice. This separation is created by having many components of MWGP in one region and few in another. To escape from this separation, simultaneous split and merge operations were performed repeatedly by merging two similar components in a region with many components and splitting one component in another region with few components. As the CEM algorithm of MWGP can sometimes get trapped in a local optimum, we developed the SMCEM algorithm of MWGP to address this issue. The CEM algorithm of MWGP and the partial CEM algorithm of MWGP are sub-algorithms of the SMCEM algorithm for MWGP.

4.1. Procedures of the Proposed Algorithms

Denote the C × N indicator variable matrix Z = [ z 1 , z 2 , , z N ] and the whole parameter set Φ = { Φ c } c = 1 C , where Φ c = { η c , α c , Σ c , θ c , Ω c } . For MWGP, procedures of the SMCEM algorithm, the CEM algorithm, and the partial CEM algorithm are presented in Algorithms 1–3.
Algorithm 1 The SMCEM algorithm for MWGP
Input:  D , C , J .
Output:  Φ b e s t , L b e s t .
1:
Initialization: Initialize the indicator variable matrix Z ( 0 ) by the k-means clustering on the input set { x n } n = 1 N , and obtain the indicator variable matrix Z ( 1 ) and the parameter set Φ ( 0 ) by performing the CEM algorithm described in Algorithm 2. Set the number of current iterations as r = 1 , and L b e s t = .
2:
The merge operation: The component numbers of the merge operation u and s are obtained by the merge criterion described in Appendix C, where u , s { 1 , 2 , , C } and u s . The new component after merging the old u-th component and the old s-th component is called the u-th component.
3:
The split operation: The component number of the split operation t is obtained by the split criterion described in Appendix C, where t { 1 , 2 , , C } . New components after splitting the old t-th component are called the s-th and t-th components.
4:
For n { n | z n ( 3 r 2 ) = e s } , set z n ( 3 r 1 ) = e u ; { x n | z n ( 3 r 2 ) = e t ; n = 1 , 2 , , N } is clustered into two clusters by the k-means clustering, and set z n ( 3 r 1 ) = e s for samples of the first cluster; otherwise, set z n ( 3 r 1 ) = z n ( 3 r 2 ) .
5:
Obtain Z ( 3 r ) by performing the partial CEM algorithm described in Algorithm 3.
6:
Obtain Φ ( 3 r ) and Z ( 3 r + 1 ) by performing the the CEM algorithm described in Algorithm 2.
7:
Convergence criterion: If the value of the approximated Q-function L ( Φ ( 3 r ) , Z ( 3 r + 1 ) ) > L b e s t , then set L b e s t = L ( Φ ( 3 r ) , Z ( 3 r + 1 ) ) , Φ b e s t = Φ ( 3 r ) , r = r + 1 and return to the second step; otherwise, stop.
Algorithm 2 The CEM algorithm for MWGP
Input:  D , C , J .
Output:  Φ ( r ) , Z ( r ) .
1:
Initialization: Set the initialized indicator variable matrix Z ( 0 ) = Z ( 3 r ) (or Z ( 0 ) ) and r = 1.
2:
M step: Update Φ ( r ) by maximizing L ( Φ , Z ( r 1 ) ) described in Appendix A.2.
3:
CE step: Update Z ( r ) by the approximated MAP principle described in Appendix A.1.
4:
Convergence criterion: If r > 9 and
i = r 4 r L ( Φ ( i ) , Z ( i ) ) i = r 9 r 5 L ( Φ ( i ) , Z ( i ) ) / | i = r 9 r 5 L ( Φ ( i ) , Z ( i ) ) | < ϵ ,
or r r m a x , stop; otherwise, r = r + 1 and return to the second step.
Algorithm 3 The partial CEM algorithm for MWGP.
Input:  D , u , s , t , J .
Output:  Z ( r ) .
1:
Initialization: Set the initialized indicator variable matrix Z ( 0 ) = Z ( 3 r 1 ) and r = 1.
2:
Partial M step: Update Φ c = u , s , t ( r ) by partially maximizing L ( Φ , Z ( r 1 ) ) , as described in Appendix B.1.
3:
Partial CE step: Obtain z n ( r ) for n { n | z n ( r 1 ) = e c = u , s , t } by the approximated MAP principle described in Appendix B.2; otherwise, set z n ( r ) = z n ( r 1 ) .
4:
Convergence criterion: If r > 9 and
i = r 4 r L ( Φ ( i ) , Z ( i ) ) i = r 9 r 5 L ( Φ ( i ) , Z ( i ) ) / | i = r 9 r 5 L ( Φ ( i ) , Z ( i ) ) | < ϵ ,
or r r m a x , stop; otherwise, r = r + 1 and return to the second step.
The k-means clustering method is adopted in the first step of Algorithm 1 since it is possible for some samples to belong to the same component if they are close in distance. In the second and third steps of Algorithm 1, the split candidate set { t } and the merge candidate set { u , s } are sorted by split and merge criteria, respectively. By renumbering the merge and split candidate sets, we obtain the candidate set { u , s , t } . In the fifth step of Algorithm 1, we perform the partial CEM algorithm to retrain the parameters of new components and ensure that all other components are not affected by this retraining. The CEM algorithm is performed as a full training procedure for all components in the sixth step of Algorithm 1. In the seventh step of Algorithm 1, it is obvious that the accepted split or merge operation attempts to increase the value of the approximated Q-function in each iteration. We set hyperparameters C , J, and D according to the best RMSE (the root mean square error (RMSE) is used to characterize the accuracy of the model, which is defined mathematically by n = 1 N ( y n y ^ n ) 2 / N , where y ^ n is the estimation of y n ). In Algorithm 1, components of MWGP with poor aggregation are divided in the split operation, and those with high similarity are combined in the merge operation. Simultaneous split and merge operations can perform a global search by crossing over low-likelihood positions.
In the second and third steps of Algorithm 2, Φ and Z are updated alternately. Samples are classified into C components to overcome the time complexity of the conventional EM algorithm in the third step of Algorithm 2. In the fourth step of Algorithm 2, we apply a relatively long-term convergence criterion.
i = r 4 r L ( Φ ( i ) , Z ( i ) ) i = r 9 r 5 L ( Φ ( i ) , Z ( i ) ) / | i = r 9 r 5 L ( Φ ( i ) , Z ( i ) ) | < ϵ
since L ( Φ ( r ) , Z ( r ) ) may fluctuate during iterations, we present the largest number of iterations, i.e., r m a x = 30 and ϵ = 0.002. Regarding the annealing mechanism, Algorithm 2 can be viewed as a deterministic annealing EM algorithm with the annealing parameter tending to positive infinity, while the conventional EM algorithm can be viewed as a deterministic annealing EM algorithm with the annealing parameter being one. Theoretically, Algorithm 2 is more likely to fall into a local optimum than the conventional EM algorithm. The details of the CEM algorithm are described in Appendix A.
Algorithm 3 was performed on new components generated by the simultaneous split and merge operations. In the second and third steps of Algorithm 3, Φ c = u , s , t ( r ) and z n ( r ) are updated alternately, where n { n | z n ( r 1 ) = e c = u , s , t } . In the third step of Algorithm 3, z n ( r ) for n { n | z n ( r 1 ) = e c = u , s , t } is obtained by using the initialized Z ( 3 r 1 ) , while z n ( r ) = z n ( r 1 ) is set for the other components. The details of the partial CEM algorithm are described in Appendix B.

4.2. Prediction Strategy

For MWGP, the time complexity of the conventional prediction method is generally high. We adopted the classification approximation method for MWGP to improve the predictive efficiency. In this prediction, the mean predictive output is used since the RMSE is measurable. In this paper, we used a predictive strategy similar to the MGP (or ME), i.e., the weighted prediction. A test sample was put into each WGP expert to calculate the predictive distribution individually, and then these predictive distributions were weighted and averaged according to the posterior probability to obtain the overall predictive distribution.
For the test sample x N + 1 in the c-th component, the predictive distribution in the latent space of the WGP is a standard GP
p ( y N + 1 , c | x N + 1 , c , D , θ c ) = N ( y ˜ N + 1 , c , σ N + 1 , c 2 ) .
By a nonlinear transformation in Equation (7), the predictive distribution in the observation space is calculated by
p ( w N + 1 , c | x N + 1 , c , D , θ c , Ω c ) = g ( w N + 1 , c ; Ω c ) / ( w N + 1 , c ) / 2 π σ N + 1 , c 2 exp g ( w N + 1 , c ; Ω c ) y ˜ N + 1 , c 2 / 2 σ N + 1 , c 2 .
Compared to the shape of the predictive distribution in Equation (7), the shape of the predictive distribution in Equation (8) is generally asymmetric and multimodal. By integrating w N + 1 , c in Equation (8), the mean predictive output in the latent space of the WGP is obtained by
E ( w N + 1 , c ) = w N + 1 , c p ( w N + 1 , c | x N + 1 , c , D , θ c , Ω c ) d w N + 1 , c = g 1 ( y N + 1 , c ; Ω c ) N ( y ˜ N + 1 , c , σ N + 1 , c 2 ) d y N + 1 , c ,
where g 1 ( ; Ω c ) is the inverse of g ( ; Ω c ) . The closed-form solution of g 1 ( ; Ω c ) is generally difficult to obtain, so we used the Newton–Raphson method to calculate it. Since Equation (9) is the one-dimensional integral of w N + 1 , c in the Gaussian density function, it can be solved accurately by the Gauss–Hermite quadrature method.
Finally, the overall mean predictive output for MWGP is given based on Equation (9):
w ^ N + 1 = c = 1 C z ^ N + 1 , c E ( w N + 1 , c ) ,
where z ^ N + 1 = arg max z N + 1 P ( z N + 1 | Φ ) p ( x N + 1 | z N + 1 , Φ ) .

5. Experimental Results

In this section, we show the experimental results of MWGP on synthetic datasets and three types of real-world datasets. The experiments were conducted on a personal computer equipped with a 2.9 GHz Intel Core i7 CPU and 16.00 GB of RAM, using Matlab R2019b.

5.1. Comparative Models

Models and related algorithms are described in Table 1, where GP, support vector machine (SVM), and feedforward neural network (FNN) are comparative models. The GPML toolbox in Matlab R2019b, the SVM toolbox in Matlab R2019b, and the FNN toolbox in Matlab R2019b were adopted in our experiments. For MWGP II, MWGP I, and WGP, we chose an optimal value of J to avoid overfitting while balancing accuracy and efficiency. RMSE and MAE (note that the mean absolute error (MAE) is used to describe the sensitivity of the model to outliers, which is defined mathematically by n = 1 N | y n y ^ n | / N , where y ^ n is the estimation of y n ) are used to assess the performance of real-world datasets.

5.2. Synthetic Datasets of MWGP I

To test the consistency of MWGP I, we generated 10 typical synthetic datasets by the MGP model with the component number C = 2 and the input dimension number E = 1 , denoted by S 1 , S 2 , , S 10 , respectively. S 1 is the original dataset, where there are 300 training samples and 600 test samples. In each dataset, samples are warped by a monotonic function g ( ; Ω c ) . The number of neurons J is set as 2, and Ω c is randomly generated from a Gaussian distribution. The main parameters of MWGP I on S 1 are shown in Table 2. The other datasets that differ from S 1 are listed as follows.
  • S 2 (a low noise dataset): σ 1 = σ 2 = 0.0200 .
  • S 3 (a high noise dataset): σ 1 = σ 2 = 0.5000 .
  • S 4 : γ 1 ( 1 ) = 0.0707 , γ 2 ( 1 ) = 0.5000 .
  • S 5 : γ 1 ( 1 ) = 0.2828 , γ 2 ( 1 ) = 2.0000 .
  • S 6 (a short length-scale dataset): γ 1 ( 2 ) = 0.8165 , γ 2 ( 2 ) = 6.3246 .
  • S 7 (a long length-scale dataset): γ 1 ( 2 ) = 0.2041 , γ 2 ( 2 ) = 1.5811 .
  • S 8 (a medium overlapping dataset): Σ 1 1 / 2 = 2.1213 , Σ 2 1 / 2 = 3.1820 .
  • S 9 (a large overlapping dataset): Σ 1 1 / 2 = 3.0000 , Σ 2 1 / 2 = 4.5000 .
  • S 10 (an unbalanced dataset): η 1 = 0.2500 , η 2 = 0.7500 .
The real parameters (RPs), average estimated parameters (AEPs), and standard deviations of estimated parameters (SDEPs) obtained by MWGP I on S 1 are listed in Table 2, where the AEPs obtained by MWGP I are similar to the related RPs and the related SDEPs are generally small. As a result, the parameter estimate of MWGP I is practically unbiased and effective.
The predictive results of MWGP I and MGP on S 1 are presented in Figure 5a. The figure suggests that MWGP I outperforms MGP in the flat zone, specifically in the intervals (10.8, 11.7). In Figure 5b, the predictive probability density of MWGP I is asymmetrical across the whole distribution, but the predictive probability density of the MGP is symmetrical even when it is calculated by using the warped samples. Warping functions learned by MWGP I on S 1 are shown in Figure 6. The warping function learned for the first component in Figure 6a is linear-like, while the warping function learned for the second component in Figure 6b is power-like, with an order between 0 and 1. It can be seen that MWGP I is flexible enough to handle non-stationary among different regions on a multimodal dataset.
In Table 3, the average predicted RMSEs, SDs of predicted RMSEs, p-values [52] of predicted RMSEs, and average running times for MWGP I and the other models on S 1 , S 2 , , S 10 are illustrated, where p-values are obtained by MWGP I and the other comparative models, respectively. The prediction accuracies of MWGP I and MGP are better than the other models due to the mixture structure, and the prediction accuracy of MWGP I is the best of all. MWGP I is superior to MGP in accuracy because of the warping function. Although the SDEPs of γ c ( 1 ) and γ c ( 2 ) are larger than the other parameters, the predicted results of MWGP I are accurate in S 1 . Thus, MWGP I is robust for the estimates of γ c ( 1 ) and γ c ( 2 ) . The SDs of predicted RMSEs for MWGP I and the other models are small. From the p-values, the predicted RMSE of MWGP I is different from the other models, except for MGP on S 3 and S 5 . Thus, our proposed MWGP I is effective, and it can optimize all parameters jointly.

5.3. Synthetic Datasets of MWGP II

For MWGP I, there is a local optimum in some cases. We propose MWGP II as a solution to this issue. To verify the consistency of MWGP II, we generated 6 typical synthetic datasets S 11 , S 12 , , S 16 by MGP with C = 5 and E = 2 . In S 11 , there are 750 training samples and 1500 test samples. In each dataset, samples are warped by g ( ; Ω c ) . We set J = 3 and generated Ω c at random using a Gaussian distribution for these synthetic datasets. The main parameters of MWGP II on S 11 are shown in Table 4. The other datasets that differ from S 11 are listed as follows.
  • S 12 (a noise dataset): σ 1 = σ 3 = σ 5 = 0.5000 , and σ 2 = σ 4 = 0.1000 .
  • S 13 : γ 1 ( 1 ) = 0.2828 , γ 2 ( 1 ) = 2.0000 , γ 3 ( 1 ) = 0.8944 , γ 4 ( 1 ) = 0.7071 , and γ 5 ( 1 ) = 0.5477 .
  • S 14 (a length-scale dataset): γ 1 ( 2 ) = 0.2041 , γ 2 ( 2 ) = 1.5811 , γ 3 ( 2 ) = 1.2910 , γ 4 ( 2 ) = 0.5000 , and γ 5 ( 2 ) = 1.5811 .
  • S 15 (an overlapping dataset):
    Σ 1 1 / 2 = [ 3.0000 , 2.4000 ; 2.4000 , 3.0000 ] , Σ 2 1 / 2 = [ 4.5000 , 3.8730 ; 3.8730 , 4.5000 ] , Σ 3 1 / 2 = [ 3.0000 , 0.0000 ; 3.0000 , 0.0000 ] , Σ 4 1 / 2 = [ 4.5000 , 3.8730 ; 3.8730 , 4.5000 ] , Σ 5 1 / 2 = [ 3.0000 , 2.4000 ; 2.4000 , 3.0000 ] .
  • S 16 (an unbalanced dataset): η 1 = 1 / 9 , η 2 = 3 / 9 , η 3 = 1 / 9 , and η 4 = 3 / 9 , η 5 = 1 / 9 .
The average ALLFs (approximated log-likelihood functions, i.e., values of the approximated Q-function after convergence of the SMCEM algorithm) and average running times of MWGP II and MWGP I on S 11 , S 12 , , S 16 are shown in Table 5. In these synthetic datasets, the average ALLF of MWGP II is larger than MWGP I, so MWGP II overcomes the local optimum of MWGP I. The average running time of MWGP II is longer than MWGP I since the partial CEM algorithm and the CEM algorithm are performed several times for the training of MWGP II. It can be concluded from the above discussion that our proposed MWGP II is effective.

5.4. Toy and Motorcycle Datasets

Toy data [7,27] and motorcycle data [6,8,27] were used to test the performance of the MGP. We tested the consistency of our proposed MWGP II and MWGP I on the toy dataset S 17 and the motorcycle dataset S 18 . S 17 consisted of four components generated by four continuous functions, i.e.,
y 1 = 0.25 x 1 2 40 + 7 ϵ , y 2 = 0.0625 ( x 2 18 ) 2 + 0.5 x 2 + 20 + 7 ϵ , y 3 = 0.008 ( x 3 60 ) 2 70 + 2 ϵ , y 4 = sin ( x 4 ) 6 + 2 ϵ ,
where x 1 ( 0 , 15 ) , x 2 ( 35 , 60 ) , x 3 ( 45 , 80 ) , x 4 ( 80 , 100 ) and ϵ N ( 0 , 1 ) , as shown in Figure 7a. In each component, there are 50 training samples and 50 test samples. We set J = 2 and C = 4 for S 17 .
S 18 presents the accelerometer readings recorded at 133 moments during an experiment evaluating the effectiveness of crash helmets. In S 18 , samples belong to three components along the time axis (millisecond), i.e., ( 2.4 , 11.4 ] , ( 11.4 , 40.4 ] , and ( 40.4 , 57.6 ] , as shown in Figure 7b. We performed 7-fold cross-validation on this dataset, with the k-th fold consisting of the dataset { ( x n , y n ) : n = 7 i + k , i = 0 , 1 , , 18 } . We used 19 samples as the test set and the remaining samples as the training set. For this dataset, we set J = 2 and C = 3 for S 18 .
We compare the MWGP II, MWGP I, MGP, WGP, GP, FNN, and SVM models on S 17 and S 18 , respectively. The average predicted RMSEs, average predicted MAEs, and average running times of these models are listed in Table 6. With these datasets, MWGP II and MWGP I are more accurate than the other models, and MWGP II is more accurate than MWGP I. The average predicted RMSE and average predicted MAE of the MGP are larger than those of a single GP and the WGP since the data in S 17 and S 18 are highly multimodal and non-stationary. The FNN and SVM can hardly fit S 17 and S 18 accurately. In Figure 7a, MWGP II and MWGP I are better than the MGP, for example on the interval ( 80 , 100 ) ; in Figure 7b, MWGP II and MWGP I are better than the MGP, for example on the interval ( 2.4 , 11.4 ] . Consequently, both MWGP II and MWGP I are effective for these tasks, and MWGP II can overcome the local optimum of MWGP I at the expense of only a little time on these datasets. In summary, the preprocessing transformation is critical for MGP in the toy and motorcycle datasets.

5.5. River-flow Datasets

We conducted experiments on ten river-flow datasets S 19 , S 20 , , S 28 to verify the consistency of our proposed MWGP II and MWGP I [53]. In each dataset, about 40 years (i.e., from 1920 to 1960) of monthly river flow for rivers in the USA (such as the Current River, the Mad River, the Madison River, and the Mackenzie River) were recorded. There are approximately 155 training samples and 313 test samples in each dataset. For river-flow datasets, there is minimal correlation between prediction accuracy and the value of C. We set J = 2 and C = 4 for these datasets.
For comparison, MWGP II, MWGP I, MGP, WGP, GP, FNN, and SVM are considered in S 19 , S 20 , , S 28 , respectively. The average predicted RMSEs (cubic meters/second), average predicted MAEs, and average running times of these models are recorded in Table 6. From this table, MWGP II and MWGP I are smaller than the other models in terms of the average predicted RMSE and average predicted MAE, and the average predicted RMSE and average predicted MAE of MWGP II are smaller than those of MWGP I. Although MWGP I has the same accuracy as MWGP II in S 25 , S 26 , , S 28 , it is more efficient than MWGP II. Based on the analysis above, our proposed MWGP II and MWGP I are effective for processing the river-flow datasets, and demonstrate that the nonlinear transformation is useful for this type of data. Additionally, MWGP II can overcome the local optimum of MWGP I at a minimal computational cost on some datasets.

6. Conclusions and Discussion

In this paper, we demonstrate that the MWGP model is a valuable generalization of the MGP and WGP models, and it is well suited for solving non-stationary probabilistic regression. From another point of view, the standard preprocessing transformation in MWGP can be learned adaptively and improved upon. We show that simultaneous split and merge operations are able to eliminate the component differences between the two regions to avoid the local optimum of the CEM algorithm for MWGP. Experimental results on synthetic and real-world datasets show that our proposed MWGP trained by the CEM algorithm as well as MWGP trained by the SMCEM algorithm are effective.
For MWGP, the actual number of WGP components is generally difficult to learn due to the correlation among outputs. In future work, we will focus on learning C for MWGP. Moreover, for probabilistic regression models, there are likely outliers in the observations that deviate significantly from the other samples. We will consider the robustness for MWGP based on the robustness of the MGP.

Author Contributions

Conceptualization, Y.X. and D.W.; methodology, Y.X., D.W. and Z.Q.; data simulation and experiment, Y.X. and D.W.; writing—original draft, Y.X.; writing—review and editing, D.W.; supervision, D.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (62006149), the Natural Science Foundation of Shaanxi Province (2020JQ-403), and the Foundation of Shaanxi Educational Committee (18JK0792).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The motorcycle data and the river-flow data that support the findings of this study are available at https://doi.org/10.1111/j.2517-6161.1985.tb01327.x; https://doi.org/10.2307/1403750 (accessed on 9 June 2022). All other datasets are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no known competing financial interest or personal relationships that could have appeared to influence the work reported in this paper.

Nomenclature

AEPAverage estimated parameter
ALLFApproximated log-likelihood function
CEMClassification expectation–maximization (also called hard-cut expectation–maximization or hard expectation–maximization)
EMExpectation–maximization
FNNFeedforward neural network
GPGaussian process
LOOCVLeave-one-out cross-validation
MAEMean absolute error
MAPMaximum a posteriori
MCMCMarkov chain Monte Carlo
MEMixture of experts
MGPMixture of Gaussian processes
MLEMaximum likelihood estimation
MWGPMixture of warped Gaussian processes
RMSERoot mean square error
RPReal parameter
SDEPStandard deviation of the estimated parameter
SDStandard deviation
SMCEMSplit and merge classification expectation–maximization
SVMSupport vector machine
VBVariational Bayesian
WGPWarped Gaussian process

Appendix A. Details of the CEM Algorithm

Appendix A.1. The Derivation of the Q-Function and Details of the Approximated MAP Principle

Denote g ( w c ; Ω c ) = g ( w n ; Ω c ) | z n = e c ; n = 1 , 2 , , N as the N c × 1 function vector of the c-th component, K c = [ K ( x n , x n ˜ ; γ c ) | z n = z n ˜ = e c ; n , n ˜ = 1 , 2 , , N ] as the N c × N c covariance matrix of the c-th component, and n = 1 N c ln g ( w n ; Ω c ) / w n as a Jacobian term. The total log-likelihood function of MWGP is given by
L ( Φ , Z ) = ln p ( D , Z | Φ ) = c = 1 C L c ( Φ c , Z ) = c = 1 C n = 1 N z n c ln η c + ln p ( x n | z n = e c ) + ln p ( w c | X c , θ c , Ω c ) ,
where ln p ( w c | X c , θ c , Ω c ) is the log-likelihood function of the c-th WGP given by
ln p ( w c | X c , θ c , Ω c ) = [ N c ln 2 π + g ( w c ; Ω c ) T K c + σ c 2 I N c 1 g ( w c ; Ω c ) + ln | K c + σ c 2 I N c | ] / 2 + n = 1 N c ln g ( w n ; Ω c ) / w n .
The Q-function of the conventional EM algorithm of MWGP is obtained by the expectation of Equation (A1) with respect to Z :
Q ( Φ | Φ ( r 1 ) ) = E Z L ( Φ , Z ) | D , Φ ( r 1 ) = Z P ( Z | D , Φ ( r 1 ) ) L ( Φ , Z ) .
The posterior probability P ( z n | x n , w n , Φ ( r ) ) is calculated by
P ( z n | x n , w n , Φ ( r ) ) p ( x n , w n , z n | Φ ( r ) ) = P ( z n | Φ ( r ) ) p ( x n | z n , Φ ( r ) ) p ( w n | x n , z n , Φ ( r ) ) = P ( z n | Φ ( r ) ) p ( x n | z n , Φ ( r ) ) p ( y n | x n , z n , Φ ( r ) ) = P ( z n | Φ ( r ) ) p ( x n | z n , Φ ( r ) ) p ( g ( w n ; Ω ( r ) ) | x n , z n , Φ ( r ) ) g ( w n ; Ω ( r ) ) / w n ,
where g ( w n ; Ω ( r ) ) / w n is the Jacobian term. Since the number of Z is C N , the time complexity of calculating Equation (A2) is O ( C N ) . Thus, the classification approximation is adopted for calculating Equation (A2), and then we obtain the approximated Q-function for the CEM algorithm:
L ( Φ , Z ( r ) ) = c = 1 C L c ( Φ c , Z ( r ) ) ,
where Z ( r ) is calculated by an approximation of the MAP method, i.e., Z ( r ) = arg max Z P ( Z | D , Φ ( r ) ) :
z n ( r ) = arg max z n P ( z n | x n , w n , Φ ( r ) ) = arg max z n P ( z n | Φ ( r ) ) p ( x n | z n , Φ ( r ) ) p ( g ( w n ; Ω ( r ) ) | x n , z n , Φ ( r ) ) g ( w n ; Ω ( r ) ) / w n .
In Equation (A5), P ( z n | x n , w n , Φ ( r ) ) is derived by Equation (A3).

Appendix A.2. Details for Maximizing the Approximated Q-Function

Parameters θ c and Ω c are updated jointly by the conjugated gradient method inherited by training the WGP.
Parameters η c , α c , and Σ c are solved analytically as follows. By adopting the Lagrange multiplier method when c = 1 C η c = 1 , we have
η c = N c c = 1 C N c .
Let L ( Φ , Z ( r 1 ) ) / α c = 0 and L ( Φ , Z ( r 1 ) ) / Σ c = 0. Then, we have
α c = n = 1 N z n c x n / N c , Σ c = n = 1 N z n c ( x n α c ) ( x n α c ) T / N c .

Appendix B. Details of the Partial CEM Algorithm

Appendix B.1. Details of Maximizing the Approximated Q-Function of the Partial CEM Algorithm

The approximated Q-function Equation (A4) is equal to
L ( Φ , Z ( r ) ) = c = u , s , t L c ( Φ c , Z ( r ) ) + c u , s , t L c ( Φ c , Z ( r ) ) ,
where the first three terms, i.e., c = u , s , t L c ( Φ c , Z ( r ) ) , are only maximized for the partial CEM algorithm. The details for maximizing c = u , s , t L c ( Φ c , Z ( r ) ) are described in Appendix A.2.

Appendix B.2. Details of the Approximated MAP Principle of the Partial CEM Algorithm

When z n ( r 1 ) = e c = u , s , t , z n ( r ) is obtained by the approximated MAP method:
z n ( r ) = arg max z n { e c = u , s , t } P ( z n | x n , w n , Φ c ( r ) ) = arg max z n { e c = u , s , t } P ( z n = e c | Φ c ( r ) ) p ( x n | z n = e c , Φ c ( r ) ) p ( g ( w n ; Ω c ( r ) ) | x n , z n = e c , Φ c ( r ) ) g ( w n ; Ω c ( r ) ) / w n ,
where P ( z n | x n , w n , Φ c ( r ) ) is derived by Equation (A3).

Appendix C. Split and Merge Criteria

Since there are too many candidate sets, it is necessary to propose specific reasonable criteria to speed up the SMCEM algorithm.
The merge criterion is defined by
F m e r g e ( u , s ) = cos < p u , p s > = p u T p s / p u p s ,
where · is the Euclidean norm, and p c represents the following N × 1 vectors [ P ( z 1 = e c | x 1 , w 1 , Φ c ( r ) ) , P ( z 2 = e c | x 2 , w 2 , Φ c ( r ) ) , , P ( z N = e c | x N , w N , Φ c ( r ) ) ] T , in which P ( z n = e c | x n , w n , Φ c ( r ) ) are derived by Equation (A3). Components with the largest F m e r g e ( u , s ) are used for merging, where u s .
The split criterion is defined by
F s p l i t ( t ) = L t ( Φ t , Z ( r ) ) / N t .
The component with the smallest F s p l i t ( t ) is used for splitting.

References

  1. Yuksel, S.E.; Wilson, J.N.; Gader, P.D. Twenty years of mixture of experts. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1177–1193. [Google Scholar] [CrossRef]
  2. Jordan, M.I.; Jacobs, R.A. Hierarchies mixtures of experts and the EM algorithm. Neural Comput. 1994, 6, 181–214. [Google Scholar] [CrossRef]
  3. Lima, C.A.M.; Coelho, A.L.V.; Zuben, F.J.V. Hybridizing mixtures of experts with support vector machines: Investigation into nonlinear dynamic systems identification. Inf. Sci. 2007, 177, 2049–2074. [Google Scholar] [CrossRef]
  4. Tresp, V. Mixtures of Gaussian processes. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Denver, CO, USA, 1 January 2000; Volume 13, pp. 654–660. [Google Scholar]
  5. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 1977, 39, 1–38. [Google Scholar]
  6. Rasmussen, C.E.; Ghahramani, Z. Infinite mixture of Gaussian process experts. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 9–14 December 2002; Volume 2, pp. 881–888. [Google Scholar]
  7. Meeds, E.; Osindero, S. An alternative infinite mixture of Gaussian process experts. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 4–7 December 2005; Volume 18, pp. 883–896. [Google Scholar]
  8. Yuan, C.; Neubauer, C. Variational mixture of Gaussian process experts. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 8–11 December 2008; Volume 21, pp. 1897–1904. [Google Scholar]
  9. Brahim-Belhouari, S.; Bermak, A. Gaussian process for nonstationary time series prediction. Comput. Stat. Data Anal. 2004, 47, 705–712. [Google Scholar] [CrossRef]
  10. Pérez-Cruz, F.; Vaerenbergh, S.V.; Murillo-Fuentes, J.J.; Lázaro-Gredilla, M.; Santamaría, I. Gaussian processes for nonlinear signal processing: An overview of recent advances. IEEE Signal Process. Mag. 2013, 30, 40–50. [Google Scholar] [CrossRef]
  11. Rasmussen, C.E.; Williams, C.K.I. Gaussian Process for Machine Learning; MIT Press: Cambridge, MA, USA, 2006; Chapter 2. [Google Scholar]
  12. MacKay, D.J.C. Introduction to Gaussian processes. NATO ASI Ser. F Comput. Syst. Sci. 1998, 168, 133–166. [Google Scholar]
  13. Xu, Z.; Guo, Y.; Saleh, J.H. VisPro: A prognostic SqueezeNet and non-stationary Gaussian process approach for remaining useful life prediction with uncertainty quantification. Neural Comput. Appl. 2022, 34, 14683–14698. [Google Scholar] [CrossRef]
  14. Heinonen, M.; Mannerström, H.; Rousu, J.; Kaski, S.; Lähdesmäki, H. Non-stationary Gaussian process regression with hamiltonian monte carlo. In Proceedings of the Machine Learning Research, Cadiz, Spain, 9–11 May 2016; Volume 51, pp. 732–740. [Google Scholar]
  15. Wang, Y.; Chaib-draa, B. Bayesian inference for time-varying applications: Particle-based Gaussian process approaches. Neurocomputing 2017, 238, 351–364. [Google Scholar] [CrossRef]
  16. Rhode, S. Non-stationary Gaussian process regression applied in validation of vehicle dynamics models. Eng. Appl. Artif. Intell. 2020, 93, 103716. [Google Scholar] [CrossRef]
  17. Sun, S.; Xu, X. Variational inference for infinite mixtures of Gaussian processes with applications to traffic flow prediction. IEEE Trans. Intell. Transp. Syst. 2010, 12, 466–475. [Google Scholar] [CrossRef]
  18. Jeon, Y.; Hwang, G. Bayesian mixture of gaussian processes for data association problem. Pattern Recognit. 2022, 127, 108592. [Google Scholar] [CrossRef]
  19. Li, T.; Ma, J. Attention mechanism based mixture of Gaussian processes. Pattern Recognit. Lett. 2022, 161, 130–136. [Google Scholar] [CrossRef]
  20. Kim, S.; Kim, J. Efficient clustering for continuous occupancy mapping using a mixture of Gaussian processes. Sensors 2022, 22, 6832. [Google Scholar] [CrossRef]
  21. Tayal, A.; Poupart, P.; Li, Y. Hierarchical double Dirichlet process mixture of Gaussian processes. In Proceedings of the 26th AAAI Conference on Artificial Intelligence (AAAI), Toronto, ON, Canada, 22–26 July 2012; pp. 1126–1133. [Google Scholar]
  22. Sun, S. Infinite mixtures of multivariate Gaussian processes. In Proceedings of the International Conference on Machine Learning and Cybernetics (ICMLC), Tianjin, China, 14–17 July 2013; pp. 1011–1016. [Google Scholar]
  23. Kastner, M. Monte Carlo methods in statistical physics: Mathematical foundations and strategies. Commun. Nonlinear Sci. Numer. Simul. 2010, 15, 1589–1602. [Google Scholar] [CrossRef]
  24. Khodadadian, A.; Parvizi, M.; Teshnehlab, M.; Heitzinger, C. Rational design of field-effect sensors using partial differential equations, Bayesian inversion, and artificial neural networks. Sensors 2022, 22, 4785. [Google Scholar] [CrossRef] [PubMed]
  25. Noii, N.; Khodadadian, A.; Ulloa, J.; Aldakheel, F.; Wick, T.; François, S.; Wriggers, P. Bayesian inversion with open-source codes for various one-dimensional model problems in computational mechanics. Arch. Comput. Methods Eng. 2022, 29, 4285–4318. [Google Scholar] [CrossRef]
  26. Ross, J.C.; Dy, J.G. Nonparametric mixture of Gaussian processes with constraints. In Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, GA, USA, 17–19 June 2013; pp. 1346–1354. [Google Scholar]
  27. Yang, Y.; Ma, J. An efficient EM approach to parameter learning of the mixture of Gaussian processes. In Proceedings of the Advances in International Symposium on Neural Networks (ISNN), Guilin, China, 29 May–1 June 2011; Volume 6676, pp. 165–174. [Google Scholar]
  28. Chen, Z.; Ma, J.; Zhou, Y. A precise hard-cut EM algorithm for mixtures of Gaussian processes. In Proceedings of the 10th International Conference on Intelligent Computing (ICIC), Taiyuan, China, 3–6 August 2014; Volume 8589, pp. 68–75. [Google Scholar]
  29. Celeux, G.; Govaert, G. A classification EM algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal. 1992, 14, 315–332. [Google Scholar] [CrossRef]
  30. Wu, D.; Chen, Z.; Ma, J. An MCMC based EM algorithm for mixtures of Gaussian processes. In Proceedings of the Advances in International Symposium on Neural Networks (ISNN), Jeju, Republic of Korea, 15–18 October 2015; Volume 9377, pp. 327–334. [Google Scholar]
  31. Wu, D.; Ma, J. An effective EM algorithm for mixtures of Gaussian processes via the MCMC sampling and approximation. Neurocomputing 2019, 331, 366–374. [Google Scholar] [CrossRef]
  32. Ma, J.; Xu, L.; Jordan, M.I. Asymptotic convergence rate of the EM algorithm for Gaussian mixtures. Neural Comput. 2000, 12, 2881–2907. [Google Scholar] [CrossRef]
  33. Zhao, L.; Chen, Z.; Ma, J. An effective model selection criterion for mixtures of Gaussian processes. In Proceedings of the Advances in Neural Networks-ISNN, Jeju, Republic of Korea, 15–18 October 2015; Volume 9377, pp. 345–354. [Google Scholar]
  34. Ueda, N.; Nakano, R.; Ghahramani, Z.; Hinton, G.E. SMEM algorithm for mixture models. Adv. Neural Inf. Process. Syst. 1998, 11, 599–605. [Google Scholar] [CrossRef] [PubMed]
  35. Li, Y.; Li, L. A novel split and merge EM algorithm for Gaussian mixture model. In Proceedings of the International Conference on Natural Computation (ICNC), Tianjin, China, 14–16 August 2009; pp. 479–483. [Google Scholar]
  36. Zhang, Z.; Chen, C.; Sun, J.; Chan, K.L. EM algorithms for Gaussian mixtures with split-and-merge operation. Pattern Recognit. 2003, 36, 1973–1983. [Google Scholar] [CrossRef]
  37. Zhao, L.; Ma, J. A dynamic model selection algorithm for mixtures of Gaussian processes. In Proceedings of the IEEE 13th International Conference on Signal Processing (ICSP), Chengdu, China, 6–10 November 2016; pp. 1095–1099. [Google Scholar]
  38. Li, T.; Wu, D.; Ma, J. Mixture of robust Gaussian processes and its hard-cut EM algorithm with variational bounding approximation. Neurocomputing 2021, 452, 224–238. [Google Scholar] [CrossRef]
  39. Snelson, E.; Rasmussen, C.E.; Ghahramani, Z. Warped Gaussian processes. Adv. Neural Inf. Process. Syst. 2003, 16, 337–344. [Google Scholar]
  40. Schmidt, M.N. Function factorization using warped Gaussian processes. In Proceedings of the 26th International Conference on Machine Learning (ICML), Montreal, QC, Canada, 14–18 June 2009; pp. 921–928. [Google Scholar]
  41. Lázaro-Gredilla, M. Bayesian warped Gaussian processes. Adv. Neural Inf. Process. Syst. 2012, 25, 6995–7004. [Google Scholar]
  42. Rios, G.; Tobar, F. Compositionally-warped Gaussian processes. Neural Netw. 2019, 118, 235–246. [Google Scholar] [CrossRef]
  43. Zhang, Y.; Yeung, D.Y. Multi-task warped Gaussian process for personalized age estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, 13–18 June 2010; pp. 2622–2629. [Google Scholar]
  44. Wiebe, J.; Cecílio, I.; Dunlop, J.; Misener, R. A robust approach to warped Gaussian process-constrained optimization. Math. Program. 2022, 196, 805–839. [Google Scholar] [CrossRef]
  45. Mateo-Sanchis, A.; Muñoz-Marí, J.; Pérez-Suay, A.; Camps-Valls, G. Warped Gaussian processes in remote sensing parameter estimation and causal inference. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1647–1651. [Google Scholar] [CrossRef]
  46. Jadidi, M.G.; Miró, J.V.; Dissanayake, G. Warped Gaussian processes occupancy mapping with uncertain inputs. IEEE Robot. Autom. Lett. 2017, 2, 680–687. [Google Scholar] [CrossRef]
  47. Kou, P.; Liang, D.; Gao, F.; Gao, L. Probabilistic wind power forecasting with online model selection and warped Gaussian process. Energy Convers. Manag. 2014, 84, 649–663. [Google Scholar] [CrossRef]
  48. Gonçalves, I.G.; Echer, E.; Frigo, E. Sunspot cycle prediction using warped Gaussian process regression. Adv. Space Res. 2020, 65, 677–683. [Google Scholar] [CrossRef]
  49. Rasmussen, C.E.; Nickisch, H. Gaussian processes for machine learning (GPML) toolbox. J. Mach. Learn. Res. 2010, 11, 3011–3015. [Google Scholar]
  50. Svozil, D.; Kvasnicka, V.; Pospichal, J. Introduction to multi-layer feedforward neural networks. Chemom. Intell. Lab. Syst. 1997, 39, 43–62. [Google Scholar] [CrossRef]
  51. Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 27. [Google Scholar] [CrossRef]
  52. Derrac, J.; García, S.; Molina, D.; Herrera, F. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 2011, 1, 3–18. [Google Scholar] [CrossRef]
  53. Mcleod, A.I. Parsimony, model adequacy and periodic correlation in forecasting time series. Int. Stat. Rev. 1993, 61, 387–393. [Google Scholar] [CrossRef]
Figure 1. The diagram structure of the MGP model: the low-layer consists of two GPs and the high-layer structure consists of an MGP. The curves are divided into two curve segments along the input space marked in different colors, and curve segments of one color corresponding to a GP.
Figure 1. The diagram structure of the MGP model: the low-layer consists of two GPs and the high-layer structure consists of an MGP. The curves are divided into two curve segments along the input space marked in different colors, and curve segments of one color corresponding to a GP.
Mathematics 11 02251 g001
Figure 2. An example of a non-stationary data regression task. The one-dimensional data are generated by adding Gaussian noise to a sine function. The dataset contains 300 training samples and 600 test samples. These samples are then warped by the function w = y 3 . The mean and two standard deviation (SD) bounds are represented by triplets of lines.
Figure 2. An example of a non-stationary data regression task. The one-dimensional data are generated by adding Gaussian noise to a sine function. The dataset contains 300 training samples and 600 test samples. These samples are then warped by the function w = y 3 . The mean and two standard deviation (SD) bounds are represented by triplets of lines.
Mathematics 11 02251 g002
Figure 3. A diagram showing the relations among the main variables in the WGP model.
Figure 3. A diagram showing the relations among the main variables in the WGP model.
Mathematics 11 02251 g003
Figure 4. The probabilistic graphical model of the MWGP model: the elements inside the boxes are main variables and the others are parameters.
Figure 4. The probabilistic graphical model of the MWGP model: the elements inside the boxes are main variables and the others are parameters.
Mathematics 11 02251 g004
Figure 5. Fitting results of MWGP I and MGP on S 1 : (a) Predictions of MWGP I and MGP. Line triplets represent the mean and two standard deviation (SD) bounds; (b) predictive probability densities of MWGP I and MGP at x = 11.3 .
Figure 5. Fitting results of MWGP I and MGP on S 1 : (a) Predictions of MWGP I and MGP. Line triplets represent the mean and two standard deviation (SD) bounds; (b) predictive probability densities of MWGP I and MGP at x = 11.3 .
Mathematics 11 02251 g005
Figure 6. Warping functions obtained by MWGP I on S 1 : (a) The warping function learned in the first component; (b) the warping function learned in the second component.
Figure 6. Warping functions obtained by MWGP I on S 1 : (a) The warping function learned in the first component; (b) the warping function learned in the second component.
Mathematics 11 02251 g006
Figure 7. (a) Predictions of MWGP II, MWGP I, and MGP on S 17 ; (b) predictions of MWGP II, MWGP I, and MGP on S 18 .
Figure 7. (a) Predictions of MWGP II, MWGP I, and MGP on S 17 ; (b) predictions of MWGP II, MWGP I, and MGP on S 18 .
Mathematics 11 02251 g007
Table 1. The symbols represent the models and related algorithms; the bold font is used for our proposed models.
Table 1. The symbols represent the models and related algorithms; the bold font is used for our proposed models.
SymbolModelAlgorithm
MWGP IIMWGPSMCEM
MWGP ICEM
MGP [28]MGPCEM
WGP [39]WGPMLE
GP [49]GP
FNN [50]FNNLevenberg–Marquardt
SVM [51]SVMSequential minimal optimization
Table 2. RPs and AEPs with SDEPs were obtained through 150 trials using MWGP I on S 1 .
Table 2. RPs and AEPs with SDEPs were obtained through 150 trials using MWGP I on S 1 .
η c α c Σ c 1 / 2 σ c γ c ( 1 ) γ c ( 2 )
c = 1 RP0.50003.00001.89740.14140.14140.2887
AEP0.49443.13801.91850.14490.15840.2708
SDEP 0.0059 0.0347 0.0588 0.0143 0.1389 0.2761
c = 2 RP0.500010.5002.84600.14141.00002.2361
AEP0.505610.6882.91220.14771.22472.0492
SDEP 0.0059 0.0380 0.0545 0.0138 0.1427 0.2658
Table 3. The average predicted RMSEs, SDs of predicted RMSEs, p-values of predicted RMSEs, and average running times (seconds) for MWGP I and the other models from over one hundred trials on synthetic datasets; the bold font represents the best results.
Table 3. The average predicted RMSEs, SDs of predicted RMSEs, p-values of predicted RMSEs, and average running times (seconds) for MWGP I and the other models from over one hundred trials on synthetic datasets; the bold font represents the best results.
Model S 1 S 2 S 3 S 4 S 5
RMSETimeRMSETimeRMSETimeRMSETimeRMSETime
AverageSDp-ValueAverageSDp-ValueAverageSDp-ValueAverageSDp-ValueAverageSDp-Value
MWGP I 0.1024 0.0312 2.5916 0.0891 0.0641 2.6589 0.3751 0.0543 2.8216 0.0901 0.0394 2.5175 0.50770.03462.5261
MGP 0.1442 0.02170.0000 1.9620 0.1142 0.06380.0000 2.5429 0.3792 0.04730.1319 2.5187 0.1331 0.04180.0000 1.9213 0.5178 0.04150.0571 2.0896
WGP 0.2507 0.02420.0000 0.2250 0.1694 0.05050.0000 0.2019 0.4168 0.00960.0000 0.2133 0.1417 0.04370.0000 0.2041 0.5569 0.03100.0000 0.1940
GP 0.4083 0.06050.0000 0.1637 0.2867 0.04710.0000 0.1581 0.5434 0.02610.0000 0.1661 0.2146 0.06240.0000 0.1590 0.6676 0.01130.0000 0.1714
FNN0.37150.13300.00001.90950.26950.01430.00001.62440.70140.13070.00001.72490.29280.05140.00002.42330.67490.17530.00002.1344
SVM0.46050.19420.000045.2360.35610.01390.000052.1260.79010.18890.000039.4510.30510.05390.000052.0890.72950.26570.000058.141
Model S 6 S 7 S 8 S 9 S 10
RMSETimeRMSETimeRMSETimeRMSETimeRMSETime
averageSDp-valueaverageSDp-valueaverageSDp-valueaverageSDp-valueaverageSDp-value
MWGP I 0.0807 0.0196 2.5135 0.2869 0.0318 2.5388 0.25820.03162.80990.48580.02962.7052 0.1816 0.0273 2.3483
MGP 0.1248 0.01970.0000 2.4189 0.3028 0.05280.0000 1.9746 0.27190.03720.00002.21280.51050.03430.00002.1638 0.2090 0.03370.0000 1.8251
WGP 0.1702 0.05270.0000 0.2609 0.3250 0.04920.0000 0.2269 0.30690.10580.00000.23420.54920.12790.00000.2958 0.2709 0.05810.0000 0.2382
GP 0.3456 0.08280.0000 0.1688 0.4787 0.08360.0000 0.1654 0.43520.13040.00000.16800.55960.13360.00000.2086 0.4416 0.09770.0000 0.1601
FNN0.32300.09070.00002.06540.37200.15510.00001.80960.45270.13160.00002.21100.54960.15610.00002.17580.37680.14090.00001.6346
SVM0.49150.15440.000051.5900.54430.20240.000043.1450.49580.16720.000055.0620.55780.17360.000055.2420.46750.16320.000048.347
Table 4. The main parameters of MWGP II on S 11 .
Table 4. The main parameters of MWGP II on S 11 .
η c α c ( 1 ) α c ( 2 ) ( Σ c ( 11 ) ) 1 / 2 ( Σ c ( 12 ) ) 1 / 2 ( Σ c ( 21 ) ) 1 / 2 ( Σ c ( 22 ) ) 1 / 2 σ c γ c ( 1 ) γ c ( 2 ) γ c ( 3 )
c = 1 0.20003.00003.00001.89741.50001.50001.89740.14140.14140.28871.2910
c = 2 0.200010.50010.5002.8460−2.1000−2.10002.84600.02001.00002.23610.5000
c = 3 0.200018.00018.0001.89740.00000.00001.89740.14140.44721.82571.5811
c = 4 0.200025.50025.5002.8460−2.1000−2.10002.84600.02000.50000.70710.7071
c = 5 0.200033.00033.0001.89741.50001.50001.89740.14140.27392.23611.2910
Table 5. The average ALLFs and running times (seconds) of MWGP II and MWGP I from over one hundred trials on the synthetic datasets; the bold font represents the best results.
Table 5. The average ALLFs and running times (seconds) of MWGP II and MWGP I from over one hundred trials on the synthetic datasets; the bold font represents the best results.
Model S 11 S 12 S 13
ALLFTimeALLFTimeALLFTime
MWGP II 1.0867 × 10 3 11.438 1.6752 × 10 3 10.584 1.7330 × 10 3 11.478
MWGP I 1.1678 × 10 3 5.4443 1.7388 × 10 3 3.9803 1.8351 × 10 3 5.4065
Model S 14 S 15 S 16
ALLFTimeALLFTimeALLFTime
MWGP II 1.4240 × 10 3 10.315 1.5961 × 10 3 11.527 1.1597 × 10 3 12.130
MWGP I 1.5726 × 10 3 3.9889 1.6786 × 10 3 5.5424 1.2283 × 10 3 5.7286
Table 6. The average predicted RMSEs, average predicted MAEs, and average running times (seconds) of different models from over thirty trials on the toy dataset, the motorcycle dataset, and the river-flow datasets; the bold font represents the best results.
Table 6. The average predicted RMSEs, average predicted MAEs, and average running times (seconds) of different models from over thirty trials on the toy dataset, the motorcycle dataset, and the river-flow datasets; the bold font represents the best results.
Model S 17 S 18 S 19 S 20
RMSEMAETimeRMSEMAETimeRMSEMAETimeRMSEMAETime
MWGP II13.4817.75616.981224.15313.29715.67147.89629.1572.334110.4255.52282.2285
MWGP I 14.312 8.1550 2.3733 25.987 14.309 6.8800 48.171 29.316 1.0935 10.668 5.7159 1.0824
MGP14.7148.49121.633126.37014.3514.412949.06030.0750.735811.0716.07720.6139
WGP 14.772 8.4834 0.1231 29.277 16.579 0.4211 49.411 30.391 0.0824 11.654 6.5137 0.0806
GP 20.387 13.322 0.1070 26.700 14.725 0.3186 55.466 35.323 0.0703 14.174 8.7933 0.0645
FNN18.00411.9741.629330.35917.21316.43349.58830.5141.203311.6696.51541.1592
SVM17.26711.44546.22329.78216.885182.6754.62734.91725.33312.7807.481623.858
Model S 21 S 22 S 23 S 24
RMSEMAETimeRMSEMAETimeRMSEMAETimeRMSEMAETime
MWGP II4.59383.75562.294314.2668.66312.411816.61710.8852.238931.17122.4832.2581
MWGP I4.67213.83211.064414.5708.95341.183116.72710.9881.1016 31.536 22.814 1.1268
MGP5.17594.33270.573414.9249.28590.591116.81411.0670.734534.57526.2570.7646
WGP4.70843.86260.067315.3189.65780.093916.72810.9900.0886 32.428 23.877 0.0819
GP5.62744.68520.058716.16110.5270.071817.04311.3020.0711 34.813 26.416 0.0755
FNN4.70793.86291.152515.5999.92611.394716.73810.9961.165032.66624.0931.1725
SVM4.84463.984119.24016.35310.67326.03917.41511.57923.65933.16324.55223.745
Model S 25 S 26 S 27 S 28
RMSEMAETimeRMSEMAETimeRMSEMAETimeRMSEMAETime
MWGP II27.73619.6752.464030.69622.0952.404050.49731.2652.282333.78924.6132.2461
MWGP I 27.776 19.702 1.2057 30.70822.1211.111350.50231.2831.109033.79224.6241.0795
MGP28.05320.0150.748032.06123.2160.792352.31732.8370.706534.52925.2770.7084
WGP 27.785 19.719 0.0991 30.71222.1280.090950.71231.4150.085734.00424.8580.0826
GP 28.310 20.263 0.0762 32.80323.8710.075652.99433.4780.074535.27626.0640.0738
FNN27.92119.9981.190730.72722.1411.198050.60331.3571.181334.16325.0401.1099
SVM28.09820.06824.06231.81222.84427.02352.92033.43226.38735.07225.82120.228
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xie, Y.; Wu, D.; Qiang, Z. An Improved Mixture Model of Gaussian Processes and Its Classification Expectation–Maximization Algorithm. Mathematics 2023, 11, 2251. https://doi.org/10.3390/math11102251

AMA Style

Xie Y, Wu D, Qiang Z. An Improved Mixture Model of Gaussian Processes and Its Classification Expectation–Maximization Algorithm. Mathematics. 2023; 11(10):2251. https://doi.org/10.3390/math11102251

Chicago/Turabian Style

Xie, Yurong, Di Wu, and Zhe Qiang. 2023. "An Improved Mixture Model of Gaussian Processes and Its Classification Expectation–Maximization Algorithm" Mathematics 11, no. 10: 2251. https://doi.org/10.3390/math11102251

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop