Next Article in Journal
Analysis of BIM-Based Quantity Take-Off in Simplification of the Length of Processed Rebar
Previous Article in Journal
Label-Free Fault Detection Scheme for Inverters of PV Systems: Deep Reinforcement Learning-Based Dynamic Threshold
Previous Article in Special Issue
Research on Bone Stick Text Recognition Method with Multi-Scale Feature Fusion
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Kernel Learning by Spectral Representation and Gaussian Mixtures

by
Luis R. Pena-Llamas
1,
Ramon O. Guardado-Medina
2,*,
Arturo Garcia
2 and
Andres Mendez-Vazquez
1
1
Department of Computer Science, El Centro de Investigación y de Estudios Avanzados (CINVESTAV), Ciudad de Mexico 44960, Mexico
2
Department of Research, Escuela Militar de Mantenimiento y Abastecimiento, Universidad del Ejercito y Fuerza Aerea, Zapopan 45200, Mexico
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(4), 2473; https://doi.org/10.3390/app13042473
Submission received: 18 December 2022 / Revised: 25 January 2023 / Accepted: 30 January 2023 / Published: 14 February 2023
(This article belongs to the Collection Machine Learning in Computer Engineering Applications)

Abstract

:
One of the main tasks in kernel methods is the selection of adequate mappings into higher dimension in order to improve class classification. However, this tends to be time consuming, and it may not finish with the best separation between classes. Therefore, there is a need for better methods that are able to extract distance and class separation from data. This work presents a novel approach for learning such mappings by using locally stationary kernels, spectral representations and Gaussian mixtures.

1. Introduction

During the 90’s, the use of kernels [1,2,3,4,5] in Machine Learning received a considerable attention for their ability to improve the performance of linear classifiers. By using kernels, Support Vector Machines and other kernel methods [6,7] can classify complex data sets by mapping them into high dimensional spaces. However, an underlying issue exists, summarized by a simple question: Which kernel should be used? [8].
Kernel selection is not a small task and would highly depend on the problem to be solved. A first idea to select the best kernel could be evaluating different kernels from a small set using leave-one-out cross-validation and selecting the kernel with better classification properties. Nevertheless, this can become a time-consuming task when the number of samples range in the thousands. A better idea is to use a combination of kernels to create kernels with better classification properties. Methods using this type of techniques are called Multiple Kernel Learning (MKL) [9]. For example, Lanckriet [10] uses a Semi-Definite Programming (SDP) to find the best conic combination of multiple kernels. However, these methods still require some pre-selected set of kernels as inputs. A better plan will be to use the distance information at the class data sets. For example, Hoi [11] tries to find a kernel Gram matrix by building the Laplacian Graph [12] of the data. Then, an SDP is applied to find the best combination of kernels.
However, none of these methods are scalable given that their Gram matrix needs to be built the computational complexity of building a Gram matrix is ( O ( N 2 ) where N is the number of samples. As a possible solution, it has been proposed to use Sequential Minimal Optimization (SMO) [13] to reduce complexity. This allows to move the quadratic programming problem to a quadratic programming sub-problems. For example, Bach [14] uses an SDP setup and solves the problem by using an SMO algorithm. Other techniques [15] use a random sample of the training set, and a possible approximation to the Gram Matrix to reduce complexity ( O ( m 2 N + m 2 + m N ) , m sub-problem size). Expanding on this idea, Rahimi [16] approximates kernel functions using samples of the distribution, but only for stationary kernels. On the other hand, Ghiasi-Shirazi [17] proposes a method for learning m number of stationary kernels in the approach of MKL. The method has a main advantage, its ability to learn m number of kernels in an unsupervised way by reducing the complexity of the output function. Furthermore, it reduces the complexity of the classifier output from O ( m x N x N S V ) to O ( m x N ) by using m kernels. Finally, Oliva [18], makes use of Bayesian methods to learn a stationary kernel in a non-parametric way.
On this work, we propose to learn locally stationary kernel from data, given that stationary kernels are a subset of the locally stationary kernel, by using a spectral representation and Gaussian Mixtures [19]. This allows to improve the classification and regression task by looking at the kernel as the result of a sampling process on a spectral representation. This paper is structured in the following way: in Section 2, we show the basic theory to understand the idea of stationary and locally stationary kernels. In Section 3, the proposed algorithm is developed by using Fourier Basis and sampling. Additionally, a theorem is given about the performance of the spectral representation. In Section 4, we review the experiments for classification and regression tasks to test the robustness of the proposed algorithm. Finally, we present an analysis of the advantages of the proposed algorithm and possible venues of research in the Section 5.

2. The Concept of Kernels

The main idea of using kernel methods is to obtain the distance between samples on higher dimensional. Thus, avoid mapping the samples to higher dimensional spaces and using the inner product for such process. In other words, let X R D be the input set, where D N , let K be a feature space, and suppose the feature mapping function is defined as φ : X K . Hence, the kernel function, κ : X × X R , has the following property:
κ ( x , x ) φ ( x ) , φ ( x ) ,
where x , x X . Thus, φ and the feature space can be implicitly defined. Now, let { x i } i = 1 N be the set of samples, κ : X × X R be a valid kernel, and · , · be a well defined inner product. Then, the elements at the Gram Matrix, K R N × N , are computed using the κ mapping, K i j : = κ ( x i , x j ) . Given this definition, Genton [20] makes an in-depth study of the class of kernel from a statistics perspective, i.e., the kernel functions as a co-variance function. He pointed out that kernels have a spectral representation which can be used to represent their Gram matrix. Based on this representation, the proposed algorithm learns the Gram matrix by using a Gibbs sampler to obtain the structure of such matrix.

2.1. Stationary Kernels

Stationary kernels [20] are defined as κ ( x , x ) = κ s ( x x ) . An important factor in such definition is its dependency on the lag vector which can be interpreted as generalizations of the Gaussian probability distribution functions which are used to represent distributions [15]. Additionally, Bochner [21] proved that a symmetric function κ s is a positive definite in R D , if and only if, it has the form:
κ s ( x x ) = R D e 2 π i ω T ( x x ) d μ ( ω ) ,
where μ is a positive finite measure. Equation (1) is called the spectral representation of κ s . Now, suppose μ has a density F ( ω ) and τ = x x . Thus, it is possible to obtain:
κ ( τ ) = F ( ω ) e 2 π i ω T τ d ω , F ( ω ) = κ ( τ ) e 2 π i ω T τ d ( τ ) .
In other words, the kernel function κ s and its spectral density F are Fourier dual of each other. Furthermore, if κ ( 0 ) = F ( ω ) d ω and F is a probability measure, the unique condition to define a valid Gaussian process is κ ( 0 ) = 1 . In other words, we need this condition to ensure that the kernel κ and the function f are correctly correlated.

2.2. Locally Stationary Kernels

Extending on the previous concept, Silverman [22] defines the locally stationary kernels as:
κ ( x , x ) : = κ 1 x + x 2 κ s ( x x ) ,
where κ 1 is a non negative function and κ s is a stationary kernel. This type of kernels increase the power of the representation by introducing a possible variance into the final calculated similarity through the use of κ 1 . Furthermore, we can see from Equation (2) that the Locally Stationary Kernels include all stationary kernels. In order to see this, we make κ 1 ( · ) = c , where c is a positive constant, then κ ( x , x ) : = c κ s ( x x ) , a multiple of all stationary kernels. Furthermore, the variance of locally stationary kernels is given by x = x , thus, the variance is defined as:
κ ( x , x ) = κ ( x ) κ ( 0 ) = κ 1 ( x ) ,
This means that the variance of the Locally Stationary Kernels relies in the positive definite function κ 1 .
The spectral representation of a locally stationary kernel is also given by [22], and it is defined as:
κ 1 x + x 2 κ s ( x x ) = X X e x p i ω 1 T x ω 2 T x 2 f 1 ω 1 + ω 2 2 f 2 ( ω 1 ω 2 ) d ω 1 d ω 2 .
Furthermore, by setting x = x = 0 , we can get:
κ ( 0 , 0 ) = X X f 1 ω 1 + ω 2 2 f 2 ( ω 1 ω 2 ) d ω 1 d ω 2 .
Consequently, in order to define a locally stationary kernel, f 1 and f 2 must be integrable functions. Additionally, an important fact is that the kernel κ has a defined inverse, given by:
f 1 ω 1 + ω 2 2 f 2 ( ω 1 ω 2 ) = 1 ( 2 π ) 2 X X κ 1 x 1 + x 2 κ 2 ( x x ) d x d x .
Moreover, f 2 is the Fourier transform of κ 1 and f 1 is the Fourier transform of κ 2 . Thus, if we introduce two dummy variables u = ( x + x ) / 2 and v = x x , it is possible to obtain:
f 1 ( ω 1 ) = 1 2 π X e x p ( i v T ω 1 ) κ 2 ( v ) d v f 2 ( ω 2 ) = 1 2 π X e x p ( i u T ω 2 ) κ 1 ( u ) d u
and
κ 1 ( u ) = X e x p ( i u T ω 2 ) f 2 ( ω 2 ) d ω 2 κ 2 ( v ) = X e x p ( i v T ω 1 ) f 1 ( ω 1 ) d ω 1 ,
with this in mind, it is possible to use the ideas in [16] to approximate the locally stationary kernels.

3. Approximating Stationary Kernels

Rahimi [16] makes use of (1) to approximate stationary kernels. This is, if we define ζ ω = e x p ( i ω T x ) ; then, Equation (1) becomes:
κ ( x x ) = X f ( ω ) e x p ( i ω T ( x x ) ) d ω = E ω [ ζ ω ( x ) ζ ω * ( x ) ]
where ω f . Now, using Monte Carlo integration and taking ω j f , the kernel can be approximated as
κ ( x x ) 1 M 1 j = 1 M 1 ζ ω j ( x ) ζ ω j * ( x ) .
In particular, if the kernel is real-valued; then, Equation (3) becomes
κ ( x x ) 1 M 1 ϕ T ( x ) ϕ ( x ) ,
where ϕ ( s ) = [ cos ( ω 1 T s ) , , cos ( ω M 1 T s ) , sin ( ω 1 T s ) , , sin ( ω M 1 T s ) ] . A side effect of (4) is that we can compute f ( x ) as f ( x ) = i = 1 n α i κ ( x i x ) . This means that function f can be approximated as
f ( x ) 1 M i = 1 n α i ϕ ( x i ) T ϕ ( x ) = γ T ϕ ( x )
where γ = 1 M i = 1 n α i ϕ ( x i ) is a constant. This constant makes possible to avoid some of the operations to obtain the Gram matrix.

3.1. Approximating Locally Stationary Kernel

As we know, κ 2 is a stationary kernel which allows to approximate κ 2 as presented in Section 3. Now, to obtain the locally stationary kernel, we would like to approximate κ 1 . For this, we define ζ v ( x ) = e x p ( i v T x / 2 ) :
κ 1 x + x 2 = R D e x p i v T x + x 2 f 2 ( v ) d v = E v ζ v x ζ v x ,
where v f 2 . Using Monte Carlo integration and taking v k f 2 , for k = 1 , 2 , , M 2 , it is possible to approximate κ 1 as:
κ 1 x + x 2 1 M 2 k = 1 M 2 ζ v k x ζ v k x .
To approximate the output of the locally stationary kernel, we can use Equations (3) and (5) as follows:
κ ( x , x ) = κ 1 x + x 2 κ 2 ( x x ) 1 M 1 M 2 k = 1 M 2 exp i v k T x 2 exp i v k T x 2 n = 1 M 1 exp i ω n T x exp i ω n T x = 1 M 1 M 2 n = 1 M 1 k = 1 M 2 exp i v k 2 + ω n T x exp i v k 2 ω n T x
where ω n f 1 and v k f 2 . In particular, if our kernel is real-valued, then previous equation becomes
κ 1 x + x 2 κ 2 x x 1 M 1 M 2 φ ( x ) T φ ( x )
where
φ ( s ) = cos v 1 2 + ω 1 T s , , cos v M 2 2 + ω M 1 T s , sin v 1 2 + ω 1 T s , , sin v M 2 2 + ω M 1 T s φ ( s ) = cos v 1 2 ω 1 T s , , cos v M 2 2 ω M 1 T s , sin v 1 2 ω 1 T s , , sin v M 2 2 ω M 1 T s
and ω n f 1 , v k f 2 . Thus, the advantage of representing the locally stationary kernel as Equation (6) is the possibility of computing f ( x ) as:
f ( x ) = j = 1 N α i κ ( x j , x ) 1 M 1 M 2 j = 1 N α i φ T ( x j ) φ ( x ) = ψ T φ ( x )
where ψ = 1 M 1 M 2 j = 1 N α i φ T ( x j ) . Given this representation, we only need to compute ψ once, avoiding the use of total Gram Matrix representation.
Now, it is necessary to remark an interesting property of using this representation. It is possible to say that | φ T ( x ) φ ( x ) κ ( x , x ) | C (using the Hoeddfing’s inequality [23]) almost everywhere. Given this, it is possible to obtain the following inequality: given any ϵ > 0 , and taking samples M 1 and M 2 from κ 2 and κ 1 respectively; then
P | φ T ( x ) φ ( x ) κ ( x , x ) | ϵ 2 exp M 1 M 2 ϵ 2 2 ( σ 2 + 2 ϵ 2 / 3 ) .
Therefore, the proposed representation of the kernel allows to obtain a good approximation to φ T ( x ) φ ( x ) . Furthermore, the following theorem gives a tighter bound making possible to say: Given a larger ϵ , the less likely is the possibility of having a larger φ T ( x ) φ ( x ) κ ( x , x ) .
Theorem 1.
Approximation of a locally stationary kernel.
Let M be a compact subset of R D with diameter D i a m ( M ) , and σ 2 > ϵ / 2 then the approximation of the kernel is given by:
P sup x , x φ T ( x ) φ ( x ) κ ( x , x ) ϵ < 2 σ 2 ϵ 1 + 4 D i a m ( M ) exp M 1 M 2 ϵ 2 4 D ( σ 2 + 2 ϵ 2 / 3 )
Proof of Theorem 1.
Define s ( x , x ) φ T ( x ) φ ( x ) , a locally stationary kernel κ ( x , x ) κ 1 x + x 2 κ 2 ( x x ) and f ( x , x ) = | s ( x , x ) κ ( x , x ) | 2 . Then, it is possible to say that E [ f ( x , x ) ] = 0 . Given that κ 2 is shift invariant, it is possible to define Δ x x M . Now, given that κ 1 can be interpreted as the mean, it is possible to define Δ + x + x 2 M + . Consequently, it is possible to define κ ( Δ , Δ + ) = κ 1 ( Δ + ) κ 2 ( Δ ) . Let M R D a compact bounded subset, it is known that D i a m ( M ) 2 D i a m ( M ) and D i a m ( M + ) 2 D i a m ( M ) . With this in mind, it is possible to define ϵ -net that covers M × M + at most T = 4 D i a m ( M ) r 2 D balls of radius r. Let { Δ , i , Δ + , i } i = 1 T denote the center of these T balls, and let L f be the Lipschitz constant of f. Therefore, we have that | f ( Δ , Δ + ) | < ϵ for all Δ , Δ + M × M + . Then, f ( Δ , i , Δ + , i ) < ϵ 2 and L f < ϵ 2 r for all i. Now, L f = f ( Δ + , i * , Δ , i * ) , where ( Δ + * , Δ * ) = max Δ + , Δ M × M + f ( Δ + , Δ ) . Additionally, we know E [ s ( Δ + , Δ ) ] = κ ( Δ + , Δ ) . Thus, it is possible to say:
E [ L f 2 ] = E [ s ( Δ + * , Δ * ) κ ( Δ + * , Δ * ) 2 ] = E [ s ( Δ + * , Δ * ) 2 ] + E [ κ ( Δ + * , Δ * ) 2 ]
Now, given that E [ s ( Δ + * , Δ * ) 2 ] and E [ κ ( Δ + * , Δ * ) 2 ] are positive,
E [ s ( Δ + * , Δ * ) 2 ] E [ κ ( Δ + * , Δ * ) 2 ] E [ s ( Δ + * , Δ * ) 2 ] = σ 2
with E [ L f 2 ] = σ 2 and E [ L f ] = σ , where σ 2 is the second momentum of Fourier transform of κ . Thus, using the Markov’s inequality,
P L f t E [ L f ] t , P L f ϵ 2 r 2 r σ 2 ϵ
Finally, using the Boole’s inequality we have
P i = 1 T | f ( Δ + , i * , Δ , i * ) | ϵ 2 2 T exp M 1 M 2 ϵ 2 2 ( σ 2 + 2 ϵ 2 / 3 )
With this at hand, it is possible to say:
P sup x , x φ T ( x ) φ ( x ) κ ( x , x ) ϵ 1 2 4 D i a m ( M ) r 2 D exp M 1 M 2 ϵ 2 2 ( σ 2 + 2 ϵ 2 / 3 ) 2 r σ 2 ϵ
Meaning that we need to solve the following equation
1 k 1 r 2 D k 2 r ,
where
k 1 = 2 ( 4 D i a m ( M ) ) 2 D exp M 1 M 2 ϵ 2 2 ( σ + 2 ϵ 2 / 3 ) , k 2 = 2 σ 2 ϵ
The solution of (7) is given by r = k 1 k 2 1 2 D . Then, plugging back this result,
1 k 1 k 1 k 2 1 2 D 2 D k 2 k 1 k 2 1 2 D .
After some development, it is possible to obtain:
k 1 k 1 k 2 1 2 D 2 D = 2 σ 2 ϵ , k 2 k 1 k 2 1 2 D = 2 σ 2 ϵ 2 D 1 2 D 4 D i a m ( M ) exp M 1 M 2 ϵ 2 4 D ( σ 2 + 2 ϵ 2 / 3 )
Using this equality, we get (8) and (9).
P sup x , x φ T ( x ) φ ( x ) κ ( x , x ) ϵ 1 2 σ 2 ϵ 2 σ 2 ϵ 2 D 1 2 D 4 D i a m ( M ) exp M 1 M 2 ϵ 2 4 D ( σ 2 + 2 ϵ 2 / 3 )
P sup x , x φ T ( x ) φ ( x ) κ ( x , x ) ϵ 2 σ 2 ϵ + 2 σ 2 ϵ 2 D 1 2 D 4 D i a m ( M ) exp M 1 M 2 ϵ 2 4 D ( σ 2 + 2 ϵ 2 / 3 )
Now, if σ 2 > ϵ / 2 , then 2 σ 2 ϵ + 2 σ 2 ϵ 2 D 1 2 D x < 2 σ 2 ϵ ( 1 + x ) . Finally:
P sup x , x φ T ( x ) φ ( x ) κ ( x , x ) ϵ < 2 σ 2 ϵ 1 + 4 D i a m ( M ) exp M 1 M 2 ϵ 2 4 D ( σ 2 + 2 ϵ 2 / 3 )

3.2. Learning Locally Stationary Kernel, GaBaSR

In this section, we explain how to learn the proposed stationary kernel. This learning algorithm is based on the work presented in [18], named Bayesian Nonparametric Kernel (BaNK) algorithm. However, given its greatest representation capabilities, we propose learning a Gaussin mixture distribution to improve the performance of the algorithm. For this reason, we name this model as Gaussian Mixture Bayesian Nonparametric Kernel Learning using Spectral Representation (GaBaSR). Furthermore, to learn the Gaussian mixture, the proposed algorithm uses ideas proposed in [15], together with a different way to learn the kernel in the classification task. Additionally, one of its main advantages is the use of vague/non-informative priors, [15,24], as well as having fewer hyperparameters for learning the kernels.

3.2.1. GaBaSR Algorithm

Based in the previous ideas, we have the following high level description of the algorithm.
  • Learn all the parameters for the Gaussian mixture ρ ( ω ) :
    • Let { π k , μ k , Σ k } k = 1 K be the current parameters of the Gaussian Mixture Model (GMM), where π k is the prior probability of the kth component, μ k is the mean and Σ k is the covariance matrix of the kth component, then the GMM is given by ρ ( ω ) = k = 1 K π k N ( x | μ k , Σ k ) , here the output will be the new sample parameters for the GMM ρ ( ω ) .
  • Take M samples from ρ ( ω ) , i.e., ω i ρ ( ω ) , i = 1 , 2 , . . . , M for the spectral representation.
    • Here the input are the parameters of the GMM, and the frequencies ω i , i = 1 , . . . , M , and the output will be the new frequencies sampled.
  • Approximate the kernel as
    κ 1 x + x 2 κ 2 x x 1 M 1 M 2 φ ( x ) T φ ( x )
  • Predict the new samples:
    (a)
    If the task is a regression use:
    f ( x ) = N ( β T φ ( x ) , σ 2 )
    (b)
    If the task is a classification use:
    f ( x ) = 1 1 + exp ( β T φ ( x ) )
In this work, we use a Markov Chain Monte Carlo (MCMC) algorithm, the Gibbs sampler [25], to learn and predict new inputs. The final entire process is described in the following subsections.

3.2.2. Learning the Gaussian Mixture

In order to learn the parameters Z i , μ k , and Σ k of the Gaussian mixture, we take the following steps:
  • First sample Z i :
    Z i indicates the component of the Gaussian Mixture from which the random frequency ω i is drawn.
    For i = 1 , 2 , . . . , M do:
    (a)
    The element ω i belongs to class k = 1 , 2 , . . . , K with probability:
    p ( z i = k | Z i , α , μ k , Λ k ) N i , k N 1 + α Λ k 1 / 2 e 1 / 2 ( ω i μ k ) T Λ k ( ω i μ k ) )
    (b)
    The element ω i belongs to an unrepresented class, with probability:
    p ( z i = k | α , μ k , Λ k ) α N 1 + α Λ k 1 / 2 e 1 / 2 ( ω i μ k ) T Λ k ( ω i μ k ) ) ,
    where the parameters μ k and Λ k are sampled from their priors,
    μ k N ( λ , R 1 ) , Λ k W ( β , W 1 ) ,
    where λ , R 1 , β and W 1 are vague/non-informative priors.
  • Second sample μ k and Σ k :
    For k = 1 , 2 , . . . , K , sample μ k and Σ k from:
    μ k N ( N k w ¯ k Λ j + λ R ) ( N k Λ k + R ) 1 , ( N k Λ k + R ) 1 , Λ k W β + N k + D 1 , β W 1 + i = 1 N δ ( z i = k ) ( ω i μ k ) ( ω i μ k ) T 1 .

3.2.3. Sampling to approximate the kernel

As we established earlier, the kernel can be represented by:
κ ( x x ) 1 M φ T ( x ) φ ( x ) ,
where φ ( s ) = [ cos ( ω 1 T s ) , , cos ( ω M T s ) , sin ( ω 1 T s ) , , sin ( ω M T s ) ] , and ω i , i = 1 , 2 , 3 , . . . , M is a sampled from the learned Gaussian Mixture. In order to approximate the kernel, for each random representation, we take a candidate frequency with probability r = m i n ( 1 , α ) , where
α = P ( y | X , W j , ω j * ) P ( y | X , W )
Now, if the task is a regression, then Equation (10) is used. With classification, Equation (11) is used. Then, we take a random number u U ( 0 , 1 ) and accept ω j * if u < r ; otherwise reject ω j * . For this, it is clear that we need to sample ω j * from:
ω j * N ( ω j * | μ Z j , Σ Z j )
In order to compute P ( y | X , Ω ) , it is necessary to identify what type of task is being solved, regression or classification.
  • In the case of a regression:
    P ( y | X , Ω ) | V N | | V 0 | b 0 a 0 b N a N Γ ( a N ) Γ ( a 0 ) ,
    where
    w N = V N ( V 0 1 w 0 + Φ ( X ) T y ) V N = ( V 0 + Φ ( X ) T Φ ( X ) ) 1 a N = a 0 + N 2 b N = b 0 + 1 2 ( w 0 T V 0 1 w 0 + y T y w N T V N 1 w N ) Φ ( X ) = ( φ ( x 1 ) T , . . . , φ ( x N ) T ) T
  • In the classification task, an approximation of the logistic regression is used,
    p ( y i = C 1 | x , X , W ) s i g m w N T φ ( x ) ,
    where s i g m ( a ) = 1 1 + exp ( a ) . Thus, the likelihood is approximated by:
    p ( y | X , W ) = i = 1 N p ( y i = C 1 | x , X , W ) y i ( 1 p ( y i = C 1 | x , X , W ) ) 1 y i i = 1 N s i g m w N T φ ( x i ) y i 1 s i g m w N T φ ( x i ) 1 y i
    where,
    w N = V N ( V 0 1 w 0 + Φ ( X ) T y ) V N = ( V 0 + Φ ( X ) T Φ ( X ) ) 1
  • Computing α : the following criteria is used to accept a sample ω j * with probability r:
    r = min 1 , α ,

3.2.4. Learning Locally Stationary Kernels

In order to learn locally stationary kernels, we use a similar process, but we compute φ ( x ) by using Equation (6) instead of Equation (4). Equation (6) needs the variables ω i , i = 1 , 2 , . . . , M 1 (approximating the κ 2 ) and v k , k = 1 , 2 , . . . , M 2 (approximation for κ 1 ). To learn the variables in κ 2 we use the algorithm showed; however to learn the variables to approximate κ 1 , we approximate κ 1 as a infinite Gaussian mixture. This means that we need to learn the variables Z j , μ k , Σ k and v k that approximate the function κ 1 . Learning these variables is very similar on how we learn them from the stationary kernel with a slight modification:
  • Sample Z j : Sampling Z j is analogous to learning the stationary kernel but with v k instead of ω k .
  • Sample μ k , Σ k : This sample is analogous to the previous section but with v k instead of ω k .
  • Sample v k : To sample v k , we sample from
    v k * N ( v k * | μ Z j , Σ Z j ) .
    In order to compute P ( y | X , W , V ) , we use φ ( x ) as a locally stationary kernel instead of a stationary kernel. This simple change allows to add more learning capabilities to GaBaSR.

3.2.5. Complexity of GaBaSR

  • Complexity of sampling all Z i : The complexity of sampling one Z i is O ( K M d ) . Thus, the complexity of sampling all Z i for i = 1 , 2 , . . . , M is bounded by O ( K M 2 d ) , where d is the dimension of the input vectors, M is the number of samples to approximate the kernel and K is the number of Gaussian’s found by the algorithm.
  • Complexity of sampling all μ k and Σ k :
    -
    Complexity of computing μ k : To take a sample we need to compute ( N k w ¯ k Λ j + λ R ) ( N k Λ k + R ) 1 which takes O ( d + d 2 ) . Also, we need to compute ( N k Λ k + R ) 1 which takes O ( d 3 ) , so the complexity is bounded by O ( d 3 ) .
    -
    Complexity of computing Λ k : To take a sample we need to compute β W 1 which takes O ( d 3 ) . After after that we need to compute the inverse of a matrix of d × d which takes O ( d 3 ) , so this step is bounded by O ( d 3 ) .
    -
    Complexity of computing both μ k and Λ k is bounded by O ( d 3 ) .
    We need to take K samples, so sample all μ k and Σ k , k = 1 , 2 , . . . , K is bounded by O ( K d 3 ) .
  • Complexity of P ( y | X , W j , ω j * ) : The complexity of computing P ( y | X , W j , ω j * ) (doesn’t matter if it is a regression or classification task) is bounded by O ( N d M + M 2 N + M 3 ) . Since the complexity of computing the matrix Φ ( X ) is O ( N d M ) ; then, the complexity of computing V N is O ( M 2 N + M 3 ) . This means that the complexity of taking M samples ( ω 1 , ω 2 , . . . , ω M ) is bounded by O ( N d M 2 + M 3 N + M 4 ) .
  • Complexity of one swap (loop) of the algorithm: We sum the three complexities and we have: O ( K M 2 d + K d 3 + N d M 2 + M 3 N + M 4 ) = O ( M 2 d ( K + N ) + K d 3 + M 3 N + M 4 ) = O ( M 3 N ) because N > > M .
  • Complexity of s swaps (loops) of the algorithm: If we make s swaps, then the complexity of all the GaBaSR is bounded by O ( s M 3 N ) .

4. Experiments

In this work, the experiments are performed without data cleaning i.e. no normalization or removal of outliers is done. Additionally, we use vague/non-informative priors to test the robustness of GaBaSR. We use the following variables a 0 = 0.001 , b 0 = 0.001 , V 0 = 0.000001 I 2 M , where I 2 M is the identity matrix of dimension 2 M × 2 M .
Using non-informative priors together with the fact that there is no need of prepossessing the data, can be seen as one of the many the advantage of GaBaSR. Finally, the main idea of the kernel methods is to give more power to linear machines via the kernel trick. For this reason, we designed the experiments to compare GaBaSR with pure linear machines. Unfortunately, when trying to collect the original datasets by Oliva et al. [18], we found that they are no longer available online. Thus, the comparison between GaBaSR and Oliva’s algorithm could not be performed.

4.1. Classification

The first dataset is the XOR problem in 2D. We set the number of samples to N = 6000 . After five swaps with 300 frequencies (M), the proposal got an AUC of 0.98. The result of this experiment is shown in Figure 1.
All the results of the GaBaSR algorithm uses 500 samples (M) and 5 swaps each one. For classification problem we use some small datasets, Breast Cancer, Credit-g, Blood Transfusion, Electricity, Egg-eye-state and Kr vs Kp. The breast cancer dataset it comes from the UCI repository [26] dataset, this dataset is the breast cancer wisconsin dataset. The Credit-g dataset comes from the UCI repository [26] and classifies people by a set of attributes as good or bad credit risk. The Electricity Dataset we downloaded from openml.org and contains data from the Australian New South Wales. Dataset egg-eye-state we downloaded from UCI, this describes if the eye is closed (1) or open (0). Kr Vs Kp dataset was downloaded from UCI, is the King Rook vs King Pawn and it’s from the King+Rook’s side to move and the classification is see if win or not win.
Table 1, Table 2 and Table 3 show the results when we try to solve the problem using perceptron, SVM and GaBaSR, respectively. From those tables, we can see that the in the problems of Kr vs. Kp and Electricity GaBaSR has "similar" AUC to SVM but it is important to notice that in blood transfusion GaBaSR performs better than the perceptron and SVM.
In other words, if we compare Table 3 with Table 2, it is possible to observe that the results, in general, for SVM Classification are better than for GaBaSR Classification. AUC equal to 0.5 indicates that the classifier is random, so it does not fulfil its function. Comparing Table 3 with Table 2, we can conclude that SVM Classification works properly for all tested datasets except Blood Transfusion, and GaBaSR Classification works properly only for Kr vs. Kp and Electricity. Results for Blood Transfusion obtained by GaBaSR are better than for SVM but still not satisfactory. A bad result is also a result that has its value.
In general we had a good accuracy, in most of the cases we had an accuracy above 0.8. For example in the dataset for the breast cancer, we had an accuracy of 0.89, which in general is a good accuracy. In the dataset credit-g we had 0.91 of accuracy using only 500 frequencies in 5 swaps.

4.2. Regression

For the regression experiments, we use a synthetic data set in our regression attempt. For this, samples are taken from the Gaussian Mixture distribution shown in Equation (12). After that, M = 250 , and β N ( 0 , I 501 ) are set. In fact, the number 501 is because the extended vector is used, which takes samples from y i N ( φ ρ ( x i ) T β ) , where β ρ represents the random features from ρ , i.e. ω i ρ ( ω ) , i = 1 , 2 , . . . , M . Furthermore, an instance of this problem is shown in Figure 2. Also, 250 samples and vague/non-informative prior to learn the function are used, and the result is shown in Figure 3.
p ( ω ) = π 1 p ω | 0 , 1 2 2 + π 2 p ω | 3 π 4 , 1 2 2 + π 3 p ω | 11 π 8 , 1 4 2
with π 1 = π 2 = π 3 = 1 3
All the results of the GaBaSR algorithm uses 500 samples (M) and 5 swaps each one. For regression problems we use some small datasets, Mauna LOA CO 2 , California Houses, Boston house-price, and Diabetes. The Mauna LOA CO 2  [29] from the global monitoring laboratory this collects the information of the monthly mean CO 2 , as we can see from Figure 4, this data it is stationary, has some repetitions and increments. The California Houses is a set of 20,640 rows with 8 columns. The Boston house price dataset was collected in 1978 from various suburbs in Boston. Diabetes Dataset has ten variables and the progression of the disease one year after.
In this section we show three tables of results, Table 4 shows the results of our algorithm. Table 5 and Table 6 we present the results of the linear regression and the Support Vector Machine with linear kernel, respectively.
In this subsection, the experiments are performed with simple data and the result of a simple linear regression are shown. The results are shown in Table 4 and Table 5.
As it can be seen, an important result is the given by the Mauna LOA CO 2 dataset. This dataset contains data from the year 1958 to 2001. Thus, for this experiment the algorithm is trained with M = 250 and performing five swaps. After the model has been trained, to learn stationary kernels, it is possible to asses the performance of the model. For example, the achieved MSE is 0.6052 which helps at the estimation ot the CO 2 outputs. For example, at the sample 2002.13 , we have prediction 376.873 where the real measure for this value is 373.08 . The total results of Mauna LOA CO 2 are shown in Figure 4 and Table 4.
We use the following datasets: (1) Mauna LOA CO 2 from [29], (2) California houses from [30], (3) Boston house-price from [30] and Diabetes from [30]

5. Conclusions

Although GaBaSR’s result are promising, there are still quite a lot of work to do. For example, sampling the ω j is quite slow, and there is a need to update the matrix Φ for only one sample. Oliva et al. [18] states that this can be done using low-rank updates. However, he does not present any procedure to perform such task, the low-rank updates are being consider for the next phase of GaBaSR. Thus, it is necessary to research how many samples M are required in order to obtain a low rank approximation.
In the experiments, it is possible to observe that GaBaSR is more accurate when performing a classification task rather than a regression task. This is an opportunity to improve the regression model. It means that the regression model needs more research to improve its performance or perhaps that a different model to learn the regression task is needed.

Author Contributions

Conceptualization, L.R.P.-L. and A.M.-V.; Methology, L.R.P.-L. and R.O.G.-M.; software L.R.P.-L. and A.G.; validation, R.O.G.-M.; formal analysis, L.R.P.-L. and A.M.-V.; resources, A.M.-V.; data curation, A.G.; writing—original draft preparation, L.R.P.-L. and R.O.G.-M.; writing—review and editing, A.M.-V. and A.G.; supervision, A.M.-V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All the data was cited and can be downloaded with their respective citation.

Acknowledgments

The authors wish to thank the The National Council for Science and Technology (CONACyT) in Mexico and Escuela Militar de Mantenimiento y Abastecimiento, Fuerza Aérea Mexicana Zapopan.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AUCArea Under the Curve
MSEMean Squared Error
SVMSupport Vector Machine
R 2 Coefficient of determination
MKLMultiple Kernel Learning
SDPSemi-Definite Programming
SMOSequential Minimal Optimization
BaNKBayesian Nonparametric Kernel
GaBaSRGaussian Mixture Bayesian Nonparametric Kernel Learning using Spectral Representation
GMMGaussian Mixture Model
MCMCMarkov Chain Monte Carlo
UCIUniversity of California, Irvine

References

  1. Smola, A.J.; Schölkopf, B. Learning with Kernels; Citeseer: Princeton, NJ, USA, 1998; Volume 4. [Google Scholar]
  2. Soentpiet, R. Advances in Kernel Methods: Support Vector Learning; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
  3. Anand, S.S.; Scotney, B.W.; Tan, M.G.; McClean, S.I.; Bell, D.A.; Hughes, J.G.; Magill, I.C. Designing a kernel for data mining. IEEE Expert 1997, 12, 65–74. [Google Scholar] [CrossRef]
  4. Schölkopf, B.; Smola, A.; Müller, K.R. Kernel principal component analysis. In Proceedings of the International Conference on Artificial Neural Networks, Lausanne, Switzerland, 8–10 October 1997; pp. 583–588. [Google Scholar]
  5. Zien, A.; Rätsch, G.; Mika, S.; Schölkopf, B.; Lengauer, T.; Müller, K.R. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 2000, 16, 799–807. [Google Scholar] [CrossRef] [PubMed]
  6. Tipping, M.E. The relevance vector machine. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2000; pp. 652–658. [Google Scholar]
  7. Junli, C.; Licheng, J. Classification mechanism of support vector machines. In Proceedings of the WCC 2000-ICSP 5th International Conference on Signal Processing Proceedings 16th World Computer Congress, Beijing, China, 21–25 August 2000; Volume 3, pp. 1556–1559. [Google Scholar]
  8. Bennett, K.P.; Campbell, C. Support vector machines: Hype or hallelujah? Acm Sigkdd Explor. Newsl. 2000, 2, 1–13. [Google Scholar] [CrossRef]
  9. Gönen, M.; Alpaydın, E. Multiple kernel learning algorithms. J. Mach. Learn. Res. 2011, 12, 2211–2268. [Google Scholar]
  10. Lanckriet, G.R.; Cristianini, N.; Bartlett, P.; Ghaoui, L.E.; Jordan, M.I. Learning the Kernel Matrix with Semidefinite Programming. J. Mach. Learn. Res. 2004, 5, 27–72. [Google Scholar]
  11. Hoi, S.C.; Jin, R.; Lyu, M.R. Learning Nonparametric Kernel Matrices from Pairwise Constraints. In Proceedings of the 24th International Conference on Machine Learning ACM, Corvallis, OR, USA, 20–24 June 2007; pp. 361–368. [Google Scholar]
  12. Cvetkovic, D.M.; Doob, M.; Sachs, H. Spectra of Graphs; Academic Press: New York, NY, USA, 1980; Volume 10. [Google Scholar]
  13. Platt, J. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines; Microsoft: Redmond, WA, USA, 1998. [Google Scholar]
  14. Bach, F.R.; Lanckriet, G.R.; Jordan, M.I. Multiple Kernel Learning, Conic Duality, and the SMO Algorithm. In Proceedings of the Twenty-First international Conference on Machine Learning ACM, Banff, AB, Canada, 4–8 July 2004; p. 6. [Google Scholar]
  15. Williams, C.K.; Rasmussen, C.E. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
  16. Rahimi, A.; Recht, B. Random features for large-scale kernel machines. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 3–5 December 2007; pp. 1177–1184. [Google Scholar]
  17. Ghiasi-Shirazi, K.; Safabakhsh, R.; Shamsi, M. Learning translation invariant kernels for classification. J. Mach. Learn. Res. 2010, 11, 1353–1390. [Google Scholar]
  18. Oliva, J.B.; Dubey, A.; Wilson, A.G.; Póczos, B.; Schneider, J.; Xing, E.P. Bayesian Nonparametric Kernel-Learning. In Proceedings of the Artificial Intelligence and Statistics, Cadiz, Spain, 9–11 May 2016; pp. 1078–1086. [Google Scholar]
  19. Rasmussen, C. The infinite Gaussian mixture model. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2000; pp. 554–560. [Google Scholar]
  20. Genton, M. Classes of kernels for machine learning: A statistics perspective. J. Mach. Learn. Res. 2001, 2, 299–312. [Google Scholar]
  21. Bochner, S. Harmonic Analysis and the Theory of Probability; California University Press: Berkeley, CA, USA, 1955. [Google Scholar]
  22. Silverman, R. Locally stationary random processes. IRE Trans. Inf. Theory 1957, 3, 182–187. [Google Scholar] [CrossRef]
  23. Hoeffding, W. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding; Springer: Berlin, Germany, 1994; pp. 409–426. [Google Scholar]
  24. Gelman, A.; Carlin, J.B.; Stern, H.S.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian Data Analysis; Chapman and Hall/CRC: Boca Raton, FL, USA, 2013. [Google Scholar]
  25. Geman, S.; Geman, D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 1984, 6, 721–741. [Google Scholar] [CrossRef] [PubMed]
  26. Dua, D.; Graff, C. UCI Machine Learning Repository. Open J. Stat. 2017, 10. [Google Scholar]
  27. Yeh, I.C.; Yang, K.J.; Ting, T.M. Knowledge discovery on RFM model using Bernoulli sequence. Expert Syst. Appl. 2009, 36, 5866–5871. [Google Scholar] [CrossRef]
  28. Gama, J. Electricity Dataset. 2004. Available online: http://www.inescporto.pt/~{}jgama/ales/ales_5.html (accessed on 6 August 2019).
  29. Carbon, D. Mauna LOA CO2. 2004. Available online: https://cdiac.ess-dive.lbl.gov/ftp/trends/CO2/sio-keel-flask/maunaloa_c.dat (accessed on 6 August 2019).
  30. Buitinck, L.; Louppe, G.; Blondel, M.; Pedregosa, F.; Mueller, A.; Grisel, O.; Niculae, V.; Prettenhofer, P.; Gramfort, A.; Grobler, J.; et al. API design for machine learning software: Experiences from the scikit-learn project. In Proceedings of the ECML PKDD Workshop: Languages for Data Mining and Machine Learning, Prague, Czech Republic, 23–27 September 2013; pp. 108–122. [Google Scholar]
Figure 1. The XOR problem with the probability of belonging to class 1 (orange). If the probability is more than 0.5, then the sample belongs to class 1 otherwise belongs to class 2.
Figure 1. The XOR problem with the probability of belonging to class 1 (orange). If the probability is more than 0.5, then the sample belongs to class 1 otherwise belongs to class 2.
Applsci 13 02473 g001
Figure 2. An instance of the samples taken from Equation (12).
Figure 2. An instance of the samples taken from Equation (12).
Applsci 13 02473 g002
Figure 3. The real vs. predicted using GaBaSR.
Figure 3. The real vs. predicted using GaBaSR.
Applsci 13 02473 g003
Figure 4. Mauna LOA CO 2 from 1958 to 2001 and the prediction.
Figure 4. Mauna LOA CO 2 from 1958 to 2001 and the prediction.
Applsci 13 02473 g004
Table 1. Results of the Perceptron.
Table 1. Results of the Perceptron.
DatasetNMSwapsAUC
Breast Cancer [26]56950050.96480
Credit-g [26]100050050.44576
Blood Transfusion [27]74850050.37572
Electricity [28]45,31250050.68576
Egg-eye-state [26]14,98050050.61635
Kr vs Kp [26]319650050.99357
Table 2. Results SVM Classification.
Table 2. Results SVM Classification.
DatasetNMSwapsAUC
Breast Cancer [26]56950050.934656
Credit-g [26]100050050.85282
Blood Transfusion [27]74850050
Electricity [28]45,31250050.76076
Egg-eye-state [26]14,98050050.692924
Kr vs Kp [26]319650050.96970
Table 3. Results of GaBaSR Classification.
Table 3. Results of GaBaSR Classification.
DatasetNMSwapsAUC
Breast Cancer [26]56950050.51348
Credit-g [26]100050050.5142
Blood Transfusion [27]74850050.54063
Electricity [28]45,31250050.74991
Egg-eye-state [26]14,98050050.519743
Kr vs Kp [26]319650050.9045
Table 4. Results GaBaSR Regression.
Table 4. Results GaBaSR Regression.
DatasetMSwapsMSE R 2
Mauna LOA CO 2 25050.605220.99789
California houses25052.91313−1.16892
Boston house-price25053.109310.97060
Diabetes25055,918,263.63330−869.13210
Table 5. Results of Linear Regression.
Table 5. Results of Linear Regression.
DatasetMSE R 2
Mauna LOA CO 2 6.860.98
California houses0.550.59
Boston house-price18.920.78
Diabetes3141.620.51
Table 6. Results Linear Regression SVM.
Table 6. Results Linear Regression SVM.
DatasetMSE R 2
Mauna LOA CO 2 758.08513−1.60705
California houses2.00777−0.50784
Boston house-price47.669040.43533
Diabetes8094.08752−0.36496
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pena-Llamas, L.R.; Guardado-Medina, R.O.; Garcia, A.; Mendez-Vazquez, A. Kernel Learning by Spectral Representation and Gaussian Mixtures. Appl. Sci. 2023, 13, 2473. https://doi.org/10.3390/app13042473

AMA Style

Pena-Llamas LR, Guardado-Medina RO, Garcia A, Mendez-Vazquez A. Kernel Learning by Spectral Representation and Gaussian Mixtures. Applied Sciences. 2023; 13(4):2473. https://doi.org/10.3390/app13042473

Chicago/Turabian Style

Pena-Llamas, Luis R., Ramon O. Guardado-Medina, Arturo Garcia, and Andres Mendez-Vazquez. 2023. "Kernel Learning by Spectral Representation and Gaussian Mixtures" Applied Sciences 13, no. 4: 2473. https://doi.org/10.3390/app13042473

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop