Next Article in Journal
Monopolistic Dynamics with Endogenous Product Differentiation
Previous Article in Journal
Capital Allocation Methods under Solvency II: A Comparative Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Maximum Correntropy Criterion with Distributed Method

1
School of Mathematics and Statistics, South-Central University for Nationalities, Wuhan 430074, China
2
School of Mathematics and Statistics, Wuhan University, Wuhan 430072, China
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(3), 304; https://doi.org/10.3390/math10030304
Submission received: 23 November 2021 / Revised: 5 January 2022 / Accepted: 14 January 2022 / Published: 19 January 2022
(This article belongs to the Topic Machine and Deep Learning)

Abstract

:
The Maximum Correntropy Criterion (MCC) has recently triggered enormous research activities in engineering and machine learning communities since it is robust when faced with heavy-tailed noise or outliers in practice. This work is interested in distributed MCC algorithms, based on a divide-and-conquer strategy, which can deal with big data efficiently. By establishing minmax optimal error bounds, our results show that the averaging output function of this distributed algorithm can achieve comparable convergence rates to the algorithm processing the total data in one single machine.

1. Introduction

In the big data era, the rapid expansion of data generation brings data of prohibitive size and complexity. This brings challenges to many traditional learning algorithms requiring access to the whole data set. Distributed learning algorithms, based on the divide-and-conquer strategy, provide a simple and efficient way to address this issue and therefore have received increasing attention. Such a strategy starts with partitioning the big data set into multiple subsets that are distributed to local machines, then it obtains local estimators in each subset by using a base algorithm, and it finally pools the local estimators together by simple averaging. It can substantially cut the time and memory costs in the algorithm implementation, and in many practical applications its learning performance has shown to be as good as that of a big machine that can use all the data. This scheme has been developed in various learning contexts, including spectral algorithms [1,2], kernel ridge regression [3,4,5], gradient descent [6,7], a semi-supervised approach [8], minimum error entropy [9] and bias correction [10].
Regression estimation and inference play an important role in the fields of data mining and statistics. The traditional ordinary least squares (OLS) method provides an efficient estimator if the regression model error is normally distributed. However, heavy-tailed noise and outliers are common in the real world, which limits the application of OLS in practice. Various robust losses have been proposed to deal with the problem instead of least squares loss. The commonly used robust losses mainly include adaptive Huber loss [11], gain function [12], minimum error entropy [13], exponential squared loss [14], etc. Among them, the Maximum Correntropy Criterion (MCC) is widely employed as an efficient alternative to the ordinary least squares method which is suboptimal in the non-Gaussian and non-linear signal processing situations [15,16,17,18,19]. Recently, MCC has been studied extensively in the literature and is widely adopted for many learning tasks, e.g., wind power forecasting [20] and pattern recognition [19]. In this paper, we are interested in the implementation of MCC by a distributed gradient descent method in a big data setting. Note that the MCC loss function is non-convex, so its analysis is essentially different from the least squares method. A rigorous analysis of distributed MCC is necessary to derive the consistency and learning rates.
Given a hypothesis function f : X Y and the scaling parameter σ > 0 , correntropy between f ( X ) and Y is defined by
V σ ( f ) : = E G ( f ( X ) Y ) 2 2 σ 2
where G ( u ) is the Gaussian function exp u , u R . Given the sample set D = { ( x i , y i ) } i = 1 N Z : = X × Y , the empirical form of V σ is
V ^ σ ( f ) : = 1 N i = 1 N G ( f ( x i ) y i ) 2 2 σ 2 .
The purpose of MCC is to maximize the empirical correntropy V ^ σ over a hypothesis space H , that is
f z , H : = arg max f H V ^ σ ( f ) .
In the statistical learning context, the loss induced by correntropy ϕ σ : R R + is defined as
ϕ σ ( u ) : = σ 2 1 G u 2 2 σ 2 = σ 2 1 exp u 2 2 σ 2 ,
where σ > 0 is the scaling parameter. The loss function can be viewed as a variant of the Welsch function [21] and the estimator f z , H of (1) is also the minimizer of the empirical minimization risk scheme over H , that is
min f H 1 N i = 1 N ϕ σ ( f ( x i ) y i ) .
This paper aims at rigorous analysis of distributed gradient descent MCC within the framework of reproducing kernel Hilbert spaces (RKHSs). Let K : X × X R be a Mercer kernel [22], i.e., a continuous, symmetric and positive semi-definite function. A kernel K is said to be positive semi-definite, if the matrix K ( u i , u j ) i , j = 1 m is positive semi-definite for any finite set u 1 , , u m X and m N . The RKHS H K associated with the Mercer kernel K is defined to be the completion of the linear span of the set of functions { K x : = K ( x , · ) , x X } with the inner product · , · K given by K x , K u K = K ( x , u ) . It has the reproducing property
f ( x ) = f , K x K
for any f H K and x X . Denote κ : = sup x X K ( x , x ) . By the property (3), we get that
f κ f K , f o r a n y f H K .
Definition 1.
Given the sample set D = { ( x i , y i ) } i = 1 N Z : = X × Y , the kernel gradient descent algorithm for solving (2) can be stated iteratively with f 1 , D = 0 as
f t + 1 , D = f t , D η × 1 N i = 1 N ϕ σ ( ( f t , D ( x i ) y i ) ) K x i , t 2
where η is the of step size and ϕ σ ( ( f t , D ( x i ) y i ) ) = G ( f t , D ( x i ) y i ) 2 2 σ 2 ( f t , D ( x i ) y i ) .
Divide-and-Conquer algorithm for the kernel gradient descent MCC (5) is easy to describe. Rather than performing on the whole N examples, the distributed algorithm executes the following three steps:
  • Partition the data set D evenly and uniformly into m disjoint subsets D j , 1 j m .
  • Perform algorithm (5) on each data set D j , and get the local estimate f T + 1 , D j after T-th iteration.
  • Take an average f ¯ T + 1 , D = 1 m j = 1 m f T + 1 , D j as a final output.
In the next section, we study the asymptotic behavior of the final estimator f ¯ T + 1 , D and show that f ¯ T + 1 , D can obtain the minimax optimal rates over all estimators using the total data set of N samples provided that the scaling parameter σ is chosen suitably.

2. Assumptions and Main Results

In the setting of non-parametric estimation, we denote X as the explanatory variable that takes values in a compact domain X , Y Y R as a real-valued response variable. Let ρ be the underlying distribution on Z : = X × Y . Moreover, let ρ X be the marginal distribution of ρ on X and ρ ( · | x ) be the conditional distribution on Y for given x X .
This work focuses on the application of MCC in regression problems, which is linked to the additive noise model
Y = f ρ ( X ) + e , E ( e | X ) = 0 ,
where e is the noise and f ρ ( x ) is the regression function, which is the conditional mean E ( Y | X = x ) for X = x X . The goal of this paper is to estimate the mean square error between f ¯ T + 1 , D and f ρ in L ρ X 2 -metric, which is defined by · L ρ X 2 : = X | · | 2 d ρ X 1 2 . For simplicity, we will use · to denote the norm · L ρ X 2 when the meaning is clear from the context.
Below, we present two important assumptions, which play a vital role in carrying out the analysis. The first assumption is about the regularity of the target function f ρ . Define the integral operator L K : L ρ X 2 L ρ X 2 associated with K by
L K f : = X X f ( x ) K x d ρ X ( x ) , f L ρ X 2 .
As K is a Mercer kernel on the compact domain X , the operator L K is hence compact and positive. So, L K r as the r-th power of L K for r > 0 is well defined. Our error bounds are stated in terms of the regularity of the target function f ρ , given by [3,23]
f ρ = L K r ( h ρ ) , for   some   r > 0   and   h ρ L ρ X 2 .
The condition (6) measures the regularity of f ρ and is closely related to the smoothness of f ρ when H K is a Sobolev space. If (6) holds with r 1 2 , f ρ lies in the space H K .
The second assumption (7) is about the capacity of H K , measured by the effective dimension [24,25]
N ( λ ) = Trace ( ( L K + λ I ) 1 L K ) , for   λ > 0 ,
where I is the identity operator on H K . In this paper, we assume that
N ( λ ) C λ s for   some   C > 0 and 0 < s 1 .
Note that it always holds with s = 1 . For 0 < s < 1 , it is almost equivalent to that the eigenvalues σ i of L K decay at a rate i 1 s . The smoother the kernel function K is, the smaller s and the smaller function space H K . In particular, if K is a Gaussian kernel, then s can be arbitrarily close to 0, as K C .
Throughout the paper, we assume that κ : = sup x X K ( x , x ) 1 and | y | M for some M > 0 . We denote a as the smallest integer not less than a.
Theorem 1.
Assume that (6) and (7) hold for some r > 1 2 and 0 < s 1 . Taking η t = η t θ with 0 < η 1 and 0 θ < 1 . If T = N 1 ( 2 r + s ) ( 1 θ ) and the number of partition of the data set D
m N r 1 2 2 r + s log N 5 ,
then with confidence at least 1 δ ,
f ¯ T + 1 , D f ρ C ˜ N r 2 r + s + N 5 2 2 r + s σ 2 log 12 δ 4 ,
where C ˜ is a constant depending on θ .
Remark 1.
The above theorem, to be proved in Section 3, exhibits the concrete learning rates of the distributed estimator f ¯ T + 1 , D (hence the standard estimator of (5) with m = 1 ). It implies that the kernel gradient descent for MCC on the single and distributed data set both achieves the learning rate O N r 2 r + s when σ is large enough. It equals the minimax optimal rates in the regression setting [24,26] in the case of r > 1 2 . This theorem suggests that the distributed MCC does not sacrifice the convergence rate provided that the partition number m satisfies the constraint (8). Thus, the distributed MCC estimator f ¯ T + 1 , D enjoys both computational efficiency and statistical optimality.
With the help of Theorem 1, we can easily deduce the following optimal learning rate in expectation.
Corollary 1.
Assume that (6) and (7) hold for some r > 1 2 and 0 < s 1 , taking η t = η t θ with 0 < η 1 and 0 θ < 1 . If T = N 1 ( 2 r + s ) ( 1 θ ) , m satisfies (8) and σ N r / 2 + 5 / 4 2 r + s , then we have
E f ¯ T + 1 , D f ρ = O N r 2 r + s .
By the confidence-based error estimate in Theorem 1, we can obtain the following almost sure convergence of the distributed gradient descent algorithm for MCC.
Corollary 2.
Assume that (6) and (7) hold for some r > 1 2 and 0 < s 1 , taking η t = η t θ with 0 < η 1 and 0 θ < 1 . If T = N 1 ( 2 r + s ) ( 1 θ ) , m satisfies (8) and σ N r / 2 + 5 / 4 2 r + s , and for arbitrary ϵ > 0 , we have
lim N N r 2 r + s ϵ f ¯ T + 1 , D f ρ = 0 .

3. Discussion and Conclusions

In this work, we have studied the theoretical properties and convergence behaviors of a distributed kernel gradient descent MCC algorithm. As shown in Theorem 1, we derived minimax optimal error bounds for the distributed learning algorithm under the regularity condition on the regression function and capacity condition on RKHS. In the standard kernel gradient descent MCC algorithm ( m = 1 ), the aggregate time complexity is O t N 2 after t iterations. However, in the distributed case ( m > 1 ), the aggregate time complexity reduces to O t N 2 / m after t iterations. In conclusion, the kernel gradient descent MCC algorithm (5) with the distributed method can achieve fast convergence rates while successfully reducing algorithmic costs.
When the optimization problem arises from non-convex losses, the iteration sequence generated by the gradient descent algorithm is likely to only converge to a stationary point or a local minimizer. Note that the loss induced by correntropy ϕ σ is not convex. Then, the convergence of the gradient descent method (5) to the global minimizer is not unconditionally guaranteed, which brings difficulties to the mathematical analysis of convergence. Our work on Theorem 1 addresses this issue, which shows that the iterative algorithm ensures the global optimality of its iterations in the theoretical analysis.
For regression problems, the distributed method has been introduced to the iteration algorithm in various learning paradigms and the minimax optimal rate has been obtained under different constraints on the partition number m. For distributed spectral algorithms [1], the lower bound of m that ensures the optimal rates is
m N min { 2 2 r + s , 2 r 1 2 r + s } .
We see from (9) that the restriction on m suffers from a saturation phenomenon in the sense that when r 3 / 2 in the sense that the maximal m to guarantee the optimal learning rate does not improve as r is beyond 3 / 2 . Our restriction in (8) is worse than (9) when r < 5 / 2 but better when r > 5 / 2 as the upper bound in (8) increases with respect to r that overcomes the saturation effect in (9). For distributed kernel gradient descent algorithms with least squares method [6] and minimum error entropy (MEE) principle [9], the restrictions of m are improved as
m N r 1 2 2 r + s log N 4 + 1
and
m N r 1 2 2 r + s log N 5 ,
respectively. Our bound (8) for MCC differs with (10) for least squares only up to a logarithmic term, which has little impact on the upper bound of m ensuring optimal rates, but numerical experiments show that the distributed kernel gradient descent algorithm for least squares method is inferior to that for MCC in non-Gaussian noise models [15,27,28]. Our bound (8) is the same as (11) that is applied to the MEE principle. As we know, MEE also performs well in dealing with non-Gaussian noise or heavy-tail distribution [13,29]. However, MEE belongs to pairwise learning problems that work with pairs of samples rather than single sample in MCC. Hence, the distributed kernel gradient descent algorithm for MCC has an advantage over MEE in algorithmic complexity.
Several related questions are worthwhile for future research. First, our distributed result provides the optimal rates by requiring a large robust parameter σ . In practice, a moderate σ may be enough to ensure a good learning performance in robust estimation as shown by [17]. It is therefore of interest to investigate the convergence properties of distributed version of algorithm (5) when σ is chosen as a constant or σ ( N ) 0 as N approaches .
Secondly, our algorithm is carried out in the framework of supervised learning; however, in numerous real-world applications, few labeled data are available, but a large amount of unlabeled data are given since the cost of labeling data is high such as time, money. Thus, we shall investigate how to enhance the learning performance of the MCC algorithm by the distributed method and the additional information given by unlabeled data.
Thirdly, as stated in Theorem 1, the choice of the last iteration T and the partition number m depends on the parameters r , s , which are usually unknown in advance. In practice, cross-validation is usually used to tune T and m adaptively. It would be interesting to know whether the kernel gradient descent MCC (5) with the distributed method can achieve the optimal convergence rate with adaptive T and m .
Last but not least, we should note that here that all the data D = { ( x i , y i ) } i = 1 N are drawn independently according to the same distribution. In the distributed method, we partition D evenly and uniformly into m disjoint subsets. This means that | D 1 | = = | D m | = N m and each sample ( x i , y i ) is assigned to the subset D j ( 1 j m ) with the same probability. In the context of uniform random sampling, such randomness splitting strategy should be reasonable and practical. So, our theoretical analysis is based on the uniform random splitting mechanism. However, for the theoretical analysis of other randomness or non-randomness splitting mechanisms, it is necessary to develop new mathematical tools for optimal performance. It is beyond the scope of this paper and will be left for our future work.

4. Proofs of Main Results

This section is devoted to proving main results in Section 2. Here and in the following, let the sample size of each subset D 1 , , D m be n; that is, D = D 1 D m and N = m n . Define the empirical operator L K , D on H K as
L K , D ( f ) = 1 N i = 1 N f , K x i K K x i , f H K ,
where x 1 , x N x : ( x , y ) D w i t h s o m e y Y . Similarly, we can define the operator L K , D j on H K for each subset D j , 1 j m ,
L K , D j ( f ) = 1 n i = 1 n f , K x i K K x i , f H K ,
where x 1 , x n x : ( x , y ) D j w i t h s o m e y Y .

4.1. Preliminaries

We first introduce some necessary lemmas in the proofs, which can be found in [3,6,9].
Lemma 1.
Let g ( z ) be a measurable function defined on Z with g M almost definitely for some M > 0 . Let 0 < δ < 1 ; then, each of the following estimates holds with confidence at least 1 δ ,
( L K + λ I ) 1 2 ( L K L K , D ) 2 A D , λ log 2 δ ,
( L K , D + λ I ) 1 ( L K + λ I ) 2 2 A D , λ log 2 δ λ 2 + 2 .
and
1 N i = 1 N ( L K + λ I ) 1 2 g ( z i ) K x i L K g 2 M A D , λ log 2 δ
where A D , λ : = 1 N λ + N ( λ ) N .
Let π i t denote the polynomial defined by π i t ( s ) = j = i t ( 1 η j x ) if i t and, for notation simplicity, let π t + 1 t ( s ) = 1 be the identity function. In our proof, we need to deal with the polynomial operators π i t ( L K ) and π i t ( L K , D ) . For this purpose we introduce the conventional notation j = T + 1 T : = 1 and the following preliminary lemmas.
Lemma 2.
If 0 α < 1 , 0 θ < 1 , then for T 3 ,
i = 1 T i ( θ + α ) j = i + 1 T j θ 1 C θ , α T min { α , 1 θ } log T ,
where C θ , α is a constant depending only on θ and α, whose value is given in the proof. In particular, if α = 0 , we have
i = 1 T i θ j = i + 1 T j θ 1 15 log T .
Lemma 3.
If η t = η t θ with 0 < η < 1 and 0 θ < 1 , then for 1 i T 1 ,
π i t ( L K , D ) 1
π i t ( L K ) 1
L K , D π i + 1 T ( L K , D ) e η j = i + 1 T j θ 1 ,
L K π i + 1 T ( L K ) e η j = i + 1 T j θ 1 ,
i = 1 T η i ( L K , D + λ I ) π i + 1 T ( L K , D ) 1 + η λ 1 θ T 1 θ ,
i = 1 T η i ( L K + λ I ) π i + 1 T ( L K ) 1 + η λ 1 θ T 1 θ .
Define a data-free gradient descent sequence for the least square method in H K by f 1 = 0 and
f t + 1 = f t η t X f t ( x ) f ρ ( x ) K x d ρ X = ( I η t L K ) f t + η t L K f ρ .
It has been well evidence in the literature [30] that under the assumption (6) with r > 1 2 , there are
f t f ρ h ρ t r ( 1 θ )
and
f t f ρ K h ρ t ( r 1 2 ) ( 1 θ ) ,
where h ρ = max g ( 2 r / e ) r , g [ ( 2 r 1 ) / e ] r 1 2 .
Lemma 4.
If η t = η t θ with 0 < η < 1 and 0 θ < 1 , then there is a constant C ρ , θ , r such that
i = 1 T η i L K , D π i + 1 T ( L K , D ) f i f ρ K C ρ , θ , r
and
i = 1 T η i L K π i + 1 T ( L K ) f i f ρ K C ρ , θ , r .
Lemma 5.
If η t = η t θ with 0 < η < 1 and 0 θ < 1 , then there is a constant D ρ , θ , r such that
i = 1 T η i f i f ρ K D ρ , θ , r T 1 θ .
Recall that the isomorphism between H K and L ρ X 2 , which yields in
f = L K 1 2 f K ( L K + λ I ) 1 2 f K , for all f H K .

4.2. Bound for the Learning Sequence

We will need the following bound for the learning sequence in the proof.
Theorem 2.
If the step size sequence η t = η t θ with 0 < η 1 and 0 θ < 1 , then we have the following bound for the learning sequence { f t , D } by (5):
f t , D K M t 1 θ 2 .
Proof. 
We prove the statement by induction. First note the conclusion holds trivially for t = 1 . Next, suppose that f t , D K M i = 1 t 1 η i holds. By the updating rule (5) and the reproducing property, we have
f t + 1 , D K 2 = f t , D K 2 2 η t N i = 1 N ϕ σ ( f t , D ( x i ) y i ) f t , D ( x i ) + η t 2 N 2 i = 1 N ϕ σ ( f t , D ( x i ) y i ) K x i K 2 f t , D K 2 2 η t N i = 1 N ϕ σ ( f t , D ( x i ) y i ) f t , D ( x i ) + η t 2 N i = 1 N G ( f t , D ( x i ) y i ) 2 2 σ 2 2 f t , D ( x i ) y i 2 = f t , D K 2 + η t N i = 1 N Q i ,
where
Q i = η t G ( f t , D ( x i ) y i ) 2 2 σ 2 2 2 G ( f t , D ( x i ) y i ) 2 2 σ 2 f t , D ( x i ) 2 2 G ( f t , D ( x i ) y i ) 2 2 σ 2 + η t G ( f t , D ( x i ) y i ) 2 2 σ 2 2 y i f t , D ( x i ) + η t G ( f t , D ( x i ) y i ) 2 2 σ 2 2 y i 2 .
The restriction η t 1 implies η t G ( f t , D ( x i ) y i ) 2 2 σ 2 2 2 G ( f t , D ( x i ) y i ) 2 2 σ 2 < 0 . By the property of quadratic function, we have
Q i η t G ( f t , D ( x i ) y i ) 2 2 σ 2 2 y i 2 G ( f t , D ( x i ) y i ) 2 2 σ 2 + η t G ( f t , D ( x i ) y i ) 2 2 σ 2 2 y i 2 η t G ( f t , D ( x i ) y i ) 2 2 σ 2 2 2 G ( f t , D ( x i ) y i ) 2 2 σ 2 = G ( f t , D ( x i ) y i ) 2 2 σ 2 y i 2 2 η t G ( f t , D ( x i ) y i ) 2 2 σ 2 M 2 .
Plugging it into (28), we obtain
f t + 1 , D K 2 f t , D K 2 + M 2 η t M 2 i = 1 t η i = M 2 η i = 1 t i θ M 2 t 1 θ .
This completes the proof. □

4.3. Error Decomposition and Estimation of Error Bounds

Now we are in a position of bounding the error of the distributed kernel gradient descent MCC. For this purpose, we decompose the error f ¯ T + 1 , D f ρ into two parts as
f ¯ T + 1 , D f ρ f T + 1 f ρ + f ¯ T + 1 , D f T + 1 .
As we have mentioned in the previous subsection, the first term can be bounded by (21) under the assumption (6) with r > 1 2 . Our key analysis is the second term, which can be bounded with the help of the following proposition.
Proposition 1.
Assume that (6) holds for some r > 1 2 . Let η t = η t θ with 0 < η 1 and 0 θ < 1 . For λ > 0 , there holds
f T + 1 , D f T + 1 C r , θ B D , λ ( C D , λ + G D , λ ) ( 1 + λ T 1 θ ) + T 5 ( 1 θ ) 2 σ 2 ,
and
f T + 1 , D f T + 1 K C r , θ B D , λ ( C D , λ + G D , λ ) ( 1 + λ T 1 θ ) / λ + T 5 ( 1 θ ) 2 σ 2 ,
where
B D , λ = ( L K , D + λ I ) 1 ( L K + λ I ) , C D , λ = ( L K + λ I ) 1 2 ( L K L K , D ) , G D , λ = ( L K + λ I ) 1 2 ( L K f ρ f ^ ρ , D ) K , f ^ ρ , D = 1 N i = 1 N y i K x i = 1 N ( x , y ) D y K x ,
and C r , θ is given in the proof, depending on r , θ .
Proof. 
By the definition of f t , D in (5) and the definition of f t in (20), we have
f t + 1 , D f t + 1 = [ I η t L K , D ] ( f t , D f t ) + η t [ L K L K , D ] f t + η t [ f ^ ρ , D L K ( f ρ ) ] + η t E t , D ,
where f ^ ρ , D is defined in (32) and
E t , D = 1 N i = 1 N 1 G ( f t , D ( x i ) y i ) 2 2 σ 2 f t , D ( x i ) y i K x i ,
Applying (33) iteratively from t = 1 to T , we obtain
f T + 1 , D f T + 1 = I 1 + I 2 + I 3 + I 4
where
I 1 = i = 1 T η i π i + 1 T ( L K , D ) [ L K L K , D ] ( f i f ρ ) , I 2 = i = 1 T η i π i + 1 T ( L K , D ) [ L K L K , D ] ( f ρ ) , I 3 = i = 1 T η i π i + 1 T ( L K , D ) [ f ^ ρ , D L K ( f ρ ) ] , I 4 = i = 1 T η i π i + 1 T ( L K , D ) E i , D .
For I 1 , by (26), Lemmas 4 and 5,
I 1 = i = 1 T η i ( L K + λ I ) 1 2 π i + 1 T ( L K , D ) [ L K L K , D ] ( f i f ρ ) K i = 1 T { η i ( L K + λ I ) 1 2 ( L K , D + λ I ) 1 2 ( L K , D + λ I ) π i + 1 T ( L K , D ) × ( L K , D + λ I ) 1 2 ( L K + λ I ) 1 2 ( L K + λ I ) 1 2 [ L K L K , D ] f i f ρ K } B D , λ C D , λ i = 1 T η i L K , D π i + 1 T ( L K , D ) f i f ρ K + λ i = 1 T η i f i f ρ K B D , λ C D , λ C ρ , θ , r + D ρ , θ , r λ T 1 θ .
For I 2 , by (26), Lemma 3, and the fact f ρ M , we have
I 2 = i = 1 T η i π i + 1 T ( L K , D ) [ L K L K , D ] ( f ρ ) i = 1 T η i ( L K + λ I ) 1 2 π i + 1 T ( L K , D ) [ L K L K , D ] ( f ρ ) K ( L K + λ I ) 1 2 ( L K , D + λ I ) 1 2 i = 1 T η i ( L K , D + λ I ) π i + 1 T ( L K , D ) × ( L K , D + λ I ) 1 2 ( L K + λ I ) 1 2 ( L K + λ I ) 1 2 [ L K L K , D ] f ρ K M 1 + λ T 1 θ 1 θ B D , λ C D , λ .
Similarly, we can bound I 3 as
I 3 1 + λ T 1 θ 1 θ B D , λ G D , λ .
For I 4 , first note that by the bound (27) of { f t , D } , we see
G ( f t , D ( x i ) y i ) 2 2 σ 2 1 ( f t , D ( x i ) y i ) K x i K ( M + f t , D K ) 3 2 σ 2 2 2 σ 2 f t , D K 3 2 2 M 3 t 3 ( 1 θ ) 2 σ 2
This implies that
E t , D K 2 2 M 3 t ( 1 θ ) ( 3 ) 2 σ 2 .
Thistogether with the estimate π i + 1 t ( L K , D ) 1 gives
I 4 i = 1 T η i E i , D K 2 2 M 3 η i = 1 T i 3 ( 1 θ ) 2 θ σ 2 2 2 M 3 ( 1 θ ) ( 5 2 ) T 5 ( 1 θ ) 2 σ 2 .
Combining the estimates in (36), (37), (39) and (35), we obtain (30) holds with
C r , θ = C ρ , θ , r + D ρ , θ , r + 2 M 1 θ + 2 3 M 3 5 ( 1 θ ) .
Following a similar process we can obtain the bound in (31). □
The following theorem provides a bound for the second term in (29).
Theorem 3.
Take λ = T ( 1 θ ) . There is a constant C r , θ such that
f ¯ T + 1 , D f T + 1 C r , θ [ G D , λ + C D , λ + λ 1 2 log T sup 1 l m C D l , λ B D l , λ ( C D l , λ + G D l , λ ) + σ 2 T 5 ( 1 θ ) 2 1 + log T sup 1 l m C D l , λ ] .
Proof. 
For each subset D l and each 1 t T , we have
f T + 1 , D l f T + 1 = [ I η t L K ] ( f T , D l f t ) + η T [ L K L K , D ] f T , D l + η T [ f ^ ρ , D l L K ( f ρ ) ] + η T E T , D l .
This implies that
f T + 1 , D l f T + 1 = i = 1 T η i π i + 1 T ( L K ) [ L K L K , D l ] f i , D l + i = 1 T η i π i + 1 T ( L K ) [ f ^ ρ , D l L K ( f ρ ) ] + i = 1 T η i π i + 1 T ( L K ) E i , D l ,
and therefore
f ¯ T + 1 , D f T + 1 = 1 m l = 1 m f T + 1 , D l f T + 1 i = 1 T η i π i + 1 T ( L K ) 1 m l = 1 m [ L K L K , D l ] f i , D l + i = 1 T η i π i + 1 T ( L K ) 1 m l = 1 m [ f ^ ρ , D l L K ( f ρ ) ] + 1 m l = 1 m i = 1 T η i π i + 1 T ( L K ) E i , D l : = J 1 + J 2 + J 3 .
We first estimate J 2 . By (26), Lemma 3, and the choice λ = T ( 1 θ ) , we obtain
J 2 i = 1 T η i ( L K + λ ) π i + 1 T ( L K ) 1 m l = 1 m ( L K + λ ) 1 2 [ f ^ ρ , D l L K ( f ρ ) ] K 1 + λ T 1 θ 1 θ ( L K + λ ) 1 2 [ f ^ ρ , D L K ( f ρ ) ] K 2 M 1 θ 1 + λ T 1 θ G D , λ : = 4 M 1 θ G D , λ .
For J 3 , by (39) we have
J 3 sup 1 l m i = 1 T η i π i + 1 T ( L K ) E i , D l 2 3 M 3 η 5 ( 1 θ ) T 5 ( 1 θ ) / 2 σ 2 .
The estimation of J 1 is much more complicated. We decompose it into three parts,
J 1 i = 1 T η i ( L K + λ ) π i + 1 T ( L K ) 1 m l = 1 m ( L K + λ ) 1 2 [ L K L K , D l ] f i , D l K i = 1 T η i ( L K + λ ) π i + 1 T ( L K ) 1 m l = 1 m ( L K + λ ) 1 2 [ L K L K , D l ] ( f i , D l f i ) K + i = 1 T η i ( L K + λ ) π i + 1 T ( L K ) ( L K + λ ) 1 2 [ L K L K , D ] ( f i f ρ ) K + i = 1 T η i ( L K + λ ) π i + 1 T ( L K ) ( L K + λ ) 1 2 [ L K L K , D ] ( f ρ ) K : = J 11 + J 12 + J 13 .
By Lemmas 4 and 5 and the fact λ T 1 θ = 1 , we obtain
J 12 C D , λ i = 1 T η i L K π i + 1 T ( L K ) f i f ρ K + λ i = 1 T η i f i f ρ K C D , λ C ρ , θ , r + D ρ , θ , r .
For J 13 , by (19) we have
J 13 i = 1 T η i ( L K + λ ) π i + 1 T ( L K ) ( L K + λ ) 1 2 [ L K L K , D ] f ρ K M 1 + λ T 1 θ 1 θ C D , λ = 2 M 1 θ C D , λ .
Now we turn to J 11 . We have
J 11 i = 1 T η i ( L K + λ ) π i + 1 T ( L K ) 1 m l = 1 m ( L K + λ ) 1 2 [ L K L K , D l ] ( f i , D l f i ) K i = 1 T η i ( L K + λ ) π i + 1 T ( L K ) sup 1 l m ( L K + λ ) 1 2 [ L K L K , D l ] ( f i , D l f i ) K i = 1 T η i j = i + 1 T η j 1 + λ sup 1 l m C D l , λ f i , D l f i K .
By Theorem 1 and the choice λ = T ( 1 θ ) , for 1 i T , there holds that λ i ( 1 θ ) 1 and
f i , D l f i K C r , θ B D l , λ ( C D l , λ + G D l , λ ) ( 1 + λ i 1 θ ) / λ + i 5 ( 1 θ 2 ) σ 2 C r , θ 2 B D l , λ ( C D l , λ + G D l , λ ) / λ + T ( 5 ( 1 θ ) 2 ) σ 2 .
Plugging it into (43), we obtain
J 11 C r , θ sup 1 l m C D l , λ 2 B D l , λ ( C D l , λ + G D l , λ ) / λ + T ( 5 ( 1 θ ) 2 ) σ 2 i = 1 T η i j = i + 1 T η j 1 + λ .
From Lemma 2, we see that
i = 1 T η i j = i + 1 T η j 1 + λ 15 log T + η λ T 1 θ 1 θ = 15 log T + 1 1 θ 15 + 1 1 θ log T .
So, we have
J 11 C r , θ 15 + 1 1 θ log T sup 1 l m C D l , λ 2 B D l , λ ( C D l , λ + G D l , λ ) / λ + T ( 5 ( 1 θ ) 2 ) σ 2 .
Combining the estimations for J 11 , J 12 and J 13 , we obtain
J 1 2 M 1 θ + C ρ , θ , r + D ρ , θ , r C D , λ + 2 C r , θ 15 + 1 1 θ λ 1 2 log T sup 1 l m C D l , λ B D l , λ ( C D l , λ + G D l , λ ) + C r , θ 15 + 1 1 θ σ 2 T ( 5 ( 1 θ ) 2 ) log T sup 1 l m C D l , λ .
Now the desired bound for f ¯ T + 1 , D f T + 1 in (40) follows by combining the estimations for J 1 , J 2 , and J 3 and the constant is given by
C r , θ : = 2 M θ 1 θ + C ρ , θ , r + D ρ , θ , r + 3 C r , θ 15 + 1 1 θ + 2 3 M 3 η 5 ( 1 θ ) .
This proves the theorem. □

4.4. Proofs

Now we can prove Theorem 1.
Proof. 
Firstly, note that with the choice T = N 1 ( 2 r + s ) ( 1 θ ) and λ = T ( 1 θ ) , and under the restriction (8) on m, we have
A D , λ N 1 + 1 4 r + 2 s + C N 1 2 + s 4 r + 2 s ( C + 1 ) N r 2 r + s .
Therefore,
A D l , λ m N 1 N 1 4 r + 2 s + C m 1 2 N 1 2 N s 4 r + 2 s ( 1 + C ) m 1 2 N r 2 r + s
and
A D l , λ λ ( 1 + C ) m 1 2 N r 2 r + s N 1 4 r + 2 s ( 1 + C ) .
By applying Lemma 1, for any 1 l m , we have with confidence at least 1 δ 6 m ,
B D l , λ 2 2 A D l , λ log 12 m δ λ 2 + 2 , C D l , λ 2 A D l , λ log 12 m δ , G D l , λ 4 A D l , λ M log 12 m δ .
Consequently, these bounds hold simultaneously with confidence at least 1 δ 2 . This implies that with confidence at least 1 δ 2 , there holds
λ 1 2 log T sup 1 l m C D l , λ B D l , λ ( C D l , λ + G D l , λ ) 2 6 ( M + 1 ) log T A D l , λ λ 2 + 1 A D l , λ 2 λ log 12 m δ 4 2 6 ( M + 1 ) 1 + C 2 + 1 2 m N 2 r 1 2 2 r + s log T log 12 m δ 4 2 10 ( M + 1 ) 1 + C 2 + 1 2 m N 2 r 1 2 2 r + s log T log m 4 log 12 δ 4 2 10 ( M + 1 ) 1 + C 2 + 1 2 ( 2 r + s ) ( 1 θ ) m N 2 r 1 2 2 r + s log N 5 log 12 δ 4 2 10 ( M + 1 ) 1 + C 2 + 1 2 ( 2 r + s ) ( 1 θ ) N r 2 r + s log 12 δ 4
and
σ 2 T 5 ( 1 θ ) 2 1 + ( log T ) sup 1 l m C D l , λ σ 2 T 5 ( 1 θ ) 2 1 + ( log T ) A D l , λ log 12 m δ 2 σ 2 N 5 2 2 r + s 1 + 2 + 2 C ( 2 r + s ) ( 1 θ ) ( log N ) m 1 2 N r 2 r + s log m log 12 δ 2 σ 2 N 5 2 2 r + s 1 + 2 + 2 C ( 2 r + s ) ( 1 θ ) m 1 2 N r 2 r + s ( log N ) 2 log 12 δ 2 σ 2 N 5 2 2 r + s 1 + 2 + 2 C ( 2 r + s ) ( 1 θ ) log 12 δ .
By Lemma 1, we have with confidence at least 1 δ 4 ,
C D , λ 2 A D , λ log 8 δ 2 ( C + 1 ) N r 2 r + s log 12 δ
and
G D , λ 2 M A D , λ log 8 δ 2 M ( C + 1 ) N r 2 r + s log 12 δ .
Plugging the estimates (45)–(48) into (40), we obtain with confidence at least 1 δ ,
f ^ T + 1 , D f T + 1 C N r 2 r + s + σ 2 N 5 2 2 r + s log 12 δ 4
where
C = C r , θ , p [ 2 M ( C 1 + 1 ) + 2 ( C + 1 ) + 2 10 ( M + 1 ) 1 + C 2 + 1 2 ( 2 r + s ) ( 1 θ ) + 2 1 + 2 + 2 C ( 2 r + s ) ( 1 θ ) ] .
This, together with the bound
f T + 1 f ρ h ρ T r ( 1 θ ) h ρ N r 2 r + s ,
leads to the desired conclusion with C ˜ = C + h ρ . □
Proof of Corollary 1. 
When σ N r / 2 + 5 / 4 2 r + s , by Theorem 1, we have that with confidence at least 1 δ ,   f ¯ T + 1 , D f ρ 2 C ˜ N r 2 r + s log 12 δ 4 . Replacing 2 C ˜ N r 2 r + s log 12 δ 4 by t, then
Prob D : f ¯ T + 1 , D f ρ t 12 exp ( 2 C ˜ ) 1 4 N r 4 ( 2 r + p ) t 1 4 .
Using the probability to expectation formula
E [ ξ ] = 0 Pr { ξ t } d t
with ξ = f ¯ T + 1 , D f ρ , we have
E f ¯ T + 1 , D f ρ = 0 Prob D : f ¯ T + 1 , D f ρ t d t 12 0 exp ( 2 C ˜ ) 1 4 N r 4 ( 2 r + p ) t 1 4 d t = 324 C ˜ N r 2 r + s 0 u 3 e u d u = 324 C ˜ Γ ( 4 ) N r 2 r + s ,
where Γ ( d ) is the Gamma function defined for u > 0 by Γ ( d ) = 0 u d 1 e u d u .
The proof is complete. □
To prove Corollary 2, we need the following Borel-Cantelli Lemma which is provided in [31].
Lemma 6.
Let { a N } be a sequence of events in some probability space and { ξ N } be a sequence of positive numbers satisfying lim N ξ N = 0 . If
N = 1 Prob | a N a | > ξ N   < ,
then a N will almost certainly converge to a.
Proof of Corollary 2. 
Let δ = N 2 in Theorem 1; then we have
Prob N r 2 r + s f ¯ T + 1 , D f ρ ρ 2 5 C ˜ log 12 N < N 2 .
Thus, for any ϵ > 0 ,
Prob N r 2 r + s ϵ f ¯ T + 1 , D f ρ 2 5 C ˜ log 12 N N ϵ < N 2 .
Applying Lemma 6 with a N = N r 2 r + s ϵ f ¯ T + 1 , D f ρ , a = 0 and ξ N = 2 5 C ˜ log 12 N N ϵ , we can obtain the conclusion of Corollary 2 by noting lim N ξ N = 0 and N = 1 N 2 < .
The proof is finished. □

Author Contributions

Validation, F.X. and S.W.; Writing (original draft), B.W.; Writing (review and editing), T.H. All authors have read and agreed to the published version of the manuscript.

Funding

The work is supported partially by the National Key Research and Development Program of China (Grant No. 2021YFA1000600) and the National Natural Science Foundation of China (Grant No.12071356).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Guo, Z.C.; Lin, S.B.; Zhou, D.X. Learning theory of distributed spectral algorithms. Inverse Probl. 2017, 33, 074009. [Google Scholar] [CrossRef]
  2. Mücke, N.; Blanchard, G. Parallelizing spectrally regularized kernel algorithms. J. Mach. Learn. Res. 2018, 19, 1069–1097. [Google Scholar]
  3. Lin, S.B.; Guo, X.; Zhou, D.X. Distributed learning with regularized least squares. J. Mach. Learn. Res. 2017, 18, 3202–3232. [Google Scholar]
  4. Hu, T.; Zhou, D.X. Distributed regularized least squares with flexible Gaussian kernels. Appl. Comput. Harmon. Anal. 2021, 53, 349–377. [Google Scholar] [CrossRef]
  5. Zhang, Y.; Duchi, J.; Wainwright, M. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 2015, 16, 3299–3340. [Google Scholar]
  6. Lin, S.B.; Zhou, D.X. Distributed kernel-based gradient descent algorithms. Constr. Approx. 2018, 47, 249–276. [Google Scholar] [CrossRef]
  7. Shamir, O.; Srebro, N. Distributed stochastic optimization and learning. In Proceedings of the 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 30 September–3 October 2014; pp. 850–857. [Google Scholar]
  8. Chang, X.; Lin, S.B.; Zhou, D.X. Distributed semi-supervised learning with kernel ridge regression. J. Mach. Learn. Res. 2017, 18, 1493–1514. [Google Scholar]
  9. Hu, T.; Wu, Q.; Zhou, D.X. Distributed kernel gradient descent algorithm for minimum error entropy principle. Appl. Comput. Harmon. Anal. 2020, 49, 229–256. [Google Scholar] [CrossRef]
  10. Sun, H.; Wu, Q. Optimal Rates of Distributed Regression with Imperfect Kernels. J. Mach. Learn. Res. 2021, 22, 1–34. [Google Scholar]
  11. Sun, Q.; Zhou, W.X.; Fan, J. Adaptive Huber regression. J. Am. Stat. Assoc. 2020, 115, 254–265. [Google Scholar] [CrossRef] [Green Version]
  12. Feng, Y.; Wu, Q. A Framework of Learning Through Empirical Gain Maximization. Neural Comput. 2021, 33, 1656–1697. [Google Scholar] [CrossRef]
  13. Erdogmus, D.; Principe, J.C. Comparison of entropy and mean square error criteria in adaptive system training using higher order statistics. Proc. ICA 2000, 5, 6. [Google Scholar]
  14. Song, Y.; Liang, X.; Zhu, Y.; Lin, L. Robust variable selection with exponential squared loss for the spatial autoregressive model. Comput. Stat. Data Anal. 2021, 155, 107094. [Google Scholar] [CrossRef]
  15. Feng, Y.; Fan, J.; Suykens, J.A. A Statistical Learning Approach to Modal Regression. J. Mach. Learn. Res. 2020, 21, 1–35. [Google Scholar]
  16. Feng, Y.; Huang, X.; Shi, L.; Yang, Y.; Suykens, J.A. Learning with the maximum correntropy criterion induced losses for regression. J. Mach. Learn. Res. 2015, 16, 993–1034. [Google Scholar]
  17. Feng, Y.; Ying, Y. Learning with correntropy-induced losses for regression with mixture of symmetric stable noise. Appl. Comput. Harmon. Anal. 2020, 48, 795–810. [Google Scholar] [CrossRef] [Green Version]
  18. Gunduz, A.; Principe, J.C. Correntropy as a novel measure for nonlinearity tests. Signal Process. 2009, 89, 14–23. [Google Scholar] [CrossRef]
  19. He, R.; Zheng, W.S.; Hu, B.G.; Kong, X.W. A regularized correntropy framework for robust pattern recognition. Neural Comput. 2011, 23, 2074–2100. [Google Scholar] [CrossRef]
  20. Bessa, R.J.; Miranda, V.; Gama, J. Entropy and correntropy against minimum square error in offline and online three-day ahead wind power forecasting. IEEE Trans. Power Syst. 2009, 24, 1657–1666. [Google Scholar] [CrossRef]
  21. Holland, P.W.; Welsch, R.E. Robust regression using iteratively reweighted least-squares. Commun. Stat.-Theory Methods 1977, 6, 813–827. [Google Scholar] [CrossRef]
  22. Aronszajn, N. Theory of reproducing kernels. Trans. Am. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
  23. Smale, S.; Zhou, D.X. Learning theory estimates via integral operators and their approximations. Constr. Approx. 2007, 26, 153–172. [Google Scholar] [CrossRef] [Green Version]
  24. Caponnetto, A.; De Vito, E. Optimal rates for the regularized least-squares algorithm. Found. Comput. Math. 2007, 7, 331–368. [Google Scholar] [CrossRef]
  25. Steinwart, I.; Christmann, A. Support Vector Machines; Springer Science & Business Media: Berlin, Germany, 2008. [Google Scholar]
  26. Blanchard, G.; Mücke, N. Optimal rates for regularization of statistical inverse learning problems. Found. Comput. Math. 2018, 18, 971–1013. [Google Scholar] [CrossRef] [Green Version]
  27. Santamaría, I.; Pokharel, P.P.; Principe, J.C. Generalized correlation function: Definition, properties, and application to blind equalization. IEEE Trans. Signal Process. 2006, 54, 2187–2197. [Google Scholar] [CrossRef] [Green Version]
  28. Liu, W.; Pokharel, P.P.; Principe, J.C. Correntropy: Properties and applications in non-Gaussian signal processing. IEEE Trans. Signal Process. 2007, 55, 5286–5298. [Google Scholar] [CrossRef]
  29. Principe, J.C. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer: New York, NY, USA, 2010. [Google Scholar]
  30. Yao, Y.; Rosasco, L.; Caponnetto, A. On early stopping in gradient descent learning. Constr. Approx. 2007, 26, 289–315. [Google Scholar] [CrossRef]
  31. Durrett, R. Probability: Theory and Examples; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Xie, F.; Hu, T.; Wang, S.; Wang, B. Maximum Correntropy Criterion with Distributed Method. Mathematics 2022, 10, 304. https://doi.org/10.3390/math10030304

AMA Style

Xie F, Hu T, Wang S, Wang B. Maximum Correntropy Criterion with Distributed Method. Mathematics. 2022; 10(3):304. https://doi.org/10.3390/math10030304

Chicago/Turabian Style

Xie, Fan, Ting Hu, Shixu Wang, and Baobin Wang. 2022. "Maximum Correntropy Criterion with Distributed Method" Mathematics 10, no. 3: 304. https://doi.org/10.3390/math10030304

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop