Next Article in Journal
Currents in a Quantum Nanoring Controlled by Non-Classical Electromagnetic Field
Previous Article in Journal
A Traditional Scientific Perspective on the Integrated Information Theory of Consciousness
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Error Bound of Mode-Based Additive Models

1
College of Science, Huazhong Agricultural University, Wuhan 430070, China
2
College of Electrical and New Energy, China Three Gorges University, Yichang 443002, China
*
Authors to whom correspondence should be addressed.
Entropy 2021, 23(6), 651; https://doi.org/10.3390/e23060651
Submission received: 22 March 2021 / Revised: 19 May 2021 / Accepted: 19 May 2021 / Published: 22 May 2021
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
Due to their flexibility and interpretability, additive models are powerful tools for high-dimensional mean regression and variable selection. However, the least-squares loss-based mean regression models suffer from sensitivity to non-Gaussian noises, and there is also a need to improve the model’s robustness. This paper considers the estimation and variable selection via modal regression in reproducing kernel Hilbert spaces (RKHSs). Based on the mode-induced metric and two-fold Lasso-type regularizer, we proposed a sparse modal regression algorithm and gave the excess generalization error. The experimental results demonstrated the effectiveness of the proposed model.

1. Introduction

Regression estimation and variable selection are two important tasks for high-dimensional data mining [1]. Sparse additive models [2,3], aiming to deal with the above tasks simultaneously, have been extensively investigated in the mean regression setting. As a class of models between linear and nonparametric regression, these methods inherit the flexibility from nonparametric regression and the interpretability from linear regression. Typical methods include COSSO [4] and SpAM [2] and its variants, such as Group SpAM [3], SAM [5], Group SAM [6], SALSA [7], MAM [8], SSAM [9], and ramp-SAM [10]. From the lens of nonparametric regression, the additive structure on the hypothesis space is crucial to overcome the curse of dimensionality [7,11,12].
Usually, the aforementioned models are limited to the estimation of the conditional mean under the mean-squared error (MSE) criterion. However, for the complex non-Gaussian noises (e.g., the skewed noise, the heavy-tailed noise), it is difficult to extract the intrinsic trends from the mean-based approaches, resulting in degraded performance. Beyond the traditional mean regression, it is interesting to formulate a new regression framework under the (conditional) mode-based criterion. With the help of the recent works in [13,14,15,16,17,18,19], this paper aimed to propose a new robust sparse additive model, rooted in modal regression associated with the RKHS.
As an alternative approach to mean regression, modal regression has been investigated on statistical behavior [14,15,17] and real-world applications [20,21]. Yao [14] proposed a modal linear regression algorithm and characterized its theoretical properties under the global mode assumption. As a natural extension of Lasso [22], Wang et al. [15] considered the regularized modal regression and established its analysis on the generalization bound and variable selection consistency. Feng et al. [17] studied modal regression by a learning theory approach and illustrated its relation with MCC [23,24]. Different from the above global approaches, some local modal regression algorithms were formulated in [16,25] with convergence guarantees. Recent literature [26] gave a general overview of modal regression, and a more comprehensive list of references can be found there.
The proposed robust additive models are formulated under the Tikhonov regularization scheme, which involves three building blocks, including the mode-based metric, the RKHS-based hypothesis space, and two Lasso-type penalties. Since the linear function space, polynomial function space, and Sobolev/Besov space are special cases of the RKHS, the kernel-based function space is more flexible than the traditional spline-based spaces or other dictionary-based hypotheses [2,5,27,28,29]. The mode-induced regression metric is robust to the non-Gaussian noise according to the theoretical and empirical evaluations [14,15,17]. The regularized penalty addresses the sparsity and smoothness of the estimator, which has shown promising performance for mean regression [2,29,30,31]. Therefore, different from mean-based kernel regression and additive models, the mode-based approach enjoys robustness and interpretability simultaneously due to its metric criterion and trade-off penalty. The estimator of our approach can be obtained by integrating the half-quadratic (HQ) optimization [32] and the second-order cone programming (SOCP) [33].
The rest of this article is organized as follows. After introducing the robust additive model in Section 2, we state its generalization error bound in Section 3. Finally, Section 5 ends this paper with a brief conclusion.

2. Methodology

2.1. Modal Regression

In this section, we recall the basic background on modal regression [19,34]. Let X be a compact subset of R p associated with the input covariate vector and Y R be the response variable set. In this paper, we considered the following nonparametric model:
Y = f * ( X ) + ϵ ,
where X = ( X 1 , , X p ) T X , Y Y , and ϵ is a random noise. For feasibility, we denote by ρ the underlying joint distribution of ( X , Y ) generated by (1).
Being different from the traditional mean regression under the noise condition E ( ϵ | X = x ) = 0 (e.g., Gaussian noise), we just require that the mode of the conditional distribution of ϵ equal zero at each x X . That is:
x X , m o d e ( ϵ | X = x ) = arg max t R P ϵ | X ( t | X = x ) = 0 ,
where P ϵ | X is the conditional density of ϵ given X. Notice that the zero condition is not specified to the homogeneity or symmetry distribution of noise ϵ , and some non-Gaussian noises (e.g., the skewed noise, the heavy-tailed noise) are not excluded.
From (1), we further deduce that:
f * ( u ) : = j = 1 p f j * ( u j ) = m o d e ( Y | X = u ) = arg max t P Y | X ( t | X = u ) ,
where u = ( u 1 , , u p ) T X and P Y | X denotes the density of Y conditional on X. Then, the purpose of modal regression is to find the target function f * according to the empirical data z = { z i } i = 1 n = { ( x i , y i ) } i = 1 n drawn independently from ρ .
For modal regression, the performance of a predictor f : X R is measured by the mode-based metric:
R ( f ) = X P Y | X ( f ( x ) | X = x ) d ρ X ( x ) ,
where ρ X is the marginal distribution of ρ with respect to input space X .
Although the target function f * is the maximizer of R ( f ) over all measurable functions, it cannot be estimated directly via maximizing (3) due to the unknown P Y | X and ρ X . Fortunately, some indirect density-estimation-based strategies were proposed in [14,15,17]. As shown in Theorem 5 of [17], R ( f ) equals the density function of random variable E f = Y f ( X ) at zero, e.g.,
R ( f ) = P E f ( 0 ) .
Therefore, we can find an approximation of f * by maximizing the empirical version of P E f ( 0 ) with the help of kernel density estimation (KDE).
Let K σ : R × R R + be a kernel with bandwidth σ , and its representing function ϕ : R [ 0 , ) satisfies ϕ ( u u σ ) = K σ ( u , u ) , u , u R . Typical kernels used in KDE include the Gaussian kernel, the Epanechnikov kernel, the logistic kernel, and the sigmoid kernel. The KDE-based estimator of P E f ( 0 ) is defined as:
P ^ E f ( 0 ) = 1 n σ i = 1 n K σ ( y i f ( x i ) , 0 ) = 1 n σ i = 1 n ϕ ( y i f ( x i ) σ ) : = R ^ σ ( f ) .
Learning models for modal regression are usually formulated by Tikhonov regularization schemes associated with the empirical metric R ^ σ ( f ) ; see, e.g., [15,35].
Naturally, the data-free modal regression metric, w . r . t . R ^ σ ( f ) , can be defined as:
R σ ( f ) = 1 σ X × Y ϕ ( y f ( x ) σ ) d ρ ( x , y ) .
In theory, the learning performance of estimator f : X R can be evaluated in terms of R ( f ) R ( f * ) , which can be further bounded via R σ ( f ) R σ ( f * ) (see Theorem 10 in [17]).
Remark 1.
As illustrated in [17], when taking K σ as a Gaussian kernel, the modal regression for maximizing R σ ( f ) is consistent with learning under the maximum correntropy criterion (MCC). By employing different kernels, we can provide rich evaluated metrics for better robust estimation.

2.2. Mode-Based Sparse Additive Models

The additive model is formulated as follows,
Y = j = 1 p f j * ( X j ) + ϵ ,
where X j X , ( j = 1 , 2 , · · · , p ) , Y Y , and f j * are unknown component functions. By employing nonlinear hypothesis function spaces with an additive structure, the additive model provides better flexibility for regression estimation and variable selection [19]. In [28], the theoretical properties of the sparse additive model with the quantile loss function were discussed. We introduce some basic notation and assumptions in a similar way.
Suppose that E f j * ( X j ) = 0 and f j * K j 1 for each f j * in (4) with j S . Here, f j * : X j R is an unknown univariate function in a reproducing kernel Hilbert space (RKHS) H j : = H K j associated with kernel K j and norm · K j [30,31], and S { 1 , , p } is an intrinsic subset with cardinality | S | < p . This means each observation ( x j , y j ) is generated according to:
y i = j S f j * ( x i j ) + ϵ i , i = 1 , , n ,
where x i = ( x i 1 , , x i p ) T R p , f j * H j and ϵ satisfies the condition (2).
For any given j { 1 , , p } , denote B r ( H j ) = { g H j : g K j r } . The hypothesis space considered here is defined by:
F = { f = j = 1 p f j : f j B r ( H j ) , i = 1 , , p } ,
which is a subset of the RKHS H = { f = j = 1 p f j : f j H j } with the norm:
f K 2 = inf { j = 1 p f j K j 2 : f = j = 1 p f j } .
For each X j and the corresponding marginal distribution ρ X j , we denote f j 2 2 : = X j | f j ( u ) | 2 d ρ X j ( u ) . Given inputs { x i } i = 1 n , define the empirical norm of each f j as:
f j n 2 : = 1 n i = 1 n f j 2 ( x i j ) , f j H j , j { 1 , , p } .
With the help of the mode-based metric (3) and the hypothesis space (5), we formulated the mode-based sparse additive model as:
f ^ = arg max f F { R ^ σ ( f ) λ 1 j = 1 p f j n λ 2 j = 1 p f j K j } ,
where ( λ 1 , λ 2 ) is a pair of positive regularization. The first regularization term is sparsity-promoting [11,36], and the second one guarantees smoothness in the solution.
By the representer theorem of kernel methods (e.g., [37]), the solution of (6) admits the following form:
f ^ ( u ) = i = 1 n j = 1 p α ^ i j K ( u j , x i j ) , u = ( u 1 , , u p ) T
with a collection of coefficients { α ^ j = ( α 1 j , , α n j ) T R n : j = 1 , , p } .
The optimal coefficients with respect to (6) are the solution to the following non-convex optimization:
max α j R n , α j T K j α j 1 { 1 n i = 1 n ϕ ( y i j = 1 p K j i T α j σ ) λ 1 n j = 1 p K j α j 2 λ 2 j = 1 p α j T K j α j }
where K j i = ( K j ( x 1 j , x i j ) , , K j ( x n j , x i j ) ) T R n and K j = ( K j ( x i j , x l j ) ) i , l n = ( K j 1 , , K j n ) R n × n .
Remark 2.
There are various combinations of sparsity and smoothness regularization for additive models [2,3,29,30,31]. The regularization in this paper adopting a two-fold group Lasso scheme, which was employed in [28], but in quantile regression settings, is also different from the coefficient-based regularized modal regression in [19].
Remark 3.
From the lens of computation, the proposed algorithm (6) can be transformed into a regularized least-squares regression problem by HQ optimization [32]. Then, the transformed problem can be tackled with the SOCP [33] easily.

3. Error Analysis

This section states the upper bounds of the excess quantity R ( f * ) R ( f ^ ) . For the ease of presentation, we only considered the special setting where H j H j , j , j { 1 , , p } , and we denote j = 1 p H j as H K with sup K ( x , x ) 1 .
Recall that the Mercer kernel K : X × X R admits the following spectral expansion [38]:
K ( x , x ) = 1 b ψ ( x ) ψ ( x ) , x , x X ,
where { ( b , ψ ) } 1 are the pairs of eigenvalue-eigenfunctions of integral operator T f : K ( · , x ) f ( x ) d ρ X ( x ) with b 1 b 2 0 .
To evaluate the complexity of H K in terms of the decay rate of eigenvalues { b } 1 [27,28], we refer to Assumption 1 in [28] as the basis of our analysis.
Assumption 1.
There exist s ( 0 , 1 ) and constant c 1 > 0 such that b c 1 1 s , 1 .
As illustrated in [27,28], the requirement s < 1 is a weak condition since b = E K ( x , x ) 1 . In particular, it holds b 2 h for the Sobolev space H K = W 2 h ( h > 1 2 ) with the Lebesgue measure on [ 0 , 1 ] .
To describe the hypothesis in RKHS, we refer to Assumption 2 in [28].
Assumption 2.
For some s ( 0 , 1 ) given in Assumption 1, there exists a positive constant c 2 such that f c 2 f 2 1 s f K s , f H K .
Remark 4.
To understand the statistical performance of the proposed estimator without any “correlatedness” conditions on covariates, Rademacher complexity [39] was used to measure functional complexity in [28]. We drew on the experience of [28].
In general, Assumption 2 is stronger than Assumption 1 and is satisfied when the RKHS is continuously embeddable in a Sobolev space. For the uniformly bounded { ψ } 1 , this sub-norm condition is consistent with Assumption 1.
For any given independent input variables { x i } i = 1 n X , define the Rademacher complexity:
R n ( f ) : = 1 n i = 1 n σ i f ( x i ) , f H K ,
where { σ i } i = 1 n is an i . i . d . sequence of Rademacher variables that take { ± 1 } with probability 1 / 2 . As shown in [40], it holds:
E R n { f H K { f K = 1 , f 2 t } } 1 n [ min { t 2 , b } ] 1 2 .
Moreover, from Assumption 1, define:
γ n : = inf { γ A log p ˜ n , E [ sup f K = 1 , f 2 t | R n ( f ) | ] γ t + γ 2 , t ( 0 , 1 ) } max { A log p ˜ n , ( 1 n ) 1 2 ( 1 + α ) } .
The main idea of our error analysis is to first state a theory result for a defined event and then investigate the behavior of f ^ in (6) conditional on that event.
Define η ( t ) : = max { 1 , t , t / n } for any t > 0 and ξ n : = ξ n ( λ ) = max { λ α 2 n 1 2 , λ 1 2 n 1 1 + α , log p n } , and consider the event:
θ ( t ) = { | 1 n i = 1 n ϵ i f ( x i ) | c α η ( t ) ξ n ( f 2 + λ 1 2 f K ) , f H K } ,
where { ϵ i } i = 1 n are zero-mean i . i . d . random variables with | ϵ i | L and c α is a constant depending on α and L.
Remark 5.
To analyze the behavior of the regularized estimator conditioned on the event, several basic facts of the empirical processes were introduced in [28]. Our work can be boiled down to this framework. We introduced the relevant lemmas in [28] as a stepping stone.
Lemma 1.
Let Assumptions 1 and 2 be true. If log p n 1 , it holds:
P ( θ ( t ) ) 1 exp { t } , λ > 0 , t 1 .
The following lemma (see also Theorem 4 in [41]) demonstrates the relationship between the empirical norm · n and · 2 for functions in H K .
Lemma 2.
For A 1 and any given p ˜ p with log p ˜ 2 log log n , there exists a constant c such that:
f 2 c ( f n + γ n f K )
and:
f n c ( f 2 + γ n f K )
with confidence at least 1 p ˜ A , where γ n max ( A log p ˜ n , ( 1 n ) 1 2 ( 1 + α ) ) .
Lemma 3.
Let { z i } i = 1 n Z be independent random variables, and let Γ be a class of real-valued functions on Z satisfying:
γ η n , γ Γ , a n d 1 n i = 1 n v a r ( γ ( z i ) ) ι n 2 ,
for some positive constants η n and ι n . Define ζ : = sup γ Γ | 1 n i = 1 n γ ( z i ) E γ ( z ) | . Then,
P { ζ E ζ + t 2 ( ι n 2 + 2 η n E z ) + 2 η n t 2 3 exp { n t 2 }
For any given Δ and Δ + , define:
F ( Δ , Δ + ) = { f = j = 1 p f j H K : γ n j = 1 p f j f j * 2 Δ , γ n 2 j = 1 p f j f j * K Δ + } ,
Lemma 4.
Let Assumptions 1 and 2 be true for each H j . For any given A 2 , with confidence at least 1 p ˜ A , it holds:
R σ ( f * ) R σ ( f ) ( R ^ σ ( f * ) R ^ σ ( f ) ) c * η ( t 0 ) ( Δ + Δ + ) + exp { p ˜ } ,
for any f F ( Δ , Δ + ) with m a x { Δ , Δ + } e p ˜ , where t 0 = 2 log ( 2 3 log 2 ) + A log p ˜ + 2 log p ˜ , λ = n 1 1 + α , and c * is a positive constant.
Proof. 
Denote Γ = { γ ( z ) : γ ( z ) = 1 σ ϕ ( y f * ( x ) σ ) 1 σ ϕ ( y f ( x ) σ ) , f F ( Δ , Δ + ) } . It is easy to verify that:
E γ ( z ) 1 n i = 1 n γ ( z i ) = R ( f * ) R ( f ) ( R ^ ( f * ) R ^ ( f ) ) , γ Γ .
Let ζ : = sup γ Γ | 1 n i = 1 n γ ( z i ) E γ ( z ) | . From Lemma 3, we have:
ζ E ζ + 2 t ( ι n 2 + 2 η n E ζ ) n + 2 η n t 3 n ,
with probability at least 1 exp { t } , where constants sup γ Γ γ = η n and sup γ Γ 1 n i = 1 n v a r ( γ ( z i ) ) = ι n . Observing that:
2 t ( ι n 2 + 2 η n E ζ ) n 2 t ι n 2 n + 2 η n E ζ n 2 t n ι n + E ζ + η n n ,
we can take:
ι n 2 2 E ( γ ( z ) ) 2 = 2 E ( 1 σ ϕ ( y f * ( x ) σ ) 1 σ ϕ ( y f ( x ) σ ) ) 2 2 ϕ 2 σ 4 f f * 2 2 2 ϕ 2 σ 4 Δ 2 γ 2 ,
and:
η n = sup γ Γ γ ϕ σ 2 f * f ϕ σ 2 f * f K ϕ σ 2 Δ + γ n 2 .
Combining (7)–(10), we obtain with confidence at least 1 exp { t }
ζ 2 E ζ + 2 ϕ γ n σ 2 t n + κ ϕ Δ + σ 2 γ n 2 1 + t n .
By a symmetrization technique in [42], we have:
E ζ 2 E R n ( Γ ) 2 ϕ σ 2 E R n ( F f * ) .
Applying Lemma 3 for R n ( F f * ) , we obtain that:
E [ R n ( F f * ) ] R n ( F f * ) + 4 Δ γ n 2 t n + Δ + γ n 2 1 + t n ,
with probability at least 1 2 exp { t } . Moreover, with probability at least 1 2 exp { t } , it holds:
ζ 8 ϕ σ 2 R n ( F f * ) + 6 ϕ Δ γ n σ 2 t n + 5 ϕ Δ + γ n 2 σ 2 1 + t n 8 ϕ σ 2 j = 1 p R n ( H j f j * ) + 6 ϕ Δ γ n σ 2 t n + 5 ϕ Δ + γ n 2 σ 2 1 + t n .
For the event θ ( t ) , Lemma 1 demonstrates that:
| R n ( f ) | c α η ( t ) ξ n ( f 2 + λ 1 2 f K ) , f H K , λ > 0 ,
with confidence 1 exp { t } . Then,
ζ 8 ϕ c α η ( t ) ξ n σ 2 sup f F { j = 1 p f f j * 2 + λ 1 2 j = 1 p f j f j * K } + 6 ϕ Δ γ n σ 2 t n + 5 ϕ Δ + γ n 2 σ 2 1 + t n .
Taking λ = n 1 1 + α , we can verify that c γ n ξ n and ξ n λ 1 2 c γ n 2 . Then,
ζ 8 c α η ( t ) ϕ σ 2 ( Δ + + Δ ) + 6 Δ ϕ σ 2 t A log p ˜ + 5 Δ + ϕ t σ 2 A log p ˜ ,
for some event θ ( Δ , Δ + ) .
For t = 2 log ( 2 3 / log 2 ) + A log p ˜ + 2 log p ˜ , we deduce that e p ˜ Δ e p ˜ and e p ˜ Δ + e p ˜ considering ( 2 p ˜ + 1 ) 2 different discrete pairs Δ j = Δ + j : = 2 j , j = p ˜ , , p ˜ , and we deduce that:
P ( k , j θ ( Δ j , Δ + j ) ) 1 3 ( 2 log 2 2 p ˜ 2 exp { 2 log ( 2 3 log 2 A log p ˜ 2 log p ˜ } 1 p ˜ A .
When Δ e p ˜ or Δ + e p ˜ , it is trivial to obtain the desired result. □
The proof of Lemma 4 is derived from the proof of Proposition 1 in [28] for the quantile regression. We state our main result on the error bound.
Theorem 1.
Let the regularization parameters of f ^ defined in (6) be λ 1 = ξ γ n and λ 2 = ξ γ n 2 , where ξ = max { 2 c η ( t 0 ) , 4 } with η ( t ) = max { 1 , t , t / n } , t 0 = 2 log ( 2 3 / log 2 ) + A log p ˜ + 2 log p ˜ , and γ n max ( A log p ˜ n , ( 1 n ) 1 2 ( 1 + α ) ) . Under the conditions of Assumptions 1 and 2, for any p ˜ p such that log p n and log p ˜ 2 log log n , then for some constant A 2 , such that with probability at least 1 2 d ˜ A :
R ( f * ) R ( f ^ ) c s ϕ η ( t 0 ) ( η ( t 0 ) ) 1 4 γ n c ( η ( t 0 ) ) 5 4 max { ( A log p ˜ c ) 1 4 , ( 1 n ) 1 4 ( 1 + α ) } c max { A log p ˜ , A log p ˜ n } 5 4 · max { ( A log p ˜ n ) 1 4 , ( 1 n ) 1 4 + 4 α } c max { ( A log p ˜ ) 7 8 n 1 4 , ( A log p ˜ ) 1 2 n 1 4 + 4 α , ( A log p ˜ ) 3 2 n 3 4 , ( A log p ˜ ) 5 4 n 3 + 2 α 4 + 4 α } .
Proof. 
By the definition of f ^ in (6), we know that:
R ^ σ ( f ^ ) λ 1 j = 1 p f ^ j n λ 2 j = 1 p f ^ j K j R ^ σ ( f * ) λ 1 j = 1 p f j * n λ 2 j = 1 p f j * K j .
This implies that:
R ^ σ ( f ^ ) R σ ( f * ) λ 1 j = 1 p f ^ j n λ 2 j = 1 p f ^ j K j [ R σ ( f ^ ) R σ ( f * ) ] [ R ^ σ ( f ^ ) R ^ σ ( f * ) ] λ 1 j = 1 p f j * n λ 2 j = 1 p f j * K j .
Moreover,
R σ ( f * ) R σ ( f ^ ) R σ ( f * ) R σ ( f ^ ) + λ 1 j S f ^ j n + λ 2 j S f ^ j K [ R σ ( f * ) R σ ( f ^ ) ] [ R ^ σ ( f * ) R ^ σ ( f ^ ) ] + λ 1 j S ( f j * n f ^ j n ) + λ 2 j S ( f j * K f ^ j K ) [ R σ ( f * ) R σ ( f ^ ) ] [ R ^ σ ( f * ) R ^ σ ( f ^ ) ] + λ 1 j S f ^ j f j * n + λ 2 j S f ^ j f j * K .
Taking λ 1 = ξ γ n , λ 2 = ξ γ n 2 with γ n = max { A log p ˜ n , ( 1 n ) 1 2 + 2 α } , α ( 0 , 1 ) , we deduce that:
γ n j = 1 p f ^ j f j * 2 2 p ( 1 n ) 1 2 + 2 α 2 p ˜ ( 1 4 ) e p ˜ , n 1 , p ˜ p ,
and:
γ n 2 j = 1 p f j f j * K j γ n γ n j = 1 p f ^ f * K j e p ˜ .
Therefore, we verify that f ^ F ( Δ , Δ + ) with Δ e p ˜ and Δ + e p ˜ . With the choices λ 2 = λ 1 2 = ξ γ n 2 , it holds:
λ 1 f ^ j f j * n + λ 2 f ^ j f j * K 2 ( λ 1 + λ 2 ) = 4 ξ γ n , j S .
due to the fact f j n f j K 1 , for any f j H K j .
According to Lemma 4 and (11), we obtain:
R σ ( f * ) R σ ( f ^ ) c η t 0 ϕ σ 2 ( γ n j = 1 p f ^ j f j * 2 + γ n 2 j = 1 p f ^ j f j * K ) + λ 1 j S f ^ j f j * n + λ 2 j S f ^ j f j * K + e p ˜ c η ( t 0 ) ϕ σ 2 ξ γ n + e p ˜ ,
with probability at least 1 2 p ˜ A .
Notice that log p ˜ 2 log log n implies that e p ˜ n 2 γ n . Then:
R σ ( f * ) R ( f ^ ) c η ( t 0 ) ϕ σ 2 ξ γ n .
Combining this with Theorem 9 in [17] and setting σ = ( ϕ η ( t 0 ) ξ γ n ) 1 4 , we obtain the desired result. □
The proof of Theorem 1 is inspired by that of Theorem 1 in [28]; see [28] for more details. According to Theorem 1, we can conclude that the mode-based SpAM can achieve the learning rate with polynomial decay O ( n 1 4 + 4 α ) since ϵ [ 0 , 1 ] and A , p ˜ are positive constants.

4. Experimental Evaluation

To demonstrate the efficiency of our method, in this section, we evaluated our model on some synthetic datasets. The data in R p with dimension p = 5 and p = 10 were generated randomly according to the uniform distribution on the interval [ 0 , 1 ] . Then, we computed the MSE of our estimator f ^ . Figure 1, Figure 2 and Figure 3 depict the MSE of f ^ when the parameter pair ( λ 1 , λ 2 ) = ( 0 , 1 ) , ( 1 , 0 ) and ( 1 , 1 ) , respectively, while the number of samples n varies from 50/60 to 80/90. This paper used Yalmip [43] modeling in the MATLAB environment and called fmincon to solve the problem. From the figures, we can notice that the MSEs tended to decrease with the increase of the number of samples n under three kinds of parameter settings, which verified that our method was effective in the regression of high-dimensional data.

5. Conclusions

In this work, we proposed a mode-based sparse additive model and established its generalization error bound. The theoretical results extended the previous mean-based analysis to the mode-based approach. We demonstrated that the mode-based SpAM can achieve the learning rate with polynomial decay O ( n 1 4 + 4 α ) , which is comparable to the previous result in [15] with O ( n 1 7 ) . In the future, it will be important to further explore the variable selection consistency of the proposed model.

Author Contributions

Conceptualization, H.D., B.S., J.C. and Z.P.; methodology, H.D. and Z.P.; validation, B.S. and Z.P.; formal analysis, H.D., B.S. and Z.P.; investigation, H.D. and J.C.; resources, Z.P.; data curation, H.D. and J.C.; writing—original draft preparation, H.D. and J.C.; writing—review and editing, H.D. and J.C.; visualization, H.D. and J.C.; supervision, B.S. and Z.P.; project administration, B.S. and Z.P.; funding acquisition, H.D., B.S. and Z.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Fundamental Research Funds for the Central Universities of China (Grant Nos. 2662019FW003 and 2662020LXQD001) and the National Natural Science Foundation of China (Grant No. 12001217).

Data Availability Statement

The synthetic data generation method of the simulation experiment has been introduced in the experimental part.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Xia, Y.; Hou, Y.; Lv, S. Learning Rates for Partially Linear Support Vector Machine in High Dimensions. Anal. Appl. 2021, 19, 167–182. [Google Scholar] [CrossRef]
  2. Ravikumar, P.; Liu, H.; Lafferty, J.; Wasserman, L. SpAM: Sparse Additive Models. J. R. Stat. Soc. Ser. B 2009, 71, 1009–1030. [Google Scholar] [CrossRef]
  3. Yin, J.; Chen, X.; Xing, E.P. Group Sparse Additive Models. In Proceedings of the International Conference on Machine Learning (ICML), Edinburgh, UK, 26 June–1 July 2012; pp. 1643–1650. [Google Scholar]
  4. Lin, Y.; Zhang, H.H. Component Selection and Smoothing in Multivariate Nonparametric Regression. Ann. Stat. 2006, 34, 2272–2297. [Google Scholar] [CrossRef] [Green Version]
  5. Zhao, T.; Liu, H. Sparse additive machine. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), La Palma, Spain, 21–23 April 2012; Volume 22, pp. 1435–1443. [Google Scholar]
  6. Chen, H.; Wang, X.; Deng, C.; Huang, H. Group Sparse Additive Machine. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 197–207. [Google Scholar]
  7. Kandasamy, K.; Yu, Y. Additive Approximations in High Dimensional Nonparametric Regression via the SALSA. In Proceedings of the International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; Volume 48, pp. 69–78. [Google Scholar]
  8. Wang, Y.; Chen, H.; Zheng, F.; Xu, C.; Gong, T.; Chen, Y. Multi-task Additive Models for Robust Estimation and Automatic Structure Discovery. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Online, 6–12 December 2020; Volume 33, pp. 11744–11755. [Google Scholar]
  9. Chen, H.; Liu, G.; Huang, H. Sparse Shrunk Additive Models. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 12–18 July 2020; Volume 119, pp. 6194–6204. [Google Scholar]
  10. Chen, H.; Guo, C.; Xiong, H.; Wang, Y. Sparse Additive Machine with Ramp Loss. Anal. Appl. 2021, 19, 509–528. [Google Scholar] [CrossRef]
  11. Meier, L.; Geer, S.V.D.; Buhlmann, P. High-dimensional Additive Modeling. Ann. Stat. 2009, 37, 3779–3821. [Google Scholar] [CrossRef]
  12. Raskutti, G.; Wainwright, M.J.; Yu, B. Minimax-optimal Rates for Sparse Additive Models over Kernel Classes via Convex Programming. J. Mach. Learn. Res. 2012, 13, 389–427. [Google Scholar]
  13. Kemp, G.C.R.; Silva, J.M.C.S. Regression towards the mode. J. Econom. 2012, 170, 92–101. [Google Scholar] [CrossRef]
  14. Yao, W.; Li, L. A New Regression model: Modal Linear Regression. Scand. J. Stat. 2014, 41, 656–671. [Google Scholar] [CrossRef] [Green Version]
  15. Wang, X.; Chen, H.; Cai, W.; Shen, D.; Huang, H. Regularized Modal Regression with Applications in Cognitive Impairment Prediction. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 1448–1458. [Google Scholar]
  16. Chen, Y.C.; Genovese, C.R.; Tibshirani, R.J.; Wasserman, L. Nonparametric Modal Regression. Ann. Stat. 2014, 44, 489–514. [Google Scholar] [CrossRef]
  17. Feng, Y.; Fan, J.; Suykens, J. A Statistical Learning Approach to Modal Regression. J. Mach. Learn. Res. 2020, 21, 1–35. [Google Scholar]
  18. Collomb, G.; Härdle, W.; Hassani, S. A Note on Prediction via Estimation of the Conditional Mode Function. J. Stat. Plan. Inference 1986, 15, 227–236. [Google Scholar] [CrossRef]
  19. Chen, H.; Wang, Y.; Zheng, F.; Deng, C.; Huang, H. Sparse Modal Additive Model. IEEE Trans. Neural Netw. Learn. Syst. 2020, 1–15. [Google Scholar] [CrossRef] [PubMed]
  20. Li, J.; Ray, S.; Lindsay, B.G. A Nonparametric Statistical Approach to Clustering via Mode Identification. J. Mach. Learn. Res. 2007, 8, 1687–1723. [Google Scholar]
  21. Einbeck, J.; Tutz, G. Modeling beyond Regression Function: An Application of Multimodal Regression to Speed-flow Data. J. R. Stat. Soc. Ser. C Appl. Stat. 2006, 55, 461–475. [Google Scholar] [CrossRef]
  22. Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
  23. Feng, Y.; Huang, X.; Shi, L.; Yang, Y.; Suykens, J.A. Learning with the Maximum Correntropy Criterion Induced Losses for Regression. J. Mach. Learn. Res. 2015, 16, 993–1034. [Google Scholar]
  24. Lv, F.; Fan, J. Optimal learning with Gaussians and Correntropy Loss. Anal. Appl. 2019, 19, 107–124. [Google Scholar] [CrossRef]
  25. Yao, W.; Lindsay, B.G.; Li, R. Local Modal Regression. J. Nonparametr. Stat. 2012, 24, 647–663. [Google Scholar] [CrossRef] [Green Version]
  26. Chen, Y. Modal Regression using Kernel Density Estimation: A Review. Wiley Interdiscip. Rev. Comput. Stat. 2018, 10, e1431. [Google Scholar] [CrossRef] [Green Version]
  27. Steinwart, I.; Christmann, A. Support Vector Machines; Springer Science and Business Media: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  28. Lv, S.; Lin, H.; Lian, H.; Huang, J. Oracle Inequalities for Sparse Additive Quantile Regression in Reproducing Kernel Hilbert Space. Ann. Stat. 2018, 46, 781–813. [Google Scholar] [CrossRef]
  29. Huang, J.; Horowitz, J.L.; Wei, F. Variable Selection in Nonparametric Additive Models. Ann. Stat. 2010, 38, 2282–2313. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  30. Christmann, A.; Zhou, D.X. Learning Rates for the Risk of Kernel based Quantile Regression Estimators in Additive Models. Anal. Appl. 2016, 14, 449–477. [Google Scholar] [CrossRef] [Green Version]
  31. Yuan, M.; Zhou, D.X. Minimax Optimal Rates of Estimation in High Dimensional Additive Models. Ann. Stat. 2016, 44, 2564–2593. [Google Scholar] [CrossRef]
  32. Nikolova, M.; Ng, M.K. Analysis of Half-quadratic Minimization Methods for Sgnal and Image Recovery. SIAM J. Sci. Comput. 2006, 27, 937–966. [Google Scholar] [CrossRef]
  33. Alizadeh, F.; Goldfarb, D. Second-Order Cone Programming. Math. Program. 2003, 95, 3–51. [Google Scholar] [CrossRef]
  34. Guo, C.; Song, B.; Wang, Y.; Chen, H.; Xiong, H. Robust Variable Selection and Estimation Based on Kernel Modal Regression. Entropy 2019, 21, 403. [Google Scholar] [CrossRef] [Green Version]
  35. Wang, Y.; Tang, Y.Y.; Li, L.; Chen, H. Modal Regression-based Atomic Representation for Robust Face Recognition and Reconstruction. IEEE Trans. Cybern. 2020, 50, 4393–4405. [Google Scholar] [CrossRef]
  36. Suzuki, T.; Sugiyama, M. Fast learning rate of multiple kernel learning: Trade-off between sparsity and smoothness. Ann. Stat. 2013, 41, 1381–1405. [Google Scholar] [CrossRef] [Green Version]
  37. Schlköpf, B.; Smola, A.J. Learning with Kernels; The MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
  38. Aronszajn, N. Theory of Reproducing Kernels. Trans. Am. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
  39. Bartlett, P.L.; Bousquet, O.; Mendelson, S. Localized Rademacher Complexities. In Proceedings of the Conference on Computational Learning Theory (COLT), Sydney, Australia, 8–10 July 2002; Volume 2373, pp. 44–58. [Google Scholar]
  40. Mendelson, S. Geometric Parameters of Kernel Machines. In Proceedings of the Conference on Computational Learning Theory (COLT), Sydney, Australia, 8–10 July 2002; Volume 2375, pp. 29–43. [Google Scholar]
  41. Koltchinskii, V.; Yuan, M. Sparsity in Multiple Kernel Learning. Ann. Stat. 2010, 38, 3660–3695. [Google Scholar] [CrossRef]
  42. Van De Geer, S. Empirical Processes in M-Estimation; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
  43. Löfberg, J. Automatic robust convex programming. Optim. Methods Softw. 2012, 27, 115–129. [Google Scholar] [CrossRef] [Green Version]
Figure 1. MSE of f ^ when ( λ 1 , λ 2 ) = ( 0 , 1 ) .
Figure 1. MSE of f ^ when ( λ 1 , λ 2 ) = ( 0 , 1 ) .
Entropy 23 00651 g001
Figure 2. MSE of f ^ when ( λ 1 , λ 2 ) = ( 1 , 0 ) .
Figure 2. MSE of f ^ when ( λ 1 , λ 2 ) = ( 1 , 0 ) .
Entropy 23 00651 g002
Figure 3. MSE of f ^ when ( λ 1 , λ 2 ) = ( 1 , 1 ) .
Figure 3. MSE of f ^ when ( λ 1 , λ 2 ) = ( 1 , 1 ) .
Entropy 23 00651 g003
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Deng, H.; Chen, J.; Song, B.; Pan, Z. Error Bound of Mode-Based Additive Models. Entropy 2021, 23, 651. https://doi.org/10.3390/e23060651

AMA Style

Deng H, Chen J, Song B, Pan Z. Error Bound of Mode-Based Additive Models. Entropy. 2021; 23(6):651. https://doi.org/10.3390/e23060651

Chicago/Turabian Style

Deng, Hao, Jianghong Chen, Biqin Song, and Zhibin Pan. 2021. "Error Bound of Mode-Based Additive Models" Entropy 23, no. 6: 651. https://doi.org/10.3390/e23060651

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop