Error Bound of Mode-Based Additive Models

Deng, Hao; Chen, Jianghong; Song, Biqin; Pan, Zhibin

doi:10.3390/e23060651

Open AccessArticle

Error Bound of Mode-Based Additive Models

¹

College of Science, Huazhong Agricultural University, Wuhan 430070, China

²

College of Electrical and New Energy, China Three Gorges University, Yichang 443002, China

^*

Authors to whom correspondence should be addressed.

Entropy 2021, 23(6), 651; https://doi.org/10.3390/e23060651

Submission received: 22 March 2021 / Revised: 19 May 2021 / Accepted: 19 May 2021 / Published: 22 May 2021

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

Due to their flexibility and interpretability, additive models are powerful tools for high-dimensional mean regression and variable selection. However, the least-squares loss-based mean regression models suffer from sensitivity to non-Gaussian noises, and there is also a need to improve the model’s robustness. This paper considers the estimation and variable selection via modal regression in reproducing kernel Hilbert spaces (RKHSs). Based on the mode-induced metric and two-fold Lasso-type regularizer, we proposed a sparse modal regression algorithm and gave the excess generalization error. The experimental results demonstrated the effectiveness of the proposed model.

Keywords:

modal regression; additive models; reproducing kernel Hilbert spaces; error bound

1. Introduction

Regression estimation and variable selection are two important tasks for high-dimensional data mining [1]. Sparse additive models [2,3], aiming to deal with the above tasks simultaneously, have been extensively investigated in the mean regression setting. As a class of models between linear and nonparametric regression, these methods inherit the flexibility from nonparametric regression and the interpretability from linear regression. Typical methods include COSSO [4] and SpAM [2] and its variants, such as Group SpAM [3], SAM [5], Group SAM [6], SALSA [7], MAM [8], SSAM [9], and ramp-SAM [10]. From the lens of nonparametric regression, the additive structure on the hypothesis space is crucial to overcome the curse of dimensionality [7,11,12].

Usually, the aforementioned models are limited to the estimation of the conditional mean under the mean-squared error (MSE) criterion. However, for the complex non-Gaussian noises (e.g., the skewed noise, the heavy-tailed noise), it is difficult to extract the intrinsic trends from the mean-based approaches, resulting in degraded performance. Beyond the traditional mean regression, it is interesting to formulate a new regression framework under the (conditional) mode-based criterion. With the help of the recent works in [13,14,15,16,17,18,19], this paper aimed to propose a new robust sparse additive model, rooted in modal regression associated with the RKHS.

As an alternative approach to mean regression, modal regression has been investigated on statistical behavior [14,15,17] and real-world applications [20,21]. Yao [14] proposed a modal linear regression algorithm and characterized its theoretical properties under the global mode assumption. As a natural extension of Lasso [22], Wang et al. [15] considered the regularized modal regression and established its analysis on the generalization bound and variable selection consistency. Feng et al. [17] studied modal regression by a learning theory approach and illustrated its relation with MCC [23,24]. Different from the above global approaches, some local modal regression algorithms were formulated in [16,25] with convergence guarantees. Recent literature [26] gave a general overview of modal regression, and a more comprehensive list of references can be found there.

The proposed robust additive models are formulated under the Tikhonov regularization scheme, which involves three building blocks, including the mode-based metric, the RKHS-based hypothesis space, and two Lasso-type penalties. Since the linear function space, polynomial function space, and Sobolev/Besov space are special cases of the RKHS, the kernel-based function space is more flexible than the traditional spline-based spaces or other dictionary-based hypotheses [2,5,27,28,29]. The mode-induced regression metric is robust to the non-Gaussian noise according to the theoretical and empirical evaluations [14,15,17]. The regularized penalty addresses the sparsity and smoothness of the estimator, which has shown promising performance for mean regression [2,29,30,31]. Therefore, different from mean-based kernel regression and additive models, the mode-based approach enjoys robustness and interpretability simultaneously due to its metric criterion and trade-off penalty. The estimator of our approach can be obtained by integrating the half-quadratic (HQ) optimization [32] and the second-order cone programming (SOCP) [33].

The rest of this article is organized as follows. After introducing the robust additive model in Section 2, we state its generalization error bound in Section 3. Finally, Section 5 ends this paper with a brief conclusion.

2. Methodology

2.1. Modal Regression

In this section, we recall the basic background on modal regression [19,34]. Let

X

be a compact subset of

R^{p}

associated with the input covariate vector and

Y \in R

be the response variable set. In this paper, we considered the following nonparametric model:

Y = f^{*} (X) + ϵ,

(1)

where

X = {(X_{1}, \dots, X_{p})}^{T} \in X

,

Y \in Y

, and

ϵ

is a random noise. For feasibility, we denote by

ρ

the underlying joint distribution of

(X, Y)

generated by (1).

Being different from the traditional mean regression under the noise condition

E (ϵ | X = x) = 0

(e.g., Gaussian noise), we just require that the mode of the conditional distribution of

ϵ

equal zero at each

x \in X

. That is:

\forall x \in X, m o d e (ϵ | X = x) = arg max_{t \in R} P_{ϵ | X} (t | X = x) = 0,

(2)

where

P_{ϵ | X}

is the conditional density of

ϵ

given X. Notice that the zero condition is not specified to the homogeneity or symmetry distribution of noise

ϵ

, and some non-Gaussian noises (e.g., the skewed noise, the heavy-tailed noise) are not excluded.

From (1), we further deduce that:

\begin{matrix} f^{*} (u) & : = & \sum_{j = 1}^{p} f_{j}^{*} (u_{j}) = m o d e (Y | X = u) = arg max_{t} P_{Y | X} (t | X = u), \end{matrix}

where

u = {(u_{1}, \dots, u_{p})}^{T} \in X

and

P_{Y | X}

denotes the density of Y conditional on X. Then, the purpose of modal regression is to find the target function

f^{*}

according to the empirical data

z = {z_{i}}_{i = 1}^{n} = {(x_{i}, y_{i})}_{i = 1}^{n}

drawn independently from

ρ

.

For modal regression, the performance of a predictor

f : X \to R

is measured by the mode-based metric:

R (f) = \int_{X} P_{Y | X} (f (x) | X = x) d ρ_{X} (x),

(3)

where

ρ_{X}

is the marginal distribution of

ρ

with respect to input space

X

.

Although the target function

f^{*}

is the maximizer of

R (f)

over all measurable functions, it cannot be estimated directly via maximizing (3) due to the unknown

P_{Y | X}

and

ρ_{X}

. Fortunately, some indirect density-estimation-based strategies were proposed in [14,15,17]. As shown in Theorem 5 of [17],

R (f)

equals the density function of random variable

E_{f} = Y - f (X)

at zero, e.g.,

R (f) = P_{E_{f}} (0) .

Therefore, we can find an approximation of

f^{*}

by maximizing the empirical version of

P_{E_{f}} (0)

with the help of kernel density estimation (KDE).

Let

K_{σ} : R \times R \to R_{+}

be a kernel with bandwidth

σ

, and its representing function

ϕ : R \to [0, \infty)

satisfies

ϕ (\frac{u - u^{'}}{σ}) = K_{σ} (u, u^{'}), \forall u, u^{'} \in R

. Typical kernels used in KDE include the Gaussian kernel, the Epanechnikov kernel, the logistic kernel, and the sigmoid kernel. The KDE-based estimator of

P_{E_{f}} (0)

is defined as:

\begin{matrix} {\hat{P}}_{E_{f}} (0) = \frac{1}{n σ} \sum_{i = 1}^{n} K_{σ} (y_{i} - f (x_{i}), 0) = \frac{1}{n σ} \sum_{i = 1}^{n} ϕ (\frac{y_{i} - f (x_{i})}{σ}) : = {\hat{R}}^{σ} (f) . \end{matrix}

Learning models for modal regression are usually formulated by Tikhonov regularization schemes associated with the empirical metric

{\hat{R}}^{σ} (f)

; see, e.g., [15,35].

Naturally, the data-free modal regression metric,

w . r . t .

{\hat{R}}^{σ} (f)

, can be defined as:

R^{σ} (f) = \frac{1}{σ} \int_{X \times Y} ϕ (\frac{y - f (x)}{σ}) d ρ (x, y) .

In theory, the learning performance of estimator

f : X \to R

can be evaluated in terms of

R (f) - R (f^{*})

, which can be further bounded via

R^{σ} (f) - R^{σ} (f^{*})

(see Theorem 10 in [17]).

Remark 1.

As illustrated in [17], when taking

K_{σ}

as a Gaussian kernel, the modal regression for maximizing

R^{σ} (f)

is consistent with learning under the maximum correntropy criterion (MCC). By employing different kernels, we can provide rich evaluated metrics for better robust estimation.

2.2. Mode-Based Sparse Additive Models

The additive model is formulated as follows,

Y = \sum_{j = 1}^{p} f_{j}^{*} (X_{j}) + ϵ,

(4)

where

X_{j} \in X

,

(j = 1, 2, \cdot \cdot \cdot, p)

,

Y \in Y

, and

f_{j}^{*}

are unknown component functions. By employing nonlinear hypothesis function spaces with an additive structure, the additive model provides better flexibility for regression estimation and variable selection [19]. In [28], the theoretical properties of the sparse additive model with the quantile loss function were discussed. We introduce some basic notation and assumptions in a similar way.

Suppose that

E f_{j}^{*} (X_{j}) = 0

and

∥ f_{j}^{*} ∥_{K_{j}} \leq 1

for each

f_{j}^{*}

in (4) with

j \in S

. Here,

f_{j}^{*} : X_{j} \to R

is an unknown univariate function in a reproducing kernel Hilbert space (RKHS)

H_{j} : = H_{K_{j}}

associated with kernel

K_{j}

and norm

{∥ \cdot ∥}_{K_{j}}

[30,31], and

S \subseteq {1, \dots, p}

is an intrinsic subset with cardinality

| S | < p

. This means each observation

(x_{j}, y_{j})

is generated according to:

y_{i} = \sum_{j \in S} f_{j}^{*} (x_{i j}) + ϵ_{i}, i = 1, \dots, n,

where

x_{i} = {(x_{i 1}, \dots, x_{i p})}^{T} \in R^{p}

,

f_{j}^{*} \in H_{j}

and

ϵ

satisfies the condition (2).

For any given

j \in {1, \dots, p}

, denote

B_{r} (H_{j}) = {g \in H_{j} {: ∥ g ∥}_{K_{j}} \leq r}

. The hypothesis space considered here is defined by:

F = {f = \sum_{j = 1}^{p} f_{j} : f_{j} \in B_{r} (H_{j}), i = 1, \dots, p},

(5)

which is a subset of the RKHS

H = {f = \sum_{j = 1}^{p} f_{j} : f_{j} \in H_{j}}

with the norm:

{∥ f ∥}_{K}^{2} = inf {\sum_{j = 1}^{p} ∥ f_{j} ∥_{K_{j}}^{2} : f = \sum_{j = 1}^{p} f_{j}} .

For each

X_{j}

and the corresponding marginal distribution

ρ_{X_{j}}

, we denote

∥ f_{j} ∥_{2}^{2} : = \int_{X_{j}} {| f_{j} (u) |}^{2} d ρ_{X_{j}} (u)

. Given inputs

{x_{i}}_{i = 1}^{n}

, define the empirical norm of each

f_{j}

as:

∥ f_{j} ∥_{n}^{2} : = \frac{1}{n} \sum_{i = 1}^{n} f_{j}^{2} (x_{i j}), \forall f_{j} \in H_{j}, j \in {1, \dots, p} .

With the help of the mode-based metric (3) and the hypothesis space (5), we formulated the mode-based sparse additive model as:

\hat{f} = arg max_{f \in F} {{\hat{R}}^{σ} (f) - λ_{1} \sum_{j = 1}^{p} ∥ f_{j} ∥_{n} - λ_{2} \sum_{j = 1}^{p} ∥ f_{j} ∥_{K_{j}}},

(6)

where

(λ_{1}, λ_{2})

is a pair of positive regularization. The first regularization term is sparsity-promoting [11,36], and the second one guarantees smoothness in the solution.

By the representer theorem of kernel methods (e.g., [37]), the solution of (6) admits the following form:

\hat{f} (u) = \sum_{i = 1}^{n} \sum_{j = 1}^{p} {\hat{α}}_{i j} K (u_{j}, x_{i j}), u = {(u_{1}, \dots, u_{p})}^{T}

with a collection of coefficients

{{\hat{α}}_{j} = {(α_{1 j}, \dots, α_{n j})}^{T} \in R^{n} : j = 1, \dots, p}

.

The optimal coefficients with respect to (6) are the solution to the following non-convex optimization:

\begin{matrix} max_{α_{j} \in R^{n}, α_{j}^{T} K_{j} α_{j} \leq 1} {\frac{1}{n} \sum_{i = 1}^{n} ϕ (\frac{y_{i} - \sum_{j = 1}^{p} K_{j i}^{T} α_{j}}{σ}) - \frac{λ_{1}}{\sqrt{n}} \sum_{j = 1}^{p} ∥ K_{j} α_{j} ∥_{2} - λ_{2} \sum_{j = 1}^{p} \sqrt{α_{j}^{T} K_{j} α_{j}}} \end{matrix}

where

K_{j i} = {(K_{j} (x_{1 j}, x_{i j}), \dots, K_{j} (x_{n j}, x_{i j}))}^{T} \in R^{n}

and

K_{j} = {(K_{j} (x_{i j}, x_{l j}))}_{i, l}^{n} = (K_{j 1}, \dots, K_{j n}) \in R^{n \times n}

.

Remark 2.

There are various combinations of sparsity and smoothness regularization for additive models [2,3,29,30,31]. The regularization in this paper adopting a two-fold group Lasso scheme, which was employed in [28], but in quantile regression settings, is also different from the coefficient-based regularized modal regression in [19].

Remark 3.

From the lens of computation, the proposed algorithm (6) can be transformed into a regularized least-squares regression problem by HQ optimization [32]. Then, the transformed problem can be tackled with the SOCP [33] easily.

3. Error Analysis

This section states the upper bounds of the excess quantity

R (f^{*}) - R (\hat{f})

. For the ease of presentation, we only considered the special setting where

H_{j} \equiv H_{j^{'}}, \forall j, j^{'} \in {1, \dots, p}

, and we denote

\oplus_{j = 1}^{p} H_{j}

as

H_{K}

with

sup K (x, x) \leq 1

.

Recall that the Mercer kernel

K : X \times X \to R

admits the following spectral expansion [38]:

K (x, x^{'}) = \sum_{ℓ \geq 1} b_{ℓ} ψ_{ℓ} (x) ψ_{ℓ} (x^{'}), x, x^{'} \in X,

where

{(b_{ℓ}, ψ_{ℓ})}_{ℓ \geq 1}

are the pairs of eigenvalue-eigenfunctions of integral operator

T f : \int K (\cdot, x) f (x) d ρ_{X} (x)

with

b_{1} \geq b_{2} \geq \dots \geq 0

.

To evaluate the complexity of

H_{K}

in terms of the decay rate of eigenvalues

{b_{ℓ}}_{ℓ \geq 1}

[27,28], we refer to Assumption 1 in [28] as the basis of our analysis.

Assumption 1.

There exist

s \in (0, 1)

and constant

c_{1} > 0

such that

b_{ℓ} \leq c_{1} ℓ^{- \frac{1}{s}}

,

\forall ℓ \geq 1

.

As illustrated in [27,28], the requirement

s < 1

is a weak condition since

\sum_{ℓ} b_{ℓ} = E K (x, x) \leq 1

. In particular, it holds

b_{ℓ} ≍ ℓ^{- 2 h}

for the Sobolev space

H_{K} = W_{2}^{h} (h > \frac{1}{2})

with the Lebesgue measure on

[0, 1]

.

To describe the hypothesis in RKHS, we refer to Assumption 2 in [28].

Assumption 2.

For some

s \in (0, 1)

given in Assumption 1, there exists a positive constant

c_{2}

such that

{∥ f ∥}_{\infty} \leq c_{2} {∥ f ∥}_{2}^{1 - s} {∥ f ∥}_{K}^{s}

,

\forall f \in H_{K}

.

Remark 4.

To understand the statistical performance of the proposed estimator without any “correlatedness” conditions on covariates, Rademacher complexity [39] was used to measure functional complexity in [28]. We drew on the experience of [28].

In general, Assumption 2 is stronger than Assumption 1 and is satisfied when the RKHS is continuously embeddable in a Sobolev space. For the uniformly bounded

{ψ_{ℓ}}_{ℓ \geq 1}

, this sub-norm condition is consistent with Assumption 1.

For any given independent input variables

{x_{i}}_{i = 1}^{n} \subset X

, define the Rademacher complexity:

R_{n} (f) : = \frac{1}{n} \sum_{i = 1}^{n} σ_{i} f (x_{i}), \forall f \in H_{K},

where

{σ_{i}}_{i = 1}^{n}

is an

i . i . d .

sequence of Rademacher variables that take

{\pm 1}

with probability

1 / 2

. As shown in [40], it holds:

E R_{n} {f \in H_{K} {{∥ f ∥}_{K} = {1, ∥ f ∥}_{2} \leq t}} ≍ \frac{1}{\sqrt{n}} {[\sum_{ℓ}^{\infty} min {t^{2}, b_{ℓ}}]}^{\frac{1}{2}} .

Moreover, from Assumption 1, define:

\begin{matrix} γ_{n} & : = & inf {γ \geq \sqrt{\frac{A log \tilde{p}}{n}}, E [sup_{{∥ f ∥}_{K} = 1, {∥ f ∥}_{2} \leq t} | R_{n} (f) |] \leq γ t + γ^{2}, \forall t \in (0, 1)} \\ ≍ & max {\sqrt{\frac{A log \tilde{p}}{n}}, {(\frac{1}{n})}^{\frac{1}{2 (1 + α)}}} . \end{matrix}

The main idea of our error analysis is to first state a theory result for a defined event and then investigate the behavior of

\hat{f}

in (6) conditional on that event.

Define

η (t) : = max {1, \sqrt{t}, t / \sqrt{n}}

for any

t > 0

and

ξ_{n} : = ξ_{n} (λ) = max {λ^{- \frac{α}{2}} n^{- \frac{1}{2}}, λ^{- \frac{1}{2}} n^{- \frac{1}{1 + α}}, \sqrt{\frac{log p}{n}}}

, and consider the event:

\begin{matrix} θ (t) & = & {| \frac{1}{n} \sum_{i = 1}^{n} ϵ_{i} f (x_{i}) | \leq c_{α} η (t) ξ_{n} {(∥ f ∥}_{2} + λ^{\frac{1}{2}} {∥ f ∥}_{K}), \forall f \in H_{K}}, \end{matrix}

where

{ϵ_{i}}_{i = 1}^{n}

are zero-mean

i . i . d .

random variables with

| ϵ_{i} | \leq L

and

c_{α}

is a constant depending on

α

and L.

Remark 5.

To analyze the behavior of the regularized estimator conditioned on the event, several basic facts of the empirical processes were introduced in [28]. Our work can be boiled down to this framework. We introduced the relevant lemmas in [28] as a stepping stone.

Lemma 1.

Let Assumptions 1 and 2 be true. If

\frac{log p}{\sqrt{n}} \leq 1

, it holds:

P (θ (t)) \geq 1 - exp {- t}, \forall λ > 0, t \geq 1 .

The following lemma (see also Theorem 4 in [41]) demonstrates the relationship between the empirical norm

{∥ \cdot ∥}_{n}

and

{∥ \cdot ∥}_{2}

for functions in

H_{K}

.

Lemma 2.

For

A \geq 1

and any given

\tilde{p} \geq p

with

log \tilde{p} \geq 2 log log n

, there exists a constant c such that:

{∥ f ∥}_{2} \leq {c (∥ f ∥}_{n} + γ_{n} {∥ f ∥}_{K})

and:

{∥ f ∥}_{n} \leq {c (∥ f ∥}_{2} + γ_{n} {∥ f ∥}_{K})

with confidence at least

1 - {\tilde{p}}^{- A}

, where

γ_{n} ≍ max (\sqrt{\frac{A log \tilde{p}}{n}}, {(\frac{1}{n})}^{\frac{1}{2 (1 + α)}})

.

Lemma 3.

Let

{z_{i}}_{i = 1}^{n} \subset Z

be independent random variables, and let Γ be a class of real-valued functions on

Z

satisfying:

∥ γ ∥ \leq η_{n}, \forall γ \in Γ, a n d \frac{1}{n} \sum_{i = 1}^{n} v a r (γ (z_{i})) \leq ι_{n}^{2},

for some positive constants

η_{n}

and

ι_{n}

. Define

ζ : = {sup}_{γ \in Γ} | \frac{1}{n} \sum_{i = 1}^{n} γ (z_{i}) - E γ (z) |

. Then,

\begin{matrix} P {ζ \geq E ζ + t \sqrt{2 (ι_{n}^{2} + 2 η_{n} E z)} + \frac{2 η_{n} t^{2}}{3} \leq exp {- n t^{2}} \end{matrix}

For any given

Δ_{-}

and

Δ_{+}

, define:

\begin{matrix} F (Δ_{-}, Δ_{+}) & = & {f = \sum_{j = 1}^{p} f_{j} \in H_{K} : γ_{n} \sum_{j = 1}^{p} ∥ f_{j} - f_{j}^{*} ∥_{2} \leq Δ_{-}, γ_{n}^{2} \sum_{j = 1}^{p} ∥ f_{j} - f_{j}^{*} ∥_{K} \leq Δ_{+}}, \end{matrix}

Lemma 4.

Let Assumptions 1 and 2 be true for each

H_{j}

. For any given

A \geq 2

, with confidence at least

1 - {\tilde{p}}^{- A}

, it holds:

\begin{matrix} R^{σ} (f^{*}) - R^{σ} (f) - ({\hat{R}}^{σ} (f^{*}) - {\hat{R}}^{σ} (f)) \leq c_{*} η (t_{0}) (Δ_{-} + Δ_{+}) + exp {- \tilde{p}}, \end{matrix}

for any

f \in F (Δ_{-}, Δ_{+})

with

m a x {Δ_{-}, Δ_{+}} \leq e^{\tilde{p}}

, where

t_{0} = 2 log (\frac{2 \sqrt{3}}{log 2}) + A log \tilde{p} + 2 log \tilde{p}

,

λ = n^{- \frac{1}{1 + α}}

, and

c_{*}

is a positive constant.

Proof.

Denote

Γ = {γ (z) : γ (z) = \frac{1}{σ} ϕ (\frac{y - f^{*} (x)}{σ}) - \frac{1}{σ} ϕ (\frac{y - f (x)}{σ}), f \in F (Δ_{-}, Δ_{+})}

. It is easy to verify that:

E γ (z) - \frac{1}{n} \sum_{i = 1}^{n} γ (z_{i}) = R (f^{*}) - R (f) - (\hat{R} (f^{*}) - \hat{R} (f)), γ \in Γ .

Let

ζ : = {sup}_{γ \in Γ} | \frac{1}{n} \sum_{i = 1}^{n} γ (z_{i}) - E γ (z) |

. From Lemma 3, we have:

ζ \leq E ζ + \sqrt{\frac{2 t (ι_{n}^{2} + 2 η_{n} E ζ)}{n}} + \frac{2 η_{n} t}{3 n},

(7)

with probability at least

1 - exp {- t}

, where constants

{sup}_{γ \in Γ} {∥ γ ∥}_{\infty} = η_{n}

and

{sup}_{γ \in Γ} \sqrt{\frac{1}{n} \sum_{i = 1}^{n} v a r (γ (z_{i}))} = ι_{n}

. Observing that:

\begin{matrix} \sqrt{\frac{2 t (ι_{n}^{2} + 2 η_{n} E ζ)}{n}} \leq \sqrt{\frac{2 t ι_{n}^{2}}{n}} + 2 \sqrt{\frac{η_{n} E ζ}{n}} \leq \sqrt{\frac{2 t}{n}} ι_{n} + E ζ + \frac{η_{n}}{n}, \end{matrix}

(8)

we can take:

\begin{matrix} ι_{n}^{2} \leq 2 E {(γ (z))}^{2} = 2 E {(\frac{1}{σ} ϕ (\frac{y - f^{*} (x)}{σ}) - \frac{1}{σ} ϕ (\frac{y - f (x)}{σ}))}^{2} \leq \frac{2 ∥ ϕ^{'} ∥_{\infty}^{2}}{σ^{4}} {∥ f - f^{*} ∥}_{2}^{2} \leq \frac{2 ∥ ϕ^{'} ∥_{\infty}^{2}}{σ^{4}} \frac{Δ_{-}^{2}}{γ^{2}}, \end{matrix}

(9)

and:

\begin{matrix} η_{n} & = & sup_{γ \in Γ} {∥ γ ∥}_{\infty} \leq \frac{∥ ϕ^{'} ∥_{\infty}}{σ^{2}} ∥ f^{*} {- f ∥}_{\infty} \leq \frac{∥ ϕ^{'} ∥_{\infty}}{σ^{2}} {∥ f^{*} - f ∥}_{K} \leq \frac{∥ ϕ^{'} ∥_{\infty}}{σ^{2}} \frac{Δ_{+}}{γ_{n}^{2}} . \end{matrix}

(10)

Combining (7)–(10), we obtain with confidence at least

1 - exp {- t}

ζ \leq 2 E ζ + \frac{2 ∥ ϕ^{'} ∥_{\infty}}{γ_{n} σ^{2}} \sqrt{\frac{t}{n}} + \frac{κ ∥ ϕ^{'} ∥_{\infty} Δ_{+}}{σ^{2} γ_{n}^{2}} \frac{1 + t}{n} .

By a symmetrization technique in [42], we have:

E ζ \leq 2 E R_{n} (Γ) \leq \frac{2 ∥ ϕ^{'} ∥_{\infty}}{σ^{2}} E R_{n} (F - f^{*}) .

Applying Lemma 3 for

R_{n} (F - f^{*})

, we obtain that:

E [R_{n} (F - f^{*})] \leq R_{n} (F - f^{*}) + 4 \frac{Δ_{-}}{γ_{n}} \sqrt{\frac{2 t}{n}} + \frac{Δ_{+}}{γ_{n}^{2}} \frac{1 + t}{n},

with probability at least

1 - 2 exp {- t}

. Moreover, with probability at least

1 - 2 exp {- t}

, it holds:

\begin{matrix} ζ & \leq & \frac{8 ∥ ϕ^{'} ∥_{\infty}}{σ^{2}} R_{n} (F - f^{*}) + \frac{6 ∥ ϕ^{'} ∥_{\infty} Δ_{-}}{γ_{n} σ^{2}} \sqrt{\frac{t}{n}} + \frac{5 ∥ ϕ^{'} ∥_{\infty} Δ_{+}}{γ_{n}^{2} σ^{2}} \frac{1 + t}{n} \\ \leq & \frac{8 ∥ ϕ^{'} ∥_{\infty}}{σ^{2}} \sum_{j = 1}^{p} R_{n} (H_{j} - f_{j}^{*}) + \frac{6 ∥ ϕ^{'} ∥_{\infty} Δ_{-}}{γ_{n} σ^{2}} \sqrt{\frac{t}{n}} + \frac{5 ∥ ϕ^{'} ∥_{\infty} Δ_{+}}{γ_{n}^{2} σ^{2}} \frac{1 + t}{n} . \end{matrix}

For the event

θ (t)

, Lemma 1 demonstrates that:

| R_{n} (f) | \leq c_{α} η (t) ξ_{n} {(∥ f ∥}_{2} + λ^{\frac{1}{2}} {∥ f ∥}_{K}), \forall f \in H_{K}, \forall λ > 0,

with confidence

1 - exp {- t}

. Then,

\begin{matrix} ζ & \leq & \frac{8 ∥ ϕ^{'} ∥_{\infty} c_{α} η (t) ξ_{n}}{σ^{2}} sup_{f \in F} {\sum_{j = 1}^{p} ∥ f - f_{j}^{*} ∥_{2} + λ^{\frac{1}{2}} \sum_{j = 1}^{p} ∥ f_{j} - f_{j}^{*} ∥_{K}} + \frac{6 ∥ ϕ^{'} ∥_{\infty} Δ_{-}}{γ_{n} σ^{2}} \sqrt{\frac{t}{n}} + \frac{5 ∥ ϕ^{'} ∥_{\infty} Δ_{+}}{γ_{n}^{2} σ^{2}} \frac{1 + t}{n} . \end{matrix}

Taking

λ = n^{- \frac{1}{1 + α}}

, we can verify that

c γ_{n} \geq ξ_{n}

and

ξ_{n} λ^{\frac{1}{2}} \geq c γ_{n}^{2}

. Then,

\begin{matrix} ζ & \leq & \frac{8 c_{α} η (t) {∥ ϕ^{'} ∥}_{\infty}}{σ^{2}} (Δ_{+} + Δ_{-}) + \frac{6 Δ_{-} {∥ ϕ^{'} ∥}_{\infty}}{σ^{2}} \sqrt{\frac{t}{A log \tilde{p}}} + \frac{5 Δ_{+} {∥ ϕ^{'} ∥}_{\infty} t}{σ^{2} A log \tilde{p}}, \end{matrix}

for some event

θ (Δ_{-}, Δ_{+})

.

For

t = 2 log (2 \sqrt{3} / log 2) + A log \tilde{p} + 2 log \tilde{p}

, we deduce that

e^{- \tilde{p}} \leq Δ_{-} \leq e^{\tilde{p}}

and

e^{- \tilde{p}} \leq Δ_{+} \leq e^{\tilde{p}}

considering

{(2 \tilde{p} + 1)}^{2}

different discrete pairs

Δ_{-}^{j} = Δ_{+}^{j} : = 2^{- j}, j = - \tilde{p}, \dots, \tilde{p}

, and we deduce that:

\begin{matrix} P (⋂_{k, j} θ (Δ_{-}^{j}, Δ_{+}^{j})) & \leq & 1 - 3 ({\frac{2}{log 2}}^{2} {\tilde{p}}^{2} exp {- 2 log (\frac{2 \sqrt{3}}{log 2} - A log \tilde{p} - 2 log \tilde{p}} \leq 1 - {\tilde{p}}^{- A} . \end{matrix}

When

Δ_{-} \leq e^{- \tilde{p}}

or

Δ_{+} \leq e^{- \tilde{p}}

, it is trivial to obtain the desired result. □

The proof of Lemma 4 is derived from the proof of Proposition 1 in [28] for the quantile regression. We state our main result on the error bound.

Theorem 1.

Let the regularization parameters of

\hat{f}

defined in (6) be

λ_{1} = \sqrt{ξ} γ_{n}

and

λ_{2} = ξ γ_{n}^{2}

, where

ξ = max {2 c η (t_{0}), 4}

with

η (t) = max {1, \sqrt{t}, t / \sqrt{n}}

,

t_{0} = 2 log (2 \sqrt{3} / log 2) + A log \tilde{p} + 2 log \tilde{p}

, and

γ_{n} ≍ max (\sqrt{\frac{A log \tilde{p}}{n}}, {(\frac{1}{n})}^{\frac{1}{2 (1 + α)}})

. Under the conditions of Assumptions 1 and 2, for any

\tilde{p} \geq p

such that

log p \leq \sqrt{n}

and

log \tilde{p} \geq 2 log log n

, then for some constant

A \geq 2

, such that with probability at least

1 - 2 {\tilde{d}}^{- A}

:

\begin{matrix} R (f^{*}) - R (\hat{f}) & \leq & c s ∥ ϕ^{'} ∥_{\infty} η (t_{0}) {(η (t_{0}))}^{\frac{1}{4}} \sqrt{γ_{n}} \leq c {(η (t_{0}))}^{\frac{5}{4}} max {{(\frac{A log \tilde{p}}{c})}^{\frac{1}{4}}, {(\frac{1}{n})}^{\frac{1}{4 (1 + α)}}} \\ \leq & c max {\sqrt{A log \tilde{p}}, \frac{A log \tilde{p}}{\sqrt{n}}}^{\frac{5}{4}} \cdot max {{(\frac{A log \tilde{p}}{n})}^{\frac{1}{4}}, {(\frac{1}{n})}^{\frac{1}{4 + 4 α}}} \\ \leq & c max {\frac{{(A log \tilde{p})}^{\frac{7}{8}}}{n^{\frac{1}{4}}}, \frac{{(A log \tilde{p})}^{\frac{1}{2}}}{n^{\frac{1}{4 + 4 α}}}, \frac{{(A log \tilde{p})}^{\frac{3}{2}}}{n^{\frac{3}{4}}}, \frac{{(A log \tilde{p})}^{\frac{5}{4}}}{n^{\frac{3 + 2 α}{4 + 4 α}}}} . \end{matrix}

Proof.

By the definition of

\hat{f}

in (6), we know that:

\begin{matrix} {\hat{R}}^{σ} (\hat{f}) - λ_{1} \sum_{j = 1}^{p} ∥ {\hat{f}}_{j} ∥_{n} - λ_{2} \sum_{j = 1}^{p} ∥ {\hat{f}}_{j} ∥_{K_{j}} \geq {\hat{R}}^{σ} (f^{*}) - λ_{1} \sum_{j = 1}^{p} ∥ f_{j}^{*} ∥_{n} - λ_{2} \sum_{j = 1}^{p} {∥ f_{j}^{*} ∥}_{K_{j}} . \end{matrix}

This implies that:

\begin{matrix} {\hat{R}}^{σ} (\hat{f}) - R^{σ} (f^{*}) - λ_{1} \sum_{j = 1}^{p} ∥ {\hat{f}}_{j} ∥_{n} - λ_{2} \sum_{j = 1}^{p} {∥ {\hat{f}}_{j} ∥}_{K_{j}} \\ \geq & [R^{σ} (\hat{f}) - R^{σ} (f^{*})] - [{\hat{R}}^{σ} (\hat{f}) - {\hat{R}}^{σ} (f^{*})] - λ_{1} \sum_{j = 1}^{p} ∥ f_{j}^{*} ∥_{n} - λ_{2} \sum_{j = 1}^{p} {∥ f_{j}^{*} ∥}_{K_{j}} . \end{matrix}

Moreover,

\begin{matrix} R^{σ} (f^{*}) - R^{σ} (\hat{f}) \leq R^{σ} (f^{*}) - R^{σ} (\hat{f}) + λ_{1} \sum_{j \notin S} ∥ {\hat{f}}_{j} ∥_{n} + λ_{2} \sum_{j \notin S} {∥ {\hat{f}}_{j} ∥}_{K} \\ \leq & [R^{σ} (f^{*}) - R^{σ} (\hat{f})] - [{\hat{R}}^{σ} (f^{*}) - {\hat{R}}^{σ} (\hat{f})] + λ_{1} \sum_{j \in S} (∥ f_{j}^{*} ∥_{n} - ∥ {\hat{f}}_{j} ∥_{n}) + λ_{2} \sum_{j \in S} (∥ f_{j}^{*} ∥_{K} - ∥ {\hat{f}}_{j} ∥_{K}) \\ \leq & [R^{σ} (f^{*}) - R^{σ} (\hat{f})] - [{\hat{R}}^{σ} (f^{*}) - {\hat{R}}^{σ} (\hat{f})] + λ_{1} \sum_{j \in S} ∥ {\hat{f}}_{j} - f_{j}^{*} ∥_{n} + λ_{2} \sum_{j \in S} {∥ {\hat{f}}_{j} - f_{j}^{*} ∥}_{K} . \end{matrix}

(11)

Taking

λ_{1} = \sqrt{ξ} γ_{n}

,

λ_{2} = ξ γ_{n}^{2}

with

γ_{n} = max {\sqrt{\frac{A log \tilde{p}}{n}}, {(\frac{1}{n})}^{\frac{1}{2 + 2 α}}}

,

α \in (0, 1)

, we deduce that:

\begin{matrix} γ_{n} \sum_{j = 1}^{p} {∥ {\hat{f}}_{j} - f_{j}^{*} ∥}_{2} \leq 2 p {(\frac{1}{n})}^{\frac{1}{2 + 2 α}} \leq 2 \tilde{p} (\frac{1}{4}) \leq e^{\tilde{p}}, \forall n \geq 1, \tilde{p} \geq p, \end{matrix}

and:

γ_{n}^{2} \sum_{j = 1}^{p} ∥ f_{j} - f_{j}^{*} ∥_{K_{j}} \leq γ_{n} γ_{n} \sum_{j = 1}^{p} {∥ \hat{f} - f^{*} ∥}_{K_{j}} \leq e^{- \tilde{p}} .

Therefore, we verify that

\hat{f} \in F (Δ_{-}, Δ_{+})

with

Δ_{-} \leq e^{\tilde{p}}

and

Δ_{+} \leq e^{\tilde{p}}

. With the choices

λ_{2} = λ_{1}^{2} = ξ γ_{n}^{2}

, it holds:

λ_{1} ∥ {\hat{f}}_{j} - f_{j}^{*} ∥_{n} + λ_{2} {∥ {\hat{f}}_{j} - f_{j}^{*} ∥}_{K} \leq 2 (λ_{1} + λ_{2}) = 4 \sqrt{ξ} γ_{n}, j \in S .

due to the fact

∥ f_{j} ∥_{n} \leq {∥ f_{j} ∥}_{K} \leq 1

, for any

f_{j} \in H_{K_{j}}

.

According to Lemma 4 and (11), we obtain:

\begin{matrix} R^{σ} (f^{*}) - R^{σ} (\hat{f}) \\ \leq & \frac{c η t_{0} {∥ ϕ^{'} ∥}_{\infty}}{σ^{2}} (γ_{n} \sum_{j = 1}^{p} ∥ {\hat{f}}_{j} - f_{j}^{*} ∥_{2} + γ_{n}^{2} \sum_{j = 1}^{p} ∥ {\hat{f}}_{j} - f_{j}^{*} ∥_{K}) + λ_{1} \sum_{j \in S} ∥ {\hat{f}}_{j} - f_{j}^{*} ∥_{n} + λ_{2} \sum_{j \in S} {∥ {\hat{f}}_{j} - f_{j}^{*} ∥}_{K} + e^{- \tilde{p}} \\ \leq & \frac{c η (t_{0}) {∥ ϕ^{'} ∥}_{\infty}}{σ^{2}} \sqrt{ξ} γ_{n} + e^{- \tilde{p}}, \end{matrix}

with probability at least

1 - 2 {\tilde{p}}^{- A}

.

Notice that

log \tilde{p} \geq 2 log log n

implies that

e^{- \tilde{p}} \leq n^{- 2} \leq γ_{n}

. Then:

R^{σ} (f^{*}) - R (\hat{f}) \leq \frac{c η (t_{0}) {∥ ϕ^{'} ∥}_{\infty}}{σ^{2}} \sqrt{ξ} γ_{n} .

Combining this with Theorem 9 in [17] and setting

σ = (∥ ϕ^{'} {∥_{\infty} η (t_{0}) \sqrt{ξ} γ_{n})}^{\frac{1}{4}}

, we obtain the desired result. □

The proof of Theorem 1 is inspired by that of Theorem 1 in [28]; see [28] for more details. According to Theorem 1, we can conclude that the mode-based SpAM can achieve the learning rate with polynomial decay

O (n^{- \frac{1}{4 + 4 α}})

since

ϵ \in [0, 1]

and

A, \tilde{p}

are positive constants.

4. Experimental Evaluation

To demonstrate the efficiency of our method, in this section, we evaluated our model on some synthetic datasets. The data in

R^{p}

with dimension

p = 5

and

p = 10

were generated randomly according to the uniform distribution on the interval

[0, 1]

. Then, we computed the MSE of our estimator

\hat{f}

. Figure 1, Figure 2 and Figure 3 depict the MSE of

\hat{f}

when the parameter pair

(λ_{1}, λ_{2}) = (0, 1), (1, 0)

and

(1, 1)

, respectively, while the number of samples n varies from 50/60 to 80/90. This paper used Yalmip [43] modeling in the MATLAB environment and called fmincon to solve the problem. From the figures, we can notice that the MSEs tended to decrease with the increase of the number of samples n under three kinds of parameter settings, which verified that our method was effective in the regression of high-dimensional data.

5. Conclusions

In this work, we proposed a mode-based sparse additive model and established its generalization error bound. The theoretical results extended the previous mean-based analysis to the mode-based approach. We demonstrated that the mode-based SpAM can achieve the learning rate with polynomial decay

O (n^{- \frac{1}{4 + 4 α}})

, which is comparable to the previous result in [15] with

O (n^{- \frac{1}{7}})

. In the future, it will be important to further explore the variable selection consistency of the proposed model.

Author Contributions

Conceptualization, H.D., B.S., J.C. and Z.P.; methodology, H.D. and Z.P.; validation, B.S. and Z.P.; formal analysis, H.D., B.S. and Z.P.; investigation, H.D. and J.C.; resources, Z.P.; data curation, H.D. and J.C.; writing—original draft preparation, H.D. and J.C.; writing—review and editing, H.D. and J.C.; visualization, H.D. and J.C.; supervision, B.S. and Z.P.; project administration, B.S. and Z.P.; funding acquisition, H.D., B.S. and Z.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Fundamental Research Funds for the Central Universities of China (Grant Nos. 2662019FW003 and 2662020LXQD001) and the National Natural Science Foundation of China (Grant No. 12001217).

Data Availability Statement

The synthetic data generation method of the simulation experiment has been introduced in the experimental part.

Conflicts of Interest

The authors declare no conflict of interest.

References

Xia, Y.; Hou, Y.; Lv, S. Learning Rates for Partially Linear Support Vector Machine in High Dimensions. Anal. Appl. 2021, 19, 167–182. [Google Scholar] [CrossRef]
Ravikumar, P.; Liu, H.; Lafferty, J.; Wasserman, L. SpAM: Sparse Additive Models. J. R. Stat. Soc. Ser. B 2009, 71, 1009–1030. [Google Scholar] [CrossRef]
Yin, J.; Chen, X.; Xing, E.P. Group Sparse Additive Models. In Proceedings of the International Conference on Machine Learning (ICML), Edinburgh, UK, 26 June–1 July 2012; pp. 1643–1650. [Google Scholar]
Lin, Y.; Zhang, H.H. Component Selection and Smoothing in Multivariate Nonparametric Regression. Ann. Stat. 2006, 34, 2272–2297. [Google Scholar] [CrossRef] [Green Version]
Zhao, T.; Liu, H. Sparse additive machine. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), La Palma, Spain, 21–23 April 2012; Volume 22, pp. 1435–1443. [Google Scholar]
Chen, H.; Wang, X.; Deng, C.; Huang, H. Group Sparse Additive Machine. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 197–207. [Google Scholar]
Kandasamy, K.; Yu, Y. Additive Approximations in High Dimensional Nonparametric Regression via the SALSA. In Proceedings of the International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; Volume 48, pp. 69–78. [Google Scholar]
Wang, Y.; Chen, H.; Zheng, F.; Xu, C.; Gong, T.; Chen, Y. Multi-task Additive Models for Robust Estimation and Automatic Structure Discovery. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Online, 6–12 December 2020; Volume 33, pp. 11744–11755. [Google Scholar]
Chen, H.; Liu, G.; Huang, H. Sparse Shrunk Additive Models. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 12–18 July 2020; Volume 119, pp. 6194–6204. [Google Scholar]
Chen, H.; Guo, C.; Xiong, H.; Wang, Y. Sparse Additive Machine with Ramp Loss. Anal. Appl. 2021, 19, 509–528. [Google Scholar] [CrossRef]
Meier, L.; Geer, S.V.D.; Buhlmann, P. High-dimensional Additive Modeling. Ann. Stat. 2009, 37, 3779–3821. [Google Scholar] [CrossRef]
Raskutti, G.; Wainwright, M.J.; Yu, B. Minimax-optimal Rates for Sparse Additive Models over Kernel Classes via Convex Programming. J. Mach. Learn. Res. 2012, 13, 389–427. [Google Scholar]
Kemp, G.C.R.; Silva, J.M.C.S. Regression towards the mode. J. Econom. 2012, 170, 92–101. [Google Scholar] [CrossRef]
Yao, W.; Li, L. A New Regression model: Modal Linear Regression. Scand. J. Stat. 2014, 41, 656–671. [Google Scholar] [CrossRef] [Green Version]
Wang, X.; Chen, H.; Cai, W.; Shen, D.; Huang, H. Regularized Modal Regression with Applications in Cognitive Impairment Prediction. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 1448–1458. [Google Scholar]
Chen, Y.C.; Genovese, C.R.; Tibshirani, R.J.; Wasserman, L. Nonparametric Modal Regression. Ann. Stat. 2014, 44, 489–514. [Google Scholar] [CrossRef]
Feng, Y.; Fan, J.; Suykens, J. A Statistical Learning Approach to Modal Regression. J. Mach. Learn. Res. 2020, 21, 1–35. [Google Scholar]
Collomb, G.; Härdle, W.; Hassani, S. A Note on Prediction via Estimation of the Conditional Mode Function. J. Stat. Plan. Inference 1986, 15, 227–236. [Google Scholar] [CrossRef]
Chen, H.; Wang, Y.; Zheng, F.; Deng, C.; Huang, H. Sparse Modal Additive Model. IEEE Trans. Neural Netw. Learn. Syst. 2020, 1–15. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Ray, S.; Lindsay, B.G. A Nonparametric Statistical Approach to Clustering via Mode Identification. J. Mach. Learn. Res. 2007, 8, 1687–1723. [Google Scholar]
Einbeck, J.; Tutz, G. Modeling beyond Regression Function: An Application of Multimodal Regression to Speed-flow Data. J. R. Stat. Soc. Ser. C Appl. Stat. 2006, 55, 461–475. [Google Scholar] [CrossRef]
Tibshirani, R. Regression Shrinkage and Selection Via the Lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
Feng, Y.; Huang, X.; Shi, L.; Yang, Y.; Suykens, J.A. Learning with the Maximum Correntropy Criterion Induced Losses for Regression. J. Mach. Learn. Res. 2015, 16, 993–1034. [Google Scholar]
Lv, F.; Fan, J. Optimal learning with Gaussians and Correntropy Loss. Anal. Appl. 2019, 19, 107–124. [Google Scholar] [CrossRef]
Yao, W.; Lindsay, B.G.; Li, R. Local Modal Regression. J. Nonparametr. Stat. 2012, 24, 647–663. [Google Scholar] [CrossRef] [Green Version]
Chen, Y. Modal Regression using Kernel Density Estimation: A Review. Wiley Interdiscip. Rev. Comput. Stat. 2018, 10, e1431. [Google Scholar] [CrossRef] [Green Version]
Steinwart, I.; Christmann, A. Support Vector Machines; Springer Science and Business Media: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
Lv, S.; Lin, H.; Lian, H.; Huang, J. Oracle Inequalities for Sparse Additive Quantile Regression in Reproducing Kernel Hilbert Space. Ann. Stat. 2018, 46, 781–813. [Google Scholar] [CrossRef]
Huang, J.; Horowitz, J.L.; Wei, F. Variable Selection in Nonparametric Additive Models. Ann. Stat. 2010, 38, 2282–2313. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Christmann, A.; Zhou, D.X. Learning Rates for the Risk of Kernel based Quantile Regression Estimators in Additive Models. Anal. Appl. 2016, 14, 449–477. [Google Scholar] [CrossRef] [Green Version]
Yuan, M.; Zhou, D.X. Minimax Optimal Rates of Estimation in High Dimensional Additive Models. Ann. Stat. 2016, 44, 2564–2593. [Google Scholar] [CrossRef]
Nikolova, M.; Ng, M.K. Analysis of Half-quadratic Minimization Methods for Sgnal and Image Recovery. SIAM J. Sci. Comput. 2006, 27, 937–966. [Google Scholar] [CrossRef]
Alizadeh, F.; Goldfarb, D. Second-Order Cone Programming. Math. Program. 2003, 95, 3–51. [Google Scholar] [CrossRef]
Guo, C.; Song, B.; Wang, Y.; Chen, H.; Xiong, H. Robust Variable Selection and Estimation Based on Kernel Modal Regression. Entropy 2019, 21, 403. [Google Scholar] [CrossRef] [Green Version]
Wang, Y.; Tang, Y.Y.; Li, L.; Chen, H. Modal Regression-based Atomic Representation for Robust Face Recognition and Reconstruction. IEEE Trans. Cybern. 2020, 50, 4393–4405. [Google Scholar] [CrossRef]
Suzuki, T.; Sugiyama, M. Fast learning rate of multiple kernel learning: Trade-off between sparsity and smoothness. Ann. Stat. 2013, 41, 1381–1405. [Google Scholar] [CrossRef] [Green Version]
Schlköpf, B.; Smola, A.J. Learning with Kernels; The MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
Aronszajn, N. Theory of Reproducing Kernels. Trans. Am. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
Bartlett, P.L.; Bousquet, O.; Mendelson, S. Localized Rademacher Complexities. In Proceedings of the Conference on Computational Learning Theory (COLT), Sydney, Australia, 8–10 July 2002; Volume 2373, pp. 44–58. [Google Scholar]
Mendelson, S. Geometric Parameters of Kernel Machines. In Proceedings of the Conference on Computational Learning Theory (COLT), Sydney, Australia, 8–10 July 2002; Volume 2375, pp. 29–43. [Google Scholar]
Koltchinskii, V.; Yuan, M. Sparsity in Multiple Kernel Learning. Ann. Stat. 2010, 38, 3660–3695. [Google Scholar] [CrossRef]
Van De Geer, S. Empirical Processes in M-Estimation; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
Löfberg, J. Automatic robust convex programming. Optim. Methods Softw. 2012, 27, 115–129. [Google Scholar] [CrossRef] [Green Version]

Figure 1. MSE of

\hat{f}

when

(λ_{1}, λ_{2}) = (0, 1)

.

Figure 1. MSE of

\hat{f}

when

(λ_{1}, λ_{2}) = (0, 1)

.

Figure 2. MSE of

\hat{f}

when

(λ_{1}, λ_{2}) = (1, 0)

.

Figure 2. MSE of

\hat{f}

when

(λ_{1}, λ_{2}) = (1, 0)

.

Figure 3. MSE of

\hat{f}

when

(λ_{1}, λ_{2}) = (1, 1)

.

Figure 3. MSE of

\hat{f}

when

(λ_{1}, λ_{2}) = (1, 1)

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Deng, H.; Chen, J.; Song, B.; Pan, Z. Error Bound of Mode-Based Additive Models. Entropy 2021, 23, 651. https://doi.org/10.3390/e23060651

AMA Style

Deng H, Chen J, Song B, Pan Z. Error Bound of Mode-Based Additive Models. Entropy. 2021; 23(6):651. https://doi.org/10.3390/e23060651

Chicago/Turabian Style

Deng, Hao, Jianghong Chen, Biqin Song, and Zhibin Pan. 2021. "Error Bound of Mode-Based Additive Models" Entropy 23, no. 6: 651. https://doi.org/10.3390/e23060651

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Error Bound of Mode-Based Additive Models

Abstract

1. Introduction

2. Methodology

2.1. Modal Regression

2.2. Mode-Based Sparse Additive Models

3. Error Analysis

4. Experimental Evaluation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI