A Double Penalty Model for Ensemble Learning

Wang, Wenjia; Zhou, Yi-Hui

doi:10.3390/math10234532

Open AccessArticle

A Double Penalty Model for Ensemble Learning

by

Wenjia Wang

¹ and

Yi-Hui Zhou

^2,*

¹

Data Science and Analytics Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou 510000, China

²

Departments of Biological Sciences and Statistics, North Carolina State University, Raleigh, NC 27695, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(23), 4532; https://doi.org/10.3390/math10234532

Submission received: 28 October 2022 / Revised: 21 November 2022 / Accepted: 23 November 2022 / Published: 30 November 2022

(This article belongs to the Topic Machine and Deep Learning)

Download

Browse Figures

Versions Notes

Abstract

:

Modern statistical learning techniques often include learning ensembles, for which the combination of multiple separate prediction procedures (ensemble components) can improve prediction accuracy. Although ensemble approaches are widely used, work remains to improve our understanding of the theoretical underpinnings of aspects such as identifiability and relative convergence rates of the ensemble components. By considering ensemble learning for two learning ensemble components as a double penalty model, we provide a framework to better understand the relative convergence and identifiability of the two components. In addition, with appropriate conditions the framework provides convergence guarantees for a form of residual stacking when iterating between the two components as a cyclic coordinate ascent procedure. We conduct numerical experiments on three synthetic simulations and two real world datasets to illustrate the performance of our approach, and justify our theory.

Keywords:

double penalty model; interpretability; partially linear model; separability

MSC:

62G08; 62P10; 62J02

1. Introduction

Ensemble learning [1] uses multiple learning algorithms together to produce an improved prediction rule. Early work on ensemble learning [2] emphasized diversity of ensemble learning components [3], while much subsequent literature concerned collections of weak or strong learners [4]. Aggregation methods such as bagging [5] are technically ensemble methods, but the components are all of similar kind, and this paper is largely concerned with ensembles over different kinds of “strong learners”.

One practical approach to ensemble learning is to perform stacked ensemble approaches sequentially, using the residuals from one model as the input for the next [6]. This approach has been used in kaggle competitions [7], with the potential convenience that different analysts can analyze the data separately in succession. This approach has a connection to boosting, in that (pseudo)residuals focus attention on the poorly-fit observations [8].

For a researcher intending to fit multiple learning models, the practice of starting with an “interpretable” or low-dimensional model favors parsimony in explaining the relationship of predictors to outcome. Much of machine learning development has focused on prediction accuracy as a primary criterion [9]. However, recent commentary has emphasized interpretability of models, both to understand underlying relationships and to improve generalizability [10]. Part of the difficulty in moving forward with an emphasis on interpretability is the lack of guiding theory. Although explainable AI, including SHAP [11] and LIME [12], has gained lots of attention from the machine learning community, such deep learning based models require large size of dataset, and a thorough theoretical development is lacking. In addition, the concept of “interpretability” can be subjective.

In this paper, we study a double penalty model that can be viewed as a special instance of ensemble learning, although it is apparent that extensions beyond two ensemble components can be made by successive grouping of learners. We use the model to formalize theoretical questions concerning consistency and identifiability, which are partly determined by the concept of function separability. Practical considerations include the development of iterative fitting algorithms for the two components, which may be valuable even when separability cannot be established.

Although not our main focus, our model may be of assistance in understanding inherent tradeoffs in model interpretability, if the prediction rule is divided into interpretable and uninterpretable portions/functions, as defined by the investigator. We emphasize that, in our framework, high interpretability may come at the cost of prediction accuracy, but a modest loss in accuracy may be worth the gain in interpretability.

2. A Double Penalty Model and a Fitting Algorithm

Consider a function of interest h, which can be expressed by

\begin{matrix} h (x) = f^{*} (x) + g^{*} (x), \forall x \in Ω, \end{matrix}

(1)

where

f^{*} \in F

and

g^{*} \in G

are two unknown functions with known function classes

F

and

G

, respectively, and

Ω

is a compact and convex region. Following the motivation for this work, we suppose that the function class

F

consists of functions that are “easy to interpret”, for example, linear functions. We further suppose that

G

is judged to be uninterpretable, for example, the output from a random forest procedure. Suppose we observe data

(x_{i}, y_{i})

,

i = 1, \dots, n

with

y_{i} = h (x_{i}) + ϵ_{i}

, where

x_{i} \in Ω

and

ϵ_{i}

’s are i.i.d. random error with mean zero and finite variance. The goal of this work is to specify or estimate

f^{*}

and

g^{*}

.

Obviously, it is not necessary that

f^{*}

and

g^{*}

are unique, or can be statistically identified. For example, for any function

h_{1}

, the summation of

f^{*} + h_{1}

and

g^{*} - h_{1}

is also equal to h. Regardless of the identifiability of

f^{*}

and

g^{*}

, we propose the following double penalty model for fitting. In the rest of this work, we define the empirical inner product as

{〈 f, g 〉}_{n} = n^{- 1} \sum_{i = 1}^{n} f (x_{i}) g (x_{i}),

and the empirical norm

{∥ f ∥}_{n}^{2} = {〈 f, f 〉}_{n}

. The double penalty model is defined by

\begin{matrix} (\hat{f}, \hat{g}) = \underset{f \in F, g \in G}{argmin} {∥ y - f - g ∥}_{n}^{2} + L_{f} (f) + L_{g} (g), \end{matrix}

(2)

where

L_{f}

and

L_{g}

are convex penalty functions on f and g, and

\hat{f}

and

\hat{g}

are estimators of

f^{*}

and

g^{*}

, respectively. Under some circumstances, if

f^{*}

and

g^{*}

can be statistically identified, by using appropriate penalty functions

L_{f}

and

L_{g}

, we can obtain consistent estimators

\hat{f}

and

\hat{g}

of

f^{*}

and

g^{*}

, respectively. Even if

f^{*}

and

g^{*}

are nonidentifiable, by using the two penalty functions

L_{f}

and

L_{g}

, the relative contributions of the “easy to interpret” part and “hard to interpret” to a final prediction rule can be controlled.

Directly solving (2) may be difficult, because

L_{f} (f)

and

L_{g} (g)

may be partly confounded. Here we describe an iterative algorithm to solve the optimization problem in (2).

In Algorithm 1, for each iteration, two separated optimization problems (4) and (3) are solved with respect to f and g, respectively. The idea of Algorithm 1 is similar to the coordinate descent method, which minimizes the objective function with respect to each coordinate direction at a time. Such an idea has been widely used in practice, for example, backfitting algorithm in the generalized additive model, and the method of alternating projections [13]. The minimization in Equations (3) and (4) ensures that the function

\begin{matrix} ∥ y - f_{m} - g_{m} ∥_{n}^{2} + L_{f} (f_{m}) + L_{g} (g_{m}) \end{matrix}

decreases as m increases. One can stop Algorithm 1 after reaching a fixed number of iterations or no further improvement of function values can be made.

Algorithm 1 Iterative algorithm

Input: Data

(x_{i}, y_{i})

,

i = 1, \dots, n

, function classes

F

and

G

, and functions

L_{f}

and

L_{g}

. Set

m = 1

. Let

f_{0} = {argmin}_{f \in F} 1 / n \sum_{i = 1}^{n} {(y_{i} - f (x_{i}))}^{2} + L_{f} (f) .

While Stopping criteria are not satisfied do

Solve

\begin{matrix} g_{m} & = \underset{g \in G}{argmin} {∥ y - f_{m - 1} - g ∥}_{n}^{2} + L_{g} (g), and \end{matrix}

(3)

\begin{matrix} f_{m} & = \underset{f \in F}{argmin} {∥ y - f - g_{m} ∥}_{n}^{2} + L_{f} (f) . \end{matrix}

(4)

Set

m = m + 1

.

return

f_{m}

and

g_{m}

.

Remark 1.

The model (1) is not the same as additive models [14]. In the additive model, the functions

f^{*}

and

g^{*}

have different covariates (thus

f^{*}

and

g^{*}

are identifiable), while in (1),

f^{*}

and

g^{*}

share the same covariates. Therefore, additional efforts need to be made on addressing the identifiability issue. A more similar model is as in [15], where

f^{*}

and

g^{*}

are two realizations of Gaussian processes, with one capturing the global information and the other capturing the local fluctuations.

The convergence of Algorithm 1 is ensured if

L_{f}

or

L_{g}

is strongly convex. Let

∥ \cdot ∥

be a (semi-)norm of a Hilbert space. A function L is said to be strongly convex with respect to (semi-)norm

∥ \cdot ∥

if there exists a parameter

γ > 0

such that for any

x, y

in the domain and

t \in [0, 1]

,

\begin{matrix} L (t x + (1 - t) y) \leq t L (x) + (1 - t) L (y) - \frac{1}{2} γ t (1 - t) {∥ x - y ∥}^{2} . \end{matrix}

As a simple example,

{∥ \cdot ∥}^{2}

is strongly convex for any norm

∥ \cdot ∥

. If

L_{f}

or

L_{g}

is strongly convex, Algorithm 1 converges, as stated in the following theorem.

Theorem 1.

Suppose

L_{f}

or

L_{g}

is strongly convex with respect to the empirical norm with parameter

γ > 0

. We have

\begin{matrix} ∥ f_{m} - \hat{f} ∥_{n} + ∥ g_{m} - \hat{g} ∥_{n} \leq {(\frac{2}{2 + γ})}^{m - 1} (∥ f_{1} - \hat{f} ∥_{n} + ∥ g_{1} - \hat{g} ∥_{n}), \end{matrix}

and

(L_{f} (f_{m}), L_{g} (g_{m})) \to (L_{f} (\hat{f}), L_{g} (\hat{g}))

, as m goes to infinity (The proof can be found in Appendix A).

From Theorem 1, it can be seen that Algorithm 1 can achieve a linear convergence if

L_{f}

or

L_{g}

is strongly convex, regardless of the identifiability of

f^{*}

and

g^{*}

. We only require one penalty function to be strongly convex. The convergence rate depends on the parameter

γ

, which measures the convexity of a function. If the penalty function is more convex, i.e.,

γ

is larger, then the convergence of Algorithm 1 is faster. The strong convexity of the penalty function

L_{f}

or

L_{g}

can be easily fulfilled, because the square of norm of any Hilbert space is strongly convex. For example, the penalty functions in the ridge regression and the neural networks with fixed number of neuron are strongly convex with respect to the empirical norm.

An Example

We demonstrate Theorem 1 and the double penalty model by considering regression with an

L_{1}

penalty on the coefficients for f and an

L_{2}

penalty on the coefficients for g. Thus, the form is the familiar elastic net [16], which we here re-characterize as a mix of LASSO regression (more interpretable because coefficients are sparse) and less-interpretable ridge regression. Although extremely fast elastic net algorithms have been developed [17], the conditions of Theorem 1 hold, and we illustrate using the diabetes dataset [18] with 442 observations and 64 predictors, including interactions. The glmnet package (v 4.0-2) [19] was used in successive LASSO and ridge steps, as in Algorithm 1. Figure 1 left panel shows the convergence of

∥ f_{m} - \hat{f} ∥_{n} + {∥ g_{m} - \hat{g} ∥}_{n}

(root mean squared error) for

λ_{f} = 0.032

,

λ_{g} = 1

(the results from a minimum grid search, although any pair will do). The right panel shows various root mean squared minima (RMSE and root mean squared variation) over

{λ_{f}, λ_{g}}

using 10-fold cross-validation. The results suggest choices of

λ_{f}, λ_{g}

that can nearly achieve the overall minimum RMSE, while placing the bulk of variation/explanatory variation on the more interpretable f.

3. Separable Function Classes

Suppose

f^{*}

and

g^{*}

can be statistically specified. It can be seen that

F \cap G \subset {0}

, for otherwise

f^{*} + w \in F

and

g^{*} - w \in G

would be another decomposition of h for any

w \in F \cap G

. For example, we can consider

A

be a function class that h lies in,

F

be a subset of

A

, and

G = F^{⊥}

be

F

’s orthogonal complement in

A

. Nevertheless, we consider a more general case in the sense defined here. We define

F

and

G

as

L_{2}

-separable if there exists

θ_{1} \in [0, 1)

such that for any functions

f \in F

and

g \in G

,

\begin{matrix} {| 〈 f, g 〉}_{2} | \leq θ_{1} {∥ f ∥}_{L_{2}} {∥ g ∥}_{L_{2}}, \end{matrix}

(5)

where

{∥ f ∥}_{L_{2}}

denotes the

L_{2}

norm of a function

f \in L_{2} (Ω)

, and

{〈 f, g 〉}_{2}

denotes the inner product of functions

f, g \in L_{2} (Ω)

. Roughly speaking, the minimal angle of

F

and

G

is strictly bounded away from zero. If two function classes are

L_{2}

-separable, then

f^{*}

and

g^{*}

are unique, as stated in the following lemma.

Lemma 1.

Suppose (5) is true for any functions

f \in F

and

g \in G

, then

f^{*}

and

g^{*}

are unique, up to a difference on a measure zero set (The proof can be found in Appendix B).

3.1. A Separable Additive Model with the Same Covariates

Suppose two function classes

F \subset H^{ν_{1}} (Ω)

, and

G \subset H^{ν_{2}} (Ω)

, where

H^{ν} (Ω)

is the Sobolev space with known smoothness

ν

. We assume that

F

and

G

are bounded, i.e., there exist constants

R_{1}

and

R_{2}

such that

{∥ f ∥}_{H^{ν_{1}} (Ω)} \leq R_{1}

and

{∥ g ∥}_{H^{ν_{2}} (Ω)} \leq R_{2}

, for all

f \in F

and

g \in G

, respectively. Typically, the “easy to interpret” part

F

has a higher smoothness, thus we assume

ν_{1} \geq ν_{2}

. In order to estimate

f^{*}

and

g^{*}

, we employ the idea from kernel ridge regression.

Let

Ψ_{ν}

be the (isotropic) Matérn family [20], defined by

\begin{matrix} Ψ_{ν} (s, t) = \frac{(2 \sqrt{ν^{'}} {ϕ ∥ s - t ∥)}^{ν^{'}}}{Γ (ν^{'}) 2^{ν^{'} - 1}} K_{ν^{'}} (2 \sqrt{ν^{'}} ϕ ∥ s - t ∥), \end{matrix}

(6)

where

K_{ν^{'}}

is the modified Bessel function of the second kind,

ν^{'} = ν - p / 2

, and

ϕ

is the range parameter. We use

N_{Ψ_{ν}} (Ω)

to denote the reproducing kernel Hilbert space generated by

Ψ_{ν}

, and

{∥ \cdot ∥}_{N_{Ψ_{ν}} (Ω)}

to denote the norm of

N_{Ψ_{ν}} (Ω)

. By Corollary 10.48 in [21],

H^{ν} (Ω)

coincides with

N_{Ψ_{ν}} (Ω)

. We use the solution to

\begin{matrix} min_{f \in F, g \in G} {∥ y - f - g ∥}_{n}^{2} + λ_{1} {∥ f ∥}_{N_{Ψ_{ν_{1}}} (Ω)}^{2} + λ_{2} {∥ g ∥}_{N_{Ψ_{ν_{2}}} (Ω)}^{2} \end{matrix}

(7)

to estimate

f^{*}

and

g^{*}

, and the corresponding estimators are denoted by

\hat{f}

and

\hat{g}

, respectively. Note that if

G

only contains zero function, then (7) reduces to kernel ridge regression. We further require that (5) holds such that

f^{*}

and

g^{*}

are identifiable.

First, we consider the consistency of

\hat{f}

and

\hat{g}

, which is provided in the following theorem.

Theorem 2.

Suppose

x_{i}

’s are uniformly distributed on Ω, and the noise

ϵ_{i}

’s are i.i.d. sub-Gaussian, i.e., satisfying

K^{2} E exp (| ϵ_{i} |^{2} / K^{2}) - 1 \leq σ_{0}^{2}

for some constants K and

σ_{0}^{2}

, and all

i = 1, \dots, n

. If

max (λ_{1}, λ_{2}) = O_{P} (n^{- 2 ν_{2} / (2 ν_{2} + p)})

, we have (The proof can be found in Appendix C)

\begin{matrix} ∥ \hat{g} - g^{*} ∥_{L_{2}}^{2} + {∥ \hat{f} - f^{*} ∥}_{L_{2}}^{2} = O_{P} (n^{- \frac{2 ν_{2}}{2 ν_{2} + p}}) . \end{matrix}

Remark 2.

Note that in Theorem 2, we only require upper bounds on

max (λ_{1}, λ_{2})

, which is because

F

and

G

are bounded. In particular, we can set

max (λ_{1}, λ_{2}) = 0

. However, if

λ_{1}

and

λ_{2}

are large, it is more likely that

{\tilde{f}}_{m} \in F

and

{\tilde{g}}_{m} \in G

, which allows us to solve (7) efficiently.

Remark 3.

Because

f^{*}

and

g^{*}

share the same covariates, the convergence speed of

\hat{f} - f^{*}

is slower than the optimal rate

O_{P} (n^{- \frac{2 ν_{1}}{2 ν_{1} + p}})

. We cannot confirm whether the rate in Theorem 2 is optimal for

\hat{f}

.

In order to solve the optimization problem in (7), we apply Algorithm 1. In each iteration of Algorithm 1,

g_{m}

and

f_{m}

are solved by

\begin{matrix} g_{m} = & \underset{g \in G}{argmin} ∥ y - f_{m - 1} {- g ∥}_{n}^{2} + λ_{2} {∥ g ∥}_{N_{Ψ_{ν_{2}}} (Ω)}^{2}, \\ f_{m} = & \underset{f \in F}{argmin} ∥ y - f - g_{m} ∥_{n}^{2} + λ_{1} {∥ f ∥}_{N_{Ψ_{ν_{1}}} (Ω)}^{2}, \end{matrix}

which have explicit forms as

\begin{matrix} g_{m} = {\tilde{g}}_{m} = & r_{1} {(\cdot)}^{T} {(K_{1} + n λ_{1})}^{- 1} (Y - f_{m - 1} (X)), \end{matrix}

(8)

\begin{matrix} f_{m} = {\tilde{f}}_{m} = & r_{2} {(\cdot)}^{T} {(K_{2} + n λ_{2})}^{- 1} (Y - g_{m} (X)), \end{matrix}

(9)

if

{\tilde{g}}_{m} \in G

and

{\tilde{f}}_{m} \in F

, where

\begin{matrix} r_{l} (x) = & {(Ψ_{ν_{l}} (x, x_{1}), \dots, Ψ_{ν_{l}} (x, x_{n}))}^{T}, \\ f_{m - 1} (X) = & {(f_{m - 1} (x_{1}), \dots, f_{m - 1} (x_{n}))}^{T}, \\ g_{m} (X) = & {(g_{m} (x_{1}), \dots, g_{m} (x_{n}))}^{T}, \end{matrix}

K_{l} = {(Ψ_{ν_{l}} (x_{j}, x_{k}))}_{j k}

for

l = 1, 2

, and

Y = {(y_{1}, \dots, y_{n})}^{T}

. The explicit forms allow us to solve (17) efficiently.

By Theorem 1, the convergence of Algorithm 1 can be guaranteed. However, if the two function classes separate well, we can achieve a faster convergence, as shown in the following theorem.

Theorem 3.

Suppose two function classes

F \subset H^{ν_{1}} (Ω)

and

G \subset H^{ν_{2}} (Ω)

are

L_{2}

-separable satisfying (5), and

x_{i}

’s are uniformly distributed on Ω. For

n \geq N

, with probability at least

1 - C_{1} exp (- n^{\frac{2 ν_{2} - p}{2 ν_{2} + p}})

, we have either

\begin{matrix} ∥ f_{m} - \hat{f} ∥_{n} + {∥ g_{m + 1} - \hat{g} ∥}_{n} \leq C_{2} n^{- \frac{2 ν_{2}}{2 ν_{2} + p}}, \end{matrix}

or

\begin{matrix} ∥ f_{m} - \hat{f} ∥_{n} + ∥ g_{m} - \hat{g} ∥_{n} \leq {(θ_{1} + C_{3} n^{- \frac{2 ν_{2} - p}{2 (2 ν_{2} + p)}})}^{2 m - 6} (∥ f_{1} - \hat{f} ∥_{n} + ∥ g_{1} - \hat{g} ∥_{n}), \end{matrix}

where N and

C_{i}

’s are constants only depending on

F

,

G

and Ω (The proof can be found in Appendix D).

In the proof of Theorem 3, the key step is to show that with high probability, (5) implies that the separability holds with respect to the empirical norm, i.e.,

\begin{matrix} {| 〈 f, g 〉}_{n} | \leq θ_{2} {∥ f ∥}_{n} {∥ g ∥}_{n}, \end{matrix}

(10)

with some

θ_{2}

close to

θ_{1}

. It can be seen that in Theorem 3, if

F

and

G

are separable with respect to the

L_{2}

norm, Algorithm 1 achieves a linear convergence. The parameter

θ_{1}

determines the convergence speed. If

θ_{1}

is small, then the convergence of Algorithm 1 is fast, and a few iterations are enough. By Theorems 2 and 3, it can be seen that the approximation error (the difference between the optimal solution and numerical solution) can be much smaller than the statistical error, which is typical. In particular, we can conclude that the solution obtained by Algorithm 1 satisfies

\begin{matrix} ∥ f_{m} - f^{*} ∥_{L_{2} (Ω)} + {∥ g_{m + 1} - g^{*} ∥}_{L_{2} (Ω)} \leq C_{4} n^{- \frac{ν_{2}}{2 ν_{2} + p}}, \end{matrix}

where we apply Lemma 5.16 of [22], which ensures the asymptotic equivalence of

L_{2}

norm and the empirical norm of

∥ f_{m} - f^{*} ∥_{n}

.

3.2. Finite Dimensional Function Classes

As a special case of the model in Section 3.1, suppose two function classes

F

and

G

have finite dimensions. To be specific, suppose

\begin{matrix} F = \{f = \sum_{k = 1}^{d_{1}} α_{k} ϕ_{k} : α_{k} \in R, {∥ f ∥}_{L_{2}} \leq R_{f}\}, G = \{g = \sum_{j = 1}^{d_{2}} β_{j} φ_{j} : β_{k} \in R, {∥ g ∥}_{L_{2}} \leq R_{g}\}, \end{matrix}

where

ϕ_{k}, φ_{j}

’s are known functions defined on a compact set

Ω

, and

R_{f}

and

R_{g}

are known constants. Furthermore, assume

F

and

G

are

L_{2}

-separable. Since the dimension of each function class is finite, we can use the least squares method to estimate

f^{*}

and

g^{*}

, i.e.,

\begin{matrix} (\hat{f}, \hat{g}) = \underset{f \in F, g \in G}{argmin} {∥ y - f - g ∥}_{n}^{2} . \end{matrix}

(11)

By applying standard arguments in the theory of Vapnik-Chervonenkis subgraph class [22], the consistency of

\hat{f}

and

\hat{g}

holds. We do not present detailed discussion for the conciseness of this paper.

Although the exact solution to the optimization problem in (11) is available, we can still use Algorithm 1 to solve it. By comparing the exact solution with numeric solution obtained by Algorithm 1, we can study the convergence rate of Algorithm 1 via numerical simulations. The detailed numerical studies of the convergence rate is provided in Section 5.1.

4. Non-Separable Function Classes

In Section 3, we consider the case that

F

and

G

are

L_{2}

-separable, which implies

f^{*}

and

g^{*}

are statistically identifiable. However, in many practical cases,

F

and

G

are not

L_{2}

-separable. Such examples include

F

as a linear function class and

G

as the function space generated by a neural network. If

F

and

G

are not

L_{2}

-separable, then

f^{*}

and

g^{*}

are not statistically identifiable. To see this, note that there exist two sequences of functions

{f_{j}^{'}} \subset F

and

{g_{j}^{'}} \subset G

such that

∥ f_{j}^{'} - g_{j}^{'} ∥_{L_{2}} \to 0

. This implies that

(f^{*}, g^{*})

and

(f^{*} - f_{j}^{'}, g^{*} + g_{j}^{'})

are not statistically identifiable, which implies that we cannot consistently estimate

f^{*}

and

g^{*}

.

Although

F

and

G

can be not

L_{2}

-separable, we can still use (2) to specify

f^{*}

and

g^{*}

. We propose choosing

F

with simple structure and to be “easy to interpret”, and choosing

G

to be flexible to improve the prediction accuracy. The tradeoff between interpretation and prediction accuracy can be adjusted by applying different penalty functions

L_{f}

and

L_{g}

. If

L_{f}

is large, then (2) forces

f^{*}

to be small and

g^{*}

to be large, which indicates that the model is more flexible, but is less interpretable. On the other hand, if

L_{g}

is large, then the model is more interpretable, but may reduce the power of prediction.

Another way is to make

F

and

G

separable. Specifically, suppose

F, G \subset H^{ν} (Ω)

, where

ν > p / 2

and p is the dimension. Then we construct a new function class

G^{'}

such that

G^{'} = G \cap F^{⊥}

, where

F^{⊥}

is the perpendicular component of

F

in

H^{ν} (Ω)

. Although in general, it is not easy to find

F^{⊥}

(

F^{⊥}

may also be empty), in some cases, it is possible to build

F^{⊥}

, for example,

F

is of finite dimension. In the next subsection, we provide a specific example of building the perpendicular component of

F

and study the convergence property of its corresponding double penalty model.

A Generalization of Partially Linear Models

In this subsection, we consider a generalization of partially linear models. The responses in a typical partially linear model can be expressed as

\begin{matrix} y = x^{T} β + g (t) + ϵ . \end{matrix}

(12)

In the partially linear models (12),

β \in R^{p}

is a vector of regression coefficients associated with x, g is an unknown function of t with some known smoothness, which is usually a one dimensional scalar, and

ϵ

is a random noise. The partially linear model (12) can be estimated by the partial spline estimator [23,24], partial residual estimator [25,26], or SCAD-penalized regression [27].

In this work, we consider a more general model. Suppose we observe data

y_{i}

on

x_{i} \in Ω = {[0, 1]}^{p}

for

i = 1, \dots, n

, where

\begin{matrix} y_{i} = x_{i}^{T} β^{*} + g^{*} (x_{i}) + ϵ_{i}, \end{matrix}

(13)

and

ϵ_{i}

’s are i.i.d. random errors with mean zero and finite variance. We assume that the function

g^{*} \in H^{ν} (Ω)

, where

H^{ν} (Ω)

is the Sobolev space with known smoothness

ν

. This is a standard assumption in nonparametric regression, see [22,28] for example. It is natural to define the two function classes by

\begin{matrix} F = \{f (x) = x^{T} β : β \in R^{p}, {∥ β ∥}_{2} \leq R_{1}, x \in Ω\} \end{matrix}

and

G = {h \in H^{ν} (Ω) : ∥ h ∥_{H^{ν} (Ω)} \leq R_{2}}

, where

{∥ \cdot ∥}_{2}

denotes the Euclidean distance, and

R_{1}, R_{2}

are known constants. In practice, we can choose a sufficient large

R_{1}, R_{2}

such that

F

and

G

are large enough. Note that in (13), the linear part and nonlinear part share the same covariates, which is different with (12). It can be seen that

β^{*}

and

g^{*}

are non-identifiable because

F \subset G

. Furthermore,

F

is more interpretable compared with

G

because it is linear.

In order to uniquely identify

β^{*}

and

g^{*}

, we need to restrict function class

G

such that

F

and

G

are separable. This can be done by applying a newly developed approach, employing the projected kernel [29]. Let

e_{k}

,

k = 1, \dots, p

be an orthonormal basis of

F

. Then

F

can be defined as a linear span of the basis

{e_{1}, \dots, e_{p}}

, and the projection of a function

w \in G

on

F

is given by

\begin{matrix} P_{F} w = \sum_{k = 1}^{p} {〈 w, e_{k} 〉}_{2} e_{k} . \end{matrix}

(14)

The perpendicular component is

\begin{matrix} P_{F}^{⊥} w = w - P_{F} w . \end{matrix}

(15)

By (14) and (15), we can split

G

into two perpendicular classes as

F

and

F^{⊥}

, where

F^{⊥} = {w_{1} = P_{F}^{⊥} w, w \in G}

. Let

h = x^{T} β^{*} + g^{*} (x)

, where

g^{*} \in F^{⊥}

. Since

F

and

F^{⊥}

are perpendicular, they are

L_{2}

-separable. By Lemma 1,

β^{*}

and

g^{*}

are unique. However, in practice it is usually difficult to find a function

g^{*} \in F^{⊥}

directly. We propose using projected kernel ridge regression, which depends on the reproducing kernel Hilbert space generated by the projected kernel.

The reproducing kernel Hilbert space generated by the projected kernel can be defined in the following way. Define the linear operators

P_{F}^{(1)}

and

P_{F}^{(2)}

:

L_{2} (Ω \times Ω) \to L_{2} (Ω \times Ω)

as

\begin{matrix} P_{F}^{(1)} (u) (x, y) = & \sum_{k = 1}^{p} e_{k} (x) \int_{Ω} u (s, y) e_{k} (s) d s, \\ P_{F}^{(2)} (u) (x, y) = & \sum_{k = 1}^{p} e_{k} (y) \int_{Ω} u (x, t) e_{k} (t) d t, \end{matrix}

for

u \in L_{2} (Ω \times Ω)

. The projected kernel of

Ψ

can be defined by

\begin{matrix} Ψ_{F} = Ψ - P_{F}^{(1)} Ψ - P_{F}^{(2)} Ψ + P_{F}^{(1)} P_{F}^{(2)} Ψ . \end{matrix}

(16)

The function class

F^{⊥}

then is equivalent to the reproducing kernel Hilbert space generated by

Ψ_{F}

, denoted by

N_{Ψ_{F}} (Ω)

, and the norm is denoted by

{∥ \cdot ∥}_{N_{Ψ_{F}} (Ω)}

. For detailed discussion and properties of

Ψ_{F}

and

N_{Ψ_{F}} (Ω)

, we refer to [29].

By using the projected kernel of

Ψ

, the double penalty model is

\begin{matrix} (\hat{β}, \hat{g}) = \underset{β \in R^{p}, g \in N_{Ψ_{F}} (Ω)}{argmin} ∥ y - x^{T} {β - g ∥}_{n}^{2} + λ {∥ g ∥}_{N_{Ψ_{F}} (Ω)}^{2}, \end{matrix}

(17)

where

(\hat{β}, \hat{g})

are estimators of

(β^{*}, g^{*})

. In practice, we can use generalized cross validation (GCV) to choose the tuning parameter

λ

[29,30]. If the tuning parameter

λ

is chosen properly, we can show that

(\hat{β}, \hat{g})

are consistent, as stated in the following theorem. In the rest of this paper, we use the following notation. For two positive sequences

a_{n}

and

b_{n}

, we write

a_{n} ≍ b_{n}

if, for some constants

C, C^{'} > 0

,

C \leq a_{n} / b_{n} \leq C^{'}

.

Theorem 4.

Suppose

x_{i}

’s are uniformly distributed on Ω, and the noise

ϵ_{i}

’s are i.i.d. sub-Gaussian, i.e., satisfying

K^{2} E exp (| ϵ_{i} |^{2} / K^{2}) - 1 \leq σ_{0}^{2}

for some constants K and

σ_{0}^{2}

, and all

i = 1, \dots, n

. If

λ ≍ n^{- 2 ν / (2 ν + p)}

, we have

\begin{matrix} ∥ \hat{g} - g^{*} ∥_{L_{2}}^{2} = O_{P} (n^{- \frac{2 ν}{2 ν + p}}), {∥ \hat{β} - β^{*} ∥}_{2}^{2} = O_{P} (n^{- \frac{2 ν}{2 ν + p}}) . \end{matrix}

Theorem 4 is a direct result of Theorem 3, thus the proof is omitted. Theorem 4 shows that the double penalty model (17) can provide consistent estimators of

β^{*}

and

g^{*}

, and the convergence rate of

∥ \hat{g} - g^{*} ∥_{L_{2}}

is known to be optimal [31]. The convergence rate of

∥ \hat{β} - β^{*} ∥_{2}

in Theorem 4 is slower than the convergence rate

n^{- 1 / 2}

in the linear model. We conjecture that this is because the convergence rate is influenced by the estimation of

g^{*}

, which may introduce extra error because functions in

N_{Ψ_{F}} (Ω)

and

F

have the same input space.

In order to solve the optimization problem in (17), we apply Algorithm 1. By Theorem 1, the convergence of Algorithm 1 can be guaranteed. In each iteration of Algorithm 1,

g_{m}

and

f_{m}

are solved by

\begin{matrix} g_{m} = & \underset{g \in N_{Ψ_{F}} (Ω)}{argmin} ∥ y - x^{T} β_{m - 1} {- g ∥}_{n}^{2} + λ {∥ g ∥}_{N_{Ψ_{F}} (Ω)}^{2}, \\ β_{m} = & \underset{β \in R^{p}}{argmin} {∥ y - x^{T} β - g_{m} ∥}_{n}^{2}, \end{matrix}

which have explicit forms as

\begin{matrix} g_{m} (x) = & r {(x)}^{T} {(K + n λ)}^{- 1} (Y - X^{T} β_{m - 1}), \\ β_{m} = & {(X^{T} X)}^{- 1} X^{T} (Y - g_{m} (X)), \end{matrix}

where

\begin{matrix} r (x) = & {(Ψ_{F} (x, x_{1}), \dots, Ψ_{F} (x, x_{n}))}^{T}, \\ g_{m} (X) = & {(g_{m} (x_{1}), \dots, g_{m} (x_{n}))}^{T}, \end{matrix}

and

K = {(Ψ_{F} (x_{j}, x_{k}))}_{j k}

. The explicit forms allow us to solve (17) efficiently. Furthermore, because

F

and

N_{Ψ_{F}} (Ω)

are orthogonal, Theorem 3 implies that a few iterations of Algorithm 1 are sufficient to obtain a good numeric solution.

5. Numerical Examples

5.1. Convergence Rate of Algorithm 1

In this subsection, we report numerical studies on the convergence rate of Algorithm 1, and verify that the convergence rate in Theorem 3 is sharp. We consider two finite function classes such that the analytic solution of (11) is available, as stated in Section 3.2. By comparing the numeric solution and the analytic solution, we can verify the convergence rate is sharp.

We consider two function classes

F = {f | f (x) = α_{1} x, α_{1} \in [0, 10]}

and

G = {g | g (x) = α_{2} sin (θ x), α_{2} \in [0, 10]}

, where

x \in [0, 1]

,

θ

is a known parameter which controls the degree of separation of two function classes, i.e., the parameter

θ_{1}

in Lemma 1. It is easy to verify that for

f \in F

and

g \in G

,

\begin{matrix} | \int_{0}^{1} f (x) g (x) d x | \leq {ψ (θ) ∥ f ∥}_{L_{2} ([0, 1])} {∥ g ∥}_{L_{2} ([0, 1])}, \end{matrix}

where

ψ (θ) = \frac{2 \sqrt{3 θ} | sin (θ) - θ cos (θ) |}{θ^{2} \sqrt{2 θ - sin (2 θ)}} .

Suppose the underlying function

h (x) = β_{1}^{*} x + β_{2}^{*} sin (θ x)

with

(β_{1}^{*}, β_{2}^{*}) = (1, 3)

. Let

({\hat{β}}_{1}, {\hat{β}}_{2})

be the solution to (11), and

(β_{1, m}, β_{2, m})

be the values obtained at mth iteration of Algorithm 1. By Theorem 3,

\begin{matrix} ∥ (β_{1, m} - {\hat{β}}_{1}) {x ∥}_{n} + {∥ (β_{2, m} - {\hat{β}}_{2}) sin (θ x) ∥}_{n} \leq C {(θ_{1} + C_{5} n^{- \frac{2 ν_{2} - p}{2 (2 ν_{2} + p)}})}^{2 m - 6}, \end{matrix}

(18)

where

C = ∥ (β_{1, 1} - {\hat{β}}_{1}) {x ∥}_{2} + {∥ (β_{2, 1} - {\hat{β}}_{2}) sin (θ x) ∥}_{2}

. By taking logarithms on both sides of (18), we have

\begin{matrix} log (∥ (β_{1, m} - {\hat{β}}_{1}) x ∥_{n} + ∥ (β_{2, m} - {\hat{β}}_{2}) sin (θ x) ∥_{n}) \\ \leq & log (C {(θ_{1} + C_{5} n^{- \frac{2 ν_{2} - p}{2 (2 ν_{2} + p)}})}^{2 m - 6}) \\ \approx & 2 log (ψ (θ)) m + log (C ψ {(θ)}^{- 6}) . \end{matrix}

(19)

If the convergence rate in Theorem 3 is sharp, then

log (∥ (β_{1, m} - {\hat{β}}_{1}) x ∥_{n} + ∥ (β_{2, m} - {\hat{β}}_{2})

{sin (θ x) ∥}_{n})

is an approximate linear function with respect to m and the slope is close to

2 log (ψ (θ))

.

In our simulation studies, we choose

θ = 2, 3, 3.5, 4

. We choose the noise

ϵ \sim N (0, 0.1)

, where

N (0, 0.1)

is a normal distribution with mean zero and variance

0.1

. The algorithm stops if the left hand side of (18) is less than

10^{- 6}

. We choose 50 uniformly distributed points as training points. We run 100 simulations and take the average of the regression coefficient and the number of iterations needed for each

θ

. The results are shown in Table 1.

Theorem 3 shows that the approximation in (19) is more accurate when the sample size is larger. We conduct numerical studies using sample sizes

20, 50, 100, 150, 200

. We choose

θ = 3

. The results are presented in Table 2.

From Table 1 and Table 2, we find that the absolute difference increases as

θ

increases and sample size decreases. When

ψ (θ)

decreases, the iteration number decreases, which implies the convergence of Algorithm 1 becomes faster. These results corroborate our theory. The regression coefficients are close to our theoretical assertion

2 log (ψ (θ))

, which indicates that the convergence rate in Theorem 3 is sharp.

5.2. Prediction of Double Penalty Model

To study the prediction performance of double penalty model, we consider two examples, with

L_{2}

-separable function classes and non-

L_{2}

-separable function classes, respectively. In these examples, we would like to stress that we only show the double penalty model can provide relatively accurate estimator with a large part attributing to the “interpretable” part. Since accuracy is not the only goal of our estimator, models that may have extremely high prediction accuracy but may hard to interpret is not preferred in our case. Furthermore, the definition of “interpretable” can be subjective and depends on the user. Therefore, we choose our subjective “interpretable” model in these examples and only show the prediction performance of our model.

Example 1.

Consider function [32]

\begin{matrix} h (x) = \frac{sin (10 π x)}{2 x} + {(x - 1)}^{4}, x \in [0.5, 2.5] . \end{matrix}

Let

F = {f (x) = β_{1} x + β_{2}, β_{1}^{2} + β_{2}^{2} \leq 100}

, and

G

be the reproducing kernel Hilbert space generated by the projected kernel. The projected kernel is calculated as in (16), where Ψ is as in (6) with

ν = 3.5

and

ϕ = 1

. We use 20 uniformly distributed points from

[0.5, 2.5]

as training points, and let

ϵ \sim N (0, 0.1)

. For each simulation, we calculate the mean squared prediction error, which is approximated by calculating the mean squared prediction error on 201 evenly spaced points. We run 100 simulations, and the average mean squared prediction error is 0.016. In this example, the iteration number needed in Algorithm 1 is less than three because the two function classes are orthogonal, which corroborates the results in Theorem 3.

Figure 2 shows that the linear part can capture the trend. However, it can be seen from the figure that the difference between the true function and the linear part is still large. Therefore, a nonlinear part is needed to make good predictions. It also indicates that the function in this example is not easy to interpret.

Example 2.

Consider a modified function of [33]

\begin{matrix} h (x) = \frac{2}{\sqrt{\sum_{i = 1}^{5} {(x_{i} - 0.5)}^{2}} + 1} + \frac{0.5}{\sqrt{\sum_{i = 1}^{5} {(x_{i} - 0.7)}^{2}} + 1}, \end{matrix}

for

x_{i} \in [0, 1]

. We use

F = {f (x) = β_{1}^{T} x + β_{2}, ∥ β_{1} ∥_{2}^{2} + β_{2}^{2} \leq 10,000, x \in {[0, 1]}^{5}}

, and

G

as the reproducing kernel Hilbert space generated by Ψ, where Ψ is as in (6) with

ν = 3.5

and

ϕ = 1

. Note that

F

and

G

are not

L_{2}

-separable because

F \subset G

.

The double penalty model is

\begin{matrix} min_{β \in F, g \in N_{Ψ} ({[0, 1]}^{5})} ∥ y - x^{T} {β - g ∥}_{n}^{2} + λ {∥ g ∥}_{N_{Ψ} ({[0, 1]}^{5})}^{2}, \end{matrix}

(20)

and the solution is denoted by

(\hat{β}, \hat{g})

. We choose

n λ = 1, 0.1, 0.01

, where

n = 50

is the sample size. The noise

ϵ \sim N (0, σ^{2})

, where

σ^{2}

is chosen to be

0.1

and

0.01

. The iteration numbers are fixed in each simulation, with values

1, 2, 3, 4, 5

. We choose maximin Latin hypercube design [34] with sample size 50 as the training set. We run 100 simulations for each case and calculated the mean squared prediction error on the testing set, which is the first 1000 points of the Halton sequence [35].

Table 3 and Table 4 show the simulation results when the variance of noise is 0.1 and 0.01, respectively. We run simulations with iteration numbers

1, 2, 3, 4, 5

for each

n λ

, and we find the results are not of much difference. For the briefness, we only present the full simulation results of

n λ = 1

to show the similarity, and present the results with 5 iterations for other values of

n λ

. In Table 3 and Table 4, we calculate the mean squared prediction error on the training set and the testing set. We also calculate the

L_{2}

norm of

\hat{f}

and

\hat{g}

as in (20), which is approximated by the empirical norm using the first 1000 points of the Halton sequence.

From Table 3 and Table 4, we can obtain the following results: (i) The prediction error in all cases are small, which suggests that the double penalty model can make accurate predictions. (ii) If we increase

n λ

, the training error decreases. The prediction error decreases when

n λ

is relatively large, and becomes large when

n λ

is too small. (iii) One iteration in Algorithm 1 is sufficient to obtain a good solution of (20). (iv) The training error of the case with smaller

σ^{2}

is smaller. If

n λ

is chosen properly, the prediction error of the case with smaller

σ^{2}

is small. However, there is not much difference in the prediction error under the cases

σ^{2} = 0.1

and

0.01

when

n λ

is large. (v) For all values of

n λ

, the

L_{2}

norm of the linear function

\hat{f}

does not vary a lot. The

L_{2}

norm of

\hat{g}

, on the other hand, increases as

n λ

decreases. This is because a smaller

n λ

implies a lower penalty on g. (vi) Comparing the values of the

L_{2}

norm of

\hat{f}

and the

L_{2}

norm of

\hat{g}

, we can see the

L_{2}

norm of

\hat{f}

is much larger, which is desired because we tend to maximize the interpretable part, which is linear functions in this example.

6. Application to Real Datasets

To illustrate, we apply the approach to two datasets. The first dataset is [36], which includes 50 human fecal microbiome features for

n = 414

unrelated individuals, of genetic sequence tags corresponding to bacterial taxa, and with a response variable of log-transformed body mass index (BMI). To increase the prediction accuracy, we first reduce the number of original features to the final dataset using the HFE cross-validated approach [37], as discussed in [38]. The second dataset is the diabetes dataset from the lars R package, widely used to illustrate penalized regression [39]. The response is a log-transformed measure of disease progression one year after baseline, and predictor features are ten baseline variables, age, sex, BMI, average blood pressure, and six blood serum measurements.

Following Algorithm 1, we let f denote the LASSO algorithm ([40], interpretable part), and use the built-in

l_{1}

penalty as

L_{f}

, with parameter

λ_{f}

as implemented in the R package glmnet. For the “uninterpretable” part, we use the xgboost decision tree approach, with built-in

L_{2}

penalty as

L_{g}

, with parameter

λ_{g}

as implemented in the R package xgboost [41]. For xgboost, we set an

L_{1}

penalty as zero throughout, with other parameters (tree depth, etc.), set by cross-validation internally, while preserving convexity of

L_{g}

. We also set the maximum number of boosting iterations at ten. At each iterative step of LASSO and xgboost, ten simulations of five-fold cross-validation were performed and the predicted values were then averaged.

Finally, in order to explore the tradeoffs between the interpretable and uninterpretable parts, we first establish a range-finding exercise for the penalty tuning parameters on the logarithmic scale, such that

{log}_{10} (λ_{g}) + {log}_{10} (λ_{f}) = c

for constant c. We refer to this tradeoff as the transect between the tuning parameters, with low values of

λ_{f}

, for example, emphasizing and placing weight on the interpretable part by enforcing a low penalty for overfitting. To illustrate performance, we use the Pearson correlation coefficient between the response vector y and the average (cross-validated) values of

\hat{f}

,

\hat{g}

and

(\hat{f} + \hat{g})

over the transect. The correlations are of course directly related to the objective function term

\sum {(y - \hat{f} - \hat{g})}^{2}

, but are easier to interpret. Note that

\hat{f}

and

\hat{g}

are not orthogonal, so the correlations do not partition into the overall correlation of y with

(\hat{f} + \hat{g})

. Additionally, as a final comparison, we compute these correlation values over the entire grid of

{λ_{f}, λ_{g}}

values, to ensure that the transect was largely capturing the best choice of tuning parameters.

For the Goodrich microbiome data, Figure 3 top panel shows the correlations between y and the three cross-validated predictors over the transect. Low values of

λ_{f}

are favored, although it is clear that the decision tree is favored throughout most of the transect, i.e., y has much higher correlations with

\hat{g}

than with

\hat{f}

. Using

{log}_{10} (λ_{f})

in the range of (−2, −1) maximizes the correlation with the interpretable portion, while still achieving near the overall maximum correlation for the combined prediction rule (correlation of nearly 0.5). Our subjective “best balance" region for the interpretable portion is shown on the figure.

Figure 3 bottom panel shows the analogous results for the diabetes dataset. Here LASSO provides overall good predictions for small tuning parameter

λ_{f}

, and

{log}_{10} (λ_{f}) = - 2

provides good correlations (in the range 0.55–0.6) of y with

\hat{f}

,

\hat{g}

and

(\hat{f} + \hat{g})

. As the tuning parameter

λ_{f}

increases, the correlation between y and

\hat{f}

falls off dramatically, and our suggested “best balance” point is also shown. In no instance were the correlation values for the full grid of

{λ_{f}, λ_{g}}

more than 0.015 greater than the greatest value observed along the transects.

7. Discussion

In this work, we propose using a double penalty model as a means of isolating and studying the effects and implications of ensemble learning. We have established conditions for local algorithmic convergence under relatively general convexity conditions. We highlight the fact that in some settings identifiability is not necessary for effective use of the model in prediction. If two function classes are orthogonal, the convergence of the algorithm provided in this work is very fast. This observation inspires potential future work, given any two function classes, to construct two separable function classes that are orthogonal, and to obtain subsequent consistency results, since the two portions are identifiable.

Although our interest here is theoretical, we have also illustrated how the fitting algorithm can be used in practice to make the relative contribution of

\hat{f}

large, while not substantially degrading overall predictive performance. The examples here are relatively straightforward, serving to illustrate the theoretical concepts. Further practical implications and implementation issues will be described elsewhere.

Author Contributions

Conceptualization, Y.-H.Z.; methodology, W.W. and Y.-H.Z.; formal analysis; validation, W.W. and Y.-H.Z.; formal analysis, W.W. and Y.-H.Z.; investigation, Y.-H.Z.; resources, Y.-H.Z.; data curation, W.W. and Y.-H.Z.; writing—original draft preparation, W.W. and Y.-H.Z.; writing—review and editing, W.W. and Y.-H.Z.; supervision, Y.-H.Z.; visualization, Y.-H.Z.; project administration, Y.-H.Z.; funding acquisition, Y.-H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Environmental Protection Agency, grant number: 84045001 and Texas A&M Superfund Research Program, grant number: P42ES027704.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Theorem 1

Without loss of generality, assume

L_{f}

is strongly convex. For any

α \in (0, 1)

, by the strong convexity of

L_{f}

, we have

\begin{matrix} ∥ f^{*} + g^{*} + ϵ - f_{m} - g_{m} ∥_{n}^{2} + L_{f} (f_{m}) \\ \leq & ∥ f^{*} + g^{*} + ϵ - α \hat{f} - (1 - α) f_{m} - g_{m} ∥_{n}^{2} + L_{f} (α \hat{f} + (1 - α) f_{m}) \\ \leq & ∥ f^{*} + g^{*} + ϵ - α \hat{f} - (1 - α) f_{m} - g_{m} ∥_{n}^{2} \\ + α L_{f} (\hat{f}) + (1 - α) L_{f} (f_{m}) - \frac{1}{2} γ α (1 - α) {∥ \hat{f} - f_{m} ∥}_{n}^{2} . \end{matrix}

(A1)

We can rewrite (A1) as

\begin{matrix} ∥ f^{*} - f_{m} ∥_{n}^{2} + 2 {〈 f^{*} - f_{m}, g^{*} - g_{m} + ϵ 〉}_{n} + L_{f} (f_{m}) \\ \leq & α^{2} ∥ f^{*} - \hat{f} ∥_{n}^{2} + {(1 - α)}^{2} {∥ f^{*} - f_{m} ∥}_{n}^{2} + 2 〈 f^{*} - α \hat{f} - (1 - α) f_{m}, g^{*} - g_{m} + ϵ 〉 \\ + 2 α (1 - α) {〈 f^{*} - \hat{f}, f^{*} - f_{m} 〉}_{n} + α L_{f} (\hat{f}) + (1 - α) L_{f} (f_{m}) - \frac{1}{2} γ α (1 - α) {∥ \hat{f} - f_{m} ∥}_{n}^{2} \\ \leq & α^{2} ∥ f^{*} - \hat{f} ∥_{n}^{2} + {(1 - α)}^{2} {∥ f^{*} - f_{m} ∥}_{n}^{2} + 2 〈 f^{*} - α \hat{f} - (1 - α) f_{m}, g^{*} - g_{m} + ϵ 〉 \\ + 2 α (1 - α) {〈 f^{*} - \hat{f}, f^{*} - f_{m} 〉}_{n} + α L_{f} (\hat{f}) + (1 - α) L_{f} (f_{m}) \\ - \frac{1}{2} γ α (1 - α) (∥ \hat{f} - f^{*} ∥_{n}^{2} - 2 {〈 f^{*} - \hat{f}, f^{*} - f_{m} 〉}_{n} + ∥ f^{*} - f_{m} ∥_{n}^{2}), \end{matrix}

which is the same as

\begin{matrix} (2 α - α^{2} + \frac{1}{2} γ α (1 - α)) {∥ f^{*} - f_{m} ∥}_{n}^{2} + 2 α {〈 f^{*} - f_{m}, g^{*} - g_{m} + ϵ 〉}_{n} + α L_{f} (f_{m}) \\ \leq & (α^{2} - \frac{1}{2} γ α (1 - α)) {∥ f^{*} - \hat{f} ∥}_{n}^{2} + 2 α {〈 f^{*} - \hat{f}, g^{*} - g_{m} + ϵ 〉}_{n} \\ + (2 + γ) α (1 - α) {〈 f^{*} - \hat{f}, f^{*} - f_{m} 〉}_{n} + α L_{f} (\hat{f}) \\ \Leftrightarrow & (2 - α + \frac{1}{2} γ (1 - α)) {∥ f^{*} - f_{m} ∥}_{n}^{2} + 2 {〈 f^{*} - f_{m}, g^{*} - g_{m} + ϵ 〉}_{n} + L_{f} (f_{m}) \\ \leq & (α - \frac{1}{2} γ (1 - α)) {∥ f^{*} - \hat{f} ∥}_{n}^{2} + 2 {〈 f^{*} - \hat{f}, g^{*} - g_{m} + ϵ 〉}_{n} \\ + (2 + γ) (1 - α) {〈 f^{*} - \hat{f}, f^{*} - f_{m} 〉}_{n} + L_{f} (\hat{f}) . \end{matrix}

(A2)

Taking limit

α \to 0

in (A2) yields

\begin{matrix} (2 + \frac{1}{2} γ) {∥ f^{*} - f_{m} ∥}_{n}^{2} + 2 {〈 f^{*} - f_{m}, g^{*} - g_{m} + ϵ 〉}_{n} + L_{f} (f_{m}) \\ \leq & - \frac{1}{2} γ {∥ f^{*} - \hat{f} ∥}_{n}^{2} + 2 {〈 f^{*} - \hat{f}, g^{*} - g_{m} + ϵ 〉}_{n} + (2 + γ) {〈 f^{*} - \hat{f}, f^{*} - f_{m} 〉}_{n} + L_{f} (\hat{f}) . \end{matrix}

(A3)

Since

\hat{f}

is the solution to (2), for any

β \in (0, 1)

, it is true that

\begin{matrix} ∥ f^{*} + g^{*} + ϵ - \hat{f} - \hat{g} ∥_{n}^{2} + L_{f} (\hat{f}) + L_{g} (\hat{g}) \\ \leq & ∥ f^{*} + g^{*} + ϵ - β \hat{f} - (1 - β) f_{m} - \hat{g} ∥_{n}^{2} + L_{f} (β \hat{f} + (1 - β) f_{m}) + L_{g} (\hat{g}) \\ \leq & ∥ f^{*} + g^{*} + ϵ - β \hat{f} - (1 - β) f_{m} - \hat{g} ∥_{n}^{2} + β L_{f} (\hat{f}) \\ + (1 - β) L_{f} (f_{m}) + L_{g} (\hat{g}) - \frac{1}{2} γ β (1 - β) {∥ \hat{f} - f_{m} ∥}_{n}^{2} . \end{matrix}

By the similar approach as shown in (A1)–(A3), we can show

\begin{matrix} (2 + \frac{1}{2} γ) {∥ f^{*} - \hat{f} ∥}_{n}^{2} + 2 {〈 f^{*} - \hat{f}, g^{*} - \hat{g} + ϵ 〉}_{n} + L_{f} (\hat{f}) \\ \leq & - \frac{1}{2} γ {∥ f^{*} - f_{m} ∥}_{n}^{2} + 2 {〈 f^{*} - f_{m}, g^{*} - \hat{g} + ϵ 〉}_{n} \\ + (2 + γ) {〈 f^{*} - \hat{f}, f^{*} - f_{m} 〉}_{n} + L_{f} (f_{m}) . \end{matrix}

(A4)

Combining (A3) and (A4) leads to

\begin{matrix} (1 + \frac{1}{2} γ) ∥ \hat{f} - f_{m} ∥_{n}^{2} \leq - {〈 \hat{f} - f_{m}, \hat{g} - g_{m} 〉}_{n} \leq ∥ \hat{f} - f_{m} ∥_{n} {∥ \hat{g} - g_{m} ∥}_{n} . \end{matrix}

Thus,

\begin{matrix} (1 + \frac{1}{2} γ) ∥ \hat{f} - f_{m} ∥_{n} \leq {∥ \hat{g} - g_{m} ∥}_{n} . \end{matrix}

(A5)

Applying the same procedure to function

g_{m + 1}

, and noting that we do not have the strong convexity of

L_{g} (g)

, we have

\begin{matrix} ∥ \hat{g} - g_{m + 1} ∥_{n} \leq & ∥ \hat{f} - f_{m} ∥_{n} . \end{matrix}

(A6)

By (A5) and (A6), we have

\begin{matrix} ∥ \hat{g} - g_{m + 1} ∥_{n} \leq ∥ \hat{f} - f_{m} ∥_{n} \leq \frac{2}{2 + γ} ∥ \hat{g} - g_{m} ∥_{n} \leq \dots \leq {(\frac{2}{2 + γ})}^{m} {∥ \hat{g} - g_{1} ∥}_{n}, \end{matrix}

which implies

∥ \hat{g} - g_{m} ∥_{n}

converges to zero. By (A5),

∥ \hat{f} - f_{m} ∥_{n}

also converges to zero. The rest of the proof is similar to the proof of Theorem 3. Thus, we finish the proof.

Appendix B. Proof of Lemma 1

The proof is straightforward. Suppose there exist another two functions

f_{0} \in F

and

g_{0} \in G

such that

h = f_{0} + g_{0}

. By (5), we have

\begin{matrix} 0 & = ∥ f_{0} + g_{0} - f^{*} - g^{*} ∥_{L_{2}}^{2} = ∥ f_{0} - f^{*} ∥_{L_{2}}^{2} + {∥ g_{0} - g^{*} ∥}_{L_{2}}^{2} + 2 {〈 f_{0} - f^{*}, g_{0} - g^{*} 〉}_{2} \\ \geq ∥ f_{0} - f^{*} ∥_{L_{2}}^{2} + ∥ g_{0} - g^{*} ∥_{L_{2}}^{2} - 2 θ_{1} ∥ f_{0} - f^{*} ∥_{L_{2}} {∥ g_{0} - g^{*} ∥}_{L_{2}} \\ \geq ∥ f_{0} - f^{*} ∥_{L_{2}}^{2} + ∥ g_{0} - g^{*} ∥_{L_{2}}^{2} - 2 ∥ f_{0} - f^{*} ∥_{L_{2}} {∥ g_{0} - g^{*} ∥}_{L_{2}} \\ = (∥ f_{0} - f^{*} ∥_{L_{2}} - ∥ g_{0} - g^{*} {∥_{L_{2}})}^{2}, \end{matrix}

where the equality holds only when

2 θ_{1} ∥ f_{0} - f^{*} ∥_{L_{2}} ∥ g_{0} - g^{*} ∥_{L_{2}} = 2 ∥ f_{0} - f^{*} ∥_{L_{2}} {∥ g_{0} - g^{*} ∥}_{L_{2}}

, i.e.,

∥ f_{0} - f^{*} ∥_{L_{2}} = {∥ g_{0} - g^{*} ∥}_{L_{2}} = 0

, since

θ_{1} < 1

. Thus, we finish the proof.

Appendix C. Proof of Theorem 2

Because

\hat{f}

and

\hat{g}

are derived by (7), we have

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{f} (x_{i}) - \hat{g} (x_{i}))}^{2} + λ_{1} ∥ \hat{f} ∥_{N_{Ψ_{ν_{1}}} (Ω)}^{2} + λ_{2} {∥ \hat{g} ∥}_{N_{Ψ_{ν_{2}}} (Ω)}^{2} \\ \leq & \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - f^{*} (x_{i}) - g^{*} (x_{i}))}^{2} + λ_{1} ∥ f^{*} ∥_{N_{Ψ_{ν_{1}}} (Ω)}^{2} + λ_{2} {∥ g^{*} ∥}_{N_{Ψ_{ν_{2}}} (Ω)}^{2}, \end{matrix}

which can be rewritten as

\begin{matrix} ∥ f^{*} - \hat{f} ∥_{n}^{2} + ∥ g^{*} - \hat{g} ∥_{n}^{2} + 2 {〈 f^{*} - \hat{f}, g^{*} - \hat{g} 〉}_{n} + λ_{1} ∥ \hat{f} ∥_{N_{Ψ_{ν_{1}}} (Ω)}^{2} + λ_{2} {∥ \hat{g} ∥}_{N_{Ψ_{ν_{2}}} (Ω)}^{2} \\ \leq & 2 {〈 ϵ, \hat{f} - f^{*} 〉}_{n} + 2 {〈 ϵ, \hat{g} - g^{*} 〉}_{n} + λ_{1} ∥ f^{*} ∥_{N_{Ψ_{ν_{1}}} (Ω)}^{2} + λ_{2} {∥ g^{*} ∥}_{N_{Ψ_{ν_{2}}} (Ω)}^{2} \end{matrix}

(A7)

Note

N_{Ψ_{ν_{l}}} (Ω)

coincides

H^{ν_{l}} (Ω)

, for

l = 1, 2

. By the entropy number of a unit ball in the Sobolev space

H^{ν_{l}} (Ω)

[42] and Lemma 8.4 of [22], it can be shown that

\begin{matrix} sup_{f \in F} \frac{| {〈 ϵ, \hat{f} - f^{*} 〉}_{n} |}{∥ f^{*} - \hat{f} ∥_{n}^{1 - \frac{p}{2 ν_{1}}} (∥ \hat{f} ∥_{N_{Ψ_{ν_{1}}} (Ω)} + ∥ f^{*} {∥_{N_{Ψ_{ν_{1}}} (Ω)})}^{\frac{p}{2 ν_{1}}}} = O_{P} (n^{- 1 / 2}), \end{matrix}

which implies

\begin{matrix} {〈 ϵ, \hat{f} - f^{*} 〉}_{n} \\ = & O_{P} (n^{- 1 / 2}) ∥ f^{*} - \hat{f} ∥_{n}^{1 - \frac{p}{2 ν_{1}}} (∥ \hat{f} ∥_{N_{Ψ_{ν_{1}}} (Ω)} + ∥ f^{*} {∥_{N_{Ψ_{ν_{1}}} (Ω)})}^{\frac{p}{2 ν_{1}}} \\ = & O_{P} (n^{- 1 / 2}) {∥ f^{*} - \hat{f} ∥}_{n}^{1 - \frac{p}{2 ν_{1}}}, \end{matrix}

(A8)

since

F

is bounded. Following a similar argument, we have

\begin{matrix} {〈 ϵ, \hat{g} - g^{*} 〉}_{n} = & O_{P} (n^{- 1 / 2}) {∥ g^{*} - \hat{g} ∥}_{n}^{1 - \frac{p}{2 ν_{2}}} . \end{matrix}

(A9)

Let

\begin{matrix} F_{1} = \{h_{1} \in F : h_{1} = \frac{\hat{f} - f^{*}}{∥ \hat{f} - f^{*} ∥_{L_{\infty} (Ω)}}\}, G_{1} = \{h_{2} \in G : h_{2} = \frac{\hat{g} - g^{*}}{∥ \hat{g} - g^{*} ∥_{L_{\infty} (Ω)}}\} . \end{matrix}

Thus,

F_{1} \subset N_{Ψ_{ν_{1}}} (Ω)

and

G_{1} \subset N_{Ψ_{ν_{2}}} (Ω)

with

{sup}_{h_{1} \in F_{1}} {∥ h_{1} ∥}_{L_{\infty} (Ω)} \leq 1

and

{sup}_{h_{2} \in G_{1}}

∥ h_{2} ∥_{L_{\infty} (Ω)} \leq 1

. Applying Lemma A2 yields

\begin{matrix} {〈 f^{*} - \hat{f}, g^{*} - \hat{g} 〉}_{n} & = {〈 f^{*} - \hat{f}, g^{*} - \hat{g} 〉}_{2} + O_{P} (n^{- 1 / 2}) ∥ \hat{f} - f^{*} ∥_{L_{\infty} (Ω)} {∥ \hat{g} - g^{*} ∥}_{L_{\infty} (Ω)}, \end{matrix}

(A10)

where we also use

F

and

G

are bounded. By the interpolation inequality, we have

\begin{matrix} ∥ \hat{f} - f^{*} ∥_{L_{\infty} (Ω)} \leq C_{1} ∥ \hat{f} - f^{*} ∥_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{1}}} ∥ \hat{f} - f^{*} ∥_{N_{Ψ_{ν_{1}}}}^{\frac{p}{2 ν_{1}}} \leq C_{2} {∥ \hat{f} - f^{*} ∥}_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{1}}}, \end{matrix}

(A11)

where the last inequality is because

\hat{f} \in F

and

F

is bounded. Similarly,

\begin{matrix} ∥ \hat{g} - g^{*} ∥_{L_{\infty} (Ω)} \leq C_{3} {∥ \hat{g} - g^{*} ∥}_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{2}}}, \end{matrix}

(A12)

By applying Lemma 5.16 of [22], we can conclude the asymptotic equivalence of

L_{2}

norm and the empirical norm of

∥ f^{*} - \hat{f} ∥_{n}^{2}

, i.e.,

\begin{matrix} \underset{n \to \infty}{lim sup} P (sup_{∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)}^{2} \geq C_{6} n^{- \frac{2 ν_{1}}{2 ν_{1} + p}}} | \frac{∥ f^{*} - \hat{f} ∥_{n}^{2}}{∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)}^{2}} - 1 | \geq η) = 0, \end{matrix}

for some constants

C_{6}

and

η

(if

∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)}^{2} \geq C_{6} n^{- \frac{2 ν_{1}}{2 ν_{1} + p}}

, then the conclusions automatically hold, and there is nothing needs to be proved). Therefore, we can replace

∥ f^{*} - \hat{f} ∥_{n}

and

∥ g^{*} - \hat{g} ∥_{n}

by

∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)}

and

∥ g^{*} - \hat{g} ∥_{L_{2} (Ω)}

in (A7), respectively. Plugging (A8), (A9), (A10), (A11), and (A12) into (A7), we obtain

\begin{matrix} (1 - θ_{1}) ∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)}^{2} + (1 - θ_{1}) {∥ g^{*} - \hat{g} ∥}_{L_{2} (Ω)}^{2} \\ \leq & ∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)}^{2} + ∥ g^{*} - \hat{g} ∥_{L_{2} (Ω)}^{2} - 2 θ_{1} ∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)} {∥ g^{*} - \hat{g} ∥}_{L_{2} (Ω)} \\ \leq & ∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)}^{2} + ∥ g^{*} - \hat{g} ∥_{L_{2} (Ω)}^{2} + 2 {〈 f^{*} - \hat{f}, g^{*} - \hat{g} 〉}_{2} + λ_{1} ∥ \hat{f} ∥_{N_{Ψ_{ν_{1}}} (Ω)}^{2} + λ_{2} {∥ \hat{g} ∥}_{N_{Ψ_{ν_{2}}} (Ω)}^{2} \\ \leq & O_{P} (n^{- 1 / 2}) ∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{1}}} + O_{P} (n^{- 1 / 2}) ∥ g^{*} - \hat{g} ∥_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{2}}} + λ_{1} ∥ f^{*} ∥_{N_{Ψ_{ν_{1}}} (Ω)}^{2} + λ_{2} {∥ g^{*} ∥}_{N_{Ψ_{ν_{2}}} (Ω)}^{2} \\ + O_{P} (n^{- 1 / 2}) ∥ \hat{f} - f^{*} ∥_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{1}}} {∥ \hat{g} - g^{*} ∥}_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{2}}} \\ = & O_{P} (n^{- 1 / 2}) ∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{1}}} + O_{P} (n^{- 1 / 2}) ∥ g^{*} - \hat{g} ∥_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{2}}} + λ_{1} ∥ f^{*} ∥_{N_{Ψ_{ν_{1}}} (Ω)}^{2} + λ_{2} {∥ g^{*} ∥}_{N_{Ψ_{ν_{2}}} (Ω)}^{2} \\ = & O_{P} (n^{- 1 / 2}) ∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{1}}} + O_{P} (n^{- 1 / 2}) {∥ g^{*} - \hat{g} ∥}_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{2}}} + O_{P} (n^{- \frac{2 ν_{2}}{2 ν_{2} + p}}), \end{matrix}

(A13)

where the first inequality is because of the Cauchy-Schwarz inequality, the second inequality is because

F

and

G

are separable with respect to

L_{2}

norm, and the first equality is because

F

is bounded. Therefore, since

ν_{1} \geq ν_{2}

either

\begin{matrix} ∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)}^{2} + {∥ g^{*} - \hat{g} ∥}_{L_{2} (Ω)}^{2} \\ = & O_{P} (n^{- 1 / 2}) ∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{1}}} + O_{P} (n^{- 1 / 2}) {∥ g^{*} - \hat{g} ∥}_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{2}}} \\ = & O_{P} (n^{- 1 / 2}) ∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{2}}} + O_{P} (n^{- 1 / 2}) {∥ g^{*} - \hat{g} ∥}_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{2}}} \end{matrix}

(A14)

or

\begin{matrix} ∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)}^{2} + {∥ g^{*} - \hat{g} ∥}_{L_{2} (Ω)}^{2} = O_{P} (n^{- \frac{2 ν_{2}}{2 ν_{2} + p}}) . \end{matrix}

(A15)

Consider (A14) first. If

∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)} \geq {∥ g^{*} - \hat{g} ∥}_{L_{2} (Ω)}

, then (A14) implies

\begin{matrix} ∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)}^{2} = O_{P} (n^{- 1 / 2}) ∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{2}}} \Leftrightarrow {∥ f^{*} - \hat{f} ∥}_{L_{2} (Ω)}^{2} = O_{P} (n^{- \frac{2 ν_{2}}{2 ν_{2} + p}}) . \end{matrix}

(A16)

Similarly, if

∥ f^{*} - \hat{f} ∥_{L_{2} (Ω)} < {∥ g^{*} - \hat{g} ∥}_{L_{2} (Ω)}

, then (A14) implies

\begin{matrix} ∥ g^{*} - \hat{g} ∥_{L_{2} (Ω)}^{2} = O_{P} (n^{- \frac{2 ν_{2}}{2 ν_{2} + p}}) . \end{matrix}

(A17)

Combining (A15), (A16), and (A17), we finish the proof.

Appendix D. Proof of Theorem 3

We first present some lemmas used in this proof.

Let

(T, d)

be a metric space with metric d, and T is a space. Let

N (ϵ, T, d)

denote the

ϵ

-covering number of the metric space

(T, d)

, and

H (ϵ, T, d) = log N (ϵ, T, d)

be the entropy number. We need the following two lemmas. Lemma A1 is a direct result of Theorem 2.1 of [43], which provides an upper bound on the difference between the empirical norm and

L_{2}

norm. Lemma A2 is a direct result of Theorem 3.1 of [43], which provides an upper bound on the empirical inner product. In Lemmas A1 and A2, we use the following definition. For

z > 0

, we define

\begin{matrix} J_{\infty}^{2} (z, A) = C_{0}^{2} inf_{δ > 0} E {[z \int_{δ}^{1} \sqrt{H (u z / 2, A, ∥ \cdot ∥_{\infty})} + \sqrt{n} δ z]}^{2}, \end{matrix}

where

C_{0}

is a constant, and

H (u, A, ∥ \cdot ∥_{\infty})

is the entropy of

(A, ∥ \cdot ∥_{\infty})

for a function class

A

.

Lemma A1.

Let

R = {sup}_{f \in A} {∥ f ∥}_{2}

, and

K = {sup}_{f \in A} {∥ f ∥}_{\infty}

, where

A

is a class. Then for all

t > 0

, with probability at least

1 - exp (- t)

,

\begin{matrix} sup_{f \in A} {| ∥ f ∥}_{n}^{2} - {∥ f ∥}_{2}^{2} | \leq C_{1} (\frac{2 R J_{\infty} (K, A) + R K \sqrt{t}}{\sqrt{n}} + \frac{4 J_{\infty}^{2} (K, A) + K^{2} t}{n}), \end{matrix}

where

C_{1}

is a constant.

Lemma A2.

Let

F

and

G

be two function classes. Let

\begin{matrix} R_{1} = sup_{f \in F} {∥ f ∥}_{2}, K_{1} = sup_{f \in F} {∥ f ∥}_{\infty}, R_{2} = sup_{g \in G} {∥ g ∥}_{2}, K_{2} = sup_{g \in G} {∥ g ∥}_{\infty} . \end{matrix}

Suppose that

R_{1} K_{2} \leq R_{2} K_{1}

. Assume

\begin{matrix} (\frac{2 R_{1} J_{\infty} (K_{1}, F) + R_{1} K_{1} \sqrt{t}}{\sqrt{n}} + \frac{4 J_{\infty}^{2} (K_{1}, F) + K_{1}^{2} t}{n}) \leq \frac{R_{1}^{2}}{C_{1}}, \end{matrix}

and

\begin{matrix} (\frac{2 R_{2} J_{\infty} (K_{2}, G) + R_{2} K_{2} \sqrt{t}}{\sqrt{n}} + \frac{4 J_{\infty}^{2} (K_{2}, G) + K_{2}^{2} t}{n}) \leq \frac{R_{2}^{2}}{C_{1}} . \end{matrix}

Then for

t \geq 4

, with probability at least

1 - 12 exp (- t)

,

\begin{matrix} \frac{1}{8 C_{1}} sup_{f \in F, g \in G} |{〈 f, g 〉}_{2} - {〈 f, g 〉}_{n}| \leq \frac{R_{1} J_{\infty} (K_{2}, G) + R_{2} J_{\infty} (R_{1} K_{2} / R_{2}), F) + R_{1} K_{2} \sqrt{t}}{\sqrt{n}} + \frac{K_{1} K_{2} t}{n} . \end{matrix}

If

m = 1

, then the results automatically hold. Suppose

m > 1

. Since

f_{m}

is the solution to (), for any

α \in (0, 1)

, we have

\begin{matrix} ∥ f^{*} + g^{*} + ϵ - f_{m} - g_{m} ∥_{n}^{2} + L_{f} (f_{m}) \\ \leq & ∥ f^{*} + g^{*} + ϵ - α \hat{f} - (1 - α) f_{m} - g_{m} ∥_{n}^{2} + L_{f} (α \hat{f} + (1 - α) f_{m}) \\ \leq & ∥ f^{*} + g^{*} + ϵ - α \hat{f} - (1 - α) f_{m} - g_{m} ∥_{n}^{2} + α L_{f} (\hat{f}) + (1 - α) L_{f} (f_{m}), \end{matrix}

(A18)

where the last inequality is because

L_{f}

is convex. Rewriting (A18) yields

\begin{matrix} ∥ f^{*} - f_{m} ∥_{n}^{2} + 2 {〈 f^{*} - f_{m}, g^{*} - g_{m} + ϵ 〉}_{n} + L_{f} (f_{m}) \\ \leq & α^{2} ∥ f^{*} - \hat{f} ∥_{n}^{2} + {(1 - α)}^{2} {∥ f^{*} - f_{m} ∥}_{n}^{2} + 2 〈 f^{*} - α \hat{f} - (1 - α) f_{m}, g^{*} - g_{m} + ϵ 〉 \\ + 2 α (1 - α) {〈 f^{*} - \hat{f}, f^{*} - f_{m} 〉}_{n} + α L_{f} (\hat{f}) + (1 - α) L_{f} (f_{m}), \end{matrix}

which is the same as

\begin{matrix} (2 α - α^{2}) {∥ f^{*} - f_{m} ∥}_{n}^{2} + 2 α {〈 f^{*} - f_{m}, g^{*} - g_{m} + ϵ 〉}_{n} + α L_{f} (f_{m}) \\ \leq & α^{2} {∥ f^{*} - \hat{f} ∥}_{n}^{2} + 2 α {〈 f^{*} - \hat{f}, g^{*} - g_{m} + ϵ 〉}_{n} \\ + 2 α (1 - α) {〈 f^{*} - \hat{f}, f^{*} - f_{m} 〉}_{n} + α L_{f} (\hat{f}) . \end{matrix}

(A19)

Because

α \in (0, 1)

, (A19) implies

\begin{matrix} (2 - α) ∥ f^{*} - f_{m} ∥_{n}^{2} + 2 {〈 f^{*} - f_{m}, g^{*} - g_{m} + ϵ 〉}_{n} + L_{f} (f_{m}) \\ \leq & α ∥ f^{*} - \hat{f} ∥_{n}^{2} + 2 {〈 f^{*} - \hat{f}, g^{*} - g_{m} + ϵ 〉}_{n} + 2 (1 - α) {〈 f^{*} - \hat{f}, f^{*} - f_{m} 〉}_{n} + L_{f} (\hat{f}) . \end{matrix}

(A20)

Taking limit

α \to 0

in (A20) leads to

\begin{matrix} ∥ f^{*} - f_{m} ∥_{n}^{2} + {〈 f^{*} - f_{m}, g^{*} - g_{m} + ϵ 〉}_{n} + L_{f} (f_{m}) / 2 \end{matrix}

(A21)

\begin{matrix} \leq {〈 f^{*} - \hat{f}, g^{*} - g_{m} + ϵ 〉}_{n} + {〈 f^{*} - \hat{f}, f^{*} - f_{m} 〉}_{n} + L_{f} (\hat{f}) / 2 . \end{matrix}

(A22)

Since

\hat{f}

is the solution to (2), for any

β \in (0, 1)

, it is true that

\begin{matrix} ∥ f^{*} + g^{*} + ϵ - \hat{f} - \hat{g} ∥_{n}^{2} + L_{f} (\hat{f}) + L_{g} (\hat{g}) \\ \leq ∥ f^{*} + g^{*} + ϵ - β \hat{f} - (1 - β) f_{m} - \hat{g} ∥_{n}^{2} + L_{f} (β \hat{f} + (1 - β) f_{m}) + L_{g} (\hat{g}) \\ \leq & ∥ f^{*} + g^{*} + ϵ - β \hat{f} - (1 - β) f_{m} - \hat{g} ∥_{n}^{2} + β L_{f} (\hat{f}) + (1 - β) L_{f} (f_{m}) + L_{g} (\hat{g}), \end{matrix}

which implies

\begin{matrix} (1 - β^{2}) {∥ f^{*} - \hat{f} ∥}_{n}^{2} + 2 (1 - β) {〈 f^{*} - \hat{f}, g^{*} - \hat{g} + ϵ 〉}_{n} + (1 - β) L_{f} (\hat{f}) \end{matrix}

\begin{matrix} \leq & {(1 - β)}^{2} {∥ f^{*} - f_{m} ∥}_{n}^{2} + 2 β (1 - β) {〈 f^{*} - \hat{f}, f^{*} - f_{m} 〉}_{n} \end{matrix}

(A23)

\begin{matrix} + 2 (1 - β) {〈 f^{*} - f_{m}, g^{*} - \hat{g} + ϵ 〉}_{n} + (1 - β) L_{f} (f_{m}) . \end{matrix}

(A24)

Since

β < 1

, (A23) implies

\begin{matrix} (1 + β) ∥ f^{*} - \hat{f} ∥_{n}^{2} + 2 {〈 f^{*} - \hat{f}, g^{*} - \hat{g} + ϵ 〉}_{n} + L_{f} (\hat{f}) \\ \leq & (1 - β) ∥ f^{*} - f_{m} ∥_{n}^{2} + 2 β {〈 f^{*} - \hat{f}, f^{*} - f_{m} 〉}_{n} + 2 {〈 f^{*} - f_{m}, g^{*} - \hat{g} + ϵ 〉}_{n} + L_{f} (f_{m}) . \end{matrix}

(A25)

Letting

β \to 1

in (A25) yields

\begin{matrix} ∥ f^{*} - \hat{f} ∥_{n}^{2} + {〈 f^{*} - \hat{f}, g^{*} - \hat{g} + ϵ 〉}_{n} + L_{f} (\hat{f}) / 2 \\ \leq & {〈 f^{*} - \hat{f}, f^{*} - f_{m} 〉}_{n} + {〈 f^{*} - f_{m}, g^{*} - \hat{g} + ϵ 〉}_{n} + L_{f} (f_{m}) / 2 . \end{matrix}

(A26)

Combining (A26) and (A21), it can be checked that

\begin{matrix} ∥ \hat{f} - f_{m} ∥_{n}^{2} \leq - {〈 \hat{f} - f_{m}, \hat{g} - g_{m} 〉}_{n}, \end{matrix}

(A27)

which implies

\begin{matrix} ∥ \hat{f} - f_{m} ∥_{n} \leq {∥ \hat{g} - g_{m} ∥}_{n} . \end{matrix}

(A28)

Applying similar approach to

g_{m}

, we obtain

\begin{matrix} ∥ \hat{g} - g_{m + 1} ∥_{n} \leq {∥ \hat{f} - f_{m} ∥}_{n} . \end{matrix}

(A29)

Let

\begin{matrix} F_{1} = \{h_{1} \in F : h_{1} = \frac{\hat{f} - f_{m}}{∥ \hat{f} - f_{m} ∥_{L_{\infty} (Ω)}}\}, G_{1} = \{h_{2} \in G : h_{2} = \frac{\hat{g} - g^{m}}{∥ \hat{g} - g_{m} ∥_{L_{\infty} (Ω)}}\} . \end{matrix}

Thus,

F_{1} \subset N_{Ψ_{ν_{1}}} (Ω)

and

G_{1} \subset N_{Ψ_{ν_{2}}} (Ω)

with

{sup}_{h_{1} \in F_{1}} {∥ h_{1} ∥}_{L_{\infty} (Ω)} \leq 1

and

{sup}_{h_{2} \in G_{1}}

∥ h_{2} ∥_{L_{\infty} (Ω)} \leq 1

. Applying Lemma A2 yields that with probability at least

1 - exp (- t)

,

\begin{matrix} |{〈 f_{m} - \hat{f}, g_{m} - \hat{g} 〉}_{n}| \\ \leq & |{〈 f_{m} - \hat{f}, g_{m} - \hat{g} 〉}_{2}| + C_{1} {(n t)}^{- 1 / 2} ∥ \hat{f} - f_{m} ∥_{L_{\infty} (Ω)} {∥ \hat{g} - g_{m} ∥}_{L_{\infty} (Ω)} \\ \leq & |{〈 f_{m} - \hat{f}, g_{m} - \hat{g} 〉}_{2}| + C_{2} {(n t)}^{- 1 / 2} ∥ \hat{f} - f_{m} ∥_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{1}}} {∥ \hat{g} - g_{m} ∥}_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{2}}}, \end{matrix}

(A30)

where the last inequality is by the interpolation inequality, and

F

and

G

are bounded. Similarly, by Lemma A2, we also obtain that with probability at least

1 - 2 exp (- t)

,

\begin{matrix} ∥ f_{m} - \hat{f} ∥_{L_{2} (Ω)}^{2} \leq & ∥ \hat{f} - f_{m} ∥_{n}^{2} + C_{1} {(n t)}^{- 1 / 2} {∥ \hat{f} - f_{m} ∥}_{L_{\infty} (Ω)}^{2} \\ \leq & ∥ \hat{f} - f_{m} ∥_{n}^{2} + C_{2} {(n t)}^{- 1 / 2} {∥ \hat{f} - f_{m} ∥}_{L_{2} (Ω)}^{2 - \frac{p}{ν_{1}}}, \end{matrix}

(A31)

and

\begin{matrix} ∥ g_{m} - \hat{g} ∥_{L_{2} (Ω)}^{2} \leq & ∥ \hat{g} - g_{m} ∥_{n}^{2} + C_{1} {(n t)}^{- 1 / 2} {∥ \hat{g} - g_{m} ∥}_{L_{\infty} (Ω)}^{2} \\ \leq & ∥ \hat{g} - g_{m} ∥_{n}^{2} + C_{2} {(n t)}^{- 1 / 2} {∥ \hat{g} - g_{m} ∥}_{L_{2} (Ω)}^{2 - \frac{p}{ν_{2}}} . \end{matrix}

(A32)

Since

F

and

G

satisfy (5), together with (A30)–(A32), we have that with probability at least

1 - 3 exp (- t)

\begin{matrix} |{〈 f_{m} - \hat{f}, g_{m} - \hat{g} 〉}_{n}| \\ \leq & θ_{1} ∥ f_{m} - \hat{f} ∥_{L_{2} (Ω)} ∥ g_{m} - \hat{g} ∥_{L_{2} (Ω)} + C_{2} {(n t)}^{- 1 / 2} ∥ \hat{f} - f_{m} ∥_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{1}}} {∥ \hat{g} - g_{m} ∥}_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{2}}} \\ \leq & θ_{1} ∥ f_{m} - \hat{f} ∥_{n} ∥ g_{m} - \hat{g} ∥_{n} + C_{2} {(n t)}^{- 1 / 4} ∥ \hat{g} - g_{m} ∥_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{2}}} {∥ f_{m} - \hat{f} ∥}_{n} \\ + C_{2} {(n t)}^{- 1 / 4} ∥ \hat{f} - f_{m} ∥_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{1}}} {∥ g_{m} - \hat{g} ∥}_{n} \\ + C_{2} {(n t)}^{- 1 / 2} ∥ \hat{f} - f_{m} ∥_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{1}}} {∥ \hat{g} - g_{m} ∥}_{L_{2} (Ω)}^{1 - \frac{p}{2 ν_{2}}} \end{matrix}

(A33)

\begin{matrix} \leq & θ_{1} ∥ f_{m} - \hat{f} ∥_{n} ∥ g_{m} - \hat{g} ∥_{n} + C_{3} {(n t)}^{- 1 / 4} ∥ \hat{g} - g_{m} ∥_{n}^{1 - \frac{p}{2 ν_{2}}} {∥ f_{m} - \hat{f} ∥}_{n} \\ + C_{3} {(n t)}^{- 1 / 4} ∥ \hat{f} - f_{m} ∥_{n}^{1 - \frac{p}{2 ν_{2}}} {∥ g_{m} - \hat{g} ∥}_{n} \end{matrix}

(A34)

\begin{matrix} + C_{3} {(n t)}^{- 1 / 2} ∥ \hat{f} - f_{m} ∥_{n}^{1 - \frac{p}{2 ν_{2}}} {∥ \hat{g} - g_{m} ∥}_{n}^{1 - \frac{p}{2 ν_{2}}} \end{matrix}

(A35)

where the last inequality is by Lemma A1 and

ν_{1} \geq ν_{2}

.

If

∥ f_{m} - \hat{f} ∥_{n} < {({(n t)}^{- 1 / 4} n^{α})}^{\frac{2 ν_{2}}{p}}

, then (A29) implies

∥ g_{m + 1} - \hat{g} ∥_{n} < {({(n t)}^{- 1 / 4} n^{α})}^{\frac{2 ν_{2}}{p}}

. If

∥ f_{m} - \hat{f} ∥_{n} \geq {({(n t)}^{- 1 / 4} n^{α})}^{\frac{2 ν_{2}}{p}}

, then by (A28), we also have

∥ g_{m} - \hat{g} ∥_{n} \geq {({(n t)}^{- 1 / 4} n^{α})}^{\frac{2 ν_{2}}{p}}

. Thus,

{(n t)}^{- 1 / 4} {∥ f_{m} - \hat{f} ∥}_{n}^{- \frac{p}{2 ν_{2}}} \leq n^{- α}

and

{(n t)}^{- 1 / 4} {∥ g_{m} - \hat{g} ∥}_{n}^{- \frac{p}{2 ν_{2}}} \leq n^{- α}

, which, together with (A33), yields

\begin{matrix} |{〈 f_{m} - \hat{f}, g_{m} - \hat{g} 〉}_{n}| \leq (θ_{1} + C_{5} n^{- α}) ∥ f_{m} - \hat{f} ∥_{n} {∥ g_{m} - \hat{g} ∥}_{n} . \end{matrix}

(A36)

Define

θ_{2} = θ_{1} + C_{5} n^{- α}

. By (A27) and (A36), we have

\begin{matrix} ∥ \hat{f} - f_{m} ∥_{n}^{2} \leq θ_{2} ∥ \hat{f} - f_{m} ∥_{n} ∥ \hat{g} - g_{m} ∥_{n} \Leftrightarrow ∥ \hat{f} - f_{m} ∥_{n} \leq θ_{2} {∥ \hat{g} - g_{m} ∥}_{n} . \end{matrix}

(A37)

Applying the same procedure to function

g_{m + 1}

, we have

\begin{matrix} ∥ \hat{g} - g_{m + 1} ∥_{n} \leq & θ_{2} {∥ \hat{f} - f_{m} ∥}_{n} . \end{matrix}

(A38)

By (A37) and (A38), it can be seen that

\begin{matrix} ∥ \hat{g} - g_{m + 1} ∥_{n} \leq & θ_{2} ∥ \hat{f} - f_{m} ∥_{n} \leq θ_{2}^{2} ∥ \hat{g} - g_{m} ∥_{n} \dots \leq θ_{2}^{2 m - 2} {∥ \hat{g} - g_{1} ∥}_{n} . \end{matrix}

Taking

α = \frac{2 ν_{2} - p}{2 (2 ν_{2} + p)}

and

t = n^{\frac{2 ν_{2} - p}{2 ν_{2} + p}}

finishes the proof.

References

Opitz, D.; Maclin, R. Popular ensemble methods: An empirical study. J. Artif. Intell. Res. 1999, 11, 169–198. [Google Scholar] [CrossRef]
Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
Krogh, A.; Vedelsby, J. Neural network ensembles, cross validation, and active learning. Adv. Neural Inf. Process. Syst. 1994, 7, 231–238. [Google Scholar]
Wyner, A.J.; Olson, M.; Bleich, J.; Mease, D. Explaining the success of adaboost and random forests as interpolating classifiers. J. Mach. Learn. Res. 2017, 18, 1558–1590. [Google Scholar]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef] [Green Version]
Zhang, H.; Nettleton, D.; Zhu, Z. Regression-enhanced random forests. arXiv 2019, arXiv:1904.10416. [Google Scholar]
Bojer, C.S.; Meldgaard, J.P. Kaggle forecasting competitions: An overlooked learning opportunity. Int. J. Forecast. 2021, 37, 587–603. [Google Scholar] [CrossRef]
Li, C. A Gentle Introduction to Gradient Boosting. 2016. Available online: http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/slides/gradient_boosting.pdf (accessed on 1 January 2019).
Abbott, D. Applied Predictive Analytics: Principles and Techniques for the Professional Data Analyst; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4765–4774. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Bickel, P.J.; Klaassen, C.A.; Bickel, P.J.; Ritov, Y.; Klaassen, J.; Wellner, J.A.; Ritov, Y. Efficient and Adaptive Estimation for Semiparametric Models; Springer: Berlin/Heidelberg, Germany, 1993; Volume 4. [Google Scholar]
Hastie, T.J.; Tibshirani, R.J. Generalized Additive Models; Routledge: London, UK, 2017. [Google Scholar]
Ba, S.; Joseph, V.R. Composite Gaussian process models for emulating expensive functions. Ann. Appl. Stat. 2012, 6, 1838–1860. [Google Scholar] [CrossRef] [Green Version]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2005, 67, 301–320. [Google Scholar] [CrossRef] [Green Version]
Zhou, Q.; Chen, W.; Song, S.; Gardner, J.; Weinberger, K.; Chen, Y. A reduction of the elastic net to support vector machines with an application to GPU computing. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; Volume 29. [Google Scholar]
Zou, H.; Hastie, T.; Tibshirani, R. On the “degrees of freedom” of the lasso. Ann. Stat. 2007, 35, 2173–2192. [Google Scholar] [CrossRef]
Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Stein, M.L. Interpolation of Spatial Data: Some Theory for Kriging; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1999. [Google Scholar]
Wendland, H. Scattered Data Approximation; Cambridge University Press: Cambridge, UK, 2004; Volume 17. [Google Scholar]
van de Geer, S. Empirical Processes in M-Estimation; Cambridge University Press: Cambridge, UK, 2000; Volume 6. [Google Scholar]
Wahba, G. Partial spline models for the semi-parametric estimation of several variables. In Statistical Analysis of Time Series, Proceedings of the Japan US Joint Seminar; 1984; pp. 319–329. Available online: https://cir.nii.ac.jp/crid/1573387449750539264 (accessed on 27 October 2022).
Heckman, N.E. Spline smoothing in a partly linear model. J. R. Stat. Soc. Ser. B (Methodol.) 1986, 48, 244–248. [Google Scholar] [CrossRef]
Speckman, P. Kernel smoothing in partial linear models. R. Stat. Soc. Ser. B (Methodol.) 1988, 50, 413–436. [Google Scholar] [CrossRef]
Chen, H. Convergence rates for parametric components in a partly linear model. Ann. Stat. 1988, 16, 136–146. [Google Scholar] [CrossRef]
Xie, H.; Huang, J. SCAD-penalized regression in high-dimensional partially linear models. Ann. Stat. 2009, 37, 673–696. [Google Scholar] [CrossRef]
Gu, C. Smoothing Spline ANOVA Models; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 297. [Google Scholar]
Tuo, R. Adjustments to Computer Models via Projected Kernel Calibration. SIAM/ASA J. Uncertain. Quantif. 2019, 7, 553–578. [Google Scholar] [CrossRef] [Green Version]
Wahba, G. Spline Models for Observational Data; SIAM: Philadelphia, PA, USA, 1990; Volume 59. [Google Scholar]
Stone, C.J. Optimal global rates of convergence for nonparametric regression. Ann. Stat. 1982, 5, 1040–1053. [Google Scholar] [CrossRef]
Gramacy, R.B.; Lee, H.K. Cases for the nugget in modeling computer experiments. Stat. Comput. 2012, 22, 713–722. [Google Scholar] [CrossRef]
Sun, L.; Hong, L.J.; Hu, Z. Balancing exploitation and exploration in discrete optimization via simulation through a Gaussian process-based search. Oper. Res. 2014, 62, 1416–1438. [Google Scholar] [CrossRef]
Santner, T.J.; Williams, B.J.; Notz, W.I. The Design and Analysis of Computer Experiments; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
Niederreiter, H. Random Number Generation and Quasi-Monte Carlo Methods; SIAM: Philadelphia, PA, USA, 1992; Volume 63. [Google Scholar]
Goodrich, J.K.; Waters, J.L.; Poole, A.C.; Sutter, J.L.; Koren, O.; Blekhman, R.; Beaumont, M.; Van Treuren, W.; Knight, R.; Bell, J.T. Human genetics shape the gut microbiome. Cell 2014, 159, 789–799. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Oudah, M.; Henschel, A. Taxonomy-aware feature engineering for microbiome classification. BMC Bioinform. 2018, 19, 227. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhou, Y.H.; Gallins, P. A Review and Tutorial of Machine Learning Methods for Microbiome Host Trait Prediction. Front. Genet. 2019, 10, 579. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J.; Franklin, J. The elements of statistical learning: Data mining, inference and prediction. Math. Intell. 2005, 27, 83–85. [Google Scholar]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Adams, R.A.; Fournier, J.J. Sobolev Spaces; Academic Press: Cambridge, MA, USA, 2003; Volume 140. [Google Scholar]
van de Geer, S. On the uniform convergence of empirical norms and inner products, with application to causal inference. Electron. J. Stat. 2014, 8, 543–574. [Google Scholar] [CrossRef]

Figure 1. Left panel: Convergence of

f_{m}, g_{m}

using Algorithm 1 for the diabetes dataset. Right panel: For each curve, the label shows which portion (f or g has been subjected to minimization, and the x-axis corresponds to the other

λ

not minimized. The LASSO portion of the elastic net example achieves nearly the same minimum RMSE as the full elastic net for these data, which would be favored in terms of interpretability.

Figure 1. Left panel: Convergence of

f_{m}, g_{m}

using Algorithm 1 for the diabetes dataset. Right panel: For each curve, the label shows which portion (f or g has been subjected to minimization, and the x-axis corresponds to the other

λ

not minimized. The LASSO portion of the elastic net example achieves nearly the same minimum RMSE as the full elastic net for these data, which would be favored in terms of interpretability.

Figure 2. One simulation result of Example 1 in Section 5.2. Each dot represents an observation on randomly sampled point.

Figure 3. Top panel: cross-validated correlations between y and each of

\hat{f}

,

\hat{g}

, and

(\hat{f} + \hat{g})

for the microbiome dataset, where the tuning parameters vary along the transect as described in the text. Bottom panel: the analogous correlations for the diabetes dataset. Grey region and black vertical line represent suggested tuning parameter values to maximize interpretability while preserving high prediction accuracy.

Figure 3. Top panel: cross-validated correlations between y and each of

\hat{f}

,

\hat{g}

, and

(\hat{f} + \hat{g})

for the microbiome dataset, where the tuning parameters vary along the transect as described in the text. Bottom panel: the analogous correlations for the diabetes dataset. Grey region and black vertical line represent suggested tuning parameter values to maximize interpretability while preserving high prediction accuracy.

Table 1. The simulation results when the sample size is fixed. The last column show the absolute difference between the third column and the fourth column, given by

| 2 log (ψ (θ)) -

Regression coefficient|.

Table 1. The simulation results when the sample size is fixed. The last column show the absolute difference between the third column and the fourth column, given by

| 2 log (ψ (θ)) -

Regression coefficient|.

$θ$	$ψ (θ)$	$2 log (ψ (θ))$	Regression Coefficient	Iteration Numbers	Absolute Difference
2	0.978	−0.045	−0.050	491.55	0.006
3	0.828	−0.378	−0.419	59.02	0.040
3.5	0.615	−0.973	−1.121	22.34	0.148
4	0.304	−2.383	−2.624	10	0.241

Table 2. The simulation results under different sample sizes. The last column shows the absolute difference between

2 log (ψ (3))

and regression coefficients, given by

| 2 log (ψ (θ)) -

Regression coefficient|.

Table 2. The simulation results under different sample sizes. The last column shows the absolute difference between

2 log (ψ (3))

and regression coefficients, given by

| 2 log (ψ (θ)) -

Regression coefficient|.

Sample Size	Regression Coefficient	Iteration Numbers	Absolute Difference
20	−0.110	225.05	0.269
50	−0.410	60.26	0.0315
100	−0.404	61	0.0260
150	−0.363	68	0.0148
200	−0.381	65	0.00244

Table 3. Simulation results when

ϵ \sim N (0, 0.1)

. The third column shows the mean squared prediction error on the training points. The fourth column shows the mean squared prediction error on the testing points. The fifth column and the last column show the approximated

L_{2}

norm of

\hat{f}

and

\hat{g}

as in (20), respectively.

Table 3. Simulation results when

ϵ \sim N (0, 0.1)

. The third column shows the mean squared prediction error on the training points. The fourth column shows the mean squared prediction error on the testing points. The fifth column and the last column show the approximated

L_{2}

norm of

\hat{f}

and

\hat{g}

as in (20), respectively.

$n λ$	Iteration Number	Training Error	Prediction Error	Linear $L_{2}$	Nonlinear $L_{2}$
1	1	0.02951	0.01714	1.5336	0.0034
	2	0.02950	0.01712	1.5312	0.0054
	3	0.02949	0.01711	1.5288	0.0076
	4	0.02947	0.01710	1.5265	0.0097
	5	0.02946	0.01709	1.5242	0.0119
0.1	5	0.02404	0.01400	1.5264	0.02224
0.001	5	0.0043	0.0059	1.5285	0.1331
$1 \times 10^{- 9}$	5	$3.860 \times 10^{- 12}$	0.03388	1.5324	0.2174

Table 4. Simulation results when

ϵ \sim N (0, 0.01)

. The third column shows the mean squared prediction error on the training points. The fourth column shows the mean squared prediction error on the testing points. The fifth column and the last column show the approximated

L_{2}

norm of

\hat{f}

and

\hat{g}

as in (20), respectively.

Table 4. Simulation results when

ϵ \sim N (0, 0.01)

. The third column shows the mean squared prediction error on the training points. The fourth column shows the mean squared prediction error on the testing points. The fifth column and the last column show the approximated

L_{2}

norm of

\hat{f}

and

\hat{g}

as in (20), respectively.

$n λ$	Iteration Number	Training Error	Prediction Error	Linear $L_{2}$	Nonlinear $L_{2}$
1	1	0.01812	0.01759	1.5316	0.002998
	2	0.01811	0.01757	1.5294	0.004763
	3	0.01810	0.01755	1.5274	0.006664
	4	0.01809	0.01754	1.5253	0.008581
	5	0.01808	0.01753	1.5234	0.010481
0.1	5	0.01336	0.01387	1.5203	0.017585
0.001	5	0.00071	0.00088	1.5287	0.120022

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, W.; Zhou, Y.-H. A Double Penalty Model for Ensemble Learning. Mathematics 2022, 10, 4532. https://doi.org/10.3390/math10234532

AMA Style

Wang W, Zhou Y-H. A Double Penalty Model for Ensemble Learning. Mathematics. 2022; 10(23):4532. https://doi.org/10.3390/math10234532

Chicago/Turabian Style

Wang, Wenjia, and Yi-Hui Zhou. 2022. "A Double Penalty Model for Ensemble Learning" Mathematics 10, no. 23: 4532. https://doi.org/10.3390/math10234532

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Double Penalty Model for Ensemble Learning

Abstract

1. Introduction

2. A Double Penalty Model and a Fitting Algorithm

An Example

3. Separable Function Classes

3.1. A Separable Additive Model with the Same Covariates

3.2. Finite Dimensional Function Classes

4. Non-Separable Function Classes

A Generalization of Partially Linear Models

5. Numerical Examples

5.1. Convergence Rate of Algorithm 1

5.2. Prediction of Double Penalty Model

6. Application to Real Datasets

7. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Theorem 1

Appendix B. Proof of Lemma 1

Appendix C. Proof of Theorem 2

Appendix D. Proof of Theorem 3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI