Sparse Density Estimation with Measurement Errors

Yang, Xiaowei; Zhang, Huiming; Wei, Haoyu; Zhang, Shouzheng

doi:10.3390/e24010030

Open AccessArticle

Sparse Density Estimation with Measurement Errors

¹

School of Mathematics and Statistics, Chaohu University, Hefei 238000, China

²

Department of Mathematics, Faculty of Science and Technology, University of Macau, Macau 999078, China

³

UMacau Zhuhai Research Institute, Zhuhai 519031, China

⁴

Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA

⁵

Graduate School of Arts and Science, Yale University, New Haven, CT 06510-8034, USA

^*

Author to whom correspondence should be addressed.

^†

Xiaowei Yang, Huiming Zhang and Haoyu Wei are co-first authors which contributes equally to this work.

Entropy 2022, 24(1), 30; https://doi.org/10.3390/e24010030

Submission received: 16 November 2021 / Revised: 13 December 2021 / Accepted: 15 December 2021 / Published: 24 December 2021

(This article belongs to the Topic Machine and Deep Learning)

Download

Browse Figures

Versions Notes

Abstract

:

This paper aims to estimate an unknown density of the data with measurement errors as a linear combination of functions from a dictionary. The main novelty is the proposal and investigation of the corrected sparse density estimator (CSDE). Inspired by the penalization approach, we propose the weighted Elastic-net penalized minimal

ℓ_{2}

-distance method for sparse coefficients estimation, where the adaptive weights come from sharp concentration inequalities. The first-order conditions holding a high probability obtain the optimal weighted tuning parameters. Under local coherence or minimal eigenvalue assumptions, non-asymptotic oracle inequalities are derived. These theoretical results are transposed to obtain the support recovery with a high probability. Some numerical experiments for discrete and continuous distributions confirm the significant improvement obtained by our procedure when compared with other conventional approaches. Finally, the application is performed in a meteorology dataset. It shows that our method has potency and superiority in detecting multi-mode density shapes compared with other conventional approaches.

Keywords:

density estimation; Elastic-net; measurement errors; support recovery; multi-mode data

1. Introduction

Over the years, the mixture models have been extensively applied to model unknown distributional shapes in astronomy, biology, economics, and genomics (see [1] and references therein). The distributions of real data involving potential complex variables often show multi-mode and heterogeneity. Due to the flexibility, it also appears in various distribution-based statistical techniques, such as cluster analysis, discriminant analysis, survival analysis, and empirical Bayesian inference. Flexible mixture models can naturally represent how the data are generated as mathematical artifacts. Theoretical results show that the mixture can approximate any density in the Euclidean space well, and the amount of the mixture can also be finite (for example, a mixture of several Gaussian distributions). Although the mixture model is inherently attractive to the statistical modeling, it is well-known that it is difficult to infer (see [2,3]). From the computational aspect, the optimization problems of mixture models are non-convex. Although existing computational methods, such as EM and various MCMC algorithms, can make the mixture model fit the data relatively easily. It should be emphasized that the mixture problems are essentially challenging, even unrecognizable, and the number of components (says, the order selection) is hard to determine (see [4]). There is a large amount of literature on its approximation theory, and various methods have been proposed to estimate the components (see [5] and references therein).

The nonparametric method and combinatorial method in density estimation were well studied in [6,7], as well as [8]. These can consistently estimate the number of the mixture’s components when the components have known functional forms. When the number of candidate components is large, the non-parametric method becomes computationally infeasible. Fortunately, the high-dimensional inference would compensate for this gap and guarantee the corrected identification of the mixture components with a probability tending to one. With the advancement of technology, high-dimensional problems have been applied at the forefront of statistical research. The high-dimensional inference method has been applied to the infinite mixture models with a sparse mixture of

p \to \infty

components, which is an interesting and challenging problem (see [9,10]). We propose an improvement of the sparse estimation strategy proposed in [9], in which Bunea et al. propose a

ℓ_{1}

-type penalty [11] to obtain a sparse density estimate (SPADES). At the same time, we add a

ℓ_{2}

-type penalty and extend the oracle-inequality results to our new estimator.

In the real data, we often encounter the situation that the i.i.d. samples

X_{i} = Z_{i} + ε_{i}

are contained by some zero-mean measurement errors

{ε_{i}}_{i = 1}^{n}

; see [12,13,14,15,16]. For density estimation of

{Z_{i}}_{i = 1}^{n}

, if there exists orthogonal basis functions, the estimation method is quite easy. In the measurement errors setting, however, finding an orthogonal-based density function is not easy (see [17]). Ref. [17] suggests the assumption that the conditional distribution function of

X_{i}

given

Z_{i}

is known. This condition is somewhat strong since most conditional distributions cannot obtain the explicit formula (except the Gaussian distribution). To address this predicament, particularly with nonorthogonal base functions, the SPADES model is attractive and makes the situation easier to deal with. Based on the SPADES method, our approach is an Elastic-net calibration approach, which is simpler and more interpreted than the conditional inference procedure proposed by [17]. In this paper, we proposed the corrected loss function to debiasing the measurement errors, and this is motivated by [18]. The main problem of considering measurement errors in various statistical models is that they are responsible for the bias of the classical statistical estimates; this is true, e.g., in linear regression, which has traditionally been the main focus of studies related to measurement errors. Debiasing represents an important task in various statistical models. In linear regression, it can be performed in the basic measurement errors (ME) model, which is also denoted as the Errors-in-variables (EIV or EV) model, if it is possible to estimate the variability of measurement errors (see [19,20]). We derive the real variable selection consistency based on weighted

ℓ_{1}

+

ℓ_{2}

penalty [21]. At the same time, some theoretical results of SPADES only contain the situation of the equal weights setting, which is not plausible in the sense of adaptive (data-dependent) penalized estimation. Moreover, we perform the Poisson mixture model to approximate the complex discrete distribution in the simulation part, while existing papers only emphasize the performance of continuous distribution models. Note that the multivariate kernel density estimator can only deal with a continuous distribution, and it requires a multivariate bandwidths section, while our method is dimensional-free (the number of the required tuning parameters is only two). There has been quite a lot of work in this area, starting with [22].

There are several differences between our article and [9]. The first point is that the upper bound of non-asymptotic oracle inequality in in our Theorems 1 and 2 is tighter than Theorems 1 and 2 in [9], and the optimal weighted tuning parameters are derived. The second point is that the

ℓ_{1}

-penalized techniques are applied in [9] to estimate the sparse density. Still, this paper considers the estimation of density functions in the presence of a classical measurement error. We opt to use an Elastic-net criterion function to estimate the density, which is taken to be approximated by a series of basis functions. The third point is that the tuning parameters are chosen by the coordinate descent algorithm in [9], and the mixture weights are calculated by the generalized bisection method (GBM). However, this paper directly calculates the optimal weights, so our algorithm is more accessible to implement than [9].

This paper is presented as follows. Section 2 introduces the density estimator, which can deal with measurement errors. This section introduces data-dependent weights for the Lasso penalty, and the weights are derived by the event of KKT conditions holding a high probability. In Section 3, we give a condition that can accurately estimate the weights of the mixture, with a probability tending to 1. We show that, in an increasing dimensional mixture model under the local coherence assumption, if the tuning parameter is higher than the noise level, the recovery of the mixture components can hold with a high probability. In Section 4, we study the performance of our approach on artificial data generated from mixture Gaussian or Poisson distributions compared with other conventional methods, which indeed shows the improvement by employing our procedures. Moreover, the simulation also demonstrates that our method is better than the traditional EM algorithm, even under a low-dimensional model. Considering the multi-modal density aspect of the meteorology dataset, our proposed estimator has a stronger ability to detect multiple modes for the underlying distribution compared with other methods, such as SPADES or un-weighted Elastic-net estimator. Section 5 is the summary, and the proof of theoretical results is delivered in the Appendix A.

2. Density Estimation

2.1. Mixture Models

Suppose that

{Z_{i}}_{i = 1}^{n} \in R^{d}

are independent random variables with a common unknown density h. However, the observations are contaminated with measurement errors

{ε_{i}}_{i = 1}^{n}

as latent variables, the observed data are actually

X_{i} = Z_{i} + ε_{i}

. Let

{h_{j}}_{j = 1}^{W}

be a series of density functions (such as Gaussian density or Poisson mass function), and

{h_{j}}_{j = 1}^{W}

are also called basis functions. Assume that the h belongs to the linear combination of

{h_{j}}_{j = 1}^{W}

. The

W : = W_{n}

is a function of n, which is particularly intriguing for us since there may be

W ≫ n

(the high-dimensional setting). Let

β^{*} : = (β_{1}^{*}, \dots, β_{W}^{*}) \in R^{W}

be the unknown true parameter. Assume that

(H.1): the $h : = h_{β^{*}}$ is defined as

$Z \sim h (z) : = h_{β^{*}} (z) = \sum_{j = 1}^{W} β_{j}^{*} h_{j} (z), with \sum_{j = 1}^{W} β_{j}^{*} = 1 .$

(1)

If the base is orthogonal and there are no measurement errors, a perfectly natural method is to estimate h by an orthogonal series of estimators in the form of

h_{\tilde{β}}

, where

\tilde{β}

has the coordinates

{\tilde{β}}_{j} = \frac{1}{n} \sum_{i = 1}^{n} h_{j} (X_{i})

(see [17]). However, this estimator depends on the choice of W, and a data-driven selection of W or the threshold needs to be adaptive. This estimator can only be applied to

W \leq n

. Nevertheless, we want to solve more general problems for

W > n

, and the base functions

{h_{j}}_{j = 1}^{W}

may not orthogonal.

We aim to achieve the best convergence for the estimator when the W is not necessarily less than n. Theorem 33.2 in [5] states that any smooth density can be well-approximated by a finite mixture of some continuous functions. However, Theorem 33.2 in [5] does not confirm how many components W are required for the mixture. Thus, the hypothesis of the increasing-dimensional W is reasonable. For discrete distributions, there is also a similar mixture density approximation—see Remark of Theorem 33.2 in [5].

2.2. The Density Estimation with Measurement Errors

This subsection aims to construct a sparse estimator for the density

h (z) : = h_{β^{*}} (z)

as a linear combination of known densities.

Recall the definition of the

L_{2} (R^{d})

norm

∥ f ∥ = {(\int_{R^{d}} f^{2} (x) d x)}^{\frac{1}{2}}

. For

f, g \in L_{2} (R^{d})

, let

< f, g > = \int_{R^{d}} f (x) g (x) d x

be the inner product. If two functions f and g satisfy

< f, g > = 0

, then we call these two functions are orthogonal. Note that if the density

h (z)

belongs to

L_{2} (R^{d})

and assume that

{X_{i}}_{i = 1}^{n}

has the same distribution X, for any

f \in L_{2}

, we have

< f, h > = \int_{R^{d}} f (x) h (x) d x = E f (X)

. If

h (x)

is the density function for a discrete distribution, the integral is replaced by summation, and we can define the inner product as

< f, h > : = \sum_{k \in Z^{d}} f (k) h (k)

.

For true observations

{Z_{i}}_{i = 1}^{n}

, we minimize the

{∥ h_{β} - h ∥}^{2}

on

β \in R^{W}

to obtain the estimate of

h (z) : = h_{β^{*}} (z)

, i.e., minimizing

\begin{matrix} {∥ h_{β} - h ∥}^{2} = {∥ h ∥}^{2} + {∥ h_{β} ∥}^{2} - 2 < h_{β}, h > = {∥ h ∥}^{2} + {∥ h_{β} ∥}^{2} - 2 E h_{β} (Z) \propto - 2 E h_{β} (Z) + {∥ h_{β} ∥}^{2}, \end{matrix}

which implies that minimizing the

{∥ h_{β} - h ∥}^{2}

is equivalent to minimizing

\begin{matrix} - 2 E h_{β} (Z) + {∥ h_{β} ∥}^{2} \approx - \frac{2}{n} \sum_{i = 1}^{n} h_{β} (Z_{i}) + {∥ h_{β} ∥}^{2} . \end{matrix}

(2)

It is plausible to assign more constrains for the candidate set of

β

in the optimization, for example, the

ℓ_{1}

constrains

{∥ β ∥}_{1} \leq a

, where a is the tuning parameter. More adaptively, we prefer to use the weighted

ℓ_{1}

restriction

\sum_{j = 1}^{W} ω_{j} | β_{j} | \leq a

, where the weights

ω_{j}

’s are data-dependent and will be specified later. From [23], we add Elastic-net penality

2 \sum_{j = 1}^{W} ω_{j} | β_{j} | + c \sum_{j = 1}^{W} β_{j}^{2}

with tuning parameter c, which is in regards to the measurement errors (see [24,25]) for a similar purpose. We would have

c = 0

in the situation without measurement errors. The c indeed becomes larger if the measurement errors become more serious, i.e., we can say that the c is proportional to the increasing variability of the measurement errors. It is different from SPADES since adjusting for the measurement errors is important for accurately describing the relationship between the observed varables and the outcomes of interest.

From the discussion above, now we propose the following Corrected Sparse Density Estimator (CSDE):

\begin{matrix} \hat{β} : = \hat{β} (ω_{1}, \dots, ω_{W}) = \underset{β \in R^{W}}{\arg \min} \{- \frac{2}{n} \sum_{i = 1}^{n} h_{β} (X_{i}) + {∥ h_{β} ∥}^{2} + 2 \sum_{j = 1}^{W} ω_{j} | β_{j} | + c \sum_{j = 1}^{W} β_{j}^{2}\} \end{matrix}

(3)

where the c is the tuning parameter for

ℓ_{2}

-penalty, and the c also presents the correction for adjusting the measurement errors in our observations.

For CSDE, if

{h_{j}}_{j = 1}^{W}

is an orthogonal system, it can be clearly seen that the CSDE estimator is consistent with the soft thresholding estimator, and the explicit solution is

{\hat{β}}_{j} = \frac{{(1 - ω_{j} / | {\tilde{β}}_{j} |)}_{+} {\tilde{β}}_{j}}{1 + c}

, where

{\tilde{β}}_{j} = \frac{1}{n} \sum_{i = 1}^{n} h_{j} (X_{i})

and

x_{+} = \max (0, x)

. In this case, we can see that

ω_{j}

is the threshold of the j-th component of the simplest mean estimator

\tilde{β} = ({\tilde{β}}_{1}, \dots, {\tilde{β}}_{W})

.

From the sub-differential of the convex optimization, the corresponding Karush–Kuhn–Tucker conditions (necessary and sufficient first-order condition) for the minimizer in Equation (3) is

Lemma 1

(KKT conditions, Lemma 4.2 of [26]). Let

k \in {1, 2, \dots, W}

and

c > 0

. Then, a necessary and sufficient condition for CSDE to be a solution of Equation (3)

${\hat{β}}_{k} : \neq 0$ if $\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - \sum_{j = 1}^{W} {\hat{β}}_{j} < h_{j}, h_{k} > - c {\hat{β}}_{k} = w_{k} sign ({\hat{β}}_{k}) .$
${\hat{β}}_{k} = 0$ if $|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - \sum_{j = 1}^{W} {\hat{β}}_{j} < h_{j}, h_{k} > - c {\hat{β}}_{k}| \leq w_{k} .$

Since all values of

β_{j}^{*}

are non-negative, when conducting minimization in Equation (3), we have to put a non-negative restriction for optimizing Equation (3).

Due to the computational feasibility and optimal first-order conditions, we prefer an adaptively weighted Lasso penalty as a convex adaptive

ℓ_{1}

penalization. We require that the larger weights are assigned to the coefficients of unimportant covariates, while significant covariates accompany the smaller weights. So, the weights represent the importance of the covariates. The larger (smaller) weights shrink to zero more easily (difficultly) than the unweighted Lasso, with appropriate or even optimal weights, leading to less bias and more efficient variable selection. The derivation of the weights will be given in Section 3.1.

In the end of this part, we will illustrate that in the mixture models, even without measurement errors, Equation (1) cannot be partially transformed into the linear model

Y = X^{T} β + ε

, where Y is the n-dimensional response variables,

X

is the

W \times n

-dimensional fixed design matrix,

β

is a W-dimensional vector of model parameters, the

ε

is a

n \times 1

-dimensional vector for random error terms with zero mean and finite variance. Consider the least square objective function

U (β)

for estimating

β

,

U (β) = {(Y - X^{T} β)}^{T} (Y - X^{T} β) = - 2 Y^{T} X^{T} β + β^{T} X X^{T} β + Y^{T} Y .

Minimizing

U (β)

is equivalent to minimizing

U^{*} (β)

in Formula (4)

U^{*} (β) = - 2 Y^{T} X^{T} β + β^{T} X X^{T} β .

(4)

Comparing the objective function in Formula (4) with Equation (2), it is easy to obtain

Y = {(\frac{1}{n}, \frac{1}{n}, \dots, \frac{1}{n})}^{T}, β = {(β_{1}, β_{2}, \dots, β_{W})}^{T}, X = (\begin{matrix} h_{1} (X_{1}) & \dots & h_{1} (X_{n}) \\ ⋮ & ⋱ & ⋮ \\ h_{W} (X_{1}) & \dots & h_{W} (X_{n}) \end{matrix}) .

Substituting Y,

X

and

β

into a linear regression model, we obtain

{(\begin{matrix} \frac{1}{n} \\ ⋮ \\ \frac{1}{n} \end{matrix})}_{n \times 1} = {(\begin{matrix} h_{1} (X_{1}) & \dots & h_{W} (X_{1}) \\ ⋮ & ⋱ & ⋮ \\ h_{1} (X_{n}) & \dots & h_{W} (X_{n}) \end{matrix})}_{n \times W} {(\begin{matrix} β_{1} \\ ⋮ \\ β_{W} \end{matrix})}_{W \times 1} + {(\begin{matrix} ε_{1} \\ ⋮ \\ ε_{n} \end{matrix})}_{n \times 1} .

Then,

ε_{i} = \frac{1}{n} - \sum_{j = 1}^{W} β_{j} h_{j} (X_{i}), i = 1, 2, \dots, n .

(5)

It can be seen from Equation (5) that the value of

ε_{i}

is no longer random if

X

was the fixed design matrix. Furthermore, even for a random design

X

, take the expectation on both sides of Equation (5), and one can find that the left side is not equal to the right side, that is,

E (ε_{i}) = 0 = \frac{1}{n} - \sum_{j = 1}^{W} β_{j} E h_{j} (X_{i}) .

It leads to an additional requirement

\sum_{j = 1}^{W} β_{j} E h_{j} (X_{i}) = \frac{1}{n} \to 0

, which is meaningless as

n \to \infty

, since all

β_{j}

and

h_{j}

are positive. This is a contradiction to

\sum_{j = 1}^{W} β_{j} E h_{j} (X_{i}) > 0

for all n.

Both of the two situations above contradict the definition of the assumed linear regression model. Hence, we cannot convert the estimation of Equation (1) into the estimation problem of linear models. Thus, the existing oracle inequalities are not applicable anymore, and we will propose new ones later. However, we can transform the mixture models to a corrected score Dantzig selector, such as in [27]. Although [10] studies the oracle inequalities for adaptive Dantzig density estimation, their study does not contain the error-in-variables framework and the support recovery content.

3. Sparse Mixture Density Estimation

In this section, we will present the oracle inequalities for estimators

\hat{β}

and

h_{\hat{β}}

. The core of this section consists of five main results corresponding to the oracle inequalities for estimated density (Theorems 1 and 2), upper bounds on

ℓ_{1}

-estimation error (Corollaries 1 and 2) and support consistency (Theorem 3) as the byproduct of Corollary 2.

3.1. Data-Dependent Weights

The weights

ω_{j}

’s are chosen adequately such that the KKT conditions for stochastic optimization problems have a high probability of being satisfied.

As mentioned before, the weights in Equation (3) rely on the observed data since we calculate the weights, ensuring the KKT conditions hold with a high probability. The weighted Lasso estimates could have less

ℓ_{1}

estimation error than Lasso estimates (see the simulation part and [28]). Next, we need to consider what kind of data-dependent weight configuration can enable the KKT conditions to be satisfied with a high probability. A way to obtain data-dependent weights is to apply a concentration inequality for a weighted sum of independent random variables. Moreover, the weights should be a known data function without any unknown parameters. A criterion can help obtain the weights grounded on McDiarmid’s inequality (see [29] for more details).

Lemma 2.

Suppose

X_{1}, \dots, X_{n}

are independent random variables, and all values belong to a set A. Let

f : A^{n} \to R

be a function and satisfy the bounded difference conditions

{sup}_{x_{1}, \dots, x_{n}, x_{s}^{'} \in A} | f (x_{1}, \dots, x_{n}) - f (x_{1}, \dots, x_{s - 1}, x_{s}^{'}, x_{s + 1}, \dots, x_{n}) | \leq C_{s},

then for all

t > 0

,

P \{| f (X_{1}, \dots, X_{n}) - E f (X_{1}, \dots, X_{n}) | \geq t\} \leq 2 \exp \{- \frac{2 t^{2}}{\sum_{s = 1}^{n} C_{s}^{2}}\} .

We define the KKT conditions of optimization evaluated at

β^{*}

(it is from the sub-gradient of the optimization function evaluated at

β^{*}

) by the events below:

F_{k} (ω_{k}) : = \{|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - \sum_{j = 1}^{W} β_{j}^{*} < h_{j}, h_{k} > - c β_{k}^{*}| \leq ω_{k}\}, k = 1, 2, \dots, W .

Assume that

(H.2): $\exists L_{k} > 0$ s.t. ${∥h_{k}∥}_{\infty} = \max_{1 \leq i \leq n} | h_{k} (X_{i}) | \leq 2 L_{k}$ ;
(H.3): $0 < \max_{1 \leq j \leq W} | β_{j}^{*} | \leq B$ .

(H.2) is an assumption in sparse

ℓ_{1}

estimation, and the assumption (H.3) is a classical compact parameter space assumption in sparse high-dimensional regressions (see [9,25]).

Next, we check that the event

F_{k} (ω_{k})

is hold with high probability. Note that

E h_{k} (X_{i}) = \sum_{j = 1}^{W} β_{j}^{*} < h_{j}, h_{k} >

(which is free of

X_{i}

), we find

\begin{matrix} \frac{1}{n} |\sum_{i = 1}^{n} h_{k} (X_{i}) - (\sum_{i \neq s}^{n} h_{k} (X_{i}) + h_{k} (X_{s}^{'}))| = \frac{1}{n} |h_{k} (X_{s}) - h_{k} (X_{s}^{'})| \leq \frac{1}{n} (| h_{k} (X_{s}) | + | h_{k} (X_{s}^{'}) |) \leq \frac{4 L_{k}}{n}, \end{matrix}

where the last inequality is due to

| h_{k} (X_{i}) | \leq \max_{1 \leq i \leq n} | h_{k} (X_{i}) | \leq 2 L_{k}

.

Next, we apply the McDiarmid’s inequality on the event

F_{k}^{c} (ω_{k})

by (H.3). Then

\begin{matrix} P (F_{k}^{c} (ω_{k})) & = P \{|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - \sum_{j = 1}^{W} β_{j}^{*} < h_{j}, h_{k} > - c β_{k}^{*}| \geq ω_{k}\} \\ (by (H . 3)) & \leq P \{|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - E h_{k} (X_{i})| \geq ω_{k} - c B\} \\ (define {\tilde{ω}}_{k} : = ω_{k} - c B > 0) & \leq 2 \exp \{- \frac{2 {\tilde{ω}}_{k}^{2}}{16 L_{k}^{2} / n}\} = 2 \exp \{- \frac{n {\tilde{ω}}_{k}^{2}}{8 L_{k}^{2}}\} = : \frac{δ}{W}, 0 < δ < 1 . \end{matrix}

Considering the previous line,

ω_{k} : = 2 \sqrt{2} L_{k} \sqrt{\frac{1}{n} \log \frac{2 W}{δ}} + c B = : 2 \sqrt{2} L_{k} v (δ / 2) + c B, where v = v (δ) : = \sqrt{\frac{1}{n} \log \frac{W}{δ}} .

(6)

The weight

ω_{k}

in our paper is different from [9], which gives the un-shift version (

\overset{ˇ}{ω_{k}} = 4 L_{k} \sqrt{\frac{1}{n} \log \frac{W}{δ / 2}}

), due to the Elastic-net penalty. Define the modified KKT conditions:

K_{k} (ω_{k}) : = {| \frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - \sum_{j = 1}^{W} β_{j}^{*} < h_{j}, h_{k} > | \leq {\tilde{ω}}_{k}}, k = 1, 2, \dots, W

(7)

which hold with probability of at least

1 - 2 \exp \{- \frac{n {\tilde{ω}}_{k}^{2}}{8 L_{k}^{2}}\}

.

3.2. Non-Asymptotic Oracle Inequalities

Introduced by [30], oracle inequality is a powerful non-asymptotic and analytical tool that seeks to provide the distance between the obtained estimator and a true estimator. The sharp oracle inequality connects the optimal convergence of an obtained estimator compared with the true parameter (see [31,32]).

For

\forall β \in R^{W}

, let

I (β) = {j \in {1, \dots, W} : β_{j} \neq 0}

be the indices corresponding to the non-zero components of the vector

β

, i.e., the support in mathematical jargon. If there is no ambiguity, we would like to write

I (β^{*})

as

I_{*}

for simplicity. Define

W (β) = \sum_{j = 1}^{W} I (β_{j} \neq 0)

as the number of its non-zero components, where

I (\cdot)

represents the indicative function. Let

σ_{j}^{2} = V a r (h_{j} (X_{1})), 1 \leq j \leq W

.

Below, we will state the non-asymptotic oracle inequalities for

h_{\hat{β}}

(with the probability at least

1 - δ (W, n)

for any integer W and n), which measures the

L_{2}

distance between

h_{\hat{β}}

and h. For

β \in R^{W}

, define the correlation for the two base densities:

h_{i}

and

h_{j}

,

ρ_{W} (i, j) = \frac{< h_{i}, h_{j} >}{∥ h_{i} ∥ ∥ h_{j} ∥}

,

i, j = 1, \dots, W .

Our results will be established under the local coherence condition, and we define the maximal local coherence as:

ρ (β) = \max_{i \in I (β)} \max_{j \neq i} | ρ_{W} (i, j) | .

It is easy to see that

ρ (β)

measures the separation of the variables in the set

I (β)

from one another and the rest. The degree of separation is measured in terms of the size of the correlation coefficients. However, the regular condition introduced by this coherence may be too strong. It may exclude cases that the “correlation” can be relatively significant for a small number of pairs

(i, j)

and almost zero otherwise. Thus, we consider the definition of the cumulative local coherence given by [9]:

ρ_{*} (β) = \sum_{i \in I (β)} \sum_{j > i} | ρ_{W} (i, j) |

. Define

H (β) = \max_{j \in I (β)} \frac{ω_{j}}{v (δ / 2) ∥ h_{j} ∥}, F = \max_{1 \leq j \leq W} \frac{v (δ / 2) ∥ h_{j} ∥}{{\tilde{ω}}_{j}} = \max_{1 \leq j \leq W} \frac{∥ h_{j} ∥}{2 \sqrt{2} L_{j}},

where

{\tilde{ω}}_{j} : = 2 \sqrt{2} L_{j} v (δ / 2)

.

By using the definition of

ρ_{*} (β)

and the notations above, we present the main results of this paper, which lays the foundation for the oracle inequality of the estimated mixture coefficients.

Theorem 1.

Under (H.1)–(H.3), let

c = \frac{\min_{1 \leq j \leq W} {{\tilde{ω}}_{j}}}{B}

and a given constant

0 < γ \leq 1

. If the true base functions

{h_{j}}_{j = 1}^{W}

conform to the cumulative local coherence assumption for all

β \in R^{W}

,

12 F H (β) ρ_{*} (β) \sqrt{W (β)} \leq γ,

(8)

then the

\hat{β}

of the optimization problem in Equation (3) has the following oracle inequality with a probability at least

1 - δ

,

\begin{matrix} {∥ h_{\hat{β}} - h ∥}^{2} + \frac{α_{o p t 1} (1 - γ)}{(α_{o p t 1} - 1)} \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + \frac{α_{o p t 1}}{α_{o p t 1} - 1} \sum_{j = 1}^{W} c {({\hat{β}}_{j} - β_{j})}^{2} \\ \leq & \frac{α_{o p t 1} + 1}{α_{o p t 1} - 1} {∥ h_{β} - h ∥}^{2} + \frac{18 α_{o p t 1}^{2}}{α_{o p t 1} - 1} H^{2} (β) v^{2} (δ / 2) W (β), \end{matrix}

where

α_{o p t 1} = 1 + \sqrt{1 + \frac{{∥ h_{β} - h ∥}^{2}}{9 H^{2} (β) v^{2} (δ / 2) W (β)}}

.

It is worthy to note that here we use

\sqrt{W (β)}

instead of

W (β)

, and the latter is used in [9]. The upper bound of the oracle inequality by Theorem 1 is sharper than the upper bound of Theorem 1 in [9]. Further, we give the value of the optimal

α_{o p t 1}

, but [9] did not give it. The reason for this phenomenon is quite clean actually: from the proof, it is due to ineuqality (A5). Now, let us address the sparse Gram matrix

Ψ_{W} = {(< h_{i}, h_{j} >)}_{1 \leq i, j \leq W}

with a small number of non-zero elements in off-diagonal positions, define

ψ_{W} (i, j)

as the element

(i, j)

-th of position

ψ_{W}

. Condition (8) in Theorem 1 can be transformed to the condition

12 S H (β) \sqrt{W (β)} \leq γ,

where the number S is called the sparse index of matrix

Ψ_{W}

, which is defined as

S = | {(i, j) : i, j \in {1, \dots, W}, i > j and ψ_{W} (i, j) \neq 0} |

, where

| A |

is the number of elements of set A.

Sometimes the assumption in Condition (8) does not imply the positive definiteness of

Ψ_{W}

. Next, we give a similar oracle inequality that is valid under the hypothesis that the Gram matrix

Ψ_{W}

is positive definite.

Theorem 2.

Under the assumption of (H.1)–(H.3) and that the Gram matrix

Ψ_{W}

is positive definite with a minimum eigenvalue greater than or equal to

λ_{W} > 0

. For all

β \in R^{W}

, the

\hat{β}

of the optimization problem in Equation (3) has the following oracle inequality with probability at least

1 - δ

,

\begin{matrix} {|| h_{\hat{β}} - h ∥}^{2} + \frac{α_{o p t 2}}{α_{o p t 2} - 1} \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + \frac{α_{o p t 2}}{α_{o p t 2} - 1} \sum_{j = 1}^{W} c {({\hat{β}}_{j} - β_{j})}^{2} \\ \leq & \frac{α_{o p t 2} + 1}{α_{o p t 2} - 1} {∥ h_{β} - h ∥}^{2} + \frac{576 α_{o p t 2}^{2}}{α_{o p t 2} - 1} \frac{G}{λ_{W}} v^{2} (δ / 2), \end{matrix}

where

G = G (β) : = \sum_{j \in I (β)} L_{j}^{2}

and

α_{o p t 2} = 1 + \sqrt{1 + \frac{{∥ h_{β} - h ∥}^{2}}{288 \frac{G}{λ_{W}} v^{2} (δ / 2)}}

.

Remark 1.

The argument and result of Theorem 1 in this paper is more refined than the conclusion of Theorem 1 in [9] for Lasso by letting

γ = 1 / 2

and

c = 0

. In addition, Theorems 1 and 2 of this paper, respectively, give the optimal α value of the density estimation oracle inequalities, namely

α_{o p t 1}

,

α_{o p t 2}

. It provides a potentially sharper bound for the

ℓ_{1}

-estimation error bound.

Next, we will present the

ℓ_{1}

-estimation error for the estimator

\hat{β}

by Equation (3), and the weights are defined by Equation (6). For the technical point, we consider that

∥ h_{j} ∥ = 1

for all j in Equation (3), i.e., the base functions are normalized. This normalization mimics the covariates’ standardization procedure when doing penalized estimations in generalized linear models. For simplicity, we put

L : = \max_{1 \leq j \leq W} L_{j}

.

For any other choice of

v (δ / 2)

greater than or equal to

\sqrt{\frac{1}{n} \log \frac{2 W}{δ}}

, the conclusions of Section 3 are valid with a high probability. It imposes a restriction on the predictive performance of CSDE. As pointed out in [33], for the

ℓ_{1}

-penalty in the regression, the adjusted sequence

ω_{j}

required for the corrected selection is usually larger than the adjusted sequence

ω_{j}

that produces a good prediction. The selection of the mixture density shown below is also true. Specifically, we will take the value

β = β^{*}

and

v = v (δ / 2 W) = \sqrt{\frac{\log (2 W^{2} / δ)}{n}}

, then

α_{o p t 1}, α_{o p t 2} = 2

. Below, we give the Corollaries of Theorems 1 and 2.

Corollary 1.

Given the same conditions as Theorem 1 with

∥ h_{j} ∥ = 1

for all j, let

α_{o p t 1} = 2

, then we have the following

ℓ_{1}

-estimation error oracle inequality:

\begin{matrix} \sum_{j = 1}^{W} | {\hat{β}}_{j} - β_{j}^{*} | \leq \frac{72 \sqrt{2} v (δ / 2 W) W (β^{*})}{1 - γ} \frac{{(L + L_{\min})}^{2}}{L_{\min}} \end{matrix}

(9)

with probability at least

1 - δ / W

, where

L_{\min} = \min_{1 \leq j \leq W} L_{j}

.

Corollary 2.

Given the same conditions as Theorem 2 with

∥ h_{j} ∥ = 1

for all j, let

α_{o p t 2} = 2

, then we have the following oracle inequality, with probability at least

1 - δ / W

,

\sum_{j = 1}^{W} | {\hat{β}}_{j} - β_{j}^{*} | \leq \frac{288 \sqrt{2} v (δ / 2 W) G^{*}}{L_{\min} λ_{W}}, w h e r e G^{*} = \sum_{j \in I_{*}} L_{j}^{2} .

If the number

W (β^{*})

of the mixture indicator elements is much smaller than

\sqrt{n}

, then inequality (9) guarantees that the estimated

\hat{β}

is close to the true

β^{*}

, and the

ℓ_{1}

-estimation error will be presented in the numerical simulation in Section 4. Our results of Corollaries 1 and 2 are non-asymptotic for any W and n. The oracle inequalities are guiders for us to find an optimal tuning parameter with order

O (\sqrt{\frac{\log W}{n}})

for a sharper estimation error and better prediction performance. This is also an intermediate and crucial result, which leads to the main results of correctly identifying the mixture components in Section 3.3. In the following section, we turn to cope with the identification of

I_{*}

. Corrected components are selected by the proposed oracle inequality for the weighted

ℓ_{1}

+

ℓ_{2}

penalty.

3.3. Corrected Support Identification of Mixture Models

In this section, we will study the results of the support recovery of our CSDE estimator. There are few results on support recovery, while most of the results are the consistency of the

ℓ_{1}

-error and prediction errors. Here, we borrow the framework of [25,33]. They give many proof techniques to deal with the corrected support identification in linear models by

ℓ_{1} + ℓ_{2}

regularization. Let

\hat{I}

be the set of indicators consisting of non-zero elements of

\hat{β}

in the given Equation (3). In other words,

\hat{I}

is an estimate of the true support set

I (β^{*}) : = I_{*}

. We will study

P (\hat{I} = I (β^{*})) \geq 1 - ε

for a given

0 < ε < 1

under some mild conditions.

To identify the

I_{*}

consistently, we need more assumptions about some special correlation conditions than

ℓ_{1}

-error consistency.

Condition (A):

ρ_{*} (β^{*}) \leq \frac{L L_{\min} λ_{W}}{288 G^{*}} .

Moreover, we need an additional condition that the minimal signal should be higher than a threshold level and quantified by order of tuning parameter. Therefore, we state it as follows:

Condition (B):

\min_{j \in I^{*}} | β_{j}^{*} | \geq 4 \sqrt{2} v (\frac{δ}{2 W}) L

, where

v (\frac{δ}{2 W}) : = \sqrt{\frac{1}{n} \log \frac{2 W^{2}}{δ}}

.

When performing simulation, Condition (B) is the theoretical guarantee that the minor magnitude of

β_{j}

must be greater than a threshold value as a minimal signal condition. It is also called the Beta-min condition (see [26]).

Theorem 3.

Let

0 < δ < \frac{1}{2}

and define

ϵ_{k} : = | E [h_{k} (X_{1})] - E [h_{k} (Z_{1})] |

. Assume that both conditions (A) and (B) are true and give the same conditions as Corollary 2, then

P (\hat{I} = I_{*}) \geq 1 - (4 W {(\frac{δ}{2 W^{2}})}^{{(1 - ϵ_{k}^{*})}^{2}} + 2 δ), w h e r e ϵ_{k}^{*} = ϵ_{k} / \sqrt{2} v (δ / 2 W) L .

Under the Beta-min condition, the support estimation is very close to the true support of

β_{j}^{*}

. The probability of the event

{\hat{I} = I_{*}}

is high when W is growing. The

\hat{β}

recovers the corrected support with probability at least

1 - (4 W {(\frac{δ}{2 W^{2}})}^{{(1 - ϵ_{k}^{*})}^{2}} + 2 δ)

. The result is non-asymptotic and it is true for any fixed W and n. There is a similar conclusion about support consistency (see Theorem 6 of [25]).

4. Simulation and Real Data Analysis

Ref. [9] proposes the SPADES estimation to deal with the samples for sparse mixture density, and they also derive an algorithm from complementing their theoretical result. Their findings successfully handle the high-dimensional adaptive density estimation to some degree. However, their algorithm is costly and unstable. In this section, we deal with the tuning parameter directly and compare our CSDE method with the SPADES method in [9] and other similar methods. In all cases here, we fix

n = 100

for

W = 81, 131, 211, 321

, which is known as the dimension of the unknown parameter

β^{*}

. The performance of each estimator is evaluated by the

ℓ_{1}

-estimation error and the total variation (TV) distance between the estimator and the true value of

β^{*}

. The total variation (TV) error is defined as:

TV (h_{β^{*}}, h_{\hat{β}}) = \int | h_{β^{*}} (x) - h_{\hat{β}} (x) | d x .

4.1. Tuning Parameter Selection

In [9], the

λ_{1}

is chosen by the coordinate descent method, while the mixture weights are detected by GBM. However, in our article, the optimal weights can be computed directly. Thus, it is much easier to carry out than [9]. The

ℓ_{1}

-penalty term

\sum_{j = 1}^{W} ω_{j} | β_{j} |

with optimal weights are defined by

ω_{k} : = 2 \sqrt{2} L_{k} v (δ / 2) + c B

, where

L_{j} = {∥ h_{j} ∥}_{\infty}

, which usually can be computed easily for a continuous

h_{j}

.

For a discrete base density

{h_{j}}_{j = 1}^{W}

, it can be estimated as the following approximation by using concentration inequalities from Exercise 4.3.3 of [34]:

| med (X) - E (X) | \leq \sqrt{2 V a r (X)}, \bar{x} \approx x_{m e d} (1 + O (n^{- 1})) \approx h^{- 1} (L_{j}) (1 + O (n^{- 1}))

, where

\bar{x}

and

x_{m e d}

represent the sample mean and sample median, respectively, in each simulation, then we only need to select the

λ_{1}

and

c = λ_{2}

, and they can be detected by the nesting coordinate descent method. Moreover, the precision level is assigned as

ξ = 0.001

in our simulation.

4.2. Multi-Modal Distributions

First, we examine our method in a multi-modal Gaussian model that is similar to the first model in [9]. However, our mixture Gaussian has a different variance, which leads the meaningful weights to our estimation. The density function for the i.i.d. sample

Z_{1}, \dots, Z_{n}

is assigned as follows:

h_{β}^{*} (x) = \sum_{j = 1}^{W} β_{j}^{*} ϕ (x | a j, σ_{j}),

(10)

where

ϕ (x | a j, σ_{j})

is the density of

N (a j, σ_{j}^{2})

. However, to estimate

β^{*}

, we only observe i.i.d. data

X_{1}, \dots, X_{n}

with density

g_{β_{j}^{*}} (x) = \sum_{j = 1}^{W} β_{j}^{*} ϕ (x | a j, \sqrt{1.1} σ_{j})

. Put

a = 0.5, n = 100

and

β^{*} = {(0_{8}^{T}, 0.2, 0_{10}^{T}, 0.1, 0_{5}^{T}, 0.1, 0_{10}^{T}, 0.1, 0_{10}^{T}, 0.1, 0_{5}^{T}, 0.15, 0_{10}^{T}, 0.15, 0_{10}^{T}, 0.1, 0_{W - 76}^{T})}^{T},

(11)

with

σ = {(1_{20}^{T}, {0.8}_{6}^{T}, {0.6}_{11}^{T}, {0.4}_{11}^{T}, {0.6}_{6}^{T}, {0.8}_{11}^{T}, {1.2}_{W - 76}^{T})}^{T}

.

We replicate the simulation

N = 100

times. Simulation results are presented in Table 1, from which we can see our method has more and more excellent performances as the W increases, which matches the non-asymptotic results in the previous section. The best performance is far better than the other three methods when

W = 321

. It is worthy to note that the better approximation follows the increase in W, matching Equation (7) and Theorem 3 in our previous section.

We plot the solution path to compare the performance of the four estimators in

β_{j} \in I (β)

for every W in Figure 1 (the result of Elastic-net in

W = 321

is not be shown due to its poor performance.). These figures also provide strong support for the above analysis. Meanwhile, we plot the probability densities of the several estimators and the true density to complement the visual sensory of the advantage in our method in Figure 2. The robust competency of detecting the multi-mode is shown (whereas other methods only find the most strongest signal, ignoring other meaningful but relatively weak signals).

4.3. Mixture of Poisson Distributions

We study the mixture of discrete distribution: the mixture Poisson distribution

h_{β^{*}} (x) = \sum_{j = 1}^{W} β_{j}^{*} p (x | λ_{j} = a \cdot j),

(12)

where

p (x | λ_{j} = a \cdot j)

is the probability mass function (p.m.f.) of the Poisson distribution with mean

λ_{j}

. We set

a = 0.1

and

β^{*} = {(0_{8}^{T}, 0.2, 0_{10}^{T}, 0.1, 0_{5}^{T}, 0.1, 0_{10}^{T}, 0.1, 0_{10}^{T}, 0.1, 0_{5}^{T}, 0.15, 0_{10}^{T}, 0.15, 0_{10}^{T}, 0.1, 0_{W - 75}^{T})}^{T} .

(13)

The adjusted weights are calculated by Equation (3), and in discrete distributions, we define

〈 f, g 〉 = \sum_{k = 1}^{\infty} f (k) g (k)

. Meanwhile, the Poisson random variable with measurement errors can be treated as a negative binomial random variable. Let

n (x | λ_{j}, r)

be the p.m.f. of the Poisson distribution with the mean

λ_{j}

and dispersion parameter r. Suppose the observed data with sample size

n = 100

has the p.m.f.

g_{β^{*}} (x) = \sum_{j = 1}^{W} β_{j}^{*} n (x | λ_{j} = a \cdot j, r),

(14)

where

r = 6

, which leads to an increment of variance from Poisson to the negative binomial distribution. Similarly, we replicate each simulation to estimate the parameter

N = 100

times with the sample from the mixture negative binomial distribution above. The result is shown in Table 2. The result is actually akin to that in the previous mixture Gaussian distribution, while the strong performance of our method is shown clearly when W is considerable.

4.4. Low-Dimensional Mixture Model

Surprisingly, our method has more competitive efficacy than some popular methods (such as EM algorithm), even the dimension W is relatively small. To see this, we introduce the following numerical experiments to estimate the weights of the low-dimensional Gaussian mixture model: the samples

X_{1}, \dots, X_{n}

come from the model:

h_{β^{*}} (x) = \sum_{j = 1}^{W} β_{j}^{*} ϕ (x | μ_{j}, σ_{j}) .

The updated equation for the EM algorithm in t-th step is:

ω_{i j}^{(t)} = \frac{p_{j}^{(t)} ϕ (x_{i}; μ_{t}, σ_{t})}{\sum_{s = 1}^{W} p_{s}^{(t)} ϕ (x_{i}; μ_{s}, σ_{s})}, β_{j}^{(t + 1)} = \frac{\sum_{i = 1}^{W} ω_{i j}^{(t)}}{\sum_{i = 1}^{n} \sum_{j = 1}^{W} ω_{i j}^{(t)}} .

Here, we consider two scenarios:

\begin{matrix} (1) W = 6, β = {(0.3, 0, 0, 0.3, 0, 0.4)}^{T}, μ = {(0, 10, 20, 30, 40, 50)}^{T}, σ = {(1, 2, 3, 4, 5, 6)}^{T}; \\ (2) W = 7, β = {(0.1, 0, 0, 0.8, 0, 0, 0.1)}^{T}, μ = {(0, 1, 2, 3, 4, 5, 6)}^{T}, \\ σ = {(0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.3)}^{T} . \end{matrix}

For each scenario

n = 50

, and the fitter levels (cessation level) in the EM approach and our method are both

ξ = 10^{- 4}

. A well-advised initial value in the EM approach is the equal weight.

We replicate the simulation

N = 100

times, and the optimal tuning parameters stem from the cross-validation (CV). Thus, under each simulation, they are not the same, albeit they are very close to each other. The result can be seen in Table 3.

4.5. Real Data Examples

Practically, we consider using our method to estimate some densities in the environmental science field. Wind, which is mercurial, has been an advisable object to study for a long time in meteorology. Please note that the wind’s speed at one specific location may not be diverse so we will use the wind’s azimuth angle with a more sparse density at two sites in China. Many types of research about the estimated density for wind exist, so there is a possibility of using our approach to cope with some difficulties in meteorology science.

There have been some very credible meteorological dataset. We used the ERA5 hourly data in [35] to continue our analysis. We want chose a continental area and a coastal area in China, so we chose Beijing Nongzhanguan and Qingdao Coast. The locations of these two areas are: (116.3125

^{°}

E, 116.4375

^{°}

E) × (39.8125

^{°}

N, 39.8125

^{°}

N). Take notice that the wind in one day may be highly correlated. Therefore, using the data at a specific time point of each day in a consecutive period as i.i.d. samples is reasonable. The sample histograms at 6 am in Beijing Nongzhanguan and at midnight on the Qingdao Coast are shown in Figure 3. Here, we used the data from 1 January 2013 to 12 December 2015.

As we can see, their density does multi-peak (we used 1095 samples). Now, we can use our approach to estimate the multi-mode densities based on a relatively small size of samples, which is only a tiny part of the whole data from 1 January 2013 to 12 December 2015. Because one year has about 360 days, we may assume that every day is a latent factor that forms the base density. Thus, the model is designed as

h_{β^{*}} (x) = \sum_{j = 1}^{360} β_{j}^{*} ϕ (x | μ_{j}, σ_{j})

with the mean and variance parameters

μ = {(1, 2, \dots, 360)}^{T}

,

σ = t \cdot 1_{360}^{T}

, where t is the bandwidth (or tuning parameter). With the different sub-samples, the computed values are different.

Another critical issue is how to choose the tuning parameters

λ_{1}

and

λ_{2}

. Then, we apply the cross-validation criterion, namely choosing

λ_{i}

, to minimize the difference between the two estimators derived from the separated samples in a random dichotomy.

Now, start to construct the samples for the estimating procedure. Assume that an observatory wants to figure out some information about the two areas’ wind. However, it does not have intact data due to the limited budget at its inception. The only samples it has are several days’ information each month for the two areas, and these days scatter randomly. Furthermore, sample size

n = 168

exactly. These imperfect data increase the challenge of estimating a trustworthy density. We compared our method with other previous methods, in which appraising the difference between the complete data sample histogram and the estimated density under each method is for the evaluation. Notice that the samples are only a tiny part of the data, so the

n = 168 < M (= 360)

is relatively small. The small sample and large dimension setting coincide with the non-asymptotic theory provided in the previous section. The estimated density has been shown in Figure 4.

In this practical application, our method vindicates its more efficient estimating performance and stability from its propinquity of the complete sample histogram, namely the productive capacity of detecting the shape of the multi-mode density and the stronger inclination to bear a resemblance to each other sub-sample (although some subtle nuances do exist because of the different sub-sample). An alternative approach can be to consider principles and tools of circular statistics, which has been reviewed in [36].

5. Summary and Discussions

The paper deals with the deconvolution problem using Lasso-type methods: the observations

X_{1}, \dots, X_{n}

are independent and generated from

X_{i} = Z_{i} + ε_{i}

, and the goal is to estimate the unknown density h of the

Z_{i}

. We assume that the function h can be written as

h (\cdot) = h_{β^{*}} (\cdot) = \sum_{j = 1}^{W} β_{j}^{*} h_{j} (\cdot)

based on some functions

{h_{j}}_{j = 1}^{W}

from a specific dictionary and propose estimating the coefficients of this decomposition with the Elastic-net method. For this estimator, we show that under some classical assumptions of the model, such as coherence of the Gram matrix, finite sample bounds for the estimation and the prediction errors valid with a relatively high probability can be obtained. Moreover, we prove a variable selection consistency result under a beta-min condition and conduct an extensive numerical study. The following estimation problem is also similar to the CSDE.

For future study, it is also interesting and meaningful to do hypothesis testing about the coefficients

β^{*} \in R^{W}

in sparse mixture models. For a general function

h : R^{W} \mapsto R^{m}

and a nonempty closed set

Ω \in R^{m}

, we can consider

H_{0} : h (β^{*}) \in Ω vs . H_{1} : h (β^{*}) \notin Ω .

It is possible to use [37] as a general approach to hypothesis testing within models with measurement errors.

Author Contributions

Conceptualization, H.Z.; Data curation, X.Y.; Formal analysis, X.Y. and S.Z.; Funding acquisition, X.Y.; Investigation, X.Y. and S.Z.; Methodology, X.Y., H.Z., H.W. and S.Z.; Project administration, H.Z.; Resources, H.W.; Software, H.W.; Supervision, X.Y. and H.Z.; Visualization, H.W.; Writing—original draft, X.Y., H.Z., H.W. and S.Z.; Writing—review and editing, H.Z. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

Xiaowei Yang is supported in part by the General Research Project of Chaohu University (XLY-201906), Chaohu University Applied Curriculum Development Project (ch19yykc21), Key Project of Natural Science Foundation of Anhui Province Colleges and Universities (KJ2019A0683), Key Scientific Research Project of Chaohu University (XLZ-202105). Huiming Zhang is supported in part by the University of Macau under UM Macao Talent Program (UMMTP-2020-01). This work also is supported in part by the National Natural Science Foundation of China (Grant No. 11701109, 11901124) and the Guangxi Science Foundation (Grant No. 2018GXNSFAA138164).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The authors would like to thank Song Xi Chen’s Group https://songxichen.com/(accessed on 20 December 2021) for sharing the meteorological dataset.

Acknowledgments

The Appendix includes the proofs of the lemmas, corollaries and theorems in the main body.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

For convenience, we first give a preliminary lemma and proof. Define the random variables

M_{j} = \frac{1}{n} \sum_{i = 1}^{n} \{h_{j} (X_{i}) - E h_{j} (X_{i})\} .

Consider the event

E

by

E = ⋂_{j = 1}^{W} {2 | M_{j} | \leq {\tilde{ω}}_{j}}

, where

{\tilde{ω}}_{k} : = 2 \sqrt{2} L_{k} \sqrt{\frac{1}{n} \log \frac{W}{δ / 2}} = : 2 \sqrt{2} L_{k} v (δ / 2)

. Then, we have the following lemma, which is cornerstone for the proofs in below.

Lemma A1.

Suppose

\max_{1 \leq j \leq W} L_{j} < \infty

and

c = \frac{\min_{1 \leq j \leq W} {{\tilde{ω}}_{j}}}{B}

, for any

β \in R^{W}

on the event

E

, we have

{∥ h_{\hat{β}} - h ∥}^{2} + \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + \sum_{j = 1}^{W} c {({\hat{β}}_{j} - β_{j})}^{2} \leq {∥ h_{β} - h ∥}^{2} + 6 \sum_{j \in I (β)} ω_{j} | {\hat{β}}_{j} - β_{j} | .

Appendix A.1. Proof of Lemma A1

According to the definition of

\hat{β}

, for any

β \in R^{W}

, we find

- \frac{2}{n} \sum_{i = 1}^{n} h_{\hat{β}} (X_{i}) + {∥ h_{\hat{β}} ∥}^{2} + 2 \sum_{j = 1}^{W} ω_{j} | {\hat{β}}_{j} | + c \sum_{j = 1}^{W} {\hat{β}}_{j}^{2} \leq - \frac{2}{n} \sum_{i = 1}^{n} h_{β} (X_{i}) + {∥ h_{β} ∥}^{2} + 2 \sum_{j = 1}^{W} ω_{j} | β_{j} | + c \sum_{j = 1}^{W} β_{j}^{2} .

Then

\begin{matrix} {∥ h_{\hat{β}} ∥}^{2} - {∥ h_{β} ∥}^{2} \leq \frac{2}{n} \sum_{i = 1}^{n} h_{\hat{β}} (X_{i}) - \frac{2}{n} \sum_{i = 1}^{n} h_{β} (X_{i}) + 2 \sum_{j = 1}^{W} ω_{j} | β_{j} | - 2 \sum_{j = 1}^{W} ω_{j} | {\hat{β}}_{j} | + c \sum_{j = 1}^{W} β_{j}^{2} - c \sum_{j = 1}^{W} {\hat{β}}_{j}^{2} . \end{matrix}

Note that

\begin{matrix} {∥ h_{\hat{β}} - h ∥}^{2} & = {∥ h_{\hat{β}} - h_{β} + h_{β} - h ∥}^{2} = {∥ h_{\hat{β}} - h_{β} ∥}^{2} + {∥ h_{β} - h ∥}^{2} + 2 < h_{β} - h, h_{\hat{β}} - h_{β} > \\ = {∥ h_{β} - h ∥}^{2} - 2 < h, h_{\hat{β}} - h_{β} > + 2 < h_{β}, h_{\hat{β}} - h_{β} > + {∥ h_{\hat{β}} - h_{β} ∥}^{2} \\ = {∥ h_{β} - h ∥}^{2} - 2 < h, h_{\hat{β}} - h_{β} > + {∥ h_{\hat{β}} ∥}^{2} - {∥ h_{β} ∥}^{2} . \end{matrix}

Combining the two result above, we obtain

\begin{matrix} {∥ h_{\hat{β}} - h ∥}^{2} & \leq {∥ h_{β} - h ∥}^{2} + 2 \sum_{j = 1}^{W} ω_{j} | β_{j} | - 2 \sum_{j = 1}^{W} ω_{j} | {\hat{β}}_{j} | + c \sum_{j = 1}^{W} β_{j}^{2} - c \sum_{j = 1}^{W} {\hat{β}}_{j}^{2} \\ - 2 < h, h_{\hat{β}} - h_{β} > + \frac{2}{n} \sum_{i = 1}^{n} h_{\hat{β}} (X_{i}) - \frac{2}{n} \sum_{i = 1}^{n} h_{β} (X_{i}) . \end{matrix}

(A1)

According to the definition of

h_{β} (x)

, it gives

h_{β} (x) = \sum_{j = 1}^{W} β_{j} h_{j} (x)

with

β = (β_{1}, \dots, β_{W})

. For the three terms in Equation (A1), we have

\begin{matrix} - 2 < h, h_{\hat{β}} - h_{β} > + \frac{2}{n} \sum_{i = 1}^{n} h_{\hat{β}} (X_{i}) - \frac{2}{n} \sum_{i = 1}^{n} h_{β} (X_{i}) \\ = & 2 \cdot \frac{1}{n} \sum_{i = 1}^{n} (\sum_{j = 1}^{W} {\hat{β}}_{j} h_{j} (X_{i}) - \sum_{j = 1}^{W} β_{j} h_{j} (X_{i})) - 2 E (h_{β^{'}} - h_{β}) (X_{i}) |_{β^{'} = \hat{β}} \\ = & 2 \sum_{j = 1}^{W} \frac{1}{n} \sum_{i = 1}^{n} h_{j} (X_{i}) ({\hat{β}}_{j} - β_{j}) - 2 \sum_{j = 1}^{W} E [h_{j} (X_{i})] ({\hat{β}}_{j} - β_{j}) \\ = & 2 \sum_{j = 1}^{W} (\frac{1}{n} \sum_{i = 1}^{n} h_{j} (X_{i}) - E [h_{j} (X_{i})]) ({\hat{β}}_{j} - β_{j}) . \end{matrix}

Then

\begin{matrix} {∥ h_{\hat{β}} - h ∥}^{2} & \leq {∥ h_{β} - h ∥}^{2} + 2 \sum_{j = 1}^{W} (\frac{1}{n} \sum_{i = 1}^{n} h_{j} (X_{i}) - E [h_{j} (X_{i})]) ({\hat{β}}_{j} - β_{j}) \\ + 2 \sum_{j = 1}^{W} ω_{j} | β_{j} | - 2 \sum_{j = 1}^{W} ω_{j} | {\hat{β}}_{j} | + c \sum_{j = 1}^{W} β_{j}^{2} - c \sum_{j = 1}^{W} {\hat{β}}_{j}^{2} . \end{matrix}

Conditioning on

E

, we have

{∥ h_{\hat{β}} - h ∥}^{2} \leq {∥ h_{β} - h ∥}^{2} + \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + 2 \sum_{j = 1}^{W} ω_{j} (| β_{j} | - | {\hat{β}}_{j} |) + c \sum_{j = 1}^{W} (β_{j}^{2} - {\hat{β}}_{j}^{2}) .

We add

\sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + c \sum_{j = 1}^{W} {(β_{j} - {\hat{β}}_{j})}^{2}

to both sides of the inequality, it gives

\begin{matrix} {∥ h_{\hat{β}} - h ∥}^{2} + \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + c \sum_{j = 1}^{W} {(β_{j} - {\hat{β}}_{j})}^{2} \\ \leq & {∥ h_{β} - h ∥}^{2} + 2 \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + 2 \sum_{j = 1}^{W} ω_{j} (| β_{j} | - | {\hat{β}}_{j} |) + c \sum_{j = 1}^{W} (β_{j}^{2} - {\hat{β}}_{j}^{2}) + c \sum_{j = 1}^{W} {(β_{j} - {\hat{β}}_{j})}^{2} . \end{matrix}

Note that

\begin{matrix} c [\sum_{j = 1}^{W} (β_{j}^{2} - {\hat{β}}_{j}^{2}) + \sum_{j = 1}^{W} {(β_{j} - {\hat{β}}_{j})}^{2}] = c [\sum_{j = 1}^{W} (β_{j}^{2} - {\hat{β}}_{j}^{2} + β_{j}^{2} - 2 β_{j} {\hat{β}}_{j} + {\hat{β}}_{j}^{2})] \\ = & 2 c \sum_{j = 1}^{W} β_{j} (β_{j} - {\hat{β}}_{j}) = 2 c \sum_{j \in I (β)} β_{j} (β_{j} - {\hat{β}}_{j}) \leq 2 c B \sum_{j \in I (β)} | β_{j} - {\hat{β}}_{j} | \leq 2 \sum_{j \in I (β)} {\tilde{ω}}_{j} | β_{j} - {\hat{β}}_{j} |, \end{matrix}

where the last inequality is due to the assumption

c = \frac{\min_{1 \leq j \leq W} {{\tilde{ω}}_{j}}}{B} \leq \frac{{\tilde{ω}}_{j}}{B}

. Thus, we obtain

\begin{matrix} {∥ h_{\hat{β}} - h ∥}^{2} + \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + c \sum_{j = 1}^{W} {({\hat{β}}_{j} - β_{j})}^{2} \\ \leq & {∥ h_{β} - h ∥}^{2} + 2 \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + 2 \sum_{j = 1}^{W} ω_{j} (| β_{j} | - | {\hat{β}}_{j} |) + 2 \sum_{j \in I (β)} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | \\ \leq & {∥ h_{β} - h ∥}^{2} + 2 \sum_{j = 1}^{W} ω_{j} | {\hat{β}}_{j} - β_{j} | + 2 \sum_{j = 1}^{W} ω_{j} (| β_{j} | - | {\hat{β}}_{j} |) + 2 \sum_{j \in I (β)} ω_{j} | {\hat{β}}_{j} - β_{j} |, \end{matrix}

where the last inequality follows from

{\tilde{ω}}_{j} \leq ω_{j}

for all j.

We know

β_{j} \neq 0

if

j \in I (β)

, and

β_{j} = 0

if

j \notin I (β)

. Considering

| β_{j} | - | {\hat{β}}_{j} | \leq | {\hat{β}}_{j} - β_{j} |

for all j, we have

2 \sum_{j = 1}^{W} ω_{j} | {\hat{β}}_{j} - β_{j} | + 2 \sum_{j = 1}^{W} ω_{j} (| β_{j} | - | {\hat{β}}_{j} |) \leq 4 \sum_{j \in I (β)} ω_{j} | {\hat{β}}_{j} - β_{j} | .

Then

\begin{matrix} || h_{\hat{β}} {- h ∥}^{2} + \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + c \sum_{j = 1}^{W} {({\hat{β}}_{j} - β_{j})}^{2} \\ \leq ∥ h_{β} {- h ∥}^{2} + 4 \sum_{j \in I (β)} ω_{j} | {\hat{β}}_{j} - β_{j} | + 2 \sum_{j \in I (β)} ω_{j} | {\hat{β}}_{j} - β_{j} | \\ = ∥ h_{β} {- h ∥}^{2} + 6 \sum_{j \in I (β)} ω_{j} | {\hat{β}}_{j} - β_{j} | . \end{matrix}

Appendix A.2. Proof of Theorems

According to

{\tilde{ω}}_{j} = 2 \sqrt{2} L_{j} \sqrt{\frac{1}{n} \log \frac{2 W}{δ}}

in Equation (6), the sum of the independent random variables

ζ_{i j} = h_{j} (X_{i}) - E h_{j} (X_{i})

is determined by Hoeffding’s inequality, and

| h_{j} (X_{i}) | \leq 2 L_{j}

. We obtain

\begin{matrix} P (E^{c}) = P (⋃_{j = 1}^{W} {2 | M_{j} | > {\tilde{ω}}_{j}}) & \leq \sum_{j = 1}^{W} P (2 | M_{j} | > {\tilde{ω}}_{j}) \leq 2 \sum_{j = 1}^{W} \exp (- \frac{2 n^{2} \cdot {\tilde{ω}}_{j}^{2} / 4}{4 n L_{j}^{2}}) \\ = 2 \sum_{j = 1}^{W} \exp (- \log \frac{2 W}{δ}) = 2 W \cdot \frac{δ}{2 W} = δ . \end{matrix}

Appendix A.3. Proof of Theorem 1

By Lemma A1, we need an upper bound on

\sum_{j \in I (β)} ω_{j} | {\hat{β}}_{j} - β_{j} |

. For easy notation, let

q_{j} = {\hat{β}}_{j} - β_{j}, Q (β) = \sum_{j \in I (β)} | q_{j} | ∥ h_{j} ∥, Q = \sum_{j = 1}^{W} | q_{j} | ∥ h_{j} ∥ .

According to the definition of

H (β)

, that is,

H (β) = \max_{j \in I (β)} \frac{ω_{j}}{v (δ / 2) ∥ h_{j} ∥}

, we have

\sum_{j \in I (β)} ω_{j} | {\hat{β}}_{j} - β_{j} | \leq v (δ / 2) H (β) Q (β) .

(A2)

Let

Q_{*} (β) : = \sqrt{\sum_{j \in I (β)} q_{j}^{2} {∥ h_{j} ∥}^{2}}

. Using the definition of

h_{β} (x)

, we obtain

Q_{*}^{2} (β) = \sum_{j \in I (β)} q_{j}^{2} {∥ h_{j} ∥}^{2} = {∥ h_{\hat{β}} - h_{β} ∥}^{2} - \sum_{i, j \notin I (β)} q_{i} q_{j} < h_{i}, h_{j} > - (2 \sum_{i \notin I (β)} \sum_{j \in I (β)} q_{i} q_{j} < h_{i}, h_{j} > + \underset{i, j \in I (β), i \neq j}{\sum \sum} q_{i} q_{j} < h_{i}, h_{j} >) .

As

i, j \notin I (β)

,

β_{i} = β_{j} = 0

, it is easy to see

\underset{i, j \notin I (β)}{\sum \sum} < h_{i}, h_{j} > q_{i} q_{j} \geq 0

. Observe that

\begin{matrix} 2 \sum_{i \notin I (β)} \sum_{j \in I (β)} q_{i} q_{j} < h_{i}, h_{j} > + \underset{i, j \in I (β), i \neq j}{\sum \sum} q_{i} q_{j} < h_{i}, h_{j} > \\ = & 2 \sum_{i \notin I (β)} \sum_{j \in I (β)} q_{i} q_{j} < h_{i}, h_{j} > + 2 \underset{i, j \in I (β), j > i}{\sum \sum} q_{i} q_{j} < h_{i}, h_{j} > = 2 \underset{i \in I (β), j > i}{\sum \sum} q_{i} q_{j} < h_{i}, h_{j} > . \end{matrix}

By the definitions of

ρ_{W} (i, j)

and

ρ_{*} (β)

, then

\begin{matrix} Q_{*}^{2} (β) & \leq {∥ h_{\hat{β}} - h_{β} ∥}^{2} + 2 \underset{i \in I (β), j > i}{\sum \sum} | q_{i} | | q_{j} | ∥ h_{i} ∥ ∥ h_{j} ∥ \frac{< h_{i}, h_{j} >}{∥ h_{i} ∥ ∥ h_{j} ∥} \\ \leq {∥ h_{\hat{β}} - h_{β} ∥}^{2} + 2 ρ_{*} (β) \max_{i \in I (β), j > i} | q_{i} | ∥ h_{i} ∥ | q_{j} | ∥ h_{j} ∥ . \end{matrix}

By

\max_{i \in I (β)} | q_{i} | ∥ h_{i} ∥ \leq \sqrt{\sum_{j \in I (β)} q_{j}^{2} {∥ h_{j} ∥}^{2}} = Q_{*} (β), \max_{i \in I (β), j > i} | q_{j} | ∥ h_{j} ∥ \leq \sum_{j = 1}^{W} | q_{j} | ∥ h_{j} ∥

,

\begin{matrix} Q_{*}^{2} (β) & \leq {∥ h_{\hat{β}} - h_{β} ∥}^{2} + 2 ρ_{*} (β) Q_{*} (β) \sum_{j = 1}^{W} | q_{j} | ∥ h_{j} ∥ = {∥ h_{\hat{β}} - h_{β} ∥}^{2} + 2 ρ_{*} (β) Q_{*} (β) Q . \end{matrix}

(A3)

By Equation (A3), we can obtain

Q_{*}^{2} (β) - 2 ρ_{*} (β) Q_{*} (β) Q - {∥ h_{\hat{β}} - h_{β} ∥}^{2} \leq 0 .

To find the upper bound of

Q_{*} (β)

, applying the properties of the quadratic inequality to the above formula, we obtain that

\begin{matrix} Q_{*} (β) \leq ρ_{*} (β) Q + \sqrt{ρ_{*}^{2} (β) Q^{2} + {∥ h_{\hat{β}} - h_{β} ∥}^{2}} & \leq ρ_{*} (β) Q + [ρ_{*} (β) Q + ∥ h_{\hat{β}} - h_{β} ∥] \\ \leq 2 ρ_{*} (β) Q + ∥ h_{\hat{β}} - h_{β} ∥ . \end{matrix}

(A4)

Note that

W (β) = | I (β) | = \sum_{j = 1}^{W} I (β_{j} \neq 0)

, employing the Cauchy–Schwarz inequalities, we have

\begin{matrix} W (β) \sum_{j \in I (β)} {| q_{j} |}^{2} {∥ h_{j} ∥}^{2} & = \sum_{j \in I (β)} I^{2} (j \in I (β)) \sum_{j \in I (β)} {| q_{j} |}^{2} {∥ h_{j} ∥}^{2} \\ \geq {(\sum_{j \in I (β)} I ({j \in I (β)) | q_{j} | ∥ h_{j} ∥)}^{2} = Q^{2} (β) . \end{matrix}

Then,

Q_{*}^{2} (β) = \sum_{j \in I (β)} {| q_{j} |}^{2} {∥ h_{j} ∥}^{2} \geq Q^{2} (β) / W (β) .

In combination with Equation (A4), we can obtain

Q (β) / \sqrt{W (β)} \leq Q_{*} (β) \leq 2 ρ_{*} (β) Q + ∥ h_{\hat{β}} - h_{β} ∥

. Therefore,

\begin{matrix} Q (β) \leq 2 ρ_{*} (β) \sqrt{W (β)} Q + \sqrt{W (β)} ∥ h_{\hat{β}} - h_{β} ∥ . \end{matrix}

(A5)

By Lemma A1 and Equation (A2), we have the following inequality with probability exceeding

1 - δ

,

\begin{matrix} {∥ h_{\hat{β}} - h ∥}^{2} + \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + \sum_{j = 1}^{W} c {({\hat{β}}_{j} - β_{j})}^{2} \leq {∥ h_{β} - h ∥}^{2} + 6 \sum_{j \in I (β)} ω_{j} | {\hat{β}}_{j} - β_{j} | \\ \leq & {∥ h_{β} - h ∥}^{2} + 6 v (δ / 2) H (β) Q (β) \\ \leq & {∥ h_{β} - h ∥}^{2} + 6 v (δ / 2) H (β) [2 ρ_{*} (β) \sqrt{W (β)} \sum_{j = 1}^{W} | q_{j} | ∥ h_{j} ∥ + \sqrt{W (β)} ∥ h_{\hat{β}} - h_{β} ∥] (by (A 5)) \\ = & {∥h_{β} - h∥}^{2} + 12 v (δ / 2) H (β) ρ_{*} (β) \sqrt{W (β)} \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | \frac{∥h_{j}∥}{{\tilde{ω}}_{j}} \\ + & 6 v (δ / 2) H (β) \sqrt{W (β)} ∥ h_{\hat{β}} - h_{β} ∥ \\ \leq & {∥h_{β} - h∥}^{2} + 12 F H (β) ρ_{*} (β) \sqrt{W (β)} \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + 6 v (δ / 2) H (β) \sqrt{W (β)} ∥ h_{\hat{β}} - h_{β} ∥ \\ \leq & {∥h_{β} - h∥}^{2} + γ \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + 6 v (δ / 2) H (β) \sqrt{W (β)} ∥ h_{\hat{β}} - h_{β} ∥, \end{matrix}

where the second last inequality follows from the definition of

F : = \max_{1 \leq j \leq W} \frac{v (δ / 2) ∥h_{j}∥}{{\tilde{ω}}_{j}}

, and the last inequality is derived by the assumption

12 F H (β) ρ_{*} (β) \sqrt{W (β)} \leq γ, (0 < γ \leq 1) .

Further, we can find that, with probability at least

1 - δ

,

\begin{matrix} {∥ h_{\hat{β}} - h ∥}^{2} + (1 - γ) \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + \sum_{j = 1}^{W} c {({\hat{β}}_{j} - β_{j})}^{2} \\ \leq & {∥ h_{β} - h ∥}^{2} + 6 v (δ / 2) H (β) \sqrt{W (β)} ∥ h_{\hat{β}} - h_{β} ∥ \\ = & {∥ h_{β} - h ∥}^{2} + 6 v (δ / 2) H (β) \sqrt{W (β)} ∥ h_{\hat{β}} - h + h - h_{β} ∥ \\ \leq & {∥ h_{β} - h ∥}^{2} + 6 v (δ / 2) H (β) \sqrt{W (β)} ∥ h_{\hat{β}} - h ∥ + 6 v (δ / 2) H (β) \sqrt{W (β)} ∥ h - h_{β} ∥ . \end{matrix}

Using the elementary inequality

2 s t \leq s^{2} / α + α t^{2} (s, t \in R, α > 1)

to the last two terms of the above inequality, it yields

\begin{matrix} 2 {3 v (δ / 2) H (β) \sqrt{W (β)}} ∥ h_{\hat{β}} - h ∥ \leq α \cdot 9 v^{2} (δ / 2) H^{2} (β) W (β) + {∥ h_{\hat{β}} - h ∥}^{2} / α, \\ 2 {3 v (δ / 2) H (β) \sqrt{W (β)}} ∥ h_{β} - h ∥ \leq α \cdot 9 v^{2} (δ / 2) H^{2} (β) W (β) + {∥ h_{β} - h ∥}^{2} / α . \end{matrix}

Thus,

\begin{matrix} {∥ h_{\hat{β}} - h ∥}^{2} + (1 - γ) \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + \sum_{j = 1}^{W} c {({\hat{β}}_{j} - β_{j})}^{2} \\ \leq & {∥ h_{β} - h ∥}^{2} + 18 α v^{2} (δ / 2) H^{2} (β) W (β) + {∥ h_{\hat{β}} - h ∥}^{2} / α + {∥ h_{β} - h ∥}^{2} / α . \end{matrix}

Simplifying, we have

\begin{matrix} {∥ h_{\hat{β}} - h ∥}^{2} + \frac{α (1 - γ)}{(α - 1)} \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + \frac{α}{α - 1} \sum_{j = 1}^{W} c {({\hat{β}}_{j} - β_{j})}^{2} \\ \leq \frac{α + 1}{α - 1} {∥ h_{β} - h ∥}^{2} + \frac{18 α^{2}}{α - 1} H^{2} (β) v^{2} (δ / 2) W (β), α > 1, 0 < γ \leq 1 . \end{matrix}

(A6)

Optimizing

α

to obtain the sharp upper bounds for the above oracle inequality

\begin{matrix} α_{o p t 1} : & = \underset{α > 1}{\arg \min} \{\frac{α + 1}{α - 1} {∥ h_{β} - h ∥}^{2} + \frac{18 α^{2}}{α - 1} H^{2} (β) v^{2} (δ / 2) W (β)\} \\ = 1 + \sqrt{1 + \frac{{∥ h_{β} - h ∥}^{2}}{9 H^{2} (β) v^{2} (δ / 2) W (β)}} \end{matrix}

by the first order condition. To date, Theorem 1 is proved by substituting

α_{o p t 1}

into Equation (A6).

Appendix A.4. Proof of Theorem 2

By the minimal eigenvalue assumption for

ψ_{W}

, we have

\begin{matrix} {∥ h_{β} ∥}^{2} = {|| \sum_{j = 1}^{W} β_{j} h_{j} (x) ||}^{2} = β^{T} ψ_{W} β \geq λ_{W} {∥ β ∥}^{2} \geq λ_{W} \sum_{j \in I (β)} β_{j}^{2} . \end{matrix}

(A7)

Using the definition of

ω_{j}

and assumption

L_{\min} : = \min_{1 \leq j \leq W} L_{j} > 0

,

ω_{j} = 2 L_{j} (\sqrt{\frac{2 \log (2 W / δ)}{n}} + \frac{c B}{2 L_{j}}) \leq 2 L_{j} (\sqrt{\frac{2 \log (2 W / δ)}{n}} + \frac{c B}{2 L_{\min}}) .

Since

c B = {\tilde{ω}}_{\min} = 2 \sqrt{2} L_{\min} v (δ / 2)

and

v (δ / 2) = \sqrt{\frac{\log (2 W / δ)}{n}}

, we have

ω_{j} \leq 4 \sqrt{2} L_{j} v (δ / 2) .

Let

G (β) = \sum_{j \in I (β)} L_{j}^{2}

, by the Cauchy–Schwartz inequality, we obtain

\begin{matrix} 6 & \sum_{j \in I (β)} ω_{j} | {\hat{β}}_{j} - β_{j} | \leq 24 \sqrt{2} v (δ / 2) \sum_{j \in I (β)} L_{j} | {\hat{β}}_{j} - β_{j} | \\ \leq 24 \sqrt{2} v (δ / 2) \sqrt{\sum_{j \in I (β)} L_{j}^{2}} \sqrt{\sum_{j \in I (β)} {({\hat{β}}_{j} - β_{j})}^{2}} \leq 24 \sqrt{2} v (δ / 2) \sqrt{\frac{G (β)}{λ_{W}}} ∥ h_{\hat{β}} - h_{β} ∥, \end{matrix}

(A8)

where the last inequality above is from Equation (A7) due to

{∥ h_{\hat{β}} - h_{β} ∥}^{2} = \underset{1 \leq i, j \leq W}{\sum \sum} ({\hat{β}}_{i} - β_{i}) ({\hat{β}}_{j} - β_{j}) < h_{i}, h_{j} > \geq λ_{W} \sum_{j \in I (β)} {({\hat{β}}_{j} - β_{j})}^{2} .

Let

b (β) : = 12 \sqrt{2} v (δ / 2) \sqrt{\frac{G (β)}{λ_{W}}}

, Lemma 2 implies

\begin{matrix} {∥ h_{\hat{β}} - h ∥}^{2} + \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + \sum_{j = 1}^{W} c {({\hat{β}}_{j} - β_{j})}^{2} \leq {∥ h_{β} - h ∥}^{2} + 2 b (β) ∥ h_{\hat{β}} - h_{β} ∥ \\ = & {∥ h_{β} - h ∥}^{2} + 2 b (β) (∥ h_{\hat{β}} - h + h - h_{β} ∥) \leq {∥ h_{β} - h ∥}^{2} + 2 b (β) ∥ h_{\hat{β}} - h ∥ + 2 b (β) ∥ h_{β} - h ∥ . \end{matrix}

Using the inequality

2 s t \leq s^{2} / α + α t^{2}

(s, t \in R, α > 1)

for the last two terms on the right side of the above inequality, we find

\begin{matrix} 2 b (β) ∥ h_{\hat{β}} - h ∥ + 2 b (β) ∥ h_{β} - h ∥ & \leq {∥ h_{\hat{β}} - h ∥}^{2} / α + b^{2} (β) α + {∥ h_{β} - h ∥}^{2} / α + b^{2} (β) α \\ = {∥ h_{\hat{β}} - h ∥}^{2} / α + {∥ h_{β} - h ∥}^{2} / α + 2 b^{2} (β) α . \end{matrix}

Thus,

\begin{matrix} {∥ h_{\hat{β}} - h ∥}^{2} + \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + \sum_{j = 1}^{W} c {({\hat{β}}_{j} - β_{j})}^{2} & \leq {∥ h_{β} - h ∥}^{2} + {∥ h_{\hat{β}} - h ∥}^{2} / α \\ + {∥ h_{β} - h ∥}^{2} / α + 2 b^{2} (β) α \end{matrix}

gives

\frac{α - 1}{α} {∥ h_{\hat{β}} - h ∥}^{2} + \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + \sum_{j = 1}^{W} c {({\hat{β}}_{j} - β_{j})}^{2} \leq \frac{α + 1}{α} {∥ h_{β} - h ∥}^{2} + 2 α b^{2} (β)

. Therefore,

\begin{matrix} {∥ h_{\hat{β}} - h ∥}^{2} + \frac{α}{α - 1} \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j} | + \frac{α}{α - 1} \sum_{j = 1}^{W} c {({\hat{β}}_{j} - β_{j})}^{2} \\ \leq \frac{α + 1}{α - 1} {∥ h_{β} - h ∥}^{2} + \frac{2 α^{2}}{α - 1} b^{2} (β) \\ = \frac{α + 1}{α - 1} {∥ h_{β} - h ∥}^{2} + \frac{576 α^{2}}{α - 1} \frac{G (β)}{λ_{W}} v^{2} (δ / 2) . \end{matrix}

To obtain the sharp upper bounds for the above oracle inequality, we optimize

α

\begin{matrix} α_{o p t 2} : & = \underset{α > 1}{\arg \min} \{\frac{α + 1}{α - 1} {∥ h_{β} - h ∥}^{2} + \frac{576 α^{2}}{α - 1} \frac{G (β)}{λ_{W}} v^{2} (δ / 2)\} = 1 + \sqrt{1 + \frac{{∥ h_{β} - h ∥}^{2}}{288 \frac{G (β)}{λ_{W}} v^{2} (δ / 2)}}, \end{matrix}

by the first-order condition. This completes the proof of Theorem 2.

Appendix A.5. Proof of Corollory 1

Let

{\tilde{ω}}_{\min} : = \min_{1 \leq j \leq W} {\tilde{ω}}_{j}

. We replace

v (δ / 2)

in Theorem 1 by the larger value

v (δ / 2 W)

. Substituting

β = β^{*}

in Theorem 1, we have

\begin{matrix} \frac{α_{o p t 1} (1 - γ)}{α_{o p t 1} - 1} \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j}^{*} | \leq \frac{18 α_{o p t 1}^{2}}{α_{o p t 1} - 1} H^{2} (β^{*}) v^{2} (δ / 2 W) W (β^{*}) \end{matrix}

by

h = h_{β^{*}}

. Since

{\tilde{ω}}_{j} \geq {\tilde{ω}}_{\min}

for all j, we obtain

\begin{matrix} \sum_{j = 1}^{W} | {\hat{β}}_{j} - β_{j}^{*} | \leq \frac{18 α_{o p t 1}}{1 - γ} \cdot \frac{1}{{\tilde{ω}}_{min}} \cdot max_{j \in I (β)} \frac{ω_{j}^{2}}{{∥ h_{j} ∥}^{2}} \cdot W (β^{*}) . \end{matrix}

In this case,

α_{o p t 1} = 2

, and

∥ h_{j} ∥ = 1

; thus,

\begin{matrix} ∥ \hat{β} - β^{*} ∥ & \leq \frac{36}{1 - γ} \cdot \max_{j \in I (β)} \frac{ω_{j}^{2}}{{\tilde{ω}}_{\min}} \cdot W (β^{*}) \\ = \frac{72 \sqrt{2} v (δ / 2 W) W (β^{*})}{1 - γ} \max_{j \in I (β)} \frac{{(L_{j} + L_{\min})}^{2}}{L_{\min}} \leq \frac{72 \sqrt{2} v (δ / 2 W) W (β^{*})}{1 - γ} \frac{{(L + L_{\min})}^{2}}{L_{\min}} \end{matrix}

from

{\tilde{ω}}_{\min} = 2 \sqrt{2} v (δ / 2 W) L_{\min}

and

ω_{j}^{2} = {[2 \sqrt{2} v (δ / 2 W)]}^{2} {[L_{j} + \frac{{\tilde{ω}}_{\min}}{2 \sqrt{2} v (δ / 2 W)}]}^{2} = {[2 \sqrt{2} v (δ / 2 W)]}^{2} {[L_{j} + L_{\min}]}^{2} .

This completes the proof of Corollary 1.

Appendix A.6. Proof of Corollary 2

Let

β = β^{*}

in Theorem 2, with

α_{o p t 2} = 2

, we replace

v (δ / 2)

in Theorem 2 by the larger value

v (δ / 2 W)

, then

\sum_{j = 1}^{W} {\tilde{ω}}_{min} | {\hat{β}}_{j} - β_{j}^{*} | \leq \sum_{j = 1}^{W} {\tilde{ω}}_{j} | {\hat{β}}_{j} - β_{j}^{*} | \leq \frac{576 α_{o p t 2} G^{*}}{λ_{W}} v^{2} (δ / 2 W) .

By the definition of

{\tilde{ω}}_{\min}

, we can obtain

\sum_{j = 1}^{W} | {\hat{β}}_{j} - β_{j}^{*} | \leq \frac{576 α_{o p t 2} G^{*} v^{2} (δ / 2)}{{\tilde{ω}}_{\min} λ_{W}} = \frac{576 \cdot 2 G^{*} v^{2} (δ / 2 W)}{2 \sqrt{2} v (δ / 2 W) L_{\min} λ_{W}} = \frac{288 \sqrt{2} G^{*} v (δ / 2 W)}{L_{\min} λ_{W}} .

This concludes the proof of Corollary 2.

Appendix A.7. Proof of Theorem 3

The following lemma is by virtue of the KKT conditions. It derives a bound of

P (I_{*} ⫅ \hat{I})

, which is easily analyzed.

Lemma A2

(Proposition 3.3 in [33]).

P (I_{*} ⫅ \hat{I}) \leq W (β^{*}) \max_{k \in I_{*}} P ({\hat{β}}_{k} = 0 and β_{k}^{*} \neq 0) .

To present the proof of Theorem 3, we first notice that

P (\hat{I} \neq I_{*}) \leq P (I_{*} ⫅ \hat{I}) + P (\hat{I} ⫅ I_{*}) .

Next, we control the probability on the right side of the above inequality.

For the control of

P (I_{*} ⫅ \hat{I})

, by Lemma A2, it remains to bound

P ({\hat{β}}_{k} = 0 and β_{k}^{*} \neq 0)

.

Below, we will use the conclusion of Lemma 2 (KKT conditions). Recall that

E [h_{k} (Z_{1})] = \sum_{j \in I_{*}} β_{j}^{*} < h_{k}, h_{j} > = \sum_{j = 1}^{W} β_{j}^{*} < h_{k}, h_{j} >

. Since we assume that the density of

Z_{1}

is the mixture density

h_{β^{*}} = \sum_{j \in I_{*}} β_{j}^{*} h_{j}

. Therefore, for

k \in I_{*}

, we have,

\begin{matrix} P ({\hat{β}}_{k} = 0 and β_{k}^{*} \neq 0) = P (|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - \sum_{j = 1}^{W} {\hat{β}}_{j} < h_{j}, h_{k} >| \leq 2 \sqrt{2} v (δ / 2 W) L_{k}; β_{k}^{*} \neq 0) \\ = & P (|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - E [h_{k} (Z_{1})] + E [h_{k} (Z_{1})] - \sum_{j = 1}^{W} {\hat{β}}_{j} < h_{j}, h_{k} >| \leq 2 \sqrt{2} v (δ / 2 W) L_{k}; β_{k}^{*} \neq 0) \\ = & P (|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - E [h_{k} (Z_{1})] - \sum_{j = 1}^{W} ({\hat{β}}_{j} - β_{j}^{*}) < h_{j}, h_{k} >| \leq 2 \sqrt{2} v (δ / 2 W) L_{k}; β_{k}^{*} \neq 0) \\ = & P (|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - E [h_{k} (Z_{1})] - \sum_{j \neq k}^{W} ({\hat{β}}_{j} - β_{j}^{*}) < h_{j}, h_{k} > + β_{k}^{*} {∥ h_{k} ∥}^{2}| \leq 2 \sqrt{2} v (δ / 2 W) L_{k}) \\ \leq & P (| β_{k}^{*} {∥ h_{k} ∥}^{2} - 2 \sqrt{2} v (δ / 2 W) L_{k} \leq |\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - E [h_{k} (Z_{1})]| + |\sum_{j \neq k}^{W} ({\hat{β}}_{j} - β_{j}^{*}) < h_{j}, h_{k} >|) \\ (A9) & \leq & P (|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - E [h_{k} (Z_{1})]| \geq \frac{| β_{k}^{*} | {∥ h_{k} ∥}^{2}}{2} - \sqrt{2} v (δ / 2 W) L_{k}) \\ (A10) & + & P (|\sum_{j \neq k}^{W} ({\hat{β}}_{j} - β_{j}^{*}) < h_{j}, h_{k} >| \geq \frac{| β_{k}^{*} | {∥ h_{k} ∥}^{2}}{2} - \sqrt{2} v (δ / 2 W) L_{k}) . \end{matrix}

Similar to Lemma 2, for Equation (A9), we use Hoeffding’s inequality. Since

∥ h_{k} ∥ = 1

for all k. Put

ϵ_{k} : = | E [h_{k} (X_{1})] - E [h_{k} (Z_{1})] |

. Consider Condition (B),

\min_{k \in I_{*}} | β_{k}^{*} | \geq 4 \sqrt{2} v (δ / 2 W) L

and

L \geq \max_{1 \leq k \leq W} L_{k}

, then we have

\begin{matrix} P & (|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - E [h_{k} (Z_{1})]| \geq \frac{| β_{k}^{*} | {∥ h_{k} ∥}^{2}}{2} - \sqrt{2} v (δ / 2 W) L_{k}) \\ = P (|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - E [h_{k} (X_{1})] + E [h_{k} (X_{1})] - E [h_{k} (Z_{1})]| \geq \frac{| β_{k}^{*} | {∥ h_{k} ∥}^{2}}{2} - \sqrt{2} v (δ / 2 W) L_{k}) \\ \leq P (|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - E [h_{k} (X_{1})]| \geq \frac{| β_{k}^{*} |}{2} - \sqrt{2} v (δ / 2 W) L - ϵ_{k}) \\ \leq P (|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - E [h_{k} (X_{1})]| \geq 2 \sqrt{2} v (δ / 2 W) L - \sqrt{2} v (δ / 2 W) L - ϵ_{k}) \\ = P (|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - E [h_{k} (X_{1})]| \geq \sqrt{2} v (δ / 2 W) L - ϵ_{k}) \\ = P (|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - E [h_{k} (X_{1})]| \geq \sqrt{2} v (δ / 2 W) L (1 - ϵ_{k}^{*})) (let ϵ_{k}^{*} = ϵ_{k} / \sqrt{2} v (δ / 2 W) L) \\ \leq 2 \exp \{- \frac{4 n^{2} v^{2} (δ / 2 W) L^{2} {(1 - ϵ_{k}^{*})}^{2}}{4 n L^{2}}\} \\ = 2 \exp \{- n {(1 - ϵ_{k}^{*})}^{2} \frac{\log (2 W^{2} / δ)}{n}\} = 2 {(\frac{δ}{2 W^{2}})}^{{(1 - ϵ_{k}^{*})}^{2}} . \end{matrix}

(A11)

For the upper bound of Equation (A10), using Condition (A) and Condition (B), by the definitions of

ρ_{*} (β^{*})

and

W (β^{*})

, we obtain

\begin{matrix} P (|\sum_{j \neq k}^{W} ({\hat{β}}_{j} - β_{j}^{*}) < h_{j}, h_{k} >| \geq \frac{| β_{k}^{*} | {∥ h_{k} ∥}^{2}}{2} - \sqrt{2} v (δ / 2 W) L_{k}) \\ = & P (|\sum_{j \neq k}^{W} ({\hat{β}}_{j} - β_{j}^{*}) < h_{j}, h_{k} >| \geq \frac{| β_{k}^{*} |}{2} - \sqrt{2} v (δ / 2 W) L_{k}) \\ \leq & P (|\sum_{j \neq k}^{W} ({\hat{β}}_{j} - β_{j}^{*}) \frac{< h_{j}, h_{k} >}{∥ h_{j} ∥ ∥ h_{k} ∥} \cdot ∥ h_{j} ∥ ∥ h_{k} ∥| \geq 2 \sqrt{2} v (δ / 2 W) L - \sqrt{2} v (δ / 2 W) L) \\ \leq & P (ρ_{*} (β^{*}) \sum_{j \neq k}^{W} |{\hat{β}}_{j} - β_{j}^{*}| \geq \sqrt{2} v (δ / 2 W) L) \leq P (\sum_{j = 1}^{W} |{\hat{β}}_{j} - β_{j}^{*}| \geq \frac{\sqrt{2} v (δ / 2 W) L}{ρ_{*} (β^{*})}) \\ \leq & P (\sum_{j = 1}^{W} |{\hat{β}}_{j} - β_{j}^{*}| \geq \frac{288 \sqrt{2} G^{*} v (δ / 2 W)}{L_{\min} λ_{W}}) \leq \frac{δ}{W} . \end{matrix}

where the second last inequality is by Condition (A), and the last inequality above is by using the

ℓ_{1}

-estimation oracle inequality in Corollary 2.

Therefore, by the definition of

W (β^{*})

,

W (β^{*}) = | I_{*} | \leq W

, we find

\begin{matrix} P (I_{*} ⫅ \hat{I}) \leq W (β^{*}) \max_{k \in I_{*}} P ({\hat{β}}_{k} = 0) & \leq W (β^{*}) 2 {(\frac{δ}{2 W^{2}})}^{{(1 - ϵ_{k}^{*})}^{2}} + W (β^{*}) \frac{δ}{W} \\ \leq 2 W {(\frac{δ}{2 W^{2}})}^{{(1 - ϵ_{k}^{*})}^{2}} + W \frac{δ}{W} = 2 W {(\frac{δ}{2 W^{2}})}^{{(1 - ϵ_{k}^{*})}^{2}} + δ . \end{matrix}

For the control of

P (\hat{I} ⫅ I_{*})

, let

\begin{matrix} \tilde{η} = \underset{η \in R^{W (β^{*})}}{\arg \min} z (η), \end{matrix}

(A12)

where

z (η) = - \frac{2}{n} \sum_{i = 1}^{n} \sum_{j \in I_{*}} η_{j} h_{j} (X_{i}) + {∥ \sum_{j \in I_{*}} η_{j} h_{j} ∥}^{2} + \sum_{j \in I_{*}} (4 \sqrt{2} v (δ / 2) L_{j} + 2 c B) | η_{j} | + c \sum_{j \in I_{*}} η_{j}^{2} .

Consider the following random event

\begin{matrix} ⋂_{k \notin I_{*}} & \{|- \frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) + \sum_{j \in I_{*}} {\tilde{η}}_{j} < h_{j}, h_{k} >| \leq 2 \sqrt{2} v (δ / 2) L_{k}\} \\ \subseteq ⋂_{k \notin I_{*}} \{|- \frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) + \sum_{j \in I_{*}} {\tilde{η}}_{j} < h_{j}, h_{k} >| \leq 2 \sqrt{2} v (δ / 2 W) L\} : = Ψ . \end{matrix}

(A13)

Let

\bar{η} \in R^{W}

be a vector corresponding to the component of the index set

I_{*}

having

\tilde{η}

given by Equation (A12), and the component at other corresponding positions is 0. By Lemma 1, we know that

\bar{η} \in R^{W}

is a solution of Equation (3) on the event

Ψ

. It is recalled that

\hat{β} \in R^{W}

, which is also a solution of Equation (3). Through the definition of the indicator set

\hat{I}

, we have

{\hat{β}}_{k} \neq 0

for

k \in \hat{I}

. By construction, we obtain

{\tilde{η}}_{k} \neq 0

for some subset

T ⫅ I_{*}

. The KKT conditions indicate that any two solutions have non-zero components at the same positions. Therefore,

\hat{I} = T ⫅ I_{*}

on the event

Ψ

. Further, we can write

\begin{matrix} P (\hat{I} ⊈ I_{*}) \leq P (Ψ^{c}) = P (⋃_{k \notin I_{*}} \{|- \frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) + \sum_{j \in I_{*}} {\tilde{η}}_{j} < h_{j}, h_{k} >| \geq 2 \sqrt{2} v (δ / 2 W) L\}) \\ \leq & \sum_{k \notin I_{*}} P \{|- \frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) + \sum_{j \in I_{*}} {\tilde{η}}_{j} < h_{j}, h_{k} >| \geq 2 \sqrt{2} v (δ / 2 W) L\} \\ = & \sum_{k \notin I_{*}} P \{|- \frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) + E [h_{k} (Z_{1})] - E [h_{k} (Z_{1})] + \sum_{j \in I_{*}} {\tilde{η}}_{j} < h_{j}, h_{k} >| \geq 2 \sqrt{2} v (δ / 2 W) L\} \\ = & \sum_{k \notin I_{*}} P \{|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - E [h_{k} (Z_{1})] - \sum_{j \in I_{*}} ({\tilde{η}}_{j} - β_{j}^{*}) < h_{j}, h_{k} >| \geq 2 \sqrt{2} v (δ / 2 W) L\} \\ (A14) & \leq & \sum_{k \notin I_{*}} P \{|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - E [h_{k} (Z_{1})]| \geq \sqrt{2} v (δ / 2 W) L\} \\ (A15) & + \sum_{k \notin I_{*}} P \{\sum_{j \in I_{*}} | {\tilde{η}}_{j} - β_{j}^{*} | |< h_{j}, h_{k} >| \geq \sqrt{2} v (δ / 2 W) L\} . \end{matrix}

According to the previously proven Formula (A11), we find

\begin{matrix} \sum_{k \notin I_{*}} P \{|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (Z_{i}) - E h_{k} (Z_{1})| \geq \sqrt{2} v (δ / 2 W) L\} \\ \leq & \sum_{k \notin I_{*}} P \{|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - E h_{k} (X_{1})| \geq \sqrt{2} v (δ / 2 W) L - ϵ_{k}\} \\ = & \sum_{k = 1}^{W} P \{|\frac{1}{n} \sum_{i = 1}^{n} h_{k} (X_{i}) - E h_{k} (X_{1})| \geq \sqrt{2} v (δ / 2 W) L (1 - ϵ_{k}^{*})\} \leq 2 W {(\frac{δ}{2 W^{2}})}^{{(1 - ϵ_{k}^{*})}^{2}} . \end{matrix}

For the upper bound of Equation (A15), observe Theorem 2, we can use a larger

v (δ / 2 W)

instead of

v (δ / 2)

. Consider the construction of

\tilde{η}

in Equation (A12), we obtain

\begin{matrix} P (\sum_{j \in I_{*}} | {\tilde{η}}_{j} - β_{j}^{*} | \geq \frac{288 \sqrt{2} G^{*} v (δ / 2 W)}{L_{min} λ_{W}}) \leq \frac{δ}{W} . \end{matrix}

Similarly, we have

\begin{matrix} \sum_{k \notin I_{*}} P \{\sum_{j \in I_{*}} | {\tilde{η}}_{j} - β_{j}^{*} | |< h_{j}, h_{k} >| \geq \sqrt{2} v (δ / 2 W) L\} \\ \leq & \sum_{k = 1}^{W} P \{\sum_{j \in I_{*}} | {\tilde{η}}_{j} - β_{j}^{*} | |\frac{< h_{j}, h_{k} >}{∥ h_{j} ∥ ∥ h_{k} ∥} ∥ h_{j} ∥ ∥ h_{k} ∥| \geq \sqrt{2} v (δ / 2 W) L\} \\ \leq & \sum_{k = 1}^{W} P \{\sum_{j \in I_{*}} | {\tilde{η}}_{j} - β_{j}^{*} | ρ_{*} (β^{*}) \geq \sqrt{2} v (δ / 2 W) L\} \\ = & \sum_{k = 1}^{W} P \{\sum_{j \in I_{*}} | {\tilde{η}}_{j} - β_{j}^{*} | \geq \frac{\sqrt{2} v (δ / 2 W) L}{ρ_{*} (β^{*})}\} \\ (using Condition (A)) & \leq \sum_{k = 1}^{W} P \{\sum_{j \in I_{*}} | {\tilde{η}}_{j} - β_{j}^{*} | \geq \frac{288 \sqrt{2} G^{*} v (δ / 2 W)}{L_{\min} λ_{W}}\} \leq \sum_{k = 1}^{W} \frac{δ}{W} = δ . \end{matrix}

Combining all the bounds above, we can obtain

\begin{matrix} P (\hat{I} \neq I_{*}) \leq P (I_{*} ⫅ \hat{I}) + P (\hat{I} ⫅ I_{*}) & \leq 2 W {(\frac{δ}{2 W^{2}})}^{{(1 - ϵ_{k}^{*})}^{2}} + δ + 2 W {(\frac{δ}{2 W^{2}})}^{{(1 - ϵ_{k}^{*})}^{2}} + δ \\ = 4 W {(\frac{δ}{2 W^{2}})}^{{(1 - ϵ_{k}^{*})}^{2}} + 2 δ . \end{matrix}

This completes the proof of Theorem 3.

References

McLachlan, G.J.; Lee, S.X.; Rathnayake, S.I. Finite mixture models. Ann. Rev. Stat. Appl. 2019, 6, 355–378. [Google Scholar] [CrossRef]
Balakrishnan, S.; Wainwright, M.J.; Yu, B. Statistical guarantees for the EM algorithm: From population to sample-based analysis. Ann. Stat. 2017, 45, 77–120. [Google Scholar] [CrossRef]
Wu, Y.; Zhou, H.H. Randomly initialized EM algorithm for two-component Gaussian mixture achieves near optimality in O( $\sqrt{n}$ ) iterations. arXiv 2019, arXiv:1908.10935. [Google Scholar]
Chen, J.; Khalili, A. Order selection in finite mixture models with a nonsmooth penalty. J. Am. Stat. Assoc. 2008, 103, 1674–1683. [Google Scholar] [CrossRef] [Green Version]
DasGupta, A. Asymptotic Theory of Statistics and Probability; Springer: New York, NY, USA, 2008. [Google Scholar]
Devroye, L.; Lugosi, G. Combinatorial Methods in Density Estimation; Springer: New York, NY, USA, 2001. [Google Scholar]
Biau, G.; Devroye, L. Density estimation by the penalized combinatorial method. J. Multivar. Anal. 2005, 94, 196–208. [Google Scholar] [CrossRef] [Green Version]
Martin, R. Fast Nonparametric Estimation of a Mixing Distribution with Application to High Dimensional Inference. Ph.D. Thesis, Purdue University, West Lafayette, IN, USA, 2009. [Google Scholar]
Bunea, F.; Tsybakov, A.B.; Wegkamp, M.H.; Barbu, A. Spades and mixture models. Ann. Stat. 2010, 38, 2525–2558. [Google Scholar] [CrossRef]
Bertin, K.; Le Pennec, E.; Rivoirard, V. Adaptive Dantzig density estimation. Annales de l’IHP Probabilités et Statistiques 2011, 47, 43–74. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodological 1996, 58, 267–288. [Google Scholar] [CrossRef]
Hall, P.; Lahiri, S.N. Estimation of distributions, moments and quantiles in deconvolution problems. Ann. Stat. 2008, 36, 2110–2134. [Google Scholar] [CrossRef]
Meister, A. Density estimation with normal measurement error with unknown variance. Stat. Sinica 2006, 16, 195–211. [Google Scholar]
Cheng, C.L.; van Ness, J.W. Statistical Regression with Measurement Error; Wiley: New York, NY, USA, 1999. [Google Scholar]
Zhu, H.; Zhang, R.; Zhu, G. Estimation and Inference in Semi-Functional Partially Linear Measurement Error Models. J. Syst. Sci. Complex. 2020, 33, 1179–1199. [Google Scholar] [CrossRef]
Zhu, H.; Zhang, R.; Yu, Z.; Lian, H.; Liu, Y. Estimation and testing for partially functional linear errors-in-variables models. J. Multivar. Anal. 2019, 170, 296–314. [Google Scholar] [CrossRef]
Bonhomme, S. Penalized Least Squares Methods for Latent Variables Models. In Advances in Economics and Econometrics: Volume 3, Econometrics: Tenth World Congress; Cambridge University Press: Cambridge, UK, 2013; Volume 51, p. 338. [Google Scholar]
Nakamura, T. Corrected score function for errors-in-variables models: Methodology and application to generalized linear models. Biometrika 1990, 77, 127–137. [Google Scholar] [CrossRef]
Buonaccorsi, J.P. Measurement error. In Models, Methods, and Applications; Chapman & Hall/CRC: Boca Raton, FL, USA, 2010. [Google Scholar]
Carroll, R.J.; Ruppert, D.; Stefanski, L.A.; Crainiceanu, C.M. Measurement error in nonlinear models. In A Modern Perspective, 2nd ed.; Chapman & Hall/CRC: Boca Raton, FL, USA, 2006. [Google Scholar]
Zou, H.; Zhang, H. On the adaptive elastic-net with a diverging number of parameters. Ann. Stat. 2009, 37, 1733–1751. [Google Scholar] [CrossRef] [Green Version]
Aitchison, J.; Aitken, C.G. Multivariate binary discrimination by the kernel method. Biometrika 1976, 63, 413–420. [Google Scholar] [CrossRef]
Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef] [Green Version]
Rosenbaum, M.; Tsybakov, A.B. Sparse recovery under matrix uncertainty. Ann. Stat. 2010, 38, 2620–2651. [Google Scholar] [CrossRef]
Zhang, H.; Jia, J. Elastic-net regularized high-dimensional negative binomial regression: Consistency and weak signals detection. Stat. Sinica 2022, 32. [Google Scholar] [CrossRef]
Buhlmann, P.; van de Geer, S. Statistics for High-Dimensional Data: Methods, Theory and Applications; Springer: New York, NY, USA, 2011. [Google Scholar]
Belloni, A.; Rosenbaum, M.; Tsybakov, A.B. Linear and conic programming estimators in high dimensional errors-in-variables models. J. R. Stat. Soc. Series B Stat. Methodol. 2017, 79, 939–956. [Google Scholar] [CrossRef] [Green Version]
Huang, H.; Gao, Y.; Zhang, H.; Li, B. Weighted Lasso estimates for sparse logistic regression: Non-asymptotic properties with measurement errors. Acta Math. Sci. 2021, 41, 207–230. [Google Scholar] [CrossRef]
Zhang, H.; Chen, S.X. Concentration Inequalities for Statistical Inference. Commun. Math. Res. 2021, 37, 1–85. [Google Scholar]
Donoho, D.L.; Johnstone, J.M. Ideal spatial adaptation by wavelet shrinkage. Biometrika 1994, 81, 425–455. [Google Scholar] [CrossRef]
Deng, H.; Chen, J.; Song, B.; Pan, Z. Error bound of mode-based additive models. Entropy 2021, 23, 651. [Google Scholar] [CrossRef]
Bickel, P.J.; Ritov, Y.A.; Tsybakov, A.B. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 2009, 37, 1705–1732. [Google Scholar] [CrossRef]
Bunea, F. Honest variable selection in linear and logistic regression models via ℓ₁ and ℓ₁ + ℓ₂ penalization. Electron. J. Stat. 2008, 2, 1153–1194. [Google Scholar] [CrossRef]
Chow, Y.S.; Teicher, H. Probability Theory: Independence, Interchangeability, Martingales, 3rd ed.; Springer: New York, NY, USA, 2003. [Google Scholar]
Hersbach, H.; de Rosnay, P.; Bell, B.; Schepers, D.; Simmons, A.; Soci, C.; Abdalla, S.; Alonso-Balmaseda, M.; Balsamo, G.; Bechtold, P.; et al. Operational Global Reanalysis: Progress, Future Directions and Synergies with NWP; European Centre for Medium Range Weather Forecasts: Reading, UK, 2018. [Google Scholar]
Fisher, N.I. Statistical Analysis of Circular Data; Cambridge University Press: Cambridge, UK, 1995. [Google Scholar]
Broniatowski, M.; Jureckova, J.; Kalina, J. Likelihood ratio testing under measurement errors. Entropy 2018, 20, 966. [Google Scholar] [CrossRef] [Green Version]

Figure 1. The simulation result in Section 4.2. The estimated support of

β^{*}

by the four types of estimators, and the W is varying. The circles represent the means of the estimators under the four specific approaches, while the half vertical lines mean the standard deviations.

Figure 1. The simulation result in Section 4.2. The estimated support of

β^{*}

by the four types of estimators, and the W is varying. The circles represent the means of the estimators under the four specific approaches, while the half vertical lines mean the standard deviations.

Figure 2. The simulation results in Section 4.2. The density map of the four estimators. The result of Elastic-net in

W = 321

is not be shown due to its poor performance.

Figure 2. The simulation results in Section 4.2. The density map of the four estimators. The result of Elastic-net in

W = 321

is not be shown due to its poor performance.

Figure 3. The sample histogram of the azimuth in Beijing Nongzhanguan at 6am and Qingdao Coast at 12 am.

Figure 4. The density map of the four estimators’ approaches for the three random sub-samples from the real-world data in Section 4.5.

Table 1. The simulation results in Section 4.2. The mean and standard deviation of the errors in the four estimators of

β^{*}

under

N = 100

simulations, with

n = 100

. The quasi-optimal

λ_{2}

is

c = 0.002

for Elastic-net, while

c = 0.027

is for the CSDE.

Table 1. The simulation results in Section 4.2. The mean and standard deviation of the errors in the four estimators of

β^{*}

under

N = 100

simulations, with

n = 100

. The quasi-optimal

λ_{2}

is

c = 0.002

for Elastic-net, while

c = 0.027

is for the CSDE.

	W	$λ_{1}$	$L_{1}$ Error	$TV$ Error
Lasso	81	0.065	2.133 (2.467)	1.137 (1.115)
Elastic-net		0.065	2.061 (1.439)	1.114 (0.805)
SPADES		0.053	1.922 (2.211)	1.258 (1.296)
CSDE		0.053	2.191 (4.812)	1.405 (2.329)
Lasso	131	0.068	2.032 (0.985)	1.352 (0.712)
Elastic-net		0.068	2.236 (2.498)	1.409 (1.056)
SPADES		0.056	1.880 (2.644)	0.972 (1.204)
CSDE		0.056	1.635 (0.342)	0.863 (0.402)
Lasso	211	0.071	2.572 (4.187)	1.605 (2.702)
Elastic-net		0.071	2.061 (1.883)	1.353 (1.516)
SPADES		0.058	1.764 (1.041)	0.832 (0.610)
CSDE		0.058	1.648 (0.168)	0.791 (0.415)
Lasso	321	0.074	2.120 (2.842)	1.146 (1.115)
Elastic-net		0.074	10.173 (82.753)	7.839 (67.887)
SPADES		0.061	2.106 (4.816)	0.818 (1.565)
CSDE		0.061	1.623 (0.085)	0.634 (0.199)

Table 2. The simulation result in Section 4.3. The mean and standard deviation of the errors in the four estimators of

β^{*}

under

N = 100

simulations. The

λ_{2}

is chosen as

c = 0.005

for Elastic-net, while

c = 0.203

for the CSDE.

Table 2. The simulation result in Section 4.3. The mean and standard deviation of the errors in the four estimators of

β^{*}

under

N = 100

simulations. The

λ_{2}

is chosen as

c = 0.005

for Elastic-net, while

c = 0.203

for the CSDE.

	W	$λ_{1}$	$L_{1}$ Error	$TV$ Error
Lasso	81	0.048	1.796 (0.006)	0.002 (0.001)
Elastic-net		0.048	1.796 (0.006)	0.002 (0.001)
SPADES		0.138	1.811 (0.013)	0.002 (0.005)
CSDE		0.138	1.806 (0.008)	0.003 (0.005)
Lasso	131	0.051	1.828 (0.006)	0.003 (0.001)
Elastic-net		0.051	1.830 (0.009)	0.004 (0.002)
SPADES		0.145	1.880 (0.006)	0.002 (0.005)
CSDE		0.145	1.854 (0.006)	0.002 (0.004)
Lasso	211	0.053	1.935 (0.010)	0.005 (0.003)
Elastic-net		0.053	2.061 (0.014)	0.007 (0.008)
SPADES		0.152	1.935 (0.008)	0.005 (0.003)
CSDE		0.152	1.861 (0.005)	0.003 (0.002)
Lasso	321	0.055	1.927 (0.031)	0.005 (0.002)
Elastic-net		0.055	2.123 (0.026)	0.009 (0.009)
SPADES		0.158	1.938 (0.008)	0.005 (0.003)
CSDE		0.158	1.852 (0.002)	0.002 (0.001)

Table 3. The low-dimensional simulation result in Section 4.4.

		$L_{1}$ Error	$TV$ Error
Scenario 1	EM	0.255 (0.122)	0.205 (0.098)
Scenario 1	CSDE	0.206 (0.145)	0.185 (0.104)
Scenario 2	EM	0.111 (0.055)	0.111 (0.055)
Scenario 2	CSDE	0.109 (0.037)	0.108 (0.037)

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, X.; Zhang, H.; Wei, H.; Zhang, S. Sparse Density Estimation with Measurement Errors. Entropy 2022, 24, 30. https://doi.org/10.3390/e24010030

AMA Style

Yang X, Zhang H, Wei H, Zhang S. Sparse Density Estimation with Measurement Errors. Entropy. 2022; 24(1):30. https://doi.org/10.3390/e24010030

Chicago/Turabian Style

Yang, Xiaowei, Huiming Zhang, Haoyu Wei, and Shouzheng Zhang. 2022. "Sparse Density Estimation with Measurement Errors" Entropy 24, no. 1: 30. https://doi.org/10.3390/e24010030

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sparse Density Estimation with Measurement Errors

Abstract

1. Introduction

2. Density Estimation

2.1. Mixture Models

2.2. The Density Estimation with Measurement Errors

3. Sparse Mixture Density Estimation

3.1. Data-Dependent Weights

3.2. Non-Asymptotic Oracle Inequalities

3.3. Corrected Support Identification of Mixture Models

4. Simulation and Real Data Analysis

4.1. Tuning Parameter Selection

4.2. Multi-Modal Distributions

4.3. Mixture of Poisson Distributions

4.4. Low-Dimensional Mixture Model

4.5. Real Data Examples

5. Summary and Discussions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Proof of Lemma A1

Appendix A.2. Proof of Theorems

Appendix A.3. Proof of Theorem 1

Appendix A.4. Proof of Theorem 2

Appendix A.5. Proof of Corollory 1

Appendix A.6. Proof of Corollary 2

Appendix A.7. Proof of Theorem 3

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI