Next Article in Journal
Predicting Change in Emotion through Ordinal Patterns and Simple Symbolic Expressions
Next Article in Special Issue
Representation Theorem and Functional CLT for RKHS-Based Function-on-Function Regressions
Previous Article in Journal
A Feasible Method to Control Left Ventricular Assist Devices for Heart Failure Patients: A Numerical Study
Previous Article in Special Issue
Group Logistic Regression Models with lp,q Regularization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Sharper Sub-Weibull Concentrations

1
Department of Mathematics, Faculty of Science and Technology, University of Macau, Macau 999078, China
2
Zhuhai UM Science & Technology Research Institute, Zhuhai 519031, China
3
Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2022, 10(13), 2252; https://doi.org/10.3390/math10132252
Submission received: 21 April 2022 / Revised: 20 June 2022 / Accepted: 21 June 2022 / Published: 27 June 2022
(This article belongs to the Special Issue New Advances in High-Dimensional and Non-asymptotic Statistics)

Abstract

:
Constant-specified and exponential concentration inequalities play an essential role in the finite-sample theory of machine learning and high-dimensional statistics area. We obtain sharper and constants-specified concentration inequalities for the sum of independent sub-Weibull random variables, which leads to a mixture of two tails: sub-Gaussian for small deviations and sub-Weibull for large deviations from the mean. These bounds are new and improve existing bounds with sharper constants. In addition, a new sub-Weibull parameter is also proposed, which enables recovering the tight concentration inequality for a random variable (vector). For statistical applications, we give an 2 -error of estimated coefficients in negative binomial regressions when the heavy-tailed covariates are sub-Weibull distributed with sparse structures, which is a new result for negative binomial regressions. In applying random matrices, we derive non-asymptotic versions of Bai-Yin’s theorem for sub-Weibull entries with exponential tail bounds. Finally, by demonstrating a sub-Weibull confidence region for a log-truncated Z-estimator without the second-moment condition, we discuss and define the sub-Weibull type robust estimator for independent observations { X i } i = 1 n without exponential-moment conditions.

1. Introduction

In the last two decades, with the development of modern data collection methods in science and techniques, scientists and engineers can access and load a huge number of variables in their experiments. Over hundreds of years, probability theory lays the mathematical foundation of statistics. Arising from data-driving problems, various recent statistics research advances also contribute to new and challenging probability problems for further study. For example, in recent years, the rapid development of high-dimensional statistics and machine learning have promoted the development of the probability theory and even pure mathematics, such as random matrices, large deviation inequalities, and geometric functional analysis, etc.; see [1]. More importantly, the concentration inequality (CI) quantifies the concentration of measures that are at the heart of statistical machine learning. Usually, CI quantifies how a random variable (r.v.) X deviates around its mean E X = : μ by presenting as one-side or two-sided bounds for the tail probability of X μ
P ( X μ > t ) or P ( | X μ | > t ) some small δ , t 0 .
The classical statistical models are faced with fixed-dimensional variables only. However, contemporary data science motivates statisticians to pay more attention to studying p × p random Hessian matrices (or sample covariance matrices, [2]) with p , arising from the likelihood functions of high-dimensional regressions with covariates in R p . When the model dimension increases with sample size, obtaining asymptotic results for the estimator is potentially more challenging than the fixed dimensional case. In statistical machine learning, concentration inequalities (large derivation inequalities) are essential in deriving non-asymptotic error bounds for the proposed estimator; see [3,4]. Over recent decades, researchers have developed remarkable results of matrix concentration inequalities, which focuses on non-asymptotic upper and lower bounds for the largest eigenvalue of a finite sum of random matrices. For a more fascinated introduction, please refer to the book [5].
Motivated from sample covariance matrices, a random matrix is a specific matrix A p × p with its entries A j k drawn from some distributions. As p , random matrix theory mainly focuses on studying the properties of the p eigenvalues of A p × p , which turn out to have some limit law. Several famous limit laws in random matrix theory are different from the CLT for the summation of independent random variables since the p eigenvalues are dependent and interact with each other. For convergence in distribution, some pioneering works are the Wigner’s semicircle law for some symmetric Gaussian matrices’ eigenvalues, the Marchenko-Pastur law for Wishart distributed random matrices (sample covariance matrices), and the Tracy-Widom laws for the limit distribution for maximum eigenvalues in Wishart matrices. All these three laws can be regarded as the CLT of random matrix versions. Moreover, the limit law for the empirical spectral density is some circle distribution, which sheds light on the non-communicative behaviors of the random matrix, while the classic limit law in CLT is for normal distribution or infinite divisible distribution. For strong convergence, Bai-Yin’s law complements the Marchenko-Pastur law, which asserts that almost surely convergence of the smallest and largest eigenvalue for a sample covariance matrix. The monograph [2] thoroughly introduces the limit law in random matrices.
This work aims to extend non-asymptotic results from sub-Gaussian to sub-Weibull in terms of exponential concentration inequalities with applications in count data regressions, random matrices, and robust estimators. The contributions are:
(i)
We review and present some new results for sub-Weibull r.v.s, including sharp concentration inequalities for weighted summations of independent sub-Weibull r.v.s and negative binomial r.v.s, which are useful in many statistical applications.
(ii)
Based on the generalized Bernstein-Orlicz norm, a sharper concentration for sub-Weibull summations is obtained in Theorem 1. Here we circumvent Stirling’s approximation and derive the inequalities more subtly. As a result, the confidence interval based on our result is sharper and more accurate than that in [6] (For example, see Remark 2) and [7] (see Proposition 1 with unknown constants) gave.
(iii)
By sharper sub-Weibull concentrations, we give two applications. First, from the proposed negative binomial concentration inequalities, we obtain the O P ( p / n ) (up to some log factors) estimation error for the estimated coefficients in negative binomial regressions under the increasing-dimensional framework p = p n and heavy-tailed covariates. Second, we provide a non-asymptotic Bai-Yin’s theorem for sub-Weibull random matrices with exponential-decay high probability.
(iv)
We propose a new sub-Weibull parameters, which is enabled of recovering the tight concentration inequality for a single non-zero mean random vector. The simulation studies for estimating sub-Gaussian and sub-exponential parameters show these parameters could be estimated well.
(v)
We establish a unified non-asymptotic confidence region and the convergence rate for general log-truncated Z-estimator in Theorem 5. Moreover, we define a sub-Weibull type estimator for a sequence of independent observations { X i } i = 1 n without the second-moment condition, beyond the definition of the sub-Gaussian estimator.

2. Sharper Concentrations for Sub-Weibull Summation

Concentration inequalities are powerful in high-dimensional statistical inference, and it can derive explicit non-asymptotic error bounds as a function of sample size, sparsity level, and dimension [3]. In this section, we present preparation results of concentration inequalities for sub-Weibull random variables.

2.1. Properties of Sub-Weibull norm and Orlicz-Type Norm

In empirical process theory, sub-Weibull norm (or other Orlicz-type norms) is crucial to derive the tail probability for both single sub-Weibull random variable and summation of random variables (by using the Chernoff’s inequality). A benefit of Orlicz-type norms is that the concentration does not need the zero mean assumption.
Definition 1
(Sub-Weibull norm). For θ > 0 , the sub-Weibull norm of X is defined as
X ψ θ : = inf { C ( 0 , ) : E [ exp ( | X | θ / C θ ) ] 2 } .
The · ψ θ is also called the ψ θ -norm. We define X as a sub-Weibull random variable with index θ if it has a bounded ψ θ -norm (denoted as X subW ( θ ) ). Actually, the sub-Weibull norm is a special case of Orlicz norms below.
Definition 2
(Orlicz Norms). Let g : [ 0 , ) [ 0 , ) be a non-decreasing convex function with g ( 0 ) = 1 . The “g-Orlicz norm” of a real-valued r.v. X is given by
X g : = inf { η > 0 : E [ g ( | X | / η ) ] 2 } .
Using exponential Markov’s inequality, we have
P ( | X | t ) = P ( g ( | X | / X g ) g ( t / X g ) ) g 1 ( t / X g ) E g ( X / X g ) 2 g 1 ( t / X g )
by Definition 2. For example, let g ( x ) = e x θ , which leads to sub-Weibull norm for θ 1 .
Example 1
( ψ θ -norm of bounded r.v.). For a r.v. | X | M < , we have
X ψ θ = inf { t > 0 : E e | X | θ / t θ 2 } inf { t > 0 : E e M θ / t θ 2 } = M ( log 2 ) 1 / θ .
In general, we have following corollary to determine X ψ θ based on moment generating functions (MGF). It would be useful for doing statistical inference of ψ θ -norm.
Corollary 1.
If X ψ θ < , then X ψ θ = m | X | θ 1 ( 2 ) 1 / θ for the MGF ϕ Z ( t ) : = E e t Z .
Remark 1.
If we observe i.i.d. data { X i } i = 1 n from a sub-Weibull distribution, one can use the empirical moment generating function (EMGF, [8]) to estimate the sub-Weibull norm of X. Since the EMGF m ^ | X | θ ( t ) = 1 n i = 1 n exp { t | X i | θ } converge to MGF m | X | θ ( t ) in probability for t in a neighbourhood of zero, the value of the inverse function of EMGF at 2. Then, under some regularity conditions, m ^ | X | θ 1 ( 2 ) , is a consistent estimate for X ψ θ .
In particular, if we take θ = 1 , we get the sub-exponential norm of X, which is defined as X ψ 1 = inf { t > 0 : E exp ( | X | / t ) 2 } . For independent r.v.s { X i } i = 1 n , if E X i = 0 and X i ψ 1 < , by Proposition 4.2 in [4], we know t 0
P | i = 1 n X i | t 2 exp 1 4 t 2 i = 1 n 2 X i ψ 1 2 t max 1 i n X i ψ 1 .
Example 2.
An explicitly calculation of the sub-exponential norm is given in [9], they show that Poisson r.v. X Poisson ( λ ) has sub-exponential norm X ψ 1 [ log ( log ( 2 ) λ 1 + 1 ) ] 1 . And Example 1 with triangle inequality implies
X E X ψ 1 X ψ 1 + E X ψ 1 = X ψ 1 + λ log 2 [ log ( log ( 2 ) λ 1 + 1 ) ] 1 + λ log 2
based on following useful results.
Proposition 1
(Lemma A.3 in [9]). For any α > 0 and any r.v.s X , Y we have X + Y ψ θ K α X ψ θ + Y ψ θ and
E X ψ θ 1 d α ( log 2 ) 1 / α X ψ θ , X E X ψ θ K α 1 + d α log 2 1 / α X ψ θ ,
where d θ : = ( θ e ) 1 / θ / 2 , K θ : = 2 1 / θ if θ ( 0 , 1 ) and K θ = 1 if θ 1 .
To extend Poisson variables, one can also consider concentration for sums of independent heterogeneous negative binomial variables { Y i } i = 1 n with probability mass functions:
P ( Y i = y ) = Γ ( y + k i ) Γ ( k i ) y ! ( 1 q i ) k i q i y q i ( 0 , 1 ) , y N ,
where { k i } i = 1 n ( 0 , ) are variance-dependence parameters. Here, the mean and variance of { Y i } i = 1 n are E Y i = k i q i 1 q i , Var   Y i = k i q i ( 1 q i ) 2 respectively. The MGF of { Y i } i = 1 n are E e s Y i = 1 q i 1 q i e s k i for i = 1 , , n . Based on (3), we obtain following results.
Corollary 2.
For any independent r.v.s { Y i } i = 1 n satisfying Y i ψ 1 < , t 0 , and non-random weight w = ( w 1 , , w n ) , we have
P | i = 1 n w i ( Y i E Y i ) | t 2 e 1 4 t 2 2 i = 1 n w i 2 ( Y i ψ 1 + | E Y i / log 2 | ) 2 t max 1 i n | w i | ( Y i ψ 1 + | E Y i / log 2 | ) .
P | i = 1 n w i ( Y i E Y i ) | > 2 2 t i = 1 n w i 2 Y i E Y i ψ 1 2 1 / 2 + 2 t max 1 i n ( | w i | Y i E Y i ψ 1 ) 2 e t .
In particular, if Y i is independently distributed as NB ( μ i , k i ) , we have
P | i = 1 n w i ( Y i E Y i ) | t 2 e 1 4 ( t 2 2 i = 1 n w i 2 a 2 ( μ i , k i ) t max 1 i n | w i | a ( μ i , k i ) ) ,
where a ( μ i , k i ) : = log 1 ( 1 q i ) / 2 k i q i 1 + μ i log 2 with q i : = μ i k i + μ i .
Corollary 2 can play an important role in many non-asymptotic analyses of various estimators. For instance, recently [10] uses the above inequality as an essential role for deriving the non-asymptotic behavior of the penalty estimator in the counting data model.
Next, we study moment properties for sub-Weibull random variables. Lemma 1.4 in [11] showed that if X subG ( σ 2 ) , then we have: (a). the tail satisfies P ( | X | > t ) 2 e t 2 / 2 σ 2 for any t > 0 ; (b). The (a) implies that moments E | X | k ( 2 σ 2 ) k / 2 k Γ ( k 2 ) and [ k 1 / 2 ( E ( | X | k ) ) 1 / k ] 2 σ 2 e 2 / e , k 2 . We extend Lemma 1.4 in [11] to sub-Weibull r.v. X satisfying following properties.
Corollary 3
(Moment properties of sub-Weibull norm). (a). If X ψ θ < , then P { | X | > t } 2 e ( t / X ψ θ ) θ for all t 0 ; and then E | X | k 2 X ψ θ k Γ ( k θ + 1 ) for all k 1 . (2). Let C θ : = max k 1 2 2 π θ 1 / k k θ 1 / ( 2 k ) , for all k 1 we have ( E | X | k ) 1 / k C θ ( θ e 11 / 12 ) 1 / θ X ψ θ k 1 / θ .
Particularly, sub-Weibull r.v.s reduce to sub-exponential or sub-Gaussian r.v.s when θ = 1 or 2. It is obvious that the smaller θ is, the heavier tail the r.v. has. A r.v. is called heavy-tailed if its distribution function fails to be bounded by a decreasing exponential function, i.e.,
e λ x d F ( x ) = , λ > 0 (the tail decays slower than some exponential r.v.s);
see [12]. Hence for sub-Weibull r.v.s, we usually focus on the the sub-Weibull index θ ( 0 , 1 ) . A simple example that the heavy-tailed distributions arises when we work more production on sub-Gaussian r.v.s. Via a power transform of | X | , the next corollary explains the relation of sub-Weibull norm with parameter θ and r θ , which is similar to Lemmas 2.7.6 of [1] for sub-exponential norm.
Corollary 4.
For any θ , r ( 0 , ) , if X subW ( θ ) , then | X | r subW ( θ / r ) . Moreover,
| X | r ψ θ / r = X ψ θ r .
Conversely, if X subW ( r θ ) , then X r subW ( θ ) with X r ψ θ = X ψ r θ r .
By Corollary 4, we obtain that d-th root of the absolute value of sub-Gaussian is subW ( 2 d ) by letting r = 1 / d . Corollary 4 can be extended to product of r.v.s, from Proposition D.2 in [6] with the equality replacing by inequality, we state it as the following proposition.
Proposition 2.
If { W i } i = 1 d are (possibly dependent) r.vs satisfying W i ψ α i < ∞ for some α i > 0 , then
i = 1 d W i ψ β i = 1 d W i ψ α i where 1 β : = i = 1 d 1 α i .
For multi-armed bandit problems in reinforcement learning, [7] move beyond sub-Gaussianity and consider the reward under sub-Weibull distribution which has a much weaker tail. The corresponding concentration inequality (Theorem 3.1 in [7]) for the sum of independent sub-Weibull r.v.s is illustrated as follows.
Proposition 3
(Concentration inequality for sub-Weibull distribution). Suppose { X i } i = 1 n are independent sub-Weibull random variables with X i E X i ψ θ v . Then there exists absolute constants C 1 θ and C 2 θ only depending on θ such that with probability at least 1 e t :
1 n i = 1 n X i E X i v C 1 θ t n 1 / 2 + C 2 θ t n 1 / θ = O ( n 1 / θ ) , θ > 2 O ( n 1 / 2 ) , 0 < θ 2 .
The weakness in the Proposition 3 is that the upper bound of S n a : = i = 1 n a i Y i E ( i = 1 n a i Y i ) is up to a unknown constants C 1 θ , C 2 θ . In the next section, we will give a constants-specified and high probability upper bound for | S n a | , which improve Proposition 3 and is sharper than Theorem 3.1 in [6].

2.2. Main Results: Concentrations for Sub-Weibull Summation

Based on the exponential moment condition, the Chernoff’s tricks implies the following sub-exponential concentrations from Proposition 4.2 in [4].
Proposition 4.
For any independent r.v.s { Y i } i = 1 n satisfying Y i ψ 1 < , t 0 , and non-random weight w = ( w 1 , , w n ) , we have
P ( | i = 1 n w i ( Y i E Y i ) | > 2 ( 2 t i = 1 n w i 2 Y i E Y i ψ 1 2 ) 1 / 2 + 2 t max 1 i n ( | w i | Y i E Y i ψ 1 ) ) 2 e t .
But it is not easy to extend to sub-Weibull distributions. From Corollary 4, Y i subW ( θ ) | Y i | 1 / θ subW ( 1 ) . The MGF of | Y i | 1 / θ satisfies E e λ 1 / θ | Y i | 1 / θ e λ 1 / θ K 1 / θ , | λ | 1 K for some constant K > 0 . The bound of E e λ 1 / θ | Y i | 1 / θ with θ 1 or 2 is not directly applicable for deriving the concentration of i = 1 n w i ( Y i E Y i ) by using the independence and Chernoff’s tricks, since the MGF of Weibull r.v. do not has closed form as exponential function. Thanks to the tail probability derived by Orlicz-type norms, instead of using the upper bound for MGF, an alternative method is given by [6] who defines the so-called Generalized Bernstein-Orlicz (GBO) norm. And the GBO norm can help us to derive tail behaviours for sub-Weibull r.v.s.
Definition 3
(GBO norm). Fix α > 0 and L 0 . Define the function Ψ θ , L ( · ) as the inverse function Ψ θ , L 1 ( t ) : = log ( t + 1 ) + L log ( t + 1 ) 1 / θ f o r   a l l t 0 . The GBO norm of a r.v. X is then given by X Ψ θ , L : = inf { η > 0 : E [ Ψ θ , L ( | X | / η ) ] 1 } .
The monotone function Ψ θ , L ( · ) is motivated by the classical Bernstein’s inequality for sub-exponential r.v.s. Like the sub-Weibull norm properties Corollary 3, the following proposition in [6] allows us to get the concentration inequality for r.v. with finite GBO norm.
Proposition 5.
If X Ψ θ , L < , then P ( | X | X Ψ θ , L { t + L t 1 / θ } ) 2 e t t 0 .
With an upper bound of GBO norm, we could easily derive the concentration inequality for a single sub-Weibull r.v. or even the sum of independent sub-Weibull r.v.s. The sharper upper bounds for the GBO norm is obtained for the sub-Weibull summation, which refines the constant in the sub-Weibull concentration inequality. Let | | X | | p : = ( E | X | p ) 1 / p for all integer p 1 . First, by truncating more precisely, we obtain a sharper upper bound for | | X | | p , comparing to Proposition C.1 in [6].
Corollary 5.
If X p C 1 p + C 2 p 1 / θ for p 2 and constants C 1 , C 2 , then
X Ψ θ , K γ e C 1
where K = γ 2 / θ C 2 / ( γ C 1 ) and γ 1.78 is the minimal solution of
k > 1 : e 2 k 2 1 + e 2 ( 1 k 2 ) / k 2 k 2 1 1 .
The proof can be seen in the Appendix A. In below, we need the moment estimation for sums of independent symmetric r.v.s.
Lemma 1
(Khinchin-Kahane Inequality, Theorem 1.3.1 of [13]). Let a i i = 1 n be a finite non-random sequence, ε i i = 1 n be a sequence of independent Rademacher variables and 1 < p < q < . Then i = 1 n ε i a i q q 1 p 1 1 / 2 i = 1 n ε i a i p .
Lemma 2
(Theorem 2 of [14]). Let X i i = 1 n be a sequence of independent symmetric r.v.s, and p 2 . Then, e 1 2 e 2 X i p X 1 + + X n p e X i p , where X i p : = inf { t > 0 : i = 1 n log ϕ p X i / t p } with ϕ p ( X ) : = E | 1 + X | p .
Lemma 3
(Example 3.2 and 3.3 of [14]). Assume X be a symmetric r.v. satisfying P | X | t = e N ( t ) . For any t 0 , we have
(a)
If N ( t ) is concave, then log ϕ p ( e 2 t X ) p M p , X ( t ) : = ( t p X p p ) ( p t 2 X 2 2 ) .
(b)
For convex N ( t ) , denote the convex conjugate function N ( t ) : = sup s > 0 { t s N ( s ) } and M p , X ( t ) = p 1 N ( p | t | ) , if p | t | 2 p t 2 , if p | t | < 2 . Then log ϕ p ( t X / 4 ) p M p , X ( t ) .
With the help of three lemmas above, we can obtain the main results concerning the shaper and constant-specified concentration inequality for the sum of independent sub-Weibull r.v.s.
Theorem 1
(Concentration for sub-Weibull summation). Let γ be given in Corollary 5. If X i i = 1 n are independent centralized r.v.s such that X i ψ θ < for all 1 i n and some θ > 0 , then for any weight vector w = ( w 1 , , w n ) R n , the following bounds holds true:
(a)
The estimate for GBO norm of the summation:
i = 1 n w i X i Ψ θ , L n ( θ , b X ) γ e C ( θ ) b X 2 ,
where b X = ( w 1 X 1 ψ θ , , w n X n ψ θ ) R n , with
C ( θ ) : = 2 log 1 / θ 2 + e 3 Γ 1 / 2 2 θ + 1 + 3 2 θ 3 θ sup p 2 p 1 θ Γ 1 / p p θ + 1 , i f θ 1 , 2 [ 4 e + ( log 2 ) 1 / θ ] , i f θ > 1 ;
and L n ( θ , b ) = γ 2 / θ A ( θ ) b b 2 1 { 0 < θ 1 } + γ 2 / θ B ( θ ) b β b 2 1 { θ > 1 } where B ( θ ) = : 2 e θ 1 / θ 1 θ 1 1 / β 4 e + ( log 2 ) 1 / θ and A ( θ ) = : inf p 2 e 3 3 2 θ 3 θ p 1 / θ Γ 1 / p p θ + 1 2 [ log 1 / θ 2 + e 3 ( Γ 1 / 2 ( 2 θ + 1 ) + 3 2 θ 3 θ sup p 2 p 1 / θ Γ 1 / p ( p θ + 1 ) ) ] . For the case θ > 1 , β is the Hölder conjugate satisfying 1 / θ + 1 / β = 1 .
(b)
Concentration for sub-Weibull summation:
P | i = 1 n w i X i | 2 e C ( θ ) b X 2 { t + L n ( θ , b X ) t 1 / θ } 2 e t .
(c)
Another form of for θ 2 :
P | i = 1 n w i X i | s 2 exp s θ 4 e C ( θ ) b X 2 L n ( θ , b X ) θ s 2 16 e 2 C 2 ( θ ) b X 2 2 ( θ < 2 ) = 2 e s 2 / 16 e 2 C 2 ( θ ) b 2 2 , if s 4 e C ( θ ) b X 2 L n θ / ( θ 2 ) ( θ , b X ) 2 e s θ / [ 4 e C ( θ ) b X 2 L n ( θ , b X ) ] θ , if s > 4 e C ( θ ) b X 2 L n θ / ( θ 2 ) ( θ , b X ) ; ( θ > 2 ) = 2 e s θ / [ 4 e C ( θ ) b X 2 L n ( θ , b X ) ] θ , if s < 4 e C ( θ ) b X 2 L n θ / ( 2 θ ) ( θ , b X ) 2 e s 2 / 16 e 2 C 2 ( θ ) b X 2 2 , if s 4 e C ( θ ) b X 2 L n θ / ( 2 θ ) ( θ , b X ) .
Remark 2.
The constant C ( θ ) in Theorem 1 can be improved as C ( θ ) / 2 under symmetric assumption of sub-Weibull r.v.s { X i } i = 1 n . Moreover, by the improved symmetrization theorem (Theorem 3.4 in [15]), one can replace the constant C ( θ ) in Theorem 1 by a sharper constant ( 1 + o ( 1 ) ) C ( θ ) / 2 . Theorem 1 (b) also implies a potential empirical upper bound for i = 1 n w i X i for independent sub-Weibull r.v.s { X i } i = 1 n , because the only unknown variable in 2 e C ( θ ) b X 2 { t + L n ( θ ) t 1 / θ } is b X . From Remark 1, estimating b X is possible for i.i.d. observation { X i } i = 1 n .
Remark 3.
Compared with the newest result in [6], our method do not use the crude String’s approximation will give sharper concentration. For example, suppose X 1 , , X 10 are i.i.d. r.v.s with mean μ and X 1 μ ψ θ = 1 . Here we set θ = 0.5 , X is heavy-tailed (for example set the density of X as f ( x ) = 1 2 x e x · 1 ( x 0 ) ). We find that C ( θ ) 2825.89 , A ( θ ) 0.07 , and L 10 ( θ , 1 10 ) = 0.23 . Hence, 95 % confidence interval in our method will be
μ X ¯ ± 2 e × 2118.80 ,
while the 95% confidence interval in Theorem 3.1 of [6] is evaluated as
μ X ¯ ± 2 e × 3969.94 .
In this example, it can be seen that our method does give a much better (tighter) confidence interval.
Remark 4.
Theorem 1 (b) generalizes the sub-Gaussian concentration inequalities, sub-exponential concentration inequalities, and Bernstein’s concentration inequalities with Bernstein’s moment condition. For θ < 2 in Theorem 1 (c), the tail behaviour of the sum is akin to a sub-Gaussian tail for small t, and the tail resembles the exponential tail for large t; For θ > 2 , the tail behaves like a Weibull r.v. with tail parameter θ and the tail of sums match that of the sub-Gaussian tail for large t. The intuition is that the sum will concentrate around zero by the Law of Large Number. Theorem 1 shows that the convergence rate will be faster for small deviations from the mean and will be slower for large deviations from the mean.
Remark 5.
Recently, similar result presented in [16] is that
P | i = 1 n X i | > x exp x n K θ 1 / θ , for x n K θ
where K θ is some constants only depends on X and θ ( K θ can be obtained by Proposition 3). But it is obvious to see this large derivation result cannot guarantee a n -convergence rate (as presented in Proposition 3) whereas our result always give a n -convergence rate, as presented in Theorem 1 (c) and Proposition 3.

2.3. Sub-Weibull Parameter

In this part, a new sub-Weibull parameters is proposed, which is enable of recovering the tight concentration inequality for single non-zero mean random vector. Similar to characterizations of sub-Gaussian r.vs. in Proposition 2.5.2 of [1], sub-Weibull r.vs. has the equivalent definitions.
Proposition 6
(Characterizations of sub-Weibull r.v., [17]). Let X be a r.v., then the following properties are equivalent. (1). The tails of X satisfy P ( | X | x ) e ( x / K 1 ) θ , for all x 0 ; (2). The moments of X satisfy X k : = ( E | X | k ) 1 / k K 2 k 1 / θ for all k 1 θ ; (3). The MGF of | X | 1 / θ satisfies E e λ 1 / θ | X | 1 / θ e λ 1 / θ K 3 1 / θ for | λ | 1 K 3 ; (4). E e | X / K 4 | 1 / θ 2 .
From the upper bound of ( E | X | k ) 1 / k in Proposition 6(2), an alternative definition of the sub-Weibull norm X ψ θ : = sup k 1 k 1 / θ ( E | X | k ) 1 / k is given by [17]. Let θ = 1 . An alternative definition of the sub-exponential norm is X ψ 1 : = sup k 1 k 1 ( E | X | k ) 1 / k see Proposition 2.7.1 of [1]. The sub-exponential r.v. X satisfies equivalent properties in Proposition 6 (Characterizations of sub-exponential with θ = 1 ). However, these definition is not enough to obtain the sharp parameter as presented in the sub-Gaussian case. Here, we redefine the sub-Weibull parameter by our Corollary 3(a).
Definition 4
(Sub-Weibull r.v., X subW ( θ , v ) ). Define the sub-Weibull norm
X φ θ = sup k 1 E | X | θ k / k ! 1 / ( θ k ) .
We denote the sub-Weibull r.v. as X subW ( θ , v ) if v = X φ θ < for a given θ > 0 . For θ 1 , the · φ θ is a norm which satisfies triangle inequality by Minkowski’s inequality: E ( | X + Y | r ) 1 / r [ E ( | X | r ) ] 1 / r + [ E ( | Y | r ) ] 1 / r , ( r 1 ) comparing to Proposition 1. Definition 4 is free of bounding MGF, and it avoids Stirling’s approximation in the proof of the tail inequality. We obtain following main results for this moment-based norm.
Corollary 6.
If X φ θ < , then P { | X | > t } 2 exp { t θ 2 X φ θ θ } for all t 0 .
Theorem 2
(sub-Weibull concentration). Suppose that there are n independent sub-Weibull r.v.s X i subW ( θ , v i ) for i = 1 , 2 , , n . We have
P i = 1 n X i t exp θ e 11 / 12 t θ 2 [ e ( i = 1 n v i ) C θ ] θ , for t e ( i = 1 n v i ) C θ ( 2 1 θ e 11 / 12 ) 1 / θ ,
and P 1 n i = 1 n X i e v ¯ 2 1 / θ C θ log ( α 1 ) θ e 11 / 12 1 / θ 1 α ( 1 e 1 , 1 ] . Moreover, we have
P | i = 1 n X i | e ( i = 1 n ( E | X i | ) t ) 1 / t + e ( i = 1 n v i ) 2 1 / θ C θ ( t θ e 11 / 12 ) 1 / θ e t , t 0 .
The proof of Theorem 2 can be seen in Appendix A.8. The concentration in this Theorem 2 will serve a critical role in many statistical and machine learning literature. For instance, the sub-Weibull concentrations in [7] contain unknown parameters, which makes the algorithm for general sub-Weibull random rewards is infeasible. However, when using our results, it will become feasible as we give explicit constants in these concentrations.
Importantly, the sub-exponential parameter is a special case of sub-Weibull norm by letting θ = 1 . Denote the sub-exponential parameter for r.v X as
X φ 1 : = sup k 1 E | X | k k ! 1 / k .
We denote X sE φ 1 ( v ) if v = X φ 2 . For exponential r.v. X Exp ( μ ) , the moment is E X k = k ! λ k and X φ 1 = λ . Another case of sub-Weibull norm is θ = 2 , which defines sub-Gaussian parameter:
X φ 2 : = sup k 1 E | X | 2 k k ! 1 / 2 k ( Var X ) 1 / 2 .
Like the generalized method of moments, we can give the higher-moment estimation procedure for the norm X φ 2 . Unfortunately, the method in Remark 1 for estimating MGF is not stable in the simulation since the exponential function has a massive variance in some cases.
  • Estimation procedure for X φ 2 and X φ 1 . Consider
    X ^ φ 2 = sup k 1 1 n × k ! i = 1 n | X i | 2 k 1 / ( 2 k ) , X ^ φ 1 = sup k 2 1 k ! · 1 n i = 1 n X i k 1 / k
    as a discrete optimization problem. We can take k max big enough to minimize
    1 n × k ! i = 1 n | X i | 2 k 1 / ( 2 k ) , 1 k ! · 1 n i = 1 n X i k 1 / k on k { 1 , , p max } .
At the first glimpse, the bigger p is, the larger n is required in this method. Nonetheless, often, most of common distributions only require a median-size of p to give a relatively good result, then only the median-size of n in turn is required. For standard Gaussian random, centralized Bernoulli (successful probability μ = 0.3 ), and uniform distributed (on [ 1 , 1 ] ) variable X,
X φ 2 = 2 Γ ( 1 + p ) / 2 Γ ( 1 / 2 ) Γ ( 1 + p / 2 ) 1 / p , μ ( 1 μ ) p + ( 1 μ ) μ p Γ ( p / 2 + 1 ) 1 / p , Γ 1 / p ( p / 2 + 1 ) ( p + 1 ) 1 / p .
It can be shown that X φ 2 1 , 0.4582576 , 0.5773503 . The Figure 1, Figure 2 and Figure 3 show the estimated value from different n under estimate method (8) for the three distributions mentioned above. The estimate method (8) is a correct estimated method for sub-Gaussian parameter to our best knowledge.
For centralized negative binomial, and centralized Poisson ( λ = 1 ) variable X, X φ 1 = 2.460938 , 0.7357589 , respectively. The Figure 4 and Figure 5 show the estimated value from different n under estimate method (8) for the four distributions mentioned above.
The five figures mentioned above show litter bias between the estimated norm and true norm. It is worthy to note that the norm estimator for centralized negative binomial case has a peak point. This is caused by sub-exponential distributions having relatively heavy tails, and hence the norm estimation may not robust as that in sub-Gaussian under relatively small sample sizes.
Moreover, sub-Gaussian and sub-exponential parameter is extensible for random vectors with values in a normed space ( X , · ) , we define norm-sub-Gaussian parameter and norm-sub-exponential parameter: The norm-sub-Gaussian parameter:
X φ 2 = sup k 1 ( k ! ) 1 / ( 2 k ) E X 2 k 1 / ( 2 k ) ;
the norm-sub-exponential parameter:
X φ 1 = sup k 1 ( k ! ) 1 / k E X k 1 / k .
We denote X nsubG φ 1 ( σ 2 ) and X nsubG φ 2 ( σ 2 ) for σ 2 = X φ 2 and X φ 1 , respectively.

3. Statistical Applications of Sub-Weibull Concentrations

3.1. Negative Binomial Regressions with Heavy-Tail Covariates

In statistical regression analysis, the responses { Y i } i = 1 n in linear regressions are assume to be continuous Gaussian variables. However, the category in classification or grouping may be infinite with index by the non-negative integers. The categorical variables is treated as countable responses for distinction categories or groups; sometimes it can be infinite. In practice, random count responses include the number of patients, the bacterium in the unit region, or stars in the sky and so on. The responses { Y i } i = 1 n with covariates { X i } i = 1 n belongs to generalized linear regressions. We consider i.i.d. random variables { ( X i , Y i ) } i = 1 n ( X , Y ) R p × N . By the methods of the maximum likelihood or the M-estimation, the estimator β ^ n is given by
β ^ n : = arg min β R p 1 n i = 1 n ( X i β , Y i ) ,
where the loss function ( · , · ) is convex and twice differentiable in the first argument.
In high-dimensional regressions, the dimension β may be growing with sample size n. When { Y i } i = 1 n belongs to the exponential family, [18] studied the asymptotic behavior of β ^ n in the generalized linear models (GLMs) as p n : = dim ( X ) is increasing. In our study, we focus on the case that the covariates is subW ( θ ) heavy-tailed for θ < 1 .
The target vector β : = arg min β R p E X T β , Y is assumed to be the loss under the population expectation, comparing to (9). Let ˙ ( u , y ) : = t ( t , y ) t = u , ¨ ( u , y ) : = t ˙ ( t , y ) t = u and C ( u , y ) : = sup | s t | u ¨ ( s , y ) ¨ ( t , y ) . Finally, define the score function and Hessian matrix of the empirical loss function are Z ^ n ( β ) : = 1 n i = 1 n ˙ ( X i T β , Y i ) X i and Q ^ n ( β ) : = 1 n i = 1 n ¨ ( X i T β , Y i ) X i X i T , respectively. The population version of Hessian matrix is Q ( β ) : = E [ ¨ ( X T β , Y ) X X T ] . The following so-called determining inequalities guarantee the 2 -error for the estimator obtained from the smooth M-estimator defined as (9).
Lemma 4
(Corollary 3.1 in [19]). Let δ n ( β ) : = 3 2 [ Q ^ n ( β ) ] 1 Z ^ n ( β ) 2 for β R p . If ( · , · ) is a twice differentiable function that is convex in the first argument and for some β R p : max 1 i n C X i 2 δ n ( β ) , Y i 4 3 . Then there exists a vector β ^ n R p satisfying Z ^ n ( β ^ n ) = 0 as the estimating equation of (9),
1 2 δ n ( β ) β ^ n β 2 δ n β .
Applications of Lemma 4 in regression analysis is of special interest when X is heavy tailed, i.e., the sub-Weibull index θ < 1 . For the negative binomial regression (NBR) with the known dispersion parameter k > 0 , the loss function is
( u , y ) = y u + ( y + k ) log ( k + e u ) .
Thus we have ˙ ( u , y ) = k ( y e u ) k + e u , ¨ ( u , y ) = k ( y + k ) e u ( k + e u ) 2 , see [20] for details.
Further computation gives C ( u , y ) = sup | s t | u e s ( k + e t ) 2 ( k + e s ) 2 e t and it implies that C ( u , y ) e 3 u . Therefore, condition max 1 i n C X i 2 δ n ( β ) , Y i 4 3 in Lemma 4 leads to
max 1 i n X i 2 δ n β log ( 4 / 3 ) 3 .
This condition need the assumption of the design space for max 1 i n X i 2 .
In NBR with loss (10), one has
Q ^ n ( β ) : = 1 n i = 1 n ( Y i + k ) k e X i β X i X i ( k + e X i β ) 2 and Z ^ n ( β ) : = 1 n i = 1 n k ( Y i e X i β ) X i k + e X i β .
To guarantee that β ^ n approximates β well, some regularity conditions are required.
(C.1): For M Y , M X > 0 , assume max 1 i n Y i ψ 1 M Y and the heavy-tailed covariates { X i k } are uniformly sub-Weibull with max 1 i n , 1 k p X i k ψ θ M X for 0 < θ < 1 .
(C.2): The vector X i is sparse or bounded. Let F Y : = { max 1 i n E Y i = max 1 i n e X i β B , max 1 i n X i 2 I n } with a slowly increasing function I n , we have P { F Y c } = ε n 0 .
In addition, to bound max 1 i n , 1 i k | X i k | , the sub-Weibull concentration determines:
P max 1 i n , 1 i k | X i k | > t n p P ( | X 11 | > t ) 2 n p e ( t / X 11 ψ θ ) θ δ t = M X log 1 / θ ( 2 n p δ ) ,
by using Corollary 3. Hence, we define the event for the maximum designs:
F max = max 1 i n , 1 k p | X i k | M X log 1 / θ ( 2 n p δ ) F Y .
To make sure that the optimization in (9) has a unique solution, we also require the minimal eigenvalue condition.
(C.3): Suppose that b E ( Q ^ n ( β ) ) b C min is satisfied for all b S p 1 .
In the proof, to ensure that the random Hessian function has a non-singular eigenvalue, we define the event
F 1 = max k , j 1 n i = 1 n Y i k e X i β X i k X i j ( k + e X i β ) 2 E Y i k e X i β X i k X i j ( k + e X i β ) 2 C min 4
F 2 = max k , j 1 n i = 1 n k e X i β X i k X i j ( k + e X i β ) 2 E k e X i β X i k X i j ( k + e X i β ) 2 C min 4 .
Theorem 3
(Upper bound for 2 -error). In the NBR with loss (10) and ( C . 1 C . 3 ) , let
M B X = M X + B log 2 , R n : = 6 M B X M X C min 2 p n log 2 p δ + 1 n p log 2 p δ log 1 / θ 2 n p δ ,
and b : = ( k / n ) M X 2 ( 1 , , 1 ) R n . Under the event F 1 F 2 F max , for any 0 < δ < 1 , if the sample size n satisfies
R n I n log ( 4 / 3 ) 3 ,
Let c n : = e 1 4 ( n t 2 2 M X 4 log 4 / θ ( 2 n p δ ) M B X 2 n t M X 2 log 2 / θ ( 2 n p δ ) M B X ) + e ( t θ / 2 [ 4 e C ( θ / 2 ) b 2 L n ( θ / 2 , b ) ] θ / 2 t 2 16 e 2 C 2 ( θ / 2 ) b 2 2 ) with t = C min / 4 , then
P ( β ^ n β 2 R n ) 1 2 p 2 c n δ ε n .
A few comment is made on this theorem. First, in order to get β ^ n β 2 p 0 , we need p = o ( n ) under sample size restriction (11) with I n = o ( log 1 / θ ( n p ) · [ n 1 p log p ] 1 / 2 ) . Second, note that the ε n in provability 1 2 p 2 c n δ ε n depends on the models size and the fluctuation of the design by the event F max .

3.2. Non-Asymptotic Bai-Yin’s Theorem

In statistical machine learning, exponential decay tail probability is crucial to evaluate the finite-sample performance. Unlike Bai-Yin’s law with the fourth-moment condition that leads to polynomial decay tail probability, under sub-Weibull conditions of data, we provide a exponential decay tail probability on the extreme eigenvalues of a n × p random matrix.
Let A = A n , p be an n × p random matrix whose entries are independent copies of a r.v. with zero mean, unit variance, and finite fourth moment. Suppose that the dimensions n and p both grow to infinity while the aspect ratio p / n converges to a constant in [ 0 , 1 ] . Then Bai-Yin’s law [21] asserted that the standardized extreme eigenvalues satisfying
1 n λ m i n ( A ) = 1 p n + o p n , 1 n λ m a x ( A ) = 1 + p n + o p n a . s . .
Next we introduce a special counting measure for measuring the complexity of a certain set in some space. The N ε is called an ε-net of K in R n if K can be covered by balls with centers in K and radii ε (under Euclidean distance). The covering number N ( K , ε ) is defined by the smallest number of closed balls with centers in K and radii ε whose union covers K.
For purposes of studying random matrices, we need to extend the definition of sub-Weibull r.v. to sub-Weibull random vectors. The n-dimensional unit Euclidean sphere S n 1 , is denoted by S n 1 = { x R n : x 2 = 1 } . We say that a random vector X in R n is sub-Weibull if the one-dimensional marginals X , a are sub-Weibull r.v.s for all a R n . The sub-Weibull norm of a random vector X is defined as X ψ θ : = sup a S n 1 X , a ψ θ . Similarly, define the spectral norm for any p × p matrix B as B = max | | x | | 2 = 1 B x 2 = sup x S p 1 | B x , x | . Spectral norm has many good properties, see [1] for details.
Furthermore, for simplicity, we assume that the rows in random matrices are isotropic random vectors. A random vector Y in R n is called isotropic if Var ( Y ) = I p . Equivalently, Y is isotropic if E [ Y , a 2 ] = a 2 2 for all a R n . In the non-asymptotic regime, Theorem 4.6.1 in [1] study the upper and lower bounds of maximum (minimum) eigenvalues of random matrices with independent sub-Gaussian entries which are sampled from high-dimensional distributions. As an extension of Theorem 4.6.1 in [1], the following result is a non-asymptotic versions of Bai-Yin’s law for sub-Weibull entries, which is useful to estimate covariance matrices from heavy-tailed data [ subW ( θ ) , θ < 1 ].
Theorem 4 (Non-asymptotic Bai-Yin’s law).
Let A be an n × p matrix whose rows A i are independent isotropic sub-Weibull random vectors in R p with covariance matrix I p and max 1 i n A i ψ θ K . Then for every s 0 , we have
P 1 n A A I p H ( c p + s 2 , n ; θ ) 1 2 e s 2 ,
where
H ( t , n ; θ ) : = 2 e K C ( θ / 2 ) K θ / 2 [ 1 + ( [ ( e θ / 2 ) θ / 2 ] log 2 ) θ / 2 ) t n + A ( θ / 2 ) ( γ 2 t ) 2 / θ n , θ 2 B ( θ / 2 ) ( γ 2 t ) 2 / θ n 1 / θ , θ > 2 ,
where K α : = 2 1 / α if α ( 0 , 1 ) and K α = 1 if α 1 ; A ( θ / 2 ) , B ( θ / 2 ) and C ( θ / 2 ) defined in Theorem 1a.
Moreover, the concentration inequality for extreme eigenvalues hold for c n log 9 / p
P 1 H 2 ( c p + s 2 , n ; θ ) λ m i n ( A ) n λ m a x ( A ) n 1 + H 2 ( c p + s 2 , n ; θ ) 1 2 e s 2 .

3.3. General Log-Truncated Z-Estimators and sub-Weibull Type Robust Estimators

Motivated from log-truncated loss in [22,23], we study the almost surely continuous and non-decreasing function φ c : R R for truncating the original score function
log 1 x + c ( | x | ) φ c ( x ) log 1 + x + c ( | x | ) , x R
where c ( | x | ) > 0 is a high-order function [23] of | x | which is to be specified. For example, a plausible choose for φ c ( x ) in (13) should have following form
φ c ( x ) = log 1 + x + c ( | x | ) 1 ( x 0 ) log 1 x + c ( | x | ) 1 ( x 0 ) = sign ( x ) log ( 1 + | x | + c ( | x | ) ) .
For (14), we get φ c ( x ) x for sufficiently smaller x and φ c ( x ) x for larger x. Under (13), now we show that c ( | x | ) must obey a key inequality. For all x R , it suffices to verify log [ 1 x + c ( | x | ) ] log [ 1 + x + c ( | x | ) ] , which is equivalent to check log 1 + c ( | x | ) + x 1 + c ( | x | ) x 0 , namely 1 + c ( | x | ) 2 x 2 1 c ( | x | ) 1 + x 2 1 .
For independent r.v.s { X i } i = 1 n , using the score function (14), we define the score function of data
Z ^ α n ( θ ) = 1 n α n i = 1 n φ c α n X i θ for any θ R .
Then the influence of the heavy-tailed outliers is weaken by φ c α n X i θ by choosing an optimal α n . We aim to estimate the average mean: μ n : = 1 n i = 1 n E X i for non-i.i.d. samples { X i } i = 1 n . Define the Z-estimator θ ^ α n as
θ ^ α n { θ R : Z ^ α n ( θ ) = 0 } ,
where α n is the tuning parameter (will be determined later).
To guarantee consistency for log-truncated Z-estimators (15), we require following assumptions of c ( · ) .
(C.1): For a constant c 2 > 1 , the c ( x ) satisfies weak triangle inequality and scaling property,
( C . 1.1 ) : c ( x + y ) c 2 [ c ( x ) + c ( y ) ] , ( C . 1.2 ) : c ( t x ) f ( t ) c ( x )
for f ( t ) satisfies
(C.1.3): f ( t ) and f ( t ) / | t | are non-constant increasing functions and lim t 0 f ( t ) / | t | = 0 .
Remark 6.
Note that | x | 1 + x 2 1 and we could put c ( | x | ) = | x | . However, c ( | x | ) = | x | does not satisfy (C.1.3) since f ( t ) = | t | and f ( t ) / | t | are constant functions of t.
In the following theorem, we establish the finite sample confidence interval and the convergence rate of the estimator θ ^ α n .
Theorem 5.
Let { X i } i = 1 n be independent samples drawn from an unknown probability distribution { P i } i = 1 n on R . Consider the estimator θ ^ α n defined as (15) with (C.1), α n 0 and 1 n i = 1 n E [ c ( X i θ ) ] = O ( 1 ) . Let B n + ( θ ) = μ n θ + 1 n α n i = 1 n E [ c α n ( X i θ ) ] + log ( δ 1 ) n α n and B n ( θ ) = μ n θ 1 n α n i = 1 n E [ c α n ( X i θ ) ] log ( δ 1 ) n α n . Let θ + be the smallest solution of the equation B n + ( θ ) = 0 and θ be the largest solution of B n ( θ ) = 0 .
(a). We have with the ( 1 2 δ ) -confidence intervals
P ( B n ( θ ) < Z ^ α n ( θ ) < B n + ( θ ) ) 1 2 δ , P ( θ θ ^ α n θ + ) 1 2 δ ,
for any δ ( 0 , 1 / 2 ) satisfies the sample condition:
1 n α n i = 1 n E [ c α n X i α n [ μ n ± d n ( c ) ] ] + log ( δ 1 ) n α n < d n ( c ) ,
where d n ( c ) is a constant such that B n ± ( μ n ± d n ( c ) ) < 0 .
(b). Moreover, picking α n f 1 log ( δ 1 ) c 2 i = 1 n E [ c X i μ n ] , one has
P | θ ^ α n μ n | g α n 1 2 log ( δ 1 ) n α n 1 2 δ , with g α n ( t ) : = t + c 2 α n c α n t .
The (17) in Theorem 5 is a fundamental extension of Lemma 2.1 (see Theorem 16 in [24]) with c ( x ) = x 2 / 2 from i.i.d. sample to independent sample. Let c ( x ) = | x | β / β , for i.i.d. sample, Theorem 5 implies Lemmas 2.3, 2.4 and Theorem 2.1 in [22]. The α n f 1 log ( δ 1 ) c 2 i = 1 n E [ c X i μ n ] in Theorem 5(b) gives a theoretical guarantee for choosing the tuning parameter α n .
Proposition 7
(Theorem 2.1 in [22]). Let { X i } i = 1 n be a sequence of i.i.d. samples drawn from an unknown probability distribution on R . We assume E X 1 β < for a certain β ( 1 , 2 ] and denote μ = E X 1 , v β = E X 1 μ β . Given any ϵ ( 0 , 1 / 2 ) and positive integer n 2 v β + 1 β β β 1 2 β log ϵ 1 v β , let α n = 1 2 ( 2 β log ( ϵ 1 ) n v β ) 1 β . Then, with probability at least 1 2 ϵ ,
| θ ^ α n μ | 2 2 β log ( ϵ 1 ) n β 1 β v β 1 β β 2 β log ( ϵ 1 ) n v β β 1 β 1 = O n β 1 β .
Comparing to the convergence rate in (18), put O ( n β 1 β ) = O ( n 1 / θ ) for θ > 2 . It implies
β 1 + θ 1 = 1 , ( θ 2 or 0 < β 2 ) .
For example, let us deal with the Pareto distribution Pareto ( α , k ) with shape parameter α > 0 and scale parameter k > 0 , and the density function is f ( x ) = α k α x α + 1 · 1 { x [ k , ) } . For α 2 , Pareto ( α , k ) has infinite variance, and it does not belong to the sub-Weibull distribution, so do the sample mean of i.i.d. Pareto distributed data. Proposition 7 shows that the estimator error for robust mean estimator enjoys sub-Weibull concentration as presented in Proposition 3, without finite sub-Weibull norm assumption of data. With the Weibull-tailed behavior, it motivates us to define general sub-Weibull estimators having the non-parametric convergence rate O ( n 1 / θ ) in Proposition 3 for θ > 2 , even if the data do not have finite sub-Weibull norm.
Definition 5
(Sub-Weibull estimators). An estimator μ ^ : = μ ^ ( X 1 , , X n ) based on i.i.d. samples { X i } i = 1 n from an unknown probability distribution P with mean μ P , is called ( A , B , C ) - subW ( θ ) if
t ( 0 , A ) , P ( | μ ^ μ P | B ( t / n ) 1 / θ ) 1 C e t .
For example, in Proposition 7, θ ^ α n is ( , B , 1 ) - subW ( β β 1 ) with B 2 2 β log ( ϵ 1 ) β 1 β v β 1 β in Definition 5. When θ = 2 , [25] defined sub-Gaussian estimators (includes Median of means and Catoni’s estimators) for certain heavy-tailed distributions and discussed the nonexistence of sub-Gaussian mean estimators under β -moment condition for the data ( β ( 1 , 2 ) ).

4. Conclusions

Concentration inequalities are far-reaching useful in high-dimensional statistical inferences and machine learnings. They can facilitate various explicit non-asymptotic confidence intervals as a function of the sample size and model dimension.
Future research includes sharper version of Theorem 2 that is crucial to construct non-asymptotic and data-driven confidence intervals for the sub-Weibull sample mean. Although we have obtained sharper upper bounds for sub-Weibull concentrations, the lower bounds on tail probabilities are also important in some statistical applications [26]. Developing non-asymptotic and sharp lower tail bounds of Weibull r.v.s is left for further study. For negative binomial concentration inequalities in Corollary 2, it is of interesting to study concentration inequalities of COM-negative binomial distributions (see [27]).

Author Contributions

Conceptualization, H.Z. and H.W.; Formal analysis, H.Z. and H.W.; Funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by National Natural Science Foundation of China Grant (12101630) and the University of Macau under UM Macao Talent Programme (UMMTP-2020-01). This work is also supported in part by the Key Project of Natural Science Foundation of Anhui Province Colleges and Universities (KJ2021A1034), Key Scientific Research Project of Chaohu University (XLZ-202105).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Acknowledgments

The authors thank Guang Cheng for the discussion about doing statistical inference in the non-asymptotic way and Arun Kumar Kuchibhotla for his help about the proof of Theorem 1. The authors also thank Xiaowei Yang for his helpful comments on Theorem 5.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1

Proof of Corollary 1.
ϕ | X | θ ( t ) is continuous for t a neighborhood of zero, by the definition, 2 E e ( | X | / X ψ θ ) θ = m | X | θ X ψ θ θ . Since | X | θ > 0 , the MGF m | X | θ ( t ) is monotonic increasing. Hence, inverse function m | X | θ 1 ( t ) exists and X ψ θ θ = m | X | θ 1 ( 2 ) . So X ψ θ = m | X | θ 1 ( 2 ) 1 / θ . □

Appendix A.2

Proof of Corollary 2.
The first inequality is the direct application of (3) by observing that for any constant a R , and r.v. Y with Y ψ 1 < , a Y ψ 1 = | a | Y ψ 1 , Y + a ψ 1 Y ψ 1 + a ψ 1 = Y ψ 1 + | a | / log 2 and X + a ψ 1 2 ( X ψ 1 + | a | / log 2 ) 2 . The second inequality is obtained from (3) by considering two rate in ( t 2 i = 1 n 2 Y i ψ 1 2 t max 1 i n Y i ψ 1 ) separately. For (5), we only need to note that
Y i ψ 1 = inf { t > 0 : E e Y i / t 2 } = inf { t > 0 : 1 q i 1 q i e 1 / t k i 2 } = log 1 ( 1 q i ) / 2 k i q i 1 .
Then the third inequality is obtained by the first inequality and the definition of a ( μ i , k i ) . □

Appendix A.3

Proof of Corollary  3.
The first and second part of this proposition were shown in Lemma 2.1 of [28]. For the third result, using the bounds of Gamma function [see [29]:
2 π x x ( 1 / 2 ) e x Γ ( x ) [ 2 π x x ( 1 / 2 ) e x ] · e 1 / ( 12 x ) , ( x > 0 ) ,
it gives
( E | X | k ) 1 / k 2 X φ θ k k θ [ 2 π k / θ k θ 1 2 e 11 k 12 θ ] 1 / k = ( 2 2 π θ ) 1 / k { k θ k θ + 1 2 e 11 k 12 θ } 1 / k X φ θ = ( 2 2 π θ ) 1 / k k / θ 1 θ + 1 2 k e 11 12 θ X φ θ C θ ( θ e 11 / 12 ) 1 / θ X φ θ k 1 / θ .

Appendix A.4

Proof of Corollary 4.
By the definition of ψ θ -norm, E exp { | X / X ψ θ | θ } 2 . Then E exp { | | X | r / X ψ θ r | θ / r } 2 . The result | X | r subW ( θ / r ) follows by the definition of ψ θ -norm again. Moreover,
X ψ θ : = inf { C ( 0 , ) : E [ exp ( | X | θ / C θ ) ] 2 } = [ inf { C r ( 0 , ) : E [ exp { | | X | r / C r | θ / r } ] 2 } ] 1 / r = | X | r ψ θ / r 1 / r ,
which verifies (6). If X subW ( r θ ) , then E exp { | X r / X ψ r θ r | θ } = E exp { | X / X ψ r θ | r θ } 2 , which means that X r subW ( θ ) with
X ψ r θ : = inf { C ( 0 , ) : E [ exp ( | X | r θ / C r θ ) ] 2 } = [ inf { C r ( 0 , ) : E [ exp { | | X | r / C r | θ } ] 2 } ] 1 / r = | X | r ψ θ 1 / r .

Appendix A.5

Proof of Corollary 5.
Set Δ : = sup p 2 X p p + L p 1 / θ so that X p Δ p + L Δ p 1 / θ holds for all p 2 . By Markov’s inequality for t-th moment ( t 2 ) , we have
P | X | e Δ t + e L Δ t 1 / θ | | X | | t e Δ [ t + L t 1 / θ ] t e t , [ By the definition of Δ ] .
So, for any t 2 ,
P | X | e Δ t + e L Δ t 1 / θ e t .
Note the definition of Δ shows X t Δ t + L Δ t 1 / θ holds for all t 2 and assumption X t C 1 t + C 2 t 1 / θ for all t 2 . It gives e Δ t + e L Δ t 1 / θ e C 1 t + e C 2 t 1 / θ . This inequality with (A1) gives
P | X | e C 1 t + e C 2 t 1 / θ 1 { 0 < t < 2 } + e t { t 2 } , t > 0 .
Take K = k 2 / θ C 2 / ( k C 1 ) , and define δ k : = k e C 1 for a certain constant k > 1 ,
E Ψ θ , K | X | δ k = 0 P | X | δ k Ψ θ , K 1 ( s ) d s = 0 P ( | X | k e C 1 log ( 1 + s ) + k e C 1 K [ log ( 1 + s ) ] 1 / θ ) d s = 0 P ( | X | e C 1 log ( 1 + s ) k 2 + e C 2 [ log ( 1 + s ) k 2 ] 1 / θ ) d s By ( A 2 ) ] 0 < k 2 log ( 1 + s ) < 2 d s + k 2 log ( 1 + s ) 2 exp k 2 log ( 1 + s ) d s 0 e 2 k 2 1 d t + e 2 k 2 1 d t ( 1 + t ) k 2 = e 2 k 2 1 + ( 1 + t ) 1 k 2 1 k 2 e 2 k 2 1 = e 2 k 2 1 + e 2 ( 1 k 2 ) / k 2 k 2 1 1 .
Therefore, X Ψ θ , K γ e C 1 with γ defined as the smallest solution of the inequality { k > 1 : e 2 k 2 1 + e 2 ( 1 k 2 ) / k 2 k 2 1 1 } . An approximate solution is γ 1.78 . □

Appendix A.6

The main idea in the proof is by the sharper estimates of the GBO norm of the sum of symmetric r.v.s.
Proof of Theorem 1.
(a)
Without loss of generality, we assume X i ψ θ = 1 . Define Y i : = | X i | ( log 2 ) 1 / θ + , then it is easy to check that P ( | X i | t ) 2 e t θ implies P ( Y i t ) e t θ . For independent Rademacher r.v. { ε i } i = 1 n , the symmetrization inequality gives i = 1 n w i X i p 2 i = 1 n ε i w i X i p . Note that ε i X i is identically distributed as ε i | X i | ,
i = 1 n w i X i p 2 i = 1 n ε i w i | X i | p 2 i = 1 n ε i w i Y i + ( log 2 ) 1 / θ p 2 i = 1 n ε i w i Y i p + 2 ( log 2 ) 1 / θ i = 1 n ε i w i p [ Khinchin - Kahane inequality ] 2 i = 1 n ε i w i Y i p + 2 ( log 2 ) 1 / θ p 1 2 1 1 / 2 i = 1 n ε i w i 2 < 2 i = 1 n ε i w i Y i p + 2 ( log 2 ) 1 / θ p ( E ( i = 1 n ε i w i ) 2 ) 1 / 2 ε i i = 1 n are independent ] = 2 i = 1 n ε i w i Y i p + 2 ( log 2 ) 1 / θ p w 2 .
From Lemma 2, we are going to handle the first term in (A3) with the sum of symmetric r.v.s. Since P ( Y i t ) e t θ , then
i = 1 n ε i w i Y i p = i = 1 n w i Z i p , Z i : = ε i Y i
for symmetric independent r.v.s { Z i } i = 1 n satisfying | Z i | = d Y i and P ( Z i t ) = e t θ for all t 0 .
Next, we proceed the proof by checking the moment conditions in Corollary 5.
Case θ 1 : N ( t ) = t θ is concave for θ 1 . From Lemmas 2 and 3 (a), for p 2 ,
i = 1 n w i Z i p e inf t > 0 : i = 1 n log ϕ p e 2 w i e 2 t Z i p e inf t > 0 : i = 1 n p M p , Z i w i e 2 t p = e inf t > 0 : i = 1 n w i e 2 t p Z i p p p w i e 2 t 2 Z i 2 2 p e inf t > 0 : Γ p θ + 1 e 2 p t p w p p 1 + e inf t > 0 : p Γ 2 θ + 1 e 4 t 2 w 2 2 1 ,
where the last inequality we use Z i p p = 0 p t p 1 P ( | Z i | t ) d t 0 p t p 1 e t θ d t = p Γ p θ + 1 . Hence
i = 1 n w i Z i p e 3 Γ 1 / p p θ + 1 w p + p Γ 1 / 2 2 θ + 1 w 2 ,
and
i = 1 n w i X i p 2 e 3 Γ 1 / p p θ + 1 w p + p Γ 1 / 2 2 θ + 1 w 2 + 2 ( log 2 ) 1 / θ p w 2 = 2 e 3 Γ 1 / p p θ + 1 w p + 2 ( log 2 ) 1 / θ + e 3 Γ 1 / 2 2 θ + 1 p w 2 .
Using homogeneity, we can assume that p w 2 + p 1 / θ w = 1 . Then w 2 p 1 / 2 and w p 1 / θ . Therefore, for p 2 ,
w p i = 1 n | w i | 2 w p 2 1 / p ( p 1 ( p 2 ) / θ ) 1 / p = ( p p / θ p ( 2 θ ) / θ ) 1 / p 3 2 θ 3 θ p 1 / θ = 3 2 θ 3 θ p 1 / θ { p w 2 + p 1 / θ w } ,
where the last inequality follows form the fact that p 1 / p 3 1 / 3 for any p 2 , p N . Hence
i = 1 n w i X i p 2 e 3 + 2 θ e θ Γ 1 / p p θ + 1 w + 2 log 1 / θ 2 + e 3 Γ 1 / 2 2 θ + 1 + 3 2 θ 3 θ p 1 θ Γ 1 / p p θ + 1 p w 2 .
Following Corollary 5, we have
i = 1 n w i X i Ψ θ , L n ( θ , p ) γ e D 1 ( θ ) ,
where L n ( θ , p ) = γ 2 / θ D 2 ( θ , p ) γ D 1 ( θ ) , D 1 ( θ ) : = 2 [ log 1 / θ 2 + e 3 ( Γ 1 / 2 ( 2 θ + 1 ) + sup p 2 3 2 θ 3 θ p 1 θ Γ 1 / p ( p θ + 1 ) ) ] w 2 < , and D 2 ( θ , p ) : = 2 e 3 3 2 θ 3 θ p 1 / θ Γ 1 / p p θ + 1 w .
Finally, take L n ( θ ) = inf p 1 L n ( θ , p ) > 0 . Indeed, the positive limit can be argued by (2.2) in [30]. Then by the monotonicity property of the GBO norm, it gives
i = 1 n w i X i Ψ θ , L n ( θ ) i = 1 n w i X i Ψ θ , L n ( θ , p ) γ e D 1 ( θ ) .
Case θ > 1 : In this case N ( t ) = t θ is convex with N ( t ) = θ 1 θ 1 1 θ 1 t θ θ 1 . By Lemmas 2 and 3(b), for p 2 , we have
i = 1 n w i Z i p e inf t > 0 : i = 1 n log ϕ p 4 w i t Z i / 4 p + e inf t > 0 : i = 1 n p M p , Z i ( 4 w i t ) p e inf t > 0 : i = 1 n p 1 N p | 4 w i t | 1 + e inf t > 0 : i = 1 n p ( 4 w i t ) 2 1 = 4 e p w 2 + ( p / θ ) 1 / θ ( 1 θ 1 ) 1 / β w β
with β mentioned in the statement. Therefore, for p 2 , Equation (A3) implies
i = 1 n w i X i p [ 8 e + 2 ( log 2 ) 1 / θ ] p w 2 + 8 e ( p / θ ) 1 / θ ( 1 θ 1 ) 1 / β w β .
Then the following result follows by Corollary 5,
i = 1 n w i X i Ψ θ , L ( θ ) γ e D 1 ( θ ) ,
where L n ( θ ) = γ 2 / θ D 2 ( θ ) γ D 1 ( θ ) , D 1 ( θ ) = 8 e + 2 ( log 2 ) 1 / θ w 2 , and D 2 ( θ ) = 8 e θ 1 / θ ( 1 θ 1 ) 1 / β w β .
Note that w i X i = ( w i X i ψ θ ) ( X i / X i ψ θ ) , we can conclude (a).
(b)
It is followed from Proposition 5 and (a).
(c)
For easy notation, put L n ( θ ) = L n ( θ , b X ) in the proof. When θ < 2 , by the inequality a + b 2 ( a b ) for a , b > 0 , we have
P | i = 1 n w i X i | 4 e C ( θ ) b 2 t 2 e t , if t L n ( θ ) t 1 / θ .
Put s : = 4 e C ( θ ) b 2 t , we have
P | i = 1 n w i X i | s 2 exp s 2 16 e 2 C 2 ( θ ) b 2 2 , if s 4 e C ( θ ) b 2 L n θ / ( θ 2 ) ( θ ) .
For t L n ( θ ) t 1 / θ , we obtain P ( | i = 1 n w i X i | 4 e C ( θ ) b X 2 L n ( θ ) t 1 / θ ) 2 e t . Let s : = 4 e C ( θ ) b 2 L n ( θ ) t 1 / θ , it gives
P | i = 1 n w i X i | s 2 exp s θ [ 4 e C ( θ ) b 2 L n ( θ ) ] θ , if s > 4 e C ( θ ) b 2 L n θ / ( θ 2 ) ( θ ) .
Similarly, for θ > 2 , it implies
P i = 1 n w i X i s 2 e s θ [ 4 e C ( θ ) b 2 L n ( θ ) ] θ if s 4 e C ( θ ) b 2 L n θ / ( 2 θ ) ( θ ) ,
and P i = 1 n w i X i s 2 e s 2 16 e 2 C 2 ( θ ) b 2 2 if s 4 e C ( θ ) b 2 L n θ / ( 2 θ ) ( θ ) . □

Appendix A.7

Proof of Corollary 6.
Using the definition of X φ θ , it yields
E e ( c 1 | X | ) θ = 1 + k = 1 c k E | X | k θ k ! 1 + k = 1 c k k ! X φ θ k θ k ! = 1 + k = 1 ( X φ θ θ c θ ) k = 1 + X φ θ θ c θ k = 0 ( X φ θ θ c θ ) k X φ 2 θ c θ < 1 ] = 1 + ( X φ 2 θ c θ ) 1 1 X φ 2 θ / c θ 2
if X φ 2 θ c θ 1 2 which implies that the minimal c is 2 1 / θ X φ θ . That is to say we have E e | X / [ 2 1 / θ X φ θ ] | 1 / θ 2 . Applying (2), we have
P { | X | > t } 2 e ( t / [ 2 1 / θ X φ θ ] ) θ = 2 exp { t θ 2 X φ θ θ } for all t 0 .

Appendix A.8

Proof of Theorem 2.
Minkowski’s inequality for p 1 and definition of X φ θ imply
i = 1 n X i p i = 1 n X i p i = 1 n v i · 2 1 / θ C θ p θ e 11 / 12 1 / θ ,
where the last inequality by letting C θ : = max k 1 2 2 π θ 1 / k k θ 1 / ( 2 k ) in Corollary 3b.
From Markov’s inequality, it yields
P i = 1 n X i t t p i = 1 n X i p p t p ( i = 1 n v i ) p 2 p / θ C θ p θ e 11 / 12 p / θ .
Let t p ( i = 1 n v i ) p 2 p / θ C θ p θ e 11 / 12 p / θ = e p , it gives
t = e ( i = 1 n v i ) 2 1 / θ C θ p θ e 11 / 12 1 / θ and p = θ e 11 / 12 t θ [ e ( i = 1 n v i ) 2 1 / θ C θ ] θ .
Therefore, for p 1 , we have
P i = 1 n X i t P i = 1 n X i e ( i = 1 n v i ) C θ ( 2 1 θ e 11 / 12 ) 1 / θ e p ( 0 , e 1 ] .
So
P i = 1 n X i t exp θ e 11 / 12 t θ 2 [ e ( i = 1 n v i ) C θ ] θ , t e ( i = 1 n v i ) C θ ( 2 1 θ e 11 / 12 ) 1 / θ .
Let v ¯ = 1 n i = 1 n v i and e p = : α . Then
P 1 n i = 1 n X i e v ¯ 2 1 / θ C θ log ( α 1 ) θ e 11 / 12 1 / θ 1 α ( 1 e 1 , 1 ] .
For p < 1 , note that moment monotonicity show that E | X | p 1 / p is a non-decreasing function of p , , i.e.,
0 < p 1 E | X | p 1 / p E | X | .
The c r -inequality implies i = 1 n X i p p i = 1 n X i p p . Using Markov’s inequality again, we have
P i = 1 n X i t t p i = 1 n X i p p t p i = 1 n X i p p t p i = 1 n ( E | X i | ) p .
Put t p i = 1 n ( E | X i | ) p = e p and t = e ( i = 1 n ( E | X i | ) p ) 1 / p . Then, we obtain
P i = 1 n X i e ( i = 1 n ( E | X i | ) p ) 1 / p e p ( e 1 , 1 ) .
Combine (A5) and (A6), we obtain for all t 0 ,
P | i = 1 n X i | e ( i = 1 n ( E | X i | ) t ) 1 / t + e ( i = 1 n v i ) 2 1 / θ C θ ( t θ e 11 / 12 ) 1 / θ e t .
This completes the proof. □

Appendix A.9

Proof of Theorem 3.
Note that for b S p 1 , it yields
b Q ^ n ( β ) b b E ( Q ^ n ( β ) ) b b max k , j | [ Q ^ n ( β ) E Q ^ n ( β ) ] k j | = max k , j 1 n i = 1 n ( Y i + k ) k e X i β X i X i ( k + e X i β ) 2 E ( Y i + k ) k e X i β X i X i ( k + e X i β ) 2 k j .
Consider the decomposition
1 n i = 1 n ( Y i + k ) k e X i β X i k X i j ( k + e X i β ) 2 E ( Y i + k ) k e X i β X i k X i j ( k + e X i β ) 2 = 1 n i = 1 n Y i k e X i β X i k X i j ( k + e X i β ) 2 E Y i k e X i β X i k X i j ( k + e X i β ) 2 + k n i = 1 n k e X i β X i k X i j ( k + e X i β ) 2 E k e X i β X i k X i j ( k + e X i β ) 2
For the first term, we have under the F max with t = C min / 4
P 1 n i = 1 n Y i k e X i β X i k X i j ( k + e X i β ) 2 E Y i k e X i β X i k X i j ( k + e X i β ) 2 t , F max 2 exp 1 4 n 2 t 2 2 i = 1 n ( X i k X i j ) 2 ( Y i ψ 1 + | exp ( X i β ) log 2 | ) 2 n t max 1 i n | X i k X i j | ( Y i ψ 1 + | exp ( X i β ) log 2 | ) 2 exp 1 4 n t 2 2 M X 4 log 4 / θ ( 2 n p δ ) M B X 2 n t M X 2 log 2 / θ ( 2 n p δ ) M B X
where we use k e X i β ( k + e X i β ) 2 1 and the second last inequality is from Corollary 2.
For the second term, by Theorem 1 and X i k X i j ψ θ / 2 X i k ψ θ X i j ψ θ M X 2 we have
P k n i = 1 n k e X i β X i k X i j ( k + e X i β ) 2 E k e X i β X i k X i j ( k + e X i β ) 2 t , F max 2 exp t θ / 2 [ 4 e C ( θ / 2 ) b 2 L n ( θ / 2 , b ) ] θ / 2 t 2 16 e 2 C 2 ( θ / 2 ) b 2 2
where b = ( k / n ) M X 2 ( 1 , , 1 ) R n .
Assume that b E ( Q ^ n ( β ) ) b C min for all b S p 1 . Under F 1 and F 2 , it shows that by (A7): b E ( Q ^ n ( β ) ) b C min C min 2 = C min 2 . Then
P { λ min ( Q ^ n ( β ) ) C min 2 } = P b E ( Q ^ n ( β ) ) b C min 2 , b S p 1 P b E ( Q ^ n ( β ) ) b C min 2 , b S p 1 , F max + P ( F max c ) P { F 1 , F max } + P { F 2 , F max } + P ( F R c ( n ) ) 2 p 2 exp 1 4 n t 2 M X 4 log 4 / θ ( 2 n p δ ) M B X 2 n t M X 2 log 2 / θ ( 2 n p δ ) M B X
+ 2 p 2 exp t θ / 2 [ 4 e C ( θ / 2 ) b 2 L n ( θ / 2 , b ) ] θ / 2 t 2 16 e 2 C 2 ( θ / 2 ) b 2 2 + P ( F max c ) .
Then we have by conditioning on F 1 F 2
δ n ( β ) : = 3 2 [ Q ^ n ( β ) ] 1 Z ^ n ( β ) 2 3 C min Z ^ n ( β ) 2 .
By k / ( k + e X i β ) 1 , Corollary 2 implies for any 1 k p ,
P [ | p n i = 1 n k ( Y i e X i β ) X i k k + e X i β | > 2 2 t p n i = 1 n X i k 2 Y i E Y i ψ 1 2 1 / 2 + 2 t p n max 1 i n | X i k | Y i E Y i ψ 1 ] 2 e t .
Let
λ 1 n ( t , X ) : = 2 2 t p n max 1 k n i = 1 n X i k 2 Y i E Y i ψ 1 2 1 / 2 + 2 t p n max 1 i n , 1 k p ( | X i k | Y i E Y i ψ 1 ) .
We bound max 1 i n , 1 k p | X i k | M X log 1 / θ ( 2 n p δ ) and max 1 k n 1 n i = 1 n X i k 2 M X 2 log 2 / θ ( 2 n p δ ) under the event F max . Note that M B X = M X + B log 2 , then (C.1) and (C.2) gives
λ 1 n ( t , X ) 2 2 t p M B X 2 max 1 k p 1 n i = 1 n X i k 2 1 / 2 + 2 t p n max 1 i n , 1 k p | X i k | M B X 2 M B X M X ( 2 t p + t p / n ) log 1 / θ ( 2 n p / δ ) = : λ n ( t ) .
So, P | p n i = 1 n k ( Y i e X i β ) X i k k + e X i β | > λ n ( t ) 2 e t , k = 1 , 2 , , p . Thus (A10) shows
P { n Z ^ n ( β ) 2 > λ 1 n ( t ) } P { n Z ^ n ( β ) 2 > λ 1 n ( t ) , F max } + P ( F max c ) P ( k = 1 p { 1 n i = 1 n k ( Y i e X i β ) X i k k + e X i β > λ 1 n ( t ) p } ) + P ( F max c ) 2 p e t + P ( F max c ) = δ + ε n ,
where t : = log ( 2 p δ ) . Then β ^ n β 2 δ n ( β ) 3 C min Z ^ n ( β ) 2 3 λ 1 n ( t ) C min n via Lemma 4. Under F 1 F 2 F max , we obtain
β ^ n β 2 6 M B X M X C min 2 p n log 2 p δ + 1 n p log 2 p δ log 1 / θ 2 n p δ .
Furthermore, under F 1 F 2 F max , it gives the condition of n: (11). □

Appendix A.10

Proof of Theorem 4.
For convenience, the proof is divided into three steps.
Step 1. Adopting the lemma
Lemma A1 (Computing the spectral norm on a net, Lemma 5.4 in [1])
Let B be an p × p matrix, and let N ε be an ε-net of S p 1 for some ε [ 0 , 1 ) . Then
B = max | | x | | 2 = 1 B x 2 = sup x S p 1 | B x , x | ( 1 2 ε ) 1 sup x N ε | B x , x | .
Then show that 1 n A A I p 2 max x N 1 / 4 | 1 n A x 2 2 1 | . Indeed, note that 1 n A A x x , x = 1 n A A x , x 1 = 1 n A x 2 2 1 . By setting ε = 1 / 4 in Lemma 4, we can obtain:
1 n A A I p ( 1 2 ε ) 1 sup x N ε | 1 n A A x x , x | = 2 max x N 1 / 4 | 1 n A x 2 2 1 | .
Step 2. Let Z i : = | A i , x | fix any x S n 1 . Observe that A x 2 2 = i = 1 n | A i , x | 2 = i = 1 n Z i 2 . The fact that { Z i } i = 1 n are subW ( θ ) with E Z i 2 = 1 , max 1 i n Z i ψ θ = K . Then by Corollary 4, Z i 2 are independent subW ( θ / 2 ) r.v.s with max 1 i n Z i 2 ψ θ / 2 = K 2 . The norm triangle inequality (Lemma A.3 in [9]) gives
max 1 i n Z i 2 1 ψ θ / 2 K θ / 2 [ 1 + ( [ ( e θ / 2 ) θ / 2 ] log 2 ) θ / 2 ] K .
where K α : = 2 1 / α if α ( 0 , 1 ) and K α = 1 if α 1 .
Denote b X : = 1 n ( Z 1 2 1 ψ θ / 2 , , Z n 2 1 ψ θ / 2 ) in Theorem 1. With (A11), we have
b X 2 = n 1 i = 1 n Z i 2 1 ψ θ / 2 2 K θ / 2 [ 1 + ( [ ( e θ / 2 ) θ / 2 ] log 2 ) θ / 2 ] K n
and b K θ / 2 [ 1 + ( [ ( e θ / 2 ) θ / 2 ] log 2 ) θ / 2 ] K n .
For β : = θ θ 1 > 1 , we obtain
b X β = n 1 { i = 1 n Z i 2 1 ψ θ / 2 β } 1 / β n β 1 1 [ K θ / 2 [ 1 + ( [ ( e θ / 2 ) θ / 2 ] log 2 ) θ / 2 ] K ] = n θ 1 K θ / 2 [ 1 + ( [ ( e θ / 2 ) θ / 2 ] log 2 ) θ / 2 ] K .
Write L n ( θ / 2 , b X ) as the constant defined in Theorem 1(a). Then,
b X 2 L n ( θ / 2 , b X ) = γ 4 / θ A ( θ / 2 ) b , θ 2 B ( θ / 2 ) b β , θ > 2 . K θ / 2 [ 1 + ( [ ( e θ / 2 ) θ / 2 ] log 2 ) θ / 2 ] K γ 4 / θ A ( θ / 2 ) / n , θ 2 B ( θ / 2 ) / n 1 / θ , θ > 2 .
Hence
2 e C ( θ / 2 ) { b X 2 t + b 2 L n ( θ / 2 , b X ) t 2 / θ } 2 e K C ( θ / 2 ) K θ / 2 [ 1 + ( [ ( e θ / 2 ) θ / 2 ] log 2 ) θ / 2 ) t n + A ( θ / 2 ) ( γ 2 t ) 2 / θ / n , θ 2 B ( θ / 2 ) ( γ 2 t ) 2 / θ / n 1 / θ , θ > 2 = : H ( t , n ; θ ) .
Therefore, P ( 1 n | i = 1 n ( Z i 2 1 ) | H ( t , n ; θ ) ) 2 e t . Let t = c p + s 2 for constant c, then
P | 1 n A X 2 2 1 | H ( c p + s 2 , n ; θ ) 2 e ( c p + s 2 ) .
Step 3. Consider the follow lemma for covering numbers in [1].
Lemma A2 (Covering numbers of the sphere).
For the unit Euclidean sphere S n 1 , the covering number N ( S n 1 , ε ) satisfies N ( S n 1 , ε ) ( 1 + 2 ε ) n for every ε > 0 .
Then, we show the concentration for 1 n A A I p , and (12) follows by the definition of largest and least eigenvalues. The conclusion is drawn by Step 1 and 2:
P 1 n A A I p H ( c p + s 2 , n ; θ ) P 2 max x N 1 / 4 | 1 n A x 2 2 1 | H ( c p + s 2 , n ; θ ) N ( S n 1 , 1 / 4 ) P | 1 n A x 2 2 1 | H ( c p + s 2 , n ; θ ) / 2 2 · 9 n e ( c p + s 2 ) ,
where the last inequality follows by Lemma A2 with ε = 1 / 4 . When the c n log 9 / p , then 2 · 9 n e ( c p + s 2 ) 2 e s 2 , and the (12) is proved.
Moreover, note that
max | | x | | 2 = 1 | 1 n A x 2 2 1 | = max | | x | | 2 = 1 ( 1 n A A I p ) x 2 2 = 1 n A A I p 2 H 2 ( c p + s 2 , n ; θ ) .
implies that
1 H 2 ( c p + s 2 , n ; θ ) 1 n λ m a x ( A ) 1 + H 2 ( c p + s 2 , n ; θ ) .
Similarly, for the minimal eigenvalue, we have
min | | x | | 2 = 1 | 1 n A x 2 2 1 | = min | | x | | 2 = 1 ( 1 n A A I p ) x 2 2 = 1 n A A I p 2 H 2 ( c p + s 2 , n ; θ ) .
This implies 1 H 2 ( c p + s 2 , n ; θ ) 1 n λ m i n ( A ) 1 + H 2 ( c p + s 2 , n ; θ ) . So we obtain that the two events satisfy
1 n A A I p 2 H 2 ( c p + s 2 , n ; θ ) 1 H 2 ( c p + s 2 , n ; θ ) 1 n λ m i n ( A ) 1 n λ m a x ( A ) 1 + H 2 ( c p + s 2 , n ; θ )
Then we obtain the second conclusion in this theorem. □

Appendix A.11

Proof of Theorem 5
By independence and (13),
E e ± n α n Z ^ α n ( θ ) = i = 1 n E exp { ± φ c α n X i θ } i = 1 n E [ 1 ± α n X i θ + c ( α n ( X i θ ) ) ] i = 1 n exp { ± α n E X i θ + E [ c ( α n ( X i θ ) ) ] } = exp ± α n i = 1 n E X i θ + i = 1 n E [ c ( α n ( X i θ ) ) ] .
For convenience, let
B n + ( θ ) = μ n θ + 1 n α n i = 1 n E [ c α n ( X i θ ) ] + log ( δ 1 ) n α n
and B n ( θ ) = μ n θ 1 n α n i = 1 n E [ c α n ( X i θ ) ] log ( δ 1 ) n α n . Therefore, Equation (A12) and the Markov’s inequality show
P ( Z ^ α n ( θ ) B n + ( θ ) ) = P ( e n α n Z ^ α n ( θ ) e n α n B n + ( θ ) ) E e n α n Z ^ α n ( θ ) e n α n B n + ( θ ) e n α n B n + ( θ ) log ( δ 1 ) e n α n B n + ( θ ) = δ
and P ( Z ^ α n ( θ ) B n ( θ ) ) = P ( e n α n Z ^ α n ( θ ) e n α n B n ( θ ) ) E e n α n Z ^ α n ( θ ) e n α n B n ( θ ) e n α n B n ( θ ) log ( δ 1 ) e n α n B n ( θ ) = δ . These two inequality yield P B n ( θ ) < Z ^ α n ( θ ) = 1 P Z ^ α n ( θ ) B n ( θ ) P Z ^ α n ( θ ) B n + ( θ ) 1 2 δ .
The Z ^ α n ( θ ) θ = 1 n i = 1 n φ ˙ c α n X i θ < 0 implies the map θ Z ^ α n ( θ ) is non-increasing. If θ = μ n , we have B n + ( μ n ) > 0 from (A13). As n is sufficient large and α n 0 , in B n + ( θ ) , from (C.1.2) the term 1 n α n i = 1 n E [ c α n ( X i θ ) ] f ( α n ) α n 1 n i = 1 n E [ c ( X i θ ) ] = f ( α n ) α n O ( 1 ) converges to 0 by (C.1.3). Then, there must be a constant d n ( c ) > 0 such that B n + ( μ n + d n ( c ) ) < 0 . So under (16), it implies that B n + ( θ ) = 0 has a solution and denote the smallest solution θ + ( μ n , μ n + d n ( c ) ) . Similarly, for B n ( θ ) , we have B n ( μ n ) < 0 . The condition (16) implies B n ( μ n d n ( c ) ) > 0 , then B n ( θ ) = 0 has a solution and denote the largest solution θ ( μ n d n ( c ) , μ n ) . Note that Z ^ α n ( θ ) is a continuous and non-increasing function, the estimating equation Z ^ α n ( θ ) = 0 has a solution θ ^ α n [ θ , θ + ] such that θ θ ^ α n θ + with a probability at least 1 2 δ . Recall that
B n + ( θ + ) = μ n θ + + 1 n α n i = 1 n E [ c α n ( X i θ + ) ] + log ( δ 1 ) n α n = 0 .
has the smallest solution θ + ( μ n , μ n + d n ( c ) ) under the condition (16). We have
μ n θ ^ α n μ n θ + = 1 n α n i = 1 n E [ c α n X i α n θ + ] log ( δ 1 ) n α n = 1 n α n i = 1 n E [ c α n ( X i μ n ) + α n ( μ n θ + ) ] log ( δ 1 ) n α n By ( C . 1.1 ) ] c 2 n α n i = 1 n E [ c α n X i α n μ n ] c 2 α n · c α n ( μ n θ + ) log ( δ 1 ) n α n
which implies
μ n θ + + c 2 α n · c α n ( μ n θ + ) c 2 n α n i = 1 n E [ c α n ( X i μ n ) ] + log ( δ 1 ) n α n .
Put c 2 n α n i = 1 n E [ c α n ( X i μ n ) ] = log ( δ 1 ) n α n , i.e., i = 1 n c 2 E [ c α n ( X i μ n ) ] = log ( δ 1 ) . The scaling assumption c ( t x ) f ( t ) c ( x ) gives
f ( α n ) c 2 i = 1 n E [ c X i μ n ] c 2 i = 1 n E [ c α n ( X i μ n ) ] = log ( δ 1 )
and thus α n f 1 log ( δ 1 ) c 2 i = 1 n E [ c ( X i μ n ) ] . Let g α n ( t ) = t + c 2 α n c α n t . Moreover, Equation (A16) and the value α n yields
g α n ( μ n θ + ) = μ n θ + + c 2 α n c α n ( μ n θ + ) 2 log ( δ 1 ) n α n .
Solve the above inequality in terms of μ n θ + , we obtain
μ n θ ^ α n μ n θ + g α n 1 2 log ( δ 1 ) n α n .
Similarly, for θ , one has μ n θ ^ α n μ n θ g α n 1 2 log ( δ 1 ) n α n . Then we obtain that (17) holds with probability at least 1 2 δ . □

References

  1. Vershynin, R. High-Dimensional Probability: An Introduction with Applications in Data Science; Cambridge University Press: Cambridge, UK, 2018; Volume 47. [Google Scholar]
  2. Bai, Z.; Silverstein, J.W. Spectral Analysis of Large Dimensional Random Matrices; Springer: New York, NY, USA, 2010; Volume 20. [Google Scholar]
  3. Wainwright, M.J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint; Cambridge University Press: Cambridge, UK, 2019; Volume 48. [Google Scholar]
  4. Zhang, H.; Chen, S.X. Concentration Inequalities for Statistical Inference. Commun. Math. Res. 2021, 37, 1–85. [Google Scholar]
  5. Tropp, J.A. An introduction to matrix concentration inequalities. Found. Trends Mach. Learn. 2015, 8, 1–230. [Google Scholar] [CrossRef]
  6. Kuchibhotla, A.K.; Chakrabortty, A. Moving beyond sub-Gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. Inf. Inference J. Imag. 2022. ahead of print. [Google Scholar] [CrossRef]
  7. Hao, B.; Abbasi-Yadkori, Y.; Wen, Z.; Cheng, G. Bootstrapping Upper Confidence Bound. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
  8. Gbur, E.E.; Collins, R.A. Estimation of the Moment Generating Function. Commun. Stat. Simul. Comput. 1989, 18, 1113–1134. [Google Scholar] [CrossRef]
  9. Götze, F.; Sambale, H.; Sinulis, A. Concentration inequalities for polynomials in α-sub-exponential random variables. Electron. J. Probab. 2021, 26, 1–22. [Google Scholar] [CrossRef]
  10. Li, S.; Wei, H.; Lei, X. Heterogeneous Overdispersed Count Data Regressions via Double-Penalized Estimations. Mathematics 2022, 10, 1700. [Google Scholar] [CrossRef]
  11. Rigollet, P.; Hütter, J.C. High Dimensional Statistics. 2019. Available online: http://www-math.mit.edu/rigollet/PDFs/RigNotes17.pdf (accessed on 20 April 2022).
  12. Foss, S.; Korshunov, D.; Zachary, S. An Introduction to Heavy-Tailed and Subexponential Distributions; Springer: New York, NY, USA, 2011. [Google Scholar]
  13. De la Pena, V.; Gine, E. Decoupling: From Dependence to Independence; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
  14. Latala, R. Estimation of moments of sums of independent real random variables. Ann. Probab. 1997, 25, 1502–1513. [Google Scholar] [CrossRef]
  15. Kashlak, A.B. Measuring distributional asymmetry with Wasserstein distance and Rademacher symmetrization. Electron. J. Stat. 2018, 12, 2091–2113. [Google Scholar] [CrossRef]
  16. Vladimirova, M.; Girard, S.; Nguyen, H.; Arbel, J. Sub-Weibull distributions: Generalizing sub-Gaussian and sub-Exponential properties to heavier tailed distributions. Stat 2020, 9, e318. [Google Scholar] [CrossRef]
  17. Wong, K.C.; Li, Z.; Tewari, A. Lasso guarantees for β-mixing heavy-tailed time series. Ann. Stat. 2020, 48, 1124–1142. [Google Scholar] [CrossRef]
  18. Portnoy, S. Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. Ann. Stat. 1988, 16, 356–366. [Google Scholar] [CrossRef]
  19. Kuchibhotla, A.K. Deterministic inequalities for smooth m-estimators. arXiv 2018, arXiv:1809.05172. [Google Scholar]
  20. Zhang, H.; Jia, J. Elastic-net regularized high-dimensional negative binomial regression: Consistency and weak signals detection. Stat. Sin. 2022, 32, 181–207. [Google Scholar] [CrossRef]
  21. Bai, Z.D.; Yin, Y.Q. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. In Advances In Statistics; World Scientific: Singapore, 1993; pp. 1275–1294. [Google Scholar]
  22. Chen, P.; Jin, X.; Li, X.; Xu, L. A generalized catoni’s m-estimator under finite α-th moment assumption with α∈(1,2). Electron. J. Stat. 2021, 15, 5523–5544. [Google Scholar] [CrossRef]
  23. Xu, L.; Yao, F.; Yao, Q.; Zhang, H. Non-Asymptotic Guarantees for Robust Statistical Learning under (1+ε)-th Moment Assumption. arXiv 2022, arXiv:2201.03182. [Google Scholar]
  24. Lerasle, M. Lecture notes: Selected topics on robust statistical learning theory. arXiv 2019, arXiv:1908.10761. [Google Scholar]
  25. Devroye, L.; Lerasle, M.; Lugosi, G.; Oliveira, R.I. Sub-gaussian mean estimators. Ann. Stat. 2016, 44, 2695–2725. [Google Scholar] [CrossRef]
  26. Zhang, A.R.; Zhou, Y. On the non-asymptotic and sharp lower tail bounds of random variables. Stat 2020, 9, e314. [Google Scholar] [CrossRef]
  27. Zhang, H.; Tan, K.; Li, B. COM-negative binomial distribution: Modeling overdispersion and ultrahigh zero-inflated count data. Front. Math. China 2018, 13, 967–998. [Google Scholar] [CrossRef] [Green Version]
  28. Zajkowski, K. On norms in some class of exponential type Orlicz spaces of random variables. Positivity 2019, 24, 1231–1240. [Google Scholar] [CrossRef] [Green Version]
  29. Jameson, G.J. A simple proof of Stirling’s formula for the gamma function. Math. Gaz. 2015, 99, 68–74. [Google Scholar] [CrossRef] [Green Version]
  30. Alzer, H. On some inequalities for the gamma and psi functions. Math. Comput. 1997, 66, 373–389. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Standard Gaussian.
Figure 1. Standard Gaussian.
Mathematics 10 02252 g001
Figure 2. Centralized Bernoulli.
Figure 2. Centralized Bernoulli.
Mathematics 10 02252 g002
Figure 3. Uniform on [ 1 , 1 ] .
Figure 3. Uniform on [ 1 , 1 ] .
Mathematics 10 02252 g003
Figure 4. Centralized negative binomial.
Figure 4. Centralized negative binomial.
Mathematics 10 02252 g004
Figure 5. Centralized Poisson.
Figure 5. Centralized Poisson.
Mathematics 10 02252 g005
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhang, H.; Wei, H. Sharper Sub-Weibull Concentrations. Mathematics 2022, 10, 2252. https://doi.org/10.3390/math10132252

AMA Style

Zhang H, Wei H. Sharper Sub-Weibull Concentrations. Mathematics. 2022; 10(13):2252. https://doi.org/10.3390/math10132252

Chicago/Turabian Style

Zhang, Huiming, and Haoyu Wei. 2022. "Sharper Sub-Weibull Concentrations" Mathematics 10, no. 13: 2252. https://doi.org/10.3390/math10132252

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop