Sharper Sub-Weibull Concentrations

Zhang, Huiming; Wei, Haoyu

doi:10.3390/math10132252

Open AccessArticle

Sharper Sub-Weibull Concentrations

by

Huiming Zhang

^1,2,† and

Haoyu Wei

^3,*,†

¹

Department of Mathematics, Faculty of Science and Technology, University of Macau, Macau 999078, China

²

Zhuhai UM Science & Technology Research Institute, Zhuhai 519031, China

³

Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2022, 10(13), 2252; https://doi.org/10.3390/math10132252

Submission received: 21 April 2022 / Revised: 20 June 2022 / Accepted: 21 June 2022 / Published: 27 June 2022

(This article belongs to the Special Issue New Advances in High-Dimensional and Non-asymptotic Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

Constant-specified and exponential concentration inequalities play an essential role in the finite-sample theory of machine learning and high-dimensional statistics area. We obtain sharper and constants-specified concentration inequalities for the sum of independent sub-Weibull random variables, which leads to a mixture of two tails: sub-Gaussian for small deviations and sub-Weibull for large deviations from the mean. These bounds are new and improve existing bounds with sharper constants. In addition, a new sub-Weibull parameter is also proposed, which enables recovering the tight concentration inequality for a random variable (vector). For statistical applications, we give an

ℓ_{2}

-error of estimated coefficients in negative binomial regressions when the heavy-tailed covariates are sub-Weibull distributed with sparse structures, which is a new result for negative binomial regressions. In applying random matrices, we derive non-asymptotic versions of Bai-Yin’s theorem for sub-Weibull entries with exponential tail bounds. Finally, by demonstrating a sub-Weibull confidence region for a log-truncated Z-estimator without the second-moment condition, we discuss and define the sub-Weibull type robust estimator for independent observations

{X_{i}}_{i = 1}^{n}

without exponential-moment conditions.

Keywords:

constants-specified concentration inequalities; exponential tail bounds; heavy-tailed random variables; sub-Weibull parameter; lower bounds on the least singular value

MSC:

60E15; 62F25; 62F99

1. Introduction

In the last two decades, with the development of modern data collection methods in science and techniques, scientists and engineers can access and load a huge number of variables in their experiments. Over hundreds of years, probability theory lays the mathematical foundation of statistics. Arising from data-driving problems, various recent statistics research advances also contribute to new and challenging probability problems for further study. For example, in recent years, the rapid development of high-dimensional statistics and machine learning have promoted the development of the probability theory and even pure mathematics, such as random matrices, large deviation inequalities, and geometric functional analysis, etc.; see [1]. More importantly, the concentration inequality (CI) quantifies the concentration of measures that are at the heart of statistical machine learning. Usually, CI quantifies how a random variable (r.v.) X deviates around its mean

E X = : μ

by presenting as one-side or two-sided bounds for the tail probability of

X - μ

P (X - μ > t) or P (| X - μ | > t) \leq some small δ, \forall t \geq 0 .

The classical statistical models are faced with fixed-dimensional variables only. However, contemporary data science motivates statisticians to pay more attention to studying

p \times p

random Hessian matrices (or sample covariance matrices, [2]) with

p \to \infty

, arising from the likelihood functions of high-dimensional regressions with covariates in

R^{p}

. When the model dimension increases with sample size, obtaining asymptotic results for the estimator is potentially more challenging than the fixed dimensional case. In statistical machine learning, concentration inequalities (large derivation inequalities) are essential in deriving non-asymptotic error bounds for the proposed estimator; see [3,4]. Over recent decades, researchers have developed remarkable results of matrix concentration inequalities, which focuses on non-asymptotic upper and lower bounds for the largest eigenvalue of a finite sum of random matrices. For a more fascinated introduction, please refer to the book [5].

Motivated from sample covariance matrices, a random matrix is a specific matrix

A_{p \times p}

with its entries

A_{j k}

drawn from some distributions. As

p \to \infty

, random matrix theory mainly focuses on studying the properties of the p eigenvalues of

A_{p \times p}

, which turn out to have some limit law. Several famous limit laws in random matrix theory are different from the CLT for the summation of independent random variables since the p eigenvalues are dependent and interact with each other. For convergence in distribution, some pioneering works are the Wigner’s semicircle law for some symmetric Gaussian matrices’ eigenvalues, the Marchenko-Pastur law for Wishart distributed random matrices (sample covariance matrices), and the Tracy-Widom laws for the limit distribution for maximum eigenvalues in Wishart matrices. All these three laws can be regarded as the CLT of random matrix versions. Moreover, the limit law for the empirical spectral density is some circle distribution, which sheds light on the non-communicative behaviors of the random matrix, while the classic limit law in CLT is for normal distribution or infinite divisible distribution. For strong convergence, Bai-Yin’s law complements the Marchenko-Pastur law, which asserts that almost surely convergence of the smallest and largest eigenvalue for a sample covariance matrix. The monograph [2] thoroughly introduces the limit law in random matrices.

This work aims to extend non-asymptotic results from sub-Gaussian to sub-Weibull in terms of exponential concentration inequalities with applications in count data regressions, random matrices, and robust estimators. The contributions are:

(i): We review and present some new results for sub-Weibull r.v.s, including sharp concentration inequalities for weighted summations of independent sub-Weibull r.v.s and negative binomial r.v.s, which are useful in many statistical applications.
(ii): Based on the generalized Bernstein-Orlicz norm, a sharper concentration for sub-Weibull summations is obtained in Theorem 1. Here we circumvent Stirling’s approximation and derive the inequalities more subtly. As a result, the confidence interval based on our result is sharper and more accurate than that in [6] (For example, see Remark 2) and [7] (see Proposition 1 with unknown constants) gave.
(iii): By sharper sub-Weibull concentrations, we give two applications. First, from the proposed negative binomial concentration inequalities, we obtain the $O_{P} (\sqrt{p / n})$ (up to some log factors) estimation error for the estimated coefficients in negative binomial regressions under the increasing-dimensional framework $p = p_{n}$ and heavy-tailed covariates. Second, we provide a non-asymptotic Bai-Yin’s theorem for sub-Weibull random matrices with exponential-decay high probability.
(iv): We propose a new sub-Weibull parameters, which is enabled of recovering the tight concentration inequality for a single non-zero mean random vector. The simulation studies for estimating sub-Gaussian and sub-exponential parameters show these parameters could be estimated well.
(v): We establish a unified non-asymptotic confidence region and the convergence rate for general log-truncated Z-estimator in Theorem 5. Moreover, we define a sub-Weibull type estimator for a sequence of independent observations ${X_{i}}_{i = 1}^{n}$ without the second-moment condition, beyond the definition of the sub-Gaussian estimator.

2. Sharper Concentrations for Sub-Weibull Summation

Concentration inequalities are powerful in high-dimensional statistical inference, and it can derive explicit non-asymptotic error bounds as a function of sample size, sparsity level, and dimension [3]. In this section, we present preparation results of concentration inequalities for sub-Weibull random variables.

2.1. Properties of Sub-Weibull norm and Orlicz-Type Norm

In empirical process theory, sub-Weibull norm (or other Orlicz-type norms) is crucial to derive the tail probability for both single sub-Weibull random variable and summation of random variables (by using the Chernoff’s inequality). A benefit of Orlicz-type norms is that the concentration does not need the zero mean assumption.

Definition 1

(Sub-Weibull norm). For

θ > 0

, the sub-Weibull norm of X is defined as

\begin{matrix} {∥ X ∥}_{ψ_{θ}} : = {inf {C \in (0, \infty) : E [exp (| X |}^{θ} / C^{θ})] \leq 2} . \end{matrix}

The

{∥ \cdot ∥}_{ψ_{θ}}

is also called the

ψ_{θ}

-norm. We define X as a sub-Weibull random variable with index

θ

if it has a bounded

ψ_{θ}

-norm (denoted as

X \sim subW (θ)

). Actually, the sub-Weibull norm is a special case of Orlicz norms below.

Definition 2

(Orlicz Norms). Let

g : [0, \infty) \to [0, \infty)

be a non-decreasing convex function with

g (0) = 1

. The “g-Orlicz norm” of a real-valued r.v. X is given by

{∥ X ∥}_{g} : = inf {η > 0 : E [g (| X | / η)] \leq 2} .

(1)

Using exponential Markov’s inequality, we have

P (| X | \geq t) = P (g (| X | / ∥ X ∥_{g}) \geq g (t / ∥ X ∥_{g})) \leq g^{- 1} {(t / ∥ X ∥}_{g}) E g (X / ∥ X ∥_{g}) \leq 2 g^{- 1} {(t / ∥ X ∥}_{g})

(2)

by Definition 2. For example, let

g (x) = e^{x^{θ}}

, which leads to sub-Weibull norm for

θ \geq 1

.

Example 1

(

ψ_{θ}

-norm of bounded r.v.). For a r.v.

| X | \leq M < \infty

, we have

{∥ X ∥}_{ψ_{θ}} = inf {t > 0 : E e^{{| X |}^{θ} / t^{θ}} \leq 2} \leq inf {t > 0 : E e^{M^{θ} / t^{θ}} \leq 2} = M {(log 2)}^{- 1 / θ} .

In general, we have following corollary to determine

{∥ X ∥}_{ψ_{θ}}

based on moment generating functions (MGF). It would be useful for doing statistical inference of

ψ_{θ}

-norm.

Corollary 1.

If

{∥ X ∥}_{ψ_{θ}} < \infty

, then

{∥ X ∥}_{ψ_{θ}} = {(m_{{| X |}^{θ}}^{- 1} (2))}^{- 1 / θ}

for the MGF

ϕ_{Z} (t) : = E e^{t Z}

.

Remark 1.

If we observe i.i.d. data

{X_{i}}_{i = 1}^{n}

from a sub-Weibull distribution, one can use the empirical moment generating function (EMGF, [8]) to estimate the sub-Weibull norm of X. Since the EMGF

{\hat{m}}_{{| X |}^{θ}} (t) = \frac{1}{n} \sum_{i = 1}^{n} exp {t | X_{i} |^{θ}}

converge to MGF

m_{{| X |}^{θ}} (t)

in probability for t in a neighbourhood of zero, the value of the inverse function of EMGF at 2. Then, under some regularity conditions,

{({\hat{m}}_{{| X |}^{θ}})}^{- 1} (2)

, is a consistent estimate for

{∥ X ∥}_{ψ_{θ}}

.

In particular, if we take

θ = 1

, we get the sub-exponential norm of X, which is defined as

{∥ X ∥}_{ψ_{1}} = inf {t > 0 : E exp (| X | / t) \leq 2}

. For independent r.v.s

{X_{i}}_{i = 1}^{n}

, if

E X_{i} = 0

and

∥ X_{i} ∥_{ψ_{1}} < \infty

, by Proposition 4.2 in [4], we know

\forall t \geq 0

P (| \sum_{i = 1}^{n} X_{i} | \geq t) \leq 2 exp \{- \frac{1}{4} (\frac{t^{2}}{\sum_{i = 1}^{n} 2 {∥ X_{i} ∥}_{ψ_{1}}^{2}} \land \frac{t}{max_{1 \leq i \leq n} {∥ X_{i} ∥}_{ψ_{1}}})\} .

(3)

Example 2.

An explicitly calculation of the sub-exponential norm is given in [9], they show that Poisson r.v.

X \sim Poisson (λ)

has sub-exponential norm

{∥X∥}_{ψ_{1}} \leq {[log (log (2) λ^{- 1} + 1)]}^{- 1}

. And Example 1 with triangle inequality implies

{∥X - E X∥}_{ψ_{1}} \leq {∥X∥}_{ψ_{1}} + {∥E X∥}_{ψ_{1}} = {∥X∥}_{ψ_{1}} + \frac{λ}{log 2} \leq {[log (log (2) λ^{- 1} + 1)]}^{- 1} + \frac{λ}{log 2}

based on following useful results.

Proposition 1

(Lemma A.3 in [9]). For any

α > 0

and any r.v.s

X, Y

we have

{∥ X + Y ∥}_{ψ_{θ}} \leq K_{α} ({∥ X ∥}_{ψ_{θ}} + {∥ Y ∥}_{ψ_{θ}})

and

{∥ E X ∥}_{ψ_{θ}} \leq \frac{1}{d_{α} {(log 2)}^{1 / α}} {∥ X ∥}_{ψ_{θ}} {, ∥ X - E X ∥}_{ψ_{θ}} \leq K_{α} (1 + {(d_{α} log 2)}^{- 1 / α}) {∥ X ∥}_{ψ_{θ}},

where

d_{θ} : = {(θ e)}^{1 / θ} / 2

,

K_{θ} : = 2^{1 / θ}

if

θ \in (0, 1)

and

K_{θ} = 1

if

θ \geq 1

.

To extend Poisson variables, one can also consider concentration for sums of independent heterogeneous negative binomial variables

{Y_{i}}_{i = 1}^{n}

with probability mass functions:

P (Y_{i} = y) = \frac{Γ (y + k_{i})}{Γ (k_{i}) y!} {(1 - q_{i})}^{k_{i}} q_{i}^{y} (q_{i} \in (0, 1), y \in N),

(4)

where

{k_{i}}_{i = 1}^{n} \in (0, \infty)

are variance-dependence parameters. Here, the mean and variance of

{Y_{i}}_{i = 1}^{n}

are

E Y_{i} = \frac{k_{i} q_{i}}{1 - q_{i}}, Var Y_{i} = \frac{k_{i} q_{i}}{{(1 - q_{i})}^{2}}

respectively. The MGF of

{Y_{i}}_{i = 1}^{n}

are

E e^{s Y_{i}} = {(\frac{1 - q_{i}}{1 - q_{i} e^{s}})}^{k_{i}}

for

i = 1, \dots, n

. Based on (3), we obtain following results.

Corollary 2.

For any independent r.v.s

{Y_{i}}_{i = 1}^{n}

satisfying

∥ Y_{i} ∥_{ψ_{1}} < \infty

,

t \geq 0

, and non-random weight

w = {(w_{1}, \dots, w_{n})}^{⊤}

, we have

P (| \sum_{i = 1}^{n} w_{i} (Y_{i} - E Y_{i}) | \geq t) \leq 2 e^{- \frac{1}{4} (\frac{t^{2}}{2 \sum_{i = 1}^{n} w_{i}^{2} (∥ Y_{i} ∥_{ψ_{1}} + | E Y_{i} / log 2 {|)}^{2}} \land \frac{t}{{max}_{1 \leq i \leq n} | w_{i} | (∥ Y_{i} ∥_{ψ_{1}} + | E Y_{i} / log 2 |)})} .

P (| \sum_{i = 1}^{n} w_{i} (Y_{i} - E Y_{i}) | > 2 {(2 t \sum_{i = 1}^{n} w_{i}^{2} {∥Y_{i} - E Y_{i}∥}_{ψ_{1}}^{2})}^{1 / 2} + 2 t max_{1 \leq i \leq n} (| w_{i} | {∥Y_{i} - E Y_{i}∥}_{ψ_{1}})) \leq 2 e^{- t} .

In particular, if

Y_{i}

is independently distributed as

NB (μ_{i}, k_{i})

, we have

P (| \sum_{i = 1}^{n} w_{i} (Y_{i} - E Y_{i}) | \geq t) \leq 2 e^{- \frac{1}{4} (\frac{t^{2}}{2 \sum_{i = 1}^{n} w_{i}^{2} a^{2} (μ_{i}, k_{i})} \land \frac{t}{{max}_{1 \leq i \leq n} | w_{i} | a (μ_{i}, k_{i})})},

(5)

where

a (μ_{i}, k_{i}) : = {[log \frac{1 - (1 - q_{i}) / \sqrt[k_{i}]{2}}{q_{i}}]}^{- 1} + \frac{μ_{i}}{log 2}

with

q_{i} : = \frac{μ_{i}}{k_{i} + μ_{i}} .

Corollary 2 can play an important role in many non-asymptotic analyses of various estimators. For instance, recently [10] uses the above inequality as an essential role for deriving the non-asymptotic behavior of the penalty estimator in the counting data model.

Next, we study moment properties for sub-Weibull random variables. Lemma 1.4 in [11] showed that if

X \sim subG (σ^{2})

, then we have: (a). the tail satisfies

P (| X | > t) \leq 2 e^{- t^{2} / 2 σ^{2}}

for any

t > 0

; (b). The (a) implies that moments

{E | X |}^{k} \leq {(2 σ^{2})}^{k / 2} k Γ (\frac{k}{2})

and

[k^{- 1 / 2} {(E (| X |}^{k} {))}^{1 / k}]^{2} \leq σ^{2} e^{2 / e}, k \geq 2

. We extend Lemma 1.4 in [11] to sub-Weibull r.v. X satisfying following properties.

Corollary 3

(Moment properties of sub-Weibull norm). (a). If

{∥ X ∥}_{ψ_{θ}} < \infty

, then

P {| X | > t} \leq 2 e^{{- (t / ∥ X ∥}_{ψ_{θ}})^{θ}}

for all

t \geq 0

; and then

{E | X |}^{k} \leq 2 {∥ X ∥}_{ψ_{θ}}^{k} Γ (\frac{k}{θ} + 1)

for all

k \geq 1

. (2). Let

C_{θ} : = max_{k \geq 1} {(\frac{2 \sqrt{2 π}}{θ})}^{1 / k} {(\frac{k}{θ})}^{1 / (2 k)}

, for all

k \geq 1

we have

{(E | X |}^{k})^{1 / k} \leq C_{θ} {(θ e^{11 / 12})}^{- 1 / θ} {∥ X ∥}_{ψ_{θ}} k^{1 / θ} .

Particularly, sub-Weibull r.v.s reduce to sub-exponential or sub-Gaussian r.v.s when

θ

= 1 or 2. It is obvious that the smaller

θ

is, the heavier tail the r.v. has. A r.v. is called heavy-tailed if its distribution function fails to be bounded by a decreasing exponential function, i.e.,

\int e^{λ x} d F (x) = \infty, \forall λ > 0

(the tail decays slower than some exponential r.v.s);

see [12]. Hence for sub-Weibull r.v.s, we usually focus on the the sub-Weibull index

θ \in (0, 1)

. A simple example that the heavy-tailed distributions arises when we work more production on sub-Gaussian r.v.s. Via a power transform of

| X |

, the next corollary explains the relation of sub-Weibull norm with parameter

θ

and

r θ

, which is similar to Lemmas 2.7.6 of [1] for sub-exponential norm.

Corollary 4.

For any

θ, r \in (0, \infty),

if

X \sim subW (θ)

, then

{| X |}^{r} \sim subW (θ / r)

. Moreover,

{∥{| X |}^{r}∥}_{ψ_{θ / r}} = {∥ X ∥}_{ψ_{θ}}^{r} .

(6)

Conversely, if

X \sim subW (r θ)

, then

X^{r} \sim subW (θ)

with

{∥X^{r}∥}_{ψ_{θ}} = {∥ X ∥}_{ψ_{r θ}}^{r}

.

By Corollary 4, we obtain that d-th root of the absolute value of sub-Gaussian is

subW (2 d)

by letting

r = 1 / d

. Corollary 4 can be extended to product of r.v.s, from Proposition D.2 in [6] with the equality replacing by inequality, we state it as the following proposition.

Proposition 2.

If

{W_{i}}_{i = 1}^{d}

are (possibly dependent) r.vs satisfying

{∥W_{i}∥}_{ψ_{α_{i}}} <

∞ for some

α_{i} > 0,

then

∥ \prod_{i = 1}^{d} W_{i} ∥_{ψ_{β}} \leq \prod_{i = 1}^{d} {∥ W_{i} ∥}_{ψ_{α_{i}}} where \frac{1}{β} : = \sum_{i = 1}^{d} \frac{1}{α_{i}} .

For multi-armed bandit problems in reinforcement learning, [7] move beyond sub-Gaussianity and consider the reward under sub-Weibull distribution which has a much weaker tail. The corresponding concentration inequality (Theorem 3.1 in [7]) for the sum of independent sub-Weibull r.v.s is illustrated as follows.

Proposition 3

(Concentration inequality for sub-Weibull distribution). Suppose

{X_{i}}_{i = 1}^{n}

are independent sub-Weibull random variables with

∥ X_{i} - E X_{i} ∥_{ψ_{θ}} \leq v

. Then there exists absolute constants

C_{1 θ}

and

C_{2 θ}

only depending on θ such that with probability at least

1 - e^{- t}

:

|\frac{1}{n} \sum_{i = 1}^{n} \frac{X_{i} - E X_{i}}{v}| \leq C_{1 θ} {(\frac{t}{n})}^{1 / 2} + C_{2 θ} {(\frac{t}{n})}^{1 / θ} = \{\begin{matrix} O (n^{- 1 / θ}), θ > 2 \\ O (n^{- 1 / 2}), 0 < θ \leq 2 \end{matrix} .

The weakness in the Proposition 3 is that the upper bound of

S_{n}^{a} : = \sum_{i = 1}^{n} a_{i} Y_{i} - E (\sum_{i = 1}^{n} a_{i} Y_{i})

is up to a unknown constants

C_{1 θ}, C_{2 θ}

. In the next section, we will give a constants-specified and high probability upper bound for

| S_{n}^{a} |

, which improve Proposition 3 and is sharper than Theorem 3.1 in [6].

2.2. Main Results: Concentrations for Sub-Weibull Summation

Based on the exponential moment condition, the Chernoff’s tricks implies the following sub-exponential concentrations from Proposition 4.2 in [4].

Proposition 4.

For any independent r.v.s

{Y_{i}}_{i = 1}^{n}

satisfying

∥ Y_{i} ∥_{ψ_{1}} < \infty

,

t \geq 0

, and non-random weight

w = {(w_{1}, \dots, w_{n})}^{⊤}

, we have

P (| \sum_{i = 1}^{n} w_{i} (Y_{i} - E Y_{i}) | > 2 {(2 t \sum_{i = 1}^{n} w_{i}^{2} {∥Y_{i} - E Y_{i}∥}_{ψ_{1}}^{2})}^{1 / 2} + 2 t max_{1 \leq i \leq n} (| w_{i} | {∥Y_{i} - E Y_{i}∥}_{ψ_{1}})) \leq 2 e^{- t} .

But it is not easy to extend to sub-Weibull distributions. From Corollary 4,

Y_{i} \sim subW (θ) \Rightarrow {| Y_{i} |}^{1 / θ} \sim subW (1)

. The MGF of

| Y_{i} |^{1 / θ}

satisfies

E e^{λ^{1 / θ} {| Y_{i} |}^{1 / θ}} \leq e^{λ^{1 / θ} K^{1 / θ}}, | λ | \leq \frac{1}{K}

for some constant

K > 0

. The bound of

E e^{λ^{1 / θ} {| Y_{i} |}^{1 / θ}}

with

θ \neq 1

or 2 is not directly applicable for deriving the concentration of

\sum_{i = 1}^{n} w_{i} (Y_{i} - E Y_{i})

by using the independence and Chernoff’s tricks, since the MGF of Weibull r.v. do not has closed form as exponential function. Thanks to the tail probability derived by Orlicz-type norms, instead of using the upper bound for MGF, an alternative method is given by [6] who defines the so-called Generalized Bernstein-Orlicz (GBO) norm. And the GBO norm can help us to derive tail behaviours for sub-Weibull r.v.s.

Definition 3

(GBO norm). Fix

α > 0

and

L \geq 0

. Define the function

Ψ_{θ, L} (\cdot)

as the inverse function

Ψ_{θ, L}^{- 1} (t) : = \sqrt{log (t + 1)} + L {(log (t + 1))}^{1 / θ} f o r a l l t \geq 0 .

The GBO norm of a r.v. X is then given by

{∥ X ∥}_{Ψ_{θ, L}} : = inf {η > 0 : E [Ψ_{θ, L} (| X | / η)] \leq 1} .

The monotone function

Ψ_{θ, L} (\cdot)

is motivated by the classical Bernstein’s inequality for sub-exponential r.v.s. Like the sub-Weibull norm properties Corollary 3, the following proposition in [6] allows us to get the concentration inequality for r.v. with finite GBO norm.

Proposition 5.

If

{∥ X ∥}_{Ψ_{θ, L}} < \infty

, then

P (| X | \geq {∥ X ∥}_{Ψ_{θ, L}} {\sqrt{t} + L t^{1 / θ}}) \leq 2 e^{- t} \forall t \geq 0 .

With an upper bound of GBO norm, we could easily derive the concentration inequality for a single sub-Weibull r.v. or even the sum of independent sub-Weibull r.v.s. The sharper upper bounds for the GBO norm is obtained for the sub-Weibull summation, which refines the constant in the sub-Weibull concentration inequality. Let

{| | X | |}_{p} : = {(E | X |}^{p})^{1 / p}

for all integer

p \geq 1

. First, by truncating more precisely, we obtain a sharper upper bound for

{| | X | |}_{p}

, comparing to Proposition C.1 in [6].

Corollary 5.

If

{∥ X ∥}_{p} \leq C_{1} \sqrt{p} + C_{2} p^{1 / θ}

for

p \geq 2

and constants

C_{1}

,

C_{2}

, then

{∥ X ∥}_{Ψ_{θ, K}} \leq γ e C_{1}

where

K = γ^{2 / θ} C_{2} / (γ C_{1})

and

γ \approx 1.78

is the minimal solution of

\{k > 1 : e^{2 k^{- 2}} - 1 + \frac{e^{2 (1 - k^{2}) / k^{2}}}{k^{2} - 1} \leq 1\} .

The proof can be seen in the Appendix A. In below, we need the moment estimation for sums of independent symmetric r.v.s.

Lemma 1

(Khinchin-Kahane Inequality, Theorem 1.3.1 of [13]). Let

{\{a_{i}\}}_{i = 1}^{n}

be a finite non-random sequence,

{\{ε_{i}\}}_{i = 1}^{n}

be a sequence of independent Rademacher variables and

1 < p < q < \infty .

Then

{∥\sum_{i = 1}^{n} ε_{i} a_{i}∥}_{q} \leq {(\frac{q - 1}{p - 1})}^{1 / 2} {∥\sum_{i = 1}^{n} ε_{i} a_{i}∥}_{p} .

Lemma 2

(Theorem 2 of [14]). Let

{\{X_{i}\}}_{i = 1}^{n}

be a sequence of independent symmetric r.v.s, and

p \geq 2

. Then,

\frac{e - 1}{2 e^{2}} {∥(X_{i})∥}_{p} \leq {∥X_{1} + \dots + X_{n}∥}_{p} \leq e {∥(X_{i})∥}_{p},

where

{∥(X_{i})∥}_{p} : = inf {t > 0 : \sum_{i = 1}^{n} log ϕ_{p} (X_{i} / t) \leq p}

with

ϕ_{p} (X) : = E {| 1 + X |}^{p} .

Lemma 3

(Example 3.2 and 3.3 of [14]). Assume X be a symmetric r.v. satisfying

P (| X | \geq t) = e^{- N (t)}

. For any

t \geq 0

, we have

(a): If $N (t)$ is concave, then $log ϕ_{p} (e^{- 2} t X) \leq p M_{p, X} (t) : = (t^{p} {∥ X ∥}_{p}^{p}) \lor (p t^{2} {∥ X ∥}_{2}^{2})$ .
(b): For convex $N (t)$ , denote the convex conjugate function $N^{*} (t) : = {sup}_{s > 0} {t s - N (s)}$ and $M_{p, X} (t) = \{\begin{matrix} p^{- 1} N^{*} (p | t |), & if p | t | \geq 2 \\ p t^{2}, & if p | t | < 2 . \end{matrix}$ Then $log ϕ_{p} (t X / 4) \leq p M_{p, X} (t)$ .

With the help of three lemmas above, we can obtain the main results concerning the shaper and constant-specified concentration inequality for the sum of independent sub-Weibull r.v.s.

Theorem 1

(Concentration for sub-Weibull summation). Let γ be given in Corollary 5. If

{\{X_{i}\}}_{i = 1}^{n}

are independent centralized r.v.s such that

∥ X_{i} ∥_{ψ_{θ}} < \infty

for all

1 \leq i \leq n

and some

θ > 0

, then for any weight vector

w = (w_{1}, \dots, w_{n}) \in R^{n}

, the following bounds holds true:

(a): The estimate for GBO norm of the summation:
${∥\sum_{i = 1}^{n} w_{i} X_{i}∥}_{Ψ_{θ, L_{n} (θ, b_{X})}} \leq γ e C (θ) {∥ b_{X} ∥}_{2}$ ,
where $b_{X} = (w_{1} ∥ X_{1} ∥_{ψ_{θ}}, \dots, w_{n} ∥ X_{n} {∥_{ψ_{θ}})}^{⊤} \in R^{n}$ , with

$C (θ) : = \{\begin{matrix} 2 [{log}^{1 / θ} 2 + e^{3} (Γ^{1 / 2} (\frac{2}{θ} + 1) + 3^{\frac{2 - θ}{3 θ}} sup_{p \geq 2} p^{- \frac{1}{θ}} Γ^{1 / p} (\frac{p}{θ} + 1))], & i f θ \leq 1, \\ 2 [4 e + {(log 2)}^{1 / θ}], & i f θ > 1; \end{matrix}$

and $L_{n} (θ, b) = γ^{2 / θ} A (θ) \frac{{∥ b ∥}_{\infty}}{{∥ b ∥}_{2}} 1 {0 < θ \leq 1} + γ^{2 / θ} B (θ) \frac{{∥ b ∥}_{β}}{{∥ b ∥}_{2}} 1 {θ > 1}$ where $B (θ) = : \frac{2 e θ^{- 1 / θ} {(1 - θ^{- 1})}^{1 / β}}{4 e + {(log 2)}^{1 / θ}}$ and $A (θ) = : inf_{p \geq 2} \frac{e^{3} 3^{\frac{2 - θ}{3 θ}} p^{- 1 / θ} Γ^{1 / p} (\frac{p}{θ} + 1)}{2 [{log}^{1 / θ} 2 + e^{3} (Γ^{1 / 2} (\frac{2}{θ} + 1) + 3^{\frac{2 - θ}{3 θ}} {sup}_{p \geq 2} p^{- 1 / θ} Γ^{1 / p} (\frac{p}{θ} + 1))]}$ . For the case $θ > 1$ , β is the Hölder conjugate satisfying $1 / θ + 1 / β = 1$ .
(b): Concentration for sub-Weibull summation:

$P (| \sum_{i = 1}^{n} w_{i} X_{i} | \geq 2 e C (θ) ∥ b_{X} ∥_{2} {\sqrt{t} + L_{n} (θ, b_{X}) t^{1 / θ}}) \leq 2 e^{- t} .$

(7)
(c): Another form of for $θ \neq 2$ :

$\begin{matrix} P (| \sum_{i = 1}^{n} w_{i} X_{i} | \geq s) & \leq 2 exp \{- (\frac{s^{θ}}{{[4 e C (θ) {∥b_{X}∥}_{2} L_{n} (θ, b_{X})]}^{θ}} \land \frac{s^{2}}{16 e^{2} C^{2} (θ) {∥b_{X}∥}_{2}^{2}})\} \\ (θ < 2) & = \{\begin{matrix} 2 e^{- s^{2} / 16 e^{2} C^{2} (θ) {∥b∥}_{2}^{2}}, if s \leq 4 e C (θ) {∥b_{X}∥}_{2} L_{n}^{θ / (θ - 2)} (θ, b_{X}) \\ 2 e^{- s^{θ} / {[4 e C (θ) {∥b_{X}∥}_{2} L_{n} (θ, b_{X})]}^{θ}}, if s > 4 e C (θ) {∥b_{X}∥}_{2} L_{n}^{θ / (θ - 2)} (θ, b_{X}); \end{matrix} \\ (θ > 2) & = \{\begin{matrix} 2 e^{- s^{θ} / {[4 e C (θ) {∥b_{X}∥}_{2} L_{n} (θ, b_{X})]}^{θ}}, if s < 4 e C (θ) {∥b_{X}∥}_{2} L_{n}^{θ / (2 - θ)} (θ, b_{X}) \\ 2 e^{- s^{2} / 16 e^{2} C^{2} (θ) {∥b_{X}∥}_{2}^{2}}, if s \geq 4 e C (θ) {∥b_{X}∥}_{2} L_{n}^{θ / (2 - θ)} (θ, b_{X}) . \end{matrix} \end{matrix}$

Remark 2.

The constant

C (θ)

in Theorem 1 can be improved as

C (θ) / 2

under symmetric assumption of sub-Weibull r.v.s

{X_{i}}_{i = 1}^{n}

. Moreover, by the improved symmetrization theorem (Theorem 3.4 in [15]), one can replace the constant

C (θ)

in Theorem 1 by a sharper constant

(1 + o (1)) C (θ) / 2

. Theorem 1 (b) also implies a potential empirical upper bound for

\sum_{i = 1}^{n} w_{i} X_{i}

for independent sub-Weibull r.v.s

{X_{i}}_{i = 1}^{n}

, because the only unknown variable in

2 e C (θ) ∥ b_{X} ∥_{2} {\sqrt{t} + L_{n} (θ) t^{1 / θ}}

is

b_{X}

. From Remark 1, estimating

b_{X}

is possible for i.i.d. observation

{X_{i}}_{i = 1}^{n}

.

Remark 3.

Compared with the newest result in [6], our method do not use the crude String’s approximation will give sharper concentration. For example, suppose

X_{1}, \dots, X_{10}

are i.i.d. r.v.s with mean μ and

∥ X_{1} {- μ ∥}_{ψ_{θ}} = 1

. Here we set

θ = 0.5

, X is heavy-tailed (for example set the density of X as

f (x) = \frac{1}{2 \sqrt{x}} e^{- \sqrt{x}} \cdot 1 (x \geq 0)

). We find that

C (θ) \approx 2825.89

,

A (θ) \approx 0.07

, and

L_{10} (θ, 1_{10}^{⊤}) = 0.23

. Hence,

95 %

confidence interval in our method will be

μ \in \bar{X} \pm 2 e \times 2118.80,

while the 95% confidence interval in Theorem 3.1 of [6] is evaluated as

μ \in \bar{X} \pm 2 e \times 3969.94 .

In this example, it can be seen that our method does give a much better (tighter) confidence interval.

Remark 4.

Theorem 1 (b) generalizes the sub-Gaussian concentration inequalities, sub-exponential concentration inequalities, and Bernstein’s concentration inequalities with Bernstein’s moment condition. For

θ < 2

in Theorem 1 (c), the tail behaviour of the sum is akin to a sub-Gaussian tail for small t, and the tail resembles the exponential tail for large t; For

θ > 2

, the tail behaves like a Weibull r.v. with tail parameter θ and the tail of sums match that of the sub-Gaussian tail for large t. The intuition is that the sum will concentrate around zero by the Law of Large Number. Theorem 1 shows that the convergence rate will be faster for small deviations from the mean and will be slower for large deviations from the mean.

Remark 5.

Recently, similar result presented in [16] is that

P (| \sum_{i = 1}^{n} X_{i} | > x) \leq exp \{- {(\frac{x}{n K_{θ}})}^{1 / θ}\}, for x \geq n K_{θ}

where

K_{θ}

is some constants only depends on X and θ (

K_{θ}

can be obtained by Proposition 3). But it is obvious to see this large derivation result cannot guarantee a

\sqrt{n}

-convergence rate (as presented in Proposition 3) whereas our result always give a

\sqrt{n}

-convergence rate, as presented in Theorem 1 (c) and Proposition 3.

2.3. Sub-Weibull Parameter

In this part, a new sub-Weibull parameters is proposed, which is enable of recovering the tight concentration inequality for single non-zero mean random vector. Similar to characterizations of sub-Gaussian r.vs. in Proposition 2.5.2 of [1], sub-Weibull r.vs. has the equivalent definitions.

Proposition 6

(Characterizations of sub-Weibull r.v., [17]). Let X be a r.v., then the following properties are equivalent. (1). The tails of X satisfy

P (| X | \geq x) \leq e^{- {(x / K_{1})}^{θ}}, for all x \geq 0

; (2). The moments of X satisfy

{∥ X ∥}_{k} : = {(E | X |}^{k})^{1 / k} \leq K_{2} k^{1 / θ} for all k \geq 1 \land θ

; (3). The MGF of

{| X |}^{1 / θ}

satisfies

E e^{λ^{1 / θ} {| X |}^{1 / θ}} \leq e^{λ^{1 / θ} K_{3}^{1 / θ}}

for

| λ | \leq \frac{1}{K_{3}}

; (4).

E e^{| X / K_{4} |^{1 / θ}} \leq 2 .

From the upper bound of

{(E | X |}^{k})^{1 / k}

in Proposition 6(2), an alternative definition of the sub-Weibull norm

{∥ X ∥}_{ψ_{θ}} : = {sup}_{k \geq 1} k^{- 1 / θ} {(E | X |}^{k})^{1 / k}

is given by [17]. Let

θ = 1

. An alternative definition of the sub-exponential norm is

{∥ X ∥}_{ψ_{1}} : = {sup}_{k \geq 1} k^{- 1} {(E | X |}^{k})^{1 / k}

see Proposition 2.7.1 of [1]. The sub-exponential r.v. X satisfies equivalent properties in Proposition 6 (Characterizations of sub-exponential with

θ = 1

). However, these definition is not enough to obtain the sharp parameter as presented in the sub-Gaussian case. Here, we redefine the sub-Weibull parameter by our Corollary 3(a).

Definition 4

(Sub-Weibull r.v.,

X \sim subW (θ, v)

). Define the sub-Weibull norm

{∥X∥}_{φ_{θ}} = {sup}_{k \geq 1} {({E | X |}^{θ k} / k!)}^{1 / (θ k)} .

We denote the sub-Weibull r.v. as

X \sim subW (θ, v)

if

v = {∥X∥}_{φ_{θ}} < \infty

for a given

θ > 0

. For

θ \geq 1

, the

{∥\cdot∥}_{φ_{θ}}

is a norm which satisfies triangle inequality by Minkowski’s inequality:

{E (| X + Y |}^{r})^{1 / r} \leq {[E (| X |}^{r} {)]}^{1 / r} + {[E (| Y |}^{r} {)]}^{1 / r}

,

(r \geq 1)

comparing to Proposition 1. Definition 4 is free of bounding MGF, and it avoids Stirling’s approximation in the proof of the tail inequality. We obtain following main results for this moment-based norm.

Corollary 6.

If

{∥X∥}_{φ_{θ}} < \infty

, then

P {| X | > t} \leq 2 exp {- \frac{t^{θ}}{2 {∥X∥}_{φ_{θ}}^{θ}}} for all t \geq 0 .

Theorem 2

(sub-Weibull concentration). Suppose that there are n independent sub-Weibull r.v.s

X_{i} \sim subW (θ, v_{i})

for

i = 1, 2, \dots, n

. We have

P (|\sum_{i = 1}^{n} X_{i}| \geq t) \leq exp \{- \frac{θ e^{11 / 12} t^{θ}}{2 {[e (\sum_{i = 1}^{n} v_{i}) C_{θ}]}^{θ}}\}, for t \geq e (\sum_{i = 1}^{n} v_{i}) C_{θ} {(2^{- 1} θ e^{11 / 12})}^{- 1 / θ},

and

P (|\frac{1}{n} \sum_{i = 1}^{n} X_{i}| \leq e \bar{v} 2^{1 / θ} C_{θ} {(\frac{log (α^{- 1})}{θ e^{11 / 12}})}^{1 / θ}) \geq 1 - α \in (1 - e^{- 1}, 1]

. Moreover, we have

P (| \sum_{i = 1}^{n} X_{i} | \geq e {(\sum_{i = 1}^{n} (E | X_{i} {|)}^{t})}^{1 / t} + e (\sum_{i = 1}^{n} v_{i}) 2^{1 / θ} C_{θ} {(\frac{t}{θ e^{11 / 12}})}^{1 / θ}) \leq e^{- t}, \forall t \geq 0 .

The proof of Theorem 2 can be seen in Appendix A.8. The concentration in this Theorem 2 will serve a critical role in many statistical and machine learning literature. For instance, the sub-Weibull concentrations in [7] contain unknown parameters, which makes the algorithm for general sub-Weibull random rewards is infeasible. However, when using our results, it will become feasible as we give explicit constants in these concentrations.

Importantly, the sub-exponential parameter is a special case of sub-Weibull norm by letting

θ = 1

. Denote the sub-exponential parameter for r.v X as

{∥X∥}_{φ_{1}} : = sup_{k \geq 1} {(\frac{E | X |^{k}}{k!})}^{1 / k}

.

We denote

X \sim {sE}_{φ_{1}} (v)

if

v = {∥ X ∥}_{φ_{2}}

. For exponential r.v.

X \sim Exp (μ)

, the moment is

E X^{k} = k! λ^{k}

and

{∥X∥}_{φ_{1}} = λ

. Another case of sub-Weibull norm is

θ = 2

, which defines sub-Gaussian parameter:

{∥X∥}_{φ_{2}} : = sup_{k \geq 1} {(\frac{{E | X |}^{2 k}}{k!})}^{1 / 2 k} \geq {(Var X)}^{1 / 2}

.

Like the generalized method of moments, we can give the higher-moment estimation procedure for the norm

{∥ X ∥}_{φ_{2}}

. Unfortunately, the method in Remark 1 for estimating MGF is not stable in the simulation since the exponential function has a massive variance in some cases.

Estimation procedure for ${∥ X ∥}_{φ_{2}}$ and ${∥ X ∥}_{φ_{1}}$ . Consider

${\hat{∥ X ∥}}_{φ_{2}} = sup_{k \geq 1} {(\frac{1}{n \times k!} \sum_{i = 1}^{n} {| X_{i} |}^{2 k})}^{1 / (2 k)}, {\hat{∥ X ∥}}_{φ_{1}} = sup_{k \geq 2} {(\frac{1}{k!} \cdot \frac{1}{n} \sum_{i = 1}^{n} {|X_{i}|}^{k})}^{1 / k}$

(8)

as a discrete optimization problem. We can take $k_{max}$ big enough to minimize
${(\frac{1}{n \times k!} \sum_{i = 1}^{n} {| X_{i} |}^{2 k})}^{1 / (2 k)}, {(\frac{1}{k!} \cdot \frac{1}{n} \sum_{i = 1}^{n} {|X_{i}|}^{k})}^{1 / k}$ on $k \in {1, \dots, p_{max}}$ .

At the first glimpse, the bigger p is, the larger n is required in this method. Nonetheless, often, most of common distributions only require a median-size of p to give a relatively good result, then only the median-size of n in turn is required. For standard Gaussian random, centralized Bernoulli (successful probability

μ = 0.3

), and uniform distributed (on

[- 1, 1]

) variable X,

{∥ X ∥}_{φ_{2}} = \sqrt{2} {[\frac{Γ ((1 + p) / 2)}{Γ (1 / 2) Γ (1 + p / 2)}]}^{1 / p}, {[\frac{μ {(1 - μ)}^{p} + (1 - μ) μ^{p}}{Γ (p / 2 + 1)}]}^{1 / p}, \frac{Γ^{- 1 / p} (p / 2 + 1)}{{(p + 1)}^{1 / p}} .

It can be shown that

{∥ X ∥}_{φ_{2}} \approx 1, 0.4582576, 0.5773503 .

The Figure 1, Figure 2 and Figure 3 show the estimated value from different n under estimate method (8) for the three distributions mentioned above. The estimate method (8) is a correct estimated method for sub-Gaussian parameter to our best knowledge.

For centralized negative binomial, and centralized Poisson (

λ = 1

) variable X,

{∥ X ∥}_{φ_{1}} = 2.460938, 0.7357589,

respectively. The Figure 4 and Figure 5 show the estimated value from different n under estimate method (8) for the four distributions mentioned above.

The five figures mentioned above show litter bias between the estimated norm and true norm. It is worthy to note that the norm estimator for centralized negative binomial case has a peak point. This is caused by sub-exponential distributions having relatively heavy tails, and hence the norm estimation may not robust as that in sub-Gaussian under relatively small sample sizes.

Moreover, sub-Gaussian and sub-exponential parameter is extensible for random vectors with values in a normed space

(X, ∥ \cdot ∥)

, we define norm-sub-Gaussian parameter and norm-sub-exponential parameter: The norm-sub-Gaussian parameter:

{∥X∥}_{φ_{2}} = {sup}_{k \geq 1} {(k!)}^{- 1 / (2 k)} {({E ∥ X ∥}^{2 k})}^{1 / (2 k)}

;

the norm-sub-exponential parameter:

{∥X∥}_{φ_{1}} = {sup}_{k \geq 1} {(k!)}^{- 1 / k} {({E ∥ X ∥}^{k})}^{1 / k} .

We denote

X \sim {nsubG}_{φ_{1}} (σ^{2})

and

X \sim {nsubG}_{φ_{2}} (σ^{2})

for

σ^{2} = {∥X∥}_{φ_{2}} and {∥X∥}_{φ_{1}}

, respectively.

3. Statistical Applications of Sub-Weibull Concentrations

3.1. Negative Binomial Regressions with Heavy-Tail Covariates

In statistical regression analysis, the responses

{Y_{i}}_{i = 1}^{n}

in linear regressions are assume to be continuous Gaussian variables. However, the category in classification or grouping may be infinite with index by the non-negative integers. The categorical variables is treated as countable responses for distinction categories or groups; sometimes it can be infinite. In practice, random count responses include the number of patients, the bacterium in the unit region, or stars in the sky and so on. The responses

{Y_{i}}_{i = 1}^{n}

with covariates

{X_{i}}_{i = 1}^{n}

belongs to generalized linear regressions. We consider i.i.d. random variables

{(X_{i}, Y_{i})}_{i = 1}^{n} \sim (X, Y) \in R^{p} \times N

. By the methods of the maximum likelihood or the M-estimation, the estimator

{\hat{β}}_{n}

is given by

{\hat{β}}_{n} : = \underset{β \in R^{p}}{arg min} \frac{1}{n} \sum_{i = 1}^{n} ℓ (X_{i}^{⊤} β, Y_{i}),

(9)

where the loss function

ℓ (\cdot, \cdot)

is convex and twice differentiable in the first argument.

In high-dimensional regressions, the dimension

β

may be growing with sample size n. When

{Y_{i}}_{i = 1}^{n}

belongs to the exponential family, [18] studied the asymptotic behavior of

{\hat{β}}_{n}

in the generalized linear models (GLMs) as

p_{n} : = \dim (X)

is increasing. In our study, we focus on the case that the covariates is

subW (θ)

heavy-tailed for

θ < 1

.

The target vector

β^{*} : = \underset{β \in R^{p}}{arg min} E ℓ (X^{T} β, Y)

is assumed to be the loss under the population expectation, comparing to (9). Let

\dot{ℓ} (u, y) : = {\frac{\partial}{\partial t} ℓ (t, y)|}_{t = u}

,

\ddot{ℓ} (u, y) : = {\frac{\partial}{\partial t} \dot{ℓ} (t, y)|}_{t = u}

and

C (u, y) : = {sup}_{| s - t | \leq u} \frac{\ddot{ℓ} (s, y)}{\ddot{ℓ} (t, y)}

. Finally, define the score function and Hessian matrix of the empirical loss function are

{\hat{Z}}_{n} (β) : = \frac{1}{n} \sum_{i = 1}^{n} \dot{ℓ} (X_{i}^{T} β, Y_{i}) X_{i}

and

{\hat{Q}}_{n} (β) : = \frac{1}{n} \sum_{i = 1}^{n} \ddot{ℓ} (X_{i}^{T} β, Y_{i}) X_{i} X_{i}^{T}

, respectively. The population version of Hessian matrix is

Q (β) : = E [\ddot{ℓ} (X^{T} β, Y) X X^{T}]

. The following so-called determining inequalities guarantee the

ℓ_{2}

-error for the estimator obtained from the smooth M-estimator defined as (9).

Lemma 4

(Corollary 3.1 in [19]). Let

δ_{n} (β) : = \frac{3}{2} {∥ {[{\hat{Q}}_{n} (β)]}^{- 1} {\hat{Z}}_{n} (β) ∥}_{2}

for

β \in R^{p}

. If

ℓ (\cdot, \cdot)

is a twice differentiable function that is convex in the first argument and for some

β^{*} \in R^{p}

:

{max}_{1 \leq i \leq n} C ({∥X_{i}∥}_{2} δ_{n} (β^{*}), Y_{i}) \leq \frac{4}{3}

. Then there exists a vector

{\hat{β}}_{n} \in R^{p}

satisfying

{\hat{Z}}_{n} ({\hat{β}}_{n}) = 0

as the estimating equation of (9),

\frac{1}{2} δ_{n} (β^{*}) \leq {∥ {\hat{β}}_{n} - β^{*} ∥}_{2} \leq δ_{n} (β^{*}) .

Applications of Lemma 4 in regression analysis is of special interest when X is heavy tailed, i.e., the sub-Weibull index

θ < 1

. For the negative binomial regression (NBR) with the known dispersion parameter

k > 0

, the loss function is

ℓ (u, y) = - y u + (y + k) log (k + e^{u}) .

(10)

Thus we have

\dot{ℓ} (u, y) = - \frac{k (y - e^{u})}{k + e^{u}}, \ddot{ℓ} (u, y) = \frac{k (y + k) e^{u}}{{(k + e^{u})}^{2}}

, see [20] for details.

Further computation gives

C (u, y) = {sup}_{| s - t | \leq u} \frac{e^{s} {(k + e^{t})}^{2}}{{(k + e^{s})}^{2} e^{t}}

and it implies that

C (u, y) \leq e^{3 u} .

Therefore, condition

{max}_{1 \leq i \leq n} C ({∥X_{i}∥}_{2} δ_{n} (β^{*}), Y_{i}) \leq \frac{4}{3}

in Lemma 4 leads to

{max}_{1 \leq i \leq n} {∥X_{i}∥}_{2} δ_{n} (β^{*}) \leq \frac{log (4 / 3)}{3} .

This condition need the assumption of the design space for

{max}_{1 \leq i \leq n} {∥X_{i}∥}_{2}

.

In NBR with loss (10), one has

{\hat{Q}}_{n} (β^{*}) : = \frac{1}{n} \sum_{i = 1}^{n} \frac{(Y_{i} + k) k e^{X_{i}^{⊤} β^{*}} X_{i} X_{i}^{⊤}}{{(k + e^{X_{i}^{⊤} β^{*}})}^{2}}

and

{\hat{Z}}_{n} (β^{*}) : = \frac{- 1}{n} \sum_{i = 1}^{n} \frac{k (Y_{i} - e^{X_{i}^{⊤} β^{*}}) X_{i}}{k + e^{X_{i}^{⊤} β^{*}}}

.

To guarantee that

{\hat{β}}_{n}

approximates

β^{*}

well, some regularity conditions are required.

•: (C.1): For $M_{Y}, M_{X} > 0$ , assume $max_{1 \leq i \leq n} ∥ Y_{i} ∥_{ψ_{1}} \leq M_{Y}$ and the heavy-tailed covariates ${X_{i k}}$ are uniformly sub-Weibull with $max_{1 \leq i \leq n, 1 \leq k \leq p} {∥X_{i k}∥}_{ψ_{θ}} \leq M_{X}$ for $0 < θ < 1$ .
•: (C.2): The vector $X_{i}$ is sparse or bounded. Let $F_{Y} : = {max_{1 \leq i \leq n} E Y_{i} = max_{1 \leq i \leq n} e^{X_{i}^{⊤} β^{*}} \leq B, max_{1 \leq i \leq n} ∥ X_{i} ∥_{2} \leq I_{n}}$ with a slowly increasing function $I_{n}$ , we have $P {F_{Y}^{c}} = ε_{n} \to 0$ .

In addition, to bound

max_{1 \leq i \leq n, 1 \leq i \leq k} | X_{i k}^{} |

, the sub-Weibull concentration determines:

P (max_{1 \leq i \leq n, 1 \leq i \leq k} | X_{i k} | > t) \leq n p P (| X_{11} | > t) \leq 2 n p e^{- {(t / {∥X_{11}∥}_{ψ_{θ}})}^{θ}} \leq δ \Rightarrow t = M_{X} {log}^{1 / θ} (\frac{2 n p}{δ}),

by using Corollary 3. Hence, we define the event for the maximum designs:

F_{max} = \{max_{1 \leq i \leq n, 1 \leq k \leq p} | X_{i k} | \leq M_{X} {log}^{1 / θ} (\frac{2 n p}{δ})\} \cap F_{Y} .

To make sure that the optimization in (9) has a unique solution, we also require the minimal eigenvalue condition.

•: (C.3): Suppose that $b^{⊤} E ({\hat{Q}}_{n} (β)) b \geq C_{min}$ is satisfied for all $b \in S^{p - 1}$ .

In the proof, to ensure that the random Hessian function has a non-singular eigenvalue, we define the event

F_{1} = \{max_{k, j} |\frac{1}{n} \sum_{i = 1}^{n} [\frac{Y_{i} k e^{X_{i}^{⊤} β^{*}} X_{i k} X_{i j}}{{(k + e^{X_{i}^{⊤} β^{*}})}^{2}} - E (\frac{Y_{i} k e^{X_{i}^{⊤} β^{*}} X_{i k} X_{i j}}{{(k + e^{X_{i}^{⊤} β^{*}})}^{2}})]| \leq \frac{C_{min}}{4}\}

F_{2} = \{max_{k, j} |\frac{1}{n} \sum_{i = 1}^{n} [\frac{k e^{X_{i}^{⊤} β^{*}} X_{i k} X_{i j}}{{(k + e^{X_{i}^{⊤} β^{*}})}^{2}} - E (\frac{k e^{X_{i}^{⊤} β^{*}} X_{i k} X_{i j}}{{(k + e^{X_{i}^{⊤} β^{*}})}^{2}})]| \leq \frac{C_{min}}{4}\} .

Theorem 3

(Upper bound for

ℓ_{2}

-error). In the NBR with loss (10) and

(C . 1 - C . 3)

, let

M_{B X} = M_{X} + \frac{B}{log 2}, R_{n} : = \frac{6 M_{B X} M_{X}}{C_{min}} [\sqrt{\frac{2 p}{n} log (\frac{2 p}{δ})} + \frac{1}{n} \sqrt{p log (\frac{2 p}{δ})}] {log}^{1 / θ} (\frac{2 n p}{δ}),

and

b : = (k / n) M_{X}^{2} {(1, \dots, 1)}^{⊤} \in R^{n}

. Under the event

F_{1} \cap F_{2} \cap F_{max}

, for any

0 < δ < 1

, if the sample size n satisfies

R_{n} I_{n} \leq \frac{log (4 / 3)}{3},

(11)

Let

c_{n} : = e^{- \frac{1}{4} (\frac{n t^{2}}{2 M_{X}^{4} {log}^{4 / θ} (\frac{2 n p}{δ}) M_{B X}^{2}} \land \frac{n t}{M_{X}^{2} {log}^{2 / θ} (\frac{2 n p}{δ}) M_{B X}})} + e^{- (\frac{t^{θ / 2}}{{[4 e C (θ / 2) {∥b∥}_{2} L_{n} (θ / 2, b)]}^{θ / 2}} \land \frac{t^{2}}{16 e^{2} C^{2} (θ / 2) {∥b∥}_{2}^{2}})}

with

t = C_{min} / 4

, then

P (∥ {\hat{β}}_{n} - β^{*} ∥_{2} \leq R_{n}) \geq 1 - 2 p^{2} c_{n} - δ - ε_{n} .

A few comment is made on this theorem. First, in order to get

∥ {\hat{β}}_{n} - β^{*} ∥_{2} \overset{p}{\to} 0

, we need

p = o (n)

under sample size restriction (11) with

I_{n} = o ({log}^{- 1 / θ} (n p) \cdot [n^{- 1} p log p]^{- 1 / 2})

. Second, note that the

ε_{n}

in provability

1 - 2 p^{2} c_{n} - δ - ε_{n}

depends on the models size and the fluctuation of the design by the event

F_{max}

.

3.2. Non-Asymptotic Bai-Yin’s Theorem

In statistical machine learning, exponential decay tail probability is crucial to evaluate the finite-sample performance. Unlike Bai-Yin’s law with the fourth-moment condition that leads to polynomial decay tail probability, under sub-Weibull conditions of data, we provide a exponential decay tail probability on the extreme eigenvalues of a

n \times p

random matrix.

Let

A = A_{n, p}

be an

n \times p

random matrix whose entries are independent copies of a r.v. with zero mean, unit variance, and finite fourth moment. Suppose that the dimensions n and p both grow to infinity while the aspect ratio

p / n

converges to a constant in

[0, 1]

. Then Bai-Yin’s law [21] asserted that the standardized extreme eigenvalues satisfying

\frac{1}{\sqrt{n}} λ_{m i n} (A) = 1 - \sqrt{\frac{p}{n}} + o (\sqrt{\frac{p}{n}}), \frac{1}{\sqrt{n}} λ_{m a x} (A) = 1 + \sqrt{\frac{p}{n}} + o (\sqrt{\frac{p}{n}}) a . s . .

Next we introduce a special counting measure for measuring the complexity of a certain set in some space. The

N_{ε}

is called an ε-net of K in

R^{n}

if K can be covered by balls with centers in K and radii

ε

(under Euclidean distance). The covering number

N (K, ε)

is defined by the smallest number of closed balls with centers in K and radii

ε

whose union covers K.

For purposes of studying random matrices, we need to extend the definition of sub-Weibull r.v. to sub-Weibull random vectors. The n-dimensional unit Euclidean sphere

S^{n - 1}

, is denoted by

S^{n - 1} = {x \in R^{n} : {∥x∥}_{2} = 1} .

We say that a random vector

X

in

R^{n}

is sub-Weibull if the one-dimensional marginals

〈 X, a 〉

are sub-Weibull r.v.s for all

a \in R^{n}

. The sub-Weibull norm of a random vector

X

is defined as

{∥ X ∥}_{ψ_{θ}} : = {sup}_{a \in S^{n - 1}} {∥ 〈 X, a 〉 ∥}_{ψ_{θ}} .

Similarly, define the spectral norm for any

p \times p

matrix

B

as

∥ B ∥ = {max}_{{| | x | |}_{2} = 1} ∥ B x ∥_{2} = {sup}_{x \in S^{p - 1}} | 〈 B x, x 〉 |

. Spectral norm has many good properties, see [1] for details.

Furthermore, for simplicity, we assume that the rows in random matrices are isotropic random vectors. A random vector

Y

in

R^{n}

is called isotropic if

Var (Y) = I_{p} .

Equivalently,

Y

is isotropic if

E [{〈 Y, a 〉}^{2}] = {∥ a ∥}_{2}^{2} for all a \in R^{n} .

In the non-asymptotic regime, Theorem 4.6.1 in [1] study the upper and lower bounds of maximum (minimum) eigenvalues of random matrices with independent sub-Gaussian entries which are sampled from high-dimensional distributions. As an extension of Theorem 4.6.1 in [1], the following result is a non-asymptotic versions of Bai-Yin’s law for sub-Weibull entries, which is useful to estimate covariance matrices from heavy-tailed data [

subW (θ)

,

θ < 1

].

Theorem 4 (Non-asymptotic Bai-Yin’s law).

Let

A

be an

n \times p

matrix whose rows

A_{i}

are independent isotropic sub-Weibull random vectors in

R^{p}

with covariance matrix

I_{p}

and

{max}_{1 \leq i \leq n} {∥ A_{i} ∥}_{ψ_{θ}} \leq K

. Then for every

s \geq 0

, we have

P \{∥ \frac{1}{n} A^{⊤} A - I_{p} ∥ \leq H (c p + s^{2}, n; θ)\} \geq 1 - 2 e^{- s^{2}},

where

H (t, n; θ) : = 2 e K C (θ / 2) K_{θ / 2} [1 + {([{(e θ / 2)}^{θ / 2}] log 2)}^{- θ / 2}) [\sqrt{\frac{t}{n}} + \{\begin{matrix} A (θ / 2) \frac{{(γ^{2} t)}^{2 / θ}}{n}, & θ \leq 2 \\ B (θ / 2) \frac{{(γ^{2} t)}^{2 / θ}}{n^{1 / θ}}, & θ > 2 \end{matrix}],

where

K_{α} : = 2^{1 / α}

if

α \in (0, 1)

and

K_{α} = 1

if

α \geq 1

;

A (θ / 2)

,

B (θ / 2)

and

C (θ / 2)

defined in Theorem 1a.

Moreover, the concentration inequality for extreme eigenvalues hold for

c \geq n log 9 / p

P \{\sqrt{1 - H^{2} (c p + s^{2}, n; θ)} \leq \frac{λ_{m i n} (A)}{\sqrt{n}} \leq \frac{λ_{m a x} (A)}{\sqrt{n}} \leq \sqrt{1 + H^{2} (c p + s^{2}, n; θ)}\} \geq 1 - 2 e^{- s^{2}} .

(12)

3.3. General Log-Truncated Z-Estimators and sub-Weibull Type Robust Estimators

Motivated from log-truncated loss in [22,23], we study the almost surely continuous and non-decreasing function

φ^{c} : R \to R

for truncating the original score function

\begin{matrix} - log [1 - x + c (| x |)] \leq φ^{c} (x) \leq log [1 + x + c (| x |)], \forall x \in R \end{matrix}

(13)

where

c (| x |) > 0

is a high-order function [23] of

| x |

which is to be specified. For example, a plausible choose for

φ^{c} (x)

in (13) should have following form

\begin{matrix} φ^{c} (x) & = log [1 + x + c (| x |)] 1 (x \geq 0) - log [1 - x + c (| x |)] 1 (x \leq 0) \\ = sign (x) log (1 + | x | + c (| x |)) . \end{matrix}

(14)

For (14), we get

φ^{c} (x) \approx x

for sufficiently smaller x and

φ^{c} (x) ≪ x

for larger x. Under (13), now we show that

c (| x |)

must obey a key inequality. For all

x \in R

, it suffices to verify

- log [1 - x + c (| x |)] \leq log [1 + x + c (| x |)]

, which is equivalent to check

log [(1 + c (| x |) + x) (1 + c (| x |) - x)] \geq 0

, namely

{(1 + c (| x |))}^{2} - x^{2} \geq 1 \Leftrightarrow c (| x |) \geq \sqrt{1 + x^{2}} - 1 .

For independent r.v.s

{X_{i}}_{i = 1}^{n}

, using the score function (14), we define the score function of data

{\hat{Z}}_{α_{n}} (θ) = \frac{1}{n α_{n}} \sum_{i = 1}^{n} φ^{c} [α_{n} (X_{i} - θ)]

for any

θ \in R

.

Then the influence of the heavy-tailed outliers is weaken by

φ^{c} [α_{n} (X_{i} - θ)]

by choosing an optimal

α_{n}

. We aim to estimate the average mean:

μ_{n} : = \frac{1}{n} \sum_{i = 1}^{n} E X_{i}

for non-i.i.d. samples

{X_{i}}_{i = 1}^{n}

. Define the Z-estimator

{\hat{θ}}_{α_{n}}

as

\begin{matrix} {\hat{θ}}_{α_{n}} \in {θ \in R : {\hat{Z}}_{α_{n}} (θ) = 0}, \end{matrix}

(15)

where

α_{n}

is the tuning parameter (will be determined later).

To guarantee consistency for log-truncated Z-estimators (15), we require following assumptions of

c (\cdot)

.

•: (C.1): For a constant $c_{2} > 1$ , the $c (x)$ satisfies weak triangle inequality and scaling property,

$(C . 1.1) : c (x + y) \leq c_{2} [c (x) + c (y)], (C . 1.2) : c (t x) \leq f (t) c (x)$

for $f (t)$ satisfies
(C.1.3): $f (t)$ and $f (t) / | t |$ are non-constant increasing functions and $lim_{t \to 0} f (t) / | t | = 0$ .

Remark 6.

Note that

| x | \geq \sqrt{1 + x^{2}} - 1

and we could put

c (| x |) = | x |

. However,

c (| x |) = | x |

does not satisfy (C.1.3) since

f (t) = | t |

and

f (t) / | t |

are constant functions of t.

In the following theorem, we establish the finite sample confidence interval and the convergence rate of the estimator

{\hat{θ}}_{α_{n}}

.

Theorem 5.

Let

{X_{i}}_{i = 1}^{n}

be independent samples drawn from an unknown probability distribution

{P_{i}}_{i = 1}^{n}

on

R

. Consider the estimator

{\hat{θ}}_{α_{n}}

defined as (15) with (C.1),

α_{n} \to 0

and

\frac{1}{n} \sum_{i = 1}^{n} E [c (X_{i} - θ)] = O (1)

. Let

B_{n}^{+} (θ) = μ_{n} - θ + \frac{1}{n α_{n}} \sum_{i = 1}^{n} E [c (α_{n} (X_{i} - θ))] + \frac{log (δ^{- 1})}{n α_{n}}

and

B_{n}^{-} (θ) = μ_{n} - θ - \frac{1}{n α_{n}} \sum_{i = 1}^{n} E [c (α_{n} (X_{i} - θ))] - \frac{log (δ^{- 1})}{n α_{n}}

. Let

θ_{+}

be the smallest solution of the equation

B_{n}^{+} (θ) = 0

and

θ_{-}

be the largest solution of

B_{n}^{-} (θ) = 0

.

(a). We have with the

(1 - 2 δ)

-confidence intervals

P (B_{n}^{-} (θ) < {\hat{Z}}_{α_{n}} (θ) < B_{n}^{+} (θ)) \geq 1 - 2 δ, P (θ_{-} \leq {\hat{θ}}_{α_{n}} \leq θ_{+}) \geq 1 - 2 δ,

for any

δ \in (0, 1 / 2)

satisfies the sample condition:

\begin{matrix} \frac{1}{n α_{n}} \sum_{i = 1}^{n} E [c (α_{n} X_{i} - α_{n} [μ_{n} \pm d_{n} (c)])] + \frac{log (δ^{- 1})}{n α_{n}} < d_{n} (c), \end{matrix}

(16)

where

d_{n} (c)

is a constant such that

B_{n}^{\pm} (μ_{n} \pm d_{n} (c)) < 0

.

(b). Moreover, picking

α_{n} \geq f^{- 1} (\frac{log (δ^{- 1})}{c_{2} \sum_{i = 1}^{n} E [c (X_{i} - μ_{n})]})

, one has

\begin{matrix} P (| {\hat{θ}}_{α_{n}} - μ_{n} | \leq |g_{α_{n}}^{- 1} \{- \frac{2 log (δ^{- 1})}{n α_{n}}\}|) \geq 1 - 2 δ, with g_{α_{n}} (t) : = t + \frac{c_{2}}{α_{n}} c (α_{n} t) . \end{matrix}

(17)

The (17) in Theorem 5 is a fundamental extension of Lemma 2.1 (see Theorem 16 in [24]) with

c (x) = x^{2} / 2

from i.i.d. sample to independent sample. Let

c (x) = {| x |}^{β} / β

, for i.i.d. sample, Theorem 5 implies Lemmas 2.3, 2.4 and Theorem 2.1 in [22]. The

α_{n} \geq f^{- 1} (\frac{log (δ^{- 1})}{c_{2} \sum_{i = 1}^{n} E [c (X_{i} - μ_{n})]})

in Theorem 5(b) gives a theoretical guarantee for choosing the tuning parameter

α_{n}

.

Proposition 7

(Theorem 2.1 in [22]). Let

{X_{i}}_{i = 1}^{n}

be a sequence of i.i.d. samples drawn from an unknown probability distribution on

R

. We assume

E {|X_{1}|}^{β} < \infty

for a certain

β \in (1, 2]

and denote

μ = E [X_{1}], v_{β} = E {|X_{1} - μ|}^{β}

. Given any

ϵ \in (0, 1 / 2)

and positive integer

n \geq {(\frac{2 v_{β} + 1}{β})}^{\frac{β}{β - 1}} \frac{2 β log (ϵ^{- 1})}{v_{β}}

, let

α_{n} = \frac{1}{2} {(\frac{2 β log (ϵ^{- 1})}{n v_{β}})}^{\frac{1}{β}} .

Then, with probability at least

1 - 2 ϵ

,

\begin{matrix} | {\hat{θ}}_{α_{n}} - μ | \leq 2 {(\frac{2 β log (ϵ^{- 1})}{n})}^{\frac{β - 1}{β}} v_{β}^{\frac{1}{β}} {[β - {(\frac{2 β log (ϵ^{- 1})}{n v_{β}})}^{\frac{β - 1}{β}}]}^{- 1} = O (n^{- \frac{β - 1}{β}}) . \end{matrix}

(18)

Comparing to the convergence rate in (18), put

O (n^{- \frac{β - 1}{β}}) = O (n^{- 1 / θ})

for

θ > 2

. It implies

β^{- 1} + θ^{- 1} = 1, (θ \geq 2 or 0 < β \leq 2) .

For example, let us deal with the Pareto distribution

Pareto (α, k)

with shape parameter

α > 0

and scale parameter

k > 0

, and the density function is

f (x) = \frac{α k^{α}}{x^{α + 1}} \cdot 1_{{x \in [k, \infty)}}

. For

α \leq 2

,

Pareto (α, k)

has infinite variance, and it does not belong to the sub-Weibull distribution, so do the sample mean of i.i.d. Pareto distributed data. Proposition 7 shows that the estimator error for robust mean estimator enjoys sub-Weibull concentration as presented in Proposition 3, without finite sub-Weibull norm assumption of data. With the Weibull-tailed behavior, it motivates us to define general sub-Weibull estimators having the non-parametric convergence rate

O (n^{- 1 / θ})

in Proposition 3 for

θ > 2

, even if the data do not have finite sub-Weibull norm.

Definition 5

(Sub-Weibull estimators). An estimator

\hat{μ} : = \hat{μ} (X_{1}, \dots, X_{n})

based on i.i.d. samples

{X_{i}}_{i = 1}^{n}

from an unknown probability distribution P with mean

μ_{P}

, is called

(A, B, C)

-

subW (θ)

if

\forall t \in (0, A), P (| \hat{μ} - μ_{P} | \leq B {(t / n)}^{1 / θ}) \geq 1 - C e^{- t} .

For example, in Proposition 7,

{\hat{θ}}_{α_{n}}

is

(\infty, B, 1)

-

subW (\frac{β}{β - 1})

with

B \sim 2 {(2 β log (ϵ^{- 1}))}^{\frac{β - 1}{β}} v_{β}^{\frac{1}{β}}

in Definition 5. When

θ = 2

, [25] defined sub-Gaussian estimators (includes Median of means and Catoni’s estimators) for certain heavy-tailed distributions and discussed the nonexistence of sub-Gaussian mean estimators under

β

-moment condition for the data (

β \in (1, 2)

).

4. Conclusions

Concentration inequalities are far-reaching useful in high-dimensional statistical inferences and machine learnings. They can facilitate various explicit non-asymptotic confidence intervals as a function of the sample size and model dimension.

Future research includes sharper version of Theorem 2 that is crucial to construct non-asymptotic and data-driven confidence intervals for the sub-Weibull sample mean. Although we have obtained sharper upper bounds for sub-Weibull concentrations, the lower bounds on tail probabilities are also important in some statistical applications [26]. Developing non-asymptotic and sharp lower tail bounds of Weibull r.v.s is left for further study. For negative binomial concentration inequalities in Corollary 2, it is of interesting to study concentration inequalities of COM-negative binomial distributions (see [27]).

Author Contributions

Conceptualization, H.Z. and H.W.; Formal analysis, H.Z. and H.W.; Funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by National Natural Science Foundation of China Grant (12101630) and the University of Macau under UM Macao Talent Programme (UMMTP-2020-01). This work is also supported in part by the Key Project of Natural Science Foundation of Anhui Province Colleges and Universities (KJ2021A1034), Key Scientific Research Project of Chaohu University (XLZ-202105).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Acknowledgments

The authors thank Guang Cheng for the discussion about doing statistical inference in the non-asymptotic way and Arun Kumar Kuchibhotla for his help about the proof of Theorem 1. The authors also thank Xiaowei Yang for his helpful comments on Theorem 5.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1

Proof of Corollary 1.

ϕ_{{| X |}^{θ}} (t)

is continuous for t a neighborhood of zero, by the definition,

2 \geq E e^{(| X | / {∥ X ∥}_{ψ_{θ}})^{θ}} = m_{{| X |}^{θ}} ({∥ X ∥}_{ψ_{θ}}^{- θ}) .

Since

{| X |}^{θ} > 0

, the MGF

m_{{| X |}^{θ}} (t)

is monotonic increasing. Hence, inverse function

m_{{| X |}^{θ}}^{- 1} (t)

exists and

{∥ X ∥}_{ψ_{θ}}^{- θ} = m_{{| X |}^{θ}}^{- 1} (2) .

So

{∥ X ∥}_{ψ_{θ}} = {(m_{{| X |}^{θ}}^{- 1} (2))}^{- 1 / θ}

. □

Appendix A.2

Proof of Corollary 2.

The first inequality is the direct application of (3) by observing that for any constant

a \in R

, and r.v. Y with

{∥ Y ∥}_{ψ_{1}} < \infty

,

{∥ a Y ∥}_{ψ_{1}} = {| a | ∥ Y ∥}_{ψ_{1}}

,

{∥ Y + a ∥}_{ψ_{1}} \leq {∥ Y ∥}_{ψ_{1}} + {∥ a ∥}_{ψ_{1}} = {∥ Y ∥}_{ψ_{1}} + | a | / log 2

and

{∥ X + a ∥}_{ψ_{1}}^{2} \leq {(∥ X ∥}_{ψ_{1}} {+ | a | / log 2)}^{2}

. The second inequality is obtained from (3) by considering two rate in

(\frac{t^{2}}{\sum_{i = 1}^{n} 2 {∥ Y_{i} ∥}_{ψ_{1}}^{2}} \land \frac{t}{{max}_{1 \leq i \leq n} {∥ Y_{i} ∥}_{ψ_{1}}})

separately. For (5), we only need to note that

\begin{matrix} ∥ Y_{i} ∥_{ψ_{1}} = inf {t > 0 : E e^{Y_{i} / t} \leq 2} = inf {t > 0 : {(\frac{1 - q_{i}}{1 - q_{i} e^{1 / t}})}^{k_{i}} \leq 2} = {[log \frac{1 - (1 - q_{i}) / \sqrt[k_{i}]{2}}{q_{i}}]}^{- 1} . \end{matrix}

Then the third inequality is obtained by the first inequality and the definition of

a (μ_{i}, k_{i})

. □

Appendix A.3

Proof of Corollary 3.

The first and second part of this proposition were shown in Lemma 2.1 of [28]. For the third result, using the bounds of Gamma function [see [29]:

\sqrt{2 π} x^{x - (1 / 2)} e^{- x} \leq Γ (x) \leq [\sqrt{2 π} x^{x - (1 / 2)} e^{- x}] \cdot e^{1 / (12 x)}, (x > 0),

it gives

\begin{matrix} {(E | X |^{k})}^{1 / k} & \leq {\{2 {∥ X ∥}_{φ_{θ}}^{k} (\frac{k}{θ}) [\sqrt{2 π} {(k / θ)}^{\frac{k}{θ} - \frac{1}{2}} e^{- \frac{11 k}{12 θ}}]\}}^{1 / k} = {(\frac{2 \sqrt{2 π}}{θ})}^{1 / k} {{(\frac{k}{θ})}^{\frac{k}{θ} + \frac{1}{2}} e^{- \frac{11 k}{12 θ}}}^{1 / k} {∥ X ∥}_{φ_{θ}} \\ = {(\frac{2 \sqrt{2 π}}{θ})}^{1 / k} {(k / θ)}^{\frac{1}{θ} + \frac{1}{2 k}} e^{- \frac{11}{12 θ}} {∥ X ∥}_{φ_{θ}} \leq C_{θ} {(θ e^{11 / 12})}^{- 1 / θ} {∥ X ∥}_{φ_{θ}} k^{1 / θ} . \end{matrix}

□

Appendix A.4

Proof of Corollary 4.

By the definition of

ψ_{θ}

-norm,

E exp {| X / {∥ X ∥}_{ψ_{θ}} |^{θ}} \leq 2

. Then

E exp {| | X |^{r} / {∥ X ∥}_{ψ_{θ}}^{r} |^{θ / r}} \leq 2 .

The result

{| X |}^{r} \sim subW (θ / r)

follows by the definition of

ψ_{θ}

-norm again. Moreover,

\begin{matrix} {∥ X ∥}_{ψ_{θ}} : & = {inf {C \in (0, \infty) : E [exp (| X |}^{θ} / C^{θ})] \leq 2} \\ = [inf {C^{r} \in {(0, \infty) : E [exp {| | X |}^{r} / C^{r} |^{θ / r}}] \leq {2}]}^{1 / r} = {∥{| X |}^{r}∥}_{ψ_{θ / r}}^{1 / r}, \end{matrix}

which verifies (6). If

X \sim subW (r θ)

, then

E exp {| X^{r} / {∥ X ∥}_{ψ_{r θ}}^{r} |^{θ}} = E exp {| X / {∥ X ∥}_{ψ_{r θ}} |^{r θ}} \leq 2

, which means that

X^{r} \sim subW (θ)

with

\begin{matrix} {∥ X ∥}_{ψ_{r θ}} : & = {inf {C \in (0, \infty) : E [exp (| X |}^{r θ} / C^{r θ})] \leq 2} \\ = [inf {C^{r} \in {(0, \infty) : E [exp {| | X |}^{r} / C^{r} |^{θ}}] \leq {2}]}^{1 / r} = {∥{| X |}^{r}∥}_{ψ_{θ}}^{1 / r} . \end{matrix}

□

Appendix A.5

Proof of Corollary 5.

Set

Δ : = {sup}_{p \geq 2} \frac{{∥ X ∥}_{p}}{\sqrt{p} + L p^{1 / θ}}

so that

{∥ X ∥}_{p} \leq Δ \sqrt{p} + L Δ p^{1 / θ}

holds for all

p \geq 2

. By Markov’s inequality for t-th moment

(t \geq 2)

, we have

\begin{matrix} P (| X | \geq e Δ \sqrt{t} + e L Δ t^{1 / θ}) & \leq {(\frac{{| | X | |}_{t}}{e Δ [\sqrt{t} + L t^{1 / θ}]})}^{t} \leq e^{- t}, [By the definition of Δ] . \end{matrix}

So, for any

t \geq 2

,

P (| X | \geq e Δ \sqrt{t} + e L Δ t^{1 / θ}) \leq e^{- t} .

(A1)

Note the definition of

Δ

shows

{∥ X ∥}_{t} \leq Δ \sqrt{t} + L Δ t^{1 / θ}

holds for all

t \geq 2

and assumption

{∥ X ∥}_{t} \leq C_{1} \sqrt{t} + C_{2} t^{1 / θ}

for all

t \geq 2

. It gives

e Δ \sqrt{t} + e L Δ t^{1 / θ} \leq e C_{1} \sqrt{t} + e C_{2} t^{1 / θ}

. This inequality with (A1) gives

P (| X | \geq e C_{1} \sqrt{t} + e C_{2} t^{1 / θ}) \leq 1 {0 < t < 2} + e^{- t} {t \geq 2}, \forall t > 0 .

(A2)

Take

K = k^{2 / θ} C_{2} / (k C_{1})

, and define

δ_{k} : = k e C_{1}

for a certain constant

k > 1

,

\begin{matrix} E [Ψ_{θ, K} (\frac{| X |}{δ_{k}})] = \int_{0}^{\infty} P (| X | \geq δ_{k} Ψ_{θ, K}^{- 1} (s)) d s \\ = \int_{0}^{\infty} P (| X | \geq k e C_{1} \sqrt{log (1 + s)} + k e C_{1} K {[log (1 + s)]}^{1 / θ}) d s \\ = \int_{0}^{\infty} P (| X | \geq e C_{1} \sqrt{log {(1 + s)}^{k^{2}}} + e C_{2} {[log {(1 + s)}^{k^{2}}]}^{1 / θ}) d s \\ By (A 2)] & \leq \int_{0 < k^{2} log (1 + s) < 2} d s + \int_{k^{2} log (1 + s) \geq 2} exp \{- k^{2} log (1 + s)\} d s \\ \leq \int_{0}^{e^{2 k^{- 2}} - 1} d t + \int_{e^{2 k^{- 2}} - 1}^{\infty} \frac{d t}{{(1 + t)}^{k^{2}}} \\ = e^{2 k^{- 2}} - 1 + \frac{{(1 + t)}^{1 - k^{2}}}{1 - k^{2}} |_{e^{2 k^{- 2}} - 1}^{\infty} = e^{2 k^{- 2}} - 1 + \frac{e^{2 (1 - k^{2}) / k^{2}}}{k^{2} - 1} \leq 1 . \end{matrix}

Therefore,

{∥ X ∥}_{Ψ_{θ, K}} \leq γ e C_{1}

with

γ

defined as the smallest solution of the inequality

{k > 1 : e^{2 k^{- 2}} - 1 + \frac{e^{2 (1 - k^{2}) / k^{2}}}{k^{2} - 1} \leq 1}

. An approximate solution is

γ \approx 1.78

. □

Appendix A.6

The main idea in the proof is by the sharper estimates of the GBO norm of the sum of symmetric r.v.s.

Proof of Theorem 1.

(a): Without loss of generality, we assume $∥ X_{i} ∥_{ψ_{θ}} = 1$ . Define $Y_{i} : = {(| X_{i} {| - (log 2)}^{1 / θ})}_{+}$ , then it is easy to check that $P (| X_{i} | \geq t) \leq 2 e^{- t^{θ}}$ implies $P (Y_{i} \geq t) \leq e^{- t^{θ}}$ . For independent Rademacher r.v. ${ε_{i}}_{i = 1}^{n}$ , the symmetrization inequality gives ${∥\sum_{i = 1}^{n} w_{i} X_{i}∥}_{p} \leq 2 {∥\sum_{i = 1}^{n} ε_{i} w_{i} X_{i}∥}_{p} .$ Note that $ε_{i} X_{i}$ is identically distributed as $ε_{i} | X_{i} |$ ,

$\begin{matrix} ∥ \sum_{i = 1}^{n} w_{i} X_{i} ∥_{p} & \leq 2 ∥ \sum_{i = 1}^{n} ε_{i} w_{i} | X_{i} {| ∥}_{p} \leq 2 ∥ \sum_{i = 1}^{n} ε_{i} w_{i} (Y_{i} + {(log 2)}^{1 / θ}) ∥_{p} \\ \leq 2 ∥ \sum_{i = 1}^{n} ε_{i} w_{i} Y_{i} ∥_{p} + 2 {(log 2)}^{1 / θ} ∥ \sum_{i = 1}^{n} ε_{i} w_{i} ∥_{p} \\ [Khinchin - Kahane inequality] & \leq 2 ∥ \sum_{i = 1}^{n} ε_{i} w_{i} Y_{i} ∥_{p} + 2 {(log 2)}^{1 / θ} {(\frac{p - 1}{2 - 1})}^{1 / 2} ∥ \sum_{i = 1}^{n} ε_{i} w_{i} ∥_{2} \\ < 2 ∥ \sum_{i = 1}^{n} ε_{i} w_{i} Y_{i} ∥_{p} + 2 {(log 2)}^{1 / θ} \sqrt{p} {(E {(\sum_{i = 1}^{n} ε_{i} w_{i})}^{2})}^{1 / 2} \\ {\{ε_{i}\}}_{i = 1}^{n} are independent] & = 2 ∥ \sum_{i = 1}^{n} ε_{i} w_{i} Y_{i} ∥_{p} + 2 {(log 2)}^{1 / θ} \sqrt{p} {∥ w ∥}_{2} . \end{matrix}$

(A3)

From Lemma 2, we are going to handle the first term in (A3) with the sum of symmetric r.v.s. Since $P (Y_{i} \geq t) \leq e^{- t^{θ}}$ , then
$∥ \sum_{i = 1}^{n} ε_{i} w_{i} Y_{i} ∥_{p} = {∥ \sum_{i = 1}^{n} w_{i} Z_{i} ∥}_{p}, Z_{i} : = ε_{i} Y_{i}$
for symmetric independent r.v.s ${Z_{i}}_{i = 1}^{n}$ satisfying $| Z_{i} | \overset{d}{=} Y_{i}$ and $P (Z_{i} \geq t) = e^{- t^{θ}}$ for all $t \geq 0$ .
Next, we proceed the proof by checking the moment conditions in Corollary 5.
Case $θ \leq 1$ : $N (t) = t^{θ}$ is concave for $θ \leq 1$ . From Lemmas 2 and 3 (a), for $p \geq 2$ ,

$\begin{matrix} ∥ \sum_{i = 1}^{n} w_{i} Z_{i} ∥_{p} \leq e inf \{t > 0 : \sum_{i = 1}^{n} log ϕ_{p} (e^{- 2} (\frac{w_{i} e^{2}}{t}) Z_{i}) \leq p\} \\ \leq e inf \{t > 0 : \sum_{i = 1}^{n} p M_{p, Z_{i}} (\frac{w_{i} e^{2}}{t}) \leq p\} \\ = e inf \{t > 0 : \sum_{i = 1}^{n} [\{{(\frac{w_{i} e^{2}}{t})}^{p} {∥ Z_{i} ∥}_{p}^{p}\} \lor \{p {(\frac{w_{i} e^{2}}{t})}^{2} {∥ Z_{i} ∥}_{2}^{2}\}] \leq p\} \\ \leq e inf \{t > 0 : Γ (\frac{p}{θ} + 1) \frac{e^{2 p}}{t^{p}} {∥ w ∥}_{p}^{p} \leq 1\} + e inf \{t > 0 : p Γ (\frac{2}{θ} + 1) \frac{e^{4}}{t^{2}} {∥ w ∥}_{2}^{2} \leq 1\}, \end{matrix}$

where the last inequality we use $∥ Z_{i} ∥_{p}^{p} = \int_{0}^{\infty} p t^{p - 1} P (| Z_{i} | \geq t) d t \leq \int_{0}^{\infty} p t^{p - 1} e^{- t^{θ}} d t = p Γ (\frac{p}{θ} + 1) .$ Hence
$∥ \sum_{i = 1}^{n} w_{i} Z_{i} ∥_{p} \leq e^{3} [Γ^{1 / p} (\frac{p}{θ} + 1) {∥ w ∥}_{p} + \sqrt{p} Γ^{1 / 2} (\frac{2}{θ} + 1) {∥ w ∥}_{2}],$
and

$\begin{matrix} ∥ \sum_{i = 1}^{n} w_{i} X_{i} ∥_{p} & \leq 2 e^{3} [Γ^{1 / p} (\frac{p}{θ} + 1) {∥ w ∥}_{p} + \sqrt{p} Γ^{1 / 2} (\frac{2}{θ} + 1) {∥ w ∥}_{2}] + 2 {(log 2)}^{1 / θ} \sqrt{p} {∥ w ∥}_{2} \\ = 2 e^{3} Γ^{1 / p} (\frac{p}{θ} + 1) {∥ w ∥}_{p} + 2 [{(log 2)}^{1 / θ} + e^{3} Γ^{1 / 2} (\frac{2}{θ} + 1)] \sqrt{p} {∥ w ∥}_{2} . \end{matrix}$

Using homogeneity, we can assume that $\sqrt{p} {∥ w ∥}_{2} + p^{1 / θ} {∥ w ∥}_{\infty} = 1$ . Then ${∥ w ∥}_{2} \leq p^{- 1 / 2}$ and ${∥ w ∥}_{\infty} \leq p^{- 1 / θ}$ . Therefore, for $p \geq 2$ ,

$\begin{matrix} {∥ w ∥}_{p} & \leq {(\sum_{i = 1}^{n} | w_{i} |^{2} {∥ w ∥}_{\infty}^{p - 2})}^{1 / p} \leq {(p^{- 1 - (p - 2) / θ})}^{1 / p} = {(p^{- p / θ} p^{(2 - θ) / θ})}^{1 / p} \\ \leq 3^{\frac{2 - θ}{3 θ}} p^{- 1 / θ} = 3^{\frac{2 - θ}{3 θ}} p^{- 1 / θ} {\sqrt{p} {∥ w ∥}_{2} + p^{1 / θ} {∥ w ∥}_{\infty}}, \end{matrix}$

where the last inequality follows form the fact that $p^{1 / p} \leq 3^{1 / 3}$ for any $p \geq 2, p \in N$ . Hence

$\begin{matrix} ∥ \sum_{i = 1}^{n} w_{i} X_{i} ∥_{p} & \leq 2 e^{3 + \frac{2 - θ}{e θ}} Γ^{1 / p} (\frac{p}{θ} + 1) {∥ w ∥}_{\infty} \\ + 2 [{log}^{1 / θ} 2 + e^{3} (Γ^{1 / 2} (\frac{2}{θ} + 1) + 3^{\frac{2 - θ}{3 θ}} p^{- \frac{1}{θ}} Γ^{1 / p} (\frac{p}{θ} + 1))] \sqrt{p} {∥ w ∥}_{2} . \end{matrix}$

Following Corollary 5, we have

$∥ \sum_{i = 1}^{n} w_{i} X_{i} ∥_{Ψ_{θ, L_{n} (θ, p)}} \leq γ e D_{1} (θ),$

where $L_{n} (θ, p) = \frac{γ^{2 / θ} D_{2} (θ, p)}{γ D_{1} (θ)}$ , $D_{1} (θ) : = 2 [{log}^{1 / θ} 2 + e^{3} (Γ^{1 / 2} (\frac{2}{θ} + 1) + {sup}_{p \geq 2} 3^{\frac{2 - θ}{3 θ}} p^{- \frac{1}{θ}} Γ^{1 / p} (\frac{p}{θ}$ $+ {1))] ∥ w ∥}_{2} < \infty$ , and $D_{2} (θ, p) : = 2 e^{3} 3^{\frac{2 - θ}{3 θ}} p^{- 1 / θ} Γ^{1 / p} (\frac{p}{θ} + 1) {∥ w ∥}_{\infty}$ .
Finally, take $L_{n} (θ) = {inf}_{p \geq 1} L_{n} (θ, p) > 0 .$ Indeed, the positive limit can be argued by (2.2) in [30]. Then by the monotonicity property of the GBO norm, it gives

$∥ \sum_{i = 1}^{n} w_{i} X_{i} ∥_{Ψ_{θ, L_{n} (θ)}} \leq ∥ \sum_{i = 1}^{n} w_{i} X_{i} ∥_{Ψ_{θ, L_{n} (θ, p)}} \leq γ e D_{1} (θ) .$

Case $θ > 1$ : In this case $N (t) = t^{θ}$ is convex with $N^{*} (t) = θ^{- \frac{1}{θ - 1}} (1 - θ^{- 1}) t^{\frac{θ}{θ - 1}} .$ By Lemmas 2 and 3(b), for $p \geq 2$ , we have

$\begin{matrix} ∥ \sum_{i = 1}^{n} w_{i} Z_{i} ∥_{p} & \leq e inf \{t > 0 : \sum_{i = 1}^{n} log ϕ_{p} (\frac{4 w_{i}}{t} Z_{i} / 4) \leq p\} + e inf \{t > 0 : \sum_{i = 1}^{n} p M_{p, Z_{i}} (\frac{4 w_{i}}{t}) \leq p\} \\ \leq e inf \{t > 0 : \sum_{i = 1}^{n} p^{- 1} N^{*} (p | \frac{4 w_{i}}{t} |) \leq 1\} + e inf \{t > 0 : \sum_{i = 1}^{n} p {(\frac{4 w_{i}}{t})}^{2} \leq 1\} \\ = 4 e [\sqrt{p} {∥ w ∥}_{2} + {(p / θ)}^{1 / θ} {(1 - θ^{- 1})}^{1 / β} {∥ w ∥}_{β}] \end{matrix}$

with $β$ mentioned in the statement. Therefore, for $p \geq 2$ , Equation (A3) implies
$∥ \sum_{i = 1}^{n} w_{i} X_{i} ∥_{p} \leq [8 e + 2 {(log 2)}^{1 / θ}] \sqrt{p} {∥ w ∥}_{2} + 8 e {(p / θ)}^{1 / θ} {(1 - θ^{- 1})}^{1 / β} {∥ w ∥}_{β} .$
Then the following result follows by Corollary 5,
${∥\sum_{i = 1}^{n} w_{i} X_{i}∥}_{Ψ_{θ, L^{'} (θ)}} \leq γ e D_{1}^{'} (θ)$ ,
where $L_{n} (θ) = \frac{γ^{2 / θ} D_{2}^{'} (θ)}{γ D_{1}^{'} (θ)}$ , $D_{1}^{'} (θ) = [8 e + 2 {(log 2)}^{1 / θ}] {∥ w ∥}_{2}$ , and $D_{2}^{'} (θ) = 8 e θ^{- 1 / θ} {(1 - θ^{- 1})}^{1 / β} {∥ w ∥}_{β}$ .
Note that $w_{i} X_{i} = (w_{i} ∥ X_{i} ∥_{ψ_{θ}}) (X_{i} / ∥ X_{i} ∥_{ψ_{θ}})$ , we can conclude (a).
(b): It is followed from Proposition 5 and (a).
(c): For easy notation, put $L_{n} (θ) = L_{n} (θ, b_{X})$ in the proof. When $θ < 2$ , by the inequality $a + b \leq 2 (a \lor b)$ for $a, b > 0$ , we have
$P (| \sum_{i = 1}^{n} w_{i} X_{i} | \geq 4 e C (θ) {∥ b ∥}_{2} \sqrt{t}) \leq 2 e^{- t}, if \sqrt{t} \geq L_{n} (θ) t^{1 / θ} .$
Put $s : = 4 e C (θ) {∥b∥}_{2} \sqrt{t}$ , we have
$P (| \sum_{i = 1}^{n} w_{i} X_{i} | \geq s) \leq 2 exp \{- \frac{s^{2}}{16 e^{2} C^{2} (θ) {∥b∥}_{2}^{2}}\}, if s \leq 4 e C (θ) {∥b∥}_{2} L_{n}^{θ / (θ - 2)} (θ) .$
For $\sqrt{t} \leq L_{n} (θ) t^{1 / θ}$ , we obtain $P (| \sum_{i = 1}^{n} w_{i} X_{i} | \geq 4 e C (θ) ∥ b_{X} ∥_{2} L_{n} (θ) t^{1 / θ}) \leq 2 e^{- t} .$ Let $s : = 4 e C (θ) {∥b∥}_{2} L_{n} (θ) t^{1 / θ}$ , it gives
$P (| \sum_{i = 1}^{n} w_{i} X_{i} | \geq s) \leq 2 exp \{- \frac{s^{θ}}{{[4 e C (θ) {∥b∥}_{2} L_{n} (θ)]}^{θ}}\}, if s > 4 e C (θ) {∥b∥}_{2} L_{n}^{θ / (θ - 2)} (θ) .$
Similarly, for $θ > 2$ , it implies
$P (|\sum_{i = 1}^{n} w_{i} X_{i}| \geq s) \leq 2 e^{- \frac{s^{θ}}{{[4 e C (θ) {∥b∥}_{2} L_{n} (θ)]}^{θ}}}$ if $s \leq 4 e C (θ) {∥b∥}_{2} L_{n}^{θ / (2 - θ)} (θ)$ ,
and $P (|\sum_{i = 1}^{n} w_{i} X_{i}| \geq s) \leq 2 e^{- \frac{s^{2}}{16 e^{2} C^{2} (θ) {∥b∥}_{2}^{2}}}$ if $s \geq 4 e C (θ) {∥b∥}_{2} L_{n}^{θ / (2 - θ)} (θ)$ . □

Appendix A.7

Proof of Corollary 6.

Using the definition of

{∥X∥}_{φ_{θ}}

, it yields

\begin{matrix} E e^{{(c^{- 1} | X |)}^{θ}} = 1 + \sum_{k = 1}^{\infty} \frac{c^{- k} E | X |^{k θ}}{k!} & \leq 1 + \sum_{k = 1}^{\infty} \frac{c^{- k} k! {∥X∥}_{φ_{θ}}^{k θ}}{k!} \\ = 1 + \sum_{k = 1}^{\infty} {(\frac{{∥X∥}_{φ_{θ}}^{θ}}{c^{θ}})}^{k} = 1 + \frac{{∥X∥}_{φ_{θ}}^{θ}}{c^{θ}} \sum_{k = 0}^{\infty} {(\frac{{∥X∥}_{φ_{θ}}^{θ}}{c^{θ}})}^{k} \\ \frac{{∥X∥}_{φ_{2}}^{θ}}{c^{θ}} < 1] & = 1 + (\frac{{∥X∥}_{φ_{2}}^{θ}}{c^{θ}}) \frac{1}{1 - {∥X∥}_{φ_{2}}^{θ} / c^{θ}} \leq 2 \end{matrix}

if

\frac{{∥X∥}_{φ_{2}}^{θ}}{c^{θ}} \leq \frac{1}{2}

which implies that the minimal c is

2^{1 / θ} {∥ X ∥}_{φ_{θ}}

. That is to say we have

E e^{| X / [2^{1 / θ} {∥ X ∥}_{φ_{θ}} {] |}^{1 / θ}} \leq 2

. Applying (2), we have

P {| X | > t} \leq 2 e^{- {(t / [2^{1 / θ} {∥ X ∥}_{φ_{θ}}])}^{θ}} = 2 exp {- \frac{t^{θ}}{2 {∥X∥}_{φ_{θ}}^{θ}}} for all t \geq 0 .

(A4)

□

Appendix A.8

Proof of Theorem 2.

Minkowski’s inequality for

p \geq 1

and definition of

{∥ X ∥}_{φ_{θ}}

imply

{∥\sum_{i = 1}^{n} X_{i}∥}_{p} \leq \sum_{i = 1}^{n} {∥X_{i}∥}_{p} \leq \sum_{i = 1}^{n} v_{i} \cdot 2^{1 / θ} C_{θ} {(\frac{p}{θ e^{11 / 12}})}^{1 / θ},

where the last inequality by letting

C_{θ} : = max_{k \geq 1} {(\frac{2 \sqrt{2 π}}{θ})}^{1 / k} {(\frac{k}{θ})}^{1 / (2 k)}

in Corollary 3b.

From Markov’s inequality, it yields

P (|\sum_{i = 1}^{n} X_{i}| \geq t) \leq t^{- p} {∥\sum_{i = 1}^{n} X_{i}∥}_{p}^{p} \leq t^{- p} {(\sum_{i = 1}^{n} v_{i})}^{p} 2^{p / θ} C_{θ} {(\frac{p}{θ e^{11 / 12}})}^{p / θ} .

Let

t^{- p} {(\sum_{i = 1}^{n} v_{i})}^{p} 2^{p / θ} C_{θ} {(\frac{p}{θ e^{11 / 12}})}^{p / θ} = e^{- p}

, it gives

t = e (\sum_{i = 1}^{n} v_{i}) 2^{1 / θ} C_{θ} {(\frac{p}{θ e^{11 / 12}})}^{1 / θ}

and

p = \frac{θ e^{11 / 12} t^{θ}}{{[e (\sum_{i = 1}^{n} v_{i}) 2^{1 / θ} C_{θ}]}^{θ}}

.

Therefore, for

p \geq 1

, we have

P (|\sum_{i = 1}^{n} X_{i}| \geq t) \leq P (|\sum_{i = 1}^{n} X_{i}| \geq e (\sum_{i = 1}^{n} v_{i}) C_{θ} {(2^{- 1} θ e^{11 / 12})}^{- 1 / θ}) \leq e^{- p} \in (0, e^{- 1}] .

(A5)

So

P (|\sum_{i = 1}^{n} X_{i}| \geq t) \leq exp \{- \frac{θ e^{11 / 12} t^{θ}}{2 {[e (\sum_{i = 1}^{n} v_{i}) C_{θ}]}^{θ}}\}, t \geq e (\sum_{i = 1}^{n} v_{i}) C_{θ} {(2^{- 1} θ e^{11 / 12})}^{- 1 / θ} .

Let

\bar{v} = \frac{1}{n} \sum_{i = 1}^{n} v_{i}

and

e^{- p} = : α

. Then

P (|\frac{1}{n} \sum_{i = 1}^{n} X_{i}| \leq e \bar{v} 2^{1 / θ} C_{θ} {(\frac{log (α^{- 1})}{θ e^{11 / 12}})}^{1 / θ}) \geq 1 - α \in (1 - e^{- 1}, 1] .

For

p < 1

, note that moment monotonicity show that

{[E ({| X |}^{p})]}^{1 / p}

is a non-decreasing function of

p,

, i.e.,

0 < p \leq 1 \Rightarrow {[{E | X |}^{p}]}^{1 / p} \leq E | X | .

The

c_{r}

-inequality implies

{∥\sum_{i = 1}^{n} X_{i}∥}_{p}^{p} \leq \sum_{i = 1}^{n} {∥X_{i}∥}_{p}^{p}

. Using Markov’s inequality again, we have

P (|\sum_{i = 1}^{n} X_{i}| \geq t) \leq t^{- p} {∥\sum_{i = 1}^{n} X_{i}∥}_{p}^{p} \leq t^{- p} \sum_{i = 1}^{n} {∥X_{i}∥}_{p}^{p} \leq t^{- p} \sum_{i = 1}^{n} (E | X_{i} {|)}^{p} .

Put

t^{- p} \sum_{i = 1}^{n} (E | X_{i} {|)}^{p} = e^{- p}

and

t = e {(\sum_{i = 1}^{n} (E | X_{i} {|)}^{p})}^{1 / p} .

Then, we obtain

P (|\sum_{i = 1}^{n} X_{i}| \geq e {(\sum_{i = 1}^{n} (E | X_{i} {|)}^{p})}^{1 / p}) \leq e^{- p} \in (e^{- 1}, 1) .

(A6)

Combine (A5) and (A6), we obtain for all

t \geq 0

,

P (| \sum_{i = 1}^{n} X_{i} | \geq e {(\sum_{i = 1}^{n} (E | X_{i} {|)}^{t})}^{1 / t} + e (\sum_{i = 1}^{n} v_{i}) 2^{1 / θ} C_{θ} {(\frac{t}{θ e^{11 / 12}})}^{1 / θ}) \leq e^{- t} .

This completes the proof. □

Appendix A.9

Proof of Theorem 3.

Note that for

\forall b \in S^{p - 1}

, it yields

\begin{matrix} b^{⊤} {\hat{Q}}_{n} (β^{*}) b - b^{⊤} E ({\hat{Q}}_{n} (β^{*})) b \geq - ∥b∥ max_{k, j} | {[{\hat{Q}}_{n} (β^{*}) - E {\hat{Q}}_{n} (β^{*})]}_{k j} | \\ = - max_{k, j} |\frac{1}{n} \sum_{i = 1}^{n} {[\frac{(Y_{i} + k) k e^{X_{i}^{⊤} β^{*}} X_{i} X_{i}^{⊤}}{{(k + e^{X_{i}^{⊤} β^{*}})}^{2}} - E (\frac{(Y_{i} + k) k e^{X_{i}^{⊤} β^{*}} X_{i} X_{i}^{⊤}}{{(k + e^{X_{i}^{⊤} β^{*}})}^{2}})]}_{k j}| . \end{matrix}

(A7)

Consider the decomposition

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} [\frac{(Y_{i} + k) k e^{X_{i}^{⊤} β^{*}} X_{i k} X_{i j}}{{(k + e^{X_{i}^{⊤} β^{*}})}^{2}} - E (\frac{(Y_{i} + k) k e^{X_{i}^{⊤} β^{*}} X_{i k} X_{i j}}{{(k + e^{X_{i}^{⊤} β^{*}})}^{2}})] \\ = \frac{1}{n} \sum_{i = 1}^{n} [\frac{Y_{i} k e^{X_{i}^{⊤} β^{*}} X_{i k} X_{i j}}{{(k + e^{X_{i}^{⊤} β^{*}})}^{2}} - E (\frac{Y_{i} k e^{X_{i}^{⊤} β^{*}} X_{i k} X_{i j}}{{(k + e^{X_{i}^{⊤} β^{*}})}^{2}})] + \frac{k}{n} \sum_{i = 1}^{n} [\frac{k e^{X_{i}^{⊤} β^{*}} X_{i k} X_{i j}}{{(k + e^{X_{i}^{⊤} β^{*}})}^{2}} - E (\frac{k e^{X_{i}^{⊤} β^{*}} X_{i k} X_{i j}}{{(k + e^{X_{i}^{⊤} β^{*}})}^{2}})] \end{matrix}

For the first term, we have under the

F_{max}

with

t = C_{min} / 4

\begin{matrix} P (|\frac{1}{n} \sum_{i = 1}^{n} [\frac{Y_{i} k e^{X_{i}^{⊤} β^{*}} X_{i k} X_{i j}}{{(k + e^{X_{i}^{⊤} β^{*}})}^{2}} - E (\frac{Y_{i} k e^{X_{i}^{⊤} β^{*}} X_{i k} X_{i j}}{{(k + e^{X_{i}^{⊤} β^{*}})}^{2}})]| \geq t, F_{max}) \\ \leq 2 exp \{- \frac{1}{4} (\frac{n^{2} t^{2}}{2 \sum_{i = 1}^{n} {(X_{i k} X_{i j})}^{2} ({∥Y_{i}∥}_{ψ_{1}}^{} + | \frac{exp (X_{i}^{⊤} β^{*})}{log 2} {|)}^{2}} \land \frac{n t}{max_{1 \leq i \leq n} | X_{i k} X_{i j} | ({∥Y_{i}∥}_{ψ_{1}}^{} + | \frac{exp (X_{i}^{⊤} β^{*})}{log 2} |)})\} \\ \leq 2 exp \{- \frac{1}{4} (\frac{n t^{2}}{2 M_{X}^{4} {log}^{4 / θ} (\frac{2 n p}{δ}) M_{B X}^{2}} \land \frac{n t}{M_{X}^{2} {log}^{2 / θ} (\frac{2 n p}{δ}) M_{B X}^{}})\} \end{matrix}

where we use

k e^{X_{i}^{⊤} β^{*}} {(k + e^{X_{i}^{⊤} β^{*}})}^{- 2} \leq 1

and the second last inequality is from Corollary 2.

For the second term, by Theorem 1 and

{∥X_{i k} X_{i j}∥}_{ψ_{θ / 2}} \leq {∥X_{i k}∥}_{ψ_{θ}} {∥X_{i j}∥}_{ψ_{θ}} \leq M_{X}^{2}

we have

\begin{matrix} P (|\frac{k}{n} \sum_{i = 1}^{n} [\frac{k e^{X_{i}^{⊤} β^{*}} X_{i k} X_{i j}}{{(k + e^{X_{i}^{⊤} β^{*}})}^{2}} - E (\frac{k e^{X_{i}^{⊤} β^{*}} X_{i k} X_{i j}}{{(k + e^{X_{i}^{⊤} β^{*}})}^{2}})]| \geq t, F_{max}) \\ \leq 2 exp \{- (\frac{t^{θ / 2}}{{[4 e C (θ / 2) {∥b∥}_{2} L_{n} (θ / 2, b)]}^{θ / 2}} \land \frac{t^{2}}{16 e^{2} C^{2} (θ / 2) {∥b∥}_{2}^{2}})\} \end{matrix}

where

b = (k / n) M_{X}^{2} {(1, \dots, 1)}^{⊤} \in R^{n}

.

Assume that

b^{⊤} E ({\hat{Q}}_{n} (β)) b \geq C_{min}

for all

b \in S^{p - 1}

. Under

F_{1}

and

F_{2}

, it shows that by (A7):

b^{⊤} E

({\hat{Q}}_{n} (β)) b \geq C_{min} - \frac{C_{min}}{2} = \frac{C_{min}}{2}

. Then

\begin{matrix} P {λ_{min} ({\hat{Q}}_{n} (β)) \leq \frac{C_{min}}{2}} = P \{b^{⊤} E ({\hat{Q}}_{n} (β)) b \leq \frac{C_{min}}{2}, \forall b \in S^{p - 1}\} \\ \leq P \{b^{⊤} E ({\hat{Q}}_{n} (β)) b \leq \frac{C_{min}}{2}, \forall b \in S^{p - 1}, F_{max}\} + P (F_{max}^{c}) \\ \leq P {F_{1}, F_{max}} + P {F_{2}, F_{max}} + P (F_{R}^{c} (n)) \\ \leq 2 p^{2} exp \{- \frac{1}{4} (\frac{n t^{2}}{M_{X}^{4} {log}^{4 / θ} (\frac{2 n p}{δ}) M_{B X}^{2}} \land \frac{n t}{M_{X}^{2} {log}^{2 / θ} (\frac{2 n p}{δ}) M_{B X}})\} \end{matrix}

(A8)

\begin{matrix} + 2 p^{2} exp \{- (\frac{t^{θ / 2}}{{[4 e C (θ / 2) {∥b∥}_{2} L_{n} (θ / 2, b)]}^{θ / 2}} \land \frac{t^{2}}{16 e^{2} C^{2} (θ / 2) {∥b∥}_{2}^{2}})\} + P (F_{max}^{c}) . \end{matrix}

(A9)

Then we have by conditioning on

F_{1} \cap F_{2}

δ_{n} (β) : = \frac{3}{2} ∥ {[{\hat{Q}}_{n} (β)]}^{- 1} {\hat{Z}}_{n} {(β) ∥}_{2} \leq \frac{3}{C_{min}} {∥ {\hat{Z}}_{n} (β) ∥}_{2} .

By

k / (k + e^{X_{i}^{⊤} β^{*}}) \leq 1

, Corollary 2 implies for any

1 \leq k \leq p

,

\begin{matrix} P [| \sqrt{\frac{p}{n}} \sum_{i = 1}^{n} \frac{k (Y_{i} - e^{X_{i}^{⊤} β^{*}}) X_{i k}}{k + e^{X_{i}^{⊤} β^{*}}} | > & 2 {(\frac{2 t p}{n} \sum_{i = 1}^{n} X_{i k}^{2} {∥Y_{i} - E Y_{i}∥}_{ψ_{1}}^{2})}^{1 / 2} \\ + 2 t \sqrt{\frac{p}{n}} max_{1 \leq i \leq n} | X_{i k} | {∥Y_{i} - E Y_{i}∥}_{ψ_{1}}] \leq 2 e^{- t} . \end{matrix}

(A10)

Let

λ_{1 n} (t, X) : = 2 {(\frac{2 t p}{n} max_{1 \leq k \leq n} \sum_{i = 1}^{n} X_{i k}^{2} {∥Y_{i} - E Y_{i}∥}_{ψ_{1}}^{2})}^{1 / 2} + 2 t \sqrt{\frac{p}{n}} max_{1 \leq i \leq n, 1 \leq k \leq p} (| X_{i k}^{} | {∥Y_{i} - E Y_{i}∥}_{ψ_{1}}) .

We bound

max_{1 \leq i \leq n, 1 \leq k \leq p} | X_{i k} | \leq M_{X} {log}^{1 / θ} (\frac{2 n p}{δ})

and

max_{1 \leq k \leq n} \frac{1}{n} \sum_{i = 1}^{n} X_{i k}^{2} \leq M_{X}^{2} {log}^{2 / θ} (\frac{2 n p}{δ})

under the event

F_{max}

. Note that

M_{B X} = M_{X} + \frac{B}{log 2}

, then (C.1) and (C.2) gives

\begin{matrix} λ_{1 n} (t, X) & \leq 2 {(2 t p M_{B X}^{2} max_{1 \leq k \leq p} \frac{1}{n} \sum_{i = 1}^{n} X_{i k}^{2})}^{1 / 2} + 2 t \sqrt{\frac{p}{n}} max_{1 \leq i \leq n, 1 \leq k \leq p} | X_{i k} | M_{B X} \\ \leq 2 M_{B X} M_{X} (\sqrt{2 t p} + t \sqrt{p / n}) {log}^{1 / θ} (2 n p / δ) = : λ_{n} (t) . \end{matrix}

So,

P (| \sqrt{\frac{p}{n}} \sum_{i = 1}^{n} \frac{k (Y_{i} - e^{X_{i}^{⊤} β^{*}}) X_{i k}}{k + e^{X_{i}^{⊤} β^{*}}} | > λ_{n} (t)) \leq 2 e^{- t}, k = 1, 2, \dots, p .

Thus (A10) shows

\begin{matrix} P {\sqrt{n} ∥ {\hat{Z}}_{n} (β^{*}) ∥_{2} > λ_{1 n} (t)} \leq P {\sqrt{n} ∥ {\hat{Z}}_{n} (β^{*}) ∥_{2} > λ_{1 n} (t), F_{max}} + P (F_{max}^{c}) \\ \leq P (⋃_{k = 1}^{p} {∥ \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \frac{k (Y_{i} - e^{X_{i}^{⊤} β^{*}}) X_{i k}}{k + e^{X_{i}^{⊤} β^{*}}} ∥ > \frac{λ_{1 n} (t)}{\sqrt{p}}}) + P (F_{max}^{c}) \leq 2 p e^{- t} + P (F_{max}^{c}) = δ + ε_{n}, \end{matrix}

where

t : = log (\frac{2 p}{δ})

. Then

∥ {\hat{β}}_{n} - β^{*} ∥_{2} \leq δ_{n} (β^{*}) \leq \frac{3}{C_{min}} {∥ {\hat{Z}}_{n} (β^{*}) ∥}_{2} \leq \frac{3 λ_{1 n} (t)}{C_{min} \sqrt{n}}

via Lemma 4. Under

F_{1} \cap F_{2} \cap F_{max}

, we obtain

∥ {\hat{β}}_{n} - β^{*} ∥_{2} \leq \frac{6 M_{B X} M_{X}}{C_{min}} [\sqrt{\frac{2 p}{n} log (\frac{2 p}{δ})} + \frac{1}{n} \sqrt{p log (\frac{2 p}{δ})}] {log}^{1 / θ} (\frac{2 n p}{δ}) .

Furthermore, under

F_{1} \cap F_{2} \cap F_{max}

, it gives the condition of n: (11). □

Appendix A.10

Proof of Theorem 4.

For convenience, the proof is divided into three steps.

Step 1. Adopting the lemma

Lemma A1 (Computing the spectral norm on a net, Lemma 5.4 in [1])

Let

B

be an

p \times p

matrix, and let

N_{ε}

be an ε-net of

S^{p - 1}

for some

ε \in [0, 1)

. Then

∥ B ∥ = {max}_{{| | x | |}_{2} = 1} ∥ B x ∥_{2} = {sup}_{x \in S^{p - 1}} | 〈 B x, x 〉 | \leq {(1 - 2 ε)}^{- 1} {sup}_{x \in N_{ε}} | 〈 B x, x 〉 | .

Then show that

∥ \frac{1}{n} A^{⊤} A - I_{p} ∥ \leq 2 {max}_{x \in N_{1 / 4}} | \frac{1}{n} {∥ A x ∥}_{2}^{2} - 1 |

. Indeed, note that

〈 \frac{1}{n} A^{⊤}

A x - x, x 〉 = 〈 \frac{1}{n} A^{⊤} A x, x 〉 - 1 = \frac{1}{n} {∥ A x ∥}_{2}^{2} - 1

. By setting

ε = 1 / 4

in Lemma 4, we can obtain:

∥ \frac{1}{n} A^{⊤} A - I_{p} ∥ \leq {(1 - 2 ε)}^{- 1} sup_{x \in N_{ε}} | 〈 \frac{1}{n} A^{⊤} A x - x, x 〉 | = 2 max_{x \in N_{1 / 4}} | \frac{1}{n} {∥ A x ∥}_{2}^{2} - 1 | .

Step 2. Let

Z_{i} : = | 〈 A_{i}, x 〉 |

fix any

x \in S^{n - 1}

. Observe that

{∥ A x ∥}_{2}^{2} = \sum_{i = 1}^{n} {| 〈 A_{i}, x 〉 |}^{2} = \sum_{i = 1}^{n} Z_{i}^{2}

. The fact that

{Z_{i}}_{i = 1}^{n}

are

subW (θ)

with

E Z_{i}^{2} = 1, {max}_{1 \leq i \leq n} {∥ Z_{i} ∥}_{ψ_{θ}} = K

. Then by Corollary 4,

Z_{i}^{2}

are independent

subW (θ / 2)

r.v.s with

{max}_{1 \leq i \leq n} {∥ Z_{i}^{2} ∥}_{ψ_{θ / 2}} = K^{2}

. The norm triangle inequality (Lemma A.3 in [9]) gives

max_{1 \leq i \leq n} {∥ Z_{i}^{2} - 1 ∥}_{ψ_{θ / 2}} \leq K_{θ / 2} [1 + {([{(e θ / 2)}^{θ / 2}] log 2)}^{- θ / 2}] K .

(A11)

where

K_{α} : = 2^{1 / α}

if

α \in (0, 1)

and

K_{α} = 1

if

α \geq 1

.

Denote

b_{X} : = \frac{1}{n} (∥ Z_{1}^{2} {- 1 ∥}_{ψ_{θ / 2}}, \dots, ∥ Z_{n}^{2} - 1 {∥_{ψ_{θ / 2}})}^{⊤}

in Theorem 1. With (A11), we have

{∥b_{X}∥}_{2} = n^{- 1} \sqrt{\sum_{i = 1}^{n} ∥ Z_{i}^{2} - 1 ∥_{ψ_{θ / 2}}^{2}} \leq \frac{K_{θ / 2} [1 + {([{(e θ / 2)}^{θ / 2}] log 2)}^{- θ / 2}] K}{\sqrt{n}}

and

{∥ b ∥}_{\infty} \leq \frac{K_{θ / 2} [1 + {([{(e θ / 2)}^{θ / 2}] log 2)}^{- θ / 2}] K}{n}

.

For

β : = \frac{θ}{θ - 1} > 1

, we obtain

{∥b_{X}∥}_{β} = n^{- 1} {\sum_{i = 1}^{n} ∥ Z_{i}^{2} - 1 ∥_{ψ_{θ / 2}}^{β}}^{1 / β} \leq n^{β^{- 1} - 1} [K_{θ / 2} [1 + {([{(e θ / 2)}^{θ / 2}] log 2)}^{- θ / 2}] K] = n^{- θ^{- 1}} K_{θ / 2} [1 + {([{(e θ / 2)}^{θ / 2}] log 2)}^{- θ / 2}] K

.

Write

L_{n} (θ / 2, b_{X})

as the constant defined in Theorem 1(a). Then,

\begin{matrix} ∥ b_{X} ∥_{2} L_{n} (θ / 2, b_{X}) & = γ^{4 / θ} \{\begin{matrix} {A (θ / 2) ∥ b ∥}_{\infty}, & θ \leq 2 \\ {B (θ / 2) ∥ b ∥}_{β}, & θ > 2 . \end{matrix} \\ \leq K_{θ / 2} [1 + {([{(e θ / 2)}^{θ / 2}] log 2)}^{- θ / 2}] K γ^{4 / θ} \{\begin{matrix} A (θ / 2) / n, & θ \leq 2 \\ B (θ / 2) / n^{1 / θ}, & θ > 2 . \end{matrix} \end{matrix}

Hence

\begin{matrix} 2 e C (θ / 2) {{∥b_{X}∥}_{2} \sqrt{t} + {∥b∥}_{2} L_{n} (θ / 2, b_{X}) t^{2 / θ}} \\ \leq 2 e K C (θ / 2) K_{θ / 2} [1 + {([{(e θ / 2)}^{θ / 2}] log 2)}^{- θ / 2}) [\sqrt{\frac{t}{n}} + \{\begin{matrix} A (θ / 2) {(γ^{2} t)}^{2 / θ} / n, & θ \leq 2 \\ B (θ / 2) {(γ^{2} t)}^{2 / θ} / n^{1 / θ}, & θ > 2 \end{matrix}] \\ = : H (t, n; θ) . \end{matrix}

Therefore,

P (\frac{1}{n} | \sum_{i = 1}^{n} (Z_{i}^{2} - 1) | \geq H (t, n; θ)) \leq 2 e^{- t}

. Let

t = c p + s^{2}

for constant c, then

P \{| \frac{1}{n} {∥ A X ∥}_{2}^{2} - 1 | \geq H (c p + s^{2}, n; θ)\} \leq 2 e^{- (c p + s^{2})} .

Step 3. Consider the follow lemma for covering numbers in [1].

Lemma A2 (Covering numbers of the sphere).

For the unit Euclidean sphere

S^{n - 1}

, the covering number

N (S^{n - 1}, ε)

satisfies

N (S^{n - 1}, ε) \leq {(1 + \frac{2}{ε})}^{n}

for every

ε > 0

.

Then, we show the concentration for

∥ \frac{1}{n} A^{⊤} A - I_{p} ∥

, and (12) follows by the definition of largest and least eigenvalues. The conclusion is drawn by Step 1 and 2:

\begin{matrix} P \{∥ \frac{1}{n} A^{⊤} A - I_{p} ∥ \geq H (c p + s^{2}, n; θ)\} \leq P \{2 max_{x \in N_{1 / 4}} | \frac{1}{n} {∥ A x ∥}_{2}^{2} - 1 | \geq H (c p + s^{2}, n; θ)\} \\ \leq N (S^{n - 1}, 1 / 4) P \{| \frac{1}{n} {∥ A x ∥}_{2}^{2} - 1 | \geq H (c p + s^{2}, n; θ) / 2\} \leq 2 \cdot 9^{n} e^{- (c p + s^{2})}, \end{matrix}

where the last inequality follows by Lemma A2 with

ε = 1 / 4

. When the

c \geq n log 9 / p

, then

2 \cdot 9^{n} e^{- (c p + s^{2})} \leq 2 e^{- s^{2}}

, and the (12) is proved.

Moreover, note that

max_{{| | x | |}_{2} = 1} | ∥ \frac{1}{\sqrt{n}} A x ∥_{2}^{2} - 1 | = max_{{| | x | |}_{2} = 1} ∥ (\frac{1}{n} A^{⊤} A - I_{p}) x ∥_{2}^{2} = ∥ \frac{1}{n} A^{⊤} A - I_{p} ∥^{2} \leq H^{2} (c p + s^{2}, n; θ) .

implies that

\sqrt{1 - H^{2} (c p + s^{2}, n; θ)} \leq \frac{1}{\sqrt{n}} λ_{m a x} (A) \leq \sqrt{1 + H^{2} (c p + s^{2}, n; θ)}

.

Similarly, for the minimal eigenvalue, we have

min_{{| | x | |}_{2} = 1} | ∥ \frac{1}{\sqrt{n}} A x ∥_{2}^{2} - 1 | = min_{{| | x | |}_{2} = 1} ∥ (\frac{1}{n} A^{⊤} A - I_{p}) x ∥_{2}^{2} = ∥ \frac{1}{n} A^{⊤} A - I_{p} ∥^{2} \leq H^{2} (c p + s^{2}, n; θ) .

This implies

\sqrt{1 - H^{2} (c p + s^{2}, n; θ)} \leq \frac{1}{\sqrt{n}} λ_{m i n} (A) \leq \sqrt{1 + H^{2} (c p + s^{2}, n; θ)}

. So we obtain that the two events satisfy

\begin{matrix} \{∥ \frac{1}{n} A^{⊤} A - I_{p} ∥^{2} \leq H^{2} (c p + s^{2}, n; θ)\} \\ \subset \{\sqrt{1 - H^{2} (c p + s^{2}, n; θ)} \leq \frac{1}{\sqrt{n}} λ_{m i n} (A) \leq \frac{1}{\sqrt{n}} λ_{m a x} (A) \leq \sqrt{1 + H^{2} (c p + s^{2}, n; θ)}\} \end{matrix}

Then we obtain the second conclusion in this theorem. □

Appendix A.11

Proof of Theorem 5

By independence and (13),

\begin{matrix} E e^{\pm n α_{n} {\hat{Z}}_{α_{n}} (θ)} = \prod_{i = 1}^{n} E exp {\pm φ^{c} [α_{n} (X_{i} - θ)]} \leq \prod_{i = 1}^{n} E [1 \pm α_{n} (X_{i} - θ) + c (α_{n} (X_{i} - θ))] \\ \leq \prod_{i = 1}^{n} exp {\pm α_{n} E (X_{i} - θ) + E [c (α_{n} (X_{i} - θ))]} = exp \{\pm α_{n} \sum_{i = 1}^{n} E (X_{i} - θ) + \sum_{i = 1}^{n} E [c (α_{n} (X_{i} - θ))]\} . \end{matrix}

(A12)

For convenience, let

\begin{matrix} B_{n}^{+} (θ) = μ_{n} - θ + \frac{1}{n α_{n}} \sum_{i = 1}^{n} E [c (α_{n} (X_{i} - θ))] + \frac{log (δ^{- 1})}{n α_{n}} \end{matrix}

(A13)

and

B_{n}^{-} (θ) = μ_{n} - θ - \frac{1}{n α_{n}} \sum_{i = 1}^{n} E [c (α_{n} (X_{i} - θ))] - \frac{log (δ^{- 1})}{n α_{n}}

. Therefore, Equation (A12) and the Markov’s inequality show

P ({\hat{Z}}_{α_{n}} (θ) \geq B_{n}^{+} (θ)) = P (e^{n α_{n} {\hat{Z}}_{α_{n}} (θ)} \geq e^{n α_{n} B_{n}^{+} (θ)}) \leq \frac{E e^{n α_{n} {\hat{Z}}_{α_{n}} (θ)}}{e^{n α_{n} B_{n}^{+} (θ)}} \leq \frac{e^{n α_{n} B_{n}^{+} (θ) - log (δ^{- 1})}}{e^{n α_{n} B_{n}^{+} (θ)}} = δ

and

P ({\hat{Z}}_{α_{n}} (θ) \leq B_{n}^{-} (θ)) = P (e^{- n α_{n} {\hat{Z}}_{α_{n}} (θ)} \geq e^{- n α_{n} B_{n}^{-} (θ)}) \leq \frac{E e^{- n α_{n} {\hat{Z}}_{α_{n}} (θ)}}{e^{- n α_{n} B_{n}^{-} (θ)}} \leq \frac{e^{- n α_{n} B_{n}^{-} (θ) - log (δ^{- 1})}}{e^{- n α_{n} B_{n}^{-} (θ)}} = δ

. These two inequality yield

P (B_{n}^{-} (θ) < {\hat{Z}}_{α_{n}} (θ)) = 1 - P ({\hat{Z}}_{α_{n}} (θ) \leq B_{n}^{-} (θ)) - P ({\hat{Z}}_{α_{n}} (θ) \geq B_{n}^{+} (θ)) \geq 1 - 2 δ .

The

\frac{\partial {\hat{Z}}_{α_{n}} (θ)}{\partial θ} = - \frac{1}{n} \sum_{i = 1}^{n} {\dot{φ}}^{c} [α_{n} (X_{i} - θ)] < 0

implies the map

θ \mapsto {\hat{Z}}_{α_{n}} (θ)

is non-increasing. If

θ = μ_{n}

, we have

B_{n}^{+} (μ_{n}) > 0

from (A13). As n is sufficient large and

α_{n} \to 0

, in

B_{n}^{+} (θ)

, from (C.1.2) the term

\frac{1}{n α_{n}} \sum_{i = 1}^{n} E [c (α_{n} (X_{i} - θ))] \leq \frac{f (α_{n})}{α_{n}} \frac{1}{n} \sum_{i = 1}^{n} E [c (X_{i} - θ)] = \frac{f (α_{n})}{α_{n}} O (1)

converges to 0 by (C.1.3). Then, there must be a constant

d_{n} (c) > 0

such that

B_{n}^{+} (μ_{n} + d_{n} (c)) < 0

. So under (16), it implies that

B_{n}^{+} (θ) = 0

has a solution and denote the smallest solution

θ_{+} \in (μ_{n}, μ_{n} + d_{n} (c))

. Similarly, for

B_{n}^{-} (θ)

, we have

B_{n}^{-} (μ_{n}) < 0

. The condition (16) implies

B_{n}^{-} (μ_{n} - d_{n} (c)) > 0

, then

B_{n}^{-} (θ) = 0

has a solution and denote the largest solution

θ_{-} \in (μ_{n} - d_{n} (c), μ_{n})

. Note that

{\hat{Z}}_{α_{n}} (θ)

is a continuous and non-increasing function, the estimating equation

{\hat{Z}}_{α_{n}} (θ) = 0

has a solution

{\hat{θ}}_{α_{n}} \in [θ_{-}, θ_{+}]

such that

θ_{-} \leq {\hat{θ}}_{α_{n}} \leq θ_{+}

with a probability at least

1 - 2 δ

. Recall that

B_{n}^{+} (θ_{+}) = μ_{n} - θ_{+} + \frac{1}{n α_{n}} \sum_{i = 1}^{n} E [c (α_{n} (X_{i} - θ_{+}))] + \frac{log (δ^{- 1})}{n α_{n}} = 0 .

(A14)

has the smallest solution

θ_{+} \in (μ_{n}, μ_{n} + d_{n} (c))

under the condition (16). We have

\begin{matrix} μ_{n} - {\hat{θ}}_{α_{n}} \geq μ_{n} - θ_{+} & = - \frac{1}{n α_{n}} \sum_{i = 1}^{n} E [c (α_{n} X_{i} - α_{n} θ_{+})] - \frac{log (δ^{- 1})}{n α_{n}} \\ = - \frac{1}{n α_{n}} \sum_{i = 1}^{n} E [c (α_{n} (X_{i} - μ_{n}) + α_{n} (μ_{n} - θ_{+}))] - \frac{log (δ^{- 1})}{n α_{n}} \\ By (C . 1.1)] & \geq - \frac{c_{2}}{n α_{n}} \sum_{i = 1}^{n} E [c (α_{n} X_{i} - α_{n} μ_{n})] - \frac{c_{2}}{α_{n}} \cdot c (α_{n} (μ_{n} - θ_{+})) - \frac{log (δ^{- 1})}{n α_{n}} \end{matrix}

(A15)

which implies

\begin{matrix} μ_{n} - θ_{+} + \frac{c_{2}}{α_{n}} \cdot c (α_{n} (μ_{n} - θ_{+})) \geq - (\frac{c_{2}}{n α_{n}} \sum_{i = 1}^{n} E [c (α_{n} (X_{i} - μ_{n}))] + \frac{log (δ^{- 1})}{n α_{n}}) . \end{matrix}

(A16)

Put

\frac{c_{2}}{n α_{n}} \sum_{i = 1}^{n} E [c (α_{n} (X_{i} - μ_{n}))] = \frac{log (δ^{- 1})}{n α_{n}}

, i.e.,

\sum_{i = 1}^{n} c_{2} E [c (α_{n} (X_{i} - μ_{n}))] = log (δ^{- 1}) .

The scaling assumption

c (t x) \leq f (t) c (x)

gives

f (α_{n}) c_{2} \sum_{i = 1}^{n} E [c (X_{i} - μ_{n})] \geq c_{2} \sum_{i = 1}^{n} E [c (α_{n} (X_{i} - μ_{n}))] = log (δ^{- 1})

and thus

α_{n} \geq f^{- 1} (\frac{log (δ^{- 1})}{c_{2} \sum_{i = 1}^{n} E [c (X_{i} - μ_{n})]})

. Let

g_{α_{n}} (t) = t + \frac{c_{2}}{α_{n}} c (α_{n} t)

. Moreover, Equation (A16) and the value

α_{n}

yields

g_{α_{n}} (μ_{n} - θ_{+}) = μ_{n} - θ_{+} + \frac{c_{2}}{α_{n}} c (α_{n} (μ_{n} - θ_{+})) \geq - \frac{2 log (δ^{- 1})}{n α_{n}} .

Solve the above inequality in terms of

μ_{n} - θ_{+}

, we obtain

μ_{n} - {\hat{θ}}_{α_{n}} \geq μ_{n} - θ_{+} \geq g_{α_{n}}^{- 1} \{- \frac{2 log (δ^{- 1})}{n α_{n}}\} .

Similarly, for

θ_{-}

, one has

μ_{n} - {\hat{θ}}_{α_{n}} \leq μ_{n} - θ_{-} \leq - g_{α_{n}}^{- 1} \{- \frac{2 log (δ^{- 1})}{n α_{n}}\}

. Then we obtain that (17) holds with probability at least

1 - 2 δ

. □

References

Vershynin, R. High-Dimensional Probability: An Introduction with Applications in Data Science; Cambridge University Press: Cambridge, UK, 2018; Volume 47. [Google Scholar]
Bai, Z.; Silverstein, J.W. Spectral Analysis of Large Dimensional Random Matrices; Springer: New York, NY, USA, 2010; Volume 20. [Google Scholar]
Wainwright, M.J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint; Cambridge University Press: Cambridge, UK, 2019; Volume 48. [Google Scholar]
Zhang, H.; Chen, S.X. Concentration Inequalities for Statistical Inference. Commun. Math. Res. 2021, 37, 1–85. [Google Scholar]
Tropp, J.A. An introduction to matrix concentration inequalities. Found. Trends Mach. Learn. 2015, 8, 1–230. [Google Scholar] [CrossRef]
Kuchibhotla, A.K.; Chakrabortty, A. Moving beyond sub-Gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. Inf. Inference J. Imag. 2022. ahead of print. [Google Scholar] [CrossRef]
Hao, B.; Abbasi-Yadkori, Y.; Wen, Z.; Cheng, G. Bootstrapping Upper Confidence Bound. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Gbur, E.E.; Collins, R.A. Estimation of the Moment Generating Function. Commun. Stat. Simul. Comput. 1989, 18, 1113–1134. [Google Scholar] [CrossRef]
Götze, F.; Sambale, H.; Sinulis, A. Concentration inequalities for polynomials in α-sub-exponential random variables. Electron. J. Probab. 2021, 26, 1–22. [Google Scholar] [CrossRef]
Li, S.; Wei, H.; Lei, X. Heterogeneous Overdispersed Count Data Regressions via Double-Penalized Estimations. Mathematics 2022, 10, 1700. [Google Scholar] [CrossRef]
Rigollet, P.; Hütter, J.C. High Dimensional Statistics. 2019. Available online: http://www-math.mit.edu/rigollet/PDFs/RigNotes17.pdf (accessed on 20 April 2022).
Foss, S.; Korshunov, D.; Zachary, S. An Introduction to Heavy-Tailed and Subexponential Distributions; Springer: New York, NY, USA, 2011. [Google Scholar]
De la Pena, V.; Gine, E. Decoupling: From Dependence to Independence; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Latala, R. Estimation of moments of sums of independent real random variables. Ann. Probab. 1997, 25, 1502–1513. [Google Scholar] [CrossRef]
Kashlak, A.B. Measuring distributional asymmetry with Wasserstein distance and Rademacher symmetrization. Electron. J. Stat. 2018, 12, 2091–2113. [Google Scholar] [CrossRef]
Vladimirova, M.; Girard, S.; Nguyen, H.; Arbel, J. Sub-Weibull distributions: Generalizing sub-Gaussian and sub-Exponential properties to heavier tailed distributions. Stat 2020, 9, e318. [Google Scholar] [CrossRef]
Wong, K.C.; Li, Z.; Tewari, A. Lasso guarantees for β-mixing heavy-tailed time series. Ann. Stat. 2020, 48, 1124–1142. [Google Scholar] [CrossRef]
Portnoy, S. Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. Ann. Stat. 1988, 16, 356–366. [Google Scholar] [CrossRef]
Kuchibhotla, A.K. Deterministic inequalities for smooth m-estimators. arXiv 2018, arXiv:1809.05172. [Google Scholar]
Zhang, H.; Jia, J. Elastic-net regularized high-dimensional negative binomial regression: Consistency and weak signals detection. Stat. Sin. 2022, 32, 181–207. [Google Scholar] [CrossRef]
Bai, Z.D.; Yin, Y.Q. Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. In Advances In Statistics; World Scientific: Singapore, 1993; pp. 1275–1294. [Google Scholar]
Chen, P.; Jin, X.; Li, X.; Xu, L. A generalized catoni’s m-estimator under finite α-th moment assumption with α∈(1,2). Electron. J. Stat. 2021, 15, 5523–5544. [Google Scholar] [CrossRef]
Xu, L.; Yao, F.; Yao, Q.; Zhang, H. Non-Asymptotic Guarantees for Robust Statistical Learning under (1+ε)-th Moment Assumption. arXiv 2022, arXiv:2201.03182. [Google Scholar]
Lerasle, M. Lecture notes: Selected topics on robust statistical learning theory. arXiv 2019, arXiv:1908.10761. [Google Scholar]
Devroye, L.; Lerasle, M.; Lugosi, G.; Oliveira, R.I. Sub-gaussian mean estimators. Ann. Stat. 2016, 44, 2695–2725. [Google Scholar] [CrossRef]
Zhang, A.R.; Zhou, Y. On the non-asymptotic and sharp lower tail bounds of random variables. Stat 2020, 9, e314. [Google Scholar] [CrossRef]
Zhang, H.; Tan, K.; Li, B. COM-negative binomial distribution: Modeling overdispersion and ultrahigh zero-inflated count data. Front. Math. China 2018, 13, 967–998. [Google Scholar] [CrossRef] [Green Version]
Zajkowski, K. On norms in some class of exponential type Orlicz spaces of random variables. Positivity 2019, 24, 1231–1240. [Google Scholar] [CrossRef] [Green Version]
Jameson, G.J. A simple proof of Stirling’s formula for the gamma function. Math. Gaz. 2015, 99, 68–74. [Google Scholar] [CrossRef] [Green Version]
Alzer, H. On some inequalities for the gamma and psi functions. Math. Comput. 1997, 66, 373–389. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Standard Gaussian.

Figure 2. Centralized Bernoulli.

Figure 3. Uniform on

[- 1, 1]

.

Figure 3. Uniform on

[- 1, 1]

.

Figure 4. Centralized negative binomial.

Figure 5. Centralized Poisson.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, H.; Wei, H. Sharper Sub-Weibull Concentrations. Mathematics 2022, 10, 2252. https://doi.org/10.3390/math10132252

AMA Style

Zhang H, Wei H. Sharper Sub-Weibull Concentrations. Mathematics. 2022; 10(13):2252. https://doi.org/10.3390/math10132252

Chicago/Turabian Style

Zhang, Huiming, and Haoyu Wei. 2022. "Sharper Sub-Weibull Concentrations" Mathematics 10, no. 13: 2252. https://doi.org/10.3390/math10132252

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sharper Sub-Weibull Concentrations

Abstract

1. Introduction

2. Sharper Concentrations for Sub-Weibull Summation

2.1. Properties of Sub-Weibull norm and Orlicz-Type Norm

2.2. Main Results: Concentrations for Sub-Weibull Summation

2.3. Sub-Weibull Parameter

3. Statistical Applications of Sub-Weibull Concentrations

3.1. Negative Binomial Regressions with Heavy-Tail Covariates

3.2. Non-Asymptotic Bai-Yin’s Theorem

3.3. General Log-Truncated Z-Estimators and sub-Weibull Type Robust Estimators

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1

Appendix A.2

Appendix A.3

Appendix A.4

Appendix A.5

Appendix A.6

Appendix A.7

Appendix A.8

Appendix A.9

Appendix A.10

Appendix A.11

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI