Sharper Concentration Inequalities for Median-of-Mean Processes

Teng, Guangqiang; Li, Yanpeng; Tian, Boping; Li, Jie

doi:10.3390/math11173730

Open AccessArticle

Sharper Concentration Inequalities for Median-of-Mean Processes

¹

School of Mathematics, Harbin Institute of Technology, Harbin 150001, China

²

Department of Statistics and Data Science, National University of Singapore, 21 Lowr Kent Ridge Road, Singapore 119077, Singapore

³

School of Statistics, Renmin University of China, Beijing 100872, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2023, 11(17), 3730; https://doi.org/10.3390/math11173730

Submission received: 30 July 2023 / Revised: 29 August 2023 / Accepted: 29 August 2023 / Published: 30 August 2023

(This article belongs to the Special Issue New Advances in High-Dimensional and Non-asymptotic Statistics)

Download Review Reports Versions Notes

Abstract

:

The Median-of-Mean (MoM) estimation is an efficient statistical method for handling data with contamination. In this paper, we propose a variance-dependent MoM estimation method using the tail probability of a binomial distribution. The bound of this method is better than the classical Hoeffding method under mild conditions. This method is then used to study the concentration of variance-dependent MoM empirical processes and sub-Gaussian intrinsic moment norm. Finally, we give the bound of the variance-dependent MoM estimator with distribution-free contaminated data.

Keywords:

concentration inequality; Median-of-Mean; robust machine learning; contaminated data

MSC:

62B10

1. Introduction

Nowadays, there is a huge amount of data in information processing, and the data are varied. With the rapid expansion of data volume, traditional centralized data processing has gradually become unable to adapt to the current needs, which makes it possible to distribute processing power to all computers on the network.

When dealing with large amounts of data, it is inevitable to produce contaminated data which we generally call outliers. The outliers will result in low accuracy or high sensitivity of data processing tasks. Naturally, inferring probability density functions from contaminated samples is an important problem. Correspondingly, when there are no outliers in a dataset, we call such a dataset sane.

The Median-of-Mean (MoM) method is an effective way to deal with contaminated data, which divides the original data into several blocks, calculates the mean for each block, and then takes the median of these means. The literature on MoM methods can be traced back to Ref. [1]. In recent years, MoM methods have been widely used in the field of machine learning. For example, Ref. [2] used the MoM method to design estimators for kernel mean embedding and maximum mean discrepancy with excessive resistance properties to outliers; Ref. [3] applied the MoM method to achieve the optimal trade-off between accuracy and confidence under minimal assumptions in the classical statistical learning/regression problem; Ref. [4] introduced an MoM method for robust machine learning without deteriorating the estimation properties of a given estimator which is also easily computable in practice; Ref. [5] introduced a robust nonparametric density estimator combining the popular Kernel Density Estimation method and the Median-of-Means principle.

When using MoM methods to deal with contaminated data, these data often do not have obvious normal distribution characteristics but have more extensive sub-Gaussian properties; thus, non-asymptotic techniques are needed. Non-asymptotic inference can give full play to its advantages in the case of finite samples. Especially in the field of machine learning, non-asymptotic inference can establish strict error boundaries for the desired learning program (see Refs. [6,7,8]). Sometimes when working with data, it is difficult to know the exact distribution; this calls for a more general study such as sub-Gaussian, sub-exponential, heavy-tailed, and bounded distributions. For example, Ref. [9] studied the non-asymptotic concentration of the heteroskedastic Wishart-type matrices; Ref. [10] constructed sub-Gaussian estimators of a mean vector under adversarial contamination and heavy-tailed data by Median-of-Mean versions of the Stahel–Donoho outlyingness and of Median Absolute Deviation functions; Ref. [11] obtained the deconvolution for some singular density errors via a combinatorial Median-of-Mean approach and assessed the estimator quality by establishing non-asymptotic risk bounds.

To obtain a clear picture of robust estimation from a non-asymptotic viewpoint, variance-dependent MoM methods based on binomial tail probability are mainly studied, including uncontaminated and contaminated cases. The paper proceeds as follows. We first provide a variance-dependent MoM-estimator bias inequality by using bounds on binomial tails with unbounded samples, whose bias bound is tighter than the classical Hoeffding’s bound (see Section 2). Then, by the variance-dependent MoM inequality, we obtain the generalization bound via entropic complexity (see Section 3.1) and the non-asymptotic property via Sub-Gaussian intrinsic moment norm (see Section 3.2). Finally, the variance-dependent MoM inequality with contamination data is illustrated in Section 4.

2. Variance-Dependent Median-of-Mean Estimator without Outliers

The MoM method was originally introduced on page 242 of Ref. [1]; it reinforces the effect of the empirical mean on the heavy-tail distribution while inheriting its efficiency on the light-tail distribution. The MoM estimator is derived as follows.

Without loss of generality, suppose that the sample data

X_{1}, X_{2}, \dots, X_{n}

are decomposed into K blocks, with each block including B observations, that is to say,

n = K B

. We first compute the mean of each block, which leads to estimators

{\hat{μ}}_{1}, \dots, {\hat{μ}}_{K}

and each estimator is based on B observations. Then, the MoM estimator is given by the median of all these estimators, i.e.,

\begin{matrix} {MoM}_{K} [μ] = median ({\hat{μ}}_{1}, \dots, {\hat{μ}}_{K}) . \end{matrix}

It turns out that, even with a very mild condition

Var (X) = σ^{2} < \infty

, the MoM estimator has a nice concentration inequality under finite sample case.

Given the i.i.d. sample

X_{1}, X_{2}, \dots, X_{n}

with mean

μ_{0}

and finite variance

σ^{2}

, using Hoeffding’s inequality, Proposition 1 in Ref. [12] produces the following concentration inequality:

P (|{MoM}_{K} [μ] - μ_{0}| > t) \leq exp (- \frac{n t^{2}}{27 σ^{2}})

where

t = σ \sqrt{(2 + δ) / B} \geq σ \sqrt{\frac{2 \sqrt{π} - \sqrt{2}}{B}}

—see detailed description in Remark 1.

When additional conditions are applied to the distribution under consideration, stricter boundaries can be obtained, such as our results on binomial tails (Theorem 1), which can be better.

In fact, sometimes we need to block the data, but the minimum number of samples per block is often a concern, because it involves efficiency and robustness issues, and, from a statistical point of view, the effect of variance is taken into account. The following theorem takes into account the partitioning of variance effects and yields the variance-dependent MoM inequality.

Theorem 1.

Given the i.i.d. samples

X_{1}, X_{2}, \dots, X_{n}

with mean

μ_{0}

and finite variance

σ^{2}

, for

\forall δ \geq \frac{\sqrt{2}}{\sqrt{π} - \sqrt{2}}

, there exists

B \in N

and

ε > 0

, such that

B ε^{2} \geq (2 + δ) σ^{2}

. Then, the MoM estimator has the following concentration inequality:

P (|{MoM}_{K} [μ] - μ_{0}| > t) \leq exp (- \frac{0.0976 n t^{2}}{σ^{2}})

(1)

where

t = σ \sqrt{(2 + δ) / B} .

A powerful feature of Theorem 1 is that

X_{i}

s can be unbounded in this case. In addition, finite sample exponential concentration is not easy to obtain if only variance exists (see Ref. [13]). And Theorem 1 provides the basis for further obtaining the inequality with outliers. In the process of proving the theorem, we used the following lemma.

Lemma 1

(Theorem 1 of [14]). Suppose

S_{n} \sim Bin (n, p), a > p \in (0, 1)

, and

1 ⩽ a n ⩽ n - 1 .

If

a n \in N

, then

\begin{matrix} P (S_{n} ⩾ a n) ⩽ \frac{1}{1 - r} \frac{1}{\sqrt{2 π a (1 - a) n}} e^{- n D (a ∥ p)} . \end{matrix}

where

r = r (a, p) : = \frac{p (1 - a)}{a (1 - p)}

, and

D (a ∥ p) : = a log \frac{a}{p} + (1 - a) log \frac{1 - a}{1 - p}

is the KL divergence between Bernoulli distributions with parameters a and p. If an

\notin N

, the bound still holds, but it can be tightened by replacing a with

a^{*} : = ⌈ a n ⌉ / n

.

Now, we give a detailed proof of Theorem 1.

Proof of Theorem 1.

First, observe that the event

\begin{matrix} \{|{MoM}_{K} [μ] - μ_{0}| > ϵ\} for \forall ϵ \geq 0 \end{matrix}

implies that at least

K / 2

of

{\hat{μ}}_{ℓ} (ℓ = 1, \dots, K)

has to be outside

ϵ

distance to

μ_{0}

for

\forall ϵ \geq 0

. Namely,

\begin{matrix} \{|{MoM}_{K} [μ] - μ_{0}| > ϵ\} \subset \{\sum_{ℓ = 1}^{K} 1 (|{\hat{μ}}_{ℓ} - μ_{0}| > ϵ) \geq \frac{K}{2}\} for \forall ϵ \geq 0 . \end{matrix}

Here, it is assumed that K is an even number. When K is an odd number, take at least

⌈ K / 2 ⌉

, and the same can be said. For the convenience of writing, the following process of proof only writes the case of at least

K / 2

, while proving the case of

⌈ K / 2 ⌉

is no difference.

Define

Z_{ℓ} = 1 (|{\hat{μ}}_{ℓ} - μ_{0}| > ϵ)

and let

\bar{p} : = {\tilde{p}}_{ϵ, B} = E (Z_{ℓ}) = P (|{\hat{μ}}_{ℓ} - μ_{0}| > ϵ)

. Note the theorem condition and the Chebyshev’s inequality (see p. 239 in Ref. [15]), which imply that there exits

B \in N

and

ε > 0

such that

\tilde{p} : = {\tilde{p}}_{ε, B} = P (|{\hat{μ}}_{ℓ} - μ_{0}| > ε) \leq \frac{σ^{2}}{B ε^{2}} < \frac{1}{2} .

(2)

In fact, the detailed derivation process is as follows:

\begin{matrix} P (|{\hat{μ}}_{ℓ} - μ_{0}| > ε) & \leq \frac{Var ({\hat{μ}}_{ℓ})}{ε^{2}} \\ = \frac{Var (\frac{X_{l 1} + \dots + X_{l B}}{B})}{ε^{2}} \\ = \frac{\frac{1}{B^{2}} Var (\sum_{i = 1}^{B} X_{l i})}{ε^{2}} \\ = \frac{\frac{1}{B^{2}} \sum_{i = 1}^{B} Var (X_{l i})}{ε^{2}} \\ = \frac{\frac{1}{B^{2}} B σ^{2}}{ε^{2}} \\ = \frac{σ^{2}}{B ε^{2}} . \end{matrix}

The random variables

Z_{ℓ} \sim Bernoulli (\tilde{p})

are i.i.d. because of the i.i.d. samples

X_{1}, X_{2}, \dots, X_{n}

. Applying Lemma 1 (with

a = 1 / 2

,

n = K

, and

p = \tilde{p}

in Lemma 1) to the summations gives

\begin{matrix} P (|{MoM}_{K} [μ] - μ_{0}| > ε) \leq P (\sum_{ℓ = 1}^{K} Z_{ℓ} \geq \frac{K}{2}) \leq \frac{1 - \tilde{p}}{1 - 2 \tilde{p}} \sqrt{\frac{2}{π K}} e^{- K D (\frac{1}{2} | | \tilde{p})}, \end{matrix}

where

D (\frac{1}{2} | | \tilde{p}) = \frac{1}{2} log (\frac{1}{4 \tilde{p} (1 - \tilde{p})})

.

Setting

B \geq (2 + δ) σ^{2} / ε^{2} > 2 σ^{2} / ε^{2}

for

\forall δ > 0

satisfies Equation (2); then,

\begin{matrix} P (|{MoM}_{K} [μ] - μ_{0}| > σ \sqrt{\frac{(2 + δ) K}{n}}) & \leq \frac{1 - \tilde{p}}{1 - 2 \tilde{p}} \sqrt{\frac{2}{π K}} e^{- K D (\frac{1}{2} | | \tilde{p})} \\ = (1 + \frac{\tilde{p}}{1 - 2 \tilde{p}}) \sqrt{\frac{2}{π K}} e^{- K D (\frac{1}{2} | | \tilde{p})} \\ \leq \frac{δ + 1}{δ} \sqrt{\frac{2}{π K}} {(1 + \frac{δ^{2}}{4 + 4 δ})}^{- \frac{K}{2}} \end{matrix}

When

K = 1

, we set

δ \geq \frac{\sqrt{2}}{\sqrt{π} - \sqrt{2}} \approx 3.95

so that

\frac{δ + 1}{δ} \sqrt{\frac{2}{π K}} \leq \frac{δ + 1}{δ} \sqrt{\frac{2}{π}} \leq 1 (K = 1, \dots, n)

. Then, it follows that

\begin{matrix} P (|{MoM}_{K} [μ] - μ_{0}| > σ \sqrt{\frac{(2 + δ) K}{n}}) \leq {(1 + \frac{δ^{2}}{4 + 4 δ})}^{- \frac{K}{2}} \end{matrix}

for

1 \leq K \leq n

and

δ \geq \sqrt{2} / (\sqrt{π} - \sqrt{2})

.

Now, taking

t : = σ \sqrt{(2 + δ) K / n}

gives

\begin{matrix} P (|{MoM}_{K} [μ] - μ_{0}| > t) \leq exp (- \frac{n t^{2}}{2 (2 + δ) σ^{2}} ln (1 + \frac{δ^{2}}{4 + 4 δ})) \end{matrix}

The function

g (δ) = - \frac{1}{2 + δ} ln (1 + \frac{δ^{2}}{4 + 4 δ}) (δ \geq \frac{\sqrt{2}}{\sqrt{π} - \sqrt{2}})

is a monotonically decreasing function, so its maximum is

g (\sqrt{2} / (\sqrt{π} - \sqrt{2})) \approx - 0.0976

.

This then leads to the final result:

\begin{matrix} P (|{MoM}_{K} [μ] - μ_{0}| > t) \leq exp (- \frac{0.0976 n t^{2}}{σ^{2}}) . \end{matrix}

□

Remark 1.

The classical result by Hoeffding inequality shows that (see Proposition 1 in Ref. [12])

\begin{matrix} P (|{MoM}_{K} [μ] - μ_{0}| > σ \sqrt{(2 + δ) K / n}) \leq e^{- K \frac{δ^{2}}{2 {(2 + δ)}^{2}}} \end{matrix}

Similarily, to obtain a sharp constant, one can consider

t : = σ \sqrt{(2 + δ) K / n}

; then,

\begin{matrix} P (|{MoM}_{K} [μ] - μ_{0}| > t) \leq exp (- \frac{n t^{2} δ^{2}}{2 σ^{2} {(2 + δ)}^{3}}) \end{matrix}

and the function

\begin{matrix} g (δ) = \frac{δ^{2}}{{(2 + δ)}^{3}} \end{matrix}

achieve the unique maximum point at

δ = 4

with

g (4) = 2 / 27

. It follows that

\begin{matrix} P (|{MoM}_{K} [μ] - μ_{0}| > t) \leq exp (- \frac{n t^{2}}{27 σ^{2}}) . \end{matrix}

Remark 2.

The efficient interval of t is an interesting issue. By the construction of

t = σ \sqrt{(2 + δ) K / n}

, it follows that

\sqrt{(2 + δ) / n} \leq t / σ \leq \sqrt{2 + δ}

since

1 \leq K \leq n

.

Remark 3.

In Theorem 1, we substitute

t = σ \sqrt{(2 + δ) / B}

into inequality (1) to produce

P (|{MoM}_{K} [μ] - μ_{0}| > σ \sqrt{\frac{2 + δ}{B}}) \leq exp (- \frac{0.0976 n σ^{2} (2 + δ)}{B σ^{2}})

Since

δ \geq \frac{\sqrt{2}}{\sqrt{π} - \sqrt{2}} \approx 3.95 > 2

, we have

P (|{MoM}_{K} [μ] - μ_{0}| > 2 σ \sqrt{\frac{K}{n}}) \leq e^{- 0.5807 K} .

This result is better than the bound

e^{- K / 8}

of level-dependent sub-Gaussian estimators. Of course, our conditions are more stringent (see Proposition 12 in Ref. [16]).

3. Applications

In this section, we use the proposed sharper concentration inequalities for MoM estimators to perform two applications in statistical machine learning.

3.1. Concentration for Supremum of Variance-Dependent MoM Empirical Processes

Let

ψ (x) \in B_{L}

and

| ψ (x) | \leq M_{0} < \infty

, where

B_{L}

is a ball of the Lipschitz functions space and

M_{0}

is a constant. Let

P ψ = E ψ = \int ψ d P

.

To derive the concentration inequality for the supremum of variance-dependent MoM empirical processes, the following auxiliary Lemma 2 is necessary, whose proof is trivial and thus omitted.

Lemma 2.

| med (a) - med (b) | \leq {∥ a - b ∥}_{\infty}

for

a, b \in B_{L}

where

med (a)

means the value of the function

a (x)

at the midpoint of the domain, and the same is true for

med (b)

.

By Lemma 2, for

\forall ϕ \in B_{L}

, we have

\begin{matrix} |{MoM}_{K} [ϕ] - P ϕ| & \leq |{MoM}_{K} [ϕ] - {MoM}_{K} [ψ]| + | P (ϕ - ψ) | + |{MoM}_{K} [ψ] - P ψ| \\ \leq {∥ ϕ - ψ ∥}_{\infty} + {∥ ϕ - ψ ∥}_{\infty} + |{MoM}_{K} [ψ] - P ψ| \\ = {2 ∥ ϕ - ψ ∥}_{\infty} + |{MoM}_{K} [ψ] - P ψ| \end{matrix}

(3)

Let

ψ_{1}, \dots, ψ_{N (ξ, B_{L}, {∥ \cdot ∥}_{\infty})}

be a

ξ

-covering of

B_{L}

w.r.t.

{∥ \cdot ∥}_{\infty}

. It is well-known that there exist constants

C_{L} > 0

and

r \geq 1

, such that

log (N (ξ, B_{L}, {∥ \cdot ∥}_{\infty})) \leq C_{L} {(\frac{1}{ξ})}^{r}, \forall ξ > 0

(4)

where

N (ξ, B_{L}, {∥ \cdot ∥}_{\infty})

denotes the number of

{∥ \cdot ∥}_{\infty}

-balls of radius

ξ > 0

needed to cover class

B_{L}

, and

C_{L}

is a universal constant depending only on

B_{L}

.

Put

N = N (ξ, B_{L}, {∥ \cdot ∥}_{\infty})

for simplicity. By definition of

N

, for

\forall i \in {1, \dots, N}

, s.t.

{∥ϕ - ψ_{i}∥}_{\infty} \leq ξ

Then, (3) becomes

|{MoM}_{K} [ϕ] - P ϕ| \leq 2 \tilde{ξ} + |{MoM}_{K} [ψ_{i}] - P ψ_{i}|

(5)

Then, by Theorem 1, the union bound for

{\{ψ_{i}\}}_{i = 1}^{N}

gives that

P (max_{1 \leq i \leq N} |{MoM}_{K} [ψ_{i}] - P ψ_{i}| \leq σ \sqrt{\frac{- ln δ}{0.0976 N}}) \geq 1 - δ .

(6)

Together, (4)–(6) give

P (sup_{ϕ \in B_{L}} |{MoM}_{K} [ϕ] - P ϕ| \leq 2 ξ + σ \sqrt{\frac{- ln δ}{0.0976 N}}) \geq 1 - δ .

Put

ξ = \sqrt{\frac{C_{L}}{N ξ}}

, i.e.,

ξ = {(\frac{C_{L}}{N})}^{\frac{1}{r + 2}}

; then, for

\forall ϕ \in B_{L}

and

δ \in (0, 1)

, we have

P (sup_{ϕ \in B_{L}} |{MoM}_{K} [ϕ] - P ϕ| \leq 2 {(\frac{C_{L}}{N})}^{\frac{1}{r + 2}} + σ \sqrt{\frac{- ln δ}{0.0976 N}}) \geq 1 - δ .

3.2. Concentration for Variance-Dependent MoM Intrinsic Moment Norm

A centered random variable X is called sub-Gaussian if

E e^{s X} \leq e^{s^{2} σ_{G}^{2} / 2} for \forall s \in R,

where the quantity

σ_{G} > 0

is named as the sub-Gaussian parameter. In non-asymptotic statistics, because the collected sub-Gaussian data is often unstable, sometimes it is not possible to directly use the empirical moment-generating function to estimate the sub-Gaussian parameter such as variance-type parameters of sub-Gaussian distributions (see Ref. [17]). This requires us to use the sub-Gaussian intrinsic moment norm for estimation. The definition of intrinsic moment norm is as follows.

Definition 1

(Intrinsic moment norm, see Definition 2 in Ref. [17]). The sub-Gaussian intrinsic moment norm is defined as

{∥ X ∥}_{G} : = max_{k \geq 1} {[\frac{2^{k} k!}{(2 k)!} E X^{2 k}]}^{1 / (2 k)} = max_{k \geq 1} {[\frac{1}{(2 k - 1)!!} E X^{2 k}]}^{1 / (2 k)},

where

n!! = \prod_{j = 0}^{[\frac{n}{2}] - 1} (n - 2 j) = n (n - 2) (n - 4) \dots

for

n \in N

.

As the amount of computation increases, so does the importance of the distributed MoM approach, with the corresponding intrinsic moment norm estimator defined below.

Definition 2

(see Equation (7) in Ref. [17]). Let

[K] = {1, \dots, K}

and

B_{s}

be the number of samples in the s-th block. The MOM estimator for sub-Gaussian intrinsic moment norm is given by

{\hat{∥ X ∥}}_{b, G} : = max_{1 \leq k \leq κ_{n}} \underset{s \in [K]}{median} \{{[{[(2 k - 1)!!]}^{- 1} P_{B}^{B_{s}} X^{2 k}]}^{1 / (2 k)}\}

where

P_{B}^{B_{s}} X = B^{- 1} \sum_{i \in B_{s}} X_{i} (s = 1, \dots, K)

.

Definition 3.

For any

B \in N

and

1 \leq k \leq κ_{n}

,

{\bar{g}}_{k, B} (σ_{k}) : = 1 - {[E X^{2 k} / (2 k - 1)!!]}^{- \frac{1}{2 k}} max_{1 \leq j \leq κ_{n}} {[- 2 B^{- \frac{1}{2}} σ_{j}^{j} / (E X^{2 j}) + E X^{2 j} / (2 j - 1)!!]}^{\frac{1}{2 j}}

and

{\underset{̲}{g}}_{k, B} (σ_{k}) : = {[2 B^{- 1 / 2} σ_{k}^{k} / (E X^{2 k}) + 1]}^{1 / (2 k)} - 1

.

Theorem 2.

Suppose, for

\forall ε > 0

and

\forall n \in N

, there exits

B \in N

, such that

\sqrt{Var X^{2 k}} < ε \sqrt{B / 2} \leq σ_{k}^{k}

where

{σ_{k}}_{k = 1}^{κ_{n}}

is a finite constant sequence. Then, we have

P \{{∥ X ∥}_{G} \leq {[1 - max_{1 \leq k \leq κ_{n}} {\bar{g}}_{k, B} (σ_{k})]}^{- 1} {\hat{∥ X ∥}}_{b, G}\} > 1 - κ_{n} e^{- 0.3904 K}

and

P \{{∥ X ∥}_{G} > {[1 + max_{1 \leq k \leq κ_{n}} {\underset{̲}{g}}_{k, B} (σ_{k})]}^{- 1} {\hat{∥ X ∥}}_{b, G}\} > 1 - κ_{n} e^{- 0.3904 K} .

Remark 4.

Let

K = n / B

; we then obtain distributed samples that satisfy Theorem 2.

Remark 5.

The key coefficient

- 0.3904 < - 0.125

. In fact, the key coefficient of Theorem 3 in Ref. [17] without outliers is −0.125, as long as

η (ε) = 1

is taken. This means that our boundary is better than the boundary in Ref. [17].

Proof of Theorem 2.

From Definitions 1 and 2, we have

{∥ X ∥}_{G} = max_{1 \leq k \leq κ_{n}} {[\frac{E X^{2 k}}{(2 k - 1)!!}]}^{1 / (2 k)}

(7)

and

{\hat{∥ X ∥}}_{b, G} = max_{1 \leq k \leq κ_{n}} {median}_{s \in [K]} \{{[\frac{1}{(2 k - 1)!!} \cdot P_{B}^{B_{s}} X^{2 k}]}^{1 / (2 k)}\} .

(8)

Recall that

{\underset{̲}{g}}_{k, B} (σ_{k})

and

{\bar{g}}_{k, B} (σ_{k})

are the sequences s.t.

\begin{matrix} {[E X^{2 k} / (2 k - 1)!!]}^{1 / (2 k)} (1 - {\bar{g}}_{k, B} (σ_{k})) \\ = max_{1 \leq k \leq κ_{n}} {[- 2 B^{- 1 / 2} σ_{k}^{k} / (E X^{2 k}) + E X^{2 k} / (2 k - 1)!!]}^{1 / (2 k)} \end{matrix}

(9)

and

{[2 B^{- 1 / 2} σ_{k}^{k} / (E X^{2 k}) + 1]}^{1 / (2 k)} = 1 + {\underset{̲}{g}}_{k, B} (σ_{k})

(10)

for any

B \in N

and

1 \leq k \leq κ_{n}

.

For the first inequality of Theorem 2, we have, by (7),

\begin{matrix} P \{{\hat{∥ X ∥}}_{b, G} \leq [1 - max_{1 \leq k \leq κ_{n}} {\bar{g}}_{k, B} (σ_{k})] {∥ X ∥}_{G}\} \\ = P \{{\hat{∥ X ∥}}_{b, G} \leq max_{1 \leq k \leq κ_{n}} {[\frac{E X^{2 k}}{(2 k - 1)!!}]}^{1 / (2 k)} (1 - max_{1 \leq k \leq κ_{n}} {\bar{g}}_{k, B} (σ_{k}))\} \\ \leq P \{{\hat{∥ X ∥}}_{b, G} \leq max_{1 \leq k \leq κ_{n}} {[\frac{E X^{2 k}}{(2 k - 1)!!}]}^{1 / (2 k)} (1 - {\bar{g}}_{k, B} (σ_{k}))\} \\ [By (9)] & = P \{\hat{∥ X} ∥_{b, G} \leq {[- \frac{σ_{k}^{k}}{(2 k - 1)!!} \cdot \frac{2}{B^{1 / 2}} + \frac{E X^{2 k}}{(2 k - 1)!!}]}^{1 / (2 k)}\} \\ \leq \sum_{k = 1}^{κ_{n}} P {{median}_{s \in [K]} \{{[\frac{1}{(2 k - 1)!!} \cdot P_{B}^{B_{s}} X^{2 k}]}^{1 / (2 k)}\} \leq \\ {[- \frac{σ_{k}^{k}}{(2 k - 1)!!} \cdot \frac{2}{B^{1 / 2}} + \frac{E X^{2 k}}{(2 k - 1)!!}]}^{1 / (2 k)}} \\ = \sum_{k = 1}^{κ_{n}} P {{median}_{s \in [K]} \{\frac{1}{(2 k - 1)!!} \cdot P_{B}^{B_{s}} X^{2 k}\} \leq \\ \frac{E X^{2 k}}{(2 k - 1)!!} - \frac{σ_{k}^{k}}{(2 k - 1)!!} \cdot \frac{2}{B^{1 / 2}}} \\ = \sum_{k = 1}^{κ_{n}} P {{median}_{s \in [K]} \{\frac{1}{(2 k - 1)!!} \cdot [P_{B}^{B_{s}} X^{2 k} - E X^{2 k}]\} \leq \\ - \frac{σ_{k}^{k}}{(2 k - 1)!!} \cdot \frac{2}{B^{1 / 2}}} \\ < \sum_{k = 1}^{κ_{n}} P \{|{median}_{s \in [K]} \{P_{B}^{B_{s}} [X^{2 k} - E^{2 k}]\}| \geq σ_{k}^{k} \cdot \frac{2}{B^{1 / 2}}\} \\ \leq κ_{n} e^{- 0.3904 K}, \end{matrix}

where the last inequality is by Theorem 1 and the assumption in Theorem 2.

Let

{\underset{̲}{g}}_{B} (σ) : = {max}_{1 \leq k \leq κ_{n}} {\underset{̲}{g}}_{k, B} (σ_{k})

. For the second inequality of Theorem 2, the definition of

{\underset{̲}{g}}_{k, B} (σ_{k})

implies

\begin{matrix} P \{{∥ X ∥}_{G} \leq \frac{{\hat{∥ X ∥}}_{b, G}}{1 + {\underset{̲}{g}}_{B} (σ)}\} \\ = P \{{\hat{∥ X ∥}}_{b, G} \geq max_{1 \leq k \leq κ_{n}} {[\frac{σ_{k}^{k}}{(2 k - 1)!!} \cdot \frac{2}{B^{1 / 2}} + \frac{E X^{2 k}}{(2 k - 1)!!}]}^{1 / (2 k)}\} \\ \leq P {max_{1 \leq k \leq κ_{n}} {median}_{s \in [K]} \{{[\frac{1}{(2 k - 1)!!} \cdot P_{B}^{B_{s}} X^{2 k}]}^{1 / (2 k)}\} \geq \\ {[\frac{σ_{k}^{k}}{(2 k - 1)!!} \cdot \frac{2}{B^{1 / 2}} + \frac{E X^{2 k}}{(2 k - 1)!!}]}^{1 / (2 k)}} \\ \leq \sum_{k = 1}^{κ_{n}} P {{median}_{s \in [K]} \{{[\frac{1}{(2 k - 1)!!} \cdot P_{B}^{B_{s}} X^{2 k}]}^{1 / (2 k)}\} \geq \end{matrix}

\begin{matrix} {[\frac{σ_{k}^{k}}{(2 k - 1)!!} \cdot \frac{2}{B^{1 / 2}} + \frac{E X^{2 k}}{(2 k - 1)!!}]}^{1 / (2 k)}} \\ = \sum_{k = 1}^{κ_{n}} P \{{median}_{s \in [K]} \{\frac{1}{(2 k - 1)!!} \cdot P_{B}^{B_{s}} X^{2 k}\} \geq [\frac{σ_{k}^{k}}{(2 k - 1)!!} \cdot \frac{2}{B^{1 / 2}} + \frac{E X^{2 k}}{(2 k - 1)!!}]\} \\ = \sum_{k = 1}^{κ_{n}} P \{{median}_{s \in [K]} \{P_{B}^{B_{s}} X^{2 k}\} \geq [\frac{2 σ_{k}^{k}}{B^{1 / 2}} + E X^{2 k}]\} \\ = \sum_{k = 1}^{κ_{n}} P \{{median}_{s \in [K]} \{P_{B}^{B_{s}} [X^{2 k} - E X^{2 k}]\} \geq \frac{2 σ_{k}^{k}}{B^{1 / 2}}\} \\ < \sum_{k = 1}^{κ_{n}} P \{| {median}_{s \in [K]} \{P_{B}^{B_{s}} [X^{2 k} - E X^{2 k}]\} | \geq \frac{2 σ_{k}^{k}}{B^{1 / 2}}\} \\ \leq κ_{n} e^{- 0.3904 K}, \end{matrix}

where the last inequality is by Theorem 1 and the assumption in Theorem 2. □

4. Concentration for Variance-Dependent MoM with Distribution-Free Outliers

In the field of big data and artificial intelligence, most work involves dealing with abnormal data. Sometimes we cannot find each outlier directly, but we can obtain a rough idea of the total number of outliers. For example, sometimes there may be abnormal economic activities in a certain region, but the specific company or person who is abnormal may not be known for the time being; however, the total number of companies and the total population in the region are still known.

Based on such information, how to accurately estimate the characteristics of all samples containing outliers is an important problem. In this section, we introduce the concept of variance-dependent MoM estimator with outliers as the following theorem.

Theorem 3.

Suppose that

(H.1) Sample

[n] = {X_{1}, X_{2}, \dots, X_{n}}

contains

n - n_{O}

i.i.d. inliers with finite mean

μ_{0}

and finite variance

σ^{2}

. And

n_{O}

outliers, upon which no assumption is made.

(H.2) Set

K = K_{O} + K_{S}

, where

K_{O}

is the number of blocks containing at least one outlier and

K_{S}

is the number of sane blocks containing no outlier. For

\forall t > 0

, there exists a function

η (ε_{O}) \in (1 / 2, 1)

, such that

K \geq max (2, ⌈\frac{1}{2 η (ε_{O}) - 1}⌉, ⌈\frac{(2 η (ε_{O}) - 1) n t^{2}}{2 η (ε_{O}) σ^{2}}⌉)

and

K_{S} \geq η (ε_{O}) K

, where

ε_{O} : = n_{O} / n

.

Then, for

\forall t > 0

, we have

\begin{matrix} P \{|{MoM}_{K} [μ] - μ_{0}| \leq t\} \\ \geq 1 - exp (- (\frac{(2 η (ε_{O}) - 1) n t^{2}}{2 η (ε_{O}) σ^{2}} - 1) \frac{2 η (ε_{O}) - 1}{2 η (ε_{O})} log \frac{(2 η (ε_{O}) - 1)}{2 η (ε_{O})}) . \end{matrix}

Remark 6.

For the number

n_{O}

and

K_{O}

, when one divides n samples evenly into K blocks, an extreme case is to assume that the blocks that do not conform to one’s preferences are full of outliers, such as

K_{O}

blocks, and the blocks that conform to one’s preferences have no outliers, such as

K_{S}

blocks; then, one has

ε_{O} = n_{O} / n = K_{O} / K

.

Remark 7.

For the function

η (ε_{O})

, we can write a concrete expression to show that such a function exists, for example,

η (ε_{O}) = (1 + 2 ε_{O}) / 2 \in (1 / 2, 1)

, where

ε_{O} \in (0, 1 / 2)

. But there must be more than one expression, so the non-concrete function

η (ε_{O})

is more appropriate for this theorem.

In fact, there is an adaptive way to generate block number K, but we do not show the specific calculation here; see Ref. [18] for more detail. Now, we give a detailed proof of Theorem 3.

Proof of Theorem 3.

In the sane blocks, in the number of blocks whose sample mean is no more than t from the population mean

μ_{0}

is at least

K / 2

, the distance between the population MoM and the population mean

μ_{0}

is no more than t, which is mathematically expressed as follows: for

\forall t > 0

, we have

\begin{matrix} \{|{MoM}_{K} [μ] - μ_{0}| \leq t\} & \supset \{\sum_{i \in [K_{S}]} 1_{\{|{\hat{μ}}_{i} - μ_{0}| \leq t\}} \geq \frac{K}{2}\} \\ \supset \{\sum_{i \in [K_{S}]} 1_{\{|{\hat{μ}}_{i} - μ_{0}| \leq t\}} \geq \frac{K_{S}}{2 η (ε_{O})}\} . \end{matrix}

Further, the following formula is established:

P \{|{MoM}_{K} [μ] - μ_{0}| \leq t\} \geq P \{\sum_{i \in [K_{S}]} 1_{\{|{\hat{μ}}_{i} - μ_{0}| \leq t\}} \geq \frac{K_{S}}{2 η (ε_{O})}\} .

(11)

From the condition (H.4), we have

1 \leq \frac{K_{S}}{2 η (ε_{O})} \leq K_{S} - 1

and

K - 1 \geq K_{S} \geq η (ε_{O}) K \geq 2 η (ε_{O}) \geq 1 + \frac{1}{K_{S} - 1} > 1 when K \geq 2 .

(12)

Applying Theorem 2 in Ref. [14], we can obtain the lower bound of Formula (11), i.e.,

\begin{matrix} P \{\sum_{i \in [K_{S}]} 1_{\{|{\hat{μ}}_{i} - μ_{0}| \leq t\}} \geq \frac{K_{S}}{2 η (ε_{O})}\} \\ \geq \frac{1 - \frac{c}{K_{S}}}{1 - r} η (ε_{O}) \sqrt{\frac{2}{π K_{S} (2 η (ε_{O}) - 1)}} e^{- K_{S} D (\frac{1}{2 η (ε_{O})} | | {\tilde{p}}_{S})} \end{matrix}

(13)

where

c = c (r) = \frac{4 η^{2} (ε_{O})}{2 η (ε_{O}) - 1} [1 + \frac{r (1 + r)}{{(1 - r)}^{2}}]

,

r = r (\frac{1}{2 η (ε_{O})}, {\tilde{p}}_{S}) = \frac{{\tilde{p}}_{S} (2 η (ε_{O}) - 1)}{1 - {\tilde{p}}_{S}}

and

D (\frac{1}{2 η (ε_{O})} | | {\tilde{p}}_{S}) = \frac{1}{2 η (ε_{O})} log \frac{1}{2 η (ε_{O}) {\tilde{p}}_{S}} + \frac{2 η (ε_{O}) - 1}{2 η (ε_{O})} log \frac{2 η (ε_{O}) - 1}{2 η (ε_{O}) (1 - {\tilde{p}}_{S})} .

On the other hand, by Chebyshev’s inequality (see p. 239 in Ref. [15]), we have

1 - \frac{1}{2 η (ε_{O})} < 1 - {\tilde{p}}_{S} = P (|{\hat{μ}}_{i} - μ_{0}| > t) \leq \frac{σ^{2}}{B t^{2}} = \frac{K σ^{2}}{n t^{2}} \leq 1 for \forall t > 0 (i = 1, \dots, K_{S}) .

(14)

Thus,

{\tilde{p}}_{S} \in [1 - \frac{K σ^{2}}{n t^{2}}, \frac{1}{2 η (ε_{O})})

and

r \in [\frac{(n t^{2} - K σ^{2}) (2 η (ε_{O}) - 1)}{K σ^{2}}, 1)

. Because of

η (ε_{O}) K \leq K_{S} \leq K - 1

and

η (ε_{O}) \in (1 / 2, 1)

, the inequality (13) can be written as

\begin{matrix} P \{\sum_{i \in [K_{S}]} 1_{\{|{\hat{μ}}_{i} - μ_{0}| \leq t\}} \geq \frac{K_{S}}{2 η (ε_{O})}\} \geq 1 - e^{- K_{S} D (\frac{1}{2 η (ε_{O})} | | {\tilde{p}}_{S})} \geq 1 - e^{- (K - 1) D (\frac{1}{2 η (ε_{O})} | | {\tilde{p}}_{S})} \end{matrix}

(15)

where

1 + \frac{1 - \frac{c}{K_{S}}}{1 - r} η (ε_{O}) \sqrt{\frac{2}{π K_{S} (2 η (ε_{O}) - 1)}} \geq e^{K_{S} D (\frac{1}{2 η (ε_{O})} | | {\tilde{p}}_{S})} .

(16)

The inequality (16) can be valid, for example, if

η (ε_{O})

is infinitely close to 1/2.

From

{\tilde{p}}_{S} \in [1 - \frac{K σ^{2}}{n t^{2}}, \frac{1}{2 η (ε_{O})})

, we have the minimum bound of

D (\frac{1}{2 η (ε_{O})} | | {\tilde{p}}_{S})

, i.e.,

\begin{matrix} D (\frac{1}{2 η (ε_{O})} | | {\tilde{p}}_{S}) & > \frac{1}{2 η (ε_{O})} log \frac{1}{2 η (ε_{O}) \frac{1}{2 η (ε_{O})}} + \frac{2 η (ε_{O}) - 1}{2 η (ε_{O})} log \frac{2 η (ε_{O}) - 1}{2 η (ε_{O}) (1 - 1 + \frac{K σ^{2}}{n t^{2}})} \\ = \frac{2 η (ε_{O}) - 1}{2 η (ε_{O})} log \frac{(2 η (ε_{O}) - 1) n t^{2}}{2 η (ε_{O}) K σ^{2}} . \end{matrix}

(17)

Substituting Equation (17) into Equation (15), we have

\begin{matrix} P \{\sum_{i \in [K_{S}]} 1_{\{|{\hat{μ}}_{i} - μ_{0}| \leq t\}} \geq \frac{K_{S}}{2 η (ε_{O})}\} \\ > 1 - exp (- (K - 1) \frac{2 η (ε_{O}) - 1}{2 η (ε_{O})} log \frac{(2 η (ε_{O}) - 1) n t^{2}}{2 η (ε_{O}) K σ^{2}}) \end{matrix}

(18)

Further, due to Relation (14), we have

\frac{(2 η (ε_{O}) - 1) n t^{2}}{2 η (ε_{O}) σ^{2}} < K \leq n

and

\frac{K σ^{2}}{n t^{2}} \leq 1

; then, the inequality (18) can be bounded as

\begin{matrix} P \{\sum_{i \in [K_{S}]} 1_{\{|{\hat{μ}}_{i} - μ_{0}| \leq t\}} \geq \frac{K_{S}}{2 η (ε_{O})}\} \\ > 1 - exp (- (\frac{(2 η (ε_{O}) - 1) n t^{2}}{2 η (ε_{O}) σ^{2}} - 1) \frac{2 η (ε_{O}) - 1}{2 η (ε_{O})} log \frac{(2 η (ε_{O}) - 1)}{2 η (ε_{O})}) \end{matrix}

□

5. Conclusions

In this paper, we obtain the bounds of variance-dependen MoM estimation based on the binomial tail probability, including the case without pollution and the case with pollution. The nonasymptotic properties of nonpolluting MoM estimates have been shown to be superior to the existing traditional Hoeffding results. In the next step, we will also continue to investigate the bound of variance-dependen MoM estimation with outliers based on sub-Gaussian distribution or Weibull distribution. Compared with traditional exponential family distributions, it is more practical to study the inequalities of these distributions (see Refs. [19,20]). We further plan to study application problems with a practical background.

Author Contributions

Conceptualization, G.T. and Y.L.; methodology, G.T. and Y.L.; formal analysis, B.T.; writing—original draft preparation, G.T.; writing—review and editing, Y.L. and J.L.; supervision, B.T.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by China Postdoctoral Science Foundation 2023M733852.

Data Availability Statement

This paper does not use any data.

Conflicts of Interest

The authors declare no conflict of interest.

References

Nemirovskij, A.S.; Yudin, D.B. Problem Complexity and Method Efficiency in Optimization; John Wiley & Sons Ltd.: Hoboken, NJ, USA, 1983. [Google Scholar]
Lerasle, M.; Szabó, Z.; Mathieu, T.; Lecué, G. Monk outlier-robust mean embedding estimation by median-of-means. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 3782–3793. [Google Scholar]
Lugosi, G.; Mendelson, S. Risk minimization by median-of-means tournaments. J. Eur. Math. Soc. 2019, 22, 925–965. [Google Scholar] [CrossRef]
Lecué, G.; Lerasle, M. Robust machine learning by median-of-means: Theory and practice. Ann. Stat. 2020, 48, 906–931. [Google Scholar] [CrossRef]
Humbert, P.; Le Bars, B.; Minvielle, L. Robust kernel density estimation with median-of-means principle. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MA, USA, 17–23 July 2022; p. 9444. [Google Scholar]
Wainwright, M.J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint; Cambridge University Press: Cambridge, UK, 2019; Volume 48. [Google Scholar]
Zhang, H.; Chen, S.X. Concentration Inequalities for Statistical Inference. Commun. Math. Res. 2021, 37, 1–85. [Google Scholar] [CrossRef]
Zhang, H.; Lei, X. Growing-dimensional Partially Functional Linear Models: Non-asymptotic Optimal Prediction Error. Phys. Scr. 2023, 98, 095216. [Google Scholar] [CrossRef]
Cai, T.T.; Han, R.; Zhang, A.R. On the non-asymptotic concentration of heteroskedastic Wishart-type matrix. Electron. J. Probab. 2022, 27, 1–40. [Google Scholar] [CrossRef]
Depersin, J.; Lecué, G. On the robustness to adversarial corruption and to heavy-tailed data of the Stahel–Donoho median of means. Inf. Inference J. IMA 2023, 12, 814–850. [Google Scholar] [CrossRef]
Marteau, C.; Sart, M. Deconvolution for some singular density errors via a combinatorial median of means approach. Math. Stat. Learn. 2023, 6, 51–85. [Google Scholar] [CrossRef]
Chen, Y. A Short Note on the Median-of-Means Estimator; University of Washington: Washington, DC, USA, 2020; Available online: https://faculty.washington.edu/yenchic/short_note/note_MoM.pdf (accessed on 12 November 2020).
Minsker, S. U-statistics of growing order and sub-Gaussian mean estimators with sharp constants. arXiv 2022, arXiv:2202.11842. [Google Scholar]
Ferrante, G.C. Bounds on Binomial Tails With Applications. IEEE Trans. Inf. Theory 2021, 67, 8273–8279. [Google Scholar] [CrossRef]
Alsmeyer, G. Chebyshev’s Inequality. In International Encyclopedia of Statistical Science; Lovric, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar] [CrossRef]
Lerasle, M. Lecture Notes: Selected Topics on Robust Statistical Learning Theory. arXiv 2019, arXiv:1908.10761. [Google Scholar]
Zhang, H.; Wei, H.; Cheng, G. Tight Non-asymptotic Inference via Sub-Gaussian Intrinsic Moment Norm. arXiv 2023, arXiv:2303.07287. [Google Scholar]
Depersin, J.; Lecué, G. Robust sub-Gaussian estimation of a mean vector in nearly linear time. Ann. Stat. 2022, 50, 511–536. [Google Scholar] [CrossRef]
Hallinan, A.J., Jr. A review of the Weibull distribution. J. Qual. Technol. 1993, 25, 85–93. [Google Scholar] [CrossRef]
Xu, L.; Yao, F.; Yao, Q.; Zhang, H. Non-Asymptotic Guarantees for Robust Statistical Learning under Infinite Variance Assumption. J. Mach. Learn. Res. 2023, 24, 1–46. [Google Scholar]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Teng, G.; Li, Y.; Tian, B.; Li, J. Sharper Concentration Inequalities for Median-of-Mean Processes. Mathematics 2023, 11, 3730. https://doi.org/10.3390/math11173730

AMA Style

Teng G, Li Y, Tian B, Li J. Sharper Concentration Inequalities for Median-of-Mean Processes. Mathematics. 2023; 11(17):3730. https://doi.org/10.3390/math11173730

Chicago/Turabian Style

Teng, Guangqiang, Yanpeng Li, Boping Tian, and Jie Li. 2023. "Sharper Concentration Inequalities for Median-of-Mean Processes" Mathematics 11, no. 17: 3730. https://doi.org/10.3390/math11173730

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sharper Concentration Inequalities for Median-of-Mean Processes

Abstract

1. Introduction

2. Variance-Dependent Median-of-Mean Estimator without Outliers

3. Applications

3.1. Concentration for Supremum of Variance-Dependent MoM Empirical Processes

3.2. Concentration for Variance-Dependent MoM Intrinsic Moment Norm

4. Concentration for Variance-Dependent MoM with Distribution-Free Outliers

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI