Next Article in Journal
Recurrent Neural Network with Finite Time Sampling for Dynamics Identification in Rehabilitation Robots
Next Article in Special Issue
Optimal Non-Asymptotic Bounds for the Sparse β Model
Previous Article in Journal
Design of Dies of Minimum Length Using the Ideal Flow Theory for Pressure-Dependent Materials
Previous Article in Special Issue
On a Low-Rank Matrix Single-Index Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Sharper Concentration Inequalities for Median-of-Mean Processes

1
School of Mathematics, Harbin Institute of Technology, Harbin 150001, China
2
Department of Statistics and Data Science, National University of Singapore, 21 Lowr Kent Ridge Road, Singapore 119077, Singapore
3
School of Statistics, Renmin University of China, Beijing 100872, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2023, 11(17), 3730; https://doi.org/10.3390/math11173730
Submission received: 30 July 2023 / Revised: 29 August 2023 / Accepted: 29 August 2023 / Published: 30 August 2023
(This article belongs to the Special Issue New Advances in High-Dimensional and Non-asymptotic Statistics)

Abstract

:
The Median-of-Mean (MoM) estimation is an efficient statistical method for handling data with contamination. In this paper, we propose a variance-dependent MoM estimation method using the tail probability of a binomial distribution. The bound of this method is better than the classical Hoeffding method under mild conditions. This method is then used to study the concentration of variance-dependent MoM empirical processes and sub-Gaussian intrinsic moment norm. Finally, we give the bound of the variance-dependent MoM estimator with distribution-free contaminated data.

1. Introduction

Nowadays, there is a huge amount of data in information processing, and the data are varied. With the rapid expansion of data volume, traditional centralized data processing has gradually become unable to adapt to the current needs, which makes it possible to distribute processing power to all computers on the network.
When dealing with large amounts of data, it is inevitable to produce contaminated data which we generally call outliers. The outliers will result in low accuracy or high sensitivity of data processing tasks. Naturally, inferring probability density functions from contaminated samples is an important problem. Correspondingly, when there are no outliers in a dataset, we call such a dataset sane.
The Median-of-Mean (MoM) method is an effective way to deal with contaminated data, which divides the original data into several blocks, calculates the mean for each block, and then takes the median of these means. The literature on MoM methods can be traced back to Ref. [1]. In recent years, MoM methods have been widely used in the field of machine learning. For example, Ref. [2] used the MoM method to design estimators for kernel mean embedding and maximum mean discrepancy with excessive resistance properties to outliers; Ref. [3] applied the MoM method to achieve the optimal trade-off between accuracy and confidence under minimal assumptions in the classical statistical learning/regression problem; Ref. [4] introduced an MoM method for robust machine learning without deteriorating the estimation properties of a given estimator which is also easily computable in practice; Ref. [5] introduced a robust nonparametric density estimator combining the popular Kernel Density Estimation method and the Median-of-Means principle.
When using MoM methods to deal with contaminated data, these data often do not have obvious normal distribution characteristics but have more extensive sub-Gaussian properties; thus, non-asymptotic techniques are needed. Non-asymptotic inference can give full play to its advantages in the case of finite samples. Especially in the field of machine learning, non-asymptotic inference can establish strict error boundaries for the desired learning program (see Refs. [6,7,8]). Sometimes when working with data, it is difficult to know the exact distribution; this calls for a more general study such as sub-Gaussian, sub-exponential, heavy-tailed, and bounded distributions. For example, Ref. [9] studied the non-asymptotic concentration of the heteroskedastic Wishart-type matrices; Ref. [10] constructed sub-Gaussian estimators of a mean vector under adversarial contamination and heavy-tailed data by Median-of-Mean versions of the Stahel–Donoho outlyingness and of Median Absolute Deviation functions; Ref. [11] obtained the deconvolution for some singular density errors via a combinatorial Median-of-Mean approach and assessed the estimator quality by establishing non-asymptotic risk bounds.
To obtain a clear picture of robust estimation from a non-asymptotic viewpoint, variance-dependent MoM methods based on binomial tail probability are mainly studied, including uncontaminated and contaminated cases. The paper proceeds as follows. We first provide a variance-dependent MoM-estimator bias inequality by using bounds on binomial tails with unbounded samples, whose bias bound is tighter than the classical Hoeffding’s bound (see Section 2). Then, by the variance-dependent MoM inequality, we obtain the generalization bound via entropic complexity (see Section 3.1) and the non-asymptotic property via Sub-Gaussian intrinsic moment norm (see Section 3.2). Finally, the variance-dependent MoM inequality with contamination data is illustrated in Section 4.

2. Variance-Dependent Median-of-Mean Estimator without Outliers

The MoM method was originally introduced on page 242 of Ref. [1]; it reinforces the effect of the empirical mean on the heavy-tail distribution while inheriting its efficiency on the light-tail distribution. The MoM estimator is derived as follows.
Without loss of generality, suppose that the sample data X 1 , X 2 , , X n are decomposed into K blocks, with each block including B observations, that is to say, n = K B . We first compute the mean of each block, which leads to estimators μ ^ 1 , , μ ^ K and each estimator is based on B observations. Then, the MoM estimator is given by the median of all these estimators, i.e.,
MoM K [ μ ] = median μ ^ 1 , , μ ^ K .
It turns out that, even with a very mild condition Var X = σ 2 < , the MoM estimator has a nice concentration inequality under finite sample case.
Given the i.i.d. sample X 1 , X 2 , , X n with mean μ 0 and finite variance σ 2 , using Hoeffding’s inequality, Proposition 1 in Ref. [12] produces the following concentration inequality:
P MoM K [ μ ] μ 0 > t exp n t 2 27 σ 2
where t = σ ( 2 + δ ) / B σ 2 π 2 B —see detailed description in Remark 1.
When additional conditions are applied to the distribution under consideration, stricter boundaries can be obtained, such as our results on binomial tails (Theorem 1), which can be better.
In fact, sometimes we need to block the data, but the minimum number of samples per block is often a concern, because it involves efficiency and robustness issues, and, from a statistical point of view, the effect of variance is taken into account. The following theorem takes into account the partitioning of variance effects and yields the variance-dependent MoM inequality.
Theorem 1. 
Given the i.i.d. samples X 1 , X 2 , , X n with mean μ 0 and finite variance σ 2 , for δ 2 π 2 , there exists B N and ε > 0 , such that B ε 2 ( 2 + δ ) σ 2 . Then, the MoM estimator has the following concentration inequality:
P MoM K [ μ ] μ 0 > t exp 0.0976 n t 2 σ 2
where t = σ ( 2 + δ ) / B .
A powerful feature of Theorem 1 is that X i s can be unbounded in this case. In addition, finite sample exponential concentration is not easy to obtain if only variance exists (see Ref. [13]). And Theorem 1 provides the basis for further obtaining the inequality with outliers. In the process of proving the theorem, we used the following lemma.
Lemma 1 
(Theorem 1 of [14]). Suppose S n Bin ( n , p ) , a > p ( 0 , 1 ) , and 1 a n n 1 . If a n N , then
P S n a n 1 1 r 1 2 π a ( 1 a ) n e n D ( a p ) .
where r = r ( a , p ) : = p ( 1 a ) a ( 1 p ) , and D ( a p ) : = a log a p + ( 1 a ) log 1 a 1 p is the KL divergence between Bernoulli distributions with parameters a and p. If an N , the bound still holds, but it can be tightened by replacing a with a * : = a n / n .
Now, we give a detailed proof of Theorem 1.
Proof of Theorem 1. 
First, observe that the event
MoM K [ μ ] μ 0 > ϵ for ϵ 0
implies that at least K / 2 of μ ^ ( = 1 , , K ) has to be outside ϵ distance to μ 0 for ϵ 0 . Namely,
MoM K [ μ ] μ 0 > ϵ = 1 K 1 μ ^ μ 0 > ϵ K 2 for ϵ 0 .
Here, it is assumed that K is an even number. When K is an odd number, take at least K / 2 , and the same can be said. For the convenience of writing, the following process of proof only writes the case of at least K / 2 , while proving the case of K / 2 is no difference.
Define Z = 1 μ ^ μ 0 > ϵ and let p ¯ : = p ˜ ϵ , B = E Z = P μ ^ μ 0 > ϵ . Note the theorem condition and the Chebyshev’s inequality (see p. 239 in Ref. [15]), which imply that there exits B N and ε > 0 such that
p ˜ : = p ˜ ε , B = P μ ^ μ 0 > ε σ 2 B ε 2 < 1 2 .
In fact, the detailed derivation process is as follows:
P μ ^ μ 0 > ε Var ( μ ^ ) ε 2 = Var X l 1 + + X l B B ε 2 = 1 B 2 Var i = 1 B X l i ε 2 = 1 B 2 i = 1 B Var ( X l i ) ε 2 = 1 B 2 B σ 2 ε 2 = σ 2 B ε 2 .
The random variables Z Bernoulli ( p ˜ ) are i.i.d. because of the i.i.d. samples X 1 , X 2 , , X n . Applying Lemma 1 (with a = 1 / 2 , n = K , and p = p ˜ in Lemma 1) to the summations gives
P MoM K [ μ ] μ 0 > ε P = 1 K Z K 2 1 p ˜ 1 2 p ˜ 2 π K e K D 1 2 | | p ˜ ,
where D 1 2 | | p ˜ = 1 2 log 1 4 p ˜ ( 1 p ˜ ) .
Setting B ( 2 + δ ) σ 2 / ε 2 > 2 σ 2 / ε 2 for δ > 0 satisfies Equation (2); then,
P MoM K [ μ ] μ 0 > σ ( 2 + δ ) K n 1 p ˜ 1 2 p ˜ 2 π K e K D 1 2 | | p ˜ = 1 + p ˜ 1 2 p ˜ 2 π K e K D 1 2 | | p ˜ δ + 1 δ 2 π K 1 + δ 2 4 + 4 δ K 2
When K = 1 , we set δ 2 π 2 3.95 so that δ + 1 δ 2 π K δ + 1 δ 2 π 1 ( K = 1 , , n ) . Then, it follows that
P MoM K [ μ ] μ 0 > σ ( 2 + δ ) K n 1 + δ 2 4 + 4 δ K 2
for 1 K n and δ 2 / ( π 2 ) .
Now, taking t : = σ ( 2 + δ ) K / n gives
P MoM K [ μ ] μ 0 > t exp n t 2 2 ( 2 + δ ) σ 2 ln 1 + δ 2 4 + 4 δ
The function g ( δ ) = 1 2 + δ ln 1 + δ 2 4 + 4 δ ( δ 2 π 2 ) is a monotonically decreasing function, so its maximum is g 2 / ( π 2 ) 0.0976 .
This then leads to the final result:
P MoM K [ μ ] μ 0 > t exp 0.0976 n t 2 σ 2 .
Remark 1. 
The classical result by Hoeffding inequality shows that (see Proposition 1 in Ref. [12])
P MoM K [ μ ] μ 0 > σ ( 2 + δ ) K / n e K δ 2 2 ( 2 + δ ) 2
Similarily, to obtain a sharp constant, one can consider t : = σ ( 2 + δ ) K / n ; then,
P MoM K [ μ ] μ 0 > t exp n t 2 δ 2 2 σ 2 ( 2 + δ ) 3
and the function
g ( δ ) = δ 2 ( 2 + δ ) 3
achieve the unique maximum point at δ = 4 with g ( 4 ) = 2 / 27 . It follows that
P MoM K [ μ ] μ 0 > t exp n t 2 27 σ 2 .
Remark 2. 
The efficient interval of t is an interesting issue. By the construction of t = σ ( 2 + δ ) K / n , it follows that ( 2 + δ ) / n t / σ 2 + δ since 1 K n .
Remark 3. 
In Theorem 1, we substitute t = σ ( 2 + δ ) / B into inequality (1) to produce
P MoM K [ μ ] μ 0 > σ 2 + δ B exp 0.0976 n σ 2 ( 2 + δ ) B σ 2
Since δ 2 π 2 3.95 > 2 , we have
P MoM K [ μ ] μ 0 > 2 σ K n e 0.5807 K .
This result is better than the bound e K / 8 of level-dependent sub-Gaussian estimators. Of course, our conditions are more stringent (see Proposition 12 in Ref. [16]).

3. Applications

In this section, we use the proposed sharper concentration inequalities for MoM estimators to perform two applications in statistical machine learning.

3.1. Concentration for Supremum of Variance-Dependent MoM Empirical Processes

Let ψ ( x ) B L and | ψ ( x ) | M 0 < , where B L is a ball of the Lipschitz functions space and M 0 is a constant. Let P ψ = E ψ = ψ d P .
To derive the concentration inequality for the supremum of variance-dependent MoM empirical processes, the following auxiliary Lemma 2 is necessary, whose proof is trivial and thus omitted.
Lemma 2. 
| med ( a ) med ( b ) | a b for a , b B L where med ( a ) means the value of the function a ( x ) at the midpoint of the domain, and the same is true for med ( b ) .
By Lemma 2, for ϕ B L , we have
MoM K [ ϕ ] P ϕ MoM K [ ϕ ] MoM K [ ψ ] + | P ( ϕ ψ ) | + MoM K [ ψ ] P ψ ϕ ψ + ϕ ψ + MoM K [ ψ ] P ψ = 2 ϕ ψ + MoM K [ ψ ] P ψ
Let ψ 1 , , ψ N ξ , B L , · be a ξ -covering of B L w.r.t. · . It is well-known that there exist constants C L > 0 and r 1 , such that
log N ξ , B L , · C L 1 ξ r , ξ > 0
where N ξ , B L , · denotes the number of · -balls of radius ξ > 0 needed to cover class B L , and C L is a universal constant depending only on B L .
Put N = N ξ , B L , · for simplicity. By definition of N , for i { 1 , , N } , s.t.
ϕ ψ i ξ
Then, (3) becomes
MoM K [ ϕ ] P ϕ 2 ξ ˜ + MoM K ψ i P ψ i
Then, by Theorem 1, the union bound for ψ i i = 1 N gives that
P max 1 i N MoM K ψ i P ψ i σ ln δ 0.0976 N 1 δ .
Together, (4)–(6) give
P sup ϕ B L MoM K [ ϕ ] P ϕ 2 ξ + σ ln δ 0.0976 N 1 δ .
Put ξ = C L N ξ , i.e., ξ = C L N 1 r + 2 ; then, for ϕ B L and δ ( 0 , 1 ) , we have
P sup ϕ B L MoM K [ ϕ ] P ϕ 2 C L N 1 r + 2 + σ ln δ 0.0976 N 1 δ .

3.2. Concentration for Variance-Dependent MoM Intrinsic Moment Norm

A centered random variable X is called sub-Gaussian if
E e s X e s 2 σ G 2 / 2 for s R ,
where the quantity σ G > 0 is named as the sub-Gaussian parameter. In non-asymptotic statistics, because the collected sub-Gaussian data is often unstable, sometimes it is not possible to directly use the empirical moment-generating function to estimate the sub-Gaussian parameter such as variance-type parameters of sub-Gaussian distributions (see Ref. [17]). This requires us to use the sub-Gaussian intrinsic moment norm for estimation. The definition of intrinsic moment norm is as follows.
Definition 1 
(Intrinsic moment norm, see Definition 2 in Ref. [17]). The sub-Gaussian intrinsic moment norm is defined as
X G : = max k 1 2 k k ! ( 2 k ) ! E X 2 k 1 / ( 2 k ) = max k 1 1 ( 2 k 1 ) ! ! E X 2 k 1 / ( 2 k ) ,
where n ! ! = j = 0 n 2 1 ( n 2 j ) = n ( n 2 ) ( n 4 ) for n N .
As the amount of computation increases, so does the importance of the distributed MoM approach, with the corresponding intrinsic moment norm estimator defined below.
Definition 2 
(see Equation (7) in Ref. [17]). Let [ K ] = { 1 , , K } and B s be the number of samples in the s-th block. The MOM estimator for sub-Gaussian intrinsic moment norm is given by
X ^ b , G : = max 1 k κ n median s [ K ] ( 2 k 1 ) ! ! 1 P B B s X 2 k 1 / ( 2 k )
where P B B s X = B 1 i B s X i ( s = 1 , , K ) .
Definition 3. 
For any B N and 1 k κ n ,
g ¯ k , B σ k : = 1 E X 2 k / ( 2 k 1 ) ! ! 1 2 k max 1 j κ n 2 B 1 2 σ j j / ( E X 2 j ) + E X 2 j / ( 2 j 1 ) ! ! 1 2 j
and g ̲ k , B σ k : = 2 B 1 / 2 σ k k / ( E X 2 k ) + 1 1 / ( 2 k ) 1 .
Theorem 2. 
Suppose, for ε > 0 and n N , there exits B N , such that Var X 2 k < ε B / 2 σ k k where { σ k } k = 1 κ n is a finite constant sequence. Then, we have
P X G 1 max 1 k κ n g ¯ k , B σ k 1 X ^ b , G > 1 κ n e 0.3904 K
and
P X G > 1 + max 1 k κ n g ̲ k , B σ k 1 X ^ b , G > 1 κ n e 0.3904 K .
Remark 4. 
Let K = n / B ; we then obtain distributed samples that satisfy Theorem 2.
Remark 5. 
The key coefficient 0.3904 < 0.125 . In fact, the key coefficient of Theorem 3 in Ref. [17] without outliers is −0.125, as long as η ( ε ) = 1 is taken. This means that our boundary is better than the boundary in Ref. [17].
Proof of Theorem 2. 
From Definitions 1 and 2, we have
X G = max 1 k κ n E X 2 k ( 2 k 1 ) ! ! 1 / ( 2 k )
and
X ^ b , G = max 1 k κ n median s [ K ] 1 ( 2 k 1 ) ! ! · P B B s X 2 k 1 / ( 2 k ) .
Recall that g ̲ k , B σ k and g ¯ k , B σ k are the sequences s.t.
E X 2 k / ( 2 k 1 ) ! ! 1 / ( 2 k ) 1 g ¯ k , B σ k = max 1 k κ n 2 B 1 / 2 σ k k / E X 2 k + E X 2 k / ( 2 k 1 ) ! ! 1 / ( 2 k )
and
2 B 1 / 2 σ k k / E X 2 k + 1 1 / ( 2 k ) = 1 + g ̲ k , B σ k
for any B N and 1 k κ n .
For the first inequality of Theorem 2, we have, by (7),
P X ^ b , G 1 max 1 k κ n g ¯ k , B σ k X G = P X ^ b , G max 1 k κ n E X 2 k ( 2 k 1 ) ! ! 1 / ( 2 k ) 1 max 1 k κ n g ¯ k , B σ k P X ^ b , G max 1 k κ n E X 2 k ( 2 k 1 ) ! ! 1 / ( 2 k ) 1 g ¯ k , B σ k [ By ( 9 ) ] = P X ^ b , G σ k k ( 2 k 1 ) ! ! · 2 B 1 / 2 + E X 2 k ( 2 k 1 ) ! ! 1 / ( 2 k ) k = 1 κ n P { median s [ K ] 1 ( 2 k 1 ) ! ! · P B B s X 2 k 1 / ( 2 k ) σ k k ( 2 k 1 ) ! ! · 2 B 1 / 2 + E X 2 k ( 2 k 1 ) ! ! 1 / ( 2 k ) } = k = 1 κ n P { median s [ K ] 1 ( 2 k 1 ) ! ! · P B B s X 2 k E X 2 k ( 2 k 1 ) ! ! σ k k ( 2 k 1 ) ! ! · 2 B 1 / 2 } = k = 1 κ n P { median s [ K ] 1 ( 2 k 1 ) ! ! · P B B s X 2 k E X 2 k σ k k ( 2 k 1 ) ! ! · 2 B 1 / 2 } < k = 1 κ n P median s [ K ] P B B s X 2 k E 2 k σ k k · 2 B 1 / 2 κ n e 0.3904 K ,
where the last inequality is by Theorem 1 and the assumption in Theorem 2.
Let g ̲ B ( σ ) : = max 1 k κ n g ̲ k , B σ k . For the second inequality of Theorem 2, the definition of g ̲ k , B σ k implies
P X G X ^ b , G 1 + g ̲ B ( σ ) = P X ^ b , G max 1 k κ n σ k k ( 2 k 1 ) ! ! · 2 B 1 / 2 + E X 2 k ( 2 k 1 ) ! ! 1 / ( 2 k ) P { max 1 k κ n median s [ K ] 1 ( 2 k 1 ) ! ! · P B B s X 2 k 1 / ( 2 k ) σ k k ( 2 k 1 ) ! ! · 2 B 1 / 2 + E X 2 k ( 2 k 1 ) ! ! 1 / ( 2 k ) } k = 1 κ n P { median s [ K ] 1 ( 2 k 1 ) ! ! · P B B s X 2 k 1 / ( 2 k )
σ k k ( 2 k 1 ) ! ! · 2 B 1 / 2 + E X 2 k ( 2 k 1 ) ! ! 1 / ( 2 k ) } = k = 1 κ n P median s [ K ] 1 ( 2 k 1 ) ! ! · P B B s X 2 k σ k k ( 2 k 1 ) ! ! · 2 B 1 / 2 + E X 2 k ( 2 k 1 ) ! ! = k = 1 κ n P median s [ K ] P B B s X 2 k 2 σ k k B 1 / 2 + E X 2 k = k = 1 κ n P median s [ K ] P B B s X 2 k E X 2 k 2 σ k k B 1 / 2 < k = 1 κ n P | median s [ K ] P B B s X 2 k E X 2 k | 2 σ k k B 1 / 2 κ n e 0.3904 K ,
where the last inequality is by Theorem 1 and the assumption in Theorem 2. □

4. Concentration for Variance-Dependent MoM with Distribution-Free Outliers

In the field of big data and artificial intelligence, most work involves dealing with abnormal data. Sometimes we cannot find each outlier directly, but we can obtain a rough idea of the total number of outliers. For example, sometimes there may be abnormal economic activities in a certain region, but the specific company or person who is abnormal may not be known for the time being; however, the total number of companies and the total population in the region are still known.
Based on such information, how to accurately estimate the characteristics of all samples containing outliers is an important problem. In this section, we introduce the concept of variance-dependent MoM estimator with outliers as the following theorem.
Theorem 3. 
Suppose that
(H.1) Sample [ n ] = { X 1 , X 2 , , X n } contains n n O i.i.d. inliers with finite mean μ 0 and finite variance σ 2 . And n O outliers, upon which no assumption is made.
(H.2) Set K = K O + K S , where K O is the number of blocks containing at least one outlier and K S is the number of sane blocks containing no outlier. For t > 0 , there exists a function η ( ε O ) ( 1 / 2 , 1 ) , such that K max 2 , 1 2 η ( ε O ) 1 , ( 2 η ( ε O ) 1 ) n t 2 2 η ( ε O ) σ 2 and K S η ( ε O ) K , where ε O : = n O / n .
Then, for t > 0 , we have
P MoM K [ μ ] μ 0 t 1 exp ( 2 η ( ε O ) 1 ) n t 2 2 η ( ε O ) σ 2 1 2 η ( ε O ) 1 2 η ( ε O ) log ( 2 η ( ε O ) 1 ) 2 η ( ε O ) .
Remark 6. 
For the number n O and K O , when one divides n samples evenly into K blocks, an extreme case is to assume that the blocks that do not conform to one’s preferences are full of outliers, such as K O blocks, and the blocks that conform to one’s preferences have no outliers, such as K S blocks; then, one has ε O = n O / n = K O / K .
Remark 7. 
For the function η ( ε O ) , we can write a concrete expression to show that such a function exists, for example, η ( ε O ) = ( 1 + 2 ε O ) / 2 ( 1 / 2 , 1 ) , where ε O ( 0 , 1 / 2 ) . But there must be more than one expression, so the non-concrete function η ( ε O ) is more appropriate for this theorem.
In fact, there is an adaptive way to generate block number K, but we do not show the specific calculation here; see Ref. [18] for more detail. Now, we give a detailed proof of Theorem 3.
Proof of Theorem 3. 
In the sane blocks, in the number of blocks whose sample mean is no more than t from the population mean μ 0 is at least K / 2 , the distance between the population MoM and the population mean μ 0 is no more than t, which is mathematically expressed as follows: for t > 0 , we have
MoM K [ μ ] μ 0 t i [ K S ] 1 μ ^ i μ 0 t K 2 i [ K S ] 1 μ ^ i μ 0 t K S 2 η ( ε O ) .
Further, the following formula is established:
P MoM K [ μ ] μ 0 t P i [ K S ] 1 μ ^ i μ 0 t K S 2 η ( ε O ) .
From the condition (H.4), we have 1 K S 2 η ( ε O ) K S 1 and
K 1 K S η ( ε O ) K 2 η ( ε O ) 1 + 1 K S 1 > 1 when K 2 .
Applying Theorem 2 in Ref. [14], we can obtain the lower bound of Formula (11), i.e.,
P i [ K S ] 1 μ ^ i μ 0 t K S 2 η ( ε O ) 1 c K S 1 r η ( ε O ) 2 π K S ( 2 η ( ε O ) 1 ) e K S D 1 2 η ( ε O ) | | p ˜ S
where c = c ( r ) = 4 η 2 ( ε O ) 2 η ( ε O ) 1 1 + r ( 1 + r ) ( 1 r ) 2 , r = r 1 2 η ( ε O ) , p ˜ S = p ˜ S ( 2 η ( ε O ) 1 ) 1 p ˜ S and
D 1 2 η ( ε O ) | | p ˜ S = 1 2 η ( ε O ) log 1 2 η ( ε O ) p ˜ S + 2 η ( ε O ) 1 2 η ( ε O ) log 2 η ( ε O ) 1 2 η ( ε O ) ( 1 p ˜ S ) .
On the other hand, by Chebyshev’s inequality (see p. 239 in Ref. [15]), we have
1 1 2 η ( ε O ) < 1 p ˜ S = P μ ^ i μ 0 > t σ 2 B t 2 = K σ 2 n t 2 1 for t > 0 ( i = 1 , , K S ) .
Thus, p ˜ S [ 1 K σ 2 n t 2 , 1 2 η ( ε O ) ) and r ( n t 2 K σ 2 ) ( 2 η ( ε O ) 1 ) K σ 2 , 1 . Because of η ( ε O ) K K S K 1 and η ( ε O ) ( 1 / 2 , 1 ) , the inequality (13) can be written as
P i [ K S ] 1 μ ^ i μ 0 t K S 2 η ( ε O ) 1 e K S D 1 2 η ( ε O ) | | p ˜ S 1 e ( K 1 ) D 1 2 η ( ε O ) | | p ˜ S
where
1 + 1 c K S 1 r η ( ε O ) 2 π K S ( 2 η ( ε O ) 1 ) e K S D 1 2 η ( ε O ) | | p ˜ S .
The inequality (16) can be valid, for example, if η ( ε O ) is infinitely close to 1/2.
From p ˜ S [ 1 K σ 2 n t 2 , 1 2 η ( ε O ) ) , we have the minimum bound of D 1 2 η ( ε O ) | | p ˜ S , i.e.,
D 1 2 η ( ε O ) | | p ˜ S > 1 2 η ( ε O ) log 1 2 η ( ε O ) 1 2 η ( ε O ) + 2 η ( ε O ) 1 2 η ( ε O ) log 2 η ( ε O ) 1 2 η ( ε O ) ( 1 1 + K σ 2 n t 2 ) = 2 η ( ε O ) 1 2 η ( ε O ) log ( 2 η ( ε O ) 1 ) n t 2 2 η ( ε O ) K σ 2 .
Substituting Equation (17) into Equation (15), we have
P i [ K S ] 1 μ ^ i μ 0 t K S 2 η ( ε O ) > 1 exp ( K 1 ) 2 η ( ε O ) 1 2 η ( ε O ) log ( 2 η ( ε O ) 1 ) n t 2 2 η ( ε O ) K σ 2
Further, due to Relation (14), we have ( 2 η ( ε O ) 1 ) n t 2 2 η ( ε O ) σ 2 < K n and K σ 2 n t 2 1 ; then, the inequality (18) can be bounded as
P i [ K S ] 1 μ ^ i μ 0 t K S 2 η ( ε O ) > 1 exp ( 2 η ( ε O ) 1 ) n t 2 2 η ( ε O ) σ 2 1 2 η ( ε O ) 1 2 η ( ε O ) log ( 2 η ( ε O ) 1 ) 2 η ( ε O )

5. Conclusions

In this paper, we obtain the bounds of variance-dependen MoM estimation based on the binomial tail probability, including the case without pollution and the case with pollution. The nonasymptotic properties of nonpolluting MoM estimates have been shown to be superior to the existing traditional Hoeffding results. In the next step, we will also continue to investigate the bound of variance-dependen MoM estimation with outliers based on sub-Gaussian distribution or Weibull distribution. Compared with traditional exponential family distributions, it is more practical to study the inequalities of these distributions (see Refs. [19,20]). We further plan to study application problems with a practical background.

Author Contributions

Conceptualization, G.T. and Y.L.; methodology, G.T. and Y.L.; formal analysis, B.T.; writing—original draft preparation, G.T.; writing—review and editing, Y.L. and J.L.; supervision, B.T.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by China Postdoctoral Science Foundation 2023M733852.

Data Availability Statement

This paper does not use any data.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Nemirovskij, A.S.; Yudin, D.B. Problem Complexity and Method Efficiency in Optimization; John Wiley & Sons Ltd.: Hoboken, NJ, USA, 1983. [Google Scholar]
  2. Lerasle, M.; Szabó, Z.; Mathieu, T.; Lecué, G. Monk outlier-robust mean embedding estimation by median-of-means. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 3782–3793. [Google Scholar]
  3. Lugosi, G.; Mendelson, S. Risk minimization by median-of-means tournaments. J. Eur. Math. Soc. 2019, 22, 925–965. [Google Scholar] [CrossRef]
  4. Lecué, G.; Lerasle, M. Robust machine learning by median-of-means: Theory and practice. Ann. Stat. 2020, 48, 906–931. [Google Scholar] [CrossRef]
  5. Humbert, P.; Le Bars, B.; Minvielle, L. Robust kernel density estimation with median-of-means principle. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MA, USA, 17–23 July 2022; p. 9444. [Google Scholar]
  6. Wainwright, M.J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint; Cambridge University Press: Cambridge, UK, 2019; Volume 48. [Google Scholar]
  7. Zhang, H.; Chen, S.X. Concentration Inequalities for Statistical Inference. Commun. Math. Res. 2021, 37, 1–85. [Google Scholar] [CrossRef]
  8. Zhang, H.; Lei, X. Growing-dimensional Partially Functional Linear Models: Non-asymptotic Optimal Prediction Error. Phys. Scr. 2023, 98, 095216. [Google Scholar] [CrossRef]
  9. Cai, T.T.; Han, R.; Zhang, A.R. On the non-asymptotic concentration of heteroskedastic Wishart-type matrix. Electron. J. Probab. 2022, 27, 1–40. [Google Scholar] [CrossRef]
  10. Depersin, J.; Lecué, G. On the robustness to adversarial corruption and to heavy-tailed data of the Stahel–Donoho median of means. Inf. Inference J. IMA 2023, 12, 814–850. [Google Scholar] [CrossRef]
  11. Marteau, C.; Sart, M. Deconvolution for some singular density errors via a combinatorial median of means approach. Math. Stat. Learn. 2023, 6, 51–85. [Google Scholar] [CrossRef]
  12. Chen, Y. A Short Note on the Median-of-Means Estimator; University of Washington: Washington, DC, USA, 2020; Available online: https://faculty.washington.edu/yenchic/short_note/note_MoM.pdf (accessed on 12 November 2020).
  13. Minsker, S. U-statistics of growing order and sub-Gaussian mean estimators with sharp constants. arXiv 2022, arXiv:2202.11842. [Google Scholar]
  14. Ferrante, G.C. Bounds on Binomial Tails With Applications. IEEE Trans. Inf. Theory 2021, 67, 8273–8279. [Google Scholar] [CrossRef]
  15. Alsmeyer, G. Chebyshev’s Inequality. In International Encyclopedia of Statistical Science; Lovric, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar] [CrossRef]
  16. Lerasle, M. Lecture Notes: Selected Topics on Robust Statistical Learning Theory. arXiv 2019, arXiv:1908.10761. [Google Scholar]
  17. Zhang, H.; Wei, H.; Cheng, G. Tight Non-asymptotic Inference via Sub-Gaussian Intrinsic Moment Norm. arXiv 2023, arXiv:2303.07287. [Google Scholar]
  18. Depersin, J.; Lecué, G. Robust sub-Gaussian estimation of a mean vector in nearly linear time. Ann. Stat. 2022, 50, 511–536. [Google Scholar] [CrossRef]
  19. Hallinan, A.J., Jr. A review of the Weibull distribution. J. Qual. Technol. 1993, 25, 85–93. [Google Scholar] [CrossRef]
  20. Xu, L.; Yao, F.; Yao, Q.; Zhang, H. Non-Asymptotic Guarantees for Robust Statistical Learning under Infinite Variance Assumption. J. Mach. Learn. Res. 2023, 24, 1–46. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Teng, G.; Li, Y.; Tian, B.; Li, J. Sharper Concentration Inequalities for Median-of-Mean Processes. Mathematics 2023, 11, 3730. https://doi.org/10.3390/math11173730

AMA Style

Teng G, Li Y, Tian B, Li J. Sharper Concentration Inequalities for Median-of-Mean Processes. Mathematics. 2023; 11(17):3730. https://doi.org/10.3390/math11173730

Chicago/Turabian Style

Teng, Guangqiang, Yanpeng Li, Boping Tian, and Jie Li. 2023. "Sharper Concentration Inequalities for Median-of-Mean Processes" Mathematics 11, no. 17: 3730. https://doi.org/10.3390/math11173730

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop