Next Article in Journal
Do Deep Reinforcement Learning Agents Model Intentions?
Next Article in Special Issue
A New Class of Alternative Bivariate Kumaraswamy-Type Models: Properties and Applications
Previous Article in Journal
Data Cloning Estimation and Identification of a Medium-Scale DSGE Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Estimating Smoothness and Optimal Bandwidth for Probability Density Functions

by
Dimitris N. Politis
1,*,
Peter F. Tarassenko
2 and
Vyacheslav A. Vasiliev
2
1
Department of Mathematics and Halicioglu Data Science Institute, University of California, San Diego, CA 92093-0112, USA
2
Institute of Applied Mathematics and Computer Science, Tomsk State University, 36 Lenin Ave., 634050 Tomsk, Russia
*
Author to whom correspondence should be addressed.
Stats 2023, 6(1), 30-49; https://doi.org/10.3390/stats6010003
Submission received: 22 November 2022 / Revised: 21 December 2022 / Accepted: 21 December 2022 / Published: 27 December 2022
(This article belongs to the Special Issue Advances in Probability Theory and Statistics)

Abstract

:
The properties of non-parametric kernel estimators for probability density function from two special classes are investigated. Each class is parametrized with distribution smoothness parameter. One of the classes was introduced by Rosenblatt, another one is introduced in this paper. For the case of the known smoothness parameter, the rates of mean square convergence of optimal (on the bandwidth) density estimators are found. For the case of unknown smoothness parameter, the estimation procedure of the parameter is developed and almost surely convergency is proved. The convergence rates in the almost sure sense of these estimators are obtained. Adaptive estimators of densities from the given class on the basis of the constructed smoothness parameter estimators are presented. It is shown in examples how parameters of the adaptive density estimation procedures can be chosen. Non-asymptotic and asymptotic properties of these estimators are investigated. Specifically, the upper bounds for the mean square error of the adaptive density estimators for a fixed sample size are found and their strong consistency is proved. The convergence of these estimators in the almost sure sense is established. Simulation results illustrate the realization of the asymptotic behavior when the sample size grows large.

1. Introduction

Let X 1 , , X n be independent identically distributed random variables (i.i.d. r.v.’s) having a probability density function f. In the typical non-parametric set-up, nothing is assumed about f except that it possesses a certain degree of smoothness, e.g., that it has r continuous derivatives.
Estimating f via kernel smoothing is a sixty year old problem; M. Rosenblatt who was one of its originators discusses the subject’s history and evolution in the monograph [1]. For some point x, the kernel smoothed estimator of f ( x ) is defined by
f n , h ( x ) = 1 n j = 1 n 1 h K x X j h
where the kernel K ( · ) is a bounded function satisfying K ( x ) d x = 1 and K 2 ( x ) d x < , and the positive bandwidth parameter h is a decreasing function of the sample size n.
If K ( · ) has finite moments up to q-th order, and moments of order up to q 1 equal to zero, then q is called the ‘order’ of the kernel K ( · ) . Since the unknown function f is assumed to have r continuous derivatives, it typically follows that
V a r ( f n , h ( x ) ) = C f , K ( x ) h n + o 1 h n ,
and
B i a s ( f n , h ( x ) ) = c f , K ( x ) h k + o ( h k ) ,
where k = min ( q , r ) , and C f , K ( x ) , c f , K ( x ) are bounded functions depending on K ( · ) as well as f and its derivatives, cf. [1] p. 8.
The idea of choosing a kernel of order q bigger (or equal) than r in order to ensure the B i a s ( f n , h ( x ) ) to be O ( h r ) dates back to the early 1960s in work of [2,3]; recent references on higher-order kernels include the following: [4,5,6,7,8,9,10]. Note that since r is typically unknown and can be arbitrarily large, it is possible to use kernels of infinite order that achieve the minimal bias condition B i a s ( f n , h ( x ) ) = O ( h r ) for any r; Ref. [11] gives many properties of kernels of infinite order. In this paper we will employ a particularly useful class of infinite order kernels namely the flat-top family; see [12] for a general definition.
It is a well-known fact that optimal bandwidth selection is perhaps the most crucial issue in such non-parametric smoothing problems; see [13], as well as the book [14]. The goal typically is minimization of the large-sample mean squared error (MSE) of f n , h ( x ) . However, to perform this minimization, the practitioner needs to know the degree of smoothness r, as well as the constants C f , K ( x ) and c f , K ( x ) . Using an infinite order kernel and focusing just on optimizing the order of magnitude of the large-sample MSE, it is apparent that the optimal bandwidth h must be asymptotically of order n 1 / ( 2 r + 1 ) ; this yields a large-sample MSE of order n 2 r / ( 2 r + 1 ) .
A generalization of the above scenario is possible using a degree of smoothness r that has another sense, and that is not necessarily an integer. Let [ r ] denote the integer part of r, and define γ = r [ r ] ; then, one may assume that f has [ r ] continuous derivatives, and that the [ r ] th derivative satisfies a Lipschitz condition of order γ . Interestingly, even in this case where f is assumed to belong to the Hölder class of degree r (the derivative of the density function of the order r satisfies the Lipschitz condition) the MSE–optimal bandwidth h is still of order n 1 / ( 2 r + 1 ) and again yields a large-sample MSE of the order n 2 r / ( 2 r + 1 ) (see, e.g., [15,16,17,18] among others).
The problem of course is that, as previously mentioned, the underlying degree of smoothness r is typically unknown. In Section 4 of the paper at hand, we develop an estimator r n of r and prove its strong consistency; this is perhaps the first such result in the literature. In order to construct our estimator r n , we operate under a class of functions that is slightly more general than the aforementioned Hölder class; this class of functions is formally defined in Section 2 via Equation (3) or (4).
Under such a condition on the tails of the characteristic function we are able to show in Section 3 that the optimized MSE of f ^ n ( x ) is again of order n 2 r / ( 2 r + 1 ) for possibly non-integer r ; this is true, for example, when the characteristic function ϕ ( s ) has tails of order O ( 1 / | s | r + 1 ) , see Example 2.
Furthermore, in Section 5 we develop an adaptive estimator f ^ n ( x ) that achieves the optimal MSE rate of n 2 r / ( 2 r + 1 ) within a logarithmic factor despite the fact that r is unknown, see Examples after Theorem 3. Similar effect arises in the adaptive estimation problem of the densities from the Hölder class; see [18,19,20]. It should pointed that problems of asymptotic adaptive optimal density estimations from another classes have also been considered in the literature; see, e.g., [14,21,22,23].
The construction of f ^ n ( x ) is rather technical; it uses the new estimator r n , and it is inspired from the construction of sequential estimates although we are in a fixed n, non-sequential setting. As the major theoretical result of our paper, we are able to prove a non-asymptotic upper bound for the MSE of f ^ n ( x ) that satisfies the above mentioned optimal rate. Section 6 contains some simulation results showing the performance of the new estimator f ^ n ( x ) in practice. All proofs are deferred to Section 7, while Section 8 contains our conclusions and suggestions for future work.

2. Problem Set-Up and Basic Assumptions

Let X 1 , , X n be i.i.d. having a probability density function f. Denote ϕ ( s ) = e i s x f ( x ) d x the characteristic function of f and the sample characteristic function ϕ n ( s ) = 1 n k = 1 n e i s X k . For some finite r > 0 , define two families F r + and F r of bounded, i.e.,
0 < f ¯ < : sup y R 1 f ( y ) f ¯ ,
and continuous functions f satisfying one of the following conditions, respectively:
| s | r | ϕ ( s ) | d s < , | s | r + ε | ϕ ( s ) | d s = , for all ε > 0 ,
| s | r ε | ϕ ( s ) | d s < , | s | r | ϕ ( s ) | d s = , for all 0 < ε < r .
In other words, F r + is the family of functions (introduced by M. Rosenblatt) satisfying (2) and (3), while F r is the family of functions (introduced in this paper) satisfying (2) and (4). It should be noted that the new class F r is a little bit more wide that the classical class F r + .
In addition, define the family F r , m + (respectively, F r , m ) as the family of functions f that belong to F r + (respectively, F r ) but with f being such that its characteristic function | ϕ ( s ) | has monotonously decreasing tails.
Consider the class Ξ of non-parametric kernel smoothed estimators f n , h ( x ) of f ( x ) as given in Equation (1). Note that we can alternatively express f n , h ( x ) in terms of the Fourier transform of kernel K ( · ) , i.e.,
f n , h ( x ) = 1 n j = 1 n 1 h K x X j h = 1 2 π λ ( s , h ) ϕ n ( s ) e i s x d s
where
λ ( s , h ) = K x h e i s x d x .
In this paper, we will employ the family of flat-top infinite order kernels, i.e., we will let the function λ ( s , h ) be of the form
λ c ( s , h ) = 1 if | s | 1 / h , g ( s , h ) if 1 / h < | s | c / h , 0 if | s | c / h ,
where c is a fixed number in [ 1 , ) chosen by the practitioner, and g ( s , h ) is some properly chosen continuous, real-valued function satisfying g ( s , h ) = g ( s , h ) , g ( s , 1 ) = g ( s / h , h ) , and | g ( s , h ) | 1 , for any s , with g ( 1 / h , h ) = 1 , and g ( c / h , h ) = 0 ; see [12,24,25,26] for more details on the above flat-top family of kernels.
Define g h ( s , h ) the partial derivative of the function g ( s , h ) with respect to the bandwidth h . We will also assume that for some c 0 > 0
lim h 0 ¯ sup 1 / h < | s | < c / h | g h ( s , h ) | / | s | < c 0 .
Denote for every 0 γ < r the functions
δ γ ( h ) = 1 / h < | s | < c / h | s | r γ | ϕ ( s ) | d s , when h > 0 , and δ γ ( 0 ) = 0 .
From (3) and (5) it follows that δ γ ( h ) = o ( 1 ) as h 0 for f F r + and γ = 0 , as well as for f F r and 0 < γ < r . In other cases δ γ ( h ) = .
Define the following classes F ¯ r = F r + F r and F ¯ r , m = F r , m + F r , m .
The main aim of the paper is the estimation of the parameter r of these classes and adaptive estimation of densities from the class F ¯ r with the unknown parameter r .

3. Asymptotic Mean Square Optimal Estimation

The mean square error (MSE) u f 2 ( f n , h ) = E f ( f n , h ( x ) f ( x ) ) 2 of the estimators f n , h ( x ) Ξ , f F ¯ r has the following form:
u f 2 ( f n , h ) = U f 2 ( h , c ) 1 n K ( v ) f ( x h v ) d v 2 ,
where U f 2 ( h , c ) is the principal term of the MSE,
U f 2 ( h , c ) = L 1 f ( x ) n h + 1 2 π 1 / h < | s | < c / h ( 1 g ( s , h ) ) ϕ ( s ) e i s x d s 2 ,
L 1 = K 2 ( v ) d v . Thus, in particular, sup f F ¯ r K ( v ) f ( x h v ) d v < .
To minimize the principal term U f 2 ( h , c ) by h we set its first derivative with respect to h to zero which gives the following equality for the optimal (in the mean square sense) value h 0 = h 0 ( n ) :
1 / h 0 < | s | < c / h 0 ( g ( s , h 0 ) 1 ) ϕ ( s ) e i s x d s · { c ϕ ( c / h 0 ) e i c x h 0 + c ϕ ( c / h 0 ) e i c x h 0 + ( h 0 ) 2 1 / h 0 < | s | < c / h 0 g h ( s , h 0 ) ϕ ( s ) e i s x d s } = 2 π 2 L 1 f ( x ) n .
From the definition of the class of kernels for cases δ γ ( h ) < we have
  1 / h < | s | < c / h ( g ( s , h ) 1 ) ϕ ( s ) e i s x d s 2 h r γ δ γ ( h )
and for h small enough, according to (6)
  1 / h < | s | < c / h g h ( s , h ) ϕ ( s ) e i s x d s c 0 h r 1 γ δ γ ( h ) .
Then, by the definition of the class F ¯ r , m , as h small enough, denoting c 1 ( γ ) = r + 1 γ 2 ( c r + 1 γ 1 ) , we have
δ γ ( h ) 1 / h c / h | s | r γ d s · [ inf 1 / h < s < c / h | ϕ ( s ) | + inf 1 / h < s < c / h | ϕ ( s ) | ]
( c 1 ( γ ) ) 1 h ( r + 1 γ ) [ | ϕ ( c / h ) | + | ϕ ( c / h ) | ] .
Thus, for h < < 1 ,
| ϕ ( c / h ) | + | ϕ ( c / h ) | c 1 ( γ ) h r + 1 γ δ γ ( h )
and from (8) it follows
( h 0 ) 2 r + 1 2 γ δ γ 2 ( h 0 ) π 2 L 1 f ( x ) ( c 0 + c 1 ( γ ) ) n .
Define the number h 1 0 = h 1 0 ( n ) from the equality
( h 1 0 ) 2 r + 1 2 γ δ γ 2 ( h 1 0 ) = π 2 L 1 f ( x ) ( c 0 + c 1 ( γ ) ) n .
It is obvious, that 0 < h 1 0 h 0 and ( h 1 0 ) 2 r + 1 2 γ δ γ 2 ( h 1 0 ) ( h 0 ) 2 r + 1 2 γ δ γ 2 ( h 0 ) .
Then, from (7) and (9), for every f F ¯ r , m and f n 0 = f n , h 0 as n , we have
u f 2 ( f n 0 ) u f 2 ( f n , h 1 0 ) L 1 f ( x ) n h 1 0 + 1 π 2 ( h 1 0 ) 2 r 2 γ δ γ 2 ( h 1 0 ) = C γ · δ γ 2 2 r + 1 2 γ ( h 1 0 ) n 2 r 2 γ 2 r + 1 2 γ ,
where
C γ = L 1 2 r 2 γ 2 r + 1 2 γ ( x ) 1 + π 2 c 0 + c 1 ( γ ) c 0 + c 1 ( γ ) π 2 1 2 r + 1 2 γ .
In such a way we have proved the following theorem, which gives the rates of convergence of the random quantities f n 0 ( x ) and f n , h 1 0 ( x ) . We can loosely call f n 0 ( x ) and f n , h 1 0 ( x ) ‘estimators’ although it is clear that these functions can not be considered as estimators in the usual sense in view of the dependence of the bandwidths h 0 and h 1 0 on unknown parameters r and f ( x ) . Nevertheless, this theorem can be used for the construction of bona fide adaptive estimators with the optimal and suboptimal converges rates; see Examples 1 and 2, as well as Section 5.3 in what follows.
Theorem 1.
Let f ( x ) > 0 . Then, for the asymptotically optimal (with respect to bandwidth h) in the MSE sense ‘estimator’ f n 0 ( x ) of the function f F ¯ r and for the ‘estimator’ f n , h 1 0 ( x ) of f F ¯ r , m the following limit relations, as n , hold
1 . sup f F ¯ r inf h u f 2 ( f n , h ) U f 2 ( h 0 , c ) = O 1 n ; 2 . f o r   e v e r y f F ¯ r , m w i t h γ = 0 i f f F r , m + a n d   e v e r y 0 < γ < r i f f F r , m u f 2 ( f n 0 ) u f 2 ( f n , h 1 0 ) C γ · δ γ 2 2 r + 1 2 γ ( h 1 0 ) n 2 r 2 γ 2 r + 1 2 γ , n 1 .
Remark 1.
The definition (9) of h 1 0 is essentially simpler than the definition (8) of the optimal bandwidth h 0 . From Theorem 1 it follows that the (slightly) suboptimal ‘estimator’ f n , h 1 0 can be successfully used instead.
It should be noted that the parameter γ is chosen by the practitioner here and that γ = 0 if f F r , m + but 0 < γ < r if f F r , m in which case we want to choose γ close to 0.
We shall write in the sequel φ ( s ) ψ ( s ) as s instead of the limit relations
0 < lim s ¯ φ ( s ) ψ ( s ) lim s ¯ φ ( s ) ψ ( s ) < .
Example 1.
Consider an estimation problem of the function f F r , m + , satisfying the following additional condition
| ϕ ( s ) | 1 | s | r + 1 ln 1 + φ | s | a s | s | , φ > 0 ,
using the kernel estimator ( f n , h ( x ) ) Ξ .
By making use of (9) and (10) we find the rate of convergence of the MSE u f 2 ( f n 0 ) and u f 2 ( f n , h 1 0 ) . To this end we calculate
δ 0 ( h ) = 1 / h < | s | < c / h | s | r | ϕ ( s ) | d s 1 ( ln h 1 ) φ 1 ( ln h 1 + ln c ) φ 1 ( ln 2 h 1 ) 1 + φ .
It is easy to verify that f n 0 , f n , h 1 0 Ξ . Thus, from (9), as n ,
h 1 0 ( n γ n ) 1 2 r + 1 ,
where γ n ln 2 ( 1 + φ ) n , n is a solution of the equation
γ n [ ln n + ln γ n ] 2 ( 1 + φ ) = 1 .
Therefore, as n , we have
h 1 0 ln 2 ( 1 + φ ) n n 1 2 r + 1 a n d u f 2 ( f n , h 1 0 ) = O 1 n 2 r ln 2 ( 1 + φ ) n 1 2 r + 1 .
Consider the piecewise linear flat-top kernel λ c L I N ( s , h ) , introduced by [25] (see [26] as well):
λ c L I N ( s , h ) = c c 1 1 h c | s | + 1 c 1 1 h | s | + ,
where ( x ) + = max ( x , 0 ) is the positive part function.
Then, from (8) we obtain
{ ϕ ( c / h 0 ) e i c x h 0 + ϕ ( c / h 0 ) e i c x h 0 } 2 1 n
and, for n large enough
| ϕ ( c / h 0 ) | 2 + | ϕ ( c / h 0 ) | 2 C n .
Thus, similarly to h 1 0 , as n , for f F r we find
h 0 ln 2 ( 1 + φ ) n n 1 2 ( r + 1 )
and
u f 2 ( f n 0 ) = O 1 n 2 r + 1 ln 2 ( 1 + φ ) n 1 2 ( r + 1 ) = o u f 2 ( f n , h 1 0 ) .
Example 2.
Consider an estimation problem of the function f F r , m , satisfying the following additional condition:
| ϕ ( s ) | 1 | s | r + 1 a s | s | ,
using the kernel estimator ( f n , h ( x ) ) Ξ .
Using (9) and (10) we will find the rate of convergence of the MSE u f 2 ( f n 0 ) and u f 2 ( f n , h 1 0 ) . To this end, we calculate
δ γ ( h ) = 1 / h < | s | < c / h | s | r γ | ϕ ( s ) | d s h γ .
It is easy to verify that f n 0 , f n , h 1 0 Ξ . Thus, from (9), as n ,
h 1 0 n 1 2 r + 1 .
Therefore, we have
u f 2 ( f n , h 1 0 ) = O 1 n 2 r 2 r + 1 , n .
Similarly to Example 1 as n , for f F r we find
h 0 1 n 1 2 ( r + 1 )
and
u f 2 ( f n 0 ) = O 1 n 2 r + 1 2 ( r + 1 ) = o u f 2 ( f n , h 1 0 ) .

4. Estimation of the Degree of Smoothness r

Define the functions
Φ α ( A , B ) = A < | s | < B | s | α | ϕ ( s ) | d s , Φ α = Φ α ( 0 , ) ,
Φ n , α ( A , B ) = A < | s | < B | s | α | ϕ n ( s ) | d s , Φ n , α = Φ n , α ( 0 , ) .
Let ( δ n ) n 1 and ( ρ n ) n 1 be two given sequences of positive numbers chosen by the practitioner such that δ n 0 and ρ n as n . The sequence ( δ n ) represents the ‘grid’-size in our search of the correct exponent r , while ( ρ n ) represents an upper bound that limits this search.
Define the following sets of non-random sequences
C + = { ( A n , B n , δ n ) n 1 : A n , 0 < A n < B n , δ n 0 as n ;
for   some m 0 2 , n 1 B n 2 m 0 ( ϱ n + 1 + δ n ) n m 0 < ; Φ r + ε ( A n , B n ) , ε > 0 } ,
C = { ( A n , B n , δ n ) n 1 : A n , 0 < A n < B n , δ n 0 as n ;
for   some m 0 2 , n 1 B n 2 m 0 ( ϱ n + 1 + δ n ) n m 0 < ; Φ r ( A n , B n ) } .
Remark 2.
Formally, the definition of sets C + , C and, as follows of estimators r n + and r n , as well of sets C + * , C * defined below depend on the unknown function Φ α ( A , B ) . At the same time, the set C + (and, as follows, the estimator r n + and the set C + * ) can be defined independently of Φ α ( A , B ) .
Indeed, denote α s = | s | r + 1 | ϕ ( s ) | and
– let f F r + .
Then for every ε > 0
lim s s ε / 2 α s = a n d lim s ¯ α s · log s < .
Thus ( A n , B n , δ n ) C + for appropriate chosen ( δ n ) and A n = O ( B n 1 / 2 ) because (consider for simplification the case A n > 0 )
Φ r + ε ( A n , B n ) = A n B n | s | r + ε | ϕ ( s ) | d s = A n B n s ε 1 α s d s = 0 B n s ε 1 α s d s 0 A n s ε 1 α s d s C 1 0 B n s ε / 2 1 d s C 2 0 A n s ε 1 log 1 s d s B n ε / 2 A n ε log 1 A n + 0 A n s ε 1 log 2 s d s .
According to the definition of the class F r it is impossible to find elements of the set C independently of the function to be estimated without usage of an a priori information about f . Consider one simple example.
– Let f F r .
Suppose, e.g., in addition that
lim s ¯ log s · 1 s 0 s | u | r + 1 | ϕ ( u ) | d u > 0 .
Then ( A n , B n , δ n ) C for appropriate chosen ( δ n ) and A n = o ( B n ) because
Φ r ( A n , B n ) = A n B n s r | ϕ ( s ) | d s = A n B n s 1 d 0 s α u d u d s = 1 s 0 s α u d u | A n B n + A n B n s 2 0 s α u d u d s C A n B n 1 s log s d s C log log B n .
Another examples are in Example 3 (see also Remark 3 and Example 4).
For an arbitrary given H > 0 chosen by the practitioner, define the estimators ( r n + ) n 1 and ( r n ) n 1 of the parameter r in (3) and (4) as follows
r n + = min [ ϱ n , ( δ n · inf { k 1 : Φ n , ( k + 1 ) δ n ( A n , B n ) H , ( A n , B n , δ n ) C + } ) ] .
r n = min [ ϱ n , ( δ n · inf { k 1 : Φ n , k δ n ( A n , B n ) H , ( A n , B n , δ n ) C } ) ] .
Example 3.
For the functions ϕ ( · ) from Examples 1 and 2, we can use the definitions (11) and (12) with the following choices:
B n = ln n , ϱ n = ρ ln n ln ln n , ρ ( 0 , ( 2 m 0 ) 1 ) ,
arbitrary δ n 0 and A n = o ( B n ) , as n . Indeed, for f F r , m + and every ε > 0 (Example 1),
Φ r + ε ( A n , B n ) = A n < | s | < B n | s | r + ε | ϕ ( s ) | d s B n ε ln φ B n A n ε ln φ A n ln ε n ln φ ln n
and for f F r , m (Example 2),
Φ r ( A n , B n ) = A n < | s | < B n | s | r | ϕ ( s ) | d s ln B n A n
and, as follows, the classes C + and C are not empty.
Define
J n , α = A n < | s | < B n | s | α | ϕ n ( s ) ϕ ( s ) | d s , α > 0 , n 1 .
Lemma 1.
Let ( A n , B n , δ n ) C + C . Then, for every α > 0 , m 1 and n 1 there exist positive numbers C α , m such that
sup f F ¯ r E f J n , α 2 m C α , m B n 2 m ( α + 1 ) n m
and for every f F ¯ r
J n , ϱ n + δ n = o ( 1 ) P f a . s .
Define the sets C + * and C * of non-random sequences ( A n , B n , δ n ) n 1
C + * = { ( A n , B n , δ n ) C + : lim n Φ r + δ n ( A n , B n ) = } ,
C * = { ( A n , B n , δ n ) C : lim n Φ r δ n ( A n , B n ) = 0 } .
Remark 3.
It can be directly verified that under the conditions of Remark 2 the sequences ( A n , B n , δ n ) C + * if A n = O ( B n 1 / 2 ) and δ n 1 = o ( log B n ) , as well as ( A n , B n , δ n ) C * if A n = o ( B n ) and
δ n = o log log log B n log B n .
Moreover, under the conditions of Example 3.1, ( A n , B n , δ n ) C + * if we put
δ n = δ · ln ln ln ( n + 1 ) ln ln ( n + 1 ) , δ > φ .
Example 4.
Consider the functions ϕ ( · ) from Examples 1, 2 and suppose, that the smooth parameter r R for some known number R . Then the sequences ( A n , B n , δ n ) C + * C * if we put
B n = n b , 0 < b < m 0 1 4 ( R + 1 ) , A n = o ( B n ) , ϱ n = R ,
δ n = δ · ln ln ( n + 1 ) ln ( n + 1 ) , δ b > φ .
Theorem 2.
The estimators r n + and r n , defined in (11) and (12), respectively, with ρ n have the following properties
1
(a) if f F r + and ( A n , B n , δ n ) C + , then
lim n r n + = r P f a . s .
(b) if f F r and ( A n , B n , δ n ) C , then
lim n r n = r P f a . s .
2
(a) if f F r + and for some δ n 0 the sequences ( A n , B n , δ n ) C + * , then
lim n δ n 1 ( r n + r ) = 0 P f a . s .
(b) if f F r and for some δ n 0 the sequences ( A n , B n , δ n ) C * , then
lim n δ n 1 ( r n r ) = 0 P f a . s .

5. Adaptive Estimation of the Functions f F ¯ r

The purpose of this section is the construction and investigation of an adaptive estimator of the function f F ¯ r with unknown r , which can either serve as the main estimator (since it achieves the optimal rate of convergence within F ¯ r ) or can serve as a ‘pilot’ estimator to be used in (8) and (9) for the construction of an adaptive optimal and suboptimal bandwidths h ^ 0 and h ^ 1 0 .

5.1. Adaptive MSE–Optimal Estimation

We define an adaptive estimator of f F ¯ r as follows
f ^ n ( x ) = 1 n j = 1 n Λ j 1 x X j = 1 2 π n j = 1 n λ j 1 ( s ) e i s ( x X j ) d s ,
where Λ j 1 ( z ) = 1 h ^ j 1 K z h ^ j 1 = 1 2 π λ j 1 ( s ) e i s z d s is the smoothing kernel, and λ j 1 ( s ) = λ c ( s , h ^ j 1 ) ; the required bandwidths are defined by
h ^ j = ( j + 1 ) 1 1 + 2 r ( j ) , j 1 ,
where r ( j ) = r j + if f F r + and r ( j ) = r j if f F r ; recall that the estimators r j + and r j are defined in (11) and (12), respectively.
From the definition of r ( j ) it follows, that h ^ j h ¯ j , j 1 , where h ¯ j = ( j + 1 ) 1 1 + 2 ϱ j . Note, that h ¯ j 1 and h ¯ j 0 if the following additional condition
lim j ϱ j ln j = 0
on the sequence ( ϱ j ) defined in the beginning of Section 4 holds.
Denote
n 1 = sup { n 1 : Φ r ( A n , B n ) > H 1 } if f F r + , sup { n 1 : Φ r δ n ( A n , B n ) > H 1 } if f F r ,
n 2 = sup { n 1 : Φ r + δ n ( A n , B n ) < H + 1 } if f F r + , sup { n 1 : Φ r ( A n , B n ) < H + 1 } if f F r ,
where the constant H first used in (11) and (12). Define the following sequences for j 0 , 0 γ r ,
h j = ( j + 1 ) 1 1 + 2 r 2 γ , h j * = ( j + 1 ) 1 1 + 2 ( r γ δ j + 1 ) ,
h ˜ j = ( j + 1 ) 1 1 + 2 ( r γ + δ j + 1 ) and Δ γ ( h ) = | s | > h 1 | s | r γ | ϕ ( s ) | d s ,
as well as the constants
C 1 = f ¯ · K 2 ( u ) d u , C ˜ 1 = C 1 ( j = 1 n 1 j + C r , 2 m 0 j > n 1 B j 4 m 0 ( r + 1 ) j 2 m 0 1 ) , C ˜ 2 ( γ ) = f ¯ 2 + C 2 ( γ ) ,
C 2 ( γ ) = 1 4 π 2 j = 1 n 2 h ¯ j 1 2 r 2 γ Δ γ 2 ( h ¯ j 1 ) + C r + 1 , m 0 j > n 2 h ¯ j 1 2 r 2 γ Δ γ 2 ( h ¯ j 1 ) B j 2 m 0 ( r + 1 + δ j ) n m 0
and the function
Ψ γ ( n ) = C 1 n 2 j = 1 n 1 h j 1 * + 1 4 π 2 n j = 1 n h ˜ j 1 2 r 2 γ Δ γ 2 ( h ˜ j 1 ) .
Note that the summability of the series in the definitions of the constants C ˜ 1 and C 2 ( γ ) follows from the corresponding demand in the definition of the classes C + and C .
Main properties of constructed estimators are stated in the following theorem.
Theorem 3.
Let the sequences ( A n , B n , δ n ) in the definition of the estimator r n + belong to the set C + * and in the definition of the estimator r n to the set C * and the condition (15) is fulfilled.
Let γ = 0 if f F r + and γ ( 0 , r ) if f F r . Then, for every f F ¯ r and n 1 the estimator (14) has the following properties:
1 . u f 2 ( f ^ n ) Ψ γ ( n ) + C ˜ 1 n 2 + C ˜ 2 ( γ ) n ;
2 . the estimator f ^ n is strongly consistent: lim n f ^ n ( x ) = f ( x ) P f a . s .
Example 11. (Examples 1 and 4 revisited, f F r + ) In this case
1 n 2 j = 1 n 1 h j 1 * = 1 n 2 j = 1 n 1 h j 1 · ( ln j ) 2 δ ( 1 + 2 r ) 2 1 n h n · ( ln n ) 2 δ ( 1 + 2 r ) 2
and
1 n j = 1 n h ˜ j 1 2 r Δ 0 2 ( h ˜ j 1 ) 1 n j = 1 n h j 1 2 r · ( ln j ) 4 r δ ( 1 + 2 r ) 2 2 φ .
Thus, under the following conditions
m 0 > 16 R ( R + 1 ) + 1 , 4 R < b < m 0 1 4 ( R + 1 ) , φ b < δ < φ 4 R
we have, as n ,
1 n j = 1 n h ˜ j 1 2 r Δ 0 2 ( h ˜ j 1 ) = o 1 n 2 j = 1 n 1 h j 1 *
and, as follows,
Ψ 0 ( n ) 1 n h n · ( ln n ) 2 δ ( 1 + 2 r ) 2 1 n 2 r 1 + 2 r · ( ln n ) 2 δ ( 1 + 2 r ) 2 .
Then, according to Theorem 2, in this case the rate of convergence of adaptive density estimators of f F r + differs from the rate of non-adaptive estimators in [26] on the extra log-factor only.
For the functions f F r and γ ( 0 , min ( r , 1 ) ) from Examples 2 and 4, it is easy to verify, that
Ψ γ ( n ) 1 n 2 ( r γ ) 1 + 2 ( r γ ) ln n δ 1 + 2 ( r γ ) as n .

5.2. A Symmetric Estimator

Noting that the construction of the estimator f ^ n ( x ) depends on the order by which the data X 1 , , X n are employed, a simple improvement is immediately available. Let X [ 1 ] X [ 2 ] X [ n ] be the order statistics that are a sufficient statistic in the case of our i.i.d. sample X 1 , , X n . Hence, by the Rao–Blackwell theorem, the estimator
E ( f ^ n ( x ) | X [ 1 ] , , X [ n ] )
will have smaller (or, at least, not larger) MSE than f ^ n ( x ) .
Unfortunately, the estimator (16) is difficult to compute. However, it is possible to construct a simple estimator that captures the same idea. To do this, consider all distinct permutations of the data X 1 , , X n , and order them in some fashion so that X 1 ( k ) , , X n ( k ) is the kth permutation. For unifying presentation, the 1st permutation will be the original data X 1 , , X n . Because of the continuity of the r.v.s X 1 , , X n , the number of such permutations is n ! with probability one.
So let f ^ n ( k ) ( x ) be the estimator f ^ n ( x ) as computed from the kth permutation X 1 ( k ) , , X n ( k ) , i.e., Equation (14) with X 1 ( k ) , , X n ( k ) instead of X 1 , , X n .
Finally, let b n ! be a positive integer (possibly depending on n), and let
f ¯ n , b ( x ) = 1 b k = 1 b f ^ n ( k ) ( x ) .
Theorem 4.
For any choice of b ( n ! ) , we have M S E ( f ¯ n , b ( x ) ) M S E ( f ^ n ( x ) ) .
Ideally, the practitioner would use a high value of b—even b = n ! if the latter is computationally feasible. However, even moderate values of b would give some improvement; in this case, the b permutations to be included in the construction of f ¯ n , b ( x ) might be picked randomly as in resampling/subsampling methods—see e.g., [27].

5.3. Adaptive Optimal Bandwidth

Define
L ( h , ϕ ) = 1 / h < | s | < c / h ( g ( s , h ) 1 ) ϕ ( s ) e i s x d s · { c ϕ ( c / h ) e i c x h + c ϕ ( c / h ) e i c x h + h 2 1 / h < | s | < c / h g h ( s , h ) ϕ ( s ) e i s x d s } .
According to (8) the optimal bandwidth h 0 is defined from the equality
L ( h 0 , ϕ ) = 2 π 2 L 1 f ( x ) n .
Thus, it is natural to define the adaptive (to the unknown parameter r and the function f ( x ) ) optimal bandwidth h ^ 0 from the equality
L ( h ^ 0 , ϕ n ) = 2 π 2 L 1 f ¯ n , b ( x ) n ,
where the adaptive estimator f ¯ n , b ( x ) is defined in (17).
It is hoped that the bandwidths h 0 and h ^ 0 have similar asymptotic properties in view of the fact that, according to Theorems 3 and 4 the function
n Ψ 1 / 2 ( n ) [ L ( h 0 , ϕ ) L ( h ^ 0 , ϕ n ) ]
is bounded in probability.

6. Simulation Results

In this section we provide results of a simulation study regarding the estimators introduced in Section 3.
Two flat-top kernels have been used in the simulation. The first one has the piecewise linear kernel characteristic function introduced in [26], i.e.,
λ ( s ) = 1 , | s | 1 , ( c | s | ) / ( c 1 ) , 1 < | s | < c , 0 , | s | c .
The piecewise linear characteristic function and corresponding kernel are shown in Figure 1.
The second case refers to the infinitely differentiable flat-top kernel characteristic function defined in [28], i.e.,
λ ( s ) = 1 , | s | c , e x p b exp b / ( | s | c ) 2 / ( | s | 1 ) 2 , c < | s | < 1 , 0 , | s | 1 .
The characteristic function and kernel of the second case are shown in Figure 2.
We examine kernel density estimators of triangular, exponential, Laplace, and gamma (with various shape parameter) distributions. Figure 3, Figure 4 and Figure 5 illustrate the estimator MSE as a function of the sample size.
Using notation C ( x ) = { 0 , x < 0 ; 1 , x 0 } for Heaviside step function, the triangular density function is defined as f ( x ) = ( ( λ | x | ) / λ 2 ) C ( λ x ) C ( λ + x ) having characteristic function ϕ ( s ) = 2 ( 1 cos ( λ s ) ) / ( λ s ) 2 . Laplace density f ( x ) = λ / 2 exp ( λ | x | ) has characteristic function ϕ ( s ) = λ 2 / ( λ 2 + s 2 ) , gamma density f ( x ) = λ k x k 1 e λ x / Γ ( k ) has characteristic function ϕ ( s ) = λ k / ( λ i s ) k .
In all cases we choose scale parameter λ to have variation equals to 1, and consider estimation of density function f ( x ) at point x = 1 .
All the above-mentioned characteristic functions ϕ ( s ) satisfy condition (4) for r = 1 (triangular and Laplace), and r = k 1 (gamma, k > 1 ); therefore, all distributions belong to the family F r with corresponding value of r. In addition, all ϕ ( s ) meet the requirements of Example 2. Thus, the bandwidth can be taken in the form h = O ( n 1 / ( 2 ( r + 1 ) ) ) and the expected convergence rate of the kernel estimator MSE is n ( 2 r + 1 ) / ( 2 ( r + 1 ) ) .
The main goal of the simulation study is investigation of the MSE behavior for the kernel estimator with the growth of sample size. We generate sequences of 150 samples for sample size from 25 to 2000 with step 25, and for some distributions for sample size from 2000 to 20,000 with step 100 or 200. Then, for each sample size we calculate the estimator MSE multiplied by n ( 2 r + 1 ) / ( 2 ( r + 1 ) ) and expect visual stabilization of the sequence of resulting values with growth of n .
Typical examples of the simulation results are presented at Figure 3 (for r = 1 ), Figure 4 (for r = 1 and r = 2 ), and Figure 5 (for r = 3 and r = 5 ). The expected stabilization of the scaled MSE is observed in all cases. Moreover, increasing r causes enlargement of sample size that is needed to achieve limiting asymptotic behavior. For r = 1 and r = 2 we can see stabilization starting from n 500 , for r = 3 it starts from n 1500 , while for r = 5 the asymptotic behavior is observed to start from sample size n 15,000.

7. Technical Proofs

7.1. Proof of Lemma 1

First we note that for every m 1 and n 1 there exist positive numbers κ m such that
sup f F ¯ r E f | ϕ n ( s ) ϕ ( s ) | 2 m κ m n m .
These inequalities follow from the Burkholder inequality (see, for example, [29]) for the martingale ( k = 1 n ( e i s X k ϕ ( s ) ) , F n X ) , F n X = σ { X 1 , , X n } and finiteness of the function ϕ ( · ) .
Using this and Hölder’s inequalities we can estimate
sup f F ¯ r E f J n , α 2 m = sup f F ¯ r E f A n < | s | < B n | s | α | ϕ n ( s ) ϕ ( s ) | d s 2 m
A n < | s | < B n | s | 2 m α 2 m 1 d s 2 m 1 · A n < | s | < B n sup f F ¯ r E f | ϕ n ( s ) ϕ ( s ) | 2 m d s C α , m B n 2 m ( α + 1 ) n m .
From the Borel–Cantelli lemma and the assumed summability of the right-hand side of (13) for m = m 0 , α = ϱ n + δ n and f F ¯ r follows the second assertion of Lemma 1.

7.2. Proof of Theorem 2

We prove now the statements 1 (a) and 2 (a) of Theorem 1. First, we show for n large enough the inequalities
r n + < ϱ n P f a . s .
To this end, according to the definition of the estimator r n + , it is enough to establish for some α > r the limiting relation
lim n Φ n , α ( A n , B n ) = P f a . s . ,
which follows from the definition of the class C + and Lemma 1:
Φ n , α ( A n , B n ) = Φ n , α ( A n , B n ) Φ α ( A n , B n ) + Φ α ( A n , B n ) Φ α ( A n , B n ) J n , α P f a . s .
From (18) and by the definition of the estimator r n + , for n large enough, we have
Φ n , r n + + δ n ( A n , B n ) H .
Thus,
lim n ¯ Φ r n + + δ n ( A n , B n ) lim n ¯ Φ n , r n + + δ n ( A n , B n )
lim n ¯ J n , ϱ n + δ n H P f a . s .
Analogously,
lim n ¯ Φ r n + ( A n , B n ) H P f a . s .
From (19) and (20) it follows, that for any ε > 0 and δ > 0 for n large enough
r ε δ < r n + < r + ε P f a . s .
and the assertion 1(a) of Theorem 2 is proved.
From the definitions of the estimator r n + , class C + * , Chebyshev’s inequality and (13), for m 1 , n > n 1 and f F r + we have
P f ( r n 1 + < r δ n 1 ) P f ( Φ n , r ( A n , B n ) H ) P f ( J n , r H Φ r ( A n , B n ) )
E f J n , r 2 m ( H Φ r ( A n , B n ) ) 2 m C r , m B n 2 m ( r + 1 ) n m .
Similar to (21) for n > n 2 and m 1 we obtain
P f ( r n 1 + > r + δ n 1 ) P f ( Φ n , r + δ n ( A n , B n ) < H ) = P f ( Φ r + δ n ( A n , B n ) H J n , r + δ n ) E f J n , r + δ n 2 m ( Φ r + δ n ( A n , B n ) H ) 2 m C r + 1 , m B n 2 m ( r + 1 + δ n ) n m
and, as follows, for f F r + ,
P f ( δ n 1 | r n + r | 1 ) 2 C r + 1 , m B n 2 m ( r + 1 + δ n ) n m .
From the Borel–Cantelli lemma and the assumed summability of the right hand side in (23) for m m 0 follows the assertion 2(a) of Theorem 2.
The other statements of Theorem 2 for the estimator r n can be proved analogically.

7.3. Proof of Theorem 3

Consider the deviation of the estimator (14) in the following form:
f ^ n ( x ) f ( x ) = I 1 ( n ) + I 2 ( n ) ,
where
I 1 ( n ) = f ^ n ( x ) f ˜ n ( x ) , I 2 ( n ) = f ˜ n ( x ) f ( x ) ,
f ˜ n ( x ) = 1 n j = 1 n K ( z ) f ( x h ^ j 1 z ) d z = 1 2 π n j = 1 n λ j 1 ( s ) ϕ ( s ) e i s x d s .
Now we estimate second moments of I 1 ( n ) and I 2 ( n ) . Denote F j X = σ { X 1 , , X j } . For f F ¯ r we have
E f I 1 2 ( n ) = 1 n 2 E f j = 1 n 1 h ^ j 1 K x X j h ^ j 1 h ^ j 1 K ( z ) f ( x h ^ j 1 z ) d z 2
= 1 n 2 E f j = 1 n 1 h ^ j 1 2 E f K x X j h ^ j 1 h ^ j 1 K ( z ) f ( x h ^ j 1 z ) d z 2 | F j 1 X
= 1 n 2 E f j = 1 n 1 h ^ j 1 [ K ( u ) h ^ j 1 K ( z ) f ( x h ^ j 1 z ) d z ] 2 f ( x h ^ j 1 u ) d u
C 1 n 2 j = 1 n E f 1 h ^ j 1 + f ¯ 2 n C 1 n 2 j = 1 n 1 h j 1 * + C 1 n 2 j = 1 n j P f ( r ( j ) < r δ j ) + f ¯ 2 n .
From (21) for m = 2 m 0 we obtain
E f I 1 2 ( n ) C 1 n 2 j = 1 n 1 h j 1 * + C 1 n 2 j = 1 n 1 j + C 1 C r , 2 m 0 n 2 j > n 1 j B j 4 m 0 ( r + 1 ) j 2 m 0 + f ¯ 2 n
= C 1 n 2 j = 1 n 1 h j 1 * + C ˜ 1 n 2 + f ¯ 2 n .
Further, by the definition of the function f ˜ ( x ) , the Cauchy–Bunyakovskii–Schwarz inequality and from (22) we have
E f I 2 2 ( n ) = 1 4 π 2 n 2 E f j = 1 n | s | h ^ j 1 1 ( λ j 1 ( s ) 1 ) ϕ ( s ) e i s x d s 2 1 4 π 2 n j 1 E f | s | h ^ j 1 1 | ϕ ( s ) | d s 2 1 4 π 2 n j = 1 n E f h ^ j 1 2 r 2 γ · | s | h ^ j 1 1 | s | r γ | ϕ ( s ) | d s 2 = 1 4 π 2 n j = 1 n E f h ^ j 1 2 r 2 γ Δ γ 2 ( h ^ j 1 ) 1 4 π 2 n j = 1 n h ˜ j 1 2 r 2 γ Δ γ 2 ( h ˜ j 1 ) + j = 1 n 2 h ¯ j 1 2 r 2 γ Δ γ 2 ( h ¯ j 1 ) + j > n 2 h ¯ j 1 2 r 2 γ Δ γ 2 ( h ¯ j 1 ) P f ( r ( j ) > r γ + δ j ) 1 4 π 2 n j = 1 n h ˜ j 1 2 r 2 γ Δ γ 2 ( h ˜ j 1 ) + j = 1 n 2 h ¯ j 1 2 r 2 γ Δ γ 2 ( h ¯ j 1 ) + C r + 1 , m 0 j > n 2 h ¯ j 1 2 r 2 γ Δ γ 2 ( h ¯ j 1 ) B j 2 m 0 ( r + 1 + δ j ) n m 0 1 4 π 2 n j = 1 n h ˜ j 1 2 r 2 γ Δ γ 2 ( h ˜ j 1 ) + C 2 n .
From (24)–(26) follows the first assertion of Theorem 3.
For the proof of the second assertion we estimate first, for some integer m > 1 the rate of convergence of the moment E f I 1 2 m ( n ) .
Let α 1 , , α m be non-negative integers and denote
K ˜ j ( x ) = K x X j h ^ j 1 h ^ j 1 K ( z ) f ( x h ^ j 1 z ) d z ,
K ¯ j ( x ) = [ K ( u ) h ^ j 1 K ( z ) f ( x h ^ j 1 z ) d z ] 2 f ( x h ^ j 1 u ) d u .
By the Burkholder inequality for the martingale j = 1 n 1 h ^ j 1 K ˜ j ( x ) we have
E f I 1 2 m ( n ) = 1 n 2 m E f j = 1 n 1 h ^ j 1 K ˜ j ( x ) 2 m C n 2 m E f j = 1 n 1 h ^ j 1 2 K ˜ j 2 ( x ) m
C n 2 m α 1 + + α m = m 1 j 1 < < j m n E f 1 h ^ j 1 1 2 · · h ^ j m 1 2 K ˜ j 1 2 ( x ) · · K ˜ j m 2 ( x )
C n 2 m α 1 + + α m = m 1 j 1 < < j m n E f K ˜ j 1 2 ( x ) · · K ˜ j m 1 2 ( x ) · K ¯ j m ( x ) h ^ j 1 1 2 · · h ^ j m 1 1 2 · h j m 1 *
C n 2 m α 1 + + α m = m 1 j 1 < < j m n E f 1 h ^ j 1 1 2 · · h ^ j m 2 1 2 · h j m 1 1 * · h j m 1 *
· K ˜ j 1 2 ( x ) · · K ˜ j m 2 2 ( x ) · K ¯ j m 1 ( x ) C n 2 m j = 1 n 1 h j 1 * m .
By the definition of h j * for some 0 < r * < r γ and j * < , h j * n 1 1 + 2 r * and, as follows
E f I 1 2 m ( n ) = O n 2 m r * 1 + 2 r * as n .
Thus 2 m r * 1 + 2 r * > 1 for m > 1 + 2 r * 2 r * and by the Borel–Cantelli lemma, as n ,
I 1 ( n ) 0 P f a . s .
Further,
| I 2 ( n ) | 1 4 π 2 n j = 1 n h ^ j 1 r | Δ γ ( h ^ j 1 ) |
and, as follows, as n ,
I 2 ( n ) 0 P f a . s .
From (27) and (28) follows the second assertion of Theorem 3.

7.4. Proof of Theorem 4

Note that the distribution of f ^ n ( k ) ( x ) is the same as that of f ^ n ( j ) ( x ) for all k , j . Hence, E f ¯ n , b ( x ) = E f ^ n ( x ) . Now | C o v ( f ^ n ( k ) ( x ) , f ^ n ( j ) ( x ) ) | V a r ( f ^ n ( 1 ) ( x ) ) by the Cauchy–Schwarz inequality and the fact that V a r ( f ^ n ( k ) ( x ) ) = V a r ( f ^ n ( j ) ( x ) ) . Thus,
V a r ( f ¯ n , b ( x ) ) = 1 b 2 k = 1 b j = 1 b C o v ( f ^ n ( k ) ( x ) , f ^ n ( j ) ( x ) ) V a r ( f ^ n ( 1 ) ( x ) )
and the theorem is proven.

8. Conclusions

Non-parametric kernel estimation crucially depends on the bandwidth choice which, in turn, depends on the smoothness of the underlying function. Focusing on estimating a probability density function, we define a smoothness class and propose a data-based estimator of the underlying degree of smoothness. The convergence rates in the almost sure sense of the proposed estimators are obtained. Adaptive estimators of densities from the given class on the basis of the constructed smoothness parameter estimators are also presented, and their consistency is established. Simulation results illustrate the realization of the asymptotic behavior when the sample size grows large.
Recently, there has been an increasing interest in nonparametric estimation with dependent data both in terms of theory as well as applications; see, e.g., [15,30,31,32,33]. With respect to probability density estimation, many asymptotic results remain true when moving from i.i.d. data to data that are weakly dependent. For example, the estimator variance, bias and MSE have the same asymptotic expansions as in the i.i.d. case subject to some limitations on the allowed bandwidth rate; fortunately, the optimal bandwidth rate of n 1 / 5 is in the allowed range—see [34,35].
Consequently, it is conjectured that our proposed estimator of smoothness—as well as resulting data-based bandwidth choice and probability density estimator—will retain their validity even when the data are weakly dependent. Future work may confirm this conjecture especially since working with dependent data can be quite intricate. For example, [36] extended the results of [34] from the realm of linear time series to strong-mixing process. In so doing, Remark 5 of [36] pointed to a nontrivial error in the work of [34] which is directly relevant to optimal bandwidth choice.

Author Contributions

All authors contributed equally to this project. All authors have read and agreed to the published version of the manuscript.

Funding

Partial funding by NSF grant DMS 19-14556.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors are very thankful to O. Lepskii for helpful comments and remarks.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Rosenblatt, M. Stochastic Curve Estimation; NSF-CBMS Regional Conference Series in Probability and Statistics, 3; Institute of Mathematical Statistics: Hayward, CA, USA, 1991. [Google Scholar]
  2. Bartlett, M.S. Statistical estimation of density function. Sankhya Indian J. Stat. 1963, A25, 245–254. [Google Scholar]
  3. Parzen, E. On estimation of a probability density function and mode. Ann. Math. Stat. 1962, 33, 1065–1076. [Google Scholar] [CrossRef]
  4. Devroye, L. A Course in Density Estimation; Birkhäuser: Boston, MA, USA; Basel, Switzerland; Stuttgart, Germany, 1987. [Google Scholar]
  5. Gasser, T.; Müller, H.-G.; Mammitzsch, V. Kernels for nonparametric curve estimation. J. R. Stat. Soc. Ser. B 1985, 60, 238–252. [Google Scholar] [CrossRef]
  6. Granovsky, B.L.; Müller, H.-G. Optimal kernel methods: A unifying variational principle. Int. Stat. Rev. 1991, 59, 373–388. [Google Scholar] [CrossRef]
  7. Jones, M.C. On higher order kernels. J. Nonparametric Stat. 1995, 5, 215–221. [Google Scholar] [CrossRef]
  8. Marron, J.S. Visual understanding of higher order kernels. J. Comput. Graph. Stat. 1994, 3, 447–458. [Google Scholar]
  9. Müller, H.-G. Nonparametric Regression Analysis of Longitudinal Data; Springer: Berlin/Heidelberg, Germany, 1988. [Google Scholar]
  10. Scott, D.W. Multivariate Density Estimation: Theory, Practice and Visualization; Wiley: New York, NY, USA, 1992. [Google Scholar]
  11. Devroye, L. A note on the usefulness of superkernels in density estimation. Ann. Stat. 1992, 20, 2037–2056. [Google Scholar] [CrossRef]
  12. Politis, D.N. On nonparametric function estimation with infinite-order flat-top kernels. In Probability and Statistical Models with Applications; Charalambides, C., Ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2001; pp. 469–483. [Google Scholar]
  13. Jones, M.C.; Marron, J.S.; Sheather, S.J. A brief survey of bandwidth selection for density estimation. J. Am. Stat. Assoc. 1996, 91, 401–407. [Google Scholar] [CrossRef]
  14. Tsybakov, A. Introduction to Nonparametric Estimation; Springer Series in Statistics; Springer: New York, NY, USA, 2009. [Google Scholar]
  15. Dobrovidov, A.V.; Koshkin, G.M.; Vasiliev, V.A. Non-Parametric State Space Models; Kendrick Press: Heber City, UT, USA, 2012. [Google Scholar]
  16. Ibragimov, I.A.; Khasminskii, R.Z. Statistical Estimation: Asymptotic Theory; Springer: Berlin/Heidelberg, Germany, 1981. [Google Scholar]
  17. Lehmann, E.L.; Romano, J.P. Testing Statistical Hypotheses, 4th ed.; Springer Texts in Statistics: New York, NY, USA, 2022. [Google Scholar]
  18. Lepskii, O.V.; Spokoiny, V.G. Optimal pointwise adaptive methods in nonparametric estimation. Ann. Stat. 1997, 25, 2512–2546. [Google Scholar] [CrossRef]
  19. Brown, L.D.; Low, M.G. Superefficiency and Lack of Adaptibility in Functional Estimation; Technical Report; Cornell University: Ithaca, NY, USA, 1992. [Google Scholar]
  20. Lepskii, O.V. On a problem of adaptive estimation in Gaussian white noise. Theory Probab. Its Appl. 1990, 35, 454–466. [Google Scholar] [CrossRef]
  21. Butucea, C. Exact adaptive pointwise estimation on Sobolev classes of densities. ESAIM Probab. Stat. 2001, 5, 1–31. [Google Scholar] [CrossRef]
  22. Goldenshluger, A.; Lepski, O. Bandwidth selection in kernel density estimation: Oracle inequalities and adaptive minimax optimality. Ann. Statist. 2011, 39, 1608–1632. [Google Scholar] [CrossRef]
  23. Lacour, C.; Massart, P.; Rivoirard, V. Estimator selection: A new method with applications to kernel density estimation. Sankhya A 2017, 79, 298–335. [Google Scholar] [CrossRef] [Green Version]
  24. Politis, D.N. Adaptive bandwidth choice. J. Nonparametric Stat. 2003, 15, 517–533. [Google Scholar] [CrossRef]
  25. Politis, D.N.; Romano, J.P. On a Family of Smoothing Kernels of Infinite Order. In Computing Science and Statistics, Proceedings of the 25th Symposium on the Interface, San Diego, CA, USA, 14–17 April 1993; Tarter, M., Lock, M., Eds.; The Interface Foundation of North America: San Diego, CA, USA, 1993; pp. 141–145. [Google Scholar]
  26. Politis, D.N.; Romano, J.P. Multivariate density estimation with general flat-top kernels of infinite order. J. Multivar. Anal. 1999, 68, 1–25. [Google Scholar] [CrossRef] [Green Version]
  27. Politis, D.N.; Romano, J.P.; Wolf, M. Subsampling; Springer: New York, NY, USA, 1999. [Google Scholar]
  28. McMurry, T.; Politis, D.N. Nonparametric regression with infinite order flat-top kernels. J. Nonparametric Stat. 2004, 16, 549–562. [Google Scholar] [CrossRef]
  29. Liptser, R.; Shiryaev, A. Theory of Martingales; Springer: New York, NY, USA, 1988. [Google Scholar]
  30. Bijloos, G.; Meyers, J.A. Fast-Converging Kernel Density Estimator for Dispersion in Horizontally Homogeneous Meteorological Conditions. Atmosphere 2021, 12, 1343. [Google Scholar] [CrossRef]
  31. Cortes Lopez, J.C.; Jornet Sanz, M. Improving Kernel Methods for Density Estimation in Random Differential Equations Problems. Math. Comput. Appl. 2020, 25, 33. [Google Scholar] [CrossRef]
  32. Correa-Quezada, R.; Cueva-Rodriguez, L.; Alvarez-Garcia, J.; del Rio-Rama, M.C. Application of the Kernel Density Function for the Analysis of Regional Growth and Convergence in the Service Sector through Productivity. Mathematics 2020, 8, 1234. [Google Scholar] [CrossRef]
  33. Vasiliev, V.A. A truncated estimation method with guaranteed accuracy. Ann. Inst. Stat. Math. 2014, 66, 141–163. [Google Scholar] [CrossRef]
  34. Hallin, M.; Tran, L.T. Kernel density estimation for linear processes: Asymptotic normality and bandwidth selection. Ann. Inst. Stat. Math. 1996, 48, 429–449. [Google Scholar] [CrossRef]
  35. Wu, W.-B.; Mielniczuk, J. Kernel density estimation for linear processes. Ann. Stat. 2002, 30, 1441–1459. [Google Scholar] [CrossRef]
  36. Lu, Z. Asymptotic normality of kernel density estimators under dependence. Ann. Inst. Stat. Math. 2001, 53, 447–468. [Google Scholar] [CrossRef]
Figure 1. Piecewise linear characteristic function (left) and corresponding kernel (right), c = 1.5 .
Figure 1. Piecewise linear characteristic function (left) and corresponding kernel (right), c = 1.5 .
Stats 06 00003 g001
Figure 2. Infinitely differentiable flat-top characteristic function (left) and corresponding kernel (right), c = 0.05 , b = 1 .
Figure 2. Infinitely differentiable flat-top characteristic function (left) and corresponding kernel (right), c = 0.05 , b = 1 .
Stats 06 00003 g002
Figure 3. MSE of kernel estimators multiplied by n 3 / 4 as a function of the sample size n for the triangle density function. (a) MSE of estimator with piecewise linear kernel characteristic function. (b) MSE of estimator with infinitely differentiable flat-top kernel characteristic function.
Figure 3. MSE of kernel estimators multiplied by n 3 / 4 as a function of the sample size n for the triangle density function. (a) MSE of estimator with piecewise linear kernel characteristic function. (b) MSE of estimator with infinitely differentiable flat-top kernel characteristic function.
Stats 06 00003 g003
Figure 4. MSE of kernel estimators (with piecewise linear kernel characteristic function) as a function of the sample size n. (a) Laplace density function ( r = 1 , MSE multiplied by n 3 / 4 ). (b) Gamma density function ( k = 3 , r = 2 , MSE multiplied by n 5 / 6 ).
Figure 4. MSE of kernel estimators (with piecewise linear kernel characteristic function) as a function of the sample size n. (a) Laplace density function ( r = 1 , MSE multiplied by n 3 / 4 ). (b) Gamma density function ( k = 3 , r = 2 , MSE multiplied by n 5 / 6 ).
Stats 06 00003 g004
Figure 5. MSE of kernel estimators (with piecewise linear kernel characteristic function) as a function of the sample size n. (a) Gamma distribution shape parameter k = 4 , r = 3 (MSE multiplied by n 7 / 8 ). (b) Gamma distribution shape parameter k = 6 , r = 5 (MSE multiplied by n 11 / 12 ).
Figure 5. MSE of kernel estimators (with piecewise linear kernel characteristic function) as a function of the sample size n. (a) Gamma distribution shape parameter k = 4 , r = 3 (MSE multiplied by n 7 / 8 ). (b) Gamma distribution shape parameter k = 6 , r = 5 (MSE multiplied by n 11 / 12 ).
Stats 06 00003 g005
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Politis, D.N.; Tarassenko, P.F.; Vasiliev, V.A. Estimating Smoothness and Optimal Bandwidth for Probability Density Functions. Stats 2023, 6, 30-49. https://doi.org/10.3390/stats6010003

AMA Style

Politis DN, Tarassenko PF, Vasiliev VA. Estimating Smoothness and Optimal Bandwidth for Probability Density Functions. Stats. 2023; 6(1):30-49. https://doi.org/10.3390/stats6010003

Chicago/Turabian Style

Politis, Dimitris N., Peter F. Tarassenko, and Vyacheslav A. Vasiliev. 2023. "Estimating Smoothness and Optimal Bandwidth for Probability Density Functions" Stats 6, no. 1: 30-49. https://doi.org/10.3390/stats6010003

Article Metrics

Back to TopTop