Next Article in Journal
Performance of Stochastic Inventory System with a Fresh Item, Returned Item, Refurbished Item, and Multi-Class Customers
Next Article in Special Issue
Fisher, Bayes, and Predictive Inference
Previous Article in Journal
A Dynamic Mechanistic Model of Perceptual Binding
Previous Article in Special Issue
Single-Block Recursive Poisson–Dirichlet Fragmentations of Normalized Generalized Gamma Processes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Asymptotic Efficiency of Point Estimators in Bayesian Predictive Inference

by
Emanuele Dolera
1,2
1
Department of Mathematics, University of Pavia, Via Adolfo Ferrata 5, 27100 Pavia, Italy
2
Collegio Carlo Alberto, Piazza V. Arbarello 8, 10134 Torino, Italy
Mathematics 2022, 10(7), 1136; https://doi.org/10.3390/math10071136
Submission received: 1 March 2022 / Revised: 24 March 2022 / Accepted: 29 March 2022 / Published: 1 April 2022

Abstract

:
The point estimation problems that emerge in Bayesian predictive inference are concerned with random quantities which depend on both observable and non-observable variables. Intuition suggests splitting such problems into two phases, the former relying on estimation of the random parameter of the model, the latter concerning estimation of the original quantity from the distinguished element of the statistical model obtained by plug-in of the estimated parameter in the place of the random parameter. This paper discusses both phases within a decision theoretic framework. As a main result, a non-standard loss function on the space of parameters, given in terms of a Wasserstein distance, is proposed to carry out the first phase. Finally, the asymptotic efficiency of the entire procedure is discussed.

1. Introduction

This paper carries on a project—conceived by Eugenio Regazzini some years ago, and partially developed in collaboration with Donato M. Cifarelli—which aims at proving why and how some classical, frequentist algorithms from the theory of point estimation can be justified, under some regularity assumptions, within the Bayesian framework. See [1,2,3,4]. This project was inspired, in turn, by the works and the thoughts of Bruno de Finetti about the foundation of statistical inference, substantially based on the following principles.
  • De Finetti’s vision of statistics is grounded on the irrefutable fact that the Bayesian standpoint—intended as the use of basic tools of probability theory and, especially, of conditional distributions—becomes a necessity for those who intend statistical inference as the utilization of observed data to update their original beliefs about other quantities of interest, not yet observed. See [5,6].
  • Rigorous notions of point estimation and optimality of an estimator can be achieved only within a decision-theoretic framework (see, e.g., [7]), at least if we admit all estimators into competition and disregard distinguished restrictions such as unbiasedness or equivariance. In turn, decision theory proves to be genuinely Bayesian, thanks to a well-known result by Abraham Wald. See [8] [Chapter 4].
  • At least from a mathematical stance, the existence of the prior distribution can be drawn from various representation theorems which, by pertaining to the more basic act of modeling incoming information, stand before the problem of point estimation. The most luminous example is the celebrated de Finetti representation theorem for exchangeable observations. See [6,9] and, for a predictive approach [10,11].
Indeed, these principles do not force the assessment of a specific prior distribution, but just lead the statistician to take cognizance that some prior has, in any case, to exist. This fact agrees with de Finetti’s indication to keep the concepts of “Bayesian standpoint” and “Bayesian techniques” as distinguished. See also [12].
Despite their robust logical coherence, orthodox Bayesian solutions to inferential problems suffer two main drawbacks on the practical, operational side, which may limit their use. On the one hand, it is rarely the case that a prior distribution is fully specified due to a lack of prior information, this phenomenon even being amplified by the choice of complex statistical models (e.g., of nonparametric type). On the other hand, the numerical tractability of the Bayesian solutions often proves to be a serious hurdle, especially in the presence of large datasets. For example, it suffices to mention those algorithms from Bayesian nonparametrics that involve tools from combinatorics (like permutations or set/integer partitions) having exponential algorithmic complexity. See, e.g., [13]. Finally, the implicit nature of the notion of Bayesian estimator, although conceptually useful, makes it hard to employ in practical problems, especially in combination with non-quadratic loss functions, even if noteworthy progress has been achieved from the numerical side in the last decade. All these issues still pervade modern statistical literature while, historically, they have paved the way firstly to the “Fisherian revolution” and then to more recent techniques such as empirical Bayes and objective Bayes methods. The ultimate result has been a proliferation of many ad hoc algorithms, often of limited conceptual value, that provide focused and operational solutions to very specific problems.
Aware of this trend, Eugenio Regazzini conceived his project with the aims of: reframing the algorithms of modern statistics—especially those obtained by frequentist techniques—within the Bayesian theory as summarized in points 1–3 above, showing whether they can be re-interpreted as good approximations of Bayesian algorithms. The rationale is that orthodox Bayesian theory could be open to accept even non-Bayesian solutions (hence, suboptimal ones if seen “through the glass of the prior”) as long as such solutions prove to be more operational than the Bayesian ones and, above all, asymptotically almost efficient, in the Bayesian sense. This concept means that, for a fixed prior, the Bayesian risk function evaluated at the non-Bayesian estimator is approximately equal to the overall minimum of such risk function (achieved when evaluated at the Bayesian estimator), the error of approximation going to zero as the sample size increases. Of course, these goals can be carried out after providing quantitative estimates for the risk function, as done, for example, in some decision-theoretic work on the empirical Bayes approach to inference. See, e.g., the seminal work [14]. Indeed, Regazzini’s project has much in common with the empirical Bayes theory, although the former strictly remains on the “orthodox Bayesian main way” whilst the latter mixes Bayesian and frequentist techniques. As to more practical results, an archetype of Regazzini’s line of reasoning can be found in a previous statement from [15] [Section 5] which proves that the maximum likelihood estimator (MLE)—obtained in the classical context of n i.i.d. observations, driven by a regular parametric model—has the same Bayesian efficiency (coinciding with the mean square error, in this case) as the Bayesian estimator up to O ( 1 / n ) -terms, provided that the prior is smooth enough. Another example can be found in [16] where the authors, while dealing with species sampling problems, rediscover the so-called Good–Turing estimator for the probability of finding a new species (which is obtained via empirical Bayes arguments) within the Bayesian nonparametric setting described in [17]. Other examples are contained in [2,4]. In any case, Regazzini’s project is not only a matter of “rigorously justifying” a given algorithm, but rather of logically conceiving an estimation problem from the beginning to the end by quantifying coherent degrees of approximation in terms of the Bayesian risk or, more generally, in terms of speed of shrinkage of the posterior distribution with respect to distances on the space of probability measures, these goals being proved uniformly with respect to an entire class of priors. Hence, this plan of action is conceptually antipodal to that of (nowadays called) “Bayesian consistency”, i.e., to justify a Bayesian algorithm from the point of view of classical statistics.

1.1. Main Contributions and General Strategy

In this paper, we pursue Regazzini’s project by considering some predictive problems where the quantity U n , m to be estimated depends explicitly on new (hitherto unobserved) variables X n + 1 , , X n + m , possibly besides the original sample variables X 1 , , X n and an unobservable parameter T. Thus, U n , m = u n , m ( X n + 1 , , X n + m ; X 1 , , X n ; T ) . For simplicity, we confine ourselves to the simplest case in which both ( X 1 , , X n ) and ( X n + 1 , , X n + m ) are segments of a whole sequence { X i } i 1 of exchangeable X -valued random variables, while T is a random parameter that makes the X i ’s conditionally i.i.d. with a common distribution depending on T, in accordance with de Finetti’s representation theorem. From the statistical point of view, the exchangeability assumption just reveals a supposed homogeneity between the observable quantities while, from a mathematical point of view, it simply states that the joint distribution of any k-subset of the X i ’s depends only on k and not on the specific k-subset, for any k N . Thus, we are setting our estimation problem within an orthodox Bayesian framework where, independently of the fact that we are able or not to precisely assess the prior distribution, such a prior has to exist for mere mathematical reasons. This solid theoretical background provides all the elements to logically formulate the original predictive estimation problem as the following decision-theoretic question: find
U ^ n , m = Argmin Z E L U ( U n , m , Z ) ,
where: L U is a suitable loss function on the space U in which U n , m takes its values; Z runs over the space of all U -valued, σ ( X 1 , , X n ) -measurable random variables; the expectation is taken with respect to the joint distribution of ( X 1 , , X n + m ) and T. It is remarkable that the same estimation problem would have been meaningless in classical (Fisherian) statistics, which can solely consider the estimation of (a function of) the parameter, and not of random quantities. Now, the solution displayed in (1) depends of course on the prior and it is the optimal one when seen, in terms of the Bayesian risk, “with the glass of that prior”. However, the above-mentioned difficulties about the assessment of a specific prior can diminish the practical (but not the conceptual) value of this solution, in the sense that it could prove to be non-operational in the case of a lack of prior information. Sometimes, when the prior is known up to further unknown parameters, another estimation problem is needed.
Our research is then focused on formalizing a general strategy aimed at producing, under regularity conditions, alternative estimators U n , m * which prove to be asymptotically nearly optimal (as specified above), uniformly with respect to any prior in some class. More precisely, for any fixed prior in that class, we aim at proving the validity of the asymptotic expansions (as n + ),
E L U ( U n , m , U ^ n , m ) = R ^ 0 , m + 1 n R ^ 1 , m + o 1 n
E L U ( U n , m , U n , m * ) = R 0 , m * + 1 n R 1 , m * + o 1 n ,
along with R ^ i , m = R i , m * for i = 0 , 1 , where U ^ n , m is the same as in (1). This is exactly the content of Theorem 5.1 and Corollary 5.1 in [15], which deal with the case where: U n , m = T (estimation of the parameter of the model), so that U coincides with the parameter space Θ R ; L U is the quadratic loss function, so that the risk function coincides with the mean square error; U ^ n , m = E [ T | X 1 , , X n ] is the Bayesian estimator with respect to L U ; U n , m * coincides with the MLE; R ^ 0 , m = R 0 , m * = 0 and R ^ 1 , m = R 1 , m * = Θ [ I ( θ ) ] 1 π ( d θ ) , I denoting the Fisher information of the model and π being any prior on Θ with positive and sufficiently smooth density (with respect to the Lebesgue measure). Moving to truly predictive problems, the main operational solutions come from the empirical Bayes theory, which shares Equation (1) with the approach we are going to present. However, the empirical Bayes theory very soon leaves the “Bayesian main way” by bringing some sort of Law of Large Numbers into the game, in order to replace the unknown quantities (usually, the prior itself). Here, on the contrary, we pursue Regazzini’s project by proposing a new method that remains on the Bayesian main way. It consists of the following six steps.
Step 1. Reformulate problem (1) into another (orthodox Bayesian) estimation problem about T, the random parameter of the model. Roughly speaking, start from the following de Finetti representation:
P [ X 1 A 1 , , X k A k | T = θ ] = μ k ( A 1 × × A k | θ ) : = i = 1 k μ ( A i | θ ) ,
valid for all k N , Borel sets A 1 , , A k , θ Θ , and some probability kernel μ ( · | · ) , which coincides with the statistical model for the single observation. Then, consider the following estimation problem: find
T ^ n , m = Argmin W E L Θ , ( X 1 , , X n ) ( T , W ) ,
where: L Θ , ( X 1 , , X n ) is a suitable loss function on Θ ; W runs over the space of all Θ -valued, σ ( X 1 , , X n ) -measurable random variables; the expectation is taken with respect to the joint distribution of ( X 1 , , X n ) and T. The explicit definition of L Θ , ( X 1 , , X n ) is given in terms of a Wasserstein distance, as follows:
L Θ , ( x 1 , , x n ) ( θ , τ ) = inf Γ U 2 L U ( u , v ) Γ ( d u d v ) ,
where Γ runs over the Fréchet class of all probability measures on U 2 with marginals γ θ , ( x 1 , , x n ) and γ τ , ( x 1 , , x n ) , respectively, and γ θ , ( x 1 , , x n ) stands for the pull-back measure μ m ( · | θ ) u n , m ( · ; x 1 , , x n ; θ ) 1 on U .
Step 2. After getting the estimator T ^ n , m from (5), consider estimators U n , m * that satisfy the following approximated version of problem (1): find
U n , m * = Argmin Z X m L U u n , m ( y 1 , , y m ; X 1 , , X n ; T ^ n , m ) , Z μ m ( d y 1 d y m | T ^ n , m ) ,
where Z runs over the space of all U -valued, σ ( X 1 , , X n ) -measurable random variables.
Step 3. For the estimators U ^ n , m and U n , m * that solve (1) and (7) respectively, prove that (2) and (3) hold along with R ^ i , m = R i , m * for i = 0 , 1 . This entails the asymptotic almost efficiency of U n , m * , which it is still a prior-dependent estimator. In any case, this step is crucial to show that the loss function L Θ , ( x 1 , , x n ) given in (6) is “Bayesianly well-conceived”, that is, in harmony with the original aim displayed in (1).
Step 4. Identities (2) and (3) provides conditions on the statistical model μ ( · | · ) that possibly allows the existence of some prior-free estimator T ˜ n , m of T which turns out to be asymptotically almost efficient, with respect to the same risk function as that displayed on the right-hand side of (5). More precisely, this fact consists of proving the validity of the following identities (as n + )
E L Θ , ( X 1 , , X n ) ( T , T ^ n , m ) = ρ ^ 0 , m + 1 n ρ ^ 1 , m + o 1 n
E L Θ , ( X 1 , , X n ) ( T , T ˜ n , m ) = ρ ˜ 0 , m + 1 n ρ ˜ 1 , m + o 1 n ,
along with ρ ^ i , m = ρ ˜ i , m for i = 0 , 1 , where T ^ n , m is the same as in (5), for all prior distributions in a given class.
Step 5. After getting estimators T ˜ n , m as in Step 4, consider the prior-free estimators U ˜ n , m satisfying the analogous minimization problem as in (7), with T ^ n , m replaced by T ˜ n , m .
Step 6. For any estimator U ˜ n , m found as in Step 5, prove the validity of the following identity (as n + ):
E L U ( U n , m , U ˜ n , m ) = R ˜ 0 , m + 1 n R ˜ 1 , m + o 1 n ,
along with R ^ i , m = R ˜ i , m for i = 0 , 1 , where the R ^ i , m ’s are the same as in (2), for all prior distributions in the same class as specified in Step 4. This last step shows why and how the frequentist (i.e., prior-free) estimator U ˜ n , m can be used, within the orthodox Bayesian framework, as a good approximation of the Bayesian estimator U ^ n , m . This is particularly remarkable at least in two cases that do not exclude each other: when the estimator T ˜ n , m obtained from Step 4 is much simpler and numerically manageable than T ^ n , m ; when prior information is sufficient to characterize only a class of priors, but not a specific element of it.
This plan of action obeys the following principles:
(A)
The loss function L Θ , ( x 1 , , x n ) on Θ is harmoniously coordinated with the original choice of the loss function L U on U . This principle is much aligned with de Finetti’s thought (see [18]), since it remarks on the more concrete nature of the space U compared with the space Θ which is, in principle, only a set of labels. Hence, it is much more reasonable to firstly metrize the space U and then the space Θ accordingly (as in (6)), rather than directly metrize Θ —even without taking account of the original predictive aim.
(B)
The Bayesian risk function associated with both U n , m * and U ˜ n , m can be bounded from above by the sum of two quantities: the former taking account of the error in estimating T, the latter reflecting the fact that we are estimating both U n , m * and U ˜ n , m from an “estimated distribution”.
The former principle, whose formalization constitutes the main novelty of this work, is concerned with the geometrical structure of the space of the parameters Θ . This is what we call a relativistic principle in point estimation theory: the goal of estimating a random quantity that depends on the observations (possibly besides the parameter) yields a modification of the geometry of Θ , to be now thought of as a curved space according to a non-trivial geometry. Of course, this modified geometry entails a coordinated notion of mean square error, now referred to the Riemannian geodesic distance. The term relativistic principle just hints at the original main principle of General Relativity Theory according to which the presence of a massive body modifies the geometry of the physical surrounding space, by means of the well-known Einstein tensor equations. These equations formalize a sort of compatibility between the physical and the geometric structures of the space. Thus, the identities (43) and (44), as stated in Section 3 to properly characterize the (Riemannian) metric on Θ , we will call compatibility equations. Actually, the idea of metrizing the parameter space Θ in a non standard way is well-known since the pioneering paper [19] by Radhakrishna Rao, and has received so much attention in the statistical literature to give birth to a fertile branch called Information Geometry. See, e.g., [20]. In particular, the concepts of efficiency, unbiasedness, Cramér–Rao lower bounds, Rao–Blackwell and Lehmann–Scheffé theorems are by far best-understood in this non-standard (i.e., non-Euclidean) setting. See [21]. In any case, to the best of our knowledge, this is the first work which connects the use of a non-standard geometric setting on Θ with predictive estimation problems—even if some hints can be drawn from [22]. In our opinion, the lack of awareness about the aforesaid relativistic principle, combined with an abuse of the quadratic loss function on Θ , has produced a lot of actually sub-efficient algorithms, most of which focused on the estimation of certain probabilities, or of nonparametric objects. In these cases, the efficiency of the ensuing estimators is created artificially through a misuse of the quadratic loss, and it proves to be drastically downsized whenever these estimators are evaluated by means of other, more concrete loss functions which take account (as in (6)) of the natural geometry of the spaces of really observable quantities. To get an idea of this phenomenon, see the discussion about Robbins’ estimators in Section 4.4 below.

1.2. Organization of the Paper

We conclude the introduction by summarizing the main results of the paper, which are threefold. The first block of results, including Theorem 1, Proposition 1 and Lemma 1 in Section 2.2, concerns some refinement of de Finetti’s Law of Large Numbers for the log-likelihood process. The second block of theoretical results, developed in Section 3, contains:
(i)
Proposition 2, which shows how to bound from above the Bayesian risk of any estimator of U n , m by using the Wasserstein distance;
(ii)
Proposition 3, which explains how to use the Laplace method of the approximation of integrals to get asymptotic expansions of the Bayesian risk functions;
(iii)
the formulation of the compatibility Equations (43) and (44);
(iv)
the proof of the “asymptotic almost efficiency” of the estimator U n , m * obtained in Step 2, via verification of identities (2) and (3);
(v)
the successful completion of Step 6, that is, the proof of the “asymptotic almost efficiency” of estimators U ˜ n , m obtained in Step 5, via verification of identity (10).
The last block of results, contained in Section 4, consists of explicit verifications of the compatibility equations for some simple statistical models (Section 4.1, Section 4.2 and Section 4.3), and also the adaptation of our plan of action to the same Poisson-mixture model used by Herbert Robbins in [23] to illustrate his empirical Bayes approach to predictive inference (Section 4.4). Finally, all the proofs of the theoretical results are deferred to Section 5, while some conclusions and future developments are hinted at in Section 6.

2. Technical Preliminaries

We begin by rigorously fixing the mathematical setting, split into two subsections. The former will contain a very general framework which will serve to give a precise meaning to the questions presented in the Introduction and to state in full generality one of the main results, that is, Proposition 2 in Section 3. In fact, this statement will include some inequalities that, by carrying out the goal described in point (B) of the Introduction will constitute the starting point for all the results presented in Section 3. The second subsection will deal with a simplification of the original setting—essentially based on additional regularity conditions for the spaces U and Θ and for the statistical model μ ( · | · ) —aimed at introducing the novel compatibility equations without too many technicalities.

2.1. The General Framework

Let ( X , X ) and ( Θ , T ) be standard Borel spaces called sample space (for any single observation) and parameter space, respectively. Consider a sequence { X i } i 1 of X -valued random variables (r.v.’s, from now on) along with another Θ -valued r.v. T, all the X i ’s and T being defined on a suitable probability space ( Ω , F , P ) . Assume that (4) holds for all k N , A 1 , , A k X and θ Θ with some given probability kernel μ ( · | · ) : X × Θ [ 0 , 1 ] , called statistical model (for any single observation). The validity of (4) entails that the X i ’s are exchangeable and that
P [ X 1 A 1 , , X k A k ] = Θ μ k ( A 1 × × A k | θ ) π ( d θ ) = : α k ( A 1 × × A k )
holds for all k N and A 1 , , A k X with some given probability measure (p.m.) π on ( Θ , T ) called prior distribution. Identity (11) uniquely characterizes the p.m. α k on ( X k , X k ) for any k N , this p.m. being called law ofk-observations, where X k ( X k , respectively) denotes the k-fold cartesian product ( σ -algebra product, respectively) of k copies of X ( X , respectively). Moreover, let
π k ( B | x 1 , , x k ) : = P [ T B | X 1 = x 1 , , X k = x k ] β k ( A | x 1 , , x k ) : = P [ X k + 1 A | X 1 = x 1 , , X k = x k ]
be two probability kernels, with π k ( · | · ) : T × X k [ 0 , 1 ] and β k ( · | · ) : X × X k [ 0 , 1 ] , defined as respective solutions of the following disintegration problems
P [ X 1 A 1 , , X k A k , T B ] = A 1 × × A k π k ( B | x 1 , , x k ) α k ( d x 1 d x k ) P [ X 1 A 1 , , X k A k , X k + 1 A ] = A 1 × × A k β k ( A | x 1 , , x k ) α k ( d x 1 d x k )
for any k N , A 1 , , A k , A X and B T . The probability kernels π k ( · | · ) and β k ( · | · ) are called posterior distribution and predictive distribution, respectively.
Let ( U , d U ) be a Polish metric space and, for fixed n , m N , let u n , m : X m × X n × Θ U be a measurable map. Let U n , m : = u n , m ( X n + 1 , , X n + m ; X 1 , , X n ; T ) be the random quantity to be estimated with respect to the loss function L U ( u , v ) : = d U 2 ( u , v ) . Now, recall the notion of barycenter (also known as Fréchet mean) of a given p.m.. Let ( S , d S ) be a Polish metric space, endowed with its Borel σ -algebra B ( S ) . Given a p.m. μ on ( S , B ( S ) ) , define
Bary S [ μ ; d S ] : = Argmin y S S d S 2 ( x , y ) μ ( d x )
provided that μ has finite second moment ( μ P 2 ( S , d S ) , in symbols) and that at least one minimum point exists. See [24,25,26] for results on existence, uniqueness and some characterizations of barycenters. Then, put
ρ n , m ( C | x 1 , , x n ) : = P [ U n , m C | X 1 = x 1 , , X n = x n ] ,
meaning that ρ n , m ( · | · ) : B ( U ) × X n [ 0 , 1 ] is a probability kernel that solves the disintegration problem
P [ X 1 A 1 , , X n A n , U n , m C ] = A 1 × × A n ρ n , m ( C | x 1 , , x n ) α k ( d x 1 d x k )
for any A 1 , , A n X and C B ( U ) . If E [ d U 2 ( U n , m , u 0 ) ] < + for some u 0 U and Bary U [ ρ n , m ( · | x 1 , , x n ) ; d U ] exists uniquely for α n -almost all ( x 1 , , x n ) , then
U ^ n , m = Bary U [ ρ n , m ( · | X 1 , , X n ) ; d U ]
solves the minimization problem (1). To give an analogous formalization to the minimization problem (7), define
γ θ , ( x 1 , , x n ) ( C ) : = μ m ( y 1 , , y m ) X m | u n , m ( y 1 , , y m ; x 1 , , x n ; θ ) C | θ
for any θ Θ , ( x 1 , , x n ) X n and C B ( U ) . Again, if γ θ , ( x 1 , , x n ) P 2 ( U , d U ) and Bary U [ γ θ , ( x 1 , , x n ) ( · ) ; d U ] exists uniquely for any θ Θ and α n -almost all ( x 1 , , x n ) , then
U n , m * = Bary U [ γ T ^ n , m , ( X 1 , , X n ) ; d U ]
solves the minimization problem (7). By the way, notice that a combination of de Finetti’s representation theorem with basic properties of conditional distributions entails that
ρ n , m ( C | x 1 , , x n ) = Θ γ θ , ( x 1 , , x n ) ( C ) π n ( d θ | x 1 , , x n )
for α n -almost all ( x 1 , , x n ) . It remains to formalize the minimization problem (5). If γ θ , ( x 1 , , x n ) , γ τ , ( x 1 , , x n ) P 2 ( U , d U ) , then the loss function in (6) satisfies
L Θ , ( x 1 , , x n ) ( θ , τ ) = W U 2 γ θ , ( x 1 , , x n ) ; γ τ , ( x 1 , , x n ) ,
where W U denotes the 2-Wasserstein distance on P 2 ( U , d U ) . See [27] [Chapters 6–7] for more information on the Wasserstein distance. Therefore, if π n ( · | x 1 , , x n ) P 2 ( Θ , L Θ , ( x 1 , , x n ) 1 / 2 ) and Bary Θ [ π n ( · | x 1 , , x n ) ; L Θ , ( x 1 , , x n ) 1 / 2 ] exists uniquely for α n -almost all ( x 1 , , x n ) , then
T ^ n , m = Bary Θ π n ( · | X 1 , , X n ) ; L Θ , ( X 1 , , X n ) 1 / 2
solves the minimization problem (5).
To conclude, it remains to formalize the definition of various Bayesian risk functions, that will appear in the formulation of the main results. For any estimator U n , m = u n , m ( X 1 , , X n ) of U n , m , obtained with a measurable u n , m : X n U , put
R U [ U n , m ] : = E L U ( U n , m , U n , m ) = Θ X n + m L U u n , m ( y ; x ; θ ) , u n , m ( x ) μ n + m ( d y d x | θ ) π ( d θ ) = X n Θ X m L U u n , m ( y ; x ; θ ) , u n , m ( x ) μ m ( d y | θ ) π n ( d θ | x ) α n ( d x )
provided that the integrals are finite. Here and throughout, the bold symbols x , y are just short-hands to denote the vectors ( x 1 , , x n ) and ( y 1 , , y m ) , respectively. Analogously, for any estimator T n , m = t n , m ( X 1 , , X n ) of T, obtained with a measurable t n , m : X n Θ , put
R Θ [ T n , m ] : = E L Θ , ( X 1 , , X n ) ( T , T n , m ) = Θ X n L Θ , x θ , t n , m ( x ) μ n ( d x | θ ) π ( d θ ) = X n Θ L Θ , x θ , t n , m ( x ) π n ( d θ | x ) α n ( d x )
provided that the integrals are finite.

2.2. The Simplified Framework

Start by assuming that U = R and L U ( u , v ) = | u v | 2 . Then, restrict the attention to those predictive problems in which the quantity to be estimated depends only on the new observations X n + 1 , , X n + m and on the random parameter T, but not on the observable variables X 1 , , X n . This restriction is actually non-conceptual, and it is made only to diminish the mathematical complexity of the ensuing asymptotic expansions (valid as n + ), having this way fewer sources of dependence from the variable n. Thus, the quantity to be estimated has the form u m ( X n + 1 , , X n + m ; T ) for some measurable u m : X m × Θ R . From now on, it will be assumed that
E u m ( X n + 1 , , X n + m ; T ) 2 < + .
Whence, for the Bayesian estimator U ^ n , m in (12) existence and uniqueness are well-known: its explicit form is given by U ^ n , m = u ^ n , m ( X 1 , , X n ) with
u ^ n , m ( x 1 , , x n ) = E [ u m ( X n + 1 , , X n + m ; T ) | X 1 = x 1 , , X n = x n ] = Θ X m u m ( y 1 , , y m ; θ ) μ m ( d y 1 d y m | θ ) π n ( d θ | x 1 , , x n ) ,
which is finite for α n -almost all ( x 1 , , x n ) . The risk function R U evaluated at U ^ n , m achieves its overall minimum value and, from (16), it takes the form:
R U [ U ^ n , m ] = X n Θ v ( θ ) π n ( d θ | x ) + Θ [ m ( θ ) ] 2 π n ( d θ | x ) Θ m ( θ ) π n ( d θ | x ) 2 α n ( d x ) ,
with
m ( θ ) : = X m u m ( y 1 , , y m ; θ ) μ m ( d y 1 d y m | θ ) v ( θ ) : = X m u m ( y 1 , , y m ; θ ) m ( θ ) 2 μ m ( d y 1 d y m | θ )
thanks to the well-known “Law of Total Variance”. See, e.g., [28] [Problem 34.10(b)]. As to the issue of estimating T, the first remarkable simplification induced by the above assumptions is that the p.m. γ θ , ( x 1 , , x n ) is independent of ( x 1 , , x n ) . Whence,
Δ ( θ , τ ) : = [ L Θ , ( x 1 , , x n ) ( θ , τ ) ] 1 / 2 = W U γ θ , ( x 1 , , x n ) ; γ τ , ( x 1 , , x n ) ,
is, in turn, independent of ( x 1 , , x n ) and defines a distance on Θ provided that
γ θ , ( x 1 , , x n ) = γ τ , ( x 1 , , x n )
entails θ = τ . Thus, for any estimator T n , m = t n , m ( X 1 , , X n ) of T, obtained with a measurable t n , m : X n Θ , (17) becomes
R Θ [ T n , m ] = X n Θ [ Δ ( θ , t n , m ( x ) ) ] 2 π n ( d θ | x ) α n ( d x ) .
The last simplifications concern the basic object of the inference, i.e., the statistical model μ ( · | · ) and the prior π . First, assume that Θ = ( a , b ) R and that π has a density p (with respect to the Lebesgue measure). Even if this one-dimensionality assumption can seem a drastic simplification, it is again of a non-conceptual nature, and it is made to diminish the mathematical complexity of the ensuing statements. In fact, one of the goals of this work is to provide a Riamannian-like characterization of the metric space ( Θ , Δ ) , and this is particularly simple in such a one-dimensional setting. The following arguments should be quite easily reproduced at least in a finite-dimensional setting (i.e., when Θ R d ) by using basic tools of Riemannian geometry, such as local expansions of the geodesic distance. See, e.g., [29] [Chapter 5]. As to the statistical model μ ( · | · ) , consider the following:
Assumption 1.
μ ( · | · ) is dominated by some σ-finite measure χ on ( X , X ) with a (distinguished version of the) density f ( · | θ ) that satisfies:
(i)
f ( x | θ ) > 0 for all x X and θ Θ ;
(ii)
for any fixed x X , θ f ( x | θ ) belongs to C 4 ( Θ ) ;
(iii)
there exists a separable Hilbert space H for which log f ( x | · ) H for all x X , and such that, for any open Θ whose closure is compact in Θ ( Θ Θ , in symbols), the restriction operators R Θ : h h | Θ are continuous from H to C 0 ( Θ ¯ ) ;
(iv)
X | log f ( x | θ ) | 2 μ ( d x | θ ) < + for π-a.e. θ, and the Kullback-Leibler divergence
K ( t θ ) : = X log f ( x | t ) log f ( x | θ ) μ ( d x | t )
is well-defined.
A canonical choice for the Hilbert space H is in the form of a weighted Sobolev space H r ( Θ ; π ) for some r 1 . See, e.g., [30,31] for definition and further properties of weighted Sobolev spaces, such as embedding theorems. By the way, it is worth remarking that such assumptions are made to easily state the following results. It is plausible they could be relaxed in future works.
In this regularity setting, introduce the sequence { H n } n 1 , where H n : Ω H represents the (normalized) log-likelihood process, that is
H n : = 1 n n ( · ; X 1 , , X n ) : = 1 n i = 1 n log f ( X i | · ) = X log f ( ξ | · ) e n ( X 1 , , X n ) ( d ξ )
the symbol e n ( X 1 , , X n ) standing for the empirical measure based on ( X 1 , , X n ) , i.e.,
e n ( X 1 , , X n ) : = 1 n i = 1 n δ X i .
For completeness, any notation like n ( · ; X 1 , , X n ) is just a short-hand to denote the entire function θ n ( θ ; X 1 , , X n ) . First of all, observe that H n is a sufficient statistics in both classical and Bayesian sense. See [11]. Then, a version of de Finetti’s Law of Large Numbers (see [9,32]) for the log-likelihood process can be stated as follows:
Theorem 1.
Under Assumption 1, define the following H -valued r.v.
H : = X log f ( z | · ) μ ( d z | T ) = K ( T · ) + X log f ( z | T ) μ ( d z | T )
along with ν n ( D ) : = P [ H n D ] and ν ( D ) : = P [ H D ] , for any D B ( H ) . Then, it holds that
H n L 2 H
which, in turn, yields that ν n ν , where ⇒ denotes weak convergence of p.m.’s on ( H , B ( H ) ) .
Then, to carry out the objectives mentioned in the Introduction, a quantitative refinement of the thesis ν n ν is needed, as stated in the following proposition.
Proposition 1.
Let C b 2 ( H ) denote the space of bounded, C 2 functionals on H . Besides Assumption 1, suppose there exists a function Γ ( · ; μ , π ) : H R such that
1 2 E Hess [ Ψ ] H Cov T [ log f ( X i | · ) ] = E Ψ ( H ) Γ ( H ; μ , π )
holds for all functional Ψ C b 2 ( H ) , where Hess [ Ψ ] h denotes the Hessian of Ψ at h H , ⊗ is the tensor product between quadratic forms (operators) and Cov t [ log f ( X i | · ) ] stands for the covariance operator of the H -valued r.v.’s log f ( X i | · ) with respect to the p.m. μ ( · | t ) . Then,
H Ψ ( h ) ν n ( d h ) = H Ψ ( h ) ν ( d h ) + 1 n H Ψ ( h ) Γ ( h ; μ , π ) ν ( d h ) + o 1 n
holds as n + for all continuous Ψ : H R for which the above integrals are convergent.
For further information on second-order differentiability in Hilbert/Banach spaces, see [33,34]. By the way, the above identity (26) is a quantitative strengthening of de Finetti’s theorem similar to the identities stated in Theorem 1.1 of [8] [Chapter 6], valid in a finite-dimensional setting. Later on, we will resort to uniform versions of (26), meaning that the o ( 1 n ) -term is uniformly bounded with respect to h. However, such a kind of results—much more in the spirit of the Central Limit Theorem—are very difficult to prove and, to the best of the author’s knowledge, there are no known results in infinite-dimension. Examples in finite-dimensional settings are given in [35,36], which prove Berry–Esseen like inequalities in the very specific context of Bernoulli r.v.s. See also [37]. Anyway, since one merit of [35] is to show how to use the classical Central Limit Theorem to prove an expansion as in (26), one could hope to follow that very same line of reasoning by resorting to some version of the central limit theorem for Banach spaces, such as that stated in [38]. Research on this is ongoing.
Now, to make the above Proposition 1 a bit more concrete, it is worth noticing the case in which f ( · | θ ) is in exponential form. In fact, in this case, the identity (26) can be rewritten in a simpler form, condensed in the following statement.
Lemma 1.
Besides Assumption 1, suppose that f ( x | θ ) = exp { θ S ( x ) M ( θ ) } , with some measurable S : X R and M ( θ ) : = log X e θ S ( x ) χ ( d x ) R for all θ Θ . Then, (26) holds with
ν ( D ) : = P θ θ M ( T ) M ( θ ) D
and
Γ ( θ θ M ( t ) M ( θ ) ) ; μ , π = M ( t ) p ( t ) d 2 d y 2 M ( V ( y ) ) p ( V ( y ) ) V ( y ) | y = M ( t ) ,
where V ( M ( t ) ) = t for any t Θ .
To conclude this subsection, consider the expressions (19)–(21) and notice that they depend explicitly on the posterior distribution π n ( · | x 1 , , x n ) . Now, thanks to Assumption 1, the mapping t δ t can be seen as defined on Θ and taking values in the dual space H * , with Riesz representative h t H . More formally, for any h H and t Θ , it holds that h ( t ) = H h , δ t H * = h , h t , where · , · stands for the scalar product on H while H · , · H * denotes the pairing between H and H * . In this notation, the posterior distribution can be rewritten in exponential form as:
π n ( B | X 1 , , X n ) = B exp { n H n , h θ } π ( d θ ) Θ exp { n H n , h θ } π ( d θ ) = π n * ( B | H n )
for any B T , the probability kernel π n * ( · | · ) : T × H [ 0 , 1 ] being defined by
π n * ( B | h ) : = B exp { n h , h θ } π ( d θ ) Θ exp { n h , h θ } π ( d θ ) .
This is particularly interesting because it shows that the posterior distribution can always be thought of, in the presence of a dominated statistical model characterized by strictly positive, smooth densities, as an element of an exponential family, even if the original statistical model μ ( · | · ) is not in exponential form. By utilizing the kernel π n * in combination with the p.m. ν n , the following re-writings of (19)–(21) are valid:
R U [ U ^ n , m ] = H Θ v ( θ ) π n * ( d θ | h ) + Θ [ m ( θ ) ] 2 π n * ( d θ | h ) Θ m ( θ ) π n * ( d θ | h ) 2 ν n ( d h )
R Θ [ T n , m ] = H Θ [ Δ ( θ , T n , m ( h ) ) ] 2 π n * ( d θ | h ) ν n ( d h ) ,
where the mapping T n , m is such that T n , m ( H n ) = t n , m ( X 1 , , X n ) holds P -a.s.

3. Main Results

The first result establishes a relationship between the Bayesian risk functions R U and R Θ defined in (16) and (17), respectively. Due to the central role of this relationship, it will be formulated within the general framework described in Section 2.1.
Proposition 2.
Consider any estimator U n , m = u n , m ( X 1 , , X n ) of U n , m and any estimator T n , m = t n , m ( X 1 , , X n ) of T such that E [ d U 2 ( U n , m , u 0 ) ] < + holds for some u 0 U along with E L Θ , ( X 1 , , X n ) ( T n , m , t 0 ) < + for some t 0 Θ . Then, it holds
R U [ U n , m ] R Θ [ T n , m ] + E U d U 2 ( U n , m , u ) γ T n , m , ( X 1 , , X n ) ( d u ) + 2 E L Θ , ( X 1 , , X n ) 1 / 2 ( T , T n , m ) U d U 2 ( U n , m , u ) γ T n , m , ( X 1 , , X n ) ( d u ) 1 / 2 .
In particular, if the Bayesian risk function R Θ is optimized by choosing T n , m = T ^ n , m , where T ^ n , m is as in (15), and U n , m is chosen equal to U n , m * , where U n , m * is as in (13), then (31) becomes
R U [ U n , m * ] R Θ [ T ^ n , m ] + E U d U 2 ( U n , m * , u ) γ T ^ n , m , ( X 1 , , X n ) ( d u ) + 2 E L Θ , ( X 1 , , X n ) 1 / 2 ( T , T ^ n , m ) U d U 2 ( U n , m * , u ) γ T ^ n , m , ( X 1 , , X n ) ( d u ) 1 / 2 = inf T n , m R Θ [ T n , m ] + E inf U n , m U d U 2 ( U n , m , u ) γ T ^ n , m , ( X 1 , , X n ) ( d u ) + 2 E L Θ , ( X 1 , , X n ) 1 / 2 ( T , T ^ n , m ) U d U 2 ( U n , m * , u ) γ T ^ n , m , ( X 1 , , X n ) ( d u ) 1 / 2 .
As an immediate remark, notice that the last member of (32) is obtained by first optimizing the risk R Θ with respect to the choice of T n , m and then, after getting T ^ n , m , the term E U d U 2 ( U n , m , u ) γ T ^ n , m , ( X 1 , , X n ) ( d u ) is optimized with respect to the choice of U n , m . Of course, it can be argued about the convenience of this procedure—and it is actually due—even if, in most problems, it seems that the strategy proposed in Proposition 2 proves indeed to be the simplest and the most feasible one, above all if computational issues are taken into account. In fact, the absolute best theoretical strategy—consisting of optimizing the right-hand side of (31) jointly with respect to the choice of ( U n , m , T n , m ) —turns out to be very often too complex and onerous to carry out. Therefore, it seems reasonable to quantify, at least approximately, how far the strategy of Proposition 2 is from absolute optimality, in terms of efficiency. Finally, the additional term
2 E L Θ , ( X 1 , , X n ) 1 / 2 ( T , T ^ n , m ) U d U 2 ( U n , m * , u ) γ T ^ n , m , ( X 1 , , X n ) ( d u ) 1 / 2
will be reconsidered in next statement, within the simplified setting of Section 2.2. Indeed, by arguing asymptotically, it will be shown that it is essentially negligible, proving in this way a sort of “Pythagorean inequality”.
Henceforth, to make the above remark effective, we will formulate the subsequent results within the simplified setting introduced in Section 2.2. Indeed, Steps 1–3 mentioned in the Introduction are worthy of being reconsidered in light of Proposition 2. On the one hand, Steps 1 and 2 boil down to checking the existence and uniqueness of the barycenters appearing in (15) and (13), for instance by using the results contained in [24,25,26]. On the other hand, Step 3 hinges on the validity of (2) and (3), which are somewhat related to inequality (32). More precisely, (2) will be proved directly by resorting to identity (29), while (3) will be obtained by estimating the right-hand side of (32). Here is a precise statement.
Proposition 3.
Besides Assumptions 1 and (18), suppose that p > 0 and p C 1 ( Θ ) , m , v C 2 ( Θ ) , Δ 2 C 2 ( Θ 2 ) , and κ t is any element of H C 3 ( Θ ) with a unique minimum point at t Θ . Then, it holds
Θ v ( θ ) + [ m ( θ ) ] 2 π n * ( d θ | κ t ) Θ m ( θ ) π n * ( d θ | κ t ) 2 = v ( t ) + 1 n κ t ( t ) [ m ( t ) ] 2 + 1 2 v ( t ) + v ( t ) p ( t ) p ( t ) 1 2 κ t ( t ) κ t ( t ) + o 1 n Θ Δ 2 ( θ , τ ) π n * ( d θ | κ t )
= Δ 2 ( t , τ ) + 1 n κ t ( t ) 1 2 2 θ 2 Δ 2 ( θ , τ ) | θ = t + θ Δ 2 ( θ , τ ) | θ = t p ( t ) p ( t ) 1 2 κ t ( t ) κ t ( t ) + o 1 n
as n + , for any τ Θ .
Here, it is worth noticing that the asymptotic expansions derived in the above proposition are obtained by means of the Laplace method, as first proposed in [39]. See also [40] [Chapter 20]. At this stage, we face the problem of optimizing the left-hand side of (35) with respect to τ . Since the explicit expression of Δ 2 ( t , τ ) will be hardly known in closed form, a reasonable strategy considers, for fixed t Θ , the optimization of the right-hand side of (35) with respect to τ , disregarding the remainder term o ( 1 / n ) . If Δ 2 C 3 ( Θ 2 ) , this attempt leads to considering the equation
τ Δ 2 ( t , τ ) + 1 n κ t ( t ) 1 2 2 θ 2 Δ 2 ( θ , τ ) | θ = t + θ Δ 2 ( θ , τ ) | θ = t p ( t ) p ( t ) 1 2 κ t ( t ) κ t ( t ) = 0
and, since
τ Δ 2 ( t , τ ) | τ = t = 0 ,
we have that any solution of (36) is of the form τ ^ n = t + ϵ n , with some ϵ n that goes to zero as n + . For completeness, the validity of (37) could be obtained by using the explicit expression of the Wasserstein distance due to Dall’Aglio. See [41].
If Δ 2 C 4 ( Θ 2 ) , we can plug the expression of τ ^ n into (35), and expand further the right-hand side. Exploiting that
Δ 2 ( t , τ ^ n ) = 1 2 2 τ 2 Δ 2 ( t , τ ) | τ = t · ϵ n 2 + o ( ϵ n 2 ) t Δ 2 ( t , τ ^ n ) = τ t Δ 2 ( t , τ ) | τ = t · ϵ n + 1 2 2 τ 2 t Δ 2 ( t , τ ) | τ = t · ϵ n 2 + o ( ϵ n 2 ) 2 t 2 Δ 2 ( t , τ ^ n ) = 2 t 2 Δ 2 ( t , τ ) | τ = t + τ 2 t 2 Δ 2 ( t , τ ) | τ = t · ϵ n + 1 2 2 τ 2 2 t 2 Δ 2 ( t , τ ) | τ = t · ϵ n 2 + o ( ϵ n 2 ) ,
we get
Θ Δ 2 ( θ , τ ^ n ) π n * ( d θ | κ t ) = 1 2 2 τ 2 Δ 2 ( t , τ ) | τ = t · ϵ n 2 + 1 n κ t ( t ) 1 2 2 t 2 Δ 2 ( t , τ ) | τ = t + 1 2 τ 2 t 2 Δ 2 ( t , τ ) | τ = t · ϵ n + 1 4 2 τ 2 2 t 2 Δ 2 ( t , τ ) | τ = t · ϵ n 2 + p ( t ) p ( t ) 1 2 κ t ( t ) κ t ( t ) · τ t Δ 2 ( t , τ ) | τ = t · ϵ n + 1 2 p ( t ) p ( t ) 1 2 κ t ( t ) κ t ( t ) · 2 τ 2 t Δ 2 ( t , τ ) | τ = t · ϵ n 2 + o ( ϵ n 2 ) + o 1 n .
The right-hand side of this expression has the form
a · ϵ n 2 + 1 n A · ϵ n 2 + B · ϵ n + C + o ( ϵ n 2 ) + o 1 n ,
so that the choice
ϵ n = B 2 n a 1 + A n a + o 1 n 2 = B 2 n a + o 1 n
optimizes its expression. Whence,
τ ^ n = t 1 n κ t ( t ) 2 τ 2 Δ 2 ( t , τ ) | τ = t 1 · 1 2 τ 2 t 2 Δ 2 ( t , τ ) | τ = t + p ( t ) p ( t ) 1 2 κ t ( t ) κ t ( t ) · τ t Δ 2 ( t , τ ) | τ = t + o 1 n
and consequently
Θ Δ 2 ( θ , τ ^ n ) π n * ( d θ | κ t ) = 1 2 n κ t ( t ) 2 t 2 Δ 2 ( t , τ ) | τ = t + o 1 n .
A first consequence of these computations is that the (Bayesian) estimator T ^ n , m in (15) has the same form as (39) with t and κ t replaced by the MLE, denoted by θ ^ n , and H n , respectively. Of course, this fact has some relevance only in the case that θ ^ n exists and is unique. Moreover, coming back to (32), it is worth noticing that
inf U n , m U | U n , m u | 2 γ τ ^ n ( d u ) = v ( τ ^ n ) = v ( t ) + v ( t ) ϵ n + o 1 n ,
where we have dropped the dependence on ( X 1 , , X n ) in the expression of γ τ ^ n , in agreement with the simplified setting of Section 2.2 we are following. The last preliminary remark is about the additional term (33) that appears in the last member of (32). In fact, exploiting from the beginning that U = R and L U ( u , v ) = | u v | 2 , we find that it reduces to
2 E [ Θ X m u m ( y , θ ) u m ( y , T ^ n , m ) u m ( y , T ^ n , m ) m ( T ^ n , m ) × × μ m ( d y | θ ) π n ( d θ | X 1 , , X n ) ]
by which we notice that it also involves “covariance terms”. The way is now paved to state the following.
Theorem 2.
Besides Assumptions 1 and (18), suppose that m , v C 2 ( Θ ) and Δ 2 C 4 ( Θ 2 ) . Then, the identities
2 τ 2 Δ 2 ( t , τ ) | τ = t = τ t Δ 2 ( t , τ ) | τ = t 1 2 v ( t ) + [ m ( t ) ] 2 = 1 2 2 t 2 Δ 2 ( t , τ ) | τ = t
1 2 2 τ 2 Δ 2 ( t , τ ) | τ = t 1 τ 2 t 2 Δ 2 ( t , τ ) | τ = t v ( t )
entail that
Θ v ( θ ) + [ m ( θ ) ] 2 π n * ( d θ | κ t ) Θ m ( θ ) π n * ( d θ | κ t ) 2 = Θ Δ 2 ( θ , τ ^ n ) π n * ( d θ | κ t ) + v ( t ) + v ( t ) ϵ n + o 1 n
for any t Θ , any κ t in H C 3 ( Θ ) with a unique minimum point at t Θ , and any p > 0 with p C 1 ( Θ ) , provided that the term in (42) is of o ( 1 n ) -type. Thus, if either
(A1)
(26) holds uniformly with respect to some class F of continuous functionals Ψ : H R , in the sense that
sup Ψ F H Ψ ( h ) ν n ( d h ) = H Ψ ( h ) ν ( d h ) + 1 n H Ψ ( h ) Γ ( h ; μ , π ) ν ( d h ) = o 1 n
(A2)
both the functionals h Θ { v ( θ ) + [ m ( θ ) ] 2 } π n * ( d θ | h ) Θ m ( θ ) π n * ( d θ | h ) 2 and h inf T n , m Θ [ Δ ( θ , T n , m ( h ) ) ] 2 π n * ( d θ | h ) belong to F , for all n N
or
(B1)
(34) and (40) hold uniformly for all κ t belonging to a given subset D of H
(B2)
ν n ( D ) = 1 for all n N
then (2)–(3) are in force with
R ^ 0 , m = R 0 , m * = Θ v ( t ) π ( d t )
R ^ 1 , m = R 1 , m * = Θ 1 κ ¯ t ( t ) [ m ( t ) ] 2 + 1 2 v ( t ) + v ( t ) p ( t ) p ( t ) 1 2 κ ¯ t ( t ) κ ¯ t ( t ) π ( d t ) + Θ v ( t ) Γ ( κ ¯ t ; μ , π ) π ( d t ) ,
where κ ¯ t ( θ ) : = K ( t | | θ ) , for any p > 0 with p C 1 ( Θ ) .
As announced in the Introduction, here we have minted the term compatibility equations to refer to identities (43) and (44). They actually constitute two “compatibility conditions” that involve only the statistical model, without any mention to the prior. The dependence on the quantity to be estimated is indeed hidden in the expression of Δ 2 . More deeply, these equations can be viewed as a check on the compatibility between the original estimation problem (1) and the fact that we have metrized the space of the parameters Θ as in (20). Actually, they could have a more general value if interpreted as relations aimed at characterizing Δ 2 , rather than imposing that this distance is given in terms of the Wasserstein distance as in (20). However, for a distance that is characterized differently from (20), an analogous of inequality (32) should be checked in terms of this new distance on Θ . As to the concrete check of the compatibility equations, we notice that the former identity (43) is generally valid as a consequence of the representation formula or the Wasserstein distance due to Dall’Aglio (see [41]), as long as the exchange between derivatives and integrals is allowed. For the other identity (44), we have instead collected in Section 4 some examples of simple statistical models for which its verification proves to be quite simple. Finally, the issue of extending these equations in a higher dimension, including the infinite dimension, is deferred to Section 6.
Apropos of the other assumptions, the verification that the term in (42) is of o ( 1 n ) -type is generally straightforward. For instance, such a term is even equal to zero if u m is independent of θ . As to the two groups of assumptions which are needed to prove (46) and (47), the latter block, formed by (B1) and (B2), is certainly easier to check. However, (B1) and (B2) can prove to be rather strong since they require the existence of the MLE for any n N . On the other hand, checking (A1) and (A2) is generally harder since it constitutes a strong reinforcement of de Finetti’s Law of Large Number for the log-likelihood process, similar in its conception to those stated in [35,36]. Moreover, the check of (A2) is more or less equivalent to prove a uniform regularity of the mapping h π n * ( d θ | h ) , as a map from H into the space of p.m.’s on ( Θ , T ) metrized with a Wasserstein distance. This theory is presented and developed in [42,43]. In any case, these lines of research deserve further investigations, to be deferred to a forthcoming paper.
Finally, we consider Steps 4–6 mentioned in the Introduction, in light of the previous results. In fact, the compatibility Equations (43) and (44) suggest two new compatibility conditions, which are necessary to get (10) along with R ^ i , m = R ˜ i , m for i = 0 , 1 . A formal statement reads as follows.
Theorem 3.
Besides Assumptions 1 and (18), suppose that m , v C 2 ( Θ ) , Δ 2 C 4 ( Θ 2 ) . Assume also that either (A1) and (A2) or (B1) and (B2) of Theorem 2 are in force. Then, any solution τ ^ n of the following equations:
v ( θ ^ n ) = v ( τ ^ n ) + Δ 2 ( τ ^ n , θ ^ n ) + o 1 n
v ( θ ^ n ) = t Δ 2 ( τ ^ n , t ) | t = θ ^ n + o 1 n
1 2 v ( θ ^ n ) + [ m ( θ ^ n ) ] 2 = 1 2 2 t 2 Δ 2 ( τ ^ n , t ) | t = θ ^ n + o 1 n ,
where θ ^ n stands for the MLE, yields a prior-free estimator T ^ n , m and, throughStep 5, another prior-free estimator U ˜ n , m that satisfies (10) along with R ^ i , m = R ˜ i , m for i = 0 , 1 , where R ^ 0 , m and R ^ 1 , m are as in (46) and (47), respectively, provided that the term in (42) is of o ( 1 n ) -type.
The derivation of new prior free-estimators via this procedure represents a novel line of research that we would like to pursue in forthcoming works.

4. Applications and Examples

This section is split into four subsections, and has two main purposes. In fact, Section 4.1, Section 4.2 and Section 4.3 just contain explicit examples of very simple statistical models for which the compatibility equations are satisfied. These models are the one-dimensional Gaussian, the exponential and the Pareto model. Section 4.4 has a different nature, since it is devoted to a more concrete application of our approach to the original Poisson-mixture setting used by Herbert Robbins to introduce his own approach to empirical Bayes theory. Finally, Section 4.5 carries on the discussion initiated in Section 4.4 by showing a concrete application relative to one year of claims data for an automobile insurance company.

4.1. The Gaussian Model

Here, we have X = Θ = R and
μ ( A | θ ) = A 1 2 π σ 2 exp { 1 2 σ 2 ( x θ ) 2 } d x ( A B ( R ) )
for some known σ 2 > 0 . For simplicity, we put m = 1 and u 1 ( y , θ ) = y , which is tantamount to saying that the original predictive aim was focused on the estimation of X n + 1 . In this setting, it is very straightforward to check that m ( θ ) = θ and v ( θ ) = σ 2 . Moreover, in view of well-know computations on the Wasserstein distance (see [44,45]), it is also straightforward to check that Δ 2 ( θ , τ ) = | θ τ | 2 . Therefore, (43) becomes 2 = 2 , while (44) reduces to 1 = 1 . Finally, it is also possible to check the validity of (48)–(50) with the simplest choice τ ^ n = θ ^ n .
The case of constant mean and unknown variance will not be dealt with here because its treatment is substantially included in the following subsection. Apropos of the multidimensional variant of this model, very important in many statistical applications, we just mention the interesting paper [46] which paves the way, mathematically speaking, to write down the multidimensional analogous of the compatibility equations in a full Riemannian context.

4.2. The Exponential Model

Here, we have X = Θ = ( 0 , ) and
μ ( A | θ ) = A θ e θ x d x ( A B ( 0 , + ) ) .
Again, for simplicity, we put m = 1 and u 1 ( y , θ ) = y , which is tantamount to saying that the original predictive aim was focused on the estimation of X n + 1 . In this setting, it is very straightforward to check that m ( θ ) = 1 / θ and v ( θ ) = 1 / θ 2 . Moreover, by resorting to Dall’Aglio representation of the Wasserstein distance (see [41]), it is also straightforward to check that Δ 2 ( θ , τ ) = 2 | 1 / θ 1 / τ | 2 . Although very simple, this is a very interesting example of non-Euclidean distance on Θ = ( 0 , ) . As to the validity of the compatibility equations, we easily see that (43) yields 4 / t 4 = 4 / t 4 , while (44) becomes:
3 t 4 + 1 t 2 2 = 1 2 · 4 t 4 1 2 8 t 5 · 4 t 4 1 · 2 t 3 .

4.3. The Pareto Model

Here, we have X = Θ = ( 0 , ) and
μ ( A | θ ) = A ( θ , + ) α θ α x α + 1 d x ( A B ( 0 , + ) )
for some known α > 2 . Again, for simplicity, we put m = 1 and u 1 ( y , θ ) = y , which is tantamount to saying that the original predictive aim was focused on the estimation of X n + 1 . In this setting, it is very straightforward to check that m ( θ ) = α α 1 θ and v ( θ ) = α ( α 2 ) ( α 1 ) 2 θ 2 . Moreover, by resorting to the Dall’Aglio representation of the Wasserstein distance (see [41]), it is also straightforward to check that Δ 2 ( θ , τ ) = α α 2 | θ τ | 2 . Of course, this is not a regular model since the support of μ ( · | θ ) varies with θ . Anyway, it is interesting to notice that the compatibility equations are still also valid in this case. Therefore, the analysis of such non-regular models should motivate further investigations about their intrinsic value.

4.4. Robbins Approach to Empirical BAYES

In his seminal paper [23], Herbert Robbins introduced the following model to present his own approach to empirical Bayes theory. The problem that he considers is inspired by car insurance data analysis, and it is only slightly different from a “standard” predictive problem. We start by putting X = N 0 2 and U = N 0 , and considering exchangeable random variables X i ’s with X i = ( ξ i , η i ) . The practical meaning is that ξ i represents the number of accidents experienced by the i-th customer in the past year, while η i represents the number of accidents that the same i-th customer will experience in the current year. Then, Robbins (in his own notation) attaches to each customer a random parameter, say λ i > 0 to the i-th customer, which represents the rate of a Poisson distribution for that customer. Moreover, he considers the λ i ’s as i.i.d. and, conditionally on the λ i ’s, the X i ’s become independent, and in addition ξ i and η i become i.i.d. with distribution Poi ( λ i ) , for all i N . Robbins calls G the common distribution of the λ i ’s and interpret it as a “prior distribution”. However, if we strictly follow the Bayesian main way, we should call this distribution θ to avoid confusion, and just realize that we have, this way, defined the statistical model, that is
μ ( { ( k , h ) } | θ ) = 0 + e z z k k ! e z z h h ! θ ( d z ) ( ( k , h ) N 0 2 ) .
Thus, the actual prior (Bayesianly speaking) is some p.m. π on the space of all p.m.’s on ( ( 0 , + ) , B ( 0 , + ) ) , while the random parameter T considered in the present paper is some random probability measure. Here, the objective—actually very practical and intuitively logic—is to estimate η 1 on the basis of the sample ( ξ 1 , , ξ n ) . Thus, our U n , m coincides with η 1 and the loss function is just, as usual, the quadratic loss. Throughout his paper, Robbins works under the conditioning to T = θ (that his under a fixed prior, in his own terminology). Hence, his “theoretical estimator” reads
E θ [ η 1 | ( ξ 1 , , ξ n ) ] = E θ [ η 1 | ξ 1 ] = ( ξ 1 + 1 ) p θ ( ξ 1 + 1 ) p θ ( ξ 1 ) ,
where p θ ( k ) : = μ ( { k } × N 0 | θ ) . To get rid of the unobservable θ , Robbins exploits that θ = E θ [ ξ 1 ] = k = 0 + k p θ ( k ) to bring the Strong Law of Large Numbers into the game. Indeed, since
p ^ ( k ) : = 1 n i = 1 n 𝟙 { ξ i = k } P θ a . s . p θ ( k )
holds for any θ , then it could be worth considering the (prior-free) estimator:
U ˜ n , m = ( ξ 1 + 1 ) p ^ ( ξ 1 + 1 ) p ^ ( ξ 1 ) .
At this stage, if we want to maintain the Bayesian main way, we should make three basic considerations. First, given the statistical model (51), independently of the estimation problem, the assumption of exchangeability of the X i ’s entails the existence of some prior distribution π , by de Finetti’s representation theorem. Second, given the quadratic loss function on U , the best (i.e., the most efficient) estimator is given by:
U ^ n , m : = E [ η 1 | ( ξ 1 , , ξ n ) ] ,
where the expectation E depends of course on the prior π . Third, if we consider the above estimator as useless, because of an effective ignorance about the prior π , we are justified to consider the above U ˜ n , m as a possible approximation of U ^ n , m , in the sense expressed by the joint validity of (2) and (10), with R ^ i , m = R ˜ i , m for i = 0 , 1 , uniformly with respect to a whole (possible very large) class of priors π . Unfortunately, it is not the case. Or rather, we could actually achieve this goal, in the presence of distinguished choices of π . Therefore, if there is ignorance on π , we can only consider the Robbins estimator as efficient “at zero-level”, and not also “at O ( 1 n ) -level”. If we follow the approach presented in this paper, the natural choice for an estimator is given by:
U n , m * = ( ξ 1 + 1 ) 0 + e z z ξ 1 + 1 ( ξ 1 + 1 ) ! T ^ n , m ( d z ) 0 + e z z ξ 1 ξ 1 ! T ^ n , m ( d z ) ,
where the estimator T ^ n , m belongs to the effective space of the parameters Θ , that is the space of all p.m.’s on ( ( 0 , + ) , B ( 0 , + ) ) , and is identified as:
T ^ n , m = Argmin τ Θ W 2 2 μ ξ 1 , θ , μ ξ 1 , τ π n ( d θ | ξ 1 , , ξ n ) ,
with
μ k , θ ( A ) : = A e z z k k ! θ ( d z ) 0 + e z z k k ! θ ( d z ) ( A B ( 0 , + ) ) .
The proof of the fact that our estimator is more efficient than Robbins estimator—at least asymptotically and uniformly with respect to a whole class of priors—will be given in a forthcoming paper. Indeed, such a proof will constitute only a first step towards a complete vindication of our approach. The crowing achievement of the project would be represented by the production of some prior-free approximation of T ^ n , m that could lead, through (54), to an efficient estimator U ˜ n , m up to the “ O ( 1 n ) -level”. Research on this is ongoing.

4.5. An Example of Real Data Analysis

This subsection represents a continuation of the analysis of Robbins’ approach to empirical Bayes theory, hinting at some concrete applications. We display below a Table 1 from [47] which is relative to one year of claims data for a European automobile insurance company. The original source of the data is the actuarial work [48].
Here, a population of 9461 automobile insurance policy holders is considered. Out of these, 7840 made no claims during the year; 1317 made a single claim; 239 made two claims each and so forth, continuing to the one person who made seven claims. The insurance company is concerned about the claims each policy holder will make in the next year. The third and the fourth lines provide estimations of such numbers by following the original Robbins method (based on (53)) and another compound model discussed in Section 6.1 of [47], respectively. In particular, the Robbins estimator predicts that the 7840 policy holders that made no claims during the year will contribute to an amount of 7840 × 0.168 1317 accidents, and so on. Analogously, the compound model predicts that the same 7840 policy holders will contribute to an amount of 7840 × 0.164 1286 accidents, and so on. Moreover, it is worth noticing that the original Robbins estimator suffers the lack of certain regularity properties, such as monotonicity, so that various smoothed versions of it have been provided by other authors. See [49]. See also [50] [Chapter 5] for a comprehensive treatment.
Here, we seize the opportunity to give the reader a taste of our approach, as explained in Section 4.4. A detailed treatment would prove, in any case, too complex to be thoroughly developed in this paper, due to the significant amount of numerical techniques which are necessary to carry out our strategy. Indeed, the big issue is concerned with the implementation of the infinite-dimensional minimization problem (55), which is still under investigation. However, we can simplify the treatment by restricting the attention on prior distributions π that put the total unitary mass, for example, on the set E of exponential distributions, so that θ ( d z ) = β e β z d z for z > 0 and some hyper-parameter β > 0 . Thus, given some hyper-prior ζ on the hyper-parameter β , we can easily see that (55) boils down to a simple, one-dimensional minimization problem. Its solution T ^ n , m is provided by the distribution
β ^ n e β ^ n z d z
with β ^ n coinciding with the harmonic mean of the posterior distribution of the hyper-parameter β . On the basis of the theory developed in the paper, this solution will prove asymptotically nearly optimal uniformly with respect to the (narrow) class of prior distributions that put the total unitary mass on E . Whence, the estimator U n , m * in (54) assumes the form
U n , m * = ξ 1 + 1 β ^ n + 1 .
This last estimator is, of course, not prior-free, because β ^ n depends on the prior ζ . However, to get a quick result, we can approximate β ^ n by means of the Laplace methods again yielding
ξ 1 + 1 β ^ n + 1 ( ξ 1 + 1 ) S n S n + n : = U ˜ n , m ,
where S n represents the total amount of accidents. Since n = 9461 and S n = 2028 in the dataset under consideration, we provide the following new Table 2,
Which is indeed comparable with the previous one. To give an idea, the Robbins estimator predicts 2019 total accidents for the next year, while the estimator U ˜ n , m above predicts 2033 total accidents for the next year.
In any case, a thorough analysis of this specific example deserves more attention, and will be developed in a forthcoming new paper.

5. Proofs

Gathered here are the proofs of the results stated in the main body of the paper.

5.1. Theorem 1

First, by following the same line of reasoning as in [32], conclude that the sequence { H n } n 1 is a Cauchy sequence in L 2 ( Ω ; H ) : = { W : Ω H | E [ W H 2 ] < + } . Thus, by completeness, there exists a random element H * in L 2 ( Ω ; H ) such that H n L 2 H * . Now, exploit the continuous embedding H C 0 ( Θ ¯ ) . By de Finetti’s Strong Law of Large Numbers (see [9]), H n ( θ ) converges P -a.s. to K ( T θ ) + X log f ( z | T ) μ ( d z | T ) = H ( θ ) , for any fixed θ Θ . Since H H by Assumption 1, then H = H * as elements of H . At this stage the conclusion that ν n ν follows by the standard implication that L 2 -convergence implies convergence in distribution, which is still true for random elements taking values in a separable Hilbert space. See [51].

5.2. Proposition 1

Start by considering a functional Ψ in C b 2 ( H ) . Notice that
H Ψ ( h ) ν n ( d h ) = E [ Ψ ( H n ) ]
and then expand the term Ψ ( H n H + H ) by the Taylor formula (see [33,34]) to get
Ψ ( H n ) = Ψ ( H ) + Ψ ( H ) , H n H + 1 2 Hess [ Ψ ] H ( H n H ) , H n H + o ( H n H 2 ) .
Observe that H is σ ( T ) -measurable while, by de Finetti’s representation theorem, the distribution of H n H , given T, coincides with the distribution of a sum of n i.i.d. random elements. Whence, the tower property of the conditional expectation entails:
E [ Ψ ( H n ) ] = E [ Ψ ( H ) ] + 1 2 n E Hess [ Ψ ] H Cov T [ log f ( X i | · ) ] + o 1 n
since E [ H n H | T ] = 0 and, then,
E [ Ψ ( H ) , H n H | T ] = Ψ ( H ) , E [ H n H | T ] = 0
the expression E [ H n H | T ] being intended as a Bochner integral. Thus, the main identity (26) follows immediately from (25), for any Ψ C b 2 ( H ) . Once (26) is established for regular Ψ ’s, one can extend its validity to more general continuous Ψ ’s by standard approximation arguments.

5.3. Lemma 1

First, observe that:
K ( T θ ) + X log f ( z | T ) μ ( d z | T ) = θ M ( T ) M ( θ ) .
Notice also that:
H Ψ ( h ) ν n ( d h ) = E Ψ θ θ n i = 1 n S ( X i ) M ( θ ) .
Then, repeat the same arguments as in the previous proof, getting
H Ψ ( h ) ν n ( d h ) = H Ψ ( h ) ν ( d h ) + 1 2 n Θ d 2 d x 2 Ψ ( θ x θ M ( θ ) ) | x = M ( t ) M ( t ) p ( t ) d t + o 1 n .
For standard exponential families, the function M is ono-to-one, with inverse function V. Whence, by indicating the range of M as C o d ( M ) ,
Θ d 2 d x 2 Ψ ( θ x θ M ( θ ) ) | x = M ( t ) M ( t ) p ( t ) d t = C o d ( M ) d 2 d x 2 Ψ ( θ x θ M ( θ ) ) M ( V ( x ) ) p ( V ( x ) ) V ( x ) d x = C o d ( M ) Ψ ( θ x θ M ( θ ) ) d 2 d x 2 [ M ( V ( x ) ) p ( V ( x ) ) V ( x ) ] d x ,
where, for the last identity, a double integration-by-parts has been used. Finally, changing the variable according to x = M ( t ) leads to the desired result.

5.4. Proposition 2

A disintegration argument shows that
R U [ U n , m ] = X n Θ × X m L U u n , m ( y ; x ; θ ) , u n , m ( x ) × × P [ ( X n + 1 , , X n + m ) d y , T d θ | ( X 1 , , X n ) = x ] α n ( d x ) = X n Θ × X m L U u n , m ( y ; x ; θ ) , u n , m ( x ) μ m ( d y | θ ) π n ( d θ | x ) α n ( d x ) = X n Θ W U 2 γ θ , x ; δ u n , m ( x ) π n ( d θ | x ) α n ( d x ) .
Then, use the triangular inequality for the Wasserstein distance to obtain:
W U γ θ , x ; δ u n , m ( x ) W U γ θ , x ; γ τ , x + W U γ τ , x ; δ u n , m ( x )
for any τ Θ . Take the square of both side and observe that:
W U γ τ , x ; δ u n , m ( x ) = U d U 2 ( u , u n , m ( x ) ) γ τ , x ( d u ) .
Now, (31) is proved by letting τ = T n , m after noticing that the latter summand in the above right-hand side is independent of θ .
Finally, (32) is obtained by first optimizing the risk R Θ with respect to the choice of T n , m and then, after getting T ^ n , m , the term E U d U 2 ( U n , m , u ) γ T ^ n , m , ( X 1 , , X n ) ( d u ) is optimized with respect to the choice of U n , m .

5.5. Proposition 3

Preliminarily, use Theorem 1 in [52] [Section II.1] to prove that:
Θ φ ( θ ) e κ t ( θ ) d θ = 2 π n e κ t ( t ) c 0 + c 2 2 n + o 1 n
holds for any φ C 2 ( Θ ) such that φ ( t ) 0 , where
c 0 : = b 0 2 a 0 1 / 2 c 2 : = b 2 2 3 a 1 b 1 a 0 + [ 5 a 1 2 4 a 0 a 2 ] 3 b 0 16 a 0 2 × 1 a 0 3 / 2 ,
with a 0 : = 1 2 κ t ( t ) , a 1 : = 1 3 ! κ t ( t ) , a 2 : = 1 4 ! κ t ( t ) , b 0 = φ ( t ) , b 1 = φ ( t ) and b 2 = 1 2 φ ( t ) . Moreover, from that very same theorem, it holds that:
Θ φ ( θ ) e κ t ( θ ) d θ = π e κ t ( t ) c 1 n 3 / 2 + o 1 n 3 / 2
for any φ C 2 ( Θ ) with a zero of order 1 at t, where
c 1 : = b 1 * 2 a 1 b 0 * 2 a 0 1 a 0
with b 0 * : = φ ( t ) and b 1 * : = 1 2 φ ( t ) . At this stage, application of this formulas gives:
Θ m ( θ ) π n * ( d θ | κ t ) = m ( t ) + 1 n a 0 1 4 m ( t ) + 1 2 m ( t ) p ( t ) p ( t ) 3 4 a 1 a 0 m ( t ) + o 1 n
and
Θ m 2 ( θ ) π n * ( d θ | κ t ) = m 2 ( t ) + 1 n a 0 1 2 ( m ( t ) ) 2 + 1 2 m ( t ) m ( t ) + m ( t ) m ( t ) p ( t ) p ( t ) 3 2 a 1 a 0 m ( t ) m ( t ) + o 1 n .
Then, in addition,
Θ [ v ( θ ) v ( t ) ] π n * ( d θ | κ t ) = 1 n a 0 1 4 v ( t ) + 1 2 v ( t ) p ( t ) p ( t ) 3 4 a 1 a 0 v ( t ) + o 1 n
and
Θ Δ 2 ( θ , τ ) π n * ( d θ | κ t ) = Δ 2 ( t , τ ) + 1 n a 0 1 2 θ Δ 2 ( θ , τ ) | θ = t p ( t ) p ( t ) + 1 2 2 θ 2 Δ 2 ( θ , τ ) | θ = t 3 4 a 1 a 0 θ Δ 2 ( θ , τ ) | θ = t + o 1 n
completing the proof just by mere substitutions.

5.6. Theorem 2

The core of the proof hinges on the identity (45). Now, the asymptotic expansion of its left-hand side is provided by (34), while the analogous expansion for right-hand side follows from a combination of (40) with (41). It is now straightforward to notice that the validity of (43) and (44) entails (45). At this stage, the validity of (46) and (47) for R ^ 0 , m and R ^ 1 , m follows directly by substitution. As to the same identities for R 0 , m * and R 1 , m * , the argument rests on the combination of (3) with (32), exploiting the fact that the additional term (33) is of o ( 1 / n ) -type. Thus, the asymptotic expansion of the left-hand side of (3) is given in terms of integrals with respect to ν of the sum of the two left-hand sides of (40) and (41), respectively. Resorting once again to (45), one gets the desired identities for R 0 , m * and R 1 , m * by substitution.

5.7. Theorem 3

The core is the proof of (10), with the same expressions (46) and (47) also for R ˜ 0 , m and R ˜ 1 , m , respectively. As in the proof of Theorem 2, the left-hand side of (10) is analyzed by resorting to inequality (32), exploiting the fact that the ensuing additional term, similar to that in (33), is of o ( 1 / n ) -type. Now, the argument is very similar to that of the preceding proof, with the variant that now the expansion (35) is not optimized in τ , but it is just evaluated at τ = τ ^ n . The conclusion reduces once again to a matter of substituting the expressions (48)–(50) into the two expansions (35) and (41).

6. Conclusions and Future Developments

This paper should be seen as a pioneering work in the field of predictive problems, whose main aim is to show how the practical construction of efficient estimators of random quantities (that depend on future and/or past observations) entails non-standard metrizations of the parameter space Θ . This is the essence of the compatibility Equations (43) and (44). Of course, all the lines of research proposed in this paper deserve much more attention, in order to produce new results of wider validity.
The first issue deals with the extension of the compatibility equations to higher dimensions, including the infinite dimension. For finite dimensions, this is only a technical fact. Indeed, the question relies on extending the asymptotic expansion given in Proposition 3 from dimension 1 to dimension d > 1 . This is done in [39] as far as the Bayesian setting, and in [53,54] for a general mathematical setting. See also [55] [Section 2.2]. For the infinite dimension, the mathematical literature is rather scant. Some interesting results on asymptotic expansions of Laplace type for separable Hilbert spaces with Gaussian measure are contained in [56]. Finally, the topic is still in its early stage as far as metric measure spaces (i.e., the full nonparametric setting) are concerned. See [57,58].
Another mathematical tool that proves to be critical to our study is the Wasserstein distance. As explained in specific monographs like [27,59], the Wasserstein distance has several connections with other fields of mathematical analysis, such as optimal transportation and the theory of PDEs. Actually, the achievement of some estimators within our theory (like the one in (55)) is tightly connected with some optimization issues in transport theory. In this respect, an interesting mathematical area to explore is represented by the theory of Wasserstein barycenters and the ensuing numerical algorithms. See [60]. Research on this is ongoing.
Then, all the extensions of de Finetti’s Law of Large Numbers for the log-likelihood process, stated in Theorem 1, Proposition 1 and Lemma 1 in Section 2.2, are worth being reconsidered, independently of their use for the purposes of this paper. As to possible extensions, the first hint is concerned with the analysis of dominated, parametric non-regular models, as those considered in [61,62,63]. Here, in fact, we never used the properties of the MLE as the root of the gradient of the log-likelihood, so that the asymptotic results contained in the quoted works should be enough to extend our statements. Subsequently, it would be also very interesting to consider dominated models which are parametrized by infinite-dimensional objects, where typically the MLE does not exist. See, e.g., the recent book [64] for plenty of examples.
As to more statistical objectives, it would be interesting to further deepen the connection between our approach and some relevant achievements obtained within the empirical Bayes theory, such as those contained in [22,23,65,66,67,68]. See also the book [69] for plenty of applications. In particular, the discussion contained in Section 4.4 about the original Poisson-mixture setting considered by Herbert Robbins deserves more attention.
A very fertile area of the application of predictive inference is that of species sampling problems. The pioneering works on this topic can be identified with the works [66,67,70]. Nowadays, the Bayesian approach (especially of nonparametric type) has received much attention, and has produced noteworthy new results in this field. See [17,71,72,73] and also [55,74,75] for novel asymptotic results. Indeed, it would be interesting to investigate whether it is possible to derive, within the approach of this paper, both asymptotic results and new estimators, hopefully more competitive than the existing ones.
Another prolific field of application is that of density estimation, aimed at solving clustering and/or classification problems. See [76] for a Bayesian perspective. Here, there is an additional technical difficulty due to the fact that the parameter is an element of some infinite-dimensional manifold, so that the characterization of any metric on Θ will prove mathematically more complex.
A last mention is devoted to predictive problems with “compressed data”. This kind of research comes directly from computer science, where the complexity of the observed data make the available sample essentially useless for statistical inference purposes. For this reason, many algorithms have been conceived to compress the information in order to make it useful in some sense. See, e.g., [77]. Here, the Bayesian approach is in its early stage (see [78]), and the results of this paper can provide a valuable contribution.

Funding

This research received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme under grant agreement No 817257.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

I wish to express my enormous gratitude and admiration to Eugenio Regazzini. He has represented for me a constant source of inspiration, transmitting enthusiasm and method for the development of my own research. This paper represents a small present for his 75-th birthday.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Cifarelli, D.M.; Dolera, E.; Regazzini, E. Note on “Frequentist Approximations to Bayesian prevision of exchangeable random elements” [Int. J. Approx. Reason. 78 (2016) 138–152]. Int. J. Approx. Reason. 2017, 86, 26–27. [Google Scholar] [CrossRef]
  2. Cifarelli, D.M.; Dolera, E.; Regazzini, E. frequentist approximations to Bayesian prevision of exchangeable random elements. Int. J. Approx. Reason. 2016, 78, 138–152. [Google Scholar] [CrossRef] [Green Version]
  3. Dolera, E. On an asymptotic property of posterior distributions. Boll. Dell’Unione Mat. Ital. 2013, 6, 741–748. (In Italian) [Google Scholar]
  4. Dolera, E.; Regazzini, E. Uniform rates of the Glivenko–Cantelli convergence and their use in approximating Bayesian inferences. Bernoulli 2019, 25, 2982–3015. [Google Scholar] [CrossRef] [Green Version]
  5. de Finetti, B. Bayesianism: Its unifying role for both the foundations and applications of statistics. Int. Stat. Rev. 1974, 42, 117–130. [Google Scholar] [CrossRef]
  6. de Finetti, B. La prévision: Ses lois logiques, ses sources subjectives. Ann. L’Inst. Henri Poincaré 1937, 7, 1–68. [Google Scholar]
  7. Ferguson, T.S. Mathematical Statistics: A Decision Theoretic Approach; Academic Press: Cambridge, MA, USA, 1967. [Google Scholar]
  8. Lehmann, E.L.; Casella, G. Theory of Point Estimation, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 1998. [Google Scholar]
  9. Aldous, D.J. Exchangeability and Related Topics; Ecole d’Eté de Probabilités de Saint-Flour XIII, Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 1985; pp. 1–198. [Google Scholar]
  10. Berti, P.; Pratelli, L.; Rigo, P. Exchangeable sequences driven by an absolutely continuous random measure. Ann. Probab. 2013, 78, 138–152. [Google Scholar] [CrossRef] [Green Version]
  11. Fortini, S.; Ladelli, L.; Regazzini, E. Exchangeability, predictive distributions and parametric models. Sankhya 2000, 62, 86–109. [Google Scholar]
  12. Rubin, D.B. Bayesianly justifiable and relevant frequency calculations for the applied statisticians. Ann. Stat. 1984, 12, 1151–1172. [Google Scholar] [CrossRef]
  13. Lijoi, A.; Prünster, I. Models beyond the Dirichlet process. In Bayesian Nonparametrics; Hjort, N.L., Holmes, C.C., Müller, P., Walker, S.G., Eds.; Cambridge University Press: Cambridge, UK, 2010; pp. 80–136. [Google Scholar]
  14. Robbins, H. The empirical Bayes approach to statistical decision problems. Ann. Math. Stat. 1964, 35, 1–20. [Google Scholar] [CrossRef]
  15. Ghosh, J.K.; Sinha, B.K.; Joshi, S.N. Expansions for posterior probability and integrated Bayes risk. In Statistical Decision Theory and Related Topics III; Gupta, S., Berger, J., Eds.; Academic Press: Cambridge, MA, USA, 1982; pp. 403–456. [Google Scholar]
  16. Favaro, S.; Nipoti, B.; Teh, Y.W. Rediscovery of Good-Turing estimators via Bayesian nonparametrics. Biometrics 2016, 72, 136–145. [Google Scholar] [CrossRef] [PubMed]
  17. Lijoi, A.; Mena, R.H.; Prünster, I. Bayesian Nonparametric Estimation of the Probability of Discovering New Species. Biometrika 2009, 94, 769–786. [Google Scholar] [CrossRef]
  18. de Finetti, B. Probabilità di una teoria e probabilità dei fatti. In Studi di Probabilità, Statistica e Ricerca Operativa in onore di Giuseppe Pompilj; Oderisi: Gubbio, Italy, 1971; pp. 86–101. (In Italian) [Google Scholar]
  19. Rao, R.C. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 1945, 37, 81–91. [Google Scholar]
  20. Amari, S.-I. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Berlin/Heidelberg, Germany, 2016; Volume 194. [Google Scholar]
  21. Oller, J.M.; Corcuera, J.M. Intrinsic analysis of statistical estimation. Ann. Stat. 1995, 23, 1562–1581. [Google Scholar] [CrossRef]
  22. Zhang, C.-H. Estimation of sums of random variables: Example and information bounds. Ann. Stat. 2005, 33, 2022–2041. [Google Scholar] [CrossRef] [Green Version]
  23. Robbins, H. An empirical Bayes approach to statistics. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability; Statistical Laboratory of the University of California: Davis Davis, CA, USA, 1956; Volume I, pp. 157–163. [Google Scholar]
  24. Berezin, S.; Miftakhov, A. On barycenters of probability measures. Bull. Pol. Acad. Sci. Math. 2020, 68, 11–20. [Google Scholar] [CrossRef]
  25. Karcher, H. Riemannian center of mass and mollifier smoothing. Commun. Pure Appl. Math. 1977, 30, 509–541. [Google Scholar] [CrossRef]
  26. Kim, Y.-H.; Pass, B. Nonpositive curvature, the variance functional, and the Wasserstein barycenter. Proc. Am. Math. Soc. 2000, 148, 1745–1756. [Google Scholar] [CrossRef]
  27. Ambrosio, L.; Gigli, N.; Savaré, G. Gradient Flows in Metric Spaces and in the Space of Probability Measures, 2nd ed.; Birkhäuser: Basel, Switzerland, 2008. [Google Scholar]
  28. Billingsley, P. Probability and Measure, 3rd ed.; John Wiley & Sons: Hoboken, NJ, USA, 1995. [Google Scholar]
  29. do Carmo, M.P. Riemannian Geomerty; Birkhäuser: Basel, Switzerland, 2013. [Google Scholar]
  30. Heinonen, J.; Kilpeläinen, T.; Martio, O. Nonlinear Potential Theory of Degenerate Elliptic Equations; Oxford Science Publications: Oxford, UK, 2008. [Google Scholar]
  31. Kufner, A. Weighted Sobolev Spaces; John Wiley & Sons: Hoboken, NJ, USA, 1985. [Google Scholar]
  32. de Finetti, B. La legge dei grandi numeri nel caso dei numeri aleatori equivalenti. Rend. Della R. Accad. Naz. Lincei 1933, 18, 203–207. (In Italian) [Google Scholar]
  33. Bauschke, H.H.; Combettes, P.L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
  34. Borwein, J.M.; Noll, D. Second order differentiability of convex functions in Banach spaces. Trans. Am. Math. Soc. 1994, 132, 43–81. [Google Scholar] [CrossRef]
  35. Dolera, E.; Favaro, S. Rates of convergence in de Finetti’s representation theorem, and Hausdorff moment problem. Bernoulli 2020, 26, 1294–1322. [Google Scholar] [CrossRef]
  36. Mijoule, G.; Peccati, G.; Swan, Y. On the rate of convergence in de Finetti’s representation theorem. Lat. Am. J. Probab. Math. Stat. 2016, 13, 1–23. [Google Scholar] [CrossRef]
  37. Dolera, E. Estimates of the approximation of weighted sums of conditionally independent random variables by the normal law. J. Inequal. Appl. 2013, 2013, 320. [Google Scholar] [CrossRef] [Green Version]
  38. Götze, F. On the rate of convergence in the central limit theorem in Banach Spaces. Ann. Probab. 1986, 14, 922–942. [Google Scholar] [CrossRef]
  39. Tierney, L.; Kadane, J.B. Accurate approximations for posterior moments and marginal densities. J. Am. Stat. Assoc. 1986, 81, 82–86. [Google Scholar] [CrossRef]
  40. DasGupta, A. Asymptotic Theory of Statistics and Probability; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  41. Dall’Aglio, G. Sugli estremi dei momenti delle funzioni di ripartizione doppia. Ann. Della Sc. Norm. Super. Pisa Cl. Sci. 1956, 10, 35–74. (In Italian) [Google Scholar]
  42. Dolera, E.; Mainini, E. On Uniform Continuity of Posterior Distributions. Stat. Probab. Lett. 2020, 157, 108627. [Google Scholar] [CrossRef] [Green Version]
  43. Dolera, E.; Mainini, E. Lipschitz continuity of probability kernels in the optimal transport framework. arXiv 2020, arXiv:2010.08380. [Google Scholar]
  44. Dowson, D.C.; Landau, B.V. The Fréchet distance between multivariate normal distributions. J. Multivar. Anal. 1982, 12, 450–455. [Google Scholar] [CrossRef] [Green Version]
  45. Olkin, I.; Pukelsheim, F. The distance between two random vectors with given dispersion matrices. Linear Algebra Its Appl. 1982, 48, 257–263. [Google Scholar] [CrossRef] [Green Version]
  46. Malagó, L.; Montrucchio, L.; Pistone, G. Wasserstein Riemannian geometry of positive definite matrices. Inf. Geom. 2018, 1, 137–179. [Google Scholar] [CrossRef]
  47. Efron, B.; Hastie, T. Computer Age Statistical Inference. Algorithms, Evidence, and Data Science; Cambridge University Press: Cambridge, UK, 2016. [Google Scholar]
  48. Thyrion, P. Contribution à l’étude du bonus pour non sinistre en assurance automobile. ASTIN Bull. J. IAA 1960, 1, 142–162. (In French) [Google Scholar] [CrossRef] [Green Version]
  49. van Houwelingen, J.C. Monotonizing empirical Bayes estimators for a class of discrete distributions with monotone likelihood ratio. Stat. Neerl. 1977, 31, 95–104. [Google Scholar] [CrossRef]
  50. Carlin, B.P.; Louis, T.A. Bayesian Methods for Data Analysis, 3rd ed.; Chapman and Hall: Boca Raton, FL, USA, 2009. [Google Scholar]
  51. Ledoux, M.; Talagr, M. Probability in Banach Spaces; Springer: Berlin/Heidelberg, Germany, 1991. [Google Scholar]
  52. Wong, R. Asymptotic Approximations of Integrals; SIAM: Philadelphia, PA, USA, 2001. [Google Scholar]
  53. McClure, J.P.; Wong, R. Error bounds for multidimensional Laplace approximation. J. Approx. Theory 1983, 37, 372–390. [Google Scholar] [CrossRef] [Green Version]
  54. Olver, F.W.J. Error bounds for the Laplace approximation for definite integrals. J. Approx. Theory 1968, 1, 293–313. [Google Scholar] [CrossRef] [Green Version]
  55. Dolera, E.; Favaro, S. A Berry–Esseen theorem for Pitman’s α–diversity. Ann. Appl. Probab. 2020, 30, 847–869. [Google Scholar] [CrossRef]
  56. Albeverio, S.; Steblovskaya, V. Asymptotics of infinite-dimensional integrals with respect to smooth measures. (I). Infin. Dimens. Anal. Quantum Probab. Relat. Top. 1999, 2, 529–556. [Google Scholar] [CrossRef]
  57. Gigli, N. Second order analysis on (P2(M), W2). Mem. Am. Math. Soc. 2012, 216, xii+154. [Google Scholar]
  58. Gigli, N.; Ohta, S.I. First variation formula in Wasserstein spaces over compact Alexandrov spaces. Can. Math. Bull. 2010, 55, 723–735. [Google Scholar] [CrossRef] [Green Version]
  59. Villani, C. Optimal Transport. Old and New; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
  60. Cuturi, M.; Doucet, A. Fast Computation of Wasserstein Barycenters. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, June 21–26 2014; Volume 32, pp. 685–693. [Google Scholar]
  61. Smith, R.L. Maximum likelihood estimation in a class of nonregular cases. Biometrika 1985, 72, 67–90. [Google Scholar] [CrossRef]
  62. Woodroofe, M. Maximum likelihood estimation of a translation parameter of a truncated distribution. Ann. Math. Stat. 1972, 43, 113–122. [Google Scholar] [CrossRef]
  63. Woodroofe, M. Maximum likelihood estimation of a translation parameter of a truncated distribution (II). Ann. Stat. 1974, 2, 474–488. [Google Scholar] [CrossRef]
  64. Giné, E.; Nickl, R. Mathematical Foundations of Infinite-Dimensional Statistical Models; Cambridge Series in Statistical and Probabilistic Mathematics: Cambridge, UK, 2016. [Google Scholar]
  65. Efron, B.; Thisted, R. Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 1976, 63, 435–447. [Google Scholar] [CrossRef] [Green Version]
  66. Good, I.J. The population frequencies of species and the estimation of population parameters. Biometrika 1953, 40, 237–264. [Google Scholar] [CrossRef]
  67. Good, I.J.; Toulmin, G.H. The number of new species, and the increase in population coverage, when a sample is increased. Biometrika 1956, 43, 45–63. [Google Scholar] [CrossRef]
  68. Orlitsky, A.; Suresh, A.T.; Wu, Y. Optimal prediction of the number of unseen species. Proc. Natl. Acad. Sci. USA 2016, 113, 13283–13288. [Google Scholar] [CrossRef] [Green Version]
  69. Maritz, J.S.; Lwin, T. Empirical Bayes Methods with Applications; Chapman and Hall: Boca Raton, FL, USA, 1989. [Google Scholar]
  70. Fisher, R.A.; Corbet, A.S.; Williams, C.B. The relation between the number of species and the number of individuals in a random sample of an animal population. J. Anim. Ecol. 1943, 12, 42–58. [Google Scholar] [CrossRef]
  71. Favaro, S.; Lijoi, A.; Mena, R.H.; Prünster, I. Bayesian nonparametric inference for species variety with a two parameter Poisson-Dirichlet process prior. J. Roy. Statist. Soc. Ser. B 2009, 71, 993–1008. [Google Scholar] [CrossRef]
  72. Favaro, S.; Lijoi, A.; Prünster, I. A new estimator of the discovery probability. Biometrics 2012, 68, 1188–1196. [Google Scholar] [CrossRef] [Green Version]
  73. Arbel, J.; Favaro, S.; Nipoti, B.; Teh, Y.W. Bayesian nonparametric inference for discovery probabilities: Credible intervals and large sample asymptotic. Stat. Sin. 2017, 27, 839–858. [Google Scholar] [CrossRef]
  74. Dolera, E.; Favaro, S. A compound Poisson perspective of Ewens–Pitman sampling model. Mathematics 2021, 9, 2820. [Google Scholar] [CrossRef]
  75. Pitman, J. Combinatorial Stochastic Processes; Ecole d’Eté de Probabilités de Saint-Flour XXXII, Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  76. Sambasivan, R.; Das, S.; Sahu, S.K. A Bayesian perspective of statistical machine learning for big data. Comput. Stat. 2020, 35, 893–930. [Google Scholar] [CrossRef] [Green Version]
  77. Cormode, G.; Yi, K. Small Summaries for Big Data; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
  78. Dolera, E.; Favaro, S.; Peluchetti, S. Learning-augmented count-min sketches via Bayesian nonparametrics. arXiv 2021, arXiv:2102.04462. [Google Scholar]
Table 1. Table reporting, in the second line, the exact counts of claimed accidents. Third and fourth lines display estimated numbers of accidents.
Table 1. Table reporting, in the second line, the exact counts of claimed accidents. Third and fourth lines display estimated numbers of accidents.
Claims01234567
Counts784013172394214441
Robbins estimator0.1680.3630.5271.331.436.001.250
Gamma MLE0.1640.3980.6330.871.101.341.570
Table 2. Table reporting, in the second line, the exact counts of claimed accidents. Third line displays estimated numbers of accidents.
Table 2. Table reporting, in the second line, the exact counts of claimed accidents. Third line displays estimated numbers of accidents.
Claims01234567
Counts784013172394214441
Estimator U ˜ n , m 0.1760.3530.530.7060.8821.061.231.41
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Dolera, E. Asymptotic Efficiency of Point Estimators in Bayesian Predictive Inference. Mathematics 2022, 10, 1136. https://doi.org/10.3390/math10071136

AMA Style

Dolera E. Asymptotic Efficiency of Point Estimators in Bayesian Predictive Inference. Mathematics. 2022; 10(7):1136. https://doi.org/10.3390/math10071136

Chicago/Turabian Style

Dolera, Emanuele. 2022. "Asymptotic Efficiency of Point Estimators in Bayesian Predictive Inference" Mathematics 10, no. 7: 1136. https://doi.org/10.3390/math10071136

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop