Next Article in Journal
Proposal and Thermodynamic Assessment of S-CO2 Brayton Cycle Layout for Improved Heat Recovery
Next Article in Special Issue
Robust Regression with Density Power Divergence: Theory, Comparisons, and Data Analysis
Previous Article in Journal
Two Faced Janus of Quantum Nonlocality
Previous Article in Special Issue
Model Selection in a Composite Likelihood Framework Based on Density Power Divergence
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust Model Selection Criteria Based on Pseudodistances

1
Department of Applied Mathematics, Bucharest University of Economic Studies, 010164 Bucharest, Romania
2
“Gh. Mihoc - C. Iacob” Institute of Mathematical Statistics and Applied Mathematics, Romanian Academy, 010164 Bucharest, Romania
3
Department of Statistics and Actuarial-Financial Mathematics, Lab of Statistics and Data Analysis, University of the Aegean, 83200 Karlovasi, Greece
*
Author to whom correspondence should be addressed.
Entropy 2020, 22(3), 304; https://doi.org/10.3390/e22030304
Submission received: 17 February 2020 / Revised: 2 March 2020 / Accepted: 3 March 2020 / Published: 6 March 2020

Abstract

:
In this paper, we introduce a new class of robust model selection criteria. These criteria are defined by estimators of the expected overall discrepancy using pseudodistances and the minimum pseudodistance principle. Theoretical properties of these criteria are proved, namely asymptotic unbiasedness, robustness, consistency, as well as the limit laws. The case of the linear regression models is studied and a specific pseudodistance based criterion is proposed. Monte Carlo simulations and applications for real data are presented in order to exemplify the performance of the new methodology. These examples show that the new selection criterion for regression models is a good competitor of some well known criteria and may have superior performance, especially in the case of small and contaminated samples.

1. Introduction

Model selection is fundamental to the practical applications of statistics and there is a substantial literature on this issue. Classical model selection criteria include, among others, the C p -criterion, the Akaike Information Criterion (AIC), based on the Kullback-Leibler divergence, and the Bayesian Information Criterion (BIC) as well as a General Information Criterion (GIC) which corresponds to a general class of criteria which also estimates the Kullback-Leibler divergence. These criteria have been proposed respectively in [1,2,3,4], and represent powerful tools for choosing the best model among different candidate models that can be used to fit a given data set. On the other hand, many classical procedures for model selection are extremely sensitive to outliers and to other departures from the distributional assumptions of the model. Robust versions of classical model selection criteria, which are not strongly affected by outliers, have been proposed for example in [5,6,7]. Some recent proposals for robust model selection are criteria based on divergences and minimum divergence estimators. We recall here, the Divergence Information Criteria (DIC) based on the density power divergences introduced in [8], the Modified Divergence Information Criteria (MDIC) introduced in [9] and the criteria based on minimum dual divergence estimators introduced in [10].
The interest on statistical methods based on divergence measures has grown significantly in recent years. For a wide variety of models, statistical methods based on divergences have high model efficiency and are also robust, representing attractive alternatives to the classical methods. We refer to the monographs [11,12] for an excellent presentation of such methods, for their importance and applications. The pseudodistances that we use in the present paper were originally introduced in [13], where they are called “type-0” divergences, and corresponding minimum divergence estimators have been studied. They are also presented and extensively studied in [14] where they are called γ -divergences, as well as in [15] in the context of decomposable pseudodistances. Like divergences, the pseudodistances are not mathematical metrics in the strict sense of the term. They satisfy two properties, namely the nonnegativity and the fact that the pseudodistance between two probability measures equals to zero if and only if the two measures are equal. The divergences are moreover characterized by the information processing property, that is, the complete invariance with respect to statistically sufficient transformations of the observation space. In general, a pseudodistance may not satisfy this property. We have adopted the term pseudodistance for this reason, but in literature we can also encounter the other terms mentioned above.
The pseudodistances that we consider in this paper have also been used to define robustness and efficiency measures, as well as the corresponding optimal robust M-estimators following the Hampel’s infinitesimal approach in [16]. The minimum pseudodistance estimators for general parametric models have been studied in [15] and consist of minimizing an empirical version of a pseudodistance between the assumed theoretical model and the true model underlying the data. These estimators have the advantage of not requiring any prior smoothing and conciliate robustness with high efficiency, providing a high degree of stability under model misspecification, often with a minimal loss in model efficiency. Such estimators are also defined and studied in the case of the multivariate normal model, as well as for linear regression models in [17,18], where applications for portfolio optimization models are also presented.
In the present paper we propose new criteria for model selection, based on pseudodistances and on minimum pseudodistance estimators. These new criteria have robustness properties, are asymptotically unbiased, consistent and compare well with some other known model selection criteria, even for small samples.
The paper is organized as follows—Section 2 is devoted to minimum pseudodistance estimators and to their asymptotic properties, which will be needed in the next sections. Section 3 presents new estimators of the expected overall discrepancy using pseudodistances, together with corresponding theoretical properties including robustness, consistency and limit laws. The new asymptotically unbiased model selection criteria are presented in Section 3.3, where the case of the univariate normal model and the case of linear regression models are investigated. Applications based on Monte Carlo simulations and on real data, illustrating the performance of the new methodology in the case of linear regression models, are included in Section 4.

2. Minimum Pseudodistance Estimators

The construction of new model selection criteria is based on using the following family of pseudodistances (see [15]). For two probability measures P and Q admitting densities p and q respectively with respect to the Lebesgue measure, the family of pseudodistances of order γ > 0 is defined by
R γ ( P , Q ) = 1 γ + 1 ln p γ d P + 1 γ ( γ + 1 ) ln q γ d Q 1 γ ln p γ d Q
and satisfies the limit relation
lim γ 0 R γ ( P , Q ) = R 0 ( P , Q ) ,
where R 0 ( P , Q ) : = ln q p d Q is the modified Kullback-Leibler divergence.
Let ( P θ ) be a parametric model indexed by θ Θ , where Θ is a d-dimensional parameter space, and p θ be the corresponding densities with respect to the Lebesgue measure λ . Let X 1 , , X n be a random sample on P θ 0 , θ 0 Θ . For γ > 0 fixed, a minimum pseudodistance estimator of the unknown parameter θ 0 from the law P θ 0 is defined by replacing the measure P θ 0 in the pseudodistance R γ ( P θ , P θ 0 ) by the empirical measure P n pertaining to the sample, and then minimizing this empirical quantity with respect to θ on the parameter space. Since the middle term in R γ ( P θ , P θ 0 ) does not depend on θ , these estimators are defined by
θ ^ n = arg min θ Θ 1 γ + 1 ln p θ γ + 1 d λ 1 γ ln 1 n i = 1 n p θ γ ( X i ) ,
or equivalently as
θ ^ n = arg max θ Θ { C γ ( θ ) 1 · 1 n i = 1 n p θ γ ( X i ) } ,
where C γ ( θ ) = ( p θ γ + 1 d λ ) γ / ( γ + 1 ) . Denoting h ( x , θ ) : = C γ ( θ ) 1 · p θ γ ( x ) , these estimators can be written as
θ ^ n = arg max θ Θ 1 n i = 1 n h ( X i , θ ) .
The optimum given above need not be uniquely defined.
On the other hand,
arg max θ Θ h ( x , θ ) d P θ 0 ( x ) = θ 0
and here θ 0 is the unique optimizer, since R γ ( P θ , P θ 0 ) = 0 implies θ = θ 0 .
Define
R γ ( θ 0 ) : = max θ Θ h ( x , θ ) d P θ 0 ( x ) = h ( x , θ 0 ) d P θ 0 ( x ) .
An estimator of R γ ( θ 0 ) is defined by
R ^ γ ( θ 0 ) : = max θ Θ h ( x , θ ) d P n ( x ) = max θ Θ 1 n i = 1 n h ( X i , θ ) = 1 n i = 1 n h ( X i , θ ^ n ) .
The following regularity conditions of the model will be assumed throughout the rest of the paper.
(C1) The density p θ ( x ) has continuous partial derivatives with respect to θ up to the third order (for all x λ -a.e.).
(C2) There exists a neighborhood N θ 0 of θ 0 such that the first-, the second- and the third- order partial derivatives with respect to θ of h ( x , θ ) are dominated on N θ 0 by some P θ 0 -integrable functions.
(C3) The integrals [ 2 θ 2 h ( x , θ ) ] θ = θ 0 d P θ 0 ( x ) and [ θ h ( x , θ ) ] θ = θ 0 [ θ h ( x , θ ) ] θ = θ 0 t d P θ 0 ( x ) exist.
Theorem 1.
Assume that conditions (C1), (C2) and (C3) are fulfilled. Then
(a) 
Let B : = θ Θ ; θ θ 0 n 1 / 3 . Then, as n , with probability one, the function θ 1 n i = 1 n h ( X i , θ ) attains a local maximal value at some point θ ^ n in the interior of B, which implies that the estimator θ ^ n is n 1 / 3 -consistent.
(b) 
n θ ^ n θ 0 converges in distribution to a centered multivariate normal random variable with covariance matrix
V = S 1 M S 1 ,
where S : = [ 2 θ 2 h ( x , θ ) ] θ = θ 0 d P θ 0 ( x ) and M : = [ θ h ( x , θ ) ] θ = θ 0 [ θ h ( x , θ ) ] θ = θ 0 t d P θ 0 ( x ) .
(c) 
n R ^ γ ( θ 0 ) R γ ( θ 0 ) converges in distribution to a centered normal variable with variance σ 2 ( θ 0 ) = h ( x , θ 0 ) 2 d P θ 0 ( x ) h ( x , θ 0 ) d P θ 0 ( x ) 2 .
We refer to [15] for details regarding these estimators and for the proofs of the above asymptotic properties.

3. Model Selection Criteria Based on Pseudodistances

Model selection is a method for selecting the best model among candidate models that can be used to fit a given data set. A model selection criterion can be considered as an approximately unbiased estimator of the expected overall discrepancy, a nonnegative quantity which measures the distance between the true unknown model and a fitted approximating model. If the value of the criterion is small, then the approximated candidate model can be chosen. In the following, by applying the same methodology used for AIC, we construct new criteria for model selection using pseudodistances (1) and minimum pseudodistance estimators.
Let X 1 , , X n be a random sample from the distribution associated with the true model Q with density q and let p θ be the density of a candidate model P θ from a parametric family ( P θ ) , where θ Θ R d .

3.1. The Expected Overall Discrepancy

For γ > 0 fixed, we consider the quantity
W θ = 1 γ + 1 ln p θ γ + 1 d λ 1 γ ln p θ γ q d λ ,
which is the same as the pseudodistance R γ ( P θ , Q ) without the middle term that remains constant irrespectively of the model ( P θ ) used.
The target theoretical quantity that will be approximated by an asymptotically unbiased estimator is given by
E [ W θ ^ n ] = E [ W θ | θ = θ ^ n ] ,
where θ ^ n is a minimum pseudodistance estimator defined as in (3). The same pseudodistance is used for both W θ and θ ^ n . The quantity (10) can be seen as an average distance between Q and ( P θ ) up to a constant and is called the expected overall discrepancy between Q and ( P θ ) .
The next Lemma gives the gradient vector and the Hessian matrix of W θ and is useful for the evaluation of E [ W θ ^ n ] through Taylor expansion.
Throughout this paper, for a scalar function φ θ ( · ) , the quantity θ φ θ ( · ) denotes the d-dimensional gradient vector of φ θ ( · ) with respect to the vector θ and 2 θ 2 φ θ ( · ) denotes the corresponding d × d Hessian matrix. We also use the notations φ ˙ θ and φ ¨ θ for the first and the second order derivatives of φ θ with respect to θ .
We assume the following conditions allowing derivation under the integral sign:
(C4) There exists a neighborhood N θ of θ such that
sup t N θ t p t γ + 1 d λ < , sup t N θ t [ p t γ p ˙ t ] d λ < .
(C5) There exists a neighborhood N θ of θ such that
sup t N θ t p t γ q d λ < , sup t N θ t [ p t γ 1 p ˙ t ] q d λ < .
Lemma 1.
Under (C4) and (C5), the gradient vector and the Hessian matrix of W θ are
θ W θ = p θ γ p ˙ θ d λ p θ γ + 1 d λ p θ γ 1 p ˙ θ q d λ p θ γ q d λ
2 θ 2 W θ = [ γ p θ γ 1 p ˙ θ p ˙ θ t d λ + p θ γ p ¨ θ d λ ] p θ γ + 1 d λ ( γ + 1 ) p θ γ p ˙ θ d λ ( p θ γ p ˙ θ d λ ) t ( p θ γ + 1 d λ ) 2 [ ( γ 1 ) p θ γ 2 p ˙ θ p ˙ θ t q d λ + p θ γ 1 p ¨ θ q d λ ] p θ γ q d λ γ p θ γ 1 p ˙ θ q d λ ( p θ γ 1 p ˙ θ q d λ ) t ( p θ γ q d λ ) 2 .
When the true model Q belongs to the parametric model ( P θ ) , hence Q = P θ 0 and q = p θ 0 , the gradient vector and the Hessian matrix of W θ simplify to
θ W θ θ = θ 0 = 0
2 θ 2 W θ θ = θ 0 = M γ , ( θ 0 )
where
M γ ( θ 0 ) : = ( p θ 0 γ 1 p ˙ θ 0 p ˙ θ 0 t d λ ) ( p θ 0 γ + 1 d λ ) ( p θ 0 γ p ˙ θ 0 d λ ) ( p θ 0 γ p ˙ θ 0 d λ ) t ( p θ 0 γ + 1 d λ ) 2 .
In the following Propositions we suppose that the true model Q belongs to the parametric model ( P θ ) , hence Q = P θ 0 , q = p θ 0 and θ 0 is the value of the parameter corresponding to the true model Q = P θ 0 . We also say that θ 0 is the true value of the parameter (All the proof of the propositions can be seen in the Appendix A).
Proposition 1.
When the true model Q belongs to the parametric model ( P θ ) , assuming that (C4) and (C5) are fulfilled for q = p θ 0 and θ = θ 0 , the expected overall discrepancy is given by
E [ W θ ^ n ] = W θ 0 + 1 2 E [ ( θ ^ n θ 0 ) t M γ ( θ 0 ) ( θ ^ n θ 0 ) ] + E [ R n ] ,
where R n = o ( θ ^ n θ 0 2 ) , M γ ( θ 0 ) is given by (14).

3.2. Estimation of the Expected Overall Discrepancy

In this section, we introduce an estimator of the expected overall discrepancy, under the hypothesis that the true model Q belongs to the parametric model ( P θ ) . Hence, Q = P θ 0 and the unknown parameter θ 0 will be estimated by a minimum pseudodistance estimator θ ^ n .
For a given θ Θ , a natural estimator of W θ is defined by
Q θ : = 1 γ + 1 ln p θ γ + 1 d λ 1 γ ln 1 n i = 1 n p θ γ ( X i ) .
Lemma 2.
Assuming (C4), the gradient vector and the Hessian matrix of Q θ are given by
θ Q θ = p θ γ p ˙ θ d λ p θ γ + 1 d λ i = 1 n p θ γ 1 ( X i ) p ˙ θ ( X i ) i = 1 n p θ γ ( X i ) 2 θ 2 Q θ = [ γ p θ γ 1 p ˙ θ p ˙ θ t d λ + p θ γ p ¨ θ d λ ] p θ γ + 1 d λ ( γ + 1 ) p θ γ p ˙ θ d λ ( p θ γ p ˙ θ d λ ) t ( p θ γ + 1 d λ ) 2 [ ( γ 1 ) i = 1 n p θ γ 2 ( X i ) p ˙ θ ( X i ) p ˙ θ ( X i ) t + i = 1 n p θ γ 1 ( X i ) p ¨ θ ( X i ) ] i = 1 n p θ γ ( X i ) ( i = 1 n p θ γ ( X i ) ) 2 + γ ( i = 1 n p θ γ 1 ( X i ) p ˙ θ ( X i ) ) ( i = 1 n p θ γ 1 ( X i ) p ˙ θ ( X i ) ) t ( i = 1 n p θ γ ( X i ) ) 2 .
Proposition 2.
When the true model Q belongs to the parametric model ( P θ ) , by imposing the conditions (C1)-(C5), it holds
E [ Q θ 0 ] = E [ Q θ ^ n ] + 1 2 E [ ( θ 0 θ ^ n ) t M γ ( θ 0 ) ( θ 0 θ ^ n ) ] + E [ R n ] ,
where R n = o ( θ ^ n θ 0 2 ) .
The following result allows to define an asymptotically unbiased estimator of the expected overall discrepancy.
Proposition 3.
When the true model Q belongs to the parametric model ( P θ ) , under (C1)-(C5), it holds
E [ W θ ^ n ] = E [ Q θ ^ n ] + E [ ( θ 0 θ ^ n ) t M γ ( θ 0 ) ( θ 0 θ ^ n ) ] + + 1 2 γ n 1 p θ 0 2 γ + 1 d λ p θ 0 γ + 1 d λ 2 + E R n + 1 γ E R n ,
where R n = o ( θ ^ n θ 0 2 ) and R n = o ( 1 n i = 1 n p θ 0 γ ( X i ) p θ 0 γ + 1 d λ 2 ) .

3.2.1. Limit Properties of the Estimator Q θ ^ n

Under the hypothesis that the true model Q belongs to the family of models ( P θ ) , hence Q = P θ 0 , we prove the consistency and the asymptotic normality for the estimator Q θ ^ n .
Note that
Q θ ^ n = 1 γ + 1 ln p θ ^ n γ + 1 d λ 1 γ ln 1 n i = 1 n p θ ^ n γ ( X i )
= ln 1 n i = 1 n p θ ^ n ( X i ) ( p θ ^ n γ + 1 d λ ) γ γ + 1 1 γ = ln [ R ^ γ ( θ 0 ) ] 1 γ ,
where p θ ^ n γ + 1 d λ = p θ γ + 1 d λ θ = θ ^ n and R ^ γ ( θ 0 ) is given by (7).
First we prove that R ^ γ ( θ 0 ) is a consistent estimator of R γ ( θ 0 ) . Indeed, using Theorem 1 and the fact that θ h ( x , θ 0 ) d P θ 0 ( x ) = 0 , a Taylor expansion of 1 n i = 1 n h ( X i , θ ) in θ ^ n around θ 0 gives
R ^ γ ( θ 0 ) = 1 n i = 1 n h ( X i , θ 0 ) + o P ( n 1 / 2 ) .
Using the weak law of large numbers,
1 n i = 1 n h ( X i , θ 0 ) = R γ ( θ 0 ) + o P ( 1 ) .
Combining (21) and (22), we obtain that R ^ γ ( θ 0 ) converges to R γ ( θ 0 ) in probability.
Then, using the continuous mapping theorem, since g ( t ) = ln t 1 γ is a continuous function, we get
Q θ ^ n = ln [ R ^ γ ( θ 0 ) ] 1 γ ln [ R γ ( θ 0 ) ] 1 γ = W θ 0
in probability.
On the other hand, using the asymptotic normality of the estimator R ^ γ ( θ 0 ) (according to Theorem 1 (c)) together with the univariate delta method, we obtain the asymptotic normality of Q θ ^ n . The Proposition below summarizes the above asymptotic results.
Proposition 4.
Under (C1)-(C3), when Q = P θ 0 , it holds
(a) Q θ ^ n converges to W θ 0 in probability.
(b) n ( Q θ ^ n W θ 0 ) converges in distribution to a centered univariate normal random variable with variance σ 2 ( θ 0 ) γ 2 R γ ( θ 0 ) 2 , σ 2 ( θ 0 ) being defined in Theorem 1.

3.2.2. Robustness Properties of the Estimator Q θ ^ n

The influence function is a useful tool for describing robustness of an estimator. Recall that, a map T defined on a set of probability measures and parameter space valued is a statistical functional corresponding to an estimator θ ^ n of the parameter θ , whenever θ ^ n = T ( P n ) , where P n is the empirical measure associated to the sample. The influence function of T at P θ is defined by
IF ( x ; T , P θ ) : = T ( P ˜ ε x ) ε ε = 0 ,
where P ˜ ε x : = ( 1 ε ) P θ + ε δ x , ε > 0 , δ x being the Dirac measure putting all mass at x. The gross error sensitivity of the estimator is defined by
γ * ( T , P θ ) = sup x IF ( x ; T , P θ ) .
Whenever the influence function is bounded with respect to x, the corresponding estimator is called B-robust (see [19]).
In what follows, for a given γ > 0 , we derive the influence function of the estimator Q θ ^ n . The statistical functional associated with this estimator, which we denote by U, is defined by
U ( P ) : = 1 γ + 1 ln p T ( P ) γ + 1 d λ 1 γ ln p T ( P ) γ d P ,
where T is the statistical functional corresponding to the used minimum pseudodistance estimator estimator θ ^ n , namely
T ( P ) : = arg sup θ C γ ( θ ) 1 p θ γ d P
where C γ ( θ ) = ( p θ γ + 1 d λ ) γ / ( γ + 1 ) .
Due to the Fisher consistency of the functional T, according to (6), we have T ( P θ 0 ) = θ 0 which implies that U ( P θ 0 ) = W θ 0 .
Proposition 5.
When Q = P θ 0 , the influence function of Q θ ^ n is given by
IF ( x ; U , P θ 0 ) = 1 γ 1 p θ 0 γ ( x ) p θ 0 γ + 1 d λ .
Note that the influence function of the estimator Q θ ^ n does not depend on the estimator θ ^ n , but depends on the used pseudodistance. Usually, p θ 0 γ ( x ) is bounded with respect to x and therefore Q θ ^ n is a robust estimator with respect to W θ 0 .
For comparison at the level of the influence function, we consider the AIC criterion which is defined by
A I C = 2 ln ( L ( θ ^ n ) ) + 2 d ,
where L ( θ ^ n ) is the maximum value of the likelihood function for the model, θ ^ n the maximum likelihood estimator and d the dimension of the parameter. The statistical functional corresponding to the statistic 2 ln ( L ( θ ^ n ) ) is
V ( P ) = 2 ln p T ( P ) d P
where T here is the statistical functional corresponding to the maximum likelihood estimator. The influence function of the functional V is given by
I F ( x ; V , P θ 0 ) = 2 ln p θ 0 d P θ 0 ln p θ 0 ( x ) .
This influence function is not bounded with respect to x, therefore the statistic 2 ln ( L ( θ ^ n ) ) is not robust.
For example, in the case of the univariate normal model, for a positive γ , the influence function (23) writes as
IF ( x ; U , P θ 0 ) = 1 γ 1 γ + 1 · exp γ 2 x m σ 2
while the influence function (24) writes as
IF ( x ; V , P θ 0 ) = x m σ 2 2 m 2 σ 2 1
(here θ 0 = ( m , σ ) ). For all the pseudodistances, the influence function (25) is bounded with respect to x, therefore the selection criteria based on the statistic Q θ ^ n will be robust. On the other hand, the influence function (26) is not bounded with respect to x, showing the non robustness of AIC in this case. Moreover, the gross error sensitivities corresponding to these influence functions are γ * ( U , P θ 0 ) = 1 γ and γ * ( V , P θ 0 ) = . These results show that, in the case of the normal model, when γ increases the gross error sensitivity decreases. Therefore, larger values of γ are associated with more robust procedures. For the particular case m = 0 and σ = 1 , the influence functions (25) and (26) are represented in Figure 1.

3.3. Model Selection Criteria Using Pseudodistances

3.3.1. The Case of Univariate Normal Family

The criteria that we propose in this section correspond to the case where the candidate model is a univariate normal model from the family of normal models ( P θ ) indexed by θ = ( μ , σ ) . We also suppose that the true model Q belongs to ( P θ ) .
In the case of the univariate normal model, M γ ( θ 0 ) defined in (14) expresses as
M γ ( θ 0 ) = ( γ + 1 ) 2 ( 2 γ + 1 ) 3 / 2 A ( γ ) V 1 ,
where V is the asymptotic covariance matrix given by (8) and the matrix A ( γ ) is given by
A ( γ ) = 1 0 0 3 γ 2 + 4 γ + 2 2 ( 2 γ + 1 ) .
For small positive values of γ , the matrix A ( γ ) can be approximated by the identity matrix I.
According to Theorem 1, n ( θ ^ n θ 0 ) is asymptotically multivariate normal and then the statistic n ( θ 0 θ ^ n ) t V 1 ( θ 0 θ ^ n ) has approximately a χ d 2 distribution. For large n, it holds
E [ ( θ 0 θ ^ n ) t M γ ( θ 0 ) ( θ 0 θ ^ n ) ] ( γ + 1 ) 2 ( 2 γ + 1 ) 3 / 2 · d n .
Also, for the normal model, it holds
p θ 0 2 γ + 1 d λ p θ 0 γ + 1 d λ 2 = γ + 1 2 γ + 1 .
Therefore, (18) becomes
E [ W θ ^ n ] E [ Q θ ^ n ] + ( γ + 1 ) 2 ( 2 γ + 1 ) 3 / 2 · d n + 1 2 γ n 1 γ + 1 2 γ + 1 + E R n + 1 γ E R n .
Using the central limit theorem and asymptotic properties of θ ^ n given in Theorem 1, the following hold
n · o ( θ ^ n θ 0 2 ) = o P ( 1 ) ,
n · o ( 1 n i = 1 n p θ 0 γ ( X i ) p θ 0 γ + 1 d λ 2 ) = o P ( 1 ) .
Using (30), (31) and (32) we obtain:
Proposition 6.
For the univariate normal family, an asymptotically unbiased estimator of the expected overall discrepancy is given by
Q θ ^ n + ( γ + 1 ) 2 ( 2 γ + 1 ) 3 / 2 · d n + 1 2 γ n 1 γ + 1 2 γ + 1 ,
where θ ^ n is a minimum pseudodistance estimator given by (3).
Under the hypothesis that ( P θ ) is the univariate normal model, as we supposed in this subsection, the function h writes as
h ( x , θ ) = ( γ + 1 ) γ / ( γ + 1 ) · ( σ 2 π ) γ / ( γ + 1 ) · exp γ 2 x m σ 2
and it can be easily checked that all the conditions (C1)–(C5) are fulfilled. Therefore we can use all results presented in the preceding subsections, such that Proposition 6 is fully justified.
Moreover, the selection criteria based on (33) are consistent on the basis of Proposition 4. It should also be noted that the bias correction term in (33) decreases slowly as the parameter γ increases staying always very close to zero ( 10 2 ) . As expected, the larger the sample size the smaller the bias correction. As we saw in Section 3.2.2, since the gross error sensitivity of Q θ ^ n is γ * ( U , P θ 0 ) = 1 γ , larger values of γ are associated with more robust procedures. On the other hand, the approximation of A ( γ ) with the identity matrix is realized for values of γ close to zero. Thus, positive values of γ smaller than 0.5 for example could represent choices satisfying the robustness requirement and the approximation of A ( γ ) through the identity matrix, approximation which is necessary to construct the criterion in this case.

3.3.2. The Case of Linear Regression Models

In the following, we adapt the pseudodistance based model selection criterion in the case of linear regression models. Consider the linear regression model
Y = α + β t X + e
where e N ( 0 , σ ) and e is independent of X. Suppose we have a sample given by the i.i.d. random vectors Z i = ( X i , Y i ) , i = 1 , , n , such that Y i = α + β t X i + e i .
We consider the joint distribution of the entire data and write a pseudodistance between the theoretical model and the true model corresponding to the data. Let P θ , θ : = ( α , β , σ ) , be the probability measure associated to the theoretical model given by the random vector Z = ( X , Y ) and Q the probability measure associated to the true model corresponding to the data. Denote by p θ , respectively by q the corresponding densities. For γ > 0 , the pseudodistance between P θ and Q is defined by
R γ ( P θ , Q ) : = 1 γ + 1 ln p θ γ ( x , y ) d P θ ( x , y ) + 1 γ ( γ + 1 ) ln q γ ( x , y ) d Q ( x , y ) 1 γ ln p θ γ ( x , y ) d Q ( x , y ) .
Similar to [18], since the middle term above does not depend on P θ , a minimum pseudodistance estimator of the parameter θ 0 = ( α 0 , β 0 , σ 0 ) is defined by
θ ^ n = ( α ^ n , β ^ n , σ ^ n ) = arg min α , β , σ 1 γ + 1 ln p θ γ ( x , y ) d P θ ( x , y ) 1 γ ln p θ γ ( x , y ) d P n ( x , y ) ,
where P n is the empirical measure associated with the sample. This estimator can be written as
θ ^ n = ( α ^ n , β ^ n , σ ^ n ) = arg min α , β , σ 1 γ + 1 ln ϕ σ γ + 1 ( e ) d e 1 γ ln 1 n i = 1 n ϕ σ γ ( Y i α β t X i ) ,
where ϕ σ is the density of the random variable e N ( 0 , σ ) . Then, the estimator Q θ ^ n can be written as
Q θ ^ n = min α , β , σ 1 γ + 1 ln 1 ( σ 2 π ) γ γ + 1 1 γ ln 1 n i = 1 n 1 ( σ 2 π ) γ · exp γ 2 σ 2 ( Y i α β t X i ) 2 .
In order to construct an asymptotic unbiased estimator of the expected overall discrepancy in the case of the linear regression models, we evaluated the second and the third terms from (18).
For values of γ close to 0 ( γ smaller than 0.3), we found the following approximation of the matrix M γ ( θ 0 )
M γ ( θ 0 ) ( γ + 1 ) 2 ( 2 γ + 1 ) 3 / 2 V 1 I 0 0 3 γ 2 + 4 γ + 2 2 γ + 1 ,
where V is the asymptotic covariance matrix of θ ^ n and I is the identity matrix. We refer to [15] for the asymptotic properties of the minimum pseudodistance estimators in the case of linear regression models. Since n ( θ n ^ θ 0 ) is asymptotically multivariate normal distributed, using the χ 2 distribution, we obtain the approximation
E [ ( θ n ^ θ 0 ) t M γ ( θ 0 ) ( θ n ^ θ 0 ) ] 1 n · ( γ + 1 ) 2 ( 2 γ + 1 ) 3 / 2 ( d 1 ) + 3 γ 2 + 4 γ + 2 2 ( γ + 1 ) ( 2 γ + 1 ) .
Also, the third term in (18) is given by
1 2 γ n 1 γ + 1 2 γ + 1 d .
Then, according to Proposition 3, an asymptotically unbiased estimator of the expected overall discrepancy is given by
Q θ ^ n + 1 n · ( γ + 1 ) 2 ( 2 γ + 1 ) 3 / 2 ( d 1 ) + 3 γ 2 + 4 γ + 2 2 ( γ + 1 ) ( 2 γ + 1 ) + 1 2 γ n 1 γ + 1 2 γ + 1 d ,
where Q θ ^ n is given by (39). Note that, using the asymptotic properties of θ ^ n and the central limit theorem, the last two terms in (18) of Proposition 3 are o P ( 1 n ) .
When we compare different linear regression models, as in Section 4 below, we can ignore the terms depending only on n and γ in (43). Therefore, we can use as model selection criterion the simplified expression
Q θ ^ n + ( γ + 1 ) 2 ( 2 γ + 1 ) 3 · d n 1 2 γ n γ + 1 2 γ + 1 d ,
which we call Pseudodistance based Information Criterion (PIC).

4. Applications

4.1. Simulation Study

In order to illustrate the performance of the PIC criterion (44) in the case of linear regression models, we performed a simulation study using for comparison the model selection criteria AIC, BIC and MDIC. These criteria are defined respectively by
A I C = n log σ ^ p 2 + 2 p + 2
B I C = n log σ p ^ 2 + p + 2 log n ,
where n the sample size, p the number of covariates of the model and σ ^ p 2 the classical unbiased estimator of the variance of the model,
M D I C = n M Q θ ^ + ( 2 π ) α / 2 ( 1 + α ) 2 + p / 2 p
with α = 0.25 and
M Q θ ^ = ( 1 + α 1 ) 1 n n = 1 n f θ ^ α ( X i ) ,
where θ ^ is a consistent estimate of the vector of unknown parameters involved in the model with p covariates and f θ ^ is the associated probability density function. Note that MDIC is based on the well known BHHJ family of divergence measures indexed by a parameter α > 0 and on the minimum divergence estimating method for robust parameter estimation (see [20]). The value of α = 0.25 was found in [9] to be an ideal one for a great variety of settings. The above three criteria have been chosen to be used in this comparative study with PIC not only due to their popularity, but also due to their special characteristics. Indeed, AIC is the classical representative of asymptotically efficient criteria, BIC is known to be consistent, while MDIC is associated with robust estimations (see e.g., [20,21,22,23]).
Let X 1 , X 2 , X 3 , X 4 be four variables following respectively the normal distributions N ( 0 , 3 ) , N ( 1 , 3 ) , N ( 2 , 3 ) and N ( 3 , 3 ) . We consider the model
Y = a 0 + a 1 X 1 + a 2 X 2 + ε
with a 0 = a 1 = a 2 = 1 and ε N ( 0 , 1 ) . This is the uncontaminated model. In order to evaluate the robustness of the new PIC criterion, we also consider the contaminated model
Y = d 1 ( a 0 + a 1 X 1 + a 2 X 2 + ε ) + d 2 ( a 0 + a 1 X 1 + a 2 X 2 + ε * )
where ε * N ( 5 , 1 ) and d 1 , d 2 [ 0 , 1 ] such that d 1 + d 2 = 1 . Note that for d 1 = 1 and d 2 = 0 the uncontaminated model is obtained.
The simulated data corresponding to the contaminated model are
Y i = d 1 ( 1 + X 1 , i + X 2 , i + ε i ) + d 2 ( 1 + X 1 , i + X 2 , i + ε i * ) ,
for i = 1 , , n , where X 1 , i , X 2 , i , ε i , ε i * are values of the variables X 1 , X 2 , ε , ε * independently generated from the normal distributions N ( 1 , 3 ) , N ( 2 , 3 ) , N ( 0 , 1 ) , N ( 5 , 1 ) correspondingly.
With a set of four possible regressors there are 2 4 1 = 15 possible model specifications that include at least one regressor. These 15 possible models constitute the set of candidate models in our study. More precisely, this set contains the full model ( X 1 , X 2 , X 3 , X 4 ) given by
Y = b 0 + b 1 X 1 + b 2 X 2 + b 3 X 3 + b 4 X 4 + ε
as well as all 14 possible subsets of the full model consisting of one ( X j 1 ) , two ( X j 1 , X j 2 ) and three ( X j 1 , X j 2 , X j 3 ) of the four regressors X 1 , X 2 , X 3 and X 4 , with j 1 j 2 j 3 , j i { 1 , 2 , 3 , 4 } and i = 1 , 2 , 3 .
In our simulation study, for several values of the parameter γ associated with the pseudodistance, we compared the new criterion PIC with the other model selection criteria. Different levels of contamination and different sample sizes have been considered. In the examples presented in this work, d 1 { 0.8 , 0.9 , 0.95 , 1 } and n { 20 , 50 , 100 } . Additional examples for n = 30 , 75 , 200 , 500 have been analyzed (results not shown) with similar findings (see below). For each setting, fifty experiments were performed in order to select the best model among the available candidate models. In the framework of each of the fifty experiments, on the basis of the simulated observations, the value of each of the above model selection criteria was calculated for each of the 15 possible models. Then, for each criterion, the 15 candidate models were ranked from 1st to 15th according to the value of the criterion. The model chosen by a given criterion is the one for which the value of the criterion is the lowest among all the 15 candidate models.
Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, Table 10, Table 11 and Table 12 present the proportions of models selected by the considered criteria. Among the 15 candidate models only 4 were chosen at least once. These four models are the same in all instances and appear in the 2nd column of all tables.
For small sample sizes ( n = 20 , n = 30 ) the criteria PIC and MDIC yield the best results. When the level of contamination is 10% or 20%, the PIC criterion yields very good results and beats the other competitors almost all the time. When the level of contamination is small, for example 5% or when there is no contamination, the two criteria are comparable, in the sense that in many cases the proportions of selected models by the two criteria are very close, so that sometimes PIC wins and sometimes MDIC wins. Table 1, Table 2, Table 3 and Table 4 present these results for n = 20 , but similar results are obtained for n = 30 , too.
For medium sample sizes ( n = 50 , n = 75 ), the criteria PIC and BIC yield the best results. The results for n = 50 are given in Table 5, Table 6, Table 7 and Table 8. Note that the PIC criterion yields the best results for 0% and 10% contamination. For the other levels of contamination, there are values of γ for which PIC is the best among all the considered criteria. On the other hand, in most cases when BIC wins, the proportions of selections of the true model by BIC and PIC are close.
When the sample size is large ( n = 100 , n = 200 , n = 500 ), BIC generally yields better results than PIC which stays relatively close behind, but sometimes BIC and PIC have the same performance. Table 9, Table 10, Table 11 and Table 12 present the results obtained for n = 100 .
Thus, the new PIC criterion works very well for small to medium sample sizes and for levels of contamination up to 20%, but falls behind BIC for large sample sizes. Note that for contaminated data, PIC with γ = 0 . 15 prevails in most of the considered cases. On the other hand, for uncontaminated data, it is PIC with γ = 0 . 2 that prevails in all the considered instances. It is also worth mentioning that PIC with γ = 0 . 3 appears to behave very satisfactorily in most cases irrespectively of the proportion of contamination ( 0 % 20 % ) and the sample size. Observe also that in all cases, AIC has the highest overestimation rate which is somehow expected (see [24]).
Although the consistency is the main focus of the applications presented in this work, one should point out that if prediction is part of the objective of a regression analysis, then model selection carried out using criteria such as the ones used in this work, have desirable properties. In fact, the case of finite-dimensional normal regression models has been shown to be associated with satisfactory prediction errors for criteria such as AIC and BIC (see [25]). Furthermore, it should be pointed out that in many instances PIC has a behavior quite similar to the above criteria by choosing the same models. Also, according to the presented simulation results, the proportion of choosing the true model by PIC is always better than the proportion of choosing the true model by AIC (even in the case of non contaminated data) and sometimes it is better than the proportion of choosing the true model by BIC. These results imply a satisfactory prediction ability for the proposed PIC criterion.
In conclusion, the new PIC criterion is a good competitor of the well known model selection criteria AIC, BIC and MDIC and may have superior performance especially in the case of small and contaminated samples.

4.2. Real Data Example

In order to illustrate the proposed method, we used the Hald cement data (see [26]) which represent a popular example for multiple linear regression. This example concern the heat evolved in calories per gram of cement Y as a function of the amount of each of four ingredient in the mix: tricalcium aluminate ( X 1 ), tricalcium silicate ( X 2 ), tetracalcium alumino-ferrite ( X 3 ) and dicalcium silicate ( X 4 ). The data are presented in Table 13.
Since 4 variables are available, there are 15 possible candidate models (involving at least one regressor) for this data set. Note that the 4 single-variable models should be excluded from the analysis, because cement involves a mixture of at least two components that react chemically (see [27], p. 102). The model selection criteria that have been used are PIC for several values of γ , AIC, BIC and MDIC with α = 0 . 25 . Table 14 shows the model selected by each of the considered criteria.
Observe that, in this example, PIC behaves similarly to AIC and MDIC having a slight tendency of overestimation. Note that for this specific dataset the collinearity is quite strong with X 1 and X 3 as well as X 2 and X 4 being seriously correlated. It should be pointed out that the model ( X 1 , X 2 , X 4 ) is chosen not only by AIC and PIC, but also by C p Mallows’ criterion ([1]) with ( X 1 , X 2 , X 3 ) coming very close second. Note further that ( X 1 , X 2 , X 4 ) has also been chosen by cross validation ([28], p. 33) and PRESS ([26], p. 325). Finally, it is worth noticing that these two models share the highest adjusted R 2 values which are almost identical (0.976 for ( X 1 , X 2 , X 4 ) and 0.974 for ( X 1 , X 2 , X 3 ) ) making the distinction between them extremely hard. Thus, in this example, the new PIC criterion gives results as good as other recognized classical model selection criteria.

5. Conclusions

In this work, by applying the same methodology as for AIC to a family of pseudodistances, we constructed new model selection criteria using minimum pseudodistance estimators. We proved theoretical properties of these criteria including asymptotic unbiasedness, robustness, consistency, as well as the limit laws. The case of the linear regression models was studied in detail and specific selection criteria based on pseudodistance are proposed.
For linear regression models, a comparative study based on Monte Carlo simulations illustrate the performance of the new methodology. Thus, for small sample sizes, the criteria PIC and MDIC yield the best results and in many cases PIC wins, for example when the level of contamination is 10% or 20%. For medium sample sizes, the criteria PIC and BIC yield the best results. When the sample size is large, BIC generally yields better results than PIC which stays relatively close behind, but sometimes BIC and PIC have the same performance.
Based on the results of the simulation study and on the real data example, we conclude that the new PIC criterion is a good competitor of the well known criteria AIC, BIC and MDIC with an overall performance which is very satisfactory for all possible settings according to the sample size and contamination rate. Also PIC may have superior performance, especially in the case of small and contaminated samples.
An important issue that needs further investigation is the choice of the appropriate value for the parameter γ associated to the procedure. The findings of the presented simulation study show that, for contaminated data, the value γ = 0 . 15 leads to very good results, irrespectively of the sample size. Also, γ = 0 . 3 produces overall very satisfactory results, irrespectively of the sample size and the contamination rate. We hope to explore further and provide a clear solution to this problem, in a future work. We also intend to extend this methodology to other type of models including nonlinear or time series models.

Author Contributions

A.T. conceived the methodology, obtained the theoretical results. A.T., A.K. and P.T. conceived the application part. A.K. and P.T. implemented the method in R and obtained the numerical results. All authors wrote the paper. All authors have read and approved the final manuscript.

Funding

This research received no external funding.

Acknowledgments

The work of the first author was partially supported by a grant of the Romanian National Authority for Scientific Research, CNCS-UEFISCDI, project number PN-II-RU-TE-2012-3-0007. The work of the third author was completed as part of the activities of the Laboratory of Statistics and Data Analysis of the University of the Aegean.

Conflicts of Interest

The authors declare no conflict of interst.

Abbreviations

The following abbreviations are used in this manuscript:
AICAkaike Information Criterion
BICBayesian Information Criterion
GICGeneral Information Criterion
DICDivergence Information Criterion
MDICModified Divergence Information Criterion
PICPseudodistance based Information Criterion
BHHJ family of measuresBasu, Harris, Hjort and Jones family of measures

Appendix A

Proof of Proposition 1.
Using a Taylor expansion of W θ around the true parameter θ 0 and taking θ = θ ^ n , on the basis of (12) and (13) we obtain
W θ ^ n = W θ 0 + 1 2 ( θ ^ n θ 0 ) t M γ ( θ 0 ) ( θ ^ n θ 0 ) + o ( θ ^ n θ 0 2 ) .
Then (15) holds. □
Proof of Proposition 2.
Using a Taylor expansion of Q θ around to θ ^ n and taking θ = θ 0 , we obtain
Q θ 0 = Q θ ^ n + θ Q θ θ = θ ^ n t ( θ 0 θ ^ n ) + 1 2 ( θ 0 θ ^ n ) t 2 θ 2 Q θ θ = θ ^ n ( θ 0 θ ^ n ) + o ( θ ^ n θ 0 2 ) .
Note that θ Q θ θ = θ ^ n = 0 by the very definition of θ ^ n .
By applying the weak law of large numbers and the continuous mapping theorem, we get
2 θ 2 Q θ θ = θ 0 2 θ 2 W θ θ = θ 0 P 0
and using (13)
2 θ 2 Q θ θ = θ 0 M γ ( θ 0 ) P 0 .
Then, using the consistency of θ ^ n and (A4), we obtain
2 θ 2 Q θ θ = θ ^ n = M γ ( θ 0 ) + o P ( 1 ) .
Consequently,
Q θ 0 = Q θ ^ n + 1 2 ( θ 0 θ ^ n ) t M γ ( θ 0 ) ( θ 0 θ ^ n ) + o ( θ ^ n θ 0 2 )
and we deduce (17). □
Proof of Proposition 3.
Using Proposition 1 and Proposition 2, we obtain
E [ W θ ^ n ] = E [ Q θ ^ n ] + E [ ( θ 0 θ ^ n ) t M γ ( θ 0 ) ( θ 0 θ ^ n ) ] E [ Q θ 0 ] + W θ 0 + E [ R n ]
where R n = o ( θ ^ n θ 0 2 ) .
In order to evaluate W θ 0 E [ Q θ 0 ] , note that
Q θ 0 W θ 0 = 1 γ ln 1 n i = 1 n p θ 0 γ ( X i ) ln p θ 0 γ + 1 d λ .
A Taylor expansion of the function ln x around to p θ 0 γ + 1 d λ yields
ln 1 n i = 1 n p θ 0 γ ( X i ) = ln p θ 0 γ + 1 d λ + 1 p θ 0 γ + 1 d λ 1 n i = 1 n p θ 0 γ ( X i ) p θ 0 γ + 1 d λ 1 2 · 1 ( p θ 0 γ + 1 d λ ) 2 1 n i = 1 n p θ 0 γ ( X i ) p θ 0 γ + 1 d λ 2 + + o ( 1 n i = 1 n p θ 0 γ ( X i ) p θ 0 γ + 1 d λ 2 ) .
Then
E [ Q θ 0 W θ 0 ] = 1 γ E ln 1 n i = 1 n p θ 0 γ ( X i ) ln p θ 0 γ + 1 d λ = 1 γ 1 p θ 0 γ + 1 d λ E 1 n i = 1 n p θ 0 γ ( X i ) p θ 0 γ + 1 d λ 1 2 · 1 ( p θ 0 γ + 1 d λ ) 2 E 1 n i = 1 n p θ 0 γ ( X i ) p θ 0 γ + 1 d λ 2 + E [ R n ]
where R n = o ( 1 n i = 1 n p θ 0 γ ( X i ) p θ 0 γ + 1 d λ 2 ) .
On the other hand,
E 1 n i = 1 n p θ 0 γ ( X i ) p θ 0 γ + 1 d λ 2 = Var 1 n i = 1 n p θ 0 γ ( X i ) = 1 n Var p θ 0 γ ( X ) = 1 n E [ p θ 0 2 γ ( X ) ] E [ p θ 0 γ ( X ) ] 2 = p θ 0 2 γ + 1 d λ ( p θ 0 γ + 1 d λ ) 2 n .
Consequently,
E [ Q θ 0 ] W θ 0 = 1 2 γ n 1 p θ 0 2 γ + 1 d λ p θ 0 γ + 1 d λ 2 1 γ E R n .
Using (A7) and (A11), we obtain (18). □
Proof of Proposition 5.
For the contaminated model P ˜ ε x = ( 1 ε ) P θ 0 + ε δ x , it holds
U ( P ˜ ε x ) = 1 γ + 1 ln p T ( P ˜ ε x ) γ + 1 d λ 1 γ ln p T ( P ˜ ε x ) γ d P ˜ ε x .
Derivation with respect to ε yields
ε [ U ( P ˜ ε x ) ] ε = 0 = 1 p θ 0 γ + 1 d λ · p θ 0 γ p ˙ θ 0 d λ · IF ( x ; T , P θ 0 ) 1 γ · 1 p θ 0 γ + 1 d λ p θ 0 γ + 1 d λ + γ · p θ 0 γ p ˙ θ 0 d λ · IF ( x ; T , P θ 0 ) + p θ 0 γ ( x ) = 1 γ · 1 p θ 0 γ ( x ) p θ 0 γ + 1 d λ .
Thus we obtain
IF ( x ; U , P θ 0 ) = 1 γ 1 p θ 0 γ ( x ) p θ 0 γ + 1 d λ .

References

  1. Mallows, C.L. Some comments on Cp. Technometrics 1973, 15, 661–675. [Google Scholar]
  2. Akaike, H. Information theory and an extension of the maximum likelihood principle. In Proceedings of the Second International Symposium on Information Theory Petrov; Springer: Berlin/Heidelberger, Germany, 1973; pp. 267–281. [Google Scholar]
  3. Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  4. Konishi, S.; Kitagawa, G. Generalised information criteria in model selection. Biometrika 1996, 83, 875–890. [Google Scholar] [CrossRef] [Green Version]
  5. Ronchetti, E. Robust model selection in regression. Statist. Probab. Lett. 1985, 3, 21–23. [Google Scholar] [CrossRef]
  6. Ronchetti, E.; Staudte, R.G. A robust version of Mallows’ Cp. J. Am. Stat. Assoc. 1994, 89, 550–559. [Google Scholar]
  7. Agostinelli, C. Robust model selection in regression via weighted likelihood estimating equations. Stat. Probab. Lett. 2002, 76, 1930–1934. [Google Scholar] [CrossRef]
  8. Mattheou, K.; Lee, S.; Karagrigoriou, A. A model selection criterion based on the BHHJ measure of divergence. J. Stat. Plann. Inf. 2009, 139, 228–235. [Google Scholar] [CrossRef]
  9. Mantalos, P.; Mattheou, K.; Karagrigoriou, A. An improved divergence information criterion for the determination of the order of an AR process. Commun. Stat.-Simul. Comput. 2010, 39, 865–879. [Google Scholar] [CrossRef]
  10. Toma, A. Model selection criteria using divergences. Entropy 2014, 16, 2686–2698. [Google Scholar] [CrossRef] [Green Version]
  11. Pardo, L. Statistical Inference Based on Divergence Measures; Chapmann & Hall: London, UK, 2006. [Google Scholar]
  12. Basu, A.; Shioya, H.; Park, C. Statistical Inference: The Minimum Distance Approach; Chapmann & Hall: London, UK, 2011. [Google Scholar]
  13. Jones, M.C.; Hjort, N.L.; Harris, I.R.; Basu, A. A comparison of related density-based minimum divergence estimators. Biometrika 2001, 88, 865–873. [Google Scholar] [CrossRef]
  14. Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef] [Green Version]
  15. Broniatowski, M.; Toma, A.; Vajda, I. Decomposable pseudodistances and applications in statistical estimation. J. Stat. Plan. Infer. 2012, 142, 2574–2585. [Google Scholar] [CrossRef] [Green Version]
  16. Toma, A.; Leoni-Aubin, S. Optimal robust M-estimators using Renyi pseudodistances. J. Multivar. Anal. 2013, 115, 359–373. [Google Scholar] [CrossRef]
  17. Toma, A.; Leoni-Aubin, S. Robust portfolio optimization using pseudodistances. PLoS ONE 2015, 10, e0140546. [Google Scholar] [CrossRef] [Green Version]
  18. Toma, A.; Fulga, C. Robust estimation for the single index model using pseudodistances. Entropy 2018, 20, 374. [Google Scholar] [CrossRef] [Green Version]
  19. Hampel, F.R.; Ronchetti, E.; Rousseeuw, P.J.; Stahel, W. Robust Statistics: The Approach Based on Influence Functions; Wiley Blackwell: Hoboken, NJ, USA, 1986. [Google Scholar]
  20. Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M.C. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar] [CrossRef] [Green Version]
  21. Karagrigoriou, A. Asymptotic efficiecy of the order selection of a nongaussian AR process. Stat. Sin. 1997, 7, 407–423. [Google Scholar]
  22. Vonta, F.; Karagrigoriou, A. Generalized measures of divergence in survival analysis and reliability. J. Appl. Prob. 2010, 47, 216–234. [Google Scholar] [CrossRef] [Green Version]
  23. Karagrigoriou, A.; Mattheou, K.; Vonta, F. On asymptotic properties of AIC variants with applications. Open J. Stat. 2011, 1, 105–109. [Google Scholar] [CrossRef] [Green Version]
  24. Shibata, R. Selection of the order of an autoregressive model by Akaike’s information criterion. Biometrika 1976, 63, 117–126. [Google Scholar] [CrossRef]
  25. Speed, T.P.; Yu, B. Model selection and prediction: Normal regression. Ann. Inst. Stat. Math. 1993, 45, 35–54. [Google Scholar] [CrossRef]
  26. Draper, N.R.; Smith, H. Applied Regression Analysis, 2nd ed.; Wiley Blackwell: Hoboken, NJ, USA, 1981. [Google Scholar]
  27. Burnham, K.P.; Anderson, D.R. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach; Springer: Berlin/Heidelberger, Germany, 2002. [Google Scholar]
  28. Hjorth, J.S.U. Computer Intensive Statistical Methods: Validation, Model Selection and Bootstrap; Chapman and Hall: London, UK, 1994. [Google Scholar]
Figure 1. Influence functions in the case of the normal model.
Figure 1. Influence functions in the case of the normal model.
Entropy 22 00304 g001
Table 1. Proportions of selected models by the considered criteria (n = 20, d 1 = 0.8 ).
Table 1. Proportions of selected models by the considered criteria (n = 20, d 1 = 0.8 ).
CriteriaVariables γ = 0.01 γ = 0.05 γ = 0.1 γ = 0.15 γ = 0.2 γ = 0.25 γ = 0.3
PIC X 1 , X 2 90848884929086
X 1 , X 2 , X 3 } ( 10 ) } ( 16 ) } ( 12 ) } ( 16 ) } ( 8 ) } ( 10 ) } ( 14 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
AIC X 1 , X 2 62565256665660
X 1 , X 2 , X 3 } ( 38 ) } ( 44 ) } ( 48 ) } ( 44 ) } ( 34 ) } ( 44 ) } ( 40 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
BIC X 1 , X 2 74766074726870
X 1 , X 2 , X 3 } ( 26 ) } ( 24 ) } ( 40 ) } ( 26 ) } ( 28 ) } ( 32 ) } ( 30 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
MDIC X 1 , X 2 86866478848074
X 1 , X 2 , X 3 } ( 14 ) } ( 14 ) } ( 36 ) } ( 22 ) } ( 16 ) } ( 20 ) } ( 26 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
Table 2. Proportions of selected models by the considered criteria (n = 20, d 1 = 0.9 ).
Table 2. Proportions of selected models by the considered criteria (n = 20, d 1 = 0.9 ).
CriteriaVariables γ = 0.01 γ = 0.05 γ = 0.1 γ = 0.15 γ = 0.2 γ = 0.25 γ = 0.3
PIC X 1 , X 2 80849082828080
X 1 , X 2 , X 3 } ( 20 ) } ( 16 ) } ( 10 ) } ( 18 ) } ( 18 ) } ( 20 ) } ( 20 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
AIC X 1 , X 2 60525662645452
X 1 , X 2 , X 3 } ( 40 ) } ( 48 ) } ( 44 ) } ( 38 ) } ( 36 ) } ( 46 ) } ( 48 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
BIC X 1 , X 2 76707872847676
X 1 , X 2 , X 3 } ( 24 ) } ( 30 ) } ( 22 ) } ( 28 ) } ( 16 ) } ( 24 ) } ( 24 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
MDIC X 1 , X 2 86768874927886
X 1 , X 2 , X 3 } ( 14 ) } ( 24 ) } ( 22 ) } ( 26 ) } ( 8 ) } ( 22 ) } ( 14 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
Table 3. Proportions of selected models by the considered criteria (n = 20, d 1 = 0.95 ).
Table 3. Proportions of selected models by the considered criteria (n = 20, d 1 = 0.95 ).
CriteriaVariables γ = 0.01 γ = 0.05 γ = 0.1 γ = 0.15 γ = 0.2 γ = 0.25 γ = 0.3
PIC X 1 , X 2 82888094828886
X 1 , X 2 , X 3 } ( 8 ) } ( 12 ) } ( 20 ) } ( 6 ) } ( 18 ) } ( 12 ) } ( 14 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
AIC X 1 , X 2 78506670666466
X 1 , X 2 , X 3 } ( 22 ) } ( 50 ) } ( 34 ) } ( 30 ) } ( 34 ) } ( 36 ) } ( 34 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
BIC X 1 , X 2 84647484847682
X 1 , X 2 , X 3 } ( 16 ) } ( 36 ) } ( 26 ) } ( 16 ) } ( 16 ) } ( 24 ) } ( 18 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
MDIC X 1 , X 2 90748288888088
X 1 , X 2 , X 3 } ( 10 ) } ( 26 ) } ( 18 ) } ( 12 ) } ( 12 ) } ( 20 ) } ( 12 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
Table 4. Proportions of selected models by the considered criteria (n = 20, d 1 = 1 ).
Table 4. Proportions of selected models by the considered criteria (n = 20, d 1 = 1 ).
CriteriaVariables γ = 0.01 γ = 0.05 γ = 0.1 γ = 0.15 γ = 0.2 γ = 0.25 γ = 0.3
PIC X 1 , X 2 86868686888292
X 1 , X 2 , X 3 } ( 14 ) } ( 14 ) } ( 14 ) } ( 14 ) } ( 12 ) } ( 18 ) } ( 8 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
AIC X 1 , X 2 64746258646270
X 1 , X 2 , X 3 } ( 36 ) } ( 26 ) } ( 38 ) } ( 42 ) } ( 36 ) } ( 38 ) } ( 30 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
BIC X 1 , X 2 78907880828074
X 1 , X 2 , X 3 } ( 22 ) } ( 10 ) } ( 22 ) } ( 20 ) } ( 18 ) } ( 20 ) } ( 26 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
MDIC X 1 , X 2 84928888888880
X 1 , X 2 , X 3 } ( 16 ) } ( 8 ) } ( 12 ) } ( 12 ) } ( 12 ) } ( 12 ) } ( 20 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
Table 5. Proportions of selected models by the considered criteria (n = 50, d 1 = 0.8 ).
Table 5. Proportions of selected models by the considered criteria (n = 50, d 1 = 0.8 ).
CriteriaVariables γ = 0.01 γ = 0.05 γ = 0.1 γ = 0.15 γ = 0.2 γ = 0.25 γ = 0.3
PIC X 1 , X 2 86969490888690
X 1 , X 2 , X 3 } ( 14 ) } ( 4 ) } ( 6 ) } ( 10 ) } ( 12 ) } ( 14 ) } ( 10 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
AIC X 1 , X 2 74648262647872
X 1 , X 2 , X 3 } ( 26 ) } ( 36 ) } ( 18 ) } ( 38 ) } ( 36 ) } ( 22 ) } ( 28 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
BIC X 1 , X 2 94869686908890
X 1 , X 2 , X 3 } ( 6 ) } ( 14 ) } ( 4 ) } ( 14 ) } ( 10 ) } ( 12 ) } ( 10 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
MDIC X 1 , X 2 94829882868890
X 1 , X 2 , X 3 } ( 6 ) } ( 18 ) } ( 2 ) } ( 18 ) } ( 14 ) } ( 12 ) } ( 10 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
Table 6. Proportions of selected models by the considered criteria (n = 50, d 1 = 0.9 ).
Table 6. Proportions of selected models by the considered criteria (n = 50, d 1 = 0.9 ).
CriteriaVariables γ = 0.01 γ = 0.05 γ = 0.1 γ = 0.15 γ = 0.2 γ = 0.25 γ = 0.3
PIC X 1 , X 2 92889290829486
X 1 , X 2 , X 3 } ( 8 ) } ( 12 ) } ( 8 ) } ( 10 ) } ( 18 ) } ( 6 ) } ( 14 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
AIC X 1 , X 2 70646264667472
X 1 , X 2 , X 3 } ( 30 ) } ( 36 ) } ( 38 ) } ( 36 ) } ( 34 ) } ( 26 ) } ( 28 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
BIC X 1 , X 2 92888292888886
X 1 , X 2 , X 3 } ( 8 ) } ( 12 ) } ( 18 ) } ( 8 ) } ( 12 ) } ( 12 ) } ( 14 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
MDIC X 1 , X 2 92867688 84 8886
X 1 , X 2 , X 3 } ( 8 ) } ( 14 ) } ( 24 ) } ( 12 ) } ( 16 ) } ( 12 ) } ( 14 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
Table 7. Proportions of selected models by the considered criteria (n = 50, d 1 = 0.95 ).
Table 7. Proportions of selected models by the considered criteria (n = 50, d 1 = 0.95 ).
CriteriaVariables γ = 0.01 γ = 0.05 γ = 0.1 γ = 0.15 γ = 0.2 γ = 0.25 γ = 0.3
PIC X 1 , X 2 94929288849088
X 1 , X 2 , X 3 } ( 6 ) } ( 8 ) } ( 8 ) } ( 12 ) } ( 16 ) } ( 10 ) } ( 12 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
AIC X 1 , X 2 70626668707258
X 1 , X 2 , X 3 } ( 30 ) } ( 38 ) } ( 34 ) } ( 32 ) } ( 30 ) } ( 28 ) } ( 42 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
BIC X 1 , X 2 96829286929286
X 1 , X 2 , X 3 } ( 4 ) } ( 18 ) } ( 8 ) } ( 14 ) } ( 8 ) } ( 8 ) } ( 14 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
MDIC X 1 , X 2 90788886869082
X 1 , X 2 , X 3 } ( 10 ) } ( 22 ) } ( 12 ) } ( 14 ) } ( 14 ) } ( 10 ) } ( 18 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
Table 8. Proportions of selected models by the considered criteria (n = 50, d 1 = 1 ).
Table 8. Proportions of selected models by the considered criteria (n = 50, d 1 = 1 ).
CriteriaVariables γ = 0.01 γ = 0.05 γ = 0.1 γ = 0.15 γ = 0.2 γ = 0.25 γ = 0.3
PIC X 1 , X 2 94908084909488
X 1 , X 2 , X 3 } ( 6 ) } ( 10 ) } ( 20 ) } ( 16 ) } ( 10 ) } ( 6 ) } ( 12 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
AIC X 1 , X 2 64686268666462
X 1 , X 2 , X 3 } ( 34 ) } ( 32 ) } ( 38 ) } ( 32 ) } ( 34 ) } ( 36 ) } ( 38 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
BIC X 1 , X 2 86868690869482
X 1 , X 2 , X 3 } ( 14 ) } ( 14 ) } ( 14 ) } ( 10 ) } ( 14 ) } ( 6 ) } ( 18 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
MDIC X 1 , X 2 8484828884 9082
X 1 , X 2 , X 3 } ( 16 ) } ( 16 ) } ( 18 ) } ( 12 ) } ( 16 ) } ( 10 ) } ( 18 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
Table 9. Proportions of selected models by the considered criteria (n = 100, d 1 = 0.8 ).
Table 9. Proportions of selected models by the considered criteria (n = 100, d 1 = 0.8 ).
CriteriaVariables γ = 0.01 γ = 0.05 γ = 0.1 γ = 0.15 γ = 0.2 γ = 0.25 γ = 0.3
PIC X 1 , X 2 94949492888894
X 1 , X 2 , X 3 } ( 6 ) } ( 6 ) } ( 6 ) } ( 8 ) } ( 12 ) } ( 12 ) } ( 6 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
AIC X 1 , X 2 70827870686872
X 1 , X 2 , X 3 } ( 30 ) } ( 18 ) } ( 22 ) } ( 30 ) } ( 32 ) } ( 32 ) } ( 28 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
BIC X 1 , X 2 90969890969488
X 1 , X 2 , X 3 } ( 10 ) } ( 4 ) } ( 2 ) } ( 10 ) } ( 4 ) } ( 6 ) } ( 12 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
MDIC X 1 , X 2 86 969286 92 90 88
X 1 , X 2 , X 3 } ( 14 ) } ( 4 ) } ( 8 ) } ( 14 ) } ( 8 ) } ( 10 ) } ( 12 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
Table 10. Proportions of selected models by the considered criteria (n = 100, d 1 = 0.9 ).
Table 10. Proportions of selected models by the considered criteria (n = 100, d 1 = 0.9 ).
CriteriaVariables γ = 0.01 γ = 0.05 γ = 0.1 γ = 0.15 γ = 0.2 γ = 0.25 γ = 0.3
PIC X 1 , X 2 88929688888886
X 1 , X 2 , X 3 } ( 12 ) } ( 8 ) } ( 4 ) } ( 12 ) } ( 12 ) } ( 12 ) } ( 14 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
AIC X 1 , X 2 68727866707860
X 1 , X 2 , X 3 } ( 32 ) } ( 28 ) } ( 22 ) } ( 34 ) } ( 30 ) } ( 22 ) } ( 40 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
BIC X 1 , X 2 98989688929492
X 1 , X 2 , X 3 } ( 2 ) } ( 2 ) } ( 4 ) } ( 12 ) } ( 8 ) } ( 6 ) } ( 8 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
MDIC X 1 , X 2 90909684829082
X 1 , X 2 , X 3 } ( 10 ) } ( 10 ) } ( 4 ) } ( 16 ) } ( 18 ) } ( 10 ) } ( 18 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
Table 11. Proportions of selected models by the considered criteria (n = 100, d 1 = 0.95 ).
Table 11. Proportions of selected models by the considered criteria (n = 100, d 1 = 0.95 ).
CriteriaVariables γ = 0.01 γ = 0.05 γ = 0.1 γ = 0.15 γ = 0.2 γ = 0.25 γ = 0.3
PIC X 1 , X 2 90889290989692
X 1 , X 2 , X 3 } ( 10 ) } ( 12 ) } ( 8 ) } ( 10 ) } ( 2 ) } ( 4 ) } ( 8 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
AIC X 1 , X 2 70787866826868
X 1 , X 2 , X 3 } ( 30 ) } ( 22 ) } ( 22 ) } ( 34 ) } ( 18 ) } ( 32 ) } ( 32 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
BIC X 1 , X 2 96929294969488
X 1 , X 2 , X 3 } ( 4 ) } ( 8 ) } ( 8 ) } ( 6 ) } ( 4 ) } ( 6 ) } ( 12 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
MDIC X 1 , X 2 90 888290 94 84 88
X 1 , X 2 , X 3 } ( 10 ) } ( 12 ) } ( 18 ) } ( 10 ) } ( 6 ) } ( 16 ) } ( 12 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
Table 12. Proportions of the selected models by the considered criteria (n = 100, d 1 = 1 ).
Table 12. Proportions of the selected models by the considered criteria (n = 100, d 1 = 1 ).
CriteriaVariables γ = 0.01 γ = 0.05 γ = 0.1 γ = 0.15 γ = 0.2 γ = 0.25 γ = 0.3
PIC X 1 , X 2 94969292969094
X 1 , X 2 , X 3 } ( 6 ) } ( 4 ) } ( 8 ) } ( 8 ) } ( 4 ) } ( 10 ) } ( 6 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
AIC X 1 , X 2 78747274706274
X 1 , X 2 , X 3 } ( 22 ) } ( 26 ) } ( 28 ) } ( 26 ) } ( 30 ) } ( 38 ) } ( 26 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
BIC X 1 , X 2 9610092969490100
X 1 , X 2 , X 3 } ( 4 ) } ( 8 ) } ( 4 ) } ( 6 ) } ( 10 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
MDIC X 1 , X 2 94 928690 86 80 94
X 1 , X 2 , X 3 } ( 6 ) } ( 8 ) } ( 14 ) } ( 10 ) } ( 14 ) } ( 20 ) } ( 6 )
X 1 , X 2 , X 4
X 1 , X 2 , X 3 , X 4
Table 13. Hald cement data.
Table 13. Hald cement data.
X 1 X 2 X 3 X 4 Y
72666078.5
129155274.3
1156820104.3
113184787.6
75263395.9
1155922109.2
371176102.7
131224472.5
254182293.1
2147426115.9
140233483.8
1166912113.3
1068812109.4
Table 14. Selected models by model selection criteria.
Table 14. Selected models by model selection criteria.
CriteriaVariables
PIC, γ = 0 . 05 X 1 , X 2 , X 4
PIC, γ = 0 . 15 X 1 , X 2 , X 4
PIC, γ = 0 . 2 X 1 , X 2 , X 3
PIC, γ = 0 . 25 X 1 , X 2 , X 4
PIC, γ = 0 . 3 X 1 , X 2 , X 4
AIC X 1 , X 2 , X 4
BIC X 1 , X 2
MDIC X 1 , X 2 , X 3

Share and Cite

MDPI and ACS Style

Toma, A.; Karagrigoriou, A.; Trentou, P. Robust Model Selection Criteria Based on Pseudodistances. Entropy 2020, 22, 304. https://doi.org/10.3390/e22030304

AMA Style

Toma A, Karagrigoriou A, Trentou P. Robust Model Selection Criteria Based on Pseudodistances. Entropy. 2020; 22(3):304. https://doi.org/10.3390/e22030304

Chicago/Turabian Style

Toma, Aida, Alex Karagrigoriou, and Paschalini Trentou. 2020. "Robust Model Selection Criteria Based on Pseudodistances" Entropy 22, no. 3: 304. https://doi.org/10.3390/e22030304

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop