Next Article in Journal
Detecting Regional Differences in Italian Health Services during Five COVID-19 Waves
Next Article in Special Issue
Combining Probability and Nonprobability Samples by Using Multivariate Mass Imputation Approaches with Application to Biomedical Research
Previous Article in Journal
The Network Bass Model with Behavioral Compartments
Previous Article in Special Issue
Consecutive-k1 and k2-out-of-n: F Structures with a Single Change Point
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Model Selection with Missing Data Embedded in Missing-at-Random Data

1
Faculty of Commerce, Kansai University, Yamatecho 3-3-35, Osaka 564-8680, Japan
2
Department of Mathematics, Keio University, Hiyoshi 3-14-1, Kohokuku, Yokohama 223-0061, Japan
*
Author to whom correspondence should be addressed.
Stats 2023, 6(2), 495-505; https://doi.org/10.3390/stats6020031
Submission received: 8 March 2023 / Revised: 31 March 2023 / Accepted: 7 April 2023 / Published: 11 April 2023
(This article belongs to the Special Issue Advances in Probability Theory and Statistics)

Abstract

:
When models are built with missing data, an information criterion is needed to select the best model among the various candidates. Using a conventional information criterion for missing data may lead to the selection of the wrong model when data are not missing at random. Conventional information criteria implicitly assume that any subset of missing-at-random data is also missing at random, and thus the maximum likelihood estimator is assumed to be consistent; that is, it is assumed that the estimator will converge to the true value. However, this assumption may not be practical. In this paper, we develop an information criterion that works even for not-missing-at-random data, so long as the largest missing data set is missing at random. Simulations are performed to show the superiority of the proposed information criterion over conventional criteria.

1. Introduction

In data analysis, all variables are not equally important in the final model; some variables are more important than others. Thus, what matters in practice is model selection. For model selection, several variants of the Kullback-Leibler (KL) information criteria have been developed. In the case of complete data, the criteria include Akaike’s information criterion (AIC; [1]) and Takeuchi’s information criterion (TIC; [2]). These all measure the “distance” from the true distribution to the model distribution in the sense of Kullback-Leibler information and find the model that minimizes the information.
Applying these information criteria to real data is not straightforward, as nearly all real data have missing values. In practical cases, missing values are an obstacle to applying statistical methods since the methods implicitly assume the availability of complete data. To handle such missing values with information criteria, modifications have been proposed by authors such as [3,4,5,6,7]. The first modification of the information criterion is attributed to [3]. The paper [4] derived the information criterion for missing data in a straight way, and the paper [5] used the symmetric KL divergence for the derivation. The paper [6] used the Hermite expansion with the Markov Chanin Monte Carlo method to approximate the H-function of the EM algorithm, and hence it needed much computation in application. The paper [7] also used the EM algorithm to develop the information criterion for the settings with missing covariates.
However, despite modification, these information criteria still face a major problem in their practical application. The common problem is their implicit or explicit assumption that any set of data selected by them produces consistent parameters in the model. This assumption is simply not realistic. In reality, these criteria can exclude a variable that causes missingness, resulting in inconsistent parameters, as the missing data are not missing at random (NMAR; [8]). For the parameters to be consistent, the missing-data mechanism requires modeling [9]. However, this is practically difficult since the data necessary for modeling are missing.
To overcome the problems described above, we propose a new information criterion for missing data. The primary advantage of our information criterion is that it can handle situations where a subset of the data is NMAR so long as the largest set of missing data is missing at random (MAR). In computing our criterion, we compute the parameters with the largest data set and extract the parameters necessary for the model under consideration. These two steps allow us to obtain an estimator that is consistent with the true value for any possible model. Since the existing criteria have no such feature, they are unable to produce a proper value for the information criterion when the data are NMAR.
The remainder of the paper is organized as follows. In the next section, we develop the asymptotic theory for a subset of MAR data. In Section 3, we derive our information criterion for a subset of MAR data. In Section 4, the results of two simulations are presented to demonstrate the effectiveness of the proposed information criterion in the practical situations. We conclude the paper with suggestions for future research.

2. Inference of Maximum Likelihood Estimator with Missing Data

We will represent the variables of the entire data set as a vector x and partition it to ( y , z ) , where y is the variable of interest and z is not. The model density functions with respect to these variables are written as f ( · ) in common; however, they can be distinguished by the context. If the values of x are observed completely, we would denote this as X c in matrix form. However, since x contains missing values, X c consists of an observed part, X o , and a missing part, X m . For Y c , Y o and Y m are defined in the same way. We denote the parameters of the distributions of x and y as θ and α , respectively. The relationship between θ and α is assumed to be α = S θ , where S is a matrix to extract the necessary part of the parameter θ , and θ = ( α , β ) . The rest of θ is written as β . Note that although we could theoretically develop α = g ( θ ) for a nonlinear function, in this paper we confine our focus to the linear function that can be expressed as α = S θ for simplicity.
In real data analysis, we often encounter a situation in which a subset of data is embedded in MAR data. With the MAR mechanism, the missingness of data is caused by observed values for variables included in the analysis; on the other hand, for NMAR data (i.e., data that are not MAR), the missingness is caused by variables not included in the analysis and/or by missing values of the included variables [8]. It follows that using more variables in the analysis makes it more likely that the missing data will be MAR (or close to MAR). One reason for this is that variables added into the analysis are likely to include those variables causing missingness. Another reason is that more variables are likely to include some variables related to missing values, recovering information lost by the missingness. These phenomena have been partly confirmed in simulations by [10,11,12]. Thus, MAR is a more plausible missing-data mechanism for data with a larger set of variables.
An example is given in which a subset of the missing data is NMAR but the entire data is MAR [9]. An investigation is conducted with an aim to estimate the correlation between the self-esteem and the sexual behavior for the teenagers. However, the question of the sexual behavior is so sensitive that the question is only asked for over 15. The resulting data are missing at random (MAR), since the question on the sexual behavior is missing for those less than 15 years of age. As described before, our interest is the relationship between the self-esteem and the sexual behavior for teenagers. If the variables for analysis are limited to the self-esteem and the sexual behavior, the data are NMAR since the data set excludes “age” which causes missingness, and thus the correlation estimate is biased. On the other hand, if the age is used as an additional variable, the data become MAR, and the likelihood estimation using the data on these three variables creates the consistent maximum likelihood estimator of the correlation coefficient between the self-esteem and the sexual behavior.
For missing data, an exact distribution such as the t-distribution for complete data is rarely obtained other than in special cases such as monotonic missing data [13,14]. Hence, asymptotic theory is preferable for deriving the distribution of an estimator with missing data. In asymptotic theory, the most important properties of an estimator are consistency to the true value and asymptotic normality. Of these two properties, the former is more important in practice. That an estimator has consistency means that the estimated parameter converges to the true value of the assumed model and that the estimated parameter is close enough to the true value in a large sample. For an analyst, this is a highly desirable property. In fact, an estimator without such consistency is essentially useless to the analyst as a means of approximating the parameter that he/she wishes to estimate. However, even if an estimator lacks consistency, asymptotic normality still might hold [15]. Notably, it has been shown that a maximum likelihood (ML) estimator under MAR is a good estimator, since it will have both consistency and asymptotic normality [16].
We can now provide more detail regarding the estimation methods when a subset of MAR data is used to produce the ML estimator. It has been shown that the ML estimator based on the maximum likelihood method has consistency [16]. Expressing this in the form of an equation, we have
θ ^ = arg max θ 1 n log f ( X o | θ ) a . s . θ 0 = def arg max θ E [ log f ( x | θ ) ] ,
where n is the sample size, and the arrow indicates that the left-side term converges almost surely to the right-side term as n increases to infinity. What matters here is that the ML estimator is based on missing data X o , and it is a function of the data and the corresponding missing-data indicator. This ML estimator converges to the value which maximizes E [ log f ( x | θ ) ] (where the expectation is taken with respect to x ), the same convergence point as the maximizer of n 1 log f ( X c | θ ) , which, in actuality, is not available under the presence of missing data. The asymptotic distribution of n ( θ ^ θ 0 ) is normal, with mean zero and variance V ( θ ) = ( E [ 2 log f ( X o | θ ) ] ) 1 , where ∇ is the first-derivative operator with respect to the parameter θ . On the other hand, the ML estimator
α ˜ = arg max α 1 n log f ( Y o | α )
would generally not be guaranteed to converge to the true value, α 0 in θ 0 , since the set of data corresponding to α might be NMAR. Hence, to estimate α , we construct an estimator by extracting the part corresponding to α from the ML estimator θ ^ , and set α ^ = S θ ^ .

3. Information Criterion for Missing Data

In this section, we propose an information criterion (IC) for missing data that are MAR as a whole but might be NMAR as a subset. In selecting a model among multiple candidate models, the standard IC underlies the discrepancy from the true distribution g ( y ) to the model f ( y | θ ) . This is measured using the KL information, which is defined as
E g ( y ) log f ( y | α ) g ( y ) = E g ( y ) log f ( y | α ) E g ( y ) log g ( y )
Since the second term is constant, the first term determines the KL information. The first term works as a measure of a model’s goodness. When we estimate the parameter α using the ML method and obtain its estimate α ˜ , a naive measure of goodness of fit is E g ( y ) [ log f ( y | α ˜ ) ] . The problem is that the true distribution g is unknown. If the complete data are available and are a random sample, we can use n 1 j = 1 n log f ( y j | α ˜ ) as an estimated measure to the true distribution, where y j ( j = 1 , , n ) are the data used to estimate α ˜ . However, the problem with using thisempirical likelihood is that the estimates depend on the data, and hence n 1 j = 1 n log f ( y j | α ˜ ) is anasymptotically biased estimator of E g ( y ) [ log f ( y | α ˜ ) ] . The likelihood with the corrected bias is known as the IC. Some ICs have a special name, such as AIC [1], TIC [2] and GIC [17]. In the correction and derivation of the IC, asymptotic properties play a central role. Of the various desirable properties of an ML estimator, the most important are asymptotic normality and the consistency of the ML estimator with respect to the true value. See also Konishi and Kitagawa [17] for the technical details.
For missing data, an IC can be computed in a similar spirit. However, considerable caution must be exercised in constructing the criterion, as different subsets of the missing data may have different types of missing-data mechanisms. Even if a subset of the missing data, call it Y o , is MAR, which would mean that a naive ML estimator based on log f ( Y o | α ) would have consistency, a smaller subset of Y o might be NMAR because the variables causing missingness are excluded from the model under consideration, in which case the ML would not have consistency. This means that a straightforward extension of the conventional IC could not be effectively applied to such missing-data cases.
To derive the proposed IC for cases with missing data, we will assume that the data as a whole, represented as X o , are MAR. As shown in [16], the ML estimator based on X o has consistency and asymptotic normality. However, a subset of MAR data X o , call it Y o , might be NMAR, and thus the ML estimator based on Y o would be asymptotically biased. To ensure that the ML estimator is consistent (i.e., converges to the true value), it is necessary to perform ML estimation with the entire data set, even when we are interested in models for a subset of X o , Y o . Let the ML estimator based on the entire data set X o be θ ^ , and let the part of θ associated with the model of interest be α . In this normality case, there is a selection matrix S such that S θ = α . The ML estimates of α can be written as α ^ = S θ ^ . In this setting, we can evaluate the bias.
We will use log f ( Y o | θ ) to denote a log-likelihood function. With the regularity conditions and the MAR assumption, the ML estimator is proved to have consistency and asymptotic normality [16].
We can now evaluate the bias of log f ( Y o | α ^ ) . The bias is given as
b ( G ) = E log f ( Y o | α ^ ) n E g ( y ) log f ( y | α ^ ) .
Decomposing the bias, we have
b ( G ) = E log f ( Y o | α ^ ) log f ( Y o | α 0 ) + E log f ( Y o | α 0 ) n E g ( y ) log f ( y | α 0 ) + + E n E g ( y ) log f ( y | α 0 ) n E g ( y ) log f ( y | α ^ ) .
Let the terms be, in order, D 1 , D 2 , and D 3 .
In this paper, we have assumed that X o is MAR and that θ ^ is asymptotically normally distributed. First, we evaluate D 1 . ∇ and α indicate the first derivative operator with respect to θ , and to α , respectively. Assume that the ML estimator θ ^ takes the form
n ( θ ^ θ 0 ) = V X o ( θ 0 ) n 1 n log f ( X o | θ 0 ) + o p ( 1 ) ,
where θ 0 is the true value of the parameter, and V X o ( θ 0 ) is the variance defined as V X o ( θ 0 ) = E [ log f ( X o | θ 0 ) log f ( X o | θ 0 ) ] 1 . In addition, assume that n ( θ ^ θ 0 ) is asymptotically distributed as normal with mean zero and variance V X o ( θ 0 ) . By n ( α ^ α 0 ) = S n ( θ ^ θ 0 ) , α 2 log f ( Y o | α 0 ) V Y o ( θ 0 ) 1 / n , we obtain
log f ( Y o | α ^ ) log f ( Y o | α 0 ) = α log f ( Y o | α 0 ) ( α ^ α 0 ) + 1 2 ( α ^ α 0 ) α 2 log f ( Y o | α 0 ) ( α ^ α 0 ) + o p ( 1 ) = n 1 n log f ( Y o | α 0 ) S S V X o ( θ 0 ) n 1 n log f ( X o | θ 0 ) 1 2 n ( θ ^ θ 0 ) S V Y o ( θ 0 ) 1 S n ( θ ^ θ 0 ) + o p ( 1 ) .
The expected first term in Equation (4) can be written as
tr S V X o ( θ 0 ) E n 1 n log f ( Y o | α 0 ) n 1 n log f ( Z o | Y o ; β 0 ) S + S V X o ( θ 0 ) S ( V Y o ( θ 0 ) ) 1 ,
where tr { } is the trace of the matrix in { } . Because E g ( Z o | Y o ) [ log f ( Z o | Y o ; β 0 ) | Y o ] = 0 , the first term in this trace is zero. Therefore, the expected first term is tr S V X o ( θ 0 ) S ( V Y o ( θ 0 ) ) 1 . Next, we calculate the expected second term in Equation (4) as
1 2 tr E n ( θ ^ θ 0 ) n ( θ ^ θ 0 ) S V Y o ( θ 0 ) 1 S = 1 2 tr S V X o ( θ 0 ) S V Y o ( θ 0 ) 1 .
In summary, D 1 = 1 2 tr S V X o ( θ 0 ) S V Y o ( θ 0 ) 1 + o p ( 1 ) .
Second, we evaluate D 2 .
D 2 = E log f ( Y o | α 0 ) n E g ( y ) log f ( y | α 0 ) = E g ( y ) log f ( Y c | α 0 ) log f ( Y m | Y o ; α 0 ) n E g ( y ) log f ( y | α 0 ) = E E log f ( Y m | Y o ; α 0 ) Y o = E H ( α 0 | α 0 ) ,
where H ( α 0 | α 0 ) is the H-function of the EM algorithm based on log f ( Y o | α ) evaluated at α = α 0 .
Finally, we evaluate D 3 . Define η ( α ) = E g ( y ) log f ( y | α ) . Since η ( α 0 ) = 0 ,
η ( α ^ ) η ( α 0 ) = 1 2 ( α ^ α 0 ) V Y c ( θ 0 ) 1 ( α ^ α 0 ) + o p ( 1 ) = 1 2 ( θ ^ θ 0 ) S V Y c ( θ 0 ) 1 S ( θ ^ θ 0 ) + o p ( 1 ) ,
where V Y c ( θ 0 ) = E [ α 2 log f ( Y c | α 0 ) ] 1 . Taking the expectation of both sides, we have
D 3 = 1 2 tr S E n ( θ ^ θ 0 ) n ( θ ^ θ 0 ) S V Y c ( θ 0 ) 1 = 1 2 tr S V X o ( θ 0 ) S V Y c ( θ 0 ) 1 .
Therefore, ignoring the term o p ( 1 ) , we obtain an unbiased estimator for E [ log f ( y | α ^ ) ] as
log f ( Y o | α ^ ) E H ( α 0 | α 0 ) 1 2 tr S V X o ( θ 0 ) S V Y o ( θ 0 ) 1 + V Y c ( θ 0 ) 1 .
The bias cannot be computed since unknown quantities such as α are involved.
We can construct an estimated version of the IC given in Equation (5). By replacing the unknown parameter with the consistent ML estimator and eliminating the expectation from the second term in Equation (5), we can substitute the first two terms with
log f ( Y o | α ^ ) H ( α ^ | α ^ ) = Q Y ( α ^ | α ^ ) .
This is a natural estimator for the first two terms in Equation (5). The penalty term is estimated by replacing θ 0 with its ML estimator, which can be consistently estimated under the assumption that the data as a whole are MAR. As a result, the following Equation (6) is obtained. As is conventional in deriving the AIC, we double the bias, obtaining the corrected bias as
I C = 2 Q Y ( α ^ | α ^ ) + tr S V ^ X o ( θ ^ ) S V ^ Y o ( θ ^ ) 1 + V ^ Y c ( θ ^ ) 1 ,
where Q Y ( α ^ | α ^ ) is the Q-function of the EM algorithm based on log f ( Y c | α ) evaluated at α = α ^ and V ^ is the sample variance of V for each suffix •. Under the regularity condition assumed in [16], the estimators of the variance-covariance matrices in (6) converges to their population version (i.e., V ^ Y o ( θ ^ ) converges to V Y o ( θ 0 ) in probability as n ).
This IC is an extension of the AIC for missing data. It becomes the original AIC [1] when X o = X c = Y o = Y c and θ = α without missing data. Similar to the original AIC, our IC shares the starting point E [ log f ( y | θ ^ ) ] with [3,4]. However, the first and second terms of our IC differ from those of [3]. Our IC shares the first term with that of [4] and that of [6], but not the second term. This difference arises from the way in which the bias is approximated. The former [4] approximated the bias under the setting that all the subsets are MAR. The latter [6] approximated the bias by using the Hermite expansion and the Markov Chain Monte Carlo methods.

4. Simulations

To confirm the validity of our proposed IC, two simulations were performed.

4.1. Selection of the Parameter Structure

We will confirm through simulations that our IC selects the correct model when there is missing data for variables ( y 1 , y 2 ) in x = ( y 1 , y 2 , z ) . In so doing, we compare our IC with a slightly modified AIC, since the original AIC proposed by [1] cannot deal with missing data:
AIC = 2 log f ( Y o | α ^ ) + 2 p ,
where p is the dimension of vector α ^ . This has been used as a competitor to the information criterion proposed by [4]. As an additional competitor, we use AIC cd , which Cavanaugh and Shumway [4] developed for missing data:
AIC cd = 2 Q ( α ^ | α ^ ) + 2 tr V ^ Y c ( α ^ ) 1 1 2 α 2 log f ( Y o | α ^ ) .
Four candidate models were considered, all of which are based on a trivariate normal distribution with variables x = ( y 1 , y 2 , z ) . The respective models restrict the means of ( y 1 , y 2 ) , and/or the variances of ( y 1 , y 2 ) , or impose no restrictions at all. Model 1 restricts the two expectations to be equal and the two variances to be equal. Model 2 restricts that only the expectations be equal. Model 3 restricts that only the two variances be equal. Model 4 has no restrictions. In the simulations, one of the four models is the true data-generating model, and a model is selected according to each of the three information criteria: AIC, AIC cd , and our proposed IC.
With respect to model selection, it should be noted that all the criteria assume that the true model must be included as a special case of any of the candidate models. Model 1, for example, is a special case of the other three candidate models. On the other hand, Model 2 is not a special case of Model 3, since Model 2 has no restrictions on the variances. Theoretically, models 2, 3, and 4 cannot be compared using an information criterion. For each model, the models that can or cannot be theoretically compared are shown in Table 1. A number in parentheses is that of a model that cannot be theoretically compared. Nevertheless, we have included in the simulation these theoretically un-comparable models in the set of candidate models for comparison because such a comparison is often conducted in practice, as pointed out in [18].
The simulation was conducted using the following procedure. First, a complete data set of a given size was generated from a trivariate normal distribution N ( μ , Σ ) for variables ( y 1 , y 2 , z ) , where μ = ( μ 1 , μ 2 , μ 3 ) and Σ = [ σ a b ] a , b = 1 , 2 , 3 . The mean and variance for the data generation are shown in Table 2, where the covariances are all set to 0.7 . The model that is assumed to be true is varied.
Second, denoting the missing indicators for ( y 1 , y 2 ) as ( r 1 , r 2 ) , we generated missing values according to the following two missing-data mechanisms. The first mechanism is, for i = 1 , 2 ,
P ( r i = 1 | z ) = ψ i ( z 0 ) , 1 ( z < 0 ) .
This is a case where the missing-data mechanism is non-smooth. We set ψ 1 at 0.3 , and ψ 2 at 0.7 . The second of the two missing-data mechanisms used the binomial distribution, with the observing probability given as, for i = 1 , 2 ,
P ( r i = 1 | z ) = 1 2 ω i exp ( z ) + exp ( z ) .
This is a smooth missing-data mechanism. We set ω 1 at 0.3 , and ω 2 at 0.7 . With the first missing-data mechanism, the values for ( y 1 , y 2 ) beyond the threshold z = 0 are always missing, while those below z = 0 are always observed. In the second mechanism, the missingness of ( y 1 , y 2 ) is stochastically determined based on the inverse of the hyperbolic cosine. Notice that the missing data for ( y 1 , y 2 , z ) are MAR for both mechanisms and that those for ( y 1 , y 2 ) are NMAR since z, which is not included in the models, causes the missingness of y 1 and y 2 .
Third, for the missing data, the model that gives the minimum IC value is selected for each of the three ICs. The procedure is repeated 1000 times, and the number of times that each of the four models is selected is recorded.
The results for the first missing-data mechanism are summarized in Table 3. As can be seen in the table, our IC is generally superior to its two competitors. In the cases where Models 1 to 3 are true, the proposed IC outperforms the other criteria for any of the sample sizes. In the case where Model 4 is true, the proposed IC underperforms the other criteria when the sample size is small. However, as the sample size increases from 200 to 800, the proposed IC gradually selects the correct model as often as the other criteria. In addition to the above findings, the AIC is found to be more robust against the violation of the missing-data mechanism than AIC cd .
The results for the second missing-data mechanism are given in Table 4. Here, again, our IC is shown to be generally superior to its competitors. While all the information criteria are capable of selecting the true model, as the sample size increases, our IC is more likely to select the true model. For any true model, our IC outperforms the other information criteria except when Model 4 is true. Even in that case, our IC is equivalent to the others. In sum, our IC nearly consistently outperforms its major competitors and selects the correct model in all of the cases.

4.2. Selection of Linear Regression Models

In this simulation, we demonstrate how our IC works for linear regression models. Since the derivation given above does not consider the covariate variables, our IC does not directly apply to a regression model. However, it is in the practical analysis of a regression model that we need information criteria to select from the various candidate models. Thus, we naively applied our IC in a simulation and monitored its performance.
In this simulation, we compared the performance of our proposed IC with that of the AIC and the AIC cd criteria. Four regression models were used in the comparison. From here on, for simplicity, we will use more common notation. Let y be the dependent variable and x = ( x 1 , x 2 ) be independent variables. The regression model is
y = β 0 + β 1 x 1 + β 2 x 2 + ϵ ,
where ϵ has mean zero and a finite variance conditional on ( x 1 , x 2 ) . The four models considered are Model 5 with β 1 = β 2 = 0 , Model 6 with β 1 nonzero and β 2 = 0 , which is the true model, Model 7 with β 1 = 0 and β 2 nonzero, and Model 8 with both β 1 and β 2 nonzero. For example, the data-generating model of Model 6 with ( β 0 , β 1 , β 2 ) = ( 1 , 1 2 , 0 ) is that the joint distribution of ( y , x 1 , x 2 ) is a trivariate normal with mean [ 0 , 0 , 0 ] and variance
1 1 2 0 1 2 1 0 0 0 1 .
These four models do not necessarily have a nested/nesting relationship. Model 5 is a special case of Models 6 to 8. Models 6 and 7 are a special case of Model 8, but they do not nest with one another. Models 6 and 7 are not in a nesting relationship. Thus, theoretically speaking, these four models cannot be compared with the AIC-type information criteria. However, in practice, the AIC criterion has been commonly used for the comparison of nonnested/nonnesting models [18]. Thus, we compare the four models according to this convention.
The missingness of y and x 1 depends only on x 2 . To represent the missing-data mechanism in a more formal way, we introduce missing-data indicators ( r y , r x 1 ) for ( y , x 1 ) , each of which takes a value of 1 when the corresponding variable is observed and 0 when it is not. The missing-data mechanisms that were used are
P ( r y = 1 | x 2 ) = 1 2 ω y exp ( x 2 ) + exp ( x 2 ) , P ( r x 1 = 1 | x 2 ) = 1 2 ω x 1 exp ( x 2 ) + exp ( x 2 ) ,
with setting ω y at 0.7 , and ω x 1 at 0.3 .
The simulation was conducted according to the following procedure. First, a sample without missing values is created by drawing random sample of size n from the joint normal distribution of ( y , x 1 , x 2 ) which is specified above. Second, the missing values in the sample are created through either of the missing-data mechanisms for ( y , x 1 ) as introduced above. Finally, the proposed IC and the competitor ICs are computed with such a sample for Models 1 to 4, and each is used to select the model corresponding to the minimum value. We repeated this procedure 1000 times, counting the number of times that each model is selected. The counts are used as an indicator of how well each of the ICs work in selecting the best model.
The results of the simulation are shown in Table 5. Regardless of which model is the actual true model, our proposed IC appears to work well. In the case where Model 5 is true, our proposed IC outperforms the other two competitors for any sample size. Here, AIC cd works much better than AIC, but falls behind our IC. In the case where either Models 6 or 7 is true, AIC cd shows the best performance for any sample size. The second best is our IC. AIC selects the wrong model, Model 8, more than half the time. Interestingly, all three IC’s select neither Model 5 that is in a nesting relationship, nor Model 7(6) that is not in a nesting relationship when Model 6(7) is true. In the case in which Model 8 is true, AIC performs best. However, while for a sample size of 200, AIC selects the true model more often than our proposed IC, for sample sizes of more than 200, our proposed IC is comparable. AIC cd always selects the wrong models. In summary, our proposed IC performed best or, at worst, second best throughout the simulation. Even in the second-best case, the performance of our proposed IC is very close to that of its higher-performing competitor.
Note that even for different values of the parameters in the missing-data mechanisms, we observed almost the same trend.

5. Conclusions

This paper considered the practical situation in which a set of missing data may not be MAR and developed a new information criterion to handle such a situation. This contrasts with previous ICs, which implicitly consider only the unrealistic case in which the data are MAR as a whole and subsets of the data are MAR as well. Our proposed IC uses the largest data set to estimate model parameters and circumvents the problem that the conventional ICs ignore. Using numerical simulations, it was shown that our new IC works better than, or at worst, equivalently to, its competitors.
The study of missing-data information criterion is far from complete. The present paper requires refinement. Although we applied our IC to regression, this was done without a rigorous basis, as noted in the body of the paper. A solid foundation for the comparison of regression models needs to be developed. Furthermore, application of the proposed IC can be extended to cases in which the parameters of the model are not a linear function of the parameters for the entire data. The approach used to develop our IC should not be limited to regression models; rather, it can be applied to other conventional statistical models such as the generalized linear regression and the explanatory factor model. Further refinements of our IC is also possible. Currently our IC uses H ( α ^ | α ^ ) as the estimator of the H-function H ( α ^ | α ^ ) . That is, implementing the MCMC-based estimator of the H-function presented in [6] to our IC can improve the performance of model selection.
Further, the framework of our proposed IC might be applied to the data taken under the two-phase design. Under the two-phase design, an efficient estimation of the regression parameter has been extensively developed [19,20]. In the two-phase design, some part of the first-phase variables are observed for all subjects and the second-phase variables are observed only for some of the subjects who are selected based on the first-phase variables to avoid high cost. Since the missingness of the two-phase variables are caused by the values of the first-phase variables, the entire data are MAR. However, the variables of interest vary from analysis to analysis, and in some cases, the subset of the data on the model become NMAR. Hence, the variable selection becomes necessary. In such a case, our proposed IC might contribute to select variables. It is necessary to develop our IC to deal with such a two-phase design.

Author Contributions

The authors carried out this work and drafted the manuscript collaboratively. In particular, K.T. focused on construction of the estimator of the proposed criteria. K.H. conducted all simulations. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by JSPS KAKENHI (Grants-in-Aid for Scientific Research) Grant Numbers 18K11205 (Takai) and 15K15950 (Hayashi).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors are grateful to the associate editor and anonymous reviewers for their comments and suggestions that served to materially improve the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
  2. Takeuchi, K. Distribution of informational statistics and a criterion for model fitting. Suri-Kagaku 1976, 153, 12–18. (In Japanese) [Google Scholar]
  3. Shimodaira, H. A new criterion for selecting models from partially observed data. In Selecting Models from Data: AI and Statistics IV; Cheeseman, P., Oldford, R.W., Eds.; Springer: New York, NY, USA, 1994; pp. 21–30. [Google Scholar]
  4. Cavanaugh, J.E.; Shumway, R.H. An Akaike information criterion for model selection in the presence of incomplete data. J. Stat. Plan. Inference 1998, 67, 45–65. [Google Scholar] [CrossRef]
  5. Seghouane, A.K.; Bekara, M.; Fleury, G. A criterion for model selection in the presence of incomplete data based on Kullback’s symmetric divergence. Signal Process. 2005, 85, 1405–1417. [Google Scholar] [CrossRef]
  6. Ibrahim, J.G.; Zhu, H.; Tang, N. Model Selection Criteria for Missing-Data Problems Using the EM Algorithm. J. Am. Stat. Assoc. 2008, 103, 1648–1658. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Consentino, G.; Claeskens, F. Variables selection with incomplete covariate data. Biometrics 2008, 64, 1062–1069. [Google Scholar]
  8. Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data; Wiley: Hoboken, NJ, USA, 2002. [Google Scholar]
  9. Enders, C.K. Applied Missing Data Analysis (Methodology in the Social Sciences); Guilford Press: New York, NY, USA, 2010. [Google Scholar]
  10. Schafer, J.L.; Graham, J.W. Missing data: Our view of the state of the art. Psychol. Methods 2002, 7, 147–177. [Google Scholar] [CrossRef] [PubMed]
  11. Graham, J.W. Adding missing-data-relevant variables to FIML-based structural equation models. Struct. Equ. Model. Multidiscip. J. 2003, 10, 80–100. [Google Scholar] [CrossRef]
  12. Collins, L.M.; Schafer, J.L.; Kam, C.M. A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol. Methods 2001, 6, 330–351. [Google Scholar] [CrossRef] [PubMed]
  13. Chang, W.Y.; Richards, D.S.P. Finite-sample inference with monotone incomplete multivariate normal data, I. J. Multivar. Anal. 2009, 100, 1883–1899. [Google Scholar] [CrossRef] [Green Version]
  14. Chang, W.Y.; Richards, D.S.P. Finite-sample inference with monotone incomplete multivariate normal data, II. J. Multivar. Anal. 2010, 101, 603–620. [Google Scholar] [CrossRef] [Green Version]
  15. White, H. Maximum likelihood estimation of misspecified models. Econometrics 1982, 50, 1–25. [Google Scholar] [CrossRef]
  16. Takai, K.; Kano, Y. Asymptotic Inference with Incomplete Data. Commun. Stat. Theory Methods 2013, 42, 3174–3190. [Google Scholar] [CrossRef]
  17. Konishi, S.; Kitagawa, G. Information Criteria and Statistical Modeling; Springer: New York, NY, USA, 2007. [Google Scholar]
  18. Akaike, H. Prediction and Entropy. In A Celebration of Statistics; Atkinson, A.C., Fienberg, S.E., Eds.; Springer: New York, NY, USA, 1985; pp. 1–24. [Google Scholar]
  19. Tao, R.; Zeng, L.; Lin, D.-Y. Optimal designs of two-phase studies. J. Am. Stat. Assoc. 2020, 115, 1946–1959. [Google Scholar] [CrossRef] [PubMed]
  20. Yang, C.; Diao, L.; Cook, R.J. Adaptive response-dependent two-phase designs: Some results on robustness and efficiency. Stat. Med. 2022, 41, 4403–4425. [Google Scholar] [CrossRef] [PubMed]
Table 1. True models and the corresponding candidate models. A number in parentheses indicates the number of a model which is theoretically inappropriate to use for comparison but compared in practice.
Table 1. True models and the corresponding candidate models. A number in parentheses indicates the number of a model which is theoretically inappropriate to use for comparison but compared in practice.
True ModelCandidate Models
11, 2, 3, 4
21, 2, (3), 4
31, (2), 3, 4
4(1), (2), (3), 4
Table 2. True values of parameters. The covariances are common.
Table 2. True values of parameters. The covariances are common.
True Model μ 1 μ 2 μ 3 σ 11 σ 22 σ 33 σ ab ( a b )
10001110.7
20001210.7
30.5−0.501110.7
40.5−0.501210.7
Table 3. Performance comparison under the missing-data mechanism (7).
Table 3. Performance comparison under the missing-data mechanism (7).
True Model
1234
ncriterionSelectedSelectedSelectedSelected
1234123412341234
Proposed7541171062367630872160085514500131869
100 AIC cd 506163243886538345300749251008992
AIC59713421158105784408008081920013987
Proposed74399143151265238298008921080046954
200 AIC cd 44114229512203970603007242760001000
AIC5591112557504320568008111890001000
Proposed67210419727163815346008811190017983
400 AIC cd 29811839918502050795006853150001000
AIC4299334513302470753007522480001000
Proposed65283234310592540300893107006994
800 AIC cd 190584662860400960006463540001000
AIC270584642080510949007272730001000
Table 4. Performance comparison under the missing-data mechanism (8).
Table 4. Performance comparison under the missing-data mechanism (8).
True Model
1234
ncriterionSelectedSelectedSelectedSelected
1234123412341234
Proposed829887491347439114009148601121878
100 AIC cd 484200232847363824265007162840098902
AIC67016513827154661171680080819200170830
Proposed82099747288523117008951050028972
200 AIC cd 46521021211387103279006753250012988
AIC66216613438208004176007992010024976
Proposed79512470112882011600867133001999
400 AIC cd 4182511931380710128900611389001999
AIC598239120430813118600717283001999
Proposed765158601708850115008271730001000
800 AIC cd 37530316315907180282005364640001000
AIC5492791066608240176006563440001000
Table 5. Performance comparison with the regression model.
Table 5. Performance comparison with the regression model.
True Model
5678
SelectedSelectedSelectedSelected
nCriterion5678567856785678
Proposed91217363509830170096931521115859
200 AIC cd 70315314400100000001000002717290
AIC15720521342504390561004425580619975
Proposed91518313609820180098020003997
400 AIC cd 70613815600100000001000001888120
AIC1602232014160430057000452548001999
Proposed890303050098701300976240001000
800 AIC cd 69916213900100000001000001128880
AIC17821718042504200580004015990001000
Proposed887393143097602400971290001000
1600 AIC cd 7051411540010000000100000289720
AIC18019117745204250575004455550001000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Takai, K.; Hayashi, K. Model Selection with Missing Data Embedded in Missing-at-Random Data. Stats 2023, 6, 495-505. https://doi.org/10.3390/stats6020031

AMA Style

Takai K, Hayashi K. Model Selection with Missing Data Embedded in Missing-at-Random Data. Stats. 2023; 6(2):495-505. https://doi.org/10.3390/stats6020031

Chicago/Turabian Style

Takai, Keiji, and Kenichi Hayashi. 2023. "Model Selection with Missing Data Embedded in Missing-at-Random Data" Stats 6, no. 2: 495-505. https://doi.org/10.3390/stats6020031

Article Metrics

Back to TopTop