2.1. AMac: Constructing MCS
Given a set of samples and a set of candidate models , where , , with . We assume that the sample is generated by the true model , where is the real model parameter and the probability of being in the model candidate set is .
The objective is to determine which model (
) is most likely to have produced the
data. To achieve this, we utilize a criterion function
to differentiate between the different models in
. An example of criterion functions is the BIC criterion function. In general, the model
M is more likely to be
based on the criterion
f if
is less. Typically, the model with the lowest value of
f is chosen as
. However, due to incomplete data and the inherent uncertainty of the criterion
f, this approach may not always produce satisfactory findings [
23]. To solve this problem, the idea is to identify several models from
that together make up the model set
, such that
has the desired probability to cover the true model
. The mathematical representation of this is as follows:
Here,
is a predetermined significance level, and
represents the model confidence set (MCS) with a confidence level of
. To minimize the number of models in
, we need to select those models with a higher probability of being
to form
.
Assuming that a smaller value of
f denotes a superior model, we can begin by computing the criteria values for each model in
, denoted as
. Based on the criteria values, we can then arrange the models in
in ascending order. Let the model be rearranged by the following:
According to our assumption, we expect to have the following:
If Equation (
2) holds under the criterion
f, we can attempt to form a model confidence set
by choosing the first
k models, where
and
satisfy (
1). This is the rationale for the Mac approach [
17], which entails making a cut (Mac) after determining the odds that the sorted sequential models will equal the correct model.
In real-world applications, it is frequently difficult for the criterion
f to completely meet (
2). The models can be sorted in descending order based on these probabilities, though if we can calculate the probabilities in Equation (
2), then we obtain the following:
and we have the following:
Then, we can select some models from
to form a model confidence set
that satisfies Equation (
1). Let us assume that
is the smallest integer that satisfies Equation (
3), which is as follows:
We obtain a minimal
model confidence set for
as follows:
It is clear that the crucial step in obtaining the minimal
model confidence set for
is to acquire the probability set:
Referring to the Mac method, we can use the bootstrap method to simulate the probabilities in Equation (
4) [
24]. Assuming that
is known, we can generate
B sets of data
from
. By using the new data
and the criterion
f, we can rank the candidate models to obtain
for
, and calculate
, where
represents the indicator function. Then, for any
, we have the following:
Here,
represents the
lth-ordered model after reordering the models in
under the bootstrap data. However, in reality, we do not know
. We only know that
. Therefore, according to the law of total probability, for any
, we have the following:
where
is the estimated parameter corresponding to the model
.
It can be observed that the estimated probability
can be approximated as the weighted average of the bootstrap probabilities corresponding to each model,
. However, in this case, all the weights are concentrated on a single model
, while the weights for other models
are all zero. Note that Liu et al. proposed the Mac method [
17], which approximates the estimated probability
using
. This is equivalent to setting
and
. This approximation yields good results only when the criterion
f is relatively accurate (i.e., when the criterion
f ranks the true model as the first one). Hence, the efficacy of the Mac method is strongly dependent on the accuracy of the criterion
f. When there is an increase in data fluctuations or a decrease in the accuracy of the criterion
f, the effectiveness of the Mac method diminishes significantly. To mitigate the impact of data fluctuations and the criterion
f, we draw inspiration from model averaging and approximate the estimated probability using Equation (
6):
where
,
is associated with model
, and
. When substantial data fluctuations occur, causing the criterion
f to make incorrect judgments, utilizing information from multiple models in Equation (
6) mitigates the impact of the criterion
f misjudgments. This leads to a more effective approximation.
It can be seen that, given
, the key to calculating the probabilities in Equation (
4) is to simulate {
. Equation (
6) may then be used to compute the probabilities in Equation (
4). The calculated probabilities will then be sorted in descending order to obtain the following results:
. In the end, we find the required value of
k using Equation (
3), and as a result, we obtain a minimum
confidence set for
as
. The specific algorithm for constructing MCS using AMac is shown in Algorithm 1.
The key of Algorithm 1 is to make a cut on the model sequence
according to bootstrap probabilities. Considering that our method combines the idea of model averaging, we name this method average Mac (AMac).
Algorithm 1 Constructing MCS using AMac |
- 1:
Using the data and a criterion f, perform parameter estimation and ranking of the models to obtain ordered models and their corresponding parameters . - 2:
Choose a set of probability weights for models , satisfying . - 3:
Keep the data unchanged, and generate new data under the models , respectively. - 4:
Calculate the values of for each candidate model in the candidate model set using and sort them in ascending order:
Resulting in an ordered sequence of models: . - 5:
Repeat steps 3 and 4 for and . - 6:
Calculate the empirical probabilities of :
- 7:
According to Equation ( 6), calculate for any :
- 8:
Based on the resorting of candidate models in descending order of , we obtain
- 9:
By calculating
we determine the minimum value of k, and the confidence set for is .
|
2.2. MSP- {* }: Constructing MCS under High-Dimensional Variables in Linear Regression Model
Many explanatory variables can result in an overly large candidate model set when dealing with high-dimensional variables. In the case of the linear regression model, for instance, there are candidate models if the explanatory variables have dimension p. If the square and interaction terms of the explanatory variables are considered, the number of candidate models N will be larger. When N is large, AMac encounters two problems: (1) The first step of AMac, which involves sorting based on criterion , becomes computationally infeasible. (2) Estimating the probability that each model in is the true model using bootstrap is challenging. To the best of our knowledge, the current approaches for creating model confidence sets (MCS, LRT, LBM, BMS, Mac) all have unreasonably high computing costs when handling a large number of candidate models. Therefore, it is particularly important to propose an effective method for reducing the number of candidate models, especially when there are many explanatory variables.
Davide Ferrari [
14] suggested using variable selection methods such as lasso [
25] and SCAD [
26] to reduce the number of variables and then applying the LBM method to construct MCS. However, for two key reasons, this strategy does not successfully tackle the issue of an excessive number of candidate models induced by high-dimensional variables. First, although the first step reduces the dimensionality
p through variable selection methods, the reduced number of variables
may still be relatively large, especially when the number of true variables itself is large. Second, it is improbable that the majority of existing variable selection techniques will cover every real variable. As a result, some real variables have likely been left out. As a result, the true model is no longer included in the model set that was created using the variables that were chosen, which goes against the assumption made by the model confidence set that the true model is included in the candidate model set.
We have observed that when constructing MCS, the selection of well-performing models is more crucial than the selection of variables. Motivated by this, we aim to efficiently reduce the number of candidate models by selecting high-performing models from the candidate model set directly to establish a new set. To achieve this, we aim to narrow down the initial candidate model set to while ensuring that satisfies the following two properties:
Property 1. The number of candidate models in is relatively small and does not significantly increase with the increase in p.
Property 2. should satisfy , or at least ensure this property when the sample size is sufficiently large, i.e., .
We refer to the model set that satisfies these two properties as the model selection path (MSP). Property 1 ensures that after constructing MSP, the computational burden of constructing MCS through MSP in the second step remains manageable. Property 2 requires that the true model is included in MSP, serving as a prerequisite for constructing MCS through MSP in the second step. Next, we will present the specific method for constructing MSP in the context of the linear regression model.
Assuming that the data follow a linear regression model,
where
is defined as previously mentioned,
represents the coefficient vector of the independent variables, and
is the random error vector. We assume that the errors
are independently and identically distributed random variables with a mean of 0 and a variance of
. Additionally, we assume that
, where
is a positive definite matrix. Our proposed solution is to construct MP by utilizing the “solution path” [
27] obtained from the adaptive lasso (Alasso) method [
22]. This approach allows us to significantly reduce the number of candidate models.
Let us start by briefly introducing the relevant knowledge about Alasso and lars [
21]. Assuming that
is the
-consistent estimator of
(in this paper, we will use the least squares estimator
), choose
(in this paper, we choose
) and define the weight vector
. Note that
and
are commonly used parameters by scholars to address issues with adaptive lasso. These parameter values are also the default settings in the adaptive lasso code available on the web (
http://www4.stat.ncsu.edu/~boos/var.select/lasso.adaptive.html) (accessed on 20 July 2023). Furthermore, it is important to note that the specific value of
within the range of
does not affect the large sample properties of MSP constructed by Algorithm 2 (see the proof in
Appendix A.2 for details.). For the sake of simplicity, we have set
in Algorithm 2.
The Alasso estimator
is defined as follows:
where
is a tuning parameter. Let
represent the set of variables with nonzero true coefficients and
represent the set of variables with nonzero coefficients in the Alasso estimation results. According to Zou’s proof [
22], as the sample size
n tends toward infinity, we can choose an appropriate
such that
. In this case, the model constructed based on the nonzero variables determined by this
is the true model. Therefore, if we consider all the models represented by the solutions of Equation (
7) with
and form a model set
, then as the sample size
n tends toward infinity,
will, with a probability of 1, contain the true model. It is known that the lars algorithm can provide all solutions of the Alasso with minimal computational effort. Therefore, we propose using the lars algorithm to generate the solution path of the Alasso, and we use the models
formed by the nonzero variables in all solutions on the path to form
, where
N represents the number of steps required by the lars–Alasso algorithm to obtain the Alasso solution path. Specifically, the construction method of the proposed MSP is presented in Algorithm 2.
Algorithm 2 Alasso–lars (AL): the construction algorithm for MSP |
- 1:
Compute the least squares estimate under the full variable set, and calculate . - 2:
Define . - 3:
Solve the lasso problem shown in Equation ( 8) using the lars–lasso algorithm:
- 4:
Assuming that the lars–lasso algorithm takes a total of N steps to solve Equation ( 8), save the model consisting of nonzero variables at each step and let be the new set of candidate models. Here, are mutually distinct models.
|
Algorithm 2 essentially leverages the consistency of Alasso in variable selection and the piecewise linearity of the solution to the Alasso problem with respect to
in a linear regression model [
28], which can be efficiently computed using the lars algorithm to obtain its solution path. In Theorem 2, we will provide a proof of the effectiveness of MSP constructed by Algorithm 2.