Model Selection Path and Construction of Model Confidence Set under High-Dimensional Variables

Wen, Faguang; Jiang, Jiming; Luan, Yihui

doi:10.3390/math12050664

Open AccessArticle

Model Selection Path and Construction of Model Confidence Set under High-Dimensional Variables

by

Faguang Wen

^1,2

,

Jiming Jiang

³

and

Yihui Luan

^1,*

¹

Frontiers Science Center for Nonlinear Expectations (Ministry of Education), Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China

²

School of Mathematics, Shandong University, Jinan 250100, China

³

Department of Statistics, University of California, Davis, CA 95616, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(5), 664; https://doi.org/10.3390/math12050664

Submission received: 28 December 2023 / Revised: 17 February 2024 / Accepted: 22 February 2024 / Published: 24 February 2024

(This article belongs to the Section Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

Model selection uncertainty has drawn a lot of attention from academics recently because it significantly affects parameter estimation and prediction. Scholars are currently addressing and quantifying uncertainty in model selection by concentrating on model combining and model confidence sets. In this paper, we present a new approach for building model confidence sets, which we call AMac. We provide a theoretical lower bound on the degree of confidence in the model confidence sets that AMac has built. Furthermore, we discuss how the implementation of current model confidence set construction methods becomes difficult when dealing with high-dimensional variables. To address this problem, we suggest building model selection paths (MSP) as a solution. We develop an algorithm for building MSP and show its effectiveness by utilizing the theories of adaptive lasso and lars. We perform an extensive set of simulation experiments to compare the performances of Mac and AMac methods. According to the results, AMac is more stable when there are fluctuations in noise levels. The model confidence sets built by AMac, in particular, achieve coverage rates that are closer to the desired confidence level, especially in the presence of high noise levels. To further confirm that MSP can successfully generate model confidence sets that maintain the given confidence level as the sample size increases, we conduct extensive simulation tests with high-dimensional variables. Ultimately, we hope that the strategies and concepts discussed in this work will improve results in subsequent research on the uncertainty of model selection.

Keywords:

model confidence sets; model selection; variable selection; high-dimensional variables; model averaging; uncertainty in model selection

MSC:

62F07

1. Introduction

When conducting modeling and analysis on a dataset in real-world applications, data analysts frequently run into several candidate models. In such situations, the conventional method is to use an appropriate model selection procedure to choose the best model among the options based on specific criteria [1]. Even though the theory and methodology for model selection have advanced significantly [2], it still presents difficulties when data are sparse. Scholars have focused a great deal of attention on uncertainties that result from even minor perturbations or changes in the data, which can significantly influence the model selection results [3,4]. When modeling and analyzing real data, it is important to measure or improve the uncertainty associated with model selection to avoid overidealized estimation or prediction results resulting from ignoring the uncertainty in the model selection process [5,6].

To deal with the uncertainty in model selection, previous researchers mostly used the model combining (or averaging) strategy. By using the models’ posterior probabilities or bootstrap weights, this method quantifies the uncertainty in the model selection process and reduces the effect of data perturbations by combining candidate models in a weighted manner. Adaptive regression by mixing (ARM) [7,8] and conventional Bayesian model averaging (BMA) [9,10] are two examples of this technique. Model combining techniques improve estimation reliability, but they are frequently computationally expensive and complex and produce interpretability-poor results [11]. The model confidence set (MCS) was first proposed by Hansen et al. in 2011 [12]. It is a set of potential models, among which is the “best model” with a specified degree of confidence. Like confidence intervals in parameter estimation, MCS offers a measure of uncertainty in model selection. MCS has an advantage in that it recognizes the constraints of the data. MCS typically uses fewer models when there is a wealth of data available. Conversely, in situations where there is a dearth of information in the data, MCS will incorporate more models to offer more useful information, leading to a more trustworthy analysis afterwards [13]. Because model confidence sets are easy to evaluate and use dependable sample data, more academics are developing them to quantify the uncertainty of model selection.

Following the introduction of the notion of MCS, Hansen et al. introduced a construction technique using a sequence of exclusion criteria and equivalency tests [12]. Variable select confidence sets (VSCSs) are the result of Ferrari et al.’s (2015) extension of the MCS concept to the linear regression model [14]. They offered a technique for choosing significant variables by defining lower boundary models (LBMs) utilizing particular subsets of VSCS and proposing an F-test-based VSCS creation method. Zheng et al. (2019a) presented a technique for building VSCS utilizing likelihood ratio tests (LRT) and implemented it in the biomedical research domain [15]. Zheng et al. (2019b) conducted the first application of logistic regression models using LRT and the LBM technique, analyzing the age-related macular degeneration (AMD) dataset in the same year [16]. Liu et al. (2021) presented a consistency model selection criterion and bootstrap-method-based MCS construction approach. They called this procedure Mac since it is crucial to Make A Cut after ordering alternative models [17]. This study, motivated by model combining and Mac, refines the Mac method from the standpoint of model averaging. We call this approach average Mac (AMac). AMac has a greater model coverage rate and is less impacted by noise than Mac.

There is an increasing quantity of high-dimensional data in both everyday life and scientific study due to the quick growth of information technology and advancements in data gathering, storage, and processing [18]. Investigating how to apply VSCS and MCS building techniques to high-dimensional data takes on great significance. To the best of our knowledge, however, scenarios in which there are a lot of candidate models because of an increase in variables are beyond the scope of the widely used approaches (MCS, LBM, LRT, Mac) for building model confidence sets. While Ferrari et al. suggested using various variable selection techniques to screen variables in the beginning to narrow down the number of candidate models while building VSCSs [14], this method still has computational issues when there are an excessive number of true variables. Li et al. [19] (2019) proposed the MCB concept, which quantifies the uncertainty of variable selection by providing lower and upper bound sets for variables. They also introduced a bootstrap-based method for constructing MCB within a linear model framework. Recently, Li et al. [20] (2023) enhanced the MCB algorithm by introducing NMCS. They investigated the performance of MCB and NMCS in high-dimensional variable scenarios and extended the NMCS algorithm to encompass generalized linear models. It should be noted that, although MCB is less affected by variable increases, it differs from MCS in terms of definition. While MCS focuses on model selection, MCB emphasizes variable selection. It is worth noting that the key to constructing MCS lies in selecting good models rather than selecting good variables. In light of this, the study presents the idea of model selection path (MSP) to solve the problem of an excessive number of alternative models arising from variables with high dimensions. The filtered alternative model set is what MSP is. By converting the exponential growth of candidate models with variables into a linear connection, it eliminates the issue of an excessive number of alternative models brought on by variable increases. It is possible to effectively create MCS under large dimensional variables by combining MSP with the standard MCS creation approach. We present the construction approach of MSP under high-dimensional variables in the context of the linear regression model. Using simulations, we discovered that the average coverage rate of MCS may be greatly increased by combining the MSP method with traditional MCS building techniques.

The three goals that this essay aims to achieve are as follows: First, we propose the AMac method that generates MCS by combining Mac with model averaging. Additionally, we provide the theoretical lower bound on the AMac-established MCS confidence level. Second, we suggest the construction of MSP as a solution to the computational challenges faced while building MCS with high-dimensional variables. By merging the lars [21] algorithm with the adaptive lasso [22] approach, we provide a specific methodology for building MSP and theoretically demonstrate its efficacy. Finally, to evaluate the finite sample properties of the MSP technique and the AMac method presented in this work, we carry out comprehensive simulation tests. We also utilize our techniques on the “diabetes” dataset.

The sections are arranged as follows: The AMac approach for building MCS is presented in the first part of Section 2, along with the concrete algorithm for building MSP in the second part. In Section 3, the first section concentrates on presenting the lower bound for the MCS confidence level that AMac constructed, while the second half establishes MSP’s validity. The AMac algorithm’s weight selection process is introduced in Section 4. In Section 5, numerous experimental simulations are carried out to confirm the finite sample properties of the suggested approach. We use the techniques discussed in this article on the “diabetes” dataset in Section 6. Section 7 concludes this essay with more discussion and an overview of the techniques and findings used.

2. Methods

2.1. AMac: Constructing MCS

Given a set of samples

(X, Y)

and a set of candidate models

M = {M_{1}, \dots, M_{N}}

, where

Y = {(y_{1}, y_{2}, \dots, y_{n})}^{'}

,

X = (x_{1}, x_{2}, \dots, x_{p})

, with

x_{i} = {(x_{i 1}, x_{i 2}, \dots, x_{i n})}^{'}

. We assume that the sample is generated by the true model

(M_{o p t}, ψ_{o p t})

, where

ψ_{o p t}

is the real model parameter and the probability of

M_{o p t}

being in the model candidate set is

P (M_{o p t} \in M) = 1

.

The objective is to determine which model (

M

) is most likely to have produced the

(X, Y)

data. To achieve this, we utilize a criterion function

f_{M} (X, Y)

to differentiate between the different models in

M = {M_{1}, \dots, M_{N}}

. An example of criterion functions is the BIC criterion function. In general, the model M is more likely to be

M_{o p t}

based on the criterion f if

f_{M} (X, Y)

is less. Typically, the model with the lowest value of f is chosen as

M_{o p t}

. However, due to incomplete data and the inherent uncertainty of the criterion f, this approach may not always produce satisfactory findings [23]. To solve this problem, the idea is to identify several models from

M

that together make up the model set

M^{*}

, such that

M^{*}

has the desired probability to cover the true model

M_{o p t}

. The mathematical representation of this is as follows:

\begin{matrix} P (M_{o p t} \in M^{*}) \geq 1 - α . \end{matrix}

(1)

Here,

α

is a predetermined significance level, and

M^{*}

represents the model confidence set (MCS) with a confidence level of

1 - α

. To minimize the number of models in

M^{*}

, we need to select those models with a higher probability of being

M_{o p t}

to form

M^{*}

.

Assuming that a smaller value of f denotes a superior model, we can begin by computing the criteria values for each model in

M

, denoted as

f_{M_{i}} (X, Y), i = 1, \dots, N

. Based on the criteria values, we can then arrange the models in

M

in ascending order. Let the model be rearranged by the following:

M_{f} = {{\hat{M}}_{1}^{0}, {\hat{M}}_{2}^{0}, \dots, {\hat{M}}_{N}^{0}} .

We have the following:

\begin{matrix} f_{{\hat{M}}_{1}^{0}} (X, Y) \leq f_{{\hat{M}}_{2}^{0}} (X, Y) \leq \dots \leq f_{{\hat{M}}_{N}^{0}} (X, Y) . \end{matrix}

According to our assumption, we expect to have the following:

\begin{matrix} P (M_{o p t} = {\hat{M}}_{1}^{0}) \geq P (M_{o p t} = {\hat{M}}_{2}^{0}) \geq \dots \geq P (M_{o p t} = {\hat{M}}_{N}^{0}) . \end{matrix}

(2)

If Equation (2) holds under the criterion f, we can attempt to form a model confidence set

M^{*}

by choosing the first k models, where

M^{*} = {{\hat{M}}_{1}^{0}, {\hat{M}}_{2}^{0}, \dots, {\hat{M}}_{k}^{0}}

and

M^{*}

satisfy (1). This is the rationale for the Mac approach [17], which entails making a cut (Mac) after determining the odds that the sorted sequential models will equal the correct model.

In real-world applications, it is frequently difficult for the criterion f to completely meet (2). The models can be sorted in descending order based on these probabilities, though if we can calculate the probabilities in Equation (2), then we obtain the following:

M_{f f} = {{\tilde{M}}_{1}^{0}, {\tilde{M}}_{2}^{0}, \dots, {\tilde{M}}_{N}^{0}},

and we have the following:

P (M_{o p t} = {\tilde{M}}_{1}^{0}) \geq P (M_{o p t} = {\tilde{M}}_{2}^{0}) \geq \dots \geq P (M_{o p t} = {\tilde{M}}_{N}^{0}) .

Then, we can select some models from

M_{f f}

to form a model confidence set

M^{*}

that satisfies Equation (1). Let us assume that

k ⩾ 0

is the smallest integer that satisfies Equation (3), which is as follows:

\begin{matrix} P (M_{o p t} \in \{{\tilde{M}}_{1}^{0}, {\tilde{M}}_{2}^{0}, \dots, {\tilde{M}}_{k}^{0}\}) & = P (M_{o p t} = {\tilde{M}}_{1}^{0}) + \dots + P (M_{o p t} = {\tilde{M}}_{k}^{0}) \\ = \sum_{l = 1}^{k} P (M_{o p t} = {\tilde{M}}_{l}^{0}) ⩾ 1 - α . \end{matrix}

(3)

We obtain a minimal

1 - α

model confidence set for

M_{o p t}

as follows:

M_{1 - α}^{*} = \{{\tilde{M}}_{1}^{0}, {\tilde{M}}_{2}^{0}, \dots, {\tilde{M}}_{k}^{0}\} .

It is clear that the crucial step in obtaining the minimal

1 - α

model confidence set for

M_{o p t}

is to acquire the probability set:

\{P (M_{o p t} = {\hat{M}}_{1}^{0}), P (M_{o p t} = {\hat{M}}_{2}^{0}), \dots, P (M_{o p t} = {\hat{M}}_{N}^{0})\} .

(4)

Referring to the Mac method, we can use the bootstrap method to simulate the probabilities in Equation (4) [24]. Assuming that

(M_{o p t}, ψ_{o p t})

is known, we can generate B sets of data

Y^{[b]}

from

(M_{o p t}, ψ_{o p t}, X)

. By using the new data

(X, Y^{[b]})

and the criterion f, we can rank the candidate models to obtain

{\hat{M}}_{l}^{[b]}

for

l = 1, 2, \dots, N

, and calculate

I ({\hat{M}}_{l}^{[b]} = M_{o p t})

, where

I (\cdot)

represents the indicator function. Then, for any

l \in 1, \dots, N

, we have the following:

\begin{matrix} P (M_{o p t} = {\hat{M}}_{l}^{0} | M_{o p t}, ψ_{o p t}) & \approx P (M_{o p t} = {\hat{M}}_{l}^{*} | M_{o p t}, ψ_{o p t}) \\ = \frac{1}{B} \sum_{b = 1}^{B} I (M_{o p t} = {\hat{M}}_{l}^{[b]} | M_{o p t}, ψ_{o p t}) + o_{p} (1) . \end{matrix}

Here,

{\hat{M}}_{l}^{*}

represents the lth-ordered model after reordering the models in

M

under the bootstrap data. However, in reality, we do not know

(M_{o p t}, ψ_{o p t})

. We only know that

M_{o p t} \in \{{\hat{M}}_{1}^{0}, {\hat{M}}_{2}^{0}, \dots, {\hat{M}}_{N}^{0}\}

. Therefore, according to the law of total probability, for any

l \in 1, \dots, N

, we have the following:

\begin{matrix} P (M_{o p t} = {\hat{M}}_{l}^{0} | M_{o p t}, ψ_{o p t}) \\ = P (M_{o p t} = {\hat{M}}_{l}^{0} | M_{o p t}, ψ_{o p t}) \times \sum_{i = 1}^{N} I ({\hat{M}}_{i}^{0} = M_{o p t}) \\ = P (M_{o p t} = {\hat{M}}_{l}^{0} | M_{o p t}, ψ_{o p t}) I ({\hat{M}}_{1}^{0} = M_{o p t}) + \dots + P (M_{o p t} = {\hat{M}}_{l}^{0} | M_{o p t}, ψ_{o p t}) I ({\hat{M}}_{N}^{0} = M_{o p t}) \\ \approx P ({\hat{M}}_{1}^{0} = {\hat{M}}_{l}^{*} | {\hat{M}}_{1}^{0}, {\hat{ψ}}_{1}^{0}) I ({\hat{M}}_{1}^{0} = M_{o p t}) + \dots + P ({\hat{M}}_{N}^{0} = {\hat{M}}_{l}^{*} | {\hat{M}}_{N}^{0}, {\hat{ψ}}_{N}^{0}) I ({\hat{M}}_{N}^{0} = M_{o p t}) . \end{matrix}

(5)

where

{\hat{ψ}}_{i}^{0}, i = 1, \dots, N

is the estimated parameter corresponding to the model

{\hat{M}}_{i}^{0}

.

It can be observed that the estimated probability

P (M_{o p t} = {\hat{M}}_{l}^{0} | M_{o p t}, ψ_{o p t})

can be approximated as the weighted average of the bootstrap probabilities corresponding to each model,

P ({\hat{M}}_{1}^{0} = {\hat{M}}_{l}^{*} | {\hat{M}}_{1}^{0}, {\hat{ψ}}_{1}^{0}), \dots, P ({\hat{M}}_{N}^{0} = {\hat{M}}_{l}^{*} | {\hat{M}}_{N}^{0}, {\hat{ψ}}_{N}^{0})

. However, in this case, all the weights are concentrated on a single model

{\hat{M}}_{i}^{0}

, while the weights for other models

{\hat{M}}_{j}^{0}, j = 1, \dots, N, j \neq i

are all zero. Note that Liu et al. proposed the Mac method [17], which approximates the estimated probability

P (M_{o p t} = {\hat{M}}_{l}^{0} | M_{o p t}, ψ_{o p t})

using

P (M_{o p t} = {\hat{M}}_{l}^{0} | M_{o p t}, ψ_{o p t}) \approx P ({\hat{M}}_{1}^{0} = {\hat{M}}_{l}^{*} | {\hat{M}}_{1}^{0}, {\hat{ψ}}_{1}^{0})

. This is equivalent to setting

I ({\hat{M}}_{1}^{0} = M_{o p t}) = 1

and

I ({\hat{M}}_{i}^{0} = M_{o p t}) = 0, i \neq 1

. This approximation yields good results only when the criterion f is relatively accurate (i.e., when the criterion f ranks the true model as the first one). Hence, the efficacy of the Mac method is strongly dependent on the accuracy of the criterion f. When there is an increase in data fluctuations or a decrease in the accuracy of the criterion f, the effectiveness of the Mac method diminishes significantly. To mitigate the impact of data fluctuations and the criterion f, we draw inspiration from model averaging and approximate the estimated probability using Equation (6):

\begin{matrix} P (M_{o p t} = {\hat{M}}_{l}^{0} | M_{o p t}, ψ_{o p t}) \\ \approx P ({\hat{M}}_{1}^{0} = {\hat{M}}_{l}^{*} | {\hat{M}}_{1}^{0}, {\hat{ψ}}_{1}^{0}) a_{1} ({\hat{M}}_{1}^{0}) + \dots + P ({\hat{M}}_{N}^{0} = {\hat{M}}_{l}^{*} | {\hat{M}}_{N}^{0}, {\hat{ψ}}_{N}^{0}) a_{N} ({\hat{M}}_{N}^{0}), \end{matrix}

(6)

where

l \in {1, \dots, N}

,

a_{l} ({\hat{M}}_{l}^{0})

is associated with model

{\hat{M}}_{l}^{0}

, and

\sum_{i = 1}^{N} a_{i} ({\hat{M}}_{i}^{0}) = 1

. When substantial data fluctuations occur, causing the criterion f to make incorrect judgments, utilizing information from multiple models in Equation (6) mitigates the impact of the criterion f misjudgments. This leads to a more effective approximation.

It can be seen that, given

a_{i} ({\hat{M}}_{i}^{0}), i = 1, \dots, N

, the key to calculating the probabilities in Equation (4) is to simulate {

P ({\hat{M}}_{1}^{0} = {\hat{M}}_{l}^{*} | {\hat{M}}_{1}^{0}, {\hat{ψ}}_{1}^{0})

, \dots,

P ({\hat{M}}_{N}^{0} = {\hat{M}}_{l}^{*} | {\hat{M}}_{N}^{0}, {\hat{ψ}}_{N}^{0}), l = 1, \dots, N}

. Equation (6) may then be used to compute the probabilities in Equation (4). The calculated probabilities will then be sorted in descending order to obtain the following results:

{P (M_{o p t} = {\tilde{M}}_{1}^{0}), P (M_{o p t} = {\tilde{M}}_{2}^{0}), \dots, P (M_{o p t} = {\tilde{M}}_{N}^{0})}

. In the end, we find the required value of k using Equation (3), and as a result, we obtain a minimum

1 - α

confidence set for

M_{o p t}

as

{{\tilde{M}}_{1}^{0}, {\hat{M}}_{2}^{0}, \dots, {\tilde{M}}_{k}^{0}}

. The specific algorithm for constructing MCS using AMac is shown in Algorithm 1.

The key of Algorithm 1 is to make a cut on the model sequence

{\tilde{M}}_{1}^{0}, {\tilde{M}}_{2}^{0}, \dots, {\tilde{M}}_{N}^{0}

according to bootstrap probabilities. Considering that our method combines the idea of model averaging, we name this method average Mac (AMac).

Algorithm 1 Constructing MCS using AMac

1:: Using the data $(X, Y)$ and a criterion f, perform parameter estimation and ranking of the models to obtain ordered models $M_{f} = {{\hat{M}}_{1}^{0}, \dots, {\hat{M}}_{N}^{0}}$ and their corresponding parameters $({\hat{M}}_{i}^{0}, {\hat{ψ}}_{i}^{0}), i = 1, \dots, N$ .
2:: Choose a set of probability weights $a_{i}$ for models ${\hat{M}}_{i}^{0}, i = 1, \dots, N$ , satisfying $\sum_{i = 1}^{N} a_{i} = 1$ .
3:: Keep the data $X$ unchanged, and generate new data $Y_{i}^{[b]}, b = 1, \dots, B, i = 1, \dots, N$ under the models $({\hat{M}}_{i}^{0}, {\hat{ψ}}_{i}^{0}), i = 1, \dots, N$ , respectively.
4:: Calculate the values of $f_{M} (X, Y)$ for each candidate model in the candidate model set $M$ using $(X, Y_{i}^{[b]})$ and sort them in ascending order:

$\begin{matrix} f_{{\hat{M}}_{i, 1}^{[b]}} (X, Y_{i}^{[b]}) \leq f_{{\hat{M}}_{i, 2}^{[b]}} (X, Y_{i}^{[b]}) \leq \dots \leq f_{{\hat{M}}_{i, N}^{[b]}} (X, Y_{i}^{[b]}) . \end{matrix}$

Resulting in an ordered sequence of models: ${\hat{M}}_{i, 1}^{[b]}, {\hat{M}}_{i, 2}^{[b]}, \dots, {\hat{M}}_{i, N}^{[b]}$ .
5:: Repeat steps 3 and 4 for $i = 1, 2, \dots, N$ and $b = 1, 2, \dots, B$ .
6:: Calculate the empirical probabilities of ${\hat{M}}_{i, 1}^{*} = {\hat{M}}_{i}^{0}, {\hat{M}}_{i, 2}^{*} = {\hat{M}}_{i}^{0}, \dots, {\hat{M}}_{i, N}^{*} = {\hat{M}}_{i}^{0}$ :

$\begin{matrix} P ({\hat{M}}_{i, l}^{*} = {\hat{M}}_{i}^{0}) = \frac{1}{B} \sum_{b = 1}^{B} I_{({\hat{M}}_{i, l}^{[b]} = {\hat{M}}_{i}^{0})} l, i = 1, \dots, N . \end{matrix}$
7:: According to Equation (6), calculate for any $l \in {1, \dots, N}$ :

$\begin{matrix} P ({\hat{M}}_{l}^{0} = M_{o p t}) & \approx P ({\hat{M}}_{l}^{*} = M_{o p t}) \\ \approx P ({\hat{M}}_{1, l}^{*} = {\hat{M}}_{1}^{0}) a_{1} + P ({\hat{M}}_{2, l}^{*} = {\hat{M}}_{2}^{0}) a_{2} + \dots + P ({\hat{M}}_{N, l}^{*} = {\hat{M}}_{N}^{0}) a_{N} . \end{matrix}$

Obtain

$P (M_{f}) = \{P (M_{o p t} = {\hat{M}}_{1}^{0}), P (M_{o p t} = {\hat{M}}_{2}^{0}), \dots, P (M_{o p t} = {\hat{M}}_{N}^{0})\} .$
8:: Based on the resorting of candidate models in descending order of $P (M_{f})$ , we obtain

$M_{f f} = {{\tilde{M}}_{1}^{0}, {\tilde{M}}_{2}^{0}, \dots, {\tilde{M}}_{N}^{0}} .$
9:: By calculating

$\begin{matrix} \sum_{l = 1}^{k} P (M_{o p t} = {\tilde{M}}_{l}^{0}) ⩾ 1 - α, \end{matrix}$

we determine the minimum value of k, and the $1 - α$ confidence set for $M_{o p t}$ is $M^{*} = \{{\tilde{M}}_{1}^{0}, {\tilde{M}}_{2}^{0}, \dots, {\tilde{M}}_{k}^{0}\}$ .

2.2. MSP- {* }: Constructing MCS under High-Dimensional Variables in Linear Regression Model

Many explanatory variables can result in an overly large candidate model set when dealing with high-dimensional variables. In the case of the linear regression model, for instance, there are

N = 2^{p}

candidate models if the explanatory variables

X

have dimension p. If the square and interaction terms of the explanatory variables are considered, the number of candidate models N will be larger. When N is large, AMac encounters two problems: (1) The first step of AMac, which involves sorting based on criterion

f_{M} (X, Y)

, becomes computationally infeasible. (2) Estimating the probability that each model in

M

is the true model using bootstrap is challenging. To the best of our knowledge, the current approaches for creating model confidence sets (MCS, LRT, LBM, BMS, Mac) all have unreasonably high computing costs when handling a large number of candidate models. Therefore, it is particularly important to propose an effective method for reducing the number of candidate models, especially when there are many explanatory variables.

Davide Ferrari [14] suggested using variable selection methods such as lasso [25] and SCAD [26] to reduce the number of variables and then applying the LBM method to construct MCS. However, for two key reasons, this strategy does not successfully tackle the issue of an excessive number of candidate models induced by high-dimensional variables. First, although the first step reduces the dimensionality p through variable selection methods, the reduced number of variables

p_{r}

may still be relatively large, especially when the number of true variables itself is large. Second, it is improbable that the majority of existing variable selection techniques will cover every real variable. As a result, some real variables have likely been left out. As a result, the true model is no longer included in the model set that was created using the variables that were chosen, which goes against the assumption made by the model confidence set that the true model is included in the candidate model set.

We have observed that when constructing MCS, the selection of well-performing models is more crucial than the selection of variables. Motivated by this, we aim to efficiently reduce the number of candidate models by selecting high-performing models from the candidate model set directly to establish a new set. To achieve this, we aim to narrow down the initial candidate model set

M

to

M_{r}

while ensuring that

M_{r}

satisfies the following two properties:

Property 1.

The number of candidate models in

M_{r}

is relatively small and does not significantly increase with the increase in p.

Property 2.

M_{r}

should satisfy

P (M_{o p t} \in M_{r}) = 1

, or at least ensure this property when the sample size is sufficiently large, i.e.,

{lim}_{n \to + \infty} P (M_{o p t} \in M_{r}^{n}) = 1

.

We refer to the model set

M_{r}

that satisfies these two properties as the model selection path (MSP). Property 1 ensures that after constructing MSP, the computational burden of constructing MCS through MSP in the second step remains manageable. Property 2 requires that the true model is included in MSP, serving as a prerequisite for constructing MCS through MSP in the second step. Next, we will present the specific method for constructing MSP in the context of the linear regression model.

Assuming that the data follow a linear regression model,

Y = X β + ϵ,

where

(Y, X)

is defined as previously mentioned,

β = (β_{1}, \dots, β_{p})

represents the coefficient vector of the independent variables, and

ϵ = (ϵ_{1}, \dots, ϵ_{n})

is the random error vector. We assume that the errors

ϵ_{i}

are independently and identically distributed random variables with a mean of 0 and a variance of

σ^{2}

. Additionally, we assume that

\frac{1}{n} X^{T} X \to C

, where

C

is a positive definite matrix. Our proposed solution is to construct MP by utilizing the “solution path” [27] obtained from the adaptive lasso (Alasso) method [22]. This approach allows us to significantly reduce the number of candidate models.

Let us start by briefly introducing the relevant knowledge about Alasso and lars [21]. Assuming that

\hat{β}

is the

\sqrt{n}

-consistent estimator of

β

(in this paper, we will use the least squares estimator

\hat{β} (o l s)

), choose

γ > 0

(in this paper, we choose

γ = 1

) and define the weight vector

\hat{w} = 1 / {| \hat{β} |}^{γ}

. Note that

\hat{β} (o l s)

and

γ = 1

are commonly used parameters by scholars to address issues with adaptive lasso. These parameter values are also the default settings in the adaptive lasso code available on the web (http://www4.stat.ncsu.edu/~boos/var.select/lasso.adaptive.html) (accessed on 20 July 2023). Furthermore, it is important to note that the specific value of

γ

within the range of

(0, \infty)

does not affect the large sample properties of MSP constructed by Algorithm 2 (see the proof in Appendix A.2 for details.). For the sake of simplicity, we have set

γ = 1

in Algorithm 2.

The Alasso estimator

{\hat{β}}^{* (n)}

is defined as follows:

{\hat{β}}^{* (n)} = arg min_{β} \{| | Y - {X β | |}^{2} + λ_{n} \sum_{j = 1}^{p} {\hat{w}}_{j} | β_{j} |\},

(7)

where

λ_{n}

is a tuning parameter. Let

A = {j, β_{j} \neq 0}

represent the set of variables with nonzero true coefficients and

A_{n}^{*} = {j, {\hat{β}}_{j}^{* (n)} \neq 0}

represent the set of variables with nonzero coefficients in the Alasso estimation results. According to Zou’s proof [22], as the sample size n tends toward infinity, we can choose an appropriate

λ_{n}

such that

{lim}_{n \to + \infty} P (A_{n}^{*} = A) = 1

. In this case, the model constructed based on the nonzero variables determined by this

λ_{n}

is the true model. Therefore, if we consider all the models represented by the solutions of Equation (7) with

λ_{n} \in [0, inf)

and form a model set

M_{r}

, then as the sample size n tends toward infinity,

M_{r}

will, with a probability of 1, contain the true model. It is known that the lars algorithm can provide all solutions of the Alasso with minimal computational effort. Therefore, we propose using the lars algorithm to generate the solution path of the Alasso, and we use the models

M_{i}, i = 1, \dots, N

formed by the nonzero variables in all solutions on the path to form

M_{r}

, where N represents the number of steps required by the lars–Alasso algorithm to obtain the Alasso solution path. Specifically, the construction method of the proposed MSP is presented in Algorithm 2.

Algorithm 2 Alasso–lars (AL): the construction algorithm for MSP

1:: Compute the least squares estimate $\hat{β} = ({\hat{β}}_{1}, \dots, {\hat{β}}_{p})$ under the full variable set, and calculate ${\hat{w}}_{j} = 1 / | {\hat{β}}_{j} |, j = 1, \dots, p$ .
2:: Define $x_{j}^{*} = x_{j} / {\hat{w}}_{j}, j = 1, 2, \dots, p$ .
3:: Solve the lasso problem shown in Equation (8) using the lars–lasso algorithm:

${\hat{β}}^{*} = arg min_{β} | | Y - {X β | |}^{2} + λ_{n} \sum_{j = 1}^{p} | β_{j} | .$

(8)
4:: Assuming that the lars–lasso algorithm takes a total of N steps to solve Equation (8), save the model $M_{i}$ consisting of nonzero variables at each step and let $M_{r} = {M_{i}, i = 1, \dots, N} = {M_{l}, l = 1, \dots, N^{*}}$ be the new set of candidate models. Here, $M_{l}, l = 1, \dots, N^{*}$ are mutually distinct models.

Algorithm 2 essentially leverages the consistency of Alasso in variable selection and the piecewise linearity of the solution to the Alasso problem with respect to

λ_{n}

in a linear regression model [28], which can be efficiently computed using the lars algorithm to obtain its solution path. In Theorem 2, we will provide a proof of the effectiveness of MSP constructed by Algorithm 2.

3. Theoretical Properties

3.1. Coverage Rate of MCS Constructed by AMac

Let

Y

denote the original data and

Y_{i}^{[b]}, 1 ⩽ b ⩽ B, 1 ⩽ i ⩽ N

be the data generated through bootstrap under

({\hat{M}}_{i}^{0}, {\hat{ψ}}_{i}^{0}, X)

, where

\hat{ψ}

represents the estimate of

ψ

. The probability of event · occurring when the underlying distribution is determined by

(M, ψ)

is denoted as

P (\cdot | M, ψ)

. We propose the following hypotheses:

Hypothesis 1.

The model space,

M

, is finite.

Hypothesis 2.

The bootstrap samples

Y_{i}^{[b]}

,

1 ⩽ b ⩽ B, 1 ⩽ i ⩽ N

are generated independently under

({\hat{M}}_{i}^{0}, {\hat{ψ}}_{i}^{0})

.

Hypothesis 3.

There is a constant c such that, for every

k ⩾ 0

and fixed vector

{\hat{ψ}}_{o p t}

, we have

\begin{matrix} |P (M_{o p t} \in \{{\hat{M}}_{1}^{*}, \dots, {\hat{M}}_{k}^{*}\} | M_{o p t}, {\hat{ψ}}_{o p t}) - P (M_{o p t} \in \{{\hat{M}}_{1}^{*}, \dots, {\hat{M}}_{k}^{*}\} | M_{o p t}, ψ_{o p t})| \\ ⩽ c |{\hat{ψ}}_{o p t} - ψ_{o p t}| . \end{matrix}

Hypothesis 4.

P ({\hat{M}}_{i, l}^{*} = {\hat{M}}_{i}^{0} | {\hat{M}}_{i}^{0}, {\hat{ψ}}_{i}^{0}) > 0, 1 ⩽ i, l ⩽ N .

Assumptions H1–H4 are similar to those proposed by Liu et al. for constructing the lower bound of Mac [17]. However, the key distinction is that Mac’s assumptions are specifically tailored to the model ranked first according to the criterion f. In contrast, the assumptions in this paper apply to the set of alternative models and are not dependent on the initial ranking based on the criterion f.

For any given set of non-negative weights

a_{i} \geq 0, i = 1, 2, \dots, N

satisfying

\sum_{i = 1}^{N} a_{i} = 1

, let

k^{*}

denote the smallest value of k that satisfies Equation (9):

\begin{matrix} \sum_{l = 1}^{k^{*}} \sum_{i = 1}^{N} P_{i} ({\tilde{M}}_{l}^{*} = {\hat{M}}_{i}^{0}) a_{i} \\ = \sum_{l \in {l_{1}, l_{2}, \dots, l_{k^{*}}}} \sum_{i = 1}^{N} P_{i} ({\hat{M}}_{l}^{*} = {\hat{M}}_{i}^{0}) a_{i} \\ = \sum_{l \in {l_{1}, l_{2}, \dots, l_{k^{*}}}} \sum_{i = 1}^{N} \frac{1}{B} \sum_{b = 1}^{B} I_{({\hat{M}}_{i, l}^{[b]} = {\hat{M}}_{i}^{0})} a_{i} \geq 1 - α . \end{matrix}

(9)

where

P_{i} (\cdot) = P (\cdot | {\hat{M}}_{i}^{0}, {\hat{ψ}}_{i}^{0})

. Assuming that H1–H4 hold, we can deduce that the model confidence set constructed by Algorithm 1 has a lower confidence bound expressed as Equation (10).

Theorem 1.

Assuming that H1–H4 hold,

k^{*}

is determined by Equation (9). As

B \to \infty

,

k^{*}

converges in probability to some integer

k ⩾ 0

, and this convergence probability depends on the bootstrap distribution under

({\hat{M}}_{i}^{0}, {\hat{ψ}}_{i}^{0})

. Furthermore, for this integer k, assuming

\sum_{i = 1}^{N} a_{i} P ({\hat{M}}_{i}^{0} = M_{o p t}) > 0

, we have

\begin{matrix} P (M_{o p t} \in \{{\tilde{M}}_{1}^{0}, \dots, {\tilde{M}}_{k}^{0}\}) \\ ⩾ \frac{1 - α - \sum_{i = 1}^{N} a_{i} P ({\hat{M}}_{i}^{0} \neq M_{o p t}) - \sum_{i = 1}^{N} c E \{|{\hat{ψ}}_{o p t} - ψ_{o p t}| a_{i} I_{({\hat{M}}_{i}^{0} = M_{o p t})}\} - o (1)}{\sum_{i = 1}^{N} a_{i} P ({\hat{M}}_{i}^{0} = M_{o p t})} . \end{matrix}

(10)

where

o (1)

goes to 0 as

B \to \infty

.

By observing Equation (10), the coverage probability of AMac is greatly influenced by the weights

a = {(a_{1}, \dots, a_{N})}^{'}

. In Section 4, we will present a method for selecting the weights.

3.2. The Effectiveness of Constructing MSP Using the Alasso–Lars Algorithm

We will now illustrate that the candidate model set

M_{r}

constructed through Algorithm 2 indeed possesses Properties 1 and 2 of MSP.

For Property 1, let N represent the total number of steps in the Alasso–lars algorithm. Although determining the exact value of N for any given dataset can be challenging, empirical evidence from Rosset et al. suggests that

N = O (p)

is typically valid [28]. Let

N^{*} = | M_{r} |

be the number of candidate models in the set

M_{r}

. It follows that

N^{*} \leq N

, and

N^{*} = N

only when the lars–lasso algorithm generates distinct

M_{i}

for each step

i = 1, \dots, N

. When p is relatively large,

N^{*} \leq N = O (p)

is typically much smaller than

2^{p}

, indicating that the number of candidate models in

M_{r}

is relatively small. Since

N^{*}

is of the same order as p, it does not significantly increase with the increase in p. In conclusion, the set

M_{r}

constructed by Algorithm 2 satisfies Property 1.

Regarding Property 2, we have Theorem 2, which guarantees that when the sample size is sufficiently large, the set

M_{r}

constructed by Algorithm 2 will indeed, with probability 1, contain the true model.

Theorem 2.

Let

M

be the initial set of candidate models,

M_{o p t}

be the true model, and

M_{r}

be the set of candidate models constructed by Algorithm 2. Then, we have

{lim}_{n \to + \infty} P (M_{o p t} \in M_{r}) = 1

.

In conclusion, the set of candidate models

M_{r}

constructed by Algorithm 2 satisfies both Property 1 and Property 2 of MSP. Consequently, it can be effectively integrated with existing MCS construction methods to tackle the problem of an overwhelming number of candidate models.

4. Weight Selection

The selection of weights in this paper follows the approach of model averaging, as suggested by Claeskens and Hjort [6].

a_{i} (M_{i}) = \frac{e x p (f_{M_{i}} (X, Y))}{\sum_{i = 1}^{N} e x p (f_{M_{i}} (X, Y))} .

In particular, when using the BIC criterion for

f_{M_{i}} (X, Y)

, the expression is given by the following:

f_{M_{i}} (X, Y) = \sqrt{log log n} \times B I C_{M_{i}} .

Here, n represents the sample size and

B I C_{M_{i}}

denotes the BIC value of the model

M_{i}

. The term

\sqrt{log log n}

in this formula is an empirical value.

When there are a large number of candidate models, bootstrapping the distribution of all models in AMac can be time-consuming. Considering that the weights of most candidate models are close to zero, it is unnecessary to perform bootstrap simulations for models with extremely low probabilities. Instead, it is sufficient to focus on performing bootstrap simulations only for the top K models with relatively higher weights. For the selection of K, assuming that the weights

(a_{1}, a_{2}, \dots, a_{N})

have already been sorted in descending order, we provide the empirical formula given by Equation (11),

\sum_{i = 1}^{K} a_{i} \geq 1 - α,

(11)

where

α

is a predetermined significance level. After calculating K using Equation (11), we normalize

(a_{1}, a_{2}, \dots, a_{K})

and set

(a_{K + 1}, a_{K + 2} \dots, a_{N})

to 0. Through experimental comparison, we have observed that implementing bootstrap simulations on the top K models yields probability distributions that are highly similar to those obtained by bootstrapping all models. Therefore, to reduce simulation time, we adopt this approximation of weights in the implementation of AMac.

5. Simulation

In this section, we validate the numerical performance of AMac and

M S P - {*}

, where

{*}

represents the baseline MCS construction method. Since the MSP in this study is constructed using the Alasso–lars algorithm, we use

A L - {*}

instead of

M S P - {*}

in the following text. Considering that Liu et al. have already compared Mac with MCS, LRT, LBM, and BMS in their study [17], this paper will exclusively compare AMac with Mac.

5.1. Simulated Performance of AMac and Mac

Referring to the simulation settings of Mac in Liu et al., we consider a multiple linear regression model:

y_{i} = β_{0} + β_{1} x_{1 i} + β_{2} x_{2 i} + β_{3} x_{2 i}^{2} + ϵ_{i} i = 1, 2, \dots, n,

where

x_{1 i}, i = 1, \dots, n

are generated from the Bernoulli(0.5) distribution, and

x_{2 i}, i = 1, \dots, n

are generated from the

N (0, 1)

distribution. The x’s are then fixed throughout the simulation. The errors

ϵ_{i}, i = 1, \dots, n

are generated from

N (0, σ^{2})

. In this simulation setting, there are three explanatory variables

(x_{1 i}, x_{2 i}, x_{2 i}^{2})

, and the total number of candidate models is

2^{3} = 8

. The true model’s coefficient vector is set as

β = (1, 1, 1, 0)

. That is, the real model is

y_{i} = 1 + x_{1 i} + x_{2 i} + ϵ_{i}, i = 1, 2, \dots, n

. In the simulation, different sample sizes of

n = 100, 125, 150, 175, 200

. The noise standard deviation is varied as

σ = 1.0, 2.0, 3.0

. The number of bootstrap iterations is set to

B = 400

, and the total number of simulations is

T N S = 25,000

times.

To assess the effectiveness of various techniques, we take into account four MCS features: the empirical coverage probability (ECP) of MCS at a specific confidence level, such as 90% or 95% (that is,

α = 0.1

or

α = 0.05

); MCS’s average model count (AM), which is comparable to a confidence interval’s length; the coefficient of variation (CV) of the model count K in MCS, which assesses the stability of various techniques; and the average coverage rate of individual models (ECP/AM), which is used to assess the models’ average efficacy in MCS. In the simulation, the model selection criterion f is via the Bayesian information criterion (BIC) [29]:

B I C (M) = n log {\hat{σ}}^{2} + d f (M) \times log n,

where M represents a candidate model;

{\hat{σ}}^{2}

is the SSE (sum of squares of residuals) divided by n, which is the MLE of

σ^{2}

under M; and

d f (M)

represents the degrees of freedom for the model M.

Table 1 presents the simulation results of AMac and Mac under different parameter settings, where the first column

l e v e l

represents the confidence level and the second column n represents the sample size. It can be observed that, overall, AMac and Mac exhibit similar properties. First, as the sample size n increases, ECP continuously increases and approaches 1, and AM decreases continuously and approaches 1. This occurs because the model space is discrete, and as n increases, the accuracy of the criterion f improves until it can fully determine the unique true model. Second, as the noise level

σ

decreases, ECP continuously increases and approaches 1, and AM decreases and approaches 1. This is because the magnitude of noise also affects the accuracy of the criterion f, and lower noise levels result in an improved accuracy of f. Third, the coefficient of variation

C V

for AMac is generally smaller than that of Mac, indicating that AMac has better stability.

To facilitate a more intuitive comparison of the simulation performance of AMac and Mac, we have created boxplots depicting the variations of ECP and ECP/AM with respect to the sample size n and the noise level

σ

based on the data presented in Table 1. These graphical representations are displayed in Figure 1. Each boxplot contains 50 sets of ECP or ECP/AM, with each set of ECP or ECP/AM being calculated using 500 simulated data. On the left side, Figure 1a,c,e, respectively, represent the ECP of the two methods when

σ = 1.0, 2.0, 3.0

. It can be observed that under different sample sizes and noise levels, AMac has a higher coverage rate compared with Mac, and this advantage becomes more pronounced as the noise level increases. On the right side, Figure 1b,d,f, respectively, represent the ECP/AM of the two methods when

σ = 1.0, 2.0, 3.0

. It can be observed that when the noise is relatively small, Mac has a higher single-model average coverage rate compared with AMac. However, as the noise increases, AMac instead has a higher single-model average coverage rate. Finally, we have created a graph showing the changes in ECP with respect to

σ

for a sample size of

n = 200

, as shown in Figure 2. Observing the slopes of the lines in Figure 2, it can be seen that the average rate of change of AMac with respect to sigma is smaller than that of Mac, indicating that AMac exhibits better stability in the face of noise variations compared with Mac. In summary, AMac surpasses Mac in terms of both higher overall coverage probability and better stability in the face of changes in noise.

5.2. Simulated Performances of AMac, Mac, AL-AMac, and AL-Mac

First, we would like to showcase the differences in computational speed among AMac, Mac, AL-AMac, and AL-Mac. Using simulation settings similar to those described in Section 5.1,

y_{i} = β_{0} + β_{1} x_{1 i} + β_{2} x_{2 i} + β_{3} x_{2 i}^{2} + β_{4} x_{3 i} + β_{5} x_{3 i}^{2} \dots + β_{p} x_{j} + ϵ_{i} i = 1, 2, \dots, n,

where the value of j depends on p;

x_{1 i}, i = 1, \dots, n

are generated from the Bernoulli(0.5) distribution; and the rest,

x_{j i}, i = 1, \dots, n

, are generated from the

N (0, 1)

distribution. The x’s are fixed throughout the simulation. The errors

ϵ_{i}, i = 1, \dots, n

are generated from

N (0, σ^{2})

. In this simulation setting, there are p explanatory variables

(x_{1 i}, x_{2 i}, x_{2 i}^{2}, x_{3 i}, x_{3 i}^{2}, \dots, x_{j})

; the total number of candidate models is

2^{p}

. The true model’s coefficient vector is set as

β = (β_{0}, β_{1}, β_{2}, \dots, β_{p}) = (1, 1, 1, 0, \dots, 0)

. That is, the real model is

y_{i} = 1 + x_{1 i} + x_{2 i} + ϵ_{i}, i = 1, 2, \dots, n

. In the simulation, we fixed the values of

n = 100

,

B = 400

, and

T N S = 500

. We set

p = 3, 4, \dots, 20

and count the total time of 500 runs of the four methods. Since the running time of AMac and Mac methods increases rapidly as the number of variables increases, we limited the simulations to AMac with up to 8 variables and Mac with up to 9 variables. The running time of the four methods is shown in Figure 3.

From Figure 3, it can be observed that as p increases, the running time of the methods exhibits the following trend: AMac has the fastest growth rate, followed closely by Mac. Both methods experience exponential increases in running time. However, when the AL method is applied to both AMac and Mac, the running time is significantly reduced. The running time of AL-AMac and AL-Mac shows an approximate linear increase, indicating that an increase in the number of variables does not have a substantial impact on the computation of the

A L - {*}

methods.

In addition to computational speed, it is also important to compare the numerical performances of MCS constructed by different methods. Due to the significant increase in running time for Mac and AMac as the dimensionality increases, we will focus on comparing the numerical results of the methods for a variable dimension of

p = 6

. The model setup is as follows:

y_{i} = β_{0} + β_{1} x_{1 i} + β_{2} x_{2 i} + β_{3} x_{2 i}^{2} + β_{4} x_{3 i} + β_{5} x_{3 i}^{2} + β_{6} x_{4 i} + ϵ_{i} i = 1, 2, \dots, n,

where

x_{1 i}, i = 1, \dots, n

are generated from the Bernoulli(0.5) distribution;

x_{2 i}, x_{3 i}, x_{4 i}, i = 1, \dots, n

are generated from the

N (0, 1)

distribution. The x’s are fixed throughout the simulation. The errors

ϵ_{i}, i = 1, \dots, n

are generated from

N (0, σ^{2})

. In this simulation setting, there are 6 explanatory variables

(x_{1 i}, x_{2 i}, x_{2 i}^{2}, x_{3 i}, x_{3 i}^{2}, x_{4 i})

; the total number of candidate models is

2^{6} = 64

. The true model’s coefficient vector is set as

β = (β_{0}, β_{1}, β_{2}, \dots, β_{p}) = (1, 1, 1, 0, \dots, 0)

. That is, the real model is

y_{i} = 1 + x_{1 i} + x_{2 i} + ϵ_{i}, i = 1, 2, \dots, n

. Similar to Section 5.1, we set the sample size to

n = 20, 50, 100, 200

; the number of bootstrap iterations to

B = 400

; the noise level to

σ = 1.0, 2.0, 3.0

; and the total number of simulations to

T N S = 500

. The model selection criterion used is still the BIC criterion.

We still consider ECP, AM, ECP/AM, and CV as comparison metrics. Additionally, since the AL method is influenced by the MSP constructed by the first step Alasso, we also calculate the probability of the true model included in the MSP constructed by Alasso (

E C P_{A L}

) as a comparison metric. The numerical results for the four methods at a confidence level of

90 %

(

α = 0.1

) are presented in Table 2.

From Table 2, we observe that, overall, AL-AMac, AL-Mac, AMac, and Mac exhibit similar properties. First, as the sample size n increases, ECP continually increases and approaches 1, and AM decreases and approaches 1. Second, as the noise level

σ

decreases, ECP increases and tends toward 1, and AM decreases and tends toward 1. Third, the CV for AL-AMac and AL-Mac is lower than that for AMac and Mac, indicating that the AL method exhibits better stability. Lastly, as the sample size n increases, the accuracy of the

A L a s s o - l a r s

algorithm (

E C P_{A L}

) tends toward 1, which aligns with Theorem 2.

To provide a more comprehensive comparison of the four methods, we have plotted the changes in ECP and ECP/AM with respect to the sample size n and noise level

σ = 2.0

in Figure 4 based on the data in Table 2. From Figure 4, several observations can be made. First, AL-AMac and AL-Mac have lower ECP compared with AMac and Mac. This is because

A L - {*}

is a two-step method, and its ECPs is influenced by the coverage rate of MSP. Second, as the sample size increases, the ECP of AL-AMac and AL-Mac catch up with those of AMac and Mac quickly. It indicates the effectiveness of the AL method. Considering the relatively fast computation speed of the AL methods, they are recommended even in low-dimensional cases with a large sample size. Third, AL-AMac and AL-Mac have a significantly higher ECP/AM. This indicates that the MCS constructed by the AL methods contains fewer models.

In conclusion, when there are a large number of variables leading to the ineffectiveness of methods like AMac and Mac due to excessive computation, choosing AL methods is a more reasonable approach. Additionally, even when the number of variables is not high, selecting AL methods can still yield better results, particularly when dealing with a large sample size. This is evidenced by a shorter computation time, a smaller number of models in MCS, and higher average coverage rates.

5.3. Numerical Performance of AL in High-Dimensional Scenarios

Next, we will simulate the performance of AL-AMac and AL-Mac in the context of high-dimensional variables. The model is set as follows:

\begin{matrix} y_{i} & = β_{c o n s} + β_{0} x_{0 j} β_{1} x_{1 j} + β_{2} x_{2 j} + \dots . + β_{p} x_{p j} \\ + β_{p + 1} x_{1 j}^{2} + β_{p + 2} x_{2 j}^{2} + \dots + β_{2 p} x_{p j}^{2} + ϵ_{j} . j = 1, 2, \dots, n, \end{matrix}

where

x_{0 j}, j = 1, \dots, n

are generated from Bernoulli(0.5);

x_{i j}, i = 1, \dots, p, j = 1, \dots, n

are generated from

N (0, 1)

. The x’s are fixed throughout the simulation. The errors

ϵ_{i}, i = 1, \dots, n

are generated from

N (0, σ^{2})

. The true model’s coefficient vector is set as

β = (1, 1, 1, \dots, 1, 0, \dots, 0, 1, \dots 1, 0, \dots, 0)

. The total number of variables is

2 p + 1

, with the number of true variables being

2 p_{t r u e} + 1

, where

p_{t r u e} < p

. That is, the total number of candidate models is

2^{(p + 1)}

and the true model is

y_{i} = 1 + x_{1 i} + x_{2 i} + \dots + x_{p_{t r u e} i} + x_{2 i}^{2} + \dots + x_{p_{t r u e} i}^{2} + ϵ_{i}, i = 1, 2, \dots, n

. We set

n = 100, 200, 400

;

B = 400

;

σ = 2.0

,

T N S = 500

; and the confidence levels to

90 %

(

α = 0.1

). The model selection criterion used is still the BIC criterion. In the simulation, we set the values of

(2 p + 1, 2 p_{t r u e} + 1)

as (11, 3), (21, 5), and (41, 9), respectively. Table 3 presents the simulation results for both methods.

Observing Table 3, several observations can be made. First, as the sample size increases, both AL-AMac and AL-Mac show a rapid increase in coverage rate, approaching 1. At the same time, the number of models in MCS decreases and also tends toward 1. This indicates that the AL method can achieve the predetermined coverage rate while maintaining a smaller “confidence interval” length. Second, AL-AMac has a higher ECP compared to AL-Mac. Third, in terms of the coefficient of variation (CV), AL-AMac exhibits better stability compared with AL-Mac. Finally, the

E C P_{A L}

of MP obtained by the AL algorithm containing the true model tends to approach 1 as the sample size increases, which is consistent with Theorem 2.

6. Real-Data Example

Efron et al. analyzed the “diabetes” dataset using the lars algorithm [21]. The dataset consists of information from 422 patients, including 10 independent variables, which are age, sex, bmi, map, tc, ldl, hdl, tch, ltg, and glu, and a dependent variable, which is a measure of disease progression 1 year after the baseline. Liu et al. conducted a comparative analysis of Mac, MCS, LRT, LBM, and BMS using this dataset [17]. In this study, we also analyze the practical performance of Mac, AMac, AL-Mac, and AL-AMac using the same dataset. We sequentially assign the numerical labels

1, \dots, 10

to the aforementioned 10 variables in the dataset.

The MCSs constructed by the four methods are shown in Figure 5 at confidence levels of

90 %

and

95 %

(

α = 0.1, 0.05

). Several observations can be made. First, as the confidence level increases, the number of models included in the MCSs constructed by the four methods steadily increases. This is expected since higher confidence levels require a larger number of models to guarantee accuracy. Second, in comparison with AMac and Mac, the growth rate of the models included in the MCS created by AL-AMac and AL-Mac is slower as the confidence level rises. This is because many ineffective models are removed in the first step of the

M P

technique. Consequently, fewer models are needed to attain the target confidence level since the criteria f are more discriminating toward the remaining models. Third, the MCSs that AMac constructs are typically larger than those that Mac constructs, meaning that AMac normally offers more coverage.

According to the BIC criterion, the optimal model determined by the all-subset selection algorithm is (2,3,4,7,9), while the optimal model determined by the adaptive lasso algorithm is (2,3,4,5,6,8,9). From Figure 5, it can be observed that the model (2,3,4,7,9) is included in the MCS constructed by AMac, and the model (2,3,4,5,6,8,9) is included in the MCS constructed by AL-AMac and AL-Mac. This reflects that the results obtained by AMac and AL-{*} are reasonable. Based on the empirical simulations conducted by Liu et al., the models chosen by AMac and AL-AMac in this study, namely, (2,3,4,5,8,9) and (2,3,4,5,6,8,9), respectively, are both encompassed within the MCS constructed by the BMA and LBM methods. This indicates that the models selected by AMac and AL-AMac are of high quality. When analyzing this dataset, the running times of the Mac, AMac, AL-Mac, and AL-AMac methods are 499.27, 1642.57, 5.65, and 5.63 s, respectively, at a confidence level of 90% (

α = 0.1

). This shows that the

A L - {*}

method can greatly reduce the time to analyze data. Of course, the effect becomes more obvious as the number of explanatory variables increases.

7. Discussion and Conclusions

Researchers now frequently use model combining and MCS to quantify the uncertainty of model selection. Our paper presents a novel approach, named AMac, that builds MCS by fusing the Mac algorithm with the notion of model averaging. Even in situations where noise levels fluctuate significantly, AMac maintains stability and operates efficiently. To further alleviate the computational difficulties that current MCS creation approaches face in high-dimensional variable settings, we propose the MSP method. By using comprehensive simulations, we verify the efficiency of the AL algorithm for building MSP.

The bootstrap method used in this study follows a fixed model approach, where the model parameters are estimated beforehand and then used to generate simulated data by repeatedly simulating the data based on the estimated parameters [11]. Additionally, bootstrap can also be implemented by perturbing the data, where the design matrix X remains unchanged, and perturbations are added to the original response variable Y to generate new simulated data [30]. Combining these two forms of bootstrap methods effectively to provide better simulation probabilities is an area that requires further research.

In practice, the uncertainty of model selection must be taken into account when working with a large number of candidate models for data analysis in order to obtain accurate parameter estimates or prediction outcomes. This research presents a way that further expands the use of model confidence sets, and it is anticipated that this method and its underlying concepts will work well in future model selection uncertainty assessments and model confidence set constructions.

Author Contributions

Conceptualization, F.W.; methodology, F.W., J.J. and Y.L.; software, F.W.; validation, F.W., J.J. and Y.L.; formal analysis, F.W., J.J. and Y.L.; investigation, F.W., J.J. and Y.L.; resources, F.W. and Y.L.; writing—original draft preparation, F.W.; writing—review and editing, J.J. and Y.L.; visualization, F.W.; supervision, J.J. and Y.L.; project administration, F.W., J.J. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China, Grant No. 2018YFA0703900, and the National Science Foundation of China, Grant No. 11971264.

Data Availability Statement

All the data included in this study are available upon request from the corresponding author.

Acknowledgments

We are thankful to the reviewers for their constructive comments, which helped us to improve the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

MCS	model confidence set
MP	model path
Mac	make a cut
AMac	average make a cut
VSCS	variable select confidence sets
ECP	empirical coverage probability
AM	average model count
CV	coefficient of variation of the model count

Appendix A

Appendix A.1

Proof of Theorem 1.

It can be easily verified that for a given

{\hat{M}}_{i}^{0}

and

{\hat{ψ}}_{i}^{0}

, there exists a unique value

k^{*}

that satisfies Equation (A1).

\begin{matrix} \sum_{l \in {l_{1}, l_{2}, \dots, l_{k^{*} - 1}}} \sum_{i = 1}^{N} P_{i} ({\hat{M}}_{l}^{*} = {\hat{M}}_{i}^{0} | {\hat{M}}_{i}^{0}, {\hat{ψ}}_{i}^{0}) a_{i} < 1 - α . \\ \sum_{l \in {l_{1}, l_{2}, \dots, l_{k^{*}}}} \sum_{i = 1}^{N} P_{i} ({\hat{M}}_{l}^{*} = {\hat{M}}_{i}^{0} | {\hat{M}}_{i}^{0}, {\hat{ψ}}_{i}^{0}) a_{i} \geq 1 - α . \end{matrix}

(A1)

While

k^{*}

is related to the bootstrap samples, it converges to k under

{\hat{M}}_{i}^{0}

,

{\hat{ψ}}_{i}^{0}

, and

P_{i} (\cdot) = P (\cdot | {\hat{M}}_{i}^{0}, {\hat{ψ}}_{i}^{0})

. Thus, according to Equation (9), the following Equation (A2) holds:

\begin{matrix} 1 - α & \leq \sum_{l \in {l_{1}, l_{2}, \dots, l_{k^{*}}}} \sum_{i = 1}^{N} \frac{1}{B} \sum_{b = 1}^{B} I_{({\hat{M}}_{i, l}^{[b]} = {\hat{M}}_{i}^{0})} a_{i} \\ = \sum_{l \in {l_{1}, l_{2}, \dots, l_{k}}} \sum_{i = 1}^{N} \frac{1}{B} \sum_{b = 1}^{B} I_{({\hat{M}}_{i, l}^{[b]} = {\hat{M}}_{i}^{0})} a_{i} + o_{p_{1, \dots, N}} (1) \\ = \sum_{l \in {l_{1}, l_{2}, \dots, l_{k}}} \sum_{i = 1}^{N} [P ({\hat{M}}_{l}^{*} = {\hat{M}}_{i}^{0} | {\hat{M}}_{i}^{0}, {\hat{ψ}}_{i}^{0}) + o_{p_{i}} (1)] a_{i} + o_{p_{1, \dots, N}} (1) \\ \leq \sum_{l \in {l_{1}, l_{2}, \dots, l_{k}}} \sum_{i = 1}^{N} P ({\hat{M}}_{l}^{*} = {\hat{M}}_{i}^{0} | {\hat{M}}_{i}^{0}, {\hat{ψ}}_{i}^{0}) a_{i} + \sum_{i = 1}^{N} k o_{p_{i}} (1) + o_{p_{1, \dots, N}} (1) \\ = \sum_{i = 1}^{N} P ({\hat{M}}_{i}^{0} \in \{{\hat{M}}_{l_{1}}^{*}, \dots, {\hat{M}}_{l_{k}}^{*}\} | {\hat{M}}_{i}^{0}, {\hat{ψ}}_{i}^{0}) a_{i} + \sum_{i = 1}^{N} k o_{p_{i}} (1) + o_{p_{1, \dots, N}} (1) \\ = \sum_{i = 1}^{N} P (M_{o p t} \in \{{\hat{M}}_{l_{1}}^{*}, \dots, {\hat{M}}_{l_{k}}^{*}\} | M_{o p t}, {\hat{ψ}}_{i}) a_{i} I_{({\hat{M}}_{i}^{0} = M_{o p t})} \\ + \sum_{i = 1}^{N} P ({\hat{M}}_{i}^{0} \in \{{\hat{M}}_{l_{1}}^{*}, \dots, {\hat{M}}_{l_{k}}^{*}\} | {\hat{M}}_{i}^{0}, {\hat{ψ}}_{i}^{0}) a_{i} I_{({\hat{M}}_{i}^{0} \neq M_{o p t})} + \sum_{i = 1}^{N} k o_{p_{i}} (1) + o_{p_{1, \dots, N}} (1) \\ \leq \sum_{i = 1}^{N} P (M_{o p t} \in \{{\hat{M}}_{l_{1}}^{*}, \dots, {\hat{M}}_{l_{k}}^{*}\} | M_{o p t}, {\hat{ψ}}_{o p t}) a_{i} I_{({\hat{M}}_{i}^{0} = M_{o p t})} \\ + \sum_{i = 1}^{N} a_{i} I_{({\hat{M}}_{i}^{0} \neq M_{o p t})} + \sum_{i = 1}^{N} k o_{p_{i}} (1) + o_{p_{1, \dots, N}} (1), \end{matrix}

(A2)

where

o_{p_{i}} (1)

is with respect to

P_{i} (\cdot)

,

i = 1, 2, \dots, N

;

o_{p_{1, \dots, N}} (1)

is with respect to

P_{i} (\cdot), i = 1, 2, \dots, N

; and

{\hat{ψ}}_{o p t}

is an estimator of

ψ_{o p t}

. According to Assumption (A3), for any

{\hat{ψ}}_{o p t}

, we have the following:

\begin{matrix} P (M_{o p t} \in \{{\hat{M}}_{l_{1}}^{*}, \dots, {\hat{M}}_{l_{k}}^{*}\} | M_{o p t}, {\hat{ψ}}_{o p t}) \\ \leq P (M_{o p t} \in \{{\hat{M}}_{l_{1}}^{*}, \dots, {\hat{M}}_{l_{k}}^{*}\} | M_{o p t}, ψ_{o p t}) + c |{\hat{ψ}}_{o p t} - ψ_{o p t}| \\ = P (M_{o p t} \in \{{\hat{M}}_{l_{1}}^{0}, \dots, {\hat{M}}_{l_{k}}^{0}\}) + c |{\hat{ψ}}_{o p t} - ψ_{o p t}|, \end{matrix}

(A3)

using the fact that, when the samples are drawn under

(M_{o p t}, ψ_{o p t})

, there is no need to have the stars in the notation.

Combining (A2) and (A3), we have the following:

\begin{matrix} 1 - α & \leq \sum_{i = 1}^{N} \{P (M_{o p t} \in \{{\hat{M}}_{l_{1}}^{0}, \dots, {\hat{M}}_{l_{k}}^{0}\}) + c |{\hat{ψ}}_{o p t} - ψ_{o p t}|\} a_{i} I_{({\hat{M}}_{i}^{0} = M_{o p t})} \\ + \sum_{i = 1}^{N} a_{i} I_{({\hat{M}}_{i}^{0} \neq M_{o p t})} + \sum_{i = 1}^{N} k o_{p_{i}} (1) + o_{p_{1, \dots, N}} (1) . \end{matrix}

(A4)

It is easy to see that

o_{p_{i}} (1)

and

o_{p_{1, \dots, N}} (1)

on the right side of (A4) are bounded quantities. Therefore, by the dominated convergence theorem [31], we have the following:

E \{o_{p_{i}} (1)\} = o (1) E \{o_{p_{1, \dots, N}} (1)\} = o (1),

(A5)

where

E

is with respect to the joint distribution of y and

y^{[b]}

. We now take the expectation on both sides of (A4). Note that the probability

P (M_{o p t} \in \{{\hat{M}}_{l_{1}}^{0}, \dots, {\hat{M}}_{l_{k}}^{0}\})

is nonrandom and

\begin{matrix} E \{|{\hat{ψ}}_{o p t} - ψ_{o p t}| I_{({\hat{M}}_{i}^{0} = M_{o p t})}\} & = E [E \{| {\hat{ψ}}_{o p t} - ψ_{o p t} | I_{({\hat{M}}_{i}^{0} = M_{o p t})} | y\}] \\ = E \{|{\hat{ψ}}_{o p t} - ψ_{o p t}| I_{({\hat{M}}_{i}^{0} = M_{o p t})}\}, \end{matrix}

(A6)

with the latter expectation with respect to y only. Therefore, by taking the expectation on both sides of Equation (A4), we have the following:

\begin{matrix} 1 - α & \leq P (M_{o p t} \in \{{\hat{M}}_{l_{1}}^{0}, \dots, {\hat{M}}_{l_{k}}^{0}\}) \sum_{i = 1}^{N} a_{i} P ({\hat{M}}_{i}^{0} = M_{o p t}) + \sum_{i = 1}^{N} c E \{| {\hat{ψ}}_{o p t} - ψ_{o p t} | a_{i} I_{({\hat{M}}_{i}^{0} = M_{o p t})}\} \\ + \sum_{i = 1}^{N} a_{i} P ({\hat{M}}_{i}^{0} \neq M_{o p t}) + o (1) . \end{matrix}

(A7)

Finally, when

\sum_{i = 1}^{N} a_{i} P ({\hat{M}}_{i}^{0} = M_{o p t}) > 0

, it can be observed that

P (M_{o p t} \in \{{\hat{M}}_{l_{1}}^{0}, \dots, {\hat{M}}_{l_{k}}^{0}\})

= P (M_{o p t} \in \{{\tilde{M}}_{1}^{0}, \dots, {\tilde{M}}_{k}^{0}\})

. Thus, the proof is complete. □

Appendix A.2

Proof of Theorem 2.

For any

γ > 0

, as n tends toward infinity, if

λ_{n}

satisfies

λ_{n} / \sqrt{n} \to 0

and

λ_{n} n^{(γ - 1) / 2} \to \infty

, then the corresponding

{\hat{β}}^{* n}

satisfies

{lim}_{n \to + \infty} P (A_{n}^{*} = A) = 1

[22]. Therefore, for

γ > 0

(

γ = 1

is a special case), as n tends toward infinity, there exists

λ_{n}^{*}

such that the model formed by all nonzero variables in the solution of Equation (7) corresponds to the true model. Since the solutions of Equations (8) and (7) are equivalent in terms of the question of whether a variable is zero or nonzero, the model formed by all nonzero variable solutions obtained from Equation (8) at

λ_{n}^{*}

also corresponds to the true model.

It is easy to observe that Equation (8) is a classic lasso problem, and its solution to

λ_{n}

is piecewise linear. The lars algorithm can efficiently provide all solutions for

λ_{n} \in [0, \infty)

[21]. In each iteration, the lars algorithm explicitly provides all solutions for

λ_{n} \in [a, b], 0 < = a < b < \infty

. The specific values of a and b are determined based on the current iteration. It is noteworthy that the active set of variables remains unchanged throughout each iteration. Assuming that

λ_{n}^{*}

falls within the range covered by the lars algorithm in the i-th step, denoting the model formed by the active variables at this point as

M_{i}

, it can be stated that

{lim}_{n \to + \infty} P (M_{o p t} = M_{i}) = 1

. Let the lars algorithm progress for a total of N steps and consider the model path

M_{r} = {M_{1}, \dots, M_{N}} = {M_{1^{*}}, \dots, M_{N^{*}}}

, where

M_{i^{*}}, i^{*} = 1, \dots, N^{*}

are all distinct. Then, there exists a specific

l^{*} \in {1^{*}, \dots, N^{*}}

such that

{lim}_{n \to + \infty} P (M_{o p t} = M_{l^{*}}) = 1

. Finally, by the additivity of probability, we have the following:

\begin{matrix} lim_{n \to + \infty} P (M_{o p t} \in M_{r}) & = lim_{n \to + \infty} P (M_{o p t} \in {M_{1^{*}}, \dots, M_{N^{*}}}) \\ = lim_{n \to + \infty} \sum_{i^{*} = 1^{*}}^{N^{*}} P (M_{o p t} = M_{i^{*}}) \\ = \sum_{i^{*} = 1^{*}}^{N^{*}} lim_{n \to + \infty} P (M_{o p t} = M_{i^{*}}) \\ = lim_{n \to + \infty} P (M_{o p t} = M_{l^{*}}) \\ = 1 . \end{matrix}

(A8)

□

References

Preacher, K.; Merkle, E. The problem of model selection uncertainty in structural equation modeling. Psychol. Methods 2012, 17, 1. [Google Scholar] [CrossRef] [PubMed]
Ding, J.; Tarokh, V.; Yang, Y. Model selection techniques: An overview. IEEE Signal Process. Mag. 2018, 35, 6–34. [Google Scholar] [CrossRef]
Draper, D. Assessment and propagation of model uncertainty. J. R. Stat. Soc. Ser. B Stat. Methodol. 1995, 57, 45–70. [Google Scholar] [CrossRef]
Chatfield, C. Model uncertainty, data mining and statistical inference. J. R. Stat. Soc. Ser. A Stat. Soc. 1995, 158, 419–444. [Google Scholar] [CrossRef]
Lubke, G.; Campbell, I.; McArtor, D.; Miller, P.; Luningham, J.; van den Berg, S. Assessing model selection uncertainty using a bootstrap approach: An update. Struct. Equ. Model. Multidiscip. J. 2017, 158, 230–245. [Google Scholar] [CrossRef] [PubMed]
Claeskens, G.; Hjort, N. Model Selection and Model Averaging; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
Yang, Y. Adaptive regression by mixing. J. Am. Stat. Assoc. 2001, 96, 574–588. [Google Scholar] [CrossRef]
Yang, Y. Regression with multiple candidate models: Selecting or mixing? Stat. Sin. 2003, 13, 783–809. [Google Scholar]
Hoeting, J.; Madigan, D.; Raftery, A.; Volinsky, C. Bayesian model averaging: A tutorial (with comments by M. Clyde, David Draper and EI George, and a rejoinder by the authors). Stat. Sci. 1999, 14, 382–417. [Google Scholar] [CrossRef]
Chipman, H.; George, E.; McCulloch, R.; Clyde, M.; Foster, D.; Stine, R. The practical implementation of Bayesian model selection. Lect. Notes Monogr. Ser. 2001, 38, 65–134. [Google Scholar]
Chen, L.; Giannakouros, P.; Yang, Y. Model combining in factorial data analysis. J. Stat. Plan. Inference 2007, 137, 2920–2934. [Google Scholar] [CrossRef]
Hansen, P.; Lunde, A.; Nason, J. The model confidence set. Econometrica 2011, 79, 453–497. [Google Scholar] [CrossRef]
Lubke, G.; Campbell, I. Inference based on the best-fitting model can contribute to the replication crisis: Assessing model selection uncertainty using a bootstrap approach. Struct. Equ. Model. Multidiscip. J. 2016, 23, 479–490. [Google Scholar] [CrossRef] [PubMed]
Ferrari, D.; Yang, Y. Confidence sets for model selection by F-testing. Stat. Sin. 2015, 1637–1658. [Google Scholar] [CrossRef]
Zheng, C.; Ferrari, D.; Yang, Y. Model selection confidence sets by likelihood ratio testing. Stat. Sin. 2019, 29, 827–851. [Google Scholar] [CrossRef]
Zheng, C.; Ferrari, D.; Zhang, M.; Baird, P. Ranking the importance of genetic factors by variable-selection confidence sets. J. R. Stat. Soc. Ser. C Appl. Stat. 2019, 68, 727–749. [Google Scholar] [CrossRef]
Liu, X.; Li, Y.; Jiang, J. Simple measures of uncertainty for model selection. Test 2021, 30, 673–692. [Google Scholar] [CrossRef]
Donoho, D. High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Chall. Lect. 2000, 1, 32. [Google Scholar]
Li, Y.; Luo, Y.; Ferrari, D.; Hu, X.; Qin, Y. Model confidence bounds for variable selection. Biometrics 2019, 75, 392–403. [Google Scholar] [CrossRef]
Li, Y.; Jiang, J. Measures of Uncertainty for Shrinkage Model Selection. Stat. Sin. 2023. preprint. [Google Scholar] [CrossRef]
Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least Angle Regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar] [CrossRef]
Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
Yuan, Z.; Yang, Y. Combining linear regression models: When and how? J. Am. Stat. Assoc. 2005, 100, 1202–1214. [Google Scholar] [CrossRef]
Efron, B. Bootstrap methods: Another look at the jackknife. Ann. Statist. 1979, 7, 1–26. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 2006, 68, 49–67. [Google Scholar] [CrossRef]
Rosset, S.; Zhu, J. Piecewise linear regularized solution paths. Ann. Stat. 2007, 1012–1030. [Google Scholar] [CrossRef]
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 461–464. [Google Scholar] [CrossRef]
Breiman, L. Heuristics of instability and stabilization in model selection. Ann. Stat. 1996, 24, 2350–2383. [Google Scholar] [CrossRef]
Jiang, J. Large Sample Techniques for Statistics; Springer: New York, NY, USA, 2010. [Google Scholar]

Figure 1. ECP and ECP/AM for AMac and Mac at a confidence level of

α = 0.1

.

Figure 1. ECP and ECP/AM for AMac and Mac at a confidence level of

α = 0.1

.

Figure 2. ECP variation with noise for AMac and Mac.

Figure 3. Running time of the four methods.

Figure 4. Relationship between ECP, ECP/AM and sample size for the four methods.

Figure 5. The MCS constructed by the four methods. The left panel (a) displays the heatmaps of MCS constructed by the four methods at a 90% confidence level. The horizontal axis represents the names of explanatory variables, while the vertical axis lists the methods used to construct MCS and the number of models included in MCS constructed by each method. In the figure, each row represents a model, and a blank cell indicates that the corresponding explanatory variable is not included in the model. For example, the MCS constructed by the Mac method includes two models. The first model is composed of explanatory variables sex, bmi, map, hdl, ltg, while the second model is composed of the explanatory variables sex, bmi, map, tc, ldl, and ltg. The right panel (b) displays the heatmaps of model confidence sets (MCS) constructed by the four methods at a 95% confidence level.

Table 1. Simulation results of AMac and Mac under different parameters.

Level (%)	n	Method	$σ$ = 1.0				$σ$ = 2.0				$σ$ = 3.0
Level (%)	n	Method	ECP	AM	ECP/AM	$CV$	ECP	AM	ECP/AM	$CV$	ECP	AM	ECP/AM	$CV$
90	100	AMac	0.997	1.188	0.840	0.333	0.818	1.895	0.432	0.329	0.619	2.290	0.271	0.353
	100	Mac	0.996	1.131	0.880	0.308	0.699	1.782	0.392	0.411	0.561	2.132	0.263	0.420
	125	AMac	0.999	1.108	0.902	0.280	0.853	1.752	0.487	0.313	0.649	2.116	0.307	0.374
	125	Mac	0.999	1.059	0.943	0.224	0.738	1.640	0.450	0.388	0.571	1.976	0.289	0.450
	150	AMac	0.999	1.080	0.925	0.251	0.887	1.682	0.527	0.316	0.674	1.986	0.340	0.374
	150	Mac	0.999	1.038	0.963	0.184	0.793	1.584	0.500	0.373	0.582	1.863	0.312	0.449
	175	AMac	1.000	1.062	0.941	0.226	0.924	1.619	0.571	0.325	0.688	1.887	0.365	0.368
	175	Mac	1.000	1.027	0.974	0.156	0.847	1.534	0.552	0.367	0.581	1.762	0.330	0.443
	200	AMac	0.999	1.056	0.947	0.217	0.944	1.557	0.607	0.339	0.709	1.809	0.392	0.356
	200	Mac	0.999	1.024	0.976	0.148	0.887	1.487	0.597	0.365	0.592	1.688	0.351	0.430
95	100	AMac	1.000	1.419	0.705	0.373	0.939	2.485	0.378	0.287	0.785	3.003	0.262	0.305
	100	Mac	0.999	1.341	0.746	0.379	0.920	2.343	0.393	0.297	0.745	2.749	0.271	0.348
	125	AMac	1.000	1.230	0.813	0.351	0.949	2.261	0.420	0.286	0.818	2.832	0.290	0.318
	125	Mac	1.000	1.149	0.870	0.319	0.914	2.150	0.426	0.302	0.782	2.623	0.299	0.354
	150	AMac	1.000	1.144	0.874	0.310	0.956	2.121	0.451	0.307	0.841	2.608	0.323	0.331
	150	Mac	1.000	1.070	0.935	0.241	0.904	2.017	0.448	0.335	0.795	2.433	0.327	0.364
	175	AMac	1.000	1.106	0.905	0.278	0.959	1.979	0.485	0.321	0.841	2.472	0.341	0.341
	175	Mac	1.000	1.036	0.965	0.179	0.902	1.889	0.478	0.352	0.776	2.302	0.338	0.380
	200	AMac	1.000	1.084	0.923	0.255	0.969	1.871	0.518	0.337	0.848	2.337	0.363	0.345
	200	Mac	1.000	1.024	0.976	0.149	0.915	1.791	0.511	0.367	0.765	2.171	0.353	0.392

Table 2. Simulation results for the four methods at 90% confidence level.

n	Method	$σ$ = 1.0					$σ$ = 2.0					$σ$ = 3.0
n	Method	ECP	AM	ECP/AM	$CV$	${ECP}_{AL}$	ECP	AM	ECP/AM	$CV$	${ECP}_{AL}$	ECP	AM	ECP/AM	$CV$	${ECP}_{AL}$
100	AMac	0.984	2.368	0.416	0.305	NA	0.898	3.454	0.260	0.398	NA	0.646	4.228	0.153	0.406	NA
	Mac	0.982	2.222	0.442	0.318	NA	0.868	3.148	0.276	0.486	NA	0.578	3.600	0.161	0.546	NA
	AL-AMac	0.978	1.412	0.693	0.369	1.000	0.756	1.852	0.408	0.295	0.856	0.344	1.614	0.213	0.366	0.500
	AL-Mac	0.976	1.224	0.797	0.356	1.000	0.644	1.650	0.390	0.372	0.856	0.206	1.330	0.155	0.396	0.500
125	AMac	0.980	2.162	0.453	0.282	NA	0.912	2.892	0.315	0.359	NA	0.726	3.642	0.199	0.457	NA
	Mac	0.976	2.088	0.467	0.311	NA	0.892	2.676	0.333	0.412	NA	0.678	3.298	0.206	0.594	NA
	AL-AMac	0.974	1.338	0.728	0.378	1.000	0.836	1.818	0.460	0.288	0.886	0.488	1.730	0.282	0.336	0.630
	AL-Mac	0.968	1.162	0.833	0.331	1.000	0.716	1.620	0.442	0.358	0.886	0.298	1.430	0.208	0.418	0.630
150	AMac	0.984	1.876	0.525	0.361	NA	0.908	2.634	0.345	0.338	NA	0.756	3.298	0.229	0.424	NA
	Mac	0.984	1.762	0.558	0.396	NA	0.900	2.516	0.358	0.374	NA	0.720	3.078	0.234	0.489	NA
	AL-AMac	0.980	1.282	0.764	0.358	1.000	0.838	1.710	0.490	0.300	0.898	0.564	1.728	0.326	0.316	0.686
	AL-Mac	0.978	1.112	0.879	0.290	1.000	0.760	1.578	0.482	0.345	0.898	0.430	1.502	0.286	0.376	0.686
175	AMac	0.990	1.658	0.597	0.434	NA	0.952	2.542	0.375	0.386	NA	0.784	3.062	0.256	0.448	NA
	Mac	0.990	1.522	0.650	0.464	NA	0.940	2.452	0.383	0.455	NA	0.750	2.848	0.263	0.520	NA
	AL-AMac	0.984	1.230	0.800	0.354	1.000	0.896	1.744	0.514	0.316	0.950	0.578	1.730	0.334	0.346	0.712
	AL-Mac	0.982	1.092	0.899	0.277	1.000	0.848	1.612	0.526	0.356	0.950	0.454	1.512	0.300	0.396	0.712
200	AMac	0.984	1.438	0.684	0.457	NA	0.964	2.374	0.406	0.417	NA	0.790	2.798	0.282	0.435	NA
	Mac	0.986	1.310	0.753	0.503	NA	0.932	2.280	0.409	0.484	NA	0.750	2.612	0.287	0.512	NA
	AL-AMac	0.978	1.202	0.814	0.338	1.000	0.904	1.678	0.539	0.332	0.962	0.620	1.694	0.366	0.332	0.768
	AL-Mac	0.978	1.094	0.894	0.279	1.000	0.856	1.574	0.544	0.362	0.962	0.492	1.502	0.328	0.383	0.768

Table 3. Simulation results of AL-AMac and AL-Mac with

α

= 0.1 and

σ

= 2.0.

Table 3. Simulation results of AL-AMac and AL-Mac with

α

= 0.1 and

σ

= 2.0.

Varia_num	n	Method	$σ$ = 2.0
Varia_num	n	Method	ECP	AM	ECP/AM	$CV$	${ECP}_{AL}$
(11, 3)	100	AL-AMac	0.616	2.010	0.306	0.303	0.670
	100	AL-Mac	0.556	1.734	0.321	0.393	0.670
	200	AL-AMac	0.888	1.788	0.497	0.322	0.936
	200	AL-Mac	0.862	1.630	0.529	0.375	0.936
	400	AL-AMac	0.980	1.300	0.754	0.363	0.996
	400	AL-Mac	0.980	1.194	0.821	0.344	0.996
(21, 5)	100	AL-AMac	0.496	2.432	0.204	0.291	0.584
	100	AL-Mac	0.448	2.160	0.207	0.408	0.584
	200	AL-AMac	0.776	2.038	0.381	0.317	0.860
	200	AL-Mac	0.742	1.820	0.408	0.401	0.860
	400	AL-AMac	0.920	1.554	0.592	0.392	0.994
	400	AL-Mac	0.904	1.350	0.670	0.414	0.994
(41, 9)	100	AL-AMac	0.252	2.872	0.088	0.299	0.402
	100	AL-Mac	0.224	2.608	0.086	0.422	0.402
	200	AL-AMac	0.600	2.422	0.248	0.308	0.810
	200	AL-Mac	0.564	2.100	0.269	0.438	0.810
	400	AL-AMac	0.826	1.908	0.433	0.360	0.984
	400	AL-Mac	0.802	1.600	0.501	0.458	0.984

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wen, F.; Jiang, J.; Luan, Y. Model Selection Path and Construction of Model Confidence Set under High-Dimensional Variables. Mathematics 2024, 12, 664. https://doi.org/10.3390/math12050664

AMA Style

Wen F, Jiang J, Luan Y. Model Selection Path and Construction of Model Confidence Set under High-Dimensional Variables. Mathematics. 2024; 12(5):664. https://doi.org/10.3390/math12050664

Chicago/Turabian Style

Wen, Faguang, Jiming Jiang, and Yihui Luan. 2024. "Model Selection Path and Construction of Model Confidence Set under High-Dimensional Variables" Mathematics 12, no. 5: 664. https://doi.org/10.3390/math12050664

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Model Selection Path and Construction of Model Confidence Set under High-Dimensional Variables

Abstract

1. Introduction

2. Methods

2.1. AMac: Constructing MCS

2.2. MSP- {* }: Constructing MCS under High-Dimensional Variables in Linear Regression Model

3. Theoretical Properties

3.1. Coverage Rate of MCS Constructed by AMac

3.2. The Effectiveness of Constructing MSP Using the Alasso–Lars Algorithm

4. Weight Selection

5. Simulation

5.1. Simulated Performance of AMac and Mac

5.2. Simulated Performances of AMac, Mac, AL-AMac, and AL-Mac

5.3. Numerical Performance of AL in High-Dimensional Scenarios

6. Real-Data Example

7. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1

Appendix A.2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI