4.1. Simulated Experiments
For the simulated data, we consider the
in (
4) following the Gaussian distribution
scaled by 0.05 and the same target functions
as adopted in previous papers [
9,
34]. Among them,
is non-additive in Examples 1, 2 and 4 and is additive in Example 3.
Example 1. The non-additive model (from [9])where are i.i.d. and drawn from the uniform distribution on with . Example 2. The non-additive model (from [9]) In this situation, are drawn from with , where Σ denotes the covariance matrix with the th entry given by .
Example 3. The additive model (from [9]) Here, we consider three cases and and both the correlated and uncorrelated variables, i.e., and .
Example 4. The non-additive model (from [34]) In this case, , where and are independently sampled from . When , the input variables and are independent, whereas means they are correlated. We also consider three cases and with different choices .
As mentioned above, three parameters, including E, L and , are involved in our algorithm when performing the variable selection. In the experiment, we generally set E ranging over , L ranging over and ranging over . For the compared algorithms, the corresponding parameters are tuned near the default values. The final values are chosen by cross validation. Finally, each algorithm will give each variable a statistic that measures the importance.
For simplicity, we assume that the number of relevant variables is known, and the first variables with largest statistics are selected. Three common metrics are adopted to measure the performance of each algorithm, including TP (the average number of selected truly relevant variables), FP (the average number of selected truly irrelevant variables) and STP (the standard deviation of TP). Generally, a higher TP with a lower FP and STP indicates a better variable-selection algorithm.
We repeated the experiments 20 times with the observation set generated in each circumstance. The average variable selection results of each algorithm are presented in
Table 2,
Table 3 and
Table 4. The optimal results in each circumstance are marked in bold, where DeepVIB (G) means that
obeys the Gaussian distribution, while DeepVIB (E) indicates the exponential distribution. The above-reported results tell us the following.
(1)
Table 2 shows that the TPs of the proposed DeepVIB (G) and DeepVIB (E) in Examples 1 and 2 always equal the
, while the FPs and STPs are always 0. This means that the proposed method can perfectly select all the relevant variables in each experiment. For comparison, other algorithms always select the irrelevant variables in Example 1 or 2. This supports the superiority of DeepVIB for variable selection.
(2)
Table 3 and
Table 4 show the performance of all variable-selection algorithms in Examples 3 and 4. Generally, the performance degrades when
p increases or when the input variables become correlated. This is consistent with the previous phenomenon in Examples 1 and 2. In particular, the proposed DeepVIB (G) and DeepVIB (E) can select all the relevant variables in the cases
and
, which is better performance compared with the other variable-selection algorithms.
(3) From the empirical results in
Table 3 and
Table 4, we also note that DeepVIB (G) provides consistently larger TPs compared with DeepVIB (E) in the case
. This indicates that DeepVIB can achieve better variable selection results when
obeys the Gaussian distribution. An empirical reason may be that the Gaussian distribution can generate a larger range on
Z compared to the exponential distribution, providing the neural network more choices when mapping
Y from
Z. Hence, we prefer the Gaussian distribution when estimating the
in DeepVIB.
(4) As for other classical variable-selection algorithms, the experimental results rely on their properties. Methods, such as Elastic Net and SCAD, are proposed for linear models, and their performance is generally inferior to methods for nonlinear models, such as Lasso Net and RF, in nonlinear examples. We also note that Lasso Net performed perfectly in Example 4 but not in Example 3. The reason may be that the input data are drawn from different distributions and that the target function in Example 3 is more complex than the one in Example 4. In addition, we can see that SpAM performs well except in Example 2. An underlying reason is that SpAM was originally proposed with an additional assumption on the target function for sparse variable selection.
In fact, Example 2 contains in the denominator, which is not coincident with the additive assumption. In contrast, Examples 1 and 4 can be more like an additive model except for the interaction term between and . Different from SpAM, Lasso Net was proposed based on DNNs and, thus, does not rely on a specific assumption of the target function. The strong representation and approximation ability of DNNs ensure that Lasso Net can achieve satisfactory performance on the four examples. This also supports our motivation to investigate DeepVIB for variable selection.
To highlight the influence of parameter
in the proposed method, we define a new static by empirical statistics:
where a large
means better selection results. Performing experiments on Example 2, the
vs.
is shown in
Figure 2. The experimental results show that the
achieves the maximum when
in this situation. It performs better than a small
close to zero (in cases of total compression).
4.2. Real Experiments
We also conducted experiments on three real-world datasets for regression problems. These data were selected from the UCI machine-learning datasets, including Boston Housing Price (BHP), California Housing Price (CHP) and Diabetes.
To be specific, the BHP dataset contains 506 observations with 13 input variables, including the per capita crime rate by town (CRIM), proportion of residential land zoned for lots (ZN), proportion of non-retail business acres per town (INDUS), Charles River dummy variable (CHAS), nitric oxides concentration (NOX), average number of rooms per dwelling (RM), AGE, weighted distances to five Boston employment centers (DIS), index of accessibility to radial highways (RAD), TAX, pupil–teacher ratio by town (PTRATIO), the proportion of black people by town (B) and lower status of the population (LSTAT). These variables are considered as relevant, and 13 irrelevant variables are generated from the distribution , which are denoted as .
The dataset CHP consists of 20,640 instances with eight variables, including the median household income (MedInc), median age of home (HouseAge), average number of rooms (AveRooms), average number of bedrooms (AveBedrms), population, average occupancy (AveOccup), latitude and longitude. Similarly, we generated 22 additional irrelevant variables for the same operation. The irrelevant variables, denoted as , are added analogously. For the dataset Diabetes, 442 instances were collected with 10 attributes, including age, sex, BMI, BP, s1, s2, s3, s4, s5 and s6. An additional 20 irrelevant variables, denoted as , are generated analogously.
Since the truly informative variables are unknown,
Table 5 presents the two most-relevant variables and two most-irrelevant variables in BHP, CHP and Diabetes as selected by each algorithm. From
Table 5, we can see that the proposed algorithm considered the pseudo-variables as irrelevant and the real variables as relevant in the three datasets. The selected relevant variables generally coincide with the results selected by other algorithms.
In contrast, the Knockoff and SCAD algorithms incorrectly identified one real variable as irrelevant in the dataset BHP. The Lasso and Elastic Net algorithms identified one real variable, while the Knockoff algorithm identified two real variables as irrelevant in the dataset CHP. The SCAD, RF, AdaBoost and SpAM algorithms incorrectly identified one real variable as irrelevant in the Diabetes dataset. This also supports the superiority of DeepVIB for variable selection.