Next Article in Journal
The Effect of Exit Time and Entropy on Asset Performance Evaluation
Next Article in Special Issue
A Blockwise Bootstrap-Based Two-Sample Test for High-Dimensional Time Series
Previous Article in Journal
Chaos Detection by Fast Dynamic Indicators in Reflecting Billiards
Previous Article in Special Issue
Feature Screening for High-Dimensional Variable Selection in Generalized Linear Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Distance Correlation-Based Feature Selection in Random Forest

by
Suthakaran Ratnasingam
* and
Jose Muñoz-Lopez
Department of Mathematics, California State University, San Bernardino, CA 92407, USA
*
Author to whom correspondence should be addressed.
Entropy 2023, 25(9), 1250; https://doi.org/10.3390/e25091250
Submission received: 30 May 2023 / Revised: 20 July 2023 / Accepted: 21 August 2023 / Published: 23 August 2023
(This article belongs to the Special Issue Recent Advances in Statistical Inference for High Dimensional Data)

Abstract

:
The Pearson correlation coefficient ( ρ ) is a commonly used measure of correlation, but it has limitations as it only measures the linear relationship between two numerical variables. The distance correlation measures all types of dependencies between random vectors X and Y in arbitrary dimensions, not just the linear ones. In this paper, we propose a filter method that utilizes distance correlation as a criterion for feature selection in Random Forest regression. We conduct extensive simulation studies to evaluate its performance compared to existing methods under various data settings, in terms of the prediction mean squared error. The results show that our proposed method is competitive with existing methods and outperforms all other methods in high-dimensional ( p 300 ) nonlinearly related data sets. The applicability of the proposed method is also illustrated by two real data applications.

1. Introduction

Feature selection is a crucial aspect of model construction in machine learning. Its main objective is to identify the most significant features while eliminating irrelevant, redundant, and noisy ones. This process involves selecting a subset of the most prominent features. Feature selection is widely used for various reasons, including enhancing model interpretability, reducing learning time, improving learning accuracy, and overcoming the curse of dimensionality, among others. This method is widely employed in many fields, particularly in classification tasks such as bioinformatics data analysis, image recognition, change point detection, and others. Various techniques have been proposed in the literature for evaluating feature subsets in machine learning. The filter method, as described by [1,2], utilizes the intrinsic properties of data to assess feature subsets. The wrapper method, as discussed by [3,4], determines the best subset of features useful for the task based on the performance of the learning algorithm. Finally, the hybrid approach, as described by  [5,6,7], makes use of both filters and wrappers by utilizing independent criteria and learning algorithms to measure feature subsets. Additionally, AIC and BIC criteria are used to identify the ‘best model’. One popular method is the Lasso, which was introduced by [8] and employs 𝓁 1 regularized linear regression model. Other Lasso-based feature selection methods have been developed since then, such as Adaptive Lasso [9], Lars [10], and elastic net [11], among others. However, when dealing with high-dimensional data, Lasso methods can face two significant problems: high computational cost and over-fitting. The correlation coefficient (CC) is a criterion, introduced by [12], utilized in feature selection for multiple machine learning algorithms. Ref. [13] used the CC, amongst other measures, for feature selection in high-dimensional data analysis. Ref. [14] made improvements to their models using the CC as well as a clustering technique to filter out less important parameters. We even see [15] use the CC for detecting daily activities in smart homes, where models rely heavily on selecting the appropriate features for these daily activities, and thus on feature selection.
Random forests (RF) is an ensemble learning algorithm that was first proposed by [16]. This method utilizes decision trees and can perform both classification and regression analyses. It achieves this by using a combination of the bootstrap aggregation method and the random subspace method to generate a collection of decision trees, which are then utilized for classification purposes. When building a random forest, the best predictor from a randomly chosen subset of predictors is used to divide each node. Although this method may seem counterintuitive, it has proven to be more effective than other classifiers such as discriminant analysis, support vector machines, and neural networks. Additionally, ref. [16] showed that this approach is resistant to overfitting. According to [17], when a data set has a small number of relevant features and a large number of irrelevant features, RF algorithms may not be able to attain the intended predictive performance, especially if the algorithm selects only a few features at each node. Several methods have been proposed in the literature to improve the performance of the [16]’s traditional RF. For example, ref. [18] establishes consistency of a special type of purely random forest model where strong variables have a larger probability of selection as a splitting variable. Ref. [19] proposed a modification to the standard RF algorithm called Reinforcement Learning Trees (RLT), which involves using a specific type of splitting variable selection and muting of noise variables to prioritize strong variables in the initial stages of tree construction, and gradually decreasing the number of candidate variables towards the terminal nodes. Ref. [20] investigated regression problems within the context of random forest algorithms by focusing on the selection of significant features that are strongly correlated with the response variable. The Pearson product-moment correlation is a criterion to identify features that exhibit high levels of correlation with the response. The Pearson product-moment correlation ( ρ ) has some drawbacks. One issue is that it only measures the linear relationship between two random variables, X and Y. Additionally, ρ = 0 indicates that X and Y are independent only if their joint distribution is bivariate normal. Furthermore, even if X and Y are dependent, the ρ can still be zero.
To remedy this, ref. [21] introduced distance correlation (dCor) that measures all types of dependence between random vectors X and Y in arbitrary dimensions. The dCor is bounded between 0 and 1, and it equals zero only when the random vectors are independent. According to [22], the dCor is effective in identifying nonlinear relationships that cannot be detected by the Pearson correlation coefficient. Additionally, it can be used for random variables of any dimension, unlike the Pearson correlation coefficient, which is limited to two-dimensional variables. This paper introduces a new approach that incorporates the dCor as a pre-processing step in the conventional RF algorithm for high-dimensional nonlinear datasets. Specifically, we utilize the dCor to select the features that have a significant correlation with the response variable, which are then used in the construction of the RF.

2. Main Results

Consider a set of p features, X = ( X 1 , , X p ) , and the dependent variable Y. The goal is to estimate the regression function f ( x ) = E ( Y | X = x ) and we assume that Y = f ( x ) + ϵ . We observe a sample of i.i.d. training observations D n = ( X 1 , Y 1 ) , ( X 2 , Y 2 ) , , ( X n , Y n ) , where each X i = X i 1 , , X i p denotes a set of p variables from a feature space X . Let ϵ i ’s be i.i.d. with mean 0 and variance σ 2 and p refers to the chosen features after removing the ones that have less correlation with the response. The remaining p p variables have no influence on the response. We also assume that the expected value E ( Y | X ) is completely determined by a set of p < p variables, which means E ( Y | X ) = E ( Y | X 1 , X 2 , , X p ) .
In their work, ref. [21] proposed a statistical measure called distance correlation (DC) that quantifies all forms of dependence between random vectors X and Y in arbitrary dimensions, unlike Pearson CC, which is limited to two-dimensional variables. The DC ranges from 0 to 1, and it equals 0 only when the random vectors are independent. According to [22], the DC is effective in detecting nonlinear relationships that cannot be detected by the Pearson CC. The DC is a measure of dependence between two variables that measure the distance between their two characteristic functions. In the bivariate normal case, the DC becomes the Pearson product-moment correlation ρ (CC).
Definition 1. 
Supposing random variables X and Y have finite and positive variances, the distance correlation ( R ) is defined as,
R ( X , Y ) = dCov ( X , Y ) dCov ( X , X ) · dCov ( Y , Y ) ,
where dCov ( X , Y ) is the distance covariance between random variables X and Y.
The d C o v ( X , Y ) is defined as follows.
d C o v ( X , Y ) = R p + q | | f X , Y ( t , s ) f X ( t ) f Y ( s ) | | 2 w ( t , s ) d t d s ,
where f X ( · ) , f Y ( · ) , and f X , Y ( · ) are the characteristic and joint characteristic functions of the random variables X (p-dimensional) and Y (q-dimensional). The weight function is given by w ( t , s ) = ( c p c q | | t | | p + 1 p | | s | | q + 1 q ) 1 , where c d = π ( 1 + d ) / 2 / Γ ( ( 1 + d ) / 2 ) . The calculation of dCov( X , Y ) is more complex compared to the relatively simple calculations performed when computing the covariance for the CC. However, we are fortunate that the R package “energy”, authored by Rizzo, simplifies the calculation of the following definition. It is interesting to note that, according to [22], the population distance covariance coincides with the covariance with respect to Brownian motion, the random motion of particles suspended in a medium. In the same article, the distance correlation is described as the “natural extension” of the CC, and it is clear that the DC offers certain advantages over the CC.
In terms of advantages, DC surpasses CC in several ways. For example, while CC is restricted to two-dimensional variables, DC can handle variables in any dimension. Moreover, the range of R is between 0 and 1, which is inclusive. It is interesting to note that when CC = 0 , there is no linear correlation, but this does not indicate independence, whereas R ( X , Y ) = 0 indicates independence between X and Y. Our aim is to utilize DC as a criterion for our filter method. However, having these advantages over CC does not necessarily mean that our filter method would perform better than the one presented in [20]. Nonetheless, there is a reason for optimism since [23] employs DC as a feature selection criterion in selecting features for energy polynomials. It is worth noting that they achieved a performance that matched that of the unfiltered models using two orders of magnitude fewer parameters.

2.1. Feature Selection Method in Random Forest

Our focus is on exploring how distance correlation can facilitate feature selection. To this end, we employ a feature selection algorithm to enhance our machine-learning models, particularly random forests (Algorithm 1). The goal of our feature selection algorithm is to reduce the feature space by considering the DC between each feature and the dependent variable, using a threshold value of R , denoted by R .
As outlined above, our approach involves creating a subset of this feature space using training data, which will then be employed to train a random forest model. To achieve this, we first specify a threshold value, denoted by R . We then compute R ( Y , X i ) for i = 1 , , p . Based on the resulting distance correlation values, we identify a subset of X , denoted by X X , that includes any feature X j satisfying R ( Y , X j ) R . We subsequently employ X to construct a random forest and compute the mean squared error (MSE) using test data.
Algorithm 1 Proposed DC-based Method
Given a training data set D n and the distance correlation set R of length s,
  • Compute the distance correlation between Y and each feature X j and rank the features using the distance correlation.
  • For each R ,
    (a)
    Eliminate the less correlated variables using the specified R as a threshold.
    (b)
    Using the new training data with reduced feature space, construct a random forest using the Breiman RF algorithm.
  • Given the s constructed random forests, select the model with the minimum prediction error based on the value of R .

2.2. Theoretical Results

In this section, we develop a large sample theory for the proposed DC-based feature selection method. We assume that our features are statistically independent and that only the relevant ones have a strong correlation with the response variable. Consider the model
Y = f ( X i ) + ϵ i .
As in [19], we assume a moment condition on the random error terms ϵ i . Our goal is to ensure that our variable importance measure still converges and that it depends only on the filtered features. The j-th variable importance is calculated based on randomly permuting the values of X j in the out-of-bag sample, which is denoted by X ˜ j . Given that we are using a regression tree and have chosen to minimize the sum of squared errors as our criterion, the resulting squared error after permutation can be calculated
E X ˜ j Y f ^ X 1 , , X ˜ j , , X p 2
We can express the variable importance for the j-th variable as follows.
V I j = E f X 1 , , X ˜ j , , X p f X 1 , , X j , , X p 2 E Y f X 1 , , X ˜ j , , X p 2 .
Theorem 1. 
Under assumptions 3.1, 3.2, 3.3, and 3.4 of [19], and there exists a fixed constant 1 < B < , for any κ > 0 , the estimated variable importance converges to the true variable importance at an exponential rate. That is
P | V I ^ j V I j | > κ e κ · n ν ( p ) / B ,
where 0 < v ( p ) 1 is a function of the dimension p , which represents the reduced number of features obtained using the DC-based filter method. V I j is a measure of variable importance for each variable j P , as defined in (Section 2.2), along with its estimate V I ^ j .
Proof. 
Employing analogous reasoning as presented in [20], we can establish the validity of Theorem 1. Consequently, the detailed proof is omitted here. □

3. Simulation Study

In this section, we perform a simulation study to assess the efficacy of our proposed method. In addition to the simulation setup used in [20], we examine two additional settings. For each setting, we generate 200 training samples and 1000 test samples. We evaluate the performance of our approach for various numbers of features, namely p = 80 , 100 , 300 , 500 .
  • Under settings 1 & 2, we consider the following model
    Model 1 : Y i = 5 X i , 1 + X i , 2 + X i , 3 + X i , 4 + ϵ i
    where ϵ i ’s are the random errors that are normally distributed with a mean of 0 and variance of 1.
    Setting 1: Generate X i from a normal distribution: N 0 p × 1 , Σ p × p , where Σ i , j = ρ | i j | , with ρ = 0.5 and 0.8.
    Setting 2: Generate X i from a normal distribution: N 0 p × 1 , Σ p × p , where Σ i , j = ρ | i j | + 0.2 I ( i j ) , with ρ = 0.5
  • Under setting 3, we consider the following model
    Model 2 : Y i = X i , 1 2 + X i , 20 + X i , 33 3 + X i , 55 2 + ϵ i
    where ϵ i ’s are the random errors that are normally distributed with a mean of 0 and variance of 1.
    Setting 3: Generate X i from a normal distribution: N 0 p × 1 , Σ p × p , where Σ i , j = ρ | i j | , with ρ = 0.8 .
  • Under setting 4, we consider the following model
    Model 3 : Y i = 100 × X i , 1 0.5 2 × X i , 2 0.25 + + ϵ i
    where ( · ) + represents the positive part and ϵ i ’s are the random errors that are normally distributed with a mean of 0 and variance of 1.
    Setting 4: Generate X i from U n i f [ 0 , 1 ] p .
The first step of our method involves calculating the distance correlation between the response variable Y and each feature variable X j for all j = 1 , , p . Next, we use pre-defined thresholds to select significant features. These thresholds are determined based on minimum distance correlation levels between Y and X j , which include R = 0.00 , 0.10 , 0.20 , 0.30 , 0.40 , 0.50 , 0.60 . If R = 0 , then all features are selected and included in the random forest regression. Conversely, if R = 0.5 , then only features with a distance correlation of at least 0.5 with the response variable are selected and added to the RF at each stage. We repeated the procedure 200 times to obtain reliable results.

3.1. Analysis of the Linear Models

Table 1 presents the results for all methods for Model 1 and setting 1 with ρ = 0.5 .
One trend that is evident is that the increase in the number of parameters (p) leads to an increase in the MSE. This implies that the model’s accuracy decreases as the number of parameters increases, which is expected. The RLTNo5 model, which is RLT without muting where five features are utilized in the linear combination to create a split candidate, performed significantly better than other models. On the other hand, the traditional RF had the worst performance, which is desirable since our aim is to enhance the traditional RF with our methods. The optimal r threshold is likely between 0.4 and 0.6, although the optimal R threshold value is inconclusive. Nonetheless, the general trend indicates that as R increases, MSE decreases. It appears that the best model has an R > 0.6 , but we found that this was not the case. For R > 0.6 , the model’s accuracy decreased, and we even encountered errors for R values that were excessively high since this meant that the model was discarding all parameters, and as a result, no random forest could be generated. It is probable that for these settings, the optimal R threshold is between 0.5 and 0.7.
We observed a significant improvement in the CC method’s performance in the RF model when r increased from 0.1 to 0.2 in the p = 500 column. This resulted in a 44.7% decrease in MSE. Similarly, there was a 45.4% reduction in MSE when our method’s threshold R increased from 0.4 to 0.5. It is possible that the similarity in the magnitude of these MSE drops is coincidental. However, we observed a similar pattern for p = 80 , 100 , and 300. To clarify, let MSE DC R , p represent the DC MSE at R and p. Similarly, let MSE CC r , p be the CC MSE at r and p. We noticed the following trend:
MSE DC 0.5 , 80 MSE DC 0.4 , 80 MSE CC 0.2 , 80 MSE CC 0.1 , 80 = 0.0774 MSE DC 0.5 , 100 MSE DC 0.4 , 100 MSE CC 0.2 , 100 MSE CC 0.1 , 100 = 0.0656 MSE DC 0.5 , 300 MSE DC 0.4 , 300 MSE CC 0.2 , 300 MSE CC 0.1 , 300 = 0.0327 MSE DC 0.5 , 500 MSE DC 0.4 , 500 MSE CC 0.2 , 500 MSE CC 0.1 , 500 = 0.0065
The DC-based model accuracy eventually improves to a comparable level with the CC-based model when R reaches approximately 0.5. However, this is not the optimal R value, just as r = 0.2 is not the optimal threshold. In this case, the CC method easily identifies the more important parameters, while the DC method is more cautious and does not filter out parameters with weak linear correlations. The best prediction MSEs are achieved at r = 0.5 for the CC method and R = 0.6 for the DC method. Although a higher R threshold is required for the DC method to optimize, the prediction MSE results are comparable to those of the CC method.
According to Table 2, we see the same optimal threshold values of r and R . The optimal MSEs for the DC and CC methods are even closer, but the CC method still has a slight edge. The race for the best MSE is now closer with RLT, but RLTNo5 remains the best model, while the traditional RF remains the least accurate. As the correlation between parameters and the response variable increases, the MSE generally decreases compared to Table 1.
In Table 3, we observe that the CC method outperforms our method and marks the first instance where a better model than RLTNo5 is identified. It is possible that the DC method could achieve comparable results at a higher threshold, but we did not have the opportunity to optimize this threshold for the DC method.

3.2. Analysis of the Nonlinear Model

In this section, we examine a nonlinear model as outlined in setting 3. The results are presented in Table 4.
These results are particularly exciting as they reveal the advantages of using DC as a feature selection criterion. It is worth noting that the CC method threshold stops at 0.3 because, as the data are not constructed under a linear model, setting a CC threshold higher than 0.3 will filter out all the parameters of the model, making it impossible to construct an RF. This is not the case with the DC method, as it is capable of detecting nonlinear correlations and allowing more parameters to survive the filter method. Although the CC method does not perform well in this case, we can see that RLT remains the best method for p = 80 , 100 , and 300. However, for the high-dimensional case, our proposed method performs best, indicating that it could be an improvement over RF in high-dimensional scenarios. In the future, it would be interesting to compare the proposed method with other machine learning techniques in high-dimensional datasets that exhibit nonlinear correlations. Additionally, we assess the benefits of the proposed method using the simulation setting employed in a previous study [19]. The outcomes of this analysis are presented in Table 5.
The performance of our DC method is outstanding compared to both traditional RF and the CC method in this nonlinear simulated dataset, similar to our other nonlinear simulated dataset. It is worth noting that the CC method has a lower threshold, as a threshold higher than 0.3 eliminates all parameters from the RF model. As we have previously observed, RLT performs exceptionally well here. However, as seen in setting 3, as the number of parameters increases, our proposed method appears to gain an advantage over RLT. Specifically, the DC-based feature selection method outperforms RLT for p = 300 and p = 500 . This once again supports the notion that the DC-based method may be an excellent candidate for high-dimensional data analysis.
According to Figure 1, it is evident that the DC-based method outperforms CC significantly. In setting 4, we observe that our optimal MSE is often less than half of the CC method’s MSE.

4. Applications

To illustrate the practical usage, we apply our proposed methods to two real datasets, which are provided below.
  • Riboflavin Data:
    This dataset contains riboflavin production by Bacillus subtilis. There are n = 71 observations of p = 4088 predictors (gene expressions) and a one-dimensional response variable.
  • Boston Housing Data:
    This dataset contains housing data for 506 census tracts of Boston from the 1970 census. There are n = 506 observations of p = 14 predictors.

4.1. Riboflavin Data

The Riboflavin dataset is a widely used dataset found in the ‘hdi’ R package, provided by [24]. It consists of 71 observations of 4088 predictors, representing the expression levels of 4088 genes, and a single response variable, which is the riboflavin production of Bacillus Subtilis. The objective of our study is to predict the log-transformed riboflavin production rate using gene expressions as predictors. This dataset is an example of a high-dimensional dataset, as the number of features is much larger than the number of observations, i.e., p > n . The results of our analysis are presented in Table 6.
To ensure stable results, we conduct 200 repetitions and calculate the average prediction mean squared error. The findings indicate that the CC-based feature selection method is much more precise than the RLT methods and significantly better than the traditional RF. Our proposed method comes in second place with an optimal threshold of 0.7. It is worth noting that a better R threshold may exist in the range of (6.5,7.5).
The results indicate that the methods have similar accuracy, but the CC method performs better. In support of this, Figure 2 shows a continued decrease in MSE as the R threshold increases, suggesting that an optimal threshold may exist beyond 0.7. However, even with this potential for improvement, the results obtained with our proposed method are comparable at best to those of the CC method.
Figure 3 illustrates the diminishing returns of increasing the CC threshold and highlights the potential for a better prediction of mean squared error (MSE) by increasing the DC threshold.

4.2. Boston Housing Data

The Boston housing data set is provided by [25] and is a built-in data set in R. Unlike the riboflavin data set, it has a lower dimensionality with only 13 predictors and a one-dimensional response variable. The data set contains 506 observations and provides information gathered from the 1970s census. The predictors include the per capita crime rate by town, the average number of rooms per dwelling, the pupil-teacher ratio by town, and other factors. The response variable is the median value of owner-occupied homes in $1000. The objective is to use the available information, such as the per capita crime rate by town (CRIM), nitric oxides concentration (NOX), proportion of non-retail business acres per town (INDUS), and full-value property-tax rate per $10,000 (TAX), among others, to predict the median value of owner-occupied homes.
We applied the same methodology to analyze the Boston housing dataset, and the prediction MSE results are presented in Table 7. Similar to the Riboflavin dataset, we do not observe any improvement in the model by using the RLT method. However, we see slight improvements from the two filter methods compared to traditional RF. Furthermore, we notice that our proposed method slightly outperforms the CC method. Moreover, we observe that our proposed DC-based method has relatively stable results irrespective of the R , whereas the CC method shows an increasing trend in prediction MSE and results in almost three times the MSE of the traditional RF as r varies from 0.1 to 0.7.
Once again, the results obtained support the notion that our DC-based feature selection method is more conservative in eliminating predictors that are relevant to the RF model compared to other methods. It is interesting to note that a similar trend as seen in Figure 4 can also be observed in Figure 5, where the change in MSE of the CC method from r = 0.2 to 0.4 is similar to that of the DC method from R = 0.5 to 0.7. This change in MSE is approximately 12% for both methods, as r and R vary within those ranges.

5. Conclusions

In this paper, we proposed a novel variable selection procedure for RF using distance correlation. We observed that the proposed DC-based method performed very well in most cases, especially in nonlinear models. Although we anticipated that our approach would perform similarly or better than the CC-based filter method, we were pleasantly surprised to find that it outperformed RLT methods under high-dimensional settings. Our approach consistently outperformed the traditional RF method, and in the case of the nonlinear models, it even outperformed the CC method. In the linearly simulated data, we observed that the DC method performed similarly to the CC method in most cases. However, we noticed that optimizing the DC prediction MSE required a higher threshold, which is not surprising given that our method is more conservative in feature filtering. This is not a significant disadvantage, except perhaps for computational cost, as more features are retained in the RF model construction. To address this, we can adjust the threshold to a higher value. We observed only one case where DC significantly underperformed the CC method, which was in setting 2. In this case, a strong linear correlation was simulated, and thus, the CC method was expected to perform well, which was indeed the case. However, in situations such as this, we can consider increasing the DC threshold to 0.7 or 0.8 and see if the prediction MSE improves and becomes comparable to that of the CC method, as we observed previously. Our method demonstrated its superior performance in nonlinear models, particularly in high-dimensional cases. This piqued our interest in exploring high-dimensional datasets. Finally, two real data applications are provided to illustrate the advantage of the proposed methods.

Author Contributions

Conceptualization, S.R.; Methodology, S.R. and J.M.-L.; Software, S.R. and J.M.-L.; Validation, J.M.-L.; Formal analysis, J.M.-L. and S.R.; Writing—original draft, J.M.-L.; Writing—review & editing, S.R. and J.M.-L.; Supervision, S.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Hall, M.A. Correlation-based feature selection for discrete and numeric class machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, CA, USA, 29 June–2 July 2000; pp. 359–366. [Google Scholar]
  2. Dash, M.; Choi, K.; Scheuermann, P.; Liu, H. Feature selection for clustering—A filter solution. In Proceedings of the Second International Conference on Data Mining, Arlington, VA, USA, 11–13 April 2002; pp. 115–122. [Google Scholar]
  3. Caruana, R.; Freitag, D. Greedy attribute selection. In Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ, USA, 10–13 July 1994; pp. 28–36. [Google Scholar]
  4. Dy, J.G.; Brodley, C.E. Feature subset selection and order identification for unsupervised learning. In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, CA, USA, 29 June–2 July 2000; pp. 247–254. [Google Scholar]
  5. Ng, A.Y. On feature selection: Learning with exponentially many irrelevant features as training examples. In Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI, USA, 24–27 July 1998; pp. 404–412. [Google Scholar]
  6. Das, S. Filters, wrappers and a boosting-based hybrid for feature selection. In Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001; pp. 74–81. [Google Scholar]
  7. Xing, E.; Jordan, M.; Karp, R. Feature selection for high-dimensional genomic microarray data. In Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001; pp. 601–608. [Google Scholar]
  8. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
  9. Zou, H. The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1428. [Google Scholar] [CrossRef]
  10. Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar] [CrossRef]
  11. Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [Google Scholar] [CrossRef]
  12. Pearson, K. Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity, and Panmixia. Philos. Trans. R. Soc. Lond. Ser. 1896, 187, 253–318. [Google Scholar]
  13. Cai, J.; Luo, J.; Wang, S.; Yang, S. Feature selection in machine learning: A new perspective. Neurocomputing 2018, 300, 70–79. [Google Scholar] [CrossRef]
  14. Hsu, H.H.; Hsieh, C.W. Feature Selection via Correlation Coefficient Clustering. J. Softw. 2010, 5, 1371–1377. [Google Scholar] [CrossRef]
  15. Liu, Y.; Mu, Y.; Chen, K.; Li, Y.; Guo, J. Daily activity feature selection in smart homes based on pearson correlation coefficient. Neural Process. Lett. 2020, 51, 1771–1787. [Google Scholar] [CrossRef]
  16. Breiman, L. Random Forest. Technical Report; University of California: Berkeley, CA, USA, 2001. [Google Scholar]
  17. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning—Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2001. [Google Scholar]
  18. Biau, G.; Devroye, L.; Lugosi, G. Consistency of random forests and other averaging classifiers. J. Mach. Learn. Res. 2008, 9, 2015–2033. [Google Scholar]
  19. Zhu, R.; Zeng, D.; Kosorok, M. Reinforcement learning trees. J. Am. Stat. Assoc. 2015, 110, 1770–1784. [Google Scholar] [CrossRef] [PubMed]
  20. Wonkye, Y.T. Innovations of Random Forests for Longitudinal Data. Ph.D. Thesis, Bowling Green State University, OhioLINK Electronic Theses and Dissertations Center, Bowling Green, OH, USA, 2019. [Google Scholar]
  21. Székely, G.J.; Rizzo, M.L.; Bakirov, N.K. Measuring and testing independence by correlation of distances. Ann. Stat. 2007, 35, 2769–2794. [Google Scholar] [CrossRef]
  22. Székely, G.J.; Rizzo, M.L. Brownian distance covariance. Ann. Appl. Stat. 2009, 3, 1236–1265. [Google Scholar] [CrossRef] [PubMed]
  23. Das, R.; Kasieczka, G.; Shih, D. Feature Selection with Distance Correlation. arXiv 2022, arXiv:2212.00046. [Google Scholar]
  24. Bühlmann, P.; Kalisch, M.; Meier, L. High-dimensional statistics with a view toward applications in biology. Annu. Rev. Stat. Appl. 2014, 1, 255–278. [Google Scholar] [CrossRef]
  25. Harrison, D., Jr.; Rubinfeld, D.L. Hedonic housing prices and the demand for clean air. J. Environ. Econ. Manag. 1978, 5, 81–102. [Google Scholar] [CrossRef]
Figure 1. Prediction MSE Comparison for Model 2 & 3.
Figure 1. Prediction MSE Comparison for Model 2 & 3.
Entropy 25 01250 g001
Figure 2. Boxplot for Prediction MSE Comparison for Riboflavin Data.
Figure 2. Boxplot for Prediction MSE Comparison for Riboflavin Data.
Entropy 25 01250 g002
Figure 3. Prediction MSE Comparison for Riboflavin Data for CC and DC-based Methods.
Figure 3. Prediction MSE Comparison for Riboflavin Data for CC and DC-based Methods.
Entropy 25 01250 g003
Figure 4. Prediction MSE Comparison for Setting 1 ( ρ = 0.5, 0.8) and Setting 2 with ρ = 0.5 .
Figure 4. Prediction MSE Comparison for Setting 1 ( ρ = 0.5, 0.8) and Setting 2 with ρ = 0.5 .
Entropy 25 01250 g004
Figure 5. Prediction MSE Comparison for Boston Housing Data for CC and DC-based Methods.
Figure 5. Prediction MSE Comparison for Boston Housing Data for CC and DC-based Methods.
Entropy 25 01250 g005
Table 1. Prediction Mean Squared Error for Model 1 and Setting 1 with ρ = 0.5 .
Table 1. Prediction Mean Squared Error for Model 1 and Setting 1 with ρ = 0.5 .
Method p = 80 p = 100 p = 300 p = 500
Traditional RF30.446832.314637.015739.9092
NoRLTNo117.114918.244920.682722.2395
RLTNo28.35869.296510.863612.1497
RLTNo55.95396.84208.40679.5437
ModerateRLTMod123.568824.924729.296231.4494
RLTMod212.739913.886216.947619.1914
RLTMod59.780610.904713.514015.6142
CC ( r ) 030.456832.309936.956039.9454
0.122.869624.637229.945433.0442
0.216.578716.788716.956618.2652
0.315.921815.890415.783015.7455
0.413.310613.489013.032613.0766
0.512.550012.893212.491712.5678
0.616.444416.955816.405115.5541
DC ( R ) 030.510332.266236.926439.9739
0.130.439432.312936.979239.9157
0.230.486032.230437.024539.8639
0.330.212632.113837.033439.8655
0.420.879422.266027.249930.5149
0.516.751716.634116.320816.6678
0.613.751113.812313.588913.3938
Table 2. Prediction Mean Squared Error for Model 1 and Setting 1 with ρ = 0.8 .
Table 2. Prediction Mean Squared Error for Model 1 and Setting 1 with ρ = 0.8 .
Method p = 80 p = 100 p = 300 p = 500
Traditional RF16.454216.828620.229321.4920
NoRLTNo111.142611.665013.572914.1749
RLTNo26.87227.31018.85519.6527
RLTNo55.48215.86497.30258.0649
ModerateRLTMod114.999215.537018.780719.8693
RLTMod210.325110.848613.848515.1718
RLTMod58.41568.901511.331612.5533
CC ( r ) 016.461816.802820.233321.5206
0.113.051013.284716.103617.3913
0.210.776010.597610.960811.1928
0.310.229510.038510.010910.0872
0.49.25809.03989.07329.1315
0.58.55908.42438.58288.5259
0.69.11139.01289.13279.0838
DC ( R ) 016.458916.868520.259621.5370
0.116.474716.831220.218021.5444
0.216.470716.789920.197321.5172
0.316.321816.736820.265321.5056
0.412.251812.530114.571015.9063
0.510.355810.245010.273110.3228
0.69.42369.26409.35339.3839
Table 3. Prediction Mean Squared Error for Model 1 and Setting 2 with ρ = 0.5 .
Table 3. Prediction Mean Squared Error for Model 1 and Setting 2 with ρ = 0.5 .
Method p = 80 p = 100 p = 300 p = 500
Traditional RF21.964023.665228.205330.0032
NoRLTNo113.098814.262016.579317.3747
RLTNo27.33788.241710.217711.1712
RLTNo55.57206.36898.23059.2038
ModerateRLTMod117.959619.312223.098624.3520
RLTMod211.423312.571516.214717.8654
RLTMod59.146510.283313.549615.2372
CC ( r ) 021.934223.698728.194029.9885
0.121.745123.632128.219329.9617
0.220.903222.934027.529329.3341
0.316.888218.616223.072125.1728
0.411.967012.495913.493814.0448
0.511.387311.743311.602211.3566
0.69.03059.21989.32159.1254
DC ( R ) 021.902123.733828.154729.9792
0.121.889223.719228.162330.0492
0.221.888823.648628.188730.0208
0.321.885323.723928.223829.9949
0.421.601123.447028.092029.8334
0.519.379921.355826.148128.1744
0.612.575313.492915.186316.6041
Table 4. Prediction Mean Squared Error for Model 2 and Setting 3 with ρ = 0.8 .
Table 4. Prediction Mean Squared Error for Model 2 and Setting 3 with ρ = 0.8 .
Method p = 80 p = 100 p = 300 p = 500
Traditional RF9.43899.524510.424610.7869
NoRLTNo18.67558.73859.40719.7955
RLTNo28.54798.66319.45879.9032
RLTNo58.67208.77629.599410.0118
ModerateRLTMod19.65849.761510.700911.2133
RLTMod29.73789.857910.956911.4871
RLTMod59.82229.975811.040211.6132
CC ( r ) 010.524110.435411.724612.1731
0.111.004610.984912.055412.3790
0.211.374511.189511.816211.9509
0.310.804110.580010.967310.8763
DC ( R ) 09.43719.527110.438710.7732
0.19.42709.546110.432210.7692
0.29.44659.527610.443310.7636
0.39.43369.534410.429510.7577
0.48.93858.96119.60919.8364
0.59.49909.49929.50109.4111
0.610.460710.424410.387410.3362
Table 5. Prediction Mean Squared Error for Model 3 and Setting 4.
Table 5. Prediction Mean Squared Error for Model 3 and Setting 4.
Method p = 80 p = 100 p = 300 p = 500
Traditional RF6.17196.31327.04917.4381
RLTNo12.48682.49582.95543.3648
RLTNo22.58822.64863.30943.8033
RLTNo52.85122.86753.59074.3271
RLTMod13.17203.12583.89184.5346
RLTMod23.61763.51864.57015.1221
RLTMod53.78513.75194.87435.7918
CC ( r ) 06.16386.23977.00407.4891
0.18.68328.96449.03539.1730
0.210.754010.778910.711210.5731
0.312.234012.244411.876412.3109
DC ( R ) 06.18796.10307.02187.5451
0.16.19256.09846.98397.4811
0.26.25136.09106.98637.4811
0.36.11126.09627.00187.4744
0.45.53245.54456.20036.7826
0.52.65572.53852.88953.2704
0.69.86339.50409.49889.9643
Table 6. Prediction Mean Squared Error for Riboflavin Data.
Table 6. Prediction Mean Squared Error for Riboflavin Data.
Traditional RF0.5029
NoRLTNo10.5521
RLTNo20.5459
RLTNo50.5436
ModerateRLTMod10.5555
RLTMod20.5216
RLTMod50.5623
ThresholdCC ( r ) DC ( R )
0.000.50260.5071
0.050.49360.5133
0.100.48660.5049
0.150.46540.5104
0.200.45210.5130
0.250.43560.5043
0.300.42170.5063
0.350.40830.5076
0.400.38640.5100
0.450.40760.5029
0.500.55940.4990
0.550.41750.4873
0.600.55650.4628
0.65NA0.4358
0.70NA0.4126
Table 7. Prediction Mean Squared Error for Boston Housing Data.
Table 7. Prediction Mean Squared Error for Boston Housing Data.
Traditional RF11.6123
NoRLTNo116.5492
RLTNo216.7430
RLTNo516.0898
ModerateRLTMod116.0028
RLTMod215.6108
RLTMod515.6015
ThresholdCC ( r ) DC ( R )
0.111.554811.5702
0.1511.567411.5258
0.211.592611.5477
0.2511.911511.5586
0.312.629711.5891
0.3512.750511.5651
0.412.931511.5344
0.4515.367211.5441
0.518.680111.5417
0.5521.502911.5951
0.621.786511.9905
0.6522.641012.5806
0.730.905213.0999
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ratnasingam, S.; Muñoz-Lopez, J. Distance Correlation-Based Feature Selection in Random Forest. Entropy 2023, 25, 1250. https://doi.org/10.3390/e25091250

AMA Style

Ratnasingam S, Muñoz-Lopez J. Distance Correlation-Based Feature Selection in Random Forest. Entropy. 2023; 25(9):1250. https://doi.org/10.3390/e25091250

Chicago/Turabian Style

Ratnasingam, Suthakaran, and Jose Muñoz-Lopez. 2023. "Distance Correlation-Based Feature Selection in Random Forest" Entropy 25, no. 9: 1250. https://doi.org/10.3390/e25091250

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop