Distance CorrelationBased Feature Selection in Random Forest
Abstract
:1. Introduction
2. Main Results
2.1. Feature Selection Method in Random Forest
Algorithm 1 Proposed DCbased Method 
Given a training data set ${\mathcal{D}}_{n}$ and the distance correlation set $\overrightarrow{{\mathcal{R}}^{\ast}}$ of length s,

2.2. Theoretical Results
3. Simulation Study
 Under settings 1 & 2, we consider the following model$$\begin{array}{c}\hfill \mathbf{Model}\phantom{\rule{4.pt}{0ex}}\mathbf{1}\mathbf{:}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}{Y}_{i}=5\left({X}_{i,1}+{X}_{i,2}+{X}_{i,3}+{X}_{i,4}\right)+{\u03f5}_{i}\end{array}$$
 –
 Setting 1: Generate ${X}_{i}$ from a normal distribution: $N\left({0}_{p\times 1},{\Sigma}_{p\times p}\right)$, where ${\Sigma}_{i,j}={\rho}^{ij}$, with $\rho =0.5$ and 0.8.
 –
 Setting 2: Generate ${X}_{i}$ from a normal distribution: $N\left({0}_{p\times 1},{\Sigma}_{p\times p}\right)$, where ${\Sigma}_{i,j}={\rho}^{ij}+0.2{I}_{(i\ne j)}$, with $\rho =0.5$
 Under setting 3, we consider the following model$$\begin{array}{c}\hfill \mathbf{Model}\phantom{\rule{4.pt}{0ex}}\mathbf{2}\mathbf{:}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}{Y}_{i}={X}_{i,1}^{2}+{X}_{i,20}+{X}_{i,33}^{3}+{X}_{i,55}^{2}+{\u03f5}_{i}\end{array}$$
 –
 Setting 3: Generate ${X}_{i}$ from a normal distribution: $N\left({0}_{p\times 1},{\Sigma}_{p\times p}\right)$, where ${\Sigma}_{i,j}={\rho}^{ij}$, with $\rho =0.8$.
 Under setting 4, we consider the following model$$\begin{array}{c}\hfill \mathbf{Model}\phantom{\rule{4.pt}{0ex}}\mathbf{3}\mathbf{:}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}{Y}_{i}=100\times {\left({X}_{i,1}0.5\right)}^{2}\times {\left({X}_{i,2}0.25\right)}^{+}+{\u03f5}_{i}\end{array}$$
 –
 Setting 4: Generate ${X}_{i}$ from $Unif{[0,1]}^{p}$.
3.1. Analysis of the Linear Models
3.2. Analysis of the Nonlinear Model
4. Applications
 Riboflavin Data:This dataset contains riboflavin production by Bacillus subtilis. There are $n=71$ observations of $p=4088$ predictors (gene expressions) and a onedimensional response variable.
 Boston Housing Data:This dataset contains housing data for 506 census tracts of Boston from the 1970 census. There are $n=506$ observations of $p=14$ predictors.
4.1. Riboflavin Data
4.2. Boston Housing Data
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
 Hall, M.A. Correlationbased feature selection for discrete and numeric class machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, CA, USA, 29 June–2 July 2000; pp. 359–366. [Google Scholar]
 Dash, M.; Choi, K.; Scheuermann, P.; Liu, H. Feature selection for clustering—A filter solution. In Proceedings of the Second International Conference on Data Mining, Arlington, VA, USA, 11–13 April 2002; pp. 115–122. [Google Scholar]
 Caruana, R.; Freitag, D. Greedy attribute selection. In Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ, USA, 10–13 July 1994; pp. 28–36. [Google Scholar]
 Dy, J.G.; Brodley, C.E. Feature subset selection and order identification for unsupervised learning. In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, CA, USA, 29 June–2 July 2000; pp. 247–254. [Google Scholar]
 Ng, A.Y. On feature selection: Learning with exponentially many irrelevant features as training examples. In Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI, USA, 24–27 July 1998; pp. 404–412. [Google Scholar]
 Das, S. Filters, wrappers and a boostingbased hybrid for feature selection. In Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001; pp. 74–81. [Google Scholar]
 Xing, E.; Jordan, M.; Karp, R. Feature selection for highdimensional genomic microarray data. In Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001; pp. 601–608. [Google Scholar]
 Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
 Zou, H. The adaptive Lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1428. [Google Scholar] [CrossRef]
 Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar] [CrossRef]
 Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 2005, 67, 301–320. [Google Scholar] [CrossRef]
 Pearson, K. Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity, and Panmixia. Philos. Trans. R. Soc. Lond. Ser. 1896, 187, 253–318. [Google Scholar]
 Cai, J.; Luo, J.; Wang, S.; Yang, S. Feature selection in machine learning: A new perspective. Neurocomputing 2018, 300, 70–79. [Google Scholar] [CrossRef]
 Hsu, H.H.; Hsieh, C.W. Feature Selection via Correlation Coefficient Clustering. J. Softw. 2010, 5, 1371–1377. [Google Scholar] [CrossRef]
 Liu, Y.; Mu, Y.; Chen, K.; Li, Y.; Guo, J. Daily activity feature selection in smart homes based on pearson correlation coefficient. Neural Process. Lett. 2020, 51, 1771–1787. [Google Scholar] [CrossRef]
 Breiman, L. Random Forest. Technical Report; University of California: Berkeley, CA, USA, 2001. [Google Scholar]
 Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning—Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2001. [Google Scholar]
 Biau, G.; Devroye, L.; Lugosi, G. Consistency of random forests and other averaging classifiers. J. Mach. Learn. Res. 2008, 9, 2015–2033. [Google Scholar]
 Zhu, R.; Zeng, D.; Kosorok, M. Reinforcement learning trees. J. Am. Stat. Assoc. 2015, 110, 1770–1784. [Google Scholar] [CrossRef] [PubMed]
 Wonkye, Y.T. Innovations of Random Forests for Longitudinal Data. Ph.D. Thesis, Bowling Green State University, OhioLINK Electronic Theses and Dissertations Center, Bowling Green, OH, USA, 2019. [Google Scholar]
 Székely, G.J.; Rizzo, M.L.; Bakirov, N.K. Measuring and testing independence by correlation of distances. Ann. Stat. 2007, 35, 2769–2794. [Google Scholar] [CrossRef]
 Székely, G.J.; Rizzo, M.L. Brownian distance covariance. Ann. Appl. Stat. 2009, 3, 1236–1265. [Google Scholar] [CrossRef] [PubMed]
 Das, R.; Kasieczka, G.; Shih, D. Feature Selection with Distance Correlation. arXiv 2022, arXiv:2212.00046. [Google Scholar]
 Bühlmann, P.; Kalisch, M.; Meier, L. Highdimensional statistics with a view toward applications in biology. Annu. Rev. Stat. Appl. 2014, 1, 255–278. [Google Scholar] [CrossRef]
 Harrison, D., Jr.; Rubinfeld, D.L. Hedonic housing prices and the demand for clean air. J. Environ. Econ. Manag. 1978, 5, 81–102. [Google Scholar] [CrossRef]
Method  $\mathit{p}=80$  $\mathit{p}=100$  $\mathit{p}=300$  $\mathit{p}=500$  

Traditional RF  30.4468  32.3146  37.0157  39.9092  
No  RLTNo1  17.1149  18.2449  20.6827  22.2395 
RLTNo2  8.3586  9.2965  10.8636  12.1497  
RLTNo5  5.9539  6.8420  8.4067  9.5437  
Moderate  RLTMod1  23.5688  24.9247  29.2962  31.4494 
RLTMod2  12.7399  13.8862  16.9476  19.1914  
RLTMod5  9.7806  10.9047  13.5140  15.6142  
CC $\left({r}^{\ast}\right)$  0  30.4568  32.3099  36.9560  39.9454 
0.1  22.8696  24.6372  29.9454  33.0442  
0.2  16.5787  16.7887  16.9566  18.2652  
0.3  15.9218  15.8904  15.7830  15.7455  
0.4  13.3106  13.4890  13.0326  13.0766  
0.5  12.5500  12.8932  12.4917  12.5678  
0.6  16.4444  16.9558  16.4051  15.5541  
DC $\left({\mathcal{R}}^{\ast}\right)$  0  30.5103  32.2662  36.9264  39.9739 
0.1  30.4394  32.3129  36.9792  39.9157  
0.2  30.4860  32.2304  37.0245  39.8639  
0.3  30.2126  32.1138  37.0334  39.8655  
0.4  20.8794  22.2660  27.2499  30.5149  
0.5  16.7517  16.6341  16.3208  16.6678  
0.6  13.7511  13.8123  13.5889  13.3938 
Method  $\mathit{p}=80$  $\mathit{p}=100$  $\mathit{p}=300$  $\mathit{p}=500$  

Traditional RF  16.4542  16.8286  20.2293  21.4920  
No  RLTNo1  11.1426  11.6650  13.5729  14.1749 
RLTNo2  6.8722  7.3101  8.8551  9.6527  
RLTNo5  5.4821  5.8649  7.3025  8.0649  
Moderate  RLTMod1  14.9992  15.5370  18.7807  19.8693 
RLTMod2  10.3251  10.8486  13.8485  15.1718  
RLTMod5  8.4156  8.9015  11.3316  12.5533  
CC $\left({r}^{\ast}\right)$  0  16.4618  16.8028  20.2333  21.5206 
0.1  13.0510  13.2847  16.1036  17.3913  
0.2  10.7760  10.5976  10.9608  11.1928  
0.3  10.2295  10.0385  10.0109  10.0872  
0.4  9.2580  9.0398  9.0732  9.1315  
0.5  8.5590  8.4243  8.5828  8.5259  
0.6  9.1113  9.0128  9.1327  9.0838  
DC $\left({\mathcal{R}}^{\ast}\right)$  0  16.4589  16.8685  20.2596  21.5370 
0.1  16.4747  16.8312  20.2180  21.5444  
0.2  16.4707  16.7899  20.1973  21.5172  
0.3  16.3218  16.7368  20.2653  21.5056  
0.4  12.2518  12.5301  14.5710  15.9063  
0.5  10.3558  10.2450  10.2731  10.3228  
0.6  9.4236  9.2640  9.3533  9.3839 
Method  $\mathit{p}=80$  $\mathit{p}=100$  $\mathit{p}=300$  $\mathit{p}=500$  

Traditional RF  21.9640  23.6652  28.2053  30.0032  
No  RLTNo1  13.0988  14.2620  16.5793  17.3747 
RLTNo2  7.3378  8.2417  10.2177  11.1712  
RLTNo5  5.5720  6.3689  8.2305  9.2038  
Moderate  RLTMod1  17.9596  19.3122  23.0986  24.3520 
RLTMod2  11.4233  12.5715  16.2147  17.8654  
RLTMod5  9.1465  10.2833  13.5496  15.2372  
CC $\left({r}^{\ast}\right)$  0  21.9342  23.6987  28.1940  29.9885 
0.1  21.7451  23.6321  28.2193  29.9617  
0.2  20.9032  22.9340  27.5293  29.3341  
0.3  16.8882  18.6162  23.0721  25.1728  
0.4  11.9670  12.4959  13.4938  14.0448  
0.5  11.3873  11.7433  11.6022  11.3566  
0.6  9.0305  9.2198  9.3215  9.1254  
DC $\left({\mathcal{R}}^{\ast}\right)$  0  21.9021  23.7338  28.1547  29.9792 
0.1  21.8892  23.7192  28.1623  30.0492  
0.2  21.8888  23.6486  28.1887  30.0208  
0.3  21.8853  23.7239  28.2238  29.9949  
0.4  21.6011  23.4470  28.0920  29.8334  
0.5  19.3799  21.3558  26.1481  28.1744  
0.6  12.5753  13.4929  15.1863  16.6041 
Method  $\mathit{p}=80$  $\mathit{p}=100$  $\mathit{p}=300$  $\mathit{p}=500$  

Traditional RF  9.4389  9.5245  10.4246  10.7869  
No  RLTNo1  8.6755  8.7385  9.4071  9.7955 
RLTNo2  8.5479  8.6631  9.4587  9.9032  
RLTNo5  8.6720  8.7762  9.5994  10.0118  
Moderate  RLTMod1  9.6584  9.7615  10.7009  11.2133 
RLTMod2  9.7378  9.8579  10.9569  11.4871  
RLTMod5  9.8222  9.9758  11.0402  11.6132  
CC $\left({r}^{\ast}\right)$  0  10.5241  10.4354  11.7246  12.1731 
0.1  11.0046  10.9849  12.0554  12.3790  
0.2  11.3745  11.1895  11.8162  11.9509  
0.3  10.8041  10.5800  10.9673  10.8763  
DC $\left({\mathcal{R}}^{\ast}\right)$  0  9.4371  9.5271  10.4387  10.7732 
0.1  9.4270  9.5461  10.4322  10.7692  
0.2  9.4465  9.5276  10.4433  10.7636  
0.3  9.4336  9.5344  10.4295  10.7577  
0.4  8.9385  8.9611  9.6091  9.8364  
0.5  9.4990  9.4992  9.5010  9.4111  
0.6  10.4607  10.4244  10.3874  10.3362 
Method  $\mathit{p}=80$  $\mathit{p}=100$  $\mathit{p}=300$  $\mathit{p}=500$  

Traditional RF  6.1719  6.3132  7.0491  7.4381  
RLTNo1  2.4868  2.4958  2.9554  3.3648  
RLTNo2  2.5882  2.6486  3.3094  3.8033  
RLTNo5  2.8512  2.8675  3.5907  4.3271  
RLTMod1  3.1720  3.1258  3.8918  4.5346  
RLTMod2  3.6176  3.5186  4.5701  5.1221  
RLTMod5  3.7851  3.7519  4.8743  5.7918  
CC $\left({r}^{\ast}\right)$  0  6.1638  6.2397  7.0040  7.4891 
0.1  8.6832  8.9644  9.0353  9.1730  
0.2  10.7540  10.7789  10.7112  10.5731  
0.3  12.2340  12.2444  11.8764  12.3109  
DC $\left({\mathcal{R}}^{\ast}\right)$  0  6.1879  6.1030  7.0218  7.5451 
0.1  6.1925  6.0984  6.9839  7.4811  
0.2  6.2513  6.0910  6.9863  7.4811  
0.3  6.1112  6.0962  7.0018  7.4744  
0.4  5.5324  5.5445  6.2003  6.7826  
0.5  2.6557  2.5385  2.8895  3.2704  
0.6  9.8633  9.5040  9.4988  9.9643 
Traditional RF  0.5029  

No  RLTNo1  0.5521 
RLTNo2  0.5459  
RLTNo5  0.5436  
Moderate  RLTMod1  0.5555 
RLTMod2  0.5216  
RLTMod5  0.5623  
Threshold  CC $\left({r}^{\ast}\right)$  DC $\left({\mathcal{R}}^{\ast}\right)$ 
0.00  0.5026  0.5071 
0.05  0.4936  0.5133 
0.10  0.4866  0.5049 
0.15  0.4654  0.5104 
0.20  0.4521  0.5130 
0.25  0.4356  0.5043 
0.30  0.4217  0.5063 
0.35  0.4083  0.5076 
0.40  0.3864  0.5100 
0.45  0.4076  0.5029 
0.50  0.5594  0.4990 
0.55  0.4175  0.4873 
0.60  0.5565  0.4628 
0.65  NA  0.4358 
0.70  NA  0.4126 
Traditional RF  11.6123  

No  RLTNo1  16.5492 
RLTNo2  16.7430  
RLTNo5  16.0898  
Moderate  RLTMod1  16.0028 
RLTMod2  15.6108  
RLTMod5  15.6015  
Threshold  CC $\left({r}^{\ast}\right)$  DC $\left({\mathcal{R}}^{\ast}\right)$ 
0.1  11.5548  11.5702 
0.15  11.5674  11.5258 
0.2  11.5926  11.5477 
0.25  11.9115  11.5586 
0.3  12.6297  11.5891 
0.35  12.7505  11.5651 
0.4  12.9315  11.5344 
0.45  15.3672  11.5441 
0.5  18.6801  11.5417 
0.55  21.5029  11.5951 
0.6  21.7865  11.9905 
0.65  22.6410  12.5806 
0.7  30.9052  13.0999 
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. 
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ratnasingam, S.; MuñozLopez, J. Distance CorrelationBased Feature Selection in Random Forest. Entropy 2023, 25, 1250. https://doi.org/10.3390/e25091250
Ratnasingam S, MuñozLopez J. Distance CorrelationBased Feature Selection in Random Forest. Entropy. 2023; 25(9):1250. https://doi.org/10.3390/e25091250
Chicago/Turabian StyleRatnasingam, Suthakaran, and Jose MuñozLopez. 2023. "Distance CorrelationBased Feature Selection in Random Forest" Entropy 25, no. 9: 1250. https://doi.org/10.3390/e25091250