High Per Parameter: A LargeScale Study of Hyperparameter Tuning for Machine Learning Algorithms
Abstract
:1. Introduction
2. Previous Work
The Current Study
 Consider significantly more algorithms;
 Consider significantly more datasets;
 Consider Bayesian optimization, rather than weakerperforming random search or grid search.
3. Experimental Setup
 Datasets;
 Algorithms;
 Metrics;
 Hyperparameter tuning;
 Overall flow.
3.1. Datasets
3.2. Algorithms
3.3. Metrics
 Accuracy: a fraction of correct predictions (∈$[0,1]$).
 Balanced accuracy: an accuracy score that takes into account class imbalances, essentially the accuracy score with classbalanced sample weights [14] (∈$[0,1]$).
 F1 score: a harmonic mean of precision and recall; in the multiclass case, this is the average of the F1 score per class with weighting (∈$[0,1]$)
 R^{2} score: an ${R}^{2}$ (coefficient of determination) regression score function (∈$[\infty ,1]$).
 Adjusted R^{2} score: a modified version of the R^{2} score that adjusts for the number of predictors in a regression model. It is defined as $1(1r2)\ast (n1)/(np1)$, with $r2$ being the R^{2} score, n being the number of samples, and p being the number of features (∈$[\infty ,1]$).
 Complement RMSE: a complement of root mean squared error (RMSE), defined as $1\mathrm{RMSE}$ (∈$[\infty ,1]$). This has the same range as the previous two metrics.
3.4. Hyperparameter Tuning
3.5. Overall Flow
 Optuna is run over the training set for 50 trials to tune the model’s hyperparameters, the best model is retained, and the best model’s testset metric score is computed.
 Fifty models are evaluated over the training set with default parameters, the best model is retained, and the best model’s testset metric score is computed. Strictly speaking, a few algorithms—decision tree, KNN, Bayesian—are essentially deterministic. For consistency, we still performed the 50 default hyperparameter trials. Further, our examination of the respective implementations revealed possible randomness, e.g., for decision tree, when max_features < n_features, the algorithm will select max_features at random; though the default is max_features = n_features we still took no chances of there being some hidden randomness deep within the code.
Algorithm 1 Experimental setup (per algorithm and dataset) 

# 'metric1', 'metric2', 'metric3' are, respectively: # · For classification: accuracy, balanced accuracy, F1 # · For regression: R^{2}, adjusted R^{2}, complement RMSE #eval_score is 5fold crossvalidation score 

4. Results
 The 13 algorithms and 9 measures of Table 2 are considered (separately for classifiers and regressors) as a dataset with 13 samples and the following 9 features: metric1_median, metric2_median, metric3_median, metric1_mean, metric2_mean, metric3_mean, metric1_std, metric2_std, metric3_std.
 Scikitlearn’s RobustScaler is applied, which scales features using statistics that are robust to outliers: “This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile). Centering and scaling happen independently on each feature…” [14].
 The hp_score of an algorithm is then simply the mean of its nine scaled features.
5. Discussion
 Decide how much to invest in hyperparameter tuning of a particular algorithm;
 Select algorithms that require less tuning to hopefully save time—as well as energy [15].
6. Concluding Remarks
 Algorithms may be added to the study.
 Datasets may be added to the study.
 Hyperparameters that have not been considered herein may be added.
 Specific components of the setup may be managed (e.g., the metrics and the scaler of Algorithm 1).
 Additional summary scores, like the hp_score, may be devised.
 For algorithms at the top of the lists in Table 3, we may inquire as to whether particular hyperparameters are the root cause of their hyperparameter sensitivity; further, we may seek out better defaults. For example, [16] recently focused on hyperparameter tuning for KernelRidge, which is at the top of the regressor list in Table 3. Ref [6] discussed the tunability of a specific hyperparameter, though they noted the problem of hyperparameter dependency.
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
 Bergstra, J.; Yamins, D.; Cox, D.D. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in Science Conference, Austin, TX, USA, 11–17 July 2013; Volume 13, p. 20. [Google Scholar]
 Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A NextGeneration Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 2623–2631. [Google Scholar]
 Sipper, M.; Moore, J.H. AddGBoost: A gradient boostingstyle algorithm based on strong learners. Mach. Learn. Appl. 2022, 7, 100243. [Google Scholar] [CrossRef]
 Sipper, M. Neural networks with à la carte selection of activation functions. SN Comput. Sci. 2021, 2, 1–9. [Google Scholar] [CrossRef]
 Bischl, B.; Binder, M.; Lang, M.; Pielok, T.; Richter, J.; Coors, S.; Thomas, J.; Ullmann, T.; Becker, M.; Boulesteix, A.L.; et al. Hyperparameter Optimization: Foundations, Algorithms, Best Practices and Open Challenges. arXiv 2021, arXiv:2107.05847. [Google Scholar]
 Probst, P.; Boulesteix, A.L.; Bischl, B. Tunability: Importance of Hyperparameters of Machine Learning Algorithms. J. Mach. Learn. Res. 2019, 20, 1–32. [Google Scholar]
 Weerts, H.J.P.; Mueller, A.C.; Vanschoren, J. Importance of Tuning Hyperparameters of Machine Learning Algorithms. arXiv 2020, arXiv:2007.07588. [Google Scholar]
 Turner, R.; Eriksson, D.; McCourt, M.; Kiili, J.; Laaksonen, E.; Xu, Z.; Guyon, I. Bayesian Optimization is Superior to Random Search for Machine Learning Hyperparameter Tuning: Analysis of the BlackBox Optimization Challenge 2020. In Proceedings of the NeurIPS 2020 Competition and Demonstration Track, Virtual Event/Vancouver, BC, Canada, 6–12 December 2020; Volume 133, pp. 3–26. [Google Scholar]
 Bergstra, J.; Bengio, Y. Random Search for HyperParameter Optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
 Romano, J.D.; Le, T.T.; La Cava, W.; Gregg, J.T.; Goldberg, D.J.; Chakraborty, P.; Ray, N.L.; Himmelstein, D.; Fu, W.; Moore, J.H. PMLB v1.0: An open source dataset collection for benchmarking machine learning methods. arXiv 2021, arXiv:2012.00058v2. [Google Scholar]
 Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikitlearn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
 Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
 Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3146–3154. [Google Scholar]
 ScikitLearn: Machine Learning in Python. 2022. Available online: https://scikitlearn.org/ (accessed on 22 June 2022).
 GarcíaMartín, E.; Rodrigues, C.F.; Riley, G.; Grahn, H. Estimation of energy consumption in machine learning. J. Parallel Distrib. Comput. 2019, 134, 75–88. [Google Scholar] [CrossRef]
 Stuke, A.; Rinke, P.; Todorović, M. Efficient hyperparameter tuning for kernel ridge regression with Bayesian optimization. Mach. Learn. Sci. Technol. 2021, 2, 035022. [Google Scholar] [CrossRef]
Classification  

Algorithm  Hyperparameter  Values 
AdaBoostClassifier  n_estimators  [10, 1000] (log) 
learning_rate  [0.1, 10] (log)  
DecisionTreeClassifier  max_depth  [2, 10] 
min_impurity_decrease  [0.0, 0.5]  
criterion  {gini, entropy}  
GradientBoostingClassifier  n_estimators  [10, 1000] (log) 
learning_rate  [0.01, 0.3]  
subsample  [0.1, 1]  
KNeighborsClassifier  weights  {uniform, distance} 
algorithm  {auto, ball_tree, kd_tree, brute}  
n_neighbors  [2, 20]  
LGBMClassifier  n_estimators  [10, 1000] (log) 
learning_rate  [0.01, 0.2]  
bagging_fraction  [0.5, 0.95]  
LinearSVC  max_iter  [10, 10,000] (log) 
tol  [1 × 10${}^{5}$, 0.1] (log)  
C  [0.01, 10] (log)  
LogisticRegression  penalty  {l1, l2} 
solver  {liblinear, saga}  
MultinomialNB  alpha  [0.01, 10] (log) 
fit_prior  {True, False}  
PassiveAggressiveClassifier  C  [0.01, 10] (log) 
fit_intercept  {True, False}  
max_iter  [10, 1000] (log)  
RandomForestClassifier  n_estimators  [10, 1000] (log) 
min_weight_fraction_leaf  [0.0, 0.5]  
max_features  {auto, sqrt, log2}  
RidgeClassifier  solver  {auto, svd, cholesky, lsqr, sparse_cg, sag, saga} 
alpha  [0.001, 10] (log)  
SGDClassifier  penalty  {l2, l1, elasticnet} 
alpha  [1 × 10${}^{5}$, 1] (log)  
XGBClassifier  n_estimators  [10, 1000] (log) 
learning_rate  [0.01, 0.2]  
gamma  [0.0, 0.4]  
Regression  
Algorithm  Hyperparameter  Values 
AdaBoostRegressor  n_estimators  [10, 1000] (log) 
learning_rate  [0.1, 10] (log)  
BayesianRidge  n_liter  [10, 1000] (log) 
alpha_1  [1 × 10${}^{7}$, 1 × 10${}^{5}$] (log)  
lambda_1  [1 × 10${}^{7}$, 1 × 10${}^{5}$] (log)  
tol  [1 × 10${}^{5}$, 0.1] (log)  
DecisionTreeRegressor  max_depth  [2, 10] 
min_impurity_decrease  [0.0, 0.5]  
criterion  {squared_error, friedman_mse, absolute_error}  
GradientBoostingRegressor  n_estimators  [10, 1000] (log) 
learning_rate  [0.01, 0.3]  
subsample  [0.1, 1]  
KNeighborsRegressor  weights  {uniform, distance} 
algorithm  {auto, ball_tree, kd_tree, brute}  
n_neighbors  [2, 20]  
KernelRidge  kernel  {linear, poly, rbf, sigmoid} 
alpha  [0.1, 10] (log)  
gamma  [0.1, 10] (log)  
LGBMRegressor  lambda_l1  [1 × 10${}^{8}$, 10.0] (log) 
lambda_l2  [1 × 10${}^{8}$, 10.0] (log)  
num_leaves  [2, 256]  
LinearRegression  fit_intercept  {True, False} 
normalize  {True, False}  
LinearSVR  loss  {epsilon_insensitive, squared_epsilon_insensitive} 
tol  [1 × 10${}^{5}$, 0.1] (log)  
C  [0.01, 10] (log)  
PassiveAggressiveRegressor  C  [0.01, 10] (log) 
fit_intercept  {True, False}  
max_iter  [10, 1000] (log)  
RandomForestRegressor  n_estimators  [10, 1000] (log) 
min_weight_fraction_leaf  [0.0, 0.5]  
max_features  {auto, sqrt, log2}  
SGDRegressor  alpha  [1 × 10${}^{5}$, 1] (log) 
penalty  {l2, l1, elasticnet}  
XGBRegressor  n_estimators  [10, 1000] (log) 
learning_rate  [0.01, 0.2]  
gamma  [0.0, 0.4] 
Classification  

Algorithm  Acc  Bal  F1  Reps  
Median  Mean(std)  Median  Mean(std)  Median  Mean(std)  
AdaBoostClassifier  1.9  20.9 (65.3)  2.2  21.5 (57.4)  1.9  39.3 (150.1)  4320 
DecisionTreeClassifier  0.0  115.6 (2.3 × 10^{3})  0.0  96.0 (2.2 × 10^{3})  0.0  55.7 (1.3 × 10^{3})  4220 
GradientBoostingClassifier  0.5  45.1 (1.4 × 10^{3})  0.6  48.9 (1.4 × 10^{3})  0.6  42.0 (1.1 × 10^{3})  4096 
KNeighborsClassifier  0.8  3.8 (13.8)  1.8  5.9 (16.7)  1.5  5.1 (17.7)  4254 
LGBMClassifier  0.0  1.2 (11.5)  0.0  1.0 (11.9)  0.0  0.9 (12.1)  4287 
LinearSVC  0.0  1.0 (8.3)  0.0  1.9 (8.5)  0.0  1.7 (8.2)  4299 
LogisticRegression  0.0  1.5 (8.2)  0.0  3.4 (12.5)  0.0  3.4 (11.8)  4307 
MultinomialNB  0.0  9.8 (58.9)  8.5  27.5 (48.9)  10.5  40.5 (128.6)  4149 
PassiveAggressiveClassifier  1.9  7.8 (24.0)  1.8  5.9 (18.6)  3.0  10.8 (28.8)  4301 
RandomForestClassifier  0.0  153.6 (2.3 × 10^{3})  0.0  218.8 (3.0 × 10^{3})  0.0  134.1 (2.0 × 10^{3})  4320 
RidgeClassifier  0.0  1.0 (6.8)  0.0  1.4 (7.3)  0.0  1.9 (7.9)  4273 
SGDClassifier  1.2  5.0 (20.6)  1.6  5.2 (16.7)  2.0  8.6 (26.7)  4212 
XGBClassifier  0.0  13.5 (643.1)  0.0  11.4 (431.2)  0.0  10.1 (467.6)  4111 
Regression  
Algorithm  R^{2}  Adj  CRMSE  Reps  
median  mean(std)  median  mean(std)  median  mean(std)  
AdaBoostRegressor  2.0  3.6 (33.5)  2.1  −741.3 (9.5 × 10^{3})  3.8  5.1 (20.9)  3179 
BayesianRidge  0.0  6.8 × 10^{3} (3.7 × 10^{5})  −0.0  −3.3 (55.3)  0.0  1.0 (9.8)  3117 
DecisionTreeRegressor  3.8  61.5 (788.0)  4.0  49.1 (841.5)  7.0  63.4 (1.3 × 10^{3})  3150 
GradientBoostingRegressor  1.6  17.3 (430.9)  1.7  −6.6 × 10^{5} (2.4 × 10^{7})  4.1  2.3 (126.4)  3180 
KNeighborsRegressor  3.5  77.8 (627.4)  3.5  18.8 (471.6)  4.5  203.5 (5.3 × 10^{3})  3160 
KernelRidge  69.5  −9.3 × 10^{5} (5.0 × 10^{7})  65.9  3.6 × 10^{3} (1.7 × 10^{5})  49.5  1.7 × 10^{3} (8.1 × 10^{4})  3053 
LGBMRegressor  0.0  0.0 (25.6)  0.0  −1.2 (34.4)  0.0  0.4 (2.1)  3179 
LinearRegression  0.0  2.3 (70.4)  0.0  −35.1 (469.4)  0.0  −1.7 (62.8)  3170 
LinearSVR  25.1  86.4 (2.7 × 10^{3})  24.3  173.5 (2.8 × 10^{3})  23.9  159.7 (2.2 × 10^{3})  3161 
PassiveAggressiveRegressor  71.6  180.7 (1.7 × 10^{3})  58.5  −304.3 (4.1 × 10^{3})  62.0  331.9 (5.5 × 10^{3})  3167 
RandomForestRegressor  −0.1  1.5 (44.2)  −0.2  −1.2 × 10^{3} (4.6 × 10^{4})  −0.5  −1.5 (13.2)  3180 
SGDRegressor  0.0  2.6 (68.6)  0.0  −41.4 (2.0 × 10^{3})  0.0  2.2 (39.8)  3167 
XGBRegressor  0.9  20.0 (717.1)  0.8  −675.6 (7.4 × 10^{3})  2.3  6.8 (164.6)  3180 
Classification  Regression  

RandomForestClassifier  3.89  KernelRidge  2110.75 
DecisionTreeClassifier  2.43  GradientBoostingRegressor  183.97 
GradientBoostingClassifier  1.52  BayesianRidge  35.19 
MultinomialNB  1.36  PassiveAggressiveRegressor  5.34 
AdaBoostClassifier  0.79  LinearSVR  2.13 
PassiveAggressiveClassifier  0.56  KNeighborsRegressor  0.59 
XGBClassifier  0.38  DecisionTreeRegressor  0.35 
SGDClassifier  0.35  RandomForestRegressor  0.09 
KNeighborsClassifier  0.27  AdaBoostRegressor  −0.07 
LogisticRegression  −0.08  XGBRegressor  −0.10 
LinearSVC  −0.09  SGDRegressor  −0.23 
RidgeClassifier  −0.10  LinearRegression  −0.25 
LGBMClassifier  −0.10  LGBMRegressor  −0.26 
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. 
© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sipper, M. High Per Parameter: A LargeScale Study of Hyperparameter Tuning for Machine Learning Algorithms. Algorithms 2022, 15, 315. https://doi.org/10.3390/a15090315
Sipper M. High Per Parameter: A LargeScale Study of Hyperparameter Tuning for Machine Learning Algorithms. Algorithms. 2022; 15(9):315. https://doi.org/10.3390/a15090315
Chicago/Turabian StyleSipper, Moshe. 2022. "High Per Parameter: A LargeScale Study of Hyperparameter Tuning for Machine Learning Algorithms" Algorithms 15, no. 9: 315. https://doi.org/10.3390/a15090315