# Assessing the Performance of Random Forests for Modeling Claim Severity in Collision Car Insurance

^{1}

^{2}

^{3}

^{4}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

#### 1.1. Aim and Methodology

#### 1.2. Literature on Models and Prediction Performance

#### 1.3. Main Results

## 2. Description of the Dataset

#### 2.1. Available Data and Variables

#### 2.2. Claim Distribution

#### 2.3. Descriptive Statistics

## 3. Model Parameterization

#### 3.1. Optimal Regression Models (GAM and GLM)

#### 3.2. Random Forests Models

## 4. Comparison and Discussion of the Models

#### 4.1. Overall Model Comparison

#### 4.2. Comparison on Individual Profiles

## 5. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A

**Figure A1.**Illustration of the relative frequencies of the claim severity S along risk factors in the training sample.

GAM (1) | GLM (3) | ||||
---|---|---|---|---|---|

Intercept | $7.161$ | *** | $6.761$ | *** | |

$BS$ (baseline: Yes) | |||||

No | $-0.072$ | *** | $-0.072$ | *** | |

$\mathit{CS}$ (baseline: CL) | |||||

CH | $0.024$ | * | $0.024$ | * | |

CC | $-0.121$ | *** | $-0.126$ | *** | |

CO | $-0.195$ | *** | $-0.198$ | *** | |

${\mathit{CB}}_{\mathit{g}}$ (baseline: VW) | |||||

BMW | $-0.105$ | *** | $-0.116$ | *** | |

Daimler | $0.106$ | *** | $0.099$ | *** | |

Fiat | $0.062$ | ** | $0.067$ | ** | |

Ford | $-0.007$ | $-0.007$ | |||

GM | $0.096$ | * | $0.113$ | * | |

Greely | $0.007$ | $-0.004$ | |||

Honda | $-0.024$ | $-0.019$ | |||

Hyundai | $-0.035$ | $-0.026$ | |||

Independent | $0.082$ | $0.086$ | |||

Mazda | $0.051$ | * | $0.054$ | * | |

PSA | $0.132$ | *** | $0.134$ | *** | |

Renault | $0.149$ | *** | $0.155$ | *** | |

Subaru | $0.038$ | $0.037$ | |||

Suzuki | $0.207$ | *** | $0.212$ | *** | |

Tata | $0.088$ | . | $0.102$ | * | |

Toyota | $0.019$ | $0.028$ | |||

$\mathit{ID}$ (baseline: 300) | |||||

500 | $0.053$ | *** | $0.052$ | *** | |

1000 | $0.218$ | *** | $0.221$ | *** | |

≥2000 | $0.487$ | *** | $0.525$ | *** | |

${\widehat{f}}_{1}\left(\mathit{AG}\right)$ | $7.009$ | *** | ${\mathit{AG}}_{\mathit{c}}$ (baseline: 29–57) | ||

$18\u201321$ | $0.293$ | *** | |||

$22\u201324$ | $0.229$ | *** | |||

$25\u201328$ | $0.057$ | ** | |||

$58\u201375$ | $-0.033$ | ** | |||

$76\u201381$ | $0.011$ | ||||

$>81$ | $0.196$ | *** | |||

${\widehat{f}}_{2}\left(\mathit{HP}\right)$ | $3.927$ | *** | ${\mathit{HP}}_{\mathit{c}}$ (baseline: >126) | ||

$41\u2013125$ | $0.070$ | *** | |||

${\widehat{f}}_{3}\left(\mathit{AC}\right)$ | $5.230$ | *** | ${\mathit{AC}}_{\mathit{c}}$ (baseline: 0–3) | ||

4 | $0.069$ | *** | |||

5 | $0.097$ | *** | |||

6 | $0.158$ | *** | |||

7 | $0.178$ | *** | |||

$>8$ | $0.269$ | *** | |||

${\widehat{f}}_{4}\left(\mathit{WC}\right)$ | $1.002$ | *** | $\mathit{WC}$ | $0.0002$ | *** |

${\widehat{f}}_{5}(\mathit{LO},\mathit{LA})$ | $24.740$ | *** | ${(\mathit{LO},\mathit{LA})}_{\mathit{c}}$ (baseline: 3) | ||

1 | $-0.132$ | *** | |||

2 | $-0.080$ | *** | |||

4 | $0.068$ | *** | |||

5 | $0.128$ | *** | |||

6 | $0.254$ | *** | |||

7 | $0.477$ | *** | |||

BIC | 201,748 | 201,516 | |||

N | 65,950 | 65,950 |

## References

- Ai, Chunrong, and Edward C. Norton. 2000. Standard errors for the retransformation problem with heteroscedasticity. Journal of Health Economics 19: 697–718. [Google Scholar] [CrossRef]
- Albrecher, Hansjoerg, Jan Beirlant, and Jef L. Teugels. 2017. Reinsurance: Actuarial and Statistical Aspects. Wiley Series in Probability and Statistics; Chichester: Wiley. [Google Scholar]
- Antonio, Katrien, and Emiliano A Valdez. 2012. Statistical Concepts of a Priori and a Posteriori Risk Classification in Insurance. Asta-Advances in Statistical Analysis 96: 187–224. [Google Scholar] [CrossRef] [Green Version]
- Bellina, Rémi. 2014. Méthodes D’apprentissage Appliquées à la Tarification Non-Vie. Lyon: Université Claude Bernard. [Google Scholar]
- Belloni, Alexandre, Victor Chernozhukov, and Lie Wang. 2014. Pivotal Estimation via Square-Root Lasso in Nonparametric Regression. Annals of Statistics 42: 757–88. [Google Scholar] [CrossRef]
- Bieck, Christian, Boderas Mareike, Peter Maas, and Tobias Schlager. 2010. Powerful Interaction Points: Saying Goodbye to the Channel. Somers: IBM Institute for Business Value and University of St. Gallen. [Google Scholar]
- Brisard, Evelien. 2014. Pricing of Car Insurance with Generalized Linear Models. Master’s thesis, Univeristé Libre de Bruxelles, Brussels, Belgium. [Google Scholar]
- Carney, John G., Padraig Cunningham, and Umesh Bhagwan. 2003. Confidence and prediction intervals for neural network ensembles. Paper presented at IJCNN’99, International Joint Conference on Neural Networks, Washington, DC, USA, July 10–16; Volume 2, pp. 1215–1218. [Google Scholar] [CrossRef]
- Chai, Tianfeng, and Roland R. Draxler. 2014. Root mean square error (RMSE) or mean absolute error (MAE)? -Arguments against avoiding RMSE in the literature. Geoscientific Model Development 7: 1247–50. [Google Scholar] [CrossRef] [Green Version]
- Charpentier, Arthur. 2014. Computational Actuarial Science with R. Boca Raton: CRC Press. [Google Scholar]
- Chernozhukov, Victor, Christian Hansen, and Martin Spindler. 2015. Valid Post-Selection and Post-Regularization Inference: An Elementary, General Approach. Annual Review of Economics 7: 649–88. [Google Scholar] [CrossRef] [Green Version]
- Cort, J. Willmott, and Matsuura Kenji. 2005. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Research 30: 79–82. [Google Scholar] [CrossRef]
- Csorgo, Sandor, and Julian J. Faraway. 1996. The Exact and Asymptotic Distributions of Cramer-von Mises Statistics. Journal of the Royal Statistical Society Series B (Methodological) 58: 221–34. [Google Scholar] [CrossRef]
- Dalkilic, Turkan Erbay, Fatih Tank, and Kamile Sanli Kula. 2009. Neural networks approach for determining total claim amounts in insurance. Insurance: Mathematics and Economics 45: 236–41. [Google Scholar] [CrossRef]
- Denuit, Michel, Donatien Hainaut, and Julien Trufin. 2019a. Effective Statistical Learning Methods for Actuaries I: GLM and Extensions. Springer Actuarial. Cham: Springer International Publishing. [Google Scholar] [CrossRef]
- Denuit, Michel, Donatien Hainaut, and Julien Trufin. 2019b. Effective Statistical Learning Methods for Actuaries III. Springer Actuarial. Cham: Springer International Publishing. [Google Scholar] [CrossRef]
- Denuit, Michel, Donatien Hainaut, and Julien Trufin. 2020. Effective Statistical Learning Methods for Actuaries II. Springer Actuarial. Cham: Springer International Publishing. [Google Scholar] [CrossRef]
- Denuit, Michel, and Stefan Lang. 2004. Non-life Rate-making with Bayesian GAMs. Insurance: Mathematics and Economics 35: 627–47. [Google Scholar] [CrossRef]
- Denuit, Michel, Xavier Marechal, Sandra Pitrebois, and Jean-Francois Walhin. 2007. Actuarial Modelling of Claim Counts: Risk Classification, Credibility and Bonus-Malus Systems. Hoboken: John Wiley & Sons, Inc. [Google Scholar]
- Denuit, Michel, Dominik Sznajder, and Julien Trufin. 2019. Model selection based on Lorenz and concentration curves, Gini indices and convex order. Insurance: Mathematics and Economics 89: 128–39. [Google Scholar] [CrossRef] [Green Version]
- Dewi, Kartika Chandra, Hendri Murfi, and Sarini Abdullah. 2019. Analysis Accuracy of Random Forest Model for Big Data—A Case Study of Claim Severity Prediction in Car Insurance. Paper presented at 2019 5th International Conference on Science in Information Technology: Embracing Industry 4.0: Towards Innovation in Cyber Physical System, ICSITech 2019, Jogja, Indonesia, October 23–24; pp. 60–65. [Google Scholar] [CrossRef]
- Dougherty, James, Ron Kohavi, and Mehran Sahami. 1995. Supervised and Unsupervised Discretization of Continuous Features. Machine Learning: Proceedings of the Twelfth International Conference 12: 194–202. [Google Scholar] [CrossRef] [Green Version]
- Duan, Naihua. 1983. Smearing Estimate: A Nonparametric Retransformation Method. Journal of the American Statistical Association 78: 605. [Google Scholar] [CrossRef]
- Durbin, James. 1973. Distribution Theory for Tests Based on the Sample Distribution Function. Philadelphia: Society for Industrial and Applied Mathematics. [Google Scholar]
- Eling, Martin. 2014. Fitting Asset Returns to Skewed Distributions: Are the Skew-Normal and Skew-Student Good Models? Insurance: Mathematics and Economics 59: 45–56. [Google Scholar] [CrossRef]
- Ferrario, Andrea, Alexander Noll, and Mario V. Wüthrich. 2018. Insights from Inside Neural Networks. SSRN Electronic Journal. [Google Scholar] [CrossRef]
- Frees, Edward W., Gee Lee, and Lu Yang. 2016. Multivariate Frequency-Severity Regression Models in Insurance. Risks 4: 4. [Google Scholar] [CrossRef] [Green Version]
- Frees, Edward W. 2015. Analytics of Insurance Markets. Annual Review of Financial Economics 7: 253–77. [Google Scholar] [CrossRef] [Green Version]
- Frees, Edward W., Richard A. Derrig, and Glenn Meyers. 2016. Predictive Modeling Applications in Actuarial Science Volume 1: Predictive Modeling Techniques. Cambridge: Cambridge University Press. [Google Scholar] [CrossRef]
- Grubinger, Thomas, Achim Zeileis, and Karl-Peter Pfeiffer. 2014. Evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R. Journal of Statistical Software 61: 1. [Google Scholar] [CrossRef] [Green Version]
- Guelman, Leo. 2012. Gradient boosting trees for auto insurance loss cost modeling and prediction. Expert Systems with Applications 39: 3659–67. [Google Scholar] [CrossRef]
- Guelman, Leo, Montserrat Guillén, and Ana M. Pérez-Marín. 2012. Random Forests for Uplift Modeling: An Insurance Customer Retention Case BT—Modeling and Simulation in Engineering, Economics and Management. Berlin/Heidelberg: Springer, pp. 123–33. [Google Scholar]
- Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York: Springer. [Google Scholar]
- Hastie, Trevor J., and Robert J. Tibshirani. 1990. Generalized Additive Models. Boca Raton: CRC Press. [Google Scholar]
- Henckaerts, Roel. 2020. DistRforest: Distribution-Based Random Forest. github. Available online: https://github.com/henckr/distRforest (accessed on 20 February 2021).
- Henckaerts, Roel, Katrien Antonio, Maxime Clijsters, and Roel Verbelen. 2018. A Data Driven Binning Strategy for the Construction of Insurance Tariff Classes. Scandinavian Actuarial Journal 1238: 1–25. [Google Scholar] [CrossRef] [Green Version]
- Henckaerts, Roel, Marie Pier Côté, Katrien Antonio, and Roel Verbelen. 2020. Boosting Insights in Insurance Tariff Plans with Tree-Based Machine Learning Methods. North American Actuarial Journal, 1–31. [Google Scholar] [CrossRef]
- Hu, Sen, Adrian O’Hagan, James Sweeney, and Mohammadhossein Ghahramani. 2020. A spatial machine learning model for analysing customers’ lapse behaviour in life insurance. Annals of Actuarial Science 2020: 1–27. [Google Scholar] [CrossRef]
- Huang, Yifan, and Shengwang Meng. 2019. Automobile insurance classification ratemaking based on telematics driving data. Decision Support Systems 127: 113156. [Google Scholar] [CrossRef]
- James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer. [Google Scholar] [CrossRef]
- Kamakura, Wagner A., Michel Wedel, Fernando de Rosa, and Jose Afonso Mazzon. 2003. Cross-Selling through Database Marketing: A mixed Data Factor Analyzer for Data Augmentation and Prediction. International Journal of Research in Marketing 20: 45–65. [Google Scholar] [CrossRef] [Green Version]
- Abbas, Khosravi, Saeid Nahavandi, Doug Creighton, and Amir F. Atiya. 2011. Comprehensive Review of Neural Network-Based Prediction Intervals and New Advances. IEEE Transactions on Neural Networks 22: 1341–56. [Google Scholar] [CrossRef]
- Klein, Nadja, Michel Denuit, Stefan Lang, and Thomas Kneib. 2014. Nonlife Ratemaking and Risk Management with Bayesian Generalized Additive Models for Location, Scale, and Shape. Insurance: Mathematics and Economics 55: 225–49. [Google Scholar] [CrossRef]
- Kuhn, Max. 2008. Building Predictive Models in R Using the caret Package. Journal of Statistical Software 28: 5. [Google Scholar] [CrossRef] [Green Version]
- Kuhn, Max, and Kjell Johnson. 2013. Applied Predictive Modeling. New York: Springer. [Google Scholar] [CrossRef]
- Laas, Daniela, Hato Schmeiser, and Joël Wagner. 2016. Empirical Findings on Motor Insurance Pricing in Germany, Austria and Switzerland. The Geneva Papers on Risk and Insurance-Issues and Practice 41: 398–431. [Google Scholar] [CrossRef]
- Li, Yaqi, Chun Yan, Wei Liu, and Maozhen Li. 2018. A principle component analysis-based random forest with the potential nearest neighbor method for automobile insurance fraud identification. Applied Soft Computing Journal 70: 1000–9. [Google Scholar] [CrossRef]
- Longford, Nicholas T. 2009. Inference with the lognormal distribution. Journal of Statistical Planning and Inference 139: 2329–40. [Google Scholar] [CrossRef] [Green Version]
- Lowe, Julian, and Louise Pryor. 1996. Neural Networks v. GLMs in pricing general insurance. Working Paper presented at Insurance Convention. [Google Scholar]
- Lüdi, Georges, and Iwar Werlen. 2005. Sprachenlandschaft in der Schweiz. Neuchâtel: Swiss Federal Statistical Office. [Google Scholar]
- Maas, Peter, Albert Graf, and Christian Bieck. 2008. Trust, Transparency and Technology. Somers: IBM Institute for Business Value and University of St. Gallen. [Google Scholar]
- Manning, Willard G. 1998. The logged dependent variable, heteroscedasticity, and the retransformation problem. Journal of Health Economics 17: 283–95. [Google Scholar] [CrossRef]
- Manning, Willard G., and John Mullahy. 2001. Estimating log models: To transform or not to transform? Journal of Health Economics 20: 461–94. [Google Scholar] [CrossRef] [Green Version]
- Nelder, John Ashworth, and Robert WM Wedderburn. 1972. Generalized Linear Models. Journal of the Royal Statistical Society. Series A (General) 135: 370–84. [Google Scholar] [CrossRef]
- Nicholls, Anthony. 2014. Confidence limits, error bars and method comparison in molecular modeling. Part 1: The calculation of confidence intervals. Journal of Computer-Aided Molecular Design 28: 887–918. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Noll, Alexander, Robert Salzmann, and Mario V. Wüthrich. 2018. Case Study: French Motor Third-Party Liability Claims. SSRN Electronic Journal. [Google Scholar] [CrossRef]
- Ohlsson, Esbjörn, and Björn Johansson. 2010. Non-Life Insurance Pricing with Generalized Linear Models. EAA SERIES. Berlin/Heidelberg: Springer. [Google Scholar] [CrossRef]
- Pelessoni, Renato, and Liviana Picech. 1998. Some applications of unsupervised neural networks in rate making procedure. Paper presented at Insurance Convention and ASTIN Colloquium, Glasgow, Scotland, October 7–10. [Google Scholar]
- Perla, Francesca, Ronald Richman, Salvatore Scognamiglio, and Mario V. Wuthrich. 2020. Time-Series Forecasting of Mortality Rates using Deep Learning. SSRN Electronic Journal. [Google Scholar] [CrossRef]
- Quan, Zhiyu, and Emiliano A. Valdez. 2018. Predictive analytics of insurance claims using multivariate decision trees. Dependence Modeling 6: 377–407. [Google Scholar] [CrossRef]
- Richman, Ronald. 2018. AI in Actuarial Science. SSRN Electronic Journal. [Google Scholar] [CrossRef]
- Schelldorfer, Jürg, and Mario V. Wüthrich. 2019. Nesting Classical Actuarial Models into Neural Networks. SSRN Electronic Journal. [Google Scholar] [CrossRef]
- Schwarz, Gideon. 1978. Estimating the Dimension of a Model. The Annals of Statistics 6: 461–64. [Google Scholar] [CrossRef]
- Slocum, Terry A., Robert B. McMaster, Fritz C. Kessler, and Hugh H. Howard. 2005. Thematic Cartography and Geographic Visualization. Upper Saddle River: Pearson Prentice Hall. [Google Scholar]
- Staudt, Yves, Julien Trufin, and Joël Wagner. 2001. Goodness of Lift in Collision Insurance. Working Paper. Lausanne: University of Lausanne. [Google Scholar]
- Staudt, Yves, and Joël Wagner. 2018. What Policyholder and Contract Features Determine the Evolution of Non-life Insurance Customer Relationships?: A Case Study Analysis. International Journal of Bank Marketing 36: 1098–124. [Google Scholar] [CrossRef] [Green Version]
- Staudt, Yves, and Joël Wagner. 2019. Comparison of Machine Learning and Traditional Severity-Frequency Regression Models for Car Insurance Pricing. Working Paper. Lausanne: University of Lausanne. [Google Scholar]
- Staudt, Yves, and Joël Wagner. 2020. Duration to Cross-selling in Non-life Insurance: New Empirical Evidence from Switzerland. Working Paper. Lausanne: University of Lausanne. [Google Scholar]
- De Vleaux, Richard D., Jennifer Schumi, Jason Schweinsberg, and Lyle H. Ungar. 2014. Intervals Prediction for Neural Networks via Nonlinear Regression. Technometrics 40: 273–82. [Google Scholar] [CrossRef]
- Velthoen, Jasper, Clément Dombry, Juan-Juan Cai, and Sebastian Engelke. 2021. Gradient Boosting for Extreme Quantile Regression. Working Paper. Delft: Delft University of Technology. [Google Scholar]
- Verbelen, Roel, and Katrien Antonio. 2016. Unraveling the Predictive Power of Telematics Data in Car Insurance Pricing. SSRN Electronic Journal. [Google Scholar] [CrossRef] [Green Version]
- Wang, Yibo, and Wei Xu. 2018. Leveraging deep learning with LDA-based text analytics to detect automobile insurance fraud. Decision Support Systems 105: 87–95. [Google Scholar] [CrossRef]
- Willmott, Cort J., Kenji Matsuura, and Scott M. Robeson. 2009. Ambiguities inherent in sums-of-squares-based error statistics. Atmospheric Environment 43: 749–52. [Google Scholar] [CrossRef]
- Wüthrich, Mario V., and Christoph Buser. 2018. Data Analytics For Non-Life Insurance Pricing. SSRN Electronic Journal. [Google Scholar] [CrossRef]

1. | In fact, a very small share of claims is not closed and our data report mostly identical reserves for each open claim making such records hard to interpret. By omitting open claims, which may relate to more difficult cases and higher amounts, we keep in mind that the overall claim distribution might differ and that our findings underestimate the total claim amount. Nevertheless, the basis for the study is the same for all of the considered models, allowing for us to compare them. |

2. | These results reach the boundary of our tuning grid and an ad hoc analysis beyond this point was conducted. The analysis does not bring new insights that would differ from the one obtained at the boundary. Thus, we keep the graphics with the same $\alpha $-axis, which allows for simple comparisons among the cases. |

**Figure 2.**Illustration of the generalized additive models (GAM) effects (${\widehat{f}}_{1}$, ${\widehat{f}}_{2}$, ${\widehat{f}}_{3}$ and ${\widehat{f}}_{4}$) in Equation (1) and the bins obtained from evolutionary trees for the age of the policyholder $AG$, the horsepower $HP$, the age $AC$ and the weight $WC$ of the car.

**Figure 3.**Illustration of the spatial GAM effect $\widehat{{f}_{5}}(LO,LA)$ along longitude and latitude.

**Figure 4.**Illustration of the BIC as a function of the penalization parameter $\alpha $ for the evolutionary trees in the case of $AG$, $HP$ and $AC$ and as a function of the number of bins n for Fisher’s natural breaks in the case of $(LO,LA)$.

**Figure 5.**Illustration of the optimal classification of the spatial information along Fisher’s natural breaks.

**Figure 6.**Illustration of the $\overline{RMSE}$ model error and 95%-confidence intervals for different values of the number of risk factors m and the minimum node size for both ${\mathrm{RF}}_{logS}$ and ${\mathrm{RF}}_{S}$ models.

**Figure 7.**Illustration of the $\overline{RMSE}$ model error and 95%-confidence intervals along the number of trees B for both ${\mathrm{RF}}_{logS}$ and ${\mathrm{RF}}_{S}$ models.

**Figure 10.**Comparison of the violin plots across the GAM, GLM and RF models in the training and test samples.

**Figure 11.**Illustration of the $ABC$ and $ICC$ GOL statistics for $logS$ in the training and test samples.

**Figure 12.**Comparison of the predicted claim severity $\widehat{S}$ with 95%-confidence intervals along different ages (in graphs a to d for profiles 1 to 4) and horsepower values (in graph e for profile 5) across the GAM, GLM, and RF models.

**Figure 13.**Illustration of the spatial claim severity predictions for the GAM in Equation (1).

**Figure 14.**Illustration of the spatial claim severity predictions for the ${\mathrm{RF}}_{S}$ model.

Variable | Description |
---|---|

Continuous variables | |

$AG$ | Age of the policyholder (in years) |

$BM$ | Bonus-malus level (in %) |

$HP$ | Horsepower of the vehicle (in hp) |

$VC$ | Value of the car without accessories (in CHF) |

$AP$ | Value of the accessories of the car as a percentage of the car value |

$AC$ | Age of the car (in years) |

$WC$ | Weight of the car (in kg) |

$LO$ | Longitude of the policyholder’s main residence ($6.0$–${10.5}^{\circ}$ E) |

$LA$ | Latitude of the policyholder’s main residence ($45.8$–${47.8}^{\circ}$ N) |

Categorical variables | |

$\mathit{CA}$ | 26 cantons of Switzerland and Principality of Liechtenstein |

$\mathit{LR}$ | Language regions along three classes: German, French, Italian |

$\mathit{NA}$ | Nationality along the following classes: Switzerland (CH), France (FR), Germany and Austria (DE-AT), Spain (ES), Portugal (PT), Italy (IT), Eastern Europe and Turkey (EE-TR), other (OT) |

$\mathit{UT}$ | Utilization along three classes: private use with/without irregular commuter route (PE), private use with regular commuter route (PR), professional use (PL) |

$\mathit{CS}$ | Car body style along four classes: hatchback and sedan (CH), limousine (CL), convertible (CC), other (CO) |

$\mathit{CB}$ | Car brand |

${\mathit{CB}}_{\mathit{g}}$ | Car brand group |

${\mathit{CB}}_{\mathit{c}}$ | Car brand country |

$\mathit{NS}$ | Number of seats along three classes: $4-$, 5 and $6+$ seats |

$\mathit{ID}$ | Deductible along four classes: CHF 300, 500, 1000 and $2000+$ |

Binary variables | |

$GE$ | Gender of the policyholder (male/female) |

$IT$ | Online contract underwriting (yes/no) |

$BS$ | Bonus-malus level protection (yes/no) |

$PM$ | Zero-alcohol engagement (yes/no) |

$DW$ | Driving license withdrawal (yes/no) |

$DD$ | Driving less than $10,000$ km per year (yes/no) |

Distribution | Weibull | Gamma | Log-Normal |
---|---|---|---|

KS | 0.057 | 0.068 | 0.043 |

CvM | 96.328 | 110.973 | 34.269 |

BIC | 1,150,200 | 1,150,469 | 1,147,293 |

**Table 3.**Comparison of the total predicted severity ${\sum}_{i}{\widehat{S}}_{i}$, goodness-of-fit statistics (GOF) and goodness-of-lift statistics (GOL) across the generalized additive models (GAM), generalized linear models (GLM), and random forests (RF) models in the training sample.

Data | GAM | GLM | ${\mathbf{RF}}_{log\mathbf{S}}$ | ${\mathbf{RF}}_{\mathbf{S}}$ | |
---|---|---|---|---|---|

Training sample (data from 2011–2014) | |||||

${\sum}_{i}{\widehat{S}}_{i}$ (in CHF) | 149.3 mio | 149.5 mio | 148.4 mio | 87.2 mio | 149.0 mio |

RE | $0.29\%$ | $0.37\%$ | $41.4\%$ | 0.04% | |

RMSE | 3047 | 3049 | 3053 | 2860 | |

MAE | 1730 | 1732 | 1416 | 1650 | |

ABC $(\xb7{10}^{-3})$ | 4.4 | 5.1 | 112.9 | 96.7 | |

ICC | 0.456 | 0.456 | 0.324 | 0.364 |

**Table 4.**Comparison of the total predicted claim severity ${\sum}_{i}{\widehat{S}}_{i}$, GOF and GOL across the GAM, GLM, and RF models in the test sample.

Data | GAM | GLM | ${\mathbf{RF}}_{log\mathbf{S}}$ | ${\mathbf{RF}}_{\mathbf{S}}$ | |
---|---|---|---|---|---|

Test sample (data from 2015) | |||||

${\sum}_{i}{\widehat{S}}_{i}$ (in CHF) | 34.9 mio | 34.8 mio | 34.9 mio | 20.3 mio | 34.8 mio |

RE | 0.24% | 0.08% | 41.9% | 0.16% | |

RMSE | 2964 | 2965 | 3114 | 2957 | |

MAE | 1689 | 1692 | 1521 | 1701 | |

ABC $(\xb7{10}^{-3})$ | 5.643 | 7.252 | 5.396 | 9.915 | |

ICC | 0.458 | 0.459 | 0.458 | 0.454 |

**Table 5.**Comparison of the total predicted claim severity GOF and GOL across the GAM, GLM, and RF models in the training and samples for $logS$.

GAM | GLM | ${\mathbf{RF}}_{log\mathbf{S}}$ | |
---|---|---|---|

Training sample (data from 2011–2014) | |||

RMSE | 1.111 | 1.111 | 1.113 |

MAE | 0.837 | 0.837 | 0.839 |

ABC ($\xb7{10}^{-3}$) | 0.250 | 0.107 | 2.150 |

ICC | 0.497 | 0.498 | 0.498 |

Test sample (data from 2015) | |||

RMSE | 1.063 | 1.064 | 1.062 |

MAE | 0.805 | 0.805 | 0.803 |

ABC ($\xb7{10}^{-3}$) | 0.255 | 0.414 | 0.2314 |

ICC | 0.498 | 0.498 | 0.498 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Staudt, Y.; Wagner, J.
Assessing the Performance of Random Forests for Modeling Claim Severity in Collision Car Insurance. *Risks* **2021**, *9*, 53.
https://doi.org/10.3390/risks9030053

**AMA Style**

Staudt Y, Wagner J.
Assessing the Performance of Random Forests for Modeling Claim Severity in Collision Car Insurance. *Risks*. 2021; 9(3):53.
https://doi.org/10.3390/risks9030053

**Chicago/Turabian Style**

Staudt, Yves, and Joël Wagner.
2021. "Assessing the Performance of Random Forests for Modeling Claim Severity in Collision Car Insurance" *Risks* 9, no. 3: 53.
https://doi.org/10.3390/risks9030053