# Synthetic Dataset Generation of Driver Telematics

^{1}

^{2}

^{*}

## Abstract

**:**

`SMOTE`algorithm. The second stage is simulating values for the number of claims as multiple binary classifications applying feedforward neural networks. The third stage is simulating values for aggregated amount of claims as regression using feedforward neural networks, with number of claims included in the set of feature variables. The resulting dataset is evaluated by comparing the synthetic and real datasets when Poisson and gamma regression models are fitted to the respective data. Other visualization and data summarization produce remarkable similar statistics between the two datasets. We hope that researchers interested in obtaining telematics datasets to calibrate models or learning algorithms will find our work ot be valuable.

## 1. Background

#### 1.1. Literature

#### 1.2. Motivation

`SMOTE`. We subsequently construct two neural networks, which emulate the number of claims and aggregated amount of claims drawn from real data. Integrating the synthetic observations with two neural networks, we are able to produce the complete portfolio with the synthetic number of claims and aggregated amount of claims.

`SMOTE`and the feedforward neural networks. This section also provides the comparison of the real data and the synthetically generated data when Poisson and gamma regression models are used. We conclude the text in Section 5.

## 2. Related Work

`SMOTE`, Synthetic Minority Oversampling Technique. This procedure is used to generate the classical and telematics predictor variables in the dataset. The second algorithm is the feedforward neural network. This is used to generate the corresponding response variables that describe the number of claims and aggregated amount of claims.

#### 2.1. Extended SMOTE

`SMOTE`) is originally intended to address classification datasets with severe class imbalances. The procedure is to augment the data to oversample observations for the minority class and this is accomplished by selecting samples that are within the neighborhood in the feature space. First, we choose a minority class and then obtain its K-nearest neighbors, where K is typically set to 5. All of the K neighbors should be minority instances. Subsequently, one of these K neighbor instances is randomly chosen to compute new instances by interpolation. The interpolation is performed by computing the difference between the minority class instance under consideration and the selected neighbor taken. This difference is multiplied by a random number uniformly drawn between 0 and 1, and the resulting instance is added to the considered minority class. In effect, this procedure does not duplicate observations; however, the interpolation causes the selection of a random point along the “line segment” between the features Fernández et al. (2018).

`SMOTE`for creating synthetic data points from minority class is employed and adopted in this paper with a minor adjustment. In our data generation, we applied it to generate predictor variables that are based on the entire feature space of the original or real dataset. The one minor adjustment we used is to tweak the interpolation by randomly drawing a number from a U-shaped distribution, rather than a uniform distribution, between 0 and 1. This mechanism has the resulting effect of maintaining the characteristic of the original or real dataset with small possibility of duplication. In particular, we are able to capture characteristics of observations that may be considered unusual or outliers. Section 4.1.1 provides a further description of a synthetically generated portfolio.

#### 2.2. Feedforward Neural Network

`ReLU`) functions, as seen in the bottom left of Figure 1. The sigmoid is used as an activation function in neural network that converts any real-valued sample to a probability range between 0 and 1. It is this property that the neural network can be used as a binary classifier. On the other hand, the

`ReLU`function is a piecewise linear function that gives the input directly as output, if positive, and zero as output, otherwise. This function is often the default function for many neural network algorithms, because it is believed to train the model with ease and outstanding performance.

`SGD`) learning rate). Parameters can be learned from the data using a loss optimizer. However, hyperparameters still must be predetermined prior to the learning process and, in many cases, these decisions depend on the judgment of the analyst or the user. The work of Hornik et al. (1989) proved that standard multi-layer feedforward networks are capable of approximating any measurable function and, thus, is called the universal approximator. This implies that any lack of success in applications must arise from inadequate learning, insufficient numbers of hidden units, or the lack of a deterministic relationship between input and target. Hyperparameters may be more essential in deep learning to be able to yield satisfactory output.

`AdaGrad`Duchi et al. (2011);

`RAMSProp`;

`Adam`Kingma and Ba (2014); and, others Ruder (2016). The

`Adam`optimization is an efficient stochastic optimization that has been suggested and it combines the advantages of two popular methods:

`AdaGrad`, which works well with sparse gradients, and RMSProp, which has an excellent performance in on-line and non-stationary settings. Recent works by Zhang et al. (2019); Peng et al. (2018); Bansal et al. (2016); and Arik et al. (2017) have presented and proven that the

`Adam`optimizer provides better performance than others in terms of both theoretical and practical perspectives. Therefore, in this paper, we use

`Adam`as the optimizer in our neural network simulations.

## 3. The Synthetic Output: File Description

`Duration`is the period that policyholder is insured in days, with values in [22, 366].`Insured.age`is the age of insured driver in integral years, with values in [16, 103].`Car.age`is the age of vehicle, with values in [−2, 20]. Negative values are rare but are possible as buying a newer model can be up to two years in advance.`Years.noclaims`is the number of years without any claims, with values in [0, 79] and always less than`Insured.age`.`Territory`refers to the territorial location code of vehicle, which has 55 labels in {11, 12, 13, …, 91}.`Annual.pct.driven`is the number of day a policyholder uses vehicle divided by 365, with values in [0, 1].`Pct.drive.mon`, ⋯,`Pct.drive.sun`are compositional variables meaning that the sum of seven (days of the week) variables is 100%.`Pct.drive.wkday`and`Pct.drive.wkend`are clearly compositional variables too.`NB_Claim`refers to the number of claims, with values in {0, 1, 2, 3}; 95.72% observations with zero claim, 4.06% with exactly one claim, and merely 0.20% with two claim and 0.01% with three claim. Real`NB_Claim`has the following proportions; zero claim: 95.60%, one claim: 4.19%, two claim: 0.20%, three claim: 0.007%.`AMT_Claim`is the aggregated amount of claims, with values in [0, 138766.5]. Table 3 shows summary statistics of synthetic and real data.

`NB_Claim`variables can be treated as integer-valued or a classification or categorical variable, with 0 category as those considered to be the least risky drivers who, thus far, have zero claim frequency history. The percentage variables are those with values between 0 and 100%. Compositional variables are less frequently described in insurance datasets but are increasingly becoming more important for telematics related variables. Compositional variables refer to a class or groups of variables that are commonly presented as percentages or proportions that describe parts of some whole. The total sum of these parts are typically constraint to be some fixed constant such as 100% of the whole. A clear example in our dataset are the variables

`Pct.drive.wkday`and

`Pct.drive.wkend`, for which, respectively, are the percentages of times spent driving during the weekdays and during the weekends. For instance, if each of these are 50%, then half of the time that the individual is driving on the road is done during the day of the week (Monday through Friday), while the other half is done during the weekend (Saturday and Sunday). See So et al. (2020) and Verbelen et al. (2018).

## 4. The Data Generating Process

`SMOTE`to reproduce the feature space. In the first stage, a synthetic portfolio of the space of feature variables is generated applying an extended

`SMOTE`algorithm. The second stage is simulating values for the number of claims as multiple binary classifications while using feedforward neural networks. The third stage is simulating values for amount of claims as a regression using feedforward neural network with number of claims treated as one of the feature variables. The final synthetic data is created by combining the synthetic portfolio, the synthetic number of claims, and the synthetic amount of claims. The resulting data generation is evaluated with a comparison between the synthetic data and the real data when Poisson and gamma regression models are fitted to the respective data. Note that the response variables were generated with an extremely complex and nonparametric procedure, so that these comparisons do not necessarily reflect the true nature of the data generation. We also provide other visualization and data summarization to demonstrate the remarkable similar statistics between the two datasets.

#### 4.1. The Detailed Simulation Procedures

`SMOTE`. For convenience, we will use notations ${\mathit{x}}_{i}\in X=\{{X}_{1},{X}_{2},\cdots ,{X}_{50}\}$, $i=1,2,\cdots ,M$, which describe the portfolio having 50 feature variables and ${\mathit{x}}_{i}$ is observation (the policy). ${Y}_{1}$ is

`NB_Claim`and ${Y}_{2}$ is

`AMT_Claim`. Superscript r means real data and s means synthetic data.

#### 4.1.1. Synthetic Portfolio Generation

`SMOTE`to generate the final synthetic portfolio, ${X}^{s}$, as described in Section 2.1. Extended

`SMOTE`is primarily different from the original

`SMOTE`in just a single step: the interpolation step. The detailed procedure is the following: for each feature vector (observation, ${x}_{i}^{r}$), the distance between ${x}_{i}^{r}$ and the other feature vectors in ${X}^{r}$ is computed based on the Euclidean distance to obtain 5 nearest neighbors for each ${x}_{i}^{r}$. Subsequently, one ${x}_{i}^{r}$ and corresponding one-nearest neighbor are randomly selected. The difference between ${x}_{i}^{r}$ and this neighbor is multiplied by a random number drawn from the U-shape distribution, as shown in Figure 2. Adding the random number to the ${x}_{i}^{r}$, we create a synthetic feature vector, ${x}_{i}^{s}$. 100,000 synthetic observations are generated, which consisted of the synthetic portfolio, ${X}^{s}$. After applying the extended

`SMOTE`, the following considerations had also been reflected in the synthetic portfolio generation:

- integer features are rounded up;
- for categorical features, only
`Car.use`are multi-class.`Car.use`is converted by one-hot coding before applying extended`SMOTE`so that every categorical feature variable has the value 0 or 1. After the generation, they are rounded up; and, - for compositional features,
`Pct.drive.sun`and`Pct.drive.wkend`are not involved in the generation process, but are calculated by ‘1 − the rest of related features.’

#### 4.1.2. The Simulation of Number of Claims

- Sub-simulation 1: ${Z}_{1}^{r}={\U0001d7d9}_{{Y}_{1}^{r}\ge 1}$. Corresponding instance index is$\{{1}^{\left(1\right)},{2}^{\left(1\right)},\cdots ,{M}^{\left(1\right)}\}$. The data is given as the following:$${\mathcal{D}}_{1}=\{({\mathit{x}}_{{1}^{\left(1\right)}}^{r},{z}_{{11}^{\left(1\right)}}^{r}),({\mathit{x}}_{{2}^{\left(1\right)}}^{r},{z}_{{12}^{\left(1\right)}}^{r}),\cdots ,({\mathit{x}}_{{M}^{\left(1\right)}}^{r},{z}_{1{M}^{\left(1\right)}}^{r})\}$$
- Sub-simulation 2: ${Z}_{2}^{r}={\U0001d7d9}_{{Y}_{1}^{r}\ge 2|{Y}_{1}^{r}\ge 1}$. Corresponding instance index is$\{{1}^{\left(2\right)},{2}^{\left(2\right)},\cdots ,{M}^{\left(2\right)}\}$. The data is given as the following:$${\mathcal{D}}_{2}=\{({\mathit{x}}_{{1}^{\left(2\right)}}^{r},{z}_{{21}^{\left(2\right)}}^{r}),({\mathit{x}}_{{2}^{\left(2\right)}}^{r},{z}_{2{2}^{\left(2\right)}}^{r}),\cdots ,({\mathit{x}}_{{M}^{\left(2\right)}}^{r},{z}_{2{M}^{\left(2\right)}}^{r})\}$$
- Sub-simulation 3: ${Z}_{3}^{r}={\U0001d7d9}_{{Y}_{1}^{r}=3|{Y}_{1}^{r}\ge 2}$. Corresponding instance index is$\{{1}^{\left(3\right)},{2}^{\left(3\right)},\cdots ,{M}^{\left(3\right)}\}$. The data is given as the following:$${\mathcal{D}}_{3}=\{({\mathit{x}}_{{1}^{\left(3\right)}}^{r},{z}_{{31}^{\left(3\right)}}^{r}),({\mathit{x}}_{{2}^{\left(3\right)}}^{r},{z}_{3{2}^{\left(3\right)}}^{r}),\cdots ,({\mathit{x}}_{{M}^{\left(3\right)}}^{r},{z}_{3{M}^{\left(3\right)}}^{r})\}$$

`GP`) algorithm, as detailed in the previous section: the number of hidden layers, the number of nodes for first hidden layer, the number of nodes for the rest of the hidden layers, activation functions, batch size, and the learning rate. Table 5 introduces the resultant architecture of the network. We set up sigmoid activation function for output layer, since this is a binary problem; it has the value between 0 and 1. Threshold is 0.5 and cross entropy loss function is used. The weight of the neural network is optimized using the

`Adam`optimizer. In the

`Adam`optimizer, as input values, we need $\alpha \phantom{\rule{4pt}{0ex}}\left(\mathrm{learning}\phantom{\rule{4.pt}{0ex}}\mathrm{rate}\right),{\beta}_{1},{\beta}_{2}$, and $\u03f5$. See Algorithm 1 of Kingma and Ba (2014). In practice, ${\beta}_{1}=0.9,{\beta}_{2}=0.999$, and $\u03f5=1{e}^{-08}$ are commonly used, and no further tuning is usually done. Thus, we only tuned the learning rate via

`GP`.

#### 4.1.3. The Simulation of Aggregated Amount of Claims

`ReLU`as the activation function and

`MSE`as the loss function.

`Adam`optimizers are used with the hyperparameters that are selected in the same manner, as described in Section 4.1.2. These are further described in Table 6.

#### 4.2. Comparison: Poisson and Gamma Regression

`Annual.pct.drive`,

`Credit.score`, and

`Pct.drive.tue`. For both of the datasets, we see that the observed values are colored blue and the predicted values are colored orange. As we expected, the distributions of the average claim frequency, as well as the pattern of blue and orange, for these feature variables considered here have very similar patterns between the real and synthetic datasets.

`Yrs.noclaims`and

`Total.miles.driven`. Both of these feature variables do not seem to produce much variation in the predicted values: this may explain that these are relatively less important predictor variables for claims severity. However, this may also be explained by the fact that we do not necessarily have an exceptionally good model here for prediction. However, this is not the purpose of this exercise.

## 5. Concluding Remarks

`SMOTE`algorithm to produce synthetic portfolio of feature variables and using feedforward neural networks to simulate the number and aggregated amount of claims. The resulting data generation is evaluated by a comparison between the synthetic data and real data when Poisson and gamma regression models are fitted to the respective data. Data summarization and visualization of these resulting fitted models between the two datasets produce remarkably similar statistics and patterns. Additional figures provided in Appendix A suggest the notable similarities between the two datasets. We are hopeful that researchers that are interested in obtaining driver telematics datasets to calibrate statistical models or machine learning algorithms will find the output of this research helpful for their purpose. We encourage the research community to build better predictive models and test these models with our synthetic datafile.

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A. Graphical Display of Distributions of Selected Variables between Synthetic and Real Datasets

**Figure A1.**Synthetic data: Distribution of average number of claims for six telematics-related features.

## References

- Arik, Sercan O., Markus Kliegl, Rewon Child, Joel Hestness, Andrew Gibiansky, Chris Fougner, Ryan Prenger, and Adam Coates. 2017. Convolutional recurrent neural networks for small-footprint keyword spotting. arXiv arXiv:1703.05390. [Google Scholar]
- Ayuso, Mercedes, Montserrat Guillen, and Jens P. Nielsen. 2019. Improving automobile insurance ratemaking using telematics: Incorporating mileage and driver behaviour data. Transportation 46: 735–52. [Google Scholar] [CrossRef] [Green Version]
- Ayuso, Mercedes, Montserrat Guillén, and Ana María Pérez-Marín. 2014. Time and distance to first accident and driving patterns of young drivers with pay-as-you-drive insurance. Accident Analysis and Prevention 73: 125–31. [Google Scholar] [CrossRef] [PubMed]
- Ayuso, Mercedes, Montserrat Guillen, and Ana María Pérez-Marín. 2016. Telematics and gender discrimination: Some usage-based evidence on whether men’s risk of accidents differs from women’s. Risks 4: 10. [Google Scholar] [CrossRef] [Green Version]
- Baecke, Philippe, and Lorenzo Bocca. 2017. The value of vehicle telematics data in insurance risk selection processes. Decision Support Systems 98: 69–79. [Google Scholar] [CrossRef]
- Bansal, Trapit, David Belanger, and Andrew McCallum. 2016. Ask the GRU: Multi-task learning for deep text recommendations. arXiv arXiv:1609.02116v2. [Google Scholar]
- Bergstra, James, and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research 13: 281–305. [Google Scholar]
- Bergstra, James S., Rémi Bardenet, Yoshua Bengio, Balázs Kégl, James S. Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems. New York: Curan Associates Inc., pp. 2546–54. [Google Scholar]
- Boucher, Jean-Philippe, Steven Côté, and Montserrat Guillen. 2017. Exposure as duration and distance in telematics motor insurance using generalized additive models. Risks 5: 54. [Google Scholar] [CrossRef] [Green Version]
- Butler, Patrick. 1993. Cost-based pricing of individual automobile risk transfer: Car-mile exposure unit analysis. Journal of Actuarial Practice 1: 51–67. [Google Scholar]
- Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002.
`SMOTE`: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research 16: 321–57. [Google Scholar] [CrossRef] - Dalkilic, Turkan Erbay, Fatih Tank, and Kamile Sanli Kula. 2009. Neural networks approach for determining total claim amounts in insurance. Insurance: Mathematics and Economics 45: 236–41. [Google Scholar] [CrossRef]
- Denuit, Michel, Xavier Maréchal, Sandra Piterbois, and Jean-François Walhin. 2007. Actuarial Modelling of Claim Counts: Risk Classification, Credibility and Bonus-Malus Systems. West Sussex: John Wiley & Sons. [Google Scholar]
- Duchi, John, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12: 2121–59. [Google Scholar]
- Fernández, Alberto, Salvador Garcia, Francisco Herrera, and Nitesh V. Chawla. 2018. SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research 61: 863–905. [Google Scholar] [CrossRef]
- Franceschi, Luca, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. 2017. Forward and reverse gradient-based hyperparameter optimization. Paper presented at 34th International Conference on Machine Learning, Sydney, Australia, August 6–11; pp. 1165–73. [Google Scholar]
- Gabrielli, Andrea, and Mario V. Wüthrich. 2018. An individual claims history simulation machine. Risks 6: 29. [Google Scholar] [CrossRef] [Green Version]
- Gan, Guojun, and Emiliano A. Valdez. 2007. Valuation of large variable annuity portfolios: Monte Carlo simulation and synthetic datasets. Dependence Modeling 5: 354–74. [Google Scholar] [CrossRef]
- Gan, Guojun, and Emiliano A. Valdez. 2018. Nested stochastic valuation of large variable annuity portfolios: Monte Carlo simulation and synthetic datasets. Data 3: 1–21. [Google Scholar]
- Gao, Guangyuan, Shengwang Meng, and Mario V. Wüthrich. 2019. Claim frequency modeling using telematics car driving data. Scandinavian Actuarial Journal 2: 143–62. [Google Scholar] [CrossRef]
- Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Cambridge, MA: MIT Press. [Google Scholar]
- Guillen, Montserrat, Jens P. Nielsen, Ana María Pérez-Marín, and Valandis Elpidorou. 2020. Can automobile insurance telematics predict the risk of near-miss events? North American Actuarial Journal 24: 141–52. [Google Scholar] [CrossRef] [Green Version]
- Guillen, Montserrat, Jens P. Nielsen, Mercedes Ayuso, and Ana M. Pérez-Marín. 2019. The use of telematics devices to improve automobile insurance rates. Risk Analysis 39: 662–72. [Google Scholar] [CrossRef]
- Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2: 359–66. [Google Scholar] [CrossRef]
- Husnjak, Siniša, Dragan Peraković, Ivan Forenbacher, and Marijan Mumdziev. 2015. Telmatics system in usage based motor insurance. Procedia Engineering 100: 816–25. [Google Scholar] [CrossRef] [Green Version]
- Ibiwoye, Ade, Olawale O. E. Ajibola, and Ashim B. Sogunro. 2012. Artificial neural network model for predicting insurance insolvency. International Journal of Management and Business Research 2: 59–68. [Google Scholar]
- Karapiperis, Dimitris, Birny Birnbaum, Aaron Bradenburg, Sandra Catagna, Allen Greenberg, Robin Harbage, and Anne Obersteadt. 2015. Usage-Based Insurance and Vehicle Telematics: Insurance Market and Regulatory Implications. Technical Report. Kansas City: National Association of Insurance Commissioners and The Center for Insurance Policy and Research. [Google Scholar]
- Kiermayer, Mark, and Christian Weiß. 2020. Grouping of contracts in insurance using neural networks. Scandinavian Actuarial Journal, 1–28. [Google Scholar] [CrossRef]
- Kingma, Diederik P., and Jimmy Ba. 2014.
`Adam`: A method for stochastic optimization. arXiv arXiv:1412.6980. [Google Scholar] - Li, Jing, Ji-Hang Cheng, Jing-Yuan Shi, and Fei Huang. 2012. Brief introduction of back propagation (BP) neural network algorithm and its improvement. Advances in Intelligent and Soft Computing 169: 553–58. [Google Scholar]
- Maclaurin, Dougal, David Duvenaud, and Ryan Adams. 2015. Gradient-based hyperparameter optimization through reversible learning. Paper presented at 32nd International Conference on Machine Learning, Lille, France, July 6–11; Volume 37, pp. 2113–22. [Google Scholar]
- McCulloch, Warren S., and Walter Pitts. 1943. A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics 5: 115–33. [Google Scholar] [CrossRef]
- Murugan, Pushparaja. 2017. Hyperparameters optimization in deep convolutional neural network/bayesian approach with gaussian process prior. arXiv arXiv:1712.07233. [Google Scholar]
- Osafune, Tatsuaki, Toshimitsu Takahashi, Noboru Kiyama, Tsuneo Sobue, Hirozumi Yamaguchi, and Teruo Higashino. 2017. Analysis of accident risks from driving behaviors. International Journal of Intelligent Transportation Systems Research 15: 192–202. [Google Scholar] [CrossRef]
- Peng, Yifan, Anthony Rios, Ramakanth Kavuluru, and Zhiyong Lu. 2018. Chemical-protein relation extraction with ensembles of svm, cnn, and rnn models. arXiv arXiv:1802.01255. [Google Scholar]
- Pesantez-Narvaez, Jessica, Montserrat Guillen, and Manuela Alcañiz. 2019. Predicting motor insurance claims using telematics data—XGBoost versus logistic regression. Risks 7: 70. [Google Scholar] [CrossRef] [Green Version]
- Pérez-Marín, Ana M., Montserrat Guillen, Manuela Alcañiz, and Lluís Bermúdez. 2019. Quantile regression with telematics information to assess the risk of driving above the posted speed limit. Risks 7: 80. [Google Scholar] [CrossRef] [Green Version]
- Ruder, Sebastian. 2016. An overview of gradient descent optimization algorithms. arXiv arXiv:1609.04747. [Google Scholar]
- Snoek, Jasper, Hugo Larochelle, and Ryan P. Adams. 2012. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems. New York: Curan Associates Inc., pp. 2951–59. [Google Scholar]
- So, Banghee, Jean-Philippe Boucher, and Emiliano A. Valdez. 2020. Cost-sensitive multi-class adaboost for understanding driving behavior with telematics. arXiv arXiv:2007.03100. [Google Scholar] [CrossRef]
- Thiede, Luca Anthony, and Ulrich Parlitz. 2019. Gradient based hyperparameter optimization in echo state networks. Neural Networks 115: 23–29. [Google Scholar] [CrossRef]
- Verbelen, Roel, Katrien Antonio, and Gerda Claeskens. 2018. Unravelling the predictive power of telematics data in car insurance pricing. Journal of the Royal Statistical Society: Series C Applied Statistics 67: 1275–304. [Google Scholar] [CrossRef] [Green Version]
- Viaene, Stijn, Guido Dedene, and Richard A. Derrig. 2005. Auto claim fraud detection using Bayesian learning neural networks. Expert Systems with Applications 29: 653–66. [Google Scholar] [CrossRef]
- Wüthrich, Mario V. 2019. Bias regularization in neural network models for general insurance pricing. European Actuarial Journal 10: 179–202. [Google Scholar] [CrossRef]
- Yan, Chun, Meixuan Li, Wei Liu, and Man Qi. 2020. Improved adaptive genetic algorithm for the vehicle insurance fraud identification model based on a bp neural network. Theoretical Computer Science 817: 12–23. [Google Scholar] [CrossRef]
- Zhang, Jingzhao, SaiPraneeth Karimireddy, Andreas Veit, Seungyeon Kim, SashankJ Reddi, Sanjiv Kumar, and Suvrit Sra. 2019. Why
`Adam`beats`SGD`for attention models. arXiv arXiv:1912.03194. [Google Scholar]

Data Source | Reference | Sample | Period | Analytical Techniques | Research Synthesis |
---|---|---|---|---|---|

Belgium | Verbelen et al. (2018) | 10,406 drivers (33,259 obs.) | 2010–2014 | Poisson GAM, Negative binomial GAM | Shows that the presence of telematics variables are better important predictors of driving habits |

Canada | So et al. (2020) | 71,875 obs. | 2013–2016 | Adaboost, SAMME.C2 | Demonstrates telematics information improves the accuracy of claims frequency prediction with a new boosting algorithm |

China | Gao et al. (2019) | 1478 drivers | 2014.01–2017.06 | Poisson GAM | Shows the relevance of telematics covariates extracted from speed-acceleration heatmaps in a claim frequency model |

Europe | Baecke and Bocca (2017) | 6984 drivers (<age 30) | 2011–2015 | Logistic regression, Random forests, Neural networks | Illustrates the importance of telematics variables for pricing UBI products and shows that as few as three months of data may already be enough to obtain efficient risk estimates |

Greece | Guillen et al. (2020) | 157 drivers (1225 obs.) | 2016– 2017 | Negative binomial reg. | Demonstrates how the information drawn from telematics can help predict near-miss events |

Japan | Osafune et al. (2017) | 809 drivers | 2013.12–2015.02 | Support Vector Machines | Investigates accident risk indices that statistically separate safe and risky drivers |

Spain | Ayuso et al. (2014) | 15,940 drivers (<age 30) | 2009–2011 | Weibull regression | Compares driving behaviors of novice and experienced young drivers with PAYD policies |

Ayuso et al. (2016) | 8198 drivers (<age 30) | 2009–2011 | Weibull regression | Determines the use of gender becomes irrelevant in the presence of sufficient telematics information | |

Boucher et al. (2017) | 71,489 obs. | 2011 | Poisson GAM | Offers the benefits of using generalized additive models (GAM) to gain additional insights as to how premiums can be more dynamically assessed with telematics information | |

Guillen et al. (2019) | 25,014 drivers (<age 40) | 2011 | Zero-inflated Poisson | Investigates how telematics information helps explain part of the occurrence of zero accidents not typically accounted by traditional risk factors | |

Ayuso et al. (2019) | 25,014 drivers (<age 40) | 2011 | Poisson regression | Incorporates information drawn from telematics metrics into classical frequency model for tariff determination | |

Pérez-Marín et al. (2019) | 9614 drivers (<age 35) | 2010 | Quantile regression | Demonstrates that the use of quantile regression allows for better identification of factors associated with risky drivers | |

Pesantez-Narvaez et al. (2019) | 2767 drivers (<age 30) | 2011 | XGBoost | Examines and compares the performance of XGBoost algorithm against the traditional logistic regression |

Type | Variable | Description |
---|---|---|

Traditional | Duration | Duration of the insurance coverage of a given policy, in days |

Insured.age | Age of insured driver, in years | |

Insured.sex | Sex of insured driver (Male/Female) | |

Car.age | Age of vehicle, in years | |

Marital | Marital status (Single/Married) | |

Car.use | Use of vehicle: Private, Commute, Farmer, Commercial | |

Credit.score | Credit score of insured driver | |

Region | Type of region where driver lives: rural, urban | |

Annual.miles.drive | Annual miles expected to be driven declared by driver | |

Years.noclaims | Number of years without any claims | |

Territory | Territorial location of vehicle | |

Telematics | Annual.pct.driven | Annualized percentage of time on the road |

Total.miles.driven | Total distance driven in miles | |

Pct.drive.xxx | Percent of driving day xxx of the week: mon/tue/…/sun | |

Pct.drive.xhrs | Percent vehicle driven within x hrs: 2hrs/3hrs/4hrs | |

Pct.drive.xxx | Percent vehicle driven during xxx: wkday/wkend | |

Pct.drive.rushxx | Percent of driving during xx rush hours: am/pm | |

Avgdays.week | Mean number of days used per week | |

Accel.xxmiles | Number of sudden acceleration 6/8/9/…/14 mph/s per 1000 miles | |

Brake.xxmiles | Number of sudden brakes 6/8/9/…/14 mph/s per 1000 miles | |

Left.turn.intensityxx | Number of left turn per 1000 miles with intensity 08/09/10/11/12 | |

Right.turn.intensityxx | Number of right turn per 1000 miles with intensity 08/09/10/11/12 | |

Response | NB_Claim | Number of claims during observation |

AMT_Claim | Aggregated amount of claims during observation |

Synthetic | NB_Claim | Mean | Std Dev | Min | Q1 | Median | Q3 | Max |
---|---|---|---|---|---|---|---|---|

AMT_Claim | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

1 | 4062 | 6767 | 0 | 670 | 2191 | 4776 | 138,767 | |

2 | 8960 | 9554 | 0 | 2350 | 7034 | 11,225 | 56,780 | |

3 | 5437 | 2314 | 2896 | 3620 | 5372 | 5698 | 9743 | |

Real | NB_Claim | Mean | Std Dev | Min | Q1 | Median | Q3 | Max |

AMT_Claim | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

1 | 4646 | 8387 | 0 | 659 | 2238 | 5140 | 145,153 | |

2 | 8643 | 10,920 | 0 | 1739 | 5184 | 11,082 | 62,259 | |

3 | 5682 | 2079 | 3253 | 4540 | 5416 | 5773 | 9521 |

Category | Continuous/Integer | Percentage | Compositional |
---|---|---|---|

Marital | Duration | Annual.pct.driven | Pct.drive.mon |

Insured.sex | Insured.age | Pct.drive.xhrs | Pct.drive.tue |

Car.use | Car.age | Pct.drive.rushxx | . |

Region | Credit.score | . | |

Territory | Annual.miles.drive | Pct.drive.sun | |

NB_Claim | Years.noclaims | Pct.drive.wkday | |

Total.miles.driven | Pct.drive.wkend | ||

Avgdays.week | |||

Accel.xxmiles | |||

Brake.xxmiles | |||

Left.turn.intensityxx | |||

Right.turn.intensityxx | |||

AMT_Claim |

Architecture | N.Hidden L. | N.Nodes_First Hidden L. | N.Nodes_Rest Hidden L. | Activation | BatchSize | Learning R. |
---|---|---|---|---|---|---|

sub-sim1 | 3 | 353 | 68 | ReLU | 85 | 0.000667 |

sub-sim2 | 3 | 473 | 67 | ReLU | 18 | 0.001019 |

sub-sim3 | 2 | 60 | 60 | ReLU | 16 | 0.001922 |

Architecture | N.Hidden L. | N.Nodes_First Hidden L. | N.Nodes_Rest Hidden L. | Activation | BatchSize | Learning R. |
---|---|---|---|---|---|---|

6 | 344 | 67 | ReLU | 3 | 0.000526 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

So, B.; Boucher, J.-P.; Valdez, E.A.
Synthetic Dataset Generation of Driver Telematics. *Risks* **2021**, *9*, 58.
https://doi.org/10.3390/risks9040058

**AMA Style**

So B, Boucher J-P, Valdez EA.
Synthetic Dataset Generation of Driver Telematics. *Risks*. 2021; 9(4):58.
https://doi.org/10.3390/risks9040058

**Chicago/Turabian Style**

So, Banghee, Jean-Philippe Boucher, and Emiliano A. Valdez.
2021. "Synthetic Dataset Generation of Driver Telematics" *Risks* 9, no. 4: 58.
https://doi.org/10.3390/risks9040058