Next Article in Journal
Premium Risk Net of Reinsurance: From Short-Term to Medium-Term Assessment
Next Article in Special Issue
Conditional Variance Forecasts for Long-Term Stock Returns
Previous Article in Journal / Special Issue
Predicting Motor Insurance Claims Using Telematics Data—XGBoost versus Logistic Regression
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

On the Validation of Claims with Excess Zeros in Liability Insurance: A Comparative Study

Department of Actuarial Mathematics and Statistics, School of Mathematical and Computer Sciences, Heriot-Watt University Malaysia, 62200 Putrajaya, Wilayah Persekutuan Putrajaya, Malaysia
Risks 2019, 7(3), 71;
Submission received: 27 May 2019 / Revised: 17 June 2019 / Accepted: 19 June 2019 / Published: 30 June 2019
(This article belongs to the Special Issue Machine Learning in Insurance)


In this study, we consider the problem of zero claims in a liability insurance portfolio and compare the predictability of three models. We use French motor third party liability (MTPL) insurance data, which has been used for a pricing game, and show that how the type of coverage and policyholders’ willingness to subscribe to insurance pricing, based on telematics data, affects their driving behaviour and hence their claims. Using our validation set, we then predict the number of zero claims. Our results show that although a zero-inflated Poisson (ZIP) model performs better than a Poisson regression, it can even be outperformed by logistic regression.

1. Introduction

There are two main types of machine learning: (i) predictive or supervised learning in which the machine trains data and learns the relationship between inputs and outputs and (ii) descriptive and unsupervised learning in which machine uses the inputs and discovers the outputs (Murphy 2012). Classification and regression are two supervised learning approaches which are well-known in general insurance. One of the objectives of the insurance companies is to charge premiums which is commensurate with the risk characteristics of their policyholders and for this, they classify the policyholders into homogeneous groups according to, say, age, sex, type of policy, subscription to telematics-based insurance pricing (see, Section 3), etc. Regression analysis and its extensions such as generalised linear modelling (GLM) are strong tools in insurance pricing. Unlike regression models, GLM is not constrained to a normal distribution and can be applied to any distribution from an exponential family. For example, a logistic regression model handles binary responses and thus is suitable for a Bernoulli distribution and a Poisson regression model applies to count data and deals with discrete random variables. GLM has long been used in actuarial practice to model claims amounts and claims frequency in the insurance portfolio (Haberman and Renshaw 1996; McCullagh and Nelder 1998).
In this study, we consider motor third party liability (MTPL) insurance. One of the problems in modelling claims frequency in this class of insurance is the number of zero claims and building a model that can capture all these zero claims. Zero claims in MTPL does not necessarily mean that there has been no accident during the term of a policy, rather it means that there has been no reported accident to the insurance company. This particularly happens under a no claim discount (NCD) system as some policyholders, known as bonus hunger, prefer to benefit from a discount by not reporting a claim. Another problem which is related to the previous one is the problem of over-dispersion. In a Poisson regression model, claims are distributed according to a Poisson distribution with equal mean and variance. Therefore, to build an appropriate model we need to test our dataset for the presence of over-dispersion (Peruman-Chaney et al. 2013; Wilson and Einbeck 2018). Binomial regression, negative binomial (NB) regression and zero-inflated Poisson (ZIP) model are techniques that can handle over and under dispersed data with the latter being able to distinguish between structured and unstructured zeros. Lambert (1992) considers a ZIP model where the probability of only possible observation, i.e., 0 and the parameter of a Poisson distribution depend on some covariates. Lambert (1992) applies this technique to model the number of defects in manufacturing. Since then, this model has been applied in different settings including insurance pricing. For example, Lee et al. (2002) use this model to analyse the impact of lifestyle and motivations on car crashes involving young drivers in Australia. Yip and Yau (2005) use ZIP to model claims frequency in car insurance. They compare different types of zero-inflated count models and conclude that a zero-inflated double Poisson regression model is a good fit for their dataset. Boucher et al. (2007) compare zero-inflated, hurdle and compound frequency models and conclude that the bonus rate is an important factor for policyholders to report the claim. In another study, Boucher et al. (2009) consider the problem of bonus hunger and construct a ZIP model to distinguish between the distribution of the number of claims and the number of accidents.
Model fitting and the selection of risk factors can be challenging in some cases. There are some papers that consider these problems. For example, Tang et al. (2014) propose a method to determine the variables in a ZIP model. They combine EM algorithm and adaptive LASSO and find that their technique performs better for the non-inflated part of the ZIP regression. Liu and Pitt (2017) also apply LASSO and ridge regression to address this issue in a bivariate negative binomial regression model. See, also, Cantoni and Auda (2018), Chowdhury et al. (2019) and Chen et al. (2019) among others.
The impact of mileage as a risk factor is considered by Lemaire et al. (2015). They conclude that annual mileage is a powerful predictor of the number of claims at-fault. Tselentis et al. (2017) provide a review of some Usage-based motor insurance (UBI) including Pay-as-you-drive (PAYD), Pay-how-you-drive (PHYD) and Pay-at-the-pump (PATP). PATP is a pricing method that considers fuel consumption as a rating factor but did not get enough attention from researchers. These new pricing methods require telematics data. In recent years, there is much research on telematics data and mileage based (MB) insurance. Boucher et al. (2017) apply generalised additive models and consider both time and mileage in insurance pricing. See the following papers on the relevance of including the mileage as a risk factor (Ayuso et al. 2019; Guillen et al. 2019; Verbelen et al. 2018).
In addition to regression analysis, neural network, decision tree, random forest and boosting algorithms such as XGBoost, etc., are other machine learning techniques that can be applied to model claims frequency and insurance pricing. However, although these models have good predictive power, unlike regression models, it is difficult to interpret their parameters and their computation time is long. Weerasinghe and Wijegunasekara (2016) study neural network, decision tree and multinomial logistic regression models. Their results show that the neural network has the best predictive performance among the three models. However, they state that to understand the relationship between independent and dependent variables, the logistic regression is the best model. Fauzan and Murfi (2018) compare XGBoost, neural network and random forest models and find that in terms of the Gini index, XGBoost is a more accurate algorithm. See, also, Spedicato et al. (2018) and Gao et al. (2019) and the references therein.
In this study, we consider the classical Poisson and logistic regression and compare our findings with a ZIP model. We divide our dataset into training and validation (hold-out) set to predict the number of zero claims. This paper is organised as follows. In the next section, we present models and notation. Section 3 discusses our dataset. In Section 4, we build our models and in Section 5 we test their validation. Finally, Section 6 concludes.

2. Methodology and Notation

Risk classification is an important concept in general insurance pricing. An insurance company tries to determine the insurance premium according to risk characteristics of policyholders such as age, sex, type of policy and car model, etc. Regression analysis is a well-known technique to incorporate such risk (rating) factors. In this section, we review Poisson regression, Logistic regression and ZIP model.
Let y i { 0 , 1 , 2 , } be a dependent or response variable such as number of claims, for  i = 1 , , n that follows a Poisson distribution with parameter λ i . Assuming a log link function and that λ i is a linear combination of rating factors β 0 + β 1 x i 1 + + β k x i k we have
E [ y i | x i ] = λ i = exp { β 0 + β 1 x i 1 + + β k x i k } , y i Pois ( λ i ) for i = 1 , 2 , , n .
When we consider the average number of claims for each policyholder, we need to specify a unit measure or exposure. We cannot expect two policyholders with the same risk characteristics, but different terms, to be equally risky. Normally, the length of coverage is considered as an exposure. However, in recent years, it is argued that even if policyholders join at different times, some may drive fewer distances than others. Therefore, when such information is available as in telematics data, mileage travelled is considered as a more appropriate exposure (Guillen et al. 2019). In our study, all policyholders are under observation for one year and thus the exposure for each policyholder is 1.
We use logistic regression when y i { 0 , 1 } is a binary, also called dichotomous variable. In that case,
E [ y i | x i ] = π i ( x ) = g β 0 + β 1 x i 1 + + β k x i k
where g is a logistic link function to ensure that π i is between 0 and 1. Hence
π i ( x ) = exp β 0 + β 1 x i 1 + + β k x i k 1 + exp β 0 + β 1 x i 1 + + β k x i k
or, more commonly
log π i ( x ) 1 π i ( x ) = β 0 + β 1 x i 1 + + β k x i k .
In this paper, we use logistic regression to answer the question: “What is the probability of a claim ( y i = 1 ) and zero claims ( y i = 0 ) for a given policyholder with particular risk characteristics?”
When the mean and variance of the underlying population is not equal, the assumption of a Poisson distribution is not suitable and a better candidate is a distribution that can allow for over/under dispersion such as a binomial or NB distribution. However, sometimes we deal with a large number of zeros in our dataset. For example, we see in the next section that many policyholders have zero claims, which does not necessarily mean that they were involved in no accidents, but they are low risk. In such cases, we can apply a ZIP model which is a mixture of a point mass at zero, also called structural zeros, and another claims frequency distribution, such as a Poisson or NB, which can be written as
Pr ( y i = j ) = π i + ( 1 π i ) Pr ( y i = 0 ) j = 0 ( 1 π i ) Pr ( y i = j ) j = 1 , 2 ,
where π i is given by Equation (2) and denotes the probability of zeros when zero is the only possible observation. In a ZIP model, y i follows a Poisson distribution with parameters being given by Equation (1).
We can easily implement these models in R and the codes are provided in Appendix A (Frees et al. 2014, 2016).

3. Data

We use datasets provided by the French Institute of Actuaries for the 2017 pricing game, which is based on French MTPL insurance. The dataset is available in Package ‘CASdatasets’ by Dutang and Dutang and Charpentier (2019) and to the best of the author’s knowledge, this is the first time it is used in a study. The dataset contains some information about the new pricing strategy of the company. The policyholders were given a choice whether they would like to join a new mileage-based (MB) pricing system or not. We would like to see how policyholders’ perception regarding this new system affects their driving behaviour and hence their number of claims. There are two types of datasets: (i) underwriting and (ii) claims dataset. Underwriting datasets are available for three years, whereas claims dataset is only publicly available for year 0. Therefore, we only use data from year 0. After merging claims and underwriting datasets, we randomly split our data into training and validation sets with 60 % being in training and 40 % in the validation set. As some policyholders have more than one car, we assume that each policy covers only one car and therefore consider the number of policies and claims per policy rather than claims per policyholder. We have 100,000 policies (rows in underwriting dataset) and 12,654 policies with claims (rows in claims dataset after consolidation). Table 1 shows the variables we use in our study. In addition to these variables, information about Insee town code, make and model, marketing duration and age of driving license are also provided. However, we do not take into account these variables as, for example, there is a considerable number of policies with 113 years for driving license age which is not reasonable.
In Table 1 policy ID refers to the combination of the vehicle ID and policyholder ID. In this study, we have 100,000 policy ID. Bonus coefficient is the percentage of the full premium that policyholders pay allowing for their claims experience and the allocated discount. There are four types of coverage available: Maxi, Median 2, Median 1 and Mini. The time from the last policy alteration, such as the inclusion of a new driver, is represented by situation duration. Payments can be made annually, semi-annually, quarterly and monthly. As it is usual for the liability insurance, some of the claims amounts are negative1. Therefore, we set all claims amounts of less than 30 equal to zero (Ferreira and Minikel 2012; Frees et al. 2014).
Subscription to mileage-based (MB) policy refers to a new scheme in which one of the main risk factors is the travel distance and policyholders are charged based on their mileage, also known as PAYD scheme. Policy Usage includes WorkPrivate, Retired, Professional and AllTrips. If a policy covers two drivers, age and gender are provided for both drivers. Different features of the car including age, engine power (represented by Din), fuel type, max speed (provided by manufacturing company), type—Tourism and Commercial, value and weight are provided and will be used as rating factors. In this study, we only consider the number of claims as a dependent variable.
We now provide some explanatory analysis based on the training set. The minimum policy term in our dataset is one year, which means all these policies have been under observation for at least one year. Since claims have occurred in Year 0, we consider car years or earned exposure of one year for all policies. The maximum claims number is 6, the oldest policyholder is 103 years old and the oldest car is 66 years old. Table 2 presents mean and standard deviation of our numerical explanatory variables for all policies, policies without claims and policies with at least one claim based on the training set. In order to examine which variables are considerably different in the group of policies with claims and the group of policies without claims and hence are effective on the frequency of claims, we can apply Mann-Whitney test. The Mann-Whitney test is a nonparametric test of the null hypothesis that it is equally likely that a randomly selected value from one sample is less than or greater than a randomly selected value from a second sample. The Mann-Whitney test shows that the difference in the mean for all these variables is statistically significant with p-value < 0.0001 , except for policy duration and driver age 2 with p-values 0.001232 and 0.004252 , respectively.
Figure 1 shows the distribution of the number of claims. We can observe that zero claims form a large part of our portfolio.
Figure 2 illustrates how policies are distributed across categorical variables. As we can see, most of our policies cover one driver and most of the drivers are men aged between 51 and 70. Our policyholders prefer Maxi and drive tourism cars for work and private purposes. Most of them pay annually and are distributed almost evenly across monthly and biannual payment categories. They have not registered for MB scheme and they use diesel with very few of them using a hybrid car. Next, we see how claims are distributed across categorical variables.
Table 3 presents the distribution of the number of claims across different categories. For the variable policy usage, although professional usage forms a small portion of our portfolio, claims under professional group is more than private and retired groups. However, from Figure 3 professional and retired groups have almost the same median loss and except for all trips we can see little difference among policies in this group. Under this insurance, the most comprehensive protection is provided by maxis and as can be expected this may lead to moral hazard. We can see there are more claims under maxis than under other types of coverage. The order of coverage is maxis, median 2, median 1 and mini and unsurprisingly, the percentage of claims reduces in the same order. Under mini, 97.39 % of the policies have made zero claims. Perhaps lower coverage is a motivation for policyholders to take more precautious measures. Figure 3 shows the effect of policy coverage on the amounts of claims and as we can see this will be an effective covariate in our model. From Table 3 those policyholders who were willing to subscribe to MB plan are less likely to have an accident. Figure 3 shows that the subscribers are less dispersed than those who have not subscribed. From the regulatory point of view, gender cannot be used as a discriminatory factor. In fact, we can see there is no considerable difference between male’s and female’s number of claims. In Figure 2 the least favourable payment frequency is quarterly payment, but we do not see considerable differences in claim numbers and amounts for different categories of payments. A large number of policies provide coverage only for one driver, but policies with two drivers have a slightly greater chance of making claims. The age of the first driver ranges from 19 to 103. We classify the policyholders in different age groups as 18–30, 31–50, 51–70, 71–85 and 85+. Most of the policyholders are in the range 51–70 and the next largest group is between 31 and 50. Both Table 3 and Figure 3 do not show a significant difference in claims frequency and claim amounts for different age categories and it seems that some categories can be combined together. In fact, in the next section we see that instead of these categories, we use age as a numerical covariate in our models as some categories are not statistically significant.
Most of our policyholders drive gasoline cars and very few of them have hybrid cars.2 According to Table 3, hybrid cars make more claims than gasoline and diesel cars. Most policies cover tourism cars and claims percentage made by this type of cars is more than commercial cars. Our initial analysis suggests that payment frequency and gender are not significant variables and therefore can be removed from our study. In the next section, we will see that they are indeed insignificant and are not included in our final models.

4. Results

In this section, we use statistical software R and package “pscl” to build Poisson, logistic and ZIP models (Zeileis et al. 2008). Our purpose is to estimate the frequency and the probability of claims and compare our results with a ZIP model using Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC).
Table 4 presents three Poisson regression models with their estimated coefficients and their corresponding p-values. Model 1 is the full model where we consider all the variables from Section 3. However, according to pricing game document, there is a correlation between vehicle cylinder, weight, value, speed and power and in our dataset, some of the entries for weight, value and cylinder are missing. Therefore, we only incorporate speed and power into our models. We build Model 2 using the stepwise selection of variables that can be implemented in R. In Model 3 we only consider variables which are statistically significant at 0.05.
As we can see, some of the coefficients are statistically significant at 0.0001 . For example, the coefficient associated with the Bonus is significant and positive as expected. The bonus represents the percentage of the full premium and a large percentage shows an adverse claims history of a policyholder. The positive sign indicates that as the percentage of the full premium increases, the mean of claims frequency will increase. The coefficients associated with Coverage are negative for all categories and significant. The coefficient of Median 2 shows that the policyholders with this type of coverage have fewer claims than policyholders with Maxi coverage (the reference level). For example, in Model 1, a policyholder with a Median 2 has fewer claims than a policyholder with a Maxi coverage by exp ( 0.1854 ) = 0.83 and a policyholder with a Mini coverage has fewer claims by exp ( 1.2611 ) = 0.28 . The coefficient of the car’s power, represented by Din, is positive and significant, which indicates that powerful cars are more likely to be involved in an accident and therefore the mean of claims frequency for the owners of the powerful cars is higher. Unlike Ayuso et al. (2019) and Guillen et al. (2019), we found that Vehicle age has a negative impact on the number of claims. In our study, most of the policyholders are middle-aged and more likely to have old cars. In Section 3 we saw that the mean of the vehicle age is 9.56 for all policies and 7.30 for policies with at least one claim. Our portfolio of middle-aged policyholders also affects the sign of the coefficient associated with Age 1. Our dataset includes drivers as old as 103. Therefore, it seems reasonable to find a positive impact of age on the mean of the number of claims. In Model 1 the coefficients which are not significantly different from zero include Female 1, car usage for Retired and All trips, Hybrid fuel, Type and Speed. The coefficient of Professional usage indicates that Professional usage increases the mean of claims frequency compared to Work and private usage (the reference level) by exp ( 0.1536 ) = 1.17 . This is in line with Table 3 that policies for professional purposes make more claims. We obtain similar results for gasoline cars as in Table 3. Owners of Gasoline cars have fewer claims than owners of Diesel cars by exp ( 0.2621 ) = 0.77 . We can see that the coefficient associated with Driver2 is positive. This seems reasonable as a policy that covers two drivers is more likely to make claims. The coefficient of Age 2 is negative. One interpretation can be that the average age of the second drivers is lower than the average age of the first drivers. However, in Section 3, we saw that driver age 2 is not significantly different for policies with claims and policies without claims. The coefficients associated with Duration and Policy duration are both negative. This implies that more experienced policyholders make fewer claims and also the more stable a policy is, the lower the mean of the number of claims. The coefficient of subscription to MB is negative and therefore this variable reduces the mean of claims frequency. Perhaps those who are willing to be monitored by telematics technology are more confident about their driving behaviour. We saw in the previous section that payment frequency is not a significant variable. As we can see, their corresponding p-values for some categories in models 1 and 2 are not significant at 0.05 and therefore we have removed them from Model 3. However, we decided to keep the variable Usage, although not all categories are significant at 0.05, as we found in the previous section that it is effective on the number of claims. Among our three models, Model 2 has the lowest AIC and Model 3 has the lowest BIC. As we can see, the computation time for Model 2 is longer than the other two models. The reason is that the stepwise algorithm examines different models to find the one with the smallest AIC.
Table 5 presents three logistic models with their coefficients and the corresponding p-values. Similar to Table 4, Model 1 includes all variables, Model 2 is based on the stepwise algorithm and Model 3 only includes significant variables. The interpretation of the coefficients in logistic regression is similar to Poisson regression and as we can see, the signs of the coefficients are the same. The only difference is that in logistic regression we look at the impact of variables on the odds of the occurrence of claims. So, for example, the interpretation of the coefficient associated with Bonus is that, the greater the percentage of the full premium (adverse claims history) is, the higher the odds of the occurrence of the claims for the coefficient associated with professional usage; we can say that the odds of the occurrence of claims for policyholders with professional usage increases by exp ( 0.1691 ) = 1.18 as opposed to policyholders with work and private usage. For the negative coefficient associated with subscription to MB, we can say that the odds of the occurrence of claims fall for a policyholder who joins this scheme. Other variables can be similarly interpreted. Model 2 is built by examining different models and finding the one with the lowest AIC. All variables in this model are the same as the variables in stepwise Poisson regression except for duration which is not included in stepwise logistic regression. For Model 3 we again remove all variables with a p-value greater than 0.05. In addition, we do not include payment frequency as this has been proved to be insignificant in Section 3. As we can see, Model 2 has the smallest AIC and Model 3 has the smallest BIC. Further, the computation time for the stepwise algorithm is longer than the stepwise Poisson regression model.
When building a model, it is important to consider the underlying assumptions. For example, to fit a ZIP model to our data, we first need to test for the presence of over-dispersion. One approach is to fit a quasi-Poisson and to determine the dispersion parameter, i.e.,  θ in Var ( y ) = θ E [ y ] . In our case, using only significant variables from Table 4 and Table 5, the dispersion parameter is 1.1 . Alternatively, we can fit NB regression and compare our new model with Poisson regression. In our case, AIC and BIC for NB regression are 46,184 and 46,346, respectively, which are lower than AIC and BIC for the Poisson regression. Now, since we have the problem of over-dispersion and excess zeros, we can fit a ZIP model to our data. Table 6 shows the estimated coefficients and their p-values for the Poisson (count) part and zero-inflated part of three ZIP models. Model 1 is the full model where we consider the variables of the full model in Table 4 for the count part and the variables of the full model in Table 5 for the zero-inflated part. As we can see, most variables are not significantly different from zero. If we consider the significant level of 0.1, the coefficient associated with Age 1 is positive as in Table 4 and statistically significant in the count part. In addition, the coefficient associated with Age 2 is positive and significant in the zero-inflated part, but not in the count part. From Section 3, we know that the second divers are younger than the first drivers. Therefore, we can claim that in this group older drivers are more likely to have zero claims. The coefficient of situation duration in the count part is negative and significant as in Table 4 with the same interpretation. The coefficients associated with coverage are significant at 0.01 in the count part with the identical signs as in Table 4, but they are not significant in the zero-inflated part. The interpretation is that the mean frequency of claims for policyholders covered under, for example, Mini coverage is less than the policyholders covered under Maxi coverage by exp ( 1.0487 ) = 0.35 .
The coefficient of fuel (gasoline) is positive and significant which indicates that the odds of zero claims for drivers of gasoline cars increases by exp ( 0.5066 ) = 1.66 as opposed to drivers of diesel cars. Further, in the zero-inflated part, the coefficient of Driver2? is negative and significant. Therefore, a policy with two drivers is less likely to have zero claims, in other words, a policy with the 2nd driver is more likely to be involved in an accident and to make a claim. The associated coefficient of vehicle age is positive and significantly different from zero in the zero-inflated part, which is in line with our findings for Poisson and logistic models that it is more likely for the owners of older cars to have zero claims. All other variables including subscription to MB are not significantly different from zero. The variables of Model 2 in the count and zero-inflated part come from the variables of stepwise models in Table 4 and Table 5, respectively. The coefficients have the same sign and therefore similar interpretation as in Model 1. Again the coefficient of subscription to MB is not significantly different from zero. Model 3 can be built using the variables of the models that contain only significant variables in Table 4 and Table 5. Coverage in the count part and Age 2, Driver2?, fuel and vehicle age in the zero-inflated part are all significantly different from zero. In Table 6 the signs of some of the coefficients do not conform to Table 4 and Table 5. For example, subscription to MB is positive both in the count part and in the zero-inflated part. Since such coefficients are not statistically significant, we can conclude that they are not significantly different from zero. Comparing AIC and BIC of these three models, we can see that the smallest AIC can be obtained by Model 2 where the variables come from stepwise models in Table 4 and Table 5 and the smallest BIC by Model 3. In addition, AIC has considerably improved for ZIP models compared to Poisson models in Table 4. In the next section, we show that the prediction of zero claims by ZIP is considerably better than Poisson regression.

5. Validation

In this section, we use our validation set to compare the predictability of the models discussed in Section 4. Table 7 presents the predicted number of zero and non-zero claims by our models in Section 4. In this table, individual 1 refers to an 85-year-old male policyholder with a maxi policy that pays biannually with the bonus (percentage of the full premium) of 0.5. He holds this policy for retired usage, for 29 years and has not signed to MB scheme. The policy was modified nine years ago. He owns a 10-year-old tourism car with gasoline, the din of 98 and max speed of 182. In year 0, this policyholder has not made any claim and the probability of zero claims predicted by Poisson regression according to full model is exp ( 0.1036 ) where 0.1036 is the estimated parameter λ and the probability of zero claims predicted by logistic regression is 1 0.0903 where 0.0903 is the estimated π = Pr ( y = 1 ) . Prediction of zero claims by ZIP is 0.9104. Individual 2 is a male policyholder with a maxi policy. This policy covers two drivers aged 54 and 56 and has been held for six years and been modified two years ago with a bonus of 0.5 and monthly premium payment. The policyholder owns a two-year old tourism car with diesel for work and private purposes with the din of 75 and max speed of 163. The estimated parameter by Poisson regression is λ = 0.1794 and by logistic regression is π = 0.1525 . As we can see, the count part of ZIP for the two policyholders is very close to the estimated value of Poisson regression. If we add the probability of zero claims in all these models, we can approximate the number of zero claims. Results show that ZIP models considerably outperform Poisson regression and logistic regression performs better than ZIP models in predicting zero claims. Further, we can see that there is a slight difference between predictions made by full models, stepwise models and the models with only significant variables.

6. Conclusions

We have divided our dataset into training and validation sets. Using our training set, we have developed three models and compared our models according to their AIC and BIC values. We found that type of coverage, vehicle age and fuel are statistically significant in most of our models. We then validated our models and showed that a ZIP model can predict the frequency of claims better than a Poisson regression. Further, we have shown that if we are just concerned about the number of zero and non-zero claims, logistic regression can even outperform a ZIP model. In fact, logistic regression is a one layer neural network and there is a scope to extend our study to a more generalised form of logistic regression for future research. We saw that the policyholders who were willing to be monitored by telematics devices are less likely to make a claim. A thorough study of the policyholders’ behaviour before and after being monitored by telematics devices can be another area of future research. Given the current concern regarding climate change and sustainability, the possibility of the inclusion of fuel consumption into a pricing model may be considered in the future (Tselentis et al. 2017).


This research received no external funding.


I am very grateful to the reviewers’ comments and suggestions which were valuable to improve this paper.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A

# Loading the prepared data
 Data <- read.csv("Year0.csv", header = TRUE)
# Creating training and validation datasets
random <- runif(dim(Data)[1])
# Training set is our data <<0.6
train <- random < 0.6
DataTrain <- cbind(Data, random, train)
# Validation set is everything not included in training set
valid <- !(train) ; DataValid <- cbind(Data, random, valid)
# Exporting our sets
write.csv(DataTrain[train == TRUE,], "DataTrain.csv")
write.csv(DataValid[valid == TRUE,], "DataValid.csv")
# Codes to produce Table~\ref{tab.2}:
DataTrain <- read.csv("DataTrain.csv", header = TRUE)
# Remove negative claim amounts
DataTrain$claim_amount[DataTrain$claim_amount < 30] <- 0
# Adjusting claim numbers
DataTrain$claim_nb <- DataTrain$claim_nb ∗ (DataTrain$claim_amount > 0)
# Removing zeros
DataTrain$drv_age2[DataTrain$drv_age2==0]         <- NA
DataTrain$vh_value[DataTrain$vh_value==0]         <- NA
DataTrain$vh_cyl[DataTrain$vh_cyl==0]             <- NA
DataTrain$vh_weight[DataTrain$vh_weight==0]       <- NA
DataTrain$drv_drv2[DataTrain$drv_drv2==0]         <- NA
# Separating the training set into two sets of policies with and without claims
NClaim <- subset(DataTrain, DataTrain$claim_nb == 0)
Claim  <- subset(DataTrain, DataTrain$claim_nb > 0)
# Calculations for all policies
Mydata <- data.frame(cbind(DataTrain$claim_nb, DataTrain$pol_duration,
        DataTrain$pol_sit_duration,  DataTrain$drv_age1, DataTrain$drv_age2,
        DataTrain$vh_value, DataTrain$vh_age, DataTrain$vh_cyl,
        DataTrain$vh_speed, DataTrain$vh_weight, DataTrain$vh_din))
Mean <- sapply(Mydata, mean, na.rm = TRUE)
SD   <- sapply(Mydata, sd,   na.rm = TRUE)
# Calculations for policies without claims
NMydata <- data.frame(cbind(NClaim$claim_nb, NClaim$pol_duration,
           NClaim$pol_sit_duration, NClaim$drv_age1, NClaim$drv_age2,
           NClaim$vh_value, NClaim$vh_age, NClaim$vh_cyl, NClaim$vh_speed,
NMean <- with(NClaim, sapply(NMydata, mean, na.rm = TRUE))
NSD   <- with(NClaim, sapply(NMydata, sd, na.rm = TRUE))
# Calculations for policies with claims
CMydata <- data.frame(cbind(Claim$claim_nb, Claim$pol_duration,
          Claim$pol_sit_duration, Claim$drv_age1, Claim$drv_age2,
          Claim$vh_value, Claim$vh_age, Claim$vh_cyl, Claim$vh_speed,
          Claim$vh_weight, Claim$vh_din))
CMean <- with(Claim, sapply(CMydata, mean, na.rm = TRUE))
CSD   <- with(Claim, sapply(CMydata, sd, na.rm = TRUE))
# Modelling
DataTrain <- read.csv("DataTrain.csv", header = TRUE)
DataTrain$claim_amount[DataTrain$claim_amount < 30] <- 0
DataTrain$claim_nb <- DataTrain$claim_nb ∗ (DataTrain$claim_amount > 0)
# Re-leveling categorical variables:
DataTrain$drv_sex1_r     <- relevel(factor(DataTrain$drv_sex1), ref = "M")
DataTrain$pol_coverage_r <- relevel(factor(DataTrain$pol_coverage), ref = "Maxi")
DataTrain$pol_pay_freq_r <- relevel(factor(DataTrain$pol_pay_freq), ref = "Yearly")
DataTrain$pol_payd_r     <- relevel(factor(DataTrain$pol_payd), ref = "No")
DataTrain$pol_usage_r    <- relevel(factor(DataTrain$pol_usage), ref = "WorkPrivate")
DataTrain$vh_fuel_r      <- relevel(factor(DataTrain$vh_fuel), ref = "Diesel")
DataTrain$vh_type_r      <- relevel(factor(DataTrain$vh_type), ref = "Tourism")
DataTrain$drv_drv2_r     <- relevel(factor(DataTrain$drv_drv2), ref = "No")# Poisson regression
Model.poi <- glm(claim_nb ~ drv_age1 + drv_age2 + drv_sex1_r + drv_drv2_r + pol_sit_duration
              + pol_bonus + pol_coverage_r + pol_pay_freq_r + pol_payd_r + pol_usage_r
              + pol_duration + vh_fuel_r + vh_type_r + vh_din + vh_age + vh_speed,
              data = DataTrain,
              family = poisson(link = "log"), offset = log(Exposures), na.action = na.omit)
# Logistic regression
# y=1 represents claim and y=0 no claim
DataTrain$y[DataTrain$claim_nb==0]  <- 0
DataTrain$y[DataTrain$claim_nb > 0] <- 1
# Model:
Model.log <- glm(y ~ drv_age1 + drv_age2 + drv_sex1_r + drv_drv2_r + pol_sit_duration
              + pol_bonus + pol_coverage_r + pol_pay_freq_r + pol_payd_r + pol_usage_r
              + pol_duration + vh_fuel_r + vh_type_r + vh_din + vh_age + vh_speed,
              data = DataTrain,
              family = binomial(link = "logit"), na.action = na.omit)# ZIP regression
Model.zeropoi <- zeroinfl(claim_nb ~ drv_age1 + drv_age2 + drv_sex1_r + drv_drv2_r
              + pol_sit_duration + pol_bonus + pol_coverage_r + pol_pay_freq_r
              + pol_payd_r + pol_usage_r
              + pol_duration + vh_fuel_r + vh_type_r + vh_din + vh_age + vh_speed,
              data = DataTrain, na.action = na.omit,
               dist = "poisson", link = "logit")
# Validation:
# loading validation set
DataValid <- read.csv("DataValid.csv", header = TRUE)
DataValid$claim_amount[DataValid$claim_amount < 30] <- 0
DataValid$claim_nb <- DataValid$claim_nb ∗ (DataValid$claim_amount > 0)
DataValid$pol_coverage_r <- DataValid$pol_coverage
DataValid$vh_fuel_r      <- DataValid$vh_fuel
DataValid$vh_type_r      <- DataValid$vh_type
DataValid$pol_pay_freq_r <- DataValid$pol_pay_freq
DataValid$pol_payd_r     <- DataValid$pol_payd
DataValid$drv_drv2_r     <- DataValid$drv_drv2
DataValid$pol_usage_r    <- DataValid$pol_usage
DataValid$drv_sex1_r  <- DataValid$drv_sex1
DataValid$y[DataValid$claim_nb==0]  <- 0
DataValid$y[DataValid$claim_nb > 0] <- 1
# Prediction:
predict.poi <- predict(Model.poi, DataValid, type = "response")
predict.log <- predict(Model.log, DataValid, type = "response")
predict.zeropoi <- cbind( DataValid, Mean = predict(Model.zeropoi,
                 DataValid, type = "response"),Probab = predict(Model.zeropoi,
                 DataValid, type = "prob"))
# Test for dispersion
Model.neg <- MASS::glm.nb(claim_nb ~ drv_age1 + drv_age2 + drv_drv2_r + pol_sit_duration
              + pol_bonus + pol_coverage_r + pol_payd_r + pol_usage_r
              + vh_fuel_r + vh_din + vh_age , data = DataTrain,
              link = "log", na.action = na.omit)
# Codes to predict zero claims:
sum(exp(-predict(Model.poi, DataValid, type = "response")))
sum(1-predict(Model.log, DataValid, type = "response"))
sum(predict(Model.zeropoi, DataValid, type = "prob")[,1])


  1. Ayuso, Mercedes, Montserrat Guillen, and Jens Perch Nielsen. 2019. Improving automobile insurance ratemaking using telematics: incorporating mileage and driver behaviour data. Transportation 46: 735–52. [Google Scholar] [CrossRef]
  2. Boucher, Jean-Philippe, Michel Denuit, and Montserrat Guillén. 2007. Risk classification for claim counts. North American Actuarial Journal 11: 110–31. [Google Scholar] [CrossRef]
  3. Boucher, Jean-Philippe, Michel Denuit, and Montserrat Guillen. 2009. Number of accidents or number of claims? An approach with zero-inflated Poisson models for panel data. The Journal of Risk and Insurance 76: 821–46. [Google Scholar] [CrossRef]
  4. Boucher, Jean-Philippe, Steven Côté, and Montserrat Guillen. 2017. Exposure as duration and distance in telematics motor insurance using generalized additive models. Risks 5: 54. [Google Scholar] [CrossRef]
  5. Cantoni, Eva, and Marie Auda. 2018. Stochastic variable selection strategies for zero-inflated models. Statistical Modelling 18: 3–23. [Google Scholar] [CrossRef]
  6. Chen, Kun, Rui Huang, Ngai Hang Chan, and Chun Yip Yau. 2019. Subgroup analysis of zero-inflated Poisson regression model with applications to insurance data. Insurance: Mathematics and Economics 86: 8–18. [Google Scholar] [CrossRef]
  7. Chowdhury, Shrabanti, Saptarshi Chatterjee, Himel Mallick, Prithish Banerjee, and Broti Garai. 2019. Group regularization for zero-inflated poisson regression models with an application to insurance ratemaking. Journal of Applied Statistics 46: 1567–81. [Google Scholar] [CrossRef]
  8. Dutang, Christophe, and Arthur Charpentier. 2019. CASdatasets: Insurance Datasets. Available online: (accessed on 15 March 2019).
  9. Fauzan, Muhammad Arief, and Hendri Murfi. 2018. The accuracy of XGBoost for insurance claim prediction. International Journal of Advances in Soft Computing and Its Applications 10: 159–71. [Google Scholar]
  10. Ferreira, Joseph, and Eric Minikel. 2012. Measuring per mile risk for pay-as-you-drive automobile insurance. Transportation Research Record: Journal of the Transportation Research Board 2297: 97–103. [Google Scholar] [CrossRef]
  11. Frees, Edward W., Richard A. Derrig, and Glenn Meyers. 2014. Predictive Modeling Applications in Actuarial Science, Volume I: Predictive Modeling Techniques. New York: Cambridge University Press. [Google Scholar]
  12. Frees, Edward W., Richard A. Derrig, and Glenn Meyers. 2016. Predictive Modeling Applications in Actuarial Science, Volume II: Case Studies in Insurance. New York: Cambridge University Press. [Google Scholar]
  13. Gao, Guangyuan, Shengwang Meng, and Mario V. Wüthrich. 2019. Claims frequency modeling using telematics car driving data. Scandinavian Actuarial Journal 2019: 143–62. [Google Scholar] [CrossRef]
  14. Guillen, Montserrat, Jens Perch Nielsen, Mercedes Ayuso, and Ana M. Pérez-Marín. 2019. The use of telematics devices to improve automobile insurance rates. Risk Analysis 39: 662–762. [Google Scholar] [CrossRef] [PubMed]
  15. Haberman, Steven, and Arthur E. Renshaw. 1996. Generalized linear models and actuarial science. The Statistician 45: 407–36. [Google Scholar] [CrossRef]
  16. Lambert, Diane. 1992. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34: 1–14. [Google Scholar] [CrossRef]
  17. Lee, Andy H., Mark R. Stevenson, Kui Wang, and Kelvin K. W. Yau. 2002. Modeling young driver motor vehicle crashes: Data with extra zeros. Accident Analysis and Prevention 34: 515–21. [Google Scholar] [CrossRef]
  18. Lemaire, Jean, Sojung Carol Park, and Kili C. Wang. 2015. The use of annual mileage as a rating variables. Astin Bulletin 46: 39–69. [Google Scholar] [CrossRef]
  19. Liu, Feng, and David Pitt. 2017. Application of bivariate negative binomial regression model in analysing insurance count data. Annals of Actuarial Science 11: 390–411. [Google Scholar] [CrossRef] [Green Version]
  20. McCullagh, Peter, and John A. Nelder. 1998. Generalized Linear Models. London: Chapman and Hall. [Google Scholar]
  21. Murphy, Kevin P. 2012. Machine Learning—A Probabilistic Perspective. Cambridge: The MIT Press. [Google Scholar]
  22. Perumean-Chaney, Suzanne E., Charity Morgan, David McDowall, and Inmaculada Aban. 2013. Zero-inflated and overdispersed: What’s one to do? Journal of Statistical Computation and Simulation 83: 1671–83. [Google Scholar] [CrossRef]
  23. Spedicato, Giorgio Alfredo, Christophe Dutang, and Leonardo Petrini. 2018. Machine learning methods to perform pricing optimization. A comparison with standard GLMs. Variance, Casualty Actuarial Society 12: 69–89. [Google Scholar]
  24. Tang, Yanlin, Liya Xiang, and Zhongyi Zhu. 2014. Risk factor selection in rate making EM adaptive LASSO for zero-inflated Poisson regression models. Risk Analysis 34: 1112–27. [Google Scholar] [CrossRef]
  25. Tselentis, Dimitrios I., George Yannis, and Eleni I. Vlahogianni. 2017. Innovative motor insurance schemes: A review of current practices and emerging challenges. Accident Analysis and Prevention 98: 139–48. [Google Scholar] [CrossRef]
  26. Verbelen, Roel, Katrien Antonio, and Gerda Claeskens. 2018. Unraveling the predictive power of telematics data in car insurance pricing. Journal of the Royal Statistical Society: Series C (Applied Statistics) 67: 1275–304. [Google Scholar] [CrossRef]
  27. Weerasinghe, K. P. M. L. P., and M. C. Wijegunasekara. 2016. A comparative study of data mining algorithms in the prediction of auto insurance claims. European International Journal of Science and Technology 5: 47–54. [Google Scholar]
  28. Wilson, Paul, and Jochen Einbeck. 2018. A new and intuitive test for zero modification. Statistical Modelling. [Google Scholar] [CrossRef]
  29. Yip, Karen C. H., and Kelvin K. W. Yau. 2005. On modeling claim frequency data in general insurance with extra zeros. Insurance: Mathematics and Economics 36: 153–63. [Google Scholar] [CrossRef]
  30. Zeileis, Achim, Christian Kleiber, and Simon Jackman. 2008. Regression models for count data in R. Journal of Statistical Software 27: 1–25. [Google Scholar] [CrossRef]
This happens due to subrogation rights of the insurer.
According to the game document, hybrid cars were not popular at the time of collecting this dataset.
Figure 1. Distribution of claims frequency.
Figure 1. Distribution of claims frequency.
Risks 07 00071 g001
Figure 2. Distribution of policies according to categorical variables.
Figure 2. Distribution of policies according to categorical variables.
Risks 07 00071 g002
Figure 3. Distribution of log of claims amounts according to categorical variables.
Figure 3. Distribution of log of claims amounts according to categorical variables.
Risks 07 00071 g003
Table 1. Variables in our datasets.
Table 1. Variables in our datasets.
ControlPolicyDriver (1 and 2)VehicleResponse
policy IDbonus coefficientdriver 2?agenumber of claims
type of coverageagecylinder
durationgenderdin power
situation duration fuel type
payment frequency max speed
subscription to MB type
usage value
Table 2. The mean and standard deviation of numerical variables in the training set.
Table 2. The mean and standard deviation of numerical variables in the training set.
VariablesAll PoliciesPolicies without ClaimsPolicies with Claims
Policy duration 11.09 8.56 11.13 8.57 10.78 8.56
Policy duration since the last change 2.74 2.36 2.78 2.41 2.40 2.36
Driver age 1 54.65 14.86 54.75 14.86 53.84 14.86
Driver age 2 46.93 16.21 47.06 16.19 46.04 16.21
Vehicle value18,0868677.9217,8588618.4719,8948677.92
Vehicle age 9.56 7.03 9.84 7.19 7.30 7.03
Engine cylinder1645 460.59 1,639 464.05 1,696 460.59
Speed 170.71 23.48 170.13 23.69 175.31 23.48
Weight 1171.59 288.39 1164.36 288.68 1228.89 288.39
Motor power (din) 91.43 34.41 90.58 34.35 98.23 34.41
Table 3. Frequency of claims per categorical variables in the training set.
Table 3. Frequency of claims per categorical variables in the training set.
VariablesCategoriesClaim FrequencyTotal
Policy usageWorkPrivate35,248 3,877 4504970139,632
88.94 % 9.78 % 1.14 %
Retired14,193 1,462 1912030015,869
89.44 % 9.21 % 1.20 %
Professional 3,729 5447610000 4,359
85.55 % 12.48 % 1.74 %
All trips41101000052
78.85 % 19.23 % 1.92 %
Policy coverageMaxis33,4594,4896007090138,628
86.62 % 11.62 % 1.55 %
Median 29,62886282710010,580
91.00 % 8.15 % 0.78 %
Median 1 5,122 412322000 5,568
91.99 % 7.04 % 0.57 %
Mini 5,002 13040000 5,136
97.39 % 2.53 % 0.08 %
Subscription to MBNo50,946 5,714 69376100157,440
88.69 % 9.95 % 1.21 %
Yes 2,265 179253000 2,472
91.63 % 7.24 % 1.01 %
PaymentYearly20,094 2,106 2632530122,492
89.34 % 9.36 % 1.17 %
88.96 % 9.75 % 1.11 %
Monthly15,880 1,875 2342340018,016
88.14 % 10.41 % 1.30 %
Quarterly 1,307 166222000 1,497
87.31 % 11.09 % 1.47 %
Policy with 2 driversNo35,675 3,814 4574760039,999
89.19 % 9.54 % 1.14 %
88.06 % 10.44 % 1.31 %
Gender 1Male32,118 3,501 4335240036,108
88.95 % 9.70 % 1.20 %
Female21,093 2,392 2852760123,804
88.61 % 10.05 % 1.20 %
Age 118–30 2,471 299294001 2,804
88.12 % 10.66 % 1.03 %
31–5018,961 2,228 2562440021,473
88.30 % 10.38 % 1.19 %
51–7022,978 2,479 3224030025,822
89.99 % 9.60 % 1.25 %
71–858,1548221059300 9,093
89.67 % 9.04 % 1.15 %
85 + 6476562000720
89.86 % 9.03 % 0.83 %
Vehicle fuelDiesel28,605 3,783 4755470132,925
86.88 % 11.49 % 1.44 %
Gasoline24,565 2,104 2412530026,938
91.19 % 7.81 % 0.89 %
83.67 % 12.24 % 4.08 %
Vehicle typeTourism47,891 5,387 66873100154,030
88.64 % 9.97 % 1.24 %
Commercial 5,320 506506000 5,882
90.45 % 8.60 % 0.85 %
Total 53,2115,89371879100159,912
Table 4. Regression coefficient of Poisson models.
Table 4. Regression coefficient of Poisson models.
CoefficientsModel 1: All VariablesModel 2: Stepwise SelectionModel 3: Only Significant
Intercept 2.6883 < 0.0001 2.5645 < 0.0001 2.4729 < 0.0001
Age 1 0.0048 0.0001 0.0048 0.0002 0.0036 0.0026
Age 2 0.0034 0.0084 0.0034 0.0087 0.0033 0.0111
Female 1 0.0380 0.1251 0.0374 0.1285
Driver2? 0.1790 0.0057 0.1781 0.0060 0.1717 0.0080
Situation duration 0.0185 0.0078 0.0185 0.0080 0.0220 0.0013
Bonus 0.8677 < 0.0001 0.8683 < 0.0001 0.9125 < 0.0001
Coverage(Med2) 0.1854 < 0.0001 0.1847 < 0.0001 0.1877 < 0.0001
Coverage(Med1) 0.2282 < 0.0001 0.2280 < 0.0001 0.2306 < 0.0001
Coverage(Mini) 1.2611 < 0.0001 1.2631 < 0.0001 1.2723 < 0.0001
Payment(biannual) 0.0485 0.0919 0.0487 0.0908
Payment(quarterly) 0.1676 0.0184 0.1681 0.0181
Payment(monthly) 0.0911 0.0018 0.0912 0.0018
Subscription to MB 0.1586 0.0198 0.1587 0.0197 0.1675 0.01370
Usage(retired) 0.0304 0.4331 0.0297 0.4433 0.0315 0.4146
Usage(professional) 0.1536 0.0003 0.1535 0.0002 0.1481 0.0002
Usage(all trips) 0.3451 0.2328 0.3456 0.2321 0.3448 0.2332
Duration 0.0025 0.0969 0.0025 0.0978
Fuel(gasoline) 0.2621 < 0.0001 0.2630 < 0.0001 0.2607 < 0.0001
Fuel(hybrid) 0.1265 0.6896 0.1225 0.6988 0.1196 0.70588
Type(commercial) 0.0318 0.5466
Din(power) 0.0022 0.0004 0.0026 < 0.0001 0.0024 < 0.0001
Vehicle age 0.0316 < 0.0001 0.0318 < 0.0001 0.0332 < 0.0001
Vehicle speed 0.0009 0.3967
Degrees of freedom242217
Running time (s) 0.761 7.336 0.601
Table 5. Regression coefficients of logistic models.
Table 5. Regression coefficients of logistic models.
CoefficientsModel 1: All variablesModel 2: Stepwise SelectionModel 3: Only Significant
Intercept 2.6571 < 0.0001 2.5321 < 0.0001 2.4255 < 0.0001
Age 1 0.0043 0.0036 0.0038 0.0066 0.0032 0.0226
Age 2 0.0048 0.0011 0.0047 0.0013 0.0047 0.0013
Female 1 0.0441 0.1198 0.0428 0.1275
Driver2? 0.2410 0.0012 0.2363 0.0014 0.2340 0.0016
Situation duration 0.0234 0.0027 0.0248 0.0013 0.0265 0.0006
Bonus 0.9017 < 0.0001 0.9151 < 0.0001 0.9447 < 0.0001
Coverage(Med2) 0.1814 < 0.0001 0.1786 < 0.0001 0.1832 < 0.0001
Coverage(Med1) 0.2111 0.0005 0.2061 0.0007 0.2132 0.0004
Coverage(Mini) 1.2481 < 0.0001 1.2438 < 0.0001 1.2589 < 0.0001
Payment(biannual) 0.0522 0.1121 0.0490 0.1335
Payment(quarterly) 0.1852 0.0240 0.1883 0.0217
Payment(monthly) 0.0939 0.0049 0.0936 0.0051
Subscription to MB 0.2014 0.0088 0.2038 0.0080 0.2098 0.0063
Usage(retired) 0.0180 0.6847 0.0149 0.7364 0.0203 0.6455
Usage(professional) 0.1691 0.0007 0.1733 0.0002 0.1664 0.0004
Usage(all trips) 0.4841 0.1577 0.4856 0.1563 0.4835 0.1577
Duration 0.0019 0.2708
Fuel(gasoline) 0.2885 < 0.0001 0.2914 < 0.0001 0.2875 < 0.0001
Fuel(hybrid) 0.0707 0.8560 0.0691 0.8592 0.0598 0.8778
Type(commercial) 0.0462 0.4402
Din(power) 0.0022 0.0019 0.0027 < 0.0001 0.0025 < 0.0001
Vehicle age 0.0336 < 0.0001 0.0336 < 0.0001 0.0332 < 0.0001
Vehicle speed 0.0010 0.4404
Degrees of freedom242117
Running time (s) 0.634 31.611 0.431
Table 6. Regression coefficients of zero-inflated Poisson (ZIP) models.
Table 6. Regression coefficients of zero-inflated Poisson (ZIP) models.
CoefficientsModel 1 * Model 2 * Model 3 *
Estimatep-Value Estimatep-Value Estimatep-Value
Poisson (count) part
Intercept−2.2750< 0.0001 −2.1404< 0.0001 −2.0736< 0.0001
Age 10.00660.0513 0.00630.0542 0.00460.1347
Age 20.00120.6647 0.00130.6290 0.00190.4998
Female 1−0.04480.4746 −0.04790.4324
Driver2?−0.04730.7366 −0.05360.6972 −0.08090.5584
Situation duration−0.00090.0194 −0.00170.9282 −0.00110.9558
Bonus0.57870.2021 0.59530.2068 0.54330.2026
Coverage(Med2)−0.36700.0006 −0.37340.0004 −0.36460.0006
Coverage(Med1)−0.42940.0022 −0.42310.0022 −0.42690.0024
Coverage(Mini)−1.04870.0017 −1.05990.0016 −1.10310.0007
Payment(biannual)−0.06910.3318 −0.06680.3472
Payment(quarterly)-0.05690.7525 −0.04520.8002
Payment(monthly)0.00910.8961 0.01370.8427
Subscription to MB0.05960.7580 0.04620.8099 0.01260.9486
Usage(retired)−0.06230.5523 −0.06230.5455 −0.05380.6006
Usage(professional)0.07230.5491 0.07100.5063 0.09520.3725
Usage(all trips)0.25430.6523 0.23750.6765 −0.03320.9544
Duration−0.00430.2948 −0.00250.1145
Fuel(gasoline)−0.03460.6549 −0.03870.6228 −0.07590.3854
Fuel(hybrid)0.69760.2553 0.70500.2500 0.63660.3302
Din(power)0.00230.3055 0.00260.1207 0.00240.3063
Vehicle age0.00240.7925 0.00320.1207 0.00630.4799
Vehicle speed0.00100.7557
Zero-inflation part
Intercept−0.55440.6449 −0.46340.6883 −0.48190.6826
Age 10.00400.5922 0.00320.6558 0.00200.7736
Age 20.01080.0653 0.01110.0546 0.01210.0356
Female 1−0.19620.1572 −0.20200.1358
Driver2?−0.54910.0892 −0.56330.0748 −0.61650.0522
Situation duration0.03310.3289 0.03090.3575 0.03870.2442
Bonus−0.86510.4958 −0.82000.5308 −1.06370.3802
Coverage(Med2)−0.37980.1014 −0.39320.0848 −0.35950.1139
Coverage(Med1)−0.40300.1541 −0.38710.1601 −0.38470.1645
Coverage(Mini)0.28030.5949 0.26100.6222 0.20520.6923
Payment(biannual)−0.25960.0910 −0.25470.0963
Payment(quarterly)−0.56610.2637 −0.53540.2798
Payment(monthly)−0.17210.2518 −0.16150.2751
Subscription to MB0.42260.2041 0.39780.2310 0.35600.3030
Usage(retired)−0.07750.7250 −0.07830.7177 −0.05330.8043
Usage(professional)−0.22150.4672 −0.22730.4063 −0.15100.5703
Usage(all trips)−0.30500.8550 −0.36680.8345 −1.86810.7405
Fuel(gasoline)0.50660.0017 0.50020.0021 0.40900.0236
Fuel(hybrid)114.800.1984 116.600.1890 1.06320.2829
Din(power)0.00000.9944 -0.00000.9959 −0.00020.9627
Vehicle age0.0696< 0.0001 0.0712< 0.0001 0.0756< 0.0001
Vehicle speed0.00060.9258
Log-likelihood−23,044 −23,045 −23,054
Degrees of freedom48 43 34
AIC46,184 46,175 46,177
BIC46,616 46,563 46,482
Running time (s)16.719 18.81 13.951
* Model 1: full model; Model 2: based on the variables of stepwise models in Table 4 and Table 5; Model 3: based on the variables of only significant models in Table 4 and Table 5.
Table 7. Prediction of the probability and the number of zero claims by our models.
Table 7. Prediction of the probability and the number of zero claims by our models.
Prob Zero Claims: Individual 1 **Prob Zero Claims: Individual 2 ***Total No. of Zero ClaimsTotal No. of Non-Zero Claims
Observed value 0135,7724316
PoissonFull model 0.90160.835835,361.444726.56
Stepwise 0.90190.835035,361.274726.73
Significant 0.90190.841035,360.504727.50
LogisticFull model 0.90970.847535,606.884481.12
Stepwise 0.90910.848035,606.954481.05
Significant 0.91020.851235,606.624481.38
ZIPModel 1 *Count0.10240.177835,6024486
Model 2 *Count0.10160.178535,601.274486.74
Model 3 *Count0.10200.171035,601.064486.94
* Model 1: full model; Model 2: based on the variables of stepwise models in Table 4 and Table 5; Model 3: based on the variables of only significant models in Table 4 and Table 5. ** An 85-year-old male policyholder with biannual maxi coverage and bonus of 0.5 for retired usage. He had this policy for 29 years and changed it nine years ago. He has not registered for MB and owns a 10-year old tourism car with gasoline, the din of 98 and max speed of 182. *** A 54-year-old male with monthly maxi coverage and bonus of 0.5 for private usage. The 2nd driver is a 56-year-old female. The policy was written six years ago and was modified two years ago. It covers a two-year-old tourism car with diesel and din of 75 and max speed of 163. It is not part of MB scheme.

Share and Cite

MDPI and ACS Style

Qazvini, M. On the Validation of Claims with Excess Zeros in Liability Insurance: A Comparative Study. Risks 2019, 7, 71.

AMA Style

Qazvini M. On the Validation of Claims with Excess Zeros in Liability Insurance: A Comparative Study. Risks. 2019; 7(3):71.

Chicago/Turabian Style

Qazvini, Marjan. 2019. "On the Validation of Claims with Excess Zeros in Liability Insurance: A Comparative Study" Risks 7, no. 3: 71.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop