Next Article in Journal
A Strain-Gauge-Based Method for the Compensation of Out-of-Plane Motions in 2D Digital Image Correlation
Next Article in Special Issue
Computation of the Distribution of the Sum of Independent Negative Binomial Random Variables
Previous Article in Journal
Performance Analysis of Multi-Objective Simulated Annealing Based on Decomposition
Previous Article in Special Issue
The Arctan Power Distribution: Properties, Quantile and Modal Regressions with Applications to Biomedical Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Prediction Interval for Compound Conway–Maxwell–Poisson Regression Model with Application to Vehicle Insurance Claim Data

by
Jahnavi Merupula
1,
V. S. Vaidyanathan
1 and
Christophe Chesneau
2,*
1
Department of Statistics, Pondicherry University, Puducherry 605014, India
2
Department of Mathematics, LMNO, University of Caen, 14032 Caen, France
*
Author to whom correspondence should be addressed.
Math. Comput. Appl. 2023, 28(2), 39; https://doi.org/10.3390/mca28020039
Submission received: 16 January 2023 / Revised: 15 February 2023 / Accepted: 27 February 2023 / Published: 9 March 2023
(This article belongs to the Special Issue Statistical Inference in Linear Models)

Abstract

:
Regression models in which the response variable has a compound distribution have applications in actuarial science. For example, the aggregate claim amount in a vehicle insurance portfolio can be modeled using a compound Poisson distribution. In this paper, we propose a regression model, wherein the response variable is assumed to have a compound Conway–Maxwell–Poisson (CMP) distribution. This distribution is a parsimonious two-parameter Poisson distribution that accounts for both over- and under-dispersed count data, making it more suitable for application in various fields. A two-part methodology in the framework of a generalized linear model is proposed to estimate the parameters. Additionally, a method to obtain the prediction interval of the response variable is developed. The workings of the proposed methodology are illustrated through simulated data. An application of the compound CMP regression model to real-life vehicle insurance claims data is presented.

1. Introduction

Compound regression models have applications in various research fields, including economics and finance. In economic consumer theory, for example, compound Poisson regression models are often used to examine the factors that account for the expenditures incurred by tourists during their stay at a location. The factors may include length of stay, type of holiday accommodations, age, occupation, socio-economic status of the tourist, etc. See Gómez-Déniz and Pérez-Rodríguez [1]. In actuarial risk theory, the aggregate claim amount incurred by the insurance company against the claims made by the policyholders is modeled using compound models. See Klugman et al. [2] and Bahnemann [3] for a detailed discussion on compound models, their distributional properties and applications in insurance claim modeling. Jørgensen and Paes De Souza [4] applied the compound Poisson regression model to determine the impact on the conditional mean of the aggregate claim amount caused by factors such as age and model of the vehicle, exposure, deductibles, etc., in the context of car insurance. In this paper, we propose a compound regression model using a two-parameter Poisson distribution. On this topic, some mathematical backgrounds are presented below in order to fix the notations. Let
S = j = 1 N Y j ,
denote the random sum, where the distributions of the random variables N and Y 1 , Y 2 , , Y N are assumed to be discrete and continuous, respectively. Moreover, ( Y j ) s are assumed to be independent and identically distributed. Therefore, in the sequel, we refer to Y j s as Y. Further, N and Y in general are assumed to be independent. The above-mentioned S is a compound random variable. Suppose Y j represents the claim amounts on an insurance portfolio, N denotes the number of claims made, then S represents the aggregate claim amount. When N has a Poisson distribution, the distribution of S is known as the compound Poisson distribution. Though the Poisson distribution is often used in constructing compound distributions, it is not suitable for modeling over- or under-dispersed count data. As an alternative to the Poisson distribution, one can use a generalized Poisson distribution (Consul and Jain [5]) to model count data that are either over- or under-dispersed. Recently, Shmueli et al. [6] studied a two-parameter Poisson distribution developed by Conway and Maxwell [7] known as the Conway–Maxwell–Poisson (CMP) distribution. This is a two-parameter flexible generalization of the Poisson distribution that can model both over- and under-dispersed data and has the feature to include the Poisson, geometric and Bernoulli distributions as special cases. A detailed discussion on the properties of this distribution and its applications can be found in Sellers et al. [8]. Also, Sellers and Premeaux [9] contains a detailed review on CMP regression models. In the context of compound distributions, assuming the CMP and binomial distributions for N and Y in Equation (1), a discrete compound CMP-binomial distribution is developed by Saavithri et al. [10].
Considering the Poisson distribution as the counting distribution, compound Poisson regression models are available in the literature. See Frees et al. [11], Andersen and Bonat [12], and Delong et al. [13]. However, its applicability is limited to data with equi-dispersed counts. To allow for flexibility in the compound regression models in terms of accommodating dispersed counts, a counting distribution that can model both over- and under-dispersed data should be considered. This serves as motivation to use the CMP distribution as the counting distribution to build a compound regression model.
The goal of this work is to create a regression model for S using a CMP distribution for N. The present work is novel because of the distribution used for N and its convolution with the distribution of Y. The problem of obtaining prediction intervals for the response variable S is also addressed. The parameters of the compound regression model are estimated using the generalized linear model (GLM) approach in two cases. In the first case, we assume that data on S are available but not on N and Y. We assume data on both N and Y are available in the latter case. For this case, a two-part likelihood-based estimation procedure is developed within the framework of the GLM. A methodology to obtain the prediction interval (PI) for the response variable of the proposed compound regression model is developed.
The rest of the paper is organized as follows: The compound CMP regression model is given in Section 2. In Section 3, the estimation of the parameters of the proposed regression model using the GLM approach is discussed. Section 4 deals with the suggested methodology for obtaining the prediction intervals for the compound CMP regression model. A numerical illustration of the estimation procedure using simulated data and an application to real-life vehicle insurance claims data is presented in Section 5. The conclusion of the paper is given in Section 6.

2. Compound CMP Regression Model

The probability mass function (pmf) of the random variable N having the CMP distribution is given by
P ( N = n ) = λ n ( n ! ) ν Z ( λ , ν ) , n = 0 , 1 , 2 , , λ > 0 , ν 0 ,
where Z ( λ , ν ) = j = 0 λ j / ( j ! ) ν is the normalizing constant. Some important remarks on this distribution are given below. The parameters λ and ν are the location and dispersion parameters, respectively. This pmf is not defined for λ 1 and ν = 0 . The mean and variance of N are given by E ( N ) = λ ln Z ( λ , ν ) λ and V ( N ) = λ E ( N ) λ , respectively. When ν = 1 , the CMP distribution reduces to the Poisson distribution. For ν > 1 , the distribution is under-dispersed, and for ν < 1 , it is over-dispersed.
Since the location parameter λ of the CMP distribution does not represent its mean, a mean reparameterized form of the distribution is used in building the compound regression model. The pmf of N under the mean-reparametrization is given by
P ( N = n ) = μ 1 + e ϕ 1 2 e ϕ n e ϕ ( n ! ) e ϕ Z ( μ 1 , ϕ ) , n = 0 , 1 , 2 , , μ 1 > 0 , ϕ R ,
where Z ( μ 1 , ϕ ) = j = 0 μ 1 + e ϕ 1 2 e ϕ j e ϕ 1 ( j ! ) e ϕ is the normalizing constant. When ϕ = 0 , the distribution reduces to the Poisson distribution. For ϕ > 0 , the distribution is under-dispersed, and for ϕ < 0 , it is over-dispersed. See Ribeiro Jr et al. [14]. Here, μ 1 λ 1 / ν ν 1 2 ν corresponds to the mean of the distribution and ϕ = ln ( ν ) . This approximation works reasonably well for ν 1 or λ > 10 ν . The mean and variance of N are E ( N ) = μ 1 and V ( N ) = μ 1 e ϕ , respectively.
Convolutions can be used to obtain the probability density function (pdf) of the random sum S defined in Equation (1). In Equation (1), N = 0 implies S = 0 . Let p 0 denote the probability mass at S = 0 . Since S is not continuous at zero, the pdf of S is represented as a generalized pdf in terms of Dirac delta function as
f ( s ) = p 0 δ ( s ) + i = 1 g Y * i ( s ) P ( N = i ) , s 0 ,
where δ ( s ) is the Dirac delta function such that 0 δ ( s ) d s = 1 . Here, P ( N = i ) denotes the pmf of the CMP distribution defined in Equation (3), and g Y * i ( . ) denotes the pdf of the i-fold convolution of Y, whose distribution is assumed to be continuous with support in R + . Note that p 0 = P ( N = 0 ) = Z ( μ 1 , ϕ ) 1 . In this paper, the distribution of Y is considered to be a mean reparameterized gamma distribution. Based on Jorgensen [15] (Chapter 3), the pdf of Y is given by
g Y ( y ; μ 2 , ψ ) = 1 Γ ( ψ ) ψ μ 2 ψ y ψ 1 exp ψ y μ 2 , y > 0 , μ 2 > 0 , ψ > 0 ,
where μ 2 denotes the mean of Y, ψ denotes the dispersion parameter and Γ ( . ) denotes the gamma function. This form is taken for mathematical convenience and to accommodate asymmetry in the distribution of Y. For example, in the context of insurance claim modeling, the individual claim amounts are always positive and often right-skewed. Since the gamma distribution is closed under convolution, we obtain
g Y * i ( y ) = 1 Γ ( ψ ) ψ i μ 2 ψ y ψ 1 exp ψ y i μ 2 , y > 0 , μ 2 > 0 , ψ > 0 .
Using Equations (3) and (6) in Equation (4), we obtain
f ( s ) = p 0 δ ( s ) + s ψ 1 ψ ψ Z ( μ 1 , ϕ ) μ 2 ψ Γ ( ψ ) i = 1 μ 1 + e ϕ 1 2 e ϕ i e ϕ ( i ! ) e ϕ i ψ exp ψ s i μ 2 , s 0 .
The pdf of S defined in Equation (7) is called the compound CMP gamma pdf. For the random sum defined in Equation (1), we have
E ( S ) = E ( N ) E ( Y ) , V ( S ) = E ( N ) V ( Y ) + V ( N ) [ E ( Y ) ] 2 .
See, for instance, Bahnemann [3] (Chapter 4). Using Equation (8), the mean and variance of the compound CMP gamma distribution given in Equation (7) are obtained as
E ( S ) = μ 1 μ 2 , V ( S ) = μ 2 2 μ 1 [ ψ 1 + e ϕ ] .
To build a compound regression model for S, let X = ( 1 , X 1 , X 2 , , X p ) denote the design matrix where X i , i = 1 , 2 , , p are the column vectors corresponding to the covariates X i , i = 1 , 2 , , p and 1 is the vector of 1 s . Following the GLM procedure given in De Jong et al. [16] (Chapter 5), the model is built by regressing S on X using the log-link function. This is because the log-link function guarantees that the expected value of the response variable is positive. Let μ denote the expected value of S. Then, the compound CMP gamma regression model is given by
μ = exp ( X δ ) ,
where δ = ( δ 0 , δ 1 , , δ p ) is a ( p + 1 ) × 1 vector of regression parameters. In the context of modeling vehicle insurance claims data, S may denote the aggregate claim amount, and the covariates may denote the driver’s age, vehicle type, and so on. In the sequel, the method of estimating the regression parameters using the likelihood approach is discussed.

3. Parameter Estimation

Consider a sample s = ( s 1 , s 2 , , s r ) of r observations on S. Let D ( > 0 ) positive values in s and r D zeros exist. Note that D can be assimilated to be random and D B i n o m i a l ( r , 1 p 0 ) , where p 0 = Z ( μ 1 , ϕ ) 1 . Therefore, the likelihood function L based on s and D = d is
L = r d p 0 r d ( 1 p 0 ) d k = 1 d f ( s k + ) = r d 1 Z ( μ 1 , ϕ ) r d 1 1 Z ( μ 1 , ϕ ) d k = 1 d f ( s k + ) ,
where f ( s k + ) = s k ψ 1 ψ ψ ( Z ( μ 1 , ϕ ) 1 ) μ 2 ψ Γ ( ψ ) i = 1 μ 1 + e ϕ 1 2 e ϕ i e ϕ ( i ! ) e ϕ i ψ exp ψ s k i μ 2 .
Thus, the log-likelihood function l based on s and D = d is obtained as
l ( μ 1 , μ 2 , ϕ , ψ ; s ) = ln r d r ln ( Z ( μ 1 , ϕ ) ) + ( ψ 1 ) k = 1 d ln ( s k ) k = 1 d ψ ln ( μ 2 ) + d ψ ln ( ψ ) d ln ( Γ ( ψ ) ) + k = 1 d ln i = 1 μ 1 + e ϕ 1 2 e ϕ i e ϕ ( i ! ) e ϕ i ψ exp ψ s k i μ 2 .
Since E ( N ) = μ 1 and E ( Y ) = μ 2 , from Equation (9), we obtain μ = μ 1 μ 2 . Let the elements of the design matrix X be x k l , l = 0 , 1 , , p ; k = 1 , 2 , , d with the k t h row given by x k = ( 1 , x k 1 , x k 2 , , x k p ) . Replacing μ 2 with μ μ 1 and μ with exp ( X δ ) in Equation (12), the log-likelihood function based on s and D = d becomes
l ( δ , μ 1 , ϕ , ψ ; s ) = ln r d r ln ( Z ( μ 1 , ϕ ) ) + ( ψ 1 ) k = 1 d ln ( s k ) k = 1 d ψ ln e l = 0 p x k l δ l μ 1 + d ψ ln ( ψ ) d ln ( Γ ( ψ ) ) + k = 1 d ln i = 1 μ 1 + e ϕ 1 2 e ϕ i e ϕ ( i ! ) e ϕ i ψ exp ψ s k μ 1 i e l = 0 p x k l δ l .
The maximum likelihood (ML) estimates of the parameters in Equation (13) can be obtained by solving the ( p + 4 ) log-likelihood equations simultaneously. However, these equations are non-linear, and therefore closed-form solutions cannot be obtained. Hence, iterative algorithms based on numerical methods can be used to solve the equations to get the estimates for the parameters. Let δ ^ denote the ML estimate of δ . By the asymptotic property of the ML estimators, for large r, the following distribution approximation holds:
Σ δ 1 / 2 ( δ ^ δ ) N p + 1 ( 0 , I ) ,
where δ and Σ δ denote the mean vector and the covariance matrix of δ ^ , respectively. Using Equation (10), an estimate of the expected value of S given the covariates X can be obtained as μ ^ = exp ( X δ ^ ) .
Assume that data on S are unavailable, but data on N and Y are. This can happen in such situations as, for example, when modeling the aggregate claim amount when one has data on the claim frequency (N) and the individual claim amounts (Y). Using N and Y, we can compute the value of S and then build the regression model using the method described above. However, it is computationally more challenging to compute the estimates due to the presence of an infinite sum in the log-likelihood function. To reduce the computational difficulty, we can use N and Y to build two separate regression models to obtain μ ^ . Towards this, a two-part GLM methodology is proposed to estimate μ assuming N and Y to be (1) independent and (2) dependent.

3.1. Independent Compound Regression Model

Using Equation (9), we have μ = μ 1 μ 2 . The proposed two-part GLM method is implemented by building two separate regression models, namely, the CMP regression model and the gamma regression model, for the means of N and Y, respectively. Given the data on N , Y and X, the estimated mean of S is computed as μ ^ = μ ^ 1 μ ^ 2 . Here, μ ^ 1 and μ ^ 2 are obtained by regressing N and Y separately on X. Using the log-link function, we have μ 1 = E ( N ) = e X α , μ 2 = E ( Y ) = e X β , where α = ( α 0 , α 1 , , α p ) and β = ( β 0 , β 1 , , β p ) denote the set of regression parameters.
Let n = ( n 1 , , n m ) denote m observations on N. For each n k > 0 , let there be n k observations on Y denoted by y k j , j = 1 , 2 , , n k , k = 1 , 2 , , m . Let y ¯ = ( y ¯ 1 , y ¯ 2 , , y ¯ m ) where y ¯ k = j = 1 n k y k j / n k if n k > 0 0 if n k = 0 .
Let the design matrix X be of order m × ( p + 1 ) with elements x k l , k = 1 , 2 , , m ; l = 0 , 1 , , p . Since the distribution of Y has positive support, zeros in y ¯ , if any, are not to be considered. The corresponding sample observation in y ¯ and the observed covariate matrix X are not included when building the gamma regression model. Let q denote the number of observations for which y ¯ k = 0 , k = 1 , 2 , , m and let t = m q . Following Garrido et al. [17], the distribution of Y g a m m a ( μ 2 , ψ ) is equivalent to Y ¯ | N g a m m a μ 2 , ψ N for independently identically distributed Y 1 , , Y N . Using the pmf of N given in Equation (3) with μ 1 = e X α , the corresponding log-likelihood function is given by
l ( α , ϕ ; n ) = k = 1 m e ϕ n k ln e l = 0 p x k l α l + e ϕ 1 2 e ϕ ln ( n k ! ) k = 1 m ln Z ( e l = 0 p x k l α l , ϕ ) .
The ML estimates for the ( p + 1 ) regression parameters are obtained by simultaneously solving the corresponding log-likelihood equations. Let α ^ = ( α 0 ^ , α 1 ^ , , α p ^ ) denote the ML estimate of α . Then the ML estimate of μ 1 is obtained as μ ^ 1 = e X α ^ . In similar lines, the ML estimate of β , namely, β ^ = ( β 0 ^ , β 1 ^ , , β p ^ ) , is obtained using the likelihood function corresponding to the conditional pdf of Y ¯ given N = n . The conditional pdf is given by
f ( y ¯ | n ; μ 2 , ψ ) = 1 Γ ( ψ / n ) ψ / n μ 2 ψ / n y ¯ ( ψ / n ) 1 exp ψ y ¯ n μ 2 , y ¯ > 0 .
Taking μ 2 = e X β in Equation (15), the log-likelihood function is obtained as
l ( β , ψ ; y ¯ ) = t ln Γ ψ n + t ψ n ln ψ n + k = 1 t ψ n 1 ln ( y ¯ k ) ψ y ¯ k n e l = 0 p x k l β l ψ n l = 0 p x k l β l .
The likelihood equations for α and β are, respectively, given by
k = 1 m x k l ( n k e l = 0 p x k l α l ) = 0
and
k = 1 t x k l n k e l = 0 p x k l β l ( y ¯ k e l = 0 p x k l β l ) = 0 , l = 0 , 1 , , p .
Since Equations (17) and (18) are non-linear, iterative procedures can be used to solve them. As an alternate, one can use the in-built functions cmp() and glm(., family=“gamma”) available in R to obtain α ^ and β ^ . Using α ^ and β ^ , the ML estimate of the expected value of S, namely, μ ^ = μ ^ 1 μ ^ 2 , can be computed. By the asymptotic property of the ML estimators, we have
Σ α 1 / 2 ( α ^ α ) N p + 1 ( 0 , I )
and
Σ β 1 / 2 ( β ^ β ) N p + 1 ( 0 , I ) .
Here, α and Σ α denote the mean vector and covariance matrix of α ^ , respectively. Similarly, β and Σ β denote the mean vector and covariance matrix of β ^ , respectively. The standard errors of α ^ and β ^ are the square root of the diagonal elements of the corresponding covariance matrices. Since α ^ and β ^ do not have closed-form expressions, their standard errors can be obtained using the sample Hessian matrix. The sample Hessian matrices of α ^ and β ^ , namely, H α ^ and H β ^ , are given by H α ^ = e ϕ ^ e X α ^ X X and H β ^ = ψ ^ X X , respectively. Since the expressions of the standard errors of the parameters α and β contain the dispersion parameters ϕ and ψ , respectively, they may be estimated using the following formulas:
ϕ ^ = ln ( m ( p + 1 ) ) k = 1 m μ ^ 1 k ( n k μ ^ 1 k ) 2
and
ψ ^ = 1 ( t ( p + 1 ) ) k = 1 t y ¯ k μ ^ 2 k μ ^ 2 k 2 ,
where μ ^ 1 k and μ ^ 2 k are the estimated values of μ 1 and μ 2 , respectively, corresponding to the k t h observation.

3.2. Dependent Compound Regression Model

Although independence between N and Y is commonly assumed in compound regression models, it is rarely observed in practice. For instance, in the framework of modeling the aggregate claim amounts, it is typical to observe that the claim amounts depend on the claim frequency as well. See, for example, the work of Garrido et al. [17]. As a result, N is included as a covariate in the regression model of Y ¯ . Let θ represent the regression parameter associated with N. Since S denotes a random sum, it can be written as S = N Y ¯ . The GLM of S through the log-link function is given by Garrido et al. [17] as
μ = e X β M N ( θ ) ,
where M N ( θ ) represents the derivative of the moment generating function of N with respect to θ . Taking N as CMP, M N ( θ ) is obtained as
M N ( θ ) = n = 0 n e θ n μ 1 + e ϕ 1 2 e ϕ n e ϕ ( n ! ) e ϕ Z ( μ 1 , ϕ ) .
Note that if θ = 0 , i.e., when N is independent of Y ¯ , M N ( θ ) = E ( N ) , and thus the dependent compound regression model will coincide with the independent compound regression model. The pdf of S under dependent case is given by
f S ( s ) = f Y ¯ | N ( y ¯ | n ) f N ( n ) ,
where f Y ¯ | N ( y ¯ | n ) is indicated in Equation (15) with μ 2 = μ θ and ψ = ψ θ . The corresponding log-likelihood function is
l ( α , β , ϕ , ψ , θ ) = l ( α , ϕ ; n ) + l ( β , ψ , θ ; y ¯ | n ) ,
where l ( α , ϕ ; n ) corresponds to Equation (14). Let the ML estimates of α , β and θ be denoted as α ˜ , β ˜ and θ ˜ , where α ˜ is obtained using Equation (17). The function l ( β , ψ , θ ; y ¯ | n ) corresponds to Equation (16) with μ 2 replaced with μ θ . To obtain the estimates of β and θ , the GLM of E ( Y ¯ | N , X ) is used with the log-link function and is defined by μ θ = e X β + θ N . The corresponding likelihood equations of the regression parameters are
k = 1 t n k x k l e l = 0 p x k l β l + θ n k ( y k ¯ e l = 0 p x k l β l + θ n k ) = 0
and
k = 1 t n k 2 e l = 0 p x k l β l + θ n k ( y k ¯ e l = 0 p x k l β l + θ n k ) = 0 , l = 0 , 1 , , p .
The dispersion parameter ψ θ can be estimated using
ψ ^ θ = 1 ( t ( p + 1 ) ) k = 1 t y ¯ k μ ^ θ k μ ^ θ k 2 ,
where μ ^ θ k is the estimated value of μ θ corresponding to the k t h observation. In addition, β ˜ and θ ˜ can be obtained by solving Equations (21) and (22) through iterative algorithms. Thus, the estimate of μ is given by μ ˜ = e X β ˜ M N ( θ ˜ ) . Denote β θ = β θ ( p + 2 ) × 1 and its ML estimate as β ˜ θ = β ˜ θ ˜ ( p + 2 ) × 1 . By the asymptotic property of the ML estimators, we have
Σ β θ 1 / 2 ( β θ ˜ β θ ) N p + 2 ( 0 , I ) .
Here, β θ and Σ β θ denote the mean vector and covariance matrix of β θ ˜ , respectively. The standard error of β θ ˜ corresponds to the square root of the diagonal elements of the sample Hessian matrix, which is given by H β ˜ θ = ψ ^ θ X * A X * , where X * is a matrix of order t × ( p + 2 ) that denotes the design matrix which includes n . A is a t × t diagonal matrix with positive elements of n . Note that H α ˜ = H α ^ .

4. Prediction Intervals

From the estimates of the regression parameters, we can obtain an estimate of the expected value of S for some fixed values of the covariates. Given the covariates, it is frequently useful to predict the actual value of S. In a regression setup, the actual value of S is related to its expected value as
S = E ^ ( S | X ) + ϵ ,
where ϵ is the error term. Since ϵ is unobserved, it is not possible to predict the actual S. In contrast, the prediction interval is a constructed interval that contains the predicted value of actual S. In this section, a method for calculating the PI for S is proposed. Let S 0 denote the response given the covariate x 0 = ( 1 , x 01 , , x 0 p ) . Thus, we have S 0 = E ^ ( S 0 | x 0 ) + ϵ , where E ^ ( S 0 | x 0 ) = exp ( x 0 δ ^ ) = μ ^ 0 (say). Assuming E ( ϵ ) = 0 , we get, E ( S 0 ) = μ ^ 0 . Additionally, we have V ( S 0 ) = V ( μ ^ 0 ) + V ( ϵ ) . Hence, the 100 ( 1 α ) % PI for S 0 is given by [ k 1 , k 2 ] , such that
P [ k 1 S 0 k 2 ] = 1 α ,
where α ( 0 , 1 ) . Here, k 1 and k 2 correspond, respectively, to the lower α 2 t h and upper α 2 t h percentiles of the distribution of S 0 , which is the compound CMP gamma distribution with mean E ( S 0 ) and variance V ( S 0 ) . Since V ( S 0 ) depends on V ( μ ^ 0 ) , we proceed as below to obtain an expression for V ( μ ^ 0 ) . To begin, consider
μ ^ 0 = exp ( x 0 δ ^ ) ln ( μ ^ 0 ) = x 0 δ ^ .
Using the Taylor series expansion of ln ( A ) at E ( A ) , we have
ln ( A ) ln ( E ( A ) ) + ( A E ( A ) ) 1 E ( A ) .
Thus, we have
E ( ln ( A ) ) ln ( E ( A ) )
and
V ( ln ( A ) ) V ( A ) E ( A ) 2 .
Taking A to be μ ^ 0 in Equations (25) and (26), we obtain E ( ln ( μ ^ 0 ) ) ln E ( μ ^ 0 ) and V ( ln ( μ ^ 0 ) ) V ( μ ^ 0 ) E ( μ ^ 0 ) 2 . From Equation (24), we establish that
E ( ln ( μ ^ 0 ) ) E ( x 0 δ ^ ) = x 0 E ( δ ^ ) E ( μ ^ 0 ) exp ( x 0 E ( δ ^ ) ) = exp ( x 0 δ ) = μ 0 .
In a similar manner, we obtain
V ( μ ^ 0 ) V ( ln ( μ ^ 0 ) ) E ( μ ^ 0 ) 2 = V ( x 0 δ ^ ) μ 0 2 = x 0 V ( δ ^ ) x 0 μ 0 2 = x 0 diag ( Σ δ ) x 0 μ 0 2 .
An estimate of V ( ϵ ) , namely, V ^ ( ϵ ) , can be obtained by dividing the residual sum of squares (RSS) of the compound CMP regression model by m ( p + 1 ) . Using V ( μ ^ 0 ) and V ^ ( ϵ ) , we obtain V ( S 0 ) . However, obtaining the values of k 1 and k 2 from Equation (23) is not easy since the cumulative distribution function of the compound CMP gamma distribution is not invertible. One may use bootstrap procedures to identify k 1 and k 2 . We propose below a heuristic method to obtain the PI using the two-part GLM methodology given in the previous section.
The PI for S 0 is obtained using the PIs of N 0 and Y 0 ¯ , where N 0 = E ^ ( N 0 | x 0 ) + ϵ and Y 0 ¯ = E ^ ( Y 0 ¯ | x 0 ) + ϵ . Note that E ^ ( N 0 | x 0 ) is obtained from the GLM of N on X and E ^ ( Y 0 ¯ | x 0 ) is obtained using the GLM of Y ¯ on X. Denoting E ^ ( N 0 | x 0 ) = μ ^ 01 and E ^ ( Y 0 ¯ | x 0 ) = μ ^ 02 , we have, μ ^ 01 = exp ( x 0 α ^ ) and μ ^ 02 = exp ( x 0 β ^ ) . Proceeding along similar lines for obtaining the PI for S 0 , the PIs for N 0 and Y 0 ¯ can be obtained, respectively, as [ a 1 , a 2 ] and [ b 1 , b 2 ] , such that
P [ a 1 N 0 a 2 ] = 1 α
and
P [ b 1 Y 0 ¯ b 2 ] = 1 α ,
where α ( 0 , 1 ) . Since N 0 has a mean reparameterized CMP distribution given in Equation (3), a 1 and a 2 are respectively, the lower α 2 t h and upper α 2 t h percentiles of the CMP distribution with mean μ ^ 01 and dispersion parameter ϕ = μ ^ 01 V ( μ ^ 01 ) + V ^ ( ϵ ) , where V ( μ ^ 01 ) = x 0 diag ( Σ α ) x 0 μ 01 2 . Likewise, b 1 and b 2 correspond respectively, to the lower α 2 t h and upper α 2 t h percentiles of the mean reparameterized gamma distribution given in Equation (15) with mean μ ^ 02 and dispersion parameter ψ = V ( μ ^ 02 ) + V ^ ( ϵ ) μ ^ 02 2 , where V ( μ ^ 02 ) = x 0 diag ( Σ β ) x 0 μ 02 2 . Supposing Σ α and Σ β are not known, the corresponding sample Hessian matrices can be used to compute V ( μ ^ 01 ) and V ( μ ^ 02 ) . The values of V ^ ( ϵ ) of the CMP and gamma regression models can be obtained by dividing the RSS of the corresponding regression models by m h and t h , where h denotes the number of regression parameters in the model.
The PI for S 0 given x 0 can be constructed using the PIs of N 0 and Y 0 ¯ . By virtue of equality S = N Y ¯ , a trivial PI for S 0 given x 0 can be taken to be [ k 1 , k 2 ] = [ a 1 b 1 , a 2 b 2 ] . When N is large, it may be useful to know the PI for S 0 . For example, in modeling aggregate claim amounts from insurance data, the company may want to know the PI for the aggregate claim amount for high claim frequencies so that enough funds can be maintained. In this case, the PI for S 0 given x 0 can be defined as [ a 2 b 1 , a 2 b 2 ] . This definition of PI is used in the remaining part.

5. Numerical Illustration

5.1. Simulation Study

This section provides a numerical illustration of how to compute the PI for S using simulated data for the independent and dependent compound regression models. To generate random samples from the CMP and gamma regression models with a single covariate X 1 = ( x 11 , x 21 , , x m 1 ) , generated from a standard normal distribution, the following steps are implemented:
  • Generate n k , k = 1 , 2 , , m , from the CMP distribution given in Equation (3) with mean μ 1 k = exp ( α 0 + α 1 x k 1 ) by fixing α 0 , α 1 and ϕ . Obtain n = ( n 1 , n 2 , , n m ) .
  • For each n k > 0 , generate y k j , j = 1 , 2 , , n k from the gamma distribution given in Equation (5) with mean μ 2 k by fixing ψ , β 0 , β 1 , and θ , where μ 2 k = exp ( β 0 + β 1 x k 1 ) for the independent compound regression model and exp ( β 0 + β 1 x k 1 + θ n k ) for the dependent compound regression model. Compute y k ¯ and obtain y ¯ = ( y ¯ 1 , y ¯ 2 , , y ¯ m ) .
For simulation, the values of the regression parameters are taken as α 0 = 0.5 , α 1 = 0.3 , β 0 = 1 , β 1 = 0.5 and θ = 0.5 . The dispersion parameter ψ of the gamma distribution is set to 1.5. To accommodate over-, equi- and under-dispersion in N , three choices of the dispersion parameter ϕ , namely, ϕ = 1.6 , 0 , and 1.6 , are considered. The CMP and gamma GLMs are fitted to the generated n and y ¯ values, using their respective log-link functions for both the independent and dependent compound regression models. All the computations are carried out in R (version 4.1.1). The cmp() function in cmpreg package (Ribeiro Jr [18]) and the glm() function are used to carry out the CMP and gamma regression, respectively. To compute the value of M N ( θ ^ ) in the dependent compound regression model, the com.expectation() function in compoisson package is employed. qcom() function in the compoisson package is used to determine the quantile values from the CMP distribution and the function qgammaAlt() in the EnvStats package is used to determine quantile values from the gamma distribution. For the above choices of the parameters, the 95 % PI for S is obtained for the independent and dependent compound regression models under three choices of sample size (m), namely, m = 25 , 50 and 100. The actual S observations, denoted by s = ( s 1 , s 2 , , s m ) , are computed by s k = n k y ¯ k , k = 1 , 2 , , m .
The proportion of s lying within its PI is presented in Table 1 for the various choices of m and ϕ . Additionally, the plots of the corresponding prediction bands are displayed in Table 2 and Table 3. From Table 1, it can be observed that, for the choices of the covariate and coefficients considered, the proportion is large for ϕ = 1.6 in the independent compound regression model and for ϕ = 1.6 in the dependent compound regression model.

5.2. Real-Life Application

In this section, the proposed two-part methodology to obtain the PI for the compound CMP gamma regression is applied to real-life vehicle insurance claims data. The dataset pertains to the average damage claims for privately owned and insured vehicles in Britain in the year 1975. See Dutang and Charpentier [19]. It consists of 128 observations on five variables, namely, the owner’s age ( X 1 ), car age ( X 2 ), model ( X 3 ), number of claims (N) and average claim amount ( Y ¯ ) in pounds. The variable X 1 consists of eight categories of age group; the variable X 2 , four categories of car age; and the variable X 3 , four categories of model. The aggregate claim amount (S) for each observation is obtained by multiplying the average claim amount by the number of claims. A dispersion test on N, performed using the function dispersiontest() available in R under AER package, resulted in a dispersion index of 119.8246 and a p-value of 2.091 × 10 6 , indicating that N is over-dispersed. Similarly, the Kolmogorov–Smirnov test on Y ¯ yielded a p-value of 0.7191 to assess the goodness-of-fit of the gamma distribution. As a result, the CMP distribution can be used to model N, whereas the gamma distribution can be used to model Y ¯ . To implement the proposed estimation methodology and validate its performance, 80 % of the observations are randomly chosen as training data and the rest 20 % as test data. The observations in the training data are used to fit the independent and dependent compound regression models. The owner’s age, car age and car model are the considered covariates in the model. The in-built functions cmp() function in cmpreg package and the glm() function are used to obtain the estimates of CMP and gamma regression models, respectively. The estimates of the regression parameters, their corresponding p-values (in parenthesis) and the AIC values are given in Table 4. Using the AIC values for the CMP and gamma regression models, the combined AIC values for the compound regression models are obtained as 2110.31 and 2108.31 , respectively. For each observation in the test data, the PI for S is computed using the estimates of the fitted model. The corresponding prediction band of the independent and dependent compound regression model is displayed in Figure 1. From this figure, it can be noted that some observations do not fall within the prediction band. One reason for this is that these observations have large claim frequencies when compared with the other observations, and the corresponding limits of the PI based on the CMP regression are also large. As a result, the limits of the PI of such observations deviate from their observed values. The proportion of observed S in the test data lying within its PI is found to be 0.4782 and 0.6956 for the independent and dependent compound regression models, respectively. Based on the combined AIC values and the proportions, it can be inferred that the dependent compound regression model provides a relatively better fit for modeling the aggregate claim amount.

6. Conclusions

The Poisson distribution is generally used in compound regression models as the counting distribution. In practice, the Poisson distribution’s equi-dispersion assumption is frequently violated. The methodology presented in this paper provided a way to handle non-equi-dispersed count data in the context of compound regression models by using the CMP distribution. The proposed compound regression model can be used when the count data are over- or under-dispersed. The estimation of the parameters was carried out using a two-part GLM approach for the independent and dependent compound regression models. This approach is less complex and provides separate estimates for the count and the continuous distribution involved in the model. Since, in practice, knowledge of the actual value of the response variable rather than its predicted value is more useful, a methodology to obtain the prediction interval of the response variable was proposed. An application of the two-part GLM method to real-life data revealed that the dependent compound regression model performs relatively better than the independent compound regression model. Thus, in practice, one can start with the dependent compound regression model and look for the significance of the count variable in the model. If the count variable is found to be not significant, then the independent compound regression model can be used. To conclude, the proposed compound CMP regression model could be an alternative to modeling a compound random variable when the count data are not equi-dispersed.

Author Contributions

J.M. has contributed to the conceptualization, methodology, mathematical derivation and simulation. V.S.V. and C.C. have contributed equally to mathematical derivation and original draft preparation. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Gómez-Déniz, E.; Pérez-Rodríguez, J.V. Modelling distribution of aggregate expenditure on tourism. Econ. Model. 2019, 78, 293–308. [Google Scholar] [CrossRef]
  2. Klugman, S.A.; Panjer, H.H.; Willmot, G.E. Loss Models: From Data to Decisions; John Wiley & Sons: New York, NY, USA, 2012; Volume 715. [Google Scholar]
  3. Bahnemann, D. Distributions for Actuaries; Casualty Actuarial Society: Arlington, VA, USA, 2015; Volume 2. [Google Scholar]
  4. Jørgensen, B.; Paes De Souza, M.C. Fitting Tweedie’s compound Poisson model to insurance claims data. Scand. Actuar. J. 1994, 1994, 69–93. [Google Scholar] [CrossRef]
  5. Consul, P.C.; Jain, G.C. A generalization of the Poisson distribution. Technometrics 1973, 15, 791–799. [Google Scholar] [CrossRef]
  6. Shmueli, G.; Minka, T.P.; Kadane, J.B.; Borle, S.; Boatwright, P. A useful distribution for fitting discrete data: Revival of the Conway-Maxwell-Poisson distribution. J. R. Stat. Soc. Ser. (Appl. Stat.) 2005, 54, 127–142. [Google Scholar] [CrossRef]
  7. Conway, R.W.; Maxwell, W.L. A queuing model with state dependent service rates. J. Ind. Eng. 1962, 12, 132–136. [Google Scholar]
  8. Sellers, K.F.; Borle, S.; Shmueli, G. The COM-Poisson model for count data: A survey of methods and applications. Appl. Stoch. Model. Bus. Ind. 2012, 28, 104–116. [Google Scholar] [CrossRef]
  9. Sellers, K.F.; Premeaux, B. Conway-Maxwell-Poisson regression models for dispersed count data. Wiley Interdiscip. Rev. Comput. Stat. 2021, 13, e1533. [Google Scholar] [CrossRef]
  10. Saavithri, V.; Priyadharshini, J.; Banu, Z.P. Compound COM-Poisson Distribution with Binomial Compounding Distribution. Available online: https://www.internationaljournalssrg.org/uploads/specialissuepdf/ICRMIT/2018/MTT/ICRMIT-P122.pdf (accessed on 15 January 2023).
  11. Frees, E.W.; Gao, J.; Rosenberg, M.A. Predicting the frequency and amount of health care expenditures. N. Am. Actuar. J. 2011, 15, 377–392. [Google Scholar] [CrossRef]
  12. Andersen, D.A.; Bonat, W.H. Double generalized linear compound Poisson models to insurance claims data. Electron. J. Appl. Stat. Anal. 2017, 10, 384–407. [Google Scholar]
  13. Delong, Ł; Lindholm, M.; Wüthrich, M.V. Making Tweedie’s compound Poisson model more accessible. Eur. Actuar. J. 2021, 11, 185–226. [Google Scholar] [CrossRef]
  14. Ribeiro, E.E., Jr.; Zeviani, W.M.; Bonat, W.H.; Demétrio, C.G.; Hinde, J. Reparametrization of COM-Poisson regression models with applications in the analysis of experimental data. Stat. Model. 2020, 20, 443–466. [Google Scholar] [CrossRef]
  15. Jorgensen, B. The Theory of Dispersion Models; CRC Press: Boca Raton, FL, USA, 1997. [Google Scholar]
  16. De Jong, P.; Heller, G.Z. Generalized Linear Models for Insurance Data; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
  17. Garrido, J.; Genest, C.; Schulz, J. Generalized linear models for dependent frequency and severity of insurance claims. Insur. Math. Econ. 2016, 70, 205–215. [Google Scholar] [CrossRef] [Green Version]
  18. Ribeiro, E.E., Jr. Cmpreg: Reparametrized COM-Poisson Regression Models, R Package Version 0.0.1; Available online: https://rdrr.io/github/JrEduardo/cmpreg/ (accessed on 15 January 2023).
  19. Dutang, C.; Charpentier, A. CASdatasets: Insurance Datasets. 2019. R Package Version 1.0-11. Available online: http://cas.uqam.ca/ (accessed on 15 January 2023).
Figure 1. Prediction band for the test data under independent model and dependent model.
Figure 1. Prediction band for the test data under independent model and dependent model.
Mca 28 00039 g001
Table 1. Proportion of S lying in its respective PIs.
Table 1. Proportion of S lying in its respective PIs.
m ϕ Independent ModelDependent Model
25−1.60.66670.9444
00.77770.8333
1.60.84000.8400
50−1.60.73530.8529
00.65000.7000
1.60.76560.8297
100−1.60.66150.9077
00.70880.9493
1.60.77770.9393
Table 2. Prediction bands of independent compound regression model for over-, equi- and under-dispersed data.
Table 2. Prediction bands of independent compound regression model for over-, equi- and under-dispersed data.
m ϕ = 1.6 ϕ = 0 ϕ = 1.6
25Mca 28 00039 i001Mca 28 00039 i002Mca 28 00039 i003
50Mca 28 00039 i004Mca 28 00039 i005Mca 28 00039 i006
100Mca 28 00039 i007Mca 28 00039 i008Mca 28 00039 i009
Table 3. Prediction bands of dependent compound regression model for over-, equi- and under-dispersed data.
Table 3. Prediction bands of dependent compound regression model for over-, equi- and under-dispersed data.
m ϕ = 1.6 ϕ = 0 ϕ = 1.6
25Mca 28 00039 i010Mca 28 00039 i011Mca 28 00039 i012
50Mca 28 00039 i013Mca 28 00039 i014Mca 28 00039 i015
100Mca 28 00039 i016Mca 28 00039 i017Mca 28 00039 i018
Table 4. Parameter estimates, p-values and AIC for the CMP and gamma regression models for the real-life data.
Table 4. Parameter estimates, p-values and AIC for the CMP and gamma regression models for the real-life data.
CovariatesCMP Regression ModelGamma Regression Model
(Independent Case)
Gamma Regression Model
(Dependent Case)
(Intercept)1.5007
(< 2 × 10 16 )
5.7421
(< 2 × 10 16 )
5.7754
(< 2 × 10 16 )
OwnerAge21–241.5885
(< 2 × 10 16 )
−0.2010
(0.0670)
−0.1800
(0.0964)
OwnerAge25–292.6237
(< 2 × 10 16 )
−0.1129
(0.2705)
−0.0497
(0.6357)
OwnerAge30–342.7585
(< 2 × 10 16 )
−0.3276
(0.0034)
−0.2542
(0.0262)
OwnerAge35–392.8854
(< 2 × 10 16 )
−0.3150
(0.0047)
−0.2271
(0.0496)
OwnerAge40–493.5362
(< 2 × 10 16 )
−0.2722
(0.0081)
−0.1140
(0.3528)
OwnerAge50–593.3678
(< 2 × 10 16 )
−0.1854
(0.0843)
−0.0590
(0.6219)
OwnerAge60+3.0280
(< 2 × 10 16 )
−0.3054
(0.0036)
−0.2120
(0.0553)
ModelB1.0255
(< 2 × 10 16 )
0.0584
(0.4260)
0.1414
(0.0877)
ModelC0.6930
(< 2 × 10 16 )
0.1083
(0.1387)
0.1500
(0.0450)
ModelD−0.1889
(0.00485)
0.4041
(6.01 × 10 7 )
0.3762
(2.40 × 10 6 )
CarAge10+−1.9174
(< 2 × 10 16 )
−0.8138
(< 2 × 10 16 )
−0.9494
(5.87 × 10 16 )
CarAge4–7−0.1558
(6.65 × 10 5 )
−0.0615
(0.3959)
−0.0727
(0.3089)
CarAge8–9−1.4876
(< 2 × 10 16 )
−0.4188
(8.64 × 10 8 )
−0.5323
(2.02 × 10 8 )
NClaims--−0.0010
(0.0301)
ϕ ^ −0.8374
(< 2 × 10 16 )
--
ψ ^ -0.06670.0644
AIC984.71481125.61123.6
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Merupula, J.; Vaidyanathan, V.S.; Chesneau, C. Prediction Interval for Compound Conway–Maxwell–Poisson Regression Model with Application to Vehicle Insurance Claim Data. Math. Comput. Appl. 2023, 28, 39. https://doi.org/10.3390/mca28020039

AMA Style

Merupula J, Vaidyanathan VS, Chesneau C. Prediction Interval for Compound Conway–Maxwell–Poisson Regression Model with Application to Vehicle Insurance Claim Data. Mathematical and Computational Applications. 2023; 28(2):39. https://doi.org/10.3390/mca28020039

Chicago/Turabian Style

Merupula, Jahnavi, V. S. Vaidyanathan, and Christophe Chesneau. 2023. "Prediction Interval for Compound Conway–Maxwell–Poisson Regression Model with Application to Vehicle Insurance Claim Data" Mathematical and Computational Applications 28, no. 2: 39. https://doi.org/10.3390/mca28020039

Article Metrics

Back to TopTop