A Flexible Semi-Poisson Distribution with Applications to Insurance Claims and Biological Data

Almuhayfith, Fatimah E.; Bapat, Sudeep R.; Bakouch, Hassan S.; Alnaghmosh, Aminh M.

doi:10.3390/math11051122

Open AccessArticle

A Flexible Semi-Poisson Distribution with Applications to Insurance Claims and Biological Data

by

Fatimah E. Almuhayfith

^1,*

,

Sudeep R. Bapat

²,

Hassan S. Bakouch

^3,4

and

Aminh M. Alnaghmosh

¹

Department of Mathematics and Statistics, College of Science, King Faisal University, Alahsa 31982, Saudi Arabia

²

Department of Operations Management and Quantitative Techniques, Indian Institute of Management, Indore 453556, India

³

Department of Mathematics, College of Science, Qassim University, Buraydah 51452, Saudi Arabia

⁴

Department of Mathematics, Faculty of Science, Tanta University, Tanta 31111, Egypt

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(5), 1122; https://doi.org/10.3390/math11051122

Submission received: 18 January 2023 / Revised: 13 February 2023 / Accepted: 20 February 2023 / Published: 23 February 2023

(This article belongs to the Section Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, a discrete one-parameter distribution called the semi-Poisson distribution is introduced, which is based on a set of non-negative integers. It is seen that this distribution captures over-dispersion and zero-inflation scenarios well. A few properties of the proposed distribution, such as moments, the probability-generating function, index of dispersion, recurrence relation for the moments, and negative moments are presented. The distribution is applied to two real-life datasets related to insurance claims and parasite counts, where it is noted to perform better than many of the existing discrete distributions based on

Z^{+}

, including some of the recently introduced ones.

Keywords:

discrete distributions; over-dispersion; recurrence relation for the moments; entropy; estimation; simulation

MSC:

62E15; 62G30; 62E10

1. Introduction

In the literature, many discrete distributions based on the entire set of non-negative integers are presented, i.e.,

Z^{+} = 0, 1, 2, \dots

. Some of the well known ones are of course Poisson, geometric, negative binomial, etc. However, these distributions have their own limitations when it comes to real-world applications with zero-inflation and/or over-dispersion conditions. A zero-inflated model is one where the observations are zero-valued more frequently than for a standard belief. Some routine practical applications of a zero-inflated situation include the number of insurance claims within a population prone to a particular risk, where there will not be any claims (hence 0) for the sub-population that has not taken out the insurance. The most widely applied zero-inflated model is the zero-inflated Poisson (ZIP) and its associated regression setting, which was introduced by [1]. Here, the author applied the model on data used for quality control. A few other applications of the ZIP model can be found in dental epidemiology [2], psychology [3], or computer science [4]. Generally, once we account for the inflated zeros, the non-zero part of the dataset appears to be over-dispersed. Over-dispersion leads to greater variability among the observations than the expected threshold. Such a scenario is often witnessed in ecological or biological examples, where the count data are persistent. Thus, the ZIP model fails to capture this over-dispersion as the mean and variance of a Poisson distribution are equal; hence, the ZIP estimators appear to be severely biased. In terms of improvements, there have been many other models introduced in the literature. Some of these include the zero-inflated negative binomial model introduced by [5], or the hurdle models discussed by [6]. A few other notable papers in this area are: [7], where the authors analyze the excess zeros under substance abuse research; [8] that provides an introduction of Poisson regression and its alternatives; [9], where the authors discuss the zero-inflated and hurdle models under a HIV-risk reduction intervention trial application; or [10], which discusses the application of excess zeros under outliers. Additionally, we may also refer to [11], where the authors undertake a comparison between the above-mentioned count data models and others, under different zero-inflation situations.

In this paper, in the same spirit, we introduce a single parameter discrete distribution based on the entire non-negative set of integers

Z^{+}

, called semi-Poisson

(S P)

distribution. The probability mass function (pmf) of semi-Poisson

(S P (λ))

distribution is obtained by

P (x; λ) = C_{λ} \frac{(x + 1) λ^{x + 2}}{(x + 2) x!}, x = 0, 1, 2, \dots,

(1)

where

C_{λ} = \frac{1}{(λ^{2} - λ + 1) e^{λ} - 1}

, and

λ > 0

is the parameter. Figure 1 contains several pmf plots for the

S P (λ)

distribution for different values of the parameter

λ

. A few points regarding the behaviour of the

S P

pmf are as follows: For all parameter values

λ

, the pmf is seen to peak at the value

x = λ

. We show that it is in fact unimodal in Section 2.4. Further, for small values of

λ

(less than 1), the

S P

pmf decreases completely. For any fixed range of X, the pmf becomes right skewed, then symmetric, i.e., increasing and then decreasing around

λ

, and ultimately left skewed. In the subsequent sections, we elaborate more on the advantages of the

S P

model over a standard Poisson or other similar models, especially in terms of handling zero-inflation and over-dispersion. See the practical data analysis section of this paper for further information.

The rest of the paper is organized as follows: Section 1.1 and Section 1.2 outline specific characterizations and the cumulative distribution function of the

S P (λ)

model. Section 2 covers the theoretical properties of the

S P (λ)

distribution. Section 3 proposes the maximum likelihood estimation method for the underlying parameter, while Section 4 outlines extensive simulations and presents a comparison analyses. Section 5 includes two specific real-life applications of the

S P

model, and Section 6 provides the conclusions.

1.1. Characterizations

A random variable following the

S P (λ)

distribution can also be written down as a weighted Poisson distribution, which was introduced by [12]. Let Y be a standard Poisson random variable with parameter

λ

, and pmf denoted as

p (y, λ)

. Then, a weighted Poisson distribution takes the following pmf:

p_{w} (y, λ) = \frac{w (y)}{E_{λ} [w (Y)]} p (y, λ),

(2)

where

w (y)

is the appropriate weight function and

E_{λ} [w (Y)]

is the normalizing constant. In our context, the pmf of the

S P (λ)

distribution can be rewritten as follows:

P (y; λ) = (\frac{y + 1}{y + 2}) (\frac{λ^{2}}{(λ^{2} - λ + 1) - e^{- λ}}) p (y, λ) \equiv (\frac{y + 1}{y + 2}) λ^{2} e^{λ} C_{λ} p (y, λ),

(3)

where

C_{λ}

is the constant provided in (1). The above restructured pmf can be viewed as a weighted Poisson distribution with the weight function being

\frac{y + 1}{y + 2}

and the normalizing constant being

{(λ^{2} e^{λ} C_{λ})}^{- 1}

.

Another form of characterization can be seen through the P-inverted Poisson distribution

(P I P)

, as discussed by [13]. A

P I P

distribution has the following pmf:

p_{_{P I P}} (y, λ) = (y + 1 - λ) \frac{e^{- λ} λ^{y}}{y!}, y \in N, λ \leq 1 .

(4)

Now, we can again rewrite the

S P

pmf slightly differently as follows:

P (y; λ) = \frac{y + 1}{(y + 2) (y + 1 - λ)} λ^{2} e^{λ} C_{λ} p_{_{P I P}} (y, λ),

(5)

where

C_{λ}

is the constant given in (1). The pmf above can be viewed as a weighted P-inverted Poisson distribution with the weight function being

\frac{y + 1}{(y + 2) (y + 1 - λ)}

and the normalizing constant being

{(λ^{2} e^{λ} C_{λ})}^{- 1}

.

1.2. The Cumulative Distribution Function

A simple closed-form expression of the cumulative distribution function (cdf) for the

S P (λ)

distribution is difficult to obtain. However, since the

S P

random variable can be viewed as a weighted Poisson variable, we outline the cdf by utilizing the results proposed in [12]. According to [12], the cdf of the weighted Poisson distribution takes the following form:

F_{w} (y, λ) = E_{λ} [w (Y)] - λ^{y} E_{λ} [\frac{t!}{(Y + t)!} w (Y + t)],

(6)

where

t \in N

and Y is a standard Poisson random variable. Thus, the cdf of a

S P

random variable specifically becomes:

F (y, λ) = E_{λ} [(\frac{Y + 1}{Y + 2})] - λ^{y} E_{λ} [\frac{t!}{(Y + t)!} (\frac{Y + t + 1}{Y + t + 2})]; y, t \in N .

(7)

2. Properties

This section is devoted to exploring the main theoretical properties of the

S P (λ)

distribution.

2.1. Natural Exponential Family

We first re-write the

S P

pmf from Equation (1) as follows:

P (x; λ) = \frac{w (x)}{x! E_{λ} [w (X)]} exp (x ln λ - λ),

(8)

where

w (x)

and

E_{λ} [w (X)]

are as defined earlier. From the earlier construction, one can see that the

S P (λ)

distribution falls under a natural exponential family over

N

.

2.2. The Generating Functions

After performing some algebraic calculations, the moment generating function (mgf) of X is obtained as follows:

\begin{matrix} M_{X} (t) & = E (e^{t X}) \\ = \frac{e^{- 2 t} (e^{λ e^{t}} - λ e^{λ e^{t} + t} + λ^{2} e^{λ e^{t} + 2 t} - 1)}{e^{λ} (λ^{2} - λ + 1) - 1}, t \in R . \end{matrix}

The probability generating function of X can be obtained by replacing

e^{t}

by t in the mgf as follows:

\begin{matrix} P_{X} (t) & = E (t^{X}) \\ = \frac{e^{λ t} (1 - λ t + λ^{2} t^{2}) - 1}{t^{2} (e^{λ} (1 - λ + λ^{2}) - 1)}, t \in R . \end{matrix}

2.3. Moments

By definition, the rth moment of the

S P

random variable X about the origin can be expressed as follows:

E (X^{r}) = μ_{r}^{'} = \sum_{x = 0}^{\infty} x^{r} \frac{(x + 1) λ^{x + 2}}{((λ^{2} - λ + 1) e^{λ} - 1) (x + 2) x!} .

(9)

By differentiating (9) with respect to

λ

, we obtain the recurrence relation for the moments as follows:

\begin{matrix} \frac{d μ_{r}^{'}}{d λ} & = \frac{μ_{r + 1}^{'}}{λ} + \frac{2 μ_{r}^{'}}{λ} - \frac{λ (λ + 1) e^{λ} μ_{r}^{'}}{e^{λ} (λ^{2} - λ + 1) - 1} . \end{matrix}

Hence,

\begin{matrix} \frac{λ (λ + 1) e^{λ} μ_{r}^{'}}{e^{λ} (λ^{2} - λ + 1) - 1} + \frac{d μ_{r}^{'}}{d λ} = \frac{μ_{r + 1}^{'}}{λ} + \frac{2 μ_{r}^{'}}{λ} . \end{matrix}

Thus,

\begin{matrix} μ_{r + 1}^{'} = (\frac{λ^{2} (λ + 1) e^{λ}}{e^{λ} (λ^{2} - λ + 1) - 1} - 2) μ_{r}^{'} + \frac{λ d μ_{r}^{'}}{d λ}, \end{matrix}

for

r = 0, 1, 2, 3, \dots, μ_{0}^{'} = 1, 0 < p < 1

,

α > 0

and

β \geq 0

.

Similarly, the rth central moment can be expressed as follows:

μ_{r} = \sum_{x = 0}^{\infty} {(x - μ_{1}^{'})}^{r} \frac{(x + 1) λ^{x + 2}}{((λ^{2} - λ + 1) e^{λ} - 1) (x + 2) x!},

(10)

where

μ_{1}^{'} = \frac{e^{λ} (λ - 1) (λ^{2} + 2) + 2}{e^{λ} (λ^{2} - λ + 1) - 1}

. Now, by differentiating the equation above with respect to

λ

and simplifying, we obtain,

\begin{matrix} \frac{d μ_{r}}{d λ} & = \frac{μ_{r + 1}}{λ} + \frac{(μ_{1}^{'} + 2) μ_{r}}{λ} - \frac{e^{λ} λ (1 + λ) μ_{r}}{(- 1 + e^{λ} (1 - λ + λ^{2}))} \\ + \frac{r e^{λ} λ (2 + λ (4 + λ) - e^{λ} (2 + λ (2 + (- 2 + λ) λ))) μ_{r - 1}}{{(- 1 + e^{λ} (1 + (- 1 + λ) λ))}^{2}} . \end{matrix}

Hence,

\begin{matrix} μ_{r + 1} & = \frac{λ d μ_{r}}{d λ} - μ_{1}^{'} μ_{r} + (\frac{e^{λ} λ^{2} (1 + λ) μ_{r}}{(- 1 + e^{λ} (1 - λ + λ^{2}))}) - 2) μ_{r} \\ + \frac{r e^{λ} λ^{2} (2 + λ (4 + λ) - e^{λ} (2 + λ (2 + (- 2 + λ) λ))) μ_{r - 1}}{{(- 1 + e^{λ} (1 + (- 1 + λ) λ))}^{2}}, \end{matrix}

where

r = 1, 2, 3, \dots, μ_{0}^{'} = 1, μ_{1} = 0, 0 < p < 1

,

α > 0

and

β \geq 0

.

For convenience, a few initial raw moments for the

S P (λ)

distribution are provided and are as follows:

\begin{matrix} E (X) = \frac{e^{λ} (- 2 + 2 λ - λ^{2} + λ^{3}) + 2}{e^{λ} (1 - λ + λ^{2}) - 1}, \end{matrix}

\begin{matrix} E (X^{2}) = \frac{e^{λ} (4 - 4 λ + 2 λ^{2} + λ^{4}) - 4}{e^{λ} (1 - λ + λ^{2}) - 1}, \end{matrix}

\begin{matrix} E (X^{3}) = \frac{e^{λ} (- 8 + 8 λ - 4 λ^{2} + 2 λ^{3} + 2 λ^{4} + λ^{5}) + 8}{e^{λ} (1 - λ + λ^{2}) - 1} . \end{matrix}

Hence, the variance can be obtained as follows:

\begin{matrix} V (X) = \frac{e^{λ} λ^{2} (- 2 - 4 λ - λ^{2} + e^{λ} (2 + 2 λ - 2 λ^{2} + λ^{3}))}{{(e^{λ} (1 - λ + λ^{2}) - 1)}^{2}} . \end{matrix}

Additionally, the index of dispersion

(I D)

can be thus found to be:

\begin{matrix} I D (X) = \frac{V (X)}{E (X)} = \frac{e^{λ} λ^{2} (- 2 - 4 λ - λ^{2} + e^{λ} (2 + 2 λ - 2 λ^{2} + λ^{3}))}{(e^{λ} (1 - λ + λ^{2}) - 1) (e^{λ} (- 2 + 2 λ - λ^{2} + λ^{3}) + 2)} . \end{matrix}

Figure 2 contains a plot showing the value of the index of dispersion for different choices of

λ

from 0 to 50. One can clearly note that for all values of

λ

,

I D > 1

, which accounts for over-dispersion. For

λ = 1.4

,

I D

attains the maximum value of

1.099

. Additionally, as

λ

approaches ∞, the

S P

distribution becomes more equi-dispersed as it approaches the standard Poisson situation.

Incomplete Moments

Incomplete moments prove to be useful in many practical problems, such as to compute the expected losses, or in relation to some characterization problems. The rth incomplete moment for a random variable X is given by,

μ_{r}^{'} (m) = \sum_{x = 0}^{m} x^{r} f (x),

(11)

where

f (x)

denotes the pmf of the random variable X, whereas, the rth incomplete factorial moment is obtained as follows:

μ^{[r]} (m) = \sum_{x = r}^{m} x^{[r]} f (x), m \geq r .

(12)

Ref. [14] proposed a class of discrete distributions known as the modified power series distributions (MPSD). The form of the MPSD class is obtained as follows:

P_{θ} (X = x) = a (x) \frac{g {(θ)}^{x}}{f (θ)}, x \in T,

(13)

where

θ

is the underlying parameter,

g (θ), f (θ)

are positive, finite and differentiable functions of

θ

alone, T is a subset of the set of non-negative integers, and

a (x) > 0

. One can note that the

S P (λ)

model is part of this family as follows:

a (x) = \frac{(x + 1)}{(x + 2) x!}, g (λ) = λ, f (λ) = \frac{1}{λ^{2} C_{λ}} .

Hence, by utilizing the ideas from [14], one can easily derive the following recurrence relations:

μ_{r + 1}^{'} (m) = λ \frac{d}{d λ} (μ_{r}^{'} (m)) + μ_{1}^{'} μ_{m}^{'} (t) .

(14)

μ^{[r + 1]} (m) = λ \frac{d}{d λ} (μ^{[r]} (m)) + μ^{[r]} (m) (μ^{[1]} - r),

(15)

where

μ_{1}^{'}

and

μ^{[1]}

are equal, and represent the mean of the

S P (λ)

distribution.

2.4. Reliability Properties

A probability distribution is said to be log-concave if its pmf satisfies the inequality

P {(x; λ)}^{2} \geq P (x - 1; λ) P (x + 1; λ),

for all x, where

P (x; λ)

denotes the pmf of a distribution. The log-concavity (log-convexity) of a distribution has an effect on the characteristic of its reliability function, failure rate function, tail probabilities and moments. Therefore, the log-concavity (log-convexity) produces an increasing failure rate (IFR) (decreasing failure rate (DFR)) and a monotonically decreasing mean residual life (DMRL) time function (increasing mean residual life (IMRL) time function), See [15]. The next proposition shows the log-concavity of the

S P

distribution.

Proposition 1.

If X

\sim SP (λ)

, then the pmf of the random variable X is log-concave for all λ and x, which is independent of λ.

Proof.

A pmf

P (x; λ)

is said to be log-concave if it satisfies the following inequality:

P {(x; λ)}^{2} \geq P (x - 1; λ) P (x + 1; λ),

for all x. For the SP model, upon utilizing the pmf from (1), the inequality above takes the form:

C_{λ}^{2} \frac{{(x + 1)}^{2} λ^{2 (x + 2)}}{{(x + 2)}^{2} {(x!)}^{2}} \geq C_{λ} \frac{x^{2} λ^{x + 1}}{(x + 1) (x - 1)!} \times C_{λ} \frac{{(x + 2)}^{2} λ^{x + 3}}{(x + 3) (x + 1)!},

which upon simplification gives,

(x + 3) {(x + 1)}^{4} > x^{2} {(x + 2)}^{3} .

Hence,

3 + 13 x + 22 x^{2} + 18 x^{3} + 7 x^{4} + x^{5} > 8 x^{2} + 12 x^{3} + 6 x^{4} + x^{5} .

And thus,

3 + 13 x + 14 x^{2} + 6 x^{3} + x^{4} > 0,

which indicates that the distribution portrays log-concavity and implies reliability class IFR and DMRL for all values of x, as per [15]. □

Corollary 1.

The following results are the direct consequence of log-concavity:

i: Using the relationships between log-concavity and unimodality presented in [16], the $S P$ distribution is unimodal.
ii: SP is an IFR.
iii: Convolution of SP with any other discrete distribution will also result in a log-concave distribution.
iv: SP has at most an exponential tail, ${lim}_{x \to \infty} e^{b x} P (Y = x) = 0,$ which implies $P (Y = x) = o (e^{- b x})$ for some $b > 0$ as $x \to \infty$ .
v: SP has a monotonically decreasing mean residual life (MRL) time function.

2.5. Entropy

We derive the expansion of the Shannon entropy for the

S P

distribution. The Shannon entropy is defined by

E [- log p (x)],

where

p (x)

is the pmf of the underlying distribution. In this regard, we make use of the first characterization given in (3). Using the pmf given in (3), the Shannon entropy for the

S P

distribution can be written as follows:

E [- log p (x)] = E [- log (λ^{2} e^{λ} C_{λ})] + E [- log (\frac{x + 1}{x + 2})] + E [- log p (x, λ)],

where

p (x, λ)

denotes the pmf of a standard Poisson distribution with parameter

λ

. Now, by utilizing the Taylor series expansion for the log function and simplifying, we obtain

\begin{matrix} E [- log p (x)] = - log (λ^{2} e^{λ} C_{λ}) - E [\sum_{r = 1}^{\infty} {(- 1)}^{r - 1} \frac{x^{r}}{r}] + E [\sum_{r = 1}^{\infty} {(- 1)}^{r - 1} \frac{x^{r}}{2^{r} r}] + log 2 \\ + E [- log p (x, λ)] . \end{matrix}

Now, the last term in the equation above provides the Shannon entropy for a standard Poisson distribution, denoted as H; by using the method presented in [17], H can be approximated by

H = \frac{1}{2} log (2 π e λ) - \frac{1}{12 λ} + O (λ^{- 2}),

where O represents the order of

λ^{- 2}

. Hence, the entropy for

S P

can be simplified to

\begin{matrix} E [- log p (x)] = - log (λ^{2} e^{λ} C_{λ}) - E [\sum_{r = 1}^{\infty} {(- 1)}^{r - 1} \frac{x^{r}}{r}] + E [\sum_{r = 1}^{\infty} {(- 1)}^{r - 1} \frac{x^{r}}{2^{r} r}] + log 2 \\ + \frac{1}{2} log (2 π e λ) - \frac{1}{12 λ} + O (λ^{- 2}) . \end{matrix}

2.6. Infinite Divisibility and Further Properties

For infinite divisibility, self-decomposability and the stability property of

S P (λ

), it is noted that according to [18,19], a necessary condition for infinite divisibility of a discrete distribution

P (x; λ)

is that

P (0; λ) > 0

, which is satisfied by the

S P

distribution. Moreover, classes of self-decomposable and stable distributions are subclasses of infinitely divisible distributions, therefore,

S P

is an infinitely divisible distribution, self-decomposable and stable.

3. Estimation

In this section, we derive the maximum likelihood estimator for the unknown parameter

λ

under the

S P

distribution. Let

x_{1}, x_{2}, \dots, x_{n}

be the observed values of a random sample of size n from the

S P (λ)

distribution. The corresponding likelihood and log-likelihood functions can be written as follows:

L (λ | x_{1}, x_{2}, \dots, x_{n}) = C_{λ}^{n} λ^{^{\sum_{i = 1}^{n} (x_{i} + 2)}} \prod_{i = 1}^{n} (\frac{x_{i} + 1}{x_{i} + 2}) \frac{1}{x_{i}!}

(16)

and

ln L (λ | x_{1}, x_{2}, \dots, x_{n}) = n ln C_{λ} + \sum_{i = 1}^{n} (x_{i} + 2) ln λ + \sum_{i = 1}^{n} ln [(\frac{x_{i} + 1}{x_{i} + 2}) \frac{1}{x_{i}!}],

(17)

respectively. By differentiating the log-likelihood from Equation (17) with respect to

λ

and equating to 0, one can get the MLE

(\hat{λ})

using the standard approach. This equation takes the following form:

\frac{d ln L}{d λ} = \frac{n}{C_{λ}} \frac{d C_{λ}}{d λ} + \frac{1}{λ} \sum_{i = 1}^{n} (x_{i} + 2) = 0 .

(18)

The following proposition states the existence of real-valued roots for the earlier equation:

Proposition 2.

Let

d ln L / d λ

be denoted as

D_{λ}

. Then,

D_{λ}

has at least one real-valued root.

Proof.

D_{λ}

can be simplified as follows:

D_{λ} = - n λ (λ + 1) e^{λ} C_{λ} + \frac{1}{λ} \sum_{i = 1}^{n} (x_{i} + 2) .

Thus, upon applying the L-Hospital’s rule, we obtain

{lim}_{λ \to 0} D_{λ} = \infty

and

{lim}_{λ \to \infty} D_{λ} = - n < 0

. Thus,

D_{λ}

crosses the x axis at least once; hence,

D_{λ} = 0

has at least one real-valued root. □

However, since this equation is non-linear, one has to resort to a numerical approximation method to obtain the MLE. Here, we find the MLE using the “optim” function in R [20], which uses Nelder–Mead approximation. Further, according to the regularity conditions described in [21], the derived MLE has a normal distribution with the mean

λ

and variance provided by the following expression:

V (\hat{λ}) {E_{λ} [- \frac{d^{2}}{d λ^{2}} ln L (λ | X_{1}, X_{2}, \dots, X_{n})]|}_{λ = \hat{λ}} .

(19)

4. Simulations

In this section, we present extensive simulation analyses highlighting the performance of the maximum likelihood estimator, which is found using the approximation method. A brief overview of the underlying simulation study is provided as follows:

In order to generate a random sample from the $S P (λ)$ distribution, we make use of the characterization based on the weighted Poisson distribution, as seen in Equation (3). We therefore generate 10,000 samples of size n from the $S P (λ)$ distribution for two specific values of $λ$ , i.e., $0.5$ and 2.
The MLEs for a particular $λ$ are then computed from these 10,000 samples. Let these be denoted by ${\hat{λ}}_{i}, i = 1, 2, \dots, 10, 000$ .
Using the MLEs from step (2), we compute the biases (Biases) and mean squared errors (MSEs) of the estimates using the following forms:

$B i a s (\hat{λ}, n) = \frac{1}{10000} \sum_{i = 1}^{10000} ({\hat{λ}}_{i} - λ),$

(20)

$M S E (\hat{λ}, n) = \frac{1}{10000} \sum_{i = 1}^{10000} {({\hat{λ}}_{i} - λ)}^{2} .$

(21)

We repeat the steps above for several equidistant values of n ranging from 10 to 200. Under each combination, we find the values of the Bias and MSE. Figure 3 shows the behaviour of the Bias and MSE with respect to n for the MLE

\hat{λ}

.

Based on the plots above, one can clearly see that the Biases and the MSEs decay sharply to 0, as

n \to \infty

for both the parameter values. This supports the asymptotic consistency of the MLE found using the numerical approximation technique. For convenience, we present a short version of the underlying R code in the Appendix A. Users may change the parameter values to obtain the desired results.

5. Applications to Real Data

In this section, we outline the applicability of the proposed

S P

distribution on two real-life datasets concerning insurance claims and parasite counts. We picked these two specific examples as they show zero-inflation and over-dispersion. Zero-inflation occurs in decreasing order, from the first dataset to the second.

5.1. Dataset 1

The first dataset that we consider provides information on an infection produced by a parasite called “Trypanosoma murmanensis” in cod; the dataset has been collected over three successive years along the Finnmark coast (northern Norway). The response variable is the number of parasites found in the cods (Intensity). The dataset has been extracted from the package “countreg” of the R software, which is introduced by [22]. For completeness, we present the dataset in a frequency format in Table 1. The mean and variance of the parasite intensity happen to be

μ = 0.7220

and

σ^{2} = 3.4940

, respectively. One can clearly note that the observations possess over-dispersion, since the variance exceeds the mean. Additionally, zero-inflation is predominant as the frequency of absence of parasites is substantial.

We now fit the

S P (λ)

distribution to the intensities. The estimated parameter is found using the maximum likelihood method, which turns out be

0.7220

. The MLE is found out using the “Brent” method in the ‘optim’ function in R. As per the Kolmogorov–Smirnov (K-S) test, the p-value is 0.71, which confirms the

S P

fit. This is also supported by Figure 4 that contains the

S P

pmf fit and the empirical distribution function (ecdf) fit.

For comparison purposes, we consider a set of comparable models, some of which are studied and compared in [23,24]. We also consider a recent zero-inflated distribution called the zero inflated Waring distribution, which is introduced by [25]. The following list contains the corresponding pmfs of these distributions:

ZIW (zero inflated Waring distribution) given by [25],

$P (x; α, β, π) = \{\begin{matrix} π + (1 - π) \frac{α}{α + β}, & if x = 0 \\ (1 - π) \frac{α (α + β - 1)! (x + β - 1)!}{(β - 1)! (x + α + β)!}, & if x > 0, \end{matrix}$

where $α, β > 0$ and $0 \leq π \leq 1$ .
NLD (new logarithmic distribution) given by [26],

$p_{x} = \frac{log (1 - α θ^{n}) - log (1 - α θ^{x + 1})}{log (1 - α)}, x = 0, 1, 2 \dots,$

where $α < 1, α \neq 0, 0 < θ < 1$ .
NGDP (new geometric discrete Pareto distribution), as proposed by [23],

$p_{x} = \frac{q^{x}}{{(x + 1)}^{α}} - \frac{q^{x + 1}}{{(x + 2)}^{α}}, x = 0, 1, 2 \dots,$

where $0 < q \leq 1, α \geq 0$ .
DGP (discrete generalized Pareto), provided by [27],

$p_{x} = \frac{1}{{(1 + λ x)}^{α}} - \frac{1}{{(1 + λ (x + 1))}^{α}}, x = 0, 1, 2 \dots,$

where $λ, α > 0$ .
PIG (Poisson inverse Gaussian distribution), outlined by [28],

$p_{x} = \frac{1}{x!} \sqrt{\frac{2 ϕ}{π}} e^{ϕ / μ} ϕ^{- \frac{1}{4} + \frac{x}{2}} {(2 + \frac{ϕ}{μ^{2}})}^{\frac{1 - 2 x}{4}} K_{x - \frac{1}{2}} (\sqrt{2 ϕ + \frac{ϕ^{2}}{μ^{2}}}), x = 0, 1, 2 \dots,$

where $ϕ, μ > 0$ and $K_{a} (.)$ represent the modified Bessel function of the third kind with $a \in R$ .
NB (negative binomial distribution),

$p_{x} = \frac{(x + r - 1)!}{x! (r - 1)!} p^{x} {(1 - p)}^{x} x = 0, 1, 2 \dots,$

where $r > 0, 0 < p < 1$ .
PLD (Poisson Lindley distribution) provided in [29],

$p_{x} = \frac{θ^{2} (x + θ + 2)}{{(θ + 1)}^{x + 3}} x = 0, 1, 2 \dots,$

where $θ > 0$ .

Table 2 presents the comparison findings. One can clearly note that the proposed

S P (λ)

distribution fits the data better than the rest in terms of the log-likelihood function, AIC (Akaike’s information criterion), and BIC (Bayesian information criterion) values, and the K-S test statistic and p-value. Among AIC and BIC, one may prefer the BIC criterion, as it penalizes adding any extra parameters to the distribution. Thus, distributions that depend on more than two parameters have a tendency to be penalized.

5.2. Dataset 2

The second dataset under consideration represents the vaccine-adverse event counts and the number of claims in automobile insurance. One may refer to [30] for a further description and analysis of the dataset. Table 3 details the entire dataset in a frequency-type setting. We find that the mean and variance of the insurance claims data are

μ = 1.5069

and

σ^{2} = 2.9034

, respectively. This clearly shows that the observations are over-dispersed.

The

S P

distribution is fitted on this dataset using the MLE as the estimated parameter, which is found to be

\hat{λ} = 1.5069

. The corresponding K-S test provides a p-value of 0.58, which suggests a good fit. This is also seen in Figure 5 through the pmf fit and ecdf plots.

As a comparison, we again consider a subset of the above outlined models, along with the zero inflated Poisson (ZIP) distribution, and find the associated likelihood, AIC, and BIC values, as well as the K-S test statistics and corresponding p-values of the distribution fits. The results are displayed in Table 4. Clearly, the

S P

distribution proves to be better than the other comparable models, according to all the criteria. Interestingly, it outperforms the recently introduced zero-inflated Waring distribution and the ZIP distribution.

6. Conclusions

In this paper, we proposed a flexible single-parameter discrete distribution based on a non-negative set of integers. This distribution is a weighted version of the Poisson distribution, but proves to be much better at handling over-dispersion and zero-inflation. A set of theoretical properties, such as its distribution function, characteristic function, index of dispersion, regular moments, recurrence relation of the moments, and incomplete moments, are derived. Parameter estimation is undertaken using the maximum likelihood approach. This distribution is then compared with a set of other comparable models to fit two real-life datasets related to insurance claims and parasite counts, where it is found to be superior to the rest of the models according to standard criteria.

Author Contributions

Conceptualization, H.S.B.; Methodology, S.R.B. and H.S.B.; Formal analysis, S.R.B., F.E.A. and A.M.A.; Investigation, S.R.B. and H.S.B.;Writing—original draft, S.R.B.; Writing—review & editing, F.E.A. and A.M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Grant No. 2540].

Data Availability Statement

The paper includes the data that was used in the study.

Acknowledgments

This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Grant No. 2540].

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. R Codes

x<-c()
y<-c()
b_l<-c()
mse_l<-c()
for(n in seq(from=10, to=200, by=10)){
  l=0.5
  for(i in 1:1000){
    nll <- function(lam) {
      c<-lam^2/(exp(lam) ∗ (lam^2-lam+1)-1)
      z<-rpois(n,l)
      -n ∗ log(c)-sum(log((z+1)/(z+2) ∗ /factorial(z)))
      -sum(z) ∗ log(lam)+n ∗ lam
    }
fit<-optim(l<-c(2),nll,hessian = FALSE)
x<-c(x,fit$par[1])
  }
  b_l<-c(b_l,mean(x)-l)
  mse_l<-c(mse_l,mean((x-l)^2))
}

df<-data.frame(b_l,mse_l)

n=seq(from=10, to=200, by=10)
ggplot(data=df, aes(x=n, y=b_l))+
  geom_line()+
  geom_point()+
  xlab("n")+
  ylab(expression(Bias(hat(lambda))))+
  theme_bw()

References

Lambert, D. Zero-Inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992, 34, 1–14. [Google Scholar] [CrossRef]
Böhning, D.; Dietz, E.; Schlattmann, P.; Mendonca, L.; Kirchner, U. The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology. J. R. Stat. Soc. Ser. A 1999, 162, 195–209. [Google Scholar] [CrossRef]
Atkins, D.C.; Gallop, R.C. Rethinking how family researchers model infrequent outcomes: A tutorial on count regression and zero-inflated models. J. Fam. Psychol. 2007, 21, 726–735. [Google Scholar] [CrossRef]
Fagundes, R.A.A.; Souza, R.M.C.R.; Cysneiros, F.J.A. Zero-inflated prediction model in software-fault data. IET Softw. 2016, 10, 1–9. [Google Scholar] [CrossRef]
Greene, W.H. Accounting for Excess Zeros and Sample Selection in Poisson and Negative Binomial Regression Models. NYU Working Paper No. EC-94-10. 1994. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1293115 (accessed on 17 January 2023).
Mullahy, J. Specification and testing of some modified count data models. J. Econom. 1986, 33, 341–365. [Google Scholar] [CrossRef]
Bandyopadhyay, D.; DeSantis, S.M.; Korte, J.E.; Brady, K.T. Some considerations for excess zeroes in substance abuse research. Am. J. Drug Alcohol Abus. 2011, 37, 376–382. [Google Scholar] [CrossRef] [PubMed]
Coxe, S.; West, S.G.; Aiken, L.S. The analysis of count data: A gentle introduction to Poisson regression and its alternatives. J. Personal. Assess. 2009, 91, 121–136. [Google Scholar] [CrossRef]
Hu, M.; Pavlicova, M.; Nunes, E.V. Zero-inflated and hurdle models of count data with extra zeros: Examples from an HIV-risk reduction intervention trial. Am. J. Drug Alcohol Abus. 2011, 37, 367–375. [Google Scholar] [CrossRef] [Green Version]
Usman, M.; Oyejola, B.A. Models for count data in the presence of outliers and/or excess zero. Math. Theory Model. 2013, 3, 94–103. [Google Scholar]
Tüzen, M.F.; Erbaş, S. A comparison of count data models with an application to daily cigarette consumption of young persons. Commun. Stat.-Theory Methods 2018, 47, 5825–5844. [Google Scholar] [CrossRef]
Louzayadio, C.G.; Malouata, R.O.; Koukoutikissa, M.D. A weighted Poisson distribution for underdispersed count data. Int. J. Stat. Probab. 2021, 10, 157. [Google Scholar] [CrossRef]
Rattihalli, R.N.; Bhati, D. Generation of new families of discrete distributions. Calcutta Stat. Assoc. Bull. 2017, 68, 135–146. [Google Scholar] [CrossRef]
Tripathi, R.C.; Gupta, P.L.; Gupta, R.C. Incomplete moments of modified power series distributions with applications. Commun. Stat.-Theory Methods 1986, 15, 999–1015. [Google Scholar] [CrossRef]
Gupta, P.L.; Gupta, R.C.; Tripathi, R.C. On the monotonic properties of discrete failure rates. J. Stat. Plan. Inference 1997, 65, 255–268. [Google Scholar] [CrossRef]
Grandell, J. Mixed Poisson Processes; Chapman & Hall: London, UK, 1997. [Google Scholar]
Evans, R.J. Ramanujan’s Second Notebook: Asymptotic expansions for hypergeometric series and related functions. In Ramanujan Revisited; Academic Press: New York, NY, USA, 1988. [Google Scholar]
Nekoukhou, V.; Alamatsaz, M.H.; Bidram, H. A discrete analog of the generalized exponential distribution. Commun. Stat.-Theory Methods 2012, 41, 2000–2013. [Google Scholar] [CrossRef]
Steutel, F.W.; van Harn, K. Infinite Divisibility of Probability Distributions on the Real Line, 1st ed.; CRC Press: Boca Raton, FL, USA, 1979. [Google Scholar]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2022; Available online: https://www.R-project.org/ (accessed on 17 January 2023).
Rohatgi, V.K.; Saleh, A.K. An Introduction to Probability and Statistics, 2nd ed.; John Wiley and Sons: Hoboken, NJ, USA, 2000. [Google Scholar]
Zeileis, A.; Kleiber, C. Countreg: Count Data Regression. R Package Version 0.2-0. 2018. Available online: http://R-Forge.Rproject.org/projects/countreg/ (accessed on 17 January 2023).
Bhati, D.; Bakouch, H.S. A new infinitely divisible discrete distribution with applications to count data modeling. Commun. Stat.-Theory Methods 2019, 48, 1401–1416. [Google Scholar] [CrossRef]
Shaul, K.B.; Ridder, A. Exponential dispersion models for over-dispersed zero-inflated count data. Commun. Stat.-Simul. Comput. 2021. [Google Scholar] [CrossRef]
Rivas, L.; Campos, F. Zero inflated Waring distribution. Commun. Stat.-Simul. Comput. 2021. [Google Scholar] [CrossRef]
Gómez-Déniz, E.; Sarabia, J.M.; Calderín-Ojeda, E. A new discrete distribution with actuarial applications. Insur. Math. Econ. 2011, 48, 406–412. [Google Scholar] [CrossRef]
Prieto, F.; Gómez-Déniz, E.; Sarabia, J.M. Modelling road accident blackspots data with the discrete generalized Pareto distribution. Accid. Anal. Prev. 2014, 71, 38–49. [Google Scholar] [CrossRef] [Green Version]
Willmot, G. The Poisson-inverse Gaussian as an alternative to the negative binomial. Scand. Actuar. J. 1987, 3, 113–127. [Google Scholar] [CrossRef]
Sankaran, M. The Discrete Poisson-Lindley Distribution. Biometrics 1970, 26, 145–149. [Google Scholar] [CrossRef]
Rose, C.E.; Martin, S.W.; Wannemuehler, K.A.; Plikaytis, B.D. On the use of zero inflated and hurdle models for modeling vaccine adverse events count data. J. Biopharm. Stat. 2006, 16, 463–481. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Pmf of the

S P (λ)

distribution for different parameter combinations.

Figure 1. Pmf of the

S P (λ)

distribution for different parameter combinations.

Figure 2. Index of dispersion for different values of

λ

.

Figure 2. Index of dispersion for different values of

λ

.

Figure 3. Bias and MSE for

λ = 0.5

(left) and

λ = 2

(right).

Figure 3. Bias and MSE for

λ = 0.5

(left) and

λ = 2

(right).

Figure 4. Fitted pmf plot (left) and ecdf plot (right) for the parasite intensity data.

Figure 5. Fitted pmf plot (left) and ecdf plot (right) for the insurance claims data.

Table 1. Parasite intensity data.

Value	0	1	2	3	4	5	6	7	8	9	10
Frequency	654	38	33	21	17	12	11	10	5	4	8

Table 2. Comparing model fits for the parasite intensity data.

Model	Fitted Parameters	Log-L	AIC	BIC	K-S	p-Value
SP $(λ)$	$\hat{λ} = 0.7220$	−778.18	1558.36	1563.06	0.04	0.71
Poisson $(λ)$	$\hat{λ} = 0.7130$	−1332.85	2667.71	2672.41	0.24	0.04
ZIW $(α, β, π)$	$\hat{α} = 1.04, \hat{β} = 0.26,$	−781.56	1569.12	1583.22	0.05	0.64
	$\hat{π} = 0.49$
NLD $(α, θ)$	$\hat{α} = 0.55, \hat{θ} = 0.17$	−824.19	1652.38	1661.78	0.18	0.13
NGDP $(q, α)$	$\hat{q} = 0.73, \hat{α} = 5.92$	−800.91	1605.68	1615.08	0.11	0.36
DGP $(λ, α)$	$\hat{λ} = 0.71, \hat{α} = 5.38$	−811.12	1626.24	1655.64	0.15	0.26
PIG $(ϕ, μ)$	$\hat{ϕ} = 0.052, \hat{μ} = 0.108$	−791.54	1587.82	1596.48	0.21	0.42
NB $(r, p)$	$\hat{r} = 0.58, \hat{p} = 0.32$	−785.41	1574.82	1584.22	0.08	0.51
PLD $(θ)$	$\hat{θ} = 1.92$	−974.37	1950.74	1955.45	0.06	0.06

Table 3. Insurance claims data.

Value	0	1	2	3	4	5	6	7	8	9	10	11	12
Frequency	1437	1010	660	428	236	122	62	34	14	8	4	4	1

Table 4. Comparing model fits for the insurance claims data.

Model	Fitted Parameters	Log-L	AIC	BIC	K-S	p-Value
SP $(λ)$	$\hat{λ} = 1.50$	−3753.627	7149.29	7155.54	0.04	0.58
Poisson $(λ)$	$\hat{λ} = 1.51$	−7231.13	14,464.27	14,470.57	0.29	0.05
ZIW $(α, β, π)$	$\hat{α} = 48.92, \hat{β} = 67.02$	−6797.25	13,588.51	13,619.41	0.18	0.24
	$\hat{π} = 0.45$
ZIPD $(μ, σ)$	$\hat{μ} = 2.04, \hat{σ} = 0.26$	−6869.55	13,741.5	13,754.1	0.24	0.22
NGDP $(q, α)$	$\hat{q} = 0.51, \hat{α} = 0.33$	−6741.78	13,487.57	13,500.16	0.10	0.33
NB $(r, p)$	$\hat{r} = 1.52, \hat{p} = 0.50$	−6741.6	13,485.2	13,486.33	0.06	0.44
PLD $(θ)$	$\hat{θ} = 0.99$	−6745.99	13,493.99	13,500.29	0.14	0.29

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Almuhayfith, F.E.; Bapat, S.R.; Bakouch, H.S.; Alnaghmosh, A.M. A Flexible Semi-Poisson Distribution with Applications to Insurance Claims and Biological Data. Mathematics 2023, 11, 1122. https://doi.org/10.3390/math11051122

AMA Style

Almuhayfith FE, Bapat SR, Bakouch HS, Alnaghmosh AM. A Flexible Semi-Poisson Distribution with Applications to Insurance Claims and Biological Data. Mathematics. 2023; 11(5):1122. https://doi.org/10.3390/math11051122

Chicago/Turabian Style

Almuhayfith, Fatimah E., Sudeep R. Bapat, Hassan S. Bakouch, and Aminh M. Alnaghmosh. 2023. "A Flexible Semi-Poisson Distribution with Applications to Insurance Claims and Biological Data" Mathematics 11, no. 5: 1122. https://doi.org/10.3390/math11051122

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Flexible Semi-Poisson Distribution with Applications to Insurance Claims and Biological Data

Abstract

1. Introduction

1.1. Characterizations

1.2. The Cumulative Distribution Function

2. Properties

2.1. Natural Exponential Family

2.2. The Generating Functions

2.3. Moments

Incomplete Moments

2.4. Reliability Properties

2.5. Entropy

2.6. Infinite Divisibility and Further Properties

3. Estimation

4. Simulations

5. Applications to Real Data

5.1. Dataset 1

5.2. Dataset 2

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. R Codes

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI