Next Article in Journal
Similarity Measures Based on T-Spherical Fuzzy Information with Applications to Pattern Recognition and Decision Making
Next Article in Special Issue
Fitting Non-Parametric Mixture of Regressions: Introducing an EM-Type Algorithm to Address the Label-Switching Problem
Previous Article in Journal
Simplification of Galactic Dynamic Equations
Previous Article in Special Issue
Mean Equality Tests for High-Dimensional and Higher-Order Data with k-Self Similar Compound Symmetry Covariance Structure
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Mixtures of Semi-Parametric Generalised Linear Models

by
Salomon M. Millard
*,† and
Frans H. J. Kanfer
Department of Statistics, University of Pretoria, Pretoria 0002, South Africa
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Symmetry 2022, 14(2), 409; https://doi.org/10.3390/sym14020409
Submission received: 19 January 2022 / Revised: 5 February 2022 / Accepted: 15 February 2022 / Published: 18 February 2022
(This article belongs to the Special Issue Symmetry in Multivariate Analysis)

Abstract

:
The mixture of generalised linear models (MGLM) requires knowledge about each mixture component’s specific exponential family (EF) distribution. This assumption is relaxed and a mixture of semi-parametric generalised linear models (MSPGLM) approach is proposed, which allows for unknown distributions of the EF for each mixture component while much of the parametric structure of the traditional MGLM is retained. Such an approach inherently allows for both symmetric and non-symmetric component distributions, frequently leading to non-symmetrical response variable distributions. It is assumed that the random component of each mixture component follows an unknown distribution of the EF. The specific member can either be from the standard class of distributions or from the broader set of admissible distributions of the EF which is accessible through the semi-parametric procedure. Since the inverse link functions of the mixture components are unknown, the MSPGLM estimates each mixture component’s inverse link function using a kernel smoother. The MSPGLM algorithm alternates the estimation of the regression parameters with the estimation of the inverse link functions. The properties of the proposed MSPGLM are illustrated through a simulation study on the separable individual components. The MSPGLM procedure is also applied on two data sets.

1. Introduction

In many situations, mixture of regression modelling focuses on Gaussian distributions, and hence symmetrical component distributions [1,2]. Frequently, data follow a distribution that is non-standard, e.g., non-symmetric or skewed multi-modal. The flexibility embedded in mixture modelling based on known component distributions enables the modelling of both symmetric and non-symmetric data patterns. In many applications, the individual component distributions are of importance, and the selection of appropriate component distributions are essential. In the case of MGLM, both the selection of the correct component distribution members and the selection of the component link functions are required. Azzalini et al. [3] and Wainer [4] presented examples for a single component model, highlighting the inadequacy of using the typical logistic link to model binary responses. Weisberg [5] argued that if the chosen distributions are inadequate, a better fit might be obtained if the inverse link function is estimated from the data. Estimating the inverse link function directly from the data enables more flexible models. The proposed semi-parametric procedure for estimating the component inverse link functions facilitates the selection of component members from the broader class of EF distributions. Finite mixture modelling provides a statistical modelling approach with applications in a wide variety of random phenomena, including marketing and market segmentation [6], insurance [7], biology [8], medicine [9], and economics [10]. Recently, there has been an enhanced focus on non-parametric and semi-parametric mixture models. In 2019, Xiang, Yao, and Yang [11] published an overview of semi-parametric extensions of finite mixture models, and in 2020, Ma, Wang, and Lee [12] considered semi-parametric mixture regression with unspecified error distributions.
In this paper we start by considering MGLM [13,14], proposing a semi-parametric estimation of the component inverse link functions for modelling the component conditional expected value of the response variable given explanatory variables. The proposed MSPGLM has several important advantages:
  • A semi-parametric, estimated, component inverse link function assists in determining if a selected parametric component link function is appropriate to be used as a component link function;
  • If the selected parametric component link function is inadequate, a better component fit can be obtained by estimating a semi-parametric component link function. Improved component fits will improve the overall fitted mixture model;
  • Relaxing the assumption of a common component link function by estimating individual semi-parametric component link functions;
  • In [5], it is shown that under fairly general conditions, n consistent estimates of the component parameter directions are obtained.
The MSPGLM has limitations similar to that of the SPGLM. Since a scale and location factor can be absorbed into the component link functions, the estimated component regression coefficients estimate a direction in p dimensional space. The magnitudes of the estimated component regression coefficients can therefore not be directly related to component rates of change in the explanatory variables. However, ratios of the estimated component regression coefficients are useful indications of the relative impact of variables at the component level, which is analogous to the findings in [15].
The performance of the proposed MSPGLM is evaluated through a simulation study, evaluating the performance of the individual components that are separable within the overall procedure. The procedure is also implemented on a data set from an insurance company and on a South African COVID-19 data set.
The paper is structured as follows. Section 2 describes the SPGLM, followed by the MSPGLM. Section 3 gives simulation results of the SPGLM performance for continuous and categorical response variables. In Section 4 and Section 5, applications of the proposed MSPGLM procedure on insurance and COVID-19 data are given. Lastly, Section 6 contains a discussion of the results, conclusions, and possibilities for future research.

2. Materials and Methods

2.1. Semi-Parametric Mixture of Generalised Linear Models

The GLM is a generalisation of the mean regression model, selecting a distribution of the EF for the random component, Y, with the following density or mass function:
f Y y | η = h y , ϕ exp η y A η ϕ ,
where A η is the cumulant function and ϕ a dispersion parameter. E ( Y | x ) = μ ( x ) , with g ( μ ( x ) ) = x T β = η , and g ( · ) is a smooth and invertible link function; x is the vector of explanatory or feature variables, and β is the vector of regression parameters. It follows for the EF that:
μ ( x ) = A η , and
V A R ( Y | x ) = ϕ A η .
The canonical inverse link function is g 1 ( · ) = A ( · )  [16].
The SPGLM considers cases where the random component is assumed to follow an unknown distribution from the EF. The inverse link function is therefore also unknown and needs to be estimated in addition to the regression parameters β .

2.2. SPGLM Estimation

Consider a random sample of pairs Y i , x i T R × R p   for i = 1 , , n . Set Y = Y 1 Y n T : n × 1 , and X = x 1 x n T : n × p . Following from (2), we estimate the expected value using the Nadaraya–Watson weighted average in the focal point z, varying over the range of possible x i T β values:
g 1 ( z ) = A ^ ( z ) = 1 n h i = 1 n y i K h ( z x i T β ) τ ( z , X , β ) ,
with τ ( z , X , β ) as the normalising constant. This approach uses a one-dimensional smoother obtained from the linear predictor in contrast to a full non-parametric approach requiring a p dimensional smoother.
We use an alternating estimation scheme [5] by first estimating β using the Newton Raphson algorithm, followed by the estimation of the inverse link function.
Next, we derive the Newton Raphson update rule for β . The likelihood function for β is as follows:
L ( β | y , X ) = i = 1 n h y i , ϕ exp η i y i A η i ϕ ,
with the following log-likelihood:
l ( β | y , X ) = i = 1 n y i A 1 g 1 x i T β A A 1 g 1 x i T β ϕ + i = 1 n log h y i , ϕ ,
which simplifies for the canonical link function to:
l ( β | y , X ) = i = 1 n y i x i T β ϕ i = 1 n A x i T β ϕ + i = 1 n log h y i , ϕ .
Maximising l ( β | y , X ) using the Newton Rapshon algorithm yields the update rule for β :
β n e w = β o l d + H ϕ 1 ( β o l d ) ϕ ( β o l d ) = β o l d + 1 ϕ X T W A ^ X 1 X T y A ^ ( X β o l d ) ,
with W A ^ = 1 ϕ d i a g A ^ ( x 1 T β ) A ^ ( x n T β ) , ϕ β = 1 ϕ X T y A ( X β ) and H ϕ β = X T W A X . For A ( · ) , we substitute the Nadaraya–Watson weighted average of (4) into (5).
The second derivative of A ( · ) is estimated using the following:
A ^ ( z ) = 1 n h 2 i = 1 n 1 τ ( z , X , β ) y i A ^ ( z ) K h ( z x i T β ) ,
which is the derivative of A ^ ( . ) , or
A ^ ( z ) = 1 ϕ 1 n h i = 1 n y i 2 K h ( z x i T β ) τ ( z , X , β ) A ( z ) 2 ,
derived using (3) for a fixed value of the dispersion parameter ϕ .
Weisberg and Welsh [5] proposed to initialise the semi-parametric estimation procedure by selecting a suitable parametric GLM to obtain initial values for β and the inverse link function. This parametric inverse link function is updated using the non-parametric estimate A ^ ( · ) . In the following step, updated values for β are determined using (5). The procedure is iterated until convergence, selecting h through the use of cross-validation. The proposed procedure for estimating the SPGLM is given in Algorithm 1.
Algorithm 1 Semi-parametric generalised linear models (SPGLM)
  • Fit an initial suitable parametric GLM.
  • Set β o l d equal to the regression parameter estimates from ( 1 ) .
  • For the current values of β o l d :
    (a)
    Determine the non-parametric estimate of the inverse link A ^ ( . ) using (4).
    (b)
    Use the Newton Raphson update rule (5) to determine β n e w .
  • Set β o l d = β n e w .
  • Repeat 3 and 4 until convergence.

2.3. Mixture of Semi-Parametric Generalised Linear Models

Consider K components, each with an SPGLM structure with an unknown link function g k ( μ k ( x ) ) = x T β k = η k for k = 1 , , K . The expected value of the kth component is μ k ( x ) = E k ( Y | x ) . The random variable Y is observed from mixture component k with probability π k . For an MSPGLM, the density or mass function is as follows:
f Y y | x , θ = k = 1 K π k h k y , ϕ k exp x T β k y A k x T β k ϕ k ,
with θ = β 1 , β K ; π 1 , , π K , ( β 1 , , β K ) as the regression parameters and ( π 1 , , π k ) as the mixing probabilities.
For a random sample of n pairs Y i , x i T for i = 1 , , n , the complete data log-likelihood function is the following:
l ( θ | y , X , Z ) = i = 1 n k = 1 K z i k l o g h k y i , ϕ k + x i T β k y i A k x i T β k ϕ k + l o g π k
where the unobserved component membership matrix Z = ( z i k ) , with
z i k = 0 if   the   observations   is   not   from   component   k 1 if   the   observations   is   from   component   k .
The complete data log-likelihood is maximised using the Expectation Maximisation (EM) algorithm [17]. We start with a suitable parametric MGLM to obtain initial parameter estimates β k , initial inverse link functions g k 1 ( · ) , and the estimated responsibilities for the ith observation in component k, γ i k , as in Millard [18]. The component inverse link functions are updated using the Nadaraya–Watson weighted average (4), with the following responsibilities incorporated:
A ^ k ( z ) = 1 n k h i = 1 n γ i k y i K h ( z x i T β ) τ k ( z , X , β )
where
τ k ( z , X , β ) = 1 n k h i = 1 n γ i k K h ( z x i T β )
and n k = i = 1 n γ i k . The second derivative A ( . ) is estimated using the following:
A ^ k ( z ) = 1 ϕ k 1 n k h i = 1 n γ i k y i 2 K h ( z x i T β ) τ k ( z , X , β ) A ^ ( z ) 2 .
The update rule for β k is as follows:
β k n e w = β k o l d + H ϕ 1 ( β k o l d ) ϕ ( β k o l d ) = β k o l d + 1 ϕ X T W A ^ k X 1 X T y A ^ k ( X β k o l d ) ,
with W A ^ k = 1 ϕ d i a g A ^ k ( x 1 T β k ) A ^ k ( x n T β k ) . The MSPGLM estimation procedure is summarised in Algorithm 2.
Algorithm 2 Mixture of semi-parametric generalised linear models (MSPGLM)
  • Fit an initial suitable parametric MGLM.
  • Set β k o l d equal to the regression parameter estimates of each component from step ( 1 ) .
  • For component k
    (a)
    Determine the non-parametric estimate of the component inverse link A ^ k ( · ) using (9).
    (b)
    Use the Newton Raphson update rule (11) to update the unknown component parameters β k n e w .
  • Set β k o l d = β k n e w .
  • Repeat ( 3 ) and ( 4 ) for component k until convergence.
  • Perform 3 5 for all K components.

3. Simulation Study

We assess the performance of the SPGLM estimation algorithm, Algorithm 1, in discrete and continuous scenarios considering different link functions and sample sizes. Data are generated using the specified link functions as indicated in Table 1. A GLM with a single feature variable is used in the generation. Scenarios 1–3 consider binary response models, with scenario 1 using a logit link. Scenarios 2 and 3 use the probit and piecewise linear link functions, respectively. Scenarios 4–6 are continuous response variable models using inverse, identity, and piecewise linear link functions.
A suitable GLM is fitted to the generated data to identify initial values for β and an initial link function. The initial link functions are given in Table 1. The fitted model is obtained using Algorithm 1. Aligned to recent trends in simulation designs [19,20,21], this process is repeated 1000 times for each scenario. These simulation results also guide the MSPGLM since the mixture components are separable.
Figure 1 shows the inverse link functions used to generate the data and the response variable values for a single simulation iteration of the discrete response scenarios.
Similarly, Figure 2 shows the inverse link functions used to generate the data and the response variable values for a single simulation iteration of the continuous response scenarios.
Table 2 gives the simulation results of the proportion of instances where the SPGLM procedure outperforms the initial link function based on the prediction error, which is also presented in the left pane of Figure 3. These scenarios clearly indicate the diverse results that can be obtained by the SPGLM. In scenario 1, the SPGLM outperforms the logit link function in 22.2 33.0 % of the cases, with the lower percentages occurring as the sample size increases. This indicates that with larger sample sizes, the signal from the initial logit link function is of such a nature that the SPGLM procedure can only perform better in ± 22 % of the simulations. This low value is to be expected as the generation link function and the initial link function are the same. The SPGLM outperforms the logit link function for scenario 2 in 32.5 34.9 % of the cases. This trend does not seem to change as the sample size increases. Scenario 3 shows that the prediction error decreases as the sample size increases. The SPGLM yields better results than the logit link function, varying between 69.2 % and 99.8 % as the sample size increases.
The SPGLM outperforms the inverse link function in scenario 4 in more than 97 % of the cases for all sample sizes evaluated. In scenario 5, the SPGLM only performs better in 4.5– 9.7 % of the cases. This is to be expected as the chosen initial link function is the link function that generated the data and therefore serves as confirmation of the parametric model. The piecewise linear link function, scenario 6, shows that the SPGLM outperforms the initial link function in all cases.
For scenarios 1 and 5, the SPGLM does not perform better than the initial link function. It can therefore be argued that the semi-parametric process is not able to identify a better inverse link structure compared to the parametric model under consideration, confirming the chosen initial link function as suitable. For scenarios 3, 4, and 6, the SPGLM generally outperforms the chosen initial link function, indicating the advantage of a data-driven link function. The estimated link function selects an unknown distribution from the broader class of distributions in the EF.
Table 3 gives the simulation results, based on the prediction error, of the number of iterations required for convergence in the cases where the SPGLM performs better than the initial link function, which is also shown in the right pane of Figure 3. The results are consistent in that the average number of iterations to convergence increases as the sample size increases.
This simulation study clearly illustrates the flexibility that can be achieved using the SPGLM procedure.

4. Application: Premium Collection Rates

4.1. Problem Statement and Data

This application is based on data received from a start-up insurance company. The company has 2227 active policies. Their target market is the lower income group. They market two unique products specifically designed to support the households of the policy holders in the event of a claim.
A major challenge in the insurance industry, specifically in the lower income bracket, is the premium collection rate. Collection rates in this market varies between 30 % and 35 % . The company is in need of a better understanding of the behaviour of its customer base and must use this understanding to focus their marketing efforts on customers in order to improve collection rates.
A model predicting the likelihood of a policyholder paying at least one premium in the next three months is required. The variables available are indicated in Table 4.
Relevant data were used to ensure that all policies are measured up to three months after the last month under consideration. A brief summary of the data is given in Table 5 and Table 6.
The exploratory results consider the marginal relationships between variables, not taking into account the differences between possible latent segments in the data. The section below considers an MSPGLM to model the collection rate. The identification of the latent segments will highlight differences in the regression structures, likely due to latent behavioural groups.

4.2. Modelling and Results

A two-component MSPGLM is fitted using the premium collection data. The estimated MSPGLM regression parameters are given in Table 7.
The parameters of the MSPGLM are not directly comparable to parametric MGLM parameters. We therefore use ratios of the regression parameters in Table 8 and Table 9 to determine the relative importance of the explanatory variables in components 1 and 2, respectively.
In both Table 8 and Table 9, the values above the diagonal of 1 s contain ratios indicating the relative importance of the parameters associated with the rows to parameters associated with the columns. Table 8 shows that for component 1, the variable Current had a contribution more than five times that of the variable Last3, and more than eight times that of Gender. Similarly, it can be seen that the relative importance of Bank-A is more than 160 times that of Bank-B. The corresponding importance for component 2 are 1.923 and 27.097 when comparing the variable Current to the variable Last3 and Gender, respectively, in Table 9. Comparing Bank-A to Bank-B yields a relative importance of 0.34 , showing a different relationship when measuring the impact of the banks between components 1 and 2.
Figure 4 shows the estimated MGSPLM link function compared to the initial link function for component 1 in the left pane and for component 2 in the right pane. It is clear that the initial link, MGLM, and the MSPGLM link functions are similar, especially for component 2. This is supported by the prediction errors, as indicated in Table 10. In both components as well as overall, the MSPGLM performs marginally better than the MGLM model based on the initial link.
Table 11 gives the classification accuracy of the estimated MGLM and MSPGLM, respectively. The classification accuracy of both models are similar, with the MSPGLM again performing slightly better.
Given the marginal differences in the estimated inverse link functions of the MGLM and MSPGLM, the results support the use of the initial logit link function as the correct link function. The MGLM, a mixture of logistic regression models, could therefore be used as an appropriate model for the client retention application.

5. Application: COVID-19 Data

5.1. Problem Description and Data

This section considers an application of MSPGLM to COVID-19 infection rates, with time as the explanatory variable. Data are observed for the Kwazulu-Natal and Eastern Cape provinces in South Africa. The response variable is the 14-day infection rate, calculated as r p , t = X p , t X p , t 14 , with X p , t as the daily cumulative COVID-19 positive cases in province p in time period t. This measure is useful in modelling the spread of the disease over time. We considered data spanning the period from December 2020 to 15 February 2021, which include the second wave of infections in South Africa. During this period, the 14-day infection rates varied between 1.006 and 1.380 , covering periods where the 14-day growth was almost stationary up to periods where the growth was more than 38%. The data were sourced from the Data Science for Social Impact COVID-19 data repository, hosted by the University of Pretoria.

5.2. Modelling and Results

A two-component MSPGLM is fitted to the COVID-19 14-day infection rate data, specifically to illustrate the ability of the estimation algorithm to update and improve the inverse link functions and to identify the latent provinces. Estimation Algorithm 2 is used to fit the MSPGLM model, selecting identity link functions as initial link functions and using only time as the explanatory variable in the linear predictor. The regression coefficients for the time variable, t, are 0.02439 and 0.00015 for the two components, respectively.
Figure 5 shows the observed data, initial inverse link functions as well as the estimated MSPGLM inverse link functions for both components, plotted against time. The components are identified using hard clustering.
It is clear that the mixture model could identify the two latent provinces. Based on the observed data, it is also clear that Eastern Cape has a downward sloping 14-day infection rate, while Kwazulu-Natal has a shape that increases and later decreases over the time period.
Figure 5 also clearly shows the ability of the estimation algorithm to update the initial inverse link functions to much more appropriate estimated inverse link functions. Component 1 is identified as the Eastern Cape and Component 2 as Kwazulu-Natal.
This improved fit of the MSPGLM is supported by the prediction errors in Table 12. In both components, the MSPGLM performs much better than the model based on the initial link functions.
The MSPGLM model therefore shows that the chosen initial link functions are not appropriate and that the estimated MSPGLM inverse link functions should rather be used.

6. Discussion and Conclusions

In this paper, we introduced an MSPGLM that estimates appropriate, semi-parametric component inverse link functions, which facilitates the selection of component members from the broader class of EF distributions. The approach allows for a more flexible modelling capability, resulting in an improved capturing of the observed structure.
The first derivative of the component cumulant function A k ( · ) is the component inverse canonical link function, estimated using the Nadaraya–Watson weighted average. The MSPGLM procedure uses an alternating estimation scheme by first updating the component regression parameters, followed by updating the component inverse link function. The MSPGLM procedure either confirms the choice of the initial, parametric component inverse link function or estimates a component inverse link function from the broader class of distributions in the EF, resulting in a better model fit. These properties were explored in the simulation study.
The prediction accuracy in the simulation scenarios where the initial parametric link function was different from the generating link function was drastically improved by the MSPGLM, highlighting the value of the proposed data-driven semi-parametric procedure. This procedure outperformed the MGLM in 69.2– 99.8 % of the cases where the generating link function was substantially different from the initial parametric link function. The procedure also confirmed the suitability of the initial parametric link function in cases where the initial parametric link function corresponded to the generating link function.
Two practical applications were considered. The first application modelled premium collection rates in an insurance company. The estimation results confirm the logit component link functions as appropriate component link functions. In the second application, 14-day infection rates for COVID-19 were modelled. The estimated model improved the initial parametric link functions by updating the initial component identity link functions with estimated non-linear component link functions, showing a substantial improvement in the prediction accuracy. The latent provinces were also successfully identified.
Recent trends confirm a continued interest in a semi-parametric mixture regression, as indicated by [11]. Further research could include the development of an alternative approach to using the estimated regression parameters in order to simplify the interpretation thereof. Additionally, one could investigate different updating strategies for the responsibilities in the MSPGLM.

Author Contributions

Conceptualization, S.M.M. and F.H.J.K.; methodology, S.M.M. and F.H.J.K.; software, S.M.M.; formal analysis, S.M.M. and F.H.J.K.; writing—review and editing, S.M.M. and F.H.J.K.; visualization, S.M.M. and F.H.J.K.; project administration, S.M.M.; funding acquisition, S.M.M. and F.H.J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by STATOMET, the Bureau for Statistical and Survey Methodology at the University of Pretoria, grant number STM21a.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data used in the application are available in a publicly accessible repository.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

EFExponential family
GLMGeneralised linear model
SPGLMSemi-parametric generalised linear model
MGLMMixture of generalised linear models
MSPGMMixture of semi-parametric generalised linear models

References

  1. An, P.; Wang, Z.; Zhang, C. Ensemble unsupervised autoencoders and Gaussian mixture model for cyberattack detection. Inf. Process. Manag. 2022, 59, 102844. [Google Scholar] [CrossRef]
  2. Huang, T.; Peng, H.; Zhang, K. Model selection for Gaussian mixture models. Stat. Sin. 2017, 27, 147–169. [Google Scholar] [CrossRef] [Green Version]
  3. Azzalini, A.; Bowman, A.; Härdle, W. On the use of nonoparametric regression for model checking. Biometrika 1989, 76, 1–12. [Google Scholar] [CrossRef]
  4. Wainer, H. Pyramid power: Searching for an error in test scoring with 830,000 helpers. Am. Stat. 1983, 37, 87–91. [Google Scholar]
  5. Weisberg, S.; Welsh, A.H. Adapting for the missing link. Ann. Stat. 1994, 22, 1674–1700. [Google Scholar] [CrossRef]
  6. Caracciolo, F.; Furno, M.; D’Amico, M.; Califano, G.; Di Vita, G. Variety seeking behavior in the wine domain: A consumers segmentation using big data. Food Qual. Prefer. 2022, 97, 104481. [Google Scholar] [CrossRef]
  7. Pacáková, V.; Zapletal, D. Mixture distributions in modelling of insurance losses. In Proceedings of the 2013 International Conference on Applied Mathematics and Computational Methods in Engineering, Sun Valley, ID, USA, 5–9 May 2013; pp. 16–19. [Google Scholar]
  8. Hamel, S.; Yoccoz, N.G.; Gaillard, J. Assessing variation in life-history tactics within a population using mixture regression models: A practical guide for evolutionary ecologists. Biol. Rev. 2017, 92, 754–775. [Google Scholar] [CrossRef] [PubMed]
  9. Kurz, C.F.; Hatfield, L.A. Identifying and interpreting subgroups in health care utilization data with count mixture regression models. Stat. Med. 2019, 38, 4423–4435. [Google Scholar] [CrossRef] [PubMed]
  10. Wang, E.; Lee, C. The impact of clean energy consumption on economic growth in China: Is environmental regulation a curse or a blessing? Int. Rev. Econ. Financ. 2022, 77, 39–58. [Google Scholar] [CrossRef]
  11. Xiang, S.; Yao, W.; Yang, G. An overview of semiparametric extensions of finite mixture models. Stat. Sci. 2019, 34, 391–404. [Google Scholar] [CrossRef] [Green Version]
  12. Ma, Y.; Wang, S.; Xu, L.; Yao, W. Semiparametric mixture regression with unspecified error distributions. Test 2020, 30, 429–444. [Google Scholar] [CrossRef]
  13. Jansen, R.C. Maximum likelihood in a generalized linear finite mixture model by using the EM algorithm. Biometrics 1993, 49, 227–231. [Google Scholar] [CrossRef]
  14. Nguyen, H.D. Finite Mixture Models for Regression Problems. Ph.D. Thesis, University of Queensland, St Lucia, Australia, 2015. [Google Scholar]
  15. Li, K.; Duan, N. Regression analysis under link violation. Annals Stat. 1989, 17, 1009–1052. [Google Scholar] [CrossRef]
  16. Dunn, P.K.; Smyth, G. Generalized Linear Models with Examples in R, 1st ed.; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
  17. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 1977, 39, 1–38. [Google Scholar]
  18. Millard, S.M. Contributions to Mixture Regression with Applications in Industry. Ph.D. Thesis, University of Pretoria, Pretoria, South Africa, 2018. [Google Scholar]
  19. Asar, Y.; Algamal, Z. A New Two-parameter Estimator for the Gamma Regression Model. Stat. Optim. Inf. Comput. 2022. [Google Scholar] [CrossRef]
  20. Belias, M.; Rovers, M.M.; Reitsma, J.B.; Debray, T.P.A.; IntHout, J. Statistical approaches to identify subgroups in meta-analysis of individual participant data: A simulation study. BMC Med. Res. Methodol. 2019, 19, 183. [Google Scholar] [CrossRef] [PubMed]
  21. Jia, B. The application of Monte Carlo methods for learning generalized linear mode. Biom. Biostat. Int. J. 2018, 7, 422–428. [Google Scholar]
Figure 1. SPGLM discrete simulation cases, generated data, and inverse link functions for a single simulation iteration.
Figure 1. SPGLM discrete simulation cases, generated data, and inverse link functions for a single simulation iteration.
Symmetry 14 00409 g001
Figure 2. SPGLM continuous simulation cases, generated data, and inverse link functions for a single simulation iteration.
Figure 2. SPGLM continuous simulation cases, generated data, and inverse link functions for a single simulation iteration.
Symmetry 14 00409 g002
Figure 3. Proportion of cases (a) where the SPGLM outperforms the initial link, and the average number of iterations (b) until convergence versus sample size.
Figure 3. Proportion of cases (a) where the SPGLM outperforms the initial link, and the average number of iterations (b) until convergence versus sample size.
Symmetry 14 00409 g003
Figure 4. Observed data (red), the initial inverse link functions MGLM (black), and estimated MSPGLM inverse link functions (blue). Component 1 is presented in the left pane (a), with Component 2 in the right pane (b).
Figure 4. Observed data (red), the initial inverse link functions MGLM (black), and estimated MSPGLM inverse link functions (blue). Component 1 is presented in the left pane (a), with Component 2 in the right pane (b).
Symmetry 14 00409 g004
Figure 5. Observed data (red), the initial inverse link functions MGLM (black), and the estimated MSPGLM inverse link functions (blue). Component 1 is presented in the left pane (a), with Component 2 in the right pane (b). The first day of December 2020 corresponds to t = 1 .
Figure 5. Observed data (red), the initial inverse link functions MGLM (black), and the estimated MSPGLM inverse link functions (blue). Component 1 is presented in the left pane (a), with Component 2 in the right pane (b). The first day of December 2020 corresponds to t = 1 .
Symmetry 14 00409 g005
Table 1. Simulation scenarios considered for the SPGLM.
Table 1. Simulation scenarios considered for the SPGLM.
ScenarioResponse TypeGenerating LinkInitial LinkSample Sizes
1DiscreteLogitLogit 100 , 200 , 500 , 1000
2DiscreteProbitLogit 100 , 200 , 500 , 1000
3DiscretePiecewise linearLogit 100 , 200 , 500 , 1000
4ContinuousInverse linkIdentity 100 , 200 , 500 , 1000
5ContinuousIdentityIdentity 100 , 200 , 500 , 1000
6ContinuousPiecewise linearIdentity 100 , 200 , 500 , 1000
Table 2. Proportion of simulations where the SPGLM model outperforms the initial link, based on the prediction error, with standard deviations in brackets.
Table 2. Proportion of simulations where the SPGLM model outperforms the initial link, based on the prediction error, with standard deviations in brackets.
Response TypeDiscrete
Scenario123
Generating linkLogitProbitPiecewise linear
Initial linkLogitLogitLogit
Sample size
100 0.330 ( 0.470 ) 0.329 ( 0.470 ) 0.692 ( 0.462 )
200 0.315 ( 0.465 ) 0.349 ( 0.477 ) 0.806 ( 0.396 )
500 0.222 ( 0.416 ) 0.325 ( 0.469 ) 0.961 ( 0.194 )
1000 0.223 ( 0.416 ) 0.331 ( 0.471 ) 0.998 ( 0.045 )
Response TypeContinuous
Scenario456
Generating linkInverseIdentityPiecewise linear
Initial linkIdentityIdentityIdentity
Sample size
100 0.976 ( 0.153 ) 0.060 ( 0.239 ) 1.000 ( 0.000 )
200 0.999 ( 0.032 ) 0.097 ( 0.296 ) 1.000 ( 0.000 )
500 1.000 ( 0.000 ) 0.072 ( 0.259 ) 1.000 ( 0.000 )
1000 1.000 ( 0.000 ) 0.047 ( 0.212 ) 1.000 ( 0.000 )
Table 3. Average number of iterations until convergence, based on the prediction error, with standard deviations in brackets.
Table 3. Average number of iterations until convergence, based on the prediction error, with standard deviations in brackets.
Response TypeDiscrete
Scenario123
Generating linkLogitProbitPiecewise linear
Initial linkLogitLogitLogit
Sample size
100 1.461 ( 1.747 ) 1.465 ( 1.085 ) 16.990 ( 22.473 )
200 1.895 ( 2.897 ) 2.003 ( 2.186 ) 17.809 ( 17.181 )
500 3.347 ( 4.236 ) 3.526 ( 2.537 ) 32.309 ( 22.354 )
1000 4.571 ( 4.602 ) 5.927 ( 4.133 ) 49.321 ( 27.858 )
Response TypeContinuous
Scenario456
Generating linkInverseIdentityPiecewise linear
Initial linkIdentityIdentityIdentity
Sample size
100 21.516 ( 26.620 ) 1.000 ( 0.000 ) 31.493 ( 68.02 )
200 38.183 ( 38.472 ) 15.567 ( 51.743 ) 43.138 ( 88.531 )
500 67.441 ( 50.656 ) 64.569 ( 93.437 ) 98.561 ( 86.089 )
1000 108.117 ( 58.969 ) 122.681 ( 97.194 ) 156.310 ( 70.296 )
Only the cases where the SPGLM outperforms the initial link.
Table 4. Customer retention data available for modelling payment behaviour.
Table 4. Customer retention data available for modelling payment behaviour.
VariableDescriptionRoleType
Pay (y)A binary response variable indicating at least one payment in the following three months.ResponseCategorical
CurrentA binary indicator of a payment in the current month.ExplanatoryCategorical
Last3The number of payments in the three months prior to the current month.ExplanatoryNumerical
GenderGender of the policy holder.ExplanatoryCategorical
ProductProduct indicator (two products).ExplanatoryCategorical
BankBank used for payments (4 banks).ExplanatoryCategorical
AgeAge of the policy holder during the month of evaluation.ExplanatoryNumerical
Table 5. Data summary: Age, Bank, and Gender versus Payment.
Table 5. Data summary: Age, Bank, and Gender versus Payment.
Age
LevelFrequency Non-PayPercentage Non-PayFrequency PayPercentage Pay
a:–248865.1854734.815
b:25–3564064.58135135.419
c:35–4552665.50427734.496
d:45–5515466.6677733.333
e:55+4161.1942638.806
Total144965.06577834.935
Bank
LevelFrequency Non-PayPercentage Non-PayFrequency PayPercentage Pay
A51179.84412920.156
B17352.74415547.256
C56560.88436339.116
D20060.42313139.577
Total144965.06577834.935
Gender
LevelFrequency Non-PayPercentage Non-PayFrequency PayPercentage Pay
Female62562.18938037.811
Male82467.43039832.570
Total144965.06577834.935
Table 6. Data summary: Product and payment history versus Payment.
Table 6. Data summary: Product and payment history versus Payment.
Product
LevelFrequency Non-PayPercentage Non-PayFrequency PayPercentage Pay
068670.14329229.857
176361.08948638.911
Total144965.06577834.935
Current month payment indicator
LevelFrequency Non-PayPercentage Non-PayFrequency PayPercentage Pay
081570.07734829.923
163459.58643040.414
Total144965.06577834.935
Number of payments in the previous three months
LevelFrequency Non-PayPercentage Non-PayFrequency PayPercentage Pay
053589.9166010.084
162476.37719323.623
226245.09531954.905
32811.96620688.034
Total144965.06577834.935
Table 7. Estimated parameters of the MSPGLM model.
Table 7. Estimated parameters of the MSPGLM model.
Parameter Estimates
VariableComponent 1Component 2
Current0.3270.687
Last30.0630.357
Gender0.0400.025
Product0.1670.055
Bank-A−0.437−0.131
Bank-B−0.0030.386
Bank-C−0.0860.178
Age0.011−0.030
Table 8. Relative importance of the estimated regression parameters for component 1 of the MSPGLM.
Table 8. Relative importance of the estimated regression parameters for component 1 of the MSPGLM.
VariableCurrentLast3GenderProduct
Current15.1608.1611.961
Last30.19411.5810.380
Gender0.1230.63210.240
Product0.5102.6324.1621
Bank-A1.3346.88710.8902.617
Bank-B0.0080.0420.0660.016
Bank-C0.2641.3632.1550.518
Age0.0340.1760.2790.067
VariableBank-ABank-BBank-CAge
Current0.749123.2453.78729.244
Last30.14523.8830.7345.667
Gender0.09215.1030.4643.584
Product0.38262.8541.93114.914
Bank-A1164.4705.05439.026
Bank-B0.00610.0310.237
Bank-C0.19832.54517.722
Age0.0264.2140.1291
Table 9. Relative importance of the estimated regression parameters for component 2 of the MSPGLM.
Table 9. Relative importance of the estimated regression parameters for component 2 of the MSPGLM.
VariableCurrentLast3GenderProduct
Current11.92327.09712.565
Last30.520114.0916.534
Gender0.0370.07110.464
Product0.0800.1532.1571
Bank-A0.1910.3685.182.402
Bank-B0.5621.08015.2257.060
Bank-C0.2600.5007.0403.264
Age0.0440.0851.2020.557
VariableBank-ABank-BBank-CAge
Current5.2311.7803.84922.552
Last32.7200.9262.00111.728
Gender0.1930.0660.1420.832
Product0.4160.1420.3061.795
Bank-A10.3400.7364.311
Bank-B2.93912.16312.672
Bank-C1.3590.46215.859
Age0.2320.0790.1711
Table 10. Prediction errors for the MGLM and MSPGLM models.
Table 10. Prediction errors for the MGLM and MSPGLM models.
ComponentMGLMMSPGLM
Overall295.904295.336
Component 1241.914241.394
Component 253.99053.942
Table 11. Classification accuracy for the MGLM and MSPGLM models.
Table 11. Classification accuracy for the MGLM and MSPGLM models.
ModelClassification Acuracy
MGLM 83.34 %
MSPGLM 83.75 %
Table 12. Prediction errors for the MGLM and MSPGLM models.
Table 12. Prediction errors for the MGLM and MSPGLM models.
ComponentMGLMMSPGLM
Overall7.7061.736
Component 11.1440.819
Component 26.5620.917
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Millard, S.M.; Kanfer, F.H.J. Mixtures of Semi-Parametric Generalised Linear Models. Symmetry 2022, 14, 409. https://doi.org/10.3390/sym14020409

AMA Style

Millard SM, Kanfer FHJ. Mixtures of Semi-Parametric Generalised Linear Models. Symmetry. 2022; 14(2):409. https://doi.org/10.3390/sym14020409

Chicago/Turabian Style

Millard, Salomon M., and Frans H. J. Kanfer. 2022. "Mixtures of Semi-Parametric Generalised Linear Models" Symmetry 14, no. 2: 409. https://doi.org/10.3390/sym14020409

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop