# Applications of Clustering with Mixed Type Data in Life Insurance

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction and Motivation

- A tracking and monitoring system is a risk management tool that can assist insurers to take the actions necessary to mitigate the economic impact of mortality deviations.
- It is a tool for improved understanding of the emergence of death claims experience, thereby helping an insurer in product design, underwriting, marketing, pricing, reserving, and financial planning.
- It provides a proactive tool for dealing with regulators, credit analysts, investors, and rating agencies who may be interested in reasons for any volatility in earnings as a result of death claim fluctuations.
- A better understanding of the company’s emergence of death claims experience helps to improve its claims predictive models.
- The results of a tracking and monitoring system provide the company a benchmark for its death claims experience that can be relatively compared with that of other companies in the industry.

## 2. Empirical Data

## 3. Data Clustering Algorithms

#### 3.1. The k-Prototype Algorithm

Algorithm 1: Pseudo-code of the k-prototype algorithm. |

#### 3.2. Determining the Parameters ${\lambda}_{1}$ and ${\lambda}_{2}$

#### 3.3. Determining the Optimal Number of Clusters

- Set $k=1,2,\dots ,10$;
- Run k-prototype algorithm and calculate $log\left({W}_{k}\right)$ under each $k=1,2,\dots ,10$ for the original data $\mathbf{X}$;
- For each $b=1,2,\dots ,B$, generate a reference data set ${\mathbf{X}}_{b}^{\ast}$ with sample size n. Run the clustering algorithm under the candidate k values and compute$$\mathrm{E}[log\left({W}_{k}\left({\mathbf{X}}^{\ast}\right)\right)]=\frac{1}{B}\sum _{b=1}^{B}log\left({W}_{k}\left({\mathbf{X}}_{b}^{\ast}\right)\right)$$
- Define $s\left(k\right)=\left(\sqrt{1+1/B}\right)\times \mathrm{sd}\left(k\right)$, where$\mathrm{sd}\left(k\right)=\sqrt{(1/B){\sum}_{b=1}^{B}{(log\left({W}_{k}\left({\mathbf{X}}_{b}^{\ast}\right)\right)-\mathrm{E}[log\left({W}_{k}\left({\mathbf{X}}^{\ast}\right)\right)])}^{2}}$; and
- Choose the optimal number of clusters as the smallest k such that $\mathrm{Gap}\left(k\right)\ge \mathrm{Gap}(k+1)-s(k+1)$.

## 4. Clustering Results and Interpretation

**Cluster 1**

- Its gender make-up is predominantly females in the entire portfolio. There is a larger percentage of term plans and a smaller percentage of substandard policies than clusters 2 and 3. The violin plots for the numerical attributes show that the youngest group with the smallest amount of insurance coverage is in this cluster. Geographically, the insureds in this cluster are mostly distributed in the northeastern regions, such as New Jersey, New York, Rhode Island, and New Hampshire.

**Cluster 2**

- This cluster has a gender make-up that is interesting. While clusters 1 and 3 have a dominating gender, cluster 2 has 30% females and 70% males. It also has the largest proportions of smokers, term conversion underwriting type policies, and substandard policies when compared with other clusters. However, when it comes to plan type, 91% of them have universal life contracts and almost none have term plans. With respect to issue age and amount of insurance coverage, this cluster of policies has the most senior people, and not surprisingly, it has also a lower face amount. Geographically, with the exception of a few states dominating the cluster, there is almost uniform distribution of the rest of the states. Custer 2 has the states with the lowest proportions of insured policies among all the clusters.

**Cluster 3**

- Male policyholders dominate this cluster and cluster 3 has the smallest proportions of smokers and term conversion underwriting type policies among all clusters. More than 90% of the policyholders purchased term plans and most of them are also with generally larger face amounts than the other two clusters. The policyholders in this cluster are more middle aged compared with other clusters according to the violin plots. The policyholders in this cluster are more geographically scattered in Arkansas, Alabama, Mississippi, Tennessee, and Oregon; interestingly, cluster 3 has the largest proportion of policies among all clusters.

## 5. Discussion

- Cluster 1 has the most favorable A/E ratio among all the cluster—significantly less than 1 at the 10% significance level, with moderate variability. This can be explained reasonably by this dominant feature compared with other clusters: Its gender make-up of all females in the entire portfolio. Females live longer than males on the average. There is a larger percentage of variable life plans, and slightly fewer smokers, term conversion, and substandard policies than clusters 2 and 3. In addition, the violin plots for the numerical attributes show that the youngest group with smallest amount of insurance coverage belongs to this cluster. We expect this youngest group to have generally low mortality rates. Geographically, the insureds in this cluster are mostly distributed in the northeastern regions, such as New Jersey, New York, Rhode Island, and New Hampshire. It may be noted that policyholders tend to come from this region where the people typically have better incomes with better employer-provided health insurance.
- Cluster 2 has the A/E ratio of 0.68—not significantly less than 1 at both 5% and 10% significance levels; it has the largest variability of this ratio among all clusters. Cluster 2 has, therefore, the most unfavorable A/E ratio from a statistical perspective. The characteristics of this cluster can be captured by these dominant features: (i) Its gender make-up is a mix of males and females, with more males than females. (ii) It has the largest proportions of smokers, term conversion underwriting type policies, and substandard policies. (iii) When it comes to plan type though, 91% of them have universal life contracts and no term policies. (iv) With respect to age at issuing and amount of insurance coverage, this cluster has the largest proportion of elderly people and therefore, has lower face amounts. All these dominating features help explain a generally worse mortality and larger variability of deviations. For example, the older group has a higher mortality rate than the younger group, and along with the largest proportion of smokers, this explains the compounded mortality. To some extent, with the largest proportions of term conversion underwriting types and substandard policies, they reasonably indicate more inferior mortality experience.
- Cluster 3 has the A/E ratio that is most significantly less than 1, even though it has the worst A/E ratio among all the clusters. The characteristics can be captured by some dominating features in the cluster: male policyholders dominate this cluster and it has the smallest proportions of smokers and term conversion underwriting type policies among ALL three clusters. More than 90% of the policyholders purchased Term plans and most of them have larger face amounts than other clusters. The policyholders in this cluster are more often middle aged compared to other clusters according to the violin plots. The policyholders are more geographically scattered in Arkansas, Alabama, Mississippi, Tennessee, and Oregon. We generally know that smokers’ mortality is worse than non-smokers. Relatively younger age groups have a lower mortality rate than other age groups. Term plans generally have fixed terms and are more subject to frequent underwriting. The small variability can be explained by having more policies giving enough information, and hence, much more predictable mortality.

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

A/E | Actual to expected ratio |

ULS | Universal life with secondary guarantees |

VLS | Variable life with secondary guarantees |

WGS84 | World Geodetic System 1984 |

## Appendix A. Convergence of A/E Ratio

**Theorem**

**A1 (Lyapunov Theorem).**

**Theorem**

**A2 (Lyapunov Theorem).**

**Proof.**

**Proof.**

## Appendix B. Tables That Summarize the Distributions of Clusters

Cluster 1 | Cluster 2 | Cluster 3 | |||
---|---|---|---|---|---|

States | Proportion | States | Proportion | States | Proportion |

NJ | 36.36% | WV | 21.25% | AR | 74.78% |

NY | 34.54% | DE | 19.40% | AL | 74.19% |

RI | 34.35% | PA | 19.26% | MS | 73.84% |

NH | 34.09% | OH | 18.50% | TN | 71.36% |

ME | 33.98% | IN | 16.40% | OR | 70.64% |

MA | 33.64% | RI | 16.21% | ID | 69.03% |

DE | 32.70% | ME | 15.79% | OK | 68.16% |

CA | 32.46% | SD | 15.48% | KY | 68.02% |

NV | 32.25% | IL | 15.45% | TX | 66.90% |

MD | 31.90% | NJ | 14.17% | WA | 66.29% |

IL | 31.82% | NY | 14.12% | UT | 64.42% |

PA | 31.63% | SC | 13.84% | GA | 64.34% |

DC | 31.54% | MD | 13.54% | CO | 64.29% |

CT | 30.82% | IA | 12.78% | WY | 63.83% |

MT | 29.96% | MO | 12.78% | ND | 63.20% |

NM | 29.94% | KS | 12.69% | NE | 62.92% |

IN | 29.55% | ND | 12.51% | NC | 62.69% |

FL | 29.19% | LA | 12.32% | KS | 61.93% |

MN | 28.60% | VT | 12.22% | VA | 61.80% |

AZ | 28.47% | MA | 12.15% | IA | 61.74% |

WI | 28.34% | FL | 12.07% | LA | 61.48% |

MI | 27.94% | NH | 11.93% | MT | 61.02% |

VT | 27.93% | MN | 11.65% | MO | 60.85% |

OH | 27.61% | WI | 11.65% | AZ | 60.69% |

WY | 27.12% | MI | 11.64% | MI | 60.42% |

UT | 27.01% | NM | 11.59% | WI | 60.01% |

VA | 26.91% | CT | 11.53% | SD | 59.86% |

SC | 26.86% | NE | 11.51% | VT | 59.85% |

WA | 26.61% | NC | 11.49% | MN | 59.75% |

MO | 26.37% | VA | 11.29% | SC | 59.31% |

LA | 26.20% | CA | 11.26% | FL | 58.74% |

GA | 26.10% | AZ | 10.83% | DC | 58.66% |

NC | 25.82% | NV | 10.34% | NM | 58.47% |

CO | 25.66% | CO | 10.05% | CT | 57.65% |

NE | 25.57% | KY | 9.98% | NV | 57.40% |

IA | 25.48% | DC | 9.80% | CA | 56.28% |

WV | 25.48% | GA | 9.56% | MD | 54.56% |

KS | 25.38% | OK | 9.08% | MA | 54.21% |

SD | 24.66% | WY | 9.05% | IN | 54.05% |

ND | 24.29% | MT | 9.02% | NH | 53.98% |

TX | 24.24% | TX | 8.86% | OH | 53.90% |

ID | 22.84% | AL | 8.74% | WV | 53.27% |

OK | 22.77% | MS | 8.62% | IL | 52.73% |

OR | 22.24% | UT | 8.57% | NY | 51.34% |

KY | 22% | ID | 8.13% | ME | 50.23% |

TN | 20.71% | TN | 7.93% | NJ | 49.47% |

AR | 18.04% | AR | 7.18% | RI | 49.44% |

MS | 17.54% | OR | 7.12% | PA | 49.10% |

AL | 17.07% | WA | 7.10% | DE | 47.90% |

Categorical Variables | Levels | Cluster 1 | Cluster 2 | Cluster 3 |
---|---|---|---|---|

Gender | Female | 100% | 30.43% | 0.09% |

Male | 0% | 69.57% | 99.91% | |

Smoker Status | Smoker | 4.43% | 6.53% | 3.45% |

Nonsmoker | 95.57% | 93.47% | 96.55% | |

Underwriting Type | Term conversion | 3.59% | 22.79% | 0.85% |

Underwritten | 96.41% | 77.21% | 99.15% | |

Substandard Indicator | Yes | 6% | 11.58% | 7.82% |

No | 94% | 88.42% | 92.18% | |

Plan | Term | 73.71% | 0% | 91.51% |

ULS | 8.95% | 90.86% | 0.12% | |

VLS | 17.34% | 9.14% | 8.37% |

Continuous Variables | Minimum | 1st Quantile | Mean | 3rd Quantile | Maximum | |
---|---|---|---|---|---|---|

Issue Age | Cluster 1 | 0 | 31 | 38.59 | 46 | 81 |

Cluster 2 | 0 | 47 | 51.53 | 61 | 90 | |

Cluster 3 | 0 | 36 | 44.47 | 53 | 85 | |

Face Amount | Cluster 1 | 215 | 100,000 | 375,066 | 500,000 | 13,000,000 |

Cluster 2 | 3000 | 57,000 | 448,634 | 250,000 | 19,000,000 | |

Cluster 3 | 4397 | 250,000 | 717,646 | 1,000,000 | 100,000,000 |

## References

- Ahmad, Amir, and Shehroz S. Khan. 2019. Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7: 31883–902. [Google Scholar] [CrossRef]
- Carter, Carl. 2002. Great Circle Distances. Available online: https://www.inventeksys.com/wp-content/uploads/2011/11/GPS_Facts_Great_Circle_Distances.pdf (accessed on 21 October 2020).
- Devale, Amol B., and Raja V. Kulkarni. 2012. Applications of data mining techniques in life insurance. International Journal of Data Mining & Knowledge Management Process 2: 31–40. [Google Scholar]
- Dickson, David C. M., Mary R. Hardy, and Howard R. Waters. 2013. Actuarial Mathematics for Life Contingent Risks, 2nd ed. Cambridge: Cambridge University Press. [Google Scholar]
- Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. Paper presented at the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, ON, USA, August 2–4; Volume 96, pp. 226–31. [Google Scholar]
- Fahad, Adil, Najlaa Alshatri, Zahir Tari, Abdullah Alamri, Ibrahim Khalil, Albert Y. Zomaya, Sebti Foufou, and Abdelaziz Bouras. 2014. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing 2: 267–79. [Google Scholar] [CrossRef]
- Gan, Guojun. 2011. Data Clustering in C++: An Object-Oriented Approach. Data Mining and Knowledge Discovery Series; Boca Raton: Chapman & Hall/CRC Press. [Google Scholar] [CrossRef]
- Gan, Guojun. 2013. Application of data clustering and machine learning in variable annuity valuation. Insurance: Mathematics and Economics 53: 795–801. [Google Scholar]
- Gan, Guojun, and X. Sheldon Lin. 2015. Valuation of large variable annuity portfolios under nested simulation: A functional data approach. Insurance: Mathematics and Economics 62: 138–50. [Google Scholar] [CrossRef]
- Gan, Guojun, Chaoqun Ma, and Jianhong Wu. 2007. Data Clustering: Theory, Algorithms and Applications. In ASA-SIAM Series on Statistics and Applied Probability. Philadelphia: SIAM Press. [Google Scholar] [CrossRef]
- Gan, Guojun, and Emiliano A Valdez. 2016. An empirical comparison of some experimental designs for the valuation of large variable annuity portfolios. Dependence Modeling 4: 382–400. [Google Scholar] [CrossRef]
- Gan, Guojun, and Emiliano A. Valdez. 2020. Data clustering with actuarial applications. North American Actuarial Journal 24: 168–86. [Google Scholar] [CrossRef]
- Hsu, Chung-Chian, and Yu-Cheng Chen. 2007. Mining of mixed data with application to catalog marketing. Expert Systems with Applications 32: 12–23. [Google Scholar] [CrossRef]
- Huang, Zhexue. 1997. Clustering large data sets with mixed numeric and categorical values. Paper presented at the First Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, February 23–24; pp. 21–34. [Google Scholar]
- Huang, Zhexue. 1998. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery 2: 283–304. [Google Scholar] [CrossRef]
- Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. 1999. Data clustering: A review. ACM Computing Surveys 31: 264–323. [Google Scholar] [CrossRef]
- Jang, Hong-Jun, Byoungwook Kim, Jongwan Kim, and Soon-Young Jung. 2019. An efficient grid-based k-prototypes algorithm for sustainable decision-making on spatial objects. Sustainability 11: 1801. [Google Scholar] [CrossRef] [Green Version]
- MacCuish, John David, and Norah E. MacCuish. 2010. Clustering in Bioinformatics and Drug Discovery. Boca Raton: CRC Press. [Google Scholar]
- MacQueen, James. 1967. Some methods for classification and analysis of multivariate observations. Paper presented at the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, June 21–July 18; Volume 1, pp. 281–97. [Google Scholar]
- Najjar, Ahmed, Christian Gagné, and Daniel Reinharz. 2014. A novel mixed values k-prototypes algorithm with application to health care databdata mining. Paper presented at IEEE Symposium on Computational Intelligence in Healthcare and e-Health (CICARE), Orlando, FL, USA, December 9–12; pp. 1–8. [Google Scholar]
- Sfyridis, Alexandros, and Paolo Agnolucci. 2020. Annual average daily traffic estimation in England and Wales: An application of clustering and regression modelling. Journal of Transport Geography 83: 1–17. [Google Scholar] [CrossRef]
- Szepannek, Gero. 2019. clustMixType: User-friendly clustering of mixed-type data in R. The R Journal 10: 200–8. [Google Scholar] [CrossRef]
- Szepannek, Gero. 2017. R: k-Prototypes Clustering for Mixed Variable-Type Data. Vienna: R Foundation for Statistical Computing. [Google Scholar]
- Thiprungsri, Sutapat, and Miklos A. Vasarhelyi. 2011. Cluster analysis for anomaly detection in accounting data: An audit approach. The International Journal of Digital Accounting Research 11: 69–84. [Google Scholar] [CrossRef]
- Tibshirani, Robert, Guenther Walther, and Trevor Hastie. 2001. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63: 411–23. [Google Scholar] [CrossRef]
- Vadiveloo, Jeyaraj, Gao Niu, Justin Xu, Xiaoyong Shen, and Tianyi Song. 2014. Tracking and monitoring claims experience: A practical application of risk management. Risk Management 31: 12–15. [Google Scholar]
- Wang, Xin, Wei Gu, Danielle Ziebelin, and Howard Hamilton. 2010. An ontology-based framework for geospatial clustering. International Journal of Geographical Information Science 24: 1601–30. [Google Scholar] [CrossRef]

**Figure 3.**(

**a**) Gap statistics in terms of the corresponding number of clusters and (

**b**) results of choosing the optimal number of clusters.

**Figure 5.**Distributions of the numerical and categorical attributes in each of the optimal clusters.

**Figure 6.**Actual to expected mortality rates based on face amounts. The values in the boxes are the observed ratios for the respective clusters; the confidence intervals were calculated based on the theory developed in Appendix A.

Categorical Variables | Description | Proportions | ||
---|---|---|---|---|

Gender | Insured’s sex | Female | 34.1% | |

Male | 65.9% | |||

Smoker Status | Insured’s smoking status | Smoker | 4.14% | |

Nonsmoker | 95.86% | |||

Underwriting Type | Type of underwriting requirement | Term conversion | 4.52% | |

Underwritten | 95.48% | |||

Substandard Indicator | Indicator of substandard policies | Yes | 7.76% | |

No | 92.24% | |||

Plan | Plan type | Term | 74.28% | |

ULS | 14.55% | |||

VLS | 11.17% | |||

Continuous Variables | Minimum | Mean | Maximum | |

Issue Age | Policyholder’s age at issue | 0 | 43.62 | 90 |

Face Amount | Amount of sum insured at issue | 215 | 529,636 | 100,000,000 |

Cluster 1 | Cluster 2 | Cluster 3 | |
---|---|---|---|

number of observations | 342,518 | 147,561 | 647,778 |

percentage | 30.10% | 12.97% | 56.93% |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Yin, S.; Gan, G.; Valdez, E.A.; Vadiveloo, J.
Applications of Clustering with Mixed Type Data in Life Insurance. *Risks* **2021**, *9*, 47.
https://doi.org/10.3390/risks9030047

**AMA Style**

Yin S, Gan G, Valdez EA, Vadiveloo J.
Applications of Clustering with Mixed Type Data in Life Insurance. *Risks*. 2021; 9(3):47.
https://doi.org/10.3390/risks9030047

**Chicago/Turabian Style**

Yin, Shuang, Guojun Gan, Emiliano A. Valdez, and Jeyaraj Vadiveloo.
2021. "Applications of Clustering with Mixed Type Data in Life Insurance" *Risks* 9, no. 3: 47.
https://doi.org/10.3390/risks9030047