Data Mining in Actuarial Science: Theory and Applications

A special issue of Risks (ISSN 2227-9091).

Deadline for manuscript submissions: closed (31 December 2021) | Viewed by 36624

Special Issue Editors


E-Mail Website
Guest Editor
Department of Mathematics, University of Connecticut, 341 Mansfield Road, Storrs, CT 06269-1009, USA
Interests: Copula models and dependencies; elliptical distributions and their applications; managing post-retirement assets; longevity risks and annuitization; risk measures and capital requirements; applications of financial economics in actuarial science; competing risks models; survival analysis
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Department of Mathematics, University of Connecticut, 341 Mansfield Road, Storrs, CT 06269-1009, USA
Interests: data mining; actuarial science; computational finance
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Insurance companies continue to gather increasing volumes of information that are being used for improved data-driven decision making. In the past few years, this has generated an increasing interest and need for data mining tools and techniques to analyze data, especially big data, in insurance and actuarial science. Data mining has potential applications in all personal and commercial lines of insurance: life, non-life, health and pensions. Although data mining is clearly useful in insurance and actuarial science, it faces many challenges during its implementation.

This Special Issue aims to collect recent developments of applying data mining techniques in insurance and actuarial science. We welcome original research articles that develop data mining techniques and case studies that showcase applications. We also encourage data mining work derived from collaborative efforts between academia and the industry.

Prof. Dr. Emiliano A. Valdez
Dr. Guojun Gan
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Risks is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Data mining
  • Machine or statistical learning
  • Actuarial learning
  • Insurance big data
  • Data visualization
  • Predictive analytics
  • Actuarial applications

Published Papers (8 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

26 pages, 1035 KiB  
Article
Concordance Probability for Insurance Pricing Models
by Jolien Ponnet, Robin Van Oirbeek and Tim Verdonck
Risks 2021, 9(10), 178; https://doi.org/10.3390/risks9100178 - 08 Oct 2021
Cited by 2 | Viewed by 2044
Abstract
The concordance probability, also called the C-index, is a popular measure to capture the discriminatory ability of a predictive model. In this article, the definition of this measure is adapted to the specific needs of the frequency and severity model, typically used during [...] Read more.
The concordance probability, also called the C-index, is a popular measure to capture the discriminatory ability of a predictive model. In this article, the definition of this measure is adapted to the specific needs of the frequency and severity model, typically used during the technical pricing of a non-life insurance product. For the frequency model, the need of two different groups is tackled by defining three new types of the concordance probability. Secondly, these adapted definitions deal with the concept of exposure, which is the duration of a policy or insurance contract. Frequency data typically have a large sample size and therefore we present two fast and accurate estimation procedures for big data. Their good performance is illustrated on two real-life datasets. Upon these examples, we also estimate the concordance probability developed for severity models. Full article
(This article belongs to the Special Issue Data Mining in Actuarial Science: Theory and Applications)
Show Figures

Figure 1

19 pages, 1404 KiB  
Article
Synthetic Dataset Generation of Driver Telematics
by Banghee So, Jean-Philippe Boucher and Emiliano A. Valdez
Risks 2021, 9(4), 58; https://doi.org/10.3390/risks9040058 - 24 Mar 2021
Cited by 13 | Viewed by 5219
Abstract
This article describes the techniques employed in the production of a synthetic dataset of driver telematics emulated from a similar real insurance dataset. The synthetic dataset generated has 100,000 policies that included observations regarding driver’s claims experience, together with associated classical risk variables [...] Read more.
This article describes the techniques employed in the production of a synthetic dataset of driver telematics emulated from a similar real insurance dataset. The synthetic dataset generated has 100,000 policies that included observations regarding driver’s claims experience, together with associated classical risk variables and telematics-related variables. This work is aimed to produce a resource that can be used to advance models to assess risks for usage-based insurance. It follows a three-stage process while using machine learning algorithms. In the first stage, a synthetic portfolio of the space of feature variables is generated applying an extended SMOTE algorithm. The second stage is simulating values for the number of claims as multiple binary classifications applying feedforward neural networks. The third stage is simulating values for aggregated amount of claims as regression using feedforward neural networks, with number of claims included in the set of feature variables. The resulting dataset is evaluated by comparing the synthetic and real datasets when Poisson and gamma regression models are fitted to the respective data. Other visualization and data summarization produce remarkable similar statistics between the two datasets. We hope that researchers interested in obtaining telematics datasets to calibrate models or learning algorithms will find our work ot be valuable. Full article
(This article belongs to the Special Issue Data Mining in Actuarial Science: Theory and Applications)
Show Figures

Figure 1

33 pages, 962 KiB  
Article
Alleviating Class Imbalance in Actuarial Applications Using Generative Adversarial Networks
by Kwanda Sydwell Ngwenduna and Rendani Mbuvha
Risks 2021, 9(3), 49; https://doi.org/10.3390/risks9030049 - 08 Mar 2021
Cited by 10 | Viewed by 3503
Abstract
To build adequate predictive models, a substantial amount of data is desirable. However, when expanding to new or unexplored territories, this required level of information is rarely always available. To build such models, actuaries often have to: procure data from local providers, use [...] Read more.
To build adequate predictive models, a substantial amount of data is desirable. However, when expanding to new or unexplored territories, this required level of information is rarely always available. To build such models, actuaries often have to: procure data from local providers, use limited unsuitable industry and public research, or rely on extrapolations from other better-known markets. Another common pathology when applying machine learning techniques in actuarial domains is the prevalence of imbalanced classes where risk events of interest, such as mortality and fraud, are under-represented in data. In this work, we show how an implicit model using the Generative Adversarial Network (GAN) can alleviate these problems through the generation of adequate quality data from very limited or highly imbalanced samples. We provide an introduction to GANs and how they are used to synthesize data that accurately enhance the data resolution of very infrequent events and improve model robustness. Overall, we show a significant superiority of GANs for boosting predictive models when compared to competing approaches on benchmark data sets. This work offers numerous of contributions to actuaries with applications to inter alia new sample creation, data augmentation, boosting predictive models, anomaly detection, and missing data imputation. Full article
(This article belongs to the Special Issue Data Mining in Actuarial Science: Theory and Applications)
Show Figures

Figure 1

19 pages, 550 KiB  
Article
Applications of Clustering with Mixed Type Data in Life Insurance
by Shuang Yin, Guojun Gan, Emiliano A. Valdez and Jeyaraj Vadiveloo
Risks 2021, 9(3), 47; https://doi.org/10.3390/risks9030047 - 03 Mar 2021
Cited by 5 | Viewed by 3519
Abstract
Death benefits are generally the largest cash flow items that affect the financial statements of life insurers; some may still not have a systematic process to track and monitor death claims. In this article, we explore data clustering to examine and understand how [...] Read more.
Death benefits are generally the largest cash flow items that affect the financial statements of life insurers; some may still not have a systematic process to track and monitor death claims. In this article, we explore data clustering to examine and understand how actual death claims differ from what is expected—an early stage of developing a monitoring system crucial for risk management. We extended the k-prototype clustering algorithm to draw inferences from a life insurance dataset using only the insured’s characteristics and policy information without regard to known mortality. This clustering has the feature of efficiently handling categorical, numerical, and spatial attributes. Using gap statistics, the optimal clusters obtained from the algorithm are then used to compare actual to expected death claims experience of the life insurance portfolio. Our empirical data contained observations of approximately 1.14 million policies with a total insured amount of over 650 billion dollars. For this portfolio, the algorithm produced three natural clusters, with each cluster having lower actual to expected death claims but with differing variability. The analytical results provide management a process to identify policyholders’ attributes that dominate significant mortality deviations, and thereby enhance decision making for taking necessary actions. Full article
(This article belongs to the Special Issue Data Mining in Actuarial Science: Theory and Applications)
Show Figures

Figure 1

14 pages, 953 KiB  
Article
Mining Actuarial Risk Predictors in Accident Descriptions Using Recurrent Neural Networks
by Jean-Thomas Baillargeon, Luc Lamontagne and Etienne Marceau
Risks 2021, 9(1), 7; https://doi.org/10.3390/risks9010007 - 29 Dec 2020
Cited by 7 | Viewed by 2482
Abstract
One crucial task of actuaries is to structure data so that observed events are explained by their inherent risk factors. They are proficient at generalizing important elements to obtain useful forecasts. Although this expertise is beneficial when paired with conventional statistical models, it [...] Read more.
One crucial task of actuaries is to structure data so that observed events are explained by their inherent risk factors. They are proficient at generalizing important elements to obtain useful forecasts. Although this expertise is beneficial when paired with conventional statistical models, it becomes limited when faced with massive unstructured datasets. Moreover, it does not take profit from the representation capabilities of recent machine learning algorithms. In this paper, we present an approach to automatically extract textual features from a large corpus that departs from the traditional actuarial approach. We design a neural architecture that can be trained to predict a phenomenon using words represented as dense embeddings. We then extract features identified as important by the model to assess the relationship between the words and the phenomenon. The technique is illustrated through a case study that estimates the number of cars involved in an accident using the accident’s description as input to a Poisson regression model. We show that our technique yields models that are more performing and interpretable than some usual actuarial data mining baseline. Full article
(This article belongs to the Special Issue Data Mining in Actuarial Science: Theory and Applications)
Show Figures

Figure 1

23 pages, 961 KiB  
Article
Application of a Vine Copula for Multi-Line Insurance Reserving
by Himchan Jeong and Dipak Dey
Risks 2020, 8(4), 111; https://doi.org/10.3390/risks8040111 - 21 Oct 2020
Cited by 4 | Viewed by 2614
Abstract
This article introduces a novel use of the vine copula which captures dependence among multi-line claim triangles, especially when an insurance portfolio consists of more than two lines of business. First, we suggest a way to choose an optimal joint loss development model [...] Read more.
This article introduces a novel use of the vine copula which captures dependence among multi-line claim triangles, especially when an insurance portfolio consists of more than two lines of business. First, we suggest a way to choose an optimal joint loss development model for multiple lines of business that considers marginal distribution, vine copula structure, and choice of family for each pair of copulas. The performance of the model is also demonstrated with Bayesian model diagnostics and out-of-sample validation measures. Finally, we provide an implication of the dependence modeling, which allows a company to analyze and establish the risk capital for whole portfolio. Full article
(This article belongs to the Special Issue Data Mining in Actuarial Science: Theory and Applications)
Show Figures

Figure 1

12 pages, 827 KiB  
Article
Address Identification Using Telematics: An Algorithm to Identify Dwell Locations
by Christopher Grumiau, Mina Mostoufi, Solon Pavlioglou and Tim Verdonck
Risks 2020, 8(3), 92; https://doi.org/10.3390/risks8030092 - 01 Sep 2020
Cited by 1 | Viewed by 2953
Abstract
In this work, a method is proposed for exploiting the predictive power of a geo-tagged dataset as a means of identification of user-relevant points of interest (POI). The proposed methodology is subsequently applied in an insurance context for the automatic identification of a [...] Read more.
In this work, a method is proposed for exploiting the predictive power of a geo-tagged dataset as a means of identification of user-relevant points of interest (POI). The proposed methodology is subsequently applied in an insurance context for the automatic identification of a driver’s residence address, solely based on his pattern of movements on the map. The analysis is performed on a real-life telematics dataset. We have anonymized the considered dataset for the purpose of this study to respect privacy regulations. The model performance is evaluated based on an independent batch of the dataset for which the address is known to be correct. The model is capable of predicting the residence postal code of the user with a high level of accuracy, with an f1 score of 0.83. A reliable result of the proposed method could generate benefits beyond the area of fraud, such as general data quality inspections, one-click quotations, and better-targeted marketing. Full article
(This article belongs to the Special Issue Data Mining in Actuarial Science: Theory and Applications)
Show Figures

Figure 1

Review

Jump to: Research

26 pages, 657 KiB  
Review
Machine Learning in P&C Insurance: A Review for Pricing and Reserving
by Christopher Blier-Wong, Hélène Cossette, Luc Lamontagne and Etienne Marceau
Risks 2021, 9(1), 4; https://doi.org/10.3390/risks9010004 - 23 Dec 2020
Cited by 28 | Viewed by 12439
Abstract
In the past 25 years, computer scientists and statisticians developed machine learning algorithms capable of modeling highly nonlinear transformations and interactions of input features. While actuaries use GLMs frequently in practice, only in the past few years have they begun studying these newer [...] Read more.
In the past 25 years, computer scientists and statisticians developed machine learning algorithms capable of modeling highly nonlinear transformations and interactions of input features. While actuaries use GLMs frequently in practice, only in the past few years have they begun studying these newer algorithms to tackle insurance-related tasks. In this work, we aim to review the applications of machine learning to the actuarial science field and present the current state of the art in ratemaking and reserving. We first give an overview of neural networks, then briefly outline applications of machine learning algorithms in actuarial science tasks. Finally, we summarize the future trends of machine learning for the insurance industry. Full article
(This article belongs to the Special Issue Data Mining in Actuarial Science: Theory and Applications)
Show Figures

Figure 1

Back to TopTop