Next Article in Journal
Deep Learning-Based Intrusion Detection Methods in Cyber-Physical Systems: Challenges and Future Trends
Previous Article in Journal
A Graph-Based Metadata Model for DevOps in Simulation-Driven Development and Generation of DCP Configurations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hybrid Random Forest Survival Model to Predict Customer Membership Dropout

by
Pedro Sobreiro
1,2,*,†,
José Garcia-Alonso
2,
Domingos Martinho
3 and
Javier Berrocal
2,†
1
Sport Sciences School of Rio Maior (ESDRM), Polytechnic Institute of Santarém, 2001-904 Santarém, Portugal
2
Quercus Software Engineering Group, University of Extremadura, 06006 Badajoz, Spain
3
ISLA Santarém, 2000-241 Santarém, Portugal
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Electronics 2022, 11(20), 3328; https://doi.org/10.3390/electronics11203328
Submission received: 18 August 2022 / Revised: 7 October 2022 / Accepted: 9 October 2022 / Published: 15 October 2022
(This article belongs to the Topic Data Science and Knowledge Discovery)

Abstract

:
Dropout prediction is a problem that must be addressed in various organizations, as retaining customers is generally more profitable than attracting them. Existing approaches address the problem considering a dependent variable representing dropout or non-dropout, without considering the dynamic perspetive that the dropout risk changes over time. To solve this problem, we explore the use of random survival forests combined with clusters, in order to evaluate whether the prediction performance improves. The model performance was determined using the concordance probability, Brier Score and the error in the prediction considering 5200 customers of a Health Club. Our results show that the prediction performance in the survival models increased substantially in the models using clusters rather than that without clusters, with a statistically significant difference between the models. The model using a hybrid approach improved the accuracy of the survival model, providing support to develop countermeasures considering the period in which dropout is likely to occur.

1. Introduction

Customer retention is a problem that many organizations have to deal with, in the context of which dropout prediction provides insights to identify customers that could churn. Dropout represents the decision of a customer to end their relationship with an organization [1], which creates two outcomes: Dropout or non-dropout. The case where dropout is developed has two main scenarios [2,3]: (1) Contractual settings, where customers pay a monthly fee and the customer informs the end of the relationship; and (2) non-contractual settings, where the organization has to extrapolate whether the customer is still active or not. In the contractual setting, the customer must choose whether they will dropout or not; for example, if they renew a contract or not [4]. This means that, in contractual settings, the customer dropout represents an explicit ending of a relationship that is more penalizing than that in non-contractual settings [5], which has implications for the profitability of organizations, increasing marketing costs and reducing sales [6].
The advantages of developing retention strategies have been supported in the concept that the costs of customer retention are lower than those associated with customer acquisition [7,8], where a reduction of dropout by 5% could realize almost a duplication of profits [9]. To address this problem, the use of the customer databases could be explored, which is considered the most valuable asset that most organizations possess [10]. The development of a customer retention strategy could be supported through the identification of customers may dropout [11]; for example, using churn prediction models to detect customers with high propensity to dropout [12].
The anticipation of the dropout allows for the development of countermeasures to reduce customer churn. Several studies have addressed the problem related to customer retention in trying to improve the profitability [13,14,15]; in particular, organizations have been addressing this problem by shifting their target from capturing new customers to preserving existing ones [14], considering that investments in retention strategies are more profitable than acquiring new customers [13].
Machine learning allows for the extraction of patterns from data, learning from a model using a set of descriptive features and a target feature based on a set of historical examples [16]. The approaches normally employed address the problem by use of a dependent variable representing dropout or non-dropout, without considering the dynamic perspective that the dropout risk changes over time [11]. A static perspective determines the dropout risk at a specific moment in time, but does not consider changes over time. Survival models have been proposed in an attempt to resolve this limitation [17], capturing the temporal dimension of customer dropout to predict when the dropout will occur, as well as the use of censured data, which allows for consideration of existing information about customers that have not churned yet [18].
There are several challenges related to the timing of the dropout, such as considering the behavior of the customer as static and not considering the dynamic behavior of the customer, in terms of the intent to dropout [11]. The importance of understanding when dropout will occur and the risk related to the temporal perspective of the problem seems to be an element that should be addressed. However, few studies have considered this aspect [18,19,20,21]. Van den Poel and Larivière [20] have used a Cox proportional hazards model to investigate customer attrition in an European financial services company. Baesens et al. [21] have explored the use of a neural network-based survival model to anticipate the timing when customers will default or pay off loans early. Burez and Vandenpoel [19] have proposed two different processes to predict customer dropout, instead of using only one, including commercial and financial churn, and suggested that financial churn (customer stops paying the invoices) is easier to predict than commercial churn (customers not renewing their subscription). Perianez et al. [18] have investigated the use of survival trees and forests to improve prediction accuracy compared to traditional methods, such Cox regression, in mobile social game subscriptions.
Survival analysis, which originates from biomedical statistics, is especially well-suited to studying the timing of events in longitudinal data [22]. Survival analysis consists of a class of statistical methods modelling the occurrence and timing of an event, such as the customer dropout. Survival analysis allows us to examine not only if an event occurred, but also how long it took to occur. The primary value of survival analysis in our context, however, is comparing the dropout probability for individuals classified through theoretically relevant variables. The survival methods have enjoyed increasing popularity in several disciplines, ranging from medicine to economics [22].
Random Survival Forests do not make the proportional hazards assumption [23], and have the flexibility to model survival curves of dissimilar shapes for contrasting groups of subjects. Random Survival Forest is an extension of Random Forest, allowing for efficient non-parametric analysis of time–event data [24]. These characteristics allow us to surpass the Cox Regression limitation of the proportional hazard assumption, which requires us to exclude variables that do not fulfill the model assumptions. It has also been shown, by Breiman [24], that ensemble learning can be further improved by injecting randomization into the base learning process (i.e., the use of Random Forests).
Previous researchers have also proposed the integration of several algorithms to improve the performance in the prediction of dropout, such as the use of clustering methods combined with churn prediction [25,26,27], where the customers are grouped on clusters to improve the prediction performance within each cluster. Clustering methods use unsupervised algorithms to group elements with similar characteristics. Such unsupervised methods have been widely used, employing approaches such Hierarchical Clustering [28], k-Means [27], or Random Forest Clustering [24]. Vijaya and Sivasankar [27], following this concept, have suggested the adoption of hybrid models combining more than one classier, in order to increase the performance, compared to the use of single classifiers. Jafari-Marandi et al. [29] have also explored an approach combining clustering methods in parallel with a classification approach.
Although there have been several studies addressing whether a given customer will dropout or not, there is a lack of research regarding the prediction of when customer will dropout. To address this lack of research, survival analysis can be utilized, which allows for prediction of when the customer will dropout, providing an opportunity to explore the duration of the relation of the customer with the organization and its influence on the churn prediction. Additionally, we assess whether survival analysis combined with the clustering could improve the prediction performance.
In this study, we investigate whether a hybrid approach using clustering and random survival forests, which have never been used to predict membership based on data, can improve the prediction performance compared to a random survival model without the use of clusters. The performance is evaluated by comparing the average discrepancies in the customer status (dropout/non-dropout) in both approaches.
The remainder of this paper is organized as follows: In the next sections, we detail survival analysis, and survival trees. Next, we describe our research methodology. The Section 7 provides the outcomes, addressing our research goal, providing evaluation metrics, and comparing the performance using clusters or without clusters. Finally, the discussion and conclusion are provided.

2. Why Dropout Prediction?

The problem related to contractual setting scenarios is that customer dropout is more damaging than in non-contractual settings (e.g., buying a product), representing a well-defined termination of a relationship with a organization [5]. Organizations are more profitable when they retain customers, due to reduced marketing costs and increased sales, minimizing the problems related to reduced sales, competitors gaining new customers, and loss of market [6].
The costs related to getting new customers are five to six times greater, in relation to maintain existing ones [30], which has motivated the organizations to move from capturing customers to retaining them [14].
The overall idea is that investment in retention strategies has higher returns [13]; however, it must be considered that it should be developed in customers with a higher likelihood of churn. It should also be considered that non-profitable retention actions must be avoided [31].
The analysis of existing customer data supports the extraction of patterns related to customer dropout, which an organization can then use to retain customers. Using this information, companies can anticipate churners and develop countermeasures to avoid desertions, retaining as much customers as possible through measures such as giving concessions [27]. By using historical data, organizations can create trained models for the classification of future dropout or non-dropout. This idea allows relevant actors to identify ways to incite customers to stay, by estimating the probability of dropout in a given period of time [15].

3. Survival Analysis

Survival analysis focuses on the analysis of the time remaining until an event of interest occurs, and explores its relationships with different factors. The main advantage is related to the concept of censoring, indicating those observations that are not completely related to the event of interest (e.g., customers that have not dropped out yet), which are incorporated into the analysis. This means that there are customers that are still active, for whom we do not know whether the event of dropout has occurred; we call this censorship. Survival models take censoring into account and incorporate this uncertainty—instead of predicting the time of event such in regression models, survival models allow for prediction of probability of an event happening at a particular time.
The time of dropout is represented by T, which is a non-negative random variable, indicating the time period of the event occurring for a randomly selected individual from the population, representing the probability of an event to occurring each time period, given that it has not already occurred in a previous time period; this is known as the discrete-time hazard function [22]. The survival function represents the probability of an individual surviving after time t, S(t) = P(T > t), t ≥ 0, with the properties S(0) = 1 and S() = 0. The distribution function is represented by F, defined as F(t) = P(T ≤ 0), for t ≥ 0. The function of probability density represented by f where:
f ( t ) = lim d t 0 P [ t T < t + d t ] d t ,
where f ( t ) d t represents the probability of an event occurring at time t. We represent the distribution evolution of the dropout probability with time using the hazard function, represented as:
λ ( t ) = lim d t 0 P [ t T < t + d t | T t ] d t .
The determination of the survival curves is based on the following elements: (1) The total value of observations removed during the time period (e.g., days, months, or years), either by dropout or by censorship; (2) observations composing the sample of the study; and (3) customers who have not yet dropped out at any given time. The survival probability until the time period i, ( p i ) is calculated as:
p i = r i d i r i ,
where r i is the number of individuals that survived at the beginning of the period and d i the number of individuals who left during the period. The survival time estimate is also made considering the month in which it is found (estimated). Cox regression allows us to test difference between survival times. The advantage of using survival analysis is that it further allows us to detect whether the risk of an event differs systematically across different people, using specific predictors. The coefficients in the Cox regression are related to the hazard, where a positive value represents a worse prognosis and, in contrast, a negative denotes a better prognosis. The advantage of survival analysis is that it allows us to include information of covariates that were censored up to the censoring event.
The Cox PH model assumes the covariates to be time-independent; for example, gender and age do not change over time when retrieved [32]. As the Cox model requires the hazards in both groups to be proportional, researchers are often asked to “test” whether the hazards are proportional [33]. Considering this, we explored another approach that allowed us to develop the analysis without proportional hazard assumptions; namely, survival trees.    

4. Survival Trees

Survival trees are methods based on Random Forest models [24]. A Random Survival Forest is an ensemble method for analysis of right-censored data [34], using randomization to improve the performance. Random survival forests follow this framework [34]:  
  • Draw B random samples of the same size from the original data set with replacement. The samples that are not drawn are said to be out-of-bag (OOB). Grow a survival tree on each of the b = 1, …, B samples.
  • At each node, select a random subset of predictor variables and find the best predictor and splitting value that provide two subsets (the daughter nodes) which maximizes the difference in the objective function.
  • Repeat step 2 recursively on each daughter node until a stopping criterion is met.
  • Calculate a cumulative hazard function (CHF) for each tree and average over all CHF for the B trees to obtain the ensemble CHF.
  • Compute the prediction error for the ensemble CHF using only the OOB data.
In each node, a predictor x is selected from a randomly selected predicted variables and split value c (one unique value of x). Each sample i if assigned to the daughter right node if x i c , or the left node if x i c . Then, the log-rank is calculated, as follows:
L ( x , c ) = i = 1 N d i , 1 Y i , 1 d i Y i i = 1 N Y i , 1 Y i 1 Y i , 1 Y i Y i d i Y i 1 d i
where
  • j: Daughter node, j { 1 , 2 } ;
  • d i , j : Number of events at time t i in daughter node j;
  • Y i , j : Number of elements that had the event or are in risk at time t i in daughter node j;
  • d i : Number of events at time t i , such that d i = j d i , j ;
  • Y i : Number of elements that experienced an event or are at risk at t i , such that Y i = j Y i , j .
We loop every x and c until we find x that satisfies | L ( x , c ) | | L ( x , c ) | for every x and c. The model performance is assessed using the concordance probability (C-index), and Brier Score (BS) [35]. The feature importance is determined by calculating the difference between the true class labels and noised data [24].
The BS is used to evaluate the predicted accuracy of the survival function at a given time t. It represents the average square distance between the survival status and the predicted survival probability, where a value of 0 is the best possible outcome.
B S ( t ) = 1 N i = 1 N 0 S ^ ( t , x i ) 2 · T i t , δ i = 1 G ^ ( T i ) + 1 S ^ ( t , x i ) 2 · T i > t G ^ ( t ) .
The model should have a Brier score below 0.25 , considering that, if i [ [ 1 , N ] ] , S ^ ( t , x i ) = 0.5 , then B S ( t ) = 0.25 .

5. Methodology

To achieve our research goals, to simplify the analysis, the survival probabilities are first presented as a survival curve to provide an overall perspective of dropout over time; the representation of the survival probabilities indicates the time where the events are observed [36]. Then, the machine learning survival model was created, following the approach of Ishwaran et al. [34] and using PySurvival [37]. The model performance was determined according to the Brier Score (BS) and Mean Absolute Error (MAE), considering that, due to the censoring of data, standard evaluation metrics such as root mean square error are not suitable [35] in the testing set. The model performance predicted and actual customer dropout are presented as a time-series with the performance indicators Root Mean Square Errors, Median Absolute Error and Mean Absolute Error [37]. The training and testing sets were created using the scikit-learn package with the holdout method (85/25) [38]. The feature importance was determined by calculating the differences between the true class label and noised data [24].
The hybrid model was developed through the identification of an optimal number of clusters. The calculation of the optimal number of clusters was developed based on the Bayesian Information Criterion (BIC), where the model with the lowest score was selected as the best model [39]; however, Scrucca et al. [40] have suggested using the higher BIC score, which we followed. In addition, we used visualization to increase the interpretability of the number of clusters, which was provided using the elbow method. Next, k-Means was used to partition the observations into the identified number of clusters.    

5.1. Model Performance Evaluation

The BS measures the average discrepancy between the status (dropout/non-dropout) and the estimated probabilities at a given time. The Integrated Brier Score (IBS) was used to calculate the performance for all available times (i.e., from t 1 to t m a x ) as:
I B S = t 1 t max B S c ( t ) d w ( t ) .
This represents the average square distance between the survival status and the predicted survival probability, where a value of 0 is the best possible outcome.
The Mean Absolute Error (MAE) is a measure of error between observed values and predicted ones, where y i and x i are the predicted and the true values, respectively:
M A E = i = 1 D | x i y i | .
The IBS and MAE were also calculated for each cluster, in order to conduct a performance comparison against the model without clusters. Additionally, validation tests were performed to compare the accuracy of the hybrid approach against that of the random survival model without clusters. To that end, a paired Mann–Whitney was conducted to estimate whether the prediction ability was significant, with a confidence interval of 95%.

5.2. Model Operacionalization

The survival analysis was conducted using the Lifelines package [41] (See Appendix A for software versioning). Dropout was considered a binary value, where one represents churn and zero represents no churn. Dropout happens when a member does not make a payment.
The random survival forest was developed using the PySurvival package [37]. PySurvival is an open-source python package for Survival Analysis modeling. The model was built with 75% of the data for training and 25% for testing.
Using the mclust package [40], the number of clusters was calculated by choosing a varying number of components and identifying the structure of the covariance matrix, based on modelling with a multivariate normal distribution for each component constituting the data set [42].
The hybrid approach was developed as follows:
  • Identify the optimal number of clusters using the mclust package of Scrucca et al. [40].
  • Fit the model using the identified number of clusters.
  • For each element, estimate the cluster membership.
  • For each cluster, follow the framework proposed by Ishwaran et al. [34] to calculate the random survival model.

6. Data Set

In this study, the data of 5209 Portuguese health club customers were analyzed (mean age = 27.88, SD = 11.80 years). The data were collected using the e@sport software (Cedis, Portugal) between 2014 and 2017. The information retrieved included: Age of the participants (in years), Sex (0, female; 1, male), non-attendance days before dropout, total amount billed, average number of visits per week, total number of visits, weekly contracted accesses, number of registration renewals, number of customer referrals, registration month, customer enrollment duration, and status (dropout/non-dropout). A dropout event is considered to occur when a customer communicates their intention to terminate the contract, or have not paid the monthly fee for 60 days.
Table 1 shows the summary statistics of the data. The average age was 27.9 ± 11.8, and the entries were 29 ± 41.2 with an inscription period of 9 ± 8.2 months. Figure 1 shows the distribution of the dropout, considering the number of months of membership (where 0 denotes non-dropout and 1 denotes dropout).

7. Results

Table 2 provides the data regarding the survival time of the customers during the first couple of months. The results indicate that the customers have a survival probability of 24.44% at 12 months (column p i ; likelihood probability) with a median survival time of 10 months (column estimated_survival). The survival probability at 6 months was 54.5%, representing a risk of dropout of 45.5% with an estimated survival of 6 months.
Figure 2 shows the overall Kaplan–Meier survival curve, considering the number of months of membership (x axis) against survival probability (y axis). The customer dropout is very high in the first 12 months, ranging from a survival probability of 54% after the first 6 months to 24.44% after 12 months.
The survival considering other cohorts is represented in Figure 3; in particular survival by gender and survival by contracted frequency. The survival by gender was very similar; however, that by contracted access frequency indicated that customers with a contracted access frequency of 6 and 4 times a week have higher survival probabilities, compared to the lower survival of customers with contracted access frequencies of 7 and 2 times a week. Survival curves can explore tendencies related to survival, in order to extract actionable knowledge, giving a perspective regarding the probability to survival within a given period of time.
The proportional hazard assumptions failed in the following variables: Age (p < 0.01), cfreq (p < 0.01), dayswfreq (p < 0.01), tbilled (p < 0.01), freeuse (p < 0.01), and nentries (p < 0.01). Therefore, it was not possible to calculate the effect of the cohorts, in terms of the survival time, using the Cox regression.

7.1. Survival Trees

To evaluate the performance of the random survival forest in predicting the survival time, considering the effect of the cohorts, we calculated the concordance probability (C-index), IBS, and Mean Absolute Error (MAE). The IBS presented an accuracy along the 12 months of 0.08 (Figure 4).
Figure 5 presents the actual and predicted customers that dropped out during the 40 months, showing an average absolute error of 7.2 customers.
Table 3 shows the feature importance, calculated according to Breiman [24], with the percent increase in misclassification rate, as compared to the out-of-bag rate (with all variables intact). Out-of-bag is a bootstrap aggregating method (i.e., sub-sampling with replacement to create training samples for the model to learn from), where two independent sets are created. One set, the bootstrap sample (in-the-bag), is obtained by sampling with replacement from the from the original data set; meanwhile, the out-of-bag is the difference between the original data set and the bootstrap data set. The most important variable was t b i l l e d , followed by d a y s w f r e q and n e n t r i e s . The least relevant features where c f r e q , a g e , and s e x .
The prediction was very similar to the actual value. The model accuracy was very high, with a root mean square error of 8. The mean absolute error mean was 4.88 customers, and the median absolute error was 3.13 customers.

7.2. Survival Tree-Based Model with Clusters

In our approach, we have created clusters and applied the survival trees within each cluster. The determination of the clusters using the BIC criterion and the EEV model was as follows: 9 clusters, 6990.94; 7 clusters, −30,105.59; and 6 clusters, −44,616.29. Figure 6 shows the determination of the number of clusters using the BIC. The elbow analysis, presented in Figure 7, shows that the curve flattened after 8 clusters. Therefore, nine clusters was considered the optimal number of clusters, which was used to partition the customers. The higher prediction performance is represented in the three bigger clusters, considering the number of elements cluster 0 (n = 1955), cluster 4 (n = 729), and cluster 8 (n = 1020).
The prediction performance analysis (IBS score) in clusters 0, 4, and 9, yielded accuracies of 0.07 (Figure 8), 0.078 (Figure 9), and 0.105 (Figure 10), respectively. The cluster 0 actual versus predicted model had a mean absolute error of 4.9 customers, median absolute error of 0.864, and the Root Mean Square Error of 8.96 (Figure 11). The feature importance in the survival model with cluster 0 (Table 4) identified the three most relevant features to predict survival as f r e e u s e , a g e , and m a c c e s s , while the features with lower relevance were n e n t r i e s , d a y s w f r e q , and t b i l l e d .
The cluster 4 actual versus predicted model presented a mean absolute error of 2.29 customers, median absolute error of 0.717, and Root Mean Square Error of 3.32 (Figure 12). The feature importance in the survival model with cluster 4 (Table 5) identified the three most relevant features to predict survival as n e n t r i e s , d a y s w f r e q , and t b i l l e d , while the least relevant were f r e e u s e , c f r e q , and s e x .
Finally, cluster 8 actual versus predicted model presented a mean absolute error of 2.02 customers, median absolute error was 1.09, and Root Mean Square Error of 3.52 (Figure 13). The feature importance in the survival model with cluster 8 (Table 6) identified the three most relevant features to predict survival as m a c c e s s , d a y s w f r e q , and t b i l l e d , while the least relevant were c f r e q , a g e , and s e x .

Model Comparison

Table 7 shows the performance of both approaches; that is, with and without clusters. The RMSE, mean, and median in the clustered approach was lower than that when not using clusters to predict the survival time until dropout. The metrics, comparing the actual values in each cluster against the model without clusters, were as follows: median 1.68 ± 1.34 vs. 4.8 w/o clusters; mean 1.348 ± 0.831 vs. 3.13 w/o clusters, RMSE 2.80 ± 2.69 vs. 7.9 w/o clusters, and IBS 0.051 ± 0.036 vs. 0.089 w/o clusters. On average, the results outperform the model without clusters even if we consider plus one standard deviation in all the indicators.
Overall, the performance was improved through clustering. The performance is also better using mean and median.
The model without clusters presented a RMSE of 7.904, the mean absolute error mean was customers, the median absolute error was 4.876, and the IBS was 0.089. One cluster in the model using clusters had worse performance (cluster 0), with a RMSE 9.181, mean absolute error of 0.893, and median absolute error 4.766. The cluster with the best performance (cluster 5) had a RMSE of 0.312, mean absolute error of 0, and median absolute error 0.127. The overall performance of the model was improved when using clusters, with the non-clustering IBS of 0.089 only being surpassed by cluster 8, with an IBS of 0.105.
Comparing the prediction accuracy in each cluster using the Brier Score, the median Brier score without clusters was 0.356, while those for clusters 0, 1, 2, 3, 4, 5, 6, 7, and 8, were 0.022, 0.114, 0.041, 0.056, 0.022, 0.072, 0.133, 0.137, and 0.049, respectively. The results of applying the Mann–Whitney test between clusters and without clusters were as follows: cluster 0 (U = 530, n1 = 19, n2 = 29, p < 0.05); cluster 1 (U = 400, n1 = 19, n2 = 25, p < 0.05); cluster 2 (U = 374, n1 = 19, n2 = 21, p < 0.05); cluster 3 (U = 721, n1 = 19, n2 = 42, p < 0.05); cluster 4 (U = 415, n1 = 19, n2 = 23, p < 0.05); cluster 5 (U = 220, n1 = 19, n2 = 13, p < 0.05); cluster 6 (U = 276, n1 = 19, n2 = 19, p < 0.05); cluster 7 (U = 597, n1 = 19, n2 = 39, p < 0.05); and cluster 8 (U = 663, n1 = 19, n2 = 38, p < 0.05). The prediction was statistically significant, when comparing the prediction accuracy of the survival model using clusters against that without clusters; namely, the median value was lower using clusters.

8. Discussion

In this study, we evaluated the performance of random survival forests using membership data of customers of a health club. A survival model was created to determine the duration of the relationship. This approach provides an additional view to identify when the customer dropout will occur, allowing for the development of retention strategies, considering the timings of the events. More than 70% of the customers were predicted to dropout in the first 12 months, which is very high, and has not been identified in other studies. Burez and Vandenpoel [19], in a study of pay TV users, have found that one out of three customers leave the company before one year, and half the customers leave within two years.
The accuracy calculated using the actual and predicted customers who dropped out during the 40 months showed a mean absolute error of 7.5 customers. Using the hybrid model, the mean absolute errors were 1.56, 6.52, and 1.56 customers in clusters 1,2, and 3, respectively. The features d a y s w f r e q , t b i l l e d and n e n t r i e s represented more than 66% of the importance predicting the survival model without clusters. Accordingly, in the hybrid approach, the most relevant features were n e n t r i e s , t b i l l e d , and d a y s w f r e q , representing 67% of the importance in the cluster 1; in Cluster 2, d a y s w f r e q , t b i l l e d , and n e n t r i e s represented 70% in prediction importance; and, in Cluster 3, t b i l l e d , d a y s w f r e q , and n e n t r i e s representing 69% in the prediction importance.
The use of clustering in supporting customer segmentation to improve the performance of machine learning techniques is not new. Jafari-Marandi et al. [29] have also explored the use of clusters to improve the prediction accuracy. However, this approach combined with the use of survival models, to the best of our knowledge, has not been previously attempted or reported.
The better performance of the hybrid model in predicting when customers will dropout, using existing data, supports the development of management counter-measures to reduce dropout. The duration of the relationship between the customer and the organization is an important aspect, allowing us to understand that the decision of the customer to dropout changes over time, which implies that existing models predicting customer dropout may be only correct at a specific point in time, after which their decision may change [11].
The time perspective allows us to identify the period in which retention actions should be developed; therefore, the prediction should be as accurate as possible. However, customers who are about to churn but cannot be retained should be excluded from the countermeasures to avoid dropout, considering that targeting them may constitute a waste of scarce resources [15]. Addressing only the performance of predictive models considering only accuracy seems to be a reduced perspective, considering that customers with a higher risk of churning may not be the best targets for the development of retention strategies. Further research should be conducted exploring this perspective, thus providing further insight into the return on investment in the development of countermeasures. A business context, or the clarification of business objectives underlying the prediction of customer dropout, should be developed, in order to clarify which objectives should be achieved before the employment of machine learning algorithms.    

9. Conclusions

In this paper, we investigated the customer dropout in a Health Club organization, considering the dynamic perspective that the dropout risk varies with time. We explored two approaches: using a survival model based on random forests with or without clusters. The model using clusters allowed us to combine the customers into different clusters, comprising a hybrid approach. Based on the results, the performance of the proposed model using clusters presented improved accuracy of the survival model, allowing for the development of targeted approaches taking into account the timing of when the dropout occurs, considering the cluster a given customer belongs to. Most importantly, managers can use the resulting information for improvement of their retention strategies.

Author Contributions

Conceptualization, P.S., J.B., D.M. and J.G.-A.; methodology, P.S. and J.B.; software, P.S. and J.B.; validation, J.B.; formal analysis, P.S. and J.B.; investigation, P.S., J.B. and J.G.-A.; data curation, P.S.; writing—original draft preparation, P.S. and J.B.; writing—review and editing, P.S., J.B. and D.M.; supervision, J.B., D.M. and J.G.-A.; project administration, J.B. and J.G.-A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Spanish Ministry of Science and Innovation under Project PID2021-124054OB-C31, in part by the 4IE+ Project by the Interreg V-A España-Portugal (POCTEP) (2014–2020) Program under Grant 0499-4IE-PLUS-4-E, in part by the Department of Economy, Regional Ministry of Economy, Science and Digital Agenda under Grant GR21133.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, an in the decision to publish the results.

Appendix A. Software Versioning

R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS
 
Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /home/sobreiro/miniconda3/envs/survival/lib/libmkl_intel_lp64.so
 
locale:
[1] en_US.UTF-8
 
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
 
other attached packages:
[1] mclust_5.4.10    labelled_2.9.1   kableExtra_1.3.4 gtsummary_1.6.0
[5] visdat_0.5.3     readxl_1.4.0     stargazer_5.2.3  reticulate_1.25
[9] ggplot2_3.3.6    dlookr_0.5.6     dplyr_1.0.9
 
loaded via a namespace (and not attached):
[1] reactable_0.2.3     webshot_0.5.3       httr_1.4.3
[4] tools_4.2.1         utf8_1.2.2          R6_2.5.1
[7] rpart_4.1.16        DBI_1.1.3           colorspace_2.0-3
[10] withr_2.5.0         tidyselect_1.1.2    gridExtra_2.3
[13] curl_4.3.2          compiler_4.2.1      extrafontdb_1.0
[16] cli_3.3.0           rvest_1.0.2         gt_0.5.0
[19] xml2_1.3.3          labeling_0.4.2      bookdown_0.26
[22] scales_1.2.0        mvtnorm_1.1-3       rappdirs_0.3.3
[25] systemfonts_1.0.4   stringr_1.4.0       digest_0.6.29
[28] rmarkdown_2.14      svglite_2.1.0       pkgconfig_2.0.3
[31] htmltools_0.5.2     showtext_0.9-5      extrafont_0.18
[34] fastmap_1.1.0       highr_0.9           htmlwidgets_1.5.4
[37] rlang_1.0.2         rstudioapi_0.13     sysfonts_0.8.8
[40] shiny_1.7.1         generics_0.1.2      farver_2.1.0
[43] jsonlite_1.8.0      magrittr_2.0.3      Formula_1.2-4
[46] Matrix_1.4-1        Rcpp_1.0.8.3        munsell_0.5.0
[49] fansi_1.0.3         gdtools_0.2.4       partykit_1.2-15
[52] lifecycle_1.0.1     stringi_1.7.6       yaml_2.3.5
[55] inum_1.0-4          grid_4.2.1          hrbrthemes_0.8.0
[58] promises_1.2.0.1    forcats_0.5.1       crayon_1.5.1
[61] lattice_0.20-45     haven_2.5.0         splines_4.2.1
[64] hms_1.1.1           knitr_1.39          pillar_1.7.0
[67] glue_1.6.2          evaluate_0.15       pagedown_0.18
[70] broom.helpers_1.7.0 vctrs_0.4.1         png_0.1-7
[73] httpuv_1.6.5        Rttf2pt1_1.3.10     cellranger_1.1.0
[76] gtable_0.3.0        purrr_0.3.4         tidyr_1.2.0
[79] assertthat_0.2.1    xfun_0.31           mime_0.12
[82] libcoin_1.0-9       xtable_1.8-4        later_1.3.0
[85] survival_3.3-1      viridisLite_0.4.0   tibble_3.1.7
[88] showtextdb_3.0      ellipsis_0.3.2

References

  1. Berry, M.J.A.; Linoff, G. Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management, 2nd ed.; Wiley Pub: Indianapolis, Indiana, 2004; p. 01692. [Google Scholar]
  2. Gupta, S.; Hanssens, D.; Hardie, B.; Kahn, W.; Kumar, V.; Lin, N.; Ravishanker, N.; Sriram, S. Modeling Customer Lifetime Value. J. Serv. Res. 2006, 9, 139–155. [Google Scholar] [CrossRef] [Green Version]
  3. Ascarza, E. Retention Futility: Targeting High-Risk Customers Might be Ineffective. J. Mark. Res. 2018, 55, 80–98. [Google Scholar] [CrossRef] [Green Version]
  4. Prasasti, N.; Ohwada, H. Applicability of machine-learning techniques in predicting customer defection. In Proceedings of the 2014 International Symposium on Technology Management and Emerging Technologies, Bandung, Indonesia, 27–29 May 2014; pp. 157–162. [Google Scholar] [CrossRef]
  5. Risselada, H.; Verhoef, P.C.; Bijmolt, T.H.A. Staying Power of Churn Prediction Models. J. Interact. Mark. 2010, 24, 198–208. [Google Scholar] [CrossRef]
  6. Amin, A.; Anwar, S.; Adnan, A.; Nawaz, M.; Alawfi, K.; Hussain, A.; Huang, K. Customer churn prediction in the telecommunication sector using a rough set approach. Neurocomputing 2017, 237, 242–254. [Google Scholar] [CrossRef]
  7. Fornell, C.; Wernerfelt, B. Defensive Marketing Strategy by Customer Complaint Management: A Theoretical Analysis. J. Mark. Res. 1987, 24, 337–346. [Google Scholar] [CrossRef] [Green Version]
  8. Edward, M.; Sahadev, S. Role of switching costs in the service quality, perceived value, customer satisfaction and customer retention linkage. Asia Pac. J. Mark. Logist. 2011, 23, 327–345. [Google Scholar] [CrossRef]
  9. Reichheld, F.F. Learning from Customer Defections. Harv. Bus. Rev. 1996, 74, 56–67. [Google Scholar]
  10. Athanassopoulos, A.D. Customer Satisfaction Cues To Support Market Segmentation and Explain Switching Behavior. J. Bus. Res. 2000, 47, 191–207. [Google Scholar] [CrossRef]
  11. Alboukaey, N.; Joukhadar, A.; Ghneim, N. Dynamic behavior based churn prediction in mobile telecom. Expert Syst. Appl. 2020, 162, 113779. [Google Scholar] [CrossRef]
  12. Verbeke, W.; Martens, D.; Mues, C.; Baesens, B. Building comprehensible customer churn prediction models with advanced rule induction techniques. Expert Syst. Appl. 2011, 38, 2354–2364. [Google Scholar] [CrossRef]
  13. Coussement, K.; Van den Poel, D. Improving customer attrition prediction by integrating emotions from client/company interaction emails and evaluating multiple classifiers. Expert Syst. Appl. 2009, 36, 6127–6134. [Google Scholar] [CrossRef]
  14. Garcia, D.L.; Nebot, A.; Vellido, A. Intelligent data analysis approaches to churn as a business problem: A survey. Knowl. Inf. Syst. 2017, 51, 719–774. [Google Scholar] [CrossRef] [Green Version]
  15. Devriendt, F.; Berrevoets, J.; Verbeke, W. Why you should stop predicting customer churn and start using uplift models. Inf. Sci. 2019, 548, 497–515. [Google Scholar] [CrossRef]
  16. Kelleher, J.D.; Namee, B.M.; D’Arcy, A. Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies, 1st ed.; The MIT Press: Cambridge, MA, USA, 2015. [Google Scholar]
  17. Routh, P.; Roy, A.; Meyer, J. Estimating customer churn under competing risks. J. Oper. Res. Soc. 2021, 72, 1138–1155. [Google Scholar] [CrossRef]
  18. Perianez, A.; Saas, A.; Guitart, A.; Magne, C. Churn Prediction in Mobile Social Games: Towards a Complete Assessment Using Survival Ensembles. In Proceedings of the 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Montreal, QC, Canada, 17–19 October 2016; pp. 564–573. [Google Scholar] [CrossRef] [Green Version]
  19. Burez, J.; Vandenpoel, D. Separating financial from commercial customer churn: A modeling step towards resolving the conflict between the sales and credit department. Expert Syst. Appl. 2008, 35, 497–514. [Google Scholar] [CrossRef] [Green Version]
  20. Van den Poel, D.; Larivière, B. Customer attrition analysis for financial services using proportional hazard models. Eur. J. Oper. Res. 2004, 157, 196–217. [Google Scholar] [CrossRef]
  21. Baesens, B.; Van Gestel, T.; Stepanova, M.; Van den Poel, D.; Vanthienen, J. Neural network survival analysis for personal loan data. J. Oper. Res. Soc. 2005, 56, 1089–1098. [Google Scholar] [CrossRef]
  22. Singer, J.D.; Willett, J.B. It’s About Time: Using Discrete-Time Survival Analysis to Study Duration and the Timing of Events. J. Educ. Stat. 1993, 18, 155–195. [Google Scholar] [CrossRef]
  23. Ehrlinger, J. ggRandomForests: Exploring Random Forest Survival. arXiv 2016, arXiv:1612.08974. [Google Scholar]
  24. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  25. Hung, S.Y.; Yen, D.C.; Wang, H.Y. Applying data mining to telecom churn management. Expert Syst. Appl. 2006, 31, 515–524. [Google Scholar] [CrossRef] [Green Version]
  26. Gok, M.; Ozyer, T.; Jida, J. A Case Study for the Churn Prediction in Turksat Internet Service Subscription. In Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015—ASONAM’15, Paris, France, 25–28 August 2015; pp. 1220–1224. [Google Scholar] [CrossRef]
  27. Vijaya, J.; Sivasankar, E. An efficient system for customer churn prediction through particle swarm optimization based feature selection model with simulated annealing. Clust. Comput. 2019, 22, 10757–10768. [Google Scholar] [CrossRef]
  28. Saunders, J. Cluster Analysis for Market Segmentation. Eur. J. Mark. 1980, 14, 422–435. [Google Scholar] [CrossRef]
  29. Jafari-Marandi, R.; Denton, J.; Idris, A.; Smith, B.K.; Keramati, A. Optimum profit-driven churn decision making: Innovative artificial neural networks in telecom industry. Neural Comput. Appl. 2020, 32, 14929–14962. [Google Scholar] [CrossRef]
  30. Bhattacharya, C.B. When customers are members: Customer retention in paid membership contexts. J. Acad. Mark. Sci. 1998, 26, 31. [Google Scholar] [CrossRef]
  31. Neslin, S.A.; Gupta, S.; Kamakura, W.; Lu, J.; Mason, C.H. Defection Detection: Measuring and Understanding the Predictive Accuracy of Customer Churn Models. J. Mark. Res. 2006, 43, 204–211. [Google Scholar] [CrossRef] [Green Version]
  32. Schober, P.; Vetter, T.R. Survival Analysis and Interpretation of Time-to-Event Data: The Tortoise and the Hare. Anesth. Analg. 2018, 127, 792–798. [Google Scholar] [CrossRef] [PubMed]
  33. Stensrud, M.J.; Hernan, M.A. Why Test for Proportional Hazards? JAMA 2020, 323, 1401–1402. [Google Scholar] [CrossRef]
  34. Ishwaran, H.; Kogalur, U.B.; Blackstone, E.H.; Lauer, M.S. Random survival forests. Ann. Appl. Stat. 2008, 2, 841–860. [Google Scholar] [CrossRef]
  35. Wang, P.; Li, Y.; Reddy, C.K. Machine Learning for Survival Analysis: A Survey. arXiv 2017, arXiv:1708.04649. [Google Scholar] [CrossRef]
  36. Bland, J.M.; Altman, D.G. Survival probabilities (The Kaplan–Meier method). BMJ 1998, 317, 1572. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  37. PySurvival: Open Source Package for Survival Analysis Modeling. 2019. Available online: https://www.pysurvival.io/ (accessed on 17 August 2022).
  38. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  39. Schwartz, H.; Sap, M.; Kern, M.; Eichstaedt, J.; Kapelner, A.; Agrawal, M.; Blanco, E.; Dziurzynski, L.; Park, G.; Stillwell, D.; et al. Predicting Individual Well-Being through the Language of Social Media; World Scientific: Singapore, 2015; pp. 516–527. [Google Scholar]
  40. Scrucca, L.; Fop, M.; Murphy, T.; Raftery, A. mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models. R J. 2016, 8, 289. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  41. Davidson-Pilon, C. lifelines: Survival analysis in Python. J. Open Source Softw. 2019, 4, 1317. [Google Scholar] [CrossRef] [Green Version]
  42. Akogul, S.; Erisoglu, M. A Comparison of Information Criteria in Clustering Based on Mixture of Multivariate Normal Distributions. Math. Comput. Appl. 2016, 21, 34. [Google Scholar] [CrossRef]
Figure 1. Number of members per month.
Figure 1. Number of members per month.
Electronics 11 03328 g001
Figure 2. Survival probabilities.
Figure 2. Survival probabilities.
Electronics 11 03328 g002
Figure 3. Survival by gender and contracted access frequency.
Figure 3. Survival by gender and contracted access frequency.
Electronics 11 03328 g003
Figure 4. Global Model performance.
Figure 4. Global Model performance.
Electronics 11 03328 g004
Figure 5. Global model performance predicted versus actual.
Figure 5. Global model performance predicted versus actual.
Electronics 11 03328 g005
Figure 6. Cluster number analysis.
Figure 6. Cluster number analysis.
Electronics 11 03328 g006
Figure 7. Elbow analysis.
Figure 7. Elbow analysis.
Electronics 11 03328 g007
Figure 8. Model performance for cluster 0.
Figure 8. Model performance for cluster 0.
Electronics 11 03328 g008
Figure 9. Model performance for cluster 4.
Figure 9. Model performance for cluster 4.
Electronics 11 03328 g009
Figure 10. Model performance for cluster 8.
Figure 10. Model performance for cluster 8.
Electronics 11 03328 g010
Figure 11. Conditional survival forest for cluster 0.
Figure 11. Conditional survival forest for cluster 0.
Electronics 11 03328 g011
Figure 12. Conditional survival forest for cluster 4.
Figure 12. Conditional survival forest for cluster 4.
Electronics 11 03328 g012
Figure 13. Conditional survival forest for cluster 8.
Figure 13. Conditional survival forest for cluster 8.
Electronics 11 03328 g013
Table 1. Summary statistics of features used.
Table 1. Summary statistics of features used.
CharacteristicN = 5209
Age (age in years), Mean (SD)28 (12)
Male or female (percentage of male), %35%
dayswfreq (non-attendance days before dropout), Mean (SD)76 (102)
tbilled (total amount billed), Mean (SD)155 (155)
maccess (average entries by week), Mean (SD)0.89 (0.76)
freeuse (user with free use (1) or with limited entries (0), %4.9%
nentries (number total of entries), Mean (SD)29 (41)
cfreq (weekly contracted accesses), %
   21.3%
   42.4%
   60.2%
   796%
months (customer enrolment, in months), Mean (SD)9 (8)
dropout (customer dropout 1; non-dropout 0), %88%
Table 2. Determination of survival time probabilities.
Table 2. Determination of survival time probabilities.
Event_atRemovedObservedCensoredEntranceAt_RiskEstimated_SurvivalProb
01105209520971.000
1339249900520870.952
2543449940486970.864
3520506140432670.763
436836260380670.691
5361350110343860.620
6385373120307760.545
7254240140269250.496
8215192230243860.457
9192171210222360.422
10288270180203160.366
11362350120174390.293
122402291101381100.244
138560250114190.231
1411286260105690.212
1578735094490.196
16615920866100.183
17686440805100.168
1861565073790.155
19503911067690.146
20533617062690.138
217456180573100.124
227561140499110.109
234939100424110.099
24201370375100.096
Note: Removed, the sum of customers with dropout and that are censored; Censored, the event did not occur during the period of this data, collection; Risk of Dropout, number of customers at risk of dropout; prob, survival probability; Estimated Survival, months to survive in the health club.
Table 3. Feature importance in the survival model.
Table 3. Feature importance in the survival model.
FeatureImportancePct_Importance
tbilled7.6390.264
freeuse5.7580.199
dayswfreq5.1050.176
nentries4.8470.167
maccess3.7440.129
cfreq1.8710.065
sex_1−0.0570.000
age−0.2100.000
Table 4. Feature importance in the survival model with cluster 0.
Table 4. Feature importance in the survival model with cluster 0.
FeatureImportancePct_Importance
freeuse1.5860.443
age1.0520.294
maccess0.9440.263
sex_1−0.0330.000
cfreq−1.0260.000
nentries−1.5950.000
dayswfreq−1.9600.000
tbilled−3.2450.000
Table 5. Feature importance in the survival model with cluster 4.
Table 5. Feature importance in the survival model with cluster 4.
FeatureImportancePct_Importance
nentries2.2260.388
dayswfreq2.0860.363
tbilled1.0330.180
age0.2250.039
maccess0.1740.030
freeuse0.0000.000
cfreq0.0000.000
sex_1−0.9890.000
Table 6. Feature importance in the survival model with cluster 8.
Table 6. Feature importance in the survival model with cluster 8.
FeatureImportancePct_Importance
maccess4.6320.278
dayswfreq3.2350.194
tbilled2.9620.178
nentries2.3660.142
freeuse2.1910.132
cfreq1.0260.062
age0.2370.014
sex_1−0.6260.000
Table 7. Brier Score performance prediction in each cluster.
Table 7. Brier Score performance prediction in each cluster.
ClusterRMSEMeanMedianIBSnNtrainNtest
09.1810.8934.7660.07019551466489
11.3500.9381.0650.0001098128
23.8611.1592.2720.036425318107
32.1561.2071.5290.057624468156
42.9030.8821.8910.078729546183
50.3120.0000.127NaN493613
60.6890.4000.521NaN403010
71.2551.0251.0500.01725819365
83.5130.9751.9410.1051020765255
w/cluster7.9043.1314.8760.089520939061303
Note: NaN value not possible to calculate; n represents the number of elements in each cluster; Ntrain and Ntest are the number of elements used to train and test the model, respectively.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Sobreiro, P.; Garcia-Alonso, J.; Martinho, D.; Berrocal, J. Hybrid Random Forest Survival Model to Predict Customer Membership Dropout. Electronics 2022, 11, 3328. https://doi.org/10.3390/electronics11203328

AMA Style

Sobreiro P, Garcia-Alonso J, Martinho D, Berrocal J. Hybrid Random Forest Survival Model to Predict Customer Membership Dropout. Electronics. 2022; 11(20):3328. https://doi.org/10.3390/electronics11203328

Chicago/Turabian Style

Sobreiro, Pedro, José Garcia-Alonso, Domingos Martinho, and Javier Berrocal. 2022. "Hybrid Random Forest Survival Model to Predict Customer Membership Dropout" Electronics 11, no. 20: 3328. https://doi.org/10.3390/electronics11203328

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop