Next Article in Journal
Market Risk and Financial Performance of Non-Financial Companies Listed on the Moroccan Stock Exchange
Next Article in Special Issue
Contingent Convertible Debt: The Impact on Equity Holders
Previous Article in Journal
The W,Z/ν,δ Paradigm for the First Passage of Strong Markov Processes without Positive Jumps
Erratum published on 29 April 2020, see Risks 2020, 8(2), 42.
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Modelling Recovery Rates for Non-Performing Loans

Department of Mathematics, Imperial College London, London SW7 2AZ, UK
Authors to whom correspondence should be addressed.
Risks 2019, 7(1), 19;
Submission received: 12 February 2019 / Accepted: 15 February 2019 / Published: 20 February 2019
(This article belongs to the Special Issue Advances in Credit Risk Modeling and Management)


Based on a rich dataset of recoveries donated by a debt collection business, recovery rates for non-performing loans taken from a single European country are modelled using linear regression, linear regression with Lasso, beta regression and inflated beta regression. We also propose a two-stage model: beta mixture model combined with a logistic regression model. The proposed model allowed us to model the multimodal distribution we found for these recovery rates. All models were built using loan characteristics, default data and collections data prior to purchase by the debt collection business. The intended use of the models was to estimate future recovery rates for improved risk assessment, capital requirement calculations and bad debt management. They were compared using a range of quantitative performance measures under K-fold cross validation. Among all the models, we found that the proposed two-stage beta mixture model performs best.

1. Introduction

In Basel II, an internal ratings-based (IRB) approach was proposed by the Basel Committee in 2001 to determine capital requirements for credit risk (Bank for International Settlements 2001). This IRB approach grants banks permission to use their own risk models or assessments to calculate regulatory capital. Under the IRB approach, banks are required to estimate the following risk components: probability of default (PD), loss given default (LGD), exposure at default (EAD) and maturity (M) (Bank for International Settlements 2001). Since Basel II’s capital requirement calculation depends heavily on LGD, financial institutions have put more emphasis on modelling LGD in recent years. Unlike the estimation of PD, which is well-established, LGD is not so well-understood and still subject to research. Improving LGD modelling can help financial institutions assess their risk and regulatory capital requirement more precisely, as well as improving debt management.
LGD is defined as the proportion of money financial institutions fail to collect during the collection period, given the borrower has already defaulted. Conversely, Recovery Rate (RR) is defined as the proportion of money financial institutions successfully collected minus the administration fees during the collection period, given the borrower has already defaulted. Equations (1) and (2) give formal definitions of RR and LGD, respectively:
  • Suppose individual i has already defaulted on a loan, let E A D i be the exposure at default for this individual i.
  • Let A i be the administration costs (e.g., letters, phone calls, visits, lawyers and legal work) incurred for individual i.
  • Let R i be the amount recovered for individual i.
Recovery Rate = R i A i E A D i = Collections Admin Fee Outstanding Balance at Default
Loss Given Default = 1 Recovery Rate = 1 R i A i E A D i
RR mainly lies in the interval [0, 1] and typically has high concentrations at the boundary points 0 and 1. It is possible for RR to be negative if recoveries are less than administration costs, A i > R i , and greater than 1 if recoveries exceed exposure plus administration costs, R i > E A D i + A i . Typically, however, RR is truncated within the interval [0, 1] when developing LGD models.
The main challenge in estimating LGD is the bimodal property with high concentrations at 0 and 1 typically present in LGD empirical distributions, where people either repay in full or repay nothing. For the dataset we used in this study, we found our LGD distribution is actually tri-modal. Therefore, regression models have been studied that specifically deal with this problem. For example, Bellotti and Crook (2012) built Tobit and decision tree models along with beta and fractional logit transformation of the RR response variable to forecast the LGD based on a dataset of 55,000 defaulted credit cards in the UK from 1999 to 2005. They concluded that ordinary least squares regression with macroeconomic variables performed the best in terms of forecast performance. Calabrese (2012) proposed a mixed continuous-discrete model, where the boundary values 0 and 1 are modelled by Bernoulli random variables and the continuous part of the RR is modelled by a Beta random variable. This model is then applied to predict RR of Bank of Italy’s loans from 1985 to 1999. The result is compared with Papke and Wooldrige’s fractional response model with log-log, logistic and complementary log-log link functions (Papke and Wooldridge 1996) and linear regression. The mixed continuous and discrete model achieves the best performance. Qi and Zhao (2011) applied four linear models, namely ordinary least squares regression, fractional response regression, inverse Gaussian regression, and inverse Gaussian regression with beta transformation, and two non-linear models, namely regression tree and neural network, to model the LGD of 3751 defaulted bank loans and bonds in the US from 1985 to 2008. They concluded that fractional response regression is slightly better than the ordinary least squares regression. Moreover, they reported that non-linear models perform best. Loterman et al. (2012) performed a benchmark study of LGD by comparing twenty-four different models using six datasets extracted from international banks. They concluded that non-linear models, such as neural network, support vector machine and mixture models perform better than linear models.
For this project, we specifically modelled and predicted RR for data from a single European country provided by a debt collection company. Due to reasons of commercial confidentiality and data protection, the debt collection company will remain anonymous and some aspects of the data were also anonymised, including the country of origin. Consequently, the data cannot be made publicly available. We applied some of the models that have already been studied previously and also extended the existing models, proposing a new beta mixture model to improve the accuracy of RR prediction. A good prediction of RR would help the debt collection company to determine collection policy for new debt portfolios. It is important to note that the RR we modelled is different from most RR, as the data only contain positive repayments and no administration fee was recorded. Therefore, all the RRs in our data lie in the range (0, 1] instead of [0, 1]. Figure 1 shows a histogram of RR for the data. We can clearly see that there are modes at 0, 0.55 (approximately) and a high spike at boundary value 1. Since the shape of the empirical RR distribution demonstrates a trimodal feature, it is reasonable to assume that the recovery rate is a mixed type random variable. The multi-modality of RR is a natural consequence of different groups of bad debts being serviced using different strategies; e.g., one strategy may be that some bad debts are allowed to be written off if the debtor paid back some agreed fixed percentage of the outstanding balance. Having outcome RR within ( 0 , 1 ] motivated the use of the beta regression model and the multi-modal nature of RR motivates the use of a mixture model within this context.
The beta mixture model has been applied successfully within several other application domains. Ji et al. (2005) showed how to apply the beta mixture regression model in several bioinformatics applications such as meta-analysis of gene expression data and to cluster correlation coefficients between gene expressions. Laurila et al. (2011) used a beta mixture model to describe DNA methylation patterns, helping to reduce the dimensionality of microarray data. Moustafa et al. (2018) used a beta mixture model as the basis of an anomaly detection system. Their network data are typically bounded, which suggests a beta distribution, and the use of the beta mixture allowed them to identify latent clusters in normal network use.
Inspired by Calabrese’s mixed continuous-discrete model (Calabrese 2012), we propose a two-stage model composed of:
  • A beta mixture model is parameterised by mean and precision based on two sets of predictor variables on the interval of (0, 1) in order to model the two modes located at just after 0 and around 0.55.
  • A logistic regression model is used for the mode at boundary value 1.
The above proposed model allows representation of the trimodal feature of the data. The beta mixture component groups the clients into two clusters for RR < 1, based on their personal information, debt conditions and repayment history, which may become useful information for other business analysis and decision-making, and then uses logistic regression to model the third case RR = 1. In addition, we also used linear regression, linear regression with Lasso, beta regression and inflated beta regression to model RR. Model performance was measured by mean squared error, mean absolute error and mean aggregate absolute error under K-fold cross validation.
To our knowledge, this is the first study for estimating RR for portfolios of non-performing loans using a statistical model, and the first use of a beta mixture model for LGD. We also developed a novel procedure for predicting an expected value of outcome from a beta mixture model based on assigning a new observation to one of the clusters in the mixture. The remainder of the article is organised as follows: Section 2 provides a detailed data overview. Section 3 introduces the modelling methodology with great emphasis on the proposed beta mixture model combined with logistic regression model. Section 4 analyses some important features of the models and reports the model performance and Section 5 concludes with key findings and future recommendations.

2. Data

Three datasets were provided by the debt collection company:
Dataset 1
provides 48 predictor variables of personal information including socio-demographic variables, Credit Bureau Score and debt status for 120,699 individuals for loans originating between January 1998 and May 2014 from several different financial institutions. Overall, 97.5% of them have credit card debt and only 2.5% are refinanced credit cards (product = “R”). Partial information was extracted from a Bad Debt Bureau. Each record corresponds to a bad loan and has a unique key Loan.Ref.
Dataset 2
records all the recoveries made by the bank before the debt collection company purchased the debt portfolio. It contains 15 predictor variables about historical collection information, which includes number of calls, contacts and visits made by the bank to collect the debt. It also includes repayments in the format of monthly summary. In total, there are 42,832 individuals’ records in Dataset 2, among which only 34,807 individuals can be matched to Dataset 1 by Loan.Ref. Numbers of calls, contacts, visits, repayment and some other monthly activities are aggregated by summing for each loan identified by Loan.Ref.
Dataset 3
records all the recoveries made by the debt collection company after they purchased the debt portfolio from the bank. It includes 12 predictor variables about the ongoing collection information. There are 8281 individuals in total, among which only 8237 individuals are from Dataset 1. Since only positive repayments are recorded, all the recovery rates we calculated are strictly greater than 0. Therefore, in the modelling section, we only focus on the recovery modelling in the interval (0, 1], which is slightly different from the usual RR defined in [0, 1]. The debt collection period recorded in this dataset is from January 2015 to end of November 2016.
Figure 2 shows how the data were joined. There are 8237 data points presented in Dataset 3, but only 7161 individual historical collection information are recorded in Dataset 2. In these cases, there are no historical recoveries by bank, i.e., no calls, contacts, visits or payments for the remaining 1076 individuals. Therefore, a value of 0 was assigned to aggregate recoveries in Dataset 2 for the remaining 1076 individuals. The modified Dataset 2 was then joined to Datasets 1 and 3 by the unique key Loan.Ref and we obtained a table of 8237 data points with 61 variables.
Table A1 gives descriptive statistics for each of the variables in the joined dataset used in the statistical modelling. The predictor variable Pre-Recovery Rate is the bank’s RR before the debt portfolio was purchased. The minimum value is −0.130, which is negative due to the substantial amount of administration fee exceeding repayments incurred during the collection period. The predictor variable Credit Bureau Score is a generic credit score provided by a credit bureau.

Recovery Rate Calculation

Since the repayments in Datasets 2 and 3 were recorded in the format of monthly activity summaries, each individual may have several repayments for the same loan. Therefore, we defined the recovery rate as the sum of repayments minus the administration fee (if available) over the original balance of the loan, which is also equivalent to the difference between original balance and ending balance over the original balance. For each individual i, RR is calculated using:
Recovery Rate i = Repayments i AdminFee i Original Balance i = Original Balance i Ending Balance i Original Balance i
Figure 1 is the empirical RR histogram calculated based on Equation (3), for the 8237 data points after pre-processing. The remaining 112,462 data points not included in the analysis essentially have RR = 0, but we do not know whether they have been serviced or not, thus they were not included in the analysis. Essentially, the goal of our model is to estimate RR computed from Dataset 3 (post-purchase), based on pre-purchase information given in Datasets 1 and 2.

3. Modelling Methodology

We applied various models to estimate RR. In all cases, model performance was measured within a K-fold cross validation framework. We first tried using ordinary least squares linear regression, with and without stepwise backward variable selection using the AIC criterion. In the following sub-sections, we list the other modelling approaches we explored. Let y indicate the outcome variable, recovery rate, and X is a corresponding vector of predictor variables.

3.1. Linear Regression with Lasso

We applied linear regression with a Lasso (Least Absolute Shrinkage and Selection Operator) penalty. The model structure is
y = β 0 + β T X + ϵ
where β 0 and β are intercept and coefficients to be estimated and ϵ is the error term. Then, estimation using least squares error with Lasso is given by the optimisation problem on a training dataset of N observations:
( β 0 ^ , β ^ ) = arg min β 0 , β [ 1 N i = 1 N ( y i β 0 β T X i ) 2 + λ j = 1 p | β j | ] ,
where λ > 0 is a tuning parameter controlling the size of regularisation. Regression with Lasso will tend to shrink coefficient estimates to zero and hence is a form of variable selection (Friedman et al. 2010). The value of λ is chosen using K-fold cross validation. For this project, the R packages “lars” (Hastie and Efron 2013) and “glmnet” (Friedman et al. 2010) were used to estimate linear regression with Lasso.

3.2. Multivariate Beta Regression

The problem with linear regression is that it does not take account of the particular distribution of RR, which is between 0 and 1. The beta distribution, with two shape parameters α and β , allows us to model RR in the open interval ( 0 , 1 ) :
f ( y i ; α i , β i ) = Γ ( α i + β i ) Γ ( α i ) Γ ( β i ) y i α i 1 ( 1 y i ) β i 1 , 0 < y i < 1 ,
where α , β > 0 are the shape parameters and Γ ( · ) is the Gamma function. The beta distribution is reparameterised by mean and precision parameters, denoting by μ and ϕ , respectively, following Ferrari and Cribari-Neto (2004), since this parameterisation meaningfully express the expected value and variance:
ϕ i = α i + β i , E ( y i ) = μ i = α i α i + β i , V a r ( y i ) = μ i ( 1 μ i ) ϕ i + 1 ,
The reparameterised beta distribution is then
f ( y i ; μ i , ϕ i ) = Γ ( ϕ i ) Γ ( μ i ϕ i ) Γ ( ( 1 μ i ) ϕ i y i μ i ϕ i 1 ( 1 y i ) ( 1 μ i ) ϕ i 1 , 0 < y i < 1 ,
with 0 < μ i < 1 and ϕ i > 0 . Figure 3a demonstrates three examples of the beta distribution with fixed ϕ = 5 and different μ . The variance is maximised at μ = 0.5 . Figure 3b demonstrates another three examples of beta distribution with fixed μ = 0.5 and different ϕ .
The precision parameter ϕ is negatively correlated with V a r ( y i ) , given a fixed μ . Furthermore, the variance of Y is a function of μ , which enables the regression to model heteroskedasticity. RR is modelled as y i ∼ B( μ i , ϕ i ) for i ( 1 , , N ) for sample size N. The multivariate beta regression model (Cribari-Neto and Zeileis 2010) is defined as:
F 1 ( μ i ) = η T X i = ξ 1 i ,
F 2 ( ϕ i ) = γ T W i = ξ 2 i ,
where η is a vector of parameters which needs to be estimated corresponding to predictor variables X and γ is a vector of parameters which needs to be estimated corresponding to predictor variables W.
The predictor variables in W may be the same as in X, or a subset, or contain different variables. For this study, W will have a subset of predictor variables determined using stepwise variable selection. The link function ensures that μ i ( 0 , 1 ) and ϕ i > 0 . We applied Logit and Log link function to μ i and ϕ i , respectively:
μ i = 1 1 + e η T X i , ϕ i = e γ T W i .
With this multivariate beta regression model, η and γ can be estimated by maximum likelihood estimation, where the log-likelihood function is
L ( η , γ ) = i = 1 N [ log Γ ( ϕ i ) log Γ ( μ i ϕ i ) log Γ ( ( 1 μ i ) ϕ i ) + ( μ i ϕ i 1 ) log y i + ( ( 1 μ i ) ϕ i 1 ) log ( 1 y i ) ] .
By substituting μ i = F 1 1 ( η T X i ) and ϕ i = F 2 1 ( γ T W i ) into Equation (8), the log-likelihood is obtained as a function of η and γ . The parameters can be estimated using Broyden–Fletcher–Goldfarb–Shanno (BFGS) quasi-Newton method, which is considered to be the most appropriate method (Mittelhammer et al. 2000; Nocedal and Wright 1999).

3.3. Inflated Beta Regression

The disadvantage of beta regression is that it does not include the boundary values 0 or 1. Therefore, a modification is required before fitting the model. To better represent RR on the boundaries 0 and 1, Calabrese (2012) suggested considering RR as a mixture of Bernoulli random variables for the boundary 0 and 1, and a Beta random variable for the open interval (0, 1). The distribution for this inflated beta regression on [0, 1] is then defined as
f Y ( y ) = p 0 , if y = 0 ( 1 p 0 p 1 ) f B ( y ; α , β ) , if 0 < y < 1 p 1 , if y = 1
for y ∈ [0, 1], p 0 = P ( y = 0 ) , p 1 = P ( y = 1 ) , 0 < p 0 + p 1 < 1 and f B ( y ) is the beta distribution defined in Section 3.2. Moreover, if RR y (0, 1], i.e., it only inflates at one, as our data do, then the distribution is just
f Y ( y ) = ( 1 p 1 ) f B ( y ; α , β ) , if 0 < y < 1 p 1 , if y = 1
We used maximum likelihood estimation to estimate parameters for Bernoulli random variable and Beta random variables, parameterising the discrete part in the following way (Calabrese 2012):
s i = p 1 p 1 + p 0 , d i = p 0 + p 1 ,
The log-likelihood function is then
L ( s , d , α , β ) = y i = 0 log ( 1 s i ) + y i = 0 log ( d i ) + y i = 1 log ( s i ) + y i = 1 log ( d i ) + 0 < y i < 1 log ( 1 d i ) + 0 < y i < 1 log ( f B ( y ; α i , β i ) ) .
The continuous beta random variables can be parameterised in the same way as described in Section 3.2.

3.4. Beta Mixture Model combined with Logistic Regression

Examining the distribution of RR shown in Figure 1, it can be seen that the distribution between 0 and 1 is bimodal. For this reason, we consider a beta mixture model to deal with what appears to be two different groups of recoveries. We propose a two-stage model: beta mixture model combined with logistic regression. The beta mixture model allows us to model the multimodality of RR in the interval (0, 1). This is similar to the two-stage (decision tree) model used by Bellotti and Crook (2012), but with a beta mixture used for regression.
Firstly, RR is classified into ones and non-ones using logistic regression. Secondly, within the non-ones group, a mixture of beta distributions is used to model RR in the range (0, 1). In general, a mixture of beta distribution consists of m components where each component follows a parametric beta distribution. The prior probability of component j is denoted as π j , where j ( 1 , , m ) . Let M j denote the jth component/cluster in the beta mixture model. The beta mixture model with m components is defined as:
g ( y ; μ , ϕ ) = j = 1 m π j f j ( y ; X , μ j , ϕ j ) = j = 1 m π j f j ( y ; X , W , F 1 1 ( η j T X i ) , F 2 1 ( γ j T W i ) ) = j = 1 m π j f j ( y ; X , W , η j , γ j ) ,
where f j is the beta distribution corresponding to the jth component with separate parameter vectors η j and γ j . The same link functions are used as in Section 3.2. The prior probabilities, π j , need to satisfy the following conditions:
j = 1 m π j = 1 , π j 0 .
The iterative Expectation-Maximisation (EM) algorithm was used to estimate the parameters of the beta mixture model, as described by (Leisch 2004). In particular, R package “flexmix” (Leisch 2004; Gruen and Leisch 2007, 2008) embedded in R package “betareg” (Cribari-Neto and Zeileis 2010; Gruen et al. 2012) was applied to estimate the model. Figure 4 illustrates the two-stage mixture model as a decision tree.
The choice of m in the model depends on the number of clusters expected in the data. Based on our analysis of the recoveries for the dataset we used, m = 2 was used since this corresponded to the two modes we see in the RR distribution for RR < 1, as shown in Figure 1. If it is not clear how many clusters may exist, approaches based on AIC can be used.

Predictions Using the Beta Mixture Model

Given the beta mixture model, we need to predict the RR for new clients based on their information, i.e., X n e w and W n e w . Figure 5 shows a flowchart explaining how to calculate the estimated RR from the beta mixture model. This gives an expected value of RR y conditional on the cluster M j . Therefore, we need to first identify which cluster the new observation belongs to. Even though the R package “betareg” (Cribari-Neto and Zeileis 2010; Gruen et al. 2012) can compute the conditional expectation for us, it does not identify which cluster the new points should be assigned to. Therefore, we propose a method to do this. In general, there are two feasible approaches to assign a new observation to M j :
  • Assign the new observation to the cluster that achieves the highest log-likelihood. This is a hard clustering approach, which assigns the observation to exactly one cluster (Fraley and Raftery. 2002).
  • Assign the new observation to each cluster j with probability P ( M j ) . This is a soft clustering approach, which assigns the observation to a percentage weighted cluster (Leisch 2004).
Decomposing the expected value of y using the Law of Total Expectation, we get
E ( y x i ) = j = 1 m P ( M j | x i ) E ( y | x i , M j )
where E ( y | x i , M j ) is calculated from the beta mixture model prediction (refer to Figure 5). We can replace P ( M j | x i ) = f ( x i | M j ) P ( M j ) f ( x i ) where f ( x i ) = j = 1 m f ( x i | M j ) P ( M j ) , to get
E ( y x i ) = j = 1 m f ( x i | M j ) P ( M j ) E ( y | x i , M j ) f ( x i )
where P ( M j ) is the prior probability of belonging to cluster M j . The density f ( x i | M j ) is estimated using kernel density estimation,
f ^ ( x n e w ) = i = 1 n 1 n k = 1 d h i , k k = 1 d K ( x k n e w x i , k h i , k )
where K ( · ) is the Gaussian kernel (Azzalini and Menardi 2014) and d is the number of dimensions in data x. In addition, x i may be high-dimensional, which makes the kernel density estimation computationally expensive. As a remedy, we applied Principal Component Analysis (PCA) to reduce the dimension of x i , and then kernel density estimation was performed in the reduced dimension space.
Approach 1: Maximum log-likelihood.
Given a new observation x i , choose j that maximises the density:
arg max j log f ( y | x i , M j ) ,
which is computed using the log-likelihood function. If the objective function is maximised with respect to Cluster M j , then set
P ( M j | x i ) = 1 and P ( M k | x i ) = 0 for all k j
and hence, from Equation (12), the expected value of y is given by E ( y | x i , M 1 ) .
Approach 2: Prior Probability.
Treat P ( M j ) as a prior estimated using methods given in Table 1 and use in Equation (13) for soft clustering. By substituting P ( M j ) given in Table 1 into Equation (13), we can compute E ( y x ) for y ( 0 , 1 ) .
After calculating E ( y x ) for the interval (0, 1) using the beta mixture model, the boundary 1 needs to be taken into consideration using a logistic regression model. From the decision tree defined in Figure 4, the logistic regression can provide the estimates at the first leaf node: P ( y = 1 X = x ) . Then, the overall expectation of RR y∈ (0, 1] is
E ( y x ) = P ( y = 1 x ) E ( y x , y = 1 ) + P ( 0 < y < 1 x ) E ( y x , 0 < y < 1 ) = P ( y = 1 x ) × 1 + 1 P ( y = 1 x ) E ( y x , 0 < y < 1 ) = P ( y = 1 x ) + 1 P ( y = 1 x ) E ( y x , 0 < y < 1 )
where E ( y x , 0 < y < 1 ) is the predicted RR from the beta mixture model using Approach 1 or 2.

4. Results

The linear model had an adjusted R 2 of 0.69, which was considerably higher than most models of RR (e.g., see Bellotti and Crook (2012); Loterman et al. (2012)), which could be explained by the richness of data, especially collections information. We expected that the linear regression model was misspecified, due to the range of the outcome variable and this is confirmed in the residual vs. fitted plot for the model and a Breusch-Pagan test for heteroscedasticity ( p < 0.0001 ).
For the beta mixture model, we used all variables for X, but variable selection for W based firstly on the output of stepwise selection using AIC in linear regression and then on a series of likelihood ratio tests. The result was the selection of four variables for W: pre-recovery rate, post balance, customer payment frequency and credit bureau score. Table 2 shows parameter estimates for η and γ for the two clusters, along with coefficient estimates under standard beta regression in the interval (0, 1) for comparison.
In Table 2, there are “NA” values for some of the p-values in the beta mixture model. This is because the estimation algorithm could not produce reliable standard errors in these cases. We can see that the significance of variables was diluted by the two clusters. For instance, credit bureau score was significant in the standard beta regression with a p-value of 0.0022, but in the beta mixture model, it was not significant for either of the clusters, taking a significance level of 5%. The direction of association of coefficient estimates in beta mixture model for both clusters were mostly consistent, where the estimates were significant (at 5% level), although magnitude of association differed. Pre-Recovery Rate for γ component was the only exception to this observation. The model also demonstrated some interesting significant associations between some variables and RR: taking insurance showed higher recoveries and having a record at the bad debt bureau was associated with lower recovery rates. In addition, the recoveries, pre-purchase, were positively correlated with future RR, although total number of calls to customer had a negative association, perhaps because these were difficult customers from whom to collect, hence requiring more intervention.
Following the procedure in Figure 5, the expected value of RR conditional on Cluster M j was calculated based on the parameters η and γ estimated in Table 2. Since it was too time consuming to perform kernel density estimation on 29 variables, we reduced the dimension to six by employing PCA analysis, which greatly shortened the running time for two clusters’ density estimations. Nevertheless, it is inevitable that information is lost during the dimension reduction process, which may result in weaker estimates. Figure 6 shows histograms of expected value of RR conditional on each jth cluster for the test dataset. The shapes of the two clusters are similar, except Cluster 2 has more estimates in the range 0.2 to 0.6.
Figure 7 shows four histograms of predicted RR corresponding to the four different priors defined in Table 1, in contrast to the true RR. The predicted value of beta mixture model combined with logistic regression model was calculated by applying the formula derived in Equation (14). Models with the different priors performed in a similar way. Importantly, they were all able to model the bimodal nature of the RR. The figure shows that none of the models ewre good at predicting the extreme values of RR close to 0 or 1, but this naturally follows from the fact that these predictions are estimates of expected values of RR, through Equation (12), albeit conditional on predictor variables, and thus do not represent the extremes in the distribution well. Further detail can be seen in Figure 8, which shows predicted RR against true RR. The strong correlation between predicted and true RR is clear. However, it is noticeable that, when true RR was around 0.6, the model tended to under-estimate for some observations. This was because the model was not perfect at detecting observations in Cluster 2. This suggests future improvements to the model to enhance its capacity to predict the correct latent cluster.

Model Performance

Predictive performance was measured using K-fold cross validation with three performance measures popular in the literature on RR estimation: mean squared error (MSE), mean absolute error (MAE) and mean aggregate absolute error (MAAE). Since the sample size (8237) was relatively large and model estimation time was long, K = 3 was chosen. Let n be the sample size. Then,
M S E = 1 K k = 1 K 1 n / K i = 1 n / K ( y ^ k i y i ) 2 = 1 n k = 1 K i = 1 n / K ( y ^ k i y i ) 2 ,
M A E = 1 K k = 1 K 1 n / K i = 1 n / K | y ^ k i y i | = 1 n k = 1 K i = 1 n / K | y ^ k i y i | .
MAAE is the MAE at segment level (Thomas and Bijak 2015) and defined here as
M A A E = 1 V i = 1 V 1 | S i | j S i ( y j y ^ j ) )
where V is the number of segments expressed as disjoint index sets S 1 , , S V . The segments could express different characteristics, e.g., risk bands. However, for this study, each segment was a different random sample from the test data with approximately the same sample size and jointly exhaustive. We used V = 100 since this gave a balance of number of segments approximately equal to number of observations in each segment. MSE, MAE and MAAE are all penalty measures, thus the smaller the value, the better the model. Since the RR is a financial ratio between 0 and 1, the MAE can reflect the size of the error in a more intuitive and direct way. If one is interested in the segment portfolio level, then the MAAE should be used.
All models were trained on the same partitions of data into cross validation folds, to avoid bias being introduced due to different samples. Table 3 shows the results. There was little difference between results for the various linear regressions, with or without variable selection or Lasso penalty, in terms of predictive performance. The last linear model, “excluding Dataset 2”, was built without predictor variables from Dataset 2. This showed noticeably worse performance than the other linear models, especially for MAE, which demonstrates that including past recoveries data (i.e., Dataset 2) improved performance. The standard beta regression model, with and without zero-inflation, performed much worse than linear regression, but the beta mixture model with logistic regression gave the best performance on all three measures. The different priors gave slightly different performances but are not very much different, although Approach 1 method for selecting cluster assignment (max log-likelihood) was slightly worse than Approach 2, soft clustering methods.

5. Conclusions

Linear regression, beta regression, inflated beta regression, and a beta mixture model combined with logistic regression were applied to model the recovery rate of non-performing loans. The models’ predictive performances were measured using mean squared error, mean absolute error and mean aggregate absolute error under three-fold cross validation. To produce predictions from the beta mixture model, methods of hard and soft clustering were developed and the soft clustering approaches gave marginally better predictive performance. Theoretically, the proposed model, beta mixture model combined with logistic regression model, should be a suitable model to predict recovery rate for this data since it allows us to model the multimodality in the dataset and takes extra consideration of the boundary value. Indeed, we found that it achieved the best results amongst the models. Stepwise linear regression also achieved relatively good performance; however, the normality and homoscedasticity assumptions did not hold. In our experiments, we also found that inclusion of previous collections data boosted predictive performance.
We believe the beta mixture model is useful for modelling RR because it is explaining different servicing strategies. In the case of our study, the cluster with mode around 0.55 is likely expressing those loans for which the debt servicer has agreed with the borrower to repay just a proportion of the outstanding debt. There may be servicing strategies in other bad debt portfolios that could be discovered using a similar mixture model or clustering approach. We developed a technique to predict the correct latent cluster for new observations and this works well. However, results suggest that further work to refine this aspect of the use of the model could yield improved performance.

Author Contributions

Conceptualisation, H.Y. and T.B.; methodology, H.Y. and T.B.; validation, T.B.; investigation and statistical modelling, H.Y.; data analysis, H.Y.; writing—original draft preparation, H.Y..; writing—review and editing, T.B.; and supervision, T.B.


This research received no external funding.


We wish to thank the anonymous debt collection company for use of their data and their expertise, which was essential to understand the meaning and context of the data. We would also like to thank Tommaso Pappagallo who did preliminary data analysis as part of his MSci project, which was useful in taking this project work forward.

Conflicts of Interest

The authors declare no conflict of interest.


The following abbreviations are used in this manuscript:
IRBInternal ratings based
RRRecovery rate
LGDLoss given default
PDProbability of default
EADExposure at default
EMExpectation-Maximisation (algorithm)
MSEMean square error
MAEMean absolute error
MAAEMean absolute aggregate error

Appendix A

Table A1. Descriptive statistics. n = 8237 . For numeric variables: min, mean (standard deviation), max. For factors, frequency (%age) for each level. All predictive data is collected prior to servicing.
Table A1. Descriptive statistics. n = 8237 . For numeric variables: min, mean (standard deviation), max. For factors, frequency (%age) for each level. All predictive data is collected prior to servicing.
RR postnumericRecovery rate (outcome variable)0.000508, 0.280 (0.283), 1
ProductfactorType of loanC:7468 (90.7%), R:769 (9.3%)
PrincipalnumericOriginal loan amount0, 3120 (2330), 15000
InterestnumericInterest payments0, 551 (439), 3380
InsurancenumericInsurance fees0, 42 (84.6), 953
Late chargesnumericLate charge fees0, 269 (109), 1470
Overlimit feesnumericOver credit limit fees0, 13.3 (24.6), 315
CreditlimitnumericCredit limit0, 4560 (2660), 13800
SexfactorSexF:3196 (38.8%), M:5041 (61.2%)
MarriedfactorMarriage status0:1201 (14.6%), D:518 (6.3%), M:3929 (47.7%), O:217 (2.6%), S:2230 (27.1%), W:142 (1.7%)
AgenumericAge1, 48.7 (11.1), 87
DelphiScoreintegerCredit bureau score0, 298 (138), 443
Bureau Sub 1factorLoan is in the servicer’s bureau (1 = True)0: 1520 (18.5%), 1: 6717 (81.5%)
CustPaymentFreqintegerCustomer repayment frequency1, 7.56 (5.59), 29
Post BalancenumericExposure amount at start of servicing0, 3130 (2630), 15900
Total paid amountnumericTotal net paid amount−275, 1200 (1100), 11200
Total callsnumericTotal number of calls0, 104 (106), 911
Total contactsnumericTotal number of contacts (except calls)0, 28.5 (26.5), 196
Bankreport FreqnumericBank reporting frequency0, 11.6 (7.92), 26
Pre recovery ratenumericRecovery rate−0.130, 0.258 (0.217), 2.89
EmployerfactorEmployer knownEmployerProvided:8053 (97.8%), NoInfo:184 (2.2%)
Total numberintegerTotal number of loan accounts0, 2.3 (2.43), 68


  1. Azzalini, Adelchi, and Giovanna Menardi. 2014. Clustering via nonparametric density estimation: The R package pdfCluster. Journal of Statistical Software 57: 1–26. [Google Scholar] [CrossRef]
  2. Bank for International Settlements. 2001. The Internal Ratings-based Approach. Basel: Bank for International Settlements. [Google Scholar]
  3. Bellotti, Tony, and Jonathan Crook. 2012. Loss given default models incorporating macroeconomic variables for credit cards. International Journal of Forecasting 28: 171–82. [Google Scholar] [CrossRef]
  4. Calabrese, Raffaella. 2012. Predicting bank loan recovery rates with a mixed continuous-discrete model. Applied Stochastic Models in Business and Industry 30: 99–114. [Google Scholar] [CrossRef] [Green Version]
  5. Cribari-Neto, Francisco, and Achim Zeileis. 2010. Beta regression in R. Journal of Statistical Software 34: 1–24. [Google Scholar] [CrossRef]
  6. Ferrari, Silvia, and Francisco Cribari-Neto. 2004. Beta regression for modelling rates and proportions. Journal of Applied Statistics 31: 799–815. [Google Scholar] [CrossRef]
  7. Fraley, Chris, and Adrian E Raftery. 2002. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97: 611–31. [Google Scholar] [CrossRef]
  8. Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2010. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33: 1–22. [Google Scholar] [CrossRef] [PubMed]
  9. Gruen, Bettina, Ioannis Kosmidis, and Achim Zeileis. 2012. Extended beta regression in R: Shaken, stirred, mixed, and partitioned. Journal of Statistical Software 48: 1–25. [Google Scholar] [CrossRef]
  10. Gruen, Bettina, and Friedrich Leisch. 2007. Fitting finite mixtures of generalized linear regressions in R. Computational Statistics & Data Analysis 51: 5247–52. [Google Scholar]
  11. Gruen, Bettina, and Friedrich Leisch. 2008. Flexmix version 2: Finite mixtures with concomitant variables and varying and constant parameters. Journal of Statistical Software 28: 1–35. [Google Scholar] [CrossRef]
  12. Hastie, Trevor, and Brad Efron. 2013. lars: Least Angle Regression, Lasso and Forward Stagewise, R package version 1.2; Available online: (accessed on 18 February 2019).
  13. Ji, Yuan, Chunlei Wu, Ping Liu, Jing Wang, and Kevin R. Coombes. 2005. Applications of beta-mixture models in bioinformatics. Bioinformatics 21: 2118–22. [Google Scholar] [CrossRef] [PubMed]
  14. Laurila, Kirsti, Bodil Oster, Claus L. Andersen, Philippe Lamy, Torben Orntoft, Olli Yli-Harja, and Carsten Wiuf. 2011. A beta-mixture model for dimensionality reduction, sample classification and analysis. BMC Bioinformatics 12: 215. [Google Scholar] [CrossRef] [PubMed]
  15. Leisch, Friedrich. 2004. Flexmix: A general framework for finite mixture models and latent class regression in R. Journal of Statistical Software 11: 1–38. [Google Scholar] [CrossRef]
  16. Loterman, Gert, Iain Brown, David Martens, Christophe Mues, and Bart Baesens. 2012. Benchmarking regression algorithms for loss given default modeling. International Journal of Forecasting 28: 161–70. [Google Scholar] [CrossRef]
  17. Mittelhammer, Ron C., George Judge, and Douglas Miller. 2000. Econometric Foundations, 1st ed.Cambridge: Cambridge University Press. [Google Scholar]
  18. Moustafa, Nour, Gideon Creech, and Jill Slay. 2018. Anomaly Detection System using Beta Mixture Models and Outlier Detection. In Progress in Computing, Analytics and Networking. Advances in Intelligent Systems and Computing. Edited by Prasant Kumar Pattnaik, Siddharth Swarup Rautaray, Himansu Das and Janmenjoy Nayak. Singapore: Springer, vol. 710. [Google Scholar] [CrossRef]
  19. Nocedal, Jorge, and Stephen J. Wright. 1999. Numerical Optimization, 1st ed.Berlin: Springer. [Google Scholar] [Green Version]
  20. Papke, Leslie, and Jeffrey Wooldridge. Econometric methods for fractional response variables with an application to 401(k) plan participation rates. Journal of Applied Econometrics 11: 619–32. [CrossRef]
  21. Qi, Min, and Xinlei Zhao. 2011. Comparison of modeling methods for loss given default. Journal of Banking & Finance 35: 2842–55. [Google Scholar]
  22. Thomas, Lyn, and Katarzyna Bijak. 2015. Impact of Segmentation on the Performance Measures of LGD Models. Available online: (accessed on 18 February 2019).
Figure 1. Histogram of recovery rates for 8237 loans after pre-preprocessing described in Section 2. The stack of 1s shows frequency of R R = 1 , but the stack at 0 shows frequency for small R R > 0 .
Figure 1. Histogram of recovery rates for 8237 loans after pre-preprocessing described in Section 2. The stack of 1s shows frequency of R R = 1 , but the stack at 0 shows frequency for small R R > 0 .
Risks 07 00019 g001
Figure 2. Joining the three datasets.
Figure 2. Joining the three datasets.
Risks 07 00019 g002
Figure 3. Beta distribution. (a) Beta Distribution with Fixed ϕ ; (b) Beta Distribution with Fixed μ .
Figure 3. Beta distribution. (a) Beta Distribution with Fixed ϕ ; (b) Beta Distribution with Fixed μ .
Risks 07 00019 g003
Figure 4. Estimate the expected value of RR using two-stage decision tree model.
Figure 4. Estimate the expected value of RR using two-stage decision tree model.
Risks 07 00019 g004
Figure 5. Prediction of RR conditional on each cluster M j .
Figure 5. Prediction of RR conditional on each cluster M j .
Risks 07 00019 g005
Figure 6. E ( y | M j , X n e w ) based on the Test dataset, for the two clusters ( m = 2 ). (a) E ( y | M 1 , X n e w ) ; (b) E ( y | M 2 , X n e w ) .
Figure 6. E ( y | M j , X n e w ) based on the Test dataset, for the two clusters ( m = 2 ). (a) E ( y | M 1 , X n e w ) ; (b) E ( y | M 2 , X n e w ) .
Risks 07 00019 g006
Figure 7. Predicted RR on test data ( n = 2746 ) using beta mixture with four different priors, combined with logistic regression.
Figure 7. Predicted RR on test data ( n = 2746 ) using beta mixture with four different priors, combined with logistic regression.
Risks 07 00019 g007
Figure 8. Predicted RR against true RR on test data ( n = 2746 ) using beta mixture with the indifferent prior, combined with logistic regression.
Figure 8. Predicted RR against true RR on test data ( n = 2746 ) using beta mixture with the indifferent prior, combined with logistic regression.
Risks 07 00019 g008
Table 1. Determining P ( M j ) in Approach 2.
Table 1. Determining P ( M j ) in Approach 2.
Approach 2 P ( M 1 ) P ( M 2 )
π j priorExtract π j from the
EM algorithm π j [ 1 ]
Extract π j from the
EM algorithm π j [ 2 ]
Prior based on
training set
cluster size ratio
Cluster 1 size Total sample size Cluster 2 size Total sample size
1 2 1 2
Table 2. η and γ estimated by EM algorithm. M1 and M2 represent Clusters 1 and 2.
Table 2. η and γ estimated by EM algorithm. M1 and M2 represent Clusters 1 and 2.
VariablesBeta Mixture Model in (0, 1)Beta Regression in (0, 1)
M1 EstimatePr(>|z|)M2 EstimatePr(>|z|)Betareg EstimatePr(>|z|)
Product R−0.033760.47711−0.007660.597330.022700.41766
Late Charges0.000420.005780.00115<0.00010.00072<0.0001
Overlimit Fees−0.001050.075940.00145<0.00010.000180.52533
Credit limit0.00004NA−0.00001NA−0.00003<0.0001
Sex = Male0.036590.17453−0.014120.133640.009690.43796
Marital status =
Credit Bureau Score0.000590.103370.000070.078900.000380.00222
Bureau bad debt−0.329900.01290-0.06936<0.0001−0.241230.00000
Cust Payment Freq0.06530<0.00010.03506<0.00010.05046<0.0001
Post Balance−0.00106NA−0.00127NA−0.001030.00000
Total Paid Amount0.00004NA−0.00038NA−0.00014<0.0001
Total Calls−0.000440.00515−0.000230.00275−0.00032<0.0001
Total Contacts−0.001360.032570.000400.08116−0.000310.28402
Bank report Freq−0.01719<0.0001−0.00407<0.0001−0.01117<0.0001
Pre recovery Rate0.56850<0.00013.63447<0.00012.26212<0.0001
Total Number−0.009490.16151−0.001690.42951−0.007760.00651
Pre recovery Rate0.490960.00025−2.11510<0.0001−0.184880.01538
Post Balance0.00039<0.00010.00018NA0.000310.00000
Cust Payment Freq0.02949<0.00010.17612<0.00010.077590.00000
Credit Bureau Score−0.000580.00458−0.000330.09534−0.000280.01388
Table 3. Predictive results using three-fold cross validation.
Table 3. Predictive results using three-fold cross validation.
Linear Regression
Linear regression0.0249840.1142680.025894
Stepwise linear regression0.0247520.1136210.025700
Linear regression with Lasso0.0252280.1148470.023739
Linear regression, excluding Dataset 20.0268220.1213850.026303
Beta regression
Standard beta regression0.0856300.2604590.161366
Inflated beta regression0.0766500.2163740.048466
Beta mixture model combined with logistic regression
Max log-likelihood0.0187500.0954320.030629
Prior based on R Flexmix π j 0.0184600.0918330.023991
Prior based on training set cluster size ratio0.0193250.0922250.022594
Indifferent Prior0.0180300.0923990.026298

Share and Cite

MDPI and ACS Style

Ye, H.; Bellotti, A. Modelling Recovery Rates for Non-Performing Loans. Risks 2019, 7, 19.

AMA Style

Ye H, Bellotti A. Modelling Recovery Rates for Non-Performing Loans. Risks. 2019; 7(1):19.

Chicago/Turabian Style

Ye, Hui, and Anthony Bellotti. 2019. "Modelling Recovery Rates for Non-Performing Loans" Risks 7, no. 1: 19.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop