Abstract

Mixture cure models are widely adopted in credit scoring. Mixture cure models consist of two parts: an incident part which predicts the probability of default and a latency part which predicts when they are likely to default. The two model parts describe two quite relevant credit aspects. So, it is reasonable to expect that the two sets of the coefficients are somewhat related. Moreover, in practical cases, it is difficult to interpret the results when the two sets of the coefficients of the same variables have conflicting signs. Most existing works either ignore the interconnections of the two sets of coefficients or impose a strict constraint between them. We proposed a mixture cure model considering the variable effect consistency using a sign-based penalty. It is a more flexible model that allows the two sets of coefficients to be in different distributions and magnitudes. To accommodate high-dimensional credit data, a group lasso penalty is also imposed for variable selection. Simulation shows that the proposed method has competitive performance compared with alternative methods in terms of estimation and prediction. Furthermore, the empirical study illustrates that the proposed method outperforms the alternative method and can improve the interpretability of the results.

1. Introduction

Credit scoring is an effective and crucial approach for evaluating credit risk [1, 2]. A slight improvement in the prediction precision of credit scoring models can bring considerable benefits. Therefore, credit scoring has attracted increasing attention of scholars and practitioners. Many studies treat it as a classification problem to distinguish noncreditworthy customers from creditworthy ones [2, 3]. These studies focus on classification techniques including logistic regression, support vector machine, neural network, and random forest [4, 5]. For example, Zhou et al. [6] proposed a logistic regression method with clustering analysis for credit risk evaluation. Zhang et al. [7] proposed a cost-sensitive logistic regression model to assess the credit risk. Considering the high cost and time consumption of credit scoring, a credit granting process using three-way decisions is proposed to make efficient credit decisions [8]. Since the exposure to risk and the losses caused by default are strongly related to the time when they default [9], credit risk prediction overtime is of great significance for timely risk management.

Survival analysis, with its ability of predicting the probability of default over time, has been first applied in credit scoring in 1992 [10]. It can be more informative than the binary classification model. Subsequently, various survival analysis models are proposed to predict credit risk over time. For instance, Cox proportional hazards (PH) model is adopted to predict early repayment and time to default in personal loans and investigated the effect of different variables on time to default [11, 12]. In addition, macroeconomic factors and time-varying data are also incorporated in survival analysis to improve the performance of prediction in credit scoring [13, 14]. And the models are further extended by a survival gradient boosting decision tree approach to enhance the prediction performance [15].

However, standard survival analysis assumes that the loan term is long enough and every customer will eventually default. In practice, a substantial proportion of customers will not default during the entire loan term. Since mixture cure models applied in medicine assume that some patients have been cured and will not die during the follow-up period, it is more appropriate in the credit market and was first introduced to credit scoring by Tong et al. [16].

Recently, the mixture cure model, an extension of the standard survival analysis, is widely adopted in credit scoring for its ability of predicting not only whether customers will default but also when they are likely to default. Results showed that the mixture cure model is more suitable for credit data compared with standard survival analysis models and the mixture cure model incorporating penalized spline has better performance in prediction [17]. Mixture cure models have been further developed by identifying different risk patterns of customers, considering the influence of competitive risk, and the relationship between the default times and the variables [1820].

Mixture cure models consist of two parts: an incident part which predicts the probability of default and a latency part which predicts when they are likely to default. In the two model parts, the two sets of the coefficients indicate the two sets of the variable effects on the credit risk. The two model parts describe two quite relevant credit aspects. Nevertheless, most of the existing studies ignore the relations between the two sets of coefficients in two model parts. These works generally assume that there are no direct constraints between the two sets of coefficients, which may get conflicting results of variable effects. For example, Dirick et al. [21] propose a mixture cure model incorporating macroeconomic factors to predict credit risk. The results show that the customers’ annual income has the opposite effect on whether and when to default. In other words, according to the results, customers with lower annual income have a lower probability of default, but they are more likely to default earlier. It is difficult to interpret the conflicting results and apply them in practice.

In fact, the two model parts describe two quite relevant credit aspects, namely, the probability of default and survival (nondefault) time. Customers with high default probability are more likely to default earlier. So, it is reasonable to expect that the two sets of the coefficients are somewhat related. Theoretical derivations [22] and empirical analysis [23] also suggest that relaxing the independence of two sets of coefficients can improve the model performance. The assumption has been relaxed by establishing a joint distribution of the defaulting predictor and the logarithm of the hazard rate in [23]. Note that the two model parts still describe two different aspects of default. The assumption of a joint distribution may be too strict. So, we consider a more flexible model that allows the two sets of coefficients to be in different distributions and magnitudes. Sign consistency penalty is proposed to promote the similarity in sign to get more interpretable results by Zhang et al. [24]. In this paper, we propose a variable effect consistency mixture cure model with a sign-based penalty. The proposed method can promote the similarity in the signs of variable effect in the two model parts to improve interpretability. To accommodate high-dimensional credit data, a group lasso penalty is also imposed for variable selection [25].

The contributions of this paper are as follows. First, we propose a variable effect consistency mixture cure model. The proposed method can lead to more interpretable results by promoting the similarity in the signs of coefficients in the two parts of the mixture cure model. Second, a group lasso penalty is imposed to select important variable subgroups and accommodate the high-dimensional data. Third, simulation and empirical analysis of credit data illustrate that the proposed method can improve the prediction accuracy as well as interpretability, which has important practical significance for applying the prediction results to the credit business.

The remainder of the paper is organized as follows. Section 2 introduces the variable effect consistency mixture cure model. Computational algorithm is presented in Section 3. Simulation is carried out in Section 4. Empirical study is presented in Section 5. Finally, conclusions are discussed in Section 6.

2. Methods

In this paper, we consider credit data with customers and variables. Denote as the time to default and as the time of censoring. Let be the unobservable binary variable with indicating that the customer is cured and will not default (), whereas indicates an uncured customer and will eventually default.

Denote as the censoring indicator of customer , where and are the time to default and censoring time of customer , respectively. if censored and otherwise. Note that there are three possible credit states of customers. (a) and : censored, cured customers who will not default; (b) and : censored, uncured customers who will eventually default and have not been observed to default in censoring time ; (c) and : uncensored, uncured customers who have been observed to default.

Denote . Note that, in many practical cases, variables can be naturally grouped. For instance, many categorical variables may have several levels and can be represented by subgroups of dummy variables [26]. The additive model with polynomial or nonparametric components can be expressed as groups of basis functions [27]. In addition, grouping structure can also be introduced by taking advantage of prior knowledge. For example, genes belonging to the same biological pathway can also be considered a group [28]. Let be the variable vector with subgroups. is the -th subgroup of variable vector, and . The observable data are , .

The incident part of the mixture cure model describes the probability of default, for which we adopt a logistic regression model. Let be the probability of cured (nondefault) customer .where is the intercept, is the vector of unknown regression coefficients, and the -th subgroup of the coefficient vector is .

In the latency part, for uncured customers, we adopt an exponential model for survival. Note that the exponential model has been commonly adopted in mixture cure models [29, 30]. It is easy to capture the relations between the probability and the time to default for it includes only one parameter [23]. The survival function iswhere is the hazard function of customer , is the intercept, is the vector of unknown regression coefficients, and the -th subgroup of the coefficient vector is . Survival function indicates the survival probability of uncured customers in time , that is, the probability of default in time given the customer will default.

The mixture cure model can be given bywhere is the survival probability of customer in time .

For observable data , the objective function can be written as follows.where is the log-likelihood function, which iswith , and . Here, is the penalty function, which iswhere and are tuning parameters, is the norm, and is the sign function. In many practical cases, grouping structure arises naturally. In addition, it is hard to interpret the results when coefficients corresponding to the same variables have conflicting signs. Therefore, we consider a flexible mixture cure model with sign consistency and group variable selection penalties. The first penalty is a group lasso penalty. It can conduct estimation and group variable selection by shrinking the coefficients of insignificant groups to 0. It considers grouping structures and has good prediction performance [26]. The second penalty is the sign consistency penalty. It promotes the sign consistency of and in the two parts of the model, which can lead to more interpretable results [31].

3. Computational Algorithm

In this section, the Expectation Coordinate Descent (ECD) algorithm is developed to optimize the objective function. In E-step, a latent unobserved is introduced to obtain a complete log-likelihood function. In CD-step, group coordinate descent is adopted to iteratively update a single parameter with the remaining parameters fixed at their most recent values. Sign function is difficult to optimize for its discontinuity and nondifferentiability. Therefore, referring to [24, 32], we propose the approximation as follows:where is a small positive constant (more discussions below).

The ECD algorithm updates in the –th iteration as follows.

3.1. E-Step

Denote the observation of the latent as and denote the complete data as . The complete log-likelihood iswhere

The expectation of iswhere

When customer is observed to default (), the unobserved , whereas the expectation of is related to the probability of cured and the uncured but censored customers.

In E-step, we take the expectation of with respect to given the complete data .where

3.2. CD-Step

In CD-step, group coordinate descent is adopted to iteratively update . The intercept is updated by

For , we adopted a fast unified algorithm, Groupwise Majorization Descent (GMD) proposed in [33] to solve the group lasso penalized objective function in (4). The upper bound of the objective function is as follows:

Here, is -length vector, and is a constant as follows:where is the maximum eigenvalue function.

Similarly, the intercept is updated by

For , consider the optimization function:

Here, is -length vector, and is a constant as follows:

The tuning parameters, and , are selected by 5-fold cross-validation. The parameter in the approximation of the sign function controls the degree of approximation [24]. A smaller leads to a better approximation but less stable estimation. The proposed method is valid as long as is not too large, and the parameters with different signs can be distinguished [34]. Therefore, as suggested in [31], we set , which leads to satisfactory results.

The ECD algorithm is summarized in Table 1.

4. Simulations

In this section, some experiment examples are given to illustrate the performance of the proposed method compared to alternative methods. The proposed method is a mixture cure model with group lasso and sign consistency (MCGS). Two alternative methods are the standard mixture cure model without variable selection and sign consistency penalty (Full) and the mixture cure model with group lasso penalty (MCG), respectively. For comparison, alternative methods both adopt the logistic regression in the incident part and the exponential model in the latency part.

Here, we set sample size and consider low-dimensional data with and high-dimensional data with . The censoring time is generated from an exponential distribution with censoring rates . We consider three examples regarding different grouping structures of coefficients and different types of variables. The true values of coefficients are generated according to the following settings in three examples:Example 1: for each subgroup, we set . Intragroup variables and are generated from a multivariate normal distribution with the correlation coefficient , whereas intergroup variables are independent. Denote the true coefficients as and . The coefficients of the two scenarios are shown as follows:Scenario 1:Scenario 2:Example 2: the settings are similar to Example 1 except for the subgroup settings. We set 15 variables in the first subgroup, 5 variables in the second subgroup, and 10 variables for the remaining subgroups. The coefficients are shown as follows:Example 3: consider the case with some discrete variables. For each subgroup, we set . A latent variable is generated from a multivariate normal distribution with the intragroup correlation coefficient and intergroup correlation independent. The coefficient setting is the same as Example 2. is defined as follows:

The performance of each model is me asured by 5 measures. Denote , as the estimation of , as the estimation of , and as the estimation of . The true positive rate (TPR), false positive rate (FPR), and mean square error (MSE) with respect to , , and can be written as follows:where

The relative root mean square error of the cure rate estimation () and the relative root mean square error of the hazard function estimation () are

Tables 25 show the mean TPR, FPR, and MSE of the coefficients, as well as the standard deviations over the 100 replicates for each example.

As indicated in Tables 25, the two group selection methods (MCGS and MCG) perform significantly better than the Full method. This is expected since the group lasso can select important subgroups of variables. Comparing the two group methods, the proposed method has competitive performance compared with the MCG method. It indicates that promoting sign consistency improves the performance in terms of estimation. For instance, under Scenario 1 in Example 1 with and in Table 2, the mean MSEs of , , and for the proposed method are 0.12, 0.04, and 0.09, respectively, compared to 1.00, 0.05, and 0.63 for the MCG method and 17.3, 2.38, and 11.56 for the Full method.

Tables 68 show the mean RMSE of and , as well as the standard deviations over the 100 replicates for each example. The results illustrate the performance in terms of prediction of the probability of nondefault and survival.

As shown in Tables 68 the prediction performance of group selection methods is significantly better than that of the Full method, and the proposed method has competitive performance compared with the MCG method. For example, in Example 2 with and in Table 8, the mean and for the proposed method are 0.01, and 0.07, respectively, compared to 0.04 and 0.12 for the MCG method and 0.27 and 10.01 for the Full method. In addition, compared with the results regarding low- and high-dimensional settings, the group selection methods have greater advantages in prediction performance when the dimensionality is higher.

Results of simulation reveal that the proposed variable effect consistency mixture cure model can improve the performance in terms of estimation and prediction compared with alternative methods.

5. Empirical Study

In this section, we applied our proposed method to real data on credit loans. The data come from the personal loan department of a city commercial bank in China, which contains 4796 personal loan samples from 2014 to 2019 after preprocessing. The data include mortgage loans and credit loans, covering consumer durables, personal housing decoration loans, and other personal consumption loans. Censoring time is the interval between the loan value date and either default or the end of observation (June 1, 2019). Therefore, censoring times of customers vary from individual to individual. It has a mean of 1.93 years and a standard deviation of 0.8. Customers whose time to default is longer than the censoring time are censored (). 47 out of 4796 customers are censored. By transforming the discrete variables into dummy variables, the data contain 27 variables. Table 9 provides a list of variables and their descriptions.

In this section, the alternative method is a mixture cure model with group lasso (MCG). Different from the simulation, the real values of parameters and are unknown in real data. Referring to [30, 35], we adopt the (1) log-rank statistics and (2) negative log-likelihood to evaluate the performance of the models instead of , , , , and in simulation.

Log-rank statistics is a commonly used indicator in survival analysis to test the null hypothesis that there is no significant difference in survival distribution between two or more independent groups. It is calculated by cross-validation. We sequentially take samples as the validation set and the remaining as the training set. Apply the proposed and alternative methods to obtain the estimation of and and then calculate the and for the validation set. Results of and are based on 10 replicates. Divide the calculated into two groups at the median and calculate the log-rank statistics. Similarly, divide the calculated into two groups at the median and calculate the log-rank statistics. The mean log-rank of the proposed and the alternative methods is 5.6 and 4.3 respectively, indicating better performance of the proposed method.

Figure 1 shows the Kaplan–Meier curves stratified by different groups. Kaplan–Meier curves are commonly used to describe the change of the survival probability overtime in survival analysis [36, 37]. The probability of being cured is negatively related to , and the survival time is negatively related to . In Figure 1, a group is denoted by “low risk” with lower (a) or lower (b) whereas another group is denoted by “high risk.” As indicated in Figure 1, there are clearly different trends in the curves in different groups. Customers with lower and have lower risk and are less likely to default.

To assess the performance of the model, the data are randomly divided into training set and test set by 2 : 1. The training set is used for fitting the model and the test set is used for evaluating the prediction performance of the fitted model. The tuning parameters are selected by 5-fold cross-validation. The mean (standard deviation) negative log-likelihood of the proposed method (MCGS) and the alternative method (MCG) is 106.04 (16.04) and 118.60 (19.09), respectively. The result is based on 100 duplicates. It indicates that the proposed method performs better than the alternative in terms of model fit and prediction.

Table 10 reports the estimations of the MCGS method and the MCG method. A positive coefficient indicates that the variable is positively related to the probability of default, and a positive coefficient indicates that the variable is negatively related to default time. Both probabilities of default and default time are two quite relevant credit aspects. Compared with the alternative method, the signs of the and of the proposed method are promoted to be more consistent, whereas the business type in the MCG model has an opposite effect on the probability and time to default. The results show that promoting variable effect consistency can improve prediction performance as well as interpretability.

The coefficient results of the proposed method reveal that interest rate, loan line, business type, gender, education, and employment status are important variables that affect the probability of default and the time to default. Loan term, age, medical insurance, entrusted payment, early repayment, annual household income, type of workplace, and housing status have no significant impact on credit. The impact of occupation and professional title on credit is not clear.

From the perspective of loan products, we find that interest rate has a negative impact. This is not surprising, as higher interest rates lead to higher costs, and the customers are more likely to default. The loan line has a positive effect. One possible explanation is that low-risk customers are more likely to obtain a higher loan line. Loan term, entrusted payment, and early repayment have no effect on the credit. Different coefficient of business type reveals that, compared with other personal loans, consumer durables are more likely to default.

From the perspective of the influence of the variables of the customers, employed customers have a positive impact. Age, annual household income, housing status, and type of workplace have no significant effect on the credit. Compared with women, men are more likely to default. This is consistent with the results of [38] and the personality characteristics of men’s risk preference [39]. Customers with higher education are less likely to default. Bachelor degree or above has a positive effect on credit. Generally, customers with higher education have a higher chance of getting decent jobs and income. They tend to maintain good credit records and are less likely to default. Compared with other employment groups such as self-employed, freelance, and unemployed ones, the employed group has a more stable income and is less likely to default.

6. Conclusions

The mixture cure model is widely adopted in credit scoring for its ability of predicting whether customers will default and when they are likely to default. However, most of the existing studies ignore the relations between the two sets of variable effects in the two model parts which may get conflicting results of variable effects. It can be difficult to interpret the results and apply them in practice.

In this paper, we propose a variable effect consistency mixture cure model, to promote the similarity of the sign of variables in the two model parts by imposing a sign consistency penalty. Meanwhile, to accommodate the high-dimensional credit data, we also impose a group lasso penalty to conduct variable selection and parameter estimation. Simulation shows that the proposed method has competitive performance compared with the MCG method and significantly outperforms the Full method in terms of estimation and prediction. Furthermore, the empirical study illustrates that the proposed method can improve prediction performance as well as interpretability. The results of the variable effect consistency mixture cure model also offer additional insights into the relationship between the variable effect before and after loan.

Data Availability

The raw/processed data used in the empirical study cannot be shared at this time as the data also form part of an ongoing study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Office for Philosophy and Social Sciences of China under Grant no. 20&ZD137 and the National Bureau of Statistics of China under Grant no. 2020ZX20.