Abstract

We introduce an imbalanced data classification approach based on logistic regression significant discriminant and Fisher discriminant. First of all, a key indicators extraction model based on logistic regression significant discriminant and correlation analysis is derived to extract features for customer classification. Secondly, on the basis of the linear weighted utilizing Fisher discriminant, a customer scoring model is established. And then, a customer rating model where the customer number of all ratings follows normal distribution is constructed. The performance of the proposed model and the classical SVM classification method are evaluated in terms of their ability to correctly classify consumers as default customer or nondefault customer. Empirical results using the data of 2157 customers in financial engineering suggest that the proposed approach better performance than the SVM model in dealing with imbalanced data classification. Moreover, our approach contributes to locating the qualified customers for the banks and the bond investors.

1. Introduction

There exist a lot of imbalanced data sets in real society [1], and the imbalanced data set appears when the size of samples in one class is greatly larger than the size of samples in another class. Many classification approaches constructed based on imbalanced data sets usually perform well on one class data but bad on the other class data [2]. Much more attention should be paid to the rare class data in these cases. For example, in risk management, the number of default customers is only 1% to 4% of all loan customers. And it causes hundreds of billions of loan losses for banks. What is more, the classifier cannot tell us the real default customers. Therefore, high prediction accuracy of the default customers will be more useful to help the bankers, the society, and the bond investors to reduce loss.

The main customer classification approaches proposed to handle imbalanced data problems can be divided into three categories. One of the most famous methods is the classification model based on econometrics technique. Using an empirical study on imbalanced bankrupt and nonbankrupt enterprises in U.S., Altman established five-variable -score model [3]. On the basis of combining collinearity diagnostics and logistic regression significant discriminant (LRSD), Shi and Chi established a classification model for handling imbalanced financial data [4]. Altman et al. employed statistics discriminant technique to revise the -score model and created the zeta rating model in order to deal with imbalanced bankrupt and nonbankrupt data [5]. Ju and Sohn created a credit-scoring classification model for selecting appropriate funding beneficiaries [6]. Elliott et al. built a model based on a double hidden Markov model (DHMM), to extract information about the “true” credit qualities of firms from all of the loan firms [7]. In order to screen good customers, Hwang et al. created an ordered semiparametric probit credit rating model by substituting ordered semiparametric function for linear regression function [8]. In comparison with conventional models such as multiple discriminant analysis, logistic regression analysis, and neural networks for business failure prediction, Min and Lee proposed a DEA credit scoring model for loan customers’ classification [9]. Linear discriminant analysis (LDA) model is applied in clients’ classification [10]. The second method for handling imbalanced data problems is the classification approach based on stochastic probability. In order to infer credit-quality data MNAR (missing not at random), Chen and Åstebro proposed a flexible method to generate the probability of missingness within a model-based bound and collapse Bayesian technique. Empirical results show that the method improves the classification power of credit scoring models under MNAR conditions [11]. Carmona et al. used a structural model with stochastic volatility for the computation of rare credit portfolio losses, and they demonstrated the efficiency of their method in situations where importance sampling is not possible or numerically unstable [12]. Kim and Sohn proposed a random effects multinomial regression model to estimate transition probabilities of different types of customers [13]. Carling et al. established continuous time model consisting of macroeconomics factors, such as the GDP growth rate and the unemployment rate, to readjust debtor’s credit rating transition probability [14]. KMV Company used the probability of asset value less than debt value to measure enterprises’ default situation, by creating a KMV model to measure default probability [15]. JPMorgan utilized transition matrix to describe the debtor’s probability of credit rating change, establishing CreditMetrics credit rating model [16]. In order to establish CreditRisk+ model, Credit Suisse First Boston developed the stochastic probability to measure default [17]. Moreover, imbalanced data classification method based on artificial intelligence is another important method. In order to extract features for classification, Li et al. created a generalized linear discriminant analysis model based on trace ratio criterion algorithm (GLDA-TRA) [18]. Abellán and Mantas studied the ensembles of classifiers for bankruptcy prediction and credit scoring using the random subspace method. And an experimental study showed that bagging scheme on decision trees presents the best results for bankruptcy prediction and credit scoring [19]. Wang established a hybrid sampling SVM model to imbalanced data classification [20]. Zhong et al. carried out a comprehensive experimental comparison study over the effectiveness of four learning algorithms (i.e., BP, ELM, I-ELM, and SVM) over an imbalanced data set consisting of real financial data for corporate credit ratings [21]. Akkoç proposed a three-stage hybrid adaptive neurofuzzy inference system (ANFIS) credit scoring model to deal with imbalanced loan customers [22]. In order to classify loan customers, Finlay compared the performance of several multiple classifiers and found that error trimmed boosting outperformed all other multiple classifiers on UK credit data [23]. Twala explored the predicted behaviour of five classifiers for different types of noise in terms of credit risk prediction accuracy and how such accuracy could be improved by using classifier ensembles. The experimental evaluation showed that the ensemble of classifiers technique had the potential to improve prediction accuracy [24].

Although the existing researches have made great progress in handling imbalanced data issues, there are still some drawbacks. First of all, the real default status of customers is not taken into account in existing loan customers’ classification. And secondly, the collinearity between indicators, which could induce the information chaos of index system, cannot be excluded in the existing classification researches.

In order to overcome the above shortcomings, this paper creates a novel imbalanced data classification approach based on logistic regression significant discriminant (LRSD) and Fisher discriminant. Using a Chinese state-owned commercial bank’s 2157-microfinance loan for small private businesses, the empirical result shows that the average accuracy rate for our proposed model is 96.27%. The proposed model performs well on the imbalanced customer classification.

The rest of the paper is organized as follows. Section 2 introduces the methodology of this paper. Section 3 presents the data and empirical analysis of our imbalanced data classification model for small private business. We conclude the paper in Section 4.

2. A Novel Imbalanced Data Classification Approach

2.1. Data Standardization
2.1.1. The Standardized of Quantitative Indicators

In order to eliminate the influence of the differences of indicators units and dimensions on index screening, the original data should be transformed into real numbers within the interval . The quantitative indicators include positive indicators, negative indicators, and interval indicators. The positive indicators are indicators whose values are the bigger and the better of credit situation of microfinance for small private business, such as “ industry cycle index.” The negative indicators are indicators whose values are the smaller and the better. And the interval indicators are indicators which are reasonable only when they lie in certain intervals, such as “ private loan amount.” And the interval indicators are indicators which are reasonable only when they lie in certain intervals. It should be pointed out that there are three interval indicators in this paper, that is, “ age,” “ age of guarantor,” and “ consumer price index.” The ideal interval of “ age” and “ age of guarantor” is [4]. If the age of the guarantor or the business owner is within the interval , it means the repayment ability and repayment willingness of the small private business are strong. The ideal interval of “ consumer price index (CPI)” is [4]. It indicates that there is neither inflation nor deflation existing, when the CPI of the small private business located lies within the interval .

It has to be noted that no technique has been shown to be optimal for all kinds of data. Because the max-min normalization technique has been widely used in the standardized of quantitative indicators [4, 25, 26], this max-min normalization technique is applied in transforming the positive and negative indicators. Let denote the standard score of the th customer on the th indicator. Let denote the indicator original data of the th customer on the th indicator. Let denote the number of customers. The standardization equations of positive indicators and negative indicators are shown in (1) and (2), respectively [4]. Consider

Equation (1) is the ratio of the deviation between the indicator original data and the minimum value min() to the range max()–min(). It indicates that the closer the indicator original data to the maximum value max() is, the bigger the standardized value would be. Consider

The meanings of (2) are the same as (1). Equation (2) indicates that the closer the indicator original data to the minimum value min() is, the bigger the standardized value would be.

Let denote the left boundary of the ideal interval. Let denote the right boundary of the ideal interval. The standard score equation of the interval indicators is shown as follows [4]:

The meanings of the rest of variables in (3a), (3b), and (3c) are the same as the variables in (1).

Equations (3a), (3b), and (3c) are applied to analyze the interval indicator standardization. From (3c), if the indicator original data belongs to the interval , the standardized value identically equals 1. From (3a), if the indicator original data is less than the left boundary , the numerator is the deviation between the indicator original data and the left boundary and the denominator max( min(), max() is the maximum between min() and max(. Equation (3a) illustrates that the smaller the distance between the indicator original data and the left boundary is, the bigger the standardized value would be. Similarly, if > , (3b) indicates that the smaller the distance between the indicator original data and the right boundary is, the bigger the standardized value would be.

2.1.2. The Standardization of Qualitative Indicators

By rational analysis and expert investigation for qualitative indicators, the scoring standard of qualitative indicators can be obtained.

2.2. The Key Indicators Extraction Approach in Imbalanced Data Classification
2.2.1. Screening the Key Indicators Based on Logistic Regression Significant Discriminant

Using the logistic regression model for selecting indicators, we ensure that the reserved indicators can effectively distinguish default customers from nondefault ones. Let be the dependent variable of logistic regression model. It is the default status of customers’ loan. Use to denote default customer and to denote nondefault customer. Let denote the corresponding default probability while conducting credit rating by indicators . Let denote the constant term, and let denote regression coefficients. The logistic regression model is as follows [6]:

Next, this paper will give a selection approach based on logistic regression significant discriminant (LRSD). The original hypothesis is as follows. If the th indicator has no effect on customers’ default status, the coefficient of logistic regression of the th indicator is equal to zero. Conversely, the alternative hypothesis is that if the th indicator has a significant effect on customers’ default status, the coefficient of logistic regression of the th indicator is not equal to zero. We establish the Wald statistics and judge whether the coefficients in (4) equal zero or not. In other words, it is to make judgment on whether the th indicator would significantly affect customers’ default status.

Let denote the Wald test value of the th indicator. Let denote the th indicator’s estimated value in (4). Let denote the standard deviation of . Thus, the Wald test value is given by

The standard process of selecting the indicators is as follows. Comparing the test probability sigi of Wald test value with the given significance level Level0 = 0.05 [27], we can distinguish whether the indicators have an obvious effect on customers’ default status. If sigi < Level0, thus , which means the th indicator affects customers’ default status significantly, and therefore the th indicator should be reserved. On the contrary, if sigi ≥ Level0, thus = 0, which means the th indicator does not have significant effect on customers’ default status, and therefore the th indicator should be deleted.

2.2.2. Deleting the Repeated Information Indicators Based on Correlation Analysis

The aim of the correlation analysis is to delete indicators of large correlation from the whole extensive indicators set, avoiding repeated information.

Let denote the standard score of the th customer on the th indicator. Let and denote the mean values, respectively, corresponding to the th indicator and the th indicator. Let denote the correlation coefficient between the th indicator and the th indicator. Then,

As a matter of experience, the threshold of correlation coefficient equals 0.80 [25]. In other words, if the absolute value of the correlation coefficient is more than 0.8, the two indicators reflect the repeated information. One of the two indicators should be deleted.

2.3. The Establishment Approach of Credit Scoring Model
2.3.1. The Indicator Empowerment Based on Fisher Discriminant

Considering the distance between default sample and nondefault sample, the bigger the distance is, the bigger the weighting is, and then the weighting of every selected indicator can be calculated by using Fisher discriminant method.

Let denote the deviation matrix among indicators in the same group. Let denote the deviation value between the th indicator and the th indicator. Let denote the number of indicators. Let denote the number of nondefault customers. Let denote the number of default customers. Let denote the deviation value between the th indicator and the th indicator in the nondefault sample group. Let denote the deviation value between the th indicator and the th indicator in the default sample group. Let denote the standard score of the th customer and the th indicator. Let denote the standard score of the th customer and the th indicator. Let and denote the mean values. Let and denote the mean values in the nondefault sample group. Let and denote the mean values in the default sample group. We have [28]where

Let denote the deviation matrix between default sample and nondefault sample. Let denote the deviation value of the th indicator between default sample and nondefault sample. Thus [28],where

The meanings of the rest of variables in (10) are the same as the variables in (8).

Correspondingly, a Fisher criterion function can be defined as follows [28]: where is a Fisher discriminant coefficient vector.

The distinct voice of (11) is obvious. By creating the objective function maximizing the deviation matrix between default sample and nondefault sample and minimizing the deviation matrix among indicators in the same group, the empowering idea that the bigger the distance between default sample and nondefault sample, the bigger the weight is reflected.

Now we outline the approach to solve the empowerment model. To explore the maximum value in (11) is equivalent to obtaining the eigenvector of the largest nonzero eigenvalue corresponding to the characteristic equation . Thus, the Fisher discriminant coefficient vector is given bywhere

Substituting into (12), we havewhere denote the Fisher weighting of the th indicator.

2.3.2. The Calculation of Customers’ Credit Scores

Let denote the score of the th customer. We have [26]

The meanings of the rest of variables in (15) are the same as the variables in (14).

Because the credit score lies in the interval calculated by (15), it is not the generally accepted score among . The credit scoring can be converted to numbers among 0 and 100 by using (16). Let denote the standard score of the th customer. Let denote the number of customers. Thus, the standard score is given by

2.4. The Establishment Approach of Credit Rating Model

With customer numbers of credit ratings following normal distribution [9, 26], all loan customers can be divided into nine ratings. A step-by-step instruction is provided.

Step 1. According to (16), the customers’ standard scores in descending order can be obtained.

Step 2. On the basis of customer numbers of all credit ratings following a bell-shaped normal distribution, we can compute the sample proportion of every rating, as shown in the second column of Table 1. The third column and the fourth column of Table 1 are the illustration of every rating. The sample frequency distribution is shown as in Figure 1.

Step 3. According to the first credit rating sample number accounting for 8% of the total sample number, the first scoring interval can be obtained combining with customers’ credit scores. If one customer’ credit score belongs to the first scoring interval, the customer is divided into the first credit rating. In the same way, all of the small private businesses can be divided into nine ratings.

3. Empirical Study of the Imbalanced Data Classification Model

3.1. Sample and Data Source
3.1.1. Sample Analysis

Data is collected from a Chinese government owned commercial bank that deals with 2157 small private businesses from 30 provinces [26]. The 2157 small private businesses consist of 246 default customers and 1911 nondefault customers. Moreover, we can compute that the default customers account for only 11.40% of the total sample and the nondefault customers account for 88.60% of the total sample. The credit rating sample satisfies the characteristics of imbalanced data. The distribution of customers is shown as in Figure 2.

3.1.2. The Establishment of Extensive Index System

According to the available indicators from a Chinese national commercial bank [26], this paper selects 64 indicators of microfinance for small private businesses, which includes six feature layers, that is, “ basic information,” “ guarantee and joint guarantee,” “ capacity of repayment,” “ capacity of profitability,” “ capacity of operation,” and “ microenvironment,” as shown in Table 2, columns 1, 2, and 4. All of these 64 indicators come from [328].

At the beginning of screening indicators, we removed six unavailable indicators, such as “ industry experience” and “ business capacity”. Another 58 indicators are left. These deleted indicators are marked with “unavailability delete” in column 5 of Table 2.

3.2. The Establishment of Key Indicators Extraction Model for Small Private Business
3.2.1. The Standardization of Indicators Data

According to the indicator type in column d of Table 3, taking the original data of positive indicators from column 1 to 2157 of Table 3 into (1), the original data of negative indicators into (2), and the original data of interval indicators into (3a), (3b), and (3c), the standardized data of indicators are obtained. The results are illustrated in columns 2158 to 4314 of Table 3.

Next, we will compute the standardized score of qualitative indicators. The scoring standard of qualitative indicators can be obtained by rational analysis, as shown in column 2 to 6 of Table 4. In accordance with the scoring standard of qualitative indicators in Table 4, the standardized scores of qualitative indicators are obtained combining with the indicator type in Column d of Table 3, as shown in column 2158 to 4314 of Table 3.

3.2.2. Indicators Extraction Based on LRSD

In order to create a logistic regression significant discriminant (LRSD) model, the training sample and test sample need to be determined. In a total of 2157 customers, 80% customers are randomly selected as the training sample (i.e., 1726 customers). In the training sample, the selected 1529 nondefault customers are shown in columns 2158 to 3686 of Table 3. And the selected 197 default customers are shown in columns 4069 to 4265 of Table 3. Meanwhile, all of the 2157 customers are used for the test sample.

Taking the standardized data from columns 2158 to 3686 and columns 4069 to 4265 of Table 3 into (4) and (5), the regression results are given in the third to sixth column of Table 5. And the given significance level Level0 equals 0.050 [27], as shown in Table 5, column 7.

According to the standard of logistic selecting indicators shown in Section 2.2.1 above, if the test probability sigi of Wald test value is less than the given significance level Level0, it means the th indicator affects customers’ default status significantly, and the th indicator should be reserved. Comparing the given significance level Level0 with the test probability sigi in the sixth column of Table 5, 24 indicators are reserved. The screening results are listed in the eighth column of Table 5.

3.2.3. Indicators Extraction Based on Correlation Analysis

Substituting the 24 reserved indicators’ data in Table 3 into (6), the correlation coefficients among these indicators are obtained, as shown in Table 6. As mentioned in Section 2.2.2 above, if the absolute value of the correlation coefficient of two indicators is larger than the threshold 0.8 [25], it is indicated that the two indicators reflect the repeated information and one of them can be deleted. From Table 6, the correlation coefficient of “ strength of the guarantor” and “ credit status of joint guarantor” is 0.925. Because 0.925 is larger than the threshold 0.8, it indicates that these two indicators are reflecting repeated information. Because “ strength of the guarantor” reveals the basic information of debtors more than “ credit status of joint guarantor,” it is reasonable to delete “ credit status of joint guarantor.” In the same way, we deleted another four indicators, including “ the relationship of coinsurance group membership,” “ net income,” “ fixed assets turnover,” and “ consumer price index.” All these five deleted indicators are marked with “deleted by correlation analysis” in column 3 and column 5 of Table 2. In summary, we select 19 indicators which can effectively distinguish nondefault customers from default ones, as shown in Table 2.

In order to test the classification ability of the key indicators extraction model, we use all of the 2157 customers as the test sample. Substitute the 19 selected indicators into (4), and the regression results are shown as in Table 7. Table 7 shows that the average accuracy rate for our model is 96.27%. The model has good classification ability for the small private business.

It should be pointed out that a lot of evaluation metrics can be applied in measuring the model’s performance, such as AUC, recall, precision, -measure, and overall success rate. This study evaluates the model’s performance by using the average accuracy rate of the default customers’ accuracy rate and the nondefault customers’ accuracy rate. It reflects the advantage of the proposed classification approach in dealing with imbalanced data. For instance, there are ten customers. Two of them are default customers, and the other eight are nondefault customers. If all of the two default customers are discriminated error and the other eight nondefault customers are discriminated right, the overall success rate is 80% (=8/10). However, in accordance with the proposed method in this paper, the default customers’ accuracy rate is 0% and the nondefault customers’ accuracy rate is 100%. Then, the average accuracy rate of the default customers’ accuracy rate and the nondefault customers’ accuracy rate equals 50% [=(100% + 0%)/2]. It is obvious that the classification performance of the imbalance data is accurately measured by the proposed average accuracy rate.

3.2.4. Comparative Analysis of LRSD and SVM in the Customer Classification

Based on the support vector machine (SVM) classification method in [29], this section constructs a SVM classification model for discriminating the customers’ default status. In order to obtain the most accurate classification function, we make the penalty parameter γ change from 1 to 5 (i.e., ), the kernel parameter σ2 changes from 0 to 3 (i.e., ), and the step of the penalty parameter γ and the kernel parameter σ2 equals 0.5. So it will be a total of 9 × 7 = 63 combinations of the penalty parameter γ and the kernel parameter σ2, as shown in column 2 to 3 of Table 8. When the average accuracy rate achieves its maximum, the classification function corresponding to the given γ and σ2 is the most accurate classification function.

As mentioned in Section 3.2.2, 80% customers are randomly selected as the training sample, and all of the 2157 customers are used for testing sample. Combined with the standardized data from columns 2158 to 3686 and columns 4069 to 4265 of Table 3, the 63 default customers’ accuracy rates, the 63 nondefault customers’ accuracy rates, and the 63 average accuracy rates can be calculated separately, as shown in Table 8, columns 4 to 6. From the sixth column of Table 8, the maximum of the average accuracy rate is 93.10%. Therefore, the most accurate classification function can be obtained. Considerwhere denote the default status predicted value of the th customer, denote the Lagrangian multipliers for SVM training sample, denote the default status of the th customer, denote the kernel function for SVM training sample, and denote the regression coefficient for SVM training sample.

From Tables 7 and 8, the average accuracy rate for our proposed model is 96.27%, and the average accuracy rate for the SVM model is 93.10%. The proposed model based on logistic regression significant discriminant and correlation analysis has better performance than the SVM model in dealing with imbalanced data.

3.3. The Calculation of Credit Scoring for Small Private Business
3.3.1. The Determination of Indicators Weighting

Taking the 19 selected indicators’ data from Table 3 into (7) and (8), the deviation matrix is obtained. Substituting the number of nondefault customers , the number of default customers , and the 19 selected indicators’ data from Table 3 into (9) and (10), the deviation matrix is obtained. Thus, we obtain the Fisher weighting vector by utilizing (11) to (14).

3.3.2. The Calculation of Credit Scoring

Substituting the Fisher weighting vector and the 19 selected indicators’ data from Table 3 into (15), the 2157 customers’ credit scores are obtained, as shown in the third column of Table 9. Taking the data of the third column in Table 9, the maximum value 0.699 of this column, and the minimum value 0.324 into (16), the 2157 customers’ standard credit scores are obtained, as shown in the fourth column of Table 9.

3.4. The Credit Rating for Small Private Business

As it is mentioned in Section 2.4 above, ranking the customers’ credit scores of the fourth column of Table 9 in descending order, the results are given in the third column of Table 10.

Take the credit rating of the first grade, for an example. From the second column of Table 1, the first credit rating sample number accounts for 8% of the total sample number, so the first credit rating sample number is equal to 173 (= 2157 × 8%), shown in column 5, Table 10. The credit scoring 72.79 of the 173rd customer can be found in the third column, so the first scoring interval is , which is listed in column 7, Table 10. That is to say, these customers whose credit score belongs to the range of 72.79 to 100 are the customers of rating AAA. In the same way, the credit rating results of the rest of eight ratings are obtained, shown in column 7 of Table 10.

4. Conclusion

Many small private businesses are important cornerstones to the flow dynamics of the current Chinese economic development. At the end of 2013, the statistical data demonstrated that there were 44.36 million small private businesses in China, and their money amounted up to 2.43 trillion Yuan [30]. However, most Chinese small private businesses were faced with the difficulty of raising funds due to their poor financial structures. The Chinese government has led financial innovation by supporting the small private businesses via carrying out a series of Pratt & Whitney financial measures. However, the default rates of small private businesses were very high for several reasons. One of the major reasons is that the credit rating system of microfinance for small private business is not sound at all, and most banks in China even have not yet established this rating system. And another primary reason is that the real default status of customers is not taken into account in existing credit rating systems.

In order to resolve the customer classification problem effectively, we propose a novel imbalanced data classification approach of microfinance for small private businesses. First of all, this paper sets up a key indicators extraction model by using logistic regression significant discriminant to select indicators which can effectively distinguish default customers from nondefault ones and utilizing the correlation analysis to delete the repeated information indicators. Secondly, on the basis of the linear weighted evaluation utilizing Fisher method, the credit scoring model for small private business, which reflects the default discriminant ability for default customers and nondefault customers, is established. And then, a credit rating model in which the customer number of credit ratings follows normal distribution is established.

The proposed approach has been verified using the data of 2157 small private businesses of a Chinese state-owned commercial bank. The results of our empirical analysis show that the proposed approach can accurately divide customers’ credit ratings. And there are 19 indicators which can effectively distinguish default small private businesses from nondefault ones, such as “ marital status,” “ asset-liability ratio,” and “ industry cycle index.” And our approach can contribute to find the quality customers for the banks and the bond investors.

Moreover, the performance of two classifier systems is evaluated in terms of their ability to correctly classify consumers as default (i.e., bad customer) or nondefault (i.e., good customer) credit risks. Empirical results suggest that the proposed approach better has performance than the SVM model in dealing with imbalanced data classification.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of the paper.

Acknowledgments

The research was supported by National Natural Science Foundation of China (no. 71471027, 71373207, 71201018, and 71171031), Banking Information Technology Risk Management Project of China Banking Regulatory Commission (CBRC) (no. 2012-4-005), Science and Technology Research Project of Ministry of Education of China (no. 2011-10), New Century Excellent Talents Support Plan of Ministry of Education of China (no. NCET-11-0443), Doctoral Fund of Ministry of Education of China (no. Z223021312), Credit Risks Evaluation and Loan Pricing For Petty Loan Funded for the Head Office of Post Savings Bank of China (no. 2009-07), and Basic Research and New Technology Project of Henan Province of China (no. 132300410333).