## A Novel Imbalanced Data Classification Approach Based on Logistic Regression and Fisher Discriminant

We introduce an imbalanced data classification approach based on logistic regression significant discriminant and Fisher discriminant. First of all, a key indicators extraction model based on logistic regression significant discriminant and correlation analysis is derived to extract features for customer classification. Secondly, on the basis of the linear weighted utilizing Fisher discriminant, a customer scoring model is established. And then, a customer rating model where the customer number of all ratings follows normal distribution is constructed. The performance of the proposed model and the classical SVM classification method are evaluated in terms of their ability to correctly classify consumers as default customer or nondefault customer. Empirical results using the data of 2157 customers in financial engineering suggest that the proposed approach better performance than the SVM model in dealing with imbalanced data classification. Moreover, our approach contributes to locating the qualified customers for the banks and the bond investors.

#### 1. Introduction

There exist a lot of imbalanced data sets in real society [1], and the imbalanced data set appears when the size of samples in one class is greatly larger than the size of samples in another class. Many classification approaches constructed based on imbalanced data sets usually perform well on one class data but bad on the other class data [2]. Much more attention should be paid to the rare class data in these cases. For example, in risk management, the number of default customers is only 1% to 4% of all loan customers. And it causes hundreds of billions of loan losses for banks. What is more, the classifier cannot tell us the real default customers. Therefore, high prediction accuracy of the default customers will be more useful to help the bankers, the society, and the bond investors to reduce loss.

The main customer classification approaches proposed to handle imbalanced data problems can be divided into three categories. One of the most famous methods is the classification model based on econometrics technique. Using an empirical study on imbalanced bankrupt and nonbankrupt enterprises in U.S., Altman established five-variable -score model [3]. On the basis of combining collinearity diagnostics and logistic regression significant discriminant (LRSD), Shi and Chi established a classification model for handling imbalanced financial data [4]. Altman et al. employed statistics discriminant technique to revise the -score model and created the zeta rating model in order to deal with imbalanced bankrupt and nonbankrupt data [5]. Ju and Sohn created a credit-scoring classification model for selecting appropriate funding beneficiaries [6]. Elliott et al. built a model based on a double hidden Markov model (DHMM), to extract information about the “true” credit qualities of firms from all of the loan firms [7]. In order to screen good customers, Hwang et al. created an ordered semiparametric probit credit rating model by substituting ordered semiparametric function for linear regression function [8]. In comparison with conventional models such as multiple discriminant analysis, logistic regression analysis, and neural networks for business failure prediction, Min and Lee proposed a DEA credit scoring model for loan customers’ classification [9]. Linear discriminant analysis (LDA) model is applied in clients’ classification [10]. The second method for handling imbalanced data problems is the classification approach based on stochastic probability. In order to infer credit-quality data MNAR (missing not at random), Chen and Åstebro proposed a flexible method to generate the probability of missingness within a model-based bound and collapse Bayesian technique. Empirical results show that the method improves the classification power of credit scoring models under MNAR conditions [11]. Carmona et al. used a structural model with stochastic volatility for the computation of rare credit portfolio losses, and they demonstrated the efficiency of their method in situations where importance sampling is not possible or numerically unstable [12]. Kim and Sohn proposed a random effects multinomial regression model to estimate transition probabilities of different types of customers [13]. Carling et al. established continuous time model consisting of macroeconomics factors, such as the GDP growth rate and the unemployment rate, to readjust debtor’s credit rating transition probability [14]. KMV Company used the probability of asset value less than debt value to measure enterprises’ default situation, by creating a KMV model to measure default probability [15]. JPMorgan utilized transition matrix to describe the debtor’s probability of credit rating change, establishing CreditMetrics credit rating model [16]. In order to establish CreditRisk+ model, Credit Suisse First Boston developed the stochastic probability to measure default [17]. Moreover, imbalanced data classification method based on artificial intelligence is another important method. In order to extract features for classification, Li et al. created a generalized linear discriminant analysis model based on trace ratio criterion algorithm (GLDA-TRA) [18]. Abellán and Mantas studied the ensembles of classifiers for bankruptcy prediction and credit scoring using the random subspace method. And an experimental study showed that bagging scheme on decision trees presents the best results for bankruptcy prediction and credit scoring [19]. Wang established a hybrid sampling SVM model to imbalanced data classification [20]. Zhong et al. carried out a comprehensive experimental comparison study over the effectiveness of four learning algorithms (i.e., BP, ELM, I-ELM, and SVM) over an imbalanced data set consisting of real financial data for corporate credit ratings [21]. Akkoç proposed a three-stage hybrid adaptive neurofuzzy inference system (ANFIS) credit scoring model to deal with imbalanced loan customers [22]. In order to classify loan customers, Finlay compared the performance of several multiple classifiers and found that error trimmed boosting outperformed all other multiple classifiers on UK credit data [23]. Twala explored the predicted behaviour of five classifiers for different types of noise in terms of credit risk prediction accuracy and how such accuracy could be improved by using classifier ensembles. The experimental evaluation showed that the ensemble of classifiers technique had the potential to improve prediction accuracy [24].

Although the existing researches have made great progress in handling imbalanced data issues, there are still some drawbacks. First of all, the real default status of customers is not taken into account in existing loan customers’ classification. And secondly, the collinearity between indicators, which could induce the information chaos of index system, cannot be excluded in the existing classification researches.

In order to overcome the above shortcomings, this paper creates a novel imbalanced data classification approach based on logistic regression significant discriminant (LRSD) and Fisher discriminant. Using a Chinese state-owned commercial bank’s 2157-microfinance loan for small private businesses, the empirical result shows that the average accuracy rate for our proposed model is 96.27%. The proposed model performs well on the imbalanced customer classification.

The rest of the paper is organized as follows. Section 2 introduces the methodology of this paper. Section 3 presents the data and empirical analysis of our imbalanced data classification model for small private business. We conclude the paper in Section 4.

#### 2. A Novel Imbalanced Data Classification Approach

##### 2.1. Data Standardization

###### 2.1.1. The Standardized of Quantitative Indicators

In order to eliminate the influence of the differences of indicators units and dimensions on index screening, the original data should be transformed into real numbers within the interval . The quantitative indicators include positive indicators, negative indicators, and interval indicators. The positive indicators are indicators whose values are the bigger and the better of credit situation of microfinance for small private business, such as “ industry cycle index.” The negative indicators are indicators whose values are the smaller and the better. And the interval indicators are indicators which are reasonable only when they lie in certain intervals, such as “ private loan amount.” And the interval indicators are indicators which are reasonable only when they lie in certain intervals. It should be pointed out that there are three interval indicators in this paper, that is, “ age,” “ age of guarantor,” and “ consumer price index.” The ideal interval of “ age” and “ age of guarantor” is [4]. If the age of the guarantor or the business owner is within the interval , it means the repayment ability and repayment willingness of the small private business are strong. The ideal interval of “ consumer price index (CPI)” is [4]. It indicates that there is neither inflation nor deflation existing, when the CPI of the small private business located lies within the interval .

It has to be noted that no technique has been shown to be optimal for all kinds of data. Because the max-min normalization technique has been widely used in the standardized of quantitative indicators [4, 25, 26], this max-min normalization technique is applied in transforming the positive and negative indicators. Let denote the standard score of the th customer on the th indicator. Let denote the indicator original data of the th customer on the th indicator. Let denote the number of customers. The standardization equations of positive indicators and negative indicators are shown in (1) and (2), respectively [4]. Consider

Equation (1) is the ratio of the deviation between the indicator original data and the minimum value min() to the range max()–min(). It indicates that the closer the indicator original data to the maximum value max() is, the bigger the standardized value would be. Consider

The meanings of (2) are the same as (1). Equation (2) indicates that the closer the indicator original data to the minimum value min() is, the bigger the standardized value would be.

Let denote the left boundary of the ideal interval. Let denote the right boundary of the ideal interval. The standard score equation of the interval indicators is shown as follows [4]:

The meanings of the rest of variables in (3a), (3b), and (3c) are the same as the variables in (1).

Equations (3a), (3b), and (3c) are applied to analyze the interval indicator standardization. From (3c), if the indicator original data belongs to the interval , the standardized value identically equals 1. From (3a), if the indicator original data is less than the left boundary , the numerator is the deviation between the indicator original data and the left boundary and the denominator max( min(), max() is the maximum between min() and max(. Equation (3a) illustrates that the smaller the distance between the indicator original data and the left boundary is, the bigger the standardized value would be. Similarly, if > , (3b) indicates that the smaller the distance between the indicator original data and the right boundary is, the bigger the standardized value would be.

###### 2.1.2. The Standardization of Qualitative Indicators

By rational analysis and expert investigation for qualitative indicators, the scoring standard of qualitative indicators can be obtained.

##### 2.2. The Key Indicators Extraction Approach in Imbalanced Data Classification

###### 2.2.1. Screening the Key Indicators Based on Logistic Regression Significant Discriminant

Using the logistic regression model for selecting indicators, we ensure that the reserved indicators can effectively distinguish default customers from nondefault ones. Let be the dependent variable of logistic regression model. It is the default status of customers’ loan. Use to denote default customer and to denote nondefault customer. Let denote the corresponding default probability while conducting credit rating by indicators . Let denote the constant term, and let denote regression coefficients. The logistic regression model is as follows [6]:

Next, this paper will give a selection approach based on logistic regression significant discriminant (LRSD). The original hypothesis is as follows. If the th indicator has no effect on customers’ default status, the coefficient of logistic regression of the th indicator is equal to zero. Conversely, the alternative hypothesis is that if the th indicator has a significant effect on customers’ default status, the coefficient of logistic regression of the th indicator is not equal to zero. We establish the Wald statistics and judge whether the coefficients in (4) equal zero or not. In other words, it is to make judgment on whether the th indicator would significantly affect customers’ default status.

Let denote the Wald test value of the th indicator. Let denote the th indicator’s estimated value in (4). Let denote the standard deviation of . Thus, the Wald test value is given by

The standard process of selecting the indicators is as follows. Comparing the test probability sig_{i} of Wald test value with the given significance level Level_{0} = 0.05 [27], we can distinguish whether the indicators have an obvious effect on customers’ default status. If sig_{i} < Level_{0}, thus , which means the th indicator affects customers’ default status significantly, and therefore the th indicator should be reserved. On the contrary, if sig_{i} ≥ Level_{0}, thus = 0, which means the th indicator does not have significant effect on customers’ default status, and therefore the th indicator should be deleted.

###### 2.2.2. Deleting the Repeated Information Indicators Based on Correlation Analysis

The aim of the correlation analysis is to delete indicators of large correlation from the whole extensive indicators set, avoiding repeated information.

Let denote the standard score of the th customer on the th indicator. Let and denote the mean values, respectively, corresponding to the th indicator and the th indicator. Let denote the correlation coefficient between the th indicator and the th indicator. Then,

As a matter of experience, the threshold of correlation coefficient equals 0.80 [25]. In other words, if the absolute value of the correlation coefficient is more than 0.8, the two indicators reflect the repeated information. One of the two indicators should be deleted.

##### 2.3. The Establishment Approach of Credit Scoring Model

###### 2.3.1. The Indicator Empowerment Based on Fisher Discriminant

Considering the distance between default sample and nondefault sample, the bigger the distance is, the bigger the weighting is, and then the weighting of every selected indicator can be calculated by using Fisher discriminant method.

Let denote the deviation matrix among indicators in the same group. Let denote the deviation value between the th indicator and the th indicator. Let denote the number of indicators. Let denote the number of nondefault customers. Let denote the number of default customers. Let denote the deviation value between the th indicator and the th indicator in the nondefault sample group. Let denote the deviation value between the th indicator and the th indicator in the default sample group. Let denote the standard score of the th customer and the th indicator. Let denote the standard score of the th customer and the th indicator. Let and denote the mean values. Let and denote the mean values in the nondefault sample group. Let and denote the mean values in the default sample group. We have [28]where

Let denote the deviation matrix between default sample and nondefault sample. Let denote the deviation value of the th indicator between default sample and nondefault sample. Thus [28],where

The meanings of the rest of variables in (10) are the same as the variables in (8).

Correspondingly, a Fisher criterion function can be defined as follows [28]: where is a Fisher discriminant coefficient vector.

The distinct voice of (11) is obvious. By creating the objective function maximizing the deviation matrix between default sample and nondefault sample and minimizing the deviation matrix among indicators in the same group, the empowering idea that the bigger the distance between default sample and nondefault sample, the bigger the weight is reflected.

Now we outline the approach to solve the empowerment model. To explore the maximum value in (11) is equivalent to obtaining the eigenvector of the largest nonzero eigenvalue corresponding to the characteristic equation . Thus, the Fisher discriminant coefficient vector is given bywhere

Substituting into (12), we havewhere denote the Fisher weighting of the th indicator.

###### 2.3.2. The Calculation of Customers’ Credit Scores

Let denote the score of the th customer. We have [26]

The meanings of the rest of variables in (15) are the same as the variables in (14).

Because the credit score lies in the interval calculated by (15), it is not the generally accepted score among . The credit scoring can be converted to numbers among 0 and 100 by using (16). Let denote the standard score of the th customer. Let denote the number of customers. Thus, the standard score is given by

##### 2.4. The Establishment Approach of Credit Rating Model

With customer numbers of credit ratings following normal distribution [9, 26], all loan customers can be divided into nine ratings. A step-by-step instruction is provided.

*Step 1. *According to (16), the customers’ standard scores in descending order can be obtained.

*Step 2. *On the basis of customer numbers of all credit ratings following a bell-shaped normal distribution, we can compute the sample proportion of every rating, as shown in the second column of Table 1. The third column and the fourth column of Table 1 are the illustration of every rating. The sample frequency distribution is shown as in Figure 1.