Abstract

Aiming at the problem that the credit card default data of a financial institution is unbalanced, which leads to unsatisfactory prediction results, this paper proposes a prediction model based on k-means SMOTE and BP neural network. In this model, k-means SMOTE algorithm is used to change the data distribution, and then the importance of data features is calculated by using random forest, and then it is substituted into the initial weights of BP neural network for prediction. The model effectively solves the problem of sample data imbalance. At the same time, this paper constructs five common machine learning models, KNN, logistics, SVM, random forest, and tree, and compares the classification performance of these six prediction models. The experimental results show that the proposed algorithm can greatly improve the prediction performance of the model, making its AUC value from 0.765 to 0.929. Moreover, when the importance of features is taken as the initial weight of BP neural network, the accuracy of model prediction is also slightly improved. In addition, compared with the other five prediction models, the comprehensive prediction effect of BP neural network is better.

1. Introduction

Recently, the state vigorously promotes the economic construction of large- and medium-sized cities, which not only improves people’s living standards but also changes people’s consumption concept and consumption mode. People are more and more inclined to spend ahead of time and mortgage their “credit” to the bank to enjoy certain things in advance. However, when consuming, people often lack rational thinking and overestimate their ability to repay loans to banks in time. On the one hand, it increases the loan risk of banks; on the other hand, it increases the credit crisis of consumers themselves [1]. With a large number of banks selling credit cards, the phenomenon of credit card default emerges one after another. It is very important for banks to effectively identify high-risk credit card default users. Generally speaking, compared with the credit card customers who have not paid their loans overdue, there are fewer overdue repayments [2, 3]. This variable feature of overdue and overdue loan repayment is called “two classifications” in machine learning prediction. In the prediction of “two classifications,” a few categories are called positive examples (default), and most categories are called counterexamples (nondefault). However, most of the credit card loan data are unbalanced. In view of this situation, domestic and overseas scholars have taken up on a large scale a lot of researches. Khoshgoftaar et al. [4] proposed an evolutionary sampling method for unbalanced data, which uses genetic algorithms to selectively delete most types of samples and retain samples with a lot of feature information. Compared with other existing data sampling technologies, evolutionary sampling technology has better performance and is more conducive to empirical replication. The FN undersampling method used by Zhao et al. [5] regarded the minority class as a cluster, which was divided into multiple regions. And they calculated the distance from the negative class samples to the sample mean point in each region, reserving only one sample point in each region. Finally, the remaining negative class samples were used as new negative class samples and the original positive class samples for training and analysis. Zan et al. [6] used the generative countermeasure network (GAN) to synthesize a few samples to balance the data, then used AdaBoost to change the weight of the input samples, and established a prediction model based on the decision tree classifier. To a certain extent, the recognition rate of unbalanced data was improved. Hu et al. [7] used an improved version of oversampling and undersampling techniques to solve the problem of data imbalance and synthesized the new samples by assigning higher weights to adjacent minority samples through a weight vector. Based on the Euclidean distance standard undersampling most types of samples and keeping the number constant during the resampling process, they found that this method was superior to using a single data sampling technique. Han et al. [8] used an improved version of the smooth algorithm: borderline-smote, which essentially synthesizes new samples from minority samples. However, the original smooth algorithm selects a small number of samples around k nearest neighbors, while scholars use an improved version of the algorithm to find the minority class at the boundary line and use this method to synthesize new samples. Wang et al. [9] constructed a deep learning prediction model for imbalanced data. The model proposed a new loss function on the basis of the original neural network. This method does not need to balance the data in advance. Predictive analysis can be performed directly, and it can effectively reduce the classification error of positive and negative examples. Jiao et al. [10] proposed a reinforcement learning cumulative reward mechanism to improve the attribute selection of the classification regression tree, so as to improve the model’s prediction probability for a small number of samples.

We can see that the problem of category imbalance is mainly solved from the following two perspectives: the first perspective is to balance the data by changing the number of samples. This method can also be divided into three aspects. On the one hand, it is to improve the oversampling method. On the other hand, it is based on the principle of undersampling to change the data distribution. On the third hand, it is the method of combining oversampling and undersampling. The second perspective is to improve the classifier algorithm to improve the prediction performance of the model and at the same time use relevant evaluation indicators to evaluate the prediction results. Under normal circumstances, since undersampling will lose information, oversampling is the most widely used technique, and smote is the more common method. However, we have found that most scholars cannot reduce the imbalance between and within the sample categories at the same time when using the improved version of the smooth method, and the applicability of the improved version of the classifier is also limited. Therefore, this paper proposes an improved version of the smooth algorithm with better applicability, which combines the k-means algorithm. This method clusters all samples using the k-means unsupervised learning algorithm, finds clusters with more samples in the minority class, and then uses the smote method that synthesizes new samples in the cluster to change the data distribution. It can not only reduce the imbalance between the categories but also reduce the imbalance within the categories. At the same time, it combines the BP neural network method to predict the credit card default situation to help the bank to identify credit card risks effectively.

2. Basic Theory

2.1. PCA

The main idea of the principal component analysis (PCA) method is to transform the n-dimensional feature variable through the coordinate axis and the origin to form a new m-dimensional feature (usually, m is less than n) [11]. This m-dimensional feature is also called principal component. Its essence is to replace a series of related sample features with newly generated comprehensive features that are irrelevant to each other. When analyzing the data, you can set the cumulative variance ratio determination factor in advance. The working steps of PCA are as follows: The first step is to standardize the original sample. This step is automatically executed by the software that analyzes the data.The second step is to determine the correlation between the sample features and calculate the correlation coefficient matrix.The third step is to determine the number m of principal components after dimensionality reduction, calculate the eigenvalues and their corresponding eigenvectors, and then synthesize these eigenvectors to obtain each principal component.The fourth step is to determine the comprehensive evaluation index, calculate the information contribution rate of each feature value and principal component, and then weight these values to obtain the final evaluation value.

2.2. Feature Importance Calculation of Random Forest

Random forest is a relatively basic machine learning algorithm, which is widely used in predictive analysis [12], data labeling [13], tag ranking [14], feature importance calculation [15], and other fields. The principle of the algorithm is as follows: using bootstrap method to randomly construct n decision trees, each decision tree is split and pruned and finally combined to form a random forest. In this paper, random forest is used to calculate feature importance, which is used as the initial weight of BP neural network. The basic algorithm steps are as follows:The first step is to calculate the out-of-bag data error (error1) by using the sample data that has not been selected (out-of-bag data) when drawing samples to construct a decision tree.The second step is to randomly add noise interference to all the sample features of the data outside the bag and then calculate the error again and record it as error2.The third step is to calculate the importance of a (n is the number of decision trees constructed).

2.3. BP Neural Network

The prediction model used in this paper is the BP neural network algorithm, which is a feed-forward neural network for error backward update. It is often used for bank risk analysis [16], geological disaster monitoring [17], image and handwritten digit recognition [18, 19], and other fields. BP neural network consists of three parts: input layer, middle layer, and output layer. In the model, data samples enter the input layer through a weighted combination of different weights, then pass through the middle layer, and finally get the result from the output layer. Different weights and activation functions make the output of the model very different. In this experiment, the following steps were taken:The first step is to assign some parameters and initialize some parameters. In the experiment, this paper takes the feature importance calculated by the random forest as the weight of the input layer Xi and sets the same value for the weight of one input variable corresponding to multiple hidden layers. In addition, the number of nodes in the input layer, hidden layer, and output layer is determined.The second step is to calculate the output of the hidden layer Zi:The third step is to calculate the output layer Yi:Among them, both aj and bk in the second and third steps are offset.The fourth step is to calculate the error E:Among them, yk is the expected output value, and Yk is the actual output value.The fifth step is to update the weights and biases in reverse.

3. k-Means SMOTE Algorithm

We know that smote is a method for synthesizing new samples and solving data imbalance proposed by Chawla et al. [20] and is widely used in various fields. Smote is an improved method of random oversampling technology. It is not a simple random sampling, repeating the original sample, but a new artificial sample generated by a formula. But the smote algorithm will also increase the imbalance between the positive and negative classes of the sample to a certain extent. Therefore, according to the problem of imbalance of credit card sample categories, this paper uses an improved smote algorithm called k-means SMOTE algorithm. This algorithm can reduce the imbalance between categories on the one hand and reduce the imbalance within categories on the other hand. In this experiment, we first cluster all samples (30,000), then use k-means method to filter clusters with more minority categories, select clusters with more minority categories after filtering, and finally perform smote oversampling in the filtered clusters. The detailed steps of the k-means SMOTE algorithm are as follows:The first step is to randomly select k points among all samples and use them as the sample cluster centers .The second step is to calculate the distance from each sample to the cluster center:Among them, ; .The third step is to allocate the sample into the closest clusters:The fourth step is to recalculate the cluster center:The fifth step is to repeat the above second, third, and fourth steps until the cluster center no longer changes.The sixth step is to filter clusters with fewer minority classes and select clusters with more minority classes to synthesize new minority samples.The seventh step is to perform smote oversampling of CK in each filtered cluster:

Among them, rand(0, 1) represents a random number between 0 and 1, represents a new synthesized negative class sample, and xc represents a negative class randomly selected from m nearest neighbors in the filtered clusters. represents the negative samples in the filtered clusters except m neighbors. The k-means SMOTE algorithm flow is shown in Figure 1.

4. Experimental Data and Preliminary Analysis

4.1. Preliminary Analysis of Data

This paper uses data on credit card usage, which comes from the kaggle website (https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset). The sample size of this data is 30,000, of which 6,636 are in the positive category (default) and 23,364 in the negative category (no default). The sample has a total of 25 variables. In this experiment, considering that the variable ID has no relationship with the target variable, the deletion process was performed. 23 characteristic variables and 1 target variable were selected. The variables are shown in Table 1:

Among these 23 features, each feature has been processed accordingly. For the feature limit_bal, we draw a density map according to the default type, and the result is shown in Figure 2.

It can be found from Figure 2 that when the given credit amount is approximately below 150,000, the probability of default is greater than that of nondefault. This shows that when the credit amount is low, there may be more defaulters. For the feature age, we also performed a visual analysis, as shown in Figure 3.

Figure 3 shows that the probability of nondefault of age between approximately 25 and 40 is higher, which indicates that consumers in this age group are more capable of repaying credit card loans. This may be because their work and family tend to be stable without too much pressure. For the feature sex, we draw a stacked histogram according to the target variable, as shown in Figure 4.

As shown in Figure 4, whether it is male or female, the proportion of default consumers is still relatively low, which is in line with the general situation. Conventionally, most of the default data such as credit card fraud are uneven, and we need to make some adjustments to the model based on the actual situation. For the feature education, we find that the feature has six attribute values, and the meanings of the numbers 5 and 6 are unknown, in order to avoid causing a “dimensional disaster” when processing data. We merge them into one meaning (unknown) and draw a stacked histogram to visualize this feature, as shown in Figure 5.

For the feature marriage, we draw the same graph as the feature sex and education. The default and nondefault conditions of this feature are shown in Figure 6.

It can be seen from the above three figures that the sample set is unbalanced in the corresponding attribute values of the three characteristics of gender, education, and marriage. For the feature series payment status, we draw different stacked histograms according to different months, and the results are shown in Figure 7.

It can be seen from Figure 7 that consumers who delay payment by one month or less have fewer credit card defaults and almost never happen. In the three months of May, August, and September, for consumers who delayed payment for more than 2 months, the greater the probability of their credit card default is, the more likely it is to increase the loan risk of financial institutions. For the feature series BillAMT and PayAMT, we also perform the corresponding analysis and draw a line graph to visualize the two features, as shown in Figures 8 and 9.

As shown in Figures 8 and 9, due to the imbalance of the data, the line of default only occupies the front part of the figure. Figure 8 shows the amount of the bill, and Figure 9 shows the amount previously paid. Comparing these two images, we find that the six subimages in Figure 9 have greater fluctuations and greater range than the six subimages in Figure 8. Moreover, the uncertainty of the previous payment amount has also increased the difficulty for banks to adjust the credit card loan limit.

4.2. Data Processing and Feature Importance

In this experiment, there are a total of 23 features and 1 target variable. After coding and data cleaning, 23 features become 89 input variables. This is a heavy load for model operation and is not conducive to the prediction results of this paper. For comparative analysis with other models, this paper uses PCA for dimensionality reduction, finally obtains 27 input variables, then uses random forest to calculate the importance of these 27 variables, and uses them as the initial weight of the BP neural network. The calculation results of the feature importance are shown in Table 2.

5. Model Prediction and Comparative Analysis

5.1. Model Evaluation Method

According to the actual situation, for unbalanced data, we should use the evaluation index of unbalanced data [21], but because at the beginning of the experiment, we have balanced the number of positive and negative classes in the sample. And we are still using the two-class evaluation indicators commonly used in the past: hybrid matrix, recall, precision, f1-score, AUC value, and so on.

5.2. BP Neural Network Prediction Model

This paper constructs a BP neural network prediction model based on credit card default data. Since this paper has 27 input variables, 55 neurons in the hidden layer, and 2 output layers, the BP neural network model used is shown in Figure 10.

Then, we use the 27 features after principal component dimensionality reduction as input variables and use the feature importance calculated by the random forest as the initial weight of BP neural network. For example, the calculation formula for the weight W of the hidden layer is as follows:

In formula (8), there are 27 rows and 55 columns. 27 rows are the number of input variables, and 55 columns are the number of hidden layer neurons. In this experiment, we set each row in the matrix to be the corresponding feature importance (as in the above formula matrix 2) and substitute the result into the model for prediction. We find that when the weights are initialized, the accuracy of the model prediction is 0.8796, and when the feature importance is assigned to the weights, the accuracy of the model prediction is 0.8811. In terms of amount, the accuracy of the second case is slightly higher.

When building the model, we used a three-layer BP neural network to build a credit card default prediction model. The input layer has 27 neurons, the hidden layer has 55 neurons, and the output layer has 2 neurons. The hidden layer is calculated using the following empirical formula:

In addition to the initial weight of the hidden layer and the number of neurons in the hidden layer, we have performed a simple process, and the other parameters are default values.

Due to the uneven distribution of the experimental data, we use the k-means SMOTE algorithm to solve this problem. For the parameter k in the k-means SMOTE algorithm, we use the following empirical formula to calculate:

Then we substitute the sample size of 30000 (N) into the above formula, can calculate the value of k to be about 122, substitute it into the k-means SMOTE algorithm, and draw the ROC curve graph to intuitively compare the prediction performance of the model before and after k-means SMOTE. And we find that k-means SMOTE greatly improves the prediction performance of the model. The result is shown in Figure 11.

In Figure 11, we find that after the sample is processed by the k-means SMOTE algorithm, the prediction of the model has been greatly improved. The AUC value has been increased from 0.765 to 0.930, the ROC curve of the model is closer to the straight line 1 above the coordinate axis, and the accuracy rate has changed from 0.8252 to 0.8796.

Normally, the BP neural network model with more parameters is prone to overfitting. Because of the high fitting degree of the model, it is possible to learn the noise. We compare the performance of the prediction model in the training set and the testing set, and the results are as follows.

It can be seen from the above table that the values of performance indexes of the prediction model in these two groups of data set have little difference, so we judge that the possibility of overfitting the model in this experiment is relatively low. And the performance of the model can achieve the desired results.

5.3. Comparative Analysis with Other Models

In order to verify the effectiveness of the method used in this experiment, we also establish five other common machine learning models for predictive analysis under the same conditions. We have compared and analyzed the prediction results of these five models in the same situation and used several common performance indicators to evaluate the model. Since the confusion matrix is used to show the prediction results according to different situations, it is not easy to compare the performance of these five models. We adjust it slightly (e.g., the accuracy rate is approximately equal to the average of the accuracy of model positive and negative examples) as shown in Table 3.

It can be seen from Table 4 that the F1 values of these six models have reached above 0.8, indicating that these six models can effectively predict the credit imbalance data in this paper, but the comprehensive prediction performance of the BP neural network is slightly better. The AUC value is the highest among the six models, and the accuracy rate is higher for SVM. But the running time of the SVM model is too long, close to 6 minutes; compared to other models, the running efficiency of SVM is very low. If the amount of data is very large, it is not a wise choice for us to use SVM for prediction. In addition, we can find that except the lower AUC value of the decision tree, the difference in the AUC value of other models is not particularly large. This situation can also be intuitively seen through the ROC curve. The result is shown in Figure 12.

In Figure 12, we can find that if we do not look at the numbers in Table 3, we cannot see the obvious difference in the ROC curves of the first five models from Figure 12. In the above figure, the sixth image is the ROC curve of the decision tree, which is obviously different from the previous five images. This also shows that the tree has the worst performance among the six prediction models.

6. Summary

This paper proposes a comprehensive way by using k-means SMOTE and BP neural network algorithms for data imbalance. We find that the improved version of the smote algorithm (k-means SMOTE) not only effectively solves the problem of data imbalance but also improves the prediction performance of the model. In addition, we also find that using the feature importance calculated by the random forest as the initial weight of the hidden layer of the BP neural network can slightly improve the prediction performance of the model to a certain extent. However, this change is not obvious. On the one hand, it may be because the credit card default data has many influencing factors and is more complicated. We cannot take all such influencing factors into account, which may indirectly affect the calculation results of feature importance. On the other hand, we think that the amount of sample data may not be enough, the model of BP neural network is relatively simple, and there is no better interpretation of these data for predictive analysis.

In addition, with the gradual increase in the penetration rate of credit cards in our country, the research on its default risk has the following suggestions. On the one hand, we should further improve the construction of the credit indicator system. A good credit index system is conducive to better assessment of personal credit, and a risk prediction model with better classification performance can be established. Specifically, methods such as Delphi expert method, analytic hierarchy process, and regression analysis can be used to find the most representative individual credit indicators, then determine the weight of each indicator, and finally dynamically manage the evaluation system. On the other hand, we should strengthen risk management and control. Since credit card loan default involves personal moral issues, it is highly subjective and uncontrollable. Although major financial institutions are committed to developing the best methods for credit card loan risk avoidance, they have not been able to completely resolve the problem of credit defaults. Therefore, financial institutions should focus on controlling and avoiding risks and try their best to reduce risk losses. Based on the idea of machine learning integration methods, they can comprehensively use each superior classifier to develop a more versatile risk control model.

Data Availability

This paper uses data on credit card usage, which comes from the Kaggle website (https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset).

Conflicts of Interest

The authors declare that they have no conflict of interest.