Abstract

The credit card business has become an indispensable financial service for commercial banks. With the development of credit card business, commercial banks have achieved outstanding results in maintaining existing customers, tapping potential customers, and market share. During credit card operations, massive amounts of data in multiple dimensions—including basic customer information; billing, installment, and repayment information; transaction flows; and overdue records—are generated. Compared with preloan and postloan links, user default prediction of the on-loan link has a huge scale of data, which makes it difficult to identify signs of risk. With the recent growing maturity and practicality of technologies such as big data analysis and artificial intelligence, it has become possible to further mine and analyze massive amounts of transaction data. This study mined and analyzed the transaction flow data that best reflected customer behavior. XGBoost, which is widely used in financial classification models, and Long-Short Term Memory (LSTM), which is widely used in time-series information, were selected for comparative research. The accuracy of the XGBoost model depends on the degree of expertise in feature extraction, while the LSTM algorithm can achieve higher accuracy without feature extraction. The resulting XGBoost-LSTM model showed good classification performance in default prediction. The results of this study can provide a reference for the application of deep learning algorithms in the field of finance.

1. Introduction

Both the issuance of credit cards and the scale of credit have increased steadily in recent years. According to data from the People’s Bank of China, at the end of 2020, the number of credit cards issued totaled 778 million, and the credit balance of credit cards was 7.91 trillion yuan. With the effects of the COVID-19 pandemic, the quality of credit card assets deteriorated in 2020, and indicators such as overdue scale and nonperforming indicators are trending upward in the short term. The total overdue credit for the last half year was 83.8 billion yuan, accounting for 1.06% of the outstanding credit card balance, which marked an increase of 0.08% from the end of 2019. On the premise of ensuring the stable development of the credit card business, determining the effective management of credit card customers and reducing the problems caused by customer default have become a focus of attention. Effective use of already-collected customer information to identify customers who may default is a key measure to increase profits. Based on the results of default predictions, banks can reduce or freeze the credit lines of accounts that may default, thereby reducing their risk exposure. This would prevent an increase in the balance of an account destined to default and save financial institutions hundreds of millions of yuan in losses each year.

Current default prediction is primarily based on account, credit bureau, and transaction flow data. The account-level data include, among other information, month-end balance, credit limit, borrower income, account activity, and arrears. Credit bureau data include credit score, total credit limit, total outstanding balance on all cards, and the number of outstanding accounts. Transaction flow data contain a large amount of information related to default prediction, such as consumption, installment, and repayment habits. Transaction flow data is easy for financial institutions to obtain, is difficult to forge, and has a high degree of authenticity. Khandani et al. [1] integrated account credit records, financial behavior, and user-level transaction flow data to predict overdue payments; the total transaction amount of nearly 20 categories was extracted as the feature value for the flow data. Their research indicated that any forward-looking insights about consumer credit collected from historical consumer behavior data are crucial. Machine-learning predictions are highly adaptable and can capture the dynamics of the ever-changing credit cycle and the absolute level of default rates.

With the widespread application of artificial intelligence technology, scholars are increasingly applying machines to research on default prediction [28]. Butaru et al. [9] collected customer credit card data from multiple banks. These data included internal account-level information from banks and consumer data from large credit bureaus in the United States. Three data mining models—decision tree, random forest, and logistic regression—were used to study customer default predictions; the comparison showed that the decision tree and random forest models were better than logistic regression in credit card prediction accuracy. Since Chen and Guestrin [10] proposed the XGBoost algorithm in 2016, it has been used in many fields, including disease diagnosis, image recognition, and personal credit. Studies have shown that XGBoost can provide better prediction accuracy than other methods. Zhang and Chen [11] applied XGBoost to bond default risk prediction and found that the XGBoost algorithm was superior to traditional algorithms (e.g., LR, SVM, and KNN) for dealing with imbalanced data. Given its high performance in various fields, this study used the XGBoost algorithm as a representative machine-learning algorithm to establish a default prediction model and used it as a benchmark for comparison with deep learning models.

The default prediction model based on machine learning requires extracting features of the transaction flow data. The quality of the model largely depends on the application of feature engineering [1214]. The LSTM [15] algorithm performs well in time-series data mining and has had many applications in the financial field [1619], such as customer service marketing, risk control, and trading strategy. In the field of antifraud, for example, deep learning technology automatically recognizes fraudulent transactions from massive amounts of transaction data, realizes successful interception, and blocks fraudulent transactions, thereby improving system effectiveness, reducing the rate of false alarms, and reducing compliance risks [20]. This article focuses on exploring the application of deep learning technology in processing transaction flow data to improve the accuracy of default prediction. Machine learning algorithms and deep learning algorithms are used on the same data set to construct default prediction models, and the prediction accuracy and modeling workload are compared, ultimately revealing that the deep learning model has high prediction accuracy and does not require features for default prediction. The results of this study have practical significance for guiding financial institutions in reducing losses caused by credit card customers who default.

2. Data

The source used to establish the default prediction model is user data from a certain small-scale commercial bank’s credit card services. The use of the data is authorized by the bank, and the data is desensitized and does not contain the user’s identity information. Most cardholders are small business owners, and the consumption amount exceeds 500 yuan automatically in installments. Thus, most users use IOUs or cash withdrawals, and there are only a few outgoing transactions and repayment transactions every month.

2.1. Data Type Description

The data include basic user information and installment, billing, transaction flow, and credit information; the variables included in these are all monthly variables. In addition to directly using the monthly values within the observation period, statistical information is also used, including the maximum, minimum, and average values, as well as the ratio of the amount to the credit limit during the period.

2.1.1. Basic Information

Basic information includes not only information such as age, gender, and marital status, but also customer hierarchical code, which is a comprehensive indicator based on the user’s occupation; company nature and size; position; and annual income. It also includes the number of days since the card was created to the first use of the credit card processed according to user behavior, which can reflect the user’s desire for funds.

2.1.2. Billing, Installment, and Credit Information

Billing information describes the user’s past bill amount and overdue status. Installment information is the number and amount of different installment states (e.g., new, activated, completed in advance, and completed), as well as the main installment type and expiration time. The credit information is the PBOC credit score, which is updated monthly.

2.1.3. Transaction Flow Information

Transaction flow data contain the user ID and transaction date, time, type, and amount. The transaction type is a 4-digit code, covering more than 100 transaction types; the transaction amount is in RMB, and positive means outgoing, while negative means incoming. As shown in Table 1, the length of monthly transaction data differs by cardholder, which is typical of unstructured data. With the large amount of data, this forms the main difficulty for processing transaction data.

2.2. User-Defined

A good user is defined as one having no overdue situation within one year after the observation point. Bad users are defined as users who have been overdue for more than 60 days within one year after the observation point. Users who are overdue for 0–60 days are uncertain users and were not adopted. The total number of samples is 140,000, of which 80% are randomly selected as training samples and 20% as test samples. In both training data and test data, bad customers accounted for 20% of the sample.

3. Model Evaluation Index

The confusion matrix and area under the curve (AUC) were used to evaluate the models. The confusion matrix summarizes the records in the data set in the form of a matrix according to the two criteria of the real category and the classification judgment made by the classification model [21]. The name is derived from the fact that it can easily be indicated whether there is confusion among multiple categories (i.e., a positive category is predicted as a negative category). P and N represent the positive and negative judgment results of the model, respectively. T and F represent the judgment results of the model as True and False. The confusion matrix is defined as follows:

Knowing the confusion matrix, the accuracy rate (ACC), precision, and recall can be calculated, and the formulas are as follows:

The ACC value is calculated based on the cutoff value between positive and negative examples. Compared with the ACC value, the AUC value can integrate the predictive performance of all cutoff values [11, 22]. AUC represents the area under the receiver operating characteristic (ROC) curve, which is between 0 and 1. As a value, AUC can intuitively evaluate the quality of the classifier: the larger the value, the better. It is calculated as follows:

4. XGBoost Model

4.1. XGBoost Algorithm

Extreme gradient boosting (XGBoost) is a boosting-type tree algorithm proposed by Chen and Guestrin. It is widely used in web text and product classification, as well as customer behavior prediction, and it has achieved the most advanced results in many machine learning competitions [10]. Its wide application benefits from optimization in the following four aspects: a distributed weighted square graph algorithm that solves the problem of segmentation point selection, better processing of sparse data, efficient cache-aware block data storage structure, and better usage of parallel and distributed computing.

The prediction results of the model consisting of K decision trees arewhere is the i-th input sample; is the predicted value calculated through the mapping relationship ; and is the collection of mapping relationships. The optimization objective and loss function are defined as

Here, is the differentiable convex loss function, which measures the difference between the predicted value and the target value , and is an additional regularization term to help smooth the weight of the model to avoid overfitting.

The above test contains functions as parameters, which cannot be optimized in Euclidean space using traditional methods; is defined as the i-th prediction of the t-th iteration. Then, the loss function is defined as

A second-order Taylor expansion is performed on the previous equation:

The constant term is removed to yield the simplified objective function of the t-th iteration:where is the first-order partial derivative of to and is the second-order partial derivative of to . is defined as the sample set on each leaf node, and as the weight of the corresponding leaf node. Formula (8) can be rewritten as

For a certain tree structure , the optimal weight for the leaf nodes can be calculated by the following formula:

The corresponding optimal objective function can be calculated as follows:

4.2. Transaction Flow Feature Extraction

Using the XGBoost model for overdue forecasting requires feature extraction of the original transaction data and conversion of original data of unequal length into feature data of equal length. The transaction data was first classified and counted based on business experience and historical documents. The process of feature extraction is to initially screen out important features based on business experience and then derive feature statistics. First, the data was divided into four main categories according to transaction type: consumption, cash withdrawal, fee, and repayment. The number of transactions was then separately counted, and the total transaction amount for each major category was found. Finally, the statistics were calculated for three months. In the first stage, 48 feature values were extracted, as shown in Table 2. However, the prediction results obtained from these features are not ideal. In the second stage, we continued to dig into the data to understand the business background and added 34 features (totaling 82 features, as shown in Table 2). The added features include the number of days for repayment in advance, statistics on the number of transactions per month, and separate statistics for a special transaction type: penalty.

4.3. XGBoost Model Establishment

The flowchart for XGBoost model establishment is shown in Figure 1. First, user-related basic information, credit information, and billing, installment, and transaction flow were collected and preprocessed (e.g., data screening, classification label generation, and missing data supplementation). Then, the flow characteristics of the transaction flow data were extracted and input into the XGBoost classification model. For other types of data, monthly data generation and statistical calculation were performed for the observation period and then entered into the XGBoost classification model as feature values. Finally, the XGBoost classification model was trained. If the model evaluation conditions are met, the establishment of the default prediction model is complete.

4.4. Feature Importance

Figure 2 shows the top 20 most important features and their weights for the XGBoost model. The default classification is primarily based on penalty-related features and the PBOC score. Five of the top 20 most important indicators are related to transaction flow data, and the importance accounts for a total of 24%. This shows that the transaction flow data contain important information related to default prediction. Penalty-related features are not in the preliminary extraction of the transaction flow data but are important features selected after in-depth analysis of each transaction type. The model evaluation indicators of XGBoost algorithm models using different features are shown in Table 3. When the transaction features are not used for overdue prediction, the AUC value of the XGBoost model is 71%. The prediction effect of using 48 preliminary extracted transaction features in the first stage is better than not using them. Compared with using preliminary extracted transaction features, the AUC is increased by 13%, and the recall rate is increased by 34% when using all 82 transaction features, which is a significant improvement, indicating that the features mined in the second stage have better distinguishing ability. Feature extraction is thus the most important and most arduous step in machine learning modeling.

4.5. XGBoost Model Evaluation

The confusion matrix of the test data based on the XGBoost model is shown in Figure 3(a). According to the confusion matrix, the ACC is 86.5%, the precision is 74.1%, and the recall is 51.5%. The ROC curve of the test data is shown in Figure 3(b), and the AUC is 89%.

4.6. Comparison of Machine Learning Models

The corresponding data packages in Python were used to train the machine learning models for default prediction. Model parameter settings and data preprocessing methods are shown in Table 4. Except for the necessary normalization of the K neighborhood and SVM algorithm, the other algorithms used the same original data. The principle of parameter setting was to set as few parameters as possible without overfitting.

The model evaluation indicators of machine learning algorithms are shown in Table 5. The K neighborhood and SVM algorithm for default prediction appear to be less effective, and the decision tree, random forest, AdaBoost, and XGBoost algorithms are better; of these, the XGBoost algorithm is the best. Figure 4 shows the ROC curves for the different algorithms.

5. XGBoost-LSTM Model

5.1. LSTM Algorithm

The traditional neural network model is fully connected from the input layer to the hidden layer and then to the output layer, which means there is no connection between nodes in the same layer, and the propagation of the network is sequential. This kind of network structure often appears powerless to deal with sequence or time-series problems because of its lack of memory. Therefore, a new kind of network—recurrent neural network (RNN)—is required. LSTM is a type of RNN that is especially good at processing sequence data. The ingenuity of LSTM is that, by increasing the input, forgetting, and output thresholds, the weight of the self-loop is changed. In the case of fixed model parameters, the integration scale at different times can be dynamically changed, thereby avoiding the problem of gradient disappearance or expansion. Figure 5 illustrates the structure of the LSTM unit.

5.1.1. Input Gate

The input gate () controls the information entered into the internal storage unit, which can be expressed as follows:σ is the sigmoid function; is input vector at time t; is hidden layer vector, including the output of all LSTM units; and , , and represent the deviation, the input weight, and the cycle weight of the input gate, respectively.

5.1.2. Forget Gate

The forget gate () controls how much information from the previous time is stored in the internal storage unit, which can be expressed as

In formula (13), , , and represent, respectively, the deviation, input weight, and loop weight of the forget gate. The expression of the internal storage unit of the LSTM unit is as follows:where b, U, and represent the deviation, input weight, and cycle weight of the LSTM unit, respectively. On the right side of formula (14), the first half is the cell state information controlled by the forget gate, and the second half is the input information controlled by the input gate [16].

5.1.3. Output Gate

The output gate () controls the internal storage unit by releasing and generating the required information, which can be given by the following formula:

Here, , , and represent the deviation, input weight, and loop weight, respectively, of the output gate.

5.2. Transaction Flow Data Processing

The establishment of a default prediction model using the LSTM algorithm does not require feature extraction; the data processing work mainly involves interception and complementation. According to the distribution of the average number of transactions per month by user, 30 was selected as the threshold for the number of transactions per month. Data exceeding the threshold were discarded, and insufficient data were filled with zeros. The 3 months of transaction flow data were then spliced. Table 6 shows the input data for the LSTM model of a certain account after processing.

5.3. XGBoost Model Establishment

The flowchart for establishing the XGBoost-LSTM model is shown in Figure 6. The processing of transaction flow data is different from the process of establishing the XGBoost model. There is no need to perform feature extraction for the transaction flow data; directly train the LSTM flow data classification model on the spliced flow data during the observation period. The classification results of the LSTM model were used as features input into the XGBoost model for training.

5.4. Feature Importance

Figure 7 shows the top 20 most important features and their weights for the XGBoost-LSTM model. The default classification of the XGBoost-LSTM model was based on outputs of LSTM model and the PBOC score.

5.5. XGBoost-LSTM Model Evaluation

The confusion matrix for the test data based on the XGBoost-LSTM model is shown in Figure 8(a). The ACC is 93.6%, the precision is 92.8%, and the recall is 73.6%. The ROC curve of the test data is shown in Figure 8(b), and the AUC is 0.95.

5.6. Comparison of XGBoost Model and XGBoost-LSTM Model
5.6.1. Data Processing

Table 7 shows the data sources and data processing methods required by different algorithms. The XGBoost-LSTM model requires the least amount of work, which reduces the work of feature extraction from the transaction flow data.

5.6.2. Model Performance

The model evaluation indicators are shown in Table 8. Combining the data in Table 3, when the transaction features are not used for overdue prediction, the test set AUC of the model is 71%, the recall rate is 0.8%, and the accuracy is 49%. Adding the two output results of the LSTM model, the AUC of the model test set increased by 24%, the recall rate increased by 73%, and the accuracy rate increased by 43%. The significant increase in evaluation indicators proves the usefulness of the output features of the LSTM model. At the same time, the XGBoost-LSTM model has less improvement in AUC and accuracy compared with the XGBoost model building with the extracted transaction features. This shows that, under the premise of extracting useful features, the XGBoost model can also have good predictive performance. However, the extraction of useful features is based on deep business experience and a large amount of data mining. As shown in Figures 3(a) and 8(a), the XGBoost-LSTM model identified 2035 more bad samples than the XGBoost model (a total of 8,863 bad samples), and the number of false identifications of good samples decreased by 957. It means that bad customers are detected 23% more, and the false identification rate of samples predicted to be bad users decreased by 16% in the actual application of the model. The XGBoost-LSTM model thus obtains best prediction results on the data set used in this study. The results verify that the XGBoost-LSTM model can be realized without feature extraction and still have high classification accuracy.

6. Conclusions

In the method used in this study, features related to transaction flow had the highest importance weight, showing that the transaction flow data could effectively predict credit card default. Second, in the process of XGBoost modeling, the accuracy of default prediction mainly depended on feature extraction. It takes a lot of time to understand the specific meaning of each transaction type, but only when there is a deep understanding of the credit card business background can transaction types be correctly classified and useful features extracted. Third, when applying LSTM to process transaction flow data, it was only necessary to complement and splice the data, without any feature extraction work, which again confirms that the advantage of deep learning is that it does not require manual feature extraction. Finally, the XGBoost-LSTM fusion model combined basic, billing, and installment information, as well as PBOC branch and transaction flow, and it obtained extremely good test accuracy. This study shows that LSTM is an effective method for dealing with credit card transaction flow data.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.