Abstract

Under the background of the increasing demand for credit evaluation and risk prediction, the establishment of an effective credit evaluation model for small- and medium-sized enterprises has become a research hotspot. Based on previous studies, this paper proposes a two-layer feature extraction method based on Gradient Boosting Decision Tree (GBDT) and Convolutional Neural Network (CNN). First, based on the original features, GBDT is used to combine and automatically screen them, the missing values in the feature are processed, and the transformed high-dimensional sparse features are obtained. Then, CNN is used to extract features further, and finally, the logistic regression (LR) model is used to predict. In the simulation experiment, this paper takes a dataset of 14,366 small- and medium-sized enterprise credit evaluations as the analysis samples to verify the results. The results show that the GBDT-CNN-LR model has the best performance. The model also shows good generalization ability and stability in the reliability test.

1. Introduction

For the credit financing of small- and medium-sized enterprises, on the one hand, due to their small scale, high operating, and capital flow risks, financing channels and financing limits will be restricted; on the other hand, the high debt repayment risk and fraudulent behavior of small- and medium-sized enterprises will bring a huge risk of capital loss to the banking industry. How to address the problems of financing difficulties and high credit risks for small- and medium-sized enterprises caused by the asymmetry of information between the two parties to establish a high-precision credit evaluation and prediction model has become the focus of current research.

The SME credit evaluation based on artificial intelligence algorithms has high accuracy and fast speed, which are more often used in the bank credit evaluation business. At the same time, the requirements for the accuracy of the evaluation algorithm are also increasing. Scholars have done extensive research on machine learning algorithms for SME credit classification prediction, including statistical methods, single machine learning algorithms, integrated learning algorithms, and multimodel hybrid integrated learning algorithms [14]. Compared with credit evaluation methods based on machine learning algorithms, traditional statistical methods often require more complicated feature engineering in the early stage, which is not only inefficient, but the accuracy of the model is largely affected by the early feature engineering work. The data mining models of machine learning algorithms mainly include artificial neural networks [58], support vector machines [911], and decision trees [12, 13]. Huang et al. [14] compared the classification accuracy and applicability of several common neural network models. The empirical results show that the probabilistic neural network (PNN) has the lowest classification error rate. Uddin et al. [15] applied the random forest (RF) method to the robust modeling of credit default prediction, which has been proven as an efficient classifier than others. Wang et al. [16] selected appropriate indicators and used an improved SVM model for analysis to be able to detect the credit risk of SMEs. Luo et al. [17] used a deep learning network and applied a deep belief network with Restricted Boltzmann Machines to credit scoring, which has higher accuracy than that of traditional logistic regression methods. Zhong et al. [18] compared the machine learning training effects of BP, ELM, I-ELM, and SVM, and the results showed that the effects of ELM and BP neural networks are better.

The characteristics of missing values, high dimensionality, and redundancy in the credit evaluation of small- and medium-sized enterprises make it difficult to find the optimal evaluation feature integration of the evaluation classifier, which is also a key factor that leads to the low accuracy of the current evaluation classification. In order to further enhance the evaluation effect, algorithm research based on hybrid integrated machine learning has been innovated and improved for the existing problems so that the integrated model is better than the original model in various evaluation indicators of the predicted results. The RS-PSO-SVM model [19] solves the problem of nonlinear modeling and multicollinearity, which has high accuracy and efficiency. It uses the PSO algorithm to optimize the SVM model parameters and to assess and classify corporate credit risks. Sun et al. [20] combined SMOTE and Bagging to propose the DTE-SBD model, which can not only dispose of the class imbalance problem of enterprise credit evaluation but also increase the diversity of base classifiers for DT ensemble. Ma [21] put forward a hybrid integrated method RS-Boosting based on boosting and random subspace sampling to predict corporate credit risk and verified the effectiveness and feasibility of the method through empirical comparisons. Arora and Kaur [22] used the Bolasso algorithm to select consistent and relevant features from the feature library and applied the generated candidate features to different classification algorithms such as the random forest. The results showed that the BS-RF algorithm has a good performance in the classification accuracy of credit evaluation.

The credit evaluation of SMEs has complex features and high redundancy, and the evaluation data often contain a lot of missing values. Therefore, when using machine learning methods for corporate credit evaluation, high requirements are often placed on the processing of missing data in the early stage, and good feature engineering is also required. However, most of the above models simply remove the redundant features in the metadata and put their subsets into one or several base models for training. However, they do not compare and verify the results of the selected subsets based on different base models. In addition, when the number of feature indicators in the dataset changes, the original model will no longer be applicable.

Aiming at the shortcomings of existing research, this paper proposes a hybrid ensemble model using the GBDT-CNN method for feature extraction to evaluate corporate credit. The model uses the GBDT-CNN method to extract the original data features, which can effectively deal with the missing values of the samples while reducing the difficulty of feature engineering, thereby reducing the assumption of the data missing mechanism and the dependence on the data distribution model, which also has better robustness to abnormal situations in the original data.

2. Enterprise Credit Evaluation Techniques and Procedures

2.1. GBDT Model

Gradient Boosting Decision Tree, based on the idea of Boosting and CART algorithm, is an iterative decision tree algorithm. Except that the first decision tree is generated using the original predictive index, the goal in each iteration is to minimize the loss function of the current learner, that is, to make the loss function always drop along its gradient. Through continuous iteration, it makes the final residual error close to 0. Then by adding up the results of all trees, we can get the final prediction results [23].

The credit risk identification of SMEs is an obvious binary classification problem, which predicts risks through a series of basic corporate information, stocks, capital, investment, income, and other indicators. Let denote the credit behavior of the enterprise, denote dishonesty behavior, and denote nondishonesty behavior. is a k-dimensional variable composed of a series of basic information of the enterprise. For a training dataset containing samples, the GDBT modeling process is as follows:

where is the initial decision tree with only one root node, is the -th training data, is the constant that minimizes the loss function , and is the loss function.

In the GBDT model, different loss functions can be used for binary classification problems, but log-likelihood is generally used:

where is the binary classification model to be solved.

Let the number of iterations be , and then the negative gradient of the -th training sample is

According to all samples and their negative gradient directions , a decision tree composed of leaf nodes is obtained. The -th leaf node area is , and the best fit value of each leaf node is

The learners obtained in this round are

where is the indicative function of the -th training sample in the -th leaf node region and

After M rounds of iteration, the final decision model is

According to the number of times the variable is selected as the split variable in the regression tree during the iteration process and the degree of improvement of the model during the split process, the importance of each variable can be obtained as

where is the decision tree trained in the -th iteration, is the -th variable , which is selected as the indicator function of the -th leaf node split variable in the decision tree , denotes the improvement of the prediction result when the variable is used as the leaf separate variable, and represents the importance value of the variable in the decision tree.

2.2. CNN

Convolutional Neural Network (CNN) consists of one or more convolutional layers and a fully connected layer, which also includes associated weights layers and pooling layers. CNN’s features such as local connection, weight sharing, and pooling processing can effectively reduce network complexity and decrease the number of training parameters. To some extent, they make the model have a certain degree of invariance to translation, distortion, and scaling. While maintaining strong robustness and fault tolerance, it is also easy to train and optimize the network structure [2, 7, 24].

Here, this paper will map the combined feature and feature classification automatically (searched by GBDT) to higher dimensions through the CNN to truly reflect the distribution of the data.

2.3. Logistic Regression

Logistic regression is used for classification problems. The decision boundary can be expressed as , assuming that a certain sample point satisfies the condition . Then, the category is judged as 1. For the binary classification problem, the given dataset is as follows:

Because the value of is continuous, it is used to fit the conditional probability . However, for , the value of is , and the probability of nonconformity ranges from 0 to 1, so we use a generalized linear model. The unit step function is as follows:

The step function is not differentiable, and the log probability function is a commonly used substitute function:

Then, there are

Regarding as a class posterior probability estimation,

The output log odds is a model represented by a linear function of the input , that is, a logistic regression model. The closer the value of is to positive infinity, the closer the value of probability is to 1. Therefore, logistic regression first fits the decision boundary and then establishes the probability link between this boundary and the classification, which gives the probability in the dichotomous case.

3. Enterprise Credit Evaluation Model GBDT-CNN-LR

The samples used by SMEs for credit evaluation often contain a large amount of missing data. The use of machine learning and other methods for credit evaluation has high requirements for the processing of missing data in the early stage. In addition, features of SMEs’ credit evaluation have the characteristics of large number, complexity, and high redundancy. Traditional machine learning methods must be based on good feature engineering in the early stage. Therefore, finding the optimal evaluation feature set of the evaluation classifier is the key to improving the accuracy of the evaluation classification. Most of the existing missing value processing methods use certain approaches to fill in data artificially. It is necessary to assume that the dataset obeys a certain distribution model. However, in practical applications, the feature missing data are often intertwined. If the assumptions and models are unreasonable, they will affect the follow-up learning effect of the classifier.

According to the analysis above, if a method adopted can make full use of the information contained in the known dataset, there is no need for the bank and other financial institutions to process the missing data before they classify SMEs’ credit, thereby reducing the assumption of the data missing mechanism and the dependence on the data distribution model. Thus, it improves the quality of the evaluation feature set in the evaluation classifier, thereby enhancing the classification accuracy. Therefore, this research mainly focuses on how to simplify the preliminary feature processing for enterprise credit data as much as possible so as to achieve the highest possible discrimination accuracy while realizing feature extraction and feature combination.

This problem can be considered from two aspects: first, compared with human feature engineering, whether the method adopted can reflect the information covered by the original data features so as to ensure the correct rate of subsequent classification of untrustworthy companies; second, whether the adopted method can better adapt to and deal with the outliers and missing values in the data, including whether it is sensitive to the data and whether it can still maintain high accuracy even in the case of massive data distribution.

Therefore, this paper proposes a method based on GBDT-CNN to extract the features of the original data. First of all, it is based on the idea of Boosting. In the GBDT feature generation part, except for the first decision tree generated by the original predictor, the goal of each subsequent iteration will minimize the loss function of the current learner; that is, the loss function always descends along its gradient, and the final residual error tends to 0 through continuous iterations. Finally, the prediction result can be obtained by combining the results of all the trees through a specific aggregation function. Different from the traditional model, this paper uses GBDT as a tool to automatically combine and filter the features of the original data, discover distinguishable features, and generate new feature combinations, thereby retaining the information contained by the original data. In addition, when the loss function is properly selected, GBDT has strong robustness to abnormal conditions in the original dataset and is not sensitive to hyperparameters. It can achieve good prediction accuracy without long-time parameter adjustments. Considering that the original dataset has two types, continuous and discrete values, GBDT can also handle them flexibly without preceding operations, which simplifies the complexity of early feature engineering.

In the GBDT model section, each original data sample will eventually fall on the leaf node of the tree, and after the One-Hot encoding is connected, the transformed high-dimensional sparse feature vector is obtained. This paper then uses CNN with Batch Normalization as a further feature extraction tool to find higher-dimensional features to improve classification accuracy. The specific implementation methods are as follows. First, this paper uses BN to standardize the input data of each layer of the network to ensure that the mean and variance of the input distribution are stable within a certain range. While alleviating the Internal Covariate Shift problem in the network, it also alleviates the disappearance of the gradient to a certain extent and accelerates the convergence of the model. Second, BN makes the network more robust to parameters and activation functions and reduces the complexity of training and tuning of the neural network model. Third, the BN training process uses the Mini Batch mean and variance as the overall sample statistics estimation and introduces random noise. To a certain extent, they have a regularization effect on the model and enhance the robustness of the model.

After extracting the characteristics of the original data through the GBDT-CNN method, the classification model is then used to identify and discriminate the untrustworthy enterprises. LR (logistic regression) is a kind of generalized linear model. The output is the weighted sum of the input features, and the final result is output by the Sigmoid function so that it lies between 0 and 1, which conforms to the meaning of probability. The credit evaluation of an enterprise is to conclude whether to lend or not after comprehensively inspecting various financial and operating indicators of the enterprise. Therefore, the logistic regression model can be better applied to the problem of enterprise credit evaluation, and it is easy to explain the importance of each evaluation index to the final evaluation result.

Based on the analysis and discussion above, this paper aims to establish a GBDT-CNN-LR-based credit risk assessment model for SMEs. The frame diagram is shown in Figure 1.

For the use of integrated learning methods for enterprise credit evaluation, we need to consider two factors: (1) whether the model can effectively identify untrustworthy companies from the sample, that is, the accuracy requirements; (2) whether the weak learning model of the model can produce a difference, to avoid the degradation of the model effect, that is, the requirement of diversity. Regarding the first question, using the GBDT-LR model to solve the prediction of Facebook ad clicks in previous studies, the GBDT-LR model can better solve the prediction problem and achieve higher accuracy, which is sufficient to explain that the GBDT-CNN-LR model has a certain application basis, and it is possible to achieve certain recognition accuracy. For the second aspect, GBDT draws on the idea of Boosting in the training process. Every training reduces the residual of the previous training model so that the residual is reduced in the gradient direction, and each classification tree constructed reduced the error in the previous step. Thus, GBDT pays more attention to those samples with larger gradients. It can be considered that each classification decision tree constructed afterward only pays attention to some of its subsamples. Compared with the forecast of ad clicks, enterprise credit evaluation requires a higher accuracy rate. If the evaluation result is wrong, it may cause huge economic losses to the bank. In actual experiments, the traditional GBDT-LR model is still difficult to achieve the expected high accuracy rate. The accuracy rate of LR is limited by the previous feature engineering. Therefore, this paper proposes to use CNN on the basis of the feature vector generated by GBDT. The intention is to find higher-dimensional features as input data to improve the prediction accuracy of LR regression.

4. Experiments and Discussion

4.1. Datasets

The experimental dataset contains the credit records of 14,366 small- and medium-sized enterprises and 14 characteristics, including company stock price, foreign investment, registered capital, corporate assets, income, expenses, liabilities, and taxation, which are selected as the credit evaluation indicators of small- and medium-sized enterprises.

4.2. Evaluation Index

The accuracy is used as the most important evaluation index, that is, the number of samples that are predicted correctly divided by the total number of samples, and the f1_score coefficient and recall_score are used as auxiliary evaluation indicators.

4.3. The Result of the Experiment

First of all, this paper conducts statistical analysis on the missing values of each feature in the sample set. Most of the features in the sample set used in this paper have 60% or more missing data, which verifies the universality of the problem that this paper aims to solve. Therefore, this paper uses the proposed GBDT-CNN model to search for the distribution and information of the data itself and automatically fill in the missing data. The new feature vector generated is substituted into the Logistic model as an input index to output the discrimination result.

First, compare the evaluation effects of the single model and the integrated model, and the results are shown in Table 1.

It can be seen from Table 1 that both the tree model and the logistic regression model can achieve better prediction accuracy, but the prediction accuracy rate of the SVM, MLP, NB, and KNN models is only 51.07%. The three evaluation indicators (accuracy, f1_score, and recall_score) of the model after adding CNN to extract features are higher than those of other models.

When CNN has not been added to models to extract features, the effects of random forest, decision tree, and GBDT are significantly better than those of the Logistic model. Since logistic regression is a linear model, random forest, decision tree, and GBDT are all nonlinear models. And they perform better than logistic regression on many nonlinear datasets and linear datasets. Therefore, the linearity of the Logistic model itself limits the predictive ability of the model to explain this phenomenon reasonably.

This paper uses the GBDT model to extract features and then adds the Logistic model for classification, and the prediction accuracy is 93.49%, which is worse than that of a single model such as random forest and decision tree. Therefore, this paper considers further optimization of the model. Since the features automatically filtered out by the GBDT model have high dimensionality and large sparseness, this paper first uses CNN to convolve and sum the features obtained by GBDT and move them from a highly sparse space to a reasonably sparse space, which not only satisfies the certain sparsity required by logistic regression but also maintains the difference between each feature.

The experiment shown in the following figure compares the evaluation effect of the GBDT-CNN-LR model with CNN and that without CNN.

It can be seen from Figure 2 that, after adding CNN to extract features, compared with the GBDT-LR model without adding CNN to extract features, the accuracy is increased by 4.6%. In addition to the evaluation indicators above, the ROC_AUC curve can more accurately judge the performance of the GBDT-CNN-LR model by the AUC area. Therefore, this paper draws the ROC_AUC curve of different models. As shown in Figure 3, GBDT-CNN-LR’s AUC area is 0.992, which is larger than the AUC area of other models. Therefore, it can be considered that the GBDT-CNN-LR model that joins CNN to extract features is reasonable and has higher prediction accuracy for evaluating the credit risk of small- and medium-sized enterprises.

The missing values of the sample data account for a relatively large amount, reaching 42.6% of the total dataset. Using GBDT-CNN to automatically fill missing values has achieved high prediction accuracy, but if the new data does not fit the sample model, the model is very likely to be unstable. Therefore, this paper tests the stability of the model.

The dataset is divided into 4 parts, and each dataset retains the same missing rate as the original dataset. Then, we train each small dataset and draw the corresponding ROC_AUC curve graph, compare the AUC area of the model, and judge the stability of the model. The results are shown in Figure 4.

The results show that the prediction accuracy of the support vector machine model is still poor, and the multilayer perceptron (MLP) fluctuates sharply. The reason may be that the neural network is more sensitive to data, there is too little data, or there are too many missing values. Thus, the training of a neural network has a large error. The AUC area of the GBDT-LR model without the CNN channel showed a downward trend of about 2%–3%, but the AUC area of the GBDT-CNN-LR model using the CNN channel almost did not decrease. Therefore, the GBDT-CNN-LR model can show good generalization ability and stability on both large datasets and small datasets. The GBDT-LR model without the CNN channel also has good generalization ability and stability, but they are lower than those of the GBDT-CNN-LR model numerically.

5. Conclusions

The application of SME credit evaluation based on artificial intelligence algorithms in the bank credit evaluation business is becoming more and more extensive; thus, the accuracy of the evaluation model and algorithm also puts forward higher requirements. This paper proposes the GBDT-CNN-LR evaluation model. The model first uses GBDT to automatically combine and filter the original data features, which can better deal with problems such as the concentration of missing indicator values, and obtain transformed high-dimensional sparse feature vectors. Then, on the basis of the feature vector generated by GBDT, CNN is used for further feature extraction, and finally, these higher-dimensional features are predicted by logistic regression. In the simulation experiment, compared with the Random Forest Classifier, Decision Tree Classifier, Logistic Regression, SVM, and other basic classification algorithms, it can be clearly seen that the accuracy of the GDBT-CNN-LR model is higher than other models. In addition, the model shows good generalization ability and stability in the reliability test, which can effectively reduce the risk of investment and provide reliable technical support for financial institutions, accordingly possessing far-reaching practical significance.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was partially supported by the Group Building Scientific Innovation Project for Universities in Chongqing (CXQT21021) and the Science and Technology Research Project of Chongqing Education Commission (KJQN202100712).