Abstract

Network lending, an innovative financial lending product, is separated from traditional financial media and implemented on the Internet platform. We study the credit risk prediction of online loan based on risk efficiency analysis. Moreover, we put forward the concept of borrower risk efficiency and apply it to risk prediction. The main task of this study is to establish risk efficiency characteristics on the basis of referring to various risk characteristics and carry out risk prediction after passing the screening of a series of features. The framework is realized by combining logistic regression and slack-based measure (SBM), and feature selection and verification are carried out through machine learning and statistics. Firstly, the efficiency risk characteristics are extracted and the risk efficiency is calculated by MaxDEA. Secondly, the features are screened and verified by Python. Then, the efficiency value obtained by SBM method is used as a new index for the training and testing of logistic model together with the initial related indexes. Moreover, in order to prove the effectiveness of the proposed credit risk prediction control scheme based on risk efficiency, the research compares the prediction before and after adding the risk efficiency feature. The simulation results demonstrated that the logistic-SBM model is more suitable for credit risk prediction than the commonly used logistic method, which realized the efficient prediction of credit risk based on the logistic-SBM model. Finally, some suggestions are put forward to China’s regulatory authorities and the platform itself to control the credit risk of Internet lending industry.

1. Introduction

In “Interim Measures for the Management of Business Activities of Online Lending Information Intermediaries” promulgated in 2016, online lending is defined as direct lending between individuals including natural persons, legal persons, and other organizations through the Internet platform. Internet finance peer-to-peer (P2P) network finance is a branch of Internet finance, which is the product of the combination of Internet and finance. The academic definition of Internet finance has something in common with Internet finance, which is a new financial business model for traditional financial institutions and Internet enterprises to achieve financing. Davis and Gelpern and Slattery believe that P2P online lending has injected fresh vitality into the traditional lending market to meet the needs of investors and consumers [1, 2]. Financial technology based on P2P is one of the new breakthroughs in financial service institutions [3]. The main business models of Internet finance include Internet payment, online lending, equity crowd funding, Internet fund sales, Internet insurance, Internet trust, and internet consumer finance. Lenders have a greater impact on borrowers than do borrowers on lenders [4]. As technologies of big data and block chain advance, the financial credit risk in the context of the Internet has become a popular research subject [5]. P2P online lending originated in foreign countries. The earliest P2P online lending platform in the world is Zopa in the UK, which was established in London in March 2005. The new financial industry represented by peer-to-peer lending has gradually become a new source of volatility due to the increasing complexity of the Chinese financial market [6]. In 2007, China established its first P2P network lending enterprise. P2P lending platforms have different backgrounds and transparency [7]. Platform background is related to operational risk [8]. The embryonic period of the development of P2P online lending financial enterprises is from 2007 to 2012. From 2013 to 2015, the development of P2P online lending financial enterprises has entered a period of vigorous expansion. From 2017 to now, it is a period of consolidation and standardization of P2P online lending financial enterprises. There are more than 10000 P2P online lending financial enterprises, in which more than 5000 were operated at the same time. The annual transaction scale is about 3 trillion yuan, and the bad debt loss rate is very high. Through continuous rectification, the People’s Bank of China issued the “fintech development plan (2019-2021)” in September 2019, proposing to “further enhance the technology application ability of the financial industry and realize the deep integration and coordinated development of Finance and technology.” By the beginning of 2020, there are already a lot fewer P2P online lending institutions in operation.

In China, the scope of definition of online lending includes both individual-to-individual lending, individual-to-business lending, and corporate organization-to-business organization lending. Since the birth of the first P2P in China in 2007, online lending has developed rapidly. To a certain extent, it is not only the result of the continuous advancement of modern information technology but also the inevitable product of the diversification of lending needs. However, the problems exposed have become more prominent during the development. Investors should pay attention to information asymmetry and credit risk impact [9]. Therefore, the problems of online lending industry in China have not only the common problems of other countries’ online lending but also the specific problems of our country. Internet financial risk is not only directly related to the operation and development of the Internet financial system itself but has also a very important impact on the country’s macroeconomic operation because of its rapid development speed and growing scale of development [10].

2. Literature Review

Since 2013, innovative Internet financial services such as Yirendai, Crowdfunder, and Renrendai have been born in China, promoting the reform of financial service models and accelerating financial marketization. Although there are a large number of online lending investors, they basically lack professional lending knowledge [11]. Moreover, the amount of online lending is small. When the lender lacks the effective information of the borrower, the bidding will often follow suit blindly and other irrational behaviors, which will inevitably increase the credit risk of online lending [12, 13]. However, the risk of the industry has also become obvious. The theory of information asymmetry was first put forward by Akerlof (1970) [14] by observing the phenomenon of used car market. In online lending, information asymmetry can also lead to the possibility of borrowers’ default [15, 16]. The imbalance of these factors will lead to the platform’s resources, and opportunities cannot play a role, resulting in the collapse of the platform [17]. The survival of the platform depends on the age, scale, and life cycle of the enterprise [18]. The management ability of platform operators plays a key role in the success or failure of small and micro platforms [19, 20]. For instance, in February 2017, 55 problematic platforms were involved in illegal fundraising, difficulty with cash withdrawal, fraud, absconding with money, and loss of connection and other risky breaches. Recent years have seen the rapid development of Internet finance in China, and various peer-to-peer (P2P) lending platforms have been released [8]. There is diversity of default behaviors of borrowers with different credit grades in online P2P loan market [21]. Reputation plays an important role in the long-term development of P2P lending platform [22]. These negative news have greatly affected investors’ investment confidence and have had a very bad impact on the social reputation of the entire industry. Therefore, it is particularly important to scientifically evaluate Internet financial risks. The issue of risk and regulation of P2P lending platform in China is taken seriously. The P2P industry has promulgated the regulation that online loan platform must be online for fund deposit business, which makes bank deposit gradually normalized [23]. The difference between P2P online loan and traditional financial institutions lies in the transaction system of P2P online loan, which adopts the interest rate auction system when the transaction is concluded. Herzenstein and Barasinska [24] studied the interest rate of the American prosper online lending platform in 2011 and 2014, respectively. They found that the borrowers would set the maximum interest rate they were willing to pay for borrowing the funds, and then, the investors would decide whether to borrow according to the loan information provided by the prosper online lending platform. This innovative financial lending model provides investors with a new way of financial management. Liu et al. mainly find that investors’ herd behavior exists significantly [25]. P2P mode can make the idle funds of investors not only increase in value but also meet the borrower’s demand for funds to increase a loan channel. In this lending mode, the lending process no longer depends on offline financial institutions but relies on the network lending platform to match the needs of both sides and to realize the transaction. The reasons for choosing logistic-SBM model are as follows: DEA can be used to explore the new intersecting fields including management science, mathematics, mathematical economics, and operations research. DEA uses multiple inputs and outputs to measure the relative efficiency of each DMU. In the process of risk management for borrowers of Internet financial loan products, the DEA method can take each borrower as each DMU to obtain its efficiency value, rather than just studying the traditional indicators of the borrower. At present, there are few researches on the real customer credit data in China. Therefore, this study selected the logistic regression method for big data analysis through the comparison of different mathematical model methods. In this study, according to the characteristics of the source data, data envelopment analysis was used to process the source data and then, the data was trained in the logistic regression model to improve the accuracy of the model prediction. This method not only provides an innovative method to study the credit risk analysis of Internet Financial borrowing customers but also expands the research space in this field, which has both theoretical and practical significance. Based on the present situation of the P2P lending platform development in our country, its development in the process of credit risk, transaction risk, legal risk, and so on is analyzed. In addition, corresponding regulatory measures were put forward to strengthen the development of P2P lending platform in China, which is greatly important.

3. Methods

The notion of probability is very closely related to the notion of symmetry [26]. Credit risk prediction is essential to predict the probability of default of borrowers. The specific research methods are as follows.

First is data preparation. This study divides the credit data of Internet financial technology companies into a sample set and a test set.

Then, SBM-DEA model was established. According to the above five indicators, the efficiency value of each customer was obtained by using DEA model through MaxDEA software. DEA_score was added to the next dataset.

The third step is feature processing. The feature processing methods include feature binning, correlation coefficient, IV, and random forest model.

The fourth step was to test the logistic-SBM model. The prediction results of the model are observed directly through the mixed matrix diagram. The AUC value of the model was calculated and tested. The model was tested by the K-S test.

In the last step, we compared the values of corresponding evaluation measures of two models.

The logistic-SBM model was established through MaxDEA and Python software.

3.1. Data Source Preparation

We used the real credit data of an Internet financial technology company as the analysis object. The company is mainly engaged in small loans, online finance, and other Internet financial products. The platform has a variety of data sources, high data quality, and rich data information. The loan customer risk management model to be studied in this paper selected the loan records of the platform. The sample population data was sampled and divided into a sample set and test set.

For the data selection, the loan with the end of repayment and the loan with default were selected for modelling. The target variable was selected according to the user’s “repayment status” characteristics. If the loan has been repaid in which the default has not occurred, the value is 0. If there is overdue loan in which default occurs, the value is 1. Finally, 14028 transaction data that have been paid off were selected as the sample set, among which 10237 cases have been successfully paid off, accounting for 72.98% of the total number of samples. Besides, 3791 cases have overdue loans, accounting for 27.02% of the total number of samples.

3.2. SBM-DEA Method

This study used the SBM-DEA method (short for SBM method) to preprocess the data, because it can distinguish each customer to measure their respective efficiency value, rather than dividing them into different categories. This method can improve the prediction accuracy of the model and make the prediction of the initial logistic model more effective. The nonoriented SBM model is used in this study. The nonoriented SBM model is as follows:

The SBM model uses to represent the efficiency value of the evaluating DMU. It measures the inefficiency from both input and output, which is called the nonoriented model. In the unsupervised SBM model, there is no zero in the input and output data. In the SBM model, the inefficiency of input and output is reflected as follows:

If the efficiency value () of the SBM model is equal to 1, it means that the DMU evaluated is strongly efficient, while the efficiency of radial model is weakly efficient. The projection value (target value) of the evaluated DMU is

The reasons for SBM indicator selection are as follows: the input indicators include borrower’s liability information, credit risk score, and income information. These three indicators can mainly summarize the borrower’s asset flow and external risk evaluation information. The output indicators are the borrower’s loan amount and period, which are the most important indicators to describe the borrower’s loan situation. Input and output indicators of the SBM method are shown in Table 1.

According to the correlation of indicators obtained in the initial stage of logistic regression and the experience summary in daily business, three input indicators and two output indicator were finally selected. Therefore, the following five indicators were selected as the input and output indicators of the SBM method.

According to the above five indicators, the efficiency value of each customer was obtained by using the SBM method through MaxDEA software. DEA_score was added to the next dataset. DEA_score distribution diagrams are shown in Figure 1.

3.3. Logistic-SBM Modelling Process

Due to the wide and complex dimensions of the data used in this study and the large amount of data involved, the logistic DEA model consisted of a series of steps. The logistic DEA model selected the input and output values of the DEA model according to the initial index of the logistic model method. Then, MaxDEA software is used to calculate the efficiency value of each customer as a decision unit (DMU). As a new index, the efficiency value obtained by DEA would be used in the training and testing of the logistic model together with the initial relevant index. Finally, the model was used to test the default probability of loan customers, which verifies the effectiveness and accuracy of the model. It was helpful to analyze the contribution of DEA index to the accuracy of the logistic regression model.

3.3.1. Feature Binning

Through the observation of the collected datasets, it was found that many data types are inconsistent, in which many of them were character type. Because these character indicators may play a great role in the model, we used weight of evidence (WOE) to transform many character indicators into measurable numerical indicators. According to the chi-square value of each pair of adjacent intervals, the two intervals with the smallest value are combined. The formula used in this step is as follows:

is the th interval and the number of th instances, is the desired frequency of , is the total number of samples, the number of samples in the th group, and is the proportion of the th sample in the whole.

Feature information table is shown in Table 2. The continuous characteristic variable was discrete. Discrete feature states were often merged to reduce the number of states. It is convenient to transform all variables to similar scales. At the same time, some missing features will be brought into the model as an independent box. The reduction of extreme values and meaningless fluctuations in characteristics have an impact on the score and increase the stability and robustness of the model.

3.3.2. Correlation Coefficient

The correlation coefficient was obtained by calculating the correlation of each feature. The correlation coefficient formula is as follows:

Among them, is the covariance of and ; and are the variance of and , respectively.

If the absolute value of characteristic correlation coefficient was greater than 0.7, it was considered as a strong correlation feature. If there was strong correlation between features, some features can be deleted and one of them can be retained, as shown in Table 3. Delete the total debt ratio indicator.

3.3.3. IV

IV (Information Value) measures the amount of information about a variable. From the formula, it is equivalent to a weighted sum of the WOE values of the independent variables, in which the size of the value determines the influence of the independent variables on the target variables. The feature Information Value (IV) index can measure the concentration of the feature containing predictor variables. Weight of evidence (WOE) is a supervised coding method. The calculation formula is

The IV is mainly used to code the input variables and evaluate the predictive ability. The value of characteristic variable IV indicates the predictive ability of the variable. The feature information degree of the remaining features was calculated, including the IV of the other features. After grouping, the formula for calculating the IV of each group is shown in Table 4.

According to the reference threshold of IV, the features with IV less than or equal to 0.02 are defined as nonpredictive features. Therefore, all features of this class were deleted. According to the characteristic IV shown in Table 5, “Marriage” and “Birth_month” features were deleted.

3.3.4. Random Forest Model

Random forest model is an integrated algorithm, which generates many trees and gets the result by voting or calculating the average. For grouped variables, cart Gini value is used as the evaluation standard. The steps of random forest model feature importance selection were as follows. The formula for calculating the Gini index is

The meaning of each indicator in the formula is as follows: means that there are categories.

means the proportion of the category in the node .

The importance of the feature - at the node is the Gini exponential change before and after the node branch and is calculated as follows:

Among them, and , respectively, represent the Gini index of the two new nodes after branching.

When the node where the feature_appears in the decision treeis in the set, the calculation formula of the importance of_in theth tree is

Assuming that there are trees in the RF, then the importance of _ in the th tree is

Finally, the importance scores obtained through normalization are processed. The formula is as follows:

The variable importance score is represented by VIM, and the Gini index is represented by GI. Assuming that there arefeatures_1,_2, and_, the Gini index score of each feature_is now calculated. Features are ranked from high to low according to their importance, and the top features are selected.

Order of feature importance is shown in Table 6. Firstly, the feature variables in the random forest were sorted in descending order according to VI (variable importance). Then, the indexes with unimportant proportion were removed from the current feature variables to obtain a new feature set.

The result of feature importance was obtained by the random forest algorithm. The results would be retained three decimal places and sorted according to the importance from high to low. At the same time, the cumulative importance was calculated. According to the feature importance ranking, it was obvious that the feature of “Pay_method” showed the low importance.

3.3.5. Logistic-SBM Model Variables

In the application of P2P network credit loan, the logistic model was adopted due to its high discrimination ability in the field of default loan customer identification. The logistic formula is

The overdue status of a group of applicants in the performance period is and . The likelihood function and log likelihood function are

The parameter estimation formula is as follows:

The parameter estimation formula is as follows:

Estimate the by the gradient descent method; the formula is as follows:

It is very important to select variables from the dataset. Considering the correlation coefficient, validity, and importance of index data, 15 variables were selected in the final logistic-SBM model for empirical study. Logistic-SBM model variables are shown in Table 7.

4. Result Analysis and Inspection

Model verification is used to measure the predictive ability of the developed model, including internal and external tests. The internal test is the comparison between the prediction situation of the test set in the sample and the actual situation. The external test is the comparison between the prediction situation and the actual situation of the dataset except the model after passing the model. The primary goal of the developed model is to distinguish whether the borrower is in default. Besides, the accuracy of model prediction, confusion matrix analysis, and the Kolmogorov-Smirnov test can all be used as criteria for judging the quality of this model.

4.1. Confusion Matrix Analysis

Accuracy is an important concept and indicator in model evaluation. The performance of the resulting classier can then be evaluated in terms of the recall (or sensitivity) and precision of the classier on an evaluation dataset. Recall and precision are defined in terms of the number of true positives (TP), misses (FN), and false alarms (FP) of the classier (cf. Table 8).

In Table 7, the first line expresses prediction results from the prediction model; the first column expresses the actual results in the original data. True positive (TP) expresses the amount that the positive samples are correctly classified as positive; false negative (FN) expresses the amount that the positive samples are misclassified as negative; false positive (FP) expresses the amount that the negative samples are misclassified as positive; true negative (TN) expresses the amount that the negative samples are correctly classified as negative. As the common evaluation measures, the accuracy-specific expressions are shown as follows:

The borrower results predicted by the model were compared with the marked good and bad borrowers. From this result, the model has a strong predictive ability. 77.49% of borrowers were accurately predicted, and only 22.51% of borrowers were incorrectly predicted. Among them, the first quadrant is the number of borrowers that the model predicts to be nondefaulting and actually not defaulting. In the second and third quadrants, the number of errors is predicted. The fourth quadrant indicates that the model predicts the number of defaults and actual defaults. The accuracy of the model was that the ratio of the number of accurate predictions to the total number was 77.49%, in which the accuracy rate was high.

4.2. AUC-ROC Curve Observation

The AUC-ROC curve is a performance measurement for classification problems under various threshold settings. ROC (receiver operating characteristic curve) is a probability curve, and AUC (area under the curve) represents the degree or measure of separability which represents how many models can distinguish categories. The higher the AUC, the better the model predicts 0 as 0 and 1 as 1. The ROC curve of the logistic-DEA model is shown in Figure 2.

4.3. K-S Test

The KS indicator measures the largest gap between the cumulative distribution of responding customers and nonresponding customers. The calculation formula was as follows:

is the proportion distribution of accumulated good customers, is the proportion distribution of accumulated bad customers.

Firstly, the scores of samples were ranked from large to small and then, the cumulative proportion of good and bad samples in each quantize interval was calculated. The larger the distance between the two, the higher the KS value, indicating that the model area has the ability to distinguish good and bad customers. In the actual business, if the KS value is less than 20%, the accuracy of the model is poor. If the KS value is between 20% and 30%, it means that the model discrimination effect is general. If the KS value is between 30% and 60%, the model is very effective.

The KS value was obtained by the logistic-SBM model, as shown in Figure 3. The KS value of the logistic-SBM model is 33.3%, indicating the good prediction effect and the better effect of distinguishing default customers of the model.

4.4. Comparison of Model Evaluation

Precision, specificity, and recall are important concepts and indicators in model evaluation too. As the common evaluation measures, sensitivity, specificity, -Measure, and -Measure are used to make the evaluation. -Measure is also called -Score. -Measure is the weighted harmonic average of precision () and recall (). It is an evaluation standard of the model and is often used to evaluate the quality of the classification model. The -Measure function synthesizes the results of and when the parameter ; the weight of and is the same. When -Measure is higher, the model is more effective. Their specific expressions are shown as follows:

We compare the mean values of corresponding evaluation measures of two models. The performance comparison of two models is shown in Table 9.

The relationship of two models can fully explain that the logistic-SBM model presented by this article has the optimal performance relative to the logistic model. The higher the value of related evaluation indicators, the better the effect of the model. Simulation results show that the logistic-SBM is more suitable for credit risk evaluation than the popularly used logistic with consideration of related evaluation indicators. According to the above research results, it can be known that using data envelopment analysis to preprocess the data and increase the efficiency value in the logistic regression model can improve the accuracy of the model.

5. Concluding Remarks

With the rapid development of the Internet, P2P has been applied in various fields [27]. At present, the risk management of borrowers in the P2P network lending platform mainly includes the following: first, the basic information authentication of borrowers. Mine their identity information and credit level from many aspects, and rate the borrowers. Feature variables are extracted from the basic information to determine the characteristics of credit management. The second is the combination of credit line management and credit risk. The loan limit of the borrower corresponds to the corresponding credit risk level.

Credit risk has four main characteristics: asymmetry, accumulation, unsystematic, and endogenous. The good operation of a platform requires strict audit of borrowers. Only through high-quality borrowers to minimize the risk of P2P network credit transactions can the P2P platform maintain stable operation. The grade assigned by the P2P lending site is the most predictive factor of default, but the accuracy of the model is improved by adding other information, especially the borrower’s debt level [28]. The results suggest that borrower’s social information can be used not only for credit screening but also for default reduction and debt collection [29].

Relevant suggestions have been put forward, which provide reference for the credit management of the P2P network lending industry in China. Regulatory authorities and the platform itself should take some measures to control the credit risk of the P2P Internet lending industry. The specific recommendations were as follows: (1) improve the social credit investigation system, and realize information sharing; (2) improve and implementation of policies; and (3) undertake social responsibility, and actively develop through innovation.

Data Availability

This study collected partial loan records from an inclusive finance platform in China from 2014 to 2018.

Conflicts of Interest

The authors declare that they have no conflicts of interest.