Abstract

The combination of the nonparametric -nearest neighbor discriminant method and R cluster analysis is used to construct a double-combination index screening model. The characteristics of the article are as follows: firstly, the nonparametric -nearest neighbor discriminant method is used to select the indicators which have significant ability to discriminate the default loss rate, which makes up the shortcomings of the previous research that only focuses on the indicators with significant ability to discriminate default state. Additionally, the R cluster analysis applied in this paper sorts the indicators by criterion class, rather than sorting the indicator by the whole index system. This approach ensures that indicators which are clustered in one class have the same economic implications and data characteristics. This approach avoids the situation where indicators that are clustered in one class only have the same data characteristics but have different economic implications.

1. Introduction

The existing research on the influencing factors of credit risk in microenterprises is divided into the following two categories.

(1) Existing Studies on Credit Evaluation Indicators System. Reusens and Croux (2017) think that the government debt, GDP growth rate, inflation, and other macroeconomic factors play a significant role in promoting corporate credit, so they cite these variables to build a credit evaluation index system [1]. Anand et al. (2016) think that indicators such as profitability, liquidity, firm size, and credit rating have an influence on the stability of the firm and play a vital role in credit evaluation. So these indicators should be included into the credit evaluation index system [2]. Jones et al. (2015) built a corporate credit rating system using financial indicators such as total assets. In addition to the above financial variables, Jones also cited the market variable such as enterprise scale and years of establishment into the index system [3]. Doumpos et al. (2015) mainly examined the impact of financial indicators on corporate credit and built a credit evaluation index system including asset returns, interest income, solvency, long-term debt leverage, and the size of the company [4].

(2) Existing Studies on Indicators Selection Methods. Many existing researches establish a classifier from the perspective of fuzzy to solve the credit evaluation problem [57]. Sohn et al. (2016) use the fuzzy logic regression method to establish the credit rating equation [8]. Abiyev (2014) develops fuzzy logic and neural network methods to extract important credit risk assessment information [9]. Ju and Sohn (2014) established a credit rating equation to pick up appropriate funding beneficiaries [10]. Elliott et al. (2014) screen out the true information which could reflect the credit state of a company based on a double hidden Markov model (DHMM) [11]. Abellán and Mantas (2014) construct the ensembles of classifiers for bankruptcy prediction and credit scoring based on random subspace method. Experimental studies show that decision tree packaging solutions provide the best results for bankruptcy forecasts and credit scores [12]. Bijak and Thomas use (2015) improved Bayesian analysis techniques to deal with the problem of loss from bad loans [13]. Gorzałczany and Rudziński (2016) are more concerned about the supervision and division of customer credit ratings than other scholars, which helps banks make better lending decisions [14]. Jones et al. (2015) predict the variation tendency of customer credit levels and determine the credit threshold through the binary classifier [3].

The defects of the existing research are as follows. First, most of the existing research constructs indicators system from the perspective of default and nondefault, which lack the research from the perspective of the default loss rate. Second, some of the existing researches cannot classify the indicators from the perspective of the economic sense of the indicators when using R cluster analysis, so that the existing research cannot remove the indicators which have redundant information.

Contributions of This Paper. First, this paper implements nonparametric -nearest neighbor discriminant method to remove indicators that cannot significantly distinguish samples of different default loss rate. Second, the paper classifies indicators by R clustering analysis and selects indicators which cover the largest information from each class by coefficient of variation. It ensures that the duplicate information is removed.

2. Research Principle

2.1. The Difficulty of the Problem

Difficulty 1. First difficulty is how to ensure that the selected indicators can significantly differentiate samples which have different default loss rate. In the existing study, the indicators selected by many classic methods can only distinguish different default state.

Difficulty 2. Second difficulty is how to delete the indicators which have the problem of information overlap and redundancy.

2.2. The Method to Solve the Difficulty

The Method to Solve the Difficulty 1. The nonparametric -nearest neighbor discrimination method will screen out the indicators which have significant discrimination ability on samples that have different default loss rate. If there are indicators, then identified accuracy will be calculated. The identified accuracy is compared with the accuracy of all the indexes, and the accuracy difference between index and all indicators is obtained. If the accuracy difference between a certain indicator and all indicators is greater than or equal to 0, then delete the index; if the accuracy difference between a certain indicator and all indicators is less than 0, then retain the index. After the above steps, the indicators which have significant discrimination ability on different default loss rate will be selected.

The Method to Solve the Difficulty 2. According to R cluster analysis, the indexes were screened again and the collinearity was excluded. By means of the R cluster analysis, the above indexes were screened out by the nonparametric -nearest neighbor discrimination method and were reclassified according to criteria layer. The indicators which have largest coefficient of variation of each category of each criteria layer will constitute the final indicator system, and the final indicator system will not cause the problem of information redundancy.

3. Construction of Indicator System

3.1. Indicators’ First Selection by Nonparametric -Nearest Neighbor Discrimination Method
3.1.1. Selection of the Optimal Value

In this paper, the optimal value will be selected by error balance method (Xing and Tingjin, 2014) [15]. At the same time, set a constraint for the error balance method. Compared with the method of generalized cross validation, the error balance method can not only get the optimal value but also reduce the computational cost greatly.

Error balance method makes the value increase from 1 and combines the test error of all the samples to draw the trend of test error. Finally, according to the trend, determining an optimal K value ensures that the test error is minimum. This method not only specifies the direction of the optimal value selection, but also ensures that the optimal value is chosen within the reasonable value range. This paper combines Góra and Wojna’s thought (Góra and Wojna, 2002) with the error balance method to find the best value [16].

Assume that is test error of the th type sample; is the number of th type samples misjudged into other class samples; is the number of actual th type samples ( = 1, 2, 3).

Assume that E is the test error of the all sample; is the test error of the high default loss rate sample; is the test error of the low default loss rate sample; is the test error of the nondefault sample; , , and are the sample size of high default loss rate sample, low default loss rate sample, and nondefault sample.

The meanings of formulas (1) and (2) are as follows: the ratio of the number of misjudgments to the actual sample size represents the test error, and the weighted average of test errors of the three types sample is the total sample test error.

Assume that is test error of the high default loss rate sample; is test error of the low default loss rate sample; is test error nondefault sample; is the number of nearest neighbors; is sample size; , , and are the sample size of high default loss rate sample, low default loss rate sample, and nondefault sample.

The Meaning of (3). According to Góra and Wojna’s theory, the optimal value should be in the range of . Under the above constraints, the optimal value is the value that minimizes the total sample test error.

3.1.2. The Process of Index Screening through Nonparametric -Neighbor Identification Method

(1) Calculate the Prior Probabilities. Assuming that is the prior probability of each class, where , , and .   is the sample amount of each class. is the sum of the sample sizes per class (Ganjiang, 2007) [17]:

The Meaning of (4). Calculate the prior probability of each class through calculating the ratio between the sample number of each class and total samples. The smaller the result, the smaller the likelihood that the sample will be classified into the class.

(2) Using K-Nearest Neighbor Estimation to Obtain the Probability Density Function. Assuming: is the probability density function of each class, where . is the number which belongs to th type neighbors, and .   is the sample amount of each class. is the volume which contains neighbors on interval [, ] (Ganjiang, 2007) [17]:

The Meaning of (5). The probability of falls within the established range.

(3) Calculate Posterior Probability. Assuming that is the posterior probability of a known category. is the prior probabilities of each class, , and . is probability density functions of each class. is the sum of the product of the probability density function and the prior probability of each class (Ganjiang, 2007) [17]:

If is the largest of three, then the sample should be sent to the class which is high default loss rate; if is the largest of three, then the sample should be sent to the class which is low default loss rate; if is the largest of three, then the sample should be sent to the class which is nondefault, where 1 minus the error rate equals the accuracy.

(4) Measure the Identification Accuracy of the Default Loss Rate. Assuming that is the accuracy of the th type sample; is the number of th type samples judged by the nonparametric -nearest neighbor discriminant method. is the actual number of th type samples. Then is

The Meaning of (7). The larger the calculated value, the better the nonparametric -nearest neighbor discriminant method which is used to identify different classes of samples.

Assume that is the identification accuracy of all the sample; there are

The Meaning of Formula (8). The discrimination accuracy of all the samples is equal to the weighted average of the discrimination accuracy of the high default loss rate sample, the discrimination accuracy of the low default loss rate sample, and the discrimination accuracy of the nondefault sample. The higher the A, the higher the accuracy of discrimination of all samples.

(5) Calculate the Degree of Influence of the th Indicator on the Discrimination Accuracy. Assume that is the degree of influence of the th indicator on the accuracy of the discrimination; is the identification accuracy of the residual indicator after eliminate the th index; is the identification accuracy of all the indicators. Then isFormula (9) reflects the degree of influence of the th index on the accuracy of the discriminant.

(6) Three Criteria of Indicator Screening Based on Nonparametric K-Nearest Neighbor Discriminant

Criterion 1. If the discrimination accuracy of the residual indicators after the th index is excluded is larger than the discrimination accuracy of all the indicators, that is to say > 0, it means that the accuracy of the discrimination after deleting the index is improved so the index should be removed. Mark the standard as standard one. All the indicators that meet Criteria 1 should be removed.

Criterion 2. If the discrimination accuracy of the residual indicators after the th index is excluded is equal to the discrimination accuracy of all the indicators, that is to say , it means that the accuracy of the discrimination after deleting the index does not change, so the indicator should be removed. Mark the standard as standard two. All the indicators that meet Criteria 2 should be removed.

Criterion 3. If the discrimination accuracy of the residual indicators after the th index is excluded is smaller than the discrimination accuracy of all the indicators, that is to say , it means that the accuracy of the discrimination after deleting the index decreases so the index should be retained. Mark the standard as standard three. All the indicators that meet Criteria 3 should be retained.

3.2. Indicators’ Second Selection by R Clustering Analysis

R-type clustering, also known as variable clustering, is a method of clustering variables. In order to reflect the characteristics of things and ensure the uniqueness of each index, R-type clustering method should be used to further cluster these variables selected by nonparametric -nearest neighbor discriminant method to delete information redundancy.

(1) R-Type Clustering Analysis Based on the Method of Squared Sum Method. The R cluster analysis of the indexes in the same criterion layer is carried out by the squared sum method.

Assume that is the sum of square deviation of th type indicators (); the indexes are divided into class ; is the number of the th type indicator; is the standardized sample value vector () of the jth indicator in the th class; is the average vector of the th class of indicators:

Assume that S is the sum of square deviation of all types of indicators ():

Step 1. Treat indicators as classes.

Step 2. Combine any two of indicators in those indictors into one class, no change on indicators left. There are kinds of combination. According to (10), calculate each class of indicators’ sum of square deviation .

Step 3. Calculate total sum of squares of deviations as to the indicators in all of the classes by (11), and reclassify the indicators in the way of indicators’ combination that would minimize the total sum of squares of deviation. sorts total sum of squares of deviations.

Step 4. Repeat Step 3 until the kind of classification is .

In the R cluster analysis, the number of reasonable categories is between 2 and 4. In order to avoid the subjective randomness of the number of categories, the nonparametric - test of each class after clustering is used to judge the rationality of the classification number . The original hypothesis of the nonparametric - test is that there are no significant differences in the numerical characteristics of the different indicators.

If the significance level of each category sig > 0.05, then accept the original hypothesis. That is to say, there is no significant difference between the indicators from the same class, and the number of classification is reasonable. On the contrary, indicators should be reclustered.

(2) Analysis of the Size of the Discriminant Force Based on the Coefficient of Variation. An indicator’s coefficient of variation reflects its identification ability. The bigger an indicator’s coefficient of variation is, the more information content it is contained. Therefore, the indicator with the biggest coefficient of variation within the same class should be retained.

Assume that is the overall standard deviation of the th indicator; is the mean of the th indicator; the formula of the coefficient of variation of the th index is

The advantage of the coefficient of variation is that the indicator which has the largest coefficient of variation has a strong ability to distinguish different information, and its role in the comprehensive evaluation is the largest, through removing the index whose coefficient of variation is small to ensure that the index system is simple and effective.

Assume that is the default loss rate of the th sample; is receivable principal and interest of the th sample which is not repaid now; is receivable principal and interest of the th sample.

4. Empirical Study

4.1. Sample Selection and Data Sources

The loan data of 860 microenterprises in this paper is derived from the credit database of a head office of a commercial bank. There are 830 nondefault customers and 30 default customers. Each sample includes 68 indicators such as Debt Asset Ratio, Acid-Test Ratio, and Operating Profit Ratio, which is shown in columns 2–69 of Table 1. The default loss rate calculated by formula (13) is set out in column 72 of Table 1. According to the type of the default loss rate, 860 customers will be divided into three categories and placed in column 73 of Table 1.

4.2. Screening of Indexes Based on Nonparametric -Nearest Neighbor Discriminant Method

Select the optimal value of . In this paper, the sample size is 860, and value should be smaller than the square root of the sample size, so the value of is less than . should belong to .

Find the best value. Combined with the objective function, the best value of can make the test error of all samples be the smallest. The test error of each value of is used to draw the trend of test error and value.

Determine the optimal value. It can be seen from Figure 1 that the value corresponding to the minimum test error is 1, so the value of is 1.

The specific process of screening indicators is based on nonparametric -nearest neighbor discriminant method. The indicators are placed in column 1 of Table 2.

The discriminant accuracy of 68 indices is obtained by nonparametric -nearest neighbor discrimination.

Step 1. Calculate the discriminant accuracy of the high default loss rate sample. Among the 24 high default loss rate samples, the number of samples that were accurately discriminated by nonparametric -nearest neighbor discriminant method was 13. According to formula (7), the discriminant accuracy of the high default loss rate sample is %, placed in column 2 of Table 2.

Step 2. Calculate the discriminant accuracy of the low default loss rate sample. Among the 6 low default loss rate samples, the number of samples that were accurately discriminated by nonparametric -nearest neighbor discriminant method was 0. According to formula (7), the discriminant accuracy of the low default loss rate sample is %, placed in column 2 of Table 2.

Step 3. Calculate the discriminant accuracy of the nondefault sample. Among the 830 nondefault samples, the number of samples that were accurately discriminated by nonparametric -nearest neighbor discriminant method was 823. According to formula (7), the discriminant accuracy of the nondefault sample is %, placed in column 2 of Table 2.

Step 4. Calculate the discriminant accuracy of all samples. Putting %, %, and % into formula (8), then we can obtain the discrimination accuracy of all the samples: , placed in column 2 of Table 2.

One of the 68 indicators is deleted one by one, and the discriminant accuracy of the remaining 67 indicators is calculated by the nonparametric -nearest neighbor discriminant method.

The discriminant accuracy of the high default loss rate sample, the discriminant accuracy of the low default loss rate sample, the discriminant accuracy of the nondefault sample, and the discriminant accuracy of the total sample can be obtained by using the 67 indicators after removing the index , placed in the first row of Table 2. Similarly, remove the to one by one, and calculate the discriminant accuracy of the high default loss rate sample, the low default loss rate sample, the nondefault loss rate sample, and the discriminant accuracy of the total sample, placed in the other rows of Table 2. Substitute and into (9), , and then calculate the influence degree of the th index on the discrimination accuracy; the degree of influence is placed in column 7 of Table 2.

Screen indicators based on the degree of discrimination of different indicators.

Standard 1 (remove indicators whose ). According to the degree of influence of the second column of Table 3, the degree of influence of , , and is larger than 0. Discrimination accuracy can be improved if this type of indicators is eliminated and the results are placed in the corresponding row in column 3 of Table 3.

Standard 2 (remove indicators whose ). According to the degree of influence of the second column of Table 3, the degree of influence of , , , and is equal to 0. Discrimination accuracy will not change if this type of indicators is eliminated and the results are placed in the corresponding row in column 3 of Table 3.

Standard 3 (retain indicators whose ). According to the degree of influence of the second column of Table 3, the degree of influence of , , and is less than 0. Discrimination accuracy will decrease if this type of indicators is eliminated and the results are placed in the corresponding row in column 3 of Table 3.

Indicator Screening Results. 58 indicators were excluded from the 68 indicators, and 10 indexes were retained. Table 4 shows the retained indicators by nonparametric -nearest neighbor discrimination.

4.3. Indicators’ Second Selection by R Clustering Analysis
4.3.1. R Clustering Analysis Based on the Method of Squared Sum Method

The 10 indicators reserved from the previous screening are classified according to the criteria layer. The classification results are shown in Table 5.

R clustering analysis is used in the same criterion layer to classify indicators, and K-W test is used to test different classification results.

(1) In the criterion layer of solvency, it can only be divided into one class and no K-W test because there is only one indicator. The criterion layers of the basic situation of legal representative and nonfinancial factors within the enterprise are similar to the criterion layer of solvency, so , , and should be reserved.

(2) In the criterion layer of operating capacity, there are two indicators. Firstly, two indicators can be divided into one class. The result of the K-W test for these two indicators is , which indicates that the original hypothesis with the same data feature between and is refused, so and have different data feature and should be reserved simultaneously.

(3) In the criterion layer of enterprise external macroconditions, there are two indicators. Firstly, two indicators can be divided into one class. The result of the K-W test for these two indicators is , which indicates that the original hypothesis with the same data feature between and is refused, so and have different data feature and should be reserved simultaneously.

(4) In the criterion layer of profitability, there are three indicators. Firstly, three indicators can be divided into 2 classes. The result of the K-W test for the two indicators among 3 indicators is , which indicates that the original hypothesis with the same data feature is refused, and two indicators should be clustered into 2 class. In this case, there is no need to divide three indicators into one category. Finally, in this criterion layer, three indicators should be divided into three categories. So , , and should be reserved simultaneously.

The classification results of indicators are as follows. (1) In the criterion layer of solvency, should be reserved. (2) In the criterion layer of the basic situation of legal representative, should be reserved. (3) In the criterion layer of nonfinancial factors within the enterprise, should be reserved. (4) In the criterion layer of operating capacity, and should be reserved simultaneously. (5) In the criterion layer of enterprise external macroconditions, and should be reserved simultaneously. (6) In the criterion layer of profitability, , , and should be reserved simultaneously.

4.3.2. Analysis of the Size of the Discriminant Force Based on the Coefficient of Variation

R clustering analysis shows that there is no redundant information in each index layer, so there is no need to use the coefficient of variation to delete the index with weaker recognition ability. So far, the paper has completed the second index screening process.

By the application of nonparametric -nearest neighbor discriminant method and R clustering analysis, the paper establishes a small enterprises credit evaluation indicators system, which contains 6 principle layers and 10 indicators.

4.4. Comparative Analysis

In order to reflect the superiority of combined model of the nonparametric nearest neighbor discriminant and the R clustering proposed in this paper, the comparative analysis of the combined model with stepwise discriminant analysis and neural network model will be carried out. The superiority of an indicator screening model can be reflected in the indicators selected by the model having higher identification ability. Therefore, this article will compare the discriminatory power of the three models.

Comparative analysis includes the following two steps.

Step 1. The combined model, stepwise discriminant analysis model, and neural network model will be used, respectively, to screen indicators that have significant discriminating ability on default loss rate.

Step 2. Use the selected index system to test the discrimination ability of the model. The higher the discriminative power of the model, the greater the superiority of the model.

Table 6 shows the discriminating ability of the three models for all types of samples. The discriminatory power of the combined models is higher than the stepwise discriminant analysis model and the neural network model, no matter for the discrimination ability of some samples or the discrimination ability of all the samples. Therefore, the combination model has more superiority than the other two models: that is to say the index system screened by the combination model has stronger identification ability. In addition, the combinatorial model is also more suitable for analyzing multiclassification problems because it has higher discriminative power when dealing with multiclassification problems.

5. Conclusion

5.1. The Main Conclusions

The credit index system which contains 10 indicators is selected by the model combination of nonparametric -nearest neighbor discrimination method and R cluster analysis.

5.2. The Characteristics of This Article

First, the nonparametric -nearest neighbor discrimination method is used to select the indicators which have significant discriminant ability on samples with different default loss rate. The study in this paper makes up the deficiency of previous studies which mainly focus on the default state.

Second, the R cluster analysis applied in this paper is based on the criterion layer rather than the whole index system. This will ensure that the indicators clustered into the same class have the same economic implications and data features, which avoid the clustering of indicators which mainly focus on the same data characteristics but ignore different economic implications.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of the paper.

Acknowledgments

The research is supported by the Key Project of National Natural Science Foundation of China (71731003), China Postdoctoral Science Foundation (2015M582746XB), and Natural Science Foundation of Inner Mongolia Autonomous Region of China (2016MS0714). The authors would like to show great gratitude to the organizations mentioned above.