Complexity

Volume 2018, Article ID 2067065, 9 pages

https://doi.org/10.1155/2018/2067065

## Empirical Study on Indicators Selection Model Based on Nonparametric -Nearest Neighbor Identification and R Clustering Analysis

^{1}College of Economics and Management, Inner Mongolia Agricultural University, Hohhot 010010, China^{2}Huachen Trust Limited Liability Company, Hohhot 010010, China

Correspondence should be addressed to Zhan-jiang Li; moc.361@285gnaijnahzil

Received 13 September 2017; Revised 19 February 2018; Accepted 26 February 2018; Published 30 April 2018

Academic Editor: Enzo Pasquale Scilingo

Copyright © 2018 Yan Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The combination of the nonparametric -nearest neighbor discriminant method and R cluster analysis is used to construct a double-combination index screening model. The characteristics of the article are as follows: firstly, the nonparametric -nearest neighbor discriminant method is used to select the indicators which have significant ability to discriminate the default loss rate, which makes up the shortcomings of the previous research that only focuses on the indicators with significant ability to discriminate default state. Additionally, the R cluster analysis applied in this paper sorts the indicators by criterion class, rather than sorting the indicator by the whole index system. This approach ensures that indicators which are clustered in one class have the same economic implications and data characteristics. This approach avoids the situation where indicators that are clustered in one class only have the same data characteristics but have different economic implications.

#### 1. Introduction

The existing research on the influencing factors of credit risk in microenterprises is divided into the following two categories.

*(1) Existing Studies on Credit Evaluation Indicators System*. Reusens and Croux (2017) think that the government debt, GDP growth rate, inflation, and other macroeconomic factors play a significant role in promoting corporate credit, so they cite these variables to build a credit evaluation index system [1]. Anand et al. (2016) think that indicators such as profitability, liquidity, firm size, and credit rating have an influence on the stability of the firm and play a vital role in credit evaluation. So these indicators should be included into the credit evaluation index system [2]. Jones et al. (2015) built a corporate credit rating system using financial indicators such as total assets. In addition to the above financial variables, Jones also cited the market variable such as enterprise scale and years of establishment into the index system [3]. Doumpos et al. (2015) mainly examined the impact of financial indicators on corporate credit and built a credit evaluation index system including asset returns, interest income, solvency, long-term debt leverage, and the size of the company [4].

*(2) Existing Studies on Indicators Selection Methods*. Many existing researches establish a classifier from the perspective of fuzzy to solve the credit evaluation problem [5–7]. Sohn et al. (2016) use the fuzzy logic regression method to establish the credit rating equation [8]. Abiyev (2014) develops fuzzy logic and neural network methods to extract important credit risk assessment information [9]. Ju and Sohn (2014) established a credit rating equation to pick up appropriate funding beneficiaries [10]. Elliott et al. (2014) screen out the true information which could reflect the credit state of a company based on a double hidden Markov model (DHMM) [11]. Abellán and Mantas (2014) construct the ensembles of classifiers for bankruptcy prediction and credit scoring based on random subspace method. Experimental studies show that decision tree packaging solutions provide the best results for bankruptcy forecasts and credit scores [12]. Bijak and Thomas use (2015) improved Bayesian analysis techniques to deal with the problem of loss from bad loans [13]. Gorzałczany and Rudziński (2016) are more concerned about the supervision and division of customer credit ratings than other scholars, which helps banks make better lending decisions [14]. Jones et al. (2015) predict the variation tendency of customer credit levels and determine the credit threshold through the binary classifier [3].

The defects of the existing research are as follows. First, most of the existing research constructs indicators system from the perspective of default and nondefault, which lack the research from the perspective of the default loss rate. Second, some of the existing researches cannot classify the indicators from the perspective of the economic sense of the indicators when using R cluster analysis, so that the existing research cannot remove the indicators which have redundant information.

*Contributions of This Paper*. First, this paper implements nonparametric -nearest neighbor discriminant method to remove indicators that cannot significantly distinguish samples of different default loss rate. Second, the paper classifies indicators by R clustering analysis and selects indicators which cover the largest information from each class by coefficient of variation. It ensures that the duplicate information is removed.

#### 2. Research Principle

##### 2.1. The Difficulty of the Problem

*Difficulty 1. *First difficulty is how to ensure that the selected indicators can significantly differentiate samples which have different default loss rate. In the existing study, the indicators selected by many classic methods can only distinguish different default state.

*Difficulty 2. *Second difficulty is how to delete the indicators which have the problem of information overlap and redundancy.

##### 2.2. The Method to Solve the Difficulty

*The Method to Solve the Difficulty 1*. The nonparametric -nearest neighbor discrimination method will screen out the indicators which have significant discrimination ability on samples that have different default loss rate. If there are indicators, then identified accuracy will be calculated. The identified accuracy is compared with the accuracy of all the indexes, and the accuracy difference between index and all indicators is obtained. If the accuracy difference between a certain indicator and all indicators is greater than or equal to 0, then delete the index; if the accuracy difference between a certain indicator and all indicators is less than 0, then retain the index. After the above steps, the indicators which have significant discrimination ability on different default loss rate will be selected.

*The Method to Solve the Difficulty 2*. According to R cluster analysis, the indexes were screened again and the collinearity was excluded. By means of the R cluster analysis, the above indexes were screened out by the nonparametric -nearest neighbor discrimination method and were reclassified according to criteria layer. The indicators which have largest coefficient of variation of each category of each criteria layer will constitute the final indicator system, and the final indicator system will not cause the problem of information redundancy.

#### 3. Construction of Indicator System

##### 3.1. Indicators’ First Selection by Nonparametric -Nearest Neighbor Discrimination Method

###### 3.1.1. Selection of the Optimal Value

In this paper, the optimal value will be selected by error balance method (Xing and Tingjin, 2014) [15]. At the same time, set a constraint for the error balance method. Compared with the method of generalized cross validation, the error balance method can not only get the optimal value but also reduce the computational cost greatly.

Error balance method makes the value increase from 1 and combines the test error of all the samples to draw the trend of test error. Finally, according to the trend, determining an optimal* K* value ensures that the test error is minimum. This method not only specifies the direction of the optimal value selection, but also ensures that the optimal value is chosen within the reasonable value range. This paper combines Góra and Wojna’s thought (Góra and Wojna, 2002) with the error balance method to find the best value [16].

Assume that is test error of the th type sample; is the number of th type samples misjudged into other class samples; is the number of actual th type samples ( = 1, 2, 3).

Assume that* E* is the test error of the all sample; is the test error of the high default loss rate sample; is the test error of the low default loss rate sample; is the test error of the nondefault sample; , , and are the sample size of high default loss rate sample, low default loss rate sample, and nondefault sample.

The meanings of formulas (1) and (2) are as follows: the ratio of the number of misjudgments to the actual sample size represents the test error, and the weighted average of test errors of the three types sample is the total sample test error.

Assume that is test error of the high default loss rate sample; is test error of the low default loss rate sample; is test error nondefault sample; is the number of nearest neighbors; is sample size; , , and are the sample size of high default loss rate sample, low default loss rate sample, and nondefault sample.

*The Meaning of (3)*. According to Góra and Wojna’s theory, the optimal value should be in the range of . Under the above constraints, the optimal value is the value that minimizes the total sample test error.

###### 3.1.2. The Process of Index Screening through Nonparametric -Neighbor Identification Method

*(1) Calculate the Prior Probabilities*. Assuming that is the prior probability of each class, where , , and . is the sample amount of each class. is the sum of the sample sizes per class (Ganjiang, 2007) [17]:

*The Meaning of (4)*. Calculate the prior probability of each class through calculating the ratio between the sample number of each class and total samples. The smaller the result, the smaller the likelihood that the sample will be classified into the class.

*(2) Using K-Nearest Neighbor Estimation to Obtain the Probability Density Function*. Assuming: is the probability density function of each class, where . is the number which belongs to th type neighbors, and . is the sample amount of each class. is the volume which contains neighbors on interval [, ] (Ganjiang, 2007) [17]:

*The Meaning of (5)*. The probability of falls within the established range.

*(3) Calculate Posterior Probability*. Assuming that is the posterior probability of a known category. is the prior probabilities of each class, , and . is probability density functions of each class. is the sum of the product of the probability density function and the prior probability of each class (Ganjiang, 2007) [17]:

If is the largest of three, then the sample should be sent to the class which is high default loss rate; if is the largest of three, then the sample should be sent to the class which is low default loss rate; if is the largest of three, then the sample should be sent to the class which is nondefault, where 1 minus the error rate equals the accuracy.

*(4) Measure the Identification Accuracy of the Default Loss Rate*. Assuming that is the accuracy of the th type sample; is the number of th type samples judged by the nonparametric -nearest neighbor discriminant method. is the actual number of th type samples. Then is

*The Meaning of (7)*. The larger the calculated value, the better the nonparametric -nearest neighbor discriminant method which is used to identify different classes of samples.

Assume that is the identification accuracy of all the sample; there are

*The Meaning of Formula (8)*. The discrimination accuracy of all the samples is equal to the weighted average of the discrimination accuracy of the high default loss rate sample, the discrimination accuracy of the low default loss rate sample, and the discrimination accuracy of the nondefault sample. The higher the A, the higher the accuracy of discrimination of all samples.

*(5) Calculate the Degree of Influence of the **th Indicator on the Discrimination Accuracy*. Assume that is the degree of influence of the th indicator on the accuracy of the discrimination; is the identification accuracy of the residual indicator after eliminate the th index; is the identification accuracy of all the indicators. Then isFormula (9) reflects the degree of influence of the th index on the accuracy of the discriminant.

*(6) Three Criteria of Indicator Screening Based on Nonparametric K-Nearest Neighbor Discriminant*

*Criterion 1. *If the discrimination accuracy of the residual indicators after the th index is excluded is larger than the discrimination accuracy of all the indicators, that is to say > 0, it means that the accuracy of the discrimination after deleting the index is improved so the index should be removed. Mark the standard as standard one. All the indicators that meet Criteria 1 should be removed.

*Criterion 2. *If the discrimination accuracy of the residual indicators after the th index is excluded is equal to the discrimination accuracy of all the indicators, that is to say , it means that the accuracy of the discrimination after deleting the index does not change, so the indicator should be removed. Mark the standard as standard two. All the indicators that meet Criteria 2 should be removed.

*Criterion 3. *If the discrimination accuracy of the residual indicators after the th index is excluded is smaller than the discrimination accuracy of all the indicators, that is to say , it means that the accuracy of the discrimination after deleting the index decreases so the index should be retained. Mark the standard as standard three. All the indicators that meet Criteria 3 should be retained.

##### 3.2. Indicators’ Second Selection by R Clustering Analysis

R-type clustering, also known as variable clustering, is a method of clustering variables. In order to reflect the characteristics of things and ensure the uniqueness of each index, R-type clustering method should be used to further cluster these variables selected by nonparametric -nearest neighbor discriminant method to delete information redundancy.

*(1) R-Type Clustering Analysis Based on the Method of Squared Sum Method*. The R cluster analysis of the indexes in the same criterion layer is carried out by the squared sum method.

Assume that is the sum of square deviation of th type indicators (); the indexes are divided into class ; is the number of the th type indicator; is the standardized sample value vector () of the* j*th indicator in the th class; is the average vector of the th class of indicators:

Assume that* S* is the sum of square deviation of all types of indicators ():

*Step 1. *Treat indicators as classes.

*Step 2. *Combine any two of indicators in those indictors into one class, no change on indicators left. There are kinds of combination. According to (10), calculate each class of indicators’ sum of square deviation .

*Step 3. *Calculate total sum of squares of deviations as to the indicators in all of the classes by (11), and reclassify the indicators in the way of indicators’ combination that would minimize the total sum of squares of deviation. sorts total sum of squares of deviations.

*Step 4. *Repeat Step 3 until the kind of classification is .

In the R cluster analysis, the number of reasonable categories is between 2 and 4. In order to avoid the subjective randomness of the number of categories, the nonparametric - test of each class after clustering is used to judge the rationality of the classification number . The original hypothesis of the nonparametric - test is that there are no significant differences in the numerical characteristics of the different indicators.

If the significance level of each category sig > 0.05, then accept the original hypothesis. That is to say, there is no significant difference between the indicators from the same class, and the number of classification is reasonable. On the contrary, indicators should be reclustered.

*(2) Analysis of the Size of the Discriminant Force Based on the Coefficient of Variation*. An indicator’s coefficient of variation reflects its identification ability. The bigger an indicator’s coefficient of variation is, the more information content it is contained. Therefore, the indicator with the biggest coefficient of variation within the same class should be retained.

Assume that is the overall standard deviation of the th indicator; is the mean of the th indicator; the formula of the coefficient of variation of the th index is

The advantage of the coefficient of variation is that the indicator which has the largest coefficient of variation has a strong ability to distinguish different information, and its role in the comprehensive evaluation is the largest, through removing the index whose coefficient of variation is small to ensure that the index system is simple and effective.

Assume that is the default loss rate of the th sample; is receivable principal and interest of the th sample which is not repaid now; is receivable principal and interest of the th sample.

#### 4. Empirical Study

##### 4.1. Sample Selection and Data Sources

The loan data of 860 microenterprises in this paper is derived from the credit database of a head office of a commercial bank. There are 830 nondefault customers and 30 default customers. Each sample includes 68 indicators such as Debt Asset Ratio, Acid-Test Ratio, and Operating Profit Ratio, which is shown in columns 2–69 of Table 1. The default loss rate calculated by formula (13) is set out in column 72 of Table 1. According to the type of the default loss rate, 860 customers will be divided into three categories and placed in column 73 of Table 1.