Complexity

Volume 2018, Article ID 1032643, 17 pages

https://doi.org/10.1155/2018/1032643

## A Novel Approach for Reducing Attributes and Its Application to Small Enterprise Financing Ability Evaluation

Correspondence should be addressed to Wenli Shi; moc.621@8002ilnewihs

Received 28 June 2017; Accepted 5 December 2017; Published 15 January 2018

Academic Editor: Dimitri Volchenkov

Copyright © 2018 Baofeng Shi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Attribute reduction is viewed as a kind of preprocessing steps for reducing large dimensionality in data mining of all complex systems. A great deal of researchers have proposed various approaches to reduce attributes or select key features in multicriteria decision making evaluation. In practice, the existing approaches for attribute reduction focused on improving the classification accuracy or saving the cost of computational time, without considering the influence of the reduction results on the original data set. To help address this gap, we develop an advanced novel attribute reduction approach combining Pearson correlation analysis with test significance discrimination for the screening and identification of key characteristics related to the original data set. The proposed model has been verified using the financing ability evaluation data of 713 small enterprises of a city commercial bank in China. And the experimental results show that the proposed reduction model is efficient and effective. Moreover, our experimental findings help to locate the qualified partners and alleviate the difficulties faced by enterprises when applying loan.

#### 1. Introduction

With the coming of the era of big data, the size of data sets has been increasing sharply, causing the decision makers and management to have difficulty in making decisions based on those data [1]. Then the most important thing for decision makers is to reduce huge attributes or large dimensionality in data sets. Attribute reduction, also called indicators selection or feature screening, ascertains a subset of attributes to reduce the dimensionality of the original data sets. Utilizing reducing attributes, it can select the attributes with the highest information content and save the cost of computational time and memory [2]. Besides, it is also useful to improve the classification accuracy as a result of deleting the information chaos and irrelevant attributes [3]. In practice, attribute reduction has been applied to a great deal of fields such as decision making, pattern recognition, and economic and social system evaluation [4–7].

The main attribute reduction approaches can be divided into three categories. One of the most famous methods for attribute reduction is based on rough set theory. Rough set approach proposed by Pawlak provides useful tools for reasoning from data [8]. It is advantageous to other approaches for attribute reduction that typically use multivariate statistics which require specific parametric assumptions [9, 10]. Degang et al. established a model to reduce the attributes of covering decision systems combining traditional rough set. Empirical study indicated that the proposed attribute reduction approach accomplished better classification performance than those of existing rough set methods [11]. In order to improve the classification accuracy containing hybrid type attributes, such as discretizing numerical attributes or categorical attributes, Hu et al. introduced a simple and efficient greedy algorithm for hybrid attribute reduction [12]. When some decision or evaluation systems have some errors, missing data, and missing attributes in observation, neither DRSA (dominance-based rough set approach) [13] nor VC-DRSA (variable-consistency dominance-based rough set approach) [14] can work appropriately. Inuiguchi et al. created a variable-precision dominance-based rough set approach (VP-DRSA) to deal with these problems [15]. Tsang et al. presented an attribute reduction model with covering rough sets based on discernibility matrix to compute all attribute reducts [16]. Furthermore, Wang et al. developed a novel approach for constructing simpler discernibility matrix with covering rough sets, and it improved some characterizations of attribute reduction proposed by Wang et al. [17]. In addition, there are the two most important attribute reduction models, which extend the Pawlak’s rough set, the neighborhood rough set (NRS) model [18] and the fuzzy rough set model [19]. They can tackle continuous numeric data and fuzzy information granulation, and the determination of what objects should be included in a rough set allowed some flexibility [20].

The second method for screening key factors is the attribute reduction models based on statistics or econometrics technique. In order to obtain preference information of the decision maker in multiobjective search, Zitzler and Künzli defined an optimization goal in terms of a binary performance measure, to select key information directly utilizing this measure [21]. Polat and Krmac screened the most important attributes using pairwise Fisher score attribute reduction approach (PFSAR) and correlation based attribute reduction [22]. Ju and Sohn developed a technology attribute reduction model that uses logistic regression based on exploratory factor analysis (EFA) of 16 technology-related attributes [23]. Elliott et al. developed a model based on a double hidden Markov model (DHMM), to extract information about the “true” credit qualities of firms [24]. Shi et al. created an indicators extraction model based on Pearson correlation analysis and logistic regression significant discriminant in customers’ classification. The proposed approach ensured the reserved indicators can effectively distinguish default customers from nondefault customers [25].

In addition, there are other attributes reduction methods, such as the concept lattice model, the heuristic algorithm, and the colony optimization algorithm. Some researchers developed some new attribute reduction models by using the concept lattice classification theory [26–28]. Wei et al. discussed attribute reduction in information systems by establishing three equivalence relations on the attribute set and its power set [29]. In overwhelming data analysis and machine learning studies, most existing attribute reduction work focused on improving the classification accuracy. However, these studies neglected the problem of how to decrease the test cost. Min et al. proposed a heuristic algorithm to handle this problem in attribute reduction [30]. Chi et al. created an indicators screen model based on correlation analysis and component analysis [31]. Minimal test cost attribute reduction is very important in cost-sensitive machine learning. However, in many cases these heuristic algorithms cannot find the optimal solution. In order to deal with this problem, Xu et al. established an ant colony optimization algorithm for attribute reduction. Experimental results on UCI data sets showed that the proposed method outperforms the information gain-based approach [32]. According to the principle of eliminating redundant information and the principle of the maximum information content, Shi and Chi proposed an attribute reduction model combining cluster analysis and coefficient of variation [33]. Because people are interested in the maximal rules implicated in attribute reduction, Li et al. developed two new kinds of attribute reduction approaches in the decision formal context based on maximal rules [34].

The existing findings can offer important references for reducing attributes. However, there are still some limitations. First of all, in the evaluation of complex systems, the aim of the attribute reduction is to eliminate the factors, which should not have significant effect on the comprehensive evaluation results. However, the existing attribute reduction approaches have not established the comprehensive index (i.e., the comprehensive score vector ), which can reflect all of the attributes’ characteristics. This means that the existing attribute reduction approaches have not developed the relationship between attributes and the comprehensive index (i.e., the comprehensive evaluation result). This results in some reserved attributes, which do not have significant effect on the comprehensive evaluation result. And secondly, most of existing attributes reduction approaches judged the performance of the proposed approach by the standard of saving the cost of computational time. The standard does not analyze the information contribution degree of the reserved attributes to the mass-election attributes. Thirdly, most of existing researches verify the applicability of the proposed attribute reduction methods using numerical simulation, but not utilizing actual data.

To solve the shortcomings, this study creates a novel attribute reduction model to screen the key influencing factors. We advance in three aspects. First, this paper establishes an attribute reduction approach by combining Pearson correlation analysis with test significance discrimination. Pearson correlation analysis is applied to the calculation of the correlation among attributes to delete the similar attributes. test significance discrimination is used to select the key attributes which have the greatest influence on comprehensive index . Second, we also define an information contribution ratio to assess this attribute reduction approach from a statistical viewpoint. Third, the proposed attribute reduction approach has been verified by utilizing the financing ability evaluation data of 713 small enterprises of a city commercial bank in China. Empirical evidence presents that the selected attributes reflect 94.7% original information with 27.54% original attributes. Furthermore, this paper also selects 19 key influencing factors for assessing the financing ability of small enterprises.

The remainder of this paper is organized as follows. Section 2 introduces the design and methodology of this study. Section 3 presents the data and empirical analysis of our attribute reduction model for 713 small enterprises. Section 4 concludes and highlights the future research directions of this paper.

#### 2. Design and Methodology of the Study

In this section, we introduce a novel attribute reduction model by combining Pearson correlation analysis with test significance discrimination approach. First of all, in order to eliminate the influence of the differences of attributes units and dimensions on attribute reduction, the original data should be transformed into real numbers within the interval . Secondly, we utilize Pearson correlation analysis to delete the attributes of large correlation from the whole mass-election attributes set, avoiding repeated information. Thirdly, test significance discrimination approach has been created to select the attributes with the highest information content, which ensures that the selected attribute has the greatest influence on the small enterprise financing performance. A step-by-step instruction is as follows.

##### 2.1. Standardization of Attribute Data

In our attribute reduction model, the first step is standardization of attribute data so that the after-calculation processes and parameters use the same standard. According to the features of attributes, the attributes can be divided into two types: quantitative attributes and qualitative attributes. The quantitative attributes include positive attributes, negative attributes, and interval attributes. The positive attributes are attributes showing that the greater their values are, the better the small enterprise financing capacity is. The negative attributes are attributes showing that the less their values are, the better small enterprise financing capacity is. The interval attributes are attributes reasonable only when the original index data are within certain range.

The standardization equations of positive attributes, negative attributes, and interval attributes are represented by (1), (2), and (3), respectively, [35]: where is the standardized score of the small enterprise on the attribute, is the attribute original data of the small enterprise on the attribute, is the number of small enterprises, is the left boundary of the ideal interval, and is the right boundary of the ideal range.

The qualitative attributes refer to these attributes whose attribute values are described by a text, rather than a numerical value. The standard scores of qualitative attributes can be obtained by rational analysis and expert investigation.

##### 2.2. Pearson Correlation Coefficients

The Pearson product-momentum correlation coefficient was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s [36]. It is a measure of the linear correlation (dependence) between two random variables. It was also called the PPMCC, PCC, or Pearson’s . Historically, it is the first formal measure of correlation and it is still one of the most widely used measures of relationship.

The Pearson correlation coefficient of two attributes and is defined as the covariance of the two variables divided by the product of their standard deviations. The Pearson correlation coefficient is commonly represented by the letter* r* and it can be equivalently defined by [37] where , are the mean of and , respectively. Equation (4) is applied to the calculation of the correlation between two variables and . The coefficient ranges from −1 to 1 and it is invariant to linear transformations of either variables. A value of 1 indicates a total positive correlation between and , a value of 0 implies no correlation between and , and a value of −1 indicates a total negative correlation.

Some authors have offered guidelines for the interpretation of the Pearson correlation coefficient [38–41]. If the Pearson correlation coefficient of two attributes is greater than 0.8 [40, 41], we can conclude that these attributes are information redundancy. In this situation, we should remove one of attributes. In the opposite situation, if the Pearson correlation coefficient is smaller than 0.8, it indicates that these attributes are not information redundancy and should keep these two attributes.

##### 2.3. Attribute Reduction Model

In our attribute reduction model, the third step is to select the key attribute which has the greatest influence on comprehensive index and deleting the uncorrelated attributes. In this part, we first calculate the attribute weightings using entropy weight approach. And then, we can obtain the financing ability evaluation score (i.e., comprehensive index ) for every small enterprise. Subsequently, the multiple determination coefficient between comprehensive index and all of these attributes can be obtained, and the multiple determination coefficient between comprehensive index and the remaining attributes after removing an attribute can be calculated. By using test significance discrimination, these key attributes which have the greatest influence on small enterprise financing ability evaluation are selected. At the same time, the reduction idea—that is, the bigger the difference between the multiple repeated determination coefficient and the multiple determination coefficient (), the more the significance to comprehensive evaluation results—is reflected. Thus, the right time to make up the existing attribute reduction approaches cannot reflect the influence of attributes on the comprehensive index , because the attribute reduction process has nothing to do with comprehensive index .

###### 2.3.1. Weighting Attributes Utilizing Entropy Weight Method

Let denote the weight of the attribute in the small enterprise, let denote the standard score of the attribute in the small enterprise, let denote the number of small enterprises, and let denote the number of attributes.

The subordinate degree function of the attribute is given by

Then, the entropy of the attribute can be calculated with

And then, the entropy weight of the attribute is [42] where .

###### 2.3.2. Reducing Attributes Based on Test Significance Discrimination

After eliminating redundant information in Section 2.2, this section will select the key attributes which have the greatest influence on comprehensive index utilizing test significance discrimination approach. Now we outline the steps to build an attribute reduction model based on test significance discrimination.

*Step 1. *Calculate the comprehensive index . Let denote the comprehensive index or the comprehensive score for the small enterprise financing ability evaluation. We haveThe meanings of the rest of variables in (8) are the same as the variables in (1) and (7).

*Step 2. *Calculate Pearson correlation coefficients between attribute and the comprehensive index . We can assume the attributes ranking is according to the correlation coefficient absolute value in a descending order.

*Step 3. *Calculate the multiple determination coefficient between comprehensive index* y* and the remaining attributes , , …, after removing the first attribute with the biggest correlation coefficient absolute value.

Let denote the estimated parameters, respectively, let denote attributes, and let denote the random error term. The regression function is given by In (9), the estimated values for parameters can be obtained using the least squares regression estimation method. Furthermore, the estimated value vector of the comprehensive index* y* can be calculated. Then, we have [43] where and denotes the number of small enterprises.

It should be pointed out that the attribute should be reserved in attribute reduction, because the attribute has the maximum pertinency with the comprehensive evaluation results. It also indicates that the attribute has the biggest impact on small enterprise financing ability evaluation.

*Step 4. *Calculate the multiple determination coefficient between comprehensive index and the remaining attributes after removing the first attribute with the biggest correlation coefficient absolute value and the second attribute with the second biggest correlation coefficient absolute value.

Let denote the estimated parameters, respectively, let denote attributes, and let denote the random error term. The regression function is as follows: In the same way, we can calculate the estimated value vector of the comprehensive index for (11). And the multiple determination coefficient is given by

*Step 5. *Calculate . Let denote the difference of the multiple determination coefficient and the multiple determination coefficient ; namely,In (13), the difference reflects the influence of the attribute on the comprehensive index . If is not equal to zero significantly, it means that the attribute affects the comprehensive evaluation result significantly, and therefore the attribute should be reserved. On the contrary, if is equal to zero significantly, then , which indicates the attribute does not have significant effect on the comprehensive evaluation result , and the attribute should be deleted.

*Step 6. *Reduce attributes establishing test significance discrimination.

Hypothesis : ; : .

Let denote the test value of the attribute ; we have [44]For (14), we can understand its meanings from the following three aspects. Firstly, the bigger the multiple determination coefficient is, the smaller the deviation of the estimated value and the actual comprehensive index would be. The smaller the multiple determination coefficient is, the bigger the deviation of the estimated value and the actual comprehensive index after removing the attribute would be. That is to say, when we remove the attribute , the explanation ability of the attributes to the comprehensive evaluation score decreases significantly. It also indicates that the attribute has significant effect on the comprehensive evaluation result of small enterprises; thus the attribute should be reserved.

Secondly, the bigger the difference of the multiple determination coefficient and the multiple determination coefficient is, the bigger the difference of the explanation ability of the attributes to the comprehensive evaluation score and the explanation ability of the attributes to the comprehensive evaluation score would be. It means that the attribute affects the comprehensive evaluation result of small enterprises significantly, and the attribute should not be deleted.

Thirdly, the bigger the difference (i.e., the bigger the difference value ) is, the bigger the test value would be. In this situation, the test can be passed easily. And it also expresses the attribute effects on the comprehensive evaluation result significantly.

Under the condition of the hypothesis of , follows distribution; that is to say, . Let the confidence level be equal to 0.05 [45]; the critical value can be checked from statistics. If , accept hypothesis . It means that is not equal to zero significantly, and the attribute should be reserved. Conversely, if , reject hypothesis , which indicates that is equal to zero significantly, and the attribute should be deleted.

*Step 7. *Repeat* Step* 3 to* Step* 6, and select other attributes.

For the rest of the attributes , we can reduce attributes by repeating Step 3 to Step 6. Until you find the first attribute , the corresponding test value satisfies the inequation . At this time, the attribute reduction can be stopped. It suggests that the rest of attributes do not have significant influence on comprehensive evaluation result .

##### 2.4. The Judgment of Reasonability of the Proposed Attribute Reduction Approach

According to the idea that the multiple determination coefficient can be used to describe the explanation ability of the independent variable on the dependent variable, this paper uses an information contribution ratio to assess the performance of attribute reduction model. The information contribution ratio can be defined as the ratio of the explanation ability of the reserved attributes to the comprehensive evaluation score to the explanation ability of the mass-election attributes to the comprehensive evaluation score .

Let denote an information contribution ratio of the reserved attributes to the mass-election attributes, let denote the multiple determination coefficient of the reserved attributes to the comprehensive evaluation score , and let denote the multiple determination coefficient of the mass-election attributes to the comprehensive evaluation score . The information contribution rate of the reserved attributes to the mass-election attributes is given by

Equation (15) is applied to judge the reasonability of the proposed attribute reduction model. The numerator reflects the explanation ability of the reserved attributes to the comprehensive evaluation score , and the denominator illustrates the explanation ability of the mass-election attributes to the comprehensive evaluation score . Equation (15) is the ratio of the explanation ability to the explanation ability . It reveals the information contribution degree of the reserved attributes to the mass-election attributes.

As a decision criterion for judging the rationality of the proposed attribute reduction model, the proposed approach is considered reasonable if the reserved attributes are able to contribute more than 90% of the mass-election attributes by using less than 30% of attributes in the mass-election attribute set.

#### 3. Empirical Study

##### 3.1. Sample Selection and Data Sources

In consideration of research purpose of verifying the applicability of the proposed attribute reduction model, this subsection implements empirical study based on the financing ability data of 713 small enterprises. In order to guarantee the representation of empirical results, this paper collected the data from the headquarter and all of the branches in a city commercial bank of China, including Beijing Branch, Tianjin Branch, Shanghai Branch, Chongqing Branch, Shenyang Branch, Dalian Branch, and Dandong Branch. The data is shown in Column 5 to Column 717 in Table 1 [46].