Abstract
The measurement error with normal distribution is universal in applications. Generally, smaller measurement error requires better instrument and higher test cost. In decision making, we will select an attribute subset with appropriate measurement error to minimize the total test cost. Recently, errorrangebased covering rough set with uniform distribution error was proposed to investigate this issue. However, the measurement errors satisfy normal distribution instead of uniform distribution which is rather simple for most applications. In this paper, we introduce normal distribution measurement errors to coveringbased rough set model and deal with testcostsensitive attribute reduction problem in this new model. The major contributions of this paper are fourfold. First, we build a new data model based on normal distribution measurement errors. Second, the coveringbased rough set model with measurement errors is constructed through the “3sigma” rule of normal distribution. With this model, coverings are constructed from data rather than assigned by users. Third, the testcostsensitive attribute reduction problem is redefined on this coveringbased rough set. Fourth, a heuristic algorithm is proposed to deal with this problem. The experimental results show that the algorithm is more effective and efficient than the existing one. This study suggests new research trends concerning costsensitive learning.
1. Introduction
The measurement error is the difference between a measurement value and its true value. It can come from the measuring instrument, from the item being measured, from the environment, from the operator, and from other sources [1]. As a plausible distribution for measurement errors, the normal distribution was put forward by Gauss in 1809. In fact, normal distribution is found to be applicable over almost the whole of science and engineering measurement. In data mining applications, the data model based on measurement errors is an important form of uncertain data (see, e.g., [2–4]).
Test costs refer to time, money, or other resources spent in obtaining data items related to some object [5–10]. There are a number of measurement methods with different test costs to obtain a data item. Generally, higher test cost is required to obtain data with smaller measurement error. In data mining applications, we will select an attribute subset with appropriate measurement error to minimize the total test cost and at the same time preserve necessary information of the original decision system.
An attribute reduct is a subset of attributes that are jointly sufficient and individually necessary for preserving a particular property of the given information table [11]. It is a key problem of rough set theory and has attracted much attention in recent years (see, e.g., [12–16]). As a generalization of attribute reduction, testcostsensitive attribute reduction [6] focuses on selecting a set of tests to satisfy a minimal test cost criterion.
Recently, errorrangebased covering rough set [4] was introduced to address error ranges. This theory is based on both coveringbased rough set [17–23] and neighborhood rough set [24–28]. Moreover, in the new theory, the testcostsensitive attribute reduction problem deals with numeric data instead of nominal ones. Therefore, the problem is more challenging than the one defined in [6]. However, errorrangebased covering rough set considers only uniform distribution errors, which are rather unrealistic.
In this paper, we introduce normal distribution to build a new model of coveringbased rough set to address normal distribution measurement errors (NDME) according to the “3sigma" rule. The major contributions of this paper are fourfold. First, we introduce normal distribution to build a new data model based on measurement errors. The error range is computed according to the values of attributes instead of the fixed error range for different datasets. Second, we build the computational model, namely, the coveringbased rough set with normal distribution measurement errors. Third, the minimal test cost attribute reduction problem is redefined in the new model. Fourth, we propose a heuristic algorithm to address the reduction problem. Specifically, a weighted heuristic reduction algorithm is designed, where attribute significance is adjusted by weighted test cost.
Ten open datasets from the University of CaliforniaIrvine (UCI) library are employed to study the performance and effectiveness of our algorithm. We adopt three measures to evaluate the performance of the reduction algorithms from a statistical viewpoint. Experiments undertaken with open source software costsensitive rough sets (Coser) [29] validate the performance of this algorithm. Experimental results show that our algorithm can generate a minimal test cost reduct in most cases. At the same time, the proposed algorithm can achieve better performance and efficiency than the existing one [4].
The rest of the paper is organized as follows: Section 2 presents the data models with measurement errors and test costs, respectively. Section 3 describes the computational model, namely, coveringbased rough set model with normal distribution measurement errors. The minimal test cost reduct problem under the new model is also defined in this section. Next, Section 4 presents a weighted heuristic reduction algorithm and a competition approach. Experiment results and comparison with the existing work are discussed in Section 5. Finally, conclusions are drawn in Section 6.
2. Data Models
This section presents data models. First, we propose a decision system with normal distribution measurement errors, which is also called NEDS for brevity. Then, we introduce test costs to NEDS and define testcostsensitive decision systems with NDME.
2.1. Normal Distribution
Normal distribution is an important type in science and engineering measurement. It can be described by the probability density function where the parameter is the mean which gives the location of the distribution and the parameter is the variance which gives the scale of the distribution.
The cumulative distribution function (CDF) of a random variable is the probability of a value less than or equal to some value : where . For a random variable , where the righthand side represents the probability that the random variable takes on a value less than or equal to . The standard normal distribution appears with and . The equation becomes
For a normal distribution, about 68.27% of values drawn from a normal distribution are within one standard deviation away from the mean; about 95.45% of the values lie within two standard deviations; nearly all values (99.73%) lie within 3 standard deviations of the mean, that is, “3sigma" rule [30]. We use the following example to explain the relationship between standard deviation and confidence interval.
Example 1. Let standard deviation be 0.01, and let the mean be 0; then we know that about 99% of the measurement errors are from −0.03 to .
2.2. Decision Systems with Measurement Errors
We introduce normal distribution measurement errors into our model [31] to make the model more realistic.
Definition 2. A decision system with normal distribution measurement errors (NEDS) is the 6tuple: where is the nonempty set called a universe and and are the nonempty sets of variables called conditional attributes and decision attributes, respectively. is the set of values for each , and is an information function for each . is the maximum value of measurement error. and are the upper confidence limit (UCL) and the lower confidence limit (LCL) of , respectively.
Definition 3. Letting be a NEDS, the error range of attribute is defined as where where is a userspecified parameter, is the th instance value of , , and is the number of instances. The precision of can be adjusted through setting.
From Definition 3 we can see that the decision system with normal distribution measurement errors is a generalization of the decision system and the decision system with error range (DSER) (see, e.g., [4]). If all attributes are error free, that is , a NEDS degrades to a DS. If the error range is a fixed value, that is , a NEDS degrades to a DSER.
We introduce how to deal with the abnormal value of measurement error. In applications, if the repeated measurement data satisfy the would be considered as an abnormal value and be rejected, where is the th measurement value and is the mean of all measurement values. This is the Pauta criterion of measurement error theory.
Now, we investigate the relationship between the limit of confidence interval and the standard deviation in the following proposition.
Proposition 4. Let and be LCL and UCL, respectively, and let be the confidence level. One has the upper limit of confidence interval where .
The value of exceed the three deviations is an abnormal error, which needs to be identified and removed from consideration. The standard normal distribution is a special case of the normal distribution. The limit of confidence interval is investigated in the following proposition.
Proposition 5. Let and be LCL and UCL of standard normal distribution measurement errors, respectively. One has
Proof. The standard normal distribution is given by taking mean and in a general normal distribution. , . Therefore (10) holds.
In Definition 3, a key parameter is an adjusting factor. Now we introduce it by the following proposition.
Proposition 6. Let and be LCL and UCL of , respectively. Confidence intervals are stated at the confidence level, and . According to (3), one has
According to (3) and Proposition 6, if , we have , ; if , we have , , and if , we have , .
Large error ranges are pronounced with shorter reaction time than those with smaller error ranges. Small error ranges are pronounced with higher classification accuracies than those with larger ones. Generally, smaller measurement error requires better instrument and higher test cost. In many applications, it is impossible or unnecessary to distinguish objects or elements with small error range in a universe. One can adjust the size of the error range through the setting to meet different requirements.
2.3. TestCostIndependent Decision System with Normal Distribution Measurement Errors
We introduce test costs to the data model. Now, we discuss the new model as follows.
Definition 7. A testcostindependent decision system with normal distribution measurement errors (TCINEDS) is the 7tuple: where , and have the same meanings as in a NEDS, and is the test cost function. Test costs are independent of one another, that is, for any .
Note that in this model, test costs are not applicable to decision attributes.
In order to process and compare, the values of conditional attributes are normalized from their value into a range from 0 to 1 through the linear function where and are the maximal and minimal values of the attribute and and are the initial value and the normalized value, respectively.
Table 1 presents a decision system of Iris, whose conditional attributes are normalized values. Where SL, SW, PL, PW}, Class}, and .
3. CoveringBased Rough Set with Normal Distribution Measurement Errors
Rough set theory is a powerful tool for dealing with uncertain knowledge in information systems [32]. It has been successfully applied to feature selection [33, 34], rule extraction [35–37], uncertainty reasoning [38, 39], decision evaluation [40, 41], granular computing [42–45], and so forth. Recently, coveringbased rough set has attracted much research interest with significant achievements in both theory and application.
The concept of neighborhood (see, e.g., [46–48]) has been applied to define different types of coveringbased rough set [16–18]. From the different viewpoints, a neighborhood is called an information granule or a covering element. Figure 1 illustrates the neighborhoods of in a twodimensional real space [25]. For this type of neighborhood rough set model, the distance parameter is a userspecified parameter. Objects with a distance less than are viewed as neighbors. Recently, a new neighborhood is defined in [4]. The size of the neighborhood depends on error ranges of tests, and more objects fall into the neighborhood of for wider error ranges. Figure 2 illustrates this twodimensional neighborhood.
In this section, we introduce normal distribution measurement errors to coveringbased rough set. The new model is called coveringbased rough set with normal distribution measurement errors. As mentioned early, if we do not consider errors, this mode degenerates to the classical decision system. Therefore, this model is a natural extension of classical one. Testcostsensitive attribute reduction problem on the coveringbased rough set model with NDME is also proposed in this section.
3.1. CoveringBased Rough Set with Normal Distribution Measurement Errors
According to “3sigma" rule, we present a new model considering both error distribution and confidence interval. According to Definition 2, we have defined a neighborhood in [31] as follows.
Definition 8. Let be a decision system with normal distribution measurement errors. Given and , the neighborhood of with respect to normal distribution measurement errors on attribute set is defined as where is the error boundary. It represents the error value of in .
Measurement errors with no more than a difference of should be viewed as the family of neighborhood granules. We explain why instead of was employed in (14) as the maximal distance. Although the value of error is within a certain range, there are significant differences among confidence intervals. As mentioned earlier, “3sigma" rule states that for a normal distribution, different proportion values lie within different standard deviations of the mean. In particularly, the proportion is very close to 0 if data is more than three standard deviations from the mean.
Therefore, measurement errors with no more than a difference of should be viewed as the family of neighborhood granules. Naturally, the size of the neighborhood depends on error ranges of tests and adjusting factor. Figure 3 shows the different sizes of neighborhood based on , , and .
In the new model, the lower and upper approximations are defined as follows.
Definition 9 (see [31]). Let be a decision system with normal distribution measurement errors, and let be a neighborhood relation induced by . For any , the lower and upper approximations of in a neighborhood approximation space are defined as
, . The boundary region of in is defined as The positive region of with respect to is defined as [49, 50].
3.2. TestCostSensitive Attribute Reduct Problem
Attribute reduction is a successful technique to remove redundant data and facilitate the mining task. A number of definitions of relative reducts exist [25, 38, 51, 52] for different rough set models. In this section, we define testcostsensitive attribute reduction on the coveringbased rough set model with NDME.
A minimal test cost reduct problem proposed in [6] can be redefined as follows. The problem of finding such a reduct is called the minimal test cost reduct problem.
Problem 1. The minimal test cost reduct problem. Input: ; Output: ; Constraint: ; Optimization objective: .
Compared with the classical minimal reduction problem, there are several differences as follows. The first is the input, where the test costs and measurement errors are the external information. The second is the optimization objective, which is to minimize the test cost instead of the number of features. We can adopt the additiondeletion strategy [15] to design our heuristic reduction algorithm.
In order to address the constraint of the problem, we have defined an inconsistent object in [4]. Here we redefine it as follows.
Definition 10. Let be a decision system with normal distribution measurement errors, , and . In , any is called an inconsistent object if . The set of inconsistent objects in is
We can evaluate the characteristics of the neighborhood block through the number of inconsistent objects, namely, . From Definition 10 we know that given , if and only if . Consequently, the is an important parameter when we compute the positive region. Therefore, the following proposition can be used as an alternative definition of a reduct.
Proposition 11. Let be a NEDS. Any is a decisionrelative reduct if and only if(1),(2). .
Sometimes we are interested in minimal reduction or minimal test cost reduct (see, e.g., [6]). In this work, we focus on finding reducts with minimal test cost, that is, testcostsensitive attribute reducts. Because the TCINEDS is a natural extension of NEDS, concepts in the NEDS are also applicable to the TCINEDS. We introduce the following definition.
Definition 12. Let denote the set of all reducts of a TCINEDS . Any where is called a minimal test cost reduct.
According to this definition, we should compute all reducts firstly. Consequently, exhaustive algorithms are needed to address this problem. However, for large datasets, finding reducts with minimal test cost is NP hard. Therefore, we should propose a heuristic algorithm to deal with this problem for large datasets.
3.3. Evaluation Measures
In order to dispel the influence of subjective and objective factors, three evaluation measures are adopted to evaluate the performance of the proposed algorithm. We adopt the three measures proposed in [6] for this purpose. These are finding optimal factor (FOF), maximal exceeding factor (MEF), and average exceeding factor (AEF).
Let be the number of experiments and let be the number of searched optimal reduct in the experiments. The finding optimal factor is a both qualitative and quantitative measure, which is defined as
The maximal exceeding factor describes the worst case of an algorithm, which is defined as where is the exceeding factor indicating the badness of a reduct, which is a quantitative measure, where is an optimal reduct and is the searched reduct.
The average exceeding factor is defined as which represents the whole performance of an algorithm.
4. Algorithm
Testcostsensitive attribute reduct problem is NPhard problem. Therefore, heuristic algorithms are needed to calculate the possible reducts for large datasets. In order to evaluate the performance of a heuristic algorithm, we should find an optimal reduct from all reducts. Hence, exhaustive algorithms are also needed.
In this section, we mainly present a heuristic algorithm and a competition approach to deal with the new problem. The exhaustive algorithm of [4] is adopted to find all reducts of datasets. It is based on backtracking where pruning techniques are crucial in reducing computation.
4.1. The Weighted Heuristic Reduction Algorithm
To design a heuristic algorithm, we employ an algorithm framework which is similar to the one proposed in [6]. The algorithm follows the typical additiondeletion strategies [15], which is listed in Algorithm 1. It constructs a superreduct and then reduces it to obtain a reduct. The algorithm is essentially different from the one in [6]. First, the input is a testcostindependent decision system with normal distribution measurement errors, which is more generalization than the TCIER. Second, test results are numerical with normal distribution measurement errors rather than only nominal. The key code of this framework is listed in lines 5 and 7, and the attribute significance function is redefined to obtain respective algorithm. The efficiency of the weighted heuristic reduction algorithm will be discussed in Section 5.4.

As previously mentioned, is an important parameter in evaluating the quality of a neighborhood block. Now, we introduce the following concepts.
Definition 13. Let be a NEDS, , and . is the number of inconsistent objects in neighborhood . The total number of inconsistent objects with respect to the positive region is
Finally, we propose a weighted heuristic information function: where is necessary and indispensable, and it plays a dominant role in the heuristic information, where is the test cost of and is a userspecified parameter. If , test costs are essentially not considered. If , tests with lower cost have bigger significance. Different settings can adjust the significance of test cost.
4.2. The Competition Approach
In order to obtain better results, the competition approach has been discussed in [6]. In the new environment, it is still valid because there is no universally optimal . In this approach, reducts complete against each other with only one winner, that is, a reduct with minimal test cost, which can be obtained using : where is the reduct obtained by Algorithm 1 using the heuristic information, with the sets of userspecified values.
This approach can improve the quality of the result significantly. The algorithm runs times with different values; hence, more runtime is needed. However, it is acceptable because the heuristic algorithm is fast.
5. Experiments
5.1. Data Generation
Most datasets from the UCI library [53] have no intrinsic measurement errors and test costs. Therefore, in our experiments, we create these data to study the performance of the reduction algorithm. For example, measurement errors satisfy normal distribution and Pauta criterion. For the same object, the condition attribute with narrower error ranges should be more expensive. In this section, we will discuss the generation of both the measurement errors and test costs.
Step 1. Ten datasets from the UCI Repository of Machine Learning Databases are selected, and these datasets are listed in Table 2. Each dataset should contain exactly one decision attribute and have no missing value. In order to facilitate processing, firstly, we normalize the data items from their value into a range from 0 to 1. And then, missing values are directly set to 0.5.
Step 2. We produce the number of additional tests for one particular attribute. We generate a random integer in the range and . That is, we have measurement methods to obtain values for each object. The number of tests including the additional ones for our experiments is , which is listed in Table 2.
Step 3. We produce the for each original test according to (6) and (7). The is computed according to the value of databases without any subjectivity. Three kinds of error ranges of different databases are shown in Table 3. These error ranges are maximal, minimal, and average error ranges of all attributes, respectively. The precision of can be adjusted through setting, and we set to be 0.01 in our experiments.
Step 4. We produce “new" data subject to error ranges. Letting be the original test, we can add a random number in to to produce , where . The number is generated as follows.
Letting and be uniformly distributed on (0, 1), then is a random number which has a normal distribution with mean and . From Proposition 4 we know the , and .
Since we need a random number in , we let
Finally, the error range is According to Definition 8, is the error range of the new test .
The generated NDME with different error ranges are shown in Figure 4. The generated NDME of different databases are shown in Figure 5.
Step 5. The test costs are produced, and they are always represented by positive integers. Let be the original test and let be the last test for one particular data item. is set to a random number in . where is set to . This setting guarantees that tests with narrower error ranges are more expensive.
A dataset generated by this approach is listed in Table 4. SL stands for sepal length, SW stands for sepal width, PL stands for petal length, and PW stands for petal width. SL1, PL1, and PL2 indicate different revisions of the original data. There is only one method to obtain SW and PW.
5.2. Effectiveness of the Heuristic Algorithm
Let . The heuristic algorithm runs 800 times with different test cost settings and different setting on all datasets. Figures 6 and 7 show the results of FOF. For different settings of , the performance of the algorithm is completely different; that is, the test cost plays a key role in this heuristic algorithm. The results are incomparable to others when ; hence, these results are not included in this experiment results.
Figures 8 and 9 show the results of MEF, which provide the worst case of the algorithm, and they should be viewed as a statistical measure. Figures 10 and 11 show the results of AEF. These display the overall performance of the algorithm from a statistical perspective.
From the results we observe that the quality of the results varies for different datasets. It is related to the dataset itself because the error range and heuristic information are all computed according to the values of dataset. Then the AEF is less than 0.3 in most cases, which is acceptable. Although the results are generally acceptable, the performance of the algorithm should be improved. In Section 5.3, we will address this issue further.
5.3. Comparison of Three Approaches
For the proposed algorithm, is a key parameter. Three approaches can be obtained with different setting of . The first approach is called nonweighting approach. It is the only one without taking into account test costs, which is implemented by setting . The second approach, called the best approach, is to choose the best value as depicted in Figures 6 through 11. The third approach is the competition approach discussed in Section 4.2. All three are based on Algorithm 1 and the same databases. Now we compare the performance of the proposed algorithm through three approaches mentioned in Section 4.
Table 5 lists results for all three approaches. From Table 5, we observe the following.(1)The nonweighting approach almost does not find the optimal reduct. It is unacceptable from all three metrics.(2)In most cases, the best approach obtains optimal results. However, we have no idea how to obtain the best value of in real applications.(3)The competition approach improves the quality of results significantly, and the runtime is acceptable for relatively small number of .
5.4. Comparison with Existing Algorithm
Compared with an existing model [4], the major improvement is introduced in this section.
First, the NDME was considered to data model, and coveringbased rough set based on NDME has been proposed. In most cases, the measurement errors satisfy normal distribution instead of uniform distribution; hence, this new model has wider application areas.
Second, comparing with the fix error range of different databases from [4], the proposed error ranges are adaptively generated according to the database values. Table 3 shows the generated error ranges for different databases. The error ranges for different attributes of the same database are completely different. For example, the maximal error range of Wdbc is 0.0040, and the minimal one is 0.0006.
Third, a weighted heuristic algorithm is developed to deal with the minimal test cost reduct problem. Our algorithm is compared with the weighted algorithm [4] from effectiveness and efficiency. Since two different algorithms have different parameters, we compare the results of the competition approach on ten datasets. Figure 12 shows competition approach results of two algorithms. From the results we observe that(1)on Wpbc and Iono datasets, two algorithms have the same performance;(2)weighted algorithm has better performance than our algorithm on Iris, Glass, and Credit datasets;(3)however, our algorithm performs better than the weighted algorithm on five datasets.
The efficiency comparison between the weighted algorithm and weighted one is depicted in Figure 13. From the results we note that our algorithm has an improvement in terms of runtime. Figure 14 shows the efficiency ratios of the weighted algorithm and the weighted algorithm.
6. Conclusions
In rough set model, measurement errors and test costs are all intrinsic to data. In this paper, we built a new coveringbased rough set model considering measurement errors and test costs at four levels.(1)At the data model level, a new data model with NDME and test cost was proposed. This model has more application areas because the measurement errors have certain universality.(2)At the computational model level, we introduced a coveringbased rough set with NDME. This model is generally more complex than that presented in this field.(3)At the problem level, a minimal test cost reduct problem based on the new model was redefined.(4)At the algorithm level, a weighted heuristic algorithm was developed to deal with this reduct problem. Experimental results indicate the effectiveness and efficiency of the algorithm.
In summary, the data model based on normal distribution measurement errors has the wide application scope. This study suggests new research trends of coveringbased rough set and costsensitive learning.
Acknowledgments
This work is in part supported by National Science Foundation of China under Grant no. 61170128, the Natural Science Foundation of Fujian Province, China, under Grant no. 2012J01294, State Key Laboratory of Management and Control for Complex Systems Open Project under Grant no. 20110106, and Fujian Province Foundation of Higher Education under Grant no. JK2012028, and the Education Department of Fujian Province under Grant no. JA12222.