Abstract

For data mining, reducing the unnecessary redundant attributes which was known as attribute reduction (AR), in particular, reducts with minimal cardinality, is an important preprocessing step. In the paper, by a coding method of combination subset of attributes set, a novel search strategy for minimal attribute reduction based on rough set theory (RST) and fish swarm algorithm (FSA) is proposed. The method identifies the core attributes by discernibility matrix firstly and all the subsets of noncore attribute sets with the same cardinality were encoded into integers as the individuals of FSA. Then, the evolutionary direction of the individual is limited to a certain extent by the coding method. The fitness function of an individual is defined based on the attribute dependency of RST, and FSA was used to find the optimal set of reducts. In each loop, if the maximum attribute dependency and the attribute dependency of condition attribute set are equal, then the algorithm terminates, otherwise adding a single attribute to the next loop. Some well-known datasets from UCI were selected to verify this method. The experimental results show that the proposed method searches the minimal attribute reduction set effectively and it has the excellent global search ability.

1. Introduction

Data mining, which was known as knowledge discovery in database, includes extracting knowledge, discovering new patterns, and predicting the future trends from the amounts of data. Nowadays, with an increasing number of applications in different fields, massive volumes of very high-dimensional data were produced; the data mining faces the great challenge. As known to all, much of datasets contain unnecessary redundant attributes, which not only occupy extensive computing resources but also seriously impact the decision-making process. Reducing the unnecessary redundant attributes becomes very necessary for data mining [1]. Attribute reduction (AR) in the rough set theory (RST) removes redundant or insignificant knowledge with keeping the classification ability of the information system the same as before. It was proposed by Pawlak and Sowinski [2]. Now, RST is widely used in many fields such as machine learning, data mining, and knowledge discovery [36].

AR is one of the core problems in RST. In particular, minimal reduction problem is an important part of AR in RST, in which the cardinality of attribute subset is the smallest among all possible reductions. It has been paid much attention by many researchers. One basic solution to find the minimal reducts is to construct a discernibility function and simplify it from the dataset by discernibility matrix [79]. Unfortunately, it has been shown that the problem of minimal reduct generation is NP-hard and the run time of generating all reducts is exponential [10]. Recently, because many kinds of NP-hard problems can be solved by heuristic algorithms with increasing computational cost, heuristic attribute reduction algorithm is the main research direction in the field of AR [11].

In general, swarm intelligence algorithm is one kind of heuristic approaches which were used widely for solving attribute reduction problem, including genetic algorithm (GA) [1214], particle swarm optimization (PSO) [1518], ant colony optimization (CO) [19, 20], and fish swarm algorithm (FSA) [11, 21, 22]. FSA is a kind of evolutionary algorithm which was inspired by the natural schooling behaviors of fish to generate candidate solutions for optimization problems, such as random, swarming, following, and preying behaviors. It has a strong ability to avoid local minimums in order to achieve a global optimization [23]. Due to its abilities to perform, FSA has received much attention in recent years.

In this paper, a new coding method about the subset of attribute sets is proposed. By the coding method, a novel strategy for minimal attribute reduction algorithm based on FSA and RST is proposed. It firstly identifies the core attributes by discernibility matrix. Based on the core attributes, all subsets without containing the core attribute are encoded into an integer by the proposed coding method and an initial population is generated for FSA used to find the optimal set of reducts. The fitness function of a subset is defined based on the attribute dependency of the formed rough set. In each loop, the evolutionary direction of the individual is limited to a certain extent by the coding method. If the maximum attribute dependency and the attribute dependency of condition attribute set are equal, then the algorithm terminates, otherwise, adding a single attribute to the next loop. Different benchmark datasets are used to compare the numerical results; our proposed method is a robust and cheap method for calling the fitness function.

The rest of the paper is organized as follows. In Section 2, we introduced some basic concepts in rough sets and fish swarm algorithm. In Section 3, we focus the coding method of combination set. In Section 4, a novel attribute reduction algorithm based on fish swarm algorithm and rough set is proposed. In Section 5, some well-known datasets are used to test the performance of the proposed method. Finally, Section 6 concludes the paper and the areas of further research.

2. Background

2.1. Base Notions of Rough Set Theory

In this section, some basic notions and its proposition will be reviewed in the theory of rough set.

A decision table can be represented as , where is a nonempty finite set of objects, , where is a set of condition attribute and is a decision attribute set, is the domains of attributes belonging to , and is a function assigning attribute values to objects in .

For any , there is an associated indiscernibility relation :

Let ; the -lower approximation of is defined as where denotes an equivalence class of determined by object . The notation refers to the -positive region is given by . The -approximation quality with respect to decisions attribute set is defined as follows: and the core attribute set is defined as

2.2. The Principle of FSA

FSA is a new bionic optimization algorithm which simulates the fish swarm behaviors such as preying, swarming, and following behaviors and updates the maximum fitness value on the bulletin board. In FSA, let be the population size, the Artificial Fishes (AF) are generated by random function which is represented by a -dimensional position , and is the updated value of . Food satisfaction of is represented as fitness function value . The Euclidean distance is denoted as the relationship between and . Other parameters include (representing maximum step length), (the visual distances of fish), being a random number in , and (a crowd factor).

Preying behavior is a basic behavior of FSA. As shown in (4), for , we randomly select a random within the current visual scope. If , then move a step from to . Otherwise, move a step to another random that . After a number of trials, if the random that meets is not satisfied, will be replaced with a random position within the visual scope directly. It makes the FSA escape from the local optimal solution. Define the function as (4).

Swarming behavior is described as (5). It shows the attraction of the swarm center to the individual. Let be the number of AFs within the current visual scope of , and is the center position of those neighbors. For the swarm center , if the food satisfaction is greater and not too crowded (i.e., ), then move a step from to . Otherwise, preying behavior is to identify a next position for the current.

Following behavior is described as (6). Let be a AF with the greatest food consistence among AFs in the current visual scope. If the food satisfaction is greater and not too crowded (i.e., , then move a step from to . Otherwise, preying behavior is to identify a next position for the current.

In addition, and parameters play an important role in FSA. They determine the convergence speed of FSA and make it escape from the local optimal solution. They are described as follows [11]:where is the Lorentzian function and is the normal distribution function.

3. A Coding Method for Combination

Let be an integers set which contain elements. The permutations number of is !. Sort them from small to large by lexicographic order. The Cantor expansion and inverse Cantor expansion indicate that there is a one-to-one correspondence between the full permutation set of and . Converting the full permutation into decimal number can be used to solve the TSP problem by the heuristic algorithm. Different from the TSP problem, rough set attribute reduction focuses on the combination of attribute set; it is necessary to discuss the ranking of a combination in the combinations sequence.

Let ; then the cardinal number of is . For and , then . Sort all the elements that, in from small to large, that is,

By lexicographic order, can be regarded as a sequence.

Example 1. Let and . All the elements of were shown in Table 1.

From Table 1, we can find that the NO is . By lexicographic order, the following proposition is apparent.

Proposition 2. Let , . precede if and only if such that,   , and .

Proposition 3. Let , then, ,  .

According to the property of combination, let and . Then the number of combinations in as is by Proposition 3. Thus we can get a matrix about , where The follow equation is apparent: if , then

Example 4. For , matrix is as follows:

Since , define a mapping from to as follows:where and . We will give Algorithm 1 for calculating the combination by the matrix .

input: .
output:
begin
:
if ;
   and ;
end
;
for to do
  if .
   , and ;
  end
end
end

For all , according to Algorithm 1 and (9) and (10), we can calculate ; thus we can decode a integer within into a combination in .

Example 5. ; let and . By Algorithm 1, we know; then ,   ;; then ,  ;; then ,  ;; then ;;in Table 1, the 22th permutation being , that is, the same result.

4. RFSA-RST Algorithm

In this section, for finding minimal reducts of a dataset, an algorithm which is named RFSA-RST in this paper based on RST and FSA is proposed. It first uses the concepts of the core to find the core attributes, and FSA is employed in a restrained manner to find the minimal reducts to ensure that the result is the minimal length.

4.1. Encoding Method

Essentially, AR is a combinatorial optimization problem and the solution space is the power set of the attribute set. Let be conditional attribute set, the power set of be , and ; then . Clearly, and . By the cardinality of the subset of , is split into disjoint parts. According to Algorithm 1, every integer within can be decoded into a combination in , so integer encoding method will be adopted. Thus, each integer represents a combination in and the solution space is integer type. Let be defined as the increment of . By (4)–(6), it may be decimal. So, it is not adaptive to the solution space. In order to improve this problem, the new increment is shown as follows [11]:

4.2. Fitness Function

In FSA, the quality of a AF is determined by fitness function. For AR problem, the subset should retain the classification ability and has minimal attributes. Generally, the fitness function must meet those issues. In addition, it should be as simple as possible to improve the computation efficiency. In this paper, all subsets without containing the core attribute are encoded into an integer as a AF. Since the cardinality of each combination in is fixed, FSA is restricted to a fixed length subset of condition attribute set and judges a AF by its dependency value for the attribute subset represented by the integer, as explained below.

For a AF and its position being , calculate the corresponding combination in by Algorithm 1. Define the fitness value of as follows:

4.3. Algorithm Description

In FSA, the initial population of size needs to be provided representing multiple points in the search space. If there is a fitness value of AF that is equal to , then the process terminates and outputs the corresponding attribute subsets of AF. If change in average fitness value of two successive generations is 0.001 or less, no further generations are created and terminate this loop. Then a new initial population is created for the next loop. These attribute subsets of the new initial population correspond to one more attribute. So, initial strings in loop of the algorithm have one more attribute than that in loop. The whole process is repeated based on this new initial population until a minimal set of reduct(s) is found. The detailed process of RFSA-RST is shown in Algorithm 2.

Input: .
Output:
begin
  ; Calculate and ;
  if ,
   , output and terminate process;
  else
   ; ; ;
   repeat
    ; ;
    Get randomly;
    repeat
     ;
     for to
      Calculate by Algorithm 1; ;
      Calculate for the permutation by Eq (14).
      if
       
      else
       ;
      end
     end
      ; ;
     Implement swarming behavior, following behavior
     and preying behavior.
    until change 0.001;
   until ;
   return ;
  end
end

5. Performance Comparison

To evaluate the effectiveness of the proposed algorithm, RFSA-RST, we carried out experiments in the following way. Another two different types of algorithms were used. One is proposed in [12], denoted here by RGA-RST, and the other is proposed in [17], denoted here by PSO-RST. A PC running Windows XP with 3.1 GHz CPU and 1 GB of main memory was employed to run these three algorithms. The test datasets were obtained from the UCI machine learning repository and 6 datasets were chosen. Basic information about the datasets is listed in Table 2, where and are the number of objects and (condition) attributes, respectively.

To make the comparison fair and reasonable, the three algorithms were independently run 50 times on each of the datasets with the same setting of the parameters involved. For each run, three values needed to be recorded for each experiment, the length of the output, the run time, and whether the output is a reduction. If the result is a reduction, then the run is said to be normal; otherwise, the run is said to be unsuccessful. If the result corresponds to a minimum reduct, then the run is not only normal but also successful. Let STL, AVL, and AVT be denoted as the shortest length, average length, and average running time, respectively, during the 50 runs. The ratios of successful and normal runs are denoted, respectively, by and . The performances of the three algorithms on the datasets are reported in Tables 3 and 4, respectively.

From Table 3, the proposed algorithm has the same performance as the other two algorithms for the shortest length of the outputs, but outperforms the other two algorithms in terms of the average length of output solutions except for the Soybean-small dataset. It means that the stability of the proposed algorithm is higher than the other two algorithms.

From Table 4, for the ratio of normal runs, all the outputs of the three algorithms are roughly the same, but the proposed algorithm outperforms the other two algorithms in terms of the ratio of successful runs except for the Soybean-small dataset. Therefore, if the minimum attribute reduction is required, the proposed algorithm is the best of the three algorithms. It is also reflected to the average length in Table 3. As far as the average running time is concerned, the proposed algorithm is slightly worse than PSO-RST but better than RGA.

Since the operation modes of the proposed algorithm and RGA algorithm are the same, another experiment has been done to evaluate the time efficiency of the two algorithms. We modify some process of both algorithms where the core attribute set is no longer identified and in RGA (for the parameter in the proposed algorithm) is set from 1 to the cardinality of minimal attribute reduction for each dataset. Record the running time for each loop. In order to show the results more clearly and length limitations, we only show the test results about the datasets Spect and Sponge. As shown in Figure 1. By Figure 1, we can find that, for all datasets, the running time in each loop of the proposed algorithm was increased with the increase of , while RGA were relatively stable. That is due to the fact that the search process of the proposed algorithm focuses on the attribute subset with fixed length. Therefore, the time efficiency of the algorithm is higher in each loop and the convergence rate is faster. It shows that the algorithm proposed in this paper is more efficient.

Summing up the experiment results, we see that the proposed algorithm is more efficient than the other two typical algorithms on the datasets for test, although the running time is slightly worse than PSO-RST.

6. Conclusions

In this paper, we derived a new method for minimum attribute reduction based on rough set theory and fish swarm algorithm. In this method, by a coding method of combination subset of attributes set, the FSA has been used to search minimum attribute reduction and attribute dependency has been applied to calculate the fitness values of the attribute subsets. The FSA has been restrained in which every integer corresponds to a specified length attribute subset in each search process. The cardinality of the attribute subset represented by AF was starting from the length of the core and incremented by one after in each loop.

Numerous test results show that it can improve not only the accuracy of minimal attribute reduction but also the efficiency of convergence rate.

To improve search efficiency, future enhancements to this work are to confirm whether the starting length was reasonable and, to improve the search time efficiency, these works can reduce redundant search process.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work is supported by the Opening Project of Sichuan Province University Key Laboratory of Bridge Non-Destruction Detecting and Engineering Computing (2014QYJ02).