Abstract
Parallel attribute reduction is one of the most important topics in current research on rough set theory. Although some parallel algorithms were well documented, most of them are still faced with some challenges for effectively dealing with the complex heterogeneous data including categorical and numerical attributes. Aiming at this problem, a novel attribute reduction algorithm based on neighborhood multigranulation rough sets was developed to process the massive heterogeneous data in the parallel way. The MapReducebased parallelization method for attribute reduction was proposed in the framework of neighborhood multigranulation rough sets. To improve the reduction efficiency, the hashing Map/Reduce functions were designed to speed up the positive region calculation. Thereafter, a quick parallel attribute reduction algorithm using MapReduce was developed. The effectiveness and superiority of this parallel algorithm were demonstrated by theoretical analysis and comparison experiments.
1. Introduction
With the rapid development of information technology, especially in the aspects of sensing, communication, network, and calculation, the amount of accumulated data in many fields is increasing at striking speeds. The inestimable value in big data has become a common understanding in academia and industry [1] and has garnered great attention in many counties; thus, big data technology development was announced as a national strategy by many countries [2]. However, most of the vast array of data that comes to us may be chaotic, irrelevant, and redundant. How to extract and express such implicit information in the form of explicit knowledge hidden in the given complex information systems has been an active area of research in the past number of decades. In practice, rough sets [3] have been widely used as a mathematical tool to deal with uncertain data. As one of the core research contents of rough set theory, attribute reduction can remove redundant attributes and reduce data dimensions under the premise of stable dependencies between decision attributes and conditional attributes in a decision table. Scholars have designed a large number of attribute reduction algorithms in recent years, which can generally be divided into three methods [4]: positive region [5–7], discernible matrix [8–11], and information entropy [12–15]. However, the execution of these algorithms is a typical serial operation. Although these kinds of algorithms with serial operation are possible to handle small data efficiently, their computational complexity, which depends on attribute number and sample size , may inevitably lead to lower efficiency and/or even complete failure when facing massive data.
In order to solve this problem, some scholars have proposed parallel algorithms for highdimensional or largescale data. Based on the divide and conquer strategy, Xiao et al. [16] used parallel computing to divide the reduction task into multiple processors that process simultaneously. Rough entropy was used to measure attribute significance by Lv et al. [17], and a parallel minimum reduction set algorithm was proposed. However, the dataset should be loaded at once into the memory when implementing these algorithms. To make up for the defects of the aforementioned parallel algorithms, the Google File System (GFS) based distributed file system and the MapReduce parallel programming model were utilized by Qian et al. [18, 19]. The decision table was divided into several subdecision tables, and thus, a large amount of data did not need to be loaded into a single memory bank when calculating attribute reduction. Moreover, the machines in the cluster can cooperate with each other to solve problems that the single machine could not address. Zhang et al. [20] proposed a parallel method for computing rough set approximations based on the MapReduce technique to deal with the massive data. The equivalence classes for each subdecision table were calculated parallelly in Map step; thereafter, these equivalence classes were combined in Reduce step if their information sets are the same. Qian et al. [21] further analyzed that the key to improving the reduction efficiency is the effective computation of equivalence classes and attribute significance. Consequently, a structure of <key, value> pair to speed up the computation of equivalence classes and attribute significance was designed and the traditional attribute reduction process was parallelized based on MapReduce mechanism. However, these abovementioned algorithms were based on the Pawlak’s classic rough set model with an equivalence relation, which is only suitable for categorical data.
To break the limit of the equivalence relation, Lin [22] proposed the concept of the neighborhood model and adopted the neighborhood relation instead of the equivalent relation, which can directly deal with numerical data through the neighborhood granulation in the universe. The monotonic relation between the positive region and attribute set in the neighborhood rough set model was proved by Hu et al. [23, 24], and an attribute reduction algorithm with lower computational complexity, which is suitable for heterogeneous data including categorical and numerical attributes, was put forward. Qian et al. [25] extended Pawlak’s rough set model to a multigranulation rough set (MGRS) model, where the set approximations are defined by using multiple equivalence relations on the universe. Based on the pessimistic multigranulation rough set model, Sang and Qian [26] analyzed a granular space selection under multiple granular spaces, defined the importance measure of the granular space, and designed a granular space reduction algorithm. Subsequently, Lin et al. [27] expanded the neighborhood rough set model to multiple granular spaces and proposed the concept of a neighborhood multigranulation rough set (NMGRS) model by constructing the universe through a hierarchical division of attribute set sequences. Furthermore, Yong et al. [28] hashed a dataset by dividing data into a series of hash buckets according to the Euclidean distance, which dramatically decreased the calculation time and reduced time complexity to for getting positive regions. A quick and efficient attribute reduction algorithm with the time complexity of was also given. In addition, Qian et al. discussed local rough set to deal with big data [29, 30]. However, these aforementioned algorithms based on different extended rough set models were still involved in the serial computation.
To the best of our knowledge, it is still a challenging task to perform parallel attribute reduction on complex and massive data. In particular, the existing algorithms could not effectively deal with the complex heterogeneous data, which include categorical and numerical attributes, in the parallel way from the multiple granular computing perspectives. Due to the rampant existence of heterogeneous datasets in reallife applications, it is therefore necessary to investigate effective parallel approaches to deal with this issue. For the purpose of parallelizing the traditional attribute reduction algorithm for complex heterogeneous data, the neighborhood multigranulation rough set model was considered in this paper, and the parallelization points of the hashing, positive region calculating, and boundary objects pruning are analyzed based on MapReduce mechanism. Thereafter, a fast parallel attribute reduction algorithm is developed. The effectiveness and superiority of this parallel algorithm were demonstrated by theoretical analysis and comparison experiments.
Different from available algorithms, the contribution of this paper is twofold. (1) Motivated by the aforementioned MapReduce technology, hash algorithm, and neighborhood multigranulation rough set model, the parallelization methods of multiple granular spaces and hashing Map/Reduce functions for heterogeneous data are brought to light; 2) a neighborhood multigranulation rough set modelbased parallel attribute reduction algorithm using MapReduce, which has never been done before, is proposed.
The paper is organized as follows. Section 2 outlines some preliminary knowledge. In Section 3, we present the parallelization strategies of multiple granular spaces for heterogeneous data and the parallel fast attribute reduction algorithm based on the neighborhood multigranulation rough set model. Next, experiments are conducted to evaluate the efficiency of the proposed algorithm in Section 4. Finally, in Section 5, we present the conclusion and the future work.
2. Preliminary Knowledge
In this section, 1type and 2type neighborhood multigranulation rough sets and MapReduce programming model will be briefly described.
2.1. Neighborhood Multigranulation Rough Set
Neighborhood rough set model uses neighborhood relation to replace equivalence relation, which can directly process numerical data and heterogeneous data. For further processing heterogeneous data from the perspective of multiple granular spaces and multiple levels of granularity, neighborhood rough set theory has been extended from single attribute subset to multiple attribute subsets. Two types of neighborhood multigranulation rough set models have been developed [27].
2.1.1. 1Type Neighborhood Multigranulation Rough Sets (1Type NMGRS)
Definition 1 (see [25]). Let be a nonempty metric space; is a nonempty finite set of objects, called the universe. A closed ball taking as its center and as its radius is called the neighborhood of and is defined as follows: where , and is a distance function. For two points in the universe, and , the distance function can usually use the Euclidean distance formula.
Definition 2 (see [25]). Let be a nonempty metric space. When categorical and numerical attributes coexist, let and be categorical and numerical attributes, respectively. The neighborhood of can be defined as follows:
Definition 3 (see [25]). Given a decision system, , where is the set of condition attributes and is the set of decision attributes, and . is a set of attribute values and is a domain of the attribute . is a function such that for every and . is the neighborhood relation. Let be a categorical attribute set and be a numerical attribute set, so is an mixed attribute set; , , and represent two partitions and a covering of the universe , respectively. For any , the optimistic multigranulation lower and upper approximations of with respect to and in are defined as follows: whereas the pessimistic multigranulation lower and upper approximations of are defined as follows:
2.1.2. 2Type Neighborhood Multigranulation Rough Sets (2Type NMGRS)
Comparing with that just the single neighborhood relation was used in the 1type neighborhood multigranulation rough sets, multiple neighborhood relations were fully considered in the 2type neighborhood multigranulation rough sets, which were denoted by 2type NMGRS by Qian et al. [25].
Definition 4 (see [27]). Given a decision system , let be a neighborhood relation on the universe induced by and , where is a categorical attribute subset, and is a numerical attribute subset. For any , the optimistic lower and upper approximations of in are defined as follows: whereas the pessimistic multigranulation lower and upper approximations of are defined as follows:
Definition 5 (see [31]). Given a decision system , let be an attribute subset of , is a partition of universe induced by decision attribute , and is a set of neighborhood radii. The attribute dependency of for decision class with the neighborhood radius is defined as follows:
Definition 6 (see [31]). Given a decision system , let be an attribute subset of , is a partition of universe induced by decision attribute , and is a set of neighborhood radii. , if , then the attribute is necessary for ; else, if , then regardless of whether the attribute is removed from , the decision positive region of the system is unchanged; in other words, the attribute is redundant for .
Given the attribute subset , , if , and , then is a relative reduction of with the neighborhood radius .
Definition 7 (see [19]). Given a decision system , let , and , if and , the decision system can be divided into subdecision systems, and is called the subsystem of .
2.2. MapReduce Programming Model
MapReduce is a parallel processing framework that breaks down large tasks into many small tasks. With the small tasks independent of each other, big tasks and small tasks are just different in size. The MapReduce parallel programming model also breaks down the computational process into two main stages: the Map stage and the Reduce stage.
In the MapReduce model, the whole dataset is split into many splits in natural sequence and then is passed to the Map stage. Data in the MapReduce programming model can be represented as <key, value> pairs. The Map function takes pairs <, > as input and generates a set of intermediate <, > pairs. The Reduce function groups together all intermediate values associated with the same and then merges together a set of values for each to form a possibly smaller set of values and finally outputs <, > pairs. The Map and Reduce functions are given as follows:
Map:
Reduce:
Here, and represent the userdefined data types; is used to denote a list.
3. Parallel Attribute Reduction Algorithm for NMGRS
Aiming at the numerical or the heterogeneous data, many attribute reduction algorithms based on neighborhood multigranulation rough sets have been developed. However, it is still a challenging task to parallelize these attribute reduction algorithms for massive heterogeneous data. Motivated by the works of Qian et al. [21] and Yong et al. [28], quick parallelization strategies to speed up the computation of neighborhood classes and positive regions are proposed, and a parallel attribute reduction algorithm is designed in this section.
3.1. Parallelization Strategies
To parallelize the attribute reduction algorithm based on the neighborhood multigranulation rough set model, the MapReduce model was adopted. Thus, it is the key point that how to design the Map and Reduce functions for quickly getting neighborhood classes and positive regions. The work of Yong et al. [28] demonstrated that the neighborhood of a sample can only exist in its adjacent hash buckets or its own hash bucket. Therefore, to find possible neighborhoods, it is only necessary to group the samples according to their hash values. So, in the Map function, the hash value of each sample could be firstly calculated, and then the hash values and sample IDs are output. In the Reduce function, the sample IDs in the hash bucket are merged according to the same hash value.
Thus, the Map and Reduce functions for hash buckets calculation are designed as follows.


Example 1. We take the decision table shown in Table 1 as an example to illustrate the calculation process of Algorithms 1 and 2, where the decision attribute is listed in the last column.
According to Definition 7, the decision information system was divided into and , where , and . The neighborhood radius is given as 0.08, and the condition attribute subset is given as .
The Map process:
The <KEY_{HM}, VALUE_{HM}> pairs that output from Map 1 are <0,1>, and <1,2>.
The <KEY_{HM}, VALUE_{HM}> pairs that output from Map 2 are <1,3>, and <3,4>.
The Reduce process:
The <KEY_{HR}, VALUE_{HR}> pair that outputs from Reduce 1 is <0,{1}>.
The <KEY_{HR}, VALUE_{HR}> pair that outputs from Reduce 2 is <1,{2,3}>.
The <KEY_{HR}, VALUE_{HR}> pair that outputs from Reduce 3 is <3,{4}>.
After Algorithms 1 and 2, samples were hashed into three hash buckets with hash values of 0, 1, and 3.
Next, we calculated the positive regions under the current subset. According to Definition 4, the process of neighborhood computation and positive region judgment based on the multigranulation neighborhood rough sets could be divided into two parts. First, calculate the neighborhood of under one condition attribute subset and then take the intersection of the positive region set of multiple condition subsets (applicable to the pessimistic neighborhood multigranulation rough set model) or take the union of the positive region set (applicable to the optimistic neighborhood multigranulation rough set model).
As to the neighborhood calculation of a single condition attribute subset, according to the work in literature [26], whether the sample belongs to the positive region can be judged by a distance function after traversing the hash bucket where a neighborhood probably exists. The hash value of a sample was calculated by the Map function firstly, and then, possible hash buckets can be searched in the output files (named after the hash value) by Algorithm 2. Consequently, the sample can be judged whether it belongs to the positive region or not by scanning the found hash buckets of this sample according to the distance function and decision attribute, where key values of samples in the positive region were assigned by 1, while key values of samples in the boundary were assigned by 0. The positive region for the whole universe could be obtained by combining each positive region of in the Reduce function.
Map and Reduce functions for neighborhood calculation by a single condition attribute subset are designed as follows.


The positive region of the whole universe under single attribute granularity can be obtained by this algorithm through the intersection or union of the single granularity positive region sequence. Thus, the significant attribute could be calculated and added to current reduction subsets.
Example 2. We continue using the decision table in Table 1 and the same conditions as in Example 1 to illustrate the operation process.
The Map process:
The <KEY_{M}, VALUE_{M}> pairs that output from Map 1 are <0,1>, and <0,2>.
The <KEY_{M}, VALUE_{M}> pairs that output from Map 2 are <0,3>, and <1,4>.
The Reduce process:
The <KEY_{R}, VALUE_{R}> pair that outputs from Reduce 1 is <0,{1,2,3}>.
The <KEY_{R}, VALUE_{R}> pair that outputs from Reduce 2 is <1,{4}>.
The above operation results show that the positive and boundary region of the current universe can be acquired by Algorithms 3 and 4, where the positive region includes and the boundary region includes .
According to the monotonic proof of the work of Ma et al. [32] that the situation of one sample belonging to a certain positive region will not be changed when additional attributes are added. In other words, in this case, it is unnecessary that these samples are repeatedly calculated in the second Reduce stage. So the Reduce function could focus the boundary samples.
The Map and Reduce functions for updating positive regions are designed as follows.


Example 3. Here, we are taking Table 1 and the results of Example 2 to illustrate the operation process on parallel boundary set updating and forming a new decision table.
It can be seen from Example 2 that the current positive region set is , and thus, this update should remove the sample with the ID of 4 from the decision table.
The Map process:
The <KEY_{UM}, VALUE_{UM}> pairs that output from Map 1 are <1, (1 0.10 0.20 0.61 0.20 Yes)> and <1, (2 0.13 0.22 0.56 0.10 Yes)>.
The <KEY_{UM}, VALUE_{UM}> pair that outputs from Map 2 is <1, (0.14 0.23 0.40 0.31 No)>.
The Reduce process:
The <KEY_{UR}, VALUE_{UR}> pairs that output from Reduce1 are <1 (0.10 0.20 0.61 0.20 Yes)>, <2 (0.13 0.22 0.56 0.10 Yes)>, and <0.14 (0.23 0.40 0.31 No)>.
After the above operation, the boundary region updating was finished, and the actual storage situation is shown as follows:
: 1 0.10 0.20 0.61 0.20 Yes
2 0.13 0.22 0.56 0.10 Yes
: 0.14 0.23 0.40 0.31 No
The above results show that the subsequent attribute reduction can be processed directly on the basis of the whole dataset without any extra splitting.
3.2. Parallel Attribute Reduction Algorithm
On the basis of parallel algorithms given in Section 3.1, a neighborhood multigranulation rough setbased parallel attribute reduction algorithm using MapReduce is presented. For convenience, this algorithm is denoted as PARA_NMG in this paper.

3.3. Algorithm: Time Complexity Analysis
It is assumed that the neighborhood decision information system has samples and condition attributes. The positive region calculation is still the key step for the proposed PARA_NMG algorithm. In step 2.1, the calculation method in literature [28] is used to calculate the positive region of each attribute set, the time complexity of which is . As to step 2.3, suppose that there are attributes eventually selected, with each attribute is added into the reduction subset, samples will convert from the boundary sample to the positive region (in term of probability). Therefore, the time complexity of serial calculation is . Furthermore, the MapReduce model was used in the PARA_NMG algorithm to parallelize the attribute reduction algorithm; assuming that there are nodes, the time complexity of the algorithm is , which is superior to the time complexity of in literature [22] and time complexity of in literature [33].
4. Experiment Analysis
In this section, we conducted some numerical experiments to assess the efficiency of our proposed algorithm. The experiments were implemented on a PC cluster of nine nodes, where one was set as a master node and the rest were configured as slave nodes. Each node is equipped with Inter Core i52400M CPU (four cores in all, each 3.1 GHz), 4 GB RAM memory, and the software of Ubuntu 14.0, Hadoop 2.6.0, and Java 1.6.20. All algorithms were coded in Java.
To illustrate the efficiency of our proposed PARA_NMG algorithm, the representative parallel algorithm for reduction algorithm based on positive region, which was denoted as PAAR_PR, proposed in literature [21] was used for comparisons. The difference is that the PAAR_PR algorithm is based on the classical rough set model, while the PARA_NMG algorithm is based on the neighborhood multigranulation rough set model.
To test the efficiencies of above two algorithms on different types of data, the experiments were carried out with the real datasets Soybean, US Census Data (1990), Susy, PAMAP2 Physical Activity Monitoring, and Poker Hand from UCI Machine Learning Repository [34] and another dataset KDD99. Here, Soybean and US Census Data (1990) are categorical datasets, Susy and PAMAP2 Physical Activity Monitoring are numerical datasets, and Poker Hand and KDD99 are heterogeneous datasets. To create a big data environment, the dataset Soybean was duplicated 100,000 times as a new dataset. For convenience, these above six datasets were denoted as DS1~DS6, respectively. The characteristics of these datasets are shown in Table 2.
4.1. Comparison and Analysis of Reduction Results
For neighborhood rough set model, it is important to select a proper neighborhood radius when calculating neighborhood classes. According to the work of Hu et al. [24], the reasonable neighborhood radius should be selected in the interval . Qian et al. [29] analyzed the monotonicity of positive region with the neighborhood radius, and they found that the classification accuracy will be deduced with neighborhood radius increase. Considering these factors and characteristics of selected datasets, the neighborhood radius for our PARA_NMG algorithm was set to be 0.1 when facing numerical data. The reduction results of the PARA_NMG algorithm and the PAAR_PR algorithm are shown in Table 3.
It can be seen from Table 3 that the PAAR_PR algorithm obtained effective reduction results on categorical datasets DS1 and DS2. Notwithstanding that there are few numerical attributes in DS6, the equivalence classes could be obtained, so the PAAR_PR algorithm is still practicable. However, for numerical datasets DS3 and DS4 and heterogeneous dataset DS5, PAAR_PR could not get the reduction results because equivalent classes could not be obtained in these datasets. Thus, the applicability of PAAR_PR algorithm depends on the characteristics of datasets. Comparatively speaking, the PARA_NMG algorithm was not limited by data types when calculating attribute reduction on different datasets. Considering the rampant existence of heterogeneous datasets in reallife applications, the neighborhood multigranulation rough setbased PARA_NMG algorithm has better applicability.
In addition, for datasets DS1, DS2, and DS6, although the attribute reduction results were all obtained, there was a little difference between the selected attribute subsets by both algorithms. To further analyze the two algorithms’ effects on reduction results from the perspective of classification accuracy, seven wellknown typical classifiers, namely, sequential minimal optimization (SMO), naive Bayes, naive Bayesian model (NBM), logistic regression model (LRM), locally weighted learning (LWL), J48, and MultiClassClassifier , were selected to further test the classification accuracy associated with different attribute reduction subsets. The test results are shown in Table 4.
We can see from Table 4 that classification accuracies, according to the reduction subsets of PARA_NMG algorithm, are better for most of these classifiers. In fact, the classification accuracy is the important factor that should be considered in reallife applications. So, from a practical point of view, our neighborhood multigranulation rough setbased PARA_NMG algorithm also has better applicability.
4.2. Comparative Analysis on Computational Time
To illustrate the influence of the number of computer nodes on the two algorithms’ computational time, the experiments were implemented on a cluster with different number of nodes. The average running times of the two algorithms were recorded, which are shown in Table 5 as follows. For datasets DS3~DS5, only the results of PARA_NMG algorithm are given.
As can be seen from Table 5, PAAR_PR is faster than PARA_NMG because of the different rough set models were used. PAAR_PR is based on the classical rough set model, and the time complexity of the classical heuristic serial reduction algorithm based on the positive region is . Conversely, PARA_NMG is based on the neighborhood multigranulation rough set model, and the time complexity of the classical serial reduction algorithm is . To minimize the times of computation for getting positive regions, the hash function was introduced into the Map and Reduce stages for neighborhood multigranulation rough sets, and the time complexity of our parallel attribute reduction algorithm was reduced to . To some extent, the computational time of our algorithm is still comparable.
In fact, except for the computational time, the speedup is really an important performance index for evaluating the efficiency of a parallel algorithm, which is defined as follows: where is the number of nodes, is the execution time at one node, and is the execution time at nodes.
The speedup of two algorithms was tested with different number of nodes. To be more intuitive, the average speedup of two algorithms on each dataset with different computer nodes is presented in Figure 1 as follows, where the axis represents the number of computer nodes, the axis represents the speedup, and the red star point denoted by liner represents the theoretical speedup of a parallel algorithm.
(a)
(b)
(c)
(d)
(e)
(f)
As shown in Figure 1, the parallel reduction algorithm proposed in this paper could achieve better speedup on different data types. With the number of nodes increase, the superiority of our PARA_NMG algorithm in speedup is more and more obvious. Therefore, the PARA_NMG algorithm is more suitable for processing heterogeneous massive data parallelly on a large number of computing nodes.
5. Conclusion
Attribute reduction is one of the important research issues in rough set theory. In current big data era, traditional attribute reduction algorithms are now faced with big challenges for dealing with massive data. Most existing parallel algorithms have seldom taken granular computing into consideration, especially for dealing with complex heterogeneous data including categorical attributes and numerical attributes. To address these issues, aiming at heterogeneous data, a quick parallel attribute reduction algorithm using MapReduce in the framework of neighborhood multigranulation rough sets was developed in this paper. The hash function was introduced into the Map and Reduce stages to speed up the positive region calculation. The effectiveness and superiority of the developed algorithm were verified by comparison analysis.
However, just the static data was considered in this paper; in fact, datasets in realworld applications often vary dynamically over time. How to parallelize the incremental attribute reduction algorithm in the framework of neighborhood multigranulation rough sets is a focus for future research.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (61833011, 61403184, and 61533010), the major program of the Natural Science Foundation of Jiangsu Province Education Commission, China (17KJA120001), the National Key Research and Development Program of China (2017YFD0401001), and the Six Talent Peaks Project in Jiangsu Province, China (XNY038).