Parallel Attribute Reduction Algorithm for Complex Heterogeneous Data Using MapReduce

Zhang, Tengfei; Ma, Fumin; Cao, Jie; Peng, Chen; Yue, Dong

doi:https://doi.org/10.1155/2018/8291650

Complexity

On this page

Abstract Introduction Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Bio-Inspired Learning and Adaptation for Optimization and Control of Complex Systems

View this Special Issue

Research Article | Open Access

Volume 2018 | Article ID 8291650 | https://doi.org/10.1155/2018/8291650

Parallel Attribute Reduction Algorithm for Complex Heterogeneous Data Using MapReduce

Tengfei Zhang,^1,2Fumin Ma,³Jie Cao,³Chen Peng,⁴and Dong Yue^1,2

Academic Editor: Liang Hu

Received20 May 2018

Accepted20 Aug 2018

Published27 Sept 2018

Abstract

Parallel attribute reduction is one of the most important topics in current research on rough set theory. Although some parallel algorithms were well documented, most of them are still faced with some challenges for effectively dealing with the complex heterogeneous data including categorical and numerical attributes. Aiming at this problem, a novel attribute reduction algorithm based on neighborhood multigranulation rough sets was developed to process the massive heterogeneous data in the parallel way. The MapReduce-based parallelization method for attribute reduction was proposed in the framework of neighborhood multigranulation rough sets. To improve the reduction efficiency, the hashing Map/Reduce functions were designed to speed up the positive region calculation. Thereafter, a quick parallel attribute reduction algorithm using MapReduce was developed. The effectiveness and superiority of this parallel algorithm were demonstrated by theoretical analysis and comparison experiments.

1. Introduction

With the rapid development of information technology, especially in the aspects of sensing, communication, network, and calculation, the amount of accumulated data in many fields is increasing at striking speeds. The inestimable value in big data has become a common understanding in academia and industry [1] and has garnered great attention in many counties; thus, big data technology development was announced as a national strategy by many countries [2]. However, most of the vast array of data that comes to us may be chaotic, irrelevant, and redundant. How to extract and express such implicit information in the form of explicit knowledge hidden in the given complex information systems has been an active area of research in the past number of decades. In practice, rough sets [3] have been widely used as a mathematical tool to deal with uncertain data. As one of the core research contents of rough set theory, attribute reduction can remove redundant attributes and reduce data dimensions under the premise of stable dependencies between decision attributes and conditional attributes in a decision table. Scholars have designed a large number of attribute reduction algorithms in recent years, which can generally be divided into three methods [4]: positive region [5–7], discernible matrix [8–11], and information entropy [12–15]. However, the execution of these algorithms is a typical serial operation. Although these kinds of algorithms with serial operation are possible to handle small data efficiently, their computational complexity, which depends on attribute number and sample size , may inevitably lead to lower efficiency and/or even complete failure when facing massive data.

In order to solve this problem, some scholars have proposed parallel algorithms for high-dimensional or large-scale data. Based on the divide and conquer strategy, Xiao et al. [16] used parallel computing to divide the reduction task into multiple processors that process simultaneously. Rough entropy was used to measure attribute significance by Lv et al. [17], and a parallel minimum reduction set algorithm was proposed. However, the dataset should be loaded at once into the memory when implementing these algorithms. To make up for the defects of the aforementioned parallel algorithms, the Google File System- (GFS-) based distributed file system and the MapReduce parallel programming model were utilized by Qian et al. [18, 19]. The decision table was divided into several subdecision tables, and thus, a large amount of data did not need to be loaded into a single memory bank when calculating attribute reduction. Moreover, the machines in the cluster can cooperate with each other to solve problems that the single machine could not address. Zhang et al. [20] proposed a parallel method for computing rough set approximations based on the MapReduce technique to deal with the massive data. The equivalence classes for each subdecision table were calculated parallelly in Map step; thereafter, these equivalence classes were combined in Reduce step if their information sets are the same. Qian et al. [21] further analyzed that the key to improving the reduction efficiency is the effective computation of equivalence classes and attribute significance. Consequently, a structure of <key, value> pair to speed up the computation of equivalence classes and attribute significance was designed and the traditional attribute reduction process was parallelized based on MapReduce mechanism. However, these abovementioned algorithms were based on the Pawlak’s classic rough set model with an equivalence relation, which is only suitable for categorical data.

To break the limit of the equivalence relation, Lin [22] proposed the concept of the neighborhood model and adopted the neighborhood relation instead of the equivalent relation, which can directly deal with numerical data through the neighborhood granulation in the universe. The monotonic relation between the positive region and attribute set in the neighborhood rough set model was proved by Hu et al. [23, 24], and an attribute reduction algorithm with lower computational complexity, which is suitable for heterogeneous data including categorical and numerical attributes, was put forward. Qian et al. [25] extended Pawlak’s rough set model to a multigranulation rough set (MGRS) model, where the set approximations are defined by using multiple equivalence relations on the universe. Based on the pessimistic multigranulation rough set model, Sang and Qian [26] analyzed a granular space selection under multiple granular spaces, defined the importance measure of the granular space, and designed a granular space reduction algorithm. Subsequently, Lin et al. [27] expanded the neighborhood rough set model to multiple granular spaces and proposed the concept of a neighborhood multigranulation rough set (NMGRS) model by constructing the universe through a hierarchical division of attribute set sequences. Furthermore, Yong et al. [28] hashed a dataset by dividing data into a series of hash buckets according to the Euclidean distance, which dramatically decreased the calculation time and reduced time complexity to for getting positive regions. A quick and efficient attribute reduction algorithm with the time complexity of was also given. In addition, Qian et al. discussed local rough set to deal with big data [29, 30]. However, these aforementioned algorithms based on different extended rough set models were still involved in the serial computation.

To the best of our knowledge, it is still a challenging task to perform parallel attribute reduction on complex and massive data. In particular, the existing algorithms could not effectively deal with the complex heterogeneous data, which include categorical and numerical attributes, in the parallel way from the multiple granular computing perspectives. Due to the rampant existence of heterogeneous datasets in real-life applications, it is therefore necessary to investigate effective parallel approaches to deal with this issue. For the purpose of parallelizing the traditional attribute reduction algorithm for complex heterogeneous data, the neighborhood multigranulation rough set model was considered in this paper, and the parallelization points of the hashing, positive region calculating, and boundary objects pruning are analyzed based on MapReduce mechanism. Thereafter, a fast parallel attribute reduction algorithm is developed. The effectiveness and superiority of this parallel algorithm were demonstrated by theoretical analysis and comparison experiments.

Different from available algorithms, the contribution of this paper is twofold. (1) Motivated by the aforementioned MapReduce technology, hash algorithm, and neighborhood multigranulation rough set model, the parallelization methods of multiple granular spaces and hashing Map/Reduce functions for heterogeneous data are brought to light; 2) a neighborhood multigranulation rough set model-based parallel attribute reduction algorithm using MapReduce, which has never been done before, is proposed.

The paper is organized as follows. Section 2 outlines some preliminary knowledge. In Section 3, we present the parallelization strategies of multiple granular spaces for heterogeneous data and the parallel fast attribute reduction algorithm based on the neighborhood multigranulation rough set model. Next, experiments are conducted to evaluate the efficiency of the proposed algorithm in Section 4. Finally, in Section 5, we present the conclusion and the future work.

2. Preliminary Knowledge

In this section, 1-type and 2-type neighborhood multigranulation rough sets and MapReduce programming model will be briefly described.

2.1. Neighborhood Multigranulation Rough Set

Neighborhood rough set model uses neighborhood relation to replace equivalence relation, which can directly process numerical data and heterogeneous data. For further processing heterogeneous data from the perspective of multiple granular spaces and multiple levels of granularity, neighborhood rough set theory has been extended from single attribute subset to multiple attribute subsets. Two types of neighborhood multigranulation rough set models have been developed [27].

2.1.1. 1-Type Neighborhood Multigranulation Rough Sets (1-Type NMGRS)

Definition 1 (see [25]). Let be a nonempty metric space; is a nonempty finite set of objects, called the universe. A closed ball taking as its center and as its radius is called the neighborhood of and is defined as follows: where , and is a distance function. For two points in the universe, and , the distance function can usually use the Euclidean distance formula.

Definition 2 (see [25]). Let be a nonempty metric space. When categorical and numerical attributes coexist, let and be categorical and numerical attributes, respectively. The neighborhood of can be defined as follows:

Definition 3 (see [25]). Given a decision system, , where is the set of condition attributes and is the set of decision attributes, and . is a set of attribute values and is a domain of the attribute . is a function such that for every and . is the neighborhood relation. Let be a categorical attribute set and be a numerical attribute set, so is an mixed attribute set; , , and represent two partitions and a covering of the universe , respectively. For any , the optimistic multigranulation lower and upper approximations of with respect to and in are defined as follows: whereas the pessimistic multigranulation lower and upper approximations of are defined as follows:

2.1.2. 2-Type Neighborhood Multigranulation Rough Sets (2-Type NMGRS)

Comparing with that just the single neighborhood relation was used in the 1-type neighborhood multigranulation rough sets, multiple neighborhood relations were fully considered in the 2-type neighborhood multigranulation rough sets, which were denoted by 2-type NMGRS by Qian et al. [25].

Definition 4 (see [27]). Given a decision system , let be a neighborhood -relation on the universe induced by and , where is a categorical attribute subset, and is a numerical attribute subset. For any , the optimistic lower and upper approximations of in are defined as follows: whereas the pessimistic multigranulation lower and upper approximations of are defined as follows:

Definition 5 (see [31]). Given a decision system , let be an attribute subset of , is a partition of universe induced by decision attribute , and is a set of neighborhood radii. The attribute dependency of for decision class with the neighborhood radius is defined as follows:

Definition 6 (see [31]). Given a decision system , let be an attribute subset of , is a partition of universe induced by decision attribute , and is a set of neighborhood radii. , if , then the attribute is necessary for ; else, if , then regardless of whether the attribute is removed from , the decision positive region of the system is unchanged; in other words, the attribute is redundant for .

Given the attribute subset , , if , and , then is a relative reduction of with the neighborhood radius .

Definition 7 (see [19]). Given a decision system , let , and , if and , the decision system can be divided into -subdecision systems, and is called the subsystem of .

2.2. MapReduce Programming Model

MapReduce is a parallel processing framework that breaks down large tasks into many small tasks. With the small tasks independent of each other, big tasks and small tasks are just different in size. The MapReduce parallel programming model also breaks down the computational process into two main stages: the Map stage and the Reduce stage.

In the MapReduce model, the whole dataset is split into many splits in natural sequence and then is passed to the Map stage. Data in the MapReduce programming model can be represented as <key, value> pairs. The Map function takes pairs <, > as input and generates a set of intermediate <, > pairs. The Reduce function groups together all intermediate values associated with the same and then merges together a set of values for each to form a possibly smaller set of values and finally outputs <, > pairs. The Map and Reduce functions are given as follows:

Map:

Reduce:

Here, and represent the user-defined data types; is used to denote a list.

3. Parallel Attribute Reduction Algorithm for NMGRS

Aiming at the numerical or the heterogeneous data, many attribute reduction algorithms based on neighborhood multigranulation rough sets have been developed. However, it is still a challenging task to parallelize these attribute reduction algorithms for massive heterogeneous data. Motivated by the works of Qian et al. [21] and Yong et al. [28], quick parallelization strategies to speed up the computation of neighborhood classes and positive regions are proposed, and a parallel attribute reduction algorithm is designed in this section.

3.1. Parallelization Strategies

To parallelize the attribute reduction algorithm based on the neighborhood multigranulation rough set model, the MapReduce model was adopted. Thus, it is the key point that how to design the Map and Reduce functions for quickly getting neighborhood classes and positive regions. The work of Yong et al. [28] demonstrated that the neighborhood of a sample can only exist in its adjacent hash buckets or its own hash bucket. Therefore, to find possible neighborhoods, it is only necessary to group the samples according to their hash values. So, in the Map function, the hash value of each sample could be firstly calculated, and then the hash values and sample IDs are output. In the Reduce function, the sample IDs in the hash bucket are merged according to the same hash value.

Thus, the Map and Reduce functions for hash buckets calculation are designed as follows.

Input: condition attribute subset, C; a data split S_i
Output: <KEY_HM, VALUE_HM> // let KEY_HM be the set of hash value of each sample, and VALUE_HM be the set of sample ID
begin
<KEY_HM, VALUE_HM>=
for each do
let key=hash(x_i);
// =key, where x₀ is a special sample in universe U, which is satisfied with , and f is the information function.
let value=the ID of
<KEY_HM, VALUE_HM>=<KEY_HM, VALUE_HM>= <key, value>
end for
end

Input: <KEY_HM, VALUE_HM>
Output: <KEY_HR, VALUE_HR> // let KEY_HR be the set of different hash value key', and VALUE_HR be the set of sample IDs subset value' with the same hash value key'.
begin
<KEY_HR, VALUE_HR>=
for <key, value>in <KEY_HM, VALUE_HM>do
if key is not appeared in <KEY_HR, VALUE_HR>
<key', value'>=<key, value>
else
if key=key'_k
<KEY_HR, VALUE_HR>=<KEY_HR, VALUE_HR>-<key', value'>
value'_k=value'_k value // combine samples with the same hash value, obtain the hash bucket
end if
end if
<KEY_HR, VALUE_HR>=<KEY_HR, VALUE_HR> <key', value'>
end for //output with multi-file; a file named after a hash value is a hash bucket
end

Example 1. We take the decision table shown in Table 1 as an example to illustrate the calculation process of Algorithms 1 and 2, where the decision attribute is listed in the last column.

According to Definition 7, the decision information system was divided into and , where , and . The neighborhood radius is given as 0.08, and the condition attribute subset is given as .

The Map process:

The <KEY_HM, VALUE_HM> pairs that output from Map 1 are <0,1>, and <1,2>.

The <KEY_HM, VALUE_HM> pairs that output from Map 2 are <1,3>, and <3,4>.

The Reduce process:

The <KEY_HR, VALUE_HR> pair that outputs from Reduce 1 is <0,{1}>.

The <KEY_HR, VALUE_HR> pair that outputs from Reduce 2 is <1,{2,3}>.

The <KEY_HR, VALUE_HR> pair that outputs from Reduce 3 is <3,{4}>.

After Algorithms 1 and 2, samples were hashed into three hash buckets with hash values of 0, 1, and 3.

Next, we calculated the positive regions under the current subset. According to Definition 4, the process of neighborhood computation and positive region judgment based on the multigranulation neighborhood rough sets could be divided into two parts. First, calculate the neighborhood of under one condition attribute subset and then take the intersection of the positive region set of multiple condition subsets (applicable to the pessimistic neighborhood multigranulation rough set model) or take the union of the positive region set (applicable to the optimistic neighborhood multigranulation rough set model).

As to the neighborhood calculation of a single condition attribute subset, according to the work in literature [26], whether the sample belongs to the positive region can be judged by a distance function after traversing the hash bucket where a neighborhood probably exists. The hash value of a sample was calculated by the Map function firstly, and then, possible hash buckets can be searched in the output files (named after the hash value) by Algorithm 2. Consequently, the sample can be judged whether it belongs to the positive region or not by scanning the found hash buckets of this sample according to the distance function and decision attribute, where key values of samples in the positive region were assigned by 1, while key values of samples in the boundary were assigned by 0. The positive region for the whole universe could be obtained by combining each positive region of in the Reduce function.

Map and Reduce functions for neighborhood calculation by a single condition attribute subset are designed as follows.

Input: Single condition attribute subset, C; the hash bucket B; and a data split S_i
Output: <KEY_M, VALUE_M>
//key represents whether the sample belongs to the positive region; let key of the sample in positive region be 1, and key of the sample in boundary be 0; value represents all samples’ ID that have the same key
begin
<KEY_M, VALUE_M>=
for each do
let key_i=0 //assuming that this sample does not belongs to the positive region under C
let value=the ID of and given that
for each //traversing the hash bucket where a neighborhood probably exists
if is the neighborhood of , but they have different decision attribute values
let key_i=1
else
break
end if
end for
<KEY_M, VALUE_M>=<KEY_M, VALUE_M> <key_i, value_i>
end for
end

Input: <KEY_M, VALUE_M>
Output: <KEY_R, VALUE_R>
// let KEY_R be the set of different key, and VALUE_R be the set of sample IDs subset value' with the same key'.
begin
<KEY_R, VALUE_R>=
for <key, value>in <KEY_M, VALUE_M>do
if key is not appeared in <KEY_R, VALUE_R>
<key, value>=<key, value>
else
if key=key_k
<KEY_R, VALUE_R>=<KEY_R, VALUE_R>-<key, value>
value_k=value_k value // combine samples with the same key
end if
end if
<KEY_R, VALUE_R>=<KEY_R, VALUE_R> <key, value>
end for
end

The positive region of the whole universe under single attribute granularity can be obtained by this algorithm through the intersection or union of the single granularity positive region sequence. Thus, the significant attribute could be calculated and added to current reduction subsets.

Example 2. We continue using the decision table in Table 1 and the same conditions as in Example 1 to illustrate the operation process.

The Map process:

The <KEY_M, VALUE_M> pairs that output from Map 1 are <0,1>, and <0,2>.

The <KEY_M, VALUE_M> pairs that output from Map 2 are <0,3>, and <1,4>.

The Reduce process:

The <KEY_R, VALUE_R> pair that outputs from Reduce 1 is <0,{1,2,3}>.

The <KEY_R, VALUE_R> pair that outputs from Reduce 2 is <1,{4}>.

The above operation results show that the positive and boundary region of the current universe can be acquired by Algorithms 3 and 4, where the positive region includes and the boundary region includes .

According to the monotonic proof of the work of Ma et al. [32] that the situation of one sample belonging to a certain positive region will not be changed when additional attributes are added. In other words, in this case, it is unnecessary that these samples are repeatedly calculated in the second Reduce stage. So the Reduce function could focus the boundary samples.

The Map and Reduce functions for updating positive regions are designed as follows.

Input: a data split S_i
Output: <KEY_UM, VALUE_UM>
// key^{^} represents the sample does not belongs to the positive region; value^{^} represents the sequence of the sample’s attribute value
begin
<KEY_UM, VALUE_UM>=
for each do
if key^{^}_i=1, value^{^}_i=x_i
else key^{^}_i=0, value^{^}_i=x_i
end if
<KEY_UM, VALUE_UM>=<KEY_UM, VALUE_UM> <key^{^}_i, value^{^}_i>
end for
end

Input: <KEY_UM, VALUE_UM>
Output: <KEY_UR, VALUE_UR>
// let KEY_UR be the sequence number in the case of key^{^}=1, and VALUE_UR be the set of value^{^} sequence subset.
begin
KEY_UR=0 and VALUE_UR=
if key^{^}_k=1
KEY_UR=KEY_UR++
VALUE_UR=VALUE_UR value^{^}_k
end if
end

Example 3. Here, we are taking Table 1 and the results of Example 2 to illustrate the operation process on parallel boundary set updating and forming a new decision table.
It can be seen from Example 2 that the current positive region set is , and thus, this update should remove the sample with the ID of 4 from the decision table.

The Map process:

The <KEY_UM, VALUE_UM> pairs that output from Map 1 are <1, (1 0.10 0.20 0.61 0.20 Yes)> and <1, (2 0.13 0.22 0.56 0.10 Yes)>.

The <KEY_UM, VALUE_UM> pair that outputs from Map 2 is <1, (0.14 0.23 0.40 0.31 No)>.

The Reduce process:

The <KEY_UR, VALUE_UR> pairs that output from Reduce1 are <1 (0.10 0.20 0.61 0.20 Yes)>, <2 (0.13 0.22 0.56 0.10 Yes)>, and <0.14 (0.23 0.40 0.31 No)>.

After the above operation, the boundary region updating was finished, and the actual storage situation is shown as follows:

: 1 0.10 0.20 0.61 0.20 Yes

2 0.13 0.22 0.56 0.10 Yes

: 0.14 0.23 0.40 0.31 No

The above results show that the subsequent attribute reduction can be processed directly on the basis of the whole dataset without any extra splitting.

3.2. Parallel Attribute Reduction Algorithm

On the basis of parallel algorithms given in Section 3.1, a neighborhood multigranulation rough set-based parallel attribute reduction algorithm using MapReduce is presented. For convenience, this algorithm is denoted as PARA_NMG in this paper.

Input: ,
Output: reduction reduct
begin
Step 1: initialize , , ;
Step 2: if (C-reduct)= , go to Step 3; while if (C-reduct) , execute the following loop operations:
Step 2.1: for each condition attribute , use algorithms 1~4 to calculate the positive region POS_k of ;
Step 2.2: compare the positive region POS_k of condition attribute subset after each attribute c_k added. Find current maximum positive region Max_POS; if then keep attribute reduction invariant, and go to Step 3. Otherwise, add c_k into reduct, ;
Step 2.3: update boundary sample set Q with algorithms 5~6, , and return to Step 2.1
Step 3: output reduction reduct
end.

3.3. Algorithm: Time Complexity Analysis

It is assumed that the neighborhood decision information system has -samples and -condition attributes. The positive region calculation is still the key step for the proposed PARA_NMG algorithm. In step 2.1, the calculation method in literature [28] is used to calculate the positive region of each attribute set, the time complexity of which is . As to step 2.3, suppose that there are -attributes eventually selected, with each attribute is added into the reduction subset, -samples will convert from the boundary sample to the positive region (in term of probability). Therefore, the time complexity of serial calculation is . Furthermore, the MapReduce model was used in the PARA_NMG algorithm to parallelize the attribute reduction algorithm; assuming that there are -nodes, the time complexity of the algorithm is , which is superior to the time complexity of in literature [22] and time complexity of in literature [33].

4. Experiment Analysis

In this section, we conducted some numerical experiments to assess the efficiency of our proposed algorithm. The experiments were implemented on a PC cluster of nine nodes, where one was set as a master node and the rest were configured as slave nodes. Each node is equipped with Inter Core i5-2400M CPU (four cores in all, each 3.1 GHz), 4 GB RAM memory, and the software of Ubuntu 14.0, Hadoop 2.6.0, and Java 1.6.20. All algorithms were coded in Java.

To illustrate the efficiency of our proposed PARA_NMG algorithm, the representative parallel algorithm for reduction algorithm based on positive region, which was denoted as PAAR_PR, proposed in literature [21] was used for comparisons. The difference is that the PAAR_PR algorithm is based on the classical rough set model, while the PARA_NMG algorithm is based on the neighborhood multigranulation rough set model.

To test the efficiencies of above two algorithms on different types of data, the experiments were carried out with the real datasets Soybean, US Census Data (1990), Susy, PAMAP2 Physical Activity Monitoring, and Poker Hand from UCI Machine Learning Repository [34] and another dataset KDD99. Here, Soybean and US Census Data (1990) are categorical datasets, Susy and PAMAP2 Physical Activity Monitoring are numerical datasets, and Poker Hand and KDD99 are heterogeneous datasets. To create a big data environment, the dataset Soybean was duplicated 100,000 times as a new dataset. For convenience, these above six datasets were denoted as DS1~DS6, respectively. The characteristics of these datasets are shown in Table 2.

4.1. Comparison and Analysis of Reduction Results

For neighborhood rough set model, it is important to select a proper neighborhood radius when calculating neighborhood classes. According to the work of Hu et al. [24], the reasonable neighborhood radius should be selected in the interval . Qian et al. [29] analyzed the monotonicity of positive region with the neighborhood radius, and they found that the classification accuracy will be deduced with neighborhood radius increase. Considering these factors and characteristics of selected datasets, the neighborhood radius for our PARA_NMG algorithm was set to be 0.1 when facing numerical data. The reduction results of the PARA_NMG algorithm and the PAAR_PR algorithm are shown in Table 3.

It can be seen from Table 3 that the PAAR_PR algorithm obtained effective reduction results on categorical datasets DS1 and DS2. Notwithstanding that there are few numerical attributes in DS6, the equivalence classes could be obtained, so the PAAR_PR algorithm is still practicable. However, for numerical datasets DS3 and DS4 and heterogeneous dataset DS5, PAAR_PR could not get the reduction results because equivalent classes could not be obtained in these datasets. Thus, the applicability of PAAR_PR algorithm depends on the characteristics of datasets. Comparatively speaking, the PARA_NMG algorithm was not limited by data types when calculating attribute reduction on different datasets. Considering the rampant existence of heterogeneous datasets in real-life applications, the neighborhood multigranulation rough set-based PARA_NMG algorithm has better applicability.

In addition, for datasets DS1, DS2, and DS6, although the attribute reduction results were all obtained, there was a little difference between the selected attribute subsets by both algorithms. To further analyze the two algorithms’ effects on reduction results from the perspective of classification accuracy, seven well-known typical classifiers, namely, sequential minimal optimization (SMO), naive Bayes, naive Bayesian model (NBM), logistic regression model (LRM), locally weighted learning (LWL), J48, and MultiClassClassifier , were selected to further test the classification accuracy associated with different attribute reduction subsets. The test results are shown in Table 4.

We can see from Table 4 that classification accuracies, according to the reduction subsets of PARA_NMG algorithm, are better for most of these classifiers. In fact, the classification accuracy is the important factor that should be considered in real-life applications. So, from a practical point of view, our neighborhood multigranulation rough set-based PARA_NMG algorithm also has better applicability.

4.2. Comparative Analysis on Computational Time

To illustrate the influence of the number of computer nodes on the two algorithms’ computational time, the experiments were implemented on a cluster with different number of nodes. The average running times of the two algorithms were recorded, which are shown in Table 5 as follows. For datasets DS3~DS5, only the results of PARA_NMG algorithm are given.

As can be seen from Table 5, PAAR_PR is faster than PARA_NMG because of the different rough set models were used. PAAR_PR is based on the classical rough set model, and the time complexity of the classical heuristic serial reduction algorithm based on the positive region is . Conversely, PARA_NMG is based on the neighborhood multigranulation rough set model, and the time complexity of the classical serial reduction algorithm is . To minimize the times of computation for getting positive regions, the hash function was introduced into the Map and Reduce stages for neighborhood multigranulation rough sets, and the time complexity of our parallel attribute reduction algorithm was reduced to . To some extent, the computational time of our algorithm is still comparable.

In fact, except for the computational time, the speedup is really an important performance index for evaluating the efficiency of a parallel algorithm, which is defined as follows: where is the number of nodes, is the execution time at one node, and is the execution time at nodes.

The speedup of two algorithms was tested with different number of nodes. To be more intuitive, the average speedup of two algorithms on each dataset with different computer nodes is presented in Figure 1 as follows, where the axis represents the number of computer nodes, the axis represents the speedup, and the red star point denoted by liner represents the theoretical speedup of a parallel algorithm.

(a)

(b)

(c)

(d)

(e)

(f)

As shown in Figure 1, the parallel reduction algorithm proposed in this paper could achieve better speedup on different data types. With the number of nodes increase, the superiority of our PARA_NMG algorithm in speedup is more and more obvious. Therefore, the PARA_NMG algorithm is more suitable for processing heterogeneous massive data parallelly on a large number of computing nodes.

5. Conclusion

Attribute reduction is one of the important research issues in rough set theory. In current big data era, traditional attribute reduction algorithms are now faced with big challenges for dealing with massive data. Most existing parallel algorithms have seldom taken granular computing into consideration, especially for dealing with complex heterogeneous data including categorical attributes and numerical attributes. To address these issues, aiming at heterogeneous data, a quick parallel attribute reduction algorithm using MapReduce in the framework of neighborhood multigranulation rough sets was developed in this paper. The hash function was introduced into the Map and Reduce stages to speed up the positive region calculation. The effectiveness and superiority of the developed algorithm were verified by comparison analysis.

However, just the static data was considered in this paper; in fact, datasets in real-world applications often vary dynamically over time. How to parallelize the incremental attribute reduction algorithm in the framework of neighborhood multigranulation rough sets is a focus for future research.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61833011, 61403184, and 61533010), the major program of the Natural Science Foundation of Jiangsu Province Education Commission, China (17KJA120001), the National Key Research and Development Program of China (2017YFD0401001), and the Six Talent Peaks Project in Jiangsu Province, China (XNY-038).

References

Q. Zhang, L. T. Yang, Z. Chen, and P. Li, “A survey on deep learning for big data,” Information Fusion, vol. 42, pp. 146–157, 2018.
View at: Publisher Site | Google Scholar
“Notice of the state council on the issuance of a platform for the development of large data,” http://www.gov.cn/zhengce/content/2015G09/05/content_10137.htm.
View at: Google Scholar
Z. Pawlak, “Rough sets,” International Journal of Computer & Information Sciences, vol. 11, no. 5, pp. 341–356, 1982.
View at: Publisher Site | Google Scholar
J. Qian, D. Q. Miao, Z. H. Zhang, and W. Li, “Hybrid approaches to attribute reduction based on indiscernibility and discernibility relation,” International Journal of Approximate Reasoning, vol. 52, no. 2, pp. 212–230, 2011.
View at: Publisher Site | Google Scholar
T. F. Zhang, J. M. Xiao, and X. H. Wang, “Algorithms of attribute relative reduction in rough set theory,” Acta Electronica Sinica, vol. 33, no. 11, pp. 2080–2083, 2005.
View at: Google Scholar
Z. Y. Xu, Z. P. Liu, B. R. Yang, and W. Song, “A quick attribute reduction algorithm with complexity of max (O(|C||U|),O(|C|²|U/C|)),” Chinese Journal of Computers, vol. 29, no. 3, pp. 391–399, 2006.
View at: Google Scholar
F. Jing, J. Yunliang, and L. Yong, “Quick attribute reduction with generalized indiscernibility models,” Information Sciences, vol. 397-398, pp. 15–36, 2017.
View at: Publisher Site | Google Scholar
Y. Yao and Y. Zhao, “Discernibility matrix simplification for constructing attribute reducts,” Information Sciences, vol. 179, no. 7, pp. 867–882, 2009.
View at: Publisher Site | Google Scholar
Y. Yao and Y. Zhao, “Attribute reduction in decision-theoretic rough set models,” Information Sciences, vol. 178, no. 17, pp. 3356–3373, 2008.
View at: Publisher Site | Google Scholar
M. Fumin and Z. Tengfei, “Generalized binary discernibility matrix for attribute reduction in incomplete information systems,” The Journal of China Universities of Posts and Telecommunications, vol. 24, no. 4, pp. 57–75, 2017.
View at: Publisher Site | Google Scholar
J. Konecny, “On attribute reduction in concept lattices: methods based on discernibility matrix are outperformed by basic clarification and reduction,” Information Sciences, vol. 415-416, pp. 199–212, 2017.
View at: Publisher Site | Google Scholar
G. Wang, “Rough reduction in algebra view and information view,” International Journal of Intelligent Systems, vol. 18, no. 6, pp. 679–688, 2003.
View at: Publisher Site | Google Scholar
D. Ye, Z. Chen, and S. Ma, “A novel and better fitness evaluation for rough set based minimum; attribute reduction problem,” Information Sciences, vol. 222, pp. 413–423, 2013.
View at: Publisher Site | Google Scholar
L. Sun, J. Xu, and Y. Tian, “Feature selection using rough entropy-based uncertainty measures in incomplete decision systems,” Knowledge-Based Systems, vol. 36, pp. 206–216, 2012.
View at: Publisher Site | Google Scholar
Q. Zhang, P. Zhang, and G. Wang, “Research on approximation set of rough set based on fuzzy similarity,” Journal of Intelligent & Fuzzy Systems, vol. 32, no. 3, pp. 2549–2562, 2017.
View at: Publisher Site | Google Scholar
D. W. Xiao, G. Y. Wang, and F. Hu, “Fast parallel attribute reduction algorithm based on rough set theory,” Computer Science, vol. 36, no. 3, pp. 208–211, 2009.
View at: Google Scholar
Y. J. Lv, X. N. Liu, and L. Chen, “Rough set attribute reduction algorithm based on PGA,” Computer Science, vol. 35, no. 3, pp. 219–221, 2008.
View at: Google Scholar
J. Qian, D. Q. Miao, and Z. H. Zhang, “Knowledge reduction algorithms in cloud computing,” Chinese Journal of Computers, vol. 34, no. 12, pp. 2332–2343, 2011.
View at: Publisher Site | Google Scholar
J. Qian, P. Lv, X. Yue, C. Liu, and Z. Jing, “Hierarchical attribute reduction algorithms for big data using MapReduce,” Knowledge-Based Systems, vol. 73, pp. 18–31, 2015.
View at: Publisher Site | Google Scholar
J. Zhang, T. Li, D. Ruan, Z. Gao, and C. Zhao, “A parallel method for computing rough set approximations,” Information Sciences, vol. 194, pp. 209–223, 2012.
View at: Publisher Site | Google Scholar
J. Qian, D. Miao, Z. Zhang, and X. Yue, “Parallel attribute reduction algorithms using MapReduce,” Information Sciences, vol. 279, pp. 671–690, 2014.
View at: Publisher Site | Google Scholar
T. Y. Lin, “Granular computing on binary relations I: data mining and neighborhood systems,” Rough Sets in Knowledge Discovery, vol. 1, pp. 107–121, 1998.
View at: Google Scholar
Q. H. Hu, H. Zhao, and D. R. Yu, “Efficient symbolic and numerical attribute reduction with neighborhood rough sets,” Pattern Recognition & Artificial Intelligence, vol. 21, no. 6, pp. 732–738, 2008.
View at: Google Scholar
Q. Hu, D. Yu, J. Liu, and C. Wu, “Neighborhood rough set based heterogeneous feature subset selection,” Information Sciences, vol. 178, no. 18, pp. 3577–3594, 2008.
View at: Publisher Site | Google Scholar
Y. Qian, J. Liang, Y. Yao, and C. Dang, “MGRS: a multi-granulation rough set,” Information Sciences, vol. 180, no. 6, pp. 949–970, 2010.
View at: Publisher Site | Google Scholar
Y. L. Sang and Y. H. Qian, “A granular space reduction approach to pessimistic multi-granulation rough sets,” Pattern Recognition & Artificial Intelligence, vol. 25, no. 3, pp. 361–366, 2012.
View at: Google Scholar
G. Lin, Y. Qian, and J. Li, “NMGRS: neighborhood-based multi-granulation rough sets,” International Journal of Approximate Reasoning, vol. 53, no. 7, pp. 1080–1093, 2012.
View at: Publisher Site | Google Scholar
L. Yong, H. Wenliang, J. Yunliang, and Z. Zhiyong, “Quick attribute reduct algorithm for neighborhood rough set model,” Information Sciences, vol. 271, pp. 65–81, 2014.
View at: Publisher Site | Google Scholar
Y. Qian, X. Liang, G. Lin, Q. Guo, and J. Liang, “Local multigranulation decision-theoretic rough sets,” International Journal of Approximate Reasoning, vol. 82, pp. 119–137, 2017.
View at: Publisher Site | Google Scholar
Y. Qian, X. Liang, Q. Wang et al., “Local rough set: a solution to rough data analysis in big data,” International Journal of Approximate Reasoning, vol. 97, pp. 38–63, 2018.
View at: Publisher Site | Google Scholar
Y. Xu, H. J. Yang, and X. Ji, “Neighborhood multi-granulation rough set model based on double granulate criterion,” Control & Decision, vol. 30, no. 8, pp. 1469–1478, 2015.
View at: Google Scholar
F. M. Ma, J. W. Chen, and T. F. Zhang, “Quick attribute reduction algorithm for neighborhood multi-granulation rough set based on double granulate criterion,” Control & Decision, vol. 32, no. 6, pp. 1121–1127, 2017.
View at: Google Scholar
H. Chen, T. Li, Y. Cai, C. Luo, and H. Fujita, “Parallel attribute reduction in dominance-based neighborhood rough set,” Information Sciences, vol. 373, pp. 351–368, 2016.
View at: Publisher Site | Google Scholar
UCI Machine Learning Repository, http://archive.ics.uci.edu/ml/datasets.html.

Copyright

Copyright © 2018 Tengfei Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1218

Downloads

1044

Citations