Random Fuzzy Granular Decision Tree
In this study, the classification problem is solved from the view of granular computing. That is, the classification problem is equivalently transformed into the fuzzy granular space to solve. Most classification algorithms are only adopted to handle numerical data; random fuzzy granular decision tree (RFGDT) can handle not only numerical data but also nonnumerical data like information granules. Measures can be taken in four ways as follows. First, an adaptive global random clustering (AGRC) algorithm is proposed, which can adaptively find the optimal cluster centers and maximize the ratio of interclass standard deviation to intraclass standard deviation, and avoid falling into local optimal solution; second, on the basis of AGRC, a parallel model is designed for fuzzy granulation of data to construct granular space, which can greatly enhance the efficiency compared with serial granulation of data; third, in the fuzzy granular space, we design RFGDT to classify the fuzzy granules, which can select important features as tree nodes based on information gain ratio and avoid the problem of overfitting based on the pruning algorithm proposed. Finally, we employ the dataset from UC Irvine Machine Learning Repository for verification. Theory and experimental results prove that RFGDT has high efficiency and accuracy and is robust in solving classification problems.
The classification problem is a necessary research topic in data mining fields. Among classification approaches, the decision tree is very effective. It has the advantages of high classification accuracy, few parameters, and strong interpretability. Decision trees have been widely adopted in business, medical care, etc., and have achieved remarkable results. Data generated in daily life is increasing rapidly, which brings opportunities and challenges for decision trees to solve large-scale data classification problems. Decision tree is an inductive learning algorithm on the basis of examples. With the in-depth research on decision tree algorithms and the diversified needs in practical applications, a variety of learning algorithms or models for constructing decision trees have been proposed.
1.1. Information Entropy Decision Tree
Quinlan proposed the ID3 decision tree algorithm. This algorithm employs information gain in information theory as an evaluation of feature quality in the process of splitting tree nodes, and the feature with the largest information gain and the corresponding split point will be used for the construction of the node . The ID3 algorithm is a clear and quick method, but it also has some obvious shortcomings. First of all, in the scope of processing datasets, it is not suitable for datasets containing continuous features; secondly, it tends to choose conditional features with more values as the optimal splitting features. In the same year, Schlimer and Fisher et al. proposed the ID4 decision tree method, which constructs decision trees in an incremental manner . Two years later, ID5 algorithm was presented by Utgof et al., which allows the structure of the existing decision tree to be modified by adding new training instances without the need for retraining . In addition, Xiaohu Liu and his colleagues also discussed the decision tree construction by considering both the information gain brought about by the conditional feature of the current node and the information gain of the conditional feature of the next node when selecting the split feature . Aiming at the shortcomings of ID3 decision trees in selecting features, Xizhao Wang et al. selected appropriate decision tree branches for merging during the construction of decision trees. This algorithm can increase the comprehensibility, improve the generality, and reduce the complexity . The C4.5 decision tree improved performance compared with ID3 algorithm, proposed by Quinlan et al. in 1993 . To solve the bias problem of information gain when selecting split features, this method employs the information gain rate as a metric for choosing split features. Also, the C4.5 algorithm also has the advantage of processing continuous features, discrete features, and incomplete datasets, which has stronger applicability than the ID3 algorithm. When constructing the C4.5 decision tree, the pruning operation is adopted, which can enhance the efficiency of the decision tree, reduce the scale of the decision tree, and effectively avoid the problem of overfitting. Domingos and his colleagues presented a very fast decision tree for data flow, which is called VFDT . The algorithm shortens the training time of the decision tree by using sampling technology. Hulten designed CVFDT algorithm, which extended VFDT and required users to give parameters in advance . Recently, some researchers have also proposed many other decision tree algorithms based on information entropy [9–16].
1.2. Gini Coefficient Decision Tree
In 1984, Breiman and his colleagues designed the CART algorithm, which adopts Gini coefficient as a metric for feature splitting, and chooses the conditional feature with the smallest Gini coefficient as the splitting feature of the tree node to generate a decision tree . When generating a decision tree, if the purity of the spanning tree node is greater than or equal to the threshold assigned in advance, the tree node is stopped from dividing and then the main label of the instance data covered by the node is used as the label of the leaf node label. Meanwhile, the approach adopts resampling to analyze the accuracy of the constructed decision tree and perform pruning operations, and the decision tree with high accuracy and the smallest size is selected as the final decision tree. Here, the pruning method adopts the minimum cost complexity method, MCCP, which can solve the problem of overfitting, reduce the size, and improve interpretability. CART also has its own shortcomings. Due to the limitation of computer memory, this method cannot effectively process large-scale datasets. Aiming at the shortcomings of the CART algorithm in computer memory, in 1996 Mehta et al. proposed an SLIQ decision tree . When building the decision tree, instances are presorted, and then the breadth-first method is used to choose the optimal split feature. The SLIQ algorithm has the advantages of fast calculation speed and ability to process larger datasets. When the data exceed the memory capacity, this method employs feature lists, classification lists, and class histograms to get the solution. However, when the amount of data is large to a certain extent, the algorithm still faces the problem of insufficient memory . In 1998, Rastogi et al. presented a public decision tree algorithm . The algorithm integrates the tree construction process with the tree adjustment process. By calculating the cost function value of the tree node, it is judged whether the node needs to be pruned or expanded. If it is not expanded, the node is marked as a leaf node. The combination of the establishment process and the adjustment process greatly increases the training efficiency of the public decision tree. The Rain Forest is a framework of the decision tree proposed by Gehrke and his colleagues in 2000 . The research aim of the algorithm was to enhance scalability of decision tree and reduce the consumption of computer memory resources as much as possible.
1.3. Rough Set Decision Tree
Incorporating the rough set theory into the decision tree can make the decision tree have the ability to handle uncertain, incomplete, and inconsistent data. When using rough set to generate a decision tree, the main research focus is how to use rough set theory to choose node splitting features. Miao and his colleagues designed a rough set based on multivariate decision tree algorithm. The method first selects the conditional features in the kernel to construct a multivariate test, and then generates a new feature to split the node . The advantage of this algorithm is that the training efficiency is relatively high, but because there are too many variables in the nodes of the decision tree, the interpretability of the decision tree is difficult. Wang et al. designed a fuzzy decision tree on the basis of rough set, which uses fuzzy integration to keep the results consistent . In 2011, Zhai and his colleagues adopted the fuzzy rough set to generate a decision tree and presented a new selection criterion for fuzzy condition features . Wei et al. constructed a decision tree according to a variable precision rough set, which allows small error in the classification process and improves the generalization ability of the decision tree . Jiang and his colleagues designed an incremental decision tree learning method via rough set, and used this method to discuss the problem of network intrusion detection . In 2012, Hu and others proposed a monotonous ordered mutual information decision tree. This decision tree employs the dominant rough set theory to establish a new metric to measure feature quality to construct tree nodes. This decision tree can be resistant to noise, handle monotonous classification problems, and has good effects on general classification problems . On the basis of the algorithm, Qian and his colleagues designed a fusion monotonic decision tree. The algorithm uses feature selection technology to generate multiple data feature distributions to construct multiple decision trees, and employs these decision trees to make comprehensive decisions . Pei and his colleagues designed a monotonously constrained multivariate decision tree. The algorithm first uses the ordered mutual information splitting criterion to generate different data subsets, and then optimizes these subdata to build a decision tree .
1.4. Parallel Decision Tree
Nonparallel or serial decision trees have received extensive and in-depth research and development, and a lot of decision tree models and algorithms have been proposed, but due to the recursive characteristics of decision trees and computing platforms, relatively speaking, the research of decision trees parallelization is not very extensive. The following is a brief review and summary of the current research status of parallel decision trees. The research on parallel decision tree started with SPRINT proposed by Shafer et al. in 1996 . The algorithm tries to avoid the problem of insufficient memory by improving data structure during the growth of decision tree, but during calculation of the tree splitting node, the algorithm requires a broadcast from the entire instance to the entire instance. Kufrin et al. discussed the parallelization of decision trees and introduced a parallelization framework in 1997 . One year later, Joshi and his team proposed a parallel form of decision tree similar to SPRIENT. It is different from the traditional depth-first construction method of decision tree. The algorithm adopts a breadth-first form of decision tree growth within a parallel framework. This method can avoid possible load imbalance problem . Srivastava and his team presented two parallel decision tree models on the basis of the synchronous construction approach and the asynchronous construction method, but both of these models have large communication overhead and load imbalance problems . Shent et al. gave a parallel decision tree that divides the input instance into four subsets to build a decision tree, and applied the generated parallel decision tree to the user authentication problem . With the in-depth research on parallel decision trees and the emergence of MapReduce distributed computing frameworks, Panda et al. designed a parallel decision tree in 2009, which relies on multiple distributed computing frameworks . Walkowiak and his colleagues focused on the parallelization of decision trees, and proposed an optimization model for network computing for the distributed implementation of decision trees . In 2012, Yin et al. adopted the scalable regression tree algorithm to give a parallel implementation of the Planet algorithm under the MapReduce framework . In response to the problem of overlearning or overfitting, Wang Ran et al. proposed an extreme learning machine decision tree on the basis of a parallel computing framework, but the disadvantage of this algorithm is that it can only handle numerical datasets and cannot handle mixed dataset . The parallel C4.5 decision tree proposed in  considers the problem of overfitting and mixed data types. Aiming at the ordered mutual information decision tree that is widely used in monotonic classification problems, Mu and his colleagues presented a fast version and gave its parallel implementation . There are other parallel decision tree approaches proposed like distributed fuzzy decision tree , parallel Pearson correlation coefficient decision tree , etc. In addition to the above research, Li and other researchers proposed some classification and alignment algorithms [43–49] from the perspective of granular computing, which have good performance.
In this study, a decision tree is constructed in granular space to solve the classification problem. The main contributions are as follows:(i)We propose AGRC that can adaptively give the optimal cluster centers, which is a global optimization method and can avoid falling into local optimization solution.(ii)We design the parallel granulation method based on the above clustering algorithm, which solves the problem of high complexity of traditional serial granulation and enhances the granulation efficiency.(iii)In granular space, we define fuzzy granules and related operators and select features based on the information gain ratio to construct a fuzzy granular decision tree for classification. To avoid overlearning, we also design the corresponding pruning algorithm. The method presented can solve binary classification or multiclassification problem and give feature importance according to the order of the tree node generated.
3. The Problem
Let be a classification system. The goal is to design a classification algorithm. During the process, parameters can be obtained by statistic instances. Then, the label of instance can be predicted by the model. The symbols mentioned above can be explained in detail as follows. is an instance set. represents a feature set. denotes the value region of feature . expresses an information function that allocates a value for each feature, that is, . indicates a label set, where is a label w.r.t. instance .
4. The Algorithm Description
To obtain the solution of classification problem described in Section 3, the model can be presented as follows. First of all, to enhance the efficiency as much as possible, we need to cluster the data. During the process, an adaptive clustering algorithm is designed, which can obtain the quantity of cluster and the cluster centers automatically. Second, on the basis of the quantity of center and the cluster centers, a parallel granulation is executed by calculating the distance between instances and cluster centers and dividing instances set into instances subsets. Third, the problem of instance classification can be converted into the problem of granule classification in granular space. Fuzzy granules, related operators, and cost function are defined in granular space. Splitting nodes can be found by solving cost function to build a fuzzy granular decision tree. Fourth, the pruning algorithm w.r.t. RFGDT is designed. Finally, the label of test instance can be predicted by RFGDT. The overview is described in Figure 1.
4.1. Theory of AGRC
K-means algorithm is an unsupervised classification approach. The cluster centers and their numbers need to be specified in advance. The results obtained rely on the above cluster center. If the initial cluster center is not well selected, it will fall into a local optimal solution. An Adaptive Global Random Clustering algorithm is proposed, which adaptively selects cluster centers and their numbers, and the initial selection of cluster centers is random, which is a global optimization approach. The thought is as follows. We know that if the between-cluster variance is large and the intracluster variance is small, then the performance is great. Hence, can be used as an evaluation indicator, where represents standard deviation of intercluster and denotes standard deviation of inner-cluster. The goal is to increase the ratio continuously until the maximum iteration is achieved. In each iteration, we have a set of new cluster centers and these parameters, including the quantity of cluster centers, cluster centers and evaluation values, are combined into an evaluation set. When the process is over, we can get cluster centers that are corresponding to the largest evaluation value in the evaluation set, which are the optimal parameters. Here, in each iteration, we select an instance as the next cluster center with a certain probability, until the quantity of cluster centers meets the preset in this iteration. The process is described as follows: Step I: remove instances of missing feature values. Step II: normalize feature values of instances into . Step III: assign maximum iterations , evaluation set (composed of standard deviation of intercluster, standard deviation of inner-cluster, and evaluation value), and iteration . Step IV: assign current cluster center set and the quantity of current cluster center and generate a random number within as the quantity of cluster center. Step V: randomly select one instance as a cluster center and set , , Step VI: calculate the shortest distance between the remaining instances and all cluster centers and the probability of an instance being selected as the next cluster center is Step VII: if is selected as cluster center, we set , , . Step VIII: if , go to Step V; otherwise, go to Step IX. Step IX: calculate the standard deviation of intercluster and the standard deviation of inner-cluster and update the evaluation set, that is, Step X: update iteration . Step XI: if , go to Step XII; or else, go to Step IV. Step XII: in evaluation set , the cluster center set with the largest ratio of the intercluster standard deviation to the sum of inner-cluster standard deviation, and the quantity of cluster center corresponding to it, (here, expresses the quantity of elements of the set), is the optimal solution. Step XIII: end.
The principle of Adaptive Global Random Clustering is given. On the basis of the principle, the algorithm is as shown in Algorithm 1.
4.2. From Data to Fuzzy Granules
Now, we introduce how to implement parallel fuzzy granulation of data via cluster centers. We can adopt AGRC to get the cluster center set . Let instance set be and feature set be . If , and , the similarity between and w.r.t. iswhere , , and represents the similarity between and on (). A fuzzy granule generated by can be written as
For simplicity, it can also be written as
Here “” represents separator and “+” denotes the union of elements. In other words, is the similarity between and cluster centers. The cardinal number of fuzzy granule is obtained by equation:
Now we give operators of fuzzy granules. For , the operators between fuzzy granules generated by and can be written as follows:
Here , , and is a parameter (). For , , the fuzzy granular array generated by on can be written as follows:
The symbol “+” denotes union and the symbol “–” represents separator. The cardinal number of the fuzzy granular array can be written as
Now, we have the operators between fuzzy granular arrays. Let and be the fuzzy granular arrays generated by the instance and on feature subset , respectively. The operators can be written as follows:
The difference between two fuzzy granular vectors can be written as follows:
From information granulation, we can see that fuzzy granules and fuzzy granular arrays are generated by their operators. The fuzzy granules consist of the space called fuzzy granular space.
Theorem 1. For , , , the similarity of fuzzy granules satisfies the following equation:Proof 1. According to equation (7), we have and . According to equation (5), we have . As a sequence, and are established. Due to and , we can have and . Equations (13) and (14) show and . Therefore, and are both established. From , we can get . Divide both sides of the inequality by and can be obtained. That is the inequality is established.
Theorem 2. For , feature subset and satisfy . Let and be fuzzy granular arrays on feature set and . Inequality is established.
Proof 2. According to equation (13), if , then . Due to and , is established. Because of , for , we have and . That is, if , then . In sum, inequality is also established.
Below we give an example to explain the granulation process and measurement.
Example 1. As shown in Table 1, let , , and be instance set, feature set, and label set, respectively. represents the cluster center set and parameter is . The fuzzy granulation is as follows.
We take instance as an example. The similarities between and , , and on feature , , and are , , , , , and , respectively. According to equation (7), fuzzy granules generated by on feature , , and are, respectively.
In the same way, fuzzy granules generated by on feature set are , , and respectively.
According to , (here, when , ; when , ), we haveSimilarly, we also obtainHence, the distance between instance and on feature set with is
4.3. Random Fuzzy Granular Decision Tree
RFGDT can embody structure and express the course of classifying instances on the basis of features. It includes a series of if-then rules, or it can also be regarded as a conditional probability distribution calculated in fuzzy granular space and label space. The strength is that this algorithm is readable and its efficiency is high. When learning, we employ the training data to generate a RFGDT on the basis of minimizing cost function. When predicting, test data are classified via the model. There are three steps in learning of RFGDT, namely, selecting feature, constructing tree, and pruning tree.
RFGDT is to describe the feature structure for classifying fuzzy granules in the fuzzy granular space, which can be composed of directed edges and nodes. A feature can be expressed by an internal node, and a label can be denoted by a leaf node.
During the classified process, the model starts from the root node, tests a certain fuzzy granule of instance, and assigns the fuzzy granule to its child nodes according to the result. Meanwhile, the value of a feature corresponds to each subnode. In this way, fuzzy granules are tested and allocated recursively until they reach the leaf node. Finally, fuzzy granules are allocated to the label of the leaf node.
Now, the definition of the fuzzy granular rule set is written as follows.
Definition 1. Suppose that be a decision system, where is an instance set, is a feature set, is a label set, and is a cluster center set. For , a fuzzy granular array can be generated by instance . Then, fuzzy granular space can be generated by , that is, where , are fuzzy granular array coefficients.
Suppose that , , a rule can be made up of fuzzy granular array and label. Thus, a rule set can be constructed.
4.3.1. IF-Then Rule
RFGDT can also be regarded as a set of if-then rules. A RFGDT can be converted into an if-then rule set like this: A rule is constructed for every path from root node to leaf node; the features of the internal nodes with regard to conditions of the rule, and the label of leaf node, correspond to the conclusion of rule. The path of RFGDT (corresponding to if-then rule set) has a key character: mutually exclusive and complete. This means that every fuzzy granular array is covered by a unique path or rule. The so-called coverage here means that the features of the fuzzy granular array are consistent with the features on the path.
The learning of RFGDT is to generalize a series of classification rules from training data. There may be more than one RFGDT (i.e., a RFGDT that can correctly classify the training data) that is not inconsistent with the training data. Our purpose is to find a RFGDT that has little contradiction with training data and has strong generalization ability. In other words, RFGDT learning is to estimate conditional probability on training data. Infinite conditional probability models exist by fuzzy granular space division. The conditional probability model chosen can not only have a great fitting to training data but also have a perfect prediction on test data. RFGDT learning uses a cost function to express the aim. As mentioned below, the cost function of RFGDT learning is usually a regularized maximum likelihood function, and its strategy is to minimize the cost function.
The algorithm of RFGDT learning is to recursively select the optimal feature and segment the training data on the basis of the feature, in order that each subdataset can get the best classified results. This process corresponds to the division of fuzzy granular space and the form of RFGDTs. At the beginning, the root node is constructed and all fuzzy granular arrays are placed at the root node. The algorithm chooses an optimal feature, and divides the training data into several subsets on the basis of this feature, in order that each subset has the best classification under the current conditions. If these subsets have been correctly classified, then the algorithm constructs leaf nodes and divides these subsets into the corresponding leaf nodes; if there are still subsets that cannot be correctly classified, then the algorithm selects new optimal features for these subsets, continues to segment them, constructs corresponding nodes, and proceeds recursively until all training data subsets are basically classified correctly or there is no suitable feature. Finally, each subset is divided into leaf nodes, i.e., there are clear categories. This generates a RFGDT.
The RFGDT produced may have better classification ability for training data but may have worse classification ability for test data, that is, overfitting may occur. We need to prune the tree from bottom to top to let the tree be simpler in order that it can enhance its generalization ability. Specifically, it is to remove the leaf nodes that are too subdivided, make them fall back to the parent node, or even an ancestor node, and then modify the parent node or an ancestor node to a new leaf node.
If the quantity of features is large, the features can also be chosen at the beginning of the RFGDT learning, leaving only features that have sufficient classification ability for training data. We can draw a conclusion that the learning algorithm includes feature selection, RFGDT construction, and RFGDT pruning. Since RFGDT denotes conditional probability distribution, RFGDTs of different depths correspond to probability models of different complexity. The generation of RFGDT corresponds to local selection of the model, and the pruning of RFGDT is related to global selection of the model. The generation of the RFGDT only considers the local optimum, while the pruning of the RFGDT considers the global optimum.
4.3.3. Feature Selection and Cost Function Construction
Selecting important features can enhance the efficiency of RFGDT learning. If the result of using a feature for classification is not very different from the result of random classification, the feature is said to have no classification ability. Empirically, throwing away such features has little effect on the accuracy. Here, we redefine the information gain ratio and use this criterion as the cost function of constructing a RFGDT. First, the empirical entropy of the dataset is defined bywhere denotes the set composed of fuzzy granular arrays, expresses the quantity of elements of the set, is the subset composed of fuzzy granular arrays of which classification is , and is the quantity of elements of the set. Stipulate . It can be seen from the definition that entropy only depends on the distribution of and has nothing to do with the value of . The greater the entropy is, the greater the uncertainty of the random variable is. can be verified from the definition.
The empirical conditional entropy of feature on the fuzzy granular array set iswhere is the subset composed of fuzzy granular arrays taking the value on the feature , is the quantity of elements of the subset , is the subset composed of fuzzy granular arrays that takes the value on feature and the label , and is the quantity of elements of the subset .
The information gain is calculated as follows:
We can now write the ratio of information gain as
Here, ( denotes the quantity of taking the value on feature ).
4.3.4. RFGDT Generation
We adopt the information gain ratio criterion to select features and recursively build a RFGDT. The specific method is as follows.
Information gain ratio for each feature of each subdataset is calculated in the Map stage. Then, in the Reduce stage, the information gain ratios on the corresponding features of each subdataset are summed. The feature with the largest sum of information gain ratio can be chosen as the feature of the node, and the child node is constructed from the different feature values. We call the above approach recursively on the child nodes to build a RFGDT until the information gain ratios of all features are very small or there are no features to choose from. The algorithm is as follows: Step I: is randomly divided into subfuzzy granular array sets . Step II: in the Map phase, the Map function uses each feature as the key and the information gain ratio as the value, namely, . Step III: in the Reduce phase, Hadoop distributed system first aggregates the output results of all Map functions according to the key, and then uses these aggregated intermediate results as the input to the Reduce phase. The intermediate results after aggregation are as follows: Step IV: if all fuzzy granular arrays in belong to the same class , the algorithm sets as a single-node tree, adopts as the label of the node, and returns . Step V: if , set as a single-node tree, use the label with the largest number of fuzzy granular arrays in as the class of the node, and return . Step VI: or else, calculate the sum of the information gain ratio of each feature in to according to equation (28), and select the feature with the largest sum of information gain ratio as the split feature, which can be written as Step VII: if the information gain ratio of is less than the threshold , set as a single-node tree, and use the class with the largest number of fuzzy granular arrays in as the label of the node, and return . Step VIII: if not, for each possible value of , according to , divide into a number of nonempty subsets , the label with the largest number of fuzzy granular arrays in is used as a mark, and the subnodes are constructed. The tree is formed by the nodes and their subnodes, and return . Step IX: for the node , use as training set and as feature set, recursively call Step I to Step VIII, get subtree , and return .
The algorithm is described as in Algorithm 2.
The algorithm recursively generates RFGDT until it cannot do further. The tree generated in this way is often very accurate for the classification of training data, but the classification of test data is not so accurate, i.e., overfitting occurs. The main reason is that too much consideration is given to how to improve the correct classification of training data, thereby building an overly complex tree. The solution to this problem is to reduce tree complexity and simplify the constructed tree. The process of simplifying the constructed tree is called pruning. Specifically, pruning cuts some subtrees or leaf nodes from the constructed tree, and uses its root node or parent node as new leaf nodes, thereby simplifying the classification tree. The pruning of RFGDT can be achieved by minimizing the overall cost function of the tree.
Suppose that the quantity of leaf nodes of the tree is , is the tree leaf node, the leaf node has fuzzy granular arrays, where there are fuzzy granular arrays in label (). is the empirical entropy on the leaf node , is the parameter (), then the cost function of RFGDT learning can be written aswhere . We have
In equation (31), denotes the prediction error of the algorithm on training data, i.e., the fit degree between the algorithm and the training data, expresses model complexity, and parameter controls the impact between them. The larger chooses a simpler model (tree), and the smaller one chooses a more complex model. means that only the fit between the model and the training data is considered, and the complexity of the model is not considered. Pruning is to select the algorithm with the smallest cost function when is determined, i.e., the subtree with the smallest cost function. If is determined, the larger the subtree is, the better the fit is to training data, but the higher the complexity is; on the contrary, the smaller the subtree, the lower the model complexity, but it often does not fit well to the training data. The cost function just shows the balance between the two. RFGDT generation only considers the better fit of the training data by increasing the information gain. The RFGDT pruning also considers the reduction of model complexity by optimizing the cost function. The RFGDT generation is a local learning model, and the RFGDT pruning is a global learning model. The following is a RFGDT pruning algorithm. Step I: empirical entropy of each node is calculated. Step II: recursively retract upward from the leaf nodes of the tree. Suppose that the whole tree before and after a group of leaf nodes retract to its parent node is and , and the corresponding cost function values are and , respectively. If then pruning, that is, the parent node becomes a new leaf node. Step III: go to Step II, until it cannot continue, and get the subtree with the smallest cost function .
The algorithm is shown as in Algorithm 3.
4.3.6. Label Prediction
After the RFGDT is constructed, given a test instance, first we transform it into fuzzy granular array and then use the RFGDT decision tree trained to predict. The method is described further in Algorithm 4.
5. Experimental Analysis
This paper employs 4 datasets from the UC Irvine Machine Learning Repository as the data source for the experimental test and constructs 8 datasets with 1% noise and 3% noise, respectively, as demonstrated in Table 2. The tenfold cross-validation method was adopted in the experiments. Data of and were chosen randomly as training sets, respectively, and the remaining data were taken as the test set to execute one verification. Then, we repeated the process ten times. The running time and average accuracy were used as measurements of performance. As illustrated in Figure 2, serial clustering fuzzy granulation, parallel clustering fuzzy granulation proposed, and serial granulation were compared on efficiency. C4.5, Support Vector Machines (SVMs), Convolutional Neural Networks (CNN), and RFGDT were compared for average accuracy (see Figures 3–5). In RFGDT classifier, the quantity of cluster centers is a key parameter that has an effect on performance of classification, such as accuracy. The relation between the quantity of cluster centers and the average accuracy was analyzed.
Fuzzy granulation in the case of serial computing tasks has lower efficiency, which cannot meet the needs. If computing task is parallelizable, serial computing task can be converted into parallel computing one. This paper adopted the approach of parallel granulation via clustering. We usually employ MapReduce to deal with parallel tasks of large-scale data. Fuzzy granulation of large-scale dataset can be divided into several subcomputing tasks and the subdatasets can be assigned to computing nodes via abstracting a hierarchical computing model. Due to its simple and easy-to-use programming interface, MapReduce has been widely used in parallel programming model and computing framework. The main thought is as follows. Job is divided into multiple independently runnable Map tasks; these Map tasks are distributed to several processors to execute; intermediate results are generated; the reduce operation tasks are combined to produce final output results. There are two parts in MapReduce calculation process, namely, Map and Reduce. The input is received by Map (see Table 3), and then intermediate results () are output, where , (see Table 4).
The output of Reduce is like , (see Table 5). Without clustering, the complexity of granulation is ; With clustering, the complexity of granulation time can reach (); If fuzzy granulation is proceeded by the parallel method, the complexity can achieve , where denotes the quantity of subsets, represents the quantity of cluster centers, expresses the quantity of instances, is the quantity of features, and each subset corresponds to a Map.
Figure 2 compares the efficiency of traditional granulation, serial granulation with clustering, and the parallel granulation with clustering proposed, where the abscissa expresses the quantity of instances and the ordinate denotes the average time taken for granulation. Below represents the quantity of instances and is the quantity of cluster center. When and , the serial granulation cost 50 mins, the running time of serial clustering granulation was 37 mins, and the running time of parallel clustering granulation was only 9 mins. The parallel clustering granulation reduced by 26.00% and 82.00%, respectively. When and , the serial granulation took 800 mins, while the running time of the serial clustering granulation was 461 mins (i.e., 42.38% improvement). Parallel granulation with clustering executed only 115 mins and was enhanced by 85.63% and 75.05%, respectively. When and , the running time of parallel granulation with clustering improved by 45.33% and 86.33%, respectively, compared with the other two methods. We can draw a conclusion that as the quantity of instances increases, serial clustering granulation and parallel clustering granulation methods increase the efficiency to a great extent.
Taking 90% of the dataset Wine Quality as the training set, we can get the following results. As shown in Figure 3(a), in the dataset Wine Quality, when the quantity of cluster centers was less than 3006, the average accuracy of RFGDT was lower than that of the other three methods. With the quantity of cluster centers rising, the average accuracy of RFGDT increases rapidly. Especially, when the quantity of cluster centers was 3006, it achieved a peak value of 0.969, while the average accuracies of SVMs, C4.5, and CNN were 0.961, 0.952, and 0.948 (i.e., 0.83%, 1.79%, and 2.22% improvement, respectively). When the quantity of cluster centers was greater than 3006, the average accuracy of RFGDT also decreased slightly, but it was still higher than the other three methods. After adding 1% noise in the data, as illustrated in Figure 3(b), when the quantity of cluster centers was 3006, the average accuracy of RFGDT reached a peak value of 0.964, while the average accuracies of SVMs, C4.5, and CNN, RFGDT were 0.923, 0.914, and 0.892, respectively. RFGDT improved by 4.44%, 5.47%, and 8.07% respectively. Comparing these two datasets, SVMs, C4.5, and CNN dropped by 3.95%, 3.99%, and 5.91% respectively, and RFGDT reduced by 0.52% at the peak value. It can be seen from statistical data that RFGDT is more robust and stable to noise. When we add 3% noise to the data, as exhibited in Figure 3(c), SVMs, C4.5, CNN, and RFGDT decreased by about 0.54%, 1.97%, 0.56%, and 0.42%, respectively, compared with noise data of 1%. After that, we took 70% of data as the training set to verify the performance. Overall, the average accuracies of these four methods were on the decline. From Figures 3(d)–3(f), RFGDT performs better than other three algorithms when the number of clustering is more than 2000.
As illustrated in Figure 4(a), dataset BankMarketing contained nearly 50,000 instances, which was 10 times the scale of dataset Wine Quality. The shape of the average accuracy curve of RFGDT was similar to Figure 3, and the overall shape was high in the middle and low on both sides. When the quantity of cluster centers was , the average accuracy of RFGDT reached the largest value of 0.963, while SVMs, C4.5, and CNN were 0.951, 0.947, and 0.955, respectively (i.e., 1.26%, 1.69%, and 0.84% improvement, respectively). In the Bank Marketing dataset with noise, as demonstrated in Figure 4(b), RFGDT reached a peak value of 0.956 at . Compared with SVMs, C4.5, and CNN, RFGDT increased by 3.13%, 4.82%, and 2.69%, respectively. RFGDT reduced by 0.73%, while SVMs, C4.5, and CNN reduced by 2.59%, 3.84%, and 2.69%, respectively. As can be seen, RFGDT is not sensitive to noise and C4.5 is more sensitive to noise. Figure 4(c) shows the four algorithms are all reduced when the percent of noise data was 3%. However, RFGDT performed better than SVMs, C4.5, and CNN by about 3.61%, 5.22%, and 2.16% regarding the average accuracy, respectively. When 70% of data were taken as the training set, the four algorithms were decreased compared with 90% of data being the training set. However, as shown in Figures 4(d)–4(f), RFGDT outperforms the other three algorithms under most parameters.
The quantity of instances in dataset Localization Data for Person Activity was more than 160,000. As illustrated in Figure 5(a), without noise, when , the average accuracy curve of RFGDT reached a peak value of 0.953, while SVMs was 0.932, C4.5 was 0.922, and CNN was 0.947 (2.25%, 3.36%, and 0.63% improvement, respectively). CNN performs better than SVMs, while SVMs is better than C4.5, and RFGDT is slightly better than CNN. In the dataset with noise, as demonstrated in Figure 5(b), compared with SVMs, C4.5, and CNN, the peak value of RFGDT increased by 4.30%, 4.86%, and 2.38%, respectively. Compared with SVMs, C4.5, RFGDT and CNN are less sensitive to noise. As shown in Figure 5(c), when noise occupied 3% of data, the four algorithms were all decreased. But RFGDT achieved 0.932 of average accuracy under K = 119,031, while SVMs, C4.5, and CNN just got 0.893, 0.887, and 0.912, respectively (i.e., 4.37%, 5.07%, and 2.19% improvement). Figures 5(d)–5(f) compare the performance with 70% of data being the training set and show that RFGDT performs better than SVMs, C4.5, and CNN under special parameters.
The dimension in dataset IDA2016Challenge is much higher than the other three datasets. We took 90% and 70% of data to test as training sets, respectively. On the basis of this, we added 1% and 3% noise into the dataset, respectively. The detailed results are as follows. As shown in Figure 6(a), when , RFGDT achieved a peak value of 0.956, while SVMs, C4.5, and CNN just got 0.932, 0.921, and 0.947, respectively (i.e., about 2.58%, 3.80%, and 0.95% improvement, respectively). After adding noise of 1%, as illustrated in Figure 6(b), the highest value of RFGDT was 0.947 and RFGDT increased by about 4.30%, 5.11%, and 2.38% compared with SVMs, C4.5, and CNN, respectively. After adding noise of 3%, as demonstrated in Figure 6(c), the four algorithms were all decreased, but RFGDT was better than the other three algorithms. When data of 70% is taken as training set, RFGDT still outperforms SVMs, C4.5, and CNN, as exhibited in Figures 6(d)–6(f).
Besides the dataset mentioned above, we also applied the algorithm to predict Alzheimer’s disease by voice. This dataset was from the University of Pittsburgh and was stored in the form of speech and text from participants containing elderly controls, people with possible Alzheimer’s Disease, and people with other dementia diagnoses. The corpus included 1263 instances. Mel-frequency Cepstral Coefficients (MFCC) of corpus was extracted as features for prediction. We calculated the first 20 dimensions, their first-order difference, and their second-order difference, which were concatenated to get 57-dimensional features. The precision of RPFDT was a maximum of 0.932. We can predict Alzheimer’s disease by voice, which is a simple and low cost method compared with Magnetic Resonance Imaging. It is very meaningful and valuable for the diagnosis of Alzheimer’s disease.
From the above analysis, we can see that the average accuracy of RFGDT is better than SVMs, C4.5, and CNN in the six datasets. In smaller datasets, CNN performs weaker than SVMs and C4.5. Especially, in datasets containing noise, the average accuracy of RFGDT is stable and less sensitive to noise. Judging from the curve shape of the average accuracy of RFGDT, it shows a form of high and low in the middle. When the value of is small, the performance of RFGDT is weaker than the other three algorithms. As the value of increases, the performance of RFGDT is better than the other three algorithms. We use 10-fold cross-validation, and the test set and training set are obtained randomly. In other words, for each time, the algorithm is evaluated in different training set and test set; it has randomness, but the performance of the algorithm is just evaluated objectively. The imbalance of instances will also affect the performance of the algorithm. For noisy datasets, we also found that RFGDT is more robust. The main reason lies in the fuzzy granulation process. RFGDT embodies a global comparison thought, which can overcome the noise interference to some extent. This is also the advantage of the RPFDT. At the same time, we also found that the choice of the value is also the key to classification. If the value is too small, it will reduce the classification accuracy. Instead, if the value is too large, it will increase noise and also reduce the classification effect. A reasonable value of is also the key to the performance of the algorithm. Compared with classical methods, granulation process costs some time, but this process can be executed offline. Moreover, parallel granulation can improve the efficiency greatly. In the meantime, in the granular space, the accuracy of classification can be enhanced.
In this study, we propose a RFGDT that is suitable for dual-classification or multiclassification problems. In the algorithm, the idea of parallel distributed granulation is introduced, which improves the efficiency of data granulation. In the parallel granulation process, we design AGRC for granulation. We transform a classified problem of data into fuzzy granular space to find the solution. In the fuzzy granular space, we define fuzzy granules, fuzzy granular arrays on the basis of operators designed. The aim is to use the information gain rate to select feature as the split point to recursively construct the fuzzy granular decision tree. In order to avoid overfitting, we also design the pruning algorithm of RFGDT, which can improve the performance further. In the future, we will apply it to cloud computing and big data.
The dataset used to support the findings of this study is from the UC Irvine Machine Learning Repository.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was supported in part by the Scientific Research “Climbing” Program of Xiamen University of Technology under Grant nos. XPDKT20027 and XPDKQ18011, in part by the National Natural Science Foundation of China under Grant nos. 61976183 and 41804019, in part by the Natural Science Foundation of Fujian Province of China under Grant nos. 2019J01850 and 2018J01480, in part by Project of Industry-University-Research of Xiamen University and Scientific Research Institute under Grant no. 3502Z20203064, in part by University Natural Sciences Research Project of Anhui Province of China under Grant no. KJ2020A0660, and in part by the Natural Science Foundation of Anhui Province of China under Grant no. 2008085MF202.
J. C. Schlimmer and D. Fisher, “A case study of incremental concept induction,” in Proceedings of the 1986 National Conference on Artificial Intelligence, pp. 496–501, Philadelphia, PA, USA, August 1986.View at: Google Scholar
P. E. Utgoff, “Id5: An incremental id3,” in Proceedings of the Fifth International Conference on Machine Learning, Ann Arbor, MI, USA, June 1988.View at: Google Scholar
X. Liu and S. Li, “An optimized algorithm of decision tree,” Journal of Software, vol. 9, no. 10, pp. 797–800, 1998, (in Chinese).View at: Google Scholar
X. Wang and C. Yang, “Merging-branches impact on decision tree induction,” Chinese Journal of Computers, vol. 30, no. 8, pp. 1251–1258, 2007, (in Chinese).View at: Google Scholar
J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.
P. Domingos and G. Hulten, “Mining high-speed data streams,” in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, August 2000.View at: Google Scholar
G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data streams,” in Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 2001.View at: Google Scholar
J. Su and H. Zhang, “A fast decision tree learning algorithm,” in Procedings of 21th National Conference on Artificial Intelligence, pp. 500–505, Boston, MA, USA, July 2006.View at: Google Scholar
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Wadsworth & Brooks/Cole Advanced Books & Software, Monterey, CA, USA, 1984.
M. Mehta, R. Agrawal, and J. Rissanen, “SLIQ: a fast scalable classifier for data mining,” in Proceedings of the 1996 International Conference on Extending Database Technology, Avignon, France, March 1996.View at: Google Scholar
B. Chandra and P. Pallath, “On improving efficiency of SLIQ decision tree algorithm,” in Proceedings of the 2007 International Joint Conference on Neural Networks, Orlando, FL, USA, August 2007.View at: Google Scholar
D. Miao and J. Wang, “Rough sets based approach for multivariate decision tree construction,” Journal of Software, vol. 6, pp. 425–431, 1997, (in Chinese).View at: Google Scholar
J. C. Shafer, R. Agrawal, and M. Mehta, “Sprint: a scalable parallel classifier for data mining,” in Proceedings of the 1996 22th International Conference on Very Large Data Bases, pp. 544–555, Mumbai, India, September 1996.View at: Google Scholar
M. V. Joshi, G. Karypis, and V. Kumar, “Scalparc: a new scalable and efficient parallel classification algorithm for mining large datasets,” in Proceedings of the 1998 First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, pp. 573–579, Orlando, FL, USA, March 1998.View at: Google Scholar
K. Walkowiak and M. Wozniak, “Decision tree induction methods for distributed environment,” Man-Machine Interactions, Springer, Heilderberg, Germany, 2009.View at: Google Scholar