Applying Randomness Effectively Based on Random Forests for Classification Task of Datasets of Insufficient Information

Sug, Hyontai

doi:https://doi.org/10.1155/2012/258054

Journal of Applied Mathematics

On this page

Abstract Introduction Conclusions References Copyright Related Articles

Research Article | Open Access

Volume 2012 | Article ID 258054 | https://doi.org/10.1155/2012/258054

Applying Randomness Effectively Based on Random Forests for Classification Task of Datasets of Insufficient Information

Hyontai Sug¹

Academic Editor: Hak-Keung Lam

Received20 Jul 2012

Revised08 Oct 2012

Accepted08 Oct 2012

Published07 Nov 2012

Abstract

Random forests are known to be good for data mining of classification tasks, because random forests are robust for datasets having insufficient information possibly with some errors. But applying random forests blindly may not produce good results, and a dataset in the domain of rotogravure printing is one of such datasets. Hence, in this paper, some best classification accuracy based on clever application of random forests to predict the occurrence of cylinder bands in rotogravure printing is investigated. Since random forests could generate good results with an appropriate combination of parameters like the number of randomly selected attributes for each split and the number of trees in the forests, an effective data mining procedure considering the property of the target dataset by way of trial random forests is investigated. The effectiveness of the suggested procedure is shown by experiments with very good results.

1. Introduction

Because rotogravure printing is used to print in a large volume, it is important to prevent process delays for higher productivity. But, when rotogravure printing is being performed, sometimes a series of bands appear in the cylinder of printing machine so that it ruins the printouts. When this happens, a pressman should do appropriate action to remove the bands from the cylinder, resulting in process delays up to even several hours. In order to reduce the delays, preventive maintenance activity is more desirable, if we can predict possible occurrence of the bands accurately in advance [1]. So many researchers tried to increase the predictive accuracy for the task [2–5], and decision tree-based methods and neurocomputing-based methods have been used mostly for the task. It is known that a weak point of decision trees is relatively poor accuracy compared to other data mining methods like neural networks, because decision trees fragment datasets and prefer majority classes, even if the size of available datasets is small. In order to overcome the problem, a large number of decision trees could be generated for a single dataset based on some random sampling method and could be used for classification. Random forests [6, 7] are a representative data mining method that uses many trees for that purpose. Random forests are known to be robust for real world datasets that may not have enough information as well as may have missing and erroneous data. Because a related dataset called “cylinder bands” is a real world dataset that contains such properties in the domain of rotogravure printing, and random forests have different performance depending on the values of parameters of the algorithm with respect to the property of given dataset, therefore, in this paper we want to find some best predictive accuracy with random forests to predict the cylinder bands by examining the property of the dataset by way of trial random forests and effective search.

Several research results have been published to find better classification models for the so-called “cylinder bands” dataset, after the first paper [2] related to the task was published. They generated rules based on C4.5 decision tree algorithm [8] to improve the heuristics that can predict possible occurrence of bands or a series of grooves in the cylinder during printing. But, because the rules are based on a single decision tree, the prediction accuracy is somewhat limited. After the first paper, other researchers have tried also to find better knowledge models with respect to accuracy.

As an effort to find the knowledge models of better performance, fuzzy lattice neurocomputing (FLN) models based on competitive clustering and supervised clustering were suggested [3]. Later, the researchers of FLN models found that the data space can be divided into subspaces based on class values of each data instance. So depending on fitness of each data instance to data space, five fit algorithms were suggested [4]; FLN tightest fit, FLN ordered tightest fit, FLN first fit, FLN selective fit, and FLN max tightest fit. A fit is called tightest, if the lattice-join of any data instance in the same class causes a contradiction. The FLN tightest fit was the first one among the five FLN models, and the accuracy of FLN ordered tightest fit is the best accuracy among the fuzzy lattice neurocomputing models. FLN models have the time complexity of to train, which means that it is a polynomial time algorithm, so it will take some long computing time, if the size of input data is large [9].

Some other researchers tried to find better knowledge models of performance based on randomness in attribute selection and training datasets. Random subspace method [6] tries to select the subsets of attributes randomly and applies aggregating to find better classification models. SubBag method [5] tries BAGGING [10] and random subspace method together. BAGGING stands for Bootstrap AGGregatING. So in BAGGING several equally sized training sets are made using sampling with replacement, and trained knowledge models vote for classification or prediction. It was combined with decision tree algorithm based on C4.5 and rule generator named JRip which is based on RIPPER (Repeated Incremental Pruning to Produce Error Reduction) [11]. According to experiments with a variety of datasets RIPPER algorithm gave better accuracy and could treat larger datasets than the rule generation method of C4.5, and the algorithm was known to be robust for noisy datasets also. SubBag and BAGGING with JRip showed competitive results with FLN tightest fit.

More recently, random forests were tried with some fixed parameter values. Because each decision tree in random forests is independent, parallel training of each decision tree in the random forests was tried with a concurrent programming language called erLang [12]. Boström generated the random forests based on decision tree algorithm that uses information gain [13]. The random forests of 100 trees, 1000 trees, 10,000 trees, and 100,000 trees were generated and showed comparable results to FLN tightest fit and SubBag method. Table 1 summarizes all the previous works.

2. The Method

2.1. Random Forests

Random forests suggested by Breiman [7] are based on BAGGING, use many decision trees with some random selection of attributes to split each node in the tree, and do no pruning. In other words, random forests use bootstrap method [14] in sampling to generate a training set, and the training set is used to build a tree, and since bootstrap method uses sampling with replacement, each training set can have some duplicate instances and could compensate the insufficiency of data to train somewhat.

After sampling some conventional decision tree generation algorithms like C4.5 or CART can be applied, but without pruning. When random selection of attributes to split each node is applied, the number of candidate attributes for split is limited by some predefined number, say . may be given by user, or default value can be used. Default value is the first integer less than [7, 15], and the half and double of the number are also recommended for further search [16]. So, depending on which number is used, the degree of randomness in tree generation is affected.

The other factor that affects the accuracy of random forests is the number of decision trees, say in the forests. Because the trees in the forests are generated samples with random sampling with replacement, appropriate value could compensate the insufficiency of data for training. According to Breiman tens to hundreds of decision trees are enough as value, because thousands of trees may not give better performance than the smaller number of trees in the random forests. Moreover, we may have different accuracy of random forests depending on how many trees are in the forests, but small difference in the number of trees may not give different accuracy.

2.2. Optimization Procedure

Decision tree algorithms have the tendency of neglecting minor classes to achieve overall best accuracy, so smaller value in random forests can alleviate the tendency. Minor classes are classes that have less number of instances possibly having more conflicting class values. Depending on the composition of given datasets, this discrimination of minor or major can be varied. Note that setting = (the number of attributes) makes the random forests conventional decision trees without pruning. Moreover, because preparing training sets with random sampling with replacement or bootstrapping has the effect of oversampling, it could duplicate training instances that can result in better accuracy. Therefore, appropriate combination of and value can generate better results. In other words, because appropriate value could supplement training instances for better accuracy and appropriate or smaller value could mitigate the decision tree’s property of neglecting minor classes, we could find best random forests.

We often use the default value with some fixed value, because we believe that the values would be good for their datasets, since the values were often recommended as other researchers did [12, 17, 18]. But, we understand that the parameters should be set well to reflect the fact that we may not have enough instances to train.

This is the reason why we generate trial random forests in three different ways: = the number of attributes, , = the default number of attributes to pick randomly, , , . Note that with parameter = the number of attributes, the splitting criteria of decision tree algorithm will be used 100% as conventional decision tree algorithms. By setting value smaller we can mitigate the splitting criteria so that decision tree’s property of preferring major classes can be mitigated.

As for our target dataset, the total number of instances in our target dataset called “cylinder bands” is 540, and it has two classes, “band” and “no band,” and 39 conditional attributes like “cylinder number” as nominal attribute and “viscosity” as numeric attribute. The number of instances in class “band” and “no band” is 228 and 312, respectively. So the size of the dataset is small. This means that we may not have enough instances for accurate classification. The procedure to find the best random forests is shown in Procedure 1.

Procedure:
Begin
Check if the grid search could be effective by generating trial random forests;
/* : the number of conditional attributes */
R := the double of the first integer less than log₂ + 1;
I = 100; F = 1000; D = 25;
Do
For t = I to F by increasing D
/* Generate random forests of t trees in which R attributes are picked
randomly to split each node */
Generate Random_forests (R, t);
End For;
R := the first integer larger than R/2;
Until R = 1;
End.

In Procedure 1, there are four parameters to be defined, , , , and . represents the initial number of trees in random forests. represents the final number of trees in random forests. represents the increment of the number of trees in the random forests in the for-loop. and are set 100 and 1000, respectively, in the experiment. was set 100 to consider small enough number of trees in the forests. was set 1,000 because the parameter showed the best results in average by Boström’s experiment [12]. In the experiment, 100 trees, 1,000 trees, 10,000 trees, and 100,000 were generated for 34 datasets with default value and ranked 1 to 4 based on accuracy. The average rank of 100 trees, 1,000 trees, 10,000 trees, and 100,000 trees is 3.12, 2.06, 2.44, and 2.38, respectively. For cylinder bands dataset, the accuracy of 1,000 trees and 100,000 trees is 79.81% and 80.19%, respectively. But, because the rank of 1,000 trees is the best in average for the 34 datasets, we use 1,000 trees for generalization. was set to 25, because we found that smaller numbers than 25 generated almost the same accuracies. One may set smaller value as becomes smaller for more searches. represents the number of randomly selected attributes to generate each decision tree in random forests. It is initialized by the double of the first integer less than , where is the number of attributes. The initial value of was inspired by Breiman’s recommendation, because smaller value could generate better results for most cases [16, 19]. But one may set the value as the total number of attributes, if more through search is necessary. This necessity for rare cases could be raised by inspecting trial random forests also. For example, if the accuracy of random forests with = the number of attributes, is greater than the accuracy of random forests with = the default number of attributes to pick randomly, or , , we should initialize with the total number of attributes. On the other hand, if the accuracy of random forests with , is greater than the accuracy of random forests with = the default number of attributes to pick randomly, , we can set as above and do the grid search. value is decreased during iteration. We consider value to be up to 1, because the dataset is small, which means we may not have enough information for accurate classification, so we want randomness in tree building process to be maximized as the search proceeds. For details about random_forests (), you may refer to Breiman’s [7].

3. Experiments

3.1. Experiments for the Dataset “Cylinder Bands”

The dataset was obtained from UCI machine learning repository [20]. The number of attributes is 39. Among the 39 attributes, 19 attributes are nominal attributes and the other 20 attributes are numeric attributes. About 4.8% of attribute values is missing.

We first check if our suggested method could find better random forests effectively by generating trial random forests. If we generate random forests with the parameters (, ), the accuracy of the random forests is 61.4815%, and the accuracy of each class is 18.4% for class “band” and 92.9% for class “no band” with 10-fold cross-validation. Because the dataset has 39 attributes, it is like conventional decision tree without pruning in which bootstrap method is applied. So, from the trial random forests, we can understand that the class “band” has very limited data instances for correct classification.

In order to see if more randomness and bootstrapping may give better results, we try random forests of parameters like (, ) and (, ). Note that is the default value [15] for the number of attributes to pick randomly, while is not. The result is summarized in Table 2.

From Table 1, we can expect that we may find better accuracy as we perform grid search by giving smaller and larger value in generating random forests. In order to find best possible results, we decrease value from the initial number of attributes. But, because we do not know exactly which value will generate the best result for a given value, and very small increase in value may generate the similar accuracy to previous ones, we increase the value in given interval as we iterate.

In the experiment 10-fold cross-validation was used. Hence, the dataset is divided into ten equal subsets and each subset was used for test while nine other subsets were used for training. Random forests in weka were utilized for the experiment. Weka is a data mining package written in Java [21]. Table 3 shows the best accuracy based on Procedure 1 in which and were varied.

For each iteration the initial number of trees is 100, and 25 trees are incremented at each step to find proper number of trees in the forests, and the final number of trees in the forests is 1000. From the results in Table 3, we can see that we could get better accuracy as value decreases. Table 4 shows the confusion matrix of the result for default and suggested value.

Figure 1 shows ROC curves for and . AUC for is 88.8%, and is 91.45%.

(a)

(b)

In Table 3, the accuracy of the random forests having 325 decision trees when the number of randomly selected attribute is one is 85.7407%, and this accuracy is yet the best accuracy according to literature survey. Table 5 summarizes the survey to compare the accuracy in other methods.

In Table 5, the accuracy of fuzzy lattice neurocomputing models is given at row 1 [3] and row 2 [4]. The training and testing were done once, so the experiments are less objective than other experiments. The result of 100,000 trees which is the best in the concurrent random forests [12] is presented at the 3rd row of the table. It is based on default value in the number of attributes to pick randomly. It was generated by using Dell PowerEdge R815 sever with 48 cores and 64 GB memory so that it took a lot of computing resources, while our random forests were generated by using a Pentium PC with 2 GB main memory. SubBag with JRip [5] has some poorer result than the others as we can see at the 4th row. In the experiment BAGGING with JRip has better accuracy between the two experiments using JRip as we can see at the 5th row. 50 JRip classifiers were used for aggregation in the experiment. The 6th row of the table contains the accuracy of single decision tree of C4.5 that is the base of the first paper for the dataset [2]. From the value of sensitivity and specificity, we can understand that C4.5 has the tendency of neglecting minor classes. The last row shows the result of the suggested method. All in all, we can say that our random forests produced a very competitive result. Some other advantage of our method is high availability than other referred data mining methods. For example, several data mining tools that provide random forests are available like Salford system’s [22], [23], and weka, and so forth.

3.2. Experiments for Other Datasets Having the Property of the Number of Attributes < the Number of Instances

In order to see the suggested procedure can find better results than conventional application of random forests, other three datasets in different domain called “Bridges,” “Dermatology,” and “Post Operative” in UCI machine learning repository were tried. Dataset “Bridges” has 12 conditional attributes, 108 instances, and 7 classes. Dataset “Dermatology” has 33 conditional attributes, 366 instances, and 6 classes. Dataset “Post Operative” has 8 conditional attributes, 90 instances, and 3 classes. Table 7 has the results of trial random forests for each dataset. Note that all the four datasets including “cylinder bands” have the property, of the number of attributes < the number of instances as in Table 6.

Table 7 shows trial random forests for the three datasets.

As we can see in Table 7, because we could generate better results with . Tables 8, 9, and 10 have the results of grid search for the datasets. Table 8 has the results of experiments for the dataset “Bridges.”

For “Bridges” dataset the same best accuracy of 66.9811% was found at and . But, while the accuracy was found only once at , the accuracy was found 34 times at . Table 9 has the results of experiments for the dataset “Dermatology.”

Table 10 has the results of experiments for the dataset “Post Operative.”

For “Post Operative” dataset the same best accuracy of 65.5556% was found at and . But, while it was found only once at , it was found 19 times at . As we can see in Tables 8, 9, and 10, we could find better results based on the suggested procedure in other datasets also.

3.3. Experiments for Another Datasets Having the Property of the Number of Attributes > the Number of Instances

Because we have considered datasets having the property of the number of attributes < the number of instances, two other datasets in UCI machine learning repository, “DB world” and “lung cancer,” that have the property of the number of attributes > the number of instances, were tried. Table 11 summarizes the datasets.

Because the two datasets might have many irrelevant attributes, preprocessing to select major attributes was performed first. It is based on weka’s correlation-based feature subset (CFS) selection method [24] with best first search. For “DB world” and “lung cancer” datasets 46 and 11 attributes are selected, respectively. Table 12 shows the results of trial random forests.

Table 13 has the results of experiments for the dataset “DB world.”

Table 14 has the results of experiments for the dataset “lung cancer.”

Note that the trial random forests of the dataset “lung cancer” have the same accuracy at = default and . So we have the best accuracy at both values. Experiments were done without attribute selection for the datasets of “DB world” and “lung cancer” to compare. Table 15 shows the results of trial random forests.

Table 16 has the results of experiments for the dataset “DB world.” As the values for , additional numbers like the whole number of attributes and 1/3 of it were used also, because we know that the dataset contains many irrelevant attributes. The setting is based on Genuer et al.’s idea [17].

Table 17 has the results of experiments for the dataset “lung cancer.” As the values for , additional numbers like the whole number of attributes and 1/3 of it were used also.

If we compare the best accuracies in Tables 13 and 16 for the dataset “DB world,” the best accuracy of preprocessed dataset is 95.3125% and that of original dataset is 90.625%. Moreover, if we compare the best accuracies in Tables 14 and 17 for the dataset “lung cancer,” the best accuracy of preprocessed dataset is 75.0% and that of original dataset is 59.375%. Therefore, we can conclude that our method is effective and the trial random forests well reflect whether the grid search is needed or not.

4. Conclusions

Rotogravure printing is very favored for massive printing tasks to print millions of copies. Hence, it is important to prevent process delays for better productivity. In order to reduce the delays preventive maintenance activity is more desirable, if we can predict some possible occurrence of bands in the cylinder. Therefore, more accurate prediction is important to reduce the delays. Random forests are known to be robust for missing and erroneous data as well as insufficient information with good performance, and moreover, they can utilize the fast building property of decision trees, so they do not require much computing time in most datasets for data mining, even though the forests have many trees. Hence, they are good for real word situation of data mining, because in the real world, lots of datasets have the property.

Because random forests have high possibility to generate better results when the combinations of parameters like the number of randomly picked attributes () and the number of trees in the forests () are good for given datasets, an effective procedure considering the properties of both of the datasets and random forests is investigated to find good results. Among and , because different values could affect the accuracy of random forests very much, we suggest generating trial random forests to see the possibility of better results. Among the used six datasets, the five datasets showed that is the best choice, while one dataset showed default value and is the best choices. can be the best choice means that we need maximum randomness to spilt, because the datasets do not have sufficient information for correct classification. So for some datasets the default value with appropriate number of trees could be the best choice, but for some other datasets smaller value could be the best. In this sense, the trial random forests can do the role of a compass for further grid search.

Acknowledgment

This work was supported by Dongseo University, “Dongseo Frontier Project” Research Fund of 2010.

References

B. Evans and D. Fisher, “Using decision tree induction to minimize process delays in printing industry,” in Handbook of Data Mining and Knowledge Discovery, W. Klösgen and J. M. Żytkow, Eds., pp. 874–880, Oxford University Press, 2002.
View at: Google Scholar
B. Evans and D. Fisher, “Overcoming process delays with design tree induction,” IEEE Expert, vol. 9, no. 1, pp. 60–66, 1994.
View at: Publisher Site | Google Scholar
V. G. Kaburlasos and V. Petridis, “Fuzzy Lattice Neurocomputing (FLN) models,” Neural Networks, vol. 13, no. 10, pp. 1145–1170, 2000.
View at: Publisher Site | Google Scholar
A. Cripps, V. G. Kaburlasos, N. Nguyen, and S. E. Papadakis, “Improved experimental results using Fuzzy Lattice Neurocomputing (FLN) classifiers,” in Proceedings of the International Conference on Machine Learning; Models, Technologies and Applications (MLMTA '03), pp. 161–166, Las Vegas, Nev, USA, June 2003.
View at: Google Scholar
P. Panov and S. Džeroski, “Combining bagging and random subspaces to create better ensembles,” Lecture Notes in Computer Science, vol. 4723, pp. 118–129, 2007.
View at: Publisher Site | Google Scholar
T. K. Ho, “The random subspace method for constructing decision forests,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 832–844, 1998.
View at: Publisher Site | Google Scholar
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
View at: Publisher Site | Google Scholar
J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993.
“Big O notation,” 16. 070 Introduction to computers and programming, MIT, http://web.mit.edu/16.070/www/lecture/big_o.pdf.
View at: Google Scholar
L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.
View at: Google Scholar | Zentralblatt MATH
W. W. Cohen, “Fast effective rule Induction,” in Proceedings of the 12th International Conference on Machine Learning, pp. 115–123, Tahoe City, Calif, USA, 1995.
View at: Google Scholar
H. Boström, “Concurrent learning of large-scale random forests,” in Proceedings of the Scandinavian Conference on Artificial Intelligence, pp. 20–29, Trondheim, Norway, 2011.
View at: Google Scholar
J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.
View at: Publisher Site | Google Scholar
B. Efron and R. Tibshirani, “Improvements on cross-validation: the .632+ bootstrap method,” Journal of the American Statistical Association, vol. 92, no. 438, pp. 548–560, 1997.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
Class Random Forest, http://weka.sourceforge.net/doc/weka/classifiers/trees/RandomForest.html.
L. Breiman and A. Cutler, “Random Forests,” http://www.stat.berkeley.edu/users/breiman/RandomForests/.
View at: Google Scholar
R. Genuer, J. Poggi, and C. Tuleau, “Random Forests: some methodological insights,” Tech. Rep. inria00340725, INRIA, 2008.
View at: Google Scholar
A. Liaw and M. Wiener, “Classification and regression by randomForest,” R News, vol. 2-3, pp. 18–22, 2002.
View at: Google Scholar
L. Breiman and A. Cutler, “Random Forests,” http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm.
View at: Google Scholar
A. Frank and A. Asuncion, “UCI machine learning repository,” University of California, School of Information and Computer Science, Irvine, Calif, USA, 2010, http://archive.ics.uci.edu/ml.
View at: Google Scholar
WEKA, http://www.cs.waikato.ac.nz/ml/weka/.
“Salford systems-random forests,” http://www.salford-systems.com/en/products/randomforests.
View at: Google Scholar
“The R project for statistical computing,” http://www.r-project.org/.
View at: Google Scholar
M. A. Hall, Correlation-based feature subset selection for machine learning [Ph.D. thesis], The University of Waikato, Hamilton, New Zealand, 1999.

Copyright

Copyright © 2012 Hyontai Sug. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

3330

Downloads

1100

Citations