Incremental Optimization Mechanism for Constructing a Decision Tree in Data Stream Mining

Yang, Hang; Fong, Simon

doi:https://doi.org/10.1155/2013/580397

Mathematical Problems in Engineering

On this page

Abstract Introduction Background Evaluation Conclusion References Copyright Related Articles

Special Issue

Data Mining and Knowledge Discovery in Industrial Engineering

View this Special Issue

Research Article | Open Access

Volume 2013 | Article ID 580397 | https://doi.org/10.1155/2013/580397

Incremental Optimization Mechanism for Constructing a Decision Tree in Data Stream Mining

Hang Yang¹and Simon Fong¹

Academic Editor: Sabrina Senatore

Received21 Sept 2012

Revised21 Jan 2013

Accepted29 Jan 2013

Published11 Mar 2013

Abstract

Imperfect data stream leads to tree size explosion and detrimental accuracy problems. Overfitting problem and the imbalanced class distribution reduce the performance of the original decision-tree algorithm for stream mining. In this paper, we propose an incremental optimization mechanism to solve these problems. The mechanism is called Optimized Very Fast Decision Tree (OVFDT) that possesses an optimized node-splitting control mechanism. Accuracy, tree size, and the learning time are the significant factors influencing the algorithm’s performance. Naturally a bigger tree size takes longer computation time. OVFDT is a pioneer model equipped with an incremental optimization mechanism that seeks for a balance between accuracy and tree size for data stream mining. It operates incrementally by a test-then-train approach. Three types of functional tree leaves improve the accuracy with which the tree model makes a prediction for a new data stream in the testing phase. The optimized node-splitting mechanism controls the tree model growth in the training phase. The experiment shows that OVFDT obtains an optimal tree structure in both numeric and nominal datasets.

1. Introduction

Decision-tree learning is one of the most significant classifying techniques in data mining and has been applied to many areas, including business intelligence, healthcare, and biomedicine. The traditional approach to build a decision-tree, designed by the Greedy Search, loads a full set of data into memory and partitions the data into a hierarchy of nodes and leaves. The tree cannot be changed when new data are acquired, unless the whole model is rebuilt by reloading the complete set of historical data together with the new data. This approach is unsuitable for unbounded input data such as data streams, in which new data continuously flow in at high speed. To this end, the incremental approach is proposed to build a decision-tree dynamically that the tree grows with new data input.

A new generation of algorithms has been developed for incremental decision-tree, a pioneer of which using a Hoeffding bound (HB) in node splitting is so called Very Fast Decision-tree (VFDT) [1]. It builds a decision-tree simply by keeping track of the statistics of the attributes of the incoming data. When sufficient statistics have accumulated at each leaf, a node-splitting algorithm determines whether there is enough statistical evidence in favor of a node-split, which expands the tree by replacing the leaf with a new decision node. This decision-tree learns by incrementally updating the model while scanning the data stream on the fly. This powerful concept is in contrast to a traditional decision-tree that requires the reading of a full dataset for tree induction. The obvious advantage is its real-time data mining capability, which frees it from the need to store up all of the data to retrain the decision-tree because the moving data streams are infinite. A research work [2] proofs the feasibility of classification algorithms for analyzing biosignals in the forms of infinite data streams, and it also provides a comparison of traditional decision-tree C4.5 and incremental decision-tree VFDT in practical.

On one hand, the challenge for data stream mining is associated with the imbalanced class distribution. The term “imbalanced data” refers to irregular class distributions in a dataset. For example, a large percentage of training samples may be biased toward class , leaving few samples that describe class . Both noise and imbalanced class distribution significantly impair the accuracy of a decision-tree classifier through confusion and misclassification prompted by the inappropriate data. The size of the decision-tree will also grow excessively large under noisy data. To tackle these problems, some researchers applied data manipulation techniques to handle the imbalanced class distribution problems, including undersampling, resampling, a recognition-based induction scheme [3], and a feature subset selection approach [4]. On the other hand, despite the difference in their tree-building processes, both traditional and incremental decision-trees suffer from a phenomenon called overfitting when the input data are infected with noise. The noise confuses the tree-building process with conflicting instances. Consequently, the tree size becomes very large and eventually describes noise rather than the underlying relationship.

With traditional decision-trees, the underperforming branches created by noise and biases are commonly pruned by cross-validating them with separate sets of training and testing data. Pruning algorithms [5] help keep the size of the decision-tree in check; however the majority are post-pruning techniques that remove relevant tree paths after a whole model has been built from a stationary dataset. Post-pruning of a decision-tree in high-speed data stream mining, however, may not be possible (or desirable) because of the nature of incremental access to the constantly incoming data streams.

In this paper, we devise a new version of VFDT, so called Optimized VFDT (OVFDT), which can provide an incremental optimization on prediction accuracy and decision-tree model size. The motivations of OVFDT are.(1)To handle the imbalanced class distribution problem, OVFDT proposes three types of functional tree leaf that improve the classification accuracy; (2)To deal with the noisy data in data streams, OVFDT uses an adaptive tie-breaking threshold instead of a user predefined. We do not know what the best setting is unless all possibilities have been tried. However it is an obstacle for real-world application. By running simulation experiments, the optimized value of adaptive tie is proved to be ideal for constraining the optimal tree growth. (3)To prevent the over-fitting problem, OVFDT contains an incremental optimization mechanism in the node-splitting test that obtains an optimal decision-tree amongst prediction accuracy and model size.

The rest of this paper is structured as follows: Section 2 introduces a research background of decision-tree learning for data streams, the effect of tiethreshold in tree building. Section 3 presents the details of OVFDT algorithm, in terms of a test-then-train approach. Experiments are described in Section 4. Section 5 concludes.

2. Background

2.1. Decision-Tree in Data Stream Mining

A decision-tree classification problem is defined as follows: is the number of examples in a dataset with a form (), where is a vector of attributes and is a discrete class label. is the index of class label. Suppose a class label with the th discrete value is . Attribute is the th attribute in and is assigned a value of , where , and is the number of different values . The classification goal is to produce a decision-tree model from examples, which predicts the classes of in future examples with high accuracy. In data stream mining, the example size is very large or unlimited, .

VFDT [1] constructs an incremental decision-tree by using constant memory and constant time-per-sample. It is a pioneering predictive technique that utilizes the Hoeffding bound. The tree is built by recursively replacing leaves with decision nodes. Sufficient statistics of attribute with a value of are stored in each leaf with a class label assigning to a value . A heuristic evaluation function is used to determine split attributes for converting leaves to nodes. Nodes contain the split attributes, and leaves contain only the class labels. The leaf represents a class according to the sample label. When a sample enters, it traverses the tree from the root to a leaf, evaluating the relevant attributes at every node. Once the sample reaches a leaf, the sufficient statistics are updated. At this time, the system evaluates each possible condition based on the attribute values; if the statistics are sufficient to support one test over the others, then a leaf is converted to a decision node. The decision node contains the number of possible values for the chosen attribute according to the installed split test. The main elements of VFDT include, first, a tree-initializing process that initially contains only a single leaf and, second, a tree-growing process that contains a splitting check using a heuristic function and a Hoeffding bound (HB). VFDT uses information gain as .

The formula of HB is shown in (1). HB controls over errors in the attribute-splitting distribution selection, where is the range of classes’ distribution and is the number of instances that have fallen into a leaf. is one minus the desired probability of choosing the correct attribute at any given node. To evaluate a splitting value for attribute , it chooses the best two values. Suppose is the best value of , where ; suppose is the second best value, where ; suppose is the difference of the best two values for attribute , where . HB is used to compute high confidence intervals for the true mean of attribute to class that , where . If, after observing examples, the inequality holds, then , meaning that the best attribute observed over a portion of the stream is truly the best attribute over the entire stream. Thus, a splitting value of attribute can be found without full attribute values even when we do not know all values of . In other words, it does not train a model from full data, and the tree is growing incrementally when more data come. Consider,

In the past decade, several research papers have proposed different methodologies to improve the accuracy of VFDT. HOT [6] proposes an algorithm producing some optional tree branches at the same time, replacing those rules with lower accuracy by optional ones. The classification accuracy has been improved significantly, while learning speed is slowed because of the construction of optional tree branches. Some of options are inactive branches consuming computer resource. Functional tree leaf is originally proposed to integrate into incremental decision-tree in VFDTc [7]. Consequently, the Naïve Bayes classifier on the tree leaf has improved classification accuracy. The functional tree leaf is able to handle both continuous and discrete values in data streams, but no direct evidence shows it can handle such imperfections like noise and bias in data streams. FlexDT [8] proposes a Sigmoid function to handle noisy data and missing values. The Sigmoid function is used to decide what true node-splitting value, but sacrificing algorithm speed. For this reason, the lightweight algorithm with fast learning speed is favored for data streams environment. CBDT [9] is a forest of trees algorithm that maintains a number of trees, each of which is rooted on different attributes and grows independently. It is sensitive to the concept drift in data streams according to the sliding-window mechanism. VFDR [10] is a decision rule learner using HB. Likewise VFDT, VFDR proposes a rule expending mechanism that constructs the decision rules (ordered or unordered) from data stream on the fly. VFDT handles streaming data that tree structure keeps on updating when new data arrive. It only requires reading some samples satisfying the statistical bound (referring to the HB) to construct a decision-tree. Since it cannot analyze over the whole training dataset in one time, normal optimization methods using full dataset to search for an optima between the accuracy and tree size do not work well here. Our previous work has provided a solution for sustainable prediction accuracy and regulates the growth of the decision-tree to a reasonable extent, even in the presence of noise. Moderated Very Fast Decision-tree (MVFDT) [11] is a novel extension of the VFDT model that includes optimizing the tree-growing process via adaptive tie-breaking threshold instead of a user predefined value in VFDT.

There are two popular platforms for implementing stream-mining decision-tree algorithms. Very Fast Machine Learning (VFML) [12] is a C-based tool for mining time-changing high-speed data streams. Massive Online Analysis (MOA) [13] is Java-based software for massive data analysis, which is a well-known open source project extended from WEKA data mining. In both platforms, the parameters of VFDT must be preconfigured. For different tree induction tasks, the parameter setup is distinguished.

MOA is an open source project with a user friendly graphic interface. It also provides several ways to evaluate algorithm’s performance. Hence, some VFDT-extended algorithms have been built in this platform. For example, the VFDT algorithms embedded in MOA (released on November 2011) are Ensemble Hoeffding Tree [14] is an online bagging method with some ensemble VFDT classifiers. Adaptive Size Hoeffding Tree (ASHT) [15] is derived from VFDT adding a maximum number of split nodes. ASHT has a maximum number of split nodes. After one node splits, if the number of split nodes is higher than the maximum value, then it deletes some nodes to reduce its size. Besides, it is designed for handling concept-drift data streams. AdaHOT [15] is also derived from HOT. Each leaf stored an estimation of current error. The weight of node in voting process was proportional to the square of inverse of error. AdaHOT combines HOT with a voting mechanism on each node. It also extends the advantages using optional trees to replace the tree branches of bad performance. Based on an assumption “there has been no change in the average value inside the window,” ADWIN [16] proposes a solution to detect changes by a variable-length window of recently seen instances. In this paper, the OVFDT algorithm is developed on the fundamental of MOA platform.

2.2. Relationship amongst Accuracy, Tree Size, and Time

When data contains noisy values, it may confuse the result of heuristic function. The difference of the best two heuristic evaluations for attribute , where , may be negligible. To solve this problem, a fixed tie-breaking , which is a user predefined threshold for incremental learning decision-tree, is proposed as prepruning mechanism to control the tree growth speed [17]. This threshold constrains the node-splitting condition that . An efficient guarantees a minimum tree growth in case of tree-size explosion problem. must be set before a new learning starts; however, so far there has not been a unique suitable for all problems. In other words, there is not a single default value that works well in all tasks so far. The choice of τ hence depends on the data and their nature. It is said that the excessive invocation of tie breaking brings the performance of decision-tree learning declining significantly on complex and noise data, even with the additional condition by the parameter .

In addition to the tie-breaking threshold , is the number of instances a leaf should observe between split attempts. In other words, is a user-defined value to control the tree growing speed, and is a user-defined value to control the interval time to check node splitting. The former is used to constrain tree size and the latter is used to constrain the learning speed. In order to optimize accuracy, tree size, and speed for decision-tree learning, first of all, an example is given to demonstrate the relationship amongst these three factors for data streams.

In this example, we use VFDT, which is a classical incremental decision-tree using HB in node splitting, to evaluate synthetic datasets added with bias classes. We use MOA to generate two typical datasets: LED24 is a nominal dataset, and Waveform21 is a numeric dataset. Both datasets share the origins with the sample generators donated by UCI machine learning repository. LED24 uses 24 nominal attributes to classify 10 different classes, and Waveform21 uses 21 numeric attributes to classify 3 different classes. The data stream problem is simulated by large numbers of instances as many as one million. The accuracy, tree size, and time are recorded with changing the predefined values of and . From Table 1, we can see the following.(i)In general, the bigger tree size brings a higher accuracy, even caused by the over-fitting problem, but taking more learning time. (ii) is proposed to control the tree size growing. A bigger brings a faster tree size growth, but longer computation time. But because the memory is limited, the tree size does not increase, while reaches a threshold ( for LED24; for Waveform21). (iii) is proposed to control the learning time. A bigger brings a faster learning speed, but smaller tree size and lower accuracy.

A proposed solution [18] to overcome this detrimental effect is an improved tie-breaking mechanism, which not only considers the best ( and the second best ( splitting candidates in terms of heuristic function, but also uses the worst candidate (). At the same time, an extra parameter is imported, , which determines how many times smaller the gap should be before it is considered as a tie. The attribute splitting condition becomes as the following: when , the attribution will be split as a node. Obviously, this approach uses two extra elements, and , which bring extra computation to the original algorithm. However, the only way to detect the best tie-breaking threshold for a certain task is trying all the possibilities in VFDT. It is impractical for real-world applications. In this paper, we propose the adaptive tie-breaking threshold using the incremental optimization methodology. The breakthrough of our work is the optimized node-splitting control, which will be specified in the following sections.

3. Proposed Methodology

3.1. Motivation and Overview

OVFDT, which inherits the use of HB, implements on a test-then-train approach (Figure 1) for classifying continuously arriving data streams, even for infinite data streams. The whole test-then-train process is synchronized such, that when the data stream arrives, one segment at a time, the decision-tree is being tested first for prediction output, and training (which is also known as updating) of the decision-tree then occurs incrementally. The description of testing process will be explained in Section 3.3 in details, and the training process will be explained in Section 3.4. Ideally, the node-splitting test updates tree model in order to improve the accuracy, while a bigger tree model takes longer computation time. The situation to do the node-splitting check is when the number of instances in a leaf is greater than the predefined value .

Imperfect data streams, including noisy data and bias class distribution, decline the performance of VFDT. Figure 2 shows the results of accuracy, tree size, and computation time using VFDT the same dataset structure added with imperfect values. The ideal stream is free from noise and has a uniform proportion of class samples, which is rare in real world. Comparing ideal data streams with imperfect data streams, we conclude Lemma 1.

(a)

(b)

(c)

Lemma 1. Imperfections in data streams worsen the performance of VFDT. The tree size and the computation time are increased, but the accuracy is declined. In other words, the optimization goal is to increase the accuracy but not enlarge the tree size, within an acceptable computation time. Naturally a bigger tree size takes longer computation time. For this reason, the computation time is dependent on the tree size.

In the decision-tree model, each path from the root to a leaf is considered as a way to present a rule. To ensure a high accuracy, there must be sufficient number of rules, which is the number of leaves in the tree model. Suppose the Hoeffding Tree (HT) is the decision-tree algorithm using Hoeffding bound (HB) as the node-splitting test. Let Accu be the accuracy function for the decision-tree structure HT at the th node-splitting estimation, and let Size be the tree size; then where is a mapping function of tree size to accuracy. Most incremental optimization functions can be expressed as the sum of several subobjective functions: where is a continuously differentiable function whose domain is a nonempty, convex, and closed set. We consider the following optimization problems:

Based on Lemma 1, we propose a solution to optimize the decision-tree structure by improving the original VFDT that:

The tree model is updated when a node splitting appears. Original VFDT considers the HB as the only index to split node. However, it is not enough. In terms of the above optimization goal, OVFDT proposes an optimized node-splitting control during the tree-building process.

3.2. OVFDT Test-Then-Train Process

Data streams are open-ended problems that traditional sampling strategies are not viable in the nonstopping streams scenario. OVFDT is an improved version of the original VFDT and its extensions using HB to decide the node splitting. The most significant contribution is OVFDT that can obtain an optimal tree structure by balancing the accuracy and tree size. It is useful for data mining especially in the events of the tree size explosion, when the decision-tree is subject to imperfect streams including noisy data and imbalanced class distribution.

HT algorithms run a test-then-train approach to build a decision-tree mode. When new stream arrives, it will be sorted from the root to a predicted leaf. Comparing the predictive class with the true class of this data stream, we can maintain an error matrix for every tree leaf in the testing process. In terms of the stored statistics matrix, the decision-tree model is being updated in the training process. Table 2 presents the differences between OVFDT and HT algorithms (including the original VFDT and its extensions). Pseudocode 1 shows the input parameters, and the output of OVFDT, and the approach presented as pseudocode.

INPUT:
: A stream of sample
: A set of symbolic attributes
: Heuristic function using for node-splitting estimation
: One minus the desired probability of choosing a correct attribute at any given node
: The minimum number of samples between check node-splitting estimation
: A functional tree leaf strategy
OUTPUT:
HT: A decision tree
PROCEDURE: OVFDT (, , , , , )
A data stream arrives
IF HT is null, THEN initializeHT(, , , , , )
ELSE traverseHT(, , ) and update
Label as the predicted class among the samples seen so far
Let be the number of samples seen at the leaf
IF the samples seen so far at leaf do not all belong to the sameclass
and ( mod ) is zero, THEN doNodeSplitting(, , , , , )
Return HT

3.3. OVFDT Testing Approach

Suppose is a vector of attributes, and is the class with different values included in the data streams. For decision-tree prediction learning tasks, the learning goal is to induce a function of , where is the predicted class by a Hoeffding tree (HT) according to a functional tree leaf strategy . When a new data stream (, ) arrives, it traverses from the root of the decision-tree to an existing leaf by the current decision-tree structure, provided that the root has existed initially. Otherwise, the heuristic function is used to construct a tree model with a single root node. When new instance comes, it will be sorted from the root to a leaf by the current tree model. The classifier on the leaf can further enhance the prediction accuracy via the embedded classifiers. OVFDT contains three classifiers to improve the performance of prediction. They are the Majority Class (), Naïve Bayes (), and Weighted Naïve Bayes ().

Suppose , the predicted class value, and is actual class in data streams with a vector of attribute . A sufficient statistics matrix stores the number of passed-by samples, which contain attribute with a value belonging to a certain so far. We call this statistics table Observed Class Distribution (OCD) matrix. The size of OCD is , where is the total number of distinct values for attribute and is the number of distinct class values. Suppose is the sufficient statistic that reflects the number of attribute with a value belonging to class . Therefore, OCD on node is defined as

For a certain leaf that attributes with a value of ,

Majority Class classifier chooses the class with the maximum value as the predicted class in a leaf. Thus, predicts the class with a value that

Naïve Bayes classifier chooses the class with the maximum possibility computed by Naïve Bayes as the predictive class in a leaf. The formula of Naïve Bayes is

OCD of leaf with value is updated incrementally. Thus, predicts the class with a value that

Weighted Naïve Bayes classifier proposes to reduce the effect of imbalanced class distribution. It chooses the class with the maximum possibility computed by Weighted Naïve Bayes as the predictive class in a leaf:

OCD of leaf with value is updated. Thus, predicts the class with a value that

After the stream traverses the whole HT, it is assigned to a predicted class , which according to the functional tree leaf . Comparing the predicted class to the actual class , the statistics of correctly and incorrectly prediction are updated immediately. Meanwhile, the sufficient statistics, , which is a count of attribute with value belonging to class , are updated in each node. This series of actions is so called a testing approach in this paper. Pseudocode 2 gives the pseudocode of this approach. According to the functional tree leaf strategy, the current HT sorts a newly arrived sample () from the root to a predicted leaf . Comparing the predicted class to the actual class , the sequential-error statistics of and prediction are updated immediately.

PROCEDURE: traverseHT(, , )
Sort from the root to a leaf by HT. Update OCD in each node: ++
Switch ()
Case : predict the class with max
Case : predict the class with max NB prob.
Case : predict the class with max WNB prob.
IF equals to the actual class label in , THEN ++
ELSE ++

Return

To store OCD for OVFDT, , , and require memory proportional to , where is the number of nodes in tree model and the number of attributes; is the maximum number of values per attribute; is the number of classes. OCD of and are converted from that of . Therefore, we do not require extra memory. When required, it can be converted from .

3.4. OVFDT Training Approach

Right after the testing approach, the training follows. Node-splitting estimation is used to initially decide if HT should be updated or not that depends on the amount of samples received so far that can potentially be represented by additional underlying rules in the decision-tree. In principle, the optimized node-splitting estimation should be applied on every single new sample that arrives. Of course this will be too exhaustive, and it will slow down the tree building process. Instead, a parameter is proposed in VFDT that only do the node-splitting estimation when examples have been observed on a leaf. In the node-splitting estimation, the tree model should be updated when a heuristic function chooses the most appropriate attribute with the highest heuristic function value as a splitting node according to HB and tie-breaking threshold. The heuristic function is implemented as an information gain here. This situ of node-splitting estimation constitutes the so-called training phase.

The node-splitting test is modified to use a dynamic tie-breaking threshold , which restricts the attribute splitting as a decision node. The parameter traditionally is preconfigured with a default value defined by the user. The optimal value is usually not known until all of the possibilities in an experiment have been tried. An example has been presented in Section 2.2. Longitudinal testing of different values in advance is certainly not favorable in real-time applications. Instead, we assign a dynamic tie threshold, equal to the dynamic mean of HB at each pass of stream data, as the splitting threshold, which controls the node splitting during the tree-building process. Tie breaking that occurs close to the HB mean can effectively narrow the variance distribution. HB mean is calculated dynamically whenever new data arrives.

The estimation of splits and ties is only executed once for every (a user-supplied value) samples that arrive at a leaf. Instead of a preconfigured tie, OVFDT uses an adaptive tie that is calculated by incremental computing. At the th node-splitting estimation, the HB estimates the sufficient statistics for a large enough sample size to split a new node, which corresponds to the leaf . Let be an adaptive tie corresponding to leaf , within estimations seen so far. Suppose is a binary variable that takes the value of 1 if HB relates to leaf and 0 otherwise. is computed by (13). To constrain HB fluctuation, an upper bound and a lower bound are proposed in the adaptive tie mechanism. The formulas are presented in

For lightweight operations, we propose an error-based prepruning mechanism for OVFDT, which stops noninformative split node before it splits into a new node. The prepruning takes into account the node-splitting error both globally and locally.

According to the optimization goal mentioned in Section 3.1, besides the HB, we also consider the global and local accuracy in terms of the sequential error statistics of and prediction computed by functional tree leaf. Let be the difference between and , and is the index of testing approach. Then reflects the global accuracy of the current HT prediction on the newly arrived data streams. If , the number of correct predictions is no less than the number of incorrect predictions in the current tree structure; otherwise, the current tree graph needs to be updated by node splitting. In this approach, the statistics of correctly and incorrectly prediction are updated. Suppose , which reflects the accuracy of HT. If declines, it means the global accuracy of current HT model worsens. Likewise, comparing and , the local accuracy is monitored during the node splitting. If is greater than , it means the current accuracy is declining locally. In this case, the HT should be updated to suit the newly arrival data streams.

Lemma 2. Monitor global accuracy. The model’s accuracy varies whenever a node splits, and the tree structure is updated. Overall accuracy of current tree model is monitored during node splitting by comparing the number of correctly and incorrectly predicted samples. The number of correctly predicted instances and otherwise is recorded as global performance indicators so far. This monitoring allows the global accuracy to be determined.

Lemma 3. Monitor local accuracy. The global accuracy can be tracked by comparing the number of correctly predicted samples with the number of wrongly ones. Likewise, comparing the global accuracy measured at current node-splitting estimation with the previous splitting, the increment in accuracy is tracked dynamically. This monitoring allows us to check whether the current node-splitting is advantageous at each step by comparing with the previous step.

Figure 3 gives an example why our proposed prepruning takes into account both the local and the global accuracy in the incremental pruning. At the th node-splitting estimation the difference between correctly and incorrectly predicted classes was , and was at ()th estimation. was negative that the local accuracy of ()th estimation was worse than its previous one, while both were on a global increasing trend. Hence, if accuracy is getting worse, it is necessary to update the HT structure.

Combining the prediction statistics gathered in the testing phase, Pseudocode 3 presents the pseudo code of the training phase in OVFDT in building an upright tree. The optimized node-splitting control is presented in Figure 3 Line 7. In each node-splitting estimation process, HB value that relates to a leaf is recorded. The recorded HB values are used to compute the adaptive tie, which uses the mean of HB to each leaf , instead of a fixed user-defined value in VFDT.

PROCEDURE doNodeSplitting (, , , , ):
FOR each attribute at the leaf
Compute
Let be the attribute with highest and the second
Compute HB with
Let
END-FOR
IF () or ( and
or ( and
or ( and
Replace by an internal node splits on
Update adaptive tie and
FOR each branch of splitting
Add a new leaf and let
Let be the obtained by predicting the class in
according to at
FOR each class and each value of each attribute
and reset OCD:
END-FOR
END-FOR
END-IF
Return updated HT

4. Evaluation

4.1. Evaluation Platform and Datasets

A Java package with OVFDT has been implemented with MOA toolkit as a simulation platform for experiments. The running environment is on a Windows 7 PC with Intel Quad 2.8 GHz CPU and 8 G RAM. In all of the experiments, the parameters of the algorithms are and , which are default values suggested by MOA. δ is the allowable error in split decision, and values closer to zero will take longer to decide; is the number of instances a leaf should observe between split attempts. The main goal of this section is to provide evidence of the improvement of OVFDT compared with the original VFDT.

The experimental datasets, including pure nominal datasets, pure numeric datasets, and mixed datasets, are either synthetics generated by the MOA Generator or extracted from real-world applications that are publicly available for download from the UCI repository. The descriptions of each experimental dataset are listed in Table 3. LED24 dataset is generated by MOA Generator. In the experiment, we add 10% noisy data to simulate imperfect data streams. The LED24 problem uses 24 binary attributes to classify 10 different classes. Waveform21 dataset is also generated by MOA. The dataset is donated by David Aha to the UCI repository. The goal of the task is to differentiate between three different classes of Waveform. It has 21 numeric attributes which contained noise. Cover Type is used to predict forest cover type from cartographic variables, and this data is collected from real world [19].

The benchmarking algorithms are VFDT [1], HOT [6] and ADWIN [16]. VFDT and HOT are the representative learning methods without sliding-window criteria for handling concept drift; AWDIN uses an adaptive window technique. The paper [16] claimed that ADWIN performed as well or only slightly worse than the best window for each rate of change in CVFDT. This justifies why CVFDT was not being compared in this test.

4.2. Held-Out Evaluation

The first evaluation simulates a holdout testing approach. The datasets are divided to two parts: 70% for training model and 30% for testing model. In Figures 4 and 5 for LED24 and Waveform21, OVFDT algorithms have the best accuracy and compact tree size. For Cover Type data, HOT obtains a higher accuracy but much bigger model size than the others. OVFDT has the mechanism of optimizing node splitting so as to balance the prediction accuracy and the model size consequently.

Besides, we use the receiver operating characteristic (ROC) as a standard method for analysing and comparing classification result. It provides a convenient graphical display of the trade off between true and false positive classification rates for two-class problems. In the decision-tree classification, the ROC is used for more than two classes. In this case, we apply the multi class ROC analysis to evaluate the performance of the tree-learning algorithm:

Likewise, two-class ROC statistics, for each class from to in a multi-class ROC analysis, can be assigned to the samples with class as positive; otherwise they are assigned as negative in Figure 6. True positives (TP) are examples correctly labeled as positives, calculated by (17). False positives (FPs) refer to negative examples incorrectly labeled as positive, calculated by (18). True negatives (TNs) correspond to negatives correctly labeled as negative, calculated by (20). Finally, false negatives (FNs) refer to positive examples incorrectly labeled as negative, calculated by (19). Each class can be converted into a two-class problem, with corresponding values for TP, TN, FP, and FN.

Precision-Recall is a well-known analysis method for ROC evaluation. In pattern recognition, precision refers to the fraction of retrieved instances that are relevant, whereas recall is the fraction of relevant instances that are retrieved. The values of precision and recall range from 0 to 1. A precision score of 1 for a class means that every item labeled as belonging to class does, indeed, belong to class . A recall score of 1 means that every item from class was correctly labeled as belonging to class . Precision-Recall scores are not analyzed in isolation: the -measure [20] is a weighted harmonic mean of the Precision-Recall measure, and the -measure evenly weights precision and recall scores, with a best value of 1 and a worst value of 0. In addition, the true positive rate (TPR) and false positive rate (FPR) are commonly used as benchmarks in ROC analysis. According to the ROC matrix shown in Figure 5, the calculation of TPR is given in (21). The calculation of FPR is given in (22). Precision is calculated by (23). -measure calculations are in (24). The experimental result of Precision-Recall test is shown in Table 4. Consider

The pairwise of Precision-Recall results are shown in Table 4. OVFDT with WNB functional tree leaf has the best statistical results in Waveform21 data, which contains numeric attributes only. For the other two data, OVFDT also has the better TPR that represents that it has higher sensitivity in this experiment. For the ROC curve, Figure 7 displays the TPR-FPR analysis in an ROC space. Obviously, the plots of OVFDT are higher than the other algorithms in ROC space that we say the proposed methods outperform others in this holdout test.

(a)

(b)

(c)

4.3. Test-Then-Train Evaluation

The second evaluation implements a test-then-train approach that is an incremental evaluation. When new instances come, they are used to test the current model tree that we write down the statistical results of accuracy and model size. After that, those instances are used to train and update the model tree. From Figure 8, we can see that OVFDT with WNB functional tree leaf obtains the best accuracy amongst the tested algorithms while smaller tree size relatively. Since the tree induction that uses optional tree leaves, HOT results much bigger model size than the other methods. This incremental evaluation shows OVFDT’s outperformance on the aspects of accuracy and compact tree size.

(a)

(b)

(c)

5. Conclusion

Imperfect data stream leads to tree size explosion and detrimental accuracy problems. In original VFDT, a tie-breaking threshold that takes a user-defined value is proposed to alleviate this problem by controlling the node-splitting process that is a way of tree growth. But there is no single default value that always works well and that user-defined value is static throughout the stream mining operation. In this paper, we propose an Optimized-VFDT (OVFDT) algorithm that uses an adaptive tie mechanism to automatically search for an optimized amount of tree node splitting, balancing the accuracy and the tree size, during the tree-building process. The optimized node-splitting mechanism controls the attribute-splitting estimation incrementally. Balancing between the accuracy and tree size is important, as stream mining is supposed to operate in limited memory computing environment, and a reasonable accuracy is needed. It is a known contradiction that high accuracy requires a large tree with many decision paths, and too sparse the decision-tree results in poor accuracy. In the experiment, we use holdout test and incremental test to evaluate the proposed model. The results show that OVFDT achieves a better performance in terms of high prediction accuracy and compact tree size than the other VFDTs. This advantage can be technically accomplished by means of simple incremental optimization mechanisms as described in this paper. They are light weighted and suitable for incremental learning. The contribution is significant because OVFDT can potentially be further modified into other variants of VFDT models in various applications, while the best possible (optimal) accuracy and minimum tree size can always be guaranteed.

References

P. Domingos and G. Hulten, “Mining high-speed data streams,” in Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '01), pp. 71–80, ACM, August 2000.
View at: Google Scholar
S. Fong, H. Yang, S. Mohammed, and J. Fiaidhi, “Stream-based biomedical classification algorithms for analyzing biosignals,” Journal of Information Processing Systems, vol. 7, no. 4, pp. 717–732, 2011.
View at: Publisher Site | Google Scholar
N. Chawla, N. Japkowicz, and A. Kolcz, “Special issue on learning from imbalanced data sets,” ACM SIGKDD Explorations, vol. 6, no. 1, 2004.
View at: Google Scholar
D. Mladenic and M. Grobelnik, “Feature selection for unbalanced class distribution and naive bayes,” in Proceedings of the 16th International Conference on Machine Learning, pp. 258–267, Morgan Kaufmann, Boston, Mass, USA, 1999.
View at: Google Scholar
T. Elomaa, “The biases of decision tree pruning strategies,” in Advances in Intelligent Data Analysis, vol. 1642 of Lecture Notes in Computer Science, pp. 63–74, 1999.
View at: Publisher Site | Google Scholar
B. Pfahringer, G. Holmes, and R. Kirkby, “New options for hoeffding trees,” in Proceedings of the Australian Conference on Artificial Intelligence, pp. 90–99, 2007.
View at: Google Scholar
J. Gama, R. Rocha, and P. Medas, “Accurate decision trees for mining high-speed data streams,” in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '03), pp. 523–528, ACM, Washington, DC, USA, August 2003.
View at: Publisher Site | Google Scholar
S. Hashemi and Y. Yang, “Flexible decision tree for data stream classification in the presence of concept change, noise and missing values,” Data Mining and Knowledge Discovery, vol. 19, no. 1, pp. 95–131, 2009.
View at: Publisher Site | Google Scholar | MathSciNet
H. Stefan, P. Russel, and S. K. Yun, “CBDT: a concept based approach to data stream mining,” in Advances in Knowledge Discovery and Data Mining, vol. 5476 of Lecture Notes in Computer Science, pp. 1006–1012, 2009.
View at: Publisher Site | Google Scholar
J. Gama and P. Kosina, “Learning decision rules from data streams,” in Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI '11), T. Walsh, Ed., vol. 2, pp. 1255–1260, AAAI Press, 2011.
View at: Google Scholar
H. Yang and S. Fong, “Moderated VFDT in stream mining using adaptive tie threshold and incremental pruning,” in Proceedings of the 13th International Conference on Data Warehousing and Knowledge Discovery (DaWaK '11), pp. 471–483, Springer, Toulouse, France, 2011.
View at: Google Scholar
G. Hulten P Domingos, “VFML—a toolkit for mining high-speed time-changing data streams,” 2003, http://www.cs.washington.edu/dm/vfml/.
View at: Google Scholar
B. Albert, H. Geoff, P. Bernhard et al., “MOA: a real-time analytics open source framework,” vol. 6913 of Lecture Notes in Computer Science, pp. 617–620.
View at: Google Scholar
N. Oza and S. Russell, “Online bagging and boosting,” in Artificial Intelligence and Statistics, pp. 105–112, Morgan Kaufmann, Boston, Mass, USA, 2001.
View at: Google Scholar
A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà, “New ensemble methods for evolving data streams,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '09), pp. 139–147, ACM, Paris, France, July 2009.
View at: Publisher Site | Google Scholar
A. Bifet and R. Gavalda, “Learning from time-changing data with adaptive windowing,” in Proceedings of the of SIAM International Conference on Data Mining, pp. 443–448, 2007.
View at: Google Scholar
G. Hulten, L. Spencer, and P. Domingos, “Mining time-changing data streams,” in Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '01), pp. 97–106, ACM, August 2001.
View at: Google Scholar
H. Geoffrey, K. Richard, and P. Bernhard, “Tie breaking in hoeffding trees,” in Proceedings of the 2nd International Workshop on Knowledge Discovery from Data Streams, pp. 107–116, Porto, Portugal, 2005.
View at: Google Scholar
A. Frank and A. Asuncion, UCI Machine Learning Repository, University of California, School of Information and Computer Science, Irvine, Calif, USA, 2010, http://archive.ics.uci.edu/ml/.
J. P. Bradford, C. Kunz, R. Kohavi, C. Brunk, and C. E. Brodley, “Pruning decision trees with misclassification costs,” in Proceedings of the 10th European Conference on Machine Learning (ECML '98), pp. 131–136, Springer, 1998.
View at: Google Scholar

Copyright

Copyright © 2013 Hang Yang and Simon Fong. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

3430

Downloads

1458

Citations