Abstract

The ever increasing data generation confronts us with the problem of handling online massive amounts of information. One of the biggest challenges is how to extract valuable information from these massive continuous data streams during single scanning. In a data stream context, data arrive continuously at high speed; therefore the algorithms developed to address this context must be efficient regarding memory and time management and capable of detecting changes over time in the underlying distribution that generated the data. This work describes a novel method for the task of pattern classification over a continuous data stream based on an associative model. The proposed method is based on the Gamma classifier, which is inspired by the Alpha-Beta associative memories, which are both supervised pattern recognition models. The proposed method is capable of handling the space and time constrain inherent to data stream scenarios. The Data Streaming Gamma classifier (DS-Gamma classifier) implements a sliding window approach to provide concept drift detection and a forgetting mechanism. In order to test the classifier, several experiments were performed using different data stream scenarios with real and synthetic data streams. The experimental results show that the method exhibits competitive performance when compared to other state-of-the-art algorithms.

1. Introduction

In recent years, technological advances have promoted the generation of a vast amount of information from different areas of knowledge: sensor networks, financial data, fraud detection, and web data, among others. According to the study performed by IDC (International Data Corporation) [1], the digital universe in 2013 was estimated in 4.4 trillion gigabytes. From this digital data, only 22% would be a candidate for analysis, while the available storage capacity could hold just 33% of the generated information. Under this scenario, the extraction of knowledge is becoming a very challenging task. We are now confronted with the problem of handling very large datasets and even with the possibility of an infinite data flow. Therefore the algorithms developed to address this context, unlike traditional ones, must meet some constraints, as defined in [2], work with a limited amount of time, and use a limited amount of memory, and one or only few pass over the data. They also should be capable of reacting to concept drift, that is, changes in the distribution of the data over time. In [3], more details of the requirements to be considered for data streams algorithms can be found.

As the idea suggests, a data stream can roughly be thought of as an ordered endless sequence of data items, where the input arrives more or less continuously as time progresses [4]. The extensive research on data stream algorithms performed over the last years has produced a large variety of techniques for data stream problems. Several relevant data mining tasks have been addressed: clustering, classification, prediction, and concept drift detection. Data stream algorithms can be roughly divided into single classifier approach and ensemble approach. The single model approach can be further divided into model based approach and instance based approach. Model based approaches have the disadvantage of the computational cost of updating the model every time that a new element of the stream arrives. In instance based approaches just the representative subset of elements (case base) from the data stream have to be adapted; this does not come for free since the computational cost is reflected during the classification stage [4]. The single classifier approach includes several data mining techniques such as decision trees, decision rules, instance based learners (IBL), and neural networks, among others. A more detailed taxonomy of data stream algorithms can be found in [5].

Decision trees have been widely used in data stream classification [69]. One of the most representative models for data stream classification using decision trees was proposed by Domingos and Hulten [6, 7] and has been the inspiration for several of the tree models used for data stream mining. In [6], the authors proposed a Very Fast Decision Tree (VFDT) learner that uses Hoeffding bounds, which guarantee that the tree can be trained in constant time. An extension of this work was presented in [7] where the decision trees are built from a sliding window of fixed size. Trees are also commonly used as base learners for ensemble approaches.

Decision rules models have also been applied to data stream context. One of the first methods proposed for rule induction in an online environment was FLORA (FLOating Rough Approximation) [10]. To deal with concept drift and hidden contexts, this framework uses three techniques: keep only a window of trusted examples, store concept descriptions to reuse them when a previous context reappears, and use heuristics to control these two functions. Another relevant work was presented by Ferrer-Troyano et al. in [11]. The method is called FACIL (Fast and Adaptive Classifier by Incremental Learning), an incremental classifier based on decision rules. The classifier uses a partial instance memory where examples are rejected if they do not describe a decision boundary.

Algorithms for data stream context have to be incremental and highly adaptive. In IBL algorithms incremental learning and model adaptation are very simple since these methods come down to adapt the case base (i.e., representative examples). This allows obtaining a flexible and easily updated model, without complex computations that optimize the update and classification times [12]. Beringer and Hüllermeier proposed an IBL algorithm for data stream classification [4]; the system maintains an implicit concept description in the form of a case base that is updated taking into account three indicators: temporal relevance, spatial relevance, and consistency. On the arrival of a new example, this example is incorporated into the case base and then a replacement policy is applied based on the three previously mentioned indicators. In [12], a new technique named Similarity-Based Data Stream Classifier (SimC) is introduced. The system maintains a representative and small set of information in order to conserve the distribution of classes over the stream. It controls noise and outliers applying an insertion/removal policy designed for retaining the best representation of the knowledge using appropriate estimators. Shaker and Hüllermeier presented an instance based algorithm that can be applied to classification and regression problems [13]. The algorithm is called IBLStream and it optimizes the composition and size of the case base autonomously. On arrival of a new example, this example is first added to the case base and then it is checked whether other examples might be removed either since they have become redundant or since they are noisy data.

For data stream scenarios, neural network-based models have hardly been used; one of the main issues of this approach is the weights update caused by the arrival of a new example. The influence on the network of this single example cannot simply be canceled later on; at best, it can be reduced gradually over the course of time [12]. However there are some works where neural networks are used for data stream tasks [14], where an online perceptron is used to classify nonstationary imbalance data streams. They have also been used as base learners for ensemble methods [15, 16].

The ensemble methods have also been extensively used in data stream classification. The ensemble approach, as its name implies, maintains in memory an ensemble of multiple models whose individual predictions are combined—generally by average or voting techniques—to output a final prediction [2]. The main idea behind an ensemble classifier is based on the concept that different learning algorithms explore different search spaces and evaluation of the hypothesis [3]. Different classifiers can be used as base learners in the ensemble methods, for instance, tree [1719], Naïve Bayes [20, 21], neural networks [15], and KNN [22].

SVMs have been used for different tasks in the data streams scenarios such as sentiment analysis, fault detection and prediction, medical applications, spam detection, and multilabel classification. In [23], a fault-detection system based on data stream prediction is proposed. SVM algorithm is used to classify the incoming data and trigger a fault alarm in case the data is considered abnormal. Sentiment analysis has also been addressed using SVM methods. Kranjc et al. [24] introduce a cloud-based scientific workflow platform, which is able to perform online dynamic adaptive sentiment analysis on twitter data. They used a linear SVM for sentiment classification. While this work does not focus on a specific topic, Smailović et al. [25] present a methodology that analyzes whether the sentiment expressed in twitter feeds, which discuss selected companies, can indicate their stock price changes. They propose the use of the Pegasos SVM [26] to classify the tweets as positive, negative, and neutral. Then the tweets classified as positive were used to calculate a sentiment probability for each day and this is used in a Granger causality analysis to correlate positive sentiment probability and stock closing price. In [27], a stream clustering SVM algorithm to treat spam identification was presented. While in general this problem has previously been addressed as a classification problem, they handle it as an anomaly detection problem. Krawczyk and Woźniak [28] proposed a version of incremental One-Class Support Vector Machine that assigns weights to each object according to its level of significance. They also introduce two schemes for estimating weights for new, incoming data and examine their usefulness.

Another well studied issue in the data stream context is concept drift detection. The change in the concept might occur due to changes in hidden variables affecting the phenomena or changes in the characteristics of the observed variables. According to [29], two main tactics are used in concept drift detection: (1) monitoring the evolution of performance indicators using statistical techniques and (2) monitoring distribution of two different time windows. Some algorithms use methods that adapt the decision model at regular intervals without taking into account if change really happened; this involves extra processing time. On the other hand, explicit change methods can detect the point of change, quantify the change, and take the needed actions. In general, the methods for concept drift detection can be integrated with different base learners. Some methods related to this issue can be found in [3032].

Some other issues related to data stream classification have also been addressed, such as unbalanced data stream classification [14, 15, 19], classification in the presence of uncertain data [8], classification with unlabeled instances [8, 33, 34], novel class detection [22, 35], and multilabel classification [36].

In this paper we describe a novel method for the task of pattern classification over a continuous data stream based on an associative model. The proposed method is based on the Gamma classifier [37], which is a supervised pattern recognition model inspired on the Alpha-Beta associative memories [38]. The proposal has been developed to work in data stream scenarios using a sliding window approach and it is capable of handling the space and time constraints inherent to this type of scenarios. Various methods have been designed for data stream classification, but to our knowledge the associative approach has never been used for this kind of problems. The main contribution of this work is the implementation of a novel method for data stream classification based on an associative model.

The rest of this paper is organized as follows. Section 2 describes all the materials and methods needed to develop our proposal. Section 3 describes how the experimental phase was conducted and discusses the results. Some conclusions are presented in Section 4, and finally the Acknowledgments and References are included.

2. Materials and Methods

2.1. Gamma Classifier

This classifier is a supervised method whose name is derived from the similarity operator that it uses: the generalized Gamma operator. This operator takes as input two binary vectors, and , and a positive integer and returns 1 if both vectors are similar ( being the degree of allowed dissimilarity) or 0 otherwise. The Gamma operator uses other operators (namely, , , and ) that will be introduced first. The rest of this section is strongly based on [37, 38].

Definition 1 (Alpha and Beta operators). Given the sets and , the Alpha () and Beta () operators are defined in a tabular form as shown in Table 1.

Definition 2 (Alpha operator applied to vectors). Let with be the input column vectors. The output for is -dimensional vector, whose th component () is computed as follows:

Definition 3 ( operator). Considering the binary pattern as input, this unary operator gives the following nonnegative integer as output and is calculated as follows:Thus, if then

Definition 4 (pruning operator). Let and ; let , , be two binary vectors; then pruned by , denoted by , is -dimensional binary vector whose th component is defined as follows:where ().
For instance, let , , , and thenThus, only the 4th, 5th, and 6th elements of will be used in . Notice that the first components of are discarded when building .
The Gamma operator needs a binary vector as input. In order to deal with real and/or integer vectors a method to represent this vector in a binary form is needed. In this work, we will use the modified Johnson-Möbius code [39], which is a variation of the classical Johnson-Möbius code. To convert a set of real/integer numbers into a binary representation we will follow these steps:(1)Let be a set of real numbers such that , where .(2)If there is one or more negative numbers in the set, create a new set from subtracting the minimum (i.e., ) from each number in , obtaining a new set with only nonnegative real numbers; , where and particularly .(3)Choose a fixed number and truncate each number of the set to have only decimal positions.(4)Scale up all the numbers of the set obtained in the previous step, by multiplying all numbers by , in order to leave only nonnegative integer numbers. , where     and is the maximum value of the set .(5)For each element with , concatenate zeroes with ones.For instance, let be defined as ; now let us use the modified Johnson-Möbius code to convert the elements of into binary vectors.(1)Subtract the minimum. Since −0.7 is the minimum in , the members of the transformed set are obtained by subtracting −0.7 from each member in . . Consider (2)Scale up the numbers. We select , so each number is truncate to 1 decimal, and then multiply by 10 (). One has(3)Concatenate zeroes with ones, where is the maximum number in and is the current number to be coded. Given that the maximum number in is 31, all the binary vectors have 31 components; that is, they all are 31 bits long. For example, is converted into its binary representation by appending zeroes, followed by ones, which gives the final vector “0000000001111111111111111111111”. Consider

Definition 5 (Gamma operator). The similarity Gamma operator takes two binary patterns, and , , , and a nonnegative integer as input and outputs a binary number, for which there are two cases.
Case  1. If then the output is computed according to the following:where mod denotes the usual modulo operation.
Case  2. If , then the output is computed using instead of as follows:In order to better illustrate how the generalized Gamma operator works, let us work through some examples of its application. Then if , , and , what is the result of ? First, we calculate ; thennowand since , ; given that , then the result of is 0, and it is decided that given the two vectors are not similar. However, if we increase , the result of will be 1, since . In this case, the two vectors will be considered similar. The main idea of the generalized Gamma operator is to indicate (result equal to 1) that two binary vectors are similar allowing up to bits to be different and still consider those vectors similar. If more than bits are different, the vectors are said to be not similar (result equal to 0). Thus, if both vectors must be equal for to output a 1.

2.1.1. Gamma Classifier Algorithm

Let be the fundamental pattern set with cardinality ; when a test pattern with is presented to the Gamma classifier, these steps are followed:(1)Code the fundamental set using the modified Johnson-Möbius code, obtaining a value for each component of the -dimensional vectors in the fundamental set. These values are calculated from the original real-valued vectors as (2)Compute the stop parameter .(3)Code the test pattern using the modified Johnson-Möbius code, with the same parameters used to code the fundamental set. If any (obtained by offsetting, scaling, and truncating the original ) is greater than the corresponding , code it using the larger number of bits.(4)Transform the index of all fundamental patterns into two indices, one for their class and another for their position in the class (e.g., which is the th pattern for class becomes ).(5)Initialize to 0.(6)If , test whether is a fundamental pattern by calculating for and then computing the initial weighted addition for each fundamental pattern as follows:If there is a unique maximum, whose value equals , assign the class associated with such maximum to the test pattern. Consider

(7)Calculate for each component of the fundamental patterns.(8)Compute a weighted sum for each class, according to this equation:where is the cardinality in the fundamental set of class .(9)If there is more than one maximum among the different , increment by 1 and repeat steps (7) and (8) until there is a unique maximum, or the stop condition is fulfilled.(10)If there is a unique maximum among the , assign to the class corresponding to such maximum(11)Otherwise, assign to the class of the first maximum.As an example, let us consider the following fundamental patterns:grouped in two classes: and . Then the patterns to be classified areAs can be seen, the dimension of all patterns is , and there are 2 classes, both with the same cardinality and . Now the steps in the algorithm are followed.(1)Code the fundamental set with the modified Johnson-Möbius code, obtaining a value for each component. ConsiderSince the maximum value among the first components of all fundamental patterns is 7, then , and the maximum value among the second components is . Thus, the binary vectors representing the first components are 7 bits long, while those representing the second component are 9 bits long.(2)Compute the stop parameter: (3)Code the patterns to be classified into the modified Johnson-Möbius code, using the same parameters used with the fundamental set:Again, the binary vectors representing the first components are 7 bits long, while those representing the second component are 9 bits long.(4)Transform the index of all fundamental patterns into two indices, one for the class they belong to and another for their position in the class. ConsiderGiven that and , we know that both and belong to class , so the first index for both patterns becomes 1; the second index is used to differentiate between patterns assigned to the same class; thus becomes , and is now . Something similar happens to and , but with the first index being 2 since they belong to class : becomes , and is now .(5)Initialize = 0.(6)Calculate for each component of the fundamental patterns in each class. Thus, for we haveSince , for to give a result of 1 it is necessary that both vectors and are equal. But only one fundamental pattern has its first component equal to that of ; , while no fundamental pattern has a second component equal to that of the test pattern. Given that this is the only case for which a component of a fundamental pattern is equal to the same component of the test pattern, it is also the only instance in which outputs 1.Now, for we have Again , thus forcing both vectors and to be equal in order to obtain a 1. Similarly to what happened in the case of , there is but one instance of such occurrence: .(7)Compute a weighted sum for each class: Here we add together the results obtained on all components of all fundamental patterns belonging to class . Since all gave 0 (i.e., none of these were similar to the corresponding component of the test pattern given the value of in effect), the result of the weighted addition for class is . One has In this case, there was one instance similar to the corresponding component of the test pattern (equal, since during this run). Thus the weighted addition for class is , given that the sum of results is divided by the cardinality of the class, .And for ,                 There was one result of equal to 1, which coupled with the cardinality of class being 2 makes . One has                     No similarities were found between the fundamental patterns of class and the test pattern (for ); thus .(8)If there is more than one maximum among the different , increment by 1 and repeat steps (6) and (7) until there is a unique maximum, or the stop condition is fulfilled. There is a unique maximum (for each test pattern), so we go directly to step (9).(9)If there is a unique maximum, assign to the class corresponding to such maximum. For ,                   while for                          (10)Otherwise, assign to the class of the first maximum. This is unnecessary since both test patterns have already been assigned a class.

2.2. Sliding Windows

A widespread approach for data stream mining is the use of a sliding window to keep only a representative portion of the data. We are not interested in keeping all the information of the data stream, particularly when we have to meet the memory space constraint required for algorithms working in this context. This technique is able to deal with concept drift, eliminating those data points that come from an old concept. According to [5], the training window size can be fixed or variable over time.

Sliding windows of a fixed size store in memory a fixed number of the most recent examples. Whenever a new example arrives, it is saved to memory and the oldest one is discarded. These types of windows are similar to first-in first-out data structures. This simple adaptive learning method is often used as a baseline in evaluation of new algorithms.

Sliding windows of variable size vary the number of examples in a window over time, typically depending on the indications of a change detector. A straightforward approach is to shrink the window whenever a change is detected, such that the training data reflects the most recent concept, and grow the window otherwise.

2.3. Proposed Method

The Gamma classifier has previously been used for time series prediction in air pollution [37] and oil well production [40], showing very promising results. The current work takes the Gamma classifier and combines it with a sliding window approach, with the aim of developing a new method capable of performing classification over a continuous data stream, referred to as DS-Gamma classifier.

A data stream can be defined as a sequence of data elements of the form from a universe of size , where is huge and potentially infinite but countable. We assume that each element has the following form:where is an input vector of size and is the class label associated with it. is an index to indicate the sequential order of the elements in the data stream. will be bounded by the size of the data stream such that if the data stream is infinite will take infinite values, but countable.

One of the main challenges for mining data streams is the ability to maintain an accurate decision model. The current model has to incorporate new knowledge from new data and also it should forget old information when data is outdated. To address these issues the proposed method implements a learning and forgetting policy. The learning (insertion)/forgetting (removal) policies were designed to guarantee the space and temporal relevance of the data and the consistency of the current model.

The proposed method implements a fixed window approach. This approach achieves concept drift detection in an indirect way by considering only the most recent data. Algorithm 1 outlines the general steps used as learning policy. Initially, the window is created with the first labeled records, where is the size of the window. After that every sample is used for classification and evaluation of the performance, and then it is used to update the window.

() Initialize window
() for each do
()   If then
()     Add to the tail of the window
()   Else
()     Use to classify
()     Add to the tail of the window

Classifiers that deal with concept drifts are forced to implement forgetting, adaptation, or drift detection mechanisms in order to adjust to changing environments. In this proposal a forgetting mechanism is implemented. Algorithm 2 outlines the basic forgetting policy implemented in this proposal. The oldest element in the window is removed.

() for each do
()   If then
()     Remove () from the beginning of the window

2.4. MOA

Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams [41]. MOA contains a collection of offline and online algorithms for both classification and clustering as well as tools for evaluation on data streams with concept drifts. MOA also can be used to build synthetic data streams using generators, reading ARFF files, or joining several streams. MOA streams generators allow simulating potentially infinite sequences of data with different characteristics (different types of concept drifts and recurrent concepts, among others). All the algorithms used for comparative analysis are implemented under the MOA framework.

2.5. Data Streams

This section provides a brief description of the data streams used during the experimental phase. The proposed solution has been tested using synthetic and real data streams. For data stream classification problems, just a few suitable data streams can be successfully used for evaluation. A few data streams with enough examples are hosted in [42, 43]. For further evaluation, the use of synthetic data streams generators becomes necessary. The synthetics data streams used in our experiments were generated using the MOA framework mentioned in the previous section. A summary of the main characteristics of the data streams is shown in Table 2. Descriptions of the data streams were taken from [42, 43] in the case of real ones and from [44] for the synthetics.

2.5.1. Real World Data Streams

Bank Marketing Dataset. The Bank Marketing dataset is available from UCI Machine Learning Repository [43]. This data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe or not a term deposit. It contains 45,211 instances. Each instance has 16 attributes and the class label. The dataset has two classes that identified if a bank term deposit would be or would not be subscribed.

Electricity Dataset. This data was collected from the Australian New South Wales electricity market. In this market, prices are not fixed and are affected by demand and supply of the market. The prices of the market are set every five minutes. The dataset was first described by Harries in [45]. During the time period, described in the dataset, the electricity market was expanded with the inclusion of adjacent areas, which leads to more elaborate management of the supply. The excess production of one region could be sold on the adjacent region. The Electricity dataset contains 45,312 instances. Each example of the dataset refers to a period of 30 minutes (i.e., there are 48 instances for each time period of one day). Each example on the dataset has 8 fields: the day of week, the time stamp, the New South Wales electricity demand, the New South Wales electricity price, the Victoria electricity demand, the Victoria electricity price, the scheduled electricity transfer between states, and the class label. The class label identifies the change of the price relative to a moving average of the last 24 hours. We use the normalized versions of these datasets so that the numerical values are between 0 and 1. The dataset is available from [42].

Forest CoverType Dataset. The Forest CoverType dataset is one of the largest datasets available from the UCI Machine Learning Repository [43]. This dataset contains information that describes forest cover types from cartographic variables. A given observation (30 × 30 meter cell) was determined from the US Forest Service (USFS) Region 2 Resource Information System (RIS) data. It contains 581,012 instances and 54 attributes. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types). The dataset has 7 classes that identify the type of forest cover.

2.5.2. Synthetic Data Streams

Agrawal Data Stream. This generator was originally introduced by Agrawal et al. in [46]. The MOA implementation was used to generate the data streams for the experimental phase in this work. The generator produces a stream containing nine attributes. Although not explicitly stated by the authors, a sensible conclusion is that these attributes describe hypothetical loan applications. There are ten functions defined for generating binary class labels from the attributes. Presumably these determine whether the loan should be approved. Perturbation shifts numeric attributes from their true value, adding an offset drawn randomly from a uniform distribution, the range of which is a specified percentage of the total value range [44]. For each experiment, a data stream with 100,000 examples was generated.

RandomRBF Data Stream. This generator was devised to offer an alternate complex concept type that is not straightforward to approximate with a decision tree model. The RBF (Radial Basis Function) generator works as follows. A fixed number of random centroids are generated. Each center has a random position, a single standard deviation, class label, and weight. New examples are generated by selecting a center at random, taking weights into consideration so that centers with higher weight are more likely to be chosen. A random direction is chosen to offset the attribute values from the central point. The length of the displacement is randomly drawn from a Gaussian distribution with standard deviation determined by the chosen centroid. The chosen centroid also determines the class label of the example. Only numeric attributes are generated [44]. For our experiments, a data stream with 100,000 data instances, 10 numerical attributes, and 2 classes was generated.

Rotating Hyperplane Data Stream. This synthetic data stream was generated using the MOA framework. A hyperplane in -dimensional space is the set of points that satisfy [44]where is the th coordinate of , examples for which are labeled positive and examples for which are labeled negative. Hyperplanes are useful for simulating time-changing concepts, because orientation and position of the hyperplane can be adjusted in a gradual way if we change the relative size of the weights. In MOA, they introduce change to this data stream by adding drift to each weight attribute using this formula: , where is the probability that the direction of change is reversed and is the change applied to every example. For our experiments, a data stream with 100,000 data instances, 10 numerical attributes, and 2 classes was generated.

2.6. Algorithms Evaluation

One of the objectives of this work is to perform a consistent comparison between the classification performance of our proposal and the classification performance of other well-known data stream pattern classification algorithms. Evaluation of data stream algorithms is a relatively new field that has not been studied as well as evaluation on batch learning. In traditional batch learning evaluation has been mainly limited to evaluating a model using different random reorder of the same static dataset (-fold cross validation, leave-one-out, and bootstrap) [2]. For data stream scenarios the problem of infinite data raises new challenges. On the other hand, the batch approaches cannot effectively measure accuracy of a model in a context with concepts that change over time. In the data stream setting, one of the main concerns is how to design an evaluation practice that allows us to obtain a reliable measure of accuracy over time. According to [41], two main approaches arise.

2.6.1. Holdout

When traditional batch learning reaches a scale where cross validation is too time-consuming, it is often accepted to instead measure performance on a single holdout set. An extension of this approach can be applied for data stream scenarios. First a set of instances is used to train the model; then a set of the following unseen instances is used to test the model; then it is again trained with the next instances and tested with the subsequence instances and so forth.

2.6.2. Interleaved Test-Then-Train or Prequential

Each individual example can be used to test the model before it is used for training, and from this the accuracy can be incrementally updated. When intentionally performed in this order, the model is always being tested on examples it has not seen.

Holdout evaluation gives a more accurate estimation of the performance of the classifier on more recent data. However, it requires recent test data that is sometimes difficult to obtain, while the prequential evaluation has the advantage that no holdout set is needed for testing, making maximum use of the available data. In [47], Gama et al. present a general framework to assess data stream algorithms. They studied different mechanisms of evaluation that include holdout estimator, the prequential error, and the prequential error estimated over a sliding window or using fading factors. They stated that the use of prequential error with forgetting mechanisms reveals to be advantageous in assessing performance and in comparing stream learning algorithms. Since our proposed method is working with a sliding window approach, we have selected the prequential error over sliding window for evaluation of the proposed method. The prequential error for a sliding window of size computed at time is based on an accumulated sum of a 0-1 loss function between the predicted values and the observed values using the following expression [47]: We also evaluate the performance results of each competing algorithm with the Gamma classifier using a statistical test. We use the Wilcoxon test [48] to assess the statistical significance of the results. The Wilcoxon signed-ranks test is a nonparametric test, which can be used to rank the differences in performances of two classifiers for each dataset, ignoring the signs and comparing the ranks for the positive and the negative differences.

2.7. Experimental Setup

To ensure valid comparison of classification performance, the same conditions and validation schemes were applied in each experiment. Classification performance of each of the compared algorithms was calculated using the prequential error approach. These results will be used to compare the performance of our proposal and other data stream classification algorithms. We have aimed at choosing representative algorithms from different approaches covering the state of the art in data stream classification, such as Hoeffding Tree, IBL, Naïve Bayes with DDM (Drift Detection Method), SVM, and ensemble methods. All of these algorithms are implemented in the MOA framework. Further details on the implementation of these algorithms can be found in [44]. For all data streams, with exception of the Electricity dataset, the experiments were executed 10 times and the average of the performance and execution time results are presented. Records in the Electricity dataset have a temporal relation that should not be altered; for this reason with this specific dataset the experiment was executed just one time. The Bank and CoverType datasets were reordered 10 times using the randomized filter from the WEKA platform [49] for each experiment. For the synthetics data streams (Agrawal, Rotating Hyperplane, and RandomRBF), 10 data streams were generated for each one of them using different random seeds in MOA. Performance was calculated using the prequential error introduced in the previous section. We also measure the total time used for classification to evaluate the efficiency of each algorithm. All experiments were conducted using a personal computer with an Intel Core i3-2100 processor running Ubuntu 13.04 64-bit operating system with 4096 GB of RAM.

3. Results and Discussion

In this section, we present and discuss the results obtained during the experimental phase, throughout which six data streams, three real and three synthetic, were used to obtain the classification and time performance for each of the compared classification algorithms. First, we evaluate the sensibility of the proposed method to the change on the windows size, which is a parameter that can largely influence the performance of the classifier. Table 3 shows the average accuracy of the DS-Gamma classifier for different window sizes; best accuracy for each data stream is emphasized with boldface. Table 4 shows the average total time used to classify each data stream. The results of the average performance with indicator of the standard deviation are also depicted in Figure 1.

In general, the DS-Gamma classifier performance improved as the window size increases with the exception of Electricity data stream, for this data stream performance is better with smaller window sizes. It is worth noting that the DS-Gamma classifier achieved its best performance for the Electricity data stream with a window size of 50 with a performance of 89.12%. This performance is better than all the performances achieved by all the other compared algorithms with a window size = 1000. For the other data streams, the best performance was generally obtained with window sizes between 500 and 1000. With window sizes greater than 1000 performance keeps stable and in some cases even degrades, as we can observe in the results from the Bank and Hyperplane data streams in Figure 1. Figure 2 shows the average classification time for different window size. A linear trend line (dotted line) and the coefficient of determination were included in the graphic of each data stream. The values ranging between 0.9743 and 0.9866 show that time is growing in linear way as the size of the window increases.

To evaluate the DS-Gamma classifier, we compared its performance with other data stream classifiers. Table 5 presents the performance results of each evaluated algorithm on each data stream, including the standard deviation. Best performance is emphasized with boldface. We also include box plot with the same results in Figure 3. For the Agrawal data stream Hoeffding Tree and the DS-Gamma classifier achieved the two best performances. OzaBoostAdwin presents the best performance for the CoverType and Electricity data streams, but this ensemble classifier exhibits the highest classification time for all the data streams, as we can observe in Table 6 and Figure 4. This can be a serious disadvantage, especially in data stream scenarios. The lowest times are achieved by the perceptron with Drift Detection Method (DDM), but its performance is generally low when compared to the other methods. A fact that caught our attention was the low performance of the DS-Gamma classifier for the CoverType data stream. This data stream presents a heavy class imbalance that seems to be negatively affecting the performance of the classifier. As we can observe in step (8) of the DS-Gamma classifier algorithm, classification relies on a weighted sum per class; when a class greatly outnumbers the other(s), this sum will be biased to the class with most members.

To statistically compare the results obtained by the DS-Gamma classifier and the other evaluated algorithms the Wilcoxon signed-ranks test was used. Table 7 shows the values of the asymptotic significance (2-sided) obtained by the Wilcoxon test using . The obtained values are all greater than , so the null hypothesis is not rejected and we can infer that there are no significant differences among the compared algorithms.

4. Conclusions

In this paper, we describe a novel method for the task of pattern classification over a continuous data stream based on an associative model. The proposed method is a supervised pattern recognition model based on the Gamma classifier. During the experimental phase, the proposed method was tested using different data stream scenarios with real and synthetic data streams. The proposed method presented very promising results in terms of performance and classification time when compared with some of the state-of-the-art algorithms for data stream classification.

We also study the effect of the window size over the classification accuracy and time. In general, accuracy improved as the window size increased. We observed that best performances were obtained with window sizes between 500 and 1000. For larger window sizes performance remained stable and in some cases even declined.

Since the proposed method based its classification on a weighted sum per class, it is strongly affected by severe class imbalance, as we can observe from the results obtained with the experiments performed with the CoverType dataset, a heavily imbalance dataset, where the accuracy of the classifier was the worst of the compared algorithms. As future work a mechanism to address severe class imbalance should be integrated to improve classification accuracy.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors would like to thank the Instituto Politécnico Nacional (Secretaría Académica, COFAA, SIP, CIC, and CIDETEC), the CONACyT, and SNI for their economical support to develop this work. The support of the University of Porto was provided during the research stay from August to December, 2014. The authors also want to thank Dr. Yenny Villuendas Rey for her contribution to perform the statistical significance analysis.