Abstract

Data stream mining has become a research hotspot in data mining and has attracted the attention of many scholars. However, the traditional data stream mining technology still has some problems to be solved in dealing with concept drift and concept evolution. In order to alleviate the influence of concept drift and concept evolution on novel class detection and classification, this paper proposes a classification and novel class detection algorithm based on the cohesiveness and separation index of Mahalanobis distance. Experimental results show that the algorithm can effectively mitigate the impact of concept drift on classification and novel class detection.

1. Introduction

In recent years, with the continuous popularization of the Internet and the continuous development of the Internet of Things and data acquisition technology, data has exploded. A constantly changing time-stamped data model, the data stream, has emerged in the Internet, finance, medicine, and ecological monitoring. After the advent of the Internet and wireless communication networks, data flow as a new type of data model has attracted more and more attention from the society [1, 2]. The data stream has characteristics different from traditional datasets. It has chronological, rapid changes and massive, potential infinite, etc. characteristics. It is precisely because of the unique characteristics of the data stream that the data processing model of the data stream is very different from the traditional data mining technology. The data processed by the traditional data mining technology are static datasets, which can be permanently stored in the medium and can be scanned and used multiple times during the process of data analysis. Unlike traditional static databases, the data processing model of the data stream is updated at a faster rate and continuously flows into and out of the computer system. Accordingly, the two biggest challenges in processing data from a data stream are its inherently infinite length and the concept drift that occurs in real-time data changes. Concept drift means that the statistical properties of the target variables that the model attempts to predict change over time in an unpredictable manner. Therefore, using traditional data mining techniques, it is impractical to store and use all historical data for training, which makes it necessary to change existing data mining techniques and design new mining algorithms for this new data model.

Data flow novel class detection is a technique for detecting new categories in a data stream. Many traditional data stream classification algorithms use fixed class numbers to train data stream classifiers. However, in reality, outliers and novel class will appear in the data stream over time, which will lead to a gradual decline in the accuracy of the traditional data stream classification algorithm. Therefore, it is urgent to design a novel class detection algorithm for the characteristics of data flow.

The rest of this paper is organized as follows: Section 2 introduces the relevant research on data stream classification and novel class detection. Section 3 details the C&NCBM algorithm. Section 4 describes the experimental results and detailed analysis in different datasets. The conclusion of the research as well as challenges and directions for future research is presented in Section 5.

2.1. Data Stream Classification in the Presence of Concept Drift

In the literature [3], various learning algorithms in the context of concept drift in recent years are reviewed. In 1986, Schlimmer and Granger [4] first proposed the “concept drift,” which was followed by the increasing attention of the academic community. From 1986 to 2000, research focused on the use of a single classifier to implement concept drift data stream classification. Widmer and Kubat proposed CBBIT [5], and Hulten et al. proposed methods such as FLORA [6]. At the same time, researchers began to pay attention to the theoretical problem of concept drift data stream classification.

Due to the need to continuously update the classification model when using the single classifier to process the concept drift data stream and the fact that the generalization ability of the classifier is not high [7], Black and Hickey [8] proposed the introduction of integrated learning into the concept drift data stream classification for the first time and proposed the AES algorithm. Therefore, after about 2000, people began to turn to the integrated classifier for the study of concept drift data streams. At this time, the concept drift data stream classification research entered a period of rapid development and began to study the concept drift data stream closer to the reality. Klinkenberg and Lanquillon earlier studied the concept drift in some cases with user feedback or with no feedback [811]. In 2004, the Intelligent Data Analysis Journal published the concept drift data stream special issue [12] that mainly discussed how to use the incremental learning method to make the existing classifier use concept drift at a small cost. Subsequently, more attention has been paid to issues such as class imbalance learning [13, 14], concept repetitive learning [15, 16], semisupervised learning [17, 18], and active learning [19, 20] in the classification of concept drift data streams. Table 1 summarizes the main three types of concept drift data stream classification techniques from 2000 to 2016.

2.2. Novel Class Detection in the Presence of Concept Drift

In the literature [33], Masud et al. proposed a novel class detection method in the data stream with concept drift and infinite length. However, this method does not address the problem of feature evolution. In the literature [34], the problem of evolution of the concept is solved while solving the problem of conceptual evolution, but the literature [33, 34] still has too high false alarm rate for some datasets and cannot distinguish different novel class problems. Masud et al. [35] proposed a method to solve the concept evolution caused by the emergence of novel classes. This method adds an auxiliary classifier set to the main classifier set. When each arriving instance in the data stream is determined to be a secondary outlier by the primary classifier set and the associated classifier set, it is temporarily stored in a buffer. When there are enough instances in the buffer, the novel class detection module is called for detection. If a novel class is found, the novel class instance is marked accordingly. In the literature [36], the feature space transformation technique is proposed to deal with the evolution of data stream feature. The traditional data stream integration classifier is combined with the novel class detection technology to solve the feature evolution problem in the data stream.

Chandak [37] proposed a string-based data stream processing method, which mainly solves the problem of data stream concept evolution through the CON_EVOLUTION algorithm. Miao et al. [38] solved the problem that only the numerical data can be solved in the framework of MineClass algorithm. A novel class detection algorithm that can process mixed-attribute data is proposed, and the processing time and model size of the algorithm framework are optimized by using VFDTc classifier. ZareMoodi et al. [39] used local patterns and neighbor graphs to solve the concept evolution problem in data streams. Local patterns are Boolean feature groups that affect sequential features and classification features, which are used to improve classification accuracy. At the same time, in candidate novel class classes, neighbor graphs are used to analyze interrelated objects to improve the accuracy of novel class detection.

After many researchers have continuously explored it, novel class detection has achieved many results. However, most of the novel class algorithms cannot solve the problem of multiple novel class problems at the same time and also do not consider the interaction of different attributes in the instance to determine the novel class. Therefore, based on the previous studies and considering the role of attributes, this paper proposes a novel class detection algorithm that can distinguish different categories of novel class.

3. Classification and Novel Class Detection Algorithm Based on Mahalanobis Distance (C&NCBM)

3.1. Cohesion and Separation Index Based on Mahalanobis Distance

Based on the Mahalanobis distance [40] and the cohesive separation index N-NSC proposed by Masud et al. [33], a novel class detection index is proposed. The relevant definitions are as follows.

Definition 1. (R-outlier) (see [33]). Let x be the test point and be the clustering result point closest to x. If x is outside the range determined by the feature space contained in , then x is an R-outlier.

Definition 2. (F-outlier) (see [33]). If x is an R-outlier for all classifiers in the classification set , then x is an F-outlier.

Definition 3. (-neighbor) (see [33]). The -neighbor of the F-outlier x is the set of n neighbors closest to x in class , denoted by the symbol , where n is a user-set parameter.
According to the above definition, we give the definition of the cohesiveness and separation index MN-NSC based on Mahalanobis distance.

Definition 4. (MN-NSC). Let be the average Mahalanobis distance of F-outlier to , be the average Mahalanobis distance of F-outlier x to , and be the minimum of ; then MN-NSC is defined as follows:where represents the -neighbor of to other F-outliers and represents the -neighbor of to its existing class.
By definition, the value of MN-NSC is within the interval [−1, 1]. When MN-NSC is negative, it means that x is closer to the existing class and it is far away from the F-outlier; when MN-NSC is positive, it means that x is farther from the existing class and close to the F-outlier. When at least N (>n) F-outliers have an MN-NSC value greater than 0, this indicates that a new heterogeneity is generated in the data stream.

3.2. Algorithm

This section will elaborate the algorithmic process of classification and novel class detection algorithms based on the Mahalanobis distance cohesive separation index, and it will analyze the concept drift processing in the data stream.

First, the data stream is divided into data blocks of the same size, and the last arriving data block , the currently optimal classifier sets , the nearest neighbor , and the novel class threshold are taken as input of the algorithm. Then, the instances in the data block are classified to determine whether the instance is R-outlier. If the instance is R-outlier, it will be added to the exception set . -means is used to cluster the instances in the set and create a cluster point for each cluster. The saves the cluster center and clustering radius of each cluster and calculates MN-NSC value for each cluster point . If the number of cluster points with MN-NSC value greater than zero is greater than the set threshold, the algorithm determines that novel class is generated and classifies it. When all data in is marked, is used to train a new model . , the model with the lowest classification accuracy, is selected from the set and replaced with . Through the above method, the classification model of the current latest concepts can be maintained at any time, so as to solve the concept drift problem in the data flow (Algorithm1). The pseudocode for the algorithm is shown below.

Input:Data block , Classifier set , Nearest neighbor , Threshold
Output: Updated classifier set
(1)for Each instance in block do
(2)Classify (, )
(3)if is an R-outlier for all classifiers in the classification set then
(4)  Add to the set
(5)end if
(6)end for
(7)Clustering by -means () and creating a cluster point for each cluster
(8)for Each cluster in do
(9) Compute MN-NSC ()
(10)if MN-NSC () is greater than 0 then
(11)  count = count + 1
(12)end if
(13)end for
(14)if count greater than then
(15) Put all instances belonging to novel class in block into class
(16)end if
(17)if All instances in is classified then
(18) = Train ()
(19) = Replacement (, )
(20)end if

4. Experiment and Analysis

In order to verify the classification and novel class detection algorithm based on the Mahalanobis distance cohesive separation index proposed in this paper, three sets of experiments were performed on two real datasets and one synthetic dataset. KNN (K-Nearest Neighbor) [41] was selected as the total data stream classifier of C&NCBM algorithm to confirm the final prediction category of the instance. The essence of the algorithm proposed in this paper is based on KNN. In order to verify the effectiveness of the algorithm, the algorithm that uses KNN to classify the data flow alone and MineClass [33] algorithm proposed by Masud et al. are selected for comparative experiments.

4.1. Experimental Datasets

The KDD Cup 1999, Covertype, and ArtificialCDS datasets were selected as experimental datasets. The number of classes, number of dimensions, and total number of dataset samples for each dataset are shown in Table 2.

4.1.1. KDD Cup 1999 Dataset

(http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). The KDD Cup 1999 dataset is the dataset used by ACM’s annual competition in 1999. The dataset consists of 3 categories of a total of 494,021 data samples, each of which contains 42 attributes. This article uses the 10% version of the KDD Cup dataset.

4.1.2. Covertype Dataset

(http://archive.ics.uci.edu/ml/datasets/Covertype). The Covertype dataset is Resource Information System (RIS) data for the US Forest Service (USFS) Region 2. The dataset contains 7 types of a total of 581,012 instances, each with an attribute dimension of 54.

4.1.3. ArtificialCDS Dataset

(https://moa.cms.waikato.ac.nz/). The ArtificialCDS dataset is a random concept drift data stream that is automatically generated by MOA. The data stream contains 5 classes with a total of 100,000 instances, and the attribute dimension of each sample is 27.

4.2. Performance Index
4.2.1. Classification Accuracy

This experiment uses the accuracy [42] and evaluation time [33] of the classification algorithm to evaluate the quality of different algorithms, which is a widely used evaluation standard in the field of classification algorithms. We expect a good classification algorithm to satisfy the short evaluation time while ensuring high classification accuracy.

4.2.2. Kappa Statistic

Kappa Statistic [43] is an indicator for assessing classification accuracy.where is the proportion of the classifier’s agreement, that is, the total number of samples of each correct classification divided by the total number of samples, and is the proportion of the random classification agreement.

4.3. Experimental Results and Analysis

This section separately compares and verifies the proposed algorithm classification performance and the algorithm’s effect on the concept drift, giving the result analysis.

4.3.1. Experiment 1

According to the experimental objectives described above, we selected the Covertype, KDD Cup 1999, and ArtificialCDS datasets as experimental datasets and compared the classification accuracy and evaluation time of C&NCBM, MineClass, and KNN alone in the above three datasets. In this experiment, the specific values of the algorithm parameters of different datasets are shown in Table 3. The experimental results on the three datasets are shown in Tables 46.

It can be seen from the experimental results in Tables 46 that, in the whole data stream classification process, compared with the other two algorithms, the classification accuracy of C&NCBM is very stable throughout the experiment and is significantly higher than that of the other two. The algorithm MineClass also has a better classification effect than that of using KNN alone. The evaluation time of C&NCBM is significantly longer than that of the other two algorithms, and the difference between the evaluation time of MineClass and the time of using KNN alone is small. C&NCBM has higher accuracy than MineClass, but it also requires more evaluation time.

The results of three sets of experiments on two real datasets and one artificial dataset show that the algorithm proposed in this paper is used to deal with the classification of data streams with concept drift and novel class, which has the following characteristics. (1) It is able to make timely judgments when a novel class appears in the concept drift data stream, and adaptively update the original model after making it, which has stronger classification robustness to novel class occurrences in the concept drift data stream. (2) Compared with the use of ordinary classifiers, there is a significant improvement in classification accuracy, and the classification accuracy is improved to a certain extent compared with the classification and novel class detection algorithms MineClass [33] based on Euclidean distance. (3) The evaluation time is slightly longer than that of the other algorithms.

4.3.2. Experiment 2

The appearance of concept drift in the data stream indicates that the mapping relationship between attributes and categories has changed, and the classifiers on the data stream are based on this mapping relationship. When the attribute-to-category mapping relationship changes, the classification accuracy index Kappa Statistic of the classifier will inevitably change significantly. Therefore, in this section, we will use the difference of classification accuracy of the classifier to determine the sensitivity of different algorithms to the concept drift.

We selected Covertype and ArtificialCDS datasets as experimental datasets and compared C&NCBM, MineClass, and KNN classification accuracy index Kappa Statistic in these two datasets, respectively. The comparison results on the datasets are shown in Figure 1.

In order to introduce the concept drift, we rearranged the Covertype dataset so that at most 3 and at least 2 categories appear in any block at the same time, and new categories appear randomly. The concept drift of the arranged Covertype dataset is mainly in blocks 3 and 5. The ArtificialCDS dataset automatically generated by MOA is incremental drift, which mainly appears in blocks 4 and 6. The results of Figure 1 show that KNN has the fastest decline in classification accuracy index Kappa Statistic because of the lack of concept drift processing mechanism. MineClass is partially affected, but the decrease is smaller than KNN. C&NCBM is the least affected by concept drift, and the classification accuracy curve is the most gradual. When the concept drift occurs in the data stream, all the three algorithms will be affected to a certain extent. The C&NCBM algorithm proposed in this paper has better concept drift adaptability and can reduce the influence of concept drift on classification to some extent.

5. Conclusion

In this paper, an MN-NSC based on the Mahalanobis distance cohesive separation index is proposed. On this index, a classification and novel class detection algorithm, C&NCBM, based on Mahalanobis distance is proposed. Different from the traditional distance measurement between the examples using Euclidean distance, this method pays more attention to the similarity between instances and can sensitively test small changes between outliers. In the comparative experiment using KNN algorithm and MineClass algorithm, the effectiveness of the classification algorithm is verified. The C&NCBM algorithm, KNN algorithm, and MineClass algorithm classification accuracy Kappa Statistic are also compared. The results show that the proposed C&NCBM algorithm is the best. The concept of drift adaptability can deal with the influence of concept drift on classification in the data stream to some extent. However, due to the problem of adding Mahalanobis distance, the algorithm proposed in this paper requires slightly longer time compared to the other algorithms. How to improve the computational time while ensuring the validity of algorithm classification is the future research direction of this paper.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was funded by the National Natural Science Foundation of China (nos. 61862042 and 61762062, 61601215); Science and Technology Innovation Platform Project of Jiangxi Province (no. 20181BCD40005); Major Discipline Academic and Technical Leader Training Plan Project of Jiangxi Province (no. 20172BCB22030); Primary Research & Development Plan of Jiangxi Province (no. 20192BBE50075, 20181ACE50033, 20171BBE50064, 2013ZBBE50018); Jiangxi Province Natural Science Foundation of China (nos. 20192BAB207019 and 20192BAB207020); and Graduate Innovation Fund Project of Jiangxi Province (nos. YC2019-S100 and YC2019-S048).