Abstract

Data stream mining techniques are able to classify evolving data streams such as network traffic in the presence of concept drift. In order to classify high bandwidth network traffic in real-time, data stream mining classifiers need to be implemented on reconfigurable high throughput platform, such as Field Programmable Gate Array (FPGA). This paper proposes an algorithm for online network traffic classification based on the concept of incremental -means clustering to continuously learn from both labeled and unlabeled flow instances. Two distance measures for incremental -means (Euclidean and Manhattan) distance are analyzed to measure their impact on the network traffic classification in the presence of concept drift. The experimental results on real datasets show that the proposed algorithm exhibits consistency, up to 94% average accuracy for both distance measures, even in the presence of concept drifts. The proposed incremental -means classification using Manhattan distance can classify network traffic 3 times faster than Euclidean distance at 671 thousands flow instances per second.

1. Introduction

Network traffic classification is a critical network processing task for network management. Traffic measurement and classification enable network administrators to understand the current network state and reconfigure the network such that the observed network state can be improved over time. The complexity and dynamic characteristic of today’s network traffic have necessitated the need for traffic classification techniques that are able to adapt to new concepts. This includes the ability to classify types of traffic almost instantaneously to avoid outdating the knowledge gained from the learning of new concepts.

Data stream mining algorithms [16] have been introduced to overcome the shortcoming of conventional data mining algorithms. They are designed to handle concept drift, to forget old irrelevant data, and to adapt to new knowledge. References [79] have proposed the use of data stream mining algorithms for traffic classification such as Very Fast Decision Tree [3] and Concept-Adaptive Very Fast Decision Tree [4]. Reference [10] proposed a new algorithm named Concept-Adaptive Rough Set based Decision Tree (CRSDT) to classify network traffic. These algorithms have successfully demonstrated the ability of data stream mining to handle dynamic and fast changing network data streams with sustained accuracy. However, the decision tree based implementation requires intensive training process and causes high memory consumption for model building [11].

References [2, 12] proposed the use of incremental clustering for data stream classification. Although both works show high classification accuracy for evolving data stream, the processing rate of such algorithms is low. One of the reasons is due to the use of Euclidean distance in both works as the distance metric. Euclidean distance computation that requires multiple square and square root functions contributes to high overhead and limited speed. Another distance measure is the Manhattan distance that does not require heavy multiplications [13], which can be efficiently implemented on reconfigurable hardware such as Field Programmable Gate Array (FPGA). Unlike batch data mining, conversion of distance metric from Euclidean distance to Manhattan distance in -means incremental learning cannot be directly applied. Certain modifications on the incremental -means algorithm need to be done. In incremental clustering algorithms, clusters information is normally stored as a summary of points and the raw data are discarded after training. Radius recomputation in Euclidean distance can be totally based on the summary of points, but raw data are needed in Manhattan distance. Thus some modification on the summary of points is needed.

This paper analyzes online incremental -means network traffic classification in [16] with two distance measures (Euclidean and Manhattan). The proposed method using Manhattan distance requires less computation than Euclidean distance and, hence, achieves higher running speed. This algorithm has been verified with real network traces to evaluate the accuracy when using these distance measures and possible implementation trade-off for high throughput network traffic classification. The rest of the paper is organized as follows. The incremental -means proposed method based on Euclidean distance is described in Section 3. Modification of the proposed method to suit Manhattan distance is discussed in Section 4. The experimental setup and results are discussed in Section 5. Section 6 concludes the paper and states suggestions for future works.

The simplicity of clustering algorithm such as -means makes it well adapted for network traffic classification. References [1719] proposed -means algorithm implementations for classifying network traffic with high accuracy. Reference [20] proposed the use of feature selection method with -means algorithm to enhance the classification accuracy. Reference [11] proposed a new initialization method for centroid selection in -means to further improve the classification accuracy of network traffic, whereas [21] proposed enhancement to -means algorithm to prevent the diverse impact of attributes on clustering output. The aforementioned works successfully show the suitability of clustering based algorithms in network traffic classification, although they could not adapt to changes in network concept in today’s network traffic.

References [2227] proposed incremental clustering algorithm which could adapt to new knowledge over time. Reference [22] proposed the microclustering concept where only the summary of clusters is kept throughout the learning process. The proposed algorithm can learn new concept incrementally by updating the clusters summary. Reference [23] proposed adapting the microclustering concept originally proposed in [22] and proposed macroclustering stage with pyramidal time frame. Not all microclusters are saved that reduce overall memory consumption. Reference [24] proposed the adaptation of microclustering and macroclustering concepts from [22, 23] and is customized for trajectory data. Although the method in [24] demonstrates the ability to update classification model incrementally, it does not support streaming data.

Reference [25] proposed graph based incremental clustering although it is not suitable for online network classification due to its long processing rate and large memory consumption. Reference [26] proposed incremental DBSCAN for data warehousing. The proposed algorithm is based on density-based clustering. The radius of cluster and minimum number of points in clusters are assumed to be fixed. Reference [27] proposed incremental clustering based real-time anomaly detection. Incremental training is initiated based on the false alarm threshold that requires continuous feedback from the network administrator. Incremental clustering can continuously learn new knowledge and reduce misclassification caused by outdated knowledge. For online real-time network traffic classification, the processing rate of software implementation of such algorithms is not sufficient to support current network speed. Implementation of such algorithms in reconfigurable hardware such as Field Programmable Gate Array (FPGA) can accelerate the processing rate.

To the best of our knowledge, only [28] has proposed the implementation of incremental clustering algorithm in FPGA for multimedia traffic classification. The proposed method uses Hamming distance instead of Euclidean distance for the distance measurement. It uses an extra bit appended to the data to indicate training or testing dataset and incrementally updates the model when training bit is detected. However, the proposed method requires large training set ( instance) to achieve high accuracy. On the other hand, implementations of nonincremental -means algorithm in FPGA are more common, for example, [2932]. However, implementation of Euclidean distance based -means algorithm on hardware consumes high hardware resources. References [30, 31] reported that 90% of the hardware resources were required to implement -means algorithm. Modification of such distance was proposed as in [33] using distance squared. References [13, 3436] proposed the use of Manhattan distance as a distance measure for -means. The implementation not only can reduce the hardware cost, it also can be fully configurable and easily pipelined to support high degree of parallelism. Hence, towards the implementation of hardware acceleration of incremental online network traffic classifier, the proposed incremental -means classifier based on Manhattan distance is a better option than using Euclidean distance.

3. Online Incremental -Means Algorithm

We proposed online incremental -means clustering in [16] for online network traffic classification. It consists of two main processes: classification and learning. Some of the terms and terminology used in this paper are as follows:(1)Flow: the network traffic that belongs to a process-to-process communication.(2)Instance: a tuple of attributes.(3)Flow features: the attributes or statistical features that are extracted from a flow, for example, number of bytes in payload.(4)Flow instance: the instance made up of flow features which represent a flow. The classification process performs online classification on flow instances, while the learning process simultaneously performs incremental learning to update the classification model. Both processes will be discussed in detail in Sections 3.2 and 3.4, respectively. Figure 1 shows the overview of the proposed method. The selection module performs the selection of flow instances to be learned to avoid mislearning (see Section 3.3). However, manual labeling is not covered in this paper. An example of such technique is the groundtruth method which is discussed in [15].

3.1. Classification Model Initialization

Before online classification can be performed, the classification model needs to be initialized. This process is performed once during start-up to prepare the base classifier model. In this stage, the supervised -means technique is used to cluster the batch labeled flow instances into initial clusters. In order to increase classification efficiency, the classification model is made up of smaller micromodels that are located in different location in the Euclidean space. To perform this, precollected flow instances are distributed to chunks according to the distance to origin , such that where is a flow instance with flow features. Each micromodel is then built using respective chunk of flow instances.

The initializations of centroids are based on the method suggested in [12] where the initial class of clusters are assumed to be proportional to the data distribution. The clusters are then compressed to sufficient statistics known as clustering features (CF). CF is a 3-tuple information that summarizes information about a cluster as proposed in [22]. Given   -dimensional data points () in a cluster where , the CF of a cluster is defined as three-tupple: , whereIn our classification model, we modify and add the timestamp and class information to represent each cluster; . Merging new to cluster is based on the Additivity Theorem [22], where and are unchanged, such that Raw data are discarded in order to save memory space. The clusters that are represented by clustering features are used for classification and the clusters may be modified based on newly received data.

Algorithm 1 shows the overall steps for classification model initialization. During model initialization, the precollected flow instances are divided into sets, and the supervised -means method is used to create a micromodel for each set of flow instances. Created clusters are summarized as clustering features, with timestamp 0.

(1) : Pre-collected flow instances (array of features)
(2) : True labels for
(3) : Distance of to origin
(4) : Number of micro-model
(5) : Number of Pre-collected flow instances
(6) : Number of cluster in a micro-model
(7) for in range of do
(8)  Calculate
(9) end for
(10) Quicksort
(11) Distribute into micro-model
(12) for each micro-model do
(13)  Generate cluster
(14)  Summarize cluster into Clustering Feature, CF
(15)  Store clusters in time-series and set timestamp to 0
(16) end for
3.2. Online Classification

Classification starts upon receiving an incoming flow instance . The distance to origin is calculated to find the respective micromodel. In the micromodel, the distance between cluster’s centroid and is computed using (4). The nearest cluster and second nearest cluster are then determined. Assuming that the real class label of a flow instance is unknown (unlabeled flow instance), the predicted class will be classified to class of with respect to :where is the centroid of cluster and .

3.3. Selection of Learning Instance

As self-training method is applied in the learning algorithm, all predicted labels are assumed to be accurate. While this is not entirely true in incremental learning process, certain threshold of false positive must be expected. Since learning on falsely predicted flow instances can cause false learning, only flow instances with high confidence of prediction are chosen to be learned. The selection criteria are designed such that they make use of the information from classification process so that they do not need extra computation. Extra computation will make longer model learning and incremental learning inefficient.

Confidence level is divided into three levels as shown in Table 1. The criteria are to determine confidence level in terms of label conflict within the two nearest clusters (conflicting neighbor) and condition when a flow instance is within the nearest cluster’s boundary (in-boundary). Conflicting neighbors are set to true when both nearest clusters belong to different classes or set to false when they belong to the same class.

The boundary of the nearest cluster is determined by the cluster’s average radius defined by (5) [22]. However, for clusters with , (5) is not valid as the subtraction in numerator will results in zero denominator. This consideration was not analyzed in [37]. In [2], the calculation of maximum boundary is based on the nearest neighbor’s maximum boundary. However, the boundary of such clusters is not always similar, and to determine the nearest neighbors will increase the system computation complexity by .

We propose that the boundary of a cluster with only one flow instance can be determined by the similarity of attributes. Each attribute of a cluster centroid is compared with the attributes of incoming flow instance. An attribute is considered similar to other attributes if the ratio between them is within 10% of each other (). A parameter boundary threshold is needed to define the maximum nonsimilar attributes between flow instances and the cluster centroid that can be tolerated to include an incoming flow instance in the respective cluster. If the number of nonsimilar attributes is lower than the boundary threshold, it is considered to be within the boundary of the particular cluster:

The confidence level can be determined as follows: let be the nearest cluster and the second nearest cluster, the class of , the radius of , and the centroid of . Confidence level is set to by default. In the case of , confidence level will increase by one . If for with more than one flow instance or for with only one flow instance, confidence level will increase to two if the condition is satisfied.

3.4. Semisupervised Incremental Learning

Only flow instances with confidence level are used in incremental learning. As shown in Algorithm 2, flow instances with confidence level are merged with the nearest cluster based on (3). The learning will then update the classification model. As the input flow instances are in streams, changes in distribution and concept are expected. Outdated knowledge needs to be deleted and the micromodel needs to be reconstructed based on recent clusters. This process will be discussed in Section 3.4.2.

(1) : flow instance
(2) : Nearest clusters from
(3) while new   do
(4)  merge to
(5) end while
(6) —Labeled Instance Injection (upon receiving)—
(7) if receive ()  then
(8)  - Inject
(9) end if
(10) —Micro-model Reconstruction (after one chunk)—
(11) : Number of clusters
(12) : Desired number of clusters
(13) while    do
(14)  delete clusters with timestamp
(15)  reduce clusters timestamp by
(16) end while
(17) gather all clusters in classification model
(18) reconstruct micro-model
3.4.1. Injecting Labeled Instances

In our previous work [16], we assumed that labeling of flow instances can be done immediately after the learning process. Our extended experiment suggests that this is not possible since labeling of flow instances involves manual steps. Thus, we can only assume that some flow instances will be labeled externally and fed into the model once they are ready. Thus, the flow instances will be treated as a new flow instance and reclassified in order to get the necessary information such as confidence level and prediction results. In order to achieve minimum effort in injection, different handling methods based on the confidence level and prediction results are proposed for handling different possible scenarios. Table 2 shows the injection handling method for several scenarios. False prediction in trained flow instances will cause the trained cluster to be deleted immediately since the cluster is no longer reliable. Besides, new cluster with will be added if is not in the boundary of . This is true except for scenarios  2 and  3, since there is a possibility of to be within the boundary of . Figure 2(a) illustrates the example of boundary condition when is nearer to but is in the boundary of . Merging to will shift boundary of towards and can result in an overlap (Figure 2(b)). Thus, when and are of different classes, one of these clusters needs to be deleted. If they belong to the same class, will be ignored since its learning brings no significant changes to the model.

3.4.2. Micromodel Reconstruction

In order to prevent the storing of all outdated clusters that may result in imbalanced micromodel, a micromodel reconstruction process is performed after a user-predefined number of flow instances (chunks) have been received. Clusters that are not utilized or underutilized will be deleted as they do not contribute to the classification decision. In addition, the micromodel reduction also aims to reduce memory footprint and classification time. This reduction process includes the following steps:(1)All clusters are structured as a time-series based on the time they were created.(2)All clusters are given zero timestamps.(3)When a cluster is nominated as the nearest cluster in the classification process, the timestamps is incremented by one.(4)When a chunk is received, the timestamps of each cluster is checked from the beginning of the series. Clusters with timestamps zero will be deleted until the number of clusters, , is reduced to a user-predefined number, . Then, the timestamps of remaining clusters will be decremented by one. If in a situation where even after deleting all zero timestamps clusters, the deleting process will be repeated for timestamps one and so on until it reaches the required number of clusters.After the deletion of unused clusters, the remaining clusters will be repartitioned into micromodels based on their centroid locations in the Euclidean space.

4. Analysis of Euclidean and Manhattan Distance Measures

The conversion from Euclidean distance to Manhattan distance for the incremental k-means is not direct. Modifications on the clustering features and equations are needed to suit the used distance measure.

4.1. Euclidean Distance versus Manhattan Distance

Euclidean distance and Manhattan distance are the special variant of Minkowski distance. The Minkowski distance (see (6)) is the distance of order between two points and , where and . Manhattan distance is Minkowski distance in first order (-Norm), while Euclidean distance is Minkowski distance in second order (-Norm):

4.2. Affected Elements

By changing the distance measures, the following elements in the proposed incremental -means algorithm need to be changed as well. These include the calculation of distance to origin (see (1)), distance between incoming instance and centroid (see (4)), and radius of a cluster (see (5)). The changes in indirectly change the elements in CF. By applying , the new equation for is as in (7), while new equation for is as in (8). Consider the following:

For the case of , the original equation is in (9) before it can be translated in terms of , where is the number of instances in cluster , is the centroid of cluster , and is the dimension. After substituting to , the new equation for will be

Note that (10) is not able to be represented in ; hence the calculation of radius is no longer possible by only keeping the cluster summary; . For example, let a cluster be with 3 instances in one dimension, , and centroid, Expanding (10) will result in When a new instance is added into cluster , the centroid will change to The new value of will be

Each absolute element in (14) requires recalculation due to the changes of centroid, but each instance is not accessible (i.e., raw data are not kept). In this case, could not be recalculated. In this paper, we suggest to calculate an approximation of by storing the previous value of in each dimension and the accumulation direction of the centroid change. New can be calculated based on these values. Since a centroid is located in the middle of instance in the cluster and adding of new instance will only happen when it is within the radius, (), we can assume that the changes of are as small as possible, such that will be in the same direction with . By taking into consideration the direction of change, we assume another instance in the cluster will have the change in the opposite direction. Thus, the changes will be canceled out. With this assumption, we can calculate the approximate of as in (15). Consider the following:

With these changes in the calculation of new , and are no longer needed. New clustering features (CF) will be changed accordingly to . Merging new to cluster will cause recalculation of based on (15), additive of based on (16), and increment of by , and and remain unchanged:

4.3. Complexity Comparison

Table 3 shows the overall changes of our proposed method using Euclidean distance and Manhattan distance. For calculation of distance to origin, Euclidean distance requires compared to Manhattan distance that only requires . is used in the determination of partition, which is required once every instance and affected the most in the reconstruction stage. During reconstruction, for all clusters in the classification model need to be calculated. Hence, it will increase the complexity of reconstruction by .

Similar to , the distance to centroid for Euclidean distance also requires compared to Manhattan distance that only requires . The increase in complexity in the calculation of affects classification time by .

The radius is used during selection for learning and during injection of labeled instances. For calculation of in Euclidean distance, are required compared to for Manhattan. In this case, the complexity is dependent on the dimension. As long as the dimension of dataset is greater than 2, the complexity of in Manhattan distance is greater. Increase in complexity for the calculation of directly affects the time for learning by .

The updates of in Euclidean distance are slightly simpler than Manhattan distance due to the recalculation of for each dimension involved. This is the only trade-off of using Manhattan distance over Euclidean distance. However, since the updates of only happen during the learning and model reconstruction, it will increase the complexity of reconstruction by .

Overall, although the computations of and are more complex for Manhattan distance, they are not as frequent as the calculation of the distance , which requires calculations of on every incoming flow instance.

5. Experimental Results

This section describes the simulation setup and results of our proposed work. The experiment is conducted to analyze and compare the performance of our proposed method using Euclidean distance and Manhattan distance measures.

5.1. Datasets

Two real network traffic datasets Cambridge [14] and UNIBS [15] are chosen for the experiment. The Cambridge dataset was captured from University of Cambridge network. It contains 248 attributes and 12 classes. A total of 11 online features are selected from the attributes as listed in Table 4. The data of the minimal class, games, and interactive are not used as they are not sufficient for training and testing (less than 10 flow instances). UNIBS dataset was captured in University of Brescia. It was collected in three consecutive days and the traces are in pcap format and come with a groundtruth. The traces were processed to extract online features with only the first 5 packets of each observed flow as in [38]. By using the provided groundtruth labels, we labeled the flow instances into 5 classes (Web, Mail, P2P, SKYPE, and MSN). The details of the used datasets are summarized in Table 5.

5.2. Experimental Setup

The model parameters used in our experiment are as stated below, unless specified otherwise:(1)Percentage of labeling, .(2)Chunk size = 1000.(3)Number of micro-model, .(4)Desired number of cluster, .(5)Boundary threshold = 1. In our experiment, the first chunk of data (the first 1000 flow instances) are treated as precollected flow instances and they are used for model initialization to generate the base model. The rest of the data are randomly labeled for different percentage . The accuracy of the proposed model is verified using the interleaved test-then-train method where the data were first tested before being trained incrementally [39]. Each experiment was repeated times, and the average performance indicators are reported in this paper.

The performance indicators used in this paper are the accuracy, cumulative accuracy, time, classification speed, and memory requirement. Accuracy refers to the accuracy of each chunk, while cumulative accuracy is the cumulative accuracy after the classification of each chunk. Running time is defined as the time to process one chunk, including the time for model reconstruction. In our experiment, the running time does not include data labeling time as in [12], since data labeling is usually done offline and is beyond the scope of discussion in this paper. Classification time is measured based on the time needed to classify one flow instance not including the feature extraction time.

5.3. Performance

This subsection discusses the overall performance of the proposed algorithm. The accuracy in time-series is shown in Figure 3. As we only use the first chunk in the dataset for model initialization, the classes which were not seen in the first chunk are treated as new concepts. A drift detection experiment was conducted on our dataset by using Drift Detection Method [40] provided in MOA tools [41].

A series of detected drifts are plotted in Figure 4. We found out that Cambridge dataset has more detected drifts than the UNIBS dataset. In order to visualize the drifts clearly, we show the Cambridge dataset in different chunks range. In the figure, we observed that when concept drifts occur (drift detected = 1), our proposed algorithm (using either Euclidean distance or Manhattan distance) can learn from the new knowledge and is able to maintain network classification accuracy compared to the model without incremental learning. In order to clearly show the accuracy difference of our proposed method between both distance measures, the difference of accuracy when using Manhattan distance over Euclidean distance is shown in Figure 5. The results show that the proposed method using Manhattan distance can provide slightly higher accuracy than using Euclidean distance. This shows that simple distance measures can provide better performance.

Table 6 shows the overall performance of our proposed algorithm. In this paper, we assume that the classifier is the bottleneck of the network traffic classification system, as reported in several works such as [38], flow features extraction in FPGA can function in very high speed. We compute the performance of the classification in terms of millisecond per flow instances in software as that is the lower bound of the system performance. The classification time of our proposed algorithm with Euclidean distance is 4.45 ms to classify 1,000 flow instances and 1.49 ms to classify 1,000 flow instances when using the Manhattan distance. This shows that using Manhattan distance can increase the classification speed by almost 3 times. Besides, we found that the time for injecting a labeled flow instance was similar to the time for classifying a new flow instance. Thus, injecting labeled flow instance will not cause long delay to affect the overall online classification process. Time for reconstruction is less than 1% of the total running time and it is almost negligible for both methods. Our proposed system does not require large memory as only the summary of cluster’s information, CF, is kept. It only requires in average 140 KiB RAM to complete the processes. Since both of the proposed method store similar amount of data, for Euclidean distance and for Manhattan distance, respectively, the overall memory consumption of both methods is similar.

5.4. Impact of Different Parameters Setting on Classification Performance

In this subsection, we analyze the effect of changing algorithm parameters on the overall classification performance. The experiments are done by changing one parameter and fixing the others. Figures 6 and 7 show how labeling percentage, , is affecting the accuracy and running time, respectively. Increases in labeled flow instances provide more class information to the classification model for better accuracy. Our experimental results also show that the use of Manhattan distance provides better accuracy than Euclidean distance. The percentage of labeling does not affect classification time, but it will increase running time that is more significant for high labeling percentage. This is due to the fact that the more the labeled data are injected to the classification model, the more distance the measure needs to be calculated. Hence the running time difference becomes more significant.

The number of desired clusters, , is the parameter used in model reconstruction process. It is used to maintain the number of clusters in the classification model. Figures 8 and 9 show the impact of different number of desired clusters on accuracy and reconstruction time, respectively. The increasing accuracy and reconstruction time are more consistent with the increase in for proposed method using Manhattan distance. Reconstruction time for our proposed method with Euclidean distance is higher due to the fact that origin distance calculation required for Euclidean distance in the reconstruction stage is more complex than Manhattan distance.

6. Conclusion

This paper proposed and analyzed an online incremental learning high bandwidth network traffic classification method with two different distance measures (Euclidean and Manhattan distance). The use of Manhattan distance not only provides improvement in running and classification time, it also slightly improves the average accuracy. In future, we will implement this work in reprogrammable hardware so that it can perform inline classification of live network traffic.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The first author is funded by UTM Zamalah schorlaship. This work is funded by Ministry of Science, Technology, and Innovation Science Fund grant (UTM vote no. 4S095).