Research Article  Open Access
HighPerformance Internet Traffic Classification Using a Markov Model and KullbackLeibler Divergence
Abstract
As internet traffic rapidly increases, fast and accurate network classification is becoming essential for high quality of service control and early detection of network traffic abnormalities. Machine learning techniques based on statistical features of packet flows have recently become popular for network classification partly because of the limitations of traditional port and payloadbased methods. In this paper, we propose a Markov modelbased network classification with a KullbackLeibler divergence criterion. Our study is mainly focused on hardtoclassify (or overlapping) traffic patterns of network applications, which current techniques have difficulty dealing with. The results of simulations conducted using our proposed method indicate that the overall accuracy reaches around 90% with a reasonable group size of .
1. Introduction
Traditional methods of network classification are either portbased or payloadbased. However, both types of approaches are currently facing numerous challenges from the advanced techniques used to circumvent the firewalls of organizations and the increasing number of packet encryption techniques. Machine learningbased network classification techniques that use statistical information from network traffic data are currently emerging.
Machine learning techniques presented to analyze network traffic data so far can be categorized into the following three techniques. Supervised learning, a classification method, uses training traffic data with known application labels. Using this method, on the basis of training sample data, we can extract (or learn) patterns of applications and apply our knowledge to unlabeled test data. Unsupervised learning, also called clustering, on the other hand, is purely based on a predefined similarity measure of network traffic data. Clustering is one way of overcoming lack of enough traffic data for training. And finally socalled semisupervised learning [1] which combines both techniques exists.
Most learning approaches use packet duration, number of packets, average packet size, or interarrival time as statistical features. While several Markov models and hidden Markov models are used to extract information from packet sequence and directions, we utilize the Markov model proposed by Munz et al. [2], which uses a discrete time Markov model with a finite state space and gives relatively successful accuracy and complexity. In order to handle correlated data, which is a characteristic feature of network traffic data, Zhang et al. [3] proposed a “bag of flows” concept. Since the first few packets follow similar patterns per network applications or protocols, it would be better to form a certainsized group and assign group membership in order to prevent wrong individual assignment due to a fluctuation.
In this paper, we generally follow a supervised learning techniques with a Markov model with states defined by the direction and size of the first four packets (in TCP connections, it is well known that the first four packets are most of time enough to capture the characteristics of the application [4]). Previous Markov modeling techniques, however, fail to differentiate among applications with overlapping, or hardtoclassify, traffic pattern as in IMAP and SMTP. Table 1 shows that both IMAP and SMTP include same state sequences: 0414 and 1414 (state 0 to state 3 are for clienttoserver packets, and state 4 to state 7 are for servertoclient packets. Smaller state number at each case means smaller packet size. More details about this table and state numbering are in Sections 4 and 5.2), and these two sequences comprise more than 50% of packets in both applications (66.98% in IMAP and 52.98% in SMTP). Now consider a network flow with traffic pattern 0414. With traditional Markov model classifiers, this flow will be classified as IMAP application because IMAP has the higher probability for this pattern. However, this flow also has a probability of more than 0.46 to belong to SMTP. The same argument can be applied to 1414 pattern, and in fact we can say more than 52% of SMTP packets will be classified as IMAP packets under traditional Markov model classifiers.

Our solution to this issue is to build another Markov model for the testing flows and measure the similarity between the trained Markov model and the testing Markov model. In the training stage, we build and train a Markov model for each application. In the testing stage, we first collect a group of network flows, called a “bag of flows,” that are believed to belong to the same application using only the port number, and then build a Markov model for each such group. After that, we use KullbackLeibler divergence to measure the divergence between the Markov models of the training set and those of the test set. Finally, we classify a test group to an application whose Markov model has the smallest divergence from that of the test group. We verify the performance of our proposed method by means of theoretical and real data simulations.
The remainder of this paper is organized as follows. In Section 2, we review relevant related work on network classification. The theoretical background of our Markov model with KullbackLeibler divergence and some evaluation measures for the classification used in our work are briefly explained in Section 3. Section 4 shows our proposed classification method with KullbackLeibler divergence, while Section 5 presents the results of a theoretical simulation and real databased analysis. Section 6 concludes and discusses further prospective developments.
2. Related Works
Traditional packet classification techniques are either portbased or payloadbased approaches. Portbased approaches classify packets based on the ports used by the packets. However, nowadays, many applications, especially P2P applications, use unpredictable or dynamic ports, which limit the efficacy of this approach [5]. Payloadbased, or Deep Packet Inspection (DPI), approaches look deep inside the packet to capture its applicationspecific pattern. However, this technique also suffers from challenges such as frequently changing packet formats, encryption of the packet payload, and load increments [6].
Machine learning approaches are being developed by researchers as an alternative to traditional approaches. A machine learning approach is either unsupervised or supervised. The unsupervised, or clustering, techniques start with unlabeled packets and classify them into different clusters. The simple means algorithm [4], in which each flow is represented by a point in a dimensional space, is one such technique. The first packets of each flow are observed and the packet size of the th packet becomes the coordinate at dimension for this flow. The flows clustered close enough together are considered to belong to the same application. This technique has a problem such as the fact that it requires an application to dominate in at least one of the clusters, which is not always true, especially in close overlapping traffics such as SMTP and IMAP.
Supervised learning starts with known packet classes. Features such as packet size, packet direction, and packet interarrival time are extracted for each class. These feature vectors are used to build and train models that represent the classes. Various modeling techniques have been proposed. Roughan et al. [7] collected statistics on the feature vectors and classified unknown traffic via Nearest Neighbors (NN), Linear Discriminate Analysis (LDA), or Quadratic Discriminant Analysis (QDA) classification techniques. Moore and Zuev [8] applied naive Bayes techniques to map unknown traffic to preclassified traffic classes. These techniques are simple and effective but they generally consume considerable computing time or require large memory space to construct and maintain complex data structures. Also they tend to show very poor performance when the feature vectors are not clustered tightly enough. Crotti et al. [9] built protocol fingerprints, a PDF vector on packet size and packet interarrival time, for each traffic class, which can express the statistical characteristics of traffic classes in a compact and efficient way. Unknown traffic is then classified via normalized threshold techniques. However, they suggest to use a simple histogram of feature vectors for the protocol fingerprints, which is obviously too simplistic to capture the subtle difference between characteristically overlapping traffic classes.
Several means of improving supervised learning techniques have been attempted. Nguyen and Armitage [10] proposed taking packet samples from various locations during packet transmission to build multiple subflows. Most classification techniques capture either the entire flow or the first few packets of the flow as samples, but they claim that capturing the entire flow is timeconsuming and detecting the beginning of packet transmission is not always possible. Instead, they capture packets in an intermittent way during packet transmission and use these multiple subflows to train their modeling system. An interesting idea of considering entire packet flow as a linear combination of a set of multiple component flows was introduced in [11–13]. They apply wavelet analysis to extract these components [11, 12] or apply ICA (Independent Component Analysis) [13] to identify the fundamental independent components and use them to classify target flow. Their target flow was abnormal flow generated by network attacking programs, but their techniques are general enough to be applied to any network flow. Reference [14] shows another approach of classifying abnormal packet flow. Their technique initially trains the classifier with normal traffic data only and then the classifier evolves in a dynamic way learning about anomalous behaviors using the Discriminative Restricted Boltzmann Machine. To overcome the limitation of a single technique, hybrid approach has been suggested in [15, 16]. Chen et al. [15] combine hardware classifier based on network processors with software classifier based on Flexible Neural Tree technique, while Ye and Cho [16] combine signaturebased classifier with statisticsbased classifier. By combining different classifying techniques we could expect faster or more accurate classifying performance.
Some other efforts have taken into consideration the time series characteristics of the transmitted packets in addition to their statistical properties. Palmieri and Fiore [17] plot recurring pattern of the training packets on RP (Recurrence Plot) and overlap the embedding vectors of the testing packets on top of RP measuring the distance of each pair of plotted points. The packets will be classified to the application from which the distance is the smallest. Dainotti et al. [18] and Mu and Wu [19] constructed Hidden Markov Models (HMMs) to represent the traffic classes while Munz et al. [2] built Markov Models (MMs) with eight states and four stages, which is much simpler than building HMMs but still effectively expresses the time series and statistical characteristics of the transmitted packets. Zhang et al. [3] proposed constructing a bag of flows and classifying it as a whole instead of classifying individual flow. Classifying a bag of flows as a whole has the advantage that the portion of the misclassified flows can be ignored if it occupies a smaller percentage of the bag, because the larger portion of the correctly classified flows determines the membership of the entire bag.
Above techniques improve the performance of supervised learning methods in various aspects. However they achieve that at the expense of other performance criteria. Capturing multiple subflows [10] and HMM [18, 19] both increase overall classification time considerably due to the extensive packet collection in the former case or due to the construction of the complex model in the latter. Each of the Markov Model by Munz et al. [2] and the BagofFlow technique by [3] shows degenerated performance when the target traffic classes are overlapping in their traffic patterns. Hybrid approaches [15, 16] improve the performance but at the cost of increased classification time.
Our technique combines the BagofFlow (BOF) technique in [3] with the Markov model approach in [2]. We believe that the Markov model is simple and powerful enough to grasp the characteristics of the traffic and that the BOF concept is essential for improving the classification accuracy. However, applying BOF technique directly to packet classification does not always produce the best results, especially when the target traffic classes show similar and overlapping characteristics, as in SMTP and IMAP. In such a case, direct application of BOF actually results in inferior performance to individual assignment (the details are given in Section 5.2). We propose the construction of another Markov model for the flows contained in the bag and to measure its similarity with the target Markov models. The flows in this bag are all assigned to the traffic class whose Markov model is most similar to the test Markov model.
3. Theoretical Background
3.1. Markov Models
A discretetime Markov model with finite state space is determined by the transition probability matrix and initial probability distribution of state space , where
We build separate Markov models for each known network traffic application in the training phase. Let denote a Markov model constructed from the training data of application . Empirical estimation of transition probability and initial probability distribution can be done in a way similar to that of Munz et al. [2]. In the testing phase, we can proceed either by individual connection or by grouping correlated connections. For each group of connections, we build Markov models as was done in the training phase. Let denote a Markov model constructed from the testing data of group . In the following section, we provide a dissimilarity measure between two Markov models.
3.2. KullbackLeibler Information
KullbackLeibler information is a wellknown dissimilarity measure used between two probability distributions [20]. Let be a set of observations drawn randomly from an unknown true probability distribution function , and let be an arbitrary probability distribution function. We assume that the goodness of the model defined by is assessed in terms of the closeness as a probability distribution to the true distribution . Akaike [21] proposed the use of the following KullbackLeibler information (or divergence): where represents the expectation with respect to the probability distribution . KullbackLeibler information or relative entropy with respect to can be represented in discrete models as where and are probability mass functions of and , respectively.
3.3. Evaluation of Classification Methods
Classification performance can be calculated by measuring “error rates” or misclassification probabilities. In binary classification, it is well known that the total probability of misclassification (TPM) is given by where represents the prior probability of class .
In our work, we use recall and precision, which are used in [2], and measure to evaluate perclass performance, as in [3]. One has(i);(ii);(iii).
Let be the total number of observations that belong to either one of two classes. Then, we can represent our evaluation measures in a simple frequency table (see Table 2):

Hence, for example, is the proportion correctly classified as application 1 out of all the true application 1 connections, and is the proportion correctly classified as application 2 out of all those classified as application 2 connections. measure is the harmonic mean of recall and precision, which hopefully represents the overall accuracy.
On the basis of the empirical measures, we can also estimate the TPM as follows:
4. Classification Using KullbackLeibler Information
In this paper, we focus first on two applications, SMTP (port 25) and IMAP (port 143), whose traffic patterns are difficult to distinguish using existing classification algorithms, and later extend our technique to other applications in Sections 5.3 and 5.4. Bernaille et al. [4] showed that the first four packets of a TCP connection are sufficient to classify known applications with high accuracy; therefore, we build our Markov model using only the first four packets exchanged. State space is defined by a combination of direction of packets and four payload intervals: , , , and (we follow the same payload intervals as in [2]. These intervals have been chosen because they emphasize well the difference in traffic feature vectors among the various applications [2]). The value of the Maximum Sequence Size (MSS) is often exchanged in a TCP connection. Since the direction is either from clienttoserver or servertoclient, each stage can have different states. Thus, our model becomes a fourstage leftright Markov model with state space . States 0–3 represent payload length intervals from clienttoserver while states 4–7 represent those of servertoclient. For example, state sequence 0414 means the following: client sends a 0–99 byte packet first (after the handshake), the server responds with a 0–99 byte packet, the client then sends a little larger 100–299 byte packet, and finally the server responds with a 0–99 byte packet.
By investigating the state sequences of SMTP and IMAP in training data, we find that 0414 is the dominant pattern in both applications. Other common patterns, such as 1414 and 0404, also exist in both applications. In this section, we explain our classification model, which has only two common patterns: 0414 and 1414. We then extend our model in the simulation section to incorporate an extra unique pattern per application.
4.1. A TimeVariant Markov Model with Two Patterns
Suppose that we have two common patterns, 0414 and 1414, in both applications. Let be the Markov model for application () and two patterns as observations, say and , in our Markov models.
Assume that and are the proportions of in App 1 and App 2, respectively. Thus, the initial state probabilities are for each model . That is, Since we have four stages, we need three transition probability matrices in addition to the initial probability vector to compute a pattern probability. Let denote a onestep transition matrix from stage to in ; that is, . On the basis of the frequency of observations and in each model, we can build three transition probability matrices, , and . The first and second transition probability matrices are shown below. can be constructed in a similar fashion:
4.2. Theoretical Evaluation of Assignment
Let and be the only two observations of Markov models. Thus, for each connection, we can define and . We can say is the likelihood of under model . Each likelihood is computed as follows:
4.2.1. Individual Assignment
Individual assignment is based on the likelihood of observations. Hence, the decision rule is
For simplicity, we assume in the following performance calculations. Since and , we assign observation to App 1 and to App 2. If , which is highly unlikely, we cannot determine its membership, so we leave it as undecided.
Since the decision rule is given, we can determine the evaluation measure recall as follows:where is the proportion of in model and is the set indicator function. The above formula for computing recall can be used for more than two patterns. In order to determine precision, we need to estimate the ratio of App 1 to App 2 and apply Bayes rule. We may assume under no information regarding the abundance of both applications. A summary of evaluation measures is given in Table 3.

4.2.2. Group Assignment
We can form a correlated group of connections in several ways. Zhang et al. [3] used the concept of “flow,” which consists of successive IP packets having the same fivetuple: [src_ip, src_port, dst_ip, dst_port, protocol]. They formed a BOF and assigned the BOF instead of individual connections.
In this paper, we use a simple concept of bag based on port number only. Constructing a group based on the same fivetuple is very timeconsuming and may not be a good strategy in realtime classification problems. Therefore, our portonly based group assignment is fast and convenient even though it is slightly less correlated than fivetuplebased BOFs.
Let denote the number of in a group with size of model . Then, is a random variable whose distribution follows a binomial with parameter , that is, , and represents the probability of assigning a group to . Thus, recall and precision can be computed similar to the individual assignment case, except that and . We call , which means that we cannot assign such a group under model (see Table 4).

We propose three group assignment methods: Majority, KullbackLeibler, and 4096d. Majority assignment is based on the voting of each individual assignment, KullbackLeibler assignment is our main proposed method, and 4096d assignment can be another reasonable candidate method based on Euclidean distance in 4096 dimensions. We explain the KullbackLeibler and 4096d methods further in the next subsection.
4.2.3. KullbackLeibler and 4096d Methods
KullbackLeibler. For each bag of connections, we build a Markov model and measure divergence by computing KullbackLeibler information and . For , where and are the likelihoods of under the model in test and training data, respectively. Since our Markov model consists of four stages and eight states, we have possible observations in test data . To avoid division by zero, we use for such with and .
If , then we assign a group of testing connections to . The evaluation measures recall and precision can be calculated in a similar fashion to the Majority case.
4096d. We can map each BOF of size to a point in 4096dimensional space, because we have possible observations. For example, if a size10 test BOF consists of , , , and , it can be represented as a point in 4096dimensional space like . After suitable standardization, we use the Euclidean distance to determine the membership of a test BOF.
Theoretical considerations for handling more than two patterns can be easily made. Trinomial or multinomial distributions are needed instead of binomial distributions to compute evaluation measures. We include some of those simulation results in the next section.
5. Simulation Experiments
In this section, we evaluate the performance of our proposed method under various scenarios. First, several hypothetical modelbased simulations are presented, followed by real network trafficbased results.
5.1. ModelBased Simulation
Suppose the proportions of observations for each application are given in Table 5.

In the following 4 hypothetical modelbased simulations we choose values of , , , and representing the real traffic sequences of ours and the more challenging scenarios, that is, . These parameters stand for the proportion of packet patterns for each application; that is, denotes the proportion of pattern 0414 and denotes that of pattern 1414 in application 1, while and stand for the proportions of the same patterns in application 2. For simplicity of notations, we use capital letter abbreviations to represent evaluation measures and subscript numbers to denote the application as usual:(i), , , and ;(ii)Maj = Majority, KL = KullbackLeibler.
Case 1 (, , , and (see Table 6)). Case 1 represents a situation in which there are only two patterns. As expected, group assignment gives a better performance than individual assignment and both KL and 4096d are better than Maj. An interesting observation in this case is the fact that the low values for in individual (50%) and Maj (46%) improve up to 86% in KL as bag size becomes . Under , which have and , we wrongly assign up to half of true App 2 to App 1 in individual and Maj in group assignment. Nevertheless, even in that case, KL performs well as group size increases.
The next three case studies deal with more than two patterns. Case 2 deals with a situation that is similar to Case 1 but with an extra unique pattern in application 2.

Case 2 (, , , and (see Table 7)). The result shown in Table 7 shows that the individual assignment is even better than the group Maj method, but both methods give abysmal performances for and . The KL method gives the best performance among the competitors (99%).

Case 3 (, , , and (see Table 8)). In Case 3, each application had its own unique pattern. KL performs well in this case also but the performance gap between it and 4096d is smaller than that in Case 2. This tells us that KL excels when a rare application specific pattern exists.

Case 4 (, , , and (see Table 9)). Case 4 represents a situation that is close to real network traffic. The group assignment Maj performs worse than the individual in most of the measures, while KL and 4096d perform as before. and of individual assignment are bad and worse in Maj but KL and 4096d do exceptionally well in this case as well.

5.2. Real DataBased Simulation
We retrieved traffic data from the packet traces in [22]. Our simulation system used the pcap library functions to extract valid TCP connections from the traces. We defined a valid TCP connection as being a packet exchange between a client and a server that starts with proper threeway TCP handshakes and has at least four packets after them (currently our system eliminates flows with less than four packets. However, it is not difficult to extend our system to handle flows with less than four packets. We can simply add zerolength packets at the end of the flow when it does not contain all the four packets. Applications that produce less than four packets then can be characterized as flows ending with a number of zerolength packets). Among the packet traces, we singled out SMTP (port 25) and IMAP (port 143) connections. The trace files were huge and produced over 160,000 connections for SMTP and over 30,000 connections for IMAP. We have used tenfold crossvalidation so that ninetenths of the connections from each application selected via random selection is used to train the target Markov model and the remaining onetenth is used to for testing purpose. For each connection, we have removed the threeway TCP handshake packets (SYN, SYN/ACK, and final ACK) and collected only the first four packets after the handshake. All acknowledgement packets are ignored.
Table 10 shows the state sequences (patterns) for each application in sorted order with the most frequent one at the top. Both applications had the 0414 sequence as the most frequent sequence, with the other common sequences being 0404, 0400, and 1414. Note that states 0 to 3 are for the clienttoserver packets and states 4 to 7 are for the servertoclient packets with each state representing one of the four payload length intervals: , , , and (Section 4). For example, 0414 means the first packet was a clienttoserver packet with size in , the second packet was a servertoclient packet with size in , the third packet was a clienttoserver one with size in , and finally the fourth packet was a servertoclient one with size in . Let us summarize IMAP and SMTP application data collected in a given time interval as in Table 11.


Given a real network traffic observation , we calculate the empirical likelihood under each model , where is the proportion of state in the first stage and is an empirical transition matrix constructed from the total connections. Evaluation measure, , can be computed as in (11) using empirical likelihood and instead. The other evaluation measures can be computed in a similar fashion.
Table 10 shows why SMTP and IMAP applications are difficult to classify using conventional network classification methods, such as the BOF technique in [3] and a plain Markov model approach in [2]. By eyeball computation, and , so state sequences 0414 and 1414 of SMTP are misclassified as IMAP. Therefore, the recall rate of SMTP is at most 1–0.5298.
Group assignment using Maj does worse than the individual assignment in the recall rate of SMTP. Because more than half the percentage of SMTP state sequences are misclassified, majority counting will aggravate the situation. Table 12 shows the performance of individual and group assignment.

The KullbackLeibler approach gives the best performance as in the modelbased simulation. The recall rates of SMTP () with the KullbackLeibler approach are well over 90% and close to 100% for a bag size of 100. The recall rates of IMAP () are also remarkably high. The values of for KL (KullbackLeibler) are somewhat lower than those for the Maj approach, but it should be understood that the high recall rate of IMAP in majority counting comes at a severe sacrifice of SMTP recall rate. The 4096d approach also shows strong recall rates, around 90%, for SMTP traffic. Its performance degrades, however, with IMAP traffic, showing around a 65% recall rate. It appears that simple Euclidean distance in 4096dimensional space is not accurate enough to capture the characteristics of a traffic class whose traffic pattern heavily overlaps with another. Further, 4096d treats all dimensions equally in a sense, so it may not be sensitive enough to detect some unique rare patterns in certain applications.
Even though the KullbackLeibler method performed well in the real data simulation, the computing time is still a concern in online network classification and decisionmaking. Table 13 presents a comparison of execution times for various algorithms. The execution times are normalized for 10,000 connections. The individual assignment and Maj show similar time requirements. The KL approach is slightly faster than the 4096d method but is about roughly ten times slower than the individual assignment or Maj in the case of bag size of 100 connections, which seems to be a reasonable bag size (100 connections are a size that is large enough to present a representative state sequence distribution and small enough to declare the identity of some unknown traffic in due time). Most of the time is consumed in building the Markov model for the test data; computing the distance between two Markov models does not take significant execution time, contrary to our initial prediction. It turns out that the probability distribution of our total statesequence space is very sparse, meaning that most of them are zeroes. Therefore, we can significantly reduce the KL computing time.

5.3. Extending to Multiple Applications
We have extended our proposed approach to handle network classification problems in the presence of more than two applications. From the network traffic data repository we have collected most of the application protocols having enough traffics to analyze. As a result we ended up with 10 network protocols, which are FTP, SSH, SMTP, HTTP, POP, NNTP, IMAP, HTTPS, SPOP, and BitTorrent.
We have conducted an experiment to check whether our proposed method works well in the presence of other protocols. Tables 14 and 15 show the recall and precision rates of various models, respectively, for the chosen protocols. in Table 14 is the recall rate for a protocol with port number , while in Table 15 represents the precision rate for the same protocol. BitTorrent protocol uses random port numbers, but we have collected and tested packets destined to port 6881 since our pcap files contain BitTorrent traffic of older version in which 6881 is known to be one of the most frequent port numbers (we discuss the problem of detecting BitTorrent packets in the presence of random port numbers in Section 5.4). Thus, and stand for the recall and precision rate of these representative BitTorrent packets. The recall rates of SMTP and IMAP are slightly different from those in Table 12 since each packet now is matched against 10 different models, instead of two models. It is clear that our techniques, KL10 (KullbackLeibler with bag size 10) and KL100 (KullbackLeibler with bag size 100), consistently show much better performance than other techniques. Other techniques show very poor recall rate for ports 21, 25, and 119. However, KL10 and especially KL100 produce close to 90% recall rate for all ports except for NNTP (port 119), which must be very tough to classify correctly as can be seen in the table. Still, KL100 matches the packets of NNTP much better than other techniques, almost reaching 80% recall rate with bag size 100 while others produce less than or around 20% recall rate.


Precision rates in Table 15 show the accuracy of the prediction of each model. Again the proposed KL model displays considerably high precision rates compared to others. To compute recall and precision rates, we need the distribution of predicted ports for each classification technique. For example, Table 16 shows the distribution of predicted ports for each protocol with KL10 classification technique. The table shows the total number of connections belonging to each protocol in the first column. For example, 1792 connections were found to belong to port 21. The rest of the columns show the predicted port number for each protocol. For port 21, the KL10 has predicted that 1480 connections belong to port 21, 212 connections belong to port 110, 40 connections belong to port 119, and 60 connections belong to port 6881. Therefore, the recall rate is as shown in Table 14. The precision rate for port 21 can be computed by collecting all numbers in column 2 which shows the number of connections predicted to belong to port 21 from all the 10 protocols. 1480 connections out of 1792 connections destined to port 21 have been matched to port 21, 30 connections out of 16508 connections destined to port 25 have been matched to port 21, and so forth. However, since each protocol has different number of connections, we have to normalize the predicted connection number before adding them. We have normalized the connection number of each protocol to 1000 connections, and the table shows the normalized connection number below each connection number. The numbers within parentheses in the first column of Table 16 are normalization factors. Now the sum of normalized numbers in column 2 (which shows all the number of connections predicted to belong to port 21) is 861.698; therefore the precision rate of KL10 technique for port 21 is 825.89/853.698 = 0.9674 as shown in Table 14.

5.4. Detecting BitTorrent Packets in the Presence of Random Port Numbers
Detecting P2P packets such as BitTorrent is a hard problem and has been investigated by numerous researchers. In this section, we describe how our technique can be applied to detect BitTorrent packets and provide some preliminary experimental results. Since our technique needs a set of flows believed to belong to the same protocol, in this case BitTorrent, we have collected packets coming out from the same host to build a bag of flow and applied our technique to it. We have identified hosts that exchange packets with more than 10 different peers within relatively short time period, in our case 1000 seconds. There were 1049 such hosts in the pcap files used in the experimentation. Since the pcap files do not contain payload portion, we cannot tell exactly which traffic is due to true BitTorrent application (another approach of collecting traffic for BitTorrent detection would be generating BitTorrent packets by ourselves and combine them with existing pcap files. But it is very hard to generate realistic BitTorrent traffic in this way, and the BitTorrent traffic generated this way would consist of very simple traffic patterns whose classification would become a trivial task). Instead we have assumed port 6881 through 6899 are BitTorrent ports and collected packets with these ports as BitTorrent traffic. Packets with these ports have very high chance of being BitTorrent packets (in older version of BitTorrent, whose traffic our pcap files contain, 6881–6899 are known to be the port range that BitTorrent hosts are using.), and the purpose of our experimentation is to see how much of them are classified as BitTorrent packets by our KL model.
The 1049 hosts were producing 37123 flows, or connections, and the port distribution of them are shown in Table 17. About 30% of them were destined to one of the wellknown ports that our classification system can recognize, while the rest (70% of the whole traffic), categorized as “Other” in the table, were using random port numbers. The table also shows the prediction result by the proposed KL10 method. Since each flow will always be matched to one of the 10 models, there is zero flow in “Other” category. Instead each category has been predicted to contain more than the actual number of flows belonged to it. We believe the prediction system will become more precise when it is equipped with more Markov Models for other missing protocols.

We are especially interested in the recall and precision performance for the case of BitTorrent traffic. Out of 1049 hosts, we have further identified 39 hosts that are producing traffic with ports in 6881–6899 range. The total number of BitTorrent flows in this category was 943. We were interested in how much of them are classified as BitTorrent by our prediction system and how precise our prediction is. The result is shown in Table 18 and summarized at the bottom of the table. “num of BT” in the table stands for “the number of BitTorrent flows,” and “num of NonBT” stands for “the number of nonBitTorrent flows.” The table shows for each host the true number of BitTorrent flows and nonBitTorrent flows, respectively, and at the same time the predicted number of BitTorrent and nonBitTorrent flows. From the table we can see all of the BitTorrent flows are classified as BitTorrent in our detection system. Therefore the recall rate is 100%. However, the precision rate is 54.70%, a much lower figure than those in Table 15. The main reason is that our system has only 10 models, ports 21, 22, 25, 80, 110, 119, 143, 443, and 995 and BitTorrent, to classify the traffic. All other traffics that do not belong to one of these, those in the “Other” category in Table 17, still have to be classified into one of these models, lowering the precision rate. Particularly, a significant portion of them are classified as BitTorrent traffic as shown in Table 18. It might be that these flows are actually BitTorrent traffic, or they are another P2P traffic that has similar packet pattern as BitTorrent.

The low precision rate could be a problem in real situation when we deploy our system in the gateway server. However, by increasing the number of models, we believe that the precision rate will improve. Also, since it captures unknown traffic as BitTorrent only by looking at the packet header, we can combine it with Deep Packet Inspection (DPI) technique to classify the traffic further. That is, instead of applying DPI to all the traffic, we can extract suspected BitTorrent traffic with our technique first and then apply DPI to these extracted ones.
There are other concerns when the current system is deployed in real situation. One is building and maintaining LRU (Least Recently Used) list to keep track of hosts for which to collect packets. Since we cannot keep track of all possible hosts, we use LRU to remove relatively inactive hosts from the monitoring list. How many hosts to keep in the LRU list, how many packets per host to collect, and so forth are questions to resolve in real world deployment. Too short LRU list will remove BitTorrent hosts prematurely from the list when a large number of nonBitTorrent hosts exchange packets before the BitTorrent host has the chance to send or receive the second packet. Too long LRU list obviously put too much overhead on the system. Another concern is the relatively slow classification time of KullbackLiebler method. However, in a moderate speed network as in our test pcap file, the timestamps of captured packets show that there is enough time to analyze traffic pattern and classify them with KullbackLiebler technique. Even in high speed network, classification task is a highly parallel process and with proper equipment such as parallel network processors our technique would still be a viable option.
6. Discussion and Conclusion
In this paper, we developed a novel classification method based on a Markov model with KullbackLeibler divergence. Our primary goal was to develop a method that performs well on hardtoclassify network applications. Even though most of the previous methods of network classification perform well in most cases by using either correlated information of connections or a combination of a machine learning technique and Markov or hidden Markov models, they fail to produce convincing results when the patterns of connections of applications are similar.
We proposed a novel method that combines a flexible Markov model with KullbackLeibler information and correlated traffic connection by grouping or bagging with the port number of applications. As our theoretical simulation and real data simulation show, our method outperformed the other methods in hardtoclassify situations, even though we did not cover all possible cases.
We recognize the slowness of the KullbackLiebler approach (compared to Individual or Majority approach as shown in Table 13) as being one drawback of our technique. However, its high prediction success rate even among the overlapping traffic classes in terms of traffic patterns, as in SMTP and IMAP, is promising. Further, our technique is scalable in that its execution time increases linearly as the number of target classes increases, because once it builds the Markov model for the test data, measurement of its distance from each of the Markov models of the target classes can be done very quickly.
Our approach can be extended to more general multiclass problems. A table can be set up to classify observations with patterns to one of the classes (see Table 19).

The basic idea of individual assignment is to compute , and assign to class for all when all likelihoods have distinct values.
The recall rate for each application can be computed as follows: Precision rates can also be computed by considering the abundance proportions of each application.
Disclosure
A preliminary shortened version of this paper has been published in [23] by the same authors. An extensive experimentation has been performed since the publication of the preliminary version and the result is presented in the current paper.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgment
This work was supported by Inha University and National Research Foundation of Korea grants funded by the Korean Government (NRF2012R1A1A200665 and NRF2013R1A1A2059335).
References
 J. Erman, A. Mahanti, M. Arlitt, I. Cohen, and C. Williamson, “Offline/realtime traffic classification using semisupervised learning,” Performance Evaluation, vol. 64, no. 9–12, pp. 1194–1213, 2007. View at: Publisher Site  Google Scholar
 G. Munz, H. Dai, L. Braum, and G. Carle, “TCP traffic classification using Markov models,” in Proceedings of the 2nd International Conference on Traffic Monitoring and Analysis (TMA '10), pp. 127–140, Zurich, Switzerland, April 2010. View at: Google Scholar
 J. Zhang, Y. Xiang, Y. Wang, W. Zhou, Y. Xiang, and Y. Guan, “Network traffic classification using correlation information,” IEEE Transactions on Parallel and Distributed Systems, vol. 24, no. 1, pp. 104–117, 2013. View at: Publisher Site  Google Scholar
 L. Bernaille, R. Teixeira, and K. Salamatian, “Early application identification,” in Proceedings of the ACM International Conference on Emerging Network Experiments and Technologies (CoNEXT '06), Lisbon, Portugal, December 2006. View at: Google Scholar
 T. Karagiannis, A. Broido, N. Brownlee, and K. Claffy, “Is P2P dying or just hiding?” in Proceedings of the 47th Annual IEEE Global Telecommunications Conference (Globecom '04), Dallas, Tex, USA, November 2004. View at: Google Scholar
 T. T. T. Nguyen and G. Armitage, “A survey of techniques for internet traffic classification using machine learning,” IEEE Communications Surveys and Tutorials, vol. 10, no. 4, pp. 56–76, 2008. View at: Publisher Site  Google Scholar
 M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “ClassofService mapping for QoS: a statistical signaturebased approach to IP traffic classification,” in Proceedings of the ACM SIGCOMM Internet Measurement Conference (IMC '04), pp. 135–148, ACM, Sicily, Italy, October 2004. View at: Google Scholar
 A. W. Moore and D. Zuev, “Internet traffic classification using bayesian analysis techniques,” in Proceedings of ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '05), pp. 50–60, ACM, Banff, Canada, June 2005. View at: Publisher Site  Google Scholar
 M. Crotti, M. Dusi, F. Gringoli, and M. Salgarelli, “Traffic classification through simple statistical fingerprinting,” ACM SIGCOMM Computer Communication Review, vol. 37, no. 1, pp. 5–16, 2007. View at: Google Scholar
 T. Nguyen and G. Armitage, “Training on multiple subows to optimize the use of Machine Learning classifiers in realworld IP networks,” in Proceedings of the 31st IEEE Conference on Local Computer Networks, Tampa, Fla, USA, November 2006. View at: Google Scholar
 A. Castiglione, A. De Santis, and F. Palmieri, “Characterizing and classifying cardsharing traffic through wavelet analysis,” in Proceedings of the 3rd International Conference on Intelligent Networking and Collaborative Systems (INCoS '11), pp. 691–697, IEEE, Fukuoka, Japan, December 2011. View at: Publisher Site  Google Scholar
 F. Palmieri, U. Fiore, A. Castiglione, and A. de Santis, “On the detection of cardsharing traffic through wavelet analysis and Support Vector Machines,” Applied Soft Computing Journal, vol. 13, no. 1, pp. 615–627, 2013. View at: Publisher Site  Google Scholar
 F. Palmieri, U. Fiore, and A. Castiglione, “A distributed approach to network anomaly detection based on independent component analysis,” Concurrency Computation: Practice and Experience, vol. 26, no. 5, pp. 1113–1129, 2014. View at: Publisher Site  Google Scholar
 U. Fiore, F. Palmieri, A. Castiglione, and A. D. Santis, “Netowrk anomaly detection with the restricted Boltzmann machine,” Nuerocomputing, vol. 122, pp. 13–23, 2013. View at: Google Scholar
 Z. Chen, B. Yang, Y. Chen, A. Abraham, C. Grosan, and L. Peng, “Online hybrid traffic classifier for PeertoPeer systems based on network processors,” Applied Soft Computing, vol. 9, no. 2, pp. 685–694, 2009. View at: Publisher Site  Google Scholar
 W. Ye and K. Cho, “Hybrid P2P traffic classification with heuristic rules and machine learning,” Soft Computing, vol. 18, no. 9, pp. 1815–1827, 2014. View at: Publisher Site  Google Scholar
 F. Palmieri and U. Fiore, “A nonlinear, recurrencebased approach to traffic classification,” Computer Networks, vol. 53, no. 6, pp. 761–773, 2009. View at: Publisher Site  Google Scholar  Zentralblatt MATH
 A. Dainotti, W. D. Donato, A. Pescapè, and P. S. Rossi, “Classification of network traffic via packetlevel hidden Markov models,” in Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM '08), pp. 1–5, IEEE, New Orleans, La, USA, December 2008. View at: Publisher Site  Google Scholar
 X. Mu and W. Wu, “A parallelized network traffic classification based on hidden Markov model,” in Proceedings of the International Conference on CyberEnabled Distributed Computing and Knowledge Discovery (CyberC '11), pp. 107–112, IEEE, Beijing, China, October 2011. View at: Publisher Site  Google Scholar
 T. Cover and J. Thomas, Elements of Information Theory, John Wiley & Sons, New York, NY, USA, 1991. View at: Publisher Site  MathSciNet
 H. Akaike, “Information theory and an extension of the maximum likelihood principle,” in Proceedings of the 2nd International Symposium on Information Theory, pp. 267–281, Budapest, Hungary, 1973. View at: Google Scholar
 SimpleWeb Traces, http://www.simpleweb.org/wiki/Traces.
 J. Kim, J. Hwang, and K. Kim, “Internet traffic classification using a Markov model and KullbackLeibler divergence,” in Proceedings of the International Conference on Communicatio