Table of Contents Author Guidelines Submit a Manuscript
Mobile Information Systems
Volume 2017 (2017), Article ID 3146868, 22 pages
https://doi.org/10.1155/2017/3146868
Research Article

Effective Packet Number for 5G IM WeChat Application at Early Stage Traffic Classification

School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China

Correspondence should be addressed to Xiangzhan Yu

Received 28 October 2016; Revised 29 December 2016; Accepted 15 January 2017; Published 14 February 2017

Academic Editor: Michal Choras

Copyright © 2017 Muhammad Shafiq and Xiangzhan Yu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Accurate network traffic classification at early stage is very important for 5G network applications. During the last few years, researchers endeavored hard to propose effective machine learning model for classification of Internet traffic applications at early stage with few packets. Nevertheless, this essential problem still needs to be studied profoundly to find out effective packet number as well as effective machine learning (ML) model. In this paper, we tried to solve the above-mentioned problem. For this purpose, five Internet traffic datasets are utilized. Initially, we extract packet size of 20 packets and then mutual information analysis is carried out to find out the mutual information of each packet on flow type. Thereafter, we execute 10 well-known machine learning algorithms using crossover classification method. Two statistical analysis tests, Friedman and Wilcoxon pairwise tests, are applied for the experimental results. Moreover, we also apply the statistical tests for classifiers to find out effective ML classifier. Our experimental results show that 13–19 packets are the effective packet numbers for 5G IM WeChat application at early stage network traffic classification. We also find out effective ML classifier, where Random Forest ML classifier is effective classifier at early stage Internet traffic classification.

1. Introduction

During the last few years, early stage Internet traffic classification received a lot of importance in the area of network traffic classification, from the perspective of features extraction technique, mostly researcher’s proposed machine learning models, which were based on features extraction on a whole network flow in [13]. In 2005, Moore et al. in [4] presented a feature extraction method which is followed by many researchers for features extraction for their research. They used the whole flow traffic and extracted 248 statistical features, such as the packet sizes and maximum, minimum, and average statistical feature values. Machine learning classifiers can get very effective performance results using these statistical features [5]. These features extraction methods also showed high performance results in the identification of anomaly detection [6]. However, these feature extraction methods are not very effective in reality. Thus it is very important to classify Internet traffic at early stage keeping in view of the security policies and quality of service (QoS) management. In 2012 [7], Internet traffic classification with few packets has become a very hot topic in the area of network traffic classification. Thus for the problem of accuracy at their early stage Internet traffic classification Qu et al. in 2012 [8] studied that it is possible to classify Internet traffic at their early stage with effective accuracy performance.

However, no study has been carried out on 5G WeChat application at its early stage traffic identification. It is very important problem to find out how many packets are most effective for 5G WeChat application at its early stage traffic classification? As we know that there are very few studies which are related to this problem at early stage traffic, but this is the first study which is concerned with instant messaging (IM) WeChat traffic classification. In this study, we are interested to find out the most effective number of packets as well as effective machine learning classifier for WeChat traffic classification at its early stage using empirical study and information analysis. Five traffic datasets and ten well-known and widely applied ML classifiers are applied for our study. For this study, we use packet size as a feature for our study and then we use mutual information analysis to find out the mutual relationship and identification information of the first packets. Thereafter, numbers of datasets are created for identification. Then, all the selected machine learning classifiers are conducted on generated datasets using different number of packets. At the end, two different statistical analyses are executed on the experimental results to find out the most effective packet number and ML classifier.

The rest of the paper is organized as follows: Section 2 demonstrates related work. The datasets used in this study are discussed in Section 3 and Section 4 includes methodology frame work and detailed steps used in this study work. Then the introductory information of mutual information analysis, selected machine learning classifiers, and statistical test theory information are discussed with details in Section 5. Results and analysis are depicted in Section 6, while we have also some discussion discussed in Section 7. In the end, conclusion is presented in Section 8.

2. Related Work

Recently, some studies have been proposed to classify Internet traffic at its early stage with few packets [7], but it is very hard to classify Internet traffic with few packets at its early stage traffic. The main problem in early stage Internet traffic classification is the extraction of effective features. Bernaille et al. in 2006 [9] proposed an early stage Internet traffic classification technique using the size of few early packets of TCP flow as features and executing -means clustering technique utilizing 10 types of application traffic; they got very effective classification results. Huang et al. in 2008 [10] studied the characteristics of early stage Internet application traffic classification. They used these characteristics for early stage Internet traffic classification. Moreover, in 2013 the authors in [11] extracted features of early stage traffic applications. Using machine learning classifiers, they used packet size, interpacket time, average and standard deviation values, packet size, and interpacket time for early stage Internet application traffic classification. Using these features, they got very high performance results for early stage Internet traffic classification. Este et al. in 2009 [12] studied the features of few packets of early stage traffic and found that these features, packet size, packet interarrival time, and packet direction of early traffic, carry enough information. They also found that these features are most effective features for early stage Internet traffic classification. Hullar et al. present an automatic machine learning (ML) method for P2P traffic classification at early stage, which consumes limited computational and memory resources for early stage traffic identification of P2P traffic. Rizzi et al. in 2013 [13] present a very effective neuro fuzzy system to identify early stage traffic. Nguyen et al. [14] further extend the early stage to “timely” for VoIP traffic classification. They derived statistical features from the subflow, while this means that subflow is a small number of packets.

In [11], the authors used 20 packets and extract feature at early stage. In [9] the authors say for early stage Internet traffic classification five packets are enough to accurately classify early stage traffic. Dainotti et al. [15] used the first 10 packets’ packet size (PS) and interpacket time (IPT) for their study work; they also use the average and standard deviation values of packets size and interpacket time of the early stage traffic for early stage traffic identification. Peng et al. in 2015 [16] used payload size of first 10 packets for early stage Internet traffic classification. They say 5–7 packets are most effective packets for early stage Internet traffic identification. They also say that selecting too many packets will increase the computational complexity while selection of few packets for early stage traffic identification will decrease accuracies performance results and cannot possess enough information.

Bernaille et al. in 2006 [17] studied the problem of effective packets numbers for early stage Internet traffic classification. In their study, they used -means, GMM, and HMM model using the size and direction of the early 10 packets of TCP connection for early stage traffic identification. They say that first four packets of early stage traffic are the most effective packets for early stage Internet traffic classification. They conducted many experiments using different eight traces traffic for their study work and they got very high identification results using the first four packets and executed three machine learning algorithms. Lim et al. in 2010 [18] used not only packet size as features but also connection level and statistical level feature for their study using a number of different datasets while conducting Naïve Bayes, C4.5 decision tree -nearest neighbors, and Support Vector Machine for their experimental study work. They used first 10 packets for their study to identify UDP application flow and also TCP flow, but their study related to empirical study.

During the last few years, Internet user increases day by day due to presence of reliable and free of cost instant messaging and free calling applications on Internet. WeChat application is one of the instant messaging applications available online freely. WeChat is an instant messaging and free calling application developed by Tencent Holding in China. This is a multifunctionality application and can be used both in smart phone and in desktop machine. After launching the WeChat application, its online users reached 300 million [19] which was amazing traffic and thereafter in November 2015, its active costumer users reached 650 million all over the world while from outside of China its active users reached 100 million [20]. So day to day increasing active users and traffic of this application can affect performance of network. It is also important to classify WeChat messages, audio, and video call traffic accurately to manage quality of services (QoS). Huang et al. [21] proposed measurement ChatDissect tool to measure WeChat application traffic and distinguish 150 K users and 16 GB traffic of WeChat from real-world network traces. In 2013, Church and De Oliveira [22] studied the performance of mobile instant messaging sending service with traditional short messages. In 2014 O’Hara et al. [23] studied instant messaging application WhatsApp in smart phone and took some interviews and survey to study the user activity using WhatsApp application. In 2014, Fiadino et al. [24] also studied WhatsApp application flow stream and collected data in European Network which consisted of millions of data stream flows and also studied audio and video flow data stream. In 2014 Liu and Guo [25] studied video messaging services in WeChat and WhatsApp application and they captured the traffic using mobile devices for their study. However, no study found out the number of packets that are most effective for WeChat application at its early stage traffic classification. In our previous work in [26], we only classify WeChat text messages service flow using two different environment datasets applying well-known four ML classifiers. In this paper we use only 50 features to classify text message flow and got high accuracy results.

3. Datasets

In this research work, we use five datasets, which include HIT Trace I and HIT Trace II; the details are given below.

3.1. HIT Trace I Dataset

Developing HIT Trace I dataset, we capture WeChat four functionalities such as text messages, pictures messages, audio calls, and video calls traffic and consider these traces traffic as different types of four datasets separately. For this study work, we are interested in classification and finding out the effective number of packets for WeChat real time application at early stage traffic classification. Thus we firstly capture WeChat application traffic of text messages, pictures messages, audio call, and video call traffic using Wire Shark tool [27] for a duration of 1 hour at research lab of School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, in 23 May 2016. But we select the traffic of nonzero payload packets. In this process of capturing, we are interested to capture WeChat TCP, UDP, and SSL traffic of text messages, pictures messages, audio call, and video call. After capturing the traffic, the trace file is saved as dot PCAP extinction. The characteristics of these datasets are given in Table 1, but note that all the captured applications include TCP, UDP, and SSL application traffic instances. However, this trace traffic includes four subdatasets.

Table 1: Characteristics of HIT Trace I selected traffic trace.
3.2. HIT Trace II Dataset

The second dataset is collected in a dormitory of Harbin Institute of Technology using Wire Shark software. We capture WeChat TCP and UDP traffic and four more other applications’ traffic such as DNS, FTP, Telnet, and WWW applications with duration of one hour. For our study we select traffic that is nonzero payload packets. We capture DNS application traffic on 26 December 2015 with duration of 1 hour, similarly FTP, Telnet on 27 December 2015, and WWW traffic on 28 December 2015. After capturing the traffic, we save the trace traffic as a dot PCAP extinction. The detailed characteristics of this dataset are given in Table 2. Note that WTCP mean WeChat TCP traffic only and WUDP mean WeChat UDP traffic only.

Table 2: Characteristics of HIT Trace II selected traffic trace.

4. Study Framework

Our study framework includes two models; we propose first model for the problem of effective packet numbers and second model for the problem of effective machine learning (ML) classifier.

We carry out the study work as Figure 1 depicts the effective packet number. The detailed explanation of the executed method and steps are given below step by step.

Figure 1: Study Framework 1 for effective packets number.

(i) Trace Traffic. In this first step, we capture WeChat traffic using Wire Shark tool and save the captured traffic as dot PCAP extinction. Notice that we capture text messages, picture messages, audio call, and video call traffic duration of one hour, respectively. Thereafter, we select early 20 nonzero payload packets of every application and save them as features extraction.

Figure 2 depicts the effective ML classifier. The detailed explanation process is given below starting from generate features dataset step up to Wilcoxon test results.

Figure 2: Study Framework 2 for selection effective ML classifier.

(i) Features Extraction. As discussed in related work section that in the area of early traffic classification packet size feature is the most essential packet features for early stage Internet traffic classification [28]. In this study work, we used the packet size as a feature of early stage WeChat traffic. We used the packet sizes of 20 packets of early stage of WeChat application. It is noticed that we only used those packets which have nonzero payload packet [16]. For the features, we put the order of feature in a number vice manner such as packet 1, then packet 2, and then up to 20 packets and similarly packet size 1, packet size 2, and up to 20 packet sizes.

(ii) Mutual Information. After development of 20 features datasets, we execute mutual information analysis on between the packet sizes of the features sets of first packets which we label, respectively, from 1 to 20 packets features to find out the mutual information of each feature’s datasets. Through mutual information analysis execution, we are able to know how much identification information carries each packet and we are able to know the effectiveness of each feature’s packet.

(iii) Generate Feature Datasets. In this section, we make features datasets such as first packet with first packet size in integers and then second packet with its packets size and so third and up to 20 features datasets.

(iv) Selected Classifiers for Identification. For this research study, we select 10 machine learning classifiers which are widely used in network traffic classification and are well-known machine learning algorithms. Using the selected machine learning classifiers, we execute crossover identification on the generated datasets. In this paper, we are only interested to find out the effective packet number not to find out the applied machine learning algorithm accuracy. Thus we are only interested in the results of packet numbers.

(v) Friedman Test. In this research work, we also use statistical tests to deeply know the effective packets numbers. Friedman tests are to be executed to find out the significant difference among the results of applied number of packets. The detailed study of statistical tests will be demonstrated in Section 5.

(vi) Wilcoxon Test. It is also statistical test. We will use this test as we will use Friedman test. In this research study we will first use the Friedman statistical test and then will use Wilcoxon test to find out the effective number of packets using different number of packets.

5. Methodology

In this section, we will explain all the applied methods in this paper study.

5.1. Mutual Information

In the information theory, mutual information is very widely used for features selections [29], image processing [30], speech recognition [31], and so on. Mutual information is the measure of mutual dependence between two random variables and which defines the amount of information held by random variable. The mutual information between two variables in information theory can be defined as In the above equation (1), the marginal entropies of and are and , respectively. Where conditional entropies are and and joint entropy of and is , respectively, while the connection among , , , , , and is shown in Figure 3 according to Shannon’s definition of entropy theory, we havewhere is the probability distribution function of a random variable. As in [32] we use the three equations in (1) and can obtain the computational formula of mutual information. We also used the same method for mutual information as in [32].

Figure 3: The relationship between mutual information and the entropies.

However, if the variables are continuous random variable then summation will be replaced by a definite double integral as given. For mutual information computation analysis, there is abundant free available software on Internet, but we choose for our study Peng’s mutual information MATLAB toolbox [33].

5.2. Machine Learning Classifiers

We conducted our identification experiments using ten well-known and widely used machine learning classifiers. All the selected machine learning classifiers are executed using Weka data mining software [34]. Weka tool is a data mining application used in many areas in computer science and also used by many researchers for network traffic classification [35]. Firstly, we formatted the datasets as a comma separated value “CSV” which is a supported extinction of Weka application. Then using two folders’ cross validation method, we apply all the selected machine learning classifiers. The introductory information about the applied machine learning classifiers is given below and the classifiers selected for this study are shown in Table 3.(i) Bayes: Bayes machine learning classifiers are actually based on Bayes Theorem; Bayes machine learning classifiers are very widely used in computer and engineering area and got very effective results. In this study, we utilized Bayesian network (Bayes Net) [36] machine learning classifier and also Naïve Bayes machine learning classifiers [37, 38].(ii) Meta: in this research work, we used Meta category classifier in Weka tool; we select Bagging [39] and AdaBoost [40] machine learning classifier to classify WeChat traffic accurately. Meta classifier was first trained to learn and then produce strong learning.(iii) Rule: this category algorithm just creates rules using specific policy and then executes classification result testing data. In this category, we select OneR [41] and PART [42] rule base classifier for our study.(iv) Trees: this is also called decision tree algorithms used by many researchers in their research study. It is also called statistical classifiers. In our study, we select J48 also called C4.5 classifiers [43], Naïve Bayesian trees [44], and Random Forest [45].(v) SMO: in function category, we select SOM [46, 47]. SOM is also called a supervised machine learning technique and known as Support Vector Machine, which is widely used in many areas for classification. SVM is useful for both classification and regression.

Table 3: Classifiers selected in this study.

For more detailed literature review of the applied machine learning classifiers, the selected machine learning classifiers are cited with original literature review.

5.3. Statistical Test

In more depth to know the effectiveness of packets and to compare the results of the applied ML classifiers as well as to find out the significant difference among the results of the applied method, statistical tests are conducted. In this study, we executed two different statistical tests, Friedman and Wilcoxon test, on the results of methods [48, 49]. The detailed introductions to both Friedman and Wilcoxon statistical tests are given below.

(i) Friedman Test. Friedman test is a statistical test. It is also called Friedman nonparametric test. Friedman test is a kind of nonparametric test used to find out the significant differences between the results of applied methods. The first step in this test is to calculate the test statistics; it converts all the original results of methods to ranks. The process of ranking of this test is that it ranks the best performing on the rank of 1 and then second best 2 and so on. After ranking the average ranking (AR), it is then calculated. If is the rank of the th of algorithms in th of datasets, thus the Friedman test also needs average ranking of algorithms:While in null hypothesis all the algorithms behave similarly, ranks should be equal and the distribution is calculated as follows:where is the number of executions and n is the number of methods. If the distribution of methods is large enough thus there is value of significant difference among the applied method results and it will be rejected. The significant level for each method and the probability value ( value) shows the significant level and it is usually conducted for the analysis of test results. On the other hand, for multiple hypothesis testing we also apply post hoc method to determine hypothesis comparison that will be rejected at specified significance level, while in many cases lowest hypothesis result is also concerned about rejection. Lowest significance results also called adjusted value (APV) and post hoc can be used to search the lowest value for each hypothesis. In this study, we used for post hoc method Holm’s test [48] which is very effective test for producing significant test results. For this research work, firstly we used Friedman test using comparison because comparison is too long to show in this paper.

(ii) Wilcoxon Test. We also used Wilcoxon signed rank statistical test in this research study. Wilcoxon test is also a nonparametric test used for pairwise comparison between two methods [50]. If di is the difference between two methods performance scores on th out of n problem and if the score is known in different ranges, then it can be normalized on intervals 0 and  1 in [51]. Thereafter, the difference is ranks by their absolute values while in case of ties practitioners will apply one method in [52]. In this case, the positive values considered that the method performed well and the second one vice versa. is the sum of positive values and is the sum of negative difference values. It means that if the difference between these and is high then the hypothesis will be rejected. This statistical test is also used like Friedman test to determine that the hypothesis will be rejected or not on the specific significance values .

5.4. Evaluation Criteria for Performance Measurement

For the performance measurements confusion matrix is the base of traffic classification measurements. Figure 4 shows the confusion matrix for traffic classification performance evaluation, wherein rows refer to the actual class of the instances and column refers to the predicted class of instances.

Figure 4: Confusion matrix for classification results evaluation.

The metrics that are used in this Internet traffic classification using confusion metrics are described below step by step:(i)True Positive (TP): it means that Class A is truly identified as belonging to Class A.(ii)True Negative (TN): it means that Class A is truly identified as not belonging to Class A.(iii)False Positive (FP): it means that Class A is not truly identified as belonging to Class A.(iv)False Negative (FN): it means that Class A is not truly identified as not belonging to Class A.

Using the above given metrics, different metrics can be made for the evaluation of classification performance [53, 54], but note that effective classifiers will minimize the FP and FN values. However, in this regard we used accuracy and AUC measurement metrics defined as follows.

(i) Accuracy. Classification accuracy can be defined as the truly classified samples in overall classified samples and its formula is given below. Mathematically, accuracy can be defined as the sum of TP and TN divided by the sum of TN, TN, FP, and FN. A classifier performance is measured by accuracy result. It shows the overall effectiveness of classification model.

(ii) Sensitivity. Remember that sensitivity and recall are the same metrics in traffic classification technique. So (4) can be used for sensitivity.

(iii) Specificity. It can be defined as the performance ability of machine learning classifier to classify negative results. Equation (5) shows its formula while mathematically it can be defined as TN divided by sum of FP and TN.

(iv) Area under Curve. It is also called receiver operating characteristics (ROC) curve [55], which defines the performance of machine learning classifiers. It also shows the trade-off among FPR and TPR, while FPR is also known as specificity and TPR is called sensitivity. The AUC values can be computed by using confusion matrix values by TPR and FPR. since specificity = 1 − FPR and sensitivity = TPR.

Replacing 1 − FPR by specificity and TPR by sensitivity, we will getEquation (14) shows that AUC is the average of sensitivity and specificity.

6. Experimental Results and Analysis

In this section, we will explain the detailed experimental results and analysis. Firstly, we will explain the mutual information analysis results of HIT Trace I dataset including four subdatasets and then HIT Trace II dataset results, then give the result analysis of applied methods to validate the effectiveness of packets, and lastly give the results of statistical test for effective ML classifier.

6.1. Mutual Information Analysis Results
6.1.1. Mutual Information Results of the HIT Trace I Dataset

Figure 5 shows the mutual information method analysis results. In Figure 3 the mutual information of the first tow packets of text messages and picture messages is higher compared to the mutual information analysis results of audio call and video call packets. The audio call and video call traffic results are no more than 0.1 values. It means that the first two packets are not contributing information, while in text and picture message packet contributes highly compared with 2 to 4 packets. However, in text messages traffic packet numbers 8-9 give high information identification values and in picture messages packets numbers 7-8 give high information identification values while in audio call traffic packets numbers 6-7 give high information identification values and in video call traffic type is very different as compared to other traffic data; in this traffic packet numbers 19-20 give very high identification information. More details of mutual information results are shown in Table 16.

Figure 5: Mutual information results of HIT Trace I dataset.
6.1.2. Mutual Information Results of the HIT Trace II Dataset

Figure 6 shows the mutual information method analysis results with details. In Figure 6 the mutual information of the first two packets of FTP, DNS, and WWW application is higher as compared to the other WTCP, WUDP, and Telnet application. Similarly packets 2-3 are also not contributing very effectively. Its means that packets 1–3 do not give much identification information. However, packets 6 and 17 give very effective identification information and remaining packets are not contributing very well as compared to the other packets. Moreover, with the perspective application FTP and DNS give very effective identification information compared with other applications.

Figure 6: Mutual information results of HIT Trace II dataset.
6.2. Analysis Results of HIT Trace I Dataset
6.2.1. Results of the Text Messages Traffic Dataset

Figure 7 shows the accuracy results of the WeChat Text message dataset while the details results are shown in Table 19. All the applied machine learning classifiers get very low accuracy using first two and three packets of early traffic, because it is very difficult to identify Internet traffic with only few packets. Due to this reason, all the applied machine learning classifiers get very low accuracy results using early few packets. Naïve Bayes, Hoeffbing, and Random Forest get very low accuracy result using early two packets. However, using text messages dataset, we could not conclude that the first packets are more effective for early stage Internet traffic classification. It is worse to say that the first three packets for early stage Internet traffic classification are effective. However, after three packets using first four packets of WeChat text messages dataset all the applied machine learning classifiers get very effective accuracy results except Random Forest and Part ML classifiers. These three ML classifiers get low accuracy results using first four packets. Support Vector Machine (SVM) gets low accuracy result compared to other machine learning classifiers. However, all the classifiers give continuously incensements in accuracy results using all the packets numbers. Except Random Forest classifier, this classifier shows poor accuracy results which are not stable. Thus we can infer that the first four and five packets carry enough identification information for early stage classification WeChat text messages dataset. Note that we use two folders’ cross validation method in this study work.

Figure 7: Accuracy results of WeChat text message dataset.

Figure 8 shows the AUC result of WeChat text messages dataset and the details are shown in Table 23. From the figure, it is clear that the first two three packets cannot contribute to the AUC identification using the selected machine learning classifiers and the AUC result continuously increases using the selected machine learning classifiers conducting the first four packets to twenty packets. Thus the entire conducted machine learning classifiers give very effective AUC results but some of them such as SMO and OneR ML classifiers give noneffective results for the WeChat text messages dataset. As discussed in accuracy analysis, the Naïve Bayes, Hoeffding, and Random Forest give low accuracy results, while in AUC result, Naïve Bayes, Hoeffding, and Random Forest give effective AUC results compared to accuracy of WeChat Text dataset. It means that there exists imbalance data in WeChat Text message dataset.

Figure 8: AUC result of WeChat text message dataset.

Table 4 shows Friedman’s statistical test results for accuracy result. In Friedman’ test result the packet number nineteen is the best performed one in the accuracy results with the lowest ranking values being 1.7555. While comparing the values and adjusted values, the packet numbers 10–12 of adjusted values are less than values and numbers 15–17 are also the same as adjusted value which is less than values. These are the best performance packets.

Table 4: Friedman’s test results for WeChat text messages dataset.

For better understanding the results of Friedman’s test, we also execute Wilcoxon sign rank test. Table 5 shows Wilcoxon pairwise test results for the WeChat text messages dataset. From the table the value of 20 packets is greater than 0.05 for the accuracy results. Thus we conclude that there is no significant difference existing between the results of 19 packets and other packets for the WeChat text messages dataset.

Table 5: Wilcoxon pairwise test results for WeChat text messages dataset.
6.2.2. Results of the Picture Messages Traffic Dataset

Figure 9 shows the accuracy results of the WeChat Pictures Messages dataset and the details results are shown in Table 18. The results of the text messages dataset are different from the pictures messages dataset. The packet number three gives very low significant increase of accuracy results compared with the first two packets.

Figure 9: Accuracy results of WeChat pictures message dataset.

From the results, it is concluded that the packet number three does not give identification results for the accuracy. It is also observed in the result that all the applied classifiers got continuously increment results using all the number of packets except Support Vector Machine (SMO) classifiers, continuously giving random results using all numbers of packets, while OneR classifiers give very poor identification result in the first 12 packets and then their results are continuously increasing. It means that there exist imbalance data in the dataset. The AUC results for the WeChat pictures messages dataset are a little bit similar to accuracy results. In Figure 10 and Table 22 the AUC results are shown, in which all the machine learning classifiers get the same AUC results but only SMO and OneR ML classifiers results are different from the other ML classifiers which hit high AUC results values.

Figure 10: AUC results of WeChat pictures message dataset.

Table 6 shows Friedman’s test results for the accuracy result of WeChat pictures messages datasets. In the table packet number 18 gives the best performance result for the accuracy. The average ranking result of 18 is 04.3333 values for the accuracy, which are the lowest average ranking results. However observing the value and adjusted PV for the accuracy, the packets number six to seventeen values are less than when compared to adjusted values. Thus these are the best behaving packets number for accuracy result, while the packet number five and packet number nine values are greater than adjusted values. Thus there is no significant difference among the results with accuracy.

Table 6: Friedman’s test results for WeChat pictures messages dataset.

Table 7 shows the Wilcoxon test results for the accuracy results. In the table, it is clear that the packet numbers 13–15 and 20 packets values are greater than the standard level of 0.05 for the accuracy results. Thus the packet numbers 13–15 and 20 are not significantly different for the WeChat pictures messages dataset.

Table 7: Wilcoxon pairwise test results for WeChat pictures messages dataset.
6.2.3. Results of the Audio Call Traffic Dataset

Figure 11 shows the accuracy results of all the selected machine learning classifiers for the WeChat Audio Call dataset. Comparing the results of previous datasets, the results of WeChat audio call dataset are very complex as shown in Figure 11. It is also clear from the figure that the first three packets do not gain identification performance effectively, while packet number four gets effective identification performances for accuracy results. Again the SMO machine learning classifiers give random performance results, which are not stable results, while OneR machine learning classifiers give stable performance results after 11 packets and Bayes Net classifiers give effective result using packet number nine while its performance is continuous after 12 packets. The detailed accuracy results are shown in Table 17.

Figure 11: Accuracy results of WeChat audio call dataset.

The AUC results for the WeChat Audio Call dataset accuracy are shown in Figure 12 and Table 21. The AUC results are very simple as compared to the accuracy results. The first three packets gain low AUC result while the first four packets gain effective AUC result for accuracy results. However, SMO and Bayes Net machine learning classifiers give low level AUC result for the first four packets and other machine learning classifiers give accurate AUC results, while Random Forest machine learning classifier gives very accurate AUC results for the accuracy.

Figure 12: AUC results of WeChat audio call dataset.

Table 8 shows Friedman’s test results for the WeChat audio call dataset accuracy. The packet number sixteen gets the lowest average ranking values for accuracy and all the values of packets are less than from the adjusted values except packet numbers 5 and 11. Thus we can say that there is no significant difference among the results, while the Wilcoxon test results are shown in Table 9, in which only packet numbers 2–5 and 11–15 get the value less than 0.05, which mean that these packets are significantly different from the other packets results.

Table 8: Friedman’s test results for WeChat audio call dataset.
Table 9: Wilcoxon pairwise test results for WeChat audio call dataset.
6.2.4. Results of the Video Call Traffic Dataset

Figure 13 shows the accuracy results of WeChat video call datasets. The accuracy results of video call dataset are different from the previous datasets accuracy results. The result of this dataset is a little bit complex; however, all the machine learning classifiers get effective accuracy results using all the packets datasets for accuracy. But the result C4.5 decision classifier is completely straight line conducting all the packets datasets. The results of the first two and three packets are lowest using SMO and Naïve Bayes classifiers but after four packets its accuracy result increases continuously. Similarly, using Bayes Net, Random Forest, Bagging, and OneR classifiers with packet number 17, all the classifiers give lowest results, but the remaining classifiers get high accuracy results using packet number 17. In Figure 14 and Table 24 we have shown the AUC results for the WeChat video call dataset. The AUC result pattern is simple as compared to accuracy result. Most of the applied machine learning classifiers get the effective AUC result for the video call dataset except OneR and SMO machine learning classifiers, because the results of the OneR and SMO are the lowest compared to other machine learning classifiers, while Table 10 shows Friedman’s statistical test results for the accuracy of WeChat video call datasets. The packet number two gets the highest average rank values compared to other packets average rank results and its value is 05.20. Similarly, all the values are less than adjusted values except packet numbers 4-5. It means that there does not exist significant difference among these results with respect to accuracy.

Table 10: Friedman’s test results for WeChat video call dataset.
Figure 13: Accuracy results of WeChat video call dataset.
Figure 14: AUC results of WeChat video call dataset.

In Table 11, we have shown the Wilcoxon test results for the WeChat video call dataset accuracy. In Table 11, all the values are less than the standard level 0.05 except packet number 4. It means that the results of the entire packet except 4 packets are significantly different from all other results.

Table 11: Wilcoxon pairwise test results for WeChat video call dataset.
6.3. Analysis Results of HIT Trace II Dataset

Figure 15 shows the accuracy results of HIT Trace II dataset. All the applied machine learning classifiers get low accuracy result for early stage Internet traffic, because it is very difficult to classify Internet traffic using first few packets. However, we are not interested in classification accuracy. We are interested to find out the most effective packet numbers and effective ML classifier. Moreover, packet numbers 13-14 give the same identification results, but its identification information is low as compared to other packets’ accuracy results. However, the accuracy results of packet numbers 12 and 18 are continuously increasing. It means that their accuracy results are very good as compared to other packets’ accuracy results. Moreover, all classifiers show stable accuracy results, but Random Forest algorithm gives effective results compared to other machine learning classifiers. Thus the first six packets carry enough identification information as well as 15–19.

Figure 15: Accuracy results of HIT Trace II dataset.

The AUC results for the HIT Trace II dataset is shown in Figure 16 and Table 25. The AUC result for the HIT Trace II is very simple as compared to other traces AUC result. For example, only packet number 5 and packet number 17 get low AUC values and all the remaining packets gain good AUC results as compared to packet AUC results. Moreover, Bagging classifier gives low AUC for packet number 14 and all the remaining classifiers give high AUC result for packet number 14. Similarly, for packet number 5 all classifiers give good AUC except SMO and AdaBoost machine learning classifiers. However, all the machine learning classifiers give high AUC values results for early stage packets. However the detailed AUC results are shown in Table 26.

Figure 16: AUC results of HIT Trace II dataset.

Table 12 shows Friedman’s test results for the HIT Trace II dataset accuracy. The packet number 18 gets the lowest average ranking values for accuracy and all the values of packets are less than from the adjusted values except packet numbers 8 and 9. Thus we can say that there is no significant difference among the results, while the Wilcoxon test results are shown in Table 13 in which only packet numbers 16 and 18 get the values less than 0.05, which mean that these packets are significantly different from the other packets results.

Table 12: Friedman’s test results for HIT Trace II dataset.
Table 13: Wilcoxon pairwise test results for HIT Trace II dataset.
6.4. Analysis Results of Algorithms

Table 14 shows Friedman’s test results for the applied machine learning classifiers. The Random Forest machine learning classifier gets the lowest average ranking values as compared to other machine learning classifiers and all the values are less than from the adjusted values except Hoeffding, Bayes Net, SMO, and OneR ML classifiers, while the Wilcoxon test results are shown in Table 15 in which only classifiers OneR, Part, C4.5, and Random Forest get the value less than 0.05, which mean that these classifiers are significantly different from the other packets results.

Table 14: Friedman’s test results for algorithm.
Table 15: Wilcoxon pairwise test results for algorithms.
Table 16: Mutual information analysis.
Table 17: Accuracy results of WeChat audio traffic dataset.
Table 18: Accuracy results of WeChat picture traffic dataset.
Table 19: Accuracy results of WeChat text traffic dataset.

7. Analysis and Discussion

Although the results of the five applied IM and WeChat traffic datasets are different, with respective accuracy results and AUC results, some information can be learned from the applied five datasets at early stage WeChat traffic classification.(i)From this study, it is clear that analyzing the results of information analysis and classification experiments results analysis of the first three packets for early stage IM do not carry enough identification information.(ii)From the experimental results, the early traffic packets carry enough identification information for the WeChat early traffic classification. However, all the applied machine learning classifiers get very high effective identification performance results using the early stage traffic except Support Vector Machine and OneR machine learning classifiers results are very poor compared to other applied ML classifiers.(iii)Through accuracy results, the classification performance can be easily evaluated for the early stage Internet traffic classification. But in some cases, some classifiers get high identification performance results and in some cases not very effective, it is due to imbalanced datasets.(iv)OneR and SVM classifiers performance is always poor with increase of nonzero payload packet numbers. However, the performance of OneR and SVM classifiers is very different as compared to other machine learning classifiers.(v)However, it is clear from the experimental results that Random Forest gives very accurate results for all the applied datasets.

8. Conclusion

In this paper, we have tried to find out the most effective packet numbers for the IM WeChat early stage traffic classification as well as effective machine learning classifier. Using mutual information analysis five datasets (text messages, picture messages, audio call, and video call traffic), HIT Trace II, and ten well-known machine learning classifiers are applied. According to experimental results, we conclude that the nonzero payload size packets carry enough identification information for WeChat instant messages applications traffic classification. However, the packet numbers 13–19 are effective packets for 5G WeChat application traffic identification. Moreover, the experimental results of the five datasets are different due to different functionality of 5G WeChat application. However, in the results all the utilized datasets are not the same and the first three packets do not carry enough identification information and give very poor results, while for WeChat early stage traffic classification, according to our experimental analysis, the packet numbers 13–19 are most effective packet numbers. While for effective ML classifiers, we conclude that Random Forest machine learning classifier is effective ML classifier for IM early stage traffic classification.

There is still gap for further research in the early Internet traffic classification. A new method should be developed to select effective packet numbers for 5G WeChat application early stage traffic identification, while selecting more packets for Internet traffic classification increases computational complexity while minimum features will decrease classification accuracy of machine learning classifier so that more models should be developed that show how many packets should be used for accurate IM application traffic classification.

Appendix

Detailed Results of the Experimental Work

See Tables 1626.

Table 20: Accuracy results of WeChat video traffic dataset.
Table 21: AUC results of WeChat audio traffic dataset.
Table 22: AUC results of WeChat picture traffic dataset.
Table 23: AUC results of WeChat text traffic dataset.
Table 24: AUC results of WeChat video traffic dataset.
Table 25: Accuracy results of HIT Trace traffic dataset.
Table 26: AUC results of HIT Trace II traffic dataset.

Competing Interests

The authors declare that they have no competing interests.

References

  1. A. Este, F. Gringoli, and L. Salgarelli, “Support vector machines for TCP traffic classification,” Computer Networks, vol. 53, no. 14, pp. 2476–2490, 2009. View at Publisher · View at Google Scholar · View at Scopus
  2. W. Li and A. W. Moore, “A machine learning approach for efficient traffic classification,” in Proceedings of the 15th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '07), pp. 310–317, IEEE, Istanbul, Turkey, October 2007. View at Publisher · View at Google Scholar · View at Scopus
  3. A. W. Moore and D. Zuev, “Internet traffic classification using bayesian analysis techniques,” in Proceedings of the ACM SIGMETRICS International Conference On Measurement and Modeling of Computer Systems (SIGMETRICS '05), pp. 50–60, Alberta, Canada, June 2005. View at Publisher · View at Google Scholar · View at Scopus
  4. A. Moore, D. Zuev, and M. Crogan, Discriminators for Use in Flow-Based Classification, Department of Computer Science, Queen Mary and Westfield College, 2005.
  5. F. Palmieri and U. Fiore, “A nonlinear, recurrence-based approach to traffic classification,” Computer Networks, vol. 53, no. 6, pp. 761–773, 2009. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at Scopus
  6. U. Fiore, F. Palmieri, A. Castiglione, and A. De Santis, “Network anomaly detection with the restricted Boltzmann machine,” Neurocomputing, vol. 122, pp. 13–23, 2013. View at Publisher · View at Google Scholar · View at Scopus
  7. A. Dainotti, A. Pescapé, and K. C. Claffy, “Issues and future directions in traffic classification,” IEEE Network, vol. 26, no. 1, pp. 35–40, 2012. View at Publisher · View at Google Scholar · View at Scopus
  8. B. Qu, Z. Zhang, L. Guo, and D. Meng, “On accuracy of early traffic classification,” in Proceedings of the IEEE 7th International Conference on Networking, Architecture and Storage (NAS '12), pp. 348–354, Xiamen, China, June 2012. View at Publisher · View at Google Scholar · View at Scopus
  9. L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian, “Traffic classification on the fly,” ACM SIGCOMM Computer Communication Review, vol. 36, no. 2, pp. 23–26, 2006. View at Google Scholar
  10. N.-F. Huang, G.-Y. Jai, and H.-C. Chao, “Early identifying application traffic with application characteristics,” in Proceedings of the IEEE International Conference on Communications (ICC '08), pp. 5788–5792, Beijing, China, May 2008. View at Publisher · View at Google Scholar · View at Scopus
  11. N.-F. Huang, G.-Y. Jai, H.-C. Chao, Y.-J. Tzang, and H.-Y. Chang, “Application traffic classification at the early stage by characterizing application rounds,” Information Sciences, vol. 232, pp. 130–142, 2013. View at Publisher · View at Google Scholar · View at Scopus
  12. A. Este, F. Gringoli, and L. Salgarelli, “On the stability of the information carried by traffic flow features at the packet level,” SIGCOMM Computer Communication Review, vol. 39, no. 3, pp. 13–18, 2009. View at Google Scholar
  13. A. Rizzi, S. Colabrese, and A. Baiocchi, “Low complexity, high performance neuro-fuzzy system for Internet traffic flows early classification,” in Proceedings of the 9th International Wireless Communications and Mobile Computing Conference (IWCMC '13), pp. 77–82, Sardinia, Italy, July 2013. View at Publisher · View at Google Scholar · View at Scopus
  14. T. T. T. Nguyen, G. Armitage, P. Branch, and S. Zander, “Timely and continuous machine-learning-based classification for interactive IP traffic,” IEEE/ACM Transactions on Networking, vol. 20, no. 6, pp. 1880–1894, 2012. View at Publisher · View at Google Scholar · View at Scopus
  15. A. Dainotti, A. Pescapé, and C. Sansone, “Early classification of network traffic through multi-classification,” in Proceedings of the International Workshop on Traffic Monitoring and Analysis, Springer, Vienna, Austria, April 2011.
  16. L. Peng, B. Yang, and Y. Chen, “Effective packet number for early stage internet traffic identification,” Neurocomputing, vol. 156, pp. 252–267, 2015. View at Publisher · View at Google Scholar · View at Scopus
  17. L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian, “Traffic classification on the fly,” ACM SIGCOMM Computer Communication Review, vol. 36, no. 2, pp. 23–26, 2006. View at Publisher · View at Google Scholar
  18. Y. Lim, H. Kim, J. Jeong, C. Kim, T. Kwon, and Y. Choi, “Internet traffic classification demystified: on the sources of the discriminative power,” in Proceedings of the 6th International Conference on Emerging Networking EXperiments and Technologies (Co-NEXT '10), ACM, Philadelphia, Pa, USA, 2010.
  19. More than 300 million users engage, http://www.chinadaily.com.cn/cndy/2013-01/17/content_16128915.htm.
  20. By the numbers: 50+ Amazing WeChat Statistics, http://expandedramblings.com/index.php/wechat-statistics/.
  21. Q. Huang, P. P. C. Lee, C. He, J. Qian, and C. He, “Fine-grained dissection of WeChat in cellular networks,” in Proceedings of the 23rd IEEE International Symposium on Quality of Service (IWQoS '15), pp. 309–318, Portland, Ore, USA, June 2015. View at Publisher · View at Google Scholar · View at Scopus
  22. K. Church and R. De Oliveira, “What's up with whatsapp?: comparing mobile instant messaging behaviors with traditional SMS,” in Proceedings of the 15th International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI '13), pp. 352–361, ACM, Munich, Germany, August 2013. View at Publisher · View at Google Scholar · View at Scopus
  23. K. P. O'Hara, M. Massimi, R. Harper, S. Rubens, and J. Morris, “Everyday dwelling with whatsApp,” in Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW '14), 1142, p. 1131, ACM, Baltimore, Md, USA, February 2014. View at Publisher · View at Google Scholar
  24. P. Fiadino, M. Schiavone, and P. Casas, “Vivisecting whatsapp through large-scale measurements in mobile networks,” ACM SIGCOMM Computer Communication Review, vol. 44, no. 4, pp. 133–134, 2014. View at Publisher · View at Google Scholar
  25. Y. Liu and L. Guo, “An empirical study of video messaging services on smartphones,” in Proceedings of the 24th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV '14), pp. 79–84, March 2014. View at Publisher · View at Google Scholar · View at Scopus
  26. M. Shafiq, X. Yu, and A. A. Laghari, “WeChat text messages service flow traffic classification using machine learning technique,” in Proceedings of the 6th International Conference on IT Convergence and Security (ICITCS '16), pp. 1–5, Prague, Czech Republic, September 2016. View at Publisher · View at Google Scholar
  27. Wireshark, http://www.wireshark.org/.
  28. A. Este, F. Gringoli, and L. Salgarelli, “On the stability of the information carried by traffic flow features at the packet level,” ACM SIGCOMM Computer Communication Review, vol. 39, no. 3, pp. 13–18, 2009. View at Google Scholar
  29. H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005. View at Publisher · View at Google Scholar · View at Scopus
  30. F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens, “Multimodality image registration by maximization of mutual information,” IEEE Transactions on Medical Imaging, vol. 16, no. 2, pp. 187–198, 1997. View at Publisher · View at Google Scholar · View at Scopus
  31. L. R. Bahl, P. Brown, P. de Souza, and R. Mercer, “Maximum mutual information estimation of hidden Markov model parameters for speech recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '86), Tokyo, Japan, April 1986. View at Publisher · View at Google Scholar
  32. L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer, “Maximum mutual information estimation of hidden Markov model parameters for speech recognition,” in Proceedings of the IEEE-IECEJ-ASJ International Conference on Acoustics, Speech, and Signal Processing (ICASSP '86), vol. 86, pp. 49–52, Tokyo, Japan, April 1986. View at Scopus
  33. H. Peng, Mutual information Matlab toolbox, http://www.mathworks.com/matlabcentral/fileexchange/14888-mutual-information-computation.
  34. “Weka 3: Data Mining Software in Java,” http://www.cs.waikato.ac.nz/ml/weka/.
  35. L. Peng, B. Yang, Y. Chen, and Z. Chen, “Effectiveness of statistical features for early stage internet traffic identification,” International Journal of Parallel Programming, vol. 44, no. 1, pp. 181–197, 2016. View at Publisher · View at Google Scholar · View at Scopus
  36. N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian network classifiers,” Machine Learning, vol. 29, no. 2-3, pp. 131–163, 1997. View at Publisher · View at Google Scholar · View at Scopus
  37. P. Domingos and M. Pazzani, “On the optimality of the simple Bayesian classifier under zero-one loss,” Machine Learning, vol. 29, no. 2-3, pp. 103–130, 1997. View at Publisher · View at Google Scholar · View at Scopus
  38. M. E. Maron, “Automatic indexing: an experimental inquiry,” Journal of the ACM (JACM), vol. 8, no. 3, pp. 404–417, 1961. View at Publisher · View at Google Scholar · View at Scopus
  39. L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996. View at Google Scholar · View at Zentralblatt MATH · View at Scopus
  40. Y. Freund and R. E. Schapire, “A desicion-theoretic generalization of on-line learning and an application to boosting,” in Proceedings of the 2nd Eropean Conference on Computational Learning Theory (EuroCOLT '95), Springer, Barcelona, Spain, 1995.
  41. R. C. Holte, “Very simple classification rules perform well on most commonly used datasets,” Machine Learning, vol. 11, no. 1, pp. 63–90, 1993. View at Publisher · View at Google Scholar · View at Scopus
  42. E. Frank and I. H. Witten, “Generating accurate rule sets without global optimization,” in Proceedings of the 15th International Conference on Machine Learning, July 1998.
  43. R. C. Quinlan, 4.5: Programs for Machine Learning, Morgan Kaufmann Publishers Inc, San Francisco, Calif, USA, 1993.
  44. R. Kohavi, “Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid,” in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD '96), pp. 202–207, Portland, Ore, USA, August 1996.
  45. V. Svetnik, A. Liaw, C. Tong, J. Christopher Culberson, R. P. Sheridan, and B. P. Feuston, “Random forest: a classification and regression tool for compound classification and QSAR modeling,” Journal of Chemical Information and Computer Sciences, vol. 43, no. 6, pp. 1947–1958, 2003. View at Publisher · View at Google Scholar · View at Scopus
  46. C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. View at Publisher · View at Google Scholar · View at Scopus
  47. J. Kim, J. Hwang, and K. Kim, “High-performance internet traffic classification using a Markov model and Kullback-Leibler divergence,” Mobile Information Systems, vol. 2016, Article ID 6180527, 13 pages, 2016. View at Publisher · View at Google Scholar · View at Scopus
  48. S. García, A. Fernández, J. Luengo, and F. Herrera, “Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power,” Information Sciences, vol. 180, no. 10, pp. 2044–2064, 2010. View at Publisher · View at Google Scholar · View at Scopus
  49. F. Gringoli, L. Salgarelli, M. Dusi, N. Cascarano, F. Risso, and K. C. Claffy, “Gt: picking up the truth from the ground for internet traffic,” ACM SIGCOMM Computer Communication Review, vol. 39, no. 5, pp. 12–18, 2009. View at Google Scholar
  50. D. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, Chapman and Hall/CRC Press, 2003. View at Publisher · View at Google Scholar
  51. D. Quade, “Using weighted rankings in the analysis of complete blocks with additive block effects,” Journal of the American Statistical Association, vol. 74, no. 367, pp. 680–683, 1979. View at Publisher · View at Google Scholar · View at Zentralblatt MATH · View at Scopus
  52. J. D. Gibbons and S. Chakraborti, Nonparametric Statistical Inference, Springer, Berlin, Germany, 2011.
  53. J. Makhoul, F. Kubala, R. Schwartz, and R. Weischedel, “Performance measures for information extraction,” in Proceedings of the DARPA Broadcast News Workshop, pp. 249–252, Herndon, Va, USA, February 1999.
  54. D. L. Olson and D. Delen, Advanced Data Mining Techniques, Springer, 1st edition, 2008.
  55. A. P. Bradley, “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern Recognition, vol. 30, no. 7, pp. 1145–1159, 1997. View at Publisher · View at Google Scholar · View at Scopus