Abstract

Accurate network traffic classification at early stage is very important for 5G network applications. During the last few years, researchers endeavored hard to propose effective machine learning model for classification of Internet traffic applications at early stage with few packets. Nevertheless, this essential problem still needs to be studied profoundly to find out effective packet number as well as effective machine learning (ML) model. In this paper, we tried to solve the above-mentioned problem. For this purpose, five Internet traffic datasets are utilized. Initially, we extract packet size of 20 packets and then mutual information analysis is carried out to find out the mutual information of each packet on flow type. Thereafter, we execute 10 well-known machine learning algorithms using crossover classification method. Two statistical analysis tests, Friedman and Wilcoxon pairwise tests, are applied for the experimental results. Moreover, we also apply the statistical tests for classifiers to find out effective ML classifier. Our experimental results show that 13–19 packets are the effective packet numbers for 5G IM WeChat application at early stage network traffic classification. We also find out effective ML classifier, where Random Forest ML classifier is effective classifier at early stage Internet traffic classification.

1. Introduction

During the last few years, early stage Internet traffic classification received a lot of importance in the area of network traffic classification, from the perspective of features extraction technique, mostly researcher’s proposed machine learning models, which were based on features extraction on a whole network flow in [13]. In 2005, Moore et al. in [4] presented a feature extraction method which is followed by many researchers for features extraction for their research. They used the whole flow traffic and extracted 248 statistical features, such as the packet sizes and maximum, minimum, and average statistical feature values. Machine learning classifiers can get very effective performance results using these statistical features [5]. These features extraction methods also showed high performance results in the identification of anomaly detection [6]. However, these feature extraction methods are not very effective in reality. Thus it is very important to classify Internet traffic at early stage keeping in view of the security policies and quality of service (QoS) management. In 2012 [7], Internet traffic classification with few packets has become a very hot topic in the area of network traffic classification. Thus for the problem of accuracy at their early stage Internet traffic classification Qu et al. in 2012 [8] studied that it is possible to classify Internet traffic at their early stage with effective accuracy performance.

However, no study has been carried out on 5G WeChat application at its early stage traffic identification. It is very important problem to find out how many packets are most effective for 5G WeChat application at its early stage traffic classification? As we know that there are very few studies which are related to this problem at early stage traffic, but this is the first study which is concerned with instant messaging (IM) WeChat traffic classification. In this study, we are interested to find out the most effective number of packets as well as effective machine learning classifier for WeChat traffic classification at its early stage using empirical study and information analysis. Five traffic datasets and ten well-known and widely applied ML classifiers are applied for our study. For this study, we use packet size as a feature for our study and then we use mutual information analysis to find out the mutual relationship and identification information of the first packets. Thereafter, numbers of datasets are created for identification. Then, all the selected machine learning classifiers are conducted on generated datasets using different number of packets. At the end, two different statistical analyses are executed on the experimental results to find out the most effective packet number and ML classifier.

The rest of the paper is organized as follows: Section 2 demonstrates related work. The datasets used in this study are discussed in Section 3 and Section 4 includes methodology frame work and detailed steps used in this study work. Then the introductory information of mutual information analysis, selected machine learning classifiers, and statistical test theory information are discussed with details in Section 5. Results and analysis are depicted in Section 6, while we have also some discussion discussed in Section 7. In the end, conclusion is presented in Section 8.

Recently, some studies have been proposed to classify Internet traffic at its early stage with few packets [7], but it is very hard to classify Internet traffic with few packets at its early stage traffic. The main problem in early stage Internet traffic classification is the extraction of effective features. Bernaille et al. in 2006 [9] proposed an early stage Internet traffic classification technique using the size of few early packets of TCP flow as features and executing -means clustering technique utilizing 10 types of application traffic; they got very effective classification results. Huang et al. in 2008 [10] studied the characteristics of early stage Internet application traffic classification. They used these characteristics for early stage Internet traffic classification. Moreover, in 2013 the authors in [11] extracted features of early stage traffic applications. Using machine learning classifiers, they used packet size, interpacket time, average and standard deviation values, packet size, and interpacket time for early stage Internet application traffic classification. Using these features, they got very high performance results for early stage Internet traffic classification. Este et al. in 2009 [12] studied the features of few packets of early stage traffic and found that these features, packet size, packet interarrival time, and packet direction of early traffic, carry enough information. They also found that these features are most effective features for early stage Internet traffic classification. Hullar et al. present an automatic machine learning (ML) method for P2P traffic classification at early stage, which consumes limited computational and memory resources for early stage traffic identification of P2P traffic. Rizzi et al. in 2013 [13] present a very effective neuro fuzzy system to identify early stage traffic. Nguyen et al. [14] further extend the early stage to “timely” for VoIP traffic classification. They derived statistical features from the subflow, while this means that subflow is a small number of packets.

In [11], the authors used 20 packets and extract feature at early stage. In [9] the authors say for early stage Internet traffic classification five packets are enough to accurately classify early stage traffic. Dainotti et al. [15] used the first 10 packets’ packet size (PS) and interpacket time (IPT) for their study work; they also use the average and standard deviation values of packets size and interpacket time of the early stage traffic for early stage traffic identification. Peng et al. in 2015 [16] used payload size of first 10 packets for early stage Internet traffic classification. They say 5–7 packets are most effective packets for early stage Internet traffic identification. They also say that selecting too many packets will increase the computational complexity while selection of few packets for early stage traffic identification will decrease accuracies performance results and cannot possess enough information.

Bernaille et al. in 2006 [17] studied the problem of effective packets numbers for early stage Internet traffic classification. In their study, they used -means, GMM, and HMM model using the size and direction of the early 10 packets of TCP connection for early stage traffic identification. They say that first four packets of early stage traffic are the most effective packets for early stage Internet traffic classification. They conducted many experiments using different eight traces traffic for their study work and they got very high identification results using the first four packets and executed three machine learning algorithms. Lim et al. in 2010 [18] used not only packet size as features but also connection level and statistical level feature for their study using a number of different datasets while conducting Naïve Bayes, C4.5 decision tree -nearest neighbors, and Support Vector Machine for their experimental study work. They used first 10 packets for their study to identify UDP application flow and also TCP flow, but their study related to empirical study.

During the last few years, Internet user increases day by day due to presence of reliable and free of cost instant messaging and free calling applications on Internet. WeChat application is one of the instant messaging applications available online freely. WeChat is an instant messaging and free calling application developed by Tencent Holding in China. This is a multifunctionality application and can be used both in smart phone and in desktop machine. After launching the WeChat application, its online users reached 300 million [19] which was amazing traffic and thereafter in November 2015, its active costumer users reached 650 million all over the world while from outside of China its active users reached 100 million [20]. So day to day increasing active users and traffic of this application can affect performance of network. It is also important to classify WeChat messages, audio, and video call traffic accurately to manage quality of services (QoS). Huang et al. [21] proposed measurement ChatDissect tool to measure WeChat application traffic and distinguish 150 K users and 16 GB traffic of WeChat from real-world network traces. In 2013, Church and De Oliveira [22] studied the performance of mobile instant messaging sending service with traditional short messages. In 2014 O’Hara et al. [23] studied instant messaging application WhatsApp in smart phone and took some interviews and survey to study the user activity using WhatsApp application. In 2014, Fiadino et al. [24] also studied WhatsApp application flow stream and collected data in European Network which consisted of millions of data stream flows and also studied audio and video flow data stream. In 2014 Liu and Guo [25] studied video messaging services in WeChat and WhatsApp application and they captured the traffic using mobile devices for their study. However, no study found out the number of packets that are most effective for WeChat application at its early stage traffic classification. In our previous work in [26], we only classify WeChat text messages service flow using two different environment datasets applying well-known four ML classifiers. In this paper we use only 50 features to classify text message flow and got high accuracy results.

3. Datasets

In this research work, we use five datasets, which include HIT Trace I and HIT Trace II; the details are given below.

3.1. HIT Trace I Dataset

Developing HIT Trace I dataset, we capture WeChat four functionalities such as text messages, pictures messages, audio calls, and video calls traffic and consider these traces traffic as different types of four datasets separately. For this study work, we are interested in classification and finding out the effective number of packets for WeChat real time application at early stage traffic classification. Thus we firstly capture WeChat application traffic of text messages, pictures messages, audio call, and video call traffic using Wire Shark tool [27] for a duration of 1 hour at research lab of School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China, in 23 May 2016. But we select the traffic of nonzero payload packets. In this process of capturing, we are interested to capture WeChat TCP, UDP, and SSL traffic of text messages, pictures messages, audio call, and video call. After capturing the traffic, the trace file is saved as dot PCAP extinction. The characteristics of these datasets are given in Table 1, but note that all the captured applications include TCP, UDP, and SSL application traffic instances. However, this trace traffic includes four subdatasets.

3.2. HIT Trace II Dataset

The second dataset is collected in a dormitory of Harbin Institute of Technology using Wire Shark software. We capture WeChat TCP and UDP traffic and four more other applications’ traffic such as DNS, FTP, Telnet, and WWW applications with duration of one hour. For our study we select traffic that is nonzero payload packets. We capture DNS application traffic on 26 December 2015 with duration of 1 hour, similarly FTP, Telnet on 27 December 2015, and WWW traffic on 28 December 2015. After capturing the traffic, we save the trace traffic as a dot PCAP extinction. The detailed characteristics of this dataset are given in Table 2. Note that WTCP mean WeChat TCP traffic only and WUDP mean WeChat UDP traffic only.

4. Study Framework

Our study framework includes two models; we propose first model for the problem of effective packet numbers and second model for the problem of effective machine learning (ML) classifier.

We carry out the study work as Figure 1 depicts the effective packet number. The detailed explanation of the executed method and steps are given below step by step.

(i) Trace Traffic. In this first step, we capture WeChat traffic using Wire Shark tool and save the captured traffic as dot PCAP extinction. Notice that we capture text messages, picture messages, audio call, and video call traffic duration of one hour, respectively. Thereafter, we select early 20 nonzero payload packets of every application and save them as features extraction.

Figure 2 depicts the effective ML classifier. The detailed explanation process is given below starting from generate features dataset step up to Wilcoxon test results.

(i) Features Extraction. As discussed in related work section that in the area of early traffic classification packet size feature is the most essential packet features for early stage Internet traffic classification [28]. In this study work, we used the packet size as a feature of early stage WeChat traffic. We used the packet sizes of 20 packets of early stage of WeChat application. It is noticed that we only used those packets which have nonzero payload packet [16]. For the features, we put the order of feature in a number vice manner such as packet 1, then packet 2, and then up to 20 packets and similarly packet size 1, packet size 2, and up to 20 packet sizes.

(ii) Mutual Information. After development of 20 features datasets, we execute mutual information analysis on between the packet sizes of the features sets of first packets which we label, respectively, from 1 to 20 packets features to find out the mutual information of each feature’s datasets. Through mutual information analysis execution, we are able to know how much identification information carries each packet and we are able to know the effectiveness of each feature’s packet.

(iii) Generate Feature Datasets. In this section, we make features datasets such as first packet with first packet size in integers and then second packet with its packets size and so third and up to 20 features datasets.

(iv) Selected Classifiers for Identification. For this research study, we select 10 machine learning classifiers which are widely used in network traffic classification and are well-known machine learning algorithms. Using the selected machine learning classifiers, we execute crossover identification on the generated datasets. In this paper, we are only interested to find out the effective packet number not to find out the applied machine learning algorithm accuracy. Thus we are only interested in the results of packet numbers.

(v) Friedman Test. In this research work, we also use statistical tests to deeply know the effective packets numbers. Friedman tests are to be executed to find out the significant difference among the results of applied number of packets. The detailed study of statistical tests will be demonstrated in Section 5.

(vi) Wilcoxon Test. It is also statistical test. We will use this test as we will use Friedman test. In this research study we will first use the Friedman statistical test and then will use Wilcoxon test to find out the effective number of packets using different number of packets.

5. Methodology

In this section, we will explain all the applied methods in this paper study.

5.1. Mutual Information

In the information theory, mutual information is very widely used for features selections [29], image processing [30], speech recognition [31], and so on. Mutual information is the measure of mutual dependence between two random variables and which defines the amount of information held by random variable. The mutual information between two variables in information theory can be defined as In the above equation (1), the marginal entropies of and are and , respectively. Where conditional entropies are and and joint entropy of and is , respectively, while the connection among , , , , , and is shown in Figure 3 according to Shannon’s definition of entropy theory, we havewhere is the probability distribution function of a random variable. As in [32] we use the three equations in (1) and can obtain the computational formula of mutual information. We also used the same method for mutual information as in [32].

However, if the variables are continuous random variable then summation will be replaced by a definite double integral as given. For mutual information computation analysis, there is abundant free available software on Internet, but we choose for our study Peng’s mutual information MATLAB toolbox [33].

5.2. Machine Learning Classifiers

We conducted our identification experiments using ten well-known and widely used machine learning classifiers. All the selected machine learning classifiers are executed using Weka data mining software [34]. Weka tool is a data mining application used in many areas in computer science and also used by many researchers for network traffic classification [35]. Firstly, we formatted the datasets as a comma separated value “CSV” which is a supported extinction of Weka application. Then using two folders’ cross validation method, we apply all the selected machine learning classifiers. The introductory information about the applied machine learning classifiers is given below and the classifiers selected for this study are shown in Table 3.(i) Bayes: Bayes machine learning classifiers are actually based on Bayes Theorem; Bayes machine learning classifiers are very widely used in computer and engineering area and got very effective results. In this study, we utilized Bayesian network (Bayes Net) [36] machine learning classifier and also Naïve Bayes machine learning classifiers [37, 38].(ii) Meta: in this research work, we used Meta category classifier in Weka tool; we select Bagging [39] and AdaBoost [40] machine learning classifier to classify WeChat traffic accurately. Meta classifier was first trained to learn and then produce strong learning.(iii) Rule: this category algorithm just creates rules using specific policy and then executes classification result testing data. In this category, we select OneR [41] and PART [42] rule base classifier for our study.(iv) Trees: this is also called decision tree algorithms used by many researchers in their research study. It is also called statistical classifiers. In our study, we select J48 also called C4.5 classifiers [43], Naïve Bayesian trees [44], and Random Forest [45].(v) SMO: in function category, we select SOM [46, 47]. SOM is also called a supervised machine learning technique and known as Support Vector Machine, which is widely used in many areas for classification. SVM is useful for both classification and regression.

For more detailed literature review of the applied machine learning classifiers, the selected machine learning classifiers are cited with original literature review.

5.3. Statistical Test

In more depth to know the effectiveness of packets and to compare the results of the applied ML classifiers as well as to find out the significant difference among the results of the applied method, statistical tests are conducted. In this study, we executed two different statistical tests, Friedman and Wilcoxon test, on the results of methods [48, 49]. The detailed introductions to both Friedman and Wilcoxon statistical tests are given below.

(i) Friedman Test. Friedman test is a statistical test. It is also called Friedman nonparametric test. Friedman test is a kind of nonparametric test used to find out the significant differences between the results of applied methods. The first step in this test is to calculate the test statistics; it converts all the original results of methods to ranks. The process of ranking of this test is that it ranks the best performing on the rank of 1 and then second best 2 and so on. After ranking the average ranking (AR), it is then calculated. If is the rank of the th of algorithms in th of datasets, thus the Friedman test also needs average ranking of algorithms:While in null hypothesis all the algorithms behave similarly, ranks should be equal and the distribution is calculated as follows:where is the number of executions and n is the number of methods. If the distribution of methods is large enough thus there is value of significant difference among the applied method results and it will be rejected. The significant level for each method and the probability value ( value) shows the significant level and it is usually conducted for the analysis of test results. On the other hand, for multiple hypothesis testing we also apply post hoc method to determine hypothesis comparison that will be rejected at specified significance level, while in many cases lowest hypothesis result is also concerned about rejection. Lowest significance results also called adjusted value (APV) and post hoc can be used to search the lowest value for each hypothesis. In this study, we used for post hoc method Holm’s test [48] which is very effective test for producing significant test results. For this research work, firstly we used Friedman test using comparison because comparison is too long to show in this paper.

(ii) Wilcoxon Test. We also used Wilcoxon signed rank statistical test in this research study. Wilcoxon test is also a nonparametric test used for pairwise comparison between two methods [50]. If di is the difference between two methods performance scores on th out of n problem and if the score is known in different ranges, then it can be normalized on intervals 0 and  1 in [51]. Thereafter, the difference is ranks by their absolute values while in case of ties practitioners will apply one method in [52]. In this case, the positive values considered that the method performed well and the second one vice versa. is the sum of positive values and is the sum of negative difference values. It means that if the difference between these and is high then the hypothesis will be rejected. This statistical test is also used like Friedman test to determine that the hypothesis will be rejected or not on the specific significance values .

5.4. Evaluation Criteria for Performance Measurement

For the performance measurements confusion matrix is the base of traffic classification measurements. Figure 4 shows the confusion matrix for traffic classification performance evaluation, wherein rows refer to the actual class of the instances and column refers to the predicted class of instances.

The metrics that are used in this Internet traffic classification using confusion metrics are described below step by step:(i)True Positive (TP): it means that Class A is truly identified as belonging to Class A.(ii)True Negative (TN): it means that Class A is truly identified as not belonging to Class A.(iii)False Positive (FP): it means that Class A is not truly identified as belonging to Class A.(iv)False Negative (FN): it means that Class A is not truly identified as not belonging to Class A.

Using the above given metrics, different metrics can be made for the evaluation of classification performance [53, 54], but note that effective classifiers will minimize the FP and FN values. However, in this regard we used accuracy and AUC measurement metrics defined as follows.

(i) Accuracy. Classification accuracy can be defined as the truly classified samples in overall classified samples and its formula is given below. Mathematically, accuracy can be defined as the sum of TP and TN divided by the sum of TN, TN, FP, and FN. A classifier performance is measured by accuracy result. It shows the overall effectiveness of classification model.

(ii) Sensitivity. Remember that sensitivity and recall are the same metrics in traffic classification technique. So (4) can be used for sensitivity.

(iii) Specificity. It can be defined as the performance ability of machine learning classifier to classify negative results. Equation (5) shows its formula while mathematically it can be defined as TN divided by sum of FP and TN.

(iv) Area under Curve. It is also called receiver operating characteristics (ROC) curve [55], which defines the performance of machine learning classifiers. It also shows the trade-off among FPR and TPR, while FPR is also known as specificity and TPR is called sensitivity. The AUC values can be computed by using confusion matrix values by TPR and FPR. since specificity = 1 − FPR and sensitivity = TPR.

Replacing 1 − FPR by specificity and TPR by sensitivity, we will getEquation (14) shows that AUC is the average of sensitivity and specificity.

6. Experimental Results and Analysis

In this section, we will explain the detailed experimental results and analysis. Firstly, we will explain the mutual information analysis results of HIT Trace I dataset including four subdatasets and then HIT Trace II dataset results, then give the result analysis of applied methods to validate the effectiveness of packets, and lastly give the results of statistical test for effective ML classifier.

6.1. Mutual Information Analysis Results
6.1.1. Mutual Information Results of the HIT Trace I Dataset

Figure 5 shows the mutual information method analysis results. In Figure 3 the mutual information of the first tow packets of text messages and picture messages is higher compared to the mutual information analysis results of audio call and video call packets. The audio call and video call traffic results are no more than 0.1 values. It means that the first two packets are not contributing information, while in text and picture message packet contributes highly compared with 2 to 4 packets. However, in text messages traffic packet numbers 8-9 give high information identification values and in picture messages packets numbers 7-8 give high information identification values while in audio call traffic packets numbers 6-7 give high information identification values and in video call traffic type is very different as compared to other traffic data; in this traffic packet numbers 19-20 give very high identification information. More details of mutual information results are shown in Table 16.

6.1.2. Mutual Information Results of the HIT Trace II Dataset

Figure 6 shows the mutual information method analysis results with details. In Figure 6 the mutual information of the first two packets of FTP, DNS, and WWW application is higher as compared to the other WTCP, WUDP, and Telnet application. Similarly packets 2-3 are also not contributing very effectively. Its means that packets 1–3 do not give much identification information. However, packets 6 and 17 give very effective identification information and remaining packets are not contributing very well as compared to the other packets. Moreover, with the perspective application FTP and DNS give very effective identification information compared with other applications.

6.2. Analysis Results of HIT Trace I Dataset
6.2.1. Results of the Text Messages Traffic Dataset

Figure 7 shows the accuracy results of the WeChat Text message dataset while the details results are shown in Table 19. All the applied machine learning classifiers get very low accuracy using first two and three packets of early traffic, because it is very difficult to identify Internet traffic with only few packets. Due to this reason, all the applied machine learning classifiers get very low accuracy results using early few packets. Naïve Bayes, Hoeffbing, and Random Forest get very low accuracy result using early two packets. However, using text messages dataset, we could not conclude that the first packets are more effective for early stage Internet traffic classification. It is worse to say that the first three packets for early stage Internet traffic classification are effective. However, after three packets using first four packets of WeChat text messages dataset all the applied machine learning classifiers get very effective accuracy results except Random Forest and Part ML classifiers. These three ML classifiers get low accuracy results using first four packets. Support Vector Machine (SVM) gets low accuracy result compared to other machine learning classifiers. However, all the classifiers give continuously incensements in accuracy results using all the packets numbers. Except Random Forest classifier, this classifier shows poor accuracy results which are not stable. Thus we can infer that the first four and five packets carry enough identification information for early stage classification WeChat text messages dataset. Note that we use two folders’ cross validation method in this study work.

Figure 8 shows the AUC result of WeChat text messages dataset and the details are shown in Table 23. From the figure, it is clear that the first two three packets cannot contribute to the AUC identification using the selected machine learning classifiers and the AUC result continuously increases using the selected machine learning classifiers conducting the first four packets to twenty packets. Thus the entire conducted machine learning classifiers give very effective AUC results but some of them such as SMO and OneR ML classifiers give noneffective results for the WeChat text messages dataset. As discussed in accuracy analysis, the Naïve Bayes, Hoeffding, and Random Forest give low accuracy results, while in AUC result, Naïve Bayes, Hoeffding, and Random Forest give effective AUC results compared to accuracy of WeChat Text dataset. It means that there exists imbalance data in WeChat Text message dataset.

Table 4 shows Friedman’s statistical test results for accuracy result. In Friedman’ test result the packet number nineteen is the best performed one in the accuracy results with the lowest ranking values being 1.7555. While comparing the values and adjusted values, the packet numbers 10–12 of adjusted values are less than values and numbers 15–17 are also the same as adjusted value which is less than values. These are the best performance packets.

For better understanding the results of Friedman’s test, we also execute Wilcoxon sign rank test. Table 5 shows Wilcoxon pairwise test results for the WeChat text messages dataset. From the table the value of 20 packets is greater than 0.05 for the accuracy results. Thus we conclude that there is no significant difference existing between the results of 19 packets and other packets for the WeChat text messages dataset.

6.2.2. Results of the Picture Messages Traffic Dataset

Figure 9 shows the accuracy results of the WeChat Pictures Messages dataset and the details results are shown in Table 18. The results of the text messages dataset are different from the pictures messages dataset. The packet number three gives very low significant increase of accuracy results compared with the first two packets.

From the results, it is concluded that the packet number three does not give identification results for the accuracy. It is also observed in the result that all the applied classifiers got continuously increment results using all the number of packets except Support Vector Machine (SMO) classifiers, continuously giving random results using all numbers of packets, while OneR classifiers give very poor identification result in the first 12 packets and then their results are continuously increasing. It means that there exist imbalance data in the dataset. The AUC results for the WeChat pictures messages dataset are a little bit similar to accuracy results. In Figure 10 and Table 22 the AUC results are shown, in which all the machine learning classifiers get the same AUC results but only SMO and OneR ML classifiers results are different from the other ML classifiers which hit high AUC results values.

Table 6 shows Friedman’s test results for the accuracy result of WeChat pictures messages datasets. In the table packet number 18 gives the best performance result for the accuracy. The average ranking result of 18 is 04.3333 values for the accuracy, which are the lowest average ranking results. However observing the value and adjusted PV for the accuracy, the packets number six to seventeen values are less than when compared to adjusted values. Thus these are the best behaving packets number for accuracy result, while the packet number five and packet number nine values are greater than adjusted values. Thus there is no significant difference among the results with accuracy.

Table 7 shows the Wilcoxon test results for the accuracy results. In the table, it is clear that the packet numbers 13–15 and 20 packets values are greater than the standard level of 0.05 for the accuracy results. Thus the packet numbers 13–15 and 20 are not significantly different for the WeChat pictures messages dataset.

6.2.3. Results of the Audio Call Traffic Dataset

Figure 11 shows the accuracy results of all the selected machine learning classifiers for the WeChat Audio Call dataset. Comparing the results of previous datasets, the results of WeChat audio call dataset are very complex as shown in Figure 11. It is also clear from the figure that the first three packets do not gain identification performance effectively, while packet number four gets effective identification performances for accuracy results. Again the SMO machine learning classifiers give random performance results, which are not stable results, while OneR machine learning classifiers give stable performance results after 11 packets and Bayes Net classifiers give effective result using packet number nine while its performance is continuous after 12 packets. The detailed accuracy results are shown in Table 17.

The AUC results for the WeChat Audio Call dataset accuracy are shown in Figure 12 and Table 21. The AUC results are very simple as compared to the accuracy results. The first three packets gain low AUC result while the first four packets gain effective AUC result for accuracy results. However, SMO and Bayes Net machine learning classifiers give low level AUC result for the first four packets and other machine learning classifiers give accurate AUC results, while Random Forest machine learning classifier gives very accurate AUC results for the accuracy.

Table 8 shows Friedman’s test results for the WeChat audio call dataset accuracy. The packet number sixteen gets the lowest average ranking values for accuracy and all the values of packets are less than from the adjusted values except packet numbers 5 and 11. Thus we can say that there is no significant difference among the results, while the Wilcoxon test results are shown in Table 9, in which only packet numbers 2–5 and 11–15 get the value less than 0.05, which mean that these packets are significantly different from the other packets results.

6.2.4. Results of the Video Call Traffic Dataset

Figure 13 shows the accuracy results of WeChat video call datasets. The accuracy results of video call dataset are different from the previous datasets accuracy results. The result of this dataset is a little bit complex; however, all the machine learning classifiers get effective accuracy results using all the packets datasets for accuracy. But the result C4.5 decision classifier is completely straight line conducting all the packets datasets. The results of the first two and three packets are lowest using SMO and Naïve Bayes classifiers but after four packets its accuracy result increases continuously. Similarly, using Bayes Net, Random Forest, Bagging, and OneR classifiers with packet number 17, all the classifiers give lowest results, but the remaining classifiers get high accuracy results using packet number 17. In Figure 14 and Table 24 we have shown the AUC results for the WeChat video call dataset. The AUC result pattern is simple as compared to accuracy result. Most of the applied machine learning classifiers get the effective AUC result for the video call dataset except OneR and SMO machine learning classifiers, because the results of the OneR and SMO are the lowest compared to other machine learning classifiers, while Table 10 shows Friedman’s statistical test results for the accuracy of WeChat video call datasets. The packet number two gets the highest average rank values compared to other packets average rank results and its value is 05.20. Similarly, all the values are less than adjusted values except packet numbers 4-5. It means that there does not exist significant difference among these results with respect to accuracy.

In Table 11, we have shown the Wilcoxon test results for the WeChat video call dataset accuracy. In Table 11, all the values are less than the standard level 0.05 except packet number 4. It means that the results of the entire packet except 4 packets are significantly different from all other results.

6.3. Analysis Results of HIT Trace II Dataset

Figure 15 shows the accuracy results of HIT Trace II dataset. All the applied machine learning classifiers get low accuracy result for early stage Internet traffic, because it is very difficult to classify Internet traffic using first few packets. However, we are not interested in classification accuracy. We are interested to find out the most effective packet numbers and effective ML classifier. Moreover, packet numbers 13-14 give the same identification results, but its identification information is low as compared to other packets’ accuracy results. However, the accuracy results of packet numbers 12 and 18 are continuously increasing. It means that their accuracy results are very good as compared to other packets’ accuracy results. Moreover, all classifiers show stable accuracy results, but Random Forest algorithm gives effective results compared to other machine learning classifiers. Thus the first six packets carry enough identification information as well as 15–19.

The AUC results for the HIT Trace II dataset is shown in Figure 16 and Table 25. The AUC result for the HIT Trace II is very simple as compared to other traces AUC result. For example, only packet number 5 and packet number 17 get low AUC values and all the remaining packets gain good AUC results as compared to packet AUC results. Moreover, Bagging classifier gives low AUC for packet number 14 and all the remaining classifiers give high AUC result for packet number 14. Similarly, for packet number 5 all classifiers give good AUC except SMO and AdaBoost machine learning classifiers. However, all the machine learning classifiers give high AUC values results for early stage packets. However the detailed AUC results are shown in Table 26.

Table 12 shows Friedman’s test results for the HIT Trace II dataset accuracy. The packet number 18 gets the lowest average ranking values for accuracy and all the values of packets are less than from the adjusted values except packet numbers 8 and 9. Thus we can say that there is no significant difference among the results, while the Wilcoxon test results are shown in Table 13 in which only packet numbers 16 and 18 get the values less than 0.05, which mean that these packets are significantly different from the other packets results.

6.4. Analysis Results of Algorithms

Table 14 shows Friedman’s test results for the applied machine learning classifiers. The Random Forest machine learning classifier gets the lowest average ranking values as compared to other machine learning classifiers and all the values are less than from the adjusted values except Hoeffding, Bayes Net, SMO, and OneR ML classifiers, while the Wilcoxon test results are shown in Table 15 in which only classifiers OneR, Part, C4.5, and Random Forest get the value less than 0.05, which mean that these classifiers are significantly different from the other packets results.

7. Analysis and Discussion

Although the results of the five applied IM and WeChat traffic datasets are different, with respective accuracy results and AUC results, some information can be learned from the applied five datasets at early stage WeChat traffic classification.(i)From this study, it is clear that analyzing the results of information analysis and classification experiments results analysis of the first three packets for early stage IM do not carry enough identification information.(ii)From the experimental results, the early traffic packets carry enough identification information for the WeChat early traffic classification. However, all the applied machine learning classifiers get very high effective identification performance results using the early stage traffic except Support Vector Machine and OneR machine learning classifiers results are very poor compared to other applied ML classifiers.(iii)Through accuracy results, the classification performance can be easily evaluated for the early stage Internet traffic classification. But in some cases, some classifiers get high identification performance results and in some cases not very effective, it is due to imbalanced datasets.(iv)OneR and SVM classifiers performance is always poor with increase of nonzero payload packet numbers. However, the performance of OneR and SVM classifiers is very different as compared to other machine learning classifiers.(v)However, it is clear from the experimental results that Random Forest gives very accurate results for all the applied datasets.

8. Conclusion

In this paper, we have tried to find out the most effective packet numbers for the IM WeChat early stage traffic classification as well as effective machine learning classifier. Using mutual information analysis five datasets (text messages, picture messages, audio call, and video call traffic), HIT Trace II, and ten well-known machine learning classifiers are applied. According to experimental results, we conclude that the nonzero payload size packets carry enough identification information for WeChat instant messages applications traffic classification. However, the packet numbers 13–19 are effective packets for 5G WeChat application traffic identification. Moreover, the experimental results of the five datasets are different due to different functionality of 5G WeChat application. However, in the results all the utilized datasets are not the same and the first three packets do not carry enough identification information and give very poor results, while for WeChat early stage traffic classification, according to our experimental analysis, the packet numbers 13–19 are most effective packet numbers. While for effective ML classifiers, we conclude that Random Forest machine learning classifier is effective ML classifier for IM early stage traffic classification.

There is still gap for further research in the early Internet traffic classification. A new method should be developed to select effective packet numbers for 5G WeChat application early stage traffic identification, while selecting more packets for Internet traffic classification increases computational complexity while minimum features will decrease classification accuracy of machine learning classifier so that more models should be developed that show how many packets should be used for accurate IM application traffic classification.

Appendix

Detailed Results of the Experimental Work

See Tables 1626.

Competing Interests

The authors declare that they have no competing interests.