Abstract

Because of the advances in Internet technology, the applications of the Internet of Things have become a crucial topic. The number of mobile devices used globally substantially increases daily; therefore, information security concerns are increasingly vital. The botnet virus is a major threat to both personal computers and mobile devices; therefore, a method of botnet feature characterization is proposed in this study. The proposed method is a classified model in which an artificial fish swarm algorithm and a support vector machine are combined. A LAN environment with several computers which has infected by the botnet virus was simulated for testing this model; the packet data of network flow was also collected. The proposed method was used to identify the critical features that determine the pattern of botnet. The experimental results indicated that the method can be used for identifying the essential botnet features and that the performance of the proposed method was superior to that of genetic algorithms.

1. Introduction

Because of the advancements and innovations in technology, the applications of the Internet of Things (IoT) [1] are rapidly growing, such as cloud computing [2] and smart phone applications. The IoT is not a new type of technology; it is the extension of existing technologies; for example, tens of thousands of smart phones are connected by Wi-Fi, 3G networks, or radio-frequency identification; therefore, using smartphones is a type of IoT, and the development of IoT will be a major trend in the future.

However, because of the recent information explosion, information security has become a crucial topic, even in relation to the IoT. Botnets [36] are a recent major threat; when a computer has been infected by a botnet virus, it still functions normally, but the attacker can control the infected computer to threaten the victim by achieving distributed denial of service (DDoS) [7], sending spam, engaging in phishing, or embezzling personal or company data. Botnets are typically composed of three components: a bot herder, a bot client, and a command and control server. The bot herder is the attacker and the bot client is the victim that is infected by the botnet virus; the command and control server (C & C) is the control server of a botnet and also a communication tool between a bot herder and a bot client. A bot herder typically uses Internet Relay Chat (IRC) protocol to communicate with the command and control server and a bot client. IRC protocol provides real-time one-on-one or group chat room service through a connection to an IRC server, and every chat room is called a channel. A bot herder uses IRC channels to send specific command codes, which are already determined by the bot herder who sent the virus, to a bot client. When a bot client recognizes the specific command code designed by a bot herder, the bot client achieves the movement according to the received command code.

Because botnet viruses are always changing, in both pattern and attack methods, detecting and protecting against these viruses have become extremely difficult. Most botnet-detecting studies have applied basic Internet virus detection methods such as Honeynet and anomaly-based, signature-based, or machine-learning techniques [8]. The anomaly-based and signature-based methods are the most commonly used. In the anomaly-based method, when the detection system observes that the traffic in the user network exhibits unusual actions, it determines that the user might be the victim of a botnet virus. The advantage of using the anomaly-based method is that unknown botnets can be detected; the disadvantage is that the rate of misjudgment might be high. In the signature-based method, an unusual packet database is typically built, and when the system detects that the Internet packets of a user conform to the database, the user might be infected by a botnet virus. The advantage of using this method is a high detection rate; however, the database must be frequently updated. Because both these methods possess disadvantages, they were not used in this research; instead, the machine-learning method was adopted for detecting botnet viruses. A method that can be used to detect unknown botnet viruses and has a high detection rate was developed by using feature selection, which was used to identify the critical features of botnet viruses.

Feature selection is used for identifying the critical features of a large amount of multidimensional data and subsequently using those features for analysis. For example, if there are 10 computers in an office and a few of them are infected with an Internet virus, the monthly Internet package data of this office must be collected, which is an extremely large data set because it contains thousands of packet transfer records, and every record has multiple features, such as a host IP address, MAC address, and the protocol type. These data must be analyzed, which subsequently reveals the affected computers as those with several feature anomalies. When the relationship between certain features and viruses is identified, those features must be used with precaution in the future.

This example is an application of feature selection. In a large subset of features, the feature subset most representative or most related to a goal must be identified because although every feature is different, some irrelevant features exist, and certain features are noised or redundant. If all these unnecessary features are considered, the complexity of and space necessary for calculations increase, and the correlation between the feature subset and the goal decreases. Therefore, the purpose of feature selection is to filter unnecessary features and to identify the feature subset that is most related to the goal. Moreover, as the feature number increases, the number of possible relevant feature subsets grows exponentially. When the number of features expands to such a large number that people cannot process it, such problems are called a curse of dimensionality. Conducting a search for all the possible feature subsets involves an excessive amount of time and calculation space, which is not cost-effective; therefore, an efficient and effective optimization algorithm must be used for determining the most suitable feature subset by using limited time and calculation space.

The applications of classification and clustering are widely used in various fields, such as recommendation systems [9], voice communication systems [10], and data mining. Applying feature selection can increase the efficiency of classification and clustering, and increasing classification accuracy and performance through feature selection is imperative. Classification refers to classifying data into appropriate categories. Multiple classification methods can be used, such as a decision tree [11], support vector machine (SVM) [12, 13], or neural network [14, 15]. All these methods are types of supervised learning. Recently, using an SVM has become increasingly common because SVM can achieve high classification with small training sets [13]. The main purpose of the SVM is to establish an optimal hyperplane to classify data and build a classification model.

The metaheuristic algorithm is widely used in various optimization problems, such as feature selection [16, 17] and schedule management [18]. Various metaheuristic algorithms are inspired by natural mechanisms; for example, genetic algorithms (GAs) [19] were inspired by gene mutation and crossover, and particle swarm optimization [20, 21] was inspired by the movement of flocks of birds. Various metaheuristic algorithms exist, such as cat swarm optimization [22], ant colony optimization [23], and artificial fish swarm algorithm (AFSA) [24], which simulates the foraging of fish swarm.

In [25], the results indicated that the AFSA exhibited excellent performance in function optimization, and the potential of applying the AFSA in optimization problems was also revealed. Furthermore, in [26], the researchers proposed a type of feature selection and back-propagation network for botnet detection; however, using an AFSA combined with an SVM classifier might yield superior performance. In this study, a classified model was proposed combining an AFSA algorithm and an SVM. The proposed method was used to identify the critical features determining the pattern of a botnet. The findings indicated that the proposed method can be used to identify the essential botnet features, accurately classifying botnet detection.

Section 2 introduces the SVM, GA, AFSA, and feature characterization of the botnet virus. Section 3 introduces the proposed botnet detection method, using the SVM and the AFSA. Section 4 presents the experiment results and Section 5 provides a conclusion and suggestions for future studies.

2. Background

2.1. Support Vector Machine

The SVM was proposed by Cortes and Vapnik [27]. It is a supervised learning model based on structural risk minimization [27] and the Vapnik–Chervonenkis dimension [28]. An SVM is typically applied in machine learning [29] and for solving classification or regression problems; therefore, the main purpose of an SVM is identifying the optimal hyperplane to analyze various classification data. The optimal hyperplane possesses the maximal margin associated with the various classification data, as shown in Figure 1. Two black points and three white points exist on the maximal margin line, which represent two types of classification data; these points are called support vectors.

These support vectors can be used for classifying new data. When the data is not linearly separable, the kernel function must be used to map the data into the Vapnik-Chervonenkis dimensional space. Three types of kernel function exist: radial basis functions (RBFs), polynomials, and sigmoids. Using the appropriate kernel function for transforming the data is imperative for increasing the classification speed. The three kernel functions are described as follows.

RBF kernel:

Polynomial kernel:

Sigmoid kernel:

2.2. Genetic Algorithm

The GA was first proposed by J. Holland in 1975, and the main concept of GAs is the simulation of survival of the fittest through crossover and mutation. In this algorithm, chromosomes, which are composed of series genes, play an essential role. Every chromosome has its own fitness value, and the chromosomes that contain high fitness values have a high chance of survival. In this study, an SVM classification accuracy value was used as the fitness value. The GA process is outlined as follows.

(1) Initialization. Encode the optimization problem to integrate with GA, create the fitness function and initial N chromosome randomly, and include the gene and the parameters.

(2) Evaluate Fitness. Use the fitness function to evaluate the fitness of every chromosome.

(3) Reproduction. Determine the reproduction rate of every chromosome based on its fitness value; if the fitness value is high, the reproduction rate is high as well. Use the roulette wheel selection method to select the reproduction chromosomes.

(4) Crossover. Randomly match two chromosomes from the reproduction pool and create a new generation of chromosomes by completing the crossover step by applying one-point crossover based on the probability of crossover rate.

(5) Mutation. Randomly select dimensions to achieve simple mutation based on the probability of mutation rates; this can increase the opportunities of identifying enhanced solutions.

(6) Stop the Algorithm If Terminal Criteria Are Satisfied. If the terminal criteria are satisfied, stop the algorithm and output the optimal solution. Otherwise, start from for the next iteration until the terminal criteria are satisfied.

2.3. Artificial Fish Swarm Algorithm
2.3.1. Conception

The AFSA is an optimization algorithm that simulates the behavior of fish swarm, such as foraging and movement. For example, the position of most fish in a pond is typically the position at which the most food can be obtained. The AFSA includes three main steps, which are Follow, Swarm, and Prey. In the AFSA, these three steps are repeated to determine the optimal solution. Similar to other bioinspired algorithms, the AFSA is used to determine the optimal or most satisfactory solution in a limited time by continually searching for possible solutions using a metaheuristic. In the AFSA, the position of every fish is considered a solution, and every solution has a fitness value that is evaluated using the fitness function. The fitness function changes when different goals are established.

2.3.2. Process

The represent fish , and represent the center of as mentioned in Table 3. The process of the AFSA is outlined as follows.

(1) Initialization. Encode the optimization problem to integrate with AFSA, create the fitness function and initial fish randomly, and include the position and parameters.

(2) Evaluate Fitness. Use the fitness function to evaluate the fitness of every fish.

(3) Movement of Fish Swarm. Process the Follow, Swarm, and Prey movements of every fish and determine the optimal solution.

Follow. At this step, the are compared with neighbors based on the optimal fitness value; if the optimal fitness of its neighbor is superior and the crowded degree of this fish is not greater than the maximal crowded degree, then the moves to the position of the neighbor fish, which indicates that the feature subset of the is replaced by that of the neighbor fish. This also indicates that the Follow step is completed. If the Follow step fails, then implement Swarm or Follow for the next fish.

Swarm. At this step, the are compared based on the fitness value of their own, ; if the fitness value of the is superior and the crowded degree of the is not greater than the maximal crowded degree, then the moves to the ; this indicates that the feature subset of the is replaced by that of the and that the Swarm step is completed. If the Swarm step fails, implement Prey or Follow for the next fish.

Prey. At this step, the randomly changes its own feature subset, indicating that if a feature is 0 and it is chosen to change randomly, this feature becomes 1 and the value of the changed features is not greater than what is visible. If the fitness of the changed feature subset is greater than that of the original, then the changed feature subset replaces the original feature subset which indicates that the Prey step is completed. If the Prey step fails, the algorithm repeats this step until the repeated number reaches the maximal try number.

(4) Stop the Algorithm If Terminal Criteria Are Satisfied. If the terminal criteria are satisfied, then stop the algorithm and output the optimal solution; otherwise, start from for the next iteration until the terminal criteria are satisfied.

2.4. Feature Characterization

To build a botnet detection system, a botnet network data set must be collected. By referencing [26], a local area network (LAN) simulation was built to collect the packet data of network flow; the computers used in this LAN were affected by a botnet virus. The software VirtualBox was used to simulate 10 computers, and the operating systems of those virtual computers included Windows XP, Windows 7, and Linux; subsequently, the computers were connected to the Internet through a Linux router. On these computers, normal user behaviors were simulated, such as playing online games, browsing websites, and watching videos. The packet data of this LAN was collected for 3 weeks, and the packets included the packet between the C & C server and the botnet virus.

Three data sets (Botnet1, Botnet2, and Botnet3) were obtained using various simulated LANs, and each one was infected by a distinct IRC botnet virus. And the duration of each data set was 1 week, the feature number of every data set was 12, and the instances in every data set were 223. The features of each data set, referenced from [26, 30], are shown in Table 1.

Details regarding the features of AvgLength, StddevLength, Time_Regularity, and Info_Char are described as follows.

AvgLength. This feature is the average length of every packet and is calculated by using (4). The variable is the packet length and is the total number of packets:

StddevLength. This feature is the standard deviation of the average length of every packet and is calculated by using (5). The variable is the packet length, is the average length of every packet, and is the total number of packets:

Time_Regularity. Because a bot client typically transmits a status report packet to a bot herder, knowing the transmission time regularity of each packet was necessary. This feature is the transmission time regularity of specific packets. A transmission time regularity counter is defined as , and if the total number of packets is , then the total number of is -1, and a set is an array, (i.e., ). For example, is the transmission time counter that counts the packet number, and the interval time is 2 seconds. Subsequently, the frequency array and the infrequency array were defined. The variable is a constant value between 0 and 1 which was set as 0.5 in this study. The feature Time_Regularity is calculated by using (6):

Info_Char. Because the specific command that a bot herder uses to control the computer of a bot client typically contains symbols, determining the weight of the symbols in the packets is necessary. This feature is the American Standard Code for Information Interchange (ASCII) counter, and 95 counters exist; each counter counts the number of times relevant ASCII characters appear in all packets. For example, a counter was defined as ; therefore, is the counter that counts the number of times the ASCII number 10 appears, even as a decimal, or with the symbol #. The feature Info_Char is calculated by using (7):

3. The Proposed Method

Both the GA and AFSA are metaheuristic algorithms; however, they employ distinct optimization mechanisms. The GA has demonstrated success in numerous applications, but a previous study [25] indicated that AFSA yields superior optimization performance. In this study, the SVM was employed as the classifier, using the AFSA and the GA to perform feature selection. Classifiers can establish a classified model and use it to assign data to the correct categories. First, the data must be divided into multiple components, and every record of this data must have the correct category label. Several pieces of data were regarded as training data and the rest were regarded as test data; subsequently, the training data were input into the classifiers, which was the SVM, to establish the classified model, and then the test data were used to verify this model and obtain accurate classifications. Various components of the data were used to alternately perform these steps, which comprised the cross-validation process. For example, the first portion of the data was used as the test data and the remaining data were used as training data; whereas in the next round, the second portion of the data was used as the test data and the remaining data were used as training data. The pseudocode of AFSA is shown in Pseudocode 1.

Random initialize Fish Swarm.
WHILE (is terminal condition reached)
  FOR ( ; NumFish; ++)
  Measure fitness for Fish.
 DO step Follow
 IF (Follow Fail) THEN
  DO step Swarm
   IF (Swarm Fail) THEN
    DO step Prey
   END
END
  End FOR
End WHILE
Output optimal solution.

In this study, the solution set comprises two parts: the SVM parameters (e.g., and ) and the feature subset. In the second part, binary codes were used to represent feature selection; 0 indicated that the feature was not selected and 1 indicated that it was selected. Table 2 shows the solution set.

The feature subset (10101) indicates that the first, third, and fifth features were selected, whereas the second and fourth features were not selected. Data input into the SVM without preprocessing indicate that every feature is selected and the classification accuracy is likely unreliable. Thus, the AFSA must be used to conduct feature selection. Incorporating the AFSA with the SVM enables the algorithm to identify a superior feature subset such as (10101). Only data relevant to the selected features are input into the SVM to establish the classification model; this facilitates analyzing whether the classification accuracy is improved. Thus, feature selection is attained and performing the aforementioned steps enables excluding unnecessary data.

At the initial steps of the AFSA, the algorithm assigns a random feature subset to every fish, and the SVM is used to obtain the classification accuracy based on the fitness of every fish. Subsequently, Follow, Swarm, and Prey processes are implemented to obtain the optimal solution. The definitions of the parameters, referenced from [31], are presented in Table 3.

The steps involved in the AFSA-SVM method are presented as follows.(1)Initiation: randomly assign a feature subset to fish. Define all parameters including vision, maximal crowded degree, and maximal trial number. For example, Figure 2 shows that eight fish were initiated; each fish has its own feature and the circle represents the vision of fish .(2)Evaluate the classification value as a fitness value of the feature subset of each fish by using the SVM as shown in Figure 2.(3)Starting with the first fish, implement the Follow step. If Follow is successful, perform step 6; otherwise perform step 4. For example, in Figure 2, the fitness value of fish is 55; by contrast, the best fitness neighbor exhibits a value of 80. Thus, the best fitness neighbor demonstrates a superior fitness value, indicating that a superior fish is located in the vision of fish . Therefore, the Follow step is successful and fish moves to the location of the best fitness neighbor, replacing its feature subset as shown in Figure 3.(4)Implement the Swarm step for the same fish. If successful, perform step 6; otherwise perform step 5. For example, in Figure 2, calculate the center subset by using in Table 3 and then use the SVM to evaluate its fitness value, comparing the fitness value of fish and the center subset. If the fitness value of the center subset is the highest, the Swarm step is successful and fish moves to the center subset, replacing the feature subset.(5)Implement the Prey step for the same fish. After the Prey step, perform step 6. For example, in Figure 2, the feature subset of fish is 00001101. The features randomly change each time the Prey step is executed. The number of changed features must be less than vision and the number of times Prey is executed must be less than the maximal trial number. After changing the feature subset, evaluate the fitness value by using the SVM and compare it with the original feature subset of fish ; if the changed feature subset exhibits superior fitness, the Prey step is successful and the feature subset is replaced with the original feature subset.(6)Determine if the current fish is the last in the fish swarm. If no, then begin from step 3 and perform the steps for the next fish; if yes, then perform step 7.(7)Determine the fitness of every fish; if excellent fitness is observed, then update the optimal solution and perform step 8.(8)Determine if the terminal criteria are satisfied and stop the algorithm; otherwise start from step 3 to begin the next iteration. Figure 4 shows the AFSA flow chart.

4. Experimental Results

To estimate the performance of feature selection using the AFSA combined with an SVM, the performance of the AFSA was compared with that of a GA, including the classification accuracy, the number of features of the optimal solution subset, and the time spent applying each algorithm to perform calculations. For both the AFSA and GA, the terminal condition of each fold was when the optimal solution was not updated after 1 hour. The algorithm parameters used in this study are presented as follows.

AFSA. The number of fish was 30, the maximal number of trials was 30, and the maximal crowded degree was 0.5.

GA. The genetic number was 20, and the mutation rate was 0.05.

The computer used to implement the AFSA and GA algorithms was a desktop computer. The operating system was Microsoft Windows 7, the coprocessor was a 2.66-GHz Intel Core 2 Quad Processor Q8400, the amount of memory was 2 GB, and the algorithms were coded using Dev C++. The classifier used was the Library for Support Vector Machines [32] and the RBF kernel function.

4.1. Experiment 1

Simulated botnet data sets were collected as mentioned in Section 2.4, and Table 4 shows the experimental results for each data set classified using the AFSA and the GA and a fivefold cross-validation process. The results are the average of the fivefold. The average classification accuracy, number of selected features of the optimal solution subset, and total time between the AFSA and GA were compared. The AFSA was more accurate than the GA was for all data sets, indicating that an increased botnet detection rate can be obtained. The number of selected features of the AFSA was also less than the number of selected features of the GA; thus, the amount of processed data involved in botnet detection was reduced, thereby reducing the detection time. Ultimately, the total time the AFSA spent was less than that of the GA, except for the data set Botnet3; based on these results, the AFSA can be used to obtain higher classification rates, identify the optimal feature subset by using less selected features, and spend less time performing calculations than using the GA can.

To determine the critical features, the total number of selected features in the optimal subset output by using AFSA-SVM was calculated and the results are presented in Table 5. If the number of selected features is high, it indicates that the feature is critical for classifying the input data when using SVM. Thus, the features that exhibit high counts are the features critical to botnet detection.

The results in Table 5 revealed that Features 9 and 11, AvgLength and Time_Regularity, are the features most often selected from the optimal feature subset, followed by Feature 12, Info_Char. Because of idle time, the bot herder was not always controlling the computer of the bot client; however, the computer of the bot clients still sent a status report packet to the bot herder regularly; therefore, AvgLength is a critical feature. Furthermore, the transmission time interval exhibited a regular pattern in sending the status report packet, which is why Time_Regularity is such a critical feature. Moreover, because the specific commands sent by the bot herder typically contain specific symbols, identifying the specific symbols that the bot herder uses may help identify a computer that is infected.

4.2. Experiment 2

Tenfold cross-validation was subsequently used, and the terminal condition of each fold was changed as if the optimal solution had not been updated after 1 hour or the classification accuracy was 100%. The results are shown in Table 6. Whether the optimal feature subset falls into the local optimal can be determined. The execution time can be substantially reduced, yielding increased classification accuracy and fewer selected features compared with using fivefold cross-validation. When using the tenfold cross-validation method, the training data grow, enabling the population to comprise additional samples; however, population growth may substantially increase the convergence rate.

The total number of selected features in the optimal subset by using tenfold cross-validations was shown in Table 7. The results shown in Table 7 indicate that Features 9, 10, and 11, representing AvgLength, StddevLength, and Time_Regularity, respectively, were most often selected from the optimal feature subset when using 10-fold cross-validation; this was similar to the results of using fivefold cross-validation, excepting Feature 10 (StddevLength). The classification rate increased when the selected number of StddevLength increased. Therefore, the StddevLength feature was critical to botnet detection. StddevLength represented the standard deviation of the packet length number; the bot clients regularly sent status report packets to the bot herder. These packets were typically short and consistent in length; thus, the StddevLength was the vital feature in botnet detection.

5. Conclusion and Future Work

In this study, a feature selection method for detecting botnet viruses is proposed, which is the AFSA-SVM method. Based on the experimental results, using the AFSA yielded only slightly higher classification accuracies than using the GA, but less time was spent to obtain a lesser number of feature subsets. In practical applications, classification accuracy is typically the first priority, but in certain processes, such as botnet virus detection, detection speed is as crucial as accuracy. To obtain the desired detection speed, the data required for processing must be reduced under the premise that the accuracy level is the same; therefore, in this scenario, the AFSA-SVM method is superior.

The result also shows that both GA and AFSA can still be applied for identifying the critical features of botnet, filtering unnecessary features, and using these algorithms in various applications easily. In our research, an IRC botnet was collected as the data set; however, in real world situations, botnet viruses are constantly changing, and an increasing number of botnet viruses are using peer to peer (P2P) or other protocols as the attack method. Therefore, in future studies, the proposed method must be tested for detecting P2P protocols or other types of botnet viruses. Finally, a feature-selection-based detection system for detecting botnet viruses can hopefully be constructed in the future.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.