Abstract
The system call sequences of processes are important for host-based anomaly detection. However, the detection accuracy can be seriously degenerated by the subsequences which simultaneously appeared in the call sequences of both normal and abnormal processes. Furthermore, the detection may be obstructed especially when the normal/abnormal distributions of subsequences are extremely imbalanced along with many ambiguous samples. In the paper, the system call sequences are divided into weighted subsequences with fixed-length. Secondly, a suffix tree of each system call sequence is constructed to automatically extract the variable-length subsequence from the longest repeated substring of the tree. The frequencies of the fixed-and variable-length subsequences that appeared in each system call sequence constitute its feature vector. Finally, vectors are input into a cost-sensitive and relaxed support vector machine, in which the penalty-free slack of the relaxed SVM is split independently between the two classes with different weights. The experimental results on two public datasets ADFA-LD and UNM showed that the AUC of the proposed method can reach 99%, while the false alarm rate is only 2.4%.
1. Introduction
System calls provide interfaces between system functions and the user applications. And the call sequences can reflect the targets of process actions. Accordingly, the system call sequences stored in various auditing and logging systems are important intrusion detection objects [1, 2].
Usually, system call sequences are broken into subsequences and submitted to classification models [3]. Jewell and Beaver [4] proposed that the unique sequences of system calls are the basis for discriminating normal and abnormal behaviors of processes. Helman and Bhangoo [5] et al. defined the priority of system call sequence based on the probability of the occurrence of system calls to get typical features. Lee and Stolfo [6] performed machine learning tasks on operating system call sequences of normal and abnormal executions of the UNIX Sendmail program. Xie et al. [7–9] attempted to reduce the dimension of the frequency vectors of subsequences based on PCA to enhance the computational efficiency. Haider et al. [10] proposed to take both the rarest repeating subsequence and the most frequent repeating subsequence in the system call sequence as statistical features. Kan et al. [11] proposed a novel IoT network intrusion detection approach based on adaptive particle swarm optimization convolutional neural network (APSO-CNN). Bian et al. [12] extracted graph-based features from host authentication logs, which are then employed in the detection of APT targets in the network. Shin and Kim [13] preprocessed the collected data using n-gram to overcome the limitations of the sequence time delay embedding (STIDE) algorithm for host intrusion detection system (HIDS).
Recently inspired by linguistics mining in NLP (Nature language process), researchers started to analyze the semantic relations between subsequences. Forrest and Hofmeyr [1] defined the sequences as phrases composed of words (system calls), and then utilized artificial immune systems to classify the phrases. Creech and Hu [14] proposed to draw semantic features of the system call sequences (phrases) using context-free grammar and built an extreme learning machine for the classifications. Liao and Vemuri [15] calculated the TF-IDF scores of system call sequences and input them into the K-NN model for abnormal detection. Zhang and Shen [16] built an improved TF-IDF model of subsequences, which takes both the time information and the correlation between the processes into consideration. Marteau [17] proposed a similarity measurement method to evaluate the similarity between symbolic sequences. Ambusaidi et al. [18] proposed a supervised feature selection algorithm, which is able to handle both linearly and nonlinearly related data features. Shams et al. [19] designed a new context-aware feature extraction method for convolutional neural network (CNN)-based multiclass intrusion detections. Subba [20] combined TF-IDF vectorizer and singular value decomposition (SVD) to design a novel HIDS framework for anomalous system processes detection.
However, it is difficult to select appropriate subsequences of system calls to discover the real purpose of the calling actions. Laszka et al. [21] proved that the optimal length of subsequence is highly dependent on data and applications and needs to be carefully fine-tuned. For example, if the subsequence is too short then we may get an incomplete calling trace. On the other hand, if the subsequence is too long, then the malicious calls are mixed with many normal system calls, and the extracted features may be disturbed. Finally, there are many normal call sequences while the abnormal calls are extremely few, and many unusual API sequences, along with incomplete sequences may have a strong influence on the classification model. How to deal with the imbalance and noise data is also a big challenge.
In view of the above problems, not only the semantic information contained in short subsequence but also the representative features in variable-length subsequences are considered to generate combined features. Usually, system call sequences contain repeated subsequences, which are regarded as program-specific behavior patterns. And the length of repeated subsequences is different among call sequences. In order to generate variable-length repeating sub-sequences, a suffix tree is constructed for each system call sequence and the longest repeating substring is automatically extracted from the subtrees. Furthermore, to address the imbalance and noisy subsequences of normal and abnormal calls, the widely used relaxed support vector machine (RSVM) is improved by assigning different weights and free slack amounts to the positive and negative classes, which are scaled by the sizes of the two classes to reduce the influence of data imbalance and outliers.
2. Feature Extraction
In order to automatically extract the call subsequences and related features, which can reflect the real target of the system calls, we present a dynamic feature extraction method for system call sequences. As shown in Figure 1 the features are extracted from both fixed-length and variable-length subsequences. In the first step, the training sequences are split into subsequences by n-gram [22], and each subsequence is weighted by TF-IDF [23]. Then, the first subsequences with big weight values are selected to composite a fixed-length subsequence set. In the second step, suffix trees are constructed for training sequences. The longest repeating substrings in the suffix trees are selected as variable-length subsequences. In the third step, both the fixed-length and variable-length subsequences are combined to get a corpus set. In step 4, the occurrence frequencies of the corpus subsequences in each system call sequences are counted to constitute feature vectors. Furthermore, in step 5, an autoencoder [24] is also utilized to reduce the vectors’ dimension before being submitted to the classifiers in step 6.

2.1. Fixed-Length Subsequence Based on Semantics
The TF-IDF (term frequency-inverse document frequency) [23] is commonly utilized in text mining to evaluate the importance of every single word or phrase in a document. In this paper, the TF-IDF is employed to evaluate the system calls, which have been coded into word sequences. As described in [1], firstly, each system call is represented by a unique word (or number), accordingly, the call sequences become word sequences. As defined in (1), is the th element in the set.
Definition 1. Inverse ratio of sequence frequency. The inverse ratio of a sequence frequency is the IDF reverse file frequency defined in TF-IDF. As shown in equation (1), is the number of training sequences, and is the number of fixed-length subsequence that appeared in the training set.
Definition 2. Vocabulary frequency of single sequence. As shown in equation (2), represents the frequency of the fixed-length subsequence in a system call sequence . And the vector represents the frequency of all the fixed-length subsequences in the system call sequence .
Definition 3. Process behavior weight. It is the combination of inverse ratio and vocabulary frequency of every single subsequence, as shown in (4), to evaluate the importance of a single vocabulary (subsequence) to one system call sequence .
Definition 4. Fixed length corpus of system call sequences. The subsequences with the first three highest weights, calculated by (4), in a single process are included in the fixed-length sequence corpus. In equation (5), where represents the th fixed-length subsequence in the system call sequence .
2.2. Variable-Length Subsequence
Usually, the repeated subsequence in a system call sequence can reflect the behavior patterns of processes, which are much useful for abnormal detection. In the paper, these sequence fragments are defined as the representative features with variable lengths.
2.2.1. Constructing Suffix Tree
Ukkonen [25] is a commonly used suffix tree construction algorithm based on path compression and suffix chain. In the paper, Ukkonen is utilized to construct suffix trees from single system call sequences. Let a sequence = “6, 4, 1, 4, 1, 4, 3.” Firstly, the suffix tree for initial string = “6”, = null ( is the current string for constructing suffix tree, = “1,4” is the next character) is shown in Figure 2(a); and then, = “6,” = “4,” so all possible suffixes of S + c are = “4,” = “6, 4,” as shown in Figure 2(b); finally, = “6, 4,” = “1,” so all possible suffixes of are = “1,” = “4, 1,” and = “6, 4, 1” as shown in Figure 2(c). In the same way, the result of the suffix tree for = “6, 4, 1, 4, 1, 4, 3” is shown in Figure 3.

(a)

(b)

(c)

2.2.2. Search the Longest Repeating Substring
After the establishment of a suffix tree, the longest repeating substring of the tree is selected to represent the behavior patterns of a system call sequence. As shown in Figure 4, for = “6, 4, 1, 4, 1, 4, 3” the deepest nonleaf node is node “4” and the longest repeating substring is = “1, 4,” which is incorporated into the variable-length sequence set .

2.2.3. Segmenting Long System Call Sequence
To alleviate the effect of long system call sequences on the suffix trees’ generation efficiency, the long sequences are segmented into subsequences. As shown in Figure 5, for the system call sequence with length . The sequence is divided into subsequence . Let represents the th subsequence of . The suffix trees are constructed for each subsequence .

2.2.4. Generating Corpus
Both the fixed-length subsequence set and the variable-length subsequence set constitute a combination set to represent the behavior patterns of the system calls.
2.2.5. Feature Extraction
Firstly, the frequency of the subsequences defined in (8) that appeared in each system call sequence is counted by AC automaton (Aho–Corasick automaton) [24]. Let represents the occurrence frequency of fixed-length subsequence in a system call sequence . represents the frequency of a variable-length subsequence in . The frequency vectors of all the call sequences constitute a feature matrix.
With the increasing of system call sequences, the number of subsequences in corpus also increases dramatically. In order to control the dimension of the feature vectors and fascinate mining of the potential features in the matrix , in (8), the autoencoder [26] is utilized to reduce the dimension of . Finally, is submitted to weighted relaxed support vector machines described in the next section.
3. Weighted Relaxed Support Vector Machines
The widely used RSVM [27] is an extension of SVM-L2 with an additional penalty-free slack variable for each sample, which allows influential support vectors to be relaxed, such that a restricted amount of penalty-free slack is used to relax support vectors and push them towards their respective classes. In the paper, based on WRSVM [28], we modify RSVM by assigning different weights and free slack amounts to the two classes, normal and abnormal call sequences, which are scaled by the positive and negative class sizes with different penalty control factors, and .
The enhanced weighted relaxed support vector machines (EWR-SVM) model is given in equation (10), which is an extension of the well-known SVM formulation RSVM. However, the constructed model is different from RSVM in such a way that, it differentiates between positive and negative classes, and considers different weights inversely proportional to the class sizes.where and are the sizes of the majority (normal) and minority (abnormal) class, and , respectively. Free slack is denoted by the variable in the constraints for the th sample. Due to imbalance, we provide separate amounts of total free slack for the normal and the abnormal classes in the constraints and is parameterized by , which is the free slack provided per sample.
Then calculate the first order partial derivative of the Lagrangian function of equation (10) with respect to the related Lagrangian multipliers, we can get the Wolfe dual of equations (10) and (12) can be efficiently solved using the sequential minimal optimization (SMO) algorithm, and the dot products in equation (12) can be replaced by a kernel for nonlinear classification.
4. Experiment
4.1. Datasets Description
In this section, a group of comparisons is carried out to test the performance of the proposed method based on two public system call sequence datasets ADFA-LD and UNM. The ADFA-LD [29] was released by the Australian defense force academy. It contains thousands of system call traces collected from modern Linux local servers. The UNM [30] was released by the University of New Mexico, which contains system call sequences of normal applications in the Linux system and sequences from attacking progress. The detailed descriptions of the two datasets are shown in Tables 1–3. And we can see that the UNM is an obviously imbalanced dataset.
Firstly, the sequence features are extracted by counting the number of the appearance of the fixed-length and variable-length subsequences in each system call sequence, and the abstracted features are verified by mathematical tests including information gain and Mann–Whitney U [31] test in the attachment. And then the features are input into classification models including the proposed EWR-SVM, naive Bayes, logistic regression, random forest, and gradient descent, and the performance is also compared with some traditional methods [8, 10, 32].
The results are evaluated by the index of AUC, F1-Score, false alarm rate, and ROC, which is also known as the receptivity curve. The ROC curve takes both the false positive rate and the true positive rate into consideration. It represents a curve drawn by the subjects under specific stimulus conditions due to different results obtained in different conditions. And AUC is the area under the ROC curve.
4.2. Experimental Result
4.2.1. Experimental Result
The experimental data set ADFA consists of 833 normal system call sequences and 719 attack system call sequences. The experimental data set UNM consists of 6 normal sample files and 16 attack sample files. The system call sequences’ features are extracted from the data set ADFA-LD and UNM datasets. Fixed length features are extracted from system call sequences by a sliding window with length 2. Variable-length features are extracted from system call sequences by a suffix tree. The anomaly detection results based on the classifiers, including EWR-SVM, naive Bayes, logistic regression, random forest, and gradient descent tree are shown in Table 4 and Figures 6 and 7.


Obviously, the performance of EWR-SVM is better than the others on the two datasets. The results demonstrated that the features extracted by our method contain useful information to reflect the behavior patterns behind the call sequences. And the features are useful for model recognition of the abnormal program behaviors. In order to further demonstrate the effectiveness of our method, a feature validation is carried out on the datasets ADFA-LD and UNM.
4.2.2. Feature Validation Verification Experiment
The occurrence frequencies of fixed- and variable-length subsequence in ADFA-LD and UNM are shown in Figures 8–15. From the figures, we can see that the distributions of the word (call) frequency are different between the normal and attacking traces (system call sequences). The information gain and Mann–Whitney U test is carried out to demonstrate the effectiveness of the extracted features.








The information gain of both the selected fixed-length sequences and the left ones are shown in Table 5. The table showed that the information gain of the selected fixed-length sequence feature is obviously higher than that of the others. The results demonstrated that fixed-length sequences with the high TF-IDF weight are useful, which can be utilized to extract the behavior features of system call sequences.
According to the statistical results, we found that the frequency of the sub-sequences of normal sequences and attacking sequences does not follow the normal distribution in two datasets, so a nonparametric method is employed to conduct Mann–Whitney U test [31]. The following hypotheses are proposed: : there is no significant difference between the sub-sequence frequency of the normal sequence and the sub-sequence frequency of the abnormal sequence : opposite hypothesis of
The results of the Mann–Whitney U test is shown in Table 6. According to the law of statistics, we know that when , the original hypothesis is true, while , the original hypothesis is rejected, and the hypothesis is true. From Table 6, we can see that the appearing frequency of fixed-length subsequence of normal sequences in ADFA-LD and UNM datasets is significantly different from that of the abnormal sequences. Similarly, the frequency of variable-length subsequences of normal sequences in ADFA and UNM datasets is also obviously different from that of the abnormal ones.
4.2.3. Comparison with Traditional Methods
In this section, we compared the EWR-SVM with other abnormal traces detection methods [8, 10, 32], and the results are shown in Table 7. From the table, we can see that our method is obviously superior to others. In [8], system call sequences are classified according to the appeared abnormal frequencies in the tested traces. In their method, the optimal window width is set by empirical test. However, the optimal length of the subsequence varies among different datasets, resulting in unstable performance. Haider [10] proposed to utilize the rarest repeating subsequence, the most frequently repeated subsequence, along with the maximum and minimum system calls in a single sequence to extract the trace features. However, the correlative contextual semantic information in the sequence is ignored. Anandapriya and Lakshmanan [32] brought the conception of contextual semantic information into trace detection with fixed-length windows. The results in Table 7 showed that the AUC of the proposed EWR-SVM is much larger than the others, while its false-positive rate is lower than the comparison targets. Thatis because our method considers not only semantic information of the sequences but also the behavior patterns of processes behind the calls. Furthermore, we proposed to assign different weights and free slack amounts to the two classes (normal and attacking traces), which are scaled by the normal and abnormal class sizes to release the affection of imbalanced data distribution between the two classes. And the slack factors in EWR-SVM make the outliers have less influence on the optimal hyperplane and ensure a large margin between the two classes.
5. Conclusions
In the paper, both fixed-length and variable-length subsequence of system calls are taken into consideration for the host-based intrusion detection. The semantic weight is incorporated into the selection process of fixed-length subsequences. On the other hand, the variable-length subsequence is automatically selected from the suffix tree of each call sequence. In order to deal with the imbalance data distribution and samples outliers of the call sequences, a cost-sensitive relaxed support vector machine EWR-SVM is proposed, in which the restricted penalty-free slack is split independently between the two classes in proportion to the number of samples in each class with different weights. Both the theoretical analysis and experimental comparison results demonstrated our methods are more suitable for the detection of attacking traces in system call sequences.
Data Availability
Host intrusion detection datasets in this article are public datasets. Web pages are as follows: (1) http://www.cs.unm.edu/~immsec/systemcalls.htm. (2) https://research.unsw.edu.au/projects/adfa-ids-datasets.
Conflicts of Interest
The authors declare that there are no conflicts of interest.
Acknowledgments
This work was supported by the National Key Research and Development Program of China (020YFB1805405, and 2019QY0800), and the Natural Science Foundation of China Nos. 61 872 255, U19A2068, and U1736212.