Abstract

Insider threat detection has been a challenging task over decades; existing approaches generally employ the traditional generative unsupervised learning methods to produce normal user behavior model and detect significant deviations as anomalies. However, such approaches are insufficient in precision and computational complexity. In this paper, we propose a novel insider threat detection method, Image-based Insider Threat Detector via Geometric Transformation (IGT), which converts the unsupervised anomaly detection into supervised image classification task, and therefore the performance can be boosted via computer vision techniques. To illustrate, our IGT uses a novel image-based feature representation of user behavior by transforming audit logs into grayscale images. By applying multiple geometric transformations on these behavior grayscale images, IGT constructs a self-labelled dataset and then trains a behavior classifier to detect anomaly in a self-supervised manner. The motivation behind our proposed method is that images converted from normal behavior data may contain unique latent features which remain unchanged after geometric transformation, while malicious ones cannot. Experimental results on CERT dataset show that IGT outperforms the classical autoencoder-based unsupervised insider threat detection approaches, and improves the instance and user based Area under the Receiver Operating Characteristic Curve (AUROC) by 4% and 2%, respectively.

1. Introduction

Insider threat generally refers to the malicious and unintentional actions on the part of the insider, always negatively affecting the confidentiality, integrity, or availability of the organization’s information system [1]. Due to the fact that insiders are usually knowledgeable about the organization’s security mechanisms and authorized to access the system service, insider threat is one of the challenging threats and the hardest to detect. As reported in the Insider Threat Annual Report, 67% of organizations have experienced one or more insider attacks in the last 12 months, and this digit keeps increasing with increased economic uncertainty [2]. In a recent survey, insider attacks are shown to account for 25% of the cybercrime incidents, and 30% of respondents indicate that the damage and economic loss caused by insider threats are much more severe than external ones [3]. As such, emerging insider threats such as system sabotage, data breach have been recognized as critical security challenges faced by various institutions and government agencies. Hence, it is urgent to develop effective approaches for detecting malicious insiders accurately.

Insider threat detection is very important, but the sensitivity of attack source and the stealthiness of malicious activities make the identification of insiders very challenging. Firstly, the built-in security defense mechanisms against external attack cannot detect them, and it is necessary to establish additional security procedure to discover insider threat. Secondly, the malicious activities by insiders represent only a small portion of their overall activities, and most malicious activities are committed in multiple stages over a long period of time. That is, analysts need to use long-term monitoring across a wide range of audit data sources, thereby increasing the burden of detection. Thirdly, the diversity of insider attack and complex role distribution of organization mean that the “one-fit-all” detection model may not exist, and the detection scheme needs to be dictated by actual requirements. Fourthly, organizations can suffer negative effects (e.g., inefficient work behavior) if an innocent user is classified as suspicious, which violate the original intention of deploying security procedure. Last but not least, researchers cannot acquire and use the real insider dataset easily due to the limitations of privacy and reputation protection, making it difficult for them to effectively evaluate the insider threat detection approaches. In brief, high accuracy requirement, excessive burden in handling big data, and lack of real-world dataset are the main challenges for designing an effective insider threat detection mechanism.

Despite the aforementioned challenges, industry and academia have proposed many insider threat detection approaches [410]. Since malicious behavior is widely varying, it is impractical to explicitly characterize insider threat. Instead, most solutions tend to build normal user behavior models by means of historical behavior analysis, and anomalies are identified as significant deviations from the normal behavior [11]. In the modeling process, many classical learning algorithms such as support vector machine (SVM), isolation forest (IF), hidden Markov model (HMM), and Bayesian inference can be used as the benchmark for security analysis, and have achieved remarkable results in practical applications [12]. Apart from the classification algorithms, the audit data sources also play an important role in determining the detection capability and performance. This is because the malicious threat scenarios are usually not limited to a specific behavior domain but are scattered in multiple domains and composed of multiple activities. If a certain activity of malicious scenario is analysed separately, it may even be normal. For example, the device activity “using a removable drive on the office computer after work” is normal behavior, but it can be judged as malicious when combined with the http activity “uploading files to wikileaks.org.” In other words, the judgment of malicious activity should be combined with the specific context, which puts forward new requirements for multisource data fusion.

According to data fusion methods, traditional insider threat detection solutions can be classified into two categories. One category deploys multiple sub-detectors and generates the final decision based on the voting mechanism, where each sub-detector only focuses on a specific type of suspicious activity. The other category combines the statistics extracted from all relevant audit data to form feature vectors, and identifies the suspicious activities using various machine learning classification algorithms. However, due to the fact that whether an activity is malicious or not is closely related to the contextual situation, the performance of sub-detector that only targets specific behavior domain is not satisfying. Therefore, we tend to adopt the second insider threat detection scheme in this paper. But this scheme also suffers the following limitations: (i) Feature engineering relies on domain knowledge about how an insider attack is characterized. It is not a trial work to define appropriate feature vectors based on potential threat scenarios, and there is still a lot of room for improvement. (ii) The traditional, shallow machine-learning models are unable to obtain the satisfactory precision due to the complexity and heterogeneity of user behavior data. Thus, one aim of the research is to develop a high-precision insider threat detection method with the deep learning model. Moreover, when it comes to practical application, unsupervised learning method is the first choice for researchers. In this regard, the unsupervised anomaly detection methods can also be roughly categorized into two categories: reconstruction-based anomaly score and reconstruction-based representation learning [13]. The former assumes that anomalies and nonanomalies have different latent low-dimensional representations, and it will be difficult to compress and reconstruct the anomalies based on a reconstruction model optimized for nonanomalies. Those samples with large reconstruction errors are regarded as anomalies. The latter uses a two-step approach, which firstly learns a compact representation of the data, and then applies density estimation methods on the lower-dimensional representation. Those samples that lie in low-density region are deemed anomalous. However, in this paper, we tend not to use the above two methods based on generative component and instead use a completely different approach to achieve unsupervised detection of insider threat.

Inspired by the image classification method GeoTransform [13], we find that the unsupervised classification problem can be converted into supervised classification problem by constructing a self-labelled dataset. What is more, this method can improve the classification accuracy while reducing the problem complexity. Meanwhile, the academic community has begun to adopt transfer learning to the domain of cybersecurity in recent years [14, 15]. On this basis, we propose a novel insider threat detection approach named IGT, which is based on image representation and geometric transformations. In accordance with the principle of comparison with historical baseline, IGT constructs individual behavior model for each user, applies the unsupervised classification method based on geometric transformation to the images converted by user behavioral feature vectors, and finally achieves the precise identification of malicious instances and users. Specifically, we extract the user behavioral representation vectors from all the relevant audit data according to the potential malicious scenarios, and convert them to grayscale images. Then, we train a multi-class neural classifier for each user over the self-labelled dataset, which is created from the normal instances and their transformed versions, obtained by applying different geometric transformations. In the testing phrase, this classifier is applied on transformed instances of the test sample, and the sample with worse classification results will be judged as malicious. The intuition behind our method is that the images converted from normal behavior data may contain some unique latent features compared to malicious data. Besides, it should be mentioned that all the experiments in this paper are based on the CERT public dataset [16].

In summary, this paper makes the following contributions:

Firstly, according to the potential threat scenarios and available audit data, we design a more reasonable feature set, which helps to get better performance on representing user behavior. The proposed feature is made up primarily of occurrence time, assigned computer, and specific activity, and can be subdivided into three types (i.e., week, day, session) based on the aggregation granularity. The experimental data show that our feature vector has better behavior representational capacity than other existing feature engineering.

Secondly, we propose a novel insider threat detection approach IGT. By converting behavior feature vectors to grayscale images and constructing self-labelled dataset through geometric transformations, IGT converts unsupervised anomaly detection problem into supervised image classification problem, thereby reducing the complexity. To the best of our knowledge, this is the first work to apply the unsupervised classification method with geometric transformation on insider threat detection.

Thirdly, we evaluate IGT on the CERT dataset. Our experiment results show that IGT outperforms the classical autoencoder-based classification method, and improves the instance- and user-based AUROC by 4% and 2%, respectively.

The rest part of this paper is organized as follows. Section 2 summarizes the related work on insider threat detection. Section 3 presents the feature extraction method, and designs an unsupervised insider threat detection mechanism based on image representation and geometric transformation as well as the related algorithm. Section 4 details the employed dataset, experimental setting, and evaluation results. Finally, we discuss the weakness and future work in Section 5, and make a conclusion in Section 6.

Due to its important role in the field of organization security, the insider threat detection has been widely investigated over many decades. On the one hand, in order to prevent military data from being stolen by insiders, the DARPA consecutively released two insider-related projects, ADAMS and CINDER. As their engineering outcome, the PRODIGAL system takes user activity logs as input, and gets good detection performance by constructing flexible dynamic detection architecture [7]. The technical report released by the CERT Insider Threat Center explored the possible manifestation of insider threats and presented the common guiding mitigation and preventive measures [17]. On the other hand, there are also many excellent surveys and solutions in the academia. Liu et al. systematically reviewed the present studies on insider threat from the perspective of audit data source [12]. Homoliak et al. proposed a structural taxonomy and novel categorization of insider to systematize knowledge in insider threat research [11]. Hunker and Probst believe that the insider threat detection problem cannot be effectively addressed without collective efforts of psychoanalysis, social relationship investigation, and anomaly detection technology [18]. Although we agree with this opinion, it should be noted that the literature related to nonbehavioral factors (i.e., psychometry, emotion) is outside the scope of this paper. Our focus is how to predict if an employee is behaving abnormally either with respect to his or her past activity at any given time instance. In this scope, this section will introduce the related studies from the perspective of feature engineering and unsupervised anomaly detection.

2.1. Feature Extraction

Textual audit data such as host logs cannot be directly used in anomaly detection algorithms, so feature extraction is a necessary step to convert them into numerical vectors. Depending on the existing human intervention, the feature extraction methods can be classified into two types: statistical features based on artificial definition and hidden features based on representation learning. The former is an intelligible common approach and its core idea is to artificially define indicators that may be related to insider threat as feature attributes by means of expert domain knowledge. These indicators include a variety of types, such as frequency and statistic. Tuor et al. generated 408 features to characterize the behavior pattern by combining different users, time frame, and activity frequency information, and it is proved to present good performance [5]. Under the guidance of this combination idea, Le et al. expand the feature vector (824 dimensions) by adding the statistical indicators (i.e., the number of words in the copied file) to obtain more detailed user behavior characteristics [19]. However, such an expansion has the benefits of improving the information richness, and it also introduces risks associated with information redundancy and huge overhead. Unlike the above methods, Chattopadhyay et al. are not limited to simple frequency and statistics aspects but introduce the concepts of sliding window and time-series feature to capture the dynamic characteristics of user activities [10]. More specifically, they construct the feature vector by calculating the variation of each indicator within the time window. Yuan et al. pay more attention to the activity time information, so they extract the behavior temporal representation from both the intra-session and intersession levels [20]. That is to say, they generate the origin behavior features by calculating activity times, activity types, session durations, and session intervals. In order to detect the low yet long-lasting threats, Yuan et al. add the group-related indicators and behavioral deviation indicators based on the original single-day features, and construct a compound matrix to characterize the user’s historical behavior pattern [21].

The hidden features based on representation learning is another common feature extraction method. It exploits the deep learning model to automatically extract user’s behavior characteristics. In a sense, this method can be regarded as multiple abstractions of raw audit data, and its purpose is to obtain numerical representation that is most conducive to anomaly detection. Sharma [6] and Yuan et al. [22] arrange the activities in the audit data in chronological order for each user to generate the activity sequence set, and then send those sequences into Long Short Term Memory (LSTM) network to obtain the advanced behavior representation. Sun et al. adopt the same network model (i.e., LSTM) to capture the general nonlinear dependency over the history activities, but the difference is that they use the interleaved sequences formed by user behaviors and user-attributes as model input [23]. Based on the Bidirectional Encoder Representations from Transformers (BERT) model, Yuan et al. map the activity type and its corresponding time information to the embedding space, and then construct the behavior representation by summing the above vectors [24]. To further improve the model accuracy, Jiang et al. expand the feature vector by exploiting the graph convolutional network and structural information between users [8]. Inspired by natural language processing, Liu et al. first use the “4W” template to reorganize the audit logs, and then transform the human-consumable textual data into the machinable-consumable numerical vector with the help of the Word2vec model [4]. The main advantage of this approach is that it can capture the potential semantic properties in the original audit logs without relying on any domain knowledge. Meanwhile, it suffers from the defect of limited detection performance.

2.2. Unsupervised Anomaly Detection

Considering the feasibility in practical applications, unsupervised anomaly detection methods are the current mainstream study direction. Many classical unsupervised anomaly detection algorithms such as IF [25], HMM [26], AutoEncoder (AE) [6, 9, 2729], Generative Adversarial Network (GAN) [30, 31], and One Class Support Vector Machine (OCSVM) [32] have been applied in the field of insider threat detection. Gavai et al. design an insider threat detection scheme based on enterprise online activity data [25]. The scheme does not attempt to model the normal user behavior but utilizes the isolation forest algorithm to detect statistical outliers directly. Rashid et al. apply the Hidden Markov Model to insider threat detection for the first time [26]. They use the user behavior sequences as system input, the hidden Markov model as modeling approach, and the deviation between predicted results and actual operations as judging criteria to detect anomalies. The benefit of this method is that the detection model has good interpretation, which is convenient for experts to conduct post-event analysis. But, it also has some disadvantages such as high computational overhead and poor early detection performance. Yuan et al. propose a novel LSTM-based deep autoencoder-based anomaly detection method for discrete event logs, which determines whether a sequence is normal or not by analyzing (encoding) and reconstructing (decoding) the given sequence [27]. Moreover, Yuan et al. [30] and Gayathri [31] et al. utilize the generative adversarial network to augment training data, and such a solution has resulted in good effect.

The powerful representation and discrimination capabilities of deep learning model provide new opportunity for insider threat detection. Most unsupervised anomaly detection methods can be roughly categorized into two approaches: reconstruction-based anomaly score and reconstruction-based representation learning. The former identifies anomalies based on whether the construction error exceeds the threshold, and classical methods belonging to this category include AutoEncoder [6, 9, 2729] and GAN [30, 31]. The latter utilizes low-density rejection principle for the extracted hidden features to detect anomalies, and examples of such methods are OCSVM [32] and kernel density estimation (KDE) [33]. Although the above works have distinctive views and new ideas on basic feature extraction, their final discriminative methods are nothing more than the above two principles. For example, Liu et al. combine the word2vec model with an autoencoder classifier to detect insider threat [28], while Lin et al. achieve the same goal by exploiting deep belief network and one class support vector machine [32].

Compared with the upper works, IGT scheme focuses on the technique innovation in the aspect of insider threat detection method, and provides an effective reference for other anomaly detection problems in the security field. In this regard, GeoTransform [13] is similar to our work, but its target is a well-arranged visual image, while the target of IGT is discrete multisource audit log, which greatly increases the problem complexity. Moreover, the major differences between our work and the classical unsupervised scheme proposed in [28] are as follows: (i) IGT constructs the separate behavior model for each user, instead of sharing one model for all users, because we believe that different users have different behavior patterns, and a unified model is difficult to detect the subtle malicious activities. (ii) IGT adopts the statistical feature extraction method based on expert domain knowledge, while the scheme in [28] uses the Word2vec model to extract behavior characteristics. Since the malicious behavior is closely related to the context in which it occurs, the intervention of domain knowledge is conducive to improve the modeling accuracy. (iii) Unlike the work [28], IGT utilizes a completely different anomaly detection method, which converts unsupervised anomaly detection problem into supervised image classification problem by constructing a self-labelled dataset, thereby reducing the problem complexity. More importantly, the performance of our proposed solution is significantly better than the existing mainstream insider threat detection methods.

3. Methodology

The primary purpose of this study is to explore the possibility of image-based classification methods in the field of insider threat detection, thereby providing new research ideas for solving the cybersecurity problem. To this end, this section begins with the workflow of our proposed method, and summarizes the overview of design principle and basic framework (Section 3.1). Then, we elaborate the feature extraction and image conversion steps involved in the detection procedure in Sections 3.2 and 3.3, respectively. Finally, the modified unsupervised anomaly detection method and corresponding algorithm implementation are presented emphatically in Section 3.4.

3.1. System Overview

Figure 1 shows the system workflow of IGT, in which there are four key procedures, namely, the feature extraction, image conversion, anomaly detection, and result analysis. The feature extraction is responsible for the abstraction and generalization of audit data. According to the potential malicious scenarios, this procedure extracts the frequency and statistical features for each user by aggregating multiple audit data, and then constructs the numerical vector representing user behavior and profile information. After obtaining the initial feature vectors, the image conversion procedure generates image creation tasks and sends these grayscale images to the anomaly detection module. On the basis of these grayscale images, the anomaly detection procedure can construct and train a geometric transformation-based classification model. This discriminative model is applied on transformed instances of the test samples, and those with poor classification accuracy are regarded as suspicious instances. When finishing the classification work, the result analysis procedure conducts the identification of malicious instances and users according to the corresponding threshold, and provides the final detection results to the security analyst.

Before getting into the details of this method, we will state the insider threat detection problem studied in this paper clearly and give a simple mathematical formulation. This problem can be defined as: “Given employee’s past online activity, predict if an employee is behaving abnormally either with respect to his past activity at any given time instance.” Let be the space of all activities, and let be the set of normal activities. Given a sample and employee’s past activities, we would like to construct the best possible classifier , where . Activities that are not in X are regarded as anomalies. A common method for controlling the classification result is to learn a scoring function , such that higher scores mean that samples are not more likely to be in X [13]. Once such a scoring function has been learned, a classifier can be constructed from it by specifying an anomaly threshold (λ):

In fact, the critical point of insider threat detection is how to construct the best possible classifier and learn the scoring function . As for the threshold, it can be determined based on empirical knowledge and numerical experiments.

As previously mentioned, the starting idea of this work is to apply the novel unsupervised anomaly detection method originated from computer vision to insider threat detection problem. That is to say, how to create the proper image representation and construct the effective discriminative model are the primary problems we should solve. In this process, the content of sample S moves from multiple logs to grayscale images, and the type of classifier moves from unsupervised to supervised. However, the supervised classifier here does not mean that we train this discriminative model with the behavior label (normal or malicious), but using self-labelled information (rotation, translation, etc.). Although we assume that all the training data are positive samples, the real anomaly labels are not used in the whole detection process. In order to present the proposed scheme in detail, we select the benchmarking datasets provided by CERT as an application example. Besides, our analysis will distinguish between malicious instances detected and malicious users detected, which represent different aspects of the proposed scheme’s performance. Because the diversity of user’s role has a strong impact on the distribution of actions performed, the high malicious instance detection rate is not synonymous with all malicious users being detected [19]. As for a detected malicious user here, he or she is identified if the proportion of suspicious instances exceeds another specific threshold κ.

3.2. Feature Extraction

Feature extraction is an essential part of insider threat detection. The performance of anomaly detection algorithm is closely relevant to the feature vector used to train the model, but it is not a trivial work to characterize the user behavior pattern due to various factors such as validity and interpretability. In order to solve this problem, we design a more reasonable feature set based on the potential malicious scenarios. The audit data in the CERT dataset consist mainly of logon information, file information handled by users, external device information, e-mail communication information, http detailed information of browsing history, and organization’s structure information. The first five logs record most activities performed by users with a certain timeframe and provide the basic data support for user behavior analysis. The organization structure information represents context data, which can be the working role and personal information. Normally, it is used as auxiliary information in the process of behavior analysis.

Based on the above audit data, we can perform feature extraction to create numerical vectors that represent user behavior pattern mostly. Given an aggregation condition t (i.e., timeframe), different audit data are aggregated based on user id in chronological order, and then feature extraction is performed on the aggregated data to generate fixed-length vectors (also called data instances). Considering that malicious activities scatter in numerous normal working activities and potential insider attack manifests in various forms, such as data leakage and intellectual property theft, we extract behavior features from three aspects of time, computer, and activity. Table 1 depicts the feature structure in the case of the CERT dataset. Since time is an important dimension to represent user behavior pattern, we set 4 different time-frames to capture the comprehensive information as much as possible. Meanwhile, in view of the fact that masqueraders make up a large chunk of insiders, the assigned computer information is added to assist characterizing the user behavior. As for the activity aspect, we design several indicators for different behavior domains, such as number of logon, number of doc file operation, number of recruiting websites visited, mean size of e-mail attachments, and number of external device used. These indicators mainly involve two types: frequency indicator and statistic indicator, in which the former is the number of activities performed in specific timeframe and the latter is the descriptive statistic such as mean, standard deviation. In short, the behavior feature is made up of the occurrence time, the assigned computer, and activity indicators. With regard to the feature vector, it can be viewed as an enumeration of all the above features. Furthermore, IGT adopts a categorizing scheme (e.g., job and leak for websites, doc and exe for files) when extracting HTTP and file features. Such a design can avoid the privacy leakage effectively, because it only requires examining the websites domain or file suffix instead of inspecting the specific content.

To further explore the impact of feature extraction on detection performance, we also set three different aggregation granularities based on time duration. Table 2 shows the details of aggregation granularity in this paper. User-week and user-day data instances represent the users’ activities during the corresponding time frame. User-session data instances summarize the behavior information by collecting activities from Logon to corresponding Logoff; or from one Logon to next Logon. According to the previous feature structure, we construct 40, 28, and 16 features for different data instances, respectively. However, the finer-grained data do not mean the best detection performance. This is because the difficulty of anomaly detection increases with the refinement of data granularity. Although the finer-grained data can provide higher data fidelity for behavior analysis, it also has the drawback of long learning time and large imbalanced ratio. Moreover, the duration and number of activities vary greatly from one session to another, which further increase the problem complexity. That is to say, there is a tradeoff between detection efficiency and data fidelity, and it is also one of our duties in this work to explore which aggregation granularity is most beneficial for insider threat detection.

It should be noted that there are several similar feature extraction schemes in the academia [10, 19]. To make it easier to understand our innovation, we make the following comparison. What is common between all these schemes is that they extract the behavior features based on experts’ domain knowledge. But, compared with these previous studies, our feature extraction method is designed for malicious behavior with a higher degree of pertinence. Although the number of features proposed in work [19] is much more than our scheme which in turn increases the information richness, it also introduces risks associated with redundancy and overhead. Due to high dependency between behavior features and the existence of noise, numerous input variables (features) sometimes degenerate the model performance and increase the unnecessary overhead. Therefore, we strip away the inessential indicators and design several critical features related to potential malicious scenarios. Similar to our work, the method proposed in work [10] extracts 20 specific behavior features, but they only consider two aspects of information: time and activity. Additional experiment analysis is performed in Section 4.2 to verify the superiority of our feature extraction scheme.

3.3. Image Conversion

The extracted behavior feature vector cannot be used for unsupervised anomaly detection algorithm based on geometric transformation because of its column vector format. Thus, how to convert the origin feature vector to grayscale image is a problem worthy of exploring. Actually, the essence of image conversion is to generate the square matrix based on column vector. In this regard, we propose a heuristic image conversion strategy. Given that the malicious scenario is closely related with various features, we tend to represent all the incidence information between the features directly to enhance the expression. Thus, we use the following equation to perform the image conversion:

Although such a conversion does not add the essential information than original version, it constructs the spatial relationship and provides a forthright representation among the features. In the earlier stages of this work, we tried a few information enhancement methods such as convolutional coder to construct more complicated images, which degraded performance and we abandoned them altogether. We hypothesize that these methods augment much irrelevant information, and then interfere with the learning effect. By means of the above conversion, each feature vector extracted from the audit data can be represented as grayscale images.

3.4. Anomaly Detection

After obtaining the user behavior image representation, we can apply the unsupervised anomaly detection algorithm based on transformation to detect malicious instances. The anomaly detection procedure can be divided into two major steps. Firstly, we create a self-labelled dataset of images from the origin user behavior image set S, by using a series of geometric transformations Ψ. As for the reason why use geometric transformations in here, we will explain it later. Let denote the created new dataset, which is generated by applying each transformation in Ψ on all images in S. The label of new dataset instance is the index of the transformation that was applied on it. That way, we generate a self-labelled multi-class dataset (with classes) whose cardinality is . Then, we can train a multi-class image classifier to predict the transformation index of image. In the testing phrase, this classifier is applied on transformed instances of test sample, and the sample with worse classification results will be judged as malicious. To measure the quality of classification results, we utilize a scoring function , which is defined as the combination of the log-likelihood of the output softmax vectors coming from the classifier . After that, we select the proper threshold based on numerical experiments and report those instances whose scores exceed the threshold to analysts for further identification.

During the above process, the selection of transformation set is critical to the anomaly detection performance since we think the appropriate set of transformations is problem dependent but is not fixed. In the field of image anomaly detection, the GeoTransform [13] method applies 72 different transformations for each sample. Its original intention is that these geometric transformations can preserve the spatial information and local pixel correlation of normal images. However, considering the fact that the behavior representation images are converted by feature vectors, the initial candidate transformations are not limited to the geometric filed. Thus, nongeometric transformations such as Gaussian blurring and Laplace sharpening are also added to the candidate set to provide more possibility for detection performance. In addition, we do not make any further division for rotation transformation. That is, the candidate transformations are composed of 9 types of shifts, rotations, flips, Gaussian blurring, and Laplace sharpening of images in each sample, yield a set of 144 transformations , where K = 144. The candidate transformation can be described as follows:where denote the transformation of rotation (90 degree), horizontal flip, Gaussian blurring, and Laplace sharpening, respectively. The corresponding parameter o, f, p, b indicate whether the transformations occur or not. The translation transformation is represented by , where sh and denote the direction of the translation in each axis. For both filters, a kernel size of 3  3 is used; for the Gaussian kernel we used σ = 1 and for Laplacian kernel σ = 0.5. Inevitably, such an extension will introduce unnecessary redundancy and an effective strategy should be designed to discard useless transformations. For this purpose, we combine the optimization technique [34] to select the transformations that are most helpful to improve the detection performance. Specifically, we split the self-labelled dataset into binary subsets composed of a pair of transformations and , where ij and i > j, and then calculate the accuracy for every pair of transformations by training a deep neural classifier based on Wide Residual Network model (see Section 4.1 for details) [13]. Those transformation subsets with an accuracy of around 50% would be regarded as redundant and in which only the transformation with the least number of operations would be reserved. The intuition behind this method is that both transformations are equivalent in detection performance if the classifier cannot distinguish one transformation from the other. By applying the aforementioned procedure to the candidate set , we can obtain the final transformation set for the CERT dataset. More surprisingly, the experiment result shows that the geometric transformations get better results than nongeometric ones, and we think the possible reason is that the nongeometric transformation eliminates the features from that are important to characterize normal behavior pattern. More experiment details about the selection of the transformation set are presented in Section 4.3.

Besides, the scoring function is used to measure the anomaly degree of data instance and generate the suspicious sample set for the security analysts. Herein, we use the following scoring function as the discriminate criterion:which is the combined log-likelihood of the output softmax vectors coming from the classifier , under the assumption that all of these conditional distributions are independent. However, this ideal assumption is inconsistent with facts and we replace it with the more common Dirichlet distribution [13]. Therefore, the final scoring function used in this paper is as follows:where is the maximum likelihood parameter of Dirichlet distribution, and it can be estimated through numerical methods [35]. For each test sample, its log-likelihood is calculated by using the classifier output and the respective transformation’s Dirichlet parameter , and then all the log-likelihoods are summed up to yield the score. The larger the score, the more anomalous the sample. Moreover, in order to strike a balance between detection accuracy and investigation overhead, we set a specific threshold to discriminate which samples should be reported based on numerical experiments. Algorithm 1 shows the details of the whole insider threat detection mechanism.

Input: user set U, user behavior set Ω, transformation set Ψ, softmax classifier hs (x)
Output: suspicious behavior set Φ, suspicious user set Ua
(1)Φ= ∅, Ua = ∅, Γ = {week, day, session}
(2)for do
(3)Ωu = Ωtrain + Ωtest
(4) for do
(5)  for do
(6)   calculate the feature vectors based on user id u;
(7)  end for
(8) end for
(9) //create the grayscale image S according to equation (2)
(10) //create the self-labelled dataset
(11) while not converged do
(12)  train hs (x) on the self-labelled dataset
(13) end while
(14) for do//
(15)  calculate according to the numerical method [35].
(16) end for
(17) calculate the threshold parameter λ and κ
(18) for do
(19)  Create the feature vector , grayscale image and self-labelled images
(20)  Calculate the anomaly score
(21)  if :
(22)   
(23)  end if
(24) end for
(25) if
(26)  
(27)  
(28) end if
(29)end for
(30)return suspicious behavior set Φ and suspicious user set Ua

4. Evaluations

In this section, we present the experimental evaluation of the proposed detection mechanism based on the CERT insider threat dataset. Firstly, we give a brief introduction about this dataset, the evaluation metrics, and the deep neural classifier used in this paper, and then we verify the effectiveness of the proposed features. Subsequently, we discuss the impact of the selection of the transformation set and threshold on the algorithm performance. Finally, the performance comparison with other representative algorithms is presented in detail.

4.1. Dataset

In order to evaluate the performance of the proposed scheme, we use the Carnegie Mellon University (CMU) CERT insider threat dataset, which is a publicly available dataset for insider threat mitigation research [16]. The dataset consists of various versions, and each release characterizes an organization with 1000 to 4000 employees. In this paper, we select the release version r4.2 to perform evaluation. The dataset records the user activity logs of different organizations over a period of 18 months. As the insider threat events are usually rare in the real world, the class imbalance problem is also embodied fully in these datasets. For example, the number of malicious users is around a tenth of normal users in the r4.2 dataset, and the imbalance ratio of instances is even bigger. More details about the datasets can be seen in Table 3. As shown in the table, the imbalance ratio is increased with the refinement of aggregation granularity, which further indicates that finer-grained data do not mean the higher performance for insider threat detection. Moreover, the dataset is split into a training set and a testing set in chronological order, and the splitting ratio is set to 30% as recommended in work [10].

Next, we introduce the performance metric used in this paper. Detection rate (DR), precision (PR), F1 score, and false positive rate (FPR) are the commonly used metrics in the classification field, and these metrics can equally well apply to the insider threat detection [21]. In this paper, true (false) positive (TP/FP) represents the number of malicious (normal) samples that are correctly recognized as “malicious,” and false (true) negative (FN/TN) denotes the number of malicious (normal) samples that are incorrectly recognized as “normal.” Among these metrics, precision represents the percentage of malicious warnings generated by the system that are true, and F1-score summarizes both DR and PR as a harmonic mean. Due to the extremely skewed data, the AUROC is another important indicator to evaluate the detection performance. Basically, the larger the value of AUROC, the better the anomaly detection method is. For the sake of convenience, all the following metrics are reported in percent. In addition, as mentioned in Section 3.1, we report the system performance from the respective of data instance (instance-based) and organizational user (user-based). For user-based results, a user is classified as “malicious” if the number of anomaly instances in specific timeframe exceeds the specific threshold. Therefore, there are two kinds of performance metrics in this work: instance-based (IDR, IFPR, IPr, IF1) and user-based (UDR, UFPR, UPr, UF1).

In addition, the detailed information about deep neural classifier used in the paper is presented as follows. We use a Wide Residual Network [13] with architecture parameters of depth 10 and width factor 4 to construct the multi-class classifier . Figure 2 depicts the full architecture of the Wide Residual Network model. This model consists of 7 convolutional layers, 3 skip connections, a global average pooling, and a fully connected layer with output size equal to the number of applied transformations . A batch size of 32, epoch size of 200, Adam optimizer with default hyperparameters, and a cross-entropy loss are applied.

4.2. Feature Superiority Comparison

Features play an important role in determining the performance of anomaly detection. In this section, we give a contrastive analysis on the superiority of the feature extraction scheme. The comparison objects are the feature extraction methods proposed in Refs. [10, 19]. All these methods extract behavior features based on experts’ domain knowledge, but the number of features proposed in work [19] are much more than ours, which means more detailed information about the user activities. Different from our scheme, the method proposed in work [10] extracts features from the aspect of time and activity. To evaluate the superiority more reasonably, we select three classical classification algorithms as the anomaly detection approach: random forest (RF) [36], isolation forest (IF) [37], and autoencoder [28]. These approaches utilize the extracted features as input directly, and output the performance metric as the evaluation criteria. In this process, Python 3.7 is used for feature extraction and Scikit-learn is used for implementing anomaly detection algorithms. The features are normalized before being used to train the classifier. In terms of parameter selection, we perform parameter search using hyperopt, which is a parameter tuning solution based on the Parzen estimator [38]. Specifically, for RF, we tune the number of features (all, square root and log base-2 of all features), the number of decision tree estimators (50 to 100), and the depth of individual trees (3 to 10). Similarly, for the autoencoder, a limit of 200 epochs was assumed. The number of hidden layers was searched between 1 and 3 and each hidden layer has the size set to the half of the previous layer. The mini-batch size is set to 32 and L2 regularization penalty is . With respect to the isolation forest, we tune the number of trees (30 to 100) and the threshold for suspicious instance is set at 5%. It should be noted that the isolation forest is trained on the training set and is then applied to the test set, instead of the whole dataset at a time. All the results are obtained by averaging the multiple experiment data, where each setting is randomly repeated 20 times.

Table 4 and Figure 3 present the feature comparison results by the anomaly detection approaches on different data granularity levels. It is observed that our feature extraction scheme performs best among three schemes at any granularity level. And, this advantage becomes more apparent when the detection methods change from supervised to unsupervised. The main reason is that the ground-truth information used in the supervised algorithm makes up for the deficiency of features in behavior characterization. Meanwhile, compared with the more detailed feature extraction scheme [10], the more succinct method proposed in work [19] performs better as well. This is because the numerous features (824 dimensions) inevitably introduce much redundancy, which in turn degenerates the model performance. On the other hand, the more detailed and pertinent information is beneficial to improve the detection performance when the dependency between features is nonexistent. As shown in Figure 3, the AUROC value of our feature extraction scheme is higher than others in both cases (instance-based and user-based results), which shows its superiority in representing user behavior pattern.

4.3. Transformations Selection and Parameter Analysis

As mentioned in Section 3.4, we attempt to explore more possibility for detection performance by using various transformations. To obtain the proper transformation set, we apply the transformation selection procedure to the candidate set based on the r4.2 dataset. Due to the huge computational overhead brought about by numerous transformations, we use a simplified setup by conducting the nongeometric transformation only over shift operations, which generates 27 possible states. That is to say, the number of candidate transformations changes from 144 to 63. Following this procedure, we compare the performance of different transformation pairs, and the results are summarized in Table 5.

From Table 5, we can see that the nongeometric transformations are not helpful to improve the detection performance. We think it is because the nongeometric transformations reduce the representational capacity of the local pixels converted by discrete feature vectors. For example, the malicious scenarios are closely related to specific features, but the nongeometric transformations such as smooth processing weaken this incidence relation. In addition, only 18 geometric transformations (9 shift and rotation) remained after applying the optimization selection technique, but their AUROC values (Transform18) are higher than others. This phenomenon indicates that eliminating redundant transformations is beneficial to reduce computational overhead, make the algorithm faster, and lower the complexity of the classification space. Summarily, the final transformations used in the work are the translation and rotation.

When describing the IDIGE mechanism, we introduce threshold parameters λ and κ to discriminate the suspicious samples. The instance whose anomaly score exceeds the threshold λ and the user whose quantity of suspicious instances exceeds κ are regarded as anomalies and reported to the security experts for further investigation. Generally speaking, the threshold is correlated with the available investigation budget and is set primarily based on empirical knowledge and numerical experiments. However, considering that the user behaviors are streaming data, we tend to select a historical score of training sample as the threshold λ rather than a proportion of testing samples. That is, the threshold is fixed within a specified period of time, and such a setting is convenient for the subsequent discrimination of new behavior data. As recommended in most works, the τ-percentile of historical scores is specified as the threshold. In this case, the meaning of parameter τ is synonymous with the threshold λ, just that they show themselves in different forms. To assign the proper value to the parameter τ, we conduct the following numerical experiments and adopt F1 performance metric for its expressive power. We analyse the impact of threshold on the algorithm performance when applying the Transform18 as the transformation set. As shown in Figure 4(a), for all granularity levels, the F1 metric generally shows an upward trend with the increase of threshold, but this trend disappears when the threshold is close to the theoretical maximum. A larger threshold means lesser alarms, which could result in many suspicious events being undetected. However, in the security domain, it is preferable for managers to reduce the false negative than the false positive. Therefore, we select 95-percentile of historical anomaly scores in the training phrase as the threshold λ.

Similarly, we experimentally determine the appropriate value for threshold κ. Although the determination of malicious user depends on whether the person performs a malicious act or not in the practical situation, there is a proportional relation between the quantity of suspicious instances and anomaly degree. More the suspicious instances, the greater the likelihood the user is malicious. But, it is not reasonable to set a fixed constant for all the users because the number of malicious behaviors varies with the user’s behavioral habit and ultimate purpose. Therefore, we adopt a similar method to select the parameter κ. Specifically, we calculate the proportion of suspicious instances in the latest timeframe whose length is equal to the timeframe of the training dataset, and label the user as malicious if this proportion exceeds the threshold κ. The intuition behind this method is that the user with larger abnormal proportions than historical normal timeframe is more likely to be the malicious user. We test ten values for κ ranging from 5% to 20% with a step size of 1.5%. Figure 4(b) presents the corresponding result. It can be seen that the F1 metric of user-based detection results will decrease accordingly when the threshold κ is set too large or too small. This is consistent with the intuition that a small κ means more false alarms while a large κ means more false negatives. In either case, it results in performance degradation. Besides, from Figure 4(b), we can also see that the detection performance is at a relatively high level when κ is set to be 14%.

4.4. Comparison with Other Related Work

In this section, we compare the performance of the IGT algorithm with other classical unsupervised insider threat detection algorithms, namely, autoencoder-based method [28] and IF-based method [37]. The main differences between the work [28] and our scheme have been introduced in Section 2, and here we refer to their model as the baseline model. The main idea of the autoencoder model is to learn the normal behavior pattern based on the construction errors. After being trained with only normal samples, it can construct the normal samples with minimal reconstruction errors, and a high reconstruction error means that significant deviation has occurred between the test sample and normal samples. IF presumes that the outliers are easier to isolate from the rest of the data than normal samples; hence, the samples with a shorter path length to the corresponding leaves are considered suspicious. All three algorithms are trained in an unsupervised manner and label the instances whose anomaly scores exceed the 95-percentile of historical scores in the training phrase as the suspicious instances. Different from the IGT algorithm, IF and autoencoder only use the original feature vector as the inputs and do not involve the transformation operations. In terms of experimental setting, we implement the IGT classifier with Pytorch, and the Adam optimizer with default hyperparameters is used to minimize the cross-entropy loss function. Batch size and epochs for all these methods are set to 32 and 200. As for the IF and autoencoder models, we adopt the same experimental setting described in Section 4.2 and search the best hyper-parameter by means of the hyperopt tool.

Firstly, we compare the performance of three different detection algorithms from two aspects, instance-based results and user-based results. Table 6 gives the detailed results and Figure 5 depicts the corresponding ROC curves and AUCs on user-day data. It can be seen that the IGT performs best among these insider threat detection algorithms in either case, and the performance of the AutoEncoder is better than that of IF. Compared with the precision metric (40%–60%), the detection rate of each algorithm is relatively high level (60%–80%), which means that the critical point of improving detection performance is to reduce the false alarms. In this regard, IGT achieves the lower FPR with the similar DR. Note that the isolation forest algorithm shows a surprisingly poor performance with higher FPR and lower DR. We think the possible reason is that IF has a weaker representational learning ability and is more suitable for outlier detection rather than novelty detection. Figure 5 further demonstrates the superiority of IGT scheme, where the instance- and user-based AUROC are improved by 4% and 2% than AutoEncoder algorithm, respectively. Although the performance gaps between these algorithms become smaller when the object detection changes from anomaly instance to anomaly user, this is in line with our expectation and it can be explained through the smaller imbalance ratio.

Secondly, in order to explore the possible explanation for the performance difference between these algorithms, we analyse the trends of anomaly scores of two different users under different approaches based on user-day data, in which EDB0714 is the malicious user and TCD0009 is the normal user. The detailed information can be seen in Figure 6. Due to the poor performance, the IF-based algorithm is ignored here. The gray line denotes the anomaly scores of training samples, and the black line denotes the anomaly scores of test samples. The false positives, true positives, and false negatives are depicted by red, blue and violet points, respectively. The star markers at the bottom indicate the actual anomaly days. From Figures 6(a) and 6(c), we can observe that the maximum anomaly instances of malicious users can be detected by the IGT and AutoEncoder algorithms, but the number of red points (false positive) under AutoEncoder approach is higher than IGT. This phenomenon is more obvious in the trends of anomaly scores of normal users (i.e., Figures 6(b) and 6(d)), which further indicates that the representational learning ability of AutoEncoder is weaker than the IGT scheme. Although the results of two individual users may not be sufficient to persuade, it is necessary to clarify that both users were chosen randomly from the organizational personnel pool and that a similar conclusion may be drawn from additional users.

Moreover, we investigate the impact of different data granularity levels on IGT detection algorithm and the experiment results are presented in Table 6 and Figure 7. It can be seen that the relationship between the algorithm’s performance and data granularity is not monotonic. For example, the AUROC value of user-week data (0.86) is higher than that of user-session (0.82) but lower than that of user-day (0.88). This phenomenon demonstrates our previous guess that the finer-grained data do not mean the best detection performance. Although the higher data fidelity is beneficial to construct more precise user individual behavior model, it also introduces the drawback of large imbalance ratio. Considering the timeliness of practical anomaly detection system and the upper experiment results, we think the user-day data are of a relatively proper granularity level to conduct the user behavior analysis. As for the finer-grained detection requirement, it could be satisfied by further analysis and detection after obtaining the suspicious days. In addition, what calls for special attention is that our experiment conclusion is a little different from the work [10] which declares that the algorithms’ performances are degraded by higher data granularity levels (i.e., monotone decreasing). This contradiction can be explained through different model construction methods, where IGT constructs the separate behavior model for each user but the latter shares one model for all users. In other words, the less training instances degrade the detection performance of IGT on user-week data.

Finally, in order to further demonstrate the effectiveness of the IGT scheme, we train and test the insider threat detection algorithm based on another CERT dataset release r6.2. Compared with the r4.2 dataset, r6.2 dataset has a different organizational structure and more users (4000). Meanwhile, it simulates only one malicious user per insider threat scenario, significantly increasing the detection complexity. We adopt the same experiment setting as the preceding dataset and the experiment results are presented in Figure 8. As shown in the figure, although there are about 10–15% degradation in the IDR and IPR metric, the performance of IGT is always better than the classical AutoEncoder algorithm. In response to this, we attribute the performance degradation to the larger imbalance ratio and the more complex malicious scenario. In addition, since there are only 5 malicious users in this dataset, both algorithms can detect all the malicious users, but the UFPR of IGT is significantly lower than that of the AutoEncoder. On data granularity, user-day data show higher performance than the other data types from r6.2, which further validates the previous conclusion. Overall, the IGT algorithm proposed in this paper shows better performance than the existing mainstream insider threat detection methods.

5. Discussion and Future Work

5.1. Discussion

The features proposed in this paper are composed of occurrence time, assigned computer, and specific activity, which are evaluated to have a good performance in detecting insider threat on CERT dataset. However, since feature selection is domain-specific, there is no better way to extract features that can cover all domains. Although our feature structure can cover most critical indicators, the specific activities are supposed to adjust according to the potential malicious scenarios. In other words, if the related activity information about an emerging insider threat is not contained in the feature vectors, the IGT may not be able to identify the compromise. Therefore, in order to improve the practicality and universality of the IGT mechanism, we discuss several potential concerns and measures. Firstly, which behaviors should be selected in features for a specific domain? Secondly, for specific user behavior, which attributes should be calculated as the feature representation? Thirdly, how to design the proper features to detect the unknown insider threat? In response to this, the guideline proposed by MITRE ATT&CK provides a possible solution for the first problem [39]. Security analysts can lookup relevant behaviors for each cyber threat by means of the guideline. Next, statistical indicators and frequency indicators are commonly used in the insider threat detection, and we should select the proper attributes according to the type of user behavior. For example, it is proper for logon activity to use the frequency indicators, while the statistical indicators such as mean and standard deviation are more suitable for content-related activities. Of course, it is also recommended to use other types of indicators to characterize user behavior pattern. At last, in terms of unknown threat detection, one possible solution is to extract the advanced features without relying on expert knowledge. Such a feature extraction method does not require to make assumptions about the potential malicious scenarios, instead it adopts a purely data-driven manner to construct user behavior model, so it can provide some advantages in detecting unknown threats.

5.2. Future Work

In prior presentations, we introduce the basic idea and operational process of IGT mechanism in detail, and conduct numerous numerical experiments to validate its feasibility and superiority. Nevertheless, we do not provide rigorous theoretical demonstration about the detection mechanism, as we only want to explore the possibility of image-based classification methods in the field of insider threat detection and provide a new research idea for solving the cybersecurity problem. Heuristic methods and experimental verification are the main research ideas in this paper. In addition, the behavior model used in this work is constructed based on a state description limited to a single exemplar. Therefore, there are several improvable things and the future work includes the following three directions. First, it is important to develop a theory that grounds the use of geometric transformations, which is the base of widespread application for this anomaly detection method. Second, an attempt will be made to extract the advanced behavior features without relying on expert knowledge. For example, we can obtain the potential semantic properties from the log text by means of the natural language processing technique. Third, we will investigate how to utilize the temporal information in user activities to improve the detection performance. In addition, we have done some works and reprinted them in the arxiv server [40].

6. Conclusions

Traditional insider threat detection approaches usually have the problems of low precision and high computational complexity. To solve this, in this paper, inspired by the image classification technique, we propose a novel, image-based insider threat detector via geometric transformation IGT. IGT constructs individual behavior model for each user, applies the unsupervised classification to the images that are transformed from user behavior feature vector, and further processes with geometric transformation. Unlike classical unsupervised methods, our anomaly detection approach completely alleviates the need for a generative component by converting the unsupervised anomaly detection problem into supervised image classification problem. More importantly, evaluation results on the CERT dataset show that compared to the classical insider threat detection approach, IGT improves the instance- and user-based AUROC by 4% and 2%. As a main future work, we will attempt to develop a theory that grounds the use of geometric transformations and explore the utilization of temporal information in user activities to further improve insider threat detection performance.

Data Availability

All the data used during the study were provided by Carnegie Mellon University CERT insider threat dataset. The dataset can be downloaded at https://kilthub.cmu.edu/articles/dataset/Insider_Threat_Test_Dataset/12841247/1.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This research was supported by a research grant from the National Natural Science Foundation of China under grant nos. 61772271, 62106282, and 62172432, the Equipment Research and Development Fund under grant no. ZXD2020C2316, and the Natural Science Foundation of Jiangsu under grant no. SBK2020043435.