Abstract

Isolation Forest or iForest is one of the outstanding outlier detectors proposed in recent years. Yet, in the model setting, it is mainly based on the technique of randomization and, as a result, it is not clear how to select a proper attribute and how to locate an optimized split point on a given attribute while building the isolation tree. Aiming to the two issues, we propose an improved computational framework which allows us to seek the most separable attributes and spot corresponding optimized split points effectively. According to the experimental results, the proposed model is able to achieve overall better performance in the accuracy of outlier detection compared with the original model and its related variants.

1. Introduction

In the information society, tremendous data are generated to record the events happened in every second. Few of them are special for they are not yielded as our expectations which can be treated as anomalies. Anomalies exist nearly in every field. For example, events like a spam email, an incidental breakdown of the machine, or a hacking behavior on WWW can be regarded as anomalous in real lives. According to the definition given by Hawkins [1], an anomaly, sometimes called an outlier as well, is far different from the observed normal objects and suspected to be generated from a different mechanism. So, the anomalies should be few and distinct. How to detect them accurately is the main challenge in this field which is commonly called the study of anomaly detection or outlier detection.

In reality, the data annotated with normal or abnormal labels are commonly not available which, if existing, are known to be able to provide the prior knowledge to identify them. Meanwhile, the anomalies in the dataset are commonly rare. It suggests that the numbers of the normal instances and the abnormal ones tend to be heavily unbalanced between them. Due to the two reasons, researchers cannot treat the anomaly detection as a typical problem of classification by simply applying the traditional supervised machine learning methodologies. As a result, in the past decades, most proposed models are unsupervised, including statistical models, distance-based models, density-based models, and tree-based models.

If a dataset with multivariate attributes contains normal and anomalous points, in contrast to the traditional view angles, we consider that the essential distinction between the anomalies and normal instances lies in the discrepant values on each attribute. In other words, that the different value distributions towards normal and anomalous points on each attribute ensure them separable. iForest is one of the successful attempts to realize effective partition on the attribute values based on the method of randomization [2]. Yet, it is not clear how to select the proper split point in the problem settings.

Despite the advantages of the iForest, we consider it still has the space to be improved. For example, the selection of the attribute and determination of the split value regarding the chosen attribute are completely arbitrary while applying the model to build the iForest and some more smart methods would be possible to be studied. Since each tree in the iForest, i.e., iTree, is a binary tree with growth constraint by depth, the leaf node or external node of the tree should play a role in identifying if an instance is normal or not. Yet, it is not considered in the previous studies. Aiming at the issues mentioned above, we consider proposing a new solution to optimize the tree building process from solving three key questions.

(1) How to select a suitable or distinguishable attribute to partition the instances?

(2) How to determine the appropriate split point on a given attribute? Split point refers to a chosen attribute value to partition the data into two sets on an attribute.

(3) How to mark the leaf node with proper category labels? As there are only two kinds of labels for anomaly detection, we can mark the leaf node with label 1 for normal instance and 0 for the anomaly. For a specified detection process in an iTree, when the step of the decision in the iTree reaches one of the leaf node, it depends on the label of the leaf with 1 or 0 to identify if the instance is normal or not.

The optimized iTree and corresponding improved iForest have four merits including the following:

(1) Via a heuristic searching method based on the gradient, the new model is able to locate the best split point on a given attribute efficiently.

(2) Compared to the extremely unbalanced isolation tree generated by the original model, i.e., iForest, the iTree generated by the proposed model is well balanced.

(3) Compared to iForest, the new model needs fewer isolation trees to constitute the forest.

(4) It has more favorable accuracy performance for outlier detection in terms of AUC results compared with the state-of-the-art methods.

The layout of the article has an organization as follows. In Section 2, related studies in the field of outlier detection will be reviewed. In Section 3, we firstly introduce some symbol definitions to be used in the rest of this paper. Then, the motivation of our study and some quantified analysis with regard to the proposed model will be given in Section 4. Some detailed algorithm descriptions are presented in Section 5. Analysis with extended experiments for comparisons between our model and existing methods is discussed in Section 6. Finally, we give a summary to conclude our work in Section 7.

Despite extensive surveys have been made for the studies of the outlier detection in the review articles appeared in recent years [36], yet, as an active and fast updating community in the domain of data mining, some new insights and trends should be introduced here. In our viewpoint, the state-of-the-art approaches can be categorized as follows.

2.1. Distance-Based Approaches

The distance-based approaches [7] are commonly regarded as the basic methods in the study of outlier detection. The original idea of distance-based approaches is to calculate the distances between a given object and its th-nearest neighbors and identify the objects as outliers with the largest distance. Some variants are proposed as follow-ups such as -weight [8], outlier detection using indegree number () [9]. -weight defines the weight of the object by summing up the distances within its neighbors which is more favorable to tackle the high dimensional data. formalizes the correlations of nearest neighbors into a graph and detect the nodes as outliers in the graph with low degrees. There are also some other related models using pruning techniques for ranking top-N outliers based on distance [10, 11]. They collectively constitute the family of the distance-based approaches.

2.2. Density-Based Approaches

These approaches introduce a concept of Local Outlier Factor (LOF) [12]; i.e., each instance is assigned a score based on the neighbors’ local density denoting a degree of outlierness. A potential outlier is identified by the relative high LOF value. Based on this main idea, some extended models are proposed. INFLO considers using s and reverse s to calculate the value of LOF [13]. LOCI further introduced the multigranularity deviation factor (MDEF) which uses -neighborhoods rather than s [14]. To alleviate the sensitivity of the parameter tuning, an improved model LDOF was proposed which yet has similar performance to the traditional density-based approaches [15]. In the field of handling high dimensional data, some works tried to find the difference between normal instances and anomalous ones in the meaningful subspace of the data space [16, 17].

2.3. Tree-Based Approaches

In recent years, tree-based approaches have obtained comprehensive attentions in the field of outlier detection for its outstanding performance in detection accuracy and scalability. Some typical models include IHCCUDT [18], iForest [2], SCiForest [19], and RRCT [20]. iForest and RRCT both adopt the strategy of the randomization for tree building. RRCT considers the impact of the outliers to the entire data, while iForest merely focus on the cost of the instances isolation. IHCCUDT and SCiForest both considered how to reduce the inconsistency in the interior of the subset for each partition. As for IHCCUDT, it utilizes small number of instances with labels to mark other instances having no labels. SCiForest follows the idea of iForest to identify anomalies by the depth of the leaf nodes and gets an enhanced performance by introducing the random hyperplanes in subspaces.

2.4. Scenario-Aware Approaches

Applying available outlier detection models to a specific problem scenario is also a trend in this field for its visible practical interests. Schubert et al. proposed a unified framework by the means of abstracting the notion of locality from the classic distance-based notion and applied it to the applications of spatial, video, and network outlier detection [21]. Aiming at the communities existing in the network, Gao et al. use a generative model called CODA to detect outliers which are defined as individuals deviating significantly from the rest of the community members [22]. Pokrajac et al. devised an incremental LOF algorithm to perform the outlier detection of data streams [23]. Some special applications can also be found in the literature like trajectory outlier detection [24], outlier detection with uncertain data [25], etc.

3. Notation Definitions

Some used symbols in this article are defined in Table 1.

4. The Proposed Solution

4.1. Two Fundamental Assumptions for the Attribute Values

It is wildly accepted that the anomalous instances should be different to the normal ones. Yet how to quantify the difference is an open issue in this field. In this study, we consider that the distributions of the attribute values could be utilized as a basis to investigate the difference between the two kinds of instances. The idea is that the cumulated discrepancy of values distributions on various attributes will be able to tell us how different the anomalous instances and normal ones are. Two assumptions for the data representation are presented as follows:

(1) From a perspective of probability theory, we can use the probability density functions (pdf) to depict the distribution of attribute’s values. No matter for normal or anomalous instances, we suppose that the corresponding attribute values for each of them tend to cluster together and have some centralized probability density values. Ideally, the values distribution of probability density with respect to the attribute values for a single class of instances (normal or anomalous) would have only one peak as shown in Figure 1. As a result, the values of probability density deviating from the center are relatively small.

(2) We suppose that the normal and anomalous instances have different clusters of values on a given attribute which implies separability between them. There are commonly two kinds of mixed distributions of probability density as shown in Figure 2. In the left diagram, the distributions for normal instances versus anomalous ones are heavily overlapped so that the integrated distribution has just one peak in probability density. On the contrary, in the right diagram, the two distributions are less overlapped and, as a result, the integrated distribution appears with two peaks. With respect to the degree of separability on the attribute values, the right diagram is apparently better than the left one. Inspired by this observation, we conclude that the separability of an attribute actually is affected by two factors, i.e., the distances between peaks and the dispersity of the values for each category of instances.

4.2. The Study of the Separability on the Attribute’s Values

In this section, we try to quantify the separability between the normal and anomalous instances. Based on the assumptions raised in the last section, we have summarized two factors which play some significant roles to distinguish the two types of objects by attribute values. Here, we average over the values on each type of instances to get a mean value, denoting a center of values. The distance between the centers of the two types of data is defined by Euclidean distance aswhere and represent the value sets for anomalous instances and normal instances on attribute and is the mean function. The dispersity of the attribute values for each type of instances can be measured by the variance over the values. So we havewhere is the function of variance. Besides the two factors mentioned above, another influential factor is accounted for the information contained in an attribute. From a perspective of information theory, if the values belonging to an attribute are more scattered, it indicates that the attribute carrying more information can be regarded more important in feature selection. We again use the variance function to measure the degree of dispersivity of the values. Hereafter combining the three aspects together, we propose a separability index shown in (3) to measure how separable an attribute is in identifying the normal instances and anomalous ones.Equation (3) demonstrates that the separability of an attribute should be positively correlated with the and , whereas negatively correlated with the . Thereby, as a criterion, the separability index allows us to select an attribute with a high value of (3) to partition the instances. Consequently, the separability index answers the first question regarding how to choose a suitable attribute for iTree building.

For the values of a given attribute, some of them correspond to the normal instances and the others relate to the anomalous ones. However, we do not know in advance which part of the values belongs to the normal instance or the anomalous. So, we need to find a split value to divide the attribute values into two parts. Based on the separability of an attribute discussed above, if we randomly choose a from the attribute values as the split value, we can calculate the separability index byIt can be deduced that the in the attribute values having the highest value of the index should be the best split point. Note that, when performing some parameter tunings, we found that the roles of the distance between the centers and the variance should be enhanced in the formula. Thereby, we slightly change the formula as

4.3. A Gradient Method for Fast Searching the Split Point

For a given attribute, if we sort all the values within the attribute in ascending order, the simplest way to locate the best split value is trying to calculate the values of separability index one by one based on the sorted attribute values and choose one of the attribute values with the largest value of the index, i.e., split point. However, if the number of values for the attribute is tremendous, an exhaustive search may not be a good solution. Hence, we propose an approximately optimum method for searching the splitting point towards a given attribute. Calculating the gradients of the separability index values between two adjacent attribute values will be able to guide us to skip some attribute values which are not possible to be determined as split point and thus speed up the search process. The definition of the gradient for two neighbored attribute values and isWhile searching the attribute values from small to large, we observe that the attribute values should be inspected one after another when the gradient of the two adjacent candidate split values is close to zero which means that it is possibly approaching the best split value. On the other hand, the more the value of gradient deviates the zero, the more attribute values can be jumped over as the current two attribute values and their neighbors have a little possibility of being the best split values. So we defined a variable based on the gradient to measure how many points can be skipped while searching.where denotes the number of values in the attribute . The detailed description of the searching process is summarized in Algorithm 1. Please note that, in the implementation, the derived split point is a mean value of two neighbored attribute values.

Input:
: sorted values of the th attribute;
Output:
: the attribute value having the largest value of the separability index;
: the largest value of the separability index;
(1) Initiate as
(2) Let be and be and
(3) while do
(4) Set
(5) Set
(6) if then
(7) Set
(8) Set
(9) end if
(10) Update by following the formulas (6) and (7)
(11) end while
(12) Return and
4.4. Parameter Setting of the Separability Index

In some cases, we notice that (5) will yield a very biased split point beyond our expectation. Here, we give an instance to address this issue. We generate a one-dimensional binary data with two different Gaussian distributions, marked as and , where the first parameter is the mean and the second one is the variance. Of the synthetic data, 800 instances having the first statistical distribution are treated as the normal and 200 instances having the second statistical distribution are anomalous. Before calculating the separability index, all the values are normalized by usingwhere and are the maximum and the minimum of the generated points, respectively. We derive the best split point by searching the largest value of the separability index over the attribute values shown in Figure 3. As we can see, the obtained split point is apparently not the expected one. After investigating the values of the separability index, it turns out that the values are biased, influenced by the variances and . The reasons behind the results are analyzed as follows. In our opinion, two aspects for and should be considered while calculating the value of the separability index:

(1) When the number of attribute values belonging to one class of the data occupies a major part, it means that the calculation of the variance over these values would be somewhat reliable since the value of the variance is more statistical credible with more instances involved. Thereby, in this case, to keep the reliability of the obtained variance, it is necessary to assign the obtained variance with a heavy weight.

(2) On the other hand, when the number of attribute values belonging to one class of the data is a minor part, it suggests that the derived variance would be not quite reliable since a small number of samples would be lack of statistical significance and fewer instances will lead to the outcome of the variance more distorted. In this case, it is reasonable to impose on a light weight to the obtained variance to minimize its biased influence to the separability index.

Based on the analysis mentioned above, we introduce two parameters into (5) to regulate the value of the and aswhere parameters and are set aswhich have no fixed values but varied along with the change of . By doing so, we again calculate the best split point and will be able to obtain a more reasonable result as shown in Figure 4 where the value curve of the separability index will form a desired single peak.

4.5. The Algorithms for iTree and iForest Building

iTree is a height-limited proper binary tree. In other words, it is a truncated tree by the depth. In general, the height of the tree does not exceed given training instances. The depth of a leaf node in the iTree would not be its actual depth if it could be fully grown. If we use a function denoting the depth of the leaf node in the iTree, its actual depth in the fully grown tree can be estimated aswhere Euler’s constant . So, we can mark the leaf nodes in the iTree byThe detailed algorithm description is shown in Algorithm 2.

Input:
: a dataset with attributes and records;
: current depth of the tree;
: limit of the tree height;
Output:
: an optimized isolation tree;
(1) if or then
(2) return the label of the node
(3) end if
(4) randomly select distinct attributes from
(5) pick out the attribute with the highest value of the separability index
(6)
(7)
(8) Return

To identify whether an instance is normal or not in the tree, one can simply make comparison processes throughout the tree with a start from the root of the tree and if the reached leaf node’s label is 0 then the instance is determined as the anomalous and vice versa. Thus, it is no longer necessary to calculate the anomaly score as shown in the literature [2]. The proposed iTree algorithm has a similar time complexity to that of its original version. We suppose the training set contains samples and choose features (attributes) to construct an OIT with a depth . To build the tree, the required calculations level by level are . Thereby, the complexity of the algorithm is .

The algorithm of building iForest is largely similar to the one in the literature [2]. The main difference between the new iForest and the original one is that it does not require the calculation of the anomaly score and use the following criterion instead to detect if an instance is anomalous.where is the average of all the path lengths of the node over the trees in the iForest.

5. Experiments

5.1. Description of the Datasets and Evaluation Metric

The datasets used for evaluation in this paper have two sources. The data of is provided by the China Union-pay company, which contains the statistical expenditure information regarding the merchants (the data is available upon request). Of the 3074 members in the dataset, there are two types of merchants including the anomalous ones with the behavior of tax evasion and the normal ones. The rest of the data come from the open accessed UCI data repository (http://archive.ics.uci.edu/ml/index.php). If there are just two kinds of instances in the data, the minor ones are regarded as anomaly data. Of course, some datasets have more than two kinds of instances, for convenience, we incorporate all minor types into one as the anomaly data. Some detailed information regarding the datasets used in this article is summarized in Table 2.

5.2. Detection Accuracy Comparisons with the State-of-the-Art Models

The evaluation metric used in this article is AUC value. AUC means the area under the ROC curve. The bigger the area, the better the model’s performance in accuracy. Here, we make accuracy comparisons for anomaly detection between OIT, OIF, IForest, SCiForest, and LOF. To avoid any biased outcomes, 10-fold cross validation is adopted to generate the averaged AUC values and corresponding standard deviations in the tests. According to the statistical tests, the experimental results are significant with respect to a significance level of 0.05. Based on the results shown in Table 3, apart from the datasets of and , OIF outperforms the rest of methods on the six other datasets. The encouraging results suggest that the new proposed solution for iForest indeed can overall give an improvement compared to its previous versions, i.e., iForest and SCiForest.

The information about training time efficiency of the proposed method on the eight datasets is presented in Table 4. The testing machine is a desktop with 8 Gigabyte memory and 3.4 GHz Intel core-i3 processor.

6. Analysis

6.1. Tree Structure Comparisons between the Proposed Model and SCiForest

Both iForest and our model need to build a proper binary tree to isolate the outliers. But, when applying the two models to build the trees on datasets, we notice that the generated trees by the two models are looking very different in structure. Here, we give two examples to demonstrate how different they are. In Figure 5, it shows the trees generated by the two models on datasets and . It turns out that the trees built by the OIT are well balanced in structure, whereas the ones build by the iForest are extremely unbalanced, which would lead to relatively low efficiency in computations.

6.2. Convergence Comparisons between the Proposed Model and iForest

In this section, we try to evaluate the speed of convergence of the proposed model. Since both OIF and iForest have a procedure of convergence when building the forest, we would like to find out which one is quicker to be convergent. In Figure 6, we randomly generate 1000 points. Among them, 800 are normal instances and 200 are anomalous. An anomalous point and a normal one are marked with and to be detected by applying the models. To compare the speed of convergence between the OIF and iForest, we produce a number of trees to constitute a forest for each model and calculate the average path length of and over the generated trees. The standard deviation of average path length over the trees can be treated as a criterion to determine if the average path length converges or not along with the increasing size of the forest.

The experiments employ the rates of downsampling varying by 0.25, 0.5, and 0.75, respectively. The threshold is set as 0.15 which means that, when the deviation of average path lengths over 50 generated forests is smaller than the threshold, the model reaches the state of convergence. Figure 7 shows that far fewer numbers of trees are required by the OIF to ensure the deviation of average path lengths converged compared with those required by the iForest. Figure 8 also validates that our model converges faster than iForest on the real dataset -.

6.3. The Impact of Point Density on the Proposed Model

To further study the performance of the proposed models on datasets with different point density, we generate two synthetic data shown in Figure 9 to carry out the tests. Here, 800 normal points and 200 anomalous points with two dimensional features are generated which have Gaussian distributions with varied standard deviation to represent different point densities. The experimental results shown in Table 5 indicate that OIT and OIF tend to have good performance when the anomalous points are more scattered than the normal ones.

6.4. The Impact of Point Distribution on the Proposed Model

We select four available functions in scikit-learn (an open-sourced Python package) to generate the synthetic datasets with specified distributions including Gaussian, Moons, Circles, and Classification. As shown in Figure 10, each distribution has 200 instances and the ratio between normal instances and anomalies is 4:1. The AUC results obtained from the experiments by iForest and OIF are shown in Table 6. The results indicate that the four models have their own merits on distinct data distributions and each of them has its specified application prospect.

7. Conclusion

In this article, we proposed a novel model for outlier detection which is an optimized version of iForest. The advancements of the new model, compared with the iForest, can be summarized as follows. (1) A separability index is devised which allows us to select proper attribute and locate the splitting point on it accurately. (2) A fast searching algorithm based on gradient is proposed which is able to speed up the process in retrieving the best splitting point. (3) Compared with the iForest, it requires fewer trees to reach the status of convergence in our model while generating the forest. Besides, the detection accuracy results shown on 8 real datasets also validate that our model performs overall better than the iForest and its improved version SCiForest. Of course, although it is called an optimized framework for iForest, it does not mean that the proposed model has been perfect and there is no space to further improve. In fact, we consider that two directions could be studied in the future. (1) Compared to the AUC result of LOF on data with circle distribution, iForest and its improved versions all performed very poor. It would be of theoretical interest to investigate the possible causes behind the results. (2) All the tree-based models for outlier detection are applied with the strategy of randomization. Despite it is a powerful tool in applications, the strategy of randomization is not interpretable and some further understandings towards it are required to study imminently.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Zhen Liu and Xin Liu are co-first authors for their equal contributions to this work.

Acknowledgments

Z. Liu acknowledges the fund of National Natural Science Foundation of China under Grant no. 60903073 and Special Project of Sichuan Youth Science and Technology Innovation Research Team (2013TD0006).