Mathematical Problems in Engineering

Volume 2018, Article ID 2318763, 13 pages

https://doi.org/10.1155/2018/2318763

## An Optimized Computational Framework for Isolation Forest

^{1}Web Sciences Center, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China^{2}Big Data Research Center, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

Correspondence should be addressed to Zhen Liu; moc.liamg@5260uil.ekauq

Received 28 February 2018; Revised 9 April 2018; Accepted 26 April 2018; Published 11 June 2018

Academic Editor: Sotiris B. Kotsiantis

Copyright © 2018 Zhen Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Isolation Forest or iForest is one of the outstanding outlier detectors proposed in recent years. Yet, in the model setting, it is mainly based on the technique of randomization and, as a result, it is not clear how to select a proper attribute and how to locate an optimized split point on a given attribute while building the isolation tree. Aiming to the two issues, we propose an improved computational framework which allows us to seek the most separable attributes and spot corresponding optimized split points effectively. According to the experimental results, the proposed model is able to achieve overall better performance in the accuracy of outlier detection compared with the original model and its related variants.

#### 1. Introduction

In the information society, tremendous data are generated to record the events happened in every second. Few of them are special for they are not yielded as our expectations which can be treated as anomalies. Anomalies exist nearly in every field. For example, events like a spam email, an incidental breakdown of the machine, or a hacking behavior on WWW can be regarded as anomalous in real lives. According to the definition given by Hawkins [1], an anomaly, sometimes called an outlier as well, is far different from the observed normal objects and suspected to be generated from a different mechanism. So, the anomalies should be few and distinct. How to detect them accurately is the main challenge in this field which is commonly called the study of anomaly detection or outlier detection.

In reality, the data annotated with normal or abnormal labels are commonly not available which, if existing, are known to be able to provide the prior knowledge to identify them. Meanwhile, the anomalies in the dataset are commonly rare. It suggests that the numbers of the normal instances and the abnormal ones tend to be heavily unbalanced between them. Due to the two reasons, researchers cannot treat the anomaly detection as a typical problem of classification by simply applying the traditional supervised machine learning methodologies. As a result, in the past decades, most proposed models are unsupervised, including statistical models, distance-based models, density-based models, and tree-based models.

If a dataset with multivariate attributes contains normal and anomalous points, in contrast to the traditional view angles, we consider that the essential distinction between the anomalies and normal instances lies in the discrepant values on each attribute. In other words, that the different value distributions towards normal and anomalous points on each attribute ensure them separable. iForest is one of the successful attempts to realize effective partition on the attribute values based on the method of randomization [2]. Yet, it is not clear how to select the proper split point in the problem settings.

Despite the advantages of the iForest, we consider it still has the space to be improved. For example, the selection of the attribute and determination of the split value regarding the chosen attribute are completely arbitrary while applying the model to build the iForest and some more smart methods would be possible to be studied. Since each tree in the iForest, i.e., iTree, is a binary tree with growth constraint by depth, the leaf node or external node of the tree should play a role in identifying if an instance is normal or not. Yet, it is not considered in the previous studies. Aiming at the issues mentioned above, we consider proposing a new solution to optimize the tree building process from solving three key questions.

(1) How to select a suitable or distinguishable attribute to partition the instances?

(2) How to determine the appropriate split point on a given attribute? Split point refers to a chosen attribute value to partition the data into two sets on an attribute.

(3) How to mark the leaf node with proper category labels? As there are only two kinds of labels for anomaly detection, we can mark the leaf node with label 1 for normal instance and 0 for the anomaly. For a specified detection process in an iTree, when the step of the decision in the iTree reaches one of the leaf node, it depends on the label of the leaf with 1 or 0 to identify if the instance is normal or not.

The optimized iTree and corresponding improved iForest have four merits including the following:

(1) Via a heuristic searching method based on the gradient, the new model is able to locate the best split point on a given attribute efficiently.

(2) Compared to the extremely unbalanced isolation tree generated by the original model, i.e., iForest, the iTree generated by the proposed model is well balanced.

(3) Compared to iForest, the new model needs fewer isolation trees to constitute the forest.

(4) It has more favorable accuracy performance for outlier detection in terms of AUC results compared with the state-of-the-art methods.

The layout of the article has an organization as follows. In Section 2, related studies in the field of outlier detection will be reviewed. In Section 3, we firstly introduce some symbol definitions to be used in the rest of this paper. Then, the motivation of our study and some quantified analysis with regard to the proposed model will be given in Section 4. Some detailed algorithm descriptions are presented in Section 5. Analysis with extended experiments for comparisons between our model and existing methods is discussed in Section 6. Finally, we give a summary to conclude our work in Section 7.

#### 2. Related Work

Despite extensive surveys have been made for the studies of the outlier detection in the review articles appeared in recent years [3–6], yet, as an active and fast updating community in the domain of data mining, some new insights and trends should be introduced here. In our viewpoint, the state-of-the-art approaches can be categorized as follows.

##### 2.1. Distance-Based Approaches

The distance-based approaches [7] are commonly regarded as the basic methods in the study of outlier detection. The original idea of distance-based approaches is to calculate the distances between a given object and its th-nearest neighbors and identify the objects as outliers with the largest distance. Some variants are proposed as follow-ups such as -*weight* [8], outlier detection using indegree number () [9]. -*weight* defines the weight of the object by summing up the distances within its neighbors which is more favorable to tackle the high dimensional data. formalizes the correlations of nearest neighbors into a graph and detect the nodes as outliers in the graph with low degrees. There are also some other related models using pruning techniques for ranking top-N outliers based on distance [10, 11]. They collectively constitute the family of the distance-based approaches.

##### 2.2. Density-Based Approaches

These approaches introduce a concept of Local Outlier Factor (LOF) [12]; i.e., each instance is assigned a score based on the neighbors’ local density denoting a degree of outlierness. A potential outlier is identified by the relative high LOF value. Based on this main idea, some extended models are proposed. INFLO considers using s and reverse s to calculate the value of LOF [13]. LOCI further introduced the multigranularity deviation factor (MDEF) which uses -neighborhoods rather than s [14]. To alleviate the sensitivity of the parameter tuning, an improved model LDOF was proposed which yet has similar performance to the traditional density-based approaches [15]. In the field of handling high dimensional data, some works tried to find the difference between normal instances and anomalous ones in the meaningful subspace of the data space [16, 17].

##### 2.3. Tree-Based Approaches

In recent years, tree-based approaches have obtained comprehensive attentions in the field of outlier detection for its outstanding performance in detection accuracy and scalability. Some typical models include IHCCUDT [18], iForest [2], SCiForest [19], and RRCT [20]. iForest and RRCT both adopt the strategy of the randomization for tree building. RRCT considers the impact of the outliers to the entire data, while iForest merely focus on the cost of the instances isolation. IHCCUDT and SCiForest both considered how to reduce the inconsistency in the interior of the subset for each partition. As for IHCCUDT, it utilizes small number of instances with labels to mark other instances having no labels. SCiForest follows the idea of iForest to identify anomalies by the depth of the leaf nodes and gets an enhanced performance by introducing the random hyperplanes in subspaces.

##### 2.4. Scenario-Aware Approaches

Applying available outlier detection models to a specific problem scenario is also a trend in this field for its visible practical interests. Schubert et al. proposed a unified framework by the means of abstracting the notion of locality from the classic distance-based notion and applied it to the applications of spatial, video, and network outlier detection [21]. Aiming at the communities existing in the network, Gao et al. use a generative model called CODA to detect outliers which are defined as individuals deviating significantly from the rest of the community members [22]. Pokrajac et al. devised an incremental LOF algorithm to perform the outlier detection of data streams [23]. Some special applications can also be found in the literature like trajectory outlier detection [24], outlier detection with uncertain data [25], etc.

#### 3. Notation Definitions

Some used symbols in this article are defined in Table 1.