Mathematical Problems in Engineering

Volume 2017, Article ID 2649535, 12 pages

https://doi.org/10.1155/2017/2649535

## A Distributed Algorithm for the Cluster-Based Outlier Detection Using Unsupervised Extreme Learning Machines

^{1}College of Information Science & Technology, Dalian Maritime University, Dalian, Liaoning 116000, China^{2}College of Information Science & Engineering, Northeastern University, Shenyang, Liaoning 110819, China

Correspondence should be addressed to Xite Wang; moc.361@reklawyks-etix

Received 25 November 2016; Accepted 13 March 2017; Published 9 April 2017

Academic Editor: Alberto Borboni

Copyright © 2017 Xite Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Outlier detection is an important data mining task, whose target is to find the abnormal or atypical objects from a given dataset. The techniques for detecting outliers have a lot of applications, such as credit card fraud detection and environment monitoring. Our previous work proposed the Cluster-Based (CB) outlier and gave a centralized method using unsupervised extreme learning machines to compute CB outliers. In this paper, we propose a new distributed algorithm for the CB outlier detection (DACB). On the master node, we collect a small number of points from the slave nodes to obtain a threshold. On each slave node, we design a new filtering method that can use the threshold to efficiently speed up the computation. Furthermore, we also propose a ranking method to optimize the order of cluster scanning. At last, the effectiveness and efficiency of the proposed approaches are verified through a plenty of simulation experiments.

#### 1. Introduction

Outlier detection is an important issue of data mining, and it has been widely studied by many scholars for years. According to the description in [1], “an outlier is an observation in a dataset which appears to be inconsistent with the remainder of that set of data.” The techniques for mining outliers can be applied to many fields, such as credit card fraud detection, network intrusion detection, and environment monitoring.

There exist two primary missions in the outlier detection. First, we need to define what data are considered as outliers in a given set. Second, an efficient method to compute these outliers needs to be designed. The outlier problem is first studied by the statistics community [2, 3]. They assume that the given dataset follows a distribution, and an object is considered as an outlier if it shows distinct deviation from this distribution. However, it is almost an impossible task to find an appropriate distribution for high-dimensional data. To overcome the drawback above, some model-free approaches are proposed by the data management community. Examples include distance-based outliers [4–6] and the density-based outlier [7]. Unfortunately, although these definitions do not need any assumption on the dataset, some shortcomings still exist in them. Therefore, in this paper, we propose a new definition, the Cluster-Based (CB) outlier. The following example discusses the weaknesses of the existing model-free approaches and the motivation of our work.

As Figure 1 shows, there are a denser cluster and a sparse cluster in a two-dimensional dataset. Intuitively, points and are the outliers because they show obvious differences from the other points. However, in the definitions of the distance-based outliers, a point is marked as an outlier depending on the distances from to its -nearest neighbors. Then, most of the points in the sparse cluster are more likely to be outliers, whereas the real outlier will be missed. In fact, we must consider the* localization* of outliers. In other words, to determine whether a point in a cluster is an outlier, we should only consider the points in , since the points in the same cluster usually have similar characters. Therefore, in Figure 1, from and from can be selected correctly. The density-based outlier [7] also considers the localization of outliers. For each point , they use the Local Outlier Factor (LOF) to measure the degree of being an outlier. To compute the LOF of , we need to find the set of its -nearest neighbors and all the -nearest neighbors of each point in . The expensive computational cost limits the practicability of the density-based outlier. Therefore, we propose the CB outlier to conquer the above deficiencies. The formal definition will be described in Section 3.