Abstract

Outlier detection is an important data mining task, whose target is to find the abnormal or atypical objects from a given dataset. The techniques for detecting outliers have a lot of applications, such as credit card fraud detection and environment monitoring. Our previous work proposed the Cluster-Based (CB) outlier and gave a centralized method using unsupervised extreme learning machines to compute CB outliers. In this paper, we propose a new distributed algorithm for the CB outlier detection (DACB). On the master node, we collect a small number of points from the slave nodes to obtain a threshold. On each slave node, we design a new filtering method that can use the threshold to efficiently speed up the computation. Furthermore, we also propose a ranking method to optimize the order of cluster scanning. At last, the effectiveness and efficiency of the proposed approaches are verified through a plenty of simulation experiments.

1. Introduction

Outlier detection is an important issue of data mining, and it has been widely studied by many scholars for years. According to the description in [1], “an outlier is an observation in a dataset which appears to be inconsistent with the remainder of that set of data.” The techniques for mining outliers can be applied to many fields, such as credit card fraud detection, network intrusion detection, and environment monitoring.

There exist two primary missions in the outlier detection. First, we need to define what data are considered as outliers in a given set. Second, an efficient method to compute these outliers needs to be designed. The outlier problem is first studied by the statistics community [2, 3]. They assume that the given dataset follows a distribution, and an object is considered as an outlier if it shows distinct deviation from this distribution. However, it is almost an impossible task to find an appropriate distribution for high-dimensional data. To overcome the drawback above, some model-free approaches are proposed by the data management community. Examples include distance-based outliers [46] and the density-based outlier [7]. Unfortunately, although these definitions do not need any assumption on the dataset, some shortcomings still exist in them. Therefore, in this paper, we propose a new definition, the Cluster-Based (CB) outlier. The following example discusses the weaknesses of the existing model-free approaches and the motivation of our work.

As Figure 1 shows, there are a denser cluster and a sparse cluster in a two-dimensional dataset. Intuitively, points and are the outliers because they show obvious differences from the other points. However, in the definitions of the distance-based outliers, a point is marked as an outlier depending on the distances from to its -nearest neighbors. Then, most of the points in the sparse cluster are more likely to be outliers, whereas the real outlier will be missed. In fact, we must consider the localization of outliers. In other words, to determine whether a point in a cluster is an outlier, we should only consider the points in , since the points in the same cluster usually have similar characters. Therefore, in Figure 1, from and from can be selected correctly. The density-based outlier [7] also considers the localization of outliers. For each point , they use the Local Outlier Factor (LOF) to measure the degree of being an outlier. To compute the LOF of , we need to find the set of its -nearest neighbors and all the -nearest neighbors of each point in . The expensive computational cost limits the practicability of the density-based outlier. Therefore, we propose the CB outlier to conquer the above deficiencies. The formal definition will be described in Section 3.

To detect CB outliers in a given set, the data need to be clustered first. In this paper, we employ the unsupervised extreme learning machine (UELM) [8] for clustering. The extreme learning machine (ELM) is a novel technique proposed by Huang et al. [911] for pattern classification, which shows better predicting accuracy than the traditional Support Vector Machines (SVMs) [1215]. Thus far, the ELM techniques have attracted the attention of many scholars and various extensions of ELM have been proposed [16]. UELM [8] is designed for dealing with the unlabeled data, and it can efficiently handle clustering tasks. The authors show that UELM provides favorable performance compared with the state-of-the-art clustering algorithms [1720].

In [21], we have studied the problem of CB outlier detection using UELM in a centralized environment. Faced with the increasing data scale, the performance of the centralized method becomes too limited to meet the timeliness requirements. Therefore, in this paper, we develop a new efficient distributed algorithm for the CB outlier detection (DACB). The main contributions are summarized as follows:(1)We propose a framework of distributed CB outlier detection, which adopts a master-slave architecture. The master node keeps monitoring the points with large weights on each slave node and obtains a threshold . The slave nodes can use to efficiently filter the unpromising points and accelerate the computational efficiency.(2)We propose a new algorithm to compute the point weights on each slave node. Compared with our previous method in [21], the new algorithm adopts a filtering technique to further improve the efficiency. We can filter out a large number of unpromising points, instead of computing their exact weights.(3)We propose a new method to optimize the order of cluster scanning on each slave node. With the help of this method, we can obtain a large early and improve the filtering performance.

The rest of the paper is organized as follows. Section 2 gives brief overviews of ELM and UELM. Section 3 formally defines the CB outlier. Section 4 gives the framework of DACB. Section 5 illustrates the details of DACB. Section 6 analyzes the experimental results. Section 7 gives the related work of outlier detection. Section 8 concludes the paper.

2. Preliminaries

2.1. Brief Introduction to ELM

The target of ELM is to train a single layer feedforward network from a training set with samples, . Here, , and is an -dimensional binary vector where only one entry is “1” to represent the class that belongs to.

The training process of ELM includes two stages. In the first stage, we build the hidden layer with nodes using a number of mapping neurons. In detail, for the th hidden layer node, a -dimensional vector and a parameter are randomly generated. Then, for each input vector , the relevant output value on the th hidden layer node can be acquired using an activation function such as the Sigmoid function below:

Then, the matrix outputted by the hidden layer is

In the second stage, an -dimensional vector is the output weight that connects the th hidden layer with the output node. The output matrix is acquired by where

We have known the matrices and . The target of ELM is to solve the output weights by minimizing the square losses of the prediction errors, leading to the following equation:

where denotes the Euclidean norm, is the error vector with respect to the training samples, and is a penalty coefficient on the training errors. The first term in the objective function is a regularization term against overfitting.

If , which means has more rows than columns and it is full of column rank, (6) is the solution for (5). Hence,

If , a restriction that is a linear combination of the rows of : is considered. Then, can be calculated by where and are the identity matrices of dimensions and , respectively.

2.2. Unsupervised ELM

Huang et al. [8] proposed UELM to process an unsupervised dataset, which shows good performance in clustering tasks. The unsupervised learning is based on the following assumption: if two points and are close to each other, their conditional probabilities and should be similar. To enforce this assumption on the data, we acquire the following equation:where is the pairwise similarity between and , which can be calculated by Gaussian function .

Since it is difficult to calculate the conditional probabilities, the following can approximate (8):where denotes the trace of a matrix, is the predictions of the unlabeled dataset, is known as graph Laplacian, and is a diagonal matrix with its diagonal elements .

In the unsupervised learning, the dataset is unlabeled. Substituting (9) into (5), the objective function of UELM is acquired:where is a tradeoff parameter. In most cases, (10) reaches its minimum value at . In [18], Belkin and Niyogi introduced an additional constraint . On the base of the conclusion in [8], if , we can obtain the following equation:

Let be the th smallest eigenvalues of (11) and be the corresponding eigenvectors. Then, the solution of the output weights is given bywhere ,  , are the normalized eigenvectors.

If , (11) is underdetermined. We obtain the alternative formulation below:

Again, let be the generalized eigenvectors corresponding to the th smallest eigenvalues of (13). Then, the final solution iswhere ,  , are the normalized eigenvectors. Algorithm 1 shows the process of UELM.

Input. The training data: .
Output. The label vector of cluster corresponding to
() (a) Construct the graph Laplacian of ;
() (b) Generate a pair of random values for each hidden neuron,
  and calculate the output matrix ;
() (c)
() if then
()   Find the generalized eigenvectors of Equation (11). Let
    .
() else
()   Find the generalized eigenvectors of Equation (13). Let
    ;
() (d) Calculate the embedding matrix: ;
() (e) Treat each row of as a point, and cluster the points into
   clusters using the -means algorithm. Let be the label
   vector of cluster index for all the points.
() return ;

3. Defining CB Outliers

For a given dataset in a -dimensional space, a point is denoted by . The distance between two points and is . Suppose that there are clusters in outputted by UELM. For each cluster , the centroid point can be computed by the following equation:

Intuitively, in a cluster , most of the normal points are closely around the centroid point of . In contrast, an abnormal point (i.e., outlier) is usually far from the centroid point and the number of points close to is quite small. Based on this observation, the weight of a point is defined as follows.

Definition 1 (weight of a point). Given an integer , for a point in cluster , one uses to denote the set of the -nearest neighbors of in . Then, the weight of is

Definition 2 (result set of CB outlier detection). For a dataset , given two integers and , let be a subset of with points. If, , there is no point that , is the result set of CB outlier detection.

For example, in Figure 1, in cluster , the centroid point is marked in red. For , the -nearest neighbors of are and  . Because is an abnormal point and it is far from the centroid point, is much larger than and . Hence, the weight of is large. In contrast, for a normal point deep in the cluster, the distances from to its -nearest neighbors are similar to . The weight of is close to 1. Therefore, is more likely to be considered as a CB outlier.

4. Framework of Distributed CB Outlier Detection

The target of this paper is to detect CB outliers in a distributed environment, which is constituted by a master node and a number of slave nodes. The master node is the coordinator, and it can communicate with all the slave nodes. Each slave node reserves a subset of the clusters in outputted by UELM, and it is the main worker in the outlier detection. Figure 2 shows the framework of the distributed algorithm for CB outlier detection (DACB) proposed in this paper.

When a request of outlier detection arrives, each slave node starts to scan the local clusters. Basically, in a cluster , we need to search the NNs of each point in to compute the weight of . Obviously, computing the NNs of all the points is very time-consuming. Therefore, in our DACB, we propose a filtering method to accelerate the computation. Specifically, each slave node keeps tracking the local top- points with the largest weights in the scanned points and sends them to the master node. From the received points, the master reserves the global top- points with the largest weights. We choose the smallest one from these weights as a threshold . The master broadcasts to the slave nodes to efficiently filter the unpromising points (the detailed filtering method is described in Section 5.1). On each slave node, if there emerges a point with weight larger than , the point will be sent to the master. The master termly updates the global top- points and the threshold . At last, the points stored in the master node are the CB outliers.

5. DACB Description

5.1. The Computation of Point Weight

In this section, we introduce the techniques to compute the point weights on each slave node. According to Definitions 1 and 2, to determine whether a point in a cluster is an outlier, we need to search the -nearest neighbors (NNs) of in . In order to accelerate the NN searching, we design an efficient method to prune the searching space.

For a cluster , suppose that the points in have been sorted according to the distances to the centroid point in an ascending order. For a point in , we scan the points to search to its NNs. Let be the set of points that are the nearest to from the scanned points, and let be the maximum value of the distances from the points in to . Then, the pruning method is described as follows.

Theorem 3. For a point in front of , if , the points in front of and itself cannot be the NNs of .

Proof. For a point in front of , because the points in have been sorted, . Then, according to the triangle inequality, the distance from to : . Clearly, there exist points closer to than , and thus cannot be the NN of .

Similarly, for a point at the back of , if , the points at the back of and itself cannot be the NNs of . For example, Figure 3 shows a portion of points in a cluster . First, we sort the points according to the distances to the centroid point and obtain a point sequence ,  ,  ,  ,  ,  ,  ,  ,  ,  . For point , we search its NNs from to both sides (). After ,  , and   are visited, and are the current top- nearest neighbors for , and thus . When we visit , . Hence, the points behind in the sequence (i.e., ) cannot be the NNs of . Similarly, when is visited, we do not need to further scan the points before in the sequence because . Therefore, the NN searching stops, and the exact NNs of are and .

Furthermore, after the weights of some points have been computed, we send the current top- points with the largest weights to the master node to obtain the threshold (mentioned in Section 4). We can use to filter the points that cannot be the CB outliers, instead of searching the exact NNs. The detailed method is stated as follows.

Theorem 4. In the NN searching of in a cluster , is the set of the current -nearest neighbors of from the scanned points. is the set of the exact -nearest neighbors. One sorts the points in and according to the distances to in an ascending order, respectively. Then, for the th point in : and the th point in : , one asserts that .

Proof. If and are the same point, the theorem can be proven easily. Otherwise, . Using the triangle inequality, . The theorem is proven.

Corollary 5. For a point , if the current -nearest neighbor set meets the following condition, cannot be the CB outlier:

Proof. If (17) holds, the weight upper bound of is smaller than or equal to according to Theorem 4. Therefore, is not a CB outlier.

By utilizing Corollary 5, a large number of points can be filtered out safely, and we do not need to compute the exact NNs of them. For example, in Figure 4, for , we search the NNs of . After ,  , and   are visited, the current -nearest neighbors are and . . Then, we can assert that is not a CB outlier even if the exact NNs are not found. Therefore, can be filtered out safely.

Algorithm 2 shows the process of CB outlier detection on each slave node. For each cluster in , the points in are sorted according to the distances to the centroid point in an ascending order (line ()). Then, we scan the points in reversed order (line ()), since the points far from the centroid point are more likely to have large weights and a good threshold can be obtained early. For each scanned point , we visit the points from to both sides to search the NNs (line ()). If the visited point meets Theorem 3, a number of points can be erased from the visited list since they cannot be the NNs of (lines ()–()). If the current NNs of meet Corollary 5, is not an outlier, so we do not further search its NNs (lines ()–()). After all points are visited, we send to the master node if is still not filtered out (lines () and ()). At last, the points on the master node are the CB outliers.

Input. The cluster set reserved on slave node , integers , , the threshold
Output. The CB outliers in
() for each cluster in do
()   Sort the points in according to the distances to the centroid point
    in ascending order;
()   Scan the points in reversed order;
()   for each scanned point do
()    Initialize a heap ; // to reserve the current NNs of
()    ; // the largest distance from the points in to
()   boolean = true;
()   Visit the points from to the both sides to search ’s NNs;
()   for each visited point do
()      if is before and
        then
()       Erase the points before from the visited list;
()      else if is behind and
        then
()       Erase the points behind from the visited list;
()      else
()       Update and ;
()      if meets the condition of Corollary 5 then
()        = false;
()       break
()    if then
()     Send to the master node to update ;
5.2. The Order of Cluster Scanning

As Section 4 describes, each slave node needs to visit the local clusters one by one, and it computes the weights of points in each visited cluster. Meanwhile, the global top- points with the largest weights thus far are kept in order to obtain the threshold . Clearly, the filtering effectiveness can be significantly improved if we obtain a large early, whereas a large is unlikely to be obtained if we visit the clusters in a random order, because we cannot guarantee that the early scanned points have large weights. As a consequence, we design a new ranking method for the clusters. Figure 5 illustrates the main idea of the ranking method.

A common idea is that points with large weights possibly emerge in a sparse cluster (the distance between every two points is large). However, this idea does not work well in many cases. For example, in Figure 5(a), the distances from the points to the centroid point in the sparse cluster are almost identical, and thus the weights of these points are not large. However, in Figure 5(b), in the dense cluster, although most of the points are very close, two abnormal points and   are far from the centroid point, whose weights are large and suitable for the computation. From the observation in Figure 5(b), we can see that the distances of the points with large weights to the centroid point are large and quite different with those of the other points. Besides, note that we only need large weights to compute . based on the description above, we propose the following ranking method.

Definition 6 (the outlier factor of a cluster). For a cluster , suppose that the points have been sorted according to the distance to in a descending order. One uses to denote the distance between the first point and and to denote the distance between the th point and . Then, the outlier factor of is

For example, in Figure 5(a), , which means we consider at most 2 points in a cluster to obtain . Thus, we sort the points according to the distance to in a descending order, and we obtain the top-3 points, ,  , and  . Then, we compute the distance between the first point and : , and the distance between the third point and : . The outlier factor of is . Similarly, we get . Comparing the two outlier factors, we can see that is large, which means many points are far from and they have similar distances to . Therefore, we can hardly know whether there are points with large weights in . Conversely, the value of is small. We can assert that, in , most of the points closely surround the centroid point and there is at least one abnormal point (e.g., , ) far from . The weights of these abnormal points are large, and they contribute to selecting a large .

As a consequence, we preferentially visit the clusters with small outlier factors, instead of visiting in a random order. This method helps us to obtain a large early and improves the filtering effectiveness.

6. Experimental Evaluation

6.1. Method to Define Distance between Points

We use the method in [4] to define the distance between points in practical applications. First, we use a point with several measurable attributes to represent an object in the real world. All the objects can be mapped into European space. Then, for point representing object and point   representing object , we use the distance between and to measure the difference between and . Thus, a point with large distances to others is more likely to be an outlier.

6.2. Result Analysis for the Centralized Algorithm

In this section, we first evaluate the performance of the proposed algorithm in a centralized environment using a PC with an Intel Core i7-2600 @3.4 GHz CPU, 8 GB main memory, and 1 TB hard disk. A synthetic dataset is used for the experiments. In detail, given the data size , we generate clusters and randomly assign each of them a center point and a radius. In average, each cluster has 1000 points following Gaussian distribution. At last, the remaining 1000 points are scattered into the space. We implement the proposed method to detect CB outliers (DACB) using JAVA programming language. A naive method (naive) is also implemented as a comparing algorithm, where we simply search each point’s NNs and compute its weight. In the experiments, we are mainly concerned with the runtime to represent the computational efficiency and the point accessing times (PATs) to indicate the disk IO cost. The parameters’ default values and their variations are shown in Table 1.

As Figure 6(a) shows, DACB is much more efficient than the naive method because of the pruning and filtering strategies proposed in this paper. With the increase of , we need to keep tracking more neighbors for a point, so the runtime of the naive method and the DACB becomes larger. Figure 6(b) shows the effect of on the PATs. For the naive method, each point needs to visit all the other points in the cluster to find its NNs. Hence, PATs are large. In contrast, for DACB, a point does not have to visit all the other points (Theorem 3), and a large number of points can be filtered out, instead of finding their exact NNs (Corollary 5). Therefore, the PATs are much smaller.

Figure 7 describes the effect of . As increases, more outliers are reported. Thus, the runtime of the naive method and the DACB becomes larger. The effect on the PATs is shown in Figure 7(b), whose trend is similar to that in Figure 6(b). Note that the PATs of DACB increase slightly with , whereas the PATs of the naive method keep unchanged.

In Figure 8, with the increase of the dimensionality, a number of operations (e.g., computing the distance of two points) become more time-consuming. Hence, the time cost of the two methods becomes larger. But the variation of the dimensionality does not affect the PATs. The effect of the data size is described in Figure 9. Clearly, with the increase of the data size, we need to scan more points to find the outliers. Therefore, both of the runtime and the PATs are linear to the data size.

6.3. Result Analysis for the Distributed Algorithm

In this section, we further evaluate the performance of the proposed algorithm for distributed outlier detection. In the experiments, we mainly consider the time cost and the network transmission quantity (NTQ). The data size and the cluster scale are shown in Table 2. The other parameters settings are identical to those described in Section 6.2.

Figure 10 shows the effect of parameter . The curve “with SOO” represents the DACB algorithm. The curve “without SOO” represents a basic distributed algorithm for outlier detection without the Scanning Order Optimization (SOO) method described in Section 5.2. As we can see, as the value of increases, both algorithms cost more time (the reason has been discussed in Section 6.2). But is not sensitive to , and thus NTQ changes very slightly. Comparing the two algorithms, we can see that, with the help of SOO, a large can be obtained early and a lot more points can be filtered out efficiently. As a result, we can reduce the network overhead and improve the computing efficiency.

In Figure 11, we evaluate the effect of parameter . With the increase of , we consider more points as outliers. Thus, the threshold becomes smaller and the filtering performance decreases. On the other hand, since a large means that more points will be transmitted to the master node to compute , NTQ also increases.

Figure 12(a) evaluates the effect of the dimensionality on the time cost, which shows the same trend as the result in Figure 8(a). In Figure 12(b), we test the effect of the dimensionality on NTQ. Clearly, each slave node needs to send the local top- points with the largest weights to the master node. The transmission contents include points’ ID, the values of all the dimensionality, and the weights. Therefore, more dimensionality leads to higher transmission quantity.

As Figure 13(a) shows, with the increase of the data size, more points need to be scanned to find the outliers; thus, the time cost becomes larger. We can also see that, in Figure 13(b), NTQ increases with the data size, but the change is very small. This is because the value of becomes stable after a certain amount of calculation, and it is not sensitive to the data size.

The effect of the cluster scale is tested in Figure 14. As more slave nodes are used, the workload on each node becomes smaller, and thus the computing speed is improved. However, to compute , the master node needs to collect points from each slave node. Therefore, NTQ increases with the cluster scale. Note that, for the dataset with points, NTQ is still maintained at a KB level. The network overhead of DACB is acceptable.

Outlier detection is an important task in the area of data management, whose target is to find the abnormal objects in a given dataset. The statistics community [2, 3] proposed the model-based outliers. The dataset is assumed to follow a distribution. An outlier is the object that shows obvious deviation from the assumed distribution. Later, the data management community pointed out that building a reasonable distribution is almost an impossible task for high-dimensional datasets. To overcome this weakness, they proposed several model-free approaches [22], including distance-based outliers [46] and density-based outliers [7].

A number of studies focus on developing efficient methods to detect outliers. Knorr and Ng [4] proposed the well-known nested-loop (NL) algorithm to compute distance-based outliers. Bay and Schwabacher [23] proposed an improved nested-loop approach, called ORCA. The approach efficiently prunes the searching space by randomizing the dataset before outlier detection. Angiulli and Fassetti [24] proposed DOLPHIN, which can reduce the disk IO cost through maintaining a small subset of the input data in main memory. Several researchers adopt the spatial indexes to further improve the computing efficiency. Examples include R-tree [25], M-tree [26], and grids. However, the performance of these methods is quite sensitive to the dimensionality. To improve the computing efficiency, some researchers attempt to use a distributed or parallel method to detect outliers. Examples include [2729].

8. Conclusion

In this paper, we studied the problem of CB outlier detection in a distributed environment and proposed an efficient algorithm, called DACB. This algorithm adopts a master-slave architecture. The master node monitors the points with large weights on each slave node and computes a threshold. On each slave node, we designed a pruning method to speed up the NN searching and a filtering method that can use the threshold to filter out a large number of unpromising points. We also designed an optimization method for cluster scanning, which can significantly improve the filtering performance of the threshold. Finally, we evaluated the performance of the proposed approaches through a series of simulation experiments. The experimental results show that our method can effectively reduce the runtime and the network transmission quantity for distributed CB outlier detection.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grants nos. 61602076 and 61371090, the Natural Science Foundation of Liaoning Province under Grant no. 201602094, and the Fundamental Research Funds for the Central Universities under Grant no. 3132016030.