Mathematical Problems in Engineering

Volume 2016, Article ID 6394253, 14 pages

http://dx.doi.org/10.1155/2016/6394253

## An Improved Semisupervised Outlier Detection Algorithm Based on Adaptive Feature Weighted Clustering

^{1}College of Science, Harbin Engineering University, Harbin 150001, China^{2}College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China

Received 13 September 2016; Revised 2 November 2016; Accepted 20 November 2016

Academic Editor: Filippo Ubertini

Copyright © 2016 Tingquan Deng and Jinhong Yang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

There exist already various approaches to outlier detection, in which semisupervised methods achieve encouraging superiority due to the introduction of prior knowledge. In this paper, an adaptive feature weighted clustering-based semisupervised outlier detection strategy is proposed. This method maximizes the membership degree of a labeled normal object to the cluster it belongs to and minimizes the membership degrees of a labeled outlier to all clusters. In consideration of distinct significance of features or components in a dataset in determining an object being an inlier or outlier, each feature is adaptively assigned different weights according to the deviation degrees between this feature of all objects and that of a certain cluster prototype. A series of experiments on a synthetic dataset and several real-world datasets are implemented to verify the effectiveness and efficiency of the proposal.

#### 1. Introduction

Outlier detection is an important topic in data mining community, which aims at finding patterns that occur infrequently as opposed to other data mining techniques [1]. An outlier is an observation that deviates significantly from, or inconsistent with the main body of a dataset, as if it was generated by a different mechanism [2]. The importance of outlier detection is in the view of the fact that outliers can provide raw patterns and valuable knowledge about a dataset. Current application areas of outlier detection include crime detection, credit card fraud detection, network intrusion detection, medical diagnosis, faulty detection in critical safety systems, or detecting abnormal regions in image processing [3–9].

Recently the studies on outlier detection are very active and many approaches have been proposed. In general, existing work on outlier detection can be broadly classified into three modes depending on whether label information is available or can be used to build outlier detection models: unsupervised, supervised, and semisupervised methods.

Supervised outlier detection concerns the situation where the training dataset contains prior information about the class of each instance that is normal or abnormal. One-class support vector machine (OCSVM) [10] or support vector data description (SVDD) [11, 12], which considers the case that training data are all normal instances, conducts a hypersphere around the normal data and utilizes the constructed hypersphere to detect an unknown sample as an inlier or outlier. The supervised outlier detection problem is a difficult case in many real-world applications, since the acquisition of label information of the whole training dataset is often expensive, time consuming, and subjective.

Unsupervised outlier detection, without prior information about the class distribution, is generally classified into distribution-based [3], distance-based [13, 14], density-based [15, 16], and clustering-based [17–20] approaches. Distribution-based approach assumes that all data points are generated by a certain statistical model, while outliers do not obey the model. However, the assumption of an underlying distribution of data points is not always available in many real-life applications. Distance-based approach was firstly investigated by Knox and Ng [14]. An object in a dataset is an outlier if at least of objects in are further than the distance from . The global parameters and are not suitable when the local information of the dataset varies greatly. Representatives of this type of approaches include -nearest neighbor (NN) algorithm [13] and its variants [21, 22]. Density-based approach was originally proposed by Breunig et al. [15]. A local outlier factor (LOF) is assigned to each data point based on their local neighborhood density. Then a data point with a high LOF value is determined as an outlier. However, this method is very sensitive to the choice of neighborhood parameter.

Clustering-based approaches [17–20] partition the dataset into several clusters depending on similarity of objects and detect outliers by examining the relationship between objects and clusters. In general, clusters containing significantly less data points than other clusters or being remote from other clusters are considered as outliers. The cluster structure of data can facilitate the task of outlier detection and a small amount of related literatures has been proposed. A classical clustering method is used to find anomaly in the intrusion detection domain [18]. In the work of [19], the clustering techniques iteratively detect outliers for multidimensional data analysis in subspace. Zhao et al. [20] propose an adaptive fuzzy c-means (AFCM) algorithm by introducing sample weight coefficients to the objective function and apply it to anomaly data detection in energy system of steel industry. Since clustering-based approaches are unsupervised without requiring any labeled training data, their performance in outlier detection is limited. In addition, most of the existing clustering-based methods only involve the optimal clustering but do not incorporate optimal outlier detection into clustering process.

In many real-world applications, one may encounter cases where a small set of objects are labeled as outliers or belonging to a certain class, but most of the data are unlabeled. Studies indicate that the introduction of a small amount of prior knowledge can significantly improve the effectiveness of outlier detection [23–25]. Therefore, semisupervised approaches to outlier detection have been developed to tackle such scenarios and have been thought of a popular direction of outlier detection recently. In order to take advantage of the label information of a target dataset, entropy-based outlier detection based on semisupervised learning from few positive examples (EODSP) is proposed in [23]. That method extracts reliable normal instances from unlabeled objects and regards them as labeled normal samples. Entropy-based outlier detection method is used to detect top outliers. However, when a dataset initially provides labeled normal and abnormal samples, the algorithm in [23] cannot make full use of the given label information. Literature [24] develops a semisupervised outlier detection method based on the assessment of deviation from known labeled objects by punishing poor clustering results and restricting the number of outliers. Xue et al. [25] present a semisupervised outlier detection proposal based on fuzzy rough c-means clustering, which detects outliers by minimizing the sum of squared errors of clustering results and the deviation from known labeled examples as well as the number of outliers. Unfortunately, some labeled normal objects are finally misidentified as outliers due to improper parameter selection in [24, 25].

Most of the previous research equally treats different features of objects in outlier detecting process, which does not conform to the intrinsic characteristic of a dataset. Actually, it is more reasonable that different features have different importance in each cluster, especially for high-dimension sparse datasets where the structure of each cluster is often limited to a subset of features rather than the entire feature set. Some works concerning feature weighted clustering have been studied. Huang et al. [26] propose a W-c-means type clustering algorithm that can automatically calculate feature weights. W-c-means adds a new step into the basic c-means algorithm to update the variable weights based on the current partition of data. Literature [27] develops an approach called simultaneous clustering and attribute discrimination (SCAD). SCAD learns the feature relevance representation of each cluster independently in an unsupervised manner. Zhou et al. [28] publish a maximum-entropy-regularized weighted fuzzy c-means (EWFCM) clustering algorithm for “nonspherical” shaped data. A new objective function is developed in the EWFCM algorithm to achieve the optimal clustering result by minimizing the dispersion within clusters and maximizing the entropy of attribute weights simultaneously. These existing methods about feature weighted clustering encourage scholars to study outlier detection based on feature weighted clustering.

To make full use of prior knowledge to facilitate clustering-based outlier detection, we develop a semisupervised outlier detection algorithm based on adaptive feature weighted clustering (SSOD-AFW) in this paper, in which the feature weights are iteratively obtained. The proposed algorithm emphasizes the diversity of different features in each cluster and assigns lower weights to irrelevant features to reduce their negative influence on outlier decision. Furthermore, based on the convention that outliers usually have a lower membership to every cluster, we relax the constraint of fuzzy c-means (FCM) clustering where the membership degrees of a sample to all clusters must sum up to one and propose an adaptive feature weighted semisupervised possibilistic clustering-based outlier detection algorithm. The interaction problem between optimal clustering and outlier detection is addressed in the proposed method. The label information is introduced into the possibilistic clustering method according to the following principles: (1) maximizing the membership degree of a labeled normal object to the cluster it belongs to; (2) minimizing the membership degrees of a labeled normal object to the clusters it does not belong to; and (3) minimizing the membership degrees of a labeled outlier to all clusters. In addition to the above principles, we simultaneously minimize the dispersion within clusters in the new objective function of clustering to achieve a proper cluster structure. Finally the yielded optimal membership degrees are used to indicate the outlying degree of each sample in the dataset. The proposed algorithm is found promising in improving the performance of outlier detection in comparison with typical outlier detection methods in accuracy, running time as well as other evaluation metrics.

The remainder of this paper is organized as follows. Section 2 gives a short review on possibilistic clustering algorithms. Section 3 presents the detailed description of feature weighted semisupervised clustering-based outlier detection algorithm. In Section 4, the experimental results of the proposed method against typical outlier detection algorithms are discussed on synthetic and real-world datasets. Finally, Section 5 follows our conclusions.

#### 2. Possibilistic Clustering Algorithms

Let be a given dataset of objects, where is the th object characterized by features. Suppose that the dataset is divided into clusters and denotes the th cluster prototype.

FCM is a well-known clustering algorithm [29], whose objective function iswhere is the membership degree of the th object to the th cluster. represents the -norm of a vector and is the fuzzification coefficient. Note that the constraint condition in (2) indicates that the membership sum of each object to all clusters equals one. Therefore, FCM is sensitive to outliers due to the intuition that outliers or noises commonly locate far away from all cluster prototypes. For this reason, Krishnapuram and Keller [30] proposed a possibilistic c-means (PCM) clustering algorithm, which relaxes the constraint on the sum of memberships and minimizes the following objective function:where is a suitable positive number. In PCM, the constraint (4) allows an outlier holding a low membership to all clusters, so an outlier has a low impact on the objective function (3). The membership information of each sample can be naturally used to interpret the outlying characteristic of a sample. For a certain sample, if it has a low membership to all clusters, it is likely to be an outlier.

Afterward, another unsupervised possibilistic clustering algorithm (PCA) is proposed by Yang and Wu [31] and the objective function of PCA is described aswhere the parameter can be calculated by the sample covariance:

#### 3. Semisupervised Outlier Detection Framework Based on Feature Weighted Clustering

##### 3.1. Model Formulation

In this section, we introduce prior knowledge into possibilistic c-means clustering method to improve the performance of outlier detection. First, a small subset of samples in a given dataset is labeled as normal or outlier objects. Each labeled normal object carries the label of class it belongs to. A semisupervised indicator matrix is constructed to describe the semisupervised information and its entries are defined by the following:(i)If an object is labeled as a normal point and it belongs to the th cluster, then , and for all , we let .(ii)If is labeled as an outlier, then for all , we set .(iii)If is unlabeled, then for all , it has .

Usually data often contain a number of redundant features. The cluster structure in a given dataset is often confined to a subset of features rather than the entire feature set. Irrelevant features can only obscure the discovery of the cluster structure by a clustering algorithm. An intrinsic outlier is easy to be neglected due to the vagueness of cluster structure. Figure 1 presents an example of a three-dimensional dataset. The dataset has two clusters ( and ) and features (, , and ). In the feature space , , , neither of the clusters is discovered (see Figure 1(a)). In the subspace , , cluster can be found, but cannot (see Figure 1(b)). Nevertheless, only cluster can be clearly shown in , (see Figure 1(c)). Therefore, if we assign weights 0.47, 0.45, and 0.08 to features , , and , respectively, cluster will be recovered by a clustering algorithm. If the weights of features , , and are assigned as 0.13, 0.46, and 0.41, respectively, cluster will be recovered. In this consideration, each cluster is relevant to different subsets of features, and the same feature may have different importance in different clusters.