Research Article | Open Access
Tingquan Deng, Jinhong Yang, "An Improved Semisupervised Outlier Detection Algorithm Based on Adaptive Feature Weighted Clustering", Mathematical Problems in Engineering, vol. 2016, Article ID 6394253, 14 pages, 2016. https://doi.org/10.1155/2016/6394253
An Improved Semisupervised Outlier Detection Algorithm Based on Adaptive Feature Weighted Clustering
There exist already various approaches to outlier detection, in which semisupervised methods achieve encouraging superiority due to the introduction of prior knowledge. In this paper, an adaptive feature weighted clustering-based semisupervised outlier detection strategy is proposed. This method maximizes the membership degree of a labeled normal object to the cluster it belongs to and minimizes the membership degrees of a labeled outlier to all clusters. In consideration of distinct significance of features or components in a dataset in determining an object being an inlier or outlier, each feature is adaptively assigned different weights according to the deviation degrees between this feature of all objects and that of a certain cluster prototype. A series of experiments on a synthetic dataset and several real-world datasets are implemented to verify the effectiveness and efficiency of the proposal.
Outlier detection is an important topic in data mining community, which aims at finding patterns that occur infrequently as opposed to other data mining techniques . An outlier is an observation that deviates significantly from, or inconsistent with the main body of a dataset, as if it was generated by a different mechanism . The importance of outlier detection is in the view of the fact that outliers can provide raw patterns and valuable knowledge about a dataset. Current application areas of outlier detection include crime detection, credit card fraud detection, network intrusion detection, medical diagnosis, faulty detection in critical safety systems, or detecting abnormal regions in image processing [3–9].
Recently the studies on outlier detection are very active and many approaches have been proposed. In general, existing work on outlier detection can be broadly classified into three modes depending on whether label information is available or can be used to build outlier detection models: unsupervised, supervised, and semisupervised methods.
Supervised outlier detection concerns the situation where the training dataset contains prior information about the class of each instance that is normal or abnormal. One-class support vector machine (OCSVM)  or support vector data description (SVDD) [11, 12], which considers the case that training data are all normal instances, conducts a hypersphere around the normal data and utilizes the constructed hypersphere to detect an unknown sample as an inlier or outlier. The supervised outlier detection problem is a difficult case in many real-world applications, since the acquisition of label information of the whole training dataset is often expensive, time consuming, and subjective.
Unsupervised outlier detection, without prior information about the class distribution, is generally classified into distribution-based , distance-based [13, 14], density-based [15, 16], and clustering-based [17–20] approaches. Distribution-based approach assumes that all data points are generated by a certain statistical model, while outliers do not obey the model. However, the assumption of an underlying distribution of data points is not always available in many real-life applications. Distance-based approach was firstly investigated by Knox and Ng . An object in a dataset is an outlier if at least of objects in are further than the distance from . The global parameters and are not suitable when the local information of the dataset varies greatly. Representatives of this type of approaches include -nearest neighbor (NN) algorithm  and its variants [21, 22]. Density-based approach was originally proposed by Breunig et al. . A local outlier factor (LOF) is assigned to each data point based on their local neighborhood density. Then a data point with a high LOF value is determined as an outlier. However, this method is very sensitive to the choice of neighborhood parameter.
Clustering-based approaches [17–20] partition the dataset into several clusters depending on similarity of objects and detect outliers by examining the relationship between objects and clusters. In general, clusters containing significantly less data points than other clusters or being remote from other clusters are considered as outliers. The cluster structure of data can facilitate the task of outlier detection and a small amount of related literatures has been proposed. A classical clustering method is used to find anomaly in the intrusion detection domain . In the work of , the clustering techniques iteratively detect outliers for multidimensional data analysis in subspace. Zhao et al.  propose an adaptive fuzzy c-means (AFCM) algorithm by introducing sample weight coefficients to the objective function and apply it to anomaly data detection in energy system of steel industry. Since clustering-based approaches are unsupervised without requiring any labeled training data, their performance in outlier detection is limited. In addition, most of the existing clustering-based methods only involve the optimal clustering but do not incorporate optimal outlier detection into clustering process.
In many real-world applications, one may encounter cases where a small set of objects are labeled as outliers or belonging to a certain class, but most of the data are unlabeled. Studies indicate that the introduction of a small amount of prior knowledge can significantly improve the effectiveness of outlier detection [23–25]. Therefore, semisupervised approaches to outlier detection have been developed to tackle such scenarios and have been thought of a popular direction of outlier detection recently. In order to take advantage of the label information of a target dataset, entropy-based outlier detection based on semisupervised learning from few positive examples (EODSP) is proposed in . That method extracts reliable normal instances from unlabeled objects and regards them as labeled normal samples. Entropy-based outlier detection method is used to detect top outliers. However, when a dataset initially provides labeled normal and abnormal samples, the algorithm in  cannot make full use of the given label information. Literature  develops a semisupervised outlier detection method based on the assessment of deviation from known labeled objects by punishing poor clustering results and restricting the number of outliers. Xue et al.  present a semisupervised outlier detection proposal based on fuzzy rough c-means clustering, which detects outliers by minimizing the sum of squared errors of clustering results and the deviation from known labeled examples as well as the number of outliers. Unfortunately, some labeled normal objects are finally misidentified as outliers due to improper parameter selection in [24, 25].
Most of the previous research equally treats different features of objects in outlier detecting process, which does not conform to the intrinsic characteristic of a dataset. Actually, it is more reasonable that different features have different importance in each cluster, especially for high-dimension sparse datasets where the structure of each cluster is often limited to a subset of features rather than the entire feature set. Some works concerning feature weighted clustering have been studied. Huang et al.  propose a W-c-means type clustering algorithm that can automatically calculate feature weights. W-c-means adds a new step into the basic c-means algorithm to update the variable weights based on the current partition of data. Literature  develops an approach called simultaneous clustering and attribute discrimination (SCAD). SCAD learns the feature relevance representation of each cluster independently in an unsupervised manner. Zhou et al.  publish a maximum-entropy-regularized weighted fuzzy c-means (EWFCM) clustering algorithm for “nonspherical” shaped data. A new objective function is developed in the EWFCM algorithm to achieve the optimal clustering result by minimizing the dispersion within clusters and maximizing the entropy of attribute weights simultaneously. These existing methods about feature weighted clustering encourage scholars to study outlier detection based on feature weighted clustering.
To make full use of prior knowledge to facilitate clustering-based outlier detection, we develop a semisupervised outlier detection algorithm based on adaptive feature weighted clustering (SSOD-AFW) in this paper, in which the feature weights are iteratively obtained. The proposed algorithm emphasizes the diversity of different features in each cluster and assigns lower weights to irrelevant features to reduce their negative influence on outlier decision. Furthermore, based on the convention that outliers usually have a lower membership to every cluster, we relax the constraint of fuzzy c-means (FCM) clustering where the membership degrees of a sample to all clusters must sum up to one and propose an adaptive feature weighted semisupervised possibilistic clustering-based outlier detection algorithm. The interaction problem between optimal clustering and outlier detection is addressed in the proposed method. The label information is introduced into the possibilistic clustering method according to the following principles: (1) maximizing the membership degree of a labeled normal object to the cluster it belongs to; (2) minimizing the membership degrees of a labeled normal object to the clusters it does not belong to; and (3) minimizing the membership degrees of a labeled outlier to all clusters. In addition to the above principles, we simultaneously minimize the dispersion within clusters in the new objective function of clustering to achieve a proper cluster structure. Finally the yielded optimal membership degrees are used to indicate the outlying degree of each sample in the dataset. The proposed algorithm is found promising in improving the performance of outlier detection in comparison with typical outlier detection methods in accuracy, running time as well as other evaluation metrics.
The remainder of this paper is organized as follows. Section 2 gives a short review on possibilistic clustering algorithms. Section 3 presents the detailed description of feature weighted semisupervised clustering-based outlier detection algorithm. In Section 4, the experimental results of the proposed method against typical outlier detection algorithms are discussed on synthetic and real-world datasets. Finally, Section 5 follows our conclusions.
2. Possibilistic Clustering Algorithms
Let be a given dataset of objects, where is the th object characterized by features. Suppose that the dataset is divided into clusters and denotes the th cluster prototype.
FCM is a well-known clustering algorithm , whose objective function iswhere is the membership degree of the th object to the th cluster. represents the -norm of a vector and is the fuzzification coefficient. Note that the constraint condition in (2) indicates that the membership sum of each object to all clusters equals one. Therefore, FCM is sensitive to outliers due to the intuition that outliers or noises commonly locate far away from all cluster prototypes. For this reason, Krishnapuram and Keller  proposed a possibilistic c-means (PCM) clustering algorithm, which relaxes the constraint on the sum of memberships and minimizes the following objective function:where is a suitable positive number. In PCM, the constraint (4) allows an outlier holding a low membership to all clusters, so an outlier has a low impact on the objective function (3). The membership information of each sample can be naturally used to interpret the outlying characteristic of a sample. For a certain sample, if it has a low membership to all clusters, it is likely to be an outlier.
Afterward, another unsupervised possibilistic clustering algorithm (PCA) is proposed by Yang and Wu  and the objective function of PCA is described aswhere the parameter can be calculated by the sample covariance:
3. Semisupervised Outlier Detection Framework Based on Feature Weighted Clustering
3.1. Model Formulation
In this section, we introduce prior knowledge into possibilistic c-means clustering method to improve the performance of outlier detection. First, a small subset of samples in a given dataset is labeled as normal or outlier objects. Each labeled normal object carries the label of class it belongs to. A semisupervised indicator matrix is constructed to describe the semisupervised information and its entries are defined by the following:(i)If an object is labeled as a normal point and it belongs to the th cluster, then , and for all , we let .(ii)If is labeled as an outlier, then for all , we set .(iii)If is unlabeled, then for all , it has .
Usually data often contain a number of redundant features. The cluster structure in a given dataset is often confined to a subset of features rather than the entire feature set. Irrelevant features can only obscure the discovery of the cluster structure by a clustering algorithm. An intrinsic outlier is easy to be neglected due to the vagueness of cluster structure. Figure 1 presents an example of a three-dimensional dataset. The dataset has two clusters ( and ) and features (, , and ). In the feature space , , , neither of the clusters is discovered (see Figure 1(a)). In the subspace , , cluster can be found, but cannot (see Figure 1(b)). Nevertheless, only cluster can be clearly shown in , (see Figure 1(c)). Therefore, if we assign weights 0.47, 0.45, and 0.08 to features , , and , respectively, cluster will be recovered by a clustering algorithm. If the weights of features , , and are assigned as 0.13, 0.46, and 0.41, respectively, cluster will be recovered. In this consideration, each cluster is relevant to different subsets of features, and the same feature may have different importance in different clusters.
(a) Plot of the space (, , )
(b) Plot of the subspace (, )
(c) Plot of the subspace (, )
In our research, let be the weight of the th dimensional feature with respect to the th cluster, which satisfies ; then the feature weighted distance between the th object and the th cluster prototype is defined aswhere the parameter is the feature weight index.
The points within clusters usually behave strongly correlated, while weak correlation is shown between outliers. That is, normal points belong to one of the clusters and outliers do not belong to any cluster. Therefore, a normal point should have a high membership to the cluster it belongs to, and an outlier has a low membership to all clusters. Based on this idea, we define a new objective function and minimize it as follows:where , , and are the number of objects, features, and clusters, respectively. , is the membership degree of the th object belonging to the th cluster. , denotes the feature weight of the th dimensional feature with respect to the th cluster. , indicates the th dimensional feature value of the th cluster prototype. denotes the feature weighted distance between object and the th cluster prototype. is the element in semisupervised indicator matrix . is the fuzzification coefficient and the parameter can be fixed as the sample covariance according to (6). The positive coefficient adjusts the significance of the label information of the th object with respect to the th cluster in objective function (8). The larger is, the larger the influence of label knowledge is.
The first term in (8) is equivalent to the FCM objective function which requires the distances of objects from the cluster prototypes to be as small as possible. The second term is constructed to force to be as large as possible. The third term focuses on minimizing the membership degrees of a labeled outlier to all the clusters and maximizing the membership degree of a labeled normal object to the cluster it belongs to. With a proper choice of , we can balance the label information weight of every object and achieve the optimal fuzzy partition.
The virtue of semisupervised indicator matrix in objective function (8) can be elaborated as follows. Recalling the construction of semisupervised indicator matrix and objective function (8), note that if we know that belongs to the th cluster, then and all the other entries in the th row equal 1. Thus, minimizing in (8) means maximizing the membership of to the th cluster and simultaneously minimizing the memberships of to the other clusters. If is labeled as an outlier, namely, where all the elements in the th row of equal 1, then minimizing in (8) means minimizing the memberships of to all clusters, for an outlier does not belong to any cluster. If is unlabeled, namely, where for all , then the term has no impact on objective function (8).
3.2. Solutions to the Objective Function
In this subsection, an iterative algorithm for minimizing with respect to , , and is derived similar to classical FCM.
First, in order to minimize with respect to , and are fixed and the parameters (; ) are constants. The Lagrange function is constructed as follows:where () are the Lagrange multipliers.
By taking the gradient of with respect to and setting it to zero, we obtain
It follows that
The updating criteria of feature weight (, ) are obtained:
The updating way of implies that the larger the deviation degrees from all samples to the th cluster prototype regarding the th feature are, the smaller the weight of the th feature is. That is, if the distribution of all data is compact around the th cluster prototype in the th feature space, the th feature plays a significant role in formulating the th cluster. Meanwhile, irrelevant features thus are assigned a smaller weight to reduce the negative impact of them on the clustering process.
To find the optimal cluster prototype , we assume and are fixed and the parameters (; ) are also constants. We take the gradient of with respect to and set it to zero:
The updating formula of cluster prototype is obtained as follows:
To solve the optimal fuzzy partition matrix , we assume and are fixed and the parameters (; ) are also constants. We set the gradient of with respect to to zero:
The updating formula of is derived as follows:
Formula (19) indicates that a large value of weighted distance leads to a smaller value of , for all , . It should be noted that the membership degree is also dependent on the coefficient . The choice of is important to the performance of the SSOD-AFW algorithm because it serves in distinguishing the importance of the third term relative to the other terms in objective function (8). If is too small, the third term will be neglected and the labels of objects will not work to promote the cluster structure. If is too large, the other terms will be neglected, and the negative influence of possible mislabels of objects will be enlarged. The value of should be chosen such that it has the same order of magnitude with the first term in (8). To determine the parameter in an adaptive way, in all experiments described in this paper, we choose proportional to as follows:where is a constant. Since the weighted distance is dynamically updated, the value of parameter is adaptively updated in each iteration.
3.3. Criteria for Outlier Identification
Based on the above analysis, outliers should hold low membership degrees to all clusters. Therefore, the sum of memberships of an object to all clusters can be used to evaluate its outlying degree. For a certain object , its outlying degree is defined as
Thus, a small value of indicates a high outlying possibility of object . The outlying degree of each sample in a dataset is calculated, respectively, and sorted incrementally. The suspicious outliers can be found just by extracting the top objects in the sorted outlying degree sequence, where is a given number of outliers contained in the dataset or a given number of outliers one needs.
In summary, the description of the SSOD-AFW algorithm is shown in Algorithm 1.
Algorithm 1 (semisupervised outlier detection based on adaptive feature weighted clustering (SSOD-AFW)).
Input. Dataset , the label information of some objects, the number of clusters , parameters , , , , and the number of outliers .
Output. suspicious outliers.(1)Calculate the parameter according to (6), randomly initialize matrix , and initialize all elements in as . Set iteration counter .(2)Compute the matrix of cluster prototype according to (17).(3)Update the feature weight matrix by (15).(4)Update the feature weighted distance by (7).(5)Update parameter by (20).(6)Update the membership degree matrix according to (19).(7)If , go to step ; else, , repeat step to step .(8)The outlying degree of each object is computed, and the OD values are sorted in an ascending manner. Finally output top outliers with the smallest outlying degrees.
Computational complexity analysis: Step (2) needs (cNn) operations to compute cluster prototype. The computational complexity of computing the weights of features is (cNn) in Step (3). Step (4) requires (cNn) to compute the weighted distances of objects to cluster prototypes. Step (5) needs (cn) to compute parameter of objects with respect to cluster prototypes. Moreover, Step (6) needs (cn) operations to calculate the memberships of objects to clusters. Therefore, the whole computational complexity is (cNn), the same as that of the classical FCM algorithm.
3.4. Proof of Convergence
In this section, we discuss the convergence of the proposed SSOD-AFW algorithm. To prove the convergence of objective function in (8) by iterating , , and with formulas (15), (17), and (19), it only needs to prove that is monotonically decreasing and bounded after a finite number of iterations. Next Lemmas 2, 3, and 4 verify the monotonically decreasing property of with respect to , , and , respectively. Lemma 5 presents the boundedness of .
Proof. Due to the fact that and are fixed when updating by (19), here the objective function can be regarded as a function only associated with , denoted as . According to Lagrangian multiplier technique, computed via (19) is a stagnation point of . On the other hand, if Hessian matrix is proved to be positive definite at , it can be proved that attains its local minimum at . The Hessian matrix is expressed as is a diagonal matrix and its diagonal element isSince , Hessian matrix is positive definite. Having proved that is a stagnation point of () and is positive definite, we conclude is the local minimum of (). Then we have , where is the membership matrix after the th iteration in (19) and is the one after the th iteration. Therefore, objective function in (8) is nonincreasing by updating using formula (19).
Proof. Similar to Lemma 2, when and are fixed, we just need to prove that the Hessian matrix of Lagrangian of at is positive definite, where is computed by (15). The Hessian matrix is denoted as , whose element is expressed as follows:Since and , the diagonal entries of the diagonal matrix are apparently positive. Therefore, Hessian matrix is positive definite. attains its local minimum at computed by (15). This completes the proof.
Lemma 5. Objective function in (8) is bounded, there exists a constant , and it satisfies .
Proof. and , we have . Thus, is monotonically decreasing with respect to and hence . Apparently, the first term and the second term of the objective function are also bounded. In consequence, objective function in (8) is bounded.
Proof. Lemmas 2, 3, and 4 verify that objective function in (8) is nonincreasing under iterations according to (15), (17), and (19). Lemma 5 shows that has a finite bound. Though the parameter needs to be updated in the iteration process, it is a constant in the problem solving using Lagrangian multiplier technique. So does not affect the convergence of the SSOD-AFW algorithm. Combining the above conclusions, is sure to converge to a local minimum through iterations of , , and by (15), (17), and (19).
4. Experiments and Analysis
Comprehensive experiments and analysis on a synthetic dataset and several real-world datasets are conducted to show the effectiveness and superiority of the proposed SSOD-AFW. We compared the proposed algorithm with two the-state-of-the-art unsupervised outlier algorithms, LOF  and NN , one supervised method SVDD , and one semisupervised method EODSP .
4.1. Evaluation Metrics
Let be the number of true outliers that a dataset contains and denotes the number of true outliers detected by an algorithm. In experiments, top most suspicious instances are detected out. Then the accuracy is given by
The receiver operating characteristic (ROC) curve represents the trade-off relationship between the detection rate and the false alarm rate. In general, the area under the ROC curve (AUC) is used to measure the performance of outlier detection method, and the value of AUC for ideal detection performance is close to one.
For a given outlier detection algorithm, true outliers occupy top positions with respect to the nonoutliers among suspicious instances; then the rank-power (RP) of the algorithm is said to be high. If is the number of true outliers found within top instances and denotes the rank of the th true outlier, then the metric rank-power (RP) is given by
RP reaches the maximum value 1 when all true outliers are in the top positions. Larger value of RP implies better performance of an algorithm.
4.2. Experiments on Synthetic Dataset
A two-dimensional synthetic dataset with two cluster patterns is generated from Gaussian distribution to intuitively compare the outlier detection results of the proposed method against the other four algorithms mentioned above. The mean vectors of the two clusters are and , respectively, and the covariance matrixes of them are and . As Figure 2(a) shows, a total of 199 samples are contained in the synthetic dataset, in which there are 183 normal samples (within two clusters) and 16 outliers (cluttered between two clusters). 13 normal objects are labeled and marked as symbol “” and 5 outliers are labeled and marked with symbol “”, while the rest samples are unlabeled marked with “.” Figures 2(b)–2(f), respectively, illustrate the outlier detection results on the synthetic dataset by using LOF, NN, SVDD, EODSP, and SSOD-AFW, where the red colored symbols “” denote the detected suspicious outliers. Here, the value of parameter (size of neighborhood) in LOF and NN is assigned to 3. Gauss kernel function is chosen in SVDD and we set the bandwidth and the trade-off coefficient . Besides, the Euclidean distance threshold in EODSP is set as 0.1 and the percentage of negative set is set to . The parameter settings of the proposed algorithm are , , and . In addition to SVDD, the top 16 objects with the highest outlying scores are considered as the results in the other four algorithms.
(a) Original synthetic dataset
In Figure 2, it is noticeable that the unsupervised methods LOF and NN as well as the supervised SVDD fail to completely detect all of the 5 labeled outliers. Nevertheless, some normal points in clusters are badly misjudged as outliers. In contrast, the semisupervised EODSP algorithm and the proposed SSOD-AFW algorithm successfully detect all of the 5 labeled outliers. However, EODSP does not completely detect all the unlabeled true outliers, and several true normal samples are improperly identified as outliers. It is concluded from Figure 2 that the proposed algorithm finds all the true outliers in the synthetic dataset and excludes the normal samples, while the other methods do not.
Figure 3 numerically presents the performance evaluation of outlier detection using LOF, NN, SVDD, EODSP, and SSOD-AFW for the synthetic dataset. From Figure 3 we see that the values of accuracy, AUC, and RP of the proposed algorithm all reach 1, outperforming the other methods.
Furthermore, during the experimental process shown in Figure 3, the feature weights of the synthetic dataset learned by formula (15) in our method are , , , and . To strengthen the effectiveness of feature weights in the proposed SSOD-AFW algorithm, a comparative analysis of the weighted and the nonweighted versions is implemented on the synthetic dataset, respectively. Considering the nonweighted scenario of the proposed algorithm, the outlier detection result on the synthetic dataset is presented in Figure 4. As can be observed from Figure 4, the nonweighted SSOD-AFW ends up tagging 15 true outlying and 1 normal samples as outliers, with one unlabeled true outlier missed.
4.3. Experiments on Real-World Datasets
4.3.1. Introduction of Datasets
For further verification of the effectiveness of the proposed algorithm, five real-world datasets from UCI Machine Learning Repository  (i.e., Iris, Abalone, Wine, Ecoli, and Breast Cancer Wisconsin (WDBC)) are employed to test the performance of the proposed algorithm against LOF, NN, SVDD, and EODSP. As mentioned in Aggarwal and Yu , one way to test the performance of an outlier detection algorithm is to run it on the dataset and calculate the percentage of points belonging to the rare classes. So a small amount of samples from the same class are randomly selected as outlying objects or as target objects, for the five datasets. For instance, the original Iris dataset incorporates 150 objects with 50 objects in each of three classes. We randomly selected 26 objects from class “Iris-virginica” as target outliers and all objects in the other two classes are considered as normal objects. The other four datasets are similarly preprocessed and more detailed description about the five real-world datasets is given in Table 1.
4.3.2. Experimental Result Analysis
We compare the outlier detection performance of the proposed algorithm with LOF, NN, SVDD, and EODSP on real-world datasets. Each method has its own parameters, and the detailed parameter settings of each algorithm are as follows. The parameters of the proposed algorithm are , , and for all the five datasets. The strategy of parameter selection for SSOD-AFW will be discussed in the later subsection called parameter analysis. For the other algorithms, those parameters are set exactly as mentioned in their references. It is well known that LOF and NN have high dependency on the neighborhood parameter . In this paper we set for datasets Iris and WDBC, for dataset Abalone, for dataset Wine, and for dataset Ecoli. For SVDD method, Gaussian kernel function is employed and the bandwidths and on all of the five real-world datasets. In EODSP, the Euclidean distance threshold is set as 0.1 and the percentage of negative set is set as for Iris and Abalone datasets, and , for datasets Ecoli, Wine, and WDBC. Since we randomly select outliers from target classes for each dataset, each experiment is repeated 10 times with the same number of different outliers. The average accuracy, AUC, and RP are calculated as the criteria of performance of various detection methods.
Figure 5 illustrates the outlier detection results of SSOD-AFW algorithm against LOF, NN, SVDD, and EODSP, respectively, on the five real-world datasets. As can be seen from Figure 5, the proposed algorithm can accurately identify outliers according to the cluster structure of a dataset, with the guidance of the label knowledge. It shows distinct superiority over the other unsupervised (LOF, NN), semisupervised (EODSP), and supervised (SVDD) methods. In particular, the outlier detection accuracy of SSOD-AFW in Figure 5(a) is significantly higher than the others, especially for datasets Iris and Wine. One can know from Figure 5(b) that the AUC values of our method are always higher than the others for all datasets except for WDBC. In terms of RP, SSOD-AFW performs better than the other four algorithms on datasets Iris and Wine, whereas slightly poorer than SVDD on Abalone, poorer than LOF on Ecoli, and poorer than NN on WDBC, seen as in Figure 5(c).
It is worth mentioning that the experiment of the proposed algorithm on WDBC involves one-class clustering problem. Although one-class clustering task is generally meaningless, one-class clustering-based outlier detection is especially meaningful and feasible in our proposal because our approach does not require that the membership degrees must sum up to 1. This is one of the powerful and important characteristics of the proposed algorithm.
4.3.3. The Influence of the Proportion of Labeled Data on Outlier Detection
In this subsection, we will investigate the influence of the proportion of labeled samples on the accuracies of our methodology. Two typical situations are considered and tested. The first one is that the proportion of labeled outliers increases when the number of labeled normal objects is fixed at a certain constant. The other one is that the percent of labeled normal samples varies while the quantity of labeled outliers is fixed. So two groups of experiments are designed to compare the accuracies of the proposed algorithm against the EODSP, in the situations of different percent of labeled outliers and normal samples, respectively, on the datasets Iris, Abalone, Wine, Ecoli, and WDBC. In the two experiments, the percent of labeled outliers or labeled normal samples ranges from 0% to 40%, respectively, when the number of another kind of labeled objects is fixed. We randomly select a certain number of labeled outliers or normal samples from each dataset, each experiment is repeated 10 times, and the average accuracies of SSOD-AFW and EODSP are computed.
Figure 6 shows results of the first group of experiments where the percent of labeled outliers varies from 0% to 40%. One can see from Figure 6 that the accuracies of the two semisupervised algorithms are roughly increased with the proportion of labeled outliers becoming larger. This powerfully supports the idea that semisupervised outlier detection algorithms can improve the accuracy of outlier detection by using prior information. Furthermore, the SSOD-AFW achieves a better accuracy than EODSP algorithm for the same proportion of labeled outliers on the five datasets. Especially for Wine, the accuracy of SSOD-AFW is 40% higher than that of EODSP. EODSP addresses the problem of detecting outliers with only few labeled outliers as training data. The labeled normal instances are extracted according to the maximum entropy principle, where the entropy is computed only using the distance between each testing sample and all the labeled outliers. That makes EODSP not flexible as our proposed method due to information deficiencies.
Figure 7 illustrates the accuracy comparison of the proposed algorithm and EODSP, when the proportion of labeled normal samples increases from 0% to 40% and the percent of labeled outliers is fixed. Note that our method obtains a better accuracy than EODSP on all of the five real-world datasets. The accuracy of the proposed algorithm gets larger when the percent of labeled normal samples increases. As mentioned, EODSP emphasizes the semisupervised outlier detection only with few labeled outliers in the initial dataset, but without considering any labeled normal objects. Therefore, the accuracy of EODSP algorithm keeps harper with various proportions of labeled normal objects and always equals the accuracy value of labeled normal samples as well.
4.3.4. Parameter Analysis
The parameters , , and are important in our proposed algorithm, which affect the performance of SSOD-AFW. In this section, the influence of each parameter on outlier detection accuracy is studied.
The parameter is the fuzzification coefficient. Figure 8(a) analyzes the relationship between the outlier detection accuracy of our proposed algorithm and parameter , with varying from 1.5 to 5.0. The results imply that the highest accuracy is achieved when ranges in . So it is reasonable that value in the experiments shown in Figure 5 has been set as 2.1. The parameter controls the importance of the label information in the result of outlier detection. Outlier detection accuracies are testified by varying from 0.1 to 0.9, which are shown in Figure 8(b). The overall tendency is that the accuracies become larger as increases. The best results of the proposed algorithm occur and keep stable when . Finally, from Figure 8(c), we conclude that the feature weight index has small influence on the accuracy of SSOD-AFW in the situation that the other parameters maintain the same settings. So the proposed algorithm is not sensitive to the parameter . In general, the parameter is suggested to select a constant from .
4.3.5. Execution Time Analysis
Figure 9 analyzes the average running time of the proposed algorithm against the other algorithms on five real-world datasets. The experimental environment is Windows XP systems, MATLAB 7.1 platform, 3 GHz CPU, 2 GB RAM. Because the volume of Abalone dataset is far greater than the other four datasets, the running times of various datasets are distinctly different. In order to facilitate the display, in Figure 9 the horizontal coordinate axis is translated downward a short distance. The result indicates that the proposed algorithm is more time-saving than the other four typical outlier detection algorithms, except for NN on dataset Wine. In whole, the execution time of the SSOD-AFW is comparable to that of NN and less than those of other algorithms on most of the datasets.
In order to detect outliers more precisely, a semisupervised outlier detection algorithm based on adaptive feature weighted clustering, called SSOD-AFW, is proposed in this paper. Distinct weights of each feature with respect to different clusters are considered and obtained by adaptive iteration, so that the negative effects of irrelevant features on outlier detection are weakened. Moreover, the proposed method makes full use of the prior knowledge contained in datasets and detects outliers in virtue of the cluster structure. It is verified by a series of experiments that the proposed SSOD-AFW algorithm is superior to other typical unsupervised, semisupervised, and supervised algorithms in both outlier detection precision and running speed.
In this paper, we present a new semisupervised outlier detection method which utilizes labels of a small amount of objects. However, our method assumes that the labels of objects are reliable and does not consider mislabel punishment in the new objective function. Therefore, a robust version of the proposed method dealing with noisy or imperfect labels of objects deserves further studies. Moreover, since only one typical dissimilarity measure named Euclidean distance is discussed in our method, the SSOD-AFW algorithm is limited to outlier detection for numerical data. The future research aims at extending our method to mixed-attribute data in more real-life applications such as fault diagnosis in industrial process or network anomaly detection.
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by the National Natural Science Foundation of China (Grant no. 11471001).
- J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, Elsevier, 2011.
- D. M. Hawkins, Identification of Outliers, Chapman & Hall, London, UK, 1980.
- V. Bamnett and T. Lewis, Outliers in Statistical Data, John Wiley & Sons, Chichester, UK, 1994.
- V. J. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artificial Intelligence Review, vol. 22, no. 2, pp. 85–126, 2004.
- B. Sheng, Q. Li, W. Mao, and W. Jin, “Outlier detection in sensor networks,” in Proceedings of the 8th ACM International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc '07), pp. 219–228, Montréal, Canada, September 2007.
- A. P. James and S. Dimitrijev, “Inter-image outliers and their application to image classification,” Pattern Recognition, vol. 43, no. 12, pp. 4101–4112, 2010.
- J. Huang, Q. Zhu, L. Yang, and J. Feng, “A non-parameter outlier detection algorithm based on Natural Neighbor,” Knowledge-Based Systems, vol. 92, pp. 71–77, 2016.
- J. M. Shepherd and S. J. Burian, “Detection of urban-induced rainfall anomalies in a major coastal city,” Earth Interactions, vol. 7, no. 4, pp. 1–17, 2003.
- O. Alan and C. Catal, “Thresholds based outlier detection approach for mining class outliers: an empirical case study on software measurement datasets,” Expert Systems with Applications, vol. 38, no. 4, pp. 3440–3445, 2011.
- B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural Computation, vol. 13, no. 7, pp. 1443–1471, 2001.
- D. Tax, A. Ypma, and R. Duin, “Support vector data description applied to machine vibration analysis,” in Proceedings of the 5th Annual Conference of the Advanced School for Computing and Imaging, pp. 398–405, Heijen, The Netherlands, June 1999.
- D. M. J. Tax and R. P. W. Duin, “Support vector data description,” Machine Learning, vol. 54, no. 1, pp. 45–66, 2004.
- S. Ramaswamy, R. Rastogi, and K. Shim, “Efficient algorithms for mining outliers from large data sets,” SIGMOD Record (ACM Special Interest Group on Management of Data), vol. 29, no. 2, pp. 427–438, 2000.
- E. M. Knox and R. T. Ng, “Algorithms for mining distance based outliers in large dataset,” in Proceedings of the International Conference on Very Large Data Bases, pp. 392–403, Citeseer, New York, NY, USA, 1998.
- M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “LOF: identifying density-based local outliers,” ACM SIGMOD Record, vol. 29, no. 2, pp. 93–104, 2000.
- J. Ha, S. Seok, and J.-S. Lee, “A precise ranking method for outlier detection,” Information Sciences, vol. 324, pp. 88–107, 2015.
- R. N. Dave, “Characterization and detection of noise in clustering,” Pattern Recognition Letters, vol. 12, no. 11, pp. 657–664, 1991.
- R. Smith, A. Bivens, M. Embrechts, C. Palagiri, and B. Szymanski, “Clustering approaches for anomaly based intrusion detection,” Proceedings of Intelligent Engineering Systems Through Artificial Neural Networks, pp. 579–584, 2002.
- Y. Shi and L. Zhang, “COID: a cluster-outlier iterative detection approach to multi-dimensional data analysis,” Knowledge and Information Systems, vol. 28, no. 3, pp. 709–733, 2011.
- J. Zhao, K. Liu, W. Wang, and Y. Liu, “Adaptive fuzzy clustering based anomaly data detection in energy system of steel industry,” Information Sciences, vol. 259, pp. 335–345, 2014.
- F. Angiulli and C. Pizzuti, “Fast outlier detection in high dimensional spaces,” in Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD '02), vol. 2, pp. 15–26, August 2002.
- M. Radovanović, A. Nanopoulos, and M. Ivanović, “Reverse nearest neighbors in unsupervised distance-based outlier detection,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 5, pp. 1369–1382, 2015.
- A. Daneshpazhouh and A. Sami, “Entropy-based outlier detection using semi-supervised approach with few positive examples,” Pattern Recognition Letters, vol. 49, pp. 77–84, 2014.
- J. Gao, H. Cheng, and P.-N. Tan, “Semi-supervised outlier detection,” in Proceedings of the ACM Symposium on Applied Computing, pp. 635–636, ACM, Dijon, France, April 2006.
- Z. Xue, Y. Shang, and A. Feng, “Semi-supervised outlier detection based on fuzzy rough C-means clustering,” Mathematics and Computers in Simulation, vol. 80, no. 9, pp. 1911–1921, 2010.
- J. Z. Huang, M. K. Ng, H. Rong, and Z. Li, “Automated variable weighting in k-means type clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pp. 657–668, 2005.
- H. Friguiand and O. Nasraou, “Unsupervised learning of prototypes and attribute weights,” Pattern Recognition, vol. 37, no. 3, pp. 567–581, 2004.
- J. Zhou, L. Chen, C. P. Chen, Y. Zhang, and H. Li, “Fuzzy clustering with the entropy of attribute weights,” Neurocomputing, vol. 198, pp. 125–134, 2016.
- M. Hassan, A. Chaudhry, A. Khan, and M. A. Iftikhar, “Robust information gain based fuzzy c-means clustering and classification of carotid artery ultrasound images,” Computer Methods and Programs in Biomedicine, vol. 113, no. 2, pp. 593–609, 2014.
- R. Krishnapuram and J. M. Keller, “A possibilistic approach to clustering,” IEEE Transactions on Fuzzy Systems, vol. 1, no. 2, pp. 98–110, 1993.
- M.-S. Yang and K.-L. Wu, “Unsupervised possibilistic clustering,” Pattern Recognition, vol. 39, no. 1, pp. 5–21, 2006.
- S. M. Guo, L. C. Chen, and J. S. H. Tsai, “A boundary method for outlier detection based on support vector domain description,” Pattern Recognition, vol. 42, no. 1, pp. 77–83, 2009.
- T. Fawcett, “Roc graphs: notes and practical considerations for researchers,” Machine Learning, vol. 31, no. 1, pp. 1–38, 2004.
- C. Blake and C. J. Merz, “UCI repository of machine learning databases,” 1998.
- C. C. Aggarwal and P. S. Yu, “Outlier detection for high dimensional data,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '01), Santa Barbara, Calif, USA, May 2001.
Copyright © 2016 Tingquan Deng and Jinhong Yang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.