Abstract

Outlier detection is a challenging task especially when outliers are defined by rare combinations of multiple variables. In this paper, we develop and evaluate a new method for the detection of outliers in multivariate data that relies on Principal Components Analysis (PCA) and three-sigma limits. The proposed approach employs PCA to effectively perform dimension reduction by regenerating variables, i.e., fitted points from the original observations. The observations lying outside the three-sigma limits are identified as the outliers. This proposed method has been successfully employed to two real life and several artificially generated datasets. The performance of the proposed method is compared with some of the existing methods using different performance evaluation criteria including the percentage of correct classification, precision, recall, and F-measure. The supremacy of the proposed method is confirmed by abovementioned criteria and datasets. The F-measure for the first real life dataset is the highest, i.e., 0.6667 for the proposed method and 0.3333 and 0.4000 for the two existing approaches. Similarly, for the second real dataset, this measure is 0.8000 for the proposed approach and 0.5263 and 0.6315 for the two existing approaches. It is also observed by the simulation experiments that the performance of the proposed approach got better with increasing sample size.

1. Introduction

In most real-life datasets, there exist data observations that do not conform to general model and/or behavior of the data. Such observations that are significantly inconsistent with the majority of the observations in the dataset are known as outliers. Outlier detection problem needs to be addressed in a wide range of applications in fraud detection (e.g., suspicious use of credit cards or other kinds of financial transactions), health data analysis (e.g., detecting unusual responses to treatment plans among patients), fault detection in production processes, and network intrusion detection, etc. Moreover, several data analysis tasks are influenced due to the presence of outliers and require minimizing the effect of outlier observations or eliminating them all together. The problem of detecting outliers in multivariate data is a nontrivial task that becomes even more problematic in case of high dimensional datasets.

Existing techniques for the general outlier detection problem can be broadly categorized in four key approaches including statistical distribution-based approaches, distance-based approaches, density-based approaches, and the subspace-learning based approaches [14].

The statistical distribution-based approaches consider a distribution or probability model (such as normal distribution or Poisson distribution) for the given dataset to find any outlier observations with reference to the selected model by employing a “discordance test” with respect to some known parameters of the dataset, e.g., the mean, variance, and/or an assumed data distribution [3]. Most approaches in this category are designed for univariate datasets, i.e., having a single attribute; however, several problems involve outlier detection in multidimensional datasets. Zhao et al. [5] presented COPOD outlier detection method that was motivated from statistical methods to model multivariate data distribution. COPOD first builds the empirical copula and then makes use of the fitted model for the prediction of tail probabilities of each data observation to classify it as a regular or outlier observation. A key concern with approaches relying on the statistical distribution of the dataset is that the statistical distribution and related parameters regarding the dataset model may not always be known a priori. Moreover, the statistical parameters of the dataset can also influence the outlier detection to the masking or swamping effect.

Distance-based approaches rely on the distances between observations to detect outliers. Data observations that do not have enough neighboring observations within a distance threshold are considered outliers [3, 6]. The first effort towards outlier detection in multivariate functional data using graphical tools was the Functional Outlier Map (FOM) approach [7, 8]. These methods utilize statistical depth functions and distance measures derived from them for outlier detection. Prykhodko et al. addressed outlier detection in multivariate nonnormal data based upon univariate and multivariate normalizing transformations [9]. They used squared Mahalanobis distance and a quantile of the Chi-Square distribution for the purpose. In a recent work, Caberoa et al. presented an outlier detection method [10] that performs archetype analysis to combine projections into relevant subspaces with a nearest-neighbor algorithm. In addition to their reliance on statistical characteristics of the dataset (e.g., mean values), a key concern with distance-based approaches is due to their reliance on the global information of the dataset as their performance depends on the neighborhood size of observations.

Density-based approaches [11] rely on the local outlier factors of data points computed by considering the local density of their neighborhoods. While approaches in this category achieve good accuracy without making any assumptions about the dataset distribution, these approaches have high computational complexity especially for high-dimensional large datasets.

Among subspace-learning based approaches [11, 12] for outlier detection, Zhao et al. proposed LOMA [12], a local outlier detection approach for massive high-dimensional datasets. LOMA performs data reduction by employing attribute relevance analysis. Further, it employs particle swarm optimization for efficient searching of sparse subspace, where the data density, i.e., the number of observations in the dataset, is very small. Our proposed method is somewhat similar as it also performs subspace-learning; however, unlike attribute relevance analysis, we employ principal component analysis for dimension reduction.

Most of the existing outlier detection methods are either designed for univariate datasets or require a large number of data points to perform effectively. For example, in case of distance-based methods, it is difficult to identify outliers simply by computing distances from the few available data points to mean value. Moreover, existing approaches considering the entire variable set are computationally expensive when considered for high dimensional datasets. However, multivariate datasets with high dimensionality with varying sizes in terms of their number of instances are often encountered in real life data analytics situations.

We employ PCA with three-sigma limits for the identification of outliers. PCA is one of the most prevalent linear dimension reduction techniques. It reduces the dimensionality of high-dimension multivariate datasets, with minimum loss of information. It works by producing new uncorrelated variables that successively maximize variance. The new variables are the linear combinations of all the original variables. The methods based upon graphs are useful tools for identifying outliers in multivariate data, especially when we are working on PCs, but they may not be effective for applications of real time detections. The validity of the existing formal tests is based upon some assumptions like the dataset having a multivariate normal distribution. If these assumptions are not satisfied, the application of these methods is not possible. We propose an innovative outlier detection approach based upon the PCs and three-sigma limits. The proposed approach can be employed in real time and does not require any assumption or restriction related to the dataset.

The rest of the paper is organized as follows. Section 2 describes the multivariate outlier detection problem and introduces important notation. Section 3 presents our proposed outlier detection method. Section 4 explains the datasets and presents the performance evaluation results of our proposed method. Section 5 provides a discussion of evaluation results and finally concludes the paper.

2. Multivariate Outlier Detection Problem

Most datasets contain one or more unusual observations that are considered as the outlying observation, i.e., dissimilar from the majority of the observations in the dataset, or are doubtful under the expected probability model of the dataset. In a dataset consisting of single feature, either very large or very small observations as compared to the others are unusual observations. If the distribution of the dataset is assumed to be normal, then an observation whose standardized value is greater than the absolute value is usually considered as an outlying observation. The situation becomes complex for a dataset having numerous features. In high dimensional datasets, there can be outliers that cannot be identified when each dimension is independently considered and, hence, cannot be identified by using the univariate criterion. Therefore a multivariate approach is required, and all the dimensions should be considered together.

Let be a random sample of size from a multivariate distribution, and we have variables . Each is defined as a vector of observations, where .

Most commonly used approaches to identify outliers in multivariate data are based upon the measuring distances of observations from the central point of dataset. If follow the multivariate normal distribution, then, for any forthcoming observation from the same multivariate normal distribution, a statistic that relies upon the Mahalanobis distance is defined as is distributed as . and are, respectively, the sample mean vector () and sample covariance matrix (), defined as follows: follows an F-distribution with and degrees of freedom [3]. A higher value of is an indication of a larger distance of the observations from the canter of the data. Other distance measures such as Euclidean distance or Canberra metric can also be used in place of Mahalanobis distance. An observation, which has a greater difference than a threshold value, is identified as an outlying observation. The threshold value is usually based upon the distribution of distance measure. The distribution of these distances is not easy to derive, even having the assumption of normality.

The PCA-based methods have a long history for the identification of outliers in multivariate data [1315]. The largest cumulative proportion of the total sample variance is explained by the leading (first few) PCs that have large variances. These leading or major PCs have a tendency of strong relationship with the dimensions that have larger variances and covariances. As a consequence, the observations that are outlying cases with respect to the leading or major components typically relate to outlying observations on one or more of the original variables. In our proposed approach, we employ PCA with three-sigma limits on the error series to identify outlier observations as discussed in the following section.

Let be the principal components score vector for observation vector . The number of variables is “” and “” is the number of retained PCs and . The sum of squared principal components scores for observation given asfollows a chi-square distribution [16] having degrees of freedom under the assumptions that and all are distinct.

For a specified level of significance , an observation is identified as outlier if .

Here, is upper percentage point of the chi-square distribution having degrees of freedom. The value of refers to false alarm rate in identifying a normal observation as an outlying observation.

3. Proposed Method

Our proposed method for outlier detection is based upon regenerating the variables using the major PCs by following [17, 18]. The step-by-step procedural details of the proposed method are presented below.

Step 1: Estimate the PCs of the original variables. In this step, we perform PCA by converting the original variables into a set of orthogonal variables, i.e., the principal components. These PCs are computed in a way such that the first PC is a maximum-variance linear combination of original variables, and the 2nd PC is a linear combination of original variables, which account for maximum remaining variation while considering a zero correlation between the 1st and 2nd PC. The remaining PCs are computed in a similar manner such that they all are uncorrelated with each other.

For computing the PCs, we first subtract the mean value of each variable from the dataset in order to center the original values in the dataset around the origin and compute pair-wise correlation among variables in the correlation matrix. Eigenvalues and eigenvectors of the correlation matrix are then computed. Scaled eigenvectors represent the PCs with corresponding eigenvalues representing the degree of variance among data observations in eigenvectors’ direction.

Given the multivariate data matrix where “” is the number of variables and each of the “” row values denotes data observations/values corresponding to these variables,

Let denote the eigenvalues of the correlation matrix of such that , and all are distinct. denotes the matrix of eigenvectors corresponding to eigenvalues of the correlation matrix of given aswhere are the eigenvectors.

The matrix of estimated principal components scores is then computed, which is as an matrix defined asor

Let , and then

The matrix of weighted PCs forvariable is then computed as

Step 2: Regenerating the series. This step involves regenerating the original series with appropriate reduction (as suggested by any rule, e.g., scree plot) in dimensions. The original variables can be regenerated without any loss if we make use of all the PCs included in the process. This, however, will not contribute towards dimensionality reduction. In principle, the number of PCs involved to regenerate the original variables should essentially be lesser than original number of variables.

Let “” be the retention level, i.e., the reduced number of PCs used for regenerating variables. Then, the initial elements of the column of PC scores are cumulated to construct cumulative PC’s scores for observation and variable using the retention level . Thus, the observation of the variable using the retention level r is regenerated as

Step 3: Compute the error series for each variable. In the PCA-based proposed procedure, we model the observations of original variables as original data points and the observations of regenerated variables as fitted points. In this step, we compute their difference as error (denoted by ).

Step 4: Employ three-sigma limits to detect outliers. Once the series of errors are computed by applying the abovementioned technique, we employ the three-sigma limits. Three-sigma limits are typically applied for identifying and/or removing anomalies or outliers in different datasets. Employing three-sigma limits implies that only a very small number of possible observations could fall outside specification limits of the corresponding dataset. Sigma is essentially a reference to the intervals under a normal or “Gaussian” curve. Each interval is equal to one standard deviation or sigma. Three-sigma limits hence refer to sigma from the mean of the data under the curve. In the case of a normal distribution, of the data points are within from the mean, are within , and are within sigma. A variation exceeding sigma indicates room for improvement.

As discussed, the regenerated variables are based upon major PCs, which account for maximum of the variation of data and are essentially the linear combinations of all original variables. Considering the difference of regenerated and original series as error, three-sigma limits for each of the error series are computed, and the observations lying outside the limits are treated as outliers.

Algorithm 1 summarizes the step-wise details of the proposed method.

Input: matrix of multivariate data, where is the number of variables, and each of the row values denotes data observations/values corresponding to these variables : retention level, i.e., the reduced number of PCs to consider
Output: Identification of outlier data observations
Steps:
(1) = computeEigenvectors (Y)/ compute the matrix of eigenvectors using equation (5) /
(2)
(3) = estimatePCscores ()/ Calculate the principal component scores using equation (6) /
(4) = weightedPCs ()/ Compute weighted PCs of original variables using equation (9) /
(5) = regenerateSeries ()/ regenerate data series using equation (10) //Compute the error series by considering the difference of original variables and regenerated variables as errors /
(6)for to
(7) = difference
(8)for each error series
(9) = Mean
(10) = Standard Deviation
(11)For each data observations in
(12)Classify as outlier if it lies outside the three-sigma limits i.e.

Note that we determine the retention level r by considering the scree plot.

4. Numerical Evaluation

In this section, we evaluate and compare the performance of proposed outlier detection method with two most commonly used available methods by considering two real world applications and a simulation study.

4.1. Real Applications

In this section, we present the performance evaluation results of our proposed outlier detection method using two real life applications.

4.1.1. Silicon Wafer Thickness Data

The first application is related to silicon wafer thickness data, and the data source is given in the data availability section. The thickness of a single wafer was measured at nine different locations for 184 consecutive lots. A single wafer from the tray of wafers was removed always at the same position for each lot of wafers after the completion of the chemical vapor decomposition process. All the observations of the dataset had been approximately cantered and scaled to disguise the original variables for privacy.

Figure 1 shows the matrix plot of each of nine variables of silicon wafer thickness dataset. All the variables are regenerated using the first two PCs, because the first two PCs account for almost 94% variation of the data.

4.1.2. Solvents Dataset

Second real life application is related to nine physical properties of 103 chemical solvents. The nine physical properties are Melting Point (), Boiling Point (), Dielectric (, Dipole Moment (), Refractive Index (), (empirical solvent polarity parameter) (), Density (), (partition coefficient of a molecule between an aqueous and lipophilic phases) (), and Solubility (). The data source is given in the data availability section.

Figure 2 presents the matrix dot-plot of nine variables of dataset. The reconstruction of all the nine variables is done using the first three PCs as suggested by scree plot. The first three PCs account for 77% of the total variation.

Table 1 presents the eigenvalues, proportion, and cumulative of variance accounted for by the respective components for both datasets. The elbows in scree plots presented in Figures 3(a) and 3(b) suggest retaining the first two PCs for silicon wafer thickness data and first three PCs for solvents data.

Outlier detection is done with the previously explained two existing methods based upon major PCs and Mahalanobis distance, and our proposed method. The error series, i.e., differences between the original and regenerated variables, are computed. The means of all these error series are approximately zero. This is an indication of how good our regenerated variables are. Table 2 presents the mean of the error series for both datasets (’s are the means of error series).

To gauge the performance of the existing and proposed methods, we use the confusion matrix [19], which is usually used for the performance evaluation of outlier detection methods (Table 3).

The performance of outlier detection methods is also evaluated by its true detection rate.

Three other metrics, i.e., precision, recall [20], and F-measure [21], have also been used to evaluate the performance of proposed and existing approaches. Recall and precision are defined as follows:

F-measure is the combination of precision and recall measures and defined as

The value of is usually taken as [22].

The analysis of the results for silicon wafer thickness data revealed that the proposed method detected observations 39, 111, and 155 as outliers. Outliers detected with the method based upon major PCs are 39, 72, 155, 161, and 174 observations. Similarly, observations 39, 61, and 145 are detected as outlier with the method based upon Mahalanobis distance.

For the solvents data, the proposed method detected observations 2, 5, 9, 15, 51, 70, 83, 97, and 101 as outliers. Outliers detected with method based upon major PCs are 2, 5, 9, 19, 61, 92, 97, and 101 observations. Similarly, observations 2, 5, 9, 15, 34, 83, 97, and 102 are detected as outlier with the method based upon Mahalanobis distance.

Table 4 presents the True Positive, False Negative, False Positive, and True Negative detected in both datasets with proposed and existing methods. The precision and recall computed using all three approaches are given in Table 4.

The results presented in Tables 4 and 5 indicate the supremacy of the proposed method. All the three evaluation criteria, i.e., precision, recall, and F-measure, are higher for the proposed method. The same results can be observed for solvents dataset. TPs and TNs are highest, and FNs and FPs are lowest with the proposed method for Solvents dataset. Precision, recall, and F-measure are found to be 0.8889, 0.7273, and 0.8000, respectively, (highest) for the proposed method.

4.2. Simulated Datasets

In this subsection, we present the comparative performance evaluation results of the proposed and two existing methods with the help of simulated datasets. For this purpose, ten variables are generated from a multivariate standard normal distribution with three different levels of correlation, i.e., 0.90, 0.95, and 0.975. Three different sample sizes are used, i.e., 200, 500, and 1000 observations for each of the three sets of variables. Three levels of contamination, i.e., , and are used to insert outliers in each of the ten variables. A total of (contamination or number of outliers) random numbers between 1 and 1000 are produced to select the “” rows to insert outliers in the dataset. The mean and standard deviation of the data are calculated. Each observation of the certain row is multiplied by

Hence, a total of twenty-seven datasets are generated in this way. The simulation experiments are replicated 1000 times to compute the percentage of true detection rate, precision, and recall.

The error series for all the simulated datasets are computed, and their means are presented in Table 6. It can be observed that the mean of all these error series are approximately zero.

Tables 7 and 8 present the percentage of true positive, false negative, false positive, and true negative, precision, recall, and F-measure for the simulated datasets. The percentage of true detection is highest, and false detection is the lowest for the proposed method as compared to the existing methods. The size of sample has a direct effect on any research findings. It can undermine or strengthen the internal and external validation of any study. The same can be observed from the results of our study. A substantial improvement in results can be seen with increasing sample sizes. As the sample size has increased from 200 to 1000, the percentage of true positives has increased from 75% to 85% when the contamination level is 2%, from 90% to 94% when the contamination level is 5%, and similarly from 90% to 96% when the contamination level is 2%. It can also be observed that the correlation level has no effect on the simulated results. Similarly, the change in contamination levels also has not any effect on the results. The same can be confirmed from precision, recall, and F-measure. A set of figures presented as Figures 4(a)4(c) show the percentages of true detection rates versus sample sizes for the three methods. True detection rate is also higher for the proposed method and is getting more improved with the increase in sample size.

5. Discussion and Conclusion

This paper suggests a novel approach based upon PCA and three-sigma limits for outlier detection. The predictive model is developed using the major principal components suggested by the scree plots. The main advantage of the proposed approach is that it does not require any distributional assumptions. We performed the outlier detection with our proposed method as well as with two existing classical approaches to gauge the performance of the proposed method. The performance comparison is made using two real life and several simulated datasets. The examples from real life data and simulation experiments confirm the better performance of our proposed technique as compared to the two existing approaches. First, the three outlier detection methods were applied to silicon wafer thickness data. The computed values of precision, recall, and F-Statistic were highest with the proposed method, i.e., 0.6667, 0.6667, and 0.6667, respectively, while using major PCs, the three measures were 0.2500, 0.5000, and 0.3333. The method based upon Mahalanobis distance produced the three measures as 0.3333, 0.5000, and 0.4000, respectively. Major PCs based method produced the worst results. The same scenario can be observed from the application of these three outlier detection methods to solvents data. Precision, recall, and F-Statistic were computed as 0.8889, 0.7273, and 0.8000 with the proposed method, 0.6250, 0.4545, and 0.5263 by using major PCs, and 0.7500, 0.5454, and 0.6315 with Mahalanobis distance method. Simulation experiments also confirmed the same situation. For all the sample sizes, correlation, and contamination levels, the proposed method performed best among the three.

With three increasing levels of sample sizes, i.e., 200, 500, and 1000, the percentages of true detections are increased, and false detection rates are decreased. The performance is getting better with increasing sample sizes regardless the level of correlation between variables and contamination level. The results showed that, with when the contamination level is 2%, the precision is increased from 0.6000 to 0.8095, recall is increased from 0.7500 to 0.8500, and F-measure is increased from 0.6667 to 0.8293 when the sample size is increased from 200 to 1000. When the contamination level is 5%, these three measures are 0.8182, 0.9000, and 0.8572 for sample size 200 and increased to 0.9038, 0.9400, and 0.9215 with the sample size 1000. The same can be observed with . The three measures with sample size 200 and 2% contamination are 0.5000, 0.7500, and 0.6000 and increased to 0.8095, 0.8500, and 0.8293 with increasing sample size of 1000. A similar situation persists with 5% and 10% contamination. The results with datasets having maximum correlation level, i.e., , gave the same scenario. For 10% contamination level with sample size = 200, the three measures were 0.9048, 0.9500, and 0.92685. When sample size was increased to 500 and 1000, these measures were increased to 0.9800, 0.9800, 0.9800, and 0.9804, 1.0000, and 0.99010, respectively.

Not only is the proposed method useful for datasets with variables having interdependence relationship, but it can also be applied to data having variable with dependence relationship, i.e., variables categorized as response and explanatory variables. The outlying observations in the set of explanatory variables can be detected by using the step by step approach of the proposed method. After taking care of outlying observations in explanatory variables, response variable can be checked for outlying observations by using studentized deleted residuals, or a formal test can be conducted by means of the Bonferroni test procedure. In future work, investigating the proposed method for variables having such relationship might prove important.

Data Availability

Previously reported data were used to support this study and are available at https://openmv.net/info/silicon-wafer-thicknesshttps://openmv.net/info/solvents.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Saima Afzal conceived, supervised, and designed the study. Ayesha Afzal performed simulation experiments and computational work. Muhammad Amin made the analysis of results. Sehar Saleem wrote the manuscript. Nouman Ali helped in conducting simulation studies and analysis of results. Muhammad Sajid reviewed and edited the manuscript. All the authors discussed the results and contributed to the final manuscript.