Abstract

Hoteling's control charts are widely used in industries to monitor multivariate processes. The classical estimators, sample mean, and the sample covariance used in control charts are highly sensitive to the outliers in the data. In Phase-I monitoring, control limits are arrived at using historical data after identifying and removing the multivariate outliers. We propose Hoteling's control charts with high-breakdown robust estimators based on the reweighted minimum covariance determinant (RMCD) and the reweighted minimum volume ellipsoid (RMVE) to monitor multivariate observations in Phase-I data. We assessed the performance of these robust control charts based on a large number of Monte Carlo simulations by considering different data scenarios and found that the proposed control charts have better performance compared to existing methods.

1. Introduction

Control charts are widely used in industries to monitor/control processes. Generally, the construction of a control chart is carried out in two phases. The Phase-I data is analyzed to determine whether the data indicates a stable (or in-control) process and to estimate the process parameters and thereby the construction of control limits. The Phase-II data analysis consists of monitoring future observations based on control limits derived from the Phase-I estimates to determine whether the process continues to be in control or not. But trends, step changes, outliers, and other unusual data points in the Phase-I data can have an adverse effect on the estimation of parameters and the resulting control limits. That is, any deviation from the main assumption (in our case, identically and independently distributed from normal distribution) may lead to an out-of-control situation. Therefore, it becomes very important to identify and eliminate these data points prior to calculating the control limits. In this paper, all these unusual data points are referred to as “outliers.”

Multivariate quality characteristics are often correlated, and to monitor the multivariate process mean Hoteling’s control chart [1, 2] is widely used. To implement Hoteling's control chart for individual observations in Phase-I, for each observation we calculate where = is the th -variate observation, () and the sample mean , sample covariance matrix are based on Phase-I observations. In Phase-I monitoring, the values are compared with the control limit derived by assuming that the ’s are multivariate normal so that the control limits are based on the beta distribution with the parameters and . However, the classical estimators, sample mean, and sample covariance are highly sensitive to the outliers, and hence robust estimation methods are preferred as they have the advantage of not being unduly influenced by the outliers. The use of robust estimation methods is well suited to detect multivariate outliers because of their high breakdown points which ensure that the control limits are reasonably accurate. Sullivan and Woodall [3] proposed a chart with an estimate of the covariance matrix based on the successive differences of observations and showed that it is effective in detecting process shift. However, these charts are not effective in detecting multiple multivariate outliers because of their low breakdown point.

Vargas [4] introduced two robust control charts based on robust estimators of location and scatter, namely, the minimum covariance determinant (MCD) and minimum volume ellipsoid (MVE) for identifying the outliers in Phase-I multivariate individual observations. Jensen et al. [5] showed that and control charts have better performance when outliers are present in the Phase-I data. Chenouri et al. [6] used reweighted MCD estimators for monitoring the Phase-II data, without constructing Phase-I control charts. However, in many situations Phase-I control charts are necessary to assess the performance of the process and also to identify the outliers. We propose control charts based on the reweighted minimum covariance determinant (RMCD)/reweighted minimum volume ellipsoid (RMVE) for monitoring Phase-I multivariate individual observations. RMCD/RMVE estimators are statistically more efficient than MCD/MVE estimators and have a manageable asymptotic distribution. We empirically arrive at Phase-I control limits for the control chart for some specific sample sizes and fitted a nonlinear model to determine control limits for any sample size for dimensions 2 to 10. Our simulation studies show that control charts are performing well compared to control charts for monitoring the Phase-I data.

The organization of the remaining part of the paper is as follows. In Section 2, we discuss the properties of a good robust estimator and we briefly explain the MCD/MVE estimators and their reweighted versions. The proposed control charts are given in Section 3 along with the control limits arrived at based on Monte Carlo simulations. We assess the performance of the proposed control charts in Section 4, and the implementation of the proposed methods is illustrated in a case example in Section 5. Our conclusions are given in Section 6.

2. Robust Estimators

The affine equivariance property of the estimator is important because it makes the analysis independent of the measurement scale of the variables as well as the transformations or rotations of the data. The breakdown point concept introduced by Donoho and Huber [7] is often used to assess the robustness. The breakdown point is the smallest proportion of the observations which can render an estimator meaningless. A higher breakdown point implies a more robust estimator, and the highest attainable breakdown point is in the case of median in the univariate case. For more details on affine equivariance and breakdown points one may refer to Chenouri et al. [6] or Jensen et al. [5].

An estimator is said to be relatively efficient compared to any other estimator if the mean square error for the estimator is the least for at least some values of the parameter compared to others. A robust estimator is considered to be good if it carries the property of affine equivariance along with a higher breakdown point and greater efficiency. In addition to the above three properties of a good robust estimator, it should be possible to calculate the estimator in a reasonable amount of time to make it computationally efficient.

It is difficult to get an affine equivariant and robust estimator as affine equivariance and high breakdown will not come simultaneously. Lopuhaä and Rousseeuw [8] and Donoho and Gasko [9] showed that the finite sample breakdown point of is difficult for an affine equivariant estimator. The largest attainable finite sample breakdown point of any affine equivariant estimator of the location and scatter matrix with a sample size and dimension is [10]. Therefore relaxing the affine equivariance condition of the estimators to invariance under the orthogonal transformation makes it easy to find an estimator with the highest breakdown point.

The classical estimators, sample mean vector, and covariance matrix of location and scatter parameters are affine equivariant but their sample breakdown point is as low as . The MCD and MVE estimators have the highest possible finite sample breakdown point . However, both of these estimators have very low asymptotic efficiency under normality. But the reweighted versions of MCD and MVE estimators have better efficiency without compromising on the breakdown point and rate of convergence compared to MCD and MVE. In the next two subsections, we discuss in detail about the MCD and MVE estimators and their reweighted versions.

2.1. MCD and RMCD Estimators

The MCD estimators of location and scatter parameters of the distribution are determined by a two-step procedure. In step 1, all possible subsets of observations of size ,   are obtained. In step 2, the subset whose covariance matrix has the smallest possible determinant is selected. The MCD location estimator is defined as the average of this selected subset of points, and the MCD scatter estimator is given by , where is the covariance matrix of the selected subset, the constant is the multiplication factor for consistency [11], and is the finite sample correction factor [12]. Here represents the breakdown point of the MCD estimators. The MCD estimator has its highest possible finite sample breakdown point when and has an rate of convergence but has a very low asymptotic efficiency under normality. Computing the exact MCD estimators (, ) is computationally expensive or even impossible for large sample sizes in high dimensions [13], and hence various algorithms have been suggested for approximating the MCD. Hawkins and Olive [14] and Rousseeuw and van Driessen [15] independently proposed a fast algorithm for approximating MCD. The FAST-MCD algorithm of Rousseeuw and van Driessen finds the exact MCD for small datasets and gives a good approximation for larger datasets, which is available in the standard statistical software SPLUS, R, SAS, and Matlab.

MCD estimators are highly robust, carry equivariance properties, and can be calculated in a reasonable time using the FAST-MCD algorithm; however, they are statistically not efficient. The reweighted procedure will help to carry both robustness and efficiency. That is, first a highly robust but perhaps an inefficient estimator is computed, which is used as a starting point to find a local solution for detecting outliers and computing the sample mean and covariance of the cleaned data set as in Rousseeuw and van Zomeren [16]. This consists of discarding those observations whose Mahalanobis distances exceed a certain fixed threshold value. MCD is the current best choice for the initial estimator of a two-step procedure as it contains the robustness, equivariance, and computational efficiency properties along with its rate of convergence. Hence RMCD estimators are the weighted mean vector and the weighted covariance matrix where is the multiplication factors for consistency [11], is the finite sample correction factor [12], and the weights are defined as where the robust distance and is 100% quantile of the chi-square distribution with degrees of freedom.

This reweighting technique improves the efficiency of the initial MCD estimator while retaining (most of) its robustness. Hence the RMCD estimator inherits the affine equivariance, robustness, and asymptotic normality properties of the MCD estimators with an improved efficiency.

2.2. MVE and RMVE Estimators

Determining the MVE estimators of location and scatter parameters of the distribution is almost in line with that of the MCD estimator. As in the case of MCD, all the possible subsets of data points with size    is obtained first. Then the ellipsoid of minimum volume that covers the subsets are obtained to determine the MVE estimators. The MVE location estimator is the geometrical center of the ellipsoid, and the MVE scatter estimator is the matrix defining the ellipsoid itself, multiplied by an appropriate constant to ensure consistency [13, 16]. Thus MVE estimator does not correspond to the sample mean vector and the sample covariance matrix as in the case of the MCD estimator. Here represents the breakdown point of the MVE estimators, as in the case of MCD, and it has the highest possible finite sample breakdown point when [8, 17]. The MVE estimator has an rate of convergence and a nonnormal asymptotic distribution [17].

As in the case for MCD estimators, MVE estimators are also not efficient. Hence, a reweighted version similar to that for MCD has been proposed by Rousseeuw and van Zomeren [16]. Note that it has been shown more recently that the RMVE estimators do not improve on the convergence rate (and thus the 0% asymptotic efficiency) of the initial MVE estimator [8, 12]. Therefore, as an alternative, a one-step M-estimator can be calculated with the MVE estimators as the initial solution [13, 18] which results in an estimator with the standard convergence rate to a normal asymptotic distribution. For more details on MCD/MVE estimators one may refer to Chenouri et al. [6] or Jensen et al. [5]. The algorithm to determine the MVE/RMVE estimators is available in the statistical software SPLUS, R, SAS, and Matlab.

3. Robust Control Charts

We propose to use charts with robust estimators of location and dispersion parameters based on RMCD/RMVE for monitoring the process mean of Phase-I multivariate individual observations. RMCD/RMVE estimators inherit the nice properties of initial MCD estimators such as affine equivariance, robustness, and asymptotic normality while achieving a higher efficiency. We now define a robust control chart with RMCD and RMVE estimators for th multivariate observation as where , are the mean vectors and , are the dispersion matrices under the RMCD/RMVE methods based on multivariate observations.

The exact distribution of / estimators not available, hence the control limits for Phase-I data are obtained empirically. In the next subsection we apply Monte Carlo simulation to estimate quantiles of the distribution of and for several combinations of sample sizes and dimensions. For each dimension, we further introduce a method to fit a smooth nonlinear model to arrive, the control limits for any given sample size.

3.1. Computation of Control Limits

We performed a large number of Monte Carlo simulations to obtain the control limits. We generated samples of size from a standard multivariate normal distribution MVN(0, ) with dimension . Due to the invariance of the and statistics, these limits will be applicable for any values of and . Using the reweighted MCD/MVE estimators , , , and with a breakdown value of / statistics for each observation in the data set were calculated using (5), and the maximum value attained for each data set of size was recorded. The empirical distribution of maximum of and was inverted to determine the % quantiles. We used the R-function “CovMcd()” in the “rrcov” package written by Torodov [19] to ascertain the RMCD/RMVE estimators.

We have constructed the empirical distribution of as above for when and arrived at the control limits for = (0.05, 0.01, and 0.001). The scatter plots of the quantiles and sample sizes for different dimensions suggest a family of nonlinear models of the form where  , and are the model parameters. For clarity, the scatter plot of the actual and the fitted values of the quantiles of and for , and 10 are given in Figures 1, 2, and 3; other plots are omitted to save space.

From Figures 1, 2, and 3, we can see that the nonlinear fit is very well supported by the high values, which help us to determine the and control limits for any given sample size. The least square estimates of the parameters , and when for dimensions and = (0.05, 0.01 and 0.001) for / control charts are given in Table 1. Using these estimates, the control limits for and can be found using (6) for any sample size.

For the implementation of a robust control chart, first collect a sample of multivariate individual observations with dimension . Compute robust estimates of mean and covariance matrix using R or any other software with , and determine . Outliers can be determined by comparing the values with control limits obtained using (6) for specific values of , and the constants given in Table 1. The outlier free data can be used to construct the standard control chart for monitoring the Phase-II observations.

4. Performance Analysis

We assess the performance of the proposed charts when outliers are present due to the shift in the process mean. In their study, Jensen et al. [5] concluded that the control charts had better performance in terms of probability of signal. Hence, we compare the performance of our proposed method with charts as well as the standard charts based on classical estimators. Our study compares more combinations of dimension , sample size  ,  and  . For a particular combination of ,  ,  and  ,  a number of datasets are generated. Out of the   observations generated, of them are random data points generated from the out-of-control distribution, and the remaining   observations are generated from the in-control distribution so that the sample of data points may contain some outliers. We set and 0.20 to ensure that the sample contains few outliers. Without loss of generality, we consider the in-control distribution as . The out-of-control distribution is a multivariate normal with a small shift in the mean vector with same covariance matrix. The amount of mean shift is defined through a noncentrality parameter (), which is given by where () is the shift in the mean vector. The larger the value of is, the more extreme the outliers are. The proportion of datasets that had at least one or   statistic greater than the control limit was calculated, and this proportion becomes the estimated probability of signal. We compared the performance of these charts with standard charts, , and charts. The standard chart was included in our performance study as a reference because of its common usage.

The probability of a signal for different values of and for some of the values of   and was considered in our study. Fifty thousand datasets of size were generated for each combination of , , and , and the probability of signal was estimated for , and . We considered various combinations of , , and which determine as per (7) and found that the probability of signal is the same irrespective of the combination of , and . Hence we have considered and for various values of . We have presented only a selected set of plots to save space. The plots of probability of signal for and 0.01, and 6, and and 100 are given in Figures 4, 5, 6, and 7 for easier understanding. For dimension , we used and 150, and the plots of probability of signal are given in Figures 8 and 9.

From Figures 49, we can see that when the value of the noncentrality parameter is zero or close to zero, the probability of signal is close to which is expected for an in-control process. As the value of the noncentrality parameter increases the probability of signals also increases. Using this criterion, we select the best method for identifying the outliers. If the probability of signal does not increase for increase in noncentrality parameter, then it is clear that the estimator has broken down and is not capable of detecting the outliers.

A careful examination of these plots of probability of signals corresponding to various values of , , and indicates that for small values of and , performs well. As    and    increase, chart is superior. For example, from Figures 4 and 5 we see that has slight advantage over . But compared to / charts, / charts are performing well which is evident from all the plots presented here. When is large (see Figures 8 and 9), the has clear advantage compared to . From these figures, we see that standard control chart possesses little ability to detect the outliers and the , and stands below the / charts throughout all the values of .

As increases for a fixed value of , the breakdown points of RMCD and RMVE get smaller as the breakdown value is given by . This suggests that the larger is, the larger will need to be in order to maintain the breakdown point, which is very well demonstrated in Figures 8 and 9. In general, there was always one estimator, RMCD or RMVE, that was found to be superior across all the values of the noncentrality parameter as long as the proportion of outliers was not so big as to cause the estimators to break down. This greatly simplifies the conclusions that can be made about when the RMCD or RMVE estimators are preferred to the MCD and MVE estimators.

Nevertheless, and charts are preferred for the various combinations of , , and , and some broad recommendations can be made on the selection among these two charts. When , the will be the best for small dimension. When , the is preferred. As increases, then the percentage of outliers that can be detected by the chart decreases. It is true for both the charts that when is higher, the number of outliers that can be detected decreases for smaller sample sizes. Thus for Phase-I applications where the number of outliers is unknown, should be used only for smaller sample sizes, and it is also computationally feasible. should be used for larger sample sizes or when it is believed that there is a large number of outliers. When the dimension is large, larger sample sizes are needed to ensure that the estimator does not break down and lose its ability to detect outliers. Hence for larger dimension cases, is preferred with large sample sizes. For very small samples (), one may opt for higher values of , for which control limits need to be developed.

5. Case Example

To illustrate the applicability of the proposed control chart method, we discuss a real case example taken from an electronic industry. The data gives 105 measurements of 3 axial components of acceleration measured by accelerometer on a e-compass unit fixed on the objects. The mean vector and covariance matrix under the classical, RMCD, and RMVE methods of the sample data considered are given by A simple comparison of these estimators indicates that there are outliers in the Phase-I data. The plots of , and values along with the respective control limits at 99% confidence level for the sample data are given in Figure 10.

The control limits for are arrived at based on beta distribution, and are calculated using (6) for and . From Figure 10, it is very clear that both and control chart alarms signal for 3 outliers whereas the standard control chart alarm signals for none even though all the charts are having the same pattern. This indicates the effectiveness of the proposed robust control charts in identifying the outliers.

6. Conclusions

Use of robust control chart in Phase-I monitoring is very important to assess the performance of the process as well as detecting outliers. We propose control charts for Phase-I monitoring of multivariate individual observations. The control limits for these charts are arrived empirically and a non-linear regression model is used for arriving control limits for any sample size. The performance of the proposed charts were compared under various data scenarios using large number of Monte Carlo simulations. Our simulation studies indicate that control charts are performing well for smaller sample sizes and smaller dimension where as control charts are performing well for larger sample sizes and larger dimensions. We illustrated our proposed robust control chart methodology using a case study from the electronic industry.

Acknowledgments

The authors would like to thank the editor and two anonymous referees for their valuable comments and suggestions that substantially improved the overall quality of an earlier version of this paper. The research is supported by a grant from the Natural Science and Engineering Council of Canada.