Abstract
This paper develops a distribution-free (or nonparametric) Shewhart-type statistical quality control chart for detecting a broad change in the probability distribution of a process. The proposed chart is designed for grouped observations, and it requires the availability of a reference (or training) sample of observations taken when the process was operating in-control. The charting statistic is a modified version of the two-sample Kolmogorov-Smirnov test statistic that allows the exact calculation of the conditional average run length using the binomial distribution. Unlike the traditional distribution-based control charts (such as the Shewhart X-Bar), the proposed chart maintains the same control limits and the in-control average run length over the class of all (symmetric or asymmetric) continuous probability distributions. The proposed chart aims at monitoring a broad, rather than a one-parameter, change in a process distribution. Simulation studies show that the chart is more robust against increased skewness and/or outliers in the process output. Further, the proposed chart is shown to be more efficient than the Shewhart X-Bar chart when the underlying process distribution has tails heavier than those of the normal distribution.
1. Introduction
Most traditional statistical quality control charts assume that the monitored process has a prespecified known probability distribution (usually normal for continuous measurements). Consequently, the chart properties (control limits, false alarm rate, and the in-control average run length) would be in error if the process distribution were missspecified. To remedy this, a number of distribution-free (or nonparametric) schemes that maintain the same chart properties over a class of distributions have been proposed in the literature. For an overview of nonparametric control charts, see Chakraborti et al. [1, 2].
Another problem is that traditional control charts aim at monitoring a change in one parameter (usually a location or scale) of a process distribution. Realistically, however, when a special cause influences a process, it may cause a shift in more than one parameter (location, scale, skewness, etc.) of the process distribution. To remedy this, we need control charts designed to monitor a broad rather than a one-parameter change in a process distribution. To our knowledge, Bakir [3] was first to suggest such charts based on the two-sample Kolmogorov-Smirnov and the Cramer-von Mises statistics. Zou and Tsung [4] proposed a nonparametric likelihood ratio chart for monitoring broad changes in a process distribution. Ross and Adams [5] developed nonparametric charts based on the two-sample Kolmogorov-Smirnov and the Cramer-von Mises statistics. Their charts, however, are designed for individual observations whereas the chart proposed in this paper is designed for grouped observations.
In this paper, we propose a nonparametric Shewhart-type control chart for monitoring a broad change in a process probability distribution. To develop the chart, we assume the availability of a training (or reference) sample taken when the process was operating in statistical control. The idea of assuming a training sample was first used by Park and Reynolds [6] to develop distribution-free charts based on the Orban and Wolfe [7] placement statistic. Later, Hackl and Ledolter [8], Willemain and Runger [9], and Chakraborti et al. [10] proposed nonparametric charts assuming the availability of a reference sample. Our proposed chart works by taking a random sample (test sample) from the process output at each monitoring stage. The charting statistic is a modified version of the two-sample Kolmogorov-Smirnov test statistic where the difference of the reference and test empirical distribution functions is maximized only over the training sample values. Such modification allows exact calculation of the conditional average run length of the proposed chart using the binomial distribution. Unlike the traditional distribution-based Shewhart X-Bar (Shew-XB) chart, the proposed chart maintains the same control limits and the same in-control average run length (ARL0) over the class of all (symmetric or asymmetric) continuous distributions. The Shew-XB and its average run length (ARL) will be discussed in Section 5. Given the training sample, the exact conditional ARL of the proposed nonparametric chart is computed using the binomial probability distribution. The unconditional ARL can then be computed approximately using simulations. A preliminary simulation study shows that the proposed nonparametric chart is more efficient (has smaller out-of-control ARL) than the Shew-XB chart under distributions with tails heavier than those of the normal distribution. If the process distribution is actually normal, then the Shew-XB chart is more efficient, as expected. The simulation study also indicates that the proposed chart is more robust against increased skewness and/or outliers in the process output.
The rest of the paper is organized as follows: Section 2 presents notational preliminaries. Section 3 develops the proposed nonparametric control chart, and Section 4 develops its ARL. Section 5 discusses the Shew-XB chart and its ARL. Section 6 investigates the effects of skewness and outliers on the two charts and presents efficiency comparisons.
2. Preliminaries
We assume the availability of a training random sample, of size observations taken when the process was operating in-control. The in-control process distribution is assumed to have a continuous cumulative distribution function (CDF), . Let denote the empirical distribution function (EDF) of the training sample, as defined by Here, is the jth order statistic of the training sample, . Then at each sampling instance , , we obtain one test sample of size from the process output, which is assumed to have a continuous CDF, . Let be the empirical distribution function of the test sample, , given by Here, is the th order statistic of the test sample, .
In practice, we may need to detect one of the following three situations.
Situation 1. Detect whether or not the process tends to produce stochastically smaller observations than the observations of the in-control state. In the terminology of statistical hypothesis testing, we are interested in testing the following null and alternative hypotheses: Figure 1 depicts Situation 1 graphically and it shows that the process CDF, , has shifted to the left of the in-control CDF .
Situation 2. Detect whether or not the process tends to produce stochastically larger observations than the observations of the in-control state. That is, we are testing the following null and alternative hypotheses: Figure 2 depicts Situation 1 graphically and it shows that the process CDF has shifted to the right of the in-control CDF .
Situation 3. Detect whether or not the process tends to produce smaller and/or larger observations than the in-control state. That is, we are testing the following null and alternative hypotheses: Figure 3 depicts Situation 3 graphically.
3. The Proposed Nonparametric Control Chart
In this section, we develop the steps for constructing a distribution-free control chart of the Shewhart type, that is, based on a modified version of the two-sample Kolmogorov-Smirnov statistic. The proposed chart, hereafter, is abbreviated to Shew-KS chart.
Step 1: control characteristic
The characteristic to be controlled (monitored) is the process theoretical probability distribution represented by the CDF, . The purpose is to detect whether or not has shifted away from the process in-control CDF, .
Step 2: sampling plan
Obtain a training sample of size when the process was operating in-control. Then obtain a test sample of size from the process output at each sampling instance , .
Step 3: assumptions.
Observations on the process output are independent. The test samples are drawn from unknown continuous distribution with CDF, . The process in-control underlying distribution is assumed continuous with unknown CDF, .
Step 4: pivot statistics.
Calculate , the EDF of the training sample . Then at each sampling instance , , calculate the EDF, , of the test sample . The pivot statistic for Situation 1 (the lower-sided Shew-KS chart) is
Note that tends to be negative when the process produces observations smaller than the in-control state, see Figure 1.
The pivot statistic for Situation 2 (the upper-sided Shew-KS chart) is
Note that tends to be positive when the process produces larger observations, see Figure 2.
The pivot statistic for Situation 3 (the two-sided Shew-KS chart) is
Note 1. The pivot statistics in (6), (7), and (8) will assume only integer values if each is multiplied by the constant .
Note 2. The pivot statistics are modified versions of the traditional two-sample Kolmogorov-Smirnov statistic ([11], pp 456–462) where maximization is taken only over the training sample observations, .
Step 5: control sequence (or charting statistics)
The control sequences for the lower-sided, the upper-sided, and the two-sided Shew-KS charts, respectively, are
Step 6: control limits
For simplification, we consider one upper-sided control limit, , and let the lower-sided control limit be . Because the Shew-KS chart is distribution free, the control limit, , is a constant (design parameter) that depends only on , , and the desired in-control ARL0 of the chart. This control limit, however, does not depend on the functional form of the in-control process distribution.
Step 7: signaling rules
The two-sided Shew-KS signals if . The lower-sided and upper-sided control charts signal, respectively, if , and .
Illustration
Let denote a normal probability distribution with mean and variance . As an illustration of the proposed Shew-KS chart, we generated 20 observations from the standard normal distribution N(0,1) to represent the in-control reference X-sample. Four test Y-samples, each of size 10, were generated. The first two samples, Y1 and Y2, have a N(0,1) distribution. The third and fourth samples, Y3 and Y4, have an N(2,1) and an N(3,4) distributions, respectively. Table 1 depicts the generated samples and the required calculations for the two-sided charting statistic. The resulting Shew-KS chart, shown in Figure 4, gives an out-of-control signal at the third sample when the process mean shifted from zero to two.
4. Calculating the ARL of the Shew-KS Chart
Values of the ARL are needed for the implementation and the performance evaluation of control charts. The implementation of a control chart requires values of the control limits that lead to some desired values of the in-control ARL0. When the successive charting statistics of a Shewhart type control chart are independent, the run length distribution is geometric and the . Unfortunately, this property of independence does not hold for the proposed Shew-KS chart because the successive charting statistics, , , all depend on the same training sample . In this section, we develop a method for calculating the ARL of the chart by first conditioning on the training sample , a method used by Chakraborti [12, 13] and by Vermaat et al. [14].
Recall that the proposed two-sided Shew-KS chart signals at the first sampling instance , for which , where . Suppose that the maximum occurs at a value, say, . Thus, a signal occurs if Equivalently, a signal occurs if or It is seen that (11) represents the branch of the chart that detects if the process output, , is stochastically smaller than the in-control output, , see Figure 1. Similarly, (12) detects if the process output is stochastically larger than the in-control output, see Figure 2.
We will work first on the lower branch, (11), of the chart. After rearranging terms and multiplying the inequality by the test sample size, , (11) becomes
Note that because maximization in (6)–(8) is defined over the values only, becomes fixed when we condition on . Consequently, given the training sample, , becomes a binomial random variable, , with number of trials = and probability of success . Given , the exact conditional probability of a signal and the exact conditional ARL of the lower branch of the chart, respectively, are Upon taking expectations over the training sample , the unconditional probability of a signal and the unconditional ARL for the lower branch of the Shew-KS chart are
Similarly, (12) can be transformed to show that the signal conditional probability and the conditional ARL of the upper branch of the chart, respectively, are
The two-sided Shew-KS chart signals if either one of the lower or the upper branch signals. Therefore, given the training sample , the conditional probability of a signal and the conditional ARL of the two-sided chart, respectively, are Theoretically, the unconditional probability of a signal and the unconditional ARL of the one-sided and two-sided charts are the expectations, over the training sample, of the respective conditional expressions.
Unfortunately, the required unconditional expectations over the training sample, , cannot be expressed directly into a closed form. In this paper we use a large number of simulations, million runs, to estimate these unconditional expectations. At each simulation run, a training sample and a test sample are generated where the conditional probability of a signal and the conditional ARL are calculated according to their exact formulas. The International Mathematical and Statistical Library (IMSL) is used to generate pseudo random variables (assuming a certain probability distribution) for the training and test samples, calculate the empirical CDFs, identify , calculate the exact binomial probabilities, and finally calculate the exact conditional probabilities of a signal and the ARLs. Then we average out these conditional values over the number of simulations to get estimates of the required unconditional expectations. For example, the estimated values for the unconditional probability of a signal and the unconditional ARL for the two-sided Shew-KS chart, respectively, are where is the simulation run number. Similar calculations are applied to estimate the unconditional expectations of the one-sided charts. The above methods for calculating the signal probability and the ARL can play an important role in the design and implementation of the proposed Shew-KS chart because they allow for calculating control limits that correspond to certain desired values of the in-control ARL for various values of and .
5. Calculating the ARL of the Traditional Shew-XB Chart
In this section, we outline an efficient method for evaluating the ARL of the traditional Shew-XB chart in order to compare it to the proposed Shew-KS chart.
The traditional Shew-XB control chart is based on charting the sequence of means of the test samples , . The control limits are calculated using the sample mean and the sample standard deviation of the in-control training sample . The two-sided Shew-XB chart gives an out-of-control signal at the first sampling instance, , for which or , where is a constant chosen (usually equals 3) to achieve a desired in-control ARL. One-sided Shew-XB charts can be obtained by employing one of the signaling rules.
Because the successive signaling events (e.g., ) of the Shew-XB control chart all use the same control limits as estimated from the same training sample, they are no longer independent. Therefore, we cannot use the geometric distribution argument that the . Jensen et al. [15] presented a literature review on the effects of parameter estimation on control charts properties. Chakraborti [12, 13] used conditional expectation arguments to derive exact formulas for the run length distribution and the ARL of the Shew-XB chart when the in-control mean and/or the variance are estimated. However, almost all studies regarding the effects of parameter estimation on control charts properties assume that the underlying process distribution is normal. This dogmatic restriction to the normal distribution is not appropriate to the distribution-free world of nonparametric statistics where we need to compare the performance of the competing charts under distributions other than the normal.
In this section, we use a conditional expectation argument and simulations to obtain reasonable estimates for the values of the unconditional ARL of the Shew-XB chart under several underlying process distributions.
Given the training sample, , the exact conditional probabilities of signals for the lower and the upper branches of the Shew-XB chart, respectively, are where is the theoretical CDF of the sample mean of the test sample, . For example, if the test sample has a normal distribution with mean and variance , then the conditional probability of a signal of the lower and the upper branches of the Shew-XB charts, respectively, are where is the CDF of the standard normal distribution. The conditional probability of a signal for the two-sided chart is The exact CDFs of the sample mean are known for many populations beside the normal. We state some results concerning the CDF of the mean of a sample of size drawn from gamma, Cauchy and Laplace distributions.(1)For a 3-parameter distribution with probability density function (PDF) The mean of a random sample of size also has a distribution.(2) For a Cauchy distribution , the mean of a random sample of size has the same Cauchy distribution with PDF (3)For a Laplace distribution with PDF the distribution of the sample mean is a little bit complicated. Using basic results in Johnson et al. ([16], pp 167), we can express the CDF of the sample mean (when ) as where is a 2-parameter random variable.
The conditional ARLs for the lower, upper, and two-sided Shew-XB chart, respectively, are , , and .
The unconditional probabilities of signals and the unconditional ARLs for the Shew-XB chart are obtained by taking expectations of their respective conditional values over the training sample . In practice, we use large number of simulations to estimate these unconditional quantities as described in Section 4, (18).
6. Effects of Skewness, Outliers, and Efficiency Comparisons
In this section, we conduct simulation studies to investigate the sensitivity against skewness, outliers, and the efficiency of both the traditional Shew-XB and the proposed Shew-KS control charts.
6.1. Effects of Skewness
We now examine the effect of skewness on the in-control ARLs of the Shew-XB and the Shew-KS charts. The control limits for the two charts are adjusted so that both charts have the same in-control ARL of 170 under the standard normal distribution. The sample sizes of the training and test sample, respectively, are and . We used IMSL to generate pseudo random numbers from the three-parameter gamma distribution in (22). We varied the shape parameter to obtain extremely skewed to almost symmetric distributions. To have a gamma distribution with mean = 0 and variance = 1, the scale and location parameters are chosen as and . The ARLs of both charts are calculated by first getting the exact conditional ARL and then using one million simulation runs to get the unconditional ARL by averaging the conditional ARL. Table 2 shows that ARL0 of the Shew-KS is chart not affected at all by the skewness of the distribution. The Shew-XB chart, however, changes dramatically as we move from extremely skewed to symmetric distributions. The ARL0 of the Shew-XB chart becomes close to the normal theory value of 170 only when the shape parameter of the gamma distribution is at least 16. Table 2 depicts two anomalies in the ARL of the Shew-XB chart when the shape parameter or 3. For explanation of these anomalies, refer to Vermaat et al. ([14], pp. 343).
6.2. Effects of Outliers
There are situations in which the in-control process output is contaminated by few outliers; for example, a process involving complex analytical measurements. A single extreme outlying observation may trigger an out-of-control signal while in fact the process is in-control, thus increasing the false alarm rate and decreasing the in-control ARL of the control chart. A good model for generating normally distributed processes with occasional outliers is the contaminated normal distribution, the CDF of which is where . We will refer to and as the percentage of contamination and the extremity of contamination, respectively. When , the process is in-control though producing occasional outliers. When , (26) becomes the standard normal CDF. In each simulation run, we generated 500 reference samples, of size each, from the standard normal distribution. For each reference sample thus generated, we generated 500 test samples, of size each, from the contaminated normal distribution all with and all the possible combinations of where and . Table 3 shows the simulated values of the two-sided in-control ARLs (in groups of size ) of the Shew-XB and the Shew-KS charts for various levels of contamination.
Table 3 shows that the effect of outliers depends on the contamination severity , and the effect is more pronounced on the Shew-XB than on the Shew-KS chart. Keeping in mind that the in-control ARL of the traditional Shew-XB chart for a process operating with no outliers is 163 (in groups of size ), we make the following observations on the results in Table 3.(i)Under very light percentage % and light extremity of contamination, outliers have no effect on both charts as the in-control ARLs of the two charts do not change.(ii)Under very light percentage % but moderate extremity of contamination, outliers have a noticeable effect on the traditional Shew-XB chart as its ARL0 drops to 115, which entails about 163/115 = 1.4 times as many false alarms as the expected ARL0 of 163. When the extremity of contamination grows to , outliers have substantial effect on the Shew-XB chart as its ARL0 drops to 70, which entails about 2.3 times as many false alarms. In contrast, the light percentage of contamination % has no effect on the Shew-KS neither when nor when .(iii)Under a moderate percentage % and light extremity of contamination, outliers have noticeable effect on the Shew-XB as its ARL0 drops to 61, which entails about 2.6 times as many false alarms as the expected ARL0 of 163. In contrast, the Shew-KS triggers 171/144 = 1.2 times as many false alarms. When the extremity of contamination grows to , outliers have a greater effect on the Shew-XB chart as its ARL0 drops to 22, entailing about 7.4 times as many false alarms. In contrast, the Shew-KS triggers 1.4 times as many false alarms. With severe extremity of contamination, , the ARL0 of the Shew-XB chart drops to 12, entailing 13.8 times as many false alarms. In contrast, the Shew-KS triggers 1.5 times as many false alarms.(iv)Outliers can have a more dramatic effect on the traditional Shew-XB chart when the percentage of contamination is as large as %. With light extremity of contamination, , the ARL0 of the Shew-XB chart drops to 33, entailing about 4.9 times as many false alarms as the expected ARL0 of 163. In contrast, the Shew-KS triggers 1.5 times as many false alarms. With moderate extremity of contamination, , the ARL0 of the Shew-XB chart drops to 11, entailing 14.8 times as many false alarms. In contrast, the Shew-KS triggers 1.8 times as many false alarms. With severe extremity of contamination, , the ARL0 of the Shew-XB chart drops to just 6, entailing about 27 times as many false alarms. In contrast, the Shew-KS triggers 1.9 times as many false alarms.
To sum up the results of Table 3, we conclude that for monitoring processes contaminated by outliers, one should not use the traditional Shew-XB, unless the percentage and the severity of contamination are both very light, around . Otherwise, the traditional Shew-XB would trigger many folds of false alarm signals as those for an uncontaminated process. Outliers have some effect on the Shew-KS chart when the percentage of contamination is as high as .
6.3. A Simulation Study for Efficiency
To compare two control charts, we adjust their control limits so that their in-control ARLs become approximately equal and then compare their out-of-control ARLs at various levels of change in the monitored quality characteristic. The chart with the smaller out-of-control ARL is considered to be more efficient.
In this section, we perform a simulation study to compare the efficiencies of the Shew-XB and the Shew-KS charts. The competing charts are compared for processes operating under a normal distribution with a standard deviation of 1.0, a Cauchy distribution and a Laplace distribution. Equation (23) gives the PDF of the Cauchy distribution with center (=median, mean does not exist) and scale . Equation (24) gives the PDF of the Laplace distribution with center (=mean = median) and scale . In (23), the scale is set to equal 0.2605 so that the Cauchy distribution with center 0 has a probability of 0.05 to the right of 1.645, the same as that of the standard normal distribution. Since the Laplace distribution has variance =, the scale in (24) is set to be so that the Laplace distribution has a standard deviation of 1.0. Efficiency comparisons are made when the median of the process is shifted from the in-control value of 0.0 to 1.0 in increments of 0.2. We used a training sample size and a test sample size in all comparisons. As mentioned in Sections 4 and 5, the ARLs of both charts are calculated by first getting the exact conditional ARLs and then using one million simulation runs to get the unconditional ARL by averaging the conditional ARLs. Tables 4, 5, and 6 show the simulated values of the two-sided ARLs (in groups of size ). The in-control ARLs (when ) of the competing control charts are made approximately equal by adjusting the control limits under each distribution.
Examinations of Tables 4, 5, and 6 lead to the following findings.(i)Table 4: For monitoring processes operating under a normal distribution, the Shew-KS is less efficient (has larger out-of-control ARLs) than the traditional Shew-XB chart.(ii)Table 5: For monitoring processes operating under a Laplace distribution, the proposed Shew-KS is more efficient (has smaller out-of-control ARLs) than the traditional Shew-XB chart at all shifts in the process center.(iii)Table 6: For monitoring processes operating under a Cauchy distribution, the Shew-KS becomes dramatically more efficient than the traditional Shew-XB chart at all shifts in the process center. For example, the Shew-KS chart is quicker than the tradition Shew-XB chart by about 4-times, 52-times, 115-times, 146-times, and 159-times to signal at respective shifts of , and in the process center.
To sum up, the results in Tables 4, 5, and 6 lead to the following recommendations.
To monitor processes operating under moderate or heavy-tailed underlying distributions (heavier than those of the normal), the proposed Shew-KS is more efficient than the traditional Shew-XB chart. This is in addition to the advantage that the Shew-KS chart maintains same control limits over the class of (symmetric or asymmetric) continuous distributions. If one is sure that the process underlying distribution is normal, then the traditional Shew-XB chart is recommended over the Shew-KS.
7. Summary and Suggestions for Further Research
In this paper, a distribution-free (or nonparametric) Shewhart-type statistical quality control chart is developed for detecting broad changes in the underlying probability distribution of a process. We assume the availability of a random sample, called training sample, taken when the process was operating in-control. At each sampling instance, we take a random sample from the process output and calculate a modified version of the two-sample Kolmogorov-Smirnov test statistic, which will serve as the charting statistic. A signal is given if the charting statistic falls outside the control limits. Unlike the traditional distribution-based control charts (such as the Shew-XB), the proposed chart maintains the same in-control ARL0 value over the class of all (symmetric or asymmetric) continuous distributions. Consequently, the control limits of the proposed chart need not be adjusted according to an assumed underlying process distribution. Given the training sample, the conditional ARL of the proposed chart is computed exactly using the binomial probability distribution. The unconditional ARL can then be estimated by simulations. A preliminary simulation study shows that the proposed Shew-KS chart is more efficient than the Shew-XB chart if the process underlying distribution has tails heavier than those of the normal. If the underlying process distribution can be assumed normal, then the Shew-XB chart is more efficient, as expected. The simulation study also indicates that the proposed chart is more robust against increased skewness and/or outliers in the process output.
Further simulation studies are needed to expand the efficiency comparisons of the proposed Shew-KS chart with charts other than Shew-XB. Tabulated values of the control limits are needed for the implementation of the proposed chart. It is worthwhile to investigate how the Kolmogorov-Smirnov statistic can be used with other charting schemes, such as the exponentially weighted moving average (EWMA) and the cumulative sum (CUSUM.)