Abstract

This paper aims to enlarge the family of one-class classification-based control charts, referred to as OC-charts, and extend their applications. We propose a new OC-chart using the -means data description (KMDD) algorithm, referred to as KM-chart. The proposed KM-chart gives the minimum closed spherical boundary around the in-control process data. It measures the distance between the center of KMDD-based sphere and the new incoming sample to be monitored. Any sample having a distance greater than the radius of KMDD-based sphere is considered as an out-of-control sample. Phase I and II analysis of KM-chart was evaluated through a real industrial application. In a comparative study based on the average run length (ARL) criterion, KM-chart was compared with the kernel-distance based control chart, referred to as K-chart, and the -nearest neighbor data description-based control chart, referred to as KNN-chart. Results revealed that, in terms of ARL, KM-chart performed better than KNN-chart in detecting small shifts in mean vector. Furthermore, the paper provides the MATLAB code for KM-chart, developed by the authors.

1. Introduction

In recent years, several attempts have been proposed to integrate data mining with statistical process control (SPC) [16]. The objective was to overcome the limitations of traditional parametric control charts especially the normality assumption, which may not be applicable in the case of modern manufacturing systems. The most commonly applied data mining technique in SPC is one-class classification. Actually, one-class classification methods have been widely used for process monitoring [712]. The principle of one-class consists in constructing a sphere which contains the maximum of data with minimum volume. This sphere distinguishes in-control process data, also known as target data, from out-of-control process data. The shape and volume of the one-class depend on the used one-class classifier, also known as data description algorithm. The one-class approach was applied to develop a new family of control charts called one-class classification-based control charts, referred to as OC-charts.

Several types of one-class classifiers exist in the literature. For instance, only two one-class classifiers were used to develop control charts, named the support vector data description (SVDD) algorithm [13] and the -nearest neighbor data description (KNNDD) algorithm [14]. Sun and Tsung [15] used SVDD to develop the kernel distance-based multivariate control chart, also known as the K-chart, which is considered as the first OC-chart that uses support vector principles. When monitoring more than two variables, the K-chart uses kernel methods that provide the advantage of dealing with high-dimension data. During the last decade, the K-chart has received significant attention and has witnessed several improvements through many works [1619].

The SVDD algorithm was also employed to develop other control charts such as the SVDD based multivariate cumulative sum control chart [20], monitor batch process [21], monitor nonlinear processes [22], and perform industrial calibration [23]. Despite its successful application, the high computational cost of SVDD remains its main drawback. Actually, the SVDD algorithm loses its efficiency when the size of the training data becomes large. To overcome this shortcoming, Sukchotrat et al. [24] proposed the use of KNNDD algorithm to develop the KNN-chart. KNNDD is a simple and fast algorithm that performs better with high dimensional data and does not consume much time during the training phase. Gani and Limam [25] compared the performance of the K-chart and the KNN-chart and demonstrated that the K-chart is sensitive to small shifts in mean vector, while the KNN-chart is sensitive to moderate shifts in mean vector.

This paper investigates the use of another one-class classifier which is the -means data description (KMDD) algorithm to construct a new OC-chart, referred to as the KM-chart. The objective of this work is twofold. First, we aim to enlarge the family of OC-charts and extend their applications by showing the methodology of their construction and providing the necessary software codes. Second, we attempt to propose an OC-chart that can compete with K-chart and KNN-chart in terms of the average run length (ARL) criterion.

The rest of this paper is organized as follows. A review of OC-charts is presented in Section 2. The proposed KM-chart is introduced in Section 3. Construction methodology of the KM-chart using a real data example is shown in Section 4, while performance analysis of the proposed control chart is discussed in Section 5. Section 6 summarizes this paper.

2. Background on OC-Charts

In the literature, there are two common OC-charts which are the K-chart and the KNN-chart. In the following, we give a review of these control charts.

2.1. The K-Chart

The K-chart relies on SVDD algorithm, which an unsupervised one-class classifier, to fit a sphere around the target data. This sphere is determined by solving the following quadratic programming

subject to where , , and , are, respectively, the cost function to minimize, the center and the radius of the sphere. Equation (2) shows that the vector of quality characteristics, denoted by , having a distance smaller than the radius are considered as target. To allow the possibility of having outliers in the training set, the distance from to the center should not be strictly smaller than , and larger distances should be penalized. Therefore, we introduce slack variables and the minimization problem becomes

subject to where is a parameter introduced for the trade-off between the volume of the sphere and the errors.

Equation (4) can be incorporated into (3) by using Lagrange multipliers with the Lagrange multipliers and , should be minimized with respect to , , and and maximized with respect to and . Setting partial derivatives of , we obtain

From (8), , , and , then Lagrange multipliers can be removed and we have

By substituting (6) and (8) into (5), we have subject to

A test sample, denoted by , is accepted when its distance is smaller or equal to the radius. This is equivalent to

Generally, data is not spherically distributed. To make the method more flexible, the vectors of are transformed to a higher dimensional feature space. The inner products in (10) and (12) are substituted by a kernel function . In a higher dimension, the sphere becomes a complex form called “hypersphere.” The problem of finding the optimal hypersphere is given by subject to (11).

A test sample is accepted when

The construction of the K-chart consists in determining which samples are support vectors (SVs) by solving the following quadratic programming subject to

Once the SVs are obtained, the kernel distance (KD) of each sample is computed. For a test sample , the KD is computed as follows where is the set of SVs.

The KD of SVs, denoted by , represents the upper control limit (UCL) for the K-chart used to monitor a new sample . This can be illustrated by the following hypothesis test

Under the process is considered as in-control and under the process is considered as out-of-control, when sample was taken.

2.2. The KNN-Chart

The KNN-chart uses an unsupervised one-class classifier called KNNDD to construct a one-class by estimating the local density of data. To understand the mechanism of the KNN-chart a brief description of the KNNDD algorithm, based on the work of Sukchotrat et al. [24], is presented below.

Let be the th nearest neighbor training observation of data point that needs to be monitored. Let be the volume of the hypersphere containing nearest neighbor training observations and the size of the training set. The local density of , denoted by , can be determined as

Similarly, the local density of , denoted by , can be determined by where is the th nearest neighbor of in the same training set.

The KNNDD method classifies as the target class when the ratio of its local density of is greater than or equal to one, which can be explained as follows

To make the algorithm more robust, the average of -distances is considered (for ). Thus, (21) becomes

To construct the KNN-chart, the statistic representing the average distance between and the -nearest observations is computed as follows

The values are used as monitoring statistics.

3. The Proposed KM-Chart

The proposed KM-chart gives the minimum closed spherical boundary around the in-control process data using the KMDD algorithm. The latter is an unsupervised one-class classifier, based on the -means algorithm which is a very popular clustering method. It measures the distance between the KMDD-based sphere and the new incoming sample to be monitored. The sphere is described by clusters placed such that the average distance to a cluster center is minimized.

The phase I of the KM-chart consists in determining the optimal KMDD-based one-class, by estimating the optimal number of clusters. In this step, the -means clustering algorithm aims to find clusters, denoted by , that minimize the within clusters sum of squares as follows where are the disjoint sets of cluster indices, is the number of observations in the training phase, is the sample mean of the observations in the th cluster, and is the Euclidean distance of the quality characteristic . It should be noted that the values are used as charting statistics for the KM-chart. The optimization problem in (24) can be solved by iterating the following two steps.(i)Given cluster centers , assign each point to the cluster with the closest center.(ii)Given a cluster assignment, update the cluster centers to be the sample mean of the observations in each cluster.

In phase II, the distance of a new incoming sample, denoted by , is computed as follows where is the number of observations in the testing phase.

The test sample is accepted when its distance is smaller or equal to the radius of the KMDD-based sphere, denoted by . This is equivalent to where is the radius of KMDD-based one-class, representing the UCL for the KM-chart. It is set according to the number of clusters used for the construction of one-class.

4. A Real Industrial Application

To demonstrate the efficacy of the proposed KM-chart, we applied it to the “Cristal Light” cigarettes data set. Actually, “Cristal Light” cigarettes are Tunisian trademark produced by the Kairouan Tobacco Manufacture. The production process of “Cristal Light” cigarettes comprises 12 sequences of operations which are humidification of tobacco leaves, threshing tobacco leaves, strip processing, hashing, drying, expansion of edges, casing, flavoring, introduction of expanded tobacco, confecting ion of cigarettes, packing and boxing, and conditioning. Details about the “Cristal Light” data set can be found in Hajlaoui [26].

The quality of “Cristal Light” cigarettes is defined by five main characteristics which are as follows.(1)The weight of a cigarette, which is the made up of the tobacco, the filter, and the cigarette paper weights. It varies between 0.965 and 1 gram.(2)The module of a cigarette, which corresponds to its diameter; it varies from 6.75 to 8.0 millimeters.(3)The humidity rate of tobacco, which is the proportion of water contained in a cigarette. It is considered acceptable if it varies between 11.5% and 13.5%.(4)Pulling resistance of a cigarette, which is defined by the difference in pressure between the two extremities of a cigarette when a quantity of air is passed through it. The pulling resistance is considered acceptable when it varies from 100 to 115 CE (colonne d’eau).(5)The folding density, which corresponds to the volume occupied by the mass of the tobacco inside a cigarette. It is tolerable to belong to 450 ± 20 cm3.

The “Cristal Light” data set is composed of 65 observations. The first 60 cigarettes are used to construct OC-charts in phase I. Each cigarette took one minute to be collected. The five remainder cigarettes are used for testing out-of-control states in phase II. For the construction of KM-chart, we follow the same methodology of Gani and Limam [25], which consists of three main steps.

Step 1. The data set is analyzed using principal component analysis (PCA) method to obtain independent and identical distributed data, which is a fundamental assumption for one-class classification problem.

Step 2. The principal components (PCs) resulting from Step 1 are used to construct the one-class. In our application, we have three one-class classifiers which are SVDD, KNNDD, and KMDD.

Step 3. The optimal one-class obtained from Step 2 is used to construct OC-charts by computing the charting statistics which are KD for the K-chart, for the KNN-chart, and for the KM-chart.

All calculations were carried out with MATLAB software. For the construction of K-chart and KNN-chart, we used the MATLAB codes of Gani and Limam [25]. For the construction of KM-chart, we used the MATLAB code developed by the authors (see Algorithm 1).

%Let Q be the (n  ×  m) matrix of quality variables where n is the number
%of observations and m the number of quality variables in the training phase.
%Define the target class.
T=target_class(+Q);
%Use the KMDD algorithm to fit a sphere around the defined target class above,
%where c1 is a fraction error on the target class and c2 is a parameter
%defining the number of clusters.
w = kmeans_dd(T,c1,c2);
%Show the results of KMDD classifier.
W=+w;
%Phase I of the KM-chart:
%Compute the Euclidean distance of each training observation and the UCL.
n = size(T,1);
D_training = sqrt(min(sqeucldistm(+T,W.w), ,2)) repmat(W.threshold,n,1) ;
%Phase II of the KM-chart:
%Let now R be the (k x p) matrix of quality variables where k is the number
%of observations and p the number of quality variables in the testing phase.
%In Phase II we repeat the same computation as in Phase I but here we use
%test data and compare it with the UCL to detect out-of-control states.
%Compute the Euclidean distance of each test observation.
m= size(R,1);
D_test = sqrt(min(sqeucldistm(+R,W.w), ,2)) repmat(W.threshold,m,1) ;
y1=D_training(:,1); y2=D_training(:,2);x1=(1:n);
y3= D_training(:,1); D_test(:,1) ;
y4 = D_training(:,2); D_test(:,2) ;x2=(1:n+m);
%Display the KM-chart for Phase I and II.
figure;
SUBPLOT(2,1,1), plot(x1,y1,-o, x1,y2,-); title(Phase I of the KM-chart)
SUBPLOT(2,1,2), plot(x2,y3,-o, x2,y4,-); title(Phase II of the KM-chart)

After performing PCA, two PCs explaining more than 90% of the variation were retained. Several numbers of clusters were tested for the construction of KMDD-based one-class and to determine the in-control state of the “Cristal Light” process. It is clear from Figure 1 that the number of clusters influences the shape of KMDD-based one-class and plays a pivotal role in determining the trade-off between oversmoothness and undersmoothness of the control boundary. In our application, KMDD-based one-class was constructed with , since the used sample size was not large.

The detection of an abnormal observation in the target class depends on the shape of the established one-class. The KMDD provided a spherical one-class, while SVDD gave a flexible nonspherical one-class due to the use of SVs. It is worth noticing that the shape of SVDD-based one-class depends on the width of the radial basis function while the shape of KNNDD-based one-class is function of the size of the nearest neighbor, denoted by . Details about the characteristics of SVDD and KNNDD-based one-classes can be found in Gani and Limam [25].

The KM-chart exceeded its control limit of 6.001 at around the 19th, 25th, 28th, 40th, 48th, and 50th cigarettes, as shown in Figure 2. On the other hand, these out-of-control cigarettes have a distance greater than the radius of the established KMDD-based sphere with . For these six abnormal cigarettes, at least one of their five quality characteristics did not respect its tolerance interval, as discussed above. In comparison with the two other control charts, the proposed KM-chart succeeded to detect a new out-of-control observation which is cigarette number 48. Both K-chart and KNN-chart failed to detect this abnormal cigarette. Once these out-of-control observations are removed, no additional outliers were detected, and the in-control process was established.

In phase II, five “Cristal Light” cigarettes were used to detect out-of-control states. The KM-chart triggered an alarm at around the 62nd cigarette and remained below its control limit for the last three cigarettes. Cigarette number 62 was also declared by K-chart as an out-of-control observation, while cigarettes number 62 and 65 were declared by KNN-chart as out-of-control observations. Figure 3 shows the discussed OC-charts for phase II.

5. Performance Comparison

In this section, we study the performance of KM-chart and we compare it with K-chart and KNN-chart. The performance study is based on ARL criterion which is defined as the expected number of samples taken before the shift is detected. It is given by where is the probability of one point plots out-of-control.

A simulation study was conducted to estimate the ARL of OC-charts. In order to be consistent with Gani and Limam [25], we follow their simulation procedure given by the following.

Step 1. Five multivariate normal variables were generated with a mean vector = (0.986; 7.650; 0.121; 107.183; 451.527) and a covariance matrix similar to the mean vector and the covariance matrix of the “Cristal Light” data set used in Section 4. The K-chart, KNN-chart, and KM-chart were designed to achieve an overall in-control ARL of 200. The ARL value was estimated by averaging the run lengths obtained by running 1000 simulated charts.

Step 2. Multivariable shifts, denoted by , were introduced in the mean vector according to Table 1. Basically, large values of correspond to bigger shifts in the mean. The value is the in-control state.

For detecting small shifts (), KM-chart performed better than KNN-chart since it gave an ARL = 192.308, while KNN-chart gave an ARL = 200. For the same shift level, K-chart yielded an ARL = 100, which was better than that of KM-chart.

For detecting moderate shifts (), KNN-chart behaved better than the other two control charts since it gave an ARL = 40 against an ARL of 50 and 147.059 of K-chart and KM-chart, respectively.

The difference in sensitivity to shifts in the mean vector between the three OC-charts is due to the difference in the nature of distance used by each control chart. The K-chart uses KD, whereas KM-chart and KNN-chart are based on the Euclidean distance. The advantage of KD in comparison with Euclidean distance lies essentially in the use of the kernel function. The latter is equivalent to the distance between two samples measured in a higher dimensional space. This allows K-chart to easily detect any small shift in the process. In terms of ARL and for small shifts in the mean vector, one can draw the conclusion that our proposed KM-chart is situated between KNN-chart and K-chart (KNN-chart < KM-chart < K-chart). Broadly speaking, each OC-chart has its advantages and disadvantages. For example, K-chart performs better than KM-chart and KNN-chart in quickly detecting changes in the process, while the computational cost of KM-chart and KNN-chart is lower than that of K-chart. Table 2 summarizes the characteristics of each OC-chart.

6. Conclusion

In this paper, we have developed a new OC-chart using KMDD algorithm, called KM-chart. Construction methodology of KM-chart is demonstrated through a real industrial application. Performance analysis of KM-chart in phase I and II showed that our proposed control chart is a competitive SPC tool. In phase I, the proposed KM-chart detected a new abnormal observation which is cigarette number 48. Both K-chart and KNN-chart failed to detect this abnormal cigarette. Based on the ARL criterion, our proposed control chart outperformed KNN-chart in detecting small shifts in the mean vector.

The proposed KM-chart can be extended to monitor nonlinear processes by using the global kernel -means algorithm instead of using the standard -means algorithm. The global kernel -means has the advantage to identify nonlinearly separable clusters and therefore allows KM-chart to monitor sophisticated manufacturing processes.

Appendix

The MATLAB Code for KM-Chart

The MATLAB code for the KM-chart requires the PRtools toolbox available at http://www.prtools.org and the dd_tools toolbox available at http://prlab.tudelft.nl/david-tax/dd_tools.html.

For more details see Algorithm 1.

Conflict of Interests

The authors declare that they do not have a direct financial relation with the software mentioned in this paper and no competing interests.

Acknowledgment

The authors express their appreciation to LARODEC of ISG, University of Tunis, for supporting this paper.