Abstract

Real-time monitoring of the breast cancer index is becoming increasingly important. It can help create advances in the diagnosis and treatment of breast cancer. In today’s modern medical processes, simultaneously monitoring changes in observations in terms of location and scale are convenient for the implementation of control schemes but can be challenging. In this paper, we consider a new nonparametric control scheme for monitoring location and scale parameters in multivariate processes. The proposed method is easy to implement, and the performance of the proposed control procedure is discussed. Then, we compare the proposed scheme with some competing methods. Simulation results show that the proposed scheme can efficiently detect a range of shifts. The proposed chart can trigger an alert and timely discover the change of the breast cancer index.

1. Introduction

Control schemes play an important role in biosurveillance studies [19]. Control schemes have been frequently used for fault detection in quality control with products and health-care monitoring [1014]. A process should be monitored using statistical means to determine whether a shift occurs, and action should be taken once the process is considered out-of-control (OC) [1518]. Many researchers have discussed and proposed many useful charts, such as Shewhart charts [19, 20], cumulative sum (CUSUM) charts [2130], and exponentially weighted moving average (EWMA) charts [3138], to detect whether there is a change in quality characteristics in a process. These proposed control schemes can be used for data analysis, including control and forecasting, which are useful for fault diagnosis in practice. Most charts require that these observations be univariate and typically assume that these observations follow a normal distribution. Unfortunately, the assumption of multivariate normality is unrealistic in most cases and would lead to a poor performance if underlying assumptions are invalid.

Nonparametric control charts are important in manufacturing and service sectors when samples of observations are nonnormal. Some control schemes are used to monitor high-dimensional processes when we know little about the underlying distribution [3942]. Most control schemes are designed to monitor location parameters. For example, Liu and Singh [43] introduced several multivariate rank tests based on data depth. Liu [44] used the concept of data depth to propose several new control charts to monitor multivariate process. Data depth provides an efficient metric of the process’ performance without using parametric assumptions. In addition, Zou et al. [45] provided a multivariate spatial rank for monitoring high-dimensional processes with unknown parameters. For detecting the location changes in nonparametric multivariate processes, we also recommend the discussions by [46, 47]. To detect the changes in the location and scale of observations simultaneously, several monitoring methods are proposed in the literature, including Mukherjee and Chakraborti [48] and Chowdhury et al. [49]. Recently, Mukherjee and Marozzi [50] consider the sum of the squares of standardized Wilcoxon and the Bradley statistics for monitoring high-dimensional processes with unknown parameters which is advantageous in simultaneous monitoring of multiple aspects.

Recently, some schemes have been proposed to monitor the changes in location and scale simultaneously using a single chart. Performance advantages of these charts have been clearly established [51]. Lepage [52] discussed a nonparametric two-sample test for location and dispersion. Based on Lepage [52], Mukherjee and Marozzi [51] introduced new circular-grid charts for simultaneous monitoring of process location and process scale based on Lepage-type statistics. Meanwhile, Mukherjee and Marozzi [53] investigated a new single distribution-free Phase-II CUSUM procedure based on the Cucconi statistic for simultaneously monitoring changes in location and scale parameters of a process. In addition, Mukherjee and Sen [54] discussed a distribution-free (nonparametric) Shewhart-Lepage scheme for simultaneous monitoring of location and scale parameters using an adaptive strategy. Li et al. [55] and Shi et al. [56] provided powerful control schemes aimed at simultaneously monitoring the location and the scale parameters of any continuous process. Moreover, Zafar et al. [57] proposed a new parametric memory-type charting structure based on progressive mean under max statistic for the joint monitoring of location and dispersion parameters. Song et al. [58] introduced distribution-free adaptive Shewhart-Lepage-type schemes for simultaneous monitoring of location and scale parameters using information about symmetry and tail weights of the process distribution. Huang et al. [59] proposed a new statistical process monitoring scheme with a double-sampling plan for simultaneously monitoring location and scale shifts. Bai and Li [60] considered monitoring ordinal categorical factors for monitoring which considers shifts in the location or scale parameters of latent variables. For multivariate processes, Cheng and Shiau [61] proposed a distribution-free phase I monitoring scheme for both location and scale parameters based on the multisample Lepage statistic.

Although these literatures contain many control schemes for monitoring location and scale parameters simultaneously, much less focus has been placed on control strategies that simultaneously monitor location and scale parameters in multivariate processes. In this study, we propose a useful and easy-to-implement control scheme for simultaneously monitoring location and scale parameters, which is based on nonparametric location and scale hypothesis testing. Reference samples are denoted as phase I data streams, and test samples are denoted as phase II data streams. One problem is that the size of phase II increases with the number of data streams. Considering this issue, we performed hypothesis testing repeatedly with each new data stream. Thus, the amount of phase II data became a constant for each acquisition time.

The remainder of this paper is organized as follows: In Section 2, we review nonparametric hypothesis testing in detail. In Section 3, we propose a new scheme based on a hypothesis testing statistic for monitoring location and scale parameters. Then, we discuss the proposed method’s performance and validity. In Section 4, we perform a simulation-based comparison to compare the proposed chart with other existing charts. In Section 5, breast cancer data are investigated to describe the performance of the proposed chart. Lastly, we briefly draw conclusions in Section 6.

2. Review of Nonparametric Hypothesis Testing

Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions about a population parameter or a population probability distribution, considering reference sample of size and test sample of size . Thus, null hypothesis versus alternative hypothesis or , where is the location parameter of reference sample; is the location parameter of test sample; and are the scale parameters of the reference and test samples, respectively. We can use a reasonable statistical decision procedure to reject the null hypothesis . In real situations, it is difficult for us to identify the exact distribution of data streams. Therefore, nonparametric hypothesis testing is also introduced, which does not consider the distribution of the original data. For hypothesis testing about the location parameter, Mood [62] proposed the median test, which is based on the rank of each datum. Considering the interaction between the reference and test samples, Wilcoxon [63] and Mann and Whitney [64] introduced the Mann-Whitney-Wilcoxon statistic. In addition, rank-based nonparametric hypothesis testing of scale parameter is used in the literature [6567].

2.1. Methods for Location Detection

In general, people often check whether there is a change for a given location parameter in a process. We often use the -statistic under the assumption that the distribution is normal. However, there is a risk in using the -statistic with unknown population distributions. Thus, some distribution-free statistics have been developed. Brown-Mood median testing is a useful nonparametric method. However, the bilateral test does not yield satisfactory results when . To use more information about the relative size of the reference sample and test sample, the Wilcoxon rank-sum test was developed. We assume that a reference sample of size and test sample of size are given, and we let . Considering the pooled sample at time , Mann and Whitney [64] developed the Mann-Whitney statistic as follows:

Therefore, the Wilcoxon rank-sum statistic is where , and is the rank of in the pooled sample . . It can be seen that [68]

Under the null hypothesis, we also calculate the approximate normal statistic when the sample is sufficiently large.

2.2. Methods for Scale Detecting

A location parameter typically describes the position of a distribution, and a scale parameter is also an important characteristic that describes a distribution. When the distribution of observations is unknown, some distribution-free methods are typically used. Given a two-phase independent sample and . We assume that the location parameters of the two samples are equal . Based on the Mann-Whitney statistic, Siegel and Tukey [65] proposed the Siegel-Tukey statistic. The implementation design of this statistic consists of the following steps: (1) mix the two samples in ascending order, ; (2) assign the rank of as shown in Table 1; and (3) calculate the ; represents the rank of .

Mood [62] also provided a useful test statistic for scale parameters. As before, we consider two sequences of and , where . The Mood statistic can be described as follows: where is the rank of , in sample of size . For and constant . Additionally [68],

Filgner and Killeen [69] also introduced a test statistic for scale parameters that is based on the absolute rank. The statistic is defined as

is the rank of in pooled sample , where , . represents the median of the sample . has the distribution of Wilcoxon’s rank-sum statistic under the null hypothesis. Therefore,

3. Proposed Monitoring Strategy

We assume that there are -independent observations from an unknown multivariate continuous distribution with dimensionality . We assume that independent observations, , follow the model below: where and are the in-control (IC) location vector and the OC location vector, respectively; and represent the IC covariance matrix and the OC covariance matrix, respectively, where ; represents an unknown change point; and is an unknown continuous distribution function. In phase I, we assume that the IC sample of size is given at time , where , . In phase II, of size is obtained. After the phase I sample is analyzed, the phase II sample is monitored.

Inspired by Mukherjee and Marozzi [50] for multivariate processes, we consider the -dimension statistic of the Euclidean distance of new observations and the mean vector of phase I data, . That is, and , where . Now, a univariate phase II sequence is obtained, . Then, a Shewhart-type chart for monitoring location changes that is based on the Wilcoxon rank-sum statistic (i.e., S-W chart) can be constructed. The statistic of the S-W chart is with upper control limit (UCL) and lower control limit (LCL) where is an unknown constant. The Shewhart-type chart can be constructed based on three other types of hypothesis statistics for the scale parameter. The S-ST chart (i.e., the Shewhart-type chart based on the Siegel-Tukey statistic) is calculated using with and . The S-MD chart (i.e., the Shewhart-type chart based on the mood statistic) is given as follows: with and . The S-FK chart (i.e., the Shewhart-type chart based on the Filgner-Killeen statistic) is given by with , and .

We then use the average run length (ARL) to evaluate the performance of these methods. ARL is the number of points that, on average, will be plotted on a control chart before an OC condition occurs. If the process is IC, ; otherwise, when the process is OC. In addition, is the probability of a type I error occurring, and is the probability of a type II error occurring. Therefore, we typically fix IC ARL, which is denoted as , and compare the OC ARL, which is denoted as . A small is considered better. Figure 1 shows the OC ARL of the S-ST, S-MD, and S-FK charts. We let , , and under the multivariate Gaussian distribution with expectations and the variance matrix, . For a fair comparison, we set for all control schemes. Figure 1 shows the OC ARL of the three Shewhart-type schemes when detecting scale parameters. Figure 1 shows that the S-MD chart’s performance is better than the other charts when detecting a range of scale shifts.

When calculating the Mahalanobis distance, the sample population must exceed the sample dimension; otherwise, the inverse matrix of the population sample covariance matrix obtained does not exist. Thus, the Mahalanobis distance sometimes fails to meet practical requirements. It is also not appropriate to simply use the Euclidean distance to reduce the dimensionality of high-dimensional data, because this process would equate the differences between different data attributes (i.e., the dimensions of each index or variable). The standardized Euclidean distance is an improvement strategy that can overcome the shortcoming of the simple Euclidean distance. Since the distribution of each dimension component of the data is different, the first to “standardize” each component to the associated mean and variance are equal.

Mukherjee and Marozzi [50] consider the sum of the squares of standardized Wilcoxon and Bradley statistics for monitoring high-dimensional processes with unknown parameters. Inspired by Mukherjee and Marozzi [50], we combine the idea of control schemes and hypothesis testing to propose an effective control scheme that simultaneously monitors expectation and variance. Based on this analysis, we propose an alternative control scheme, whose statistic is as follows: with

The term asymptotic distribution is used in the sense of convergence in law when and with the ratio constant [52]. Under , the statistics and are uncorrelated for all and . Since, for all and ,

Thus, we have

Equality (14) is the product of and . Therefore,

It is obvious that

Under , and with , , and the ratio constant.

4. Performance Evaluation

In this section, we compare the performances of these charts with different reference sample sizes and test sample sizes when shifts occur. We assume that the th future observation, , is collected over time using the following multivariate model: where , , and represents the identity matrix. We let and dimensionality . Table 2 shows the OC ARL of these charts. Table 3 presents the OC ARL of these charts when there is a correlation between variables: where

The Weibull type of distributional changes for detecting general distributional changes is shown in Table 4, where represents the Weibull distribution with the shape parameter and the scale parameter . The IC distribution is , and the OC distribution is . We also consider the three types of general changes (multivariate with 3 , multivariate exponential, and multivariate gamma distributions) in Table 5. Tables 25 show that the proposed method performs well for detecting a range of shifts.

5. Illustration

5.1. Data Source

To describe the proposed method, we analyze a real clinical case. Samples arrive periodically as Dr. Wolberg reports in his clinical cases. The database therefore reflects this chronological grouping of the data. For each of the 599 clinical cases, several clinical features were observed or measured. Quantitative attributes including clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. The datasets are publicly available in the “Breast Cancer Wisconsin (Original) Data Set” of the UCI Machine Learning Repository and can be downloaded from the website http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29. Breast cancer screening is an important strategy to allow for early detection and ensure a greater probability of having a good outcome in treatment. More details about these datasets can be related to [7073]. In this work, we aim to monitor the Breast Cancer Wisconsin Data Set and identify whether there is a shift in a process.

5.2. Data Analysis

A quantile-quantile (Q-Q) plot of each index, including 599 historical observations, is shown in Figure 2, which highlights that the normality assumption is invalid, which leads us to reject the null hypothesis that the data are normally distributed. Thus, we use the proposed distribution-free control scheme to monitor the breast cancer data.

We let and . We use the 1–350 IC data to find the control limits of the S-W chart, S-MD chart, and proposed chart. For a fair comparison, the IC ARL of all control charts is set equal to 400, and the remaining 249 breast cancer data are monitored. The curves of the S-W and S-MD charts of the monitored banknote authentication data are shown in Figure 3, which indicates that the S-W chart produces a false alarm when the process is IC; conversely, the S-MD chart produces no OC signal when the process is OC. Figure 4 shows the proposed chart for monitoring breast cancer data and shows that the statistic of the proposed chart falls out of the control limits after 353 observations. Compared with the S-W and S-MD charts, the proposed chart can detect a shift more accurately and earlier than the other charts.

6. Conclusions and Discussion

This paper provided a new control scheme for detecting location and scale changes. Inspired by Mukherjee and Marozzi [50], we proposed an effective control chart that simultaneously monitors changes in both location and scale. In this paper, Breast Cancer Wisconsin Data Sets are provided by using the proposed method. Spectral analysis is also reviewed and conducted to investigate the periodicities of shorter time series, and then, nonlinear least squares fitting is used for fitting analysis. The real-data example shows that the proposed scheme performed well for detecting process changes. In this study, we mainly considered the standard Euclidean distance to reduce the dimensionality of high-dimensional data; the other methods of dimensionality reduction still need to be investigated in more detail.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest statement.

Authors’ Contributions

Liu Liu and Jin Yue designed the study and performed the research, Jin Yue discussed the experiment and the related issues in the data analysis parts, and Jin Yue wrote the manuscript. Liu Liu and Jin Yue reviewed the manuscript.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 12075162), the Sichuan Sciences and Technology Program (No. 2020YJ0357), and the VC & VR Key Lab of Sichuan Province.