Abstract

Data privacy is a serious issue and therefore needs our attention. In this study, we propose masking through randomized response techniques (RRTs) to ensure the privacy and thus to avoid falsification. We assume that the process characteristic is of sensitive nature, and due to privacy issue, the actual measurements cannot be shared with the monitoring team. In such situations, the producer is very likely to falsify the measurements. Consequently, the usual control charting techniques will mislead about the process status. We discuss different data masking strategies to be used with Shewhart-type control charts. The usual Shewhart-type control chart appears to be a subchart of the proposed charts. Average run length (ARL) is used as a performance measure of the study proposals. We have evaluated the performance of the proposed charts for different shift sizes and under different intensities of masking. We have also carried out a comparative analysis for various models under varying sensitivity parameters. We have also compared the performance of the proposals with the traditional Shewhart chart. It is observed that the B-L control chart under the RRT model performs better for smaller shifts and for larger shift sizes, the G-B chart under an unrelated question model tperforms better. A real-life application of the study proposal is also considered where monitoring of thickness of paint on refrigerators is of interest.

1. Introduction

Control charting technique deals with the early detection of special cause variation in the process. The idea was proposed by Shewhart for monitoring production process and is also widely applied in other fields such as health care, social sciences, and business decision-making. For example, Muhammad et al. [1] explored various phenomena, such as number of surgeries and stay time in hospital, and interpreted various control charts for continuous and attribute data. Sherlaw-Johnson and Beardsley [2] used k-sigma and variable control charts for detection of changes in quarterly observed emergency admission of individuals, of specific age group, into hospital for different diseases. In their work, they highlighted need of frequent and continuous data collection on different indicators via patient experience surveys. In other health settings, SPC has been used for monitoring mortality [3, 4], the frequency of health care acquired infection [5] and [6, 7], and a number of process measures including readmission rates, lengths of stay, or bed days [8, 9].

On the extension side of control charting techniques, motivated by the Shewhart (1939, p. 134) statement, “the only way one can experience any quality characteristic quantitatively is by means of an operation of measurement,” Linna and Woodall [10] addressed the issue of measurement error in a production process and quantified its impact on power of the Shewhart control chart for monitoring the characteristic under study. In social and health sciences, these measurements, for monitoring purposes, may be collected directly from the persons in the population or extracted from the record. In both the situations, there are variables on which one may need data, but due to privacy issue, it becomes hard to get accurate data. For example, one may be interested in the monitoring of abortion rate, average amount of under reporting in income tax report, number of unsuccessful surgeries, and number of times illicit drugs were taken, but observing data directly on these issues will lead to misleading results, see, for example, Horvitz et al. [11], Eichhorn and Hayre [12], and Bar-Lev et al. [13]. Randomized response techniques (RRTs), pioneered by Warner [14], provide tools through which one can collect trustworthy data for drawing inferences about true quantities with high trust but low precision.

Nowadays, data privacy is the major issue for the manufacturing industries and business-related organizations. In both fields, control charting techniques are widely suggested for improving the quality and enhancing the business decision-making, respectively. For procuring the data, one possible way is to mask the actual measurement of a process before storing it or at least when releasing the information for monitoring purposes. The simple and probabilistic way of masking the characteristic is to do it through randomized response models. As in quality control charting techniques, the nature of the data is either qualitative or quantitative. The randomized response models provide masking mechanism both for qualitative and quantitative data. It is important to note that individual randomized observation cannot be used to know the true observation but the actual average behavior is estimable.

In the current electronically connected era, the quality control, in manufacturing process, through outside professionals is not uncommon; the exact information sharing with other organizations may pose a serious threat to data security, in situations where the involvement of outside inspectors is unavoidable and sharing the exact data is threat to the privacy. The RR model is highly capable of addressing this issue; that is, one has to just mask the data through the RR model before sharing it with the quality control department, and, in this paper, we show that the quality control department is still capable of monitoring the process parameters without knowing the actual measurements of the process characteristic.

The aim of this paper is to address the issue of false reporting and data privacy in control charting techniques with the help of incorporating randomized responses. Background of the study is given in Section 2, and hence, an introduction to RRT and statistical process control is provided. In Section 3, data-masking strategies and resultant Shewhart-type control charts are proposed for monitoring the process mean. Section 4 gives performance evaluations for the proposed charts. Section 5 is dedicated to comparative analysis among the proposed charts along with the existing ones. Section 6 presents an application of the study proposals. The study is concluded, and some future research aspects are listed in Section 7.

2. Background

This study is based on the two prominent statistical techniques, namely, randomized response technique and statistical process control chart. So, a brief review of these two is given in the following sections.

2.1. Randomized Response Technique

About a half century ago, Warner [14] proposed the randomized response method as a survey technique to reduce potential bias due to nonresponse and social desirability when asking questions about sensitive behaviors and beliefs. The method asks respondents to use a randomization device, such as a coin flip, whose outcome is unobserved by the interviewer. Depending on the particular design, the randomization device determines which question the respondent has to answer [1417], the type of expression the respondent uses to answer the sensitive question [18], or if the respondent should give a predetermined response [19, 20]. By introducing random noise, the randomized response method conceals individual responses and protects respondent privacy. As a result, respondents may be more inclined to answer truthfully.

Randomized response technique is basically a data-generating technique and is extended in various ways. Depending on the type of the parameter, we want to estimate through data; RRTs are divided into two main types, namely, qualitative RRTs and quantitative RRTs. In qualitative RRTs, estimation of proportions remained basic interest, and when it comes to estimation of population mean or similar, then quantitative RRTs are used. For recent advancements in RRTs, we refer to Shah et al. [21] and Blair et al. [22]. On the other hand, RRTs have been effectively merged with other tools and new research areas have been proposed. For example, Du and Zhan [23] merged RRT for privacy preserving in data mining, and Shah et al. [24] successfully merged RRT and paired comparison modeling to ensure privacy of the judges in paired comparison experiment and hence reduce the impact of social desirability bias.

There are various other fields with whom RRT may be combined to produce results less affected by false reporting, and we think that statistical control charting is one of them. Thus, in the coming subsection, we explain briefly this with a brief review.

2.2. Statistical Control Charts

In statistical process control (SPC) settings, the variations in measurements are categorized into two types: the special cause variations and common cause variations, the latter one can never be eliminated fully, and that is why these are acceptable. The first one is considered the bad one, and if present, the process is categorized as out-of-control. Furthermore, these variations are attributed to shift in the process parameters. Early detection of a shift is encouraged to insure the quality. The control chart (CC) is one of the statistical tools used for the early detection of the shift. The idea was developed by Shewhart [25] about a century ago in America for monitoring manufacturing process. Currently, there are worthy advancements in these techniques. For example, exponentially moving average (EWMA) and cumulative sum (CUSUM) control charts are proposed while considering the importance of magnitude of the shift, and p-chart, u-chart, s-chart and many more are proposed by considering different quality characteristics of interest. Furthermore, a lot of the literature has been dedicated to adjust the limits of the Shewhart chart for different environments. For example, Raji et al. [26] evaluate the impact of outliers on the Shewhart chart and evaluate the applicability through data from the semiconductor-manufacturing industry; and Celano and Chakraborti [27] proposed a distribution-free Shewhart control chart for monitoring finite horizon productions.

Now, we turn to discuss the basic and simple Shewhart [25] control chart for monitoring mean of the continuous random variable X, known as the x-bar control chart. The established ingredients of this chart are (1) an estimator of the process parameter, referred as plotting statistic, a function of the sample observation (s), (2) a central line corresponding to the average value of the estimator, and (3) control limit, upper control limit (UCL), and lower control limit (LCL), usually set as 3 standard error around the average value. For mathematical formulations, let us assume that a quality characteristic follows a distribution with mean and variance and consider that is the plotting statistic, then the upper and lower k-sigma limits are, respectively, defined as

There are several rules for classifying the process out-of-control or in-control, called run rules, but simplest among them is if a value of the plotting statistic falls outside these control limits, then the process is said to be out-of-control; otherwise, there are no evidences of special cause variations. One of the performance measuring criteria of CC is ARL, defined as the average number of subsamples required for a CC to detect a shift in the process’s parameters. The common cause variations may lead CC to signal; in such situations, the calculated ARL is categorized as in-control ARL and is denoted by ARL0. False alarm rate, the reciprocal of ARL0, is the probability that the value of plotting statistic falls outside the control limits given that the process’ parameters are in in-control state, that is, where is said to be the value for which the underlying process is considered to be in in-control state. In (1) and (2), k is a positive constant, typically used equal to 3 for a process where the underlying quality characteristic’s distribution is assumed to follow normal distribution. In such cases, k equal to 3 yields ARL0 equals to 370. When the plotting statistic does not follow normal distribution and then for performance measure, one may fix the false alarm rate at 0.0027 (or any other value) and the value of k should be found. For more detail, we refer to Mehmood et al. [28] and Haridy et al. [29].

3. Different Types of Masking and Proposed Charts

This section is devoted to construction of generalized control charting technique which takes a broad view that of Shewhart and result in a flexible way of monitoring the sensitive character. This mechanism exploits both the quantitative randomized response models and the techniques used for monitoring the process mean in statistical process control. We start by introducing a masking mechanism to the producer which they may choose to mask the data. Then, we provide control limits of Shewhart-type charts, and those are capable of catering the intensity of masking used by the producers. Lastly, we provide a theoretical comparison with the traditional Shewhart chart.

In the coming sections, let us denote the sensitive character of interest by a random variable X, let S be a random variable used in masking the response, generated from an arbitrary distribution, and let Z be the observed randomized response. In the RR studies, S is also called the scrambling variable.

3.1. Masking through Greenberg et al. [30] Model and G-B Chart

In the study of Greenberg et al. [30], they coined the idea of quantitative data collection in survey sampling on sensitive issues. We exploit the study of Greenberg et al. [30] and state that the producer/respondent reports the actual measurement X, with probability P, or the value S with probability (1-P). Thus, the masked data may be modeled aswhere p is defined in the same way as defined by Warner [14], known as the masking parameter and is selected by the producer and is known to the monitoring team. Furthermore, let us assume that X follows a distribution with mean and variance , and S is independent of X with mean and variance . Now, it is nontrivial to show that mean and variance of Z are

Then, an unbiased estimate of proposed by Greenberg et al. [30] is

And its variance is given as follows:

Using the above facts, we have a Shewhart-type control chart, for monitoring the mean of X, with lower confidence limit and upper confidence limit given, respectively, as

If we look at model (3) and limits (7) and (8), then it is nontrivial to notice that for , we have the unmasked data, and this chart reduces to the usual Shewhart-type chart limits are given in (1) and (2). In coming sections, we refer this chart as the G-B chart.

3.2. Masking through Eichhorn and Hayre [12] and Bar-Lev et al. [13] Models and the B-L Chart

Eichhorn and Hayre [12] proposed a new randomized response technique for estimating the population mean response of the quantitative sensitive character. Each measurement is recorded in masked form. That is, the producer is directed to give a coded or scrambled observations obtained by multiplying the true measurement value by some random numbers unknown by the monitoring authorities. The monitoring team does not know which random number the producer has used for coding, but the distribution by which the random number has been generated is known.

Let X is the variable representing the responses to the sensitive quantitative character, S is the variable representing the random number used for the purpose of coding the responses, and ZE = XS represents the coded response, where S and X are being independent of each other. Then,where is the coefficient of variation of .

Thus, the unbiased estimator of , proposed by Eichhorn and Hyre [12], is given asand its variances is

Let us denote the sensitive character of interest by a random variable X, let S is a random variable used in masking the response, generated from an arbitrary distribution, and Z is the observed randomized response. For simplicity and without any loss, we assume independence between X and S. According to Bar-Lev et al. [13], Z is distributed aswhere is the same role defined by Warner [14], known as the masking parameter and is presented by the investigator. Furthermore, let us assume that X follows a distribution with mean and variance and S is independent of X with mean and variance . Now, it is nontrivial to show that mean and variance of Z are

Using the above information, Bar-Lev et al. [13] proposed an unbiased estimator of true mean aswhere is the sample mean of randomized data. The standard error of the estimator given in (14) is

Using equations (14) and (15), the k-sigma lower control limit (LCL2), central line (CL2), and upper control limit (UCL2) of Shewhart control chart, for monitoring the mean of sensitive process X, are defined, respectively, as

This model is the general form of direct responses and the model proposed by Eichhorn and Hayre [12];that is, for , this model reduces to the direct response mechanism, and for , this becomes the multiplicative randomized response model proposed by Eichhorn and Hayre. For simplicity, we denote this proposed chart by B-L (Bar-Lev), in the coming sections.

3.3. Masking through Additive Models and the G-A Chart

For gain in efficiency of the estimates, several techniques have been considered. Such improvements cost the simplicity of the model, but additive randomized response models are one of those which are simple in application and better in terms of the variances of the estimates. Gupta et al. [31] proposed an optional randomized additive response model, which, with gain in efficiency, is also capable of producing the information for estimation of the sensitivity level. Here, we simplify the Gupta et al. [31] model as

In this simplified version of the Gupta et al. [31] model, if , then we get the model proposed by Pollock and Beck [32]. The unbiased estimator of the sensitive mean may be reported as

It is trivial to show that the variance of (18) is

By setting , in (17), (18), and (19), one may obtain the model, estimator, and its variance, proposed by Pollock and Beck [32], respectively.

The upper control limit (UCL3) and lower control limit (LCL3) for this model are, respectively, given as

Furthermore, this chart is named as the G-A chart, which means the Gupta-additive chart, and by putting , the G-A chart reduces to the usual Shewhart chart.

4. Performance Evaluations

The performance of the proposed chart is evaluated through the traditional Monte Carlo simulations for various values of the masking parameter (p). We assumed that X follows a normal distribution with mean 3 and variance 1. The distribution of the masking variable S is assumed to be Poisson with mean . First, we searched for the values of k for different values of by fixing the ARL0 approximately equal to 370, and then we introduced shift of various magnitude and calculated the out-of-control average run length (ARL) and standard deviation of run length (SDRL) for evaluation of power of detecting shift, in X, of the proposed chart. The shift size in the mean of X is defined as , where is the shifted mean of X. The results are reported in Table 13 for different settings. We compared the performance with the traditional Shewhart chart.

4.1. The G-B Chart

According to the results provided in Table 2, this chart suffers with biased ARL issue for smaller shift sizes. Noted from (5), the plotting statistic does not exist for , which implies that fully unrelated data (when Z = S) are not usable for monitoring the process mean. Furthermore, for larger shifts (greater than or equal to 1), the performance of the chart improves with increase in the and converges to the usual Shewhart chart when becomes 1.

4.2. The B-L Chart

As discussed in Section 3.2 that the B-L model is the generalized form of the Eichorn and Hayre [12] model, we consider the B-L model for evaluation. The ARL values reported in Table 2 show the usual behavior; that is, as the shift size increases, then we expect earlier detection. It is also observed that there is no linear relationship between p and the performance of the control chart but the chart performs better at the extreme values of p; that is, when or , then ARL decreases early as the shift size increases. It is worth mentioning that at , we have fully masked data; that is, privacy is fully ensured; when , then we have fully unmasked data, and the proposed control chart reduces to the usual Shewhart chart; generality of the proposed chart should be witnessed. Furthermore, when p decreases from 0.9 to 0.1 (with a step size equal to 0.1), then the detection power of the chart increases. Furthermore, in the given setting, it is observed that for some combinations of p and shift size, the proposed chart is ARL biased, and these situations are highlighted in Table 2.

4.3. The G-A Chart

The control limits are provided in Table 1 for different values of the masking parameter, p, by fixing the ARL0 equal to 370. The ARL bias issue is once again encountered for small shift sizes. The relationship between extent of masking and average run length is not linear; that is, there is an upward trend in the ARL for smaller shift sizes but gradually a downward trend for larger shift sizes (greater than or equal to 1). Similarly, this chart is the generalization of the simple Shewhart and Pollock and Beck [32] charts.

5. Comparative Analysis

This section provides a comparison among the proposed charts of this study along with the traditional Shewhart chart.

5.1. Comparisons among the Proposed Charts

In this section, graphical comparison of the proposed G-B, G-A, and B-L charts is provided for different shift sizes. Figure 1 depicts the relationship between , δ, and ARL of the proposed charts. It is observed that the B-L chart outperformed G-B and G-A charts when is less than or equal to 0.7 for different shifts. As the shift size increases, than the performance of G-A chart increases, comparative to the G-B chart, for smaller values of . By fixing greater than 0.7, the G-B chart performs better than the other two charts, and when , then all of the charts converge to a simple Shewhart chart.

5.2. Proposed versus Shewhart Chart

As we witnessed, the proposed charts are the generalized form of the traditional Shewhart chart. It is worth mentioning that masking has two facts (i) it insures privacy and (ii) it introduces extra noise in the variances of the estimates. Due to the last fact, one may say that the traditional chart, provided in (1) and (2), has narrower limits than the proposed one and hence has more power to detect shift. However, we do not buy this argument and argue that the traditional Shewhart chart is incapable to detect falsified responses which we witnessed in Table 2. To further elaborate the applicability of the proposed work, we consider a situation where the process is assumed to be of sensitive nature and the monitoring team does not know that the measurements are falsified. Due to sensitivity, the producer falsified the responses as , where X is the actual value and S is the scrambling variable defined earlier. Due to the lack of prior information, the monitoring team has to use the traditional Shewhart-type control chart (with limits provided in (1) and (2)) for monitoring purposes. If so, then one may witness the deteriorating nature of this phenomenon on the Shewhart chart in the last rows of Table 2; without any shift in the process parameters, the chart declared the process out-of-control. On the other side, the proposed technique is capable to reduce the impact of falsified responses and capable to detect the shift in the process parameter.

6. An Application

In this section, we consider a data set from Wild and Seber [33] and we randomize these individual observations using the model proposed by Bar-Lev et al. [13] (in fact, any other model may be considered). The masked and unmasked data are given in Table 4, for different values of  = 0, 0.1, 0.5, 0.8, 0.9, and 1.0. The thickness of paint given in the first row of Table 1 is unmasked; that is, the measurements corresponding to are the original measurements. We believe that due to sensitive situations, the producer may falsify these observations before presenting these to the quality inspectors. Our procedure allows the producer to mask these measurements through the randomized response model discussed in the previous sections. By doing so, the inspector will not be able to trace the original measurements and hence the privacy of the measurements will be ensured. Here, we considered negative binomial distribution with mean 3 and variance 7.5 in masking the observations for different value of the masking parameter P. And it is evident from the values that the true measurements are masked; furthermore, the intensity of masking increases when the value of p decreases, and hence, for , the data are fully masked; that is, each measurement is very likely to be different from the actual one. In Figure 2, we showed that after masking the observation, the monitoring team is still capable to declare the process out-of-control which in fact is out-of-control.

7. Conclusion and Recommendations

Data privacy is nowadays a serious issue, and we believe that if privacy is not protected, then it is very likely that the data owners (producers and patients) have no choice left with but to falsify the data. It has been established that falsified responses lead to misleading conclusions. To handle this situation, a win-win strategy is offered in this work. The producer is allowed to do the masking, through a probabilistic way, up to any extent, to insure the privacy, and to share the value of the masking parameter, , with the monitoring team. The empirical study showed that as the extent of masking increases, then the difference between actual and masked observation increases, and at the highest extent, the chart is still capable to detect the actual (out-of-control) status of the process. The proposed technique is established for the Shewhart-type control chart to monitor the process mean. Hence, there is a room for extending the proposed strategy to memory control charts and also applicable to attributes. On the masking mechanism side, there are a lot of advanced models but the basic models were considered to avoid the complications for the practitioners. Therefore, a comparative study with different RR models will be interesting.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.