Research Article | Open Access
An Efficient Approximation Method for Calculating Confidence Level of Negative Survey
The confidence level of negative survey is one of the key scientific problems. The present work uses generation function to analyse the confidence level and uses a greedy algorithm to calculate that, which is used to evaluate the dependable level of negative survey. However, the present method is of low efficiency and complex. This study focuses on an efficient approximation method for calculating the confidence level of negative survey. This approximation method based on central limit theorem and Bayesian method can get the results efficiently.
Artificial immune system simulates the mechanism of biology immune system to model and design effective algorithm for solving some complex issues. Negative selection principle  is one of the unique mechanisms of biology immune system, and the implication of negative selection principle is that the immaturity T cell dies if it matches with itself as it grows, and it survives if it mismatches with itself. Inspired by negative selection principle, the negative selection algorithm  is proposed and can be used for network security, virus detection [3, 4], and anomaly detection .
Similarly, the negative survey , which is inspired by negative selection principle, is a novel and promising indirect question method for information security and enhancing privacy in collecting sensitive data and individual privacy . Negative surveys consist of a question and categories for the interviewees to select from. In contrast to traditional surveys, the participants are required to select a category that does not agree with the fact [6, 8]; that is, randomly select a category from the other unreal categories. For convenience, it defines positive category as the category that agrees with the fact, while it defines negative category as the other categories that do not agree with the fact .
The negative survey method can attain privacy protection with lower power and higher degree and boost participants’ confidence. The main calculation of collecting sensitive data with negative survey is reconstructing the corresponding positive survey in the central processor. The privacy preserving properties of negative survey do not rely on anonymity, cryptography, or any legal contracts, but rather participants not revealing their own privacy information. And the negative survey method is applicable to collecting data at a high speed in low-powered mobile devices such as smart phones and tablets .
The positive survey can be reconstructed from a result of negative survey. For a survey consisting of a question and categories for interviewees to select from, a negative survey result is , where is the results of category in negative survey. Meanwhile, the original positive survey is , where is the number of interviewees belonging to category . Define as the probability that category is chosen given that a respondent positively belongs to category , where and . Define the probability matrix as as Formula (1), and and . In consequence, the positive survey can be reconstructed from a negative survey :
Generally, , which means the probability of selecting negative categories follows uniform distribution . Following the work in , Xie et al. proposed Gaussian Negative Survey (GNS) , where the probabilities of selecting negative categories (i.e., ) follow a Gaussian distribution centered at the corresponding positive category. The GNS could attain higher accuracy but lower ability of privacy protection.
The traditional reconstructing method in  may lead to the reconstruction of positive survey with negative values. Based on the problem, two methods  were proposed for reconstructing positive survey which had no negative values. In , Bao et al. proposed a greedy algorithm for calculating the confidence level, which is analysed in generating function. But this method is of low efficiency and complex and could not achieve the high efficiency of negative survey.
In this study, an efficient approximation method is proposed to calculate the confidence level of negative survey. This work reinforces the efficiency of negative survey.
In the remainder of this study, Section 2 introduces the related work of this study. Section 3 describes the problem in this study. Section 4 describes the efficient approximation method. Section 6 discusses some existing problems of this approximation method and Section 7 concludes the whole study.
2. Related Work
In this study, the probability of selecting negative categories follows uniform distribution (i.e., ) as general negative survey in [6, 8, 11, 12]. So, in this section, the related work of negative survey [6, 8, 11, 12] is introduced. For convenience, some definitions are given in the followoing list: the number of interviewees for surveys. the number of categories in surveys. the number of interviewees selecting category in negative survey. the proportion of negative category ; that is, . the original number of interviewees in positive category . the estimated number of . the participant vector; that is, . the participant vector; that is, . the proportion of positive category ; that is, . the estimated number of ; that is, .
Define as the number of interviewees participating in the negative survey and as the number of categories. The results of the negative survey are , where represents the total number of participants who select the th category in the negative survey. Similarly, the real positive survey is , and . In [6, 8], the reconstructed positive survey can be calculated by Formula (2). In this study, a positive category , which has interviewees, category, and the proportion of category which is , is written as for simplicity. And the corresponding negative category is written as :
Although , it can be observed that when . Therefore, this traditional method is not practical sometimes. Following the traditional method in [6, 8], two methods were proposed for reconstructing positive survey in . Method I  uses an iteration method to reconstruct the positive survey. The advantage of Method I is that no negative values are in the reconstructed positive survey; that is, . But this method only uses an implicit function to reconstruct the positive survey approximately. And the accuracy of this method lacks theoretical basis.
Method II  eliminates the negative values through adjusting the results of reconstructed positive survey. This method sets the negative value of the category in the reconstructed positive survey to 0 and then keeps the sum of the reconstructed positive survey unchanged by the proportion of the values in the other categories. This method is more efficient than Method I, but there is no theoretical analysis of this method. In , the confidence level of negative survey is analysed in generation functions and calculated in a greedy algorithm.
3. Problem Formulation
Efficiency is one of the greatest advantages in collecting data by the negative survey method, because each participant only needs to send one of her or his negative categories (i.e., unreal information). The reconstructed positive survey from negative survey has nonexact values, so there are two important issues, which are the confidence level and the efficient, respectively. It is not necessary and inefficient to use a generation function method to exactly calculate the confidence level  with the nonexact values reconstructed from negative survey. More importantly, it is so complicated to exactly calculate the confidence level that a greedy algorithm uses .
This study proposes an efficient method, which is analysed by central limit theorem and Bayes method, to calculate the confidence level approximately, and this approximation method can reinforce the efficiency of negative survey. The core concept of this approximation method is using Normal Distribution to approximate the original distribution for fast calculation (more details in Section 4). The Bayes method is then used to calculate the confidence level of each category in negative survey, which is studied based on the analysis of the distribution of possible positive survey results.
4. The Efficient Method of Approximation
This section gives the proposed efficient approximation method for calculating the confidence level. In Section 4.1, central limit theorem is used to calculate the approximated distribution of . In Section 4.2, the Bayes method is used to estimate the probability density function of . In Section 4.3, the confidence level is calculated based on Bayes method.
4.1. The Distribution of Negative Survey
Theorem 1 gives the distribution of category in negative survey when that of positive survey is known.
Theorem 1. For a given positive category and the corresponding negative category , so approximately follows Normal Distribution when goes to infinity:where and .
Proof. Consider the negative category and calculate the probability distribution of . In the negative survey, interviewees are likely to select the th category. Define the random variable . If the th interviewee selects the th category, , or else . Obviously, each is independent and identically distributed and follows the Binomial Distribution . Let . So , andOwing to the De Moivre-Laplace central limit theorem, follows Normal Distribution as goes to infinity; that is,SoIn consequence, follows the Normal Distribution when goes to infinity and Theorem 1 and Formula (3) are both valid.
According to the character of Normal Distribution, . So we can regard that and . Figure 2 illustrates the range of for different values of when and .
4.2. The Distribution of Reconstructed Positive Survey
There are some differences between reconstructing positive survey from a given negative survey and traditional method for parameter estimating. The reason is that the given result of negative survey is only one sample for its original positive survey. In consequence, we use Bayes method to reconstruct the positive survey. The distribution of the reconstructed is given in Theorem 2.
Theorem 2. If a negative category is , the probability density function of corresponding is
Proof. Define to be the prior probability density function of , and is the conditional probability density function for . According to Bayes function form of probability density function, the probability density function of with given is the following formula:Suppose that we have no knowledge of . Based on Bayesian assumption, the prior probability density function can be considered as uniform distribution . On this occasion, the density function can be calculated in the following Formula (10). In addition, can be calculated in Formula (7). Consider the following:So the conditional probability density function of with given isCombing Formula (7) and Formula (11), Formula (8) can be gotten and Theorem 2 is valid.
Figure 3 illustrates the function curve of for different values of , , or . Figures 3(a) and 3(b) show that less makes centred around more closely, Figure 3(c) shows that greater makes that, and Figure 3(d) shows that less makes that, too. In addition, Figures 3(a) and 3(b) also show that greater may lead to , and the corresponding is 0 with a great probability.
4.3. The Confidence Level
In this subsection, an approximation method is used for calculating confidence level of reconstructed positive survey.
Theorem 3. If confidence interval length is , the confidence level iswhere , and is in Formula (8).
Figure 4 illustrates the confidence level varying with with different values of and . From this figure, obviously the confidence level has the following two characters: when , the confidence level increases with (Figure 4(c)) and decreases with (Figure 4(a)) or (Figure 4(b)); when , the confidence level increases with firstly (Figure 4(d)). Because in this case the is 0 with a high probability, the confidence level decreases severely (Figure 4(d)). These values of are nearly impossible because the prior probability to attain such a large value of is very low, and may be the survey error (if as described in Section 4.1).
5. Simulation Experiments
In this section, some examples of negative survey (similar with that in ) are specially designed to verify this approximation method. In Tables 1 and 2, the confidence level is calculated independently by category when the confidence interval (abbreviated as CI) length is 0.1. As is indicated in Table 1, the confidence interval is as . In this case, the confidence level increases with and decreases with . As shown in Table 2, the confidence level is diverse and complicated. If , then the confidence level is very small as is large. That means an excessive rise of may even be a survey error because the prior probability to attain such a greater value of is very low. In addition, when is a negative value, the second method in  is needed to correct the reconstructed positive survey.
Table 3 shows the confidence levels of seven groups of negative survey. The confidence level includes three values, which is the confidence level of each category, respectively. It is worth reminding that the confidence levels in the last three groups of negative survey are less when . The reason that the probability to get such a large value of is rather low if . When , the confidence levels of the last three groups of negative survey are low, and the survey results may be faulty.
In this study, we propose an efficient approximation method for calculating the confidence level of negative survey, but there are some works for future study.
Firstly, this approximation method is based on central limit theorem, which is valid when is sufficiently large. However, the degree of “sufficiently large” (of ) is diverse when has various values. So the “sufficiently large” cannot only be measured in and should be measured in both and . If or is smaller in amount, the Poisson Distribution is the better approximation distribution rather than Normal Distribution. In addition, Normal Distribution, which is a symmetric distribution, is used to approximate the original distribution, but the original distribution is not perfectly symmetrical.
Secondly, this method in this study analyses each category independently. The correlation of different categories should be taken into account in future work. For example, the confidence level may be high when , because the corresponding has a high probability to be 0. But in this case, the sum of all the estimated is greater than 1, and the results are needed for further revision.
Thirdly, the confidence interval is set to be when . But strictly, is not a completely symmetrical function. So the confidence interval may not be the smallest one.
Finally, the confidence level calculated in this study is by category independently. How to compare the two close confidence levels (such as the first two examples in Table 3) still needs to be studied further.
This study proposes an efficient approximation method for calculating the confidence level of negative survey. Normal Distribution is used to approximate to the distribution of at first; then Bayes method is used for approximately calculating the confidence level. Depending on the proposed efficient approximation method, the confidence level of negative survey can be approximately calculated efficiently.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
The project was supported by the National Natural Science Foundation of China under Grants 61502440, 61502439, and 61272469 and the Fundamental Research Funds for the Central Universities, China University of Geosciences (Wuhan), under Grant CUGL 140840.
- S. A. Hofmeyr and S. Forrest, “Architecture for an artificial immune system,” Evolutionary Computation, vol. 8, no. 4, pp. 443–473, 2000.
- S. Forrest, A. S. Perelson, L. Allen, and R. Cherukuri, “Self-nonself discrimination in a computer,” in Proceedings of the IEEE Symposium on Research in Security and Privacy, pp. 202–212, May 1994.
- J. Kim and P. J. Bentley, “Towards an artificial immune system for network intrusion detection: an investigation of clonal selection with a negative selection operator,” in Proceedings of the Congress on Evolutionary Computation (CEC '01), vol. 2, pp. 1244–1252, May 2001.
- G. Du, T. Huang, B. Zhao, and L. Song, “Dynamic self-defined immunity model base on data mining for network intrusion detection,” in Proceedings of the 4th International Coference on Machine Learning and Cybernetics, vol. 6, pp. 3866–3870, 2005.
- Z. Ji and D. Dasgupta, “V-detector: an efficient negative selection algorithm with ‘probably adequate’ detector coverage,” Information Sciences, vol. 179, no. 10, pp. 1390–1406, 2009.
- F. Esponda, “Negative surveys,” http://arxiv.org/abs/math/0608176.
- J. Horey, M. M. Groat, S. Forrest, and F. Esponda, “Anonymous data collection in sensor networks,” in Proceedings of the 4th Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services (MobiQuitous '07), pp. 1–8, August 2007.
- F. Esponda and V. M. Guerrero, “Surveys with negative questions for sensitive items,” Statistics and Probability Letters, vol. 79, no. 24, pp. 2456–2461, 2009.
- J. L. Horey, S. Forrest, and M. Groat, “Reconstructing spatial distributions from anonymized locations,” in Proceedings of the 28th International Conference on Data Engineering Workshops, pp. 243–250, ACM, 2012.
- H. Xie, L. Kulik, and E. Tanin, “Privacy-aware collection of aggregate spatial data,” Data & Knowledge Engineering, vol. 70, no. 6, pp. 575–595, 2011.
- Y. Bao, W. Luo, and X. Zhang, “Estimating positive surveys from negative surveys,” Statistics and Probability Letters, vol. 83, no. 2, pp. 551–558, 2013.
- Y. Bao, W. Luo, and Y. Lu, “On the dependable level of the negative survey,” Statistics and Probability Letters, vol. 89, pp. 31–40, 2014.
Copyright © 2015 Ran Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.