Abstract

Negative information selection is an approach to protect the privacy by using negative information to replace original information. In this paper, we prove some bounds for negative information selection. Those bounds reveal the privacy protection strength of quantitative probability analysis. We also analyzed the reconstruction probability of original information from available negative information. The formal analysis can specify the bound on the strength of security and utility for negative information selection. Besides, we simulate brute force attacks under different data leakage ratios. Specifically, we calculate the attacker’s guess times before and after the data leakage. Experimental results indicate that the data leakage of over 30% can put the original information in a dangerous situation. Furthermore, we found that the leakage possibility has little relevance to the number of elements in the full set, but it is influenced by the ratio of the leaked information.

1. Introduction

Negative information selection is a general method for privacy protection, as negative information is selected to be out instead of original information so as to protect the privacy of the original information [1]. The negative information selection can be widely employed in various scenarios such as data publication with data privacy protection, privacy-aware data mining or machine learning, and federated learning [2, 3].

Supposing the full set is , which could be a set of words, sentences, or objects. The user information is collected in set , but the information is stored in the negative format , which is the subset of ( is the complementary set of A). The relationship of , ′, and is shown in Figure 1. As shown, users store locally and only expose ′ to the database. Since many systems suffer from the central database attack which causes the data leakage frequently [4], attackers can steal the ′ information from database due to the data leakage, and they try to guess from the ′. Users only pick some elements from the set , and attackers will launch brute force attacks to guess the exact combination of elements chosen by users (a brute force attack is a hacking method that uses trial and error to crack passwords and encryption keys). There is no doubt that the ′ leakage will reduce the guess times of attackers and increase the successful rate of cracking. The major concern is, thus, to what extent the negative information selection can protect the privacy of original information and/or to what extent the negative information can recover the original information [5].

Negative surveys (NSs) were first proposed in 2006 [6], which is a privacy-preserving method for cryptography, anonymity, and in legal guarantees. Take a direct working hours survey:

I work(i)Less than 3 hours a day(ii)Between 3 and 6 hours a day(iii)More than 6 hours a day

In a negative survey, I do not work(i)Less than 3 hours a day(ii)Between 3 and 6 hours a day(iii)More than 6 hours a day

If the positive version of the survey is being answered by an individual working less than 3 hours a day, the first option must be chosen. If the same person is answering the negative version of the survey, one of the last two options should be selected. After selecting a series of answers, the negative information is stored in the database for every individual which is a unique user profile. Once the negative information is exposed to the attackers, they will guess the positive answers from users by brute force attack. The problem thus falls to one point: the probability of revealing or guessing elements of after viewing on a set that is out of . As the selection of negative information could be repeated multiple times, the problem will become subtle.

In this paper, we formally analyze above probability. The main contributions of the paper are as follows:(1)We prove the privacy strength in terms of probability analysis and some key bounds are provided(2)We also analyze the possible methods and probability of recovery of original information from negative information(3)We test the probability of original information recovery from negative information with different bound values under a brute force attack

The rest of the paper is organized as follows: Related work is reviewed in Section 2. Section 3 presents the problem formation and analysis. We discuss some bounds in further advanced discussions in Section 4. Section 5 shows the experimental results on the bound of the negative information, and we conclude the paper in Section 6.

Negative surveys (NSs) are designed to get more accurate data from the negative answers given by interviewees, which are generated by a randomized response model. NS mainly focuses on two folds: data accuracy and data privacy [7, 8]. For data accuracy, users tend to avoid sensitive questions and give implicit or fake answers, thus generating inaccurate datasets. Therefore, NS asks the respondents to choose between t possible answers to a single question; that is, other t-1 answers are eliminated by users one by one. For data privacy, attackers can get user profiles and infer the user identities from the negative answers they give [9]. As a result, there is a need to protect original information recovery from negative information. For example, Bao et al. [10] designed a greedy algorithm to calculate the smallest confidence areas. With the dependable level of the negative survey, the reasonable range of the positive survey results could be estimated, which is analyzed by studying the confidence coefficient of the negative survey. However, there is currently no research on the bounds of negative information selection [11].

A recent study on negative surveys focuses on improving accuracy by requiring each respondent to select multiple false answers of the same category [12]. In 2016, Esponda et al. [13] introduced a statistical approach for collecting sensitive data, which allows each participant to customize the amount of information that she is willing to reveal, as each respondent has a different criterion in terms of the sensitivity of a specific topic. To improve the accuracy of estimation, Liu et al. [14] proposed a multiple negative survey (MNS), which collects each user’s multiple different negative categories to get more accurate results. Two crucial scientific problems (accuracy and confidence level) are analyzed, and the anonymity vote model is then introduced. Jiang et al. [15] indicated that the typical type of NS fails to achieve satisfactory privacy preservation. As a result, they proposed two novel negative survey models that use negative combined categories (NCCs), namely, NCC-I and NCC-II. The experimental results show that the proposed methods can achieve excellent privacy preservation in only two categories. Xu et al. [16] proposed to retain aggregate scores in negative surveys and designed an algorithm to exploit the aggregate scores to enhance the accuracy during result reconstruction. Experimental results show that the proposed approach could outperform existing algorithms since it considers the questionnaire which has multiple questions and aggregate scores from them to have global results.

When it comes to data privacy protection in NS, we found that NS can be leveraged for both user and object privacy protection. For example, Aoki and Sezaki [17] argued that it was difficult to enable a complicated security system on resource-constrained mobile phones. Therefore, they proposed a method of combining the NS with randomized response techniques for privacy preservation. By using this method, the participatory sensing applications can ensure the data integrity while protecting data privacy with low computation complexity. Similarly, Jiang et al. [18] introduced a privacy-preserving aggregation scheme based on NS for smart meters. However, they do not have a discussion on the bounds of the NS information selection. In 2017, Luo et al. [19] applied NS in the location- and trace-privacy protection of the moving object. They analyze the effectiveness of both the single-selection NS and the multiple-selection NS for location- and trace-privacy protection and theoretically prove that the single-selection NS is a more effective method in this scenario. Yang et al. [20] introduced a privacy-preserving aggregation scheme based on NS for vehicle fuel consumption data. In this paper, they found that although the individual real-time fuel consumption data is meaningless, continuous real-time fuel consumption data may reveal the user’s privacy. Therefore, they proposed an anonymous algorithm on the user side and an estimation algorithm on the server side that are able to prevent data collection by attackers. However, no research is focused on the bounds of negative information selection, which makes it important to find the balance between the data accuracy and data privacy.

3. Problem Formulation and Analysis

Let is a set. is a full domain. is included by A.

. is a set.

Definition 1. negative information selection mapping ). It that takes a set as input and output a new set . That is, .

Remark 2. (1) can be conducted by, in each run, selecting , and then let (2) is randomly mapped instead of a function, as could be different for the same (3)Next, we will analyze some bounds for the security of . Roughly speaking, the security is quantitative measurement of the information leakage on from .
Let , where returns the number of elements in a set . Let .

Remark 3. (1)If , then , where is an entropy function that represents the entropy of a set.(2)If , then .(3)Because , . Thus, , .Suppose is public. is public. is private. We afterward intend to explore the privacy leakage of when given . Or, we want to quantitatively measure the privacy leakage in terms of main threshold and .

Definition 4 (privacy of given ). It is defined as a conditional probability that denotes the probability of successfully guessing the set of upon given .

Let be the combination counts of selecting from .

Proposition 5. , where .

Proof. (straightforward). The combination counts for selecting from are , and the combination counts for selecting from are .

Proposition 6. .

Proof. (straightforward). The combination counts for selecting from are , and the combination counts for selecting from are .

Proposition 7. The privacy leakage of upon given is denoted as and

Proof. It is due to Propositions 5 and 6, and the gap between them is the leakage.

Remark 8. (1) grows with the decreasing . Recall Remark 3.(3). Therefore, it is better when is larger.(2)Whether grows or not with the increasing of depends on the decreasing gaps between and . Recall Remark 3 (1), .

Proposition 9. . That is, privacy leakage cannot be averted.

Proof. As , . .

4. Advanced Discussion

4.1. Simplified Estimation

Next, we want to give a simplified estimation for the leakage.

Suppose . .

Proposition 10. , where is the probability that an adversary successfully guesses an element of .

Proof. (straightforward). The number of elements in over the number of is .

Proposition 11. , where is the probability that an adversary successfully guesses an element of after viewing .

Proof. .

Proposition 12. .

Proof. .

4.2. Interleave Impacts of Multiple Negative Information

Next, we will discuss the impact of multiple execution of .

If , , .

Proposition 13. , where is the probability that an adversary successfully guesses an element of after viewing all .

Proof. .

Remark 14. (1)As , the upper-bound of is (2)The worst case is usually the concern, (3)In the worst case, the probability grows with the increasing of (4)For the viewpoint of information recovery, the private collection of can recover with expected probability

4.3. Correlations in Mixing Selection

In this section, we will discuss the implicit impact of other negative information for designated information.

Let . . .

Proposition 15.

Proof. and . Thus, . Thus, .
It means that the openness of will damage the privacy of .
Similarly, we have the following result.

Proposition 16. .

Proof. , . Thus, . Thus, .

Remark 17. (1)The addition of the openness of increases the probability of the leakage on an element in (2)Sufficient number of can recover an element of

5. Experiment Evaluation

5.1. Experiment Setup

In this section, we do the attacker simulation experiment under different parameters. We assume that the attacker can launch brute force cracking attacks, which is more powerful to guess the right set a in A. As shown in Figure 2, attackers can both guess from the whole set S or the set S. In the first situation, attackers do not know and can only guess a from S, while in the second situation, attackers can guess a from S. Attackers can make a query of a to check whether their guesses are right. If a-guess equals to real a, it means a success attack.

The experiments are done by a Python project in Windows 10 and the main function is random sample function supported for combination calculation. The sizes of S are 30, 48, and 60, which is flexible and customized. As we need to calculate , setting as 60 in our experiment could already support a large number of combinations of possible subsets chosen from S.

5.2. Simulation Time

We calculate the time consumption when an attacker cracks the information successfully under different parameters. When S is 30, 48, and 60, brute force attacks take hours or days to crack the information, while after leakage, we can see a clear decrease in the time consumption in information guessing. For example, when no data are leaked, it may take around 4 days to crack the user information. When half the information is leaked when S is 48, the average cracking time is 17.3 s, 62.4 s, and 174.2 s for different values of the ratio of a to A. The short time of cracking information indicates the low security of the system to deal with the leakage.

5.3. Guess Times Comparison

We first calculate the attacker’s guess times under two sets, namely, S and S-, and then calculate the drop percentage. Besides, we vary the ratio of A to S and the ratio of A to , so as to see the effect of the parameters and on the drop percentage.

The results of the drop percentage of brute force attack times under S and S are shown in Figures 35. The drop percentage is the difference between the times guessed under S and S minus the times guessed under S. To notice, the large drop percentage indicates low security, which means that once is exposed to the attackers, the set a is dangerous.

In Figure 3, we can see that the drop percentage decreases when the ratio of A to increases. Besides, the drop percentage is highest when the ratio of a to A is the highest as well. When is 1 : 2, the value of drop percentages is 97%, 99.8%, and 99%, respectively. When is 1 : 1, the value of drop percentages is 88.7%, 97.8%, and 98%. When is 2 : 1, the value of drop percentages is 71.9%, 89%, and 97.8%. These values are still too high, over 30% exposure of set S is dangerous; therefore, we cannot expose much information and also we need to mix set a in a large set A.

In Figure 4, we can draw the same conclusion that with the increase of the value of , the drop percentage decreases, since counts less part in S. When is 1 : 2, the value of drop percentages is all around 99%. When is 1 : 1, the value of drop percentages is 88.2%, 98.8%, and 99%. When is 2 : 1, the value of drop percentages is 58.1%, 84%, and 94.3%. To notice, when is 2 : 1 and the ratio of a to A is the smallest, the drop percentage decreases to 58%, which means we can find a boundary to make sure the set a is hard to guess. As long as the ratio of A to is high, as the ratio of a to A (S) is small at certain degree, we can protect the set better.

In Figure 5, when the ratio of A to is 1 : 2, the drop percentages are all around 99% under different ratios of a to A. When is 1 : 1, the value of drop percentages is 82.2%, 88.9%, and 99%. When is 2 : 1, the value of drop percentages is 47.9%, 73%, and 97%. In conclusion, the successful guessing percentage drops when few elements in set are exposed to attackers, and we could protect the information by increasing the value and decreasing the value . Besides, the possibility of guessing the exact a in set A is lower with larger , as the attackers should conduct more combination operations to guess the original information. Therefore, the final suggestion is that to keep a large database and guarantee the lowest ratio of data exposure.

6. Conclusion

In this paper, we formally analyze the bounds of negative information selection, which is an important approach for privacy protection by using negative information to replace original information. We prove the bounds in privacy leakage, probability estimation, multiple negative information implication, and correlation implication. The experimental findings reveal that when data leakage exceeds 30%, it leads to a dangerous situation. Also, the leakage possibility shows minimal correlation with the number of elements in full set, but it is influenced by the ratio of the leaked information. Further works focus on the system application of the proposed method.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The research was financially supported by the Guangdong Basic and Applied Basic Research Foundation “Research on Key Technologies of Intelligent Interconnection of Industrial Internet” (2022B1515120054), the State Key Laboratory of Geo-Information Engineering and Key Laboratory of Surveying and Mapping Science and Geospatial Information Technology of MNR, CASM (2023-04-04), the Key Laboratory of Data Protection and Intelligent Management, Ministry of Education, Sichuan University, and also the Fundamental Research Funds for the Central Universities (SCU2023D008).