Abstract
When using differential privacy to publish highdimensional data, the huge dimensionality leads to greater noise. Especially for highdimensional binary data, it is easy to be covered by excessive noise. Most existing methods cannot address real highdimensional data problems appropriately because they suffer from high time complexity. Therefore, in response to the problems above, we propose the differential privacy adaptive Bayesian network algorithm PrivABN to publish highdimensional binary data. This algorithm uses a new greedy algorithm to accelerate the construction of Bayesian networks, which reduces the time complexity of the GreedyBayes algorithm from to . In addition, it uses an adaptive algorithm to adjust the structure and uses a differential privacy Exponential mechanism to preserve the privacy, so as to generate a highquality protected Bayesian network. Moreover, we use the Bayesian network to calculate the conditional distribution with noise and generate a synthetic dataset for publication. This synthetic dataset satisfies differential privacy. Lastly, we carry out experiments against three reallife highdimensional binary datasets to evaluate the functional performance.
1. Introduction
Various data are continuously collected and stored in different information systems with the continuous development of information technology. In the actual application process, people often encounter various data, such as medical, market trade, and travel track. These data usually have hundreds or thousands of attribute dimensions, and some are even higher. If these data are released, it may cause the leakage of sensitive personal information because highdimensional data usually contain numerous personal privacy information. Therefore, this requires consideration of some measures to protect these information. However, protecting data privacy while ensuring data availability is a very challenging problem. The main reason is that the publishing space formed will grow exponentially as the attribute dimension increases.
The traditional privacy protection methods are mainly anonymity [1], anonymity [2], closeness [3], diversity [4], etc. However, all of these methods require special attack assumptions and knowledge background as support. It cannot be applied to general scenarios. Nevertheless, the data processed by techniques, such as differential privacy [5] and random disturbance [6, 7], do not need to make a series of conditional assumptions for the attacker and can be applied to various problem scenarios universally. Therefore, in recent years, such related technologies, especially differential privacy, have received increasing attention. Differential privacy is a typical data perturbation technique that perturbs information by adding noise that satisfies a specific distribution into the data. The disturbing data still retains the original statistical characteristics, but the attacker cannot reconstruct the real original data.
Many highdimensional data publishing methods based on differential privacy are available, but these methods can only solve the problem to a certain extent, some problems still exist: first, these methods usually deal with the dimensional disasters caused by highdimensional data by converting them into lowdimensional data forcibly. This will cause serious information loss.
Second, the time complexity of these methods is generally too high. Although it can handle data of any dimension, in theory, it can only handle low dimension in the actual operation process. Because it takes such a long time, it cannot satisfy the need for higherdimensional data.
Third, although most existing methods can handle highdimensional binary data, it is easy for these methods to add too much noise, which leads to the data being completely covered, thus affecting the accuracy of publishing.
To address the challenges above, we propose the PrivABN algorithm, which is a highdimensional binary data publishing method. Our main contributions are presented as follows:(1)Instead of directly adding noise to the data, we use the Bayesian network method to avoid the impact of the dimensional disaster. In this way, the increase of global sensitivity with attribute dimension can be avoided, and the dimensional disaster can be solved effectively(2)To reduce the total time complexity of the algorithm and enable it to process real highdimensional data, a construction algorithm ABN is proposed by using a greedy algorithm, adaptive algorithm, and differential privacy index mechanism(3)We propose a synthetic data generation algorithm SDG by using the characteristics of binary data and the topological order of the Bayesian network. This algorithm can reduce the magnitude of added noise and prevent excessive noise from covering the actual value
2. Materials and Methods
The materials and methods section should contain sufficient detail so that all procedures can be repeated. It may be divided into headed subsections if several methods are described.
3. Related Work
So far, many differential privacy publishing methods for highdimensional data are available. Aiming at the privacy protection of highdimensional binary data, Qardaji et al. [8] proposed the PriView method. This method assumes that all attributes are independent of each other and, then, answers user queries by constructing a set of lowdimensional noisy views, thereby reducing the impact of dimensional disasters. Zhang et al. [9, 10] proposed the PrivBayes method, which was directed at the issue of highdimensional data privacy release. This method assumes that all attributes have a certain correlation and, then, constructed a Bayesian network between the attributes of the dataset by a greedy algorithm. Next, the Bayesian network is used to calculate the noisy joint distribution among attributes, and this joint distribution was utilized to generate a synthetic dataset for release. Based on the PrivBayes method, a series of derivative methods, such as weighted PrivBayes [11], Jtree [12], PrivHD [13], and PrivMN [14], has been proposed one after another. Wang et al. [11] set up a method for calculating attribute weights. They think that the importance of different attributes is different, and they will choose these important attributes first when building a Bayesian network. Chen et al. [12] used sparse vector sampling techniques to explore the relationships among attributes. Then, these relationships are used to build a Markov network (a special Bayesian network). Based on the Markov network, the joint tree algorithm is used to accelerate the solution of joint distribution, and the differential privacy protection is realized by adding Laplace noise to the joint distribution. Subsequently, Zhang et al. [13] introduced highpass filtering technology based on the Jtree method to accelerate the construction of the Markov network. Finally, the PrivMN method, which is also based on the Markov network, is proposed by Wei et al. [14] to solve the joint distribution among attributes. The difference is that the approximate reasoning method is used in the calculation of the joint distribution.
The above analysis suggests that most of the existing methods consider how to construct the Bayesian or Markov network better to obtain a higherprecision joint distribution, and reducing publishing errors. However, these methods have high time complexity, which makes it impossible to process real highdimensional data in practical applications. Moreover, the constructed Bayesian network still cannot reflect the true distribution well because of the degree’s limitation. Therefore, this study proposes the PrivABN algorithm to solve the problems above.
4. Theorems and Definition
4.1. Differential Privacy
Definition 1 (differential privacy). Let be two neighboring datasets, i.e., and differ in only one record. Giving a randomized mechanism , if satisfies differential privacy, the following is true:where is the privacy budget and the smaller the privacy budget is, the higher the degree of privacy protection will be. and represent the probability that the algorithm outputs as on the data set and , respectively.
Generally, Laplace [15] and Exponential mechanism [16] can realize differential privacy. Both of these mechanisms disturb the value or selection of the original data by generating noise. The magnitude of the generated noise is related to the global sensitivity of the query function.
Definition 2 (global sensitivity (see [15])). Let be the query function. The global sensitivity is defined aswhere and are two neighboring datasets and is the norm, which is a more commonly used 1norm. Generally, the greater the global sensitivity is, the greater the noise generated by the mechanism and the impact on the algorithm results will be.
Theorem 1 (Laplace mechanism (see [15])). Let be the query function. Giving a randomized mechanism , if satisfies differential privacy, the following is true:where is the noise variable that satisfies the Laplace distribution and where . Equation (3) shows that the larger the privacy budget or the smaller the global sensitivity is, the smaller the noise generated will be.
Theorem 2 (Exponential mechanism (see [16])). Let the score function denote the score of . Giving a random algorithm , if satisfies differential privacy, the following is true:where is the global sensitivity of the score function . This formula means that for each output result of algorithm , a probability of being selected is likely to exist. The higher the score of is, the greater the probability of being selected will be.
In addition, in designing and proving to meet the differential privacy algorithm, an important combination of differential privacy needs to be used.
Property 1 (sequential composition (see [17])). Giving a dataset and a set of differential privacy algorithms and the algorithm satisfies differential privacy. Moreover, the random processes of any two algorithms are independent of each other. Then, the combination of these algorithms satisfies differential privacy.
4.2. Bayesian Network
Bayesian network is a probabilistic graph model, mainly used to explore the relationship between a group of objects. Usually, a directed acyclic graph is used to represent the Bayesian network. The nodes in the graph represent objects, and the edges represent relationships.
In general, giving a set of attributes set , its joint distribution can be expressed as
Through the Bayesian network constructed by the attribute set, its joint distribution can be approximated aswhere is the parent node set of node . If the constructed Bayesian network can represent the relationship between attributes well, then .
Therefore, how to build a better Bayesian network is important.
4.3. Conditional Entropy
Conditional entropy can be used to measure the degree of interdependence among attributes. The larger the value, the higher the degree of dependence between attributes.
Definition 3 (conditional entropy). Giving two discrete random variables and , the conditional entropy between them iswhere is the joint distribution probability value of and .
Equation (7) shows that when , there is , that is, variables and are close to independent of each other.
5. The PrivABN Algorithm
In Table 1, the meanings of the commonly used symbols in this section are provided, and the other symbols are explained when used.
5.1. Differential Privacy Bayesian Network Algorithm
Zhang et al. [9] proposed a conditional entropybased degree Bayesian network construction algorithm GreedyBayes in PrivBayes. The main idea of this algorithm is to select a pair of the largest conditional entropy to join the current Bayesian network each time.
The GreedyBayes algorithm is a common algorithm used to construct Bayesian networks, and its implementation is shown in Algorithm 1.

Considering that Zhang et al. [9, 10] did not provide the time complexity formula of the algorithm in the article, this study demonstrates the time complexity of the GreedyBayes algorithm.
Theorem 3. The time complexity of the GreedyBayes algorithm is .
Proof. The time consumption of the GreedyBayes algorithm is mainly concentrated in the for loop of Step 3. The for loop is executed a total of times, and each time pairs of are generated. Therefore, a total of pairs will be generated in the whole process. In the worst case, when, any holds forFor each pair, the conditional entropy size needs to be calculated, and each calculation takes time. Therefore, the total time complexity of the algorithm is .
Evidently, due to the influence of time complexity, the GreedyBayes algorithm can only be applied when the number of attributes is small or the maximum degree of the Bayesian network is small.
To process real highdimensional data, a more complex Bayesian network is constructed. This paper proposes a simple and efficient Bayesian network construction algorithm ABN. This algorithm only needs the time complexity of to construct a complete Bayesian network.
In order to more intuitively illustrate the advantages of the ABN algorithm in time performance, we take a dataset containing 50 attributes as an example. When the maximum degree of the Bayesian network is 5, 10, 15, 20, 25, the number of pairs enumerated by the GreedyBayes algorithm and ABN algorithm are shown in Table 2.
It can be seen that, assuming, the computer can calculate the mutual information of 10,000 pairs per second. When the maximum degree , it already takes 184 days of uninterrupted processing to complete. When , even the computer cannot solve it, because it requires a total of 728 years and 11 days of uninterrupted processing to complete the task. However, the ABN algorithm can find a relatively good pair no matter how much value is; it only takes 250 times. This is very valuable in practical application.
The specific implementation of the ABN algorithm is shown in Algorithm 2.

In addition to solving the problem of the GreedyBayes’s execution efficiency, the ABN algorithm also introduces an Exponential mechanism of differential privacy to disturb the Bayesian network construction process to solve privacy leakage caused by the Bayesian network.
It can be seen from the algorithm flow that the ABN algorithm removes the limitation of the maximum degree of the Bayesian network. And adopt the way of adaptively selecting the number of degree of each node in the network, which makes the network structure become more complex and diverse. The resulting network contains more information, and the synthetic data generated from the network is more likely to match the real data.
The main difference between the ABN algorithm and the GreedyBayes algorithm is that the former uses a greedy algorithm GParentSet with time complexity of to solve the optimal parent attribute set under each indegree of the current attribute . Compared with the ABN algorithm, the GreedyBayes algorithm traverses the values of all parent attribute sets through brute force enumeration every time it searches for the optimal parent attribute set. Actually, in this process, a lot of repeated and useless calculations are performed. Therefore, the ABN algorithm adopts the memorization, and its characteristics are suitable for the optimal substructure. On the premise of ensuring the best solution, it can greatly reduce the repetitive and useless calculation process and improve the construction speed of the Bayesian networks.
The specific implementation of the GParentSet algorithm is shown in Algorithm 3.

In this algorithm, is a set of attributes, which indicates that the current attribute has selected attributes as the best choice of parent attributes. In fact, when the GParentSet algorithm is executed in the new round, has recorded the optimal parent set selection in this state in the previous round. Therefore, for this round, we only need to care about the selection of the newly added parent attribute in the previous round.
In the ABN algorithm, in Step 5, each candidate attribute and its optimal parent attribute set under various indegree values are all added to the set . In Step 6, it is selected through the differential privacy Exponential mechanism. This process does not limit the maximum indegree of the Bayesian network. Among all the optional degrees, the degree with the greater amount of conditional entropy is easier to be selected. We do not need to control the structure of the resulting Bayesian network. Its construction is an adaptive process toward greater conditional entropy. This method is more flexible and reasonable than the traditional method of constructing Bayesian networks that requires a fixed network’s maximum indegree. Considering that the maximum indegree of the Bayesian network constructed by this method is not fixed, it will adjust adaptively according to different datasets. Moreover, it will adjust in the direction of making the conditional entropy of the entire Bayesian network larger.
The ABN algorithm uses the differential privacy Exponential mechanism in Step 6. Its idea is to select the pairs currently added to the Bayesian network based on probability. We use conditional entropy as the scoring function of the Exponential mechanism . The greater the conditional entropy of the air, the higher the scoring function value and the greater the probability of being selected.
Zhang et al. [10] proved in the article that the global sensitivity of conditional entropy on binary data is
Although they further gave a scoring function with lower global sensitivity in the article, its global sensitivity is . However, the time complexity required to calculate the scoring function is , which can only be applied when the attribute dimension is small. Therefore, we do not adopt this method and still uses conditional entropy as the scoring function.
Step 6 will be executed times, each time the Exponential mechanism is used to select a pair from to join the Bayesian network. From Property 1, we know that each choice needs to consume a part of the privacy budget. Here, we use the average division method to allocate the privacy budget, that is, is divided equally into shares. Then, combined with the Exponential mechanism, we can obtain the expression of as
Finally, proving that the ABN algorithm satisfies differential privacy. The final output of the algorithm is a Bayesian network . In Step 6, the Exponential mechanism is used to select the currently added attribute node and its parent attribute set node. This operation disturbs the construction of the Bayesian network. According to Theorem 2, this step satisfies differential privacy. Moreover, the entire ABN algorithm satisfies differential privacy because no other operations involve the use of the original dataset .
5.2. Differential Privacy Synthetic Data Release
The Bayesian network can simplify the calculation of the joint distribution between attributes to a large extent, and the better the Bayesian network is, the closer the joint distribution is to the true value. However, if the Bayesian network is directly used to calculate the joint distribution between attributes, it may still cause privacy leakage. Therefore, we need to perturb the calculated joint distribution further to achieve the purpose of protecting privacy.
Zhang et al. [9] used the NoisyConditionals algorithm to realize the secure calculation of Bayesian networks. This algorithm adds Laplace noise into the joint distribution to obtain the joint distribution with noise. Although this algorithm can ensure that the obtained joint distribution meets the differential privacy protection, its joint distribution may become very sparse, and the original probability value will generally be small when the attribute dimension increases. At this time, if Laplace noise is directly added into these smaller probability values, then it may cause the problem that the noise completely covers the true value, and seriously affecting the accuracy of the release.
Therefore, we do not directly use a joint distribution like the NoisyConditionals algorithm. Instead, we use conditional distribution to generate synthetic data because of the characteristics of binary data with only 0 and 1. The main reason is that when the distribution is a conditional distribution, we have the following equation:
According to the equation, we can infer that at least one relatively large conditional probability value exists in the conditional distributions and . In this way, if noise is directly added into the conditional probability, then at least one larger conditional probability will not be covered by the noise. Even if the conditional probability is completely covered by noise, it has minimal effect on the accuracy of the final release. The reason is that if such a conditional probability exists, then its probability value must be very small or negligible relative to another probability value. Therefore, even after normalization, they can still reflect the original distribution law.
Nevertheless, synthetic data can still be generated. The traditional method is to generate it directly through joint distribution, but it still has the same problems as data sparseness. So this study uses conditional distribution to generate synthetic data. To this end, we design a synthetic data generation algorithm SDG. The main idea of the SDG algorithm is to use the conditional distribution of the node and its parent node to generate synthetic data according to the topological order of the Bayesian network.
The specific implementation of the SDG algorithm is shown in Algorithm 4.

The synthetic data generated by the SDG algorithm can make the attacker unable to infer a specific record in the original dataset ; thus, it protects the personal privacy information in the data from being leaked.
Finally, proving that the SDG algorithm satisfies differential privacy. The algorithm adds Laplace noise to each conditional distribution in Step 5 and obtains the conditional distribution with noise. Then, these conditional distributions are used to generate synthetic data for release. According to Theorem 2, this process satisfies differential privacy. Moreover, because no other subsequent steps involve the use of the original data set , the entire SDG algorithm satisfies differential privacy.
5.3. Differential Privacy Adaptive Bayesian Network Algorithm
The PrivABN algorithm can be divided into two independent steps:(1)Through the ABN algorithm, construct a Bayesian network which satisfies differential privacy for the original dataset , and extract the noise conditional distribution from the Bayesian network(2)Through the SDG algorithm, generate a synthetic dataset for release, according to the topological order of the Bayesian network and the conditional distribution with noise
The specific implementation process of the PrivABN algorithm is shown in Algorithm 5.

In the PrivABN algorithm, the subalgorithms ABN and SDG involve the use of the original dataset . According to property 1, in order for the PrivABN algorithm to satisfy differential privacy, privacy budgets must be allocated for these two subalgorithms first. According to the analysis in the previous chapters of this paper, the privacy budget and will, respectively, correspond to the two noise distributions and . The first noise distribution aims to make the value of (with the nondependent ) as large as possible. The second noise distribution aims to make the value of as small as possible. By simplifying the two formulas above, we can obtain
From the simplified results, the proportions of the two privacy budgets that need to be allocated are roughly the same. Therefore, we adopt an even distribution strategy to allocate the privacy budget .
Finally, proving that the PrivABN algorithm satisfies differential privacy. It can be seen that the ABN algorithm and SDG algorithm satisfy differential privacy and differential privacy. Apart from the above two algorithms, the PrivABN algorithm has no other place that involves the use of the original data set . Therefore, according to property 1, the PrivABN algorithm satisfies differential privacy.
6. Experiences
6.1. Experiences Environment
The experimental platform is a 4core Intel i56300HQ CPU (2.3GHz), 8GB memory, Windows 10 operating system, and the compilation environment is DevC++5.11. Our experiments use the C++ programming language to implement all the methods, among which the implementation of the Bayesian network refers to the relevant code of the paper experiment by Zhang et al. [10].
6.2. Datasets
Our experiments use three realworld datasets, NLTCS, ACS, and Retail. NLTCS [18] is an American longterm care survey record that includes the daily life and medical conditions of 21,574 elderly disabled persons. ACS [19] is the global census data released by IPUMSUSA, which records 47,461 pieces of personal information. Retail [20] is 88,162 shopping records in the US retail market. Each record contains items purchased by it, a total of 16,469 categories of goods, from which we have retained the top 50 bestselling goods. The specific information of these three data sets is shown in Table 3.
6.3. Evaluation
For each set of experiments, we will compare the error (mean error) between the generated synthetic dataset and the original dataset . Moreover, the error is caused by the same number of queries on these two datasets.
Definition 4 ( error). The error between the original dataset and the synthetic dataset iswhere is the value of the th row of the th experiment using the original dataset and is the value of the th row of the th experiment using the synthetic dataset. is the number of experiments, and is the number of rows of the dataset.
Definition 5 ( error). Let the edge table generate by the original dataset through the th query is , and the edge table generate by the synthetic dataset is ; the error between them iswhere is the value of the th row of the edge table generated by the th query, is the number of queries, and is the number of rows of the edge table.
6.4. Result Analysis
The first part of the experiments is to analyze the availability of the PrivABN method. To know the size of the noise generated by the PrivABN method in a noisefree environment, we set up the NoPrivABN method without differential privacy protection for comparison. We will conduct 100 random repeated experiments on NLTCS and ACS. We set the privacy budget to 0.001, 0.005, 0.01, 0.05, 0.1, and 1.0. The experimental results will be verified by 200 random queries, the values of which are 3, 7, and 12. We will test their performance on error and error. The experimental results are shown in Figure 1.
Figure 1 shows that the PrivABN algorithm only needs a very small privacy budget. When , it is very close to the effect of NoPrivABN, which shows that the PrivABN algorithm has high availability.
Moreover, as the privacy budget increases, the error generated by PrivABN gradually approaches the error of NoPrivABN, which is in line with the differential privacy law. This finding further verified the credibility of the experiment.
Another finding is that the PrivABN algorithm without differential privacy protection will produce certain errors. The reason is that the algorithm itself generates a synthetic dataset based on the Bayesian network and conditional distribution. The process of constructing the Bayesian network itself may produce certain errors, and the process of generating synthetic data through conditional distribution is probabilistic, thereby resulting in the production of certain errors. Therefore, taking the error of NoPrivABN as the lower bound is in line with the experimental standard.
Finally, the performance of PrivABN on different queries under the same dataset is observed. Their overall trend of change and their turning points are also the same. Their error sizes under different privacy budgets do not differ greatly, and they are basically the same within a certain error range. Therefore, we can consider that the stability of PrivABN in the face of different conditional parameters is a manifestation of the PrivABN’s remarkable robustness.
According to the errors and errors between the generated synthetic dataset and the original dataset, PrivABN only needs a very small privacy budget to achieve the effect of NoPrivABN. To know this threshold more accurately, we further subdivide the value of the privacy budget. We set the privacy budget to increase from 0.05 to 0.5 in increments of 0.025 and from 0.5 to 1.0 in increments of 0.1. Then, experiments with query under the NLTCS and ACS datasets, respectively (from the previous experimental conclusions, it can be known that PrivABN has better robustness, and its error under different queries is not much different, so only one query needs to be compared). The experimental results are shown in Figure 2.
Figure 2 shows that when the privacy budget is 0.4 (NLTCS dataset) and 0.225 (ACS dataset), the error is lower than the error bar of 0.01. This finding shows that the PrivABN method only needs to consume a very small privacy budget to achieve good privacy protection. Therefore, we can definitely believe that privabn has high availability.
From another point of view, the PrivABN algorithm only needs to give a small amount of privacy budget. It can achieve a good privacy protection effect and can greatly reduce the error of differential privacy protection. This is the reason for its high availability.
In the second part of the experiments, the performance of PrivABN on the real highdimensional dataset Retail is analyzed. To reflect the pros and cons of the results better, the experiments will be compared with these three methods, namely, PriView [8], PrivBayes [9], and Jtree [12]. We will conduct 100 repeated random experiments on the Retail dataset and set the privacy budget to 0.1 and 1.0. The experimental results will be verified by 200 random queries, where values correspond to 4, 6, and 8. We will test their performance on errors. The experimental results are shown in Figure 3.
Figure 3 shows that under different privacy budgets, the PrivABN method performs significantly better than the three other methods on the Retail dataset. Further, when the privacy budget is small (), the PrivABN method performs significantly better than the other three methods. When the privacy budget is large (), PrivABN is still superior to the other three methods, but it is not different from the JTree method. This is in line with the law of differential privacy, because as the privacy budget continues to increase, the degree of privacy protection of the algorithm will continue to decline, until the implementation result is consistent with that of the algorithm without differential privacy protection. Therefore, it can be seen that the accuracy of the probabilistic graph model itself built by the Privabn method is better than that of the model built by the JTree method. Compared with the model built by the PrivBayes method, the accuracy is much better. This further verifies that the improvement strategies proposed in this paper are effective and have achieved good results.
Moreover, with the increase in query dimension, the variation range of error of the PrivABN method is significantly smaller than those of the three other methods. Therefore, we can further infer that the PrivABN method has higher availability and better robustness.
7. Conclusion
Private releasing of highdimensional data has been a research hotspot and a challenge in the field of differential privacy. This study proposes an efficient and lownoise differential privacy publishing method called PrivABN for highdimensional binary data. The method uses the ABN algorithm to construct the Bayesian network over the dataset quickly and adaptively while using the differential privacy Exponential mechanism to protect the privacy of the Bayesian network during the construction. Subsequently, the SDG algorithm uses the differential privacy Laplace mechanism to initially extract the noisy conditional distribution from the Bayesian network and then uses these conditional distributions and the Bayesian network topology to generate synthetic data for release. By performing experiments on three real data sets, we demonstrate that PrivABN deserves higher usability and robustness than existing methods.
The main focus for our future work will be continually on the differential privacy publication of highdimensional data. We will investigate the differential privacy publication of highdimensional nonbinary data and explore the issue of differential privacy publication in a streaming highdimensional data environment.
Data Availability
NLTCS is an American longterm care survey record that includes the daily life and medical conditions of 21,574 elderly disabled persons. ACS is the global census data released by IPUMSUSA, which records 47,461 pieces of personal information. Retail is 88,162 shopping records in the US retail market.
Conflicts of Interest
The authors declare that they have no conflicts of interest.