Abstract
With the arrival of the big data era, it is predicted that distributed data mining will lead to an information technology revolution. To motivate different institutes to collaborate with each other, the crucial issue is to eliminate their concerns regarding data privacy. In this paper, we propose a privacypreserving method for training a restricted boltzmann machine (RBM). The RBM can be got without revealing their private data to each other when using our privacypreserving method. We provide a correctness and efficiency analysis of our algorithms. The comparative experiment shows that the accuracy is very close to the original RBM model.
1. Introduction
With the rapid development of information technology and modern network, huge amounts of personal data are generated every day, and people care deeply about maintaining their privacy. Therefore, there is a need to focus on developing privacypreserving data mining algorithms. With the rapid growth of social networks like Facebook and LinkedIn, increasingly more research will be based on personal data, such as advertising suggestion. In another scenario, doctors always collect patients’ personal information before the diagnosis of a disease or the treatment of an illness. However, in order to prevent the leakage of these privacy data, the Health Insurance Portability and Accountability Act (HIPPA) has set up a series of regulations that protect the privacy of individually identifiable health information.
Data mining is an important interdisciplinary field of computer science and has been widely extended to the fields of bioinformatics, medicine, and social networks. For example, when a research institute wants to study the DNA sequence and related genetic diseases, they need to collect patients’ DNA data and apply data mining or machine learning algorithms to obtain a relevant model. However, if scientists from other institutes also want to use these DNA sequences, ensuring that the patients’ personal information is protected is an example of the problem at hand. In another scenario, some researchers want to combine the personal data from Facebook and LinkedIn to undertake a study. However, neither company wants to reveal the personal information of their subscribers, and they especially do not want to give it to a competitor. Therefore, we propose a privacypreserving machine learning method to ensure that individuals’ privacy is protected.
The restricted Boltzmann machine (RBM) [1] is increasingly being used in supervised or unsupervised learning scenarios, such as classification. It is a variant of the Boltzmann machines (BMs) which is a type of stochastic recurrent neural network invented by Hinton and Sejnowski. It has been used as windows of melcepstral coefficients that represent speech [2], bags of words that represent documents [3], and user ratings of movies [4].
In this paper we propose a privacypreserving method for training the RBM, which can be used for information sharing without revealing personal data from different institutions to each other. We provide a correctness and efficiency analysis of our algorithms. The comparative experiment shows that the accuracy is very close to original RBM model.
The rest of this paper is organized as follows. Section 2 describes the related work. We introduce the restricted Boltzmann machine, Gibbs sampling, contrastive divergence, and cryptograph scheme in more detail in Section 3. In Section 4, we describe our privacypreserving method for training the RBM. The analysis of our model is described in Section 5. Section 6 gives the design of our experiments in detail. Last, Section 7 is the conclusion of this paper.
2. Related Work
In [5], Hinton gives a practical guide for training the restricted Boltzmann machine. It is widely used in collaborative filtering [4]. In [6], Agrawal and Srikant and [7] Lindell and Pinkes propose separately that much of future research in data mining will be focused on the development of privacypreserving techniques. With the development of privacypreserving data mining techniques, it can be divided into two classes: the randomizationbased method like [7] and the cryptographbased method like [6].
Randomizationbased privacypreserving data mining, which perturbs data or reconstructs the distribution of the original data, can only provide a limited degree of privacy and accuracy but is more efficient when the database is very large. In [8], Du and Zhan present a method to build decision tree classifiers from the disguised data. They have conducted experiments to compare the accuracy of their decision tree with the one built from the original undisguised data. In [9], Huang et al. study how correlations affect the privacy of a dataset disguised via the random perturbation scheme and propose two data reconstruction methods that are based on data correlations. In [10], Aggarwal and Yu develop a new flexible approach for privacypreserving data mining, which does not require new problemspecific algorithms since it maps the original dataset into a new anonymous dataset.
Cryptographbased privacypreserving data mining, which can provide a better guarantee of privacy when different institutes want to cooperate to meet a common research goal, is always subject to its efficiency when the dataset is very large. In [11], Wright and Yang propose a cryptographicbased privacypreserving protocol for learning the Bayesian network structure. Chen and Zhong [12] present a cryptographicbased privacypreserving algorithm for backpropagation neural network learning. In [13], Laur et al. propose cryptographically secure protocols for kernel perceptron and kernelized support vector machines. In [14], Vaidya et al. propose a privacypreserving naive Bayes classifier on both vertically and horizontally partitioned data.
To the best of our knowledge, we are the first to provide a privacypreserving RBM training algorithm for vertical partitions.
3. Technical Preliminaries
In the section, we give a brief review of RBM and the cryptograph method we have used in our privacypreserving algorithm. First, we introduce RBM and the learning method for the binary unit. Much of the description about RBM and its training method in this section is adapted from [5, 15]. Second, we introduce the cryptograph technology [12] that we have used in our work.
3.1. RBM
The Boltzmann machine (BM) [16] is a stochastic neural network with symmetric connections between units and no connection in the same unit. The BMs can be used to learn important aspects of an unknown probability distribution based on its samples. Restricted Boltzmann machines (RBMs) further restrict that BMs do not have visiblevisible and hiddenhidden connections [15], thus simplifying their learning process. A graphical depiction of an RBM is shown in Figure 1. are visible units and are hidden units. All visible units are connected with all hidden units with a weight matrix .
Given , a joint configuration of the visible and hidden units has an energy [17] defined as where and are the vectors consisting of states of all visible units and hidden units, respectively; and are the biases associated with unit and unit , respectively, and is the weight between units and . The energy determines the probability distributions over the hidden units’ and visible units’ state vectors using an energy function as follows: where is the sum of for all possible pairs.
3.2. RBM with Binary Units
When units’ states are binary, according to [18], a probabilistic version of the usual neuron activation function that is commonly studied can be simplified to where sigm denotes the sigmoid function and (and , resp.) is the th row vector (the th column vector, resp.) of .
Based on (2) and (3), the loglikelihood gradients for an RBM with binary units [15] can be computed as These gradients will be used in guiding the weight matrix’s updates during the training procedure of the RBMs.
3.3. Sampling and Contrastive Divergence in an RBM
Using Gibbs sampling as the transition operator, samples of can be obtained by running a Markov chain to convergence [15]. To sample a joint of random variables , Gibbs sampling performs a sequence of sampling substeps of the form , where represents the ensemble of the random variables in other than .
An RBM consists of visible and hidden units. However, since they are conditionally independent, we can perform block Gibbs sampling [15]. In this condition, hidden units are sampled simultaneously when given fixed values of the visible units. Similarly, visible units are sampled simultaneously when given the hidden units. A step in the Markov chain is thus taken as follows [15]: where refers to the set of all hidden units at the th step of the Markov chain. What it means is that, for example, is randomly chosen to be 1 (versus 0) with probability , and similarly is randomly chosen to be 1 (versus 0) with probability [15]. This can be illustrated graphically in Figure 2. Contrastive divergence does not wait for the chain to converge. Samples are obtained only after ksteps of Gibbs sampling. In practice, has been shown to work surprisingly well [15].
3.4. ElGamal Scheme
In our privacypreserving scheme, we use ElGamal [19], which is a typical public encryption method, as our cryptograph tool. Reference [20] has shown that the ElGamal encryption scheme is semantically secure [21] under a standard cryptographic assumption. In [12], the authors develop an elegant secure computing sigmoid function method and a secure computing product of two integer algorithms based on ElGamal’s homomorphic property and probabilistic property. Here we give a brief review of these two algorithms. As shown in Algorithm 1, first Party computes that , and is all the possible input of Party . Specifically, is the sigmoid function. Similarly, as shown in Algorithm 2, Party holds and Party holds . Party computes for all possible inputs of Party and then sends all encrypted messages to Party . Then, Party and Party can obtain the secret share of [12].


4. PrivacyPreserving Restricted Boltzmann Machine
4.1. Overview and Algorithm of Our PrivacyPreserving Restricted Boltzmann Machine
In order to use cryptographic tools in our privacypreserving RBM, we use probability as the value of the hidden unit and visible unit. That means that when we are undertaking the Gibbs sampling process, we use the probability instead of as the value of the hidden unit and visible unit. Therefore, we can use the ElGamal scheme to encrypt the probability after rounding the decimal. However, there will be some accuracy loss when we use this approximation. We will evaluate this accuracy loss in Section 5.
In our privacypreserving RBM training algorithm, we assume the data are vertically partitioned. That means that each party owns some features of the dataset. Our privacypreserving RBM is the first work on training restricted Boltzmann machine over a vertically partitioned dataset. We will look in detail at our training algorithm.
For each training iteration, two parties, and , own the inputs and separately. The main idea of our privacypreserving RBM is that when training our model, we use the cryptograph method (Algorithms 1 and 2) [12] to secure each step without revealing the original data to each other’s party.
First, we let each party sum up their visible data of each sample. Then Party computes for all possible , where is a random number generated by Party . Then Party rounds all these results to the integer and encrypts them. Then Party sends the cipher message to Party in the increasing order of . Then Party picks , which is their sumup value, rerandomizes it, and sends it to Party , who partially decrypts this message and sends it back to Party , who decrypts it and gets the value of . Specifically, is and as shown in the PrivacyPreserving Distributed Algorithm for RBM. Then, using the same method we can perform the rest of the privacypreserving Gibbs sampling process.
For the second updating weight part, we use Algorithm 2 [12] to securely compute the products and separately. Specifically, , , and , where the number on the top indicates the Gibbs step and the number on the bottom indicates the party the data belongs to. So we can get . Regardless of which party belongs to, we can get the same result. Furthermore, we get . Therefore, we use Algorithm 2 to securely compute these products. As one example, indicates that belongs to Party , which computes all for all , rounds all these result to the integer and encrypts them, and then sends the cipher message to Party in the increasing order of . Then Party picks , which is their value, rerandomizes it, and sends it to Party , who partially decrypts this message and sends it back to Party , who decrypts it and gets the value of . Specifically, is and as shown in the PrivacyPreserving Distributed Algorithm for RBM (Algorithm 3). Then, using the same method, we can perform the rest of the privacypreserving product process.

Lastly, if Party owns , it can compute , and Party computes . Then Party sends this to Party , and Party sums up these two to get the final value of − . Then Party can perform gradient descent to update the weight. Using the same method, we can update the bias of visible unit and the bias of hidden unit .
A privacypreserving testing algorithm can be easily derived from the Gibbs sampling part of the privacypreserving training algorithm.
4.2. Analysis of Algorithm Complexity and Accuracy Loss
The running time of one iteration of training consists of two parts, the Gibbs sampling and updating the weights. First, we analyze the execution time of the Gibbs sampling process. According to [12], Algorithm 1 takes , where is the total number of in Algorithm 1 and E and D are the costs of encryption and decryption. Therefore, in the Gibbs sampling process, we assume there are samples, hidden units, and visible units. We can get the time cost as .
In the updating weights process, Algorithm 2 also takes . Therefore, the total time used to encrypt and decrypt is .
Combining the time for the two stages, we obtain the running time of one round of privacypreserving RBM learning as .
In order to provide the preservation of privacy, we introduced two approximations in our algorithm. First, we replaced the binary value by the probability. Second, we mapped the real numbers to fixedpoint representations to enable the cryptographic operations in Algorithms 1 and 2 [12]. This is necessary in that intermediate results, such as the values of visible and hidden units, are represented as real numbers in normal RBM learning, but cryptographic operations are on discrete finite fields. We will empirically evaluate the impact of these two sources of approximation on the accuracy loss of our RBM learning algorithm in Section 6. Below we give a brief theoretical analysis of the accuracy loss caused by the fixedpoint representations. We assume that the error ratio bound which is caused by truncating the real number is . In the Gibbs sampling process, Algorithm 1 is applied three times; therefore, the error ratio bound is . In updating the weight process, Algorithm 2 is one for each dataset. The error ratio bound for is .
4.3. Analysis of Algorithm’s Security
In our distributed RBM training algorithm, except the computations that can be done by a party itself, all other computations that have to be done jointly by the two parties protect their input data with semantically secure encryptions. In addition, all intermediate computing results are also protected using the secret sharing scheme. In the semihonest model, both parties follow the algorithm without any deviation; our algorithm guarantees that the additional knowledge gained from the execution of our algorithm by a party is only the final training result. Therefore, our algorithm protects both parties’ privacy in this model.
5. Experiments
In this section, we explain the experimental process for measuring the accuracy loss of our modified algorithms. We compare the testing error rates to nonprivacypreserving cases. In the second set, we distinguish two types of approximations introduced by our algorithms: a conversion of real numbers to fixedpoint numbers when applying cryptographic algorithms and an analysis of how they affect the accuracy of the RBM.
5.1. Setup
The algorithms were implemented in MATLAB. The experiments were executed on a Windows computer with a core i5 2.3 GHz Intel processor and 3 Gb of memory. The testing datasets were MINST database of handwritten digits. We chose the number of hidden nodes based on the number of attributes. Weights were initialized as uniformly random values in the range of [−0.1, 0.1]. Feature values in each dataset were normalized between 0 and 1.
5.2. Effects of Two Types of Approximation on Accuracy
In this section, we evaluate the loss of accuracy of our modified training model. In our model, there exist two approximations. The first one is that we use probability instead of binary value as our Gibbs sampling result. The second is that we truncate the probability to finite digits so that we can shift the decimal point and then use this number for encryption. We then distinguish and evaluate the effects of these two approximation types without cryptographic operations (we call it approximation test).
First, we compare the loss of accuracy caused by using probability instead of binary value on the MNIST dataset. We chose 5,000 samples as training data and 1,000 as testing data. We then set the 100 hidden units and perform the experiments by varying the number of epochs and evaluating the loss of accuracy on different training epochs. In Figure 3, we can see that the accuracy caused by this approximation is less than 1%. Since encryption and decryption do not influence the accuracy of our model, this is the accurate amount of loss of our privacypreserving training method.
Second, we compare the accuracy loss caused by truncating the probability to finite digits. Specifically, we truncate the number to two digits. We set the parameter as the same as the first experiment. From the results we can see that the error rate is still close to the algorithm without approximation.
6. Conclusion and Future Work
In this paper, we have presented a privacypreserving algorithm for RBM. The algorithm guarantees privacy in a standard cryptographic model, the semihonest model. Although approximations are introduced in the algorithm, the experiments on realworld data show that the amount of accuracy loss is reasonable.
Using our techniques, it should not be difficult to develop the privacypreserving algorithms for RBM learning with three or more participants. In this paper, we have proposed only the RBM training method. A future research topic would be to apply it in a practical implementation and to extend our work to deep networks training.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.