Abstract

With the arrival of the big data era, it is predicted that distributed data mining will lead to an information technology revolution. To motivate different institutes to collaborate with each other, the crucial issue is to eliminate their concerns regarding data privacy. In this paper, we propose a privacy-preserving method for training a restricted boltzmann machine (RBM). The RBM can be got without revealing their private data to each other when using our privacy-preserving method. We provide a correctness and efficiency analysis of our algorithms. The comparative experiment shows that the accuracy is very close to the original RBM model.

1. Introduction

With the rapid development of information technology and modern network, huge amounts of personal data are generated every day, and people care deeply about maintaining their privacy. Therefore, there is a need to focus on developing privacy-preserving data mining algorithms. With the rapid growth of social networks like Facebook and LinkedIn, increasingly more research will be based on personal data, such as advertising suggestion. In another scenario, doctors always collect patients’ personal information before the diagnosis of a disease or the treatment of an illness. However, in order to prevent the leakage of these privacy data, the Health Insurance Portability and Accountability Act (HIPPA) has set up a series of regulations that protect the privacy of individually identifiable health information.

Data mining is an important interdisciplinary field of computer science and has been widely extended to the fields of bioinformatics, medicine, and social networks. For example, when a research institute wants to study the DNA sequence and related genetic diseases, they need to collect patients’ DNA data and apply data mining or machine learning algorithms to obtain a relevant model. However, if scientists from other institutes also want to use these DNA sequences, ensuring that the patients’ personal information is protected is an example of the problem at hand. In another scenario, some researchers want to combine the personal data from Facebook and LinkedIn to undertake a study. However, neither company wants to reveal the personal information of their subscribers, and they especially do not want to give it to a competitor. Therefore, we propose a privacy-preserving machine learning method to ensure that individuals’ privacy is protected.

The restricted Boltzmann machine (RBM) [1] is increasingly being used in supervised or unsupervised learning scenarios, such as classification. It is a variant of the Boltzmann machines (BMs) which is a type of stochastic recurrent neural network invented by Hinton and Sejnowski. It has been used as windows of mel-cepstral coefficients that represent speech [2], bags of words that represent documents [3], and user ratings of movies [4].

In this paper we propose a privacy-preserving method for training the RBM, which can be used for information sharing without revealing personal data from different institutions to each other. We provide a correctness and efficiency analysis of our algorithms. The comparative experiment shows that the accuracy is very close to original RBM model.

The rest of this paper is organized as follows. Section 2 describes the related work. We introduce the restricted Boltzmann machine, Gibbs sampling, contrastive divergence, and cryptograph scheme in more detail in Section 3. In Section 4, we describe our privacy-preserving method for training the RBM. The analysis of our model is described in Section 5. Section 6 gives the design of our experiments in detail. Last, Section 7 is the conclusion of this paper.

In [5], Hinton gives a practical guide for training the restricted Boltzmann machine. It is widely used in collaborative filtering [4]. In [6], Agrawal and Srikant and [7] Lindell and Pinkes propose separately that much of future research in data mining will be focused on the development of privacy-preserving techniques. With the development of privacy-preserving data mining techniques, it can be divided into two classes: the randomization-based method like [7] and the cryptograph-based method like [6].

Randomization-based privacy-preserving data mining, which perturbs data or reconstructs the distribution of the original data, can only provide a limited degree of privacy and accuracy but is more efficient when the database is very large. In [8], Du and Zhan present a method to build decision tree classifiers from the disguised data. They have conducted experiments to compare the accuracy of their decision tree with the one built from the original undisguised data. In [9], Huang et al. study how correlations affect the privacy of a dataset disguised via the random perturbation scheme and propose two data reconstruction methods that are based on data correlations. In [10], Aggarwal and Yu develop a new flexible approach for privacy-preserving data mining, which does not require new problem-specific algorithms since it maps the original dataset into a new anonymous dataset.

Cryptograph-based privacy-preserving data mining, which can provide a better guarantee of privacy when different institutes want to cooperate to meet a common research goal, is always subject to its efficiency when the dataset is very large. In [11], Wright and Yang propose a cryptographic-based privacy-preserving protocol for learning the Bayesian network structure. Chen and Zhong [12] present a cryptographic-based privacy-preserving algorithm for backpropagation neural network learning. In [13], Laur et al. propose cryptographically secure protocols for kernel perceptron and kernelized support vector machines. In [14], Vaidya et al. propose a privacy-preserving naive Bayes classifier on both vertically and horizontally partitioned data.

To the best of our knowledge, we are the first to provide a privacy-preserving RBM training algorithm for vertical partitions.

3. Technical Preliminaries

In the section, we give a brief review of RBM and the cryptograph method we have used in our privacy-preserving algorithm. First, we introduce RBM and the learning method for the binary unit. Much of the description about RBM and its training method in this section is adapted from [5, 15]. Second, we introduce the cryptograph technology [12] that we have used in our work.

3.1. RBM

The Boltzmann machine (BM) [16] is a stochastic neural network with symmetric connections between units and no connection in the same unit. The BMs can be used to learn important aspects of an unknown probability distribution based on its samples. Restricted Boltzmann machines (RBMs) further restrict that BMs do not have visible-visible and hidden-hidden connections [15], thus simplifying their learning process. A graphical depiction of an RBM is shown in Figure 1. are visible units and are hidden units. All visible units are connected with all hidden units with a weight matrix .

Given , a joint configuration of the visible and hidden units has an energy [17] defined as where and are the vectors consisting of states of all visible units and hidden units, respectively; and are the biases associated with unit and unit , respectively, and is the weight between units and . The energy determines the probability distributions over the hidden units’ and visible units’ state vectors using an energy function as follows: where is the sum of for all possible pairs.

3.2. RBM with Binary Units

When units’ states are binary, according to [18], a probabilistic version of the usual neuron activation function that is commonly studied can be simplified to where sigm denotes the sigmoid function and (and , resp.) is the th row vector (the th column vector, resp.) of .

Based on (2) and (3), the log-likelihood gradients for an RBM with binary units [15] can be computed as These gradients will be used in guiding the weight matrix’s updates during the training procedure of the RBMs.

3.3. Sampling and Contrastive Divergence in an RBM

Using Gibbs sampling as the transition operator, samples of can be obtained by running a Markov chain to convergence [15]. To sample a joint of random variables , Gibbs sampling performs a sequence of sampling substeps of the form , where represents the ensemble of the random variables in other than .

An RBM consists of visible and hidden units. However, since they are conditionally independent, we can perform block Gibbs sampling [15]. In this condition, hidden units are sampled simultaneously when given fixed values of the visible units. Similarly, visible units are sampled simultaneously when given the hidden units. A step in the Markov chain is thus taken as follows [15]: where refers to the set of all hidden units at the th step of the Markov chain. What it means is that, for example, is randomly chosen to be 1 (versus 0) with probability , and similarly is randomly chosen to be 1 (versus 0) with probability [15]. This can be illustrated graphically in Figure 2. Contrastive divergence does not wait for the chain to converge. Samples are obtained only after k-steps of Gibbs sampling. In practice, has been shown to work surprisingly well [15].

3.4. ElGamal Scheme

In our privacy-preserving scheme, we use ElGamal [19], which is a typical public encryption method, as our cryptograph tool. Reference [20] has shown that the ElGamal encryption scheme is semantically secure [21] under a standard cryptographic assumption. In [12], the authors develop an elegant secure computing sigmoid function method and a secure computing product of two integer algorithms based on ElGamal’s homomorphic property and probabilistic property. Here we give a brief review of these two algorithms. As shown in Algorithm 1, first Party computes that , and is all the possible input of Party . Specifically, is the sigmoid function. Similarly, as shown in Algorithm 2, Party holds and Party holds . Party computes for all possible inputs of Party and then sends all encrypted messages to Party . Then, Party and Party can obtain the secret share of [12].

Initialize
Step 1.
 Party first generates a random number and computes for each , is the possible inputs of Party .
 We define , is the plain text. Party encrypts each using the ElGamal scheme and gets ,
 where each is a new random number. Party sends each in the increasing order of .
Step 2.
 Party picks , rerandomizes it and sends back to Party ,
 where , and is only known to Party .
Step 3.
 Party partially decrypts and sends the partially decrypted message to Party .
Step 4.
 Party finally decrypts the message (by doing partial decryption on the already partially decrypted message)
 to get . Note that is only known to Party and is only known to Party .
 Furthermore, .

Initialize
Step 1.
 Party first generates a random number and computes for each , is the possible input of Party .
 Then define , is the plain text. Then Party encrypts each
 using ElGamal scheme and gets , where each is a new random number.
 After that, Party sends each to Party in the increasing order of .
Step 2.
 Party picks , rerandomizes it, and sends
 back to Party , where , and is only known to Party .
Step 3.
 Party partially decrypts and sends the partially decrypted message to Party .
Step 4.
 Party finally decrypts the message (by doing partial decryption on the already partially decrypted message)
  to get .
Note that is only known to Party and is only known to Party . Furthermore, .

4. Privacy-Preserving Restricted Boltzmann Machine

4.1. Overview and Algorithm of Our Privacy-Preserving Restricted Boltzmann Machine

In order to use cryptographic tools in our privacy-preserving RBM, we use probability as the value of the hidden unit and visible unit. That means that when we are undertaking the Gibbs sampling process, we use the probability instead of as the value of the hidden unit and visible unit. Therefore, we can use the ElGamal scheme to encrypt the probability after rounding the decimal. However, there will be some accuracy loss when we use this approximation. We will evaluate this accuracy loss in Section 5.

In our privacy-preserving RBM training algorithm, we assume the data are vertically partitioned. That means that each party owns some features of the dataset. Our privacy-preserving RBM is the first work on training restricted Boltzmann machine over a vertically partitioned dataset. We will look in detail at our training algorithm.

For each training iteration, two parties, and , own the inputs and separately. The main idea of our privacy-preserving RBM is that when training our model, we use the cryptograph method (Algorithms 1 and 2) [12] to secure each step without revealing the original data to each other’s party.

First, we let each party sum up their visible data of each sample. Then Party computes for all possible , where is a random number generated by Party . Then Party rounds all these results to the integer and encrypts them. Then Party sends the cipher message to Party in the increasing order of . Then Party picks , which is their sum-up value, rerandomizes it, and sends it to Party , who partially decrypts this message and sends it back to Party , who decrypts it and gets the value of . Specifically, is and as shown in the Privacy-Preserving Distributed Algorithm for RBM. Then, using the same method we can perform the rest of the privacy-preserving Gibbs sampling process.

For the second updating weight part, we use Algorithm 2 [12] to securely compute the products and separately. Specifically, , , and , where the number on the top indicates the Gibbs step and the number on the bottom indicates the party the data belongs to. So we can get . Regardless of which party belongs to, we can get the same result. Furthermore, we get . Therefore, we use Algorithm 2 to securely compute these products. As one example, indicates that belongs to Party , which computes all for all , rounds all these result to the integer and encrypts them, and then sends the cipher message to Party in the increasing order of . Then Party picks , which is their value, rerandomizes it, and sends it to Party , who partially decrypts this message and sends it back to Party , who decrypts it and gets the value of . Specifically, is and as shown in the Privacy-Preserving Distributed Algorithm for RBM (Algorithm 3). Then, using the same method, we can perform the rest of the privacy-preserving product process.

Initialize
all weights ( ) to small random numbers and make them known to both parties.
Repeat
for all training sample do
This part mainly uses (5) and (9). Samples are obtained after only one step of Gibbs sampling
because one-step Gibbs has been shown to work surprisingly well [22].
Step 1. Gibbs Sampling
{
 For each hidden layer node , Party computes , and Party computes
 Using Algorithm 1, Parties and jointly compute the sigmoid function for , obtaining the random shares and ,
 respectively, s.t.
 For each visible layer node , after that Party computes , and Party computes
 then again using Algorithm 1, Parties and jointly compute the sigmoid function for ,
 obtaining the random shares and , respectively, s.t.
 For each hidden layer node , Party computes , and Party computes
 Using Algorithm 1, Parties and jointly compute the sigmoid function for ,
 obtaining the random shares and , respectively, s.t.
}
return   and to two parties.
This part mainly uses (4), (6) and (7).
Step 2. Update Weights
{
 Parties and compute , , and respectively.
 Parties and apply Algorithm 2 to securely compute the product ,
 obtaining random shares and , respectively,
s.t. . Similarly, they compute the random partitions of , , and , s.t. .
 Parties and apply Algorithm 2 to securely compute the product ,
 obtaining random shares and , respectively,
s.t. . Similarly, they compute the random partitions of , , and , s.t. .
 If Party owns the visible unit, which is , then Party computes ,
 and Party computes . Then Party sends this value to Party ,
 who can add these two numbers to get the log-likelihood gradients
for an RBM.
If Party owns the visible unit, then the same method can be used to calculate the value.
 Then using these log-likelihood gradients Party can update the parameter of this RBM.
}
return  new to two parties
Using the same method, we can update parameters and
end for
Until (termination condition)

Lastly, if Party owns , it can compute , and Party computes . Then Party sends this to Party , and Party sums up these two to get the final value of . Then Party can perform gradient descent to update the weight. Using the same method, we can update the bias of visible unit and the bias of hidden unit .

A privacy-preserving testing algorithm can be easily derived from the Gibbs sampling part of the privacy-preserving training algorithm.

4.2. Analysis of Algorithm Complexity and Accuracy Loss

The running time of one iteration of training consists of two parts, the Gibbs sampling and updating the weights. First, we analyze the execution time of the Gibbs sampling process. According to [12], Algorithm 1 takes , where is the total number of in Algorithm 1 and E and D are the costs of encryption and decryption. Therefore, in the Gibbs sampling process, we assume there are samples, hidden units, and visible units. We can get the time cost as .

In the updating weights process, Algorithm 2 also takes . Therefore, the total time used to encrypt and decrypt is .

Combining the time for the two stages, we obtain the running time of one round of privacy-preserving RBM learning as .

In order to provide the preservation of privacy, we introduced two approximations in our algorithm. First, we replaced the binary value by the probability. Second, we mapped the real numbers to fixed-point representations to enable the cryptographic operations in Algorithms 1 and 2 [12]. This is necessary in that intermediate results, such as the values of visible and hidden units, are represented as real numbers in normal RBM learning, but cryptographic operations are on discrete finite fields. We will empirically evaluate the impact of these two sources of approximation on the accuracy loss of our RBM learning algorithm in Section 6. Below we give a brief theoretical analysis of the accuracy loss caused by the fixed-point representations. We assume that the error ratio bound which is caused by truncating the real number is . In the Gibbs sampling process, Algorithm 1 is applied three times; therefore, the error ratio bound is . In updating the weight process, Algorithm 2 is one for each dataset. The error ratio bound for is .

4.3. Analysis of Algorithm’s Security

In our distributed RBM training algorithm, except the computations that can be done by a party itself, all other computations that have to be done jointly by the two parties protect their input data with semantically secure encryptions. In addition, all intermediate computing results are also protected using the secret sharing scheme. In the semihonest model, both parties follow the algorithm without any deviation; our algorithm guarantees that the additional knowledge gained from the execution of our algorithm by a party is only the final training result. Therefore, our algorithm protects both parties’ privacy in this model.

5. Experiments

In this section, we explain the experimental process for measuring the accuracy loss of our modified algorithms. We compare the testing error rates to non-privacy-preserving cases. In the second set, we distinguish two types of approximations introduced by our algorithms: a conversion of real numbers to fixed-point numbers when applying cryptographic algorithms and an analysis of how they affect the accuracy of the RBM.

5.1. Setup

The algorithms were implemented in MATLAB. The experiments were executed on a Windows computer with a core i5 2.3 GHz Intel processor and 3 Gb of memory. The testing datasets were MINST database of handwritten digits. We chose the number of hidden nodes based on the number of attributes. Weights were initialized as uniformly random values in the range of [−0.1, 0.1]. Feature values in each dataset were normalized between 0 and 1.

5.2. Effects of Two Types of Approximation on Accuracy

In this section, we evaluate the loss of accuracy of our modified training model. In our model, there exist two approximations. The first one is that we use probability instead of binary value as our Gibbs sampling result. The second is that we truncate the probability to finite digits so that we can shift the decimal point and then use this number for encryption. We then distinguish and evaluate the effects of these two approximation types without cryptographic operations (we call it approximation test).

First, we compare the loss of accuracy caused by using probability instead of binary value on the MNIST dataset. We chose 5,000 samples as training data and 1,000 as testing data. We then set the 100 hidden units and perform the experiments by varying the number of epochs and evaluating the loss of accuracy on different training epochs. In Figure 3, we can see that the accuracy caused by this approximation is less than 1%. Since encryption and decryption do not influence the accuracy of our model, this is the accurate amount of loss of our privacy-preserving training method.

Second, we compare the accuracy loss caused by truncating the probability to finite digits. Specifically, we truncate the number to two digits. We set the parameter as the same as the first experiment. From the results we can see that the error rate is still close to the algorithm without approximation.

6. Conclusion and Future Work

In this paper, we have presented a privacy-preserving algorithm for RBM. The algorithm guarantees privacy in a standard cryptographic model, the semihonest model. Although approximations are introduced in the algorithm, the experiments on real-world data show that the amount of accuracy loss is reasonable.

Using our techniques, it should not be difficult to develop the privacy-preserving algorithms for RBM learning with three or more participants. In this paper, we have proposed only the RBM training method. A future research topic would be to apply it in a practical implementation and to extend our work to deep networks training.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.