Abstract

Data sharing is challenging but important for healthcare research. Methods for privacy-preserving data dissemination based on the rigorous differential privacy standard have been developed but they did not consider the characteristics of biomedical data and make full use of the available information. This often results in too much noise in the final outputs. We hypothesized that this situation can be alleviated by leveraging a small portion of open-consented data to improve utility without sacrificing privacy. We developed a hybrid privacy-preserving differentially private support vector machine (SVM) model that uses public data and private data together. Our model leverages the RBF kernel and can handle nonlinearly separable cases. Experiments showed that this approach outperforms two baselines: (1) SVMs that only use public data, and (2) differentially private SVMs that are built from private data. Our method demonstrated very close performance metrics compared to nonprivate SVMs trained on the private data.

1. Introduction

Data sharing is important for accelerating scientific discoveries, especially when there are not enough local samples to test a hypothesis [1, 2]. However, medical data are sensitive as they essentially contain personal information and can reveal much about ethnicity, disease risk [3], and even family surnames [4]. To promote data sharing, it is important to develop privacy-preserving algorithms that respect data confidentiality and present data utility [5], especially when one wants to leverage cloud computing [6].

Privacy preserving data analysis and publishing [7, 8] have received considerable attention in recent years as a promising approach for sharing information while preserving data privacy. Differential privacy [911] has recently emerged as one of the strongest privacy guarantees for statistical data release [1217]. A statistical aggregation or computation is DP (we shorten differentially private to DP) if the outcome is formally indistinguishable when run with and without any particular record in the dataset. The level of indistinguishability is quantified as a privacy parameter . A common mechanism to achieve differential privacy is the Laplace mechanism [18] which injects calibrated noise to a statistical measure determined by the privacy parameter and the sensitivity of the statistical measure influenced by the inclusion and exclusion of a record in the dataset. A lower privacy parameter requires larger noise to be added and provides a higher level of privacy.

General purpose algorithms for privacy protection (e.g., [19, 20]) often introduce too much perturbation error, which renders the resulting information useless for healthcare research. Our contribution is to leverage a small portion of open-consented data to maximally explore information that resides in the private data through a hybrid framework. Figure 1 shows an example of an environment in this case. We recently published differentially private distributed logistic regression using public and private biomedical datasets [21], which demonstrated advantages over pure private or public models. However, logistic regression is a generalized linear model, which has limited flexibility in classifying complex patterns. In this paper, we sought to extend our previous effort to the more powerful, RBF-kernel based support vector machines.

The remainder of the paper is organized as follows. Section 2 reviews background knowledge of differential privacy and SVM and RBF kernel. Section 3 describes the framework and details for our hybrid SVM mechanism. Then, Section 4 contains an extensive set of experimental evaluations. Finally, Section 5 concludes the paper with conclusions, limitations, and directions for future work.

Rubinstein et al. [22] propose a private kernel SVM algorithm (shortened as PrivateSVM) which only works for a translation-invariant kernel . The method approximates the original infinite feature space of with a finite feature space using the Fourier transform of . Then add the noise to the weight parameters in the primal form based on the new space . One weakness is that the parameters used to construct are randomly generated from which degrades the approximation accuracy of to . Another problem is that the utility bounds use the same regularization parameter value to compare the private and nonprivate classifiers. They take no consideration into the change of regularization parameter incurred by privacy constraints. Chaudhuri et al. [23] investigated a general mechanism, namely, DPERM, to produce private approximations of classifiers by regularized empirical risk minimization (ERM) with good perturbation error. Akin to PrivateSVM, DPERM requires that the underlying kernel is translation invariant. In this paper, we will compare our method to the PrivateSVM algorithm, since DPERM has comparable performance with PrivateSVM.

3. Preliminary

Consider an original dataset that contains a small portion of public data and a large part of private data . Our goal is to release a differentially private support vector machine using both public and private data. In this section, we first introduce the definition of differential privacy; then, we give a brief overview of SVM and RBF kernel.

3.1. Differential Privacy

Differential privacy has emerged as one of the strongest privacy definitions for statistical data release. It guarantees that if an adversary knows complete information of all the tuples in except one, the output of a differentially private randomized algorithm should not give the adversary too much additional information about the remaining tuples. We say that datasets and differ in only one tuple if we can obtain by removing or adding only one tuple from . A formal definition of differential privacy is given as follows.

Definition 1 (-differential privacy [18]). Let be a randomized algorithm over two datasets and differing in only one tuple, and let be any arbitrary set of possible outputs of . Algorithm satisfies -differential privacy if and only if the following holds:

Intuitively, differential privacy ensures that the released output distribution of remains nearly the same whether or not an individual tuple is in the dataset.

A common mechanism to achieve differential privacy is the Laplace mechanism [18] that adds a small amount of independent noise to the output of a numeric function to fulfill -differential privacy of releasing , where the noise is drawn from Laplace distribution with a probability density function . A Laplace noise has a variance with a magnitude of . The magnitude of the noise depends on the concept of sensitivity which is defined as follows.

Definition 2 (sensitivity [18]). Let denote a numeric function, and the sensitivity of is defined as the maximal -norm distance between the outputs of over the two datasets and which differ in only one tuple. Formally,

With the concept of sensitivity, the noise follows a zero-mean Laplace distribution with the magnitude . To fulfill -differential privacy for a numeric function over , it is sufficient to publish , where is drawn from .

3.2. Review of SVM and RBF Kernel

SVM is one of the most popular supervised binary classification methods that takes a sample and a predetermined kernel function as input, and outputs a predicted class label for this sample. Consider training data , where denotes the training input points, are the training class labels, and is the size of training data. Here, is the dimension of input data and “+1” and “−1” are class labels. A SVM maximizes the geometric margin between two classes of data and minimizes the error from misclassified data points. The primal form of a soft-margin SVM can be written as where is the normal vector to the hyperplane separating two classes of data, is a loss function convex in , is a regularization parameter that weighs smoothness and errors (i.e., large for fewer errors, smaller for increased smoothness), and , where is a function mapping training data point from their input space to a new -dimensional feature space ( may be infinite). Sometimes we map the training data from their input space to another high-dimensional feature space in order to classify nonlinearly separable data. When is large or infinite, the innerproducts in feature space may be computed efficiently by an explicit representation of the kernel function . For example, is a linear kernel function for a linear SVM, and is a RBF kernel function, which is translation invariant.

In this paper, we use a RBF kernel function. Our method can be applied to any translation invariant kernel SVM. With the hinge loss , we can obtain a dual form SVM written as where , is a persample parameter and , is a perfeature weight parameter. The weight vector can be converted from sample weight vector via in the linear SVM.

4. Privacy Preserving Hybrid SVM

In this section, we first introduce a framework overview and then the technical details of our hybrid SVM method. We assume that all data samples follow the same distribution. Here, we assume that all original data from different data sets follow some unknown joint multivariate distribution and all data tuples are samples from this distribution.

4.1. The General Framework

Figure 2 illustrates the general framework of hybrid SVM. Algorithm 1 presents the hybrid SVM algorithm. First, we use the small amount of public data and (5) and (6) to compute the parameter , in the mapping function of the approximation form to the RBF kernel. Second, with , we transform the private data from the original sample space to the new -dimensional feature space via the mapping function in (7). Then we can compute the parameter in the dual space with the transformed private data and in the primal space via the linear relationship between and in the linear SVM. Finally, draw from where and return and . Then users can transform their test data to the new -dimensional feature space with and classify the transformed data with . Here the computation of parameter has no privacy risk because it is retrieved directly from public data. More details about hybrid SVM will be given in the successive subsections.

Input: Public data , private data , the dimensionality of , a regularization parameter ,
and privacy budget ;
Output: Differentially private SVM;
(1) Use the public data to compute via (5), (6);
(2) Transform each record of the private data to new 2D-dimensional data via the mapping function
defined by (7);
(3) Compute the parameter in the dual space with the transformed private data, and in the primal space
via ;
(4) Draw from Lap , , then return and .

Privacy Properties. We present the following theorem showing the privacy property of Algorithm 1.

Theorem 3. Algorithm 1 guarantees -differential privacy.

Proof. For step 1, no private data is used, and hence step 1 does not impact the privacy guarantee. Due to Corollary 15 in [22] and the fact that the hinge-loss is convex and 1-Lipschitz in , the sensitivity of over a pair of neighbouring datasets is . Then the scale parameter in step 4 is set to due to the Laplace mechanism introduced in Section 3.1. Therefore, Algorithm 1 preserves -differential privacy which completes the proof.

4.2. The Computation of

Rahimi and Recht [24] approximate a Reproducing Kernel Hilbert Space () induced by an infinite dimensional feature mapping with a random induced by a random finite-dimensional mapping . The random finite-dimensional can be constructed by drawing i.i.d. vectors from the Fourier transform of a positive-definite translation-invariant kernel function , such as the RBF kernel function. Then we can obtain an approximation form of using the real-valued mapping function defined by the following equation: where are i.i.d. samples drawn from a uniform distribution . maps the data from its original -dimensional input space to the new -dimensional feature space. Their approach is based on the fact that the kernel function of a continuous positive-definite translation-invariant kernel is the Fourier transform of a nonnegative measure. The uniform convergence property of the approximation form to the kernel function has also been proved in [24]. In our context, the kernel function refers to the RBF kernel function.

In our problem setting, since a small amount of public data can be considered as in and only the vectors are needed to construct the random finite-dimensional , we can compute the vectors with an optimization function defined as follows: Since (6) is an unconstrained nonlinear optimization function, we solve it using L-BFGS (the full name is Limited-memory Broyden Fletcher Goldfarb Shanno) algorithm.

Thus, we can obtain a more accurate approximation form of the kernel function by deploying the public data to compute the , than randomly sampling from the fourier transform of the kernel function as shown in [25]. To guarantee differential privacy, we need only consider the data-dependent weight parameter . Fortunately we can employ the differentially private linear SVM approach in [25] to compute after transforming all private data to a new -dimensional feature space using the mapping defined in (7) with the vectors as follows:

4.3. The Computation of

With the vectors to approximate the RBF kernel function, we can convert RBF kernel SVM in the -dimensional input space into the linear SVM in a new -dimensional feature space with (7), then use the privacy preserving linear SVM algorithm in [25]. The general idea of this algorithm is that with the transformed -dimensional private data, we first compute the parameter in the dual space and then in the primal space using ; then we draw from , where and compute noisy with .

5. Experiments

In this section, we experimentally evaluate our hybrid SVM and compare it with one state-of-the-art method, called private SVM and on baseline method. We evaluate the utility of the trained SVM classifier using the AUC metric. Hybrid SVM and private SVM are implemented in MATLAB R2010b, and all experiments were performed on a PC with 3.2 GHz CPU and 8 G RAM.

5.1. Experiment Setup

Datasets. We used two open source datasets from the Integrated Public Use Microdata Series (Minnesota Population Center, Integrated public use microdata series—international: Version 5.0., 2009, https://international.ipums.org),   the US and Brazil census datasets with 370,000 and 190,000 records collected in the US and Brazil, respectively. One motivation for using these public datasets is that it bears similar attributes (e.g., demographic features) as some medical records, but it is publicly available for testing and comparisons. From each dataset, we selected 40,000 records, with 10,000 records serving as the public data pool. There were 13 attributes in both datasets, namely, age, gender, marital status, education, disability, nationality, working hours per week, number of years residing in the current location, ownership of dwelling, family size, number of children, number of automobiles, and annual income. Among these attributes, marital status is the only categorical attribute containing more than 2 values, that is, single, married, and divorced/widowed. Because SVMs do not handle categorical features by default, we transformed marital status into two binary attributes, is single and is married (an individual divorced or widowed would have false on both of these attributes). With this transformation, our two datasets had 14 dimensions. For each dataset, we randomly extract a subset of original data as a public data pool, from which public data is sampled uniformly, and use the remaining 30000 tuples as the private data.

Comparison. We experimentally compared the performance of our hybrid SVM against two approaches, namely, public data baseline and private SVM [25]. The public data baseline is a RBF kernel SVM that uses only public data. In our experiment figures, we use “Public—#” to denote the public data baseline method with # as the size of public data. The private SVM is a state-of-the-art differentially private RBF kernel SVM that uses private data only. The parameters in all methods are set to optimal values.

Metrics. We used the other attributes to predict the value of annual income by converting annual income into a binary attribute: values higher than a predefined threshold were mapped to 1, and otherwise to −1. Here, we set the predefined threshold as the median value of annual income. The classification accuracy was measured by the AUC (the area under an ROC curve) [26]. The boxplot was used to measure the stability of our method and private SVM. The boxplots of “Public—50,” “Public—100,” and “Public—200,” are qualitatively similar to our hybrid SVM; hence, we do not report boxplots of these baseline methods. We performed 10-fold cross-validation 10 times for each algorithm and reported the average results. We varied three different parameters: the privacy budget , the dataset dimensionality, and the data cardinality (i.e., the size of training data). To vary the data cardinality parameter, we randomly generate subsets of records in the training records set, with the sampling rate varying from 0.1 to 1. For various data dimensionalities with the range being 5, 8, 11, and 14, we select three attribute subsets in the US and Brazil datasets for classification. The first five dimensions include: age, gender, education, family size, and annual income. The second eight dimensions contain the previous five attributes, and additionally nativity, owner of dwelling, and number of automobiles. The third eleven dimensions consist of all the attributes in the second 8 dimensions and is single, is married, and number of children. Table 1 summarizes the parameters and their default values in the experiments.

5.2. AUC versus Privacy Budget

Figures 3 and 4 illustrate the AUCs of each method under various privacy budgets from 0.5 to 4, where “Public—#” means the public data baseline methods with various sizes of public data. Observe that our hybrid SVM outperforms the private SVM and performs better than the public data baseline defined by the public data. The AUC of our method remains stable under all privacy budgets and is significantly close to the public data baseline that uses the complete private data set as public data.

5.3. AUC versus Dataset Dimensionality

Figures 5 and 6 present the AUCs of each algorithm as a function of the dataset dimensionality for the US and Brazil datasets. With a higher number of dimensions, the AUCs of the hybrid SVM and of the SVM that uses the public data (baseline) increase. This is reasonable because the training data size with the default value being 27,000 is much larger than the number of data dimensions which are at most 14. When the number of dimensions grows, the performance improves. In contrast, the performance of the private SVM degrades in 14 dimensions with poor boxplots because more noise is introduced with higher dimensions.

5.4. AUC versus Data Cardinality

Figures 7 and 8 investigate the relationship between the sampling rate and AUC of hybrid and private SVMs. From the figures, our method consistently outperformed the private SVM at different sampling rates. It is worth mentioning that AUCs of the hybrid SVM are large even at small sampling rates and tend to stabilize when the size of training data grows (i.e., large sampling rate). The boxplots reflect that the private SVM has larger variance than the hybrid SVM, because private SVM selects the values of randomly from the Fourier transform of RBF kernel. In contrast, hybrid SVM computes via the public data. This helps improve the accuracy of and leads to less variance.

5.5. Computation Time

Finally, Figure 9 shows the time cost of our proposed algorithm with varying dimensions and different sampling rates. We only report the results for the US dataset; the results for the Brazil dataset are greatly similar. One can notice that the dimensionality, rather than the sampling rate, determines the computational cost of the hybrid SVM. The overhead of the hybrid SVM is from computing with the public data, since a nonlinear optimization equation needs to be solved. As the other private SVM methods, our hybrid SVM is intended for off-line use, and hence the time is generally acceptable for even 14 dimensional datasets.

6. Discussion and Conclusion

We proposed and developed a RBF kernel SVM using a small amount of public data and a large amount of private data to preserve differential privacy with improved utility. In this algorithm, we use public data to compute the parameters in an approximation form of the RBF kernel function and then train private classifiers with linear SVM after converting all private data into a new feature space defined by the approximation form. A limitation of our approach is that we used the L-BFGS method [27], which is not very efficient, to find the optimal solution. Because the objective function in (6) is not a convex function, our model is computationally intensive in order to calculate the local optimal values, especially when the size of the public data set is large. We will develop more efficient methods and test the model on clinical records in future work. Another limitation is that we assume all original data from different data sets follow some unknown joint multivariate distribution. Our assumption might now always be true in practice, and calibration is necessary for future investigation. That is, in the presence of distributional difference, we will leverage transfer learning to build the global model.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

Lucila Ohno-Machado and Xiaoqian Jiang are partially supported by NLM (R00LM011392) and iDASH (NIH Grant U54HL108460).