Abstract

Support vector machines (SVMs) are among the most robust and accurate methods in all well-known machine learning algorithms, especially for classification. The SVMs train a classification model by solving an optimization problem to decide which instances in the training datasets are the support vectors (SVs). However, SVs are intact instances taken from the training datasets and directly releasing the classification model of the SVMs will carry significant risk to the privacy of individuals, when the training datasets contain sensitive information. In this paper, we study the problem of how to release the classification model of kernel SVMs while preventing privacy leakage of the SVs and satisfying the requirement of privacy protection. We propose a new differentially private algorithm for the kernel SVMs based on the exponential and Laplace hybrid mechanism named DPKSVMEL. The DPKSVMEL algorithm has two major advantages compared with existing private SVM algorithms. One is that it protects the privacy of the SVs by postprocessing and the training process of the non-private kernel SVMs does not change. Another is that the scoring function values are directly derived from the symmetric kernel matrix generated during the training process and does not require additional storage space and complex sensitivity analysis. In the DPKSVMEL algorithm, we define a similarity parameter to denote the correlation or distance between the non-SVs and every SV. And then, every non-SV is divided into a group with one of the SVs according to the maximal value of the similarity. Under some certain similarity parameter value, we replace every SV with a mean value of the top-k randomly selected most similar non-SVs within the group by the exponential mechanism if the number of non-SVs is greater than k. Otherwise, we add random noise to the SVs by the Laplace mechanism. We theoretically prove that the DPKSVMEL algorithm satisfies differential privacy. The extensive experiments show the effectiveness of the DPKSVMEL algorithm for kernel SVMs on real datasets; meanwhile, it achieves higher classification accuracy than existing private SVM algorithms.

1. Introduction

In recent years, with the rapid development of computing devices in the collecting, storing, and processing capabilities, data sharing and analyzing are becoming easier and more practical [1]. Data mining and machine learning techniques have been gaining a great deal of attention for analyzing useful information. The classification algorithm, as one of the important data mining tasks, trains a classification model from labeled training datasets to classify unknown data in the future [2]. The support vector machine (SVM) [3, 4] is one of the most widely used machine learning algorithms for classification in practice [5]. SVMs train a classification model by finding a solution to the convex optimization problem. Like most other classification algorithms, SVMs also have privacy issues when the training datasets contain sensitive information such as user behavior records or electronic health records. In SVMs, support vectors (SVs) are an important component of the classification model and they are intact instances taken from the training datasets. Directly releasing the classification model of SVMs will carry significant risk to the privacy of individuals, especially for kernel SVMs [2].

More and more researchers have made great efforts on the privacy leakage problem. Differential privacy (DP) [68] is one of the state-of-the-art models and has become an accepted standard for privacy protection in sensitive data analysis since it was proposed by a series of works by Dwork et al. in 2006. In recent decades, two main research directions of DP have been developed: differentially private data publishing and differentially private data analysis [9]. Differentially private data publishing aims to output aggregate information to the public without disclosing any individual record, including transaction data publishing [10], histogram publishing [11], stream data publishing [12], graph data publishing [13], batch query publishing [14], and synthetic datasets publishing [15]. The essential task of differentially private data analysis is extending the current non-private algorithms to differentially private algorithms, including supervised learning [16], unsupervised learning [17], and frequent pattern mining [18].

In this paper, we studied the problem of how to release the classification model of kernel SVMs while satisfying the requirement of privacy protection. To overcome the shortcomings in the existing private SVM algorithms, such as the requirements on the differentiability of the objective function and the low classification accuracy, we proposed a new differentially private algorithm for the kernel SVMs. The main contributions in this paper concluded as follows:(i)We propose the exponential and Laplace hybrid mechanism to prevent privacy leakage of the SVs. The hybrid mechanism takes advantage of both the exponential mechanism and the Laplace mechanism to improve the classification accuracy.(ii)We define a similarity parameter to denote the correlation or distance between the non-SVs and every SV. This can be easily achieved from the symmetric kernel matrix produced during the training process. And then, every non-SV is divided into a group with one of the SVs according to the maximal value of the similarity.(iii)Learning from the idea of top-k frequent pattern mining [1, 19, 20], we use different methods to protect the privacy of the SVs under some certain similarity parameter values. When the number of the non-SVs is greater than k within the group, we replace every SV with a mean value of the top-k randomly selected most similar non-SVs by the exponential mechanism. Otherwise, we add random noise to the SVs by the Laplace mechanism.(iv)We theoretically prove that the DPKSVMEL algorithm satisfies DP. The extensive experiments show the effectiveness of the DPKSVMEL algorithm for kernel SVMs on real datasets; meanwhile, it achieves higher classification accuracy than existing private SVMs algorithms.

The rest of the paper is organized as follows: In Section 2, we discuss the work related to private SVMs. In Section 3, we give a brief overview of the basic knowledge of SVMs, DP, and top-k frequent pattern mining. Section 4 proposes the DPKSVMEL algorithm, and Section 5 gives the experimental performance evaluation of the DPKSVMEL algorithm. Section 6 concludes the research work.

In this section, we briefly review some work of private SVMs and then focus on the work related to differentially private SVMs.

There are some works related to private SVMs. Mangasarian et al. [21] proposed a highly efficient privacy-preserving SVM PPSVM via random kernels for vertically partitioned data. Lin et al. [2] pointed out the private violation problem of SVs in the classification model of the SVM and proposed a privacy-preserving SVM classifier PPSVC to replace the Gaussian kernel with a precisely approximate decision function. These two methods achieve similar classification accuracy to the original non-private SVM classifier. Nevertheless, the degree of privacy protection cannot be proved as the private SVMs based on DP.

DP is a rigorous privacy definition and has become an accepted standard for privacy protection in sensitive data analysis. The degree of privacy protection can be measured with privacy budget parameter ɛ. The classification model of the SVMs should be released under a DP guarantee. Chaudhuri et al. [22, 23] proposed two popular perturbation-based techniques, output perturbation and objective perturbation, for privacy-preserving machine learning algorithm design. Output perturbation introduces randomness into the weight vector after the optimization process, and the randomness scale is determined by the sensitivity of . Objective perturbation introduces randomness into the objective function before the optimization process, and the randomness scale is independent of the sensitivity of . These two perturbation-based techniques have been applied to logistic regression and linear SVM algorithms. However, their sensitivity is difficult to analyze, and the loss function needs to satisfy certain convexity and differentiability criteria for objective perturbation. For nonlinear kernel SVMs, Chaudhuri et al. used the random projections method to approximate the kernel function by transforming it to linear classification and avoided publishing privacy values in the training datasets directly. The disadvantage of this method is that it needs to provide the projection matrix except for the privacy classification model in the prediction process, which increases the risk of privacy leakage. Furthermore, the appropriate projection dimension is also an issue to be considered. Rubinstein et al. [24] proposed two mechanisms for differentially private SVM learning, one with finite-dimensional feature mappings and one with potentially infinite-dimensional feature mappings. Both mechanisms are achieved by adding noise to the output classifier and are effective for all convex loss functions including the most frequent hinge loss function. They also came up with a utility metric by comparing the similarity of the classifiers released by private and non-private SVM. Their mechanisms are valid only for the translation-invariant kernels. Li et al. [25] developed a hybrid private SVM model by using public data and private data together. They leveraged a small portion of open-consented data to calculate the Fourier transformation to alleviate too much noise in the final outputs. However, public data are hard to obtain in the private world. Liu et al. [26] proposed a private classification algorithm LabSam with high classification accuracy under DP when the labeled data were limited and the privacy budget was small. Their algorithm implements random sampling under the exponential mechanism differing from the perturbation-based methods. Zhang et al. [27] constructed a novel private SVM classifier DPSVMDVP on dual variable perturbation, which added Laplace noise to the corresponding dual variables according to the ratio of errors.

3. Preliminaries

In this section, we give a brief overview of the basic knowledge of SVMs, DP, and top-k frequent pattern mining.

3.1. Support Vector Machines

The SVM is an efficient learning method for classification based on structural risk minimization [3]. It aims to find an optimal separating hyperplane with a maximal margin to separate two classes of the given instances. The maximal margin corresponds to the shortest distance between the closest data points and any point on the hyperplane. Xue et al. [28] described the complete calculation process of the decision function in detail. The main task for training a SVM is to solve the optimization problem of quadratic programming as follows [29, 30]:

In equation (1), Q denotes a symmetric kernel matrix with Qij = yiyjK (xi, xj) and K is the kernel function. α is a dual vector, and xi and yi denote the training instance and label, respectively. The optimization problem of equation (1) can be solved by the sequential minimal optimization algorithm efficiently [30]. After the optimization process, we obtain the decision function as follows:

From equation (2), we can conclude that the classification model of the SVMs is composed of dual variables α and SVs. It is a very serious private issue that directly releases a classification model with the original instances of the training datasets.

3.2. Differential Privacy

With the advent of the digital age, more and more personal information has been collected and shared by mobile devices and web services to improve the quality of these services. At the same time, it also raises privacy concerns of data contributors. DP [68, 31] provides a mathematically rigorous definition of privacy for private data analysis. It guarantees that any possible outcome of the data analysis is hardly any different from each other whether or not the individual participates in the database. The maximal difference of the outcome is controlled by a small privacy budget parameter ɛ. Formally, the definitions related to DP are given in the following.

Definition 1. (ε-DP [6]). A randomized algorithm K satisfies ε-DP if datasets D and D′ differ on at most one instance, and for all subsets of possible outcomes of the algorithm S Range (K),

Definition 2. (sensitivity [6]). For a given query function and neighboring datasets D and D′, the sensitivity of f is defined asCurrently, there are two principal mechanisms used for realizing DP: the Laplace mechanism for numerical data and the exponential mechanism for nonnumerical queries.

Definition 3. (Laplace mechanism [8]). For a numeric function , the algorithm K that answers f in equation (5) provides ɛ-DP.where Lap (∆f/ε) is a random variable sampled from the Laplace distribution with mean 0 and standard deviation .
The Laplace mechanism retrieves the true results from the numerical query and then perturbs it by adding independent random noise according to the sensitivity.

Definition 4. (exponential mechanism [7]). Let be a scoring function on a dataset D that measures the quality of output , and represents the sensitivity. The algorithm K satisfies ɛ-DP ifThe exponential mechanism is useful to select a discrete output in a differentially private manner, which employs a scoring function q to evaluate the quality of an output r with a nonzero probability.

Definition 5. (composition properties [31, 32]). Let K1 and K2 be ε1 and ε2-DP algorithms, respectively. Then, we have the following:Sequential composition: releasing the output of K1 (D) and K2 (D) satisfies (ε1 + ε2)-DPParallel composition: for , releasing K1 (D1) and K2 (D2) satisfies max(ε1, ε2)-DP

3.3. Top-k Frequent Pattern Mining

Frequent pattern mining aims to discover items that frequently appear together in a transaction dataset. Directly releasing the discovered frequent patterns with support counts will violate individual privacy. Therefore, the top-k most frequent pattern should be released under DP guarantee. Zhang et al. [1] proposed an algorithm DFP-Growth, which accurately found the top-k frequent patterns with noisy support counts while satisfying DP. The DFP-Growth algorithm is performed by two key steps: firstly, selecting the top-k frequent patterns by the exponential mechanism; secondly, perturbing the true support count of each top-k pattern by the Laplace mechanism.

4. Materials and Methods

To solve the privacy leakage problem of the SVs in the classification model of kernel SVMs, we proposed the DPKSVMEL algorithm based on the exponential and Laplace hybrid mechanism. The privacy of SVs is protected by postprocessing the non-private classification model with DP, while the training process of the original SVMs is not changed. Firstly, we train a non-private kernel SVM to obtain a classification model including dual vector α and the SVs. Secondly, we define a similarity parameter to denote the correlation or distance between the non-SVs and every SV and then every non-SV is divided into a group with one of the SVs according to the maximal value of the similarity. Thirdly, we use either the exponential mechanism or the Laplace mechanism to generate a new SV and then replace the original SV within the group under some certain similarity parameter values. The mechanism to use depends on whether the number of non-SVs is greater than k. Lastly, we output the classification model with the private SVs. Figure 1 gives an example of the DPKSVMEL algorithm implementation process.

In Figure 1, there are three SVs and eight non-SVs. The square represents the SV, the small circle represents the non-SV, the triangle represents the private SV, and the big circle represents a group. Every group is represented by different colors and viewed as a hypersphere with the SV as the center and the similarity as the radius. And, every non-SV is divided into one group according to the maximal value of its similarity with every SV. In particular, some non-SVs that are located at the intersection of multiple groups also belong to one group to satisfy the parallel composition property of DP. We set the parameter k to 2 in this example. In the groups with red and yellow colors, the number of the non-SVs is greater than k. We use the exponential mechanism to randomly select the most similar two non-SVs and then generate a new private SV with the mean value of them. However, there is only one non-SV in the blue group. We use the Laplace mechanism to generate a new private SV with noise. Therefore, the SVs in the final classification model are all private ones to prevent privacy leakage.

4.1. Similarity Parameter and Sensitivity

In the DPKSVMEL algorithm, the similarity is a vital parameter. We view the symmetric kernel matrix Q in equation (1) as the probability of similarity between every two instances in the datasets, especially for the radial basis kernel.

Definition 6. (similarity). For a non-SV xi and a SV xj, the similarity between them is defined asIn equation (7), Similarity is a subset of Q. It is obtained easily from the classification model and requires no extra complicated computation. The smaller the distance between a non-SV and a SV, the greater the value of the Similarity when they have the same labels. If they have different labels, the value of the Similarity is less than zero and the corresponding non-SV will be discarded from participating in the calculation within the group.
The Similarity is viewed as the probability of the correlation between a non-SV and a SV. In the DPKSVMEL algorithm, we set a lower limit for the Similarity named LLs to denote the minimum value of the correlation or the maximum value of the distance between them. If the value of the Similarity is less than LLs, it means that the correlation is too small or the distance is too large between the non-SV and the SV; then, the non-SV is also discarded from the group. After all the non-SVs are divided into groups, we use the exponential and Laplace hybrid mechanism to generate a new SV in every group. When LLs are fixed, the radius of the hypersphere corresponding to each group is determined. Then, the sensitivity of the exponential mechanism and Laplace mechanism is calculated by LLs easily. They are denoted by Sensitivityem and Sensitivitylm, respectively.In the exponential mechanism, we use Similarity as the scoring function. Under the fixed LLs, the maximum value of the similarity is 1 when the non-SV coincided with the SV and the minimum value is LLs within the group. Therefore, the sensitivity of the exponential mechanism is obvious as shown in equation (8). In the Laplace mechanism, we define the radius of the hypersphere R to denote the maximal distance between the non-SVs and the SV within a group that corresponds to the lower limit of the Similarity LLs. The relationship between LLs and R is drawn from equation (7) in thatwhere gamma is a scale parameter with a default value of 1/n for a dataset with n attributes. Then,where R denotes the maximal distance between every non-SV and the SV within the group. As all the attributes in a dataset are independent, we conclude that the biggest change of any attribute within a group is at most sqrt (−log (LLs)) based on the formula for the distance between two points. Therefore, the sensitivity of the Laplace mechanism is as shown in equation (9).

4.2. Privacy Budget Allocation

In DP algorithm, the privacy budget ε is another vital parameter. It determines the level of privacy protection for a randomized algorithm. The smaller the privacy budget, the higher the level of privacy protection. When the allocated privacy budget runs out, the randomized algorithm K will lose privacy protection. In the DPKSVMEL algorithm, every non-SV is divided into a group with one of the SV and there are no common instances between groups. Therefore, the DPKSVMEL algorithm satisfies the parallel composition property of DP. There is no need to split the privacy budget for the exponential mechanism and the Laplace mechanism.

4.3. Description of the DPKSVMEL Algorithm

In the DPKSVMEL algorithm, DP is achieved by the exponential and Laplace hybrid mechanism. The description of the DPKSVMEL algorithm is shown in Algorithm 1.

Input: Q: symmetric kernel matrix; ɛ: privacy budget; LLs: lower limit of the Similarity; Nns: the number of non-SVs in a group; k: the number of non-SVs selected in the exponential mechanism;
Output: SVp: private SV;
Begin
(1) obtain a non-private classification model including dual vector α and the SVs by training a kernel SVM;
(2) get the Similarity matrix from the subset of Q in which the Similarity value was no less than LLs;
(3) divide every non-SV into one group according to the maximal value of its similarity with every SV;
(4)  for i in every group
(5)   if Nns > k then
(6)    compute the probability Prns for every non-SVs with its Similarity value;
(7)    randomly select the most similar k non-SVs with probability Prns by the exponential mechanism;
(8)    SVpi = the mean value of the selected k non-SVs;
(9)   else
(10)    for every attribute of the SV
(11)    SVpij = SVij + Laplace (Sensitivelm/ɛ);
(12)   end for
(13)  end if
(14) end for
(15) output the private classification model with SVp;
End

The DPKSVMEL algorithm protects the privacy of SVs by postprocessing the non-private classification model. It builds a single group for every SV and divides every non-SV into one of the groups according to its similarity. The private SVs are constructed by the exponential mechanism when the number of the non-SVs is greater than k within the group; otherwise, they are constructed by the Laplace mechanism. Finally, the DPKSVMEL algorithm outputs the private classification model. Because the running time of the algorithm only depends on the number of SVs, its time complexity is much less than O (n), where n denotes the number of instances.

4.4. Privacy Analysis

In the DPKSVMEL algorithm, randomness is introduced by the exponential and Laplace hybrid mechanism. According to the definition of DP, we prove that the DPKSVMEL algorithm satisfies DP by Theorem 1.

Theorem 1. DPKSVMEL algorithm satisfies DP.

Proof. In the DPKSVMEL algorithm, DP is achieved by postprocessing the non-private classification model. Every SV is viewed as the center of a group, and there is no intersection between groups. We consider the impact of adding an instance in the dataset on the classification model in the following three cases. The first one is that the new instance becomes a SV; then, one new group needs to deal with the Laplace mechanism. The second one is that the new instance is a non-SV and is divided into a group, which only adds one non-SV to be randomly selected by the exponential mechanism. The third one is that the new instance is a non-SV and does not belong to any group, and the classification model shows no change. Based on the sensitivity computation in equations (8) and (9), either the exponential mechanism or the Laplace mechanism for dealing with one group satisfies DP. According to the parallel composition property of DP, the DPKSVMEL algorithm satisfies DP.

5. Results

In this section, we compared the performance of the DPKSVMEL algorithm with the newest private SVM algorithms LabSam [26] and DPSVMDVP [27]. While PrivateSVM [24] does not provide practical and comparable experimental results, the experimental datasets in hybrid SVM [25] cannot be obtained now.

5.1. Datasets

The datasets in our experiments are commonly used for testing SVM algorithms’ performance and are available at https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/. Table 1 shows the basic information of the eight datasets and classification accuracy of the non-private SVM with the default parameters based on LIBSVM (version 3.24) [33]. We use the radial basis function as the kernel function in the experiments.

5.2. Algorithm Performance Experiments

In this section, we evaluated the performance of the DPKSVMEL algorithm by the Accuracy and AUC (the area under a ROC curve). The higher their values, the better the usability of the algorithm. To evaluate the algorithm performance under different parameters setting, we set k at 2 and 3, set privacy budget ɛ at 0.1, 0.5, and 1, and set the lower limit for the Similarity from 0.5 to 0.9. To avoid the influence of randomness on algorithm performance, we execute the algorithm DPKSVMEL 10 times under every set of parameters. Tables 29 show the mean value, standard deviation, maximum value, and minimum value of the Accuracy and AUC on the eight datasets. The values in bold represent the best case of the mean value of Accuracy and AUC under the same privacy budget. And, the running time of the DPKSVMEL algorithm is shown in Table 10.

Based on the above experimental results, a N-way ANOVA (analysis of variance) was conducted to compare the effect of the three parameters and the interaction model of every two parameters on Accuracy and AUC. Due to space constrains, we present the value in the results of ANOVA only as shown in Tables 11 and 12. We can conclude that the results for parameters ɛ and Similarity are significantly different at the level. However, the results for parameter k are not significantly different for most of the datasets, similarly as every two parameter interaction.

To observe the experimental performance of the algorithm more intuitively, Figures 25 show Accuracy and AUC with the variation of Similarity under different k and ɛ values in the form of the bar graphs on datasets Australian and Breast. The effect of the experimental performance by the three parameters is consistent with the result of ANOVA. The larger the privacy budget ɛ, the higher the classification accuracy of the algorithm. The experimental performance for parameter Similarity is mainly affected by the privacy protection mechanism adopted. When its value is small, there are more non-SVs within the group and the exponential mechanism plays a greater role. Otherwise, the Laplace mechanism plays a greater role. The DPKSVMEL algorithm is more stable on dataset Breast.

We compared the performance of the DPKSVMEL algorithm and the non-private SVM under different privacy budgets on datasets Australian and Breast from Figures 69. With the increase of the privacy budget, the performance of the DPKSVMEL algorithm gradually reaches or even exceeds that of non-private SVM.

Then, we compared the Accuracy of the DPKSVMEL algorithm and the LabSam algorithm under different privacy budgets on datasets Heart, Ionosphere, Sonar, German, and Diabetes in Figures 1014. Finally, we compared the Accuracy of the DPKSVMEL algorithm and the DPSVMDVP algorithm under different privacy budgets on dataset Splice in Figure 15. Compared to the LabSam algorithm and the DPSVMDVP algorithm, our DPKSVMEL algorithm has higher classification accuracy and is closer to the non-private SVM under the same privacy budget.

In addition, as the DPKSVMEL algorithm does not change the training process of the classical non-private SVMs, a number of new optimization methods can be easily combined with our proposed algorithm to improve the classification accuracy, for example, CRP algorithm [34] for bilinear analysis and NI-SVM [35] in event analysis tasks.

6. Conclusions

In this paper, we study the privacy problem of the classification model of kernel SVMs. We proposed the DPKSVMEL algorithm based on the exponential and Laplace hybrid mechanism. The privacy of SVs is protected by postprocessing the non-private classification model with DP to prevent privacy leakage of the SVs. The DPKSVMEL algorithm is proved to satisfy DP theoretically and overcomes some shortcomings in the existing private SVM algorithm. Firstly, the postprocessing in the DPKSVMEL algorithm does not change the training process of the non-private kernel SVMs. Therefore, no complex sensitivity analysis is required like output perturbation and objective perturbation. Secondly, the DPKSVMEL algorithm avoids the additional risk of privacy disclosure and the consideration of projection dimension caused by the transformation from nonlinear SVMs to linear SVMs using random projection as PrivateSVM and hybrid SVM. Meanwhile, the DPKSVMEL algorithm has higher classification accuracy than the newest private SVM algorithms LabSam and DPSVMDVP under the same privacy budget. However, the DPKSVMEL algorithm views the value of kernel function as the probability of similarity between a non-SV and a SV that is valid only for some kernel functions with values in the range 0 to 1. Furthermore, the DPKSVMEL algorithm has poor performance for the datasets with a high proportion of SVs, especially when the similarity lower limit is small. In the future, we will work in two aspects. One is to extend the DPKSVMEL algorithm into more kernel functions. The other is to consider setting different similarity lower limits for different groups.

Data Availability

The datasets in our experiments are commonly used for testing SVM algorithms’ performance and are available at https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grants 61672179, 61370083, 61402126, and 61501275, by the Natural Science Foundation of Heilongjiang Province under Grant F2015030, by the Science Fund for Youths of Heilongjiang Province under Grant QC2016083, by the Postdoctoral Fellowship of Heilongjiang Province under Grant LBH-Z14071, and by the Fundamental Research Funds in Heilongjiang Provincial Universities under Grant 135509312.