Abstract
Support vector machines (SVMs) are among the most robust and accurate methods in all wellknown machine learning algorithms, especially for classification. The SVMs train a classification model by solving an optimization problem to decide which instances in the training datasets are the support vectors (SVs). However, SVs are intact instances taken from the training datasets and directly releasing the classification model of the SVMs will carry significant risk to the privacy of individuals, when the training datasets contain sensitive information. In this paper, we study the problem of how to release the classification model of kernel SVMs while preventing privacy leakage of the SVs and satisfying the requirement of privacy protection. We propose a new differentially private algorithm for the kernel SVMs based on the exponential and Laplace hybrid mechanism named DPKSVMEL. The DPKSVMEL algorithm has two major advantages compared with existing private SVM algorithms. One is that it protects the privacy of the SVs by postprocessing and the training process of the nonprivate kernel SVMs does not change. Another is that the scoring function values are directly derived from the symmetric kernel matrix generated during the training process and does not require additional storage space and complex sensitivity analysis. In the DPKSVMEL algorithm, we define a similarity parameter to denote the correlation or distance between the nonSVs and every SV. And then, every nonSV is divided into a group with one of the SVs according to the maximal value of the similarity. Under some certain similarity parameter value, we replace every SV with a mean value of the topk randomly selected most similar nonSVs within the group by the exponential mechanism if the number of nonSVs is greater than k. Otherwise, we add random noise to the SVs by the Laplace mechanism. We theoretically prove that the DPKSVMEL algorithm satisfies differential privacy. The extensive experiments show the effectiveness of the DPKSVMEL algorithm for kernel SVMs on real datasets; meanwhile, it achieves higher classification accuracy than existing private SVM algorithms.
1. Introduction
In recent years, with the rapid development of computing devices in the collecting, storing, and processing capabilities, data sharing and analyzing are becoming easier and more practical [1]. Data mining and machine learning techniques have been gaining a great deal of attention for analyzing useful information. The classification algorithm, as one of the important data mining tasks, trains a classification model from labeled training datasets to classify unknown data in the future [2]. The support vector machine (SVM) [3, 4] is one of the most widely used machine learning algorithms for classification in practice [5]. SVMs train a classification model by finding a solution to the convex optimization problem. Like most other classification algorithms, SVMs also have privacy issues when the training datasets contain sensitive information such as user behavior records or electronic health records. In SVMs, support vectors (SVs) are an important component of the classification model and they are intact instances taken from the training datasets. Directly releasing the classification model of SVMs will carry significant risk to the privacy of individuals, especially for kernel SVMs [2].
More and more researchers have made great efforts on the privacy leakage problem. Differential privacy (DP) [6–8] is one of the stateoftheart models and has become an accepted standard for privacy protection in sensitive data analysis since it was proposed by a series of works by Dwork et al. in 2006. In recent decades, two main research directions of DP have been developed: differentially private data publishing and differentially private data analysis [9]. Differentially private data publishing aims to output aggregate information to the public without disclosing any individual record, including transaction data publishing [10], histogram publishing [11], stream data publishing [12], graph data publishing [13], batch query publishing [14], and synthetic datasets publishing [15]. The essential task of differentially private data analysis is extending the current nonprivate algorithms to differentially private algorithms, including supervised learning [16], unsupervised learning [17], and frequent pattern mining [18].
In this paper, we studied the problem of how to release the classification model of kernel SVMs while satisfying the requirement of privacy protection. To overcome the shortcomings in the existing private SVM algorithms, such as the requirements on the differentiability of the objective function and the low classification accuracy, we proposed a new differentially private algorithm for the kernel SVMs. The main contributions in this paper concluded as follows:(i)We propose the exponential and Laplace hybrid mechanism to prevent privacy leakage of the SVs. The hybrid mechanism takes advantage of both the exponential mechanism and the Laplace mechanism to improve the classification accuracy.(ii)We define a similarity parameter to denote the correlation or distance between the nonSVs and every SV. This can be easily achieved from the symmetric kernel matrix produced during the training process. And then, every nonSV is divided into a group with one of the SVs according to the maximal value of the similarity.(iii)Learning from the idea of topk frequent pattern mining [1, 19, 20], we use different methods to protect the privacy of the SVs under some certain similarity parameter values. When the number of the nonSVs is greater than k within the group, we replace every SV with a mean value of the topk randomly selected most similar nonSVs by the exponential mechanism. Otherwise, we add random noise to the SVs by the Laplace mechanism.(iv)We theoretically prove that the DPKSVMEL algorithm satisfies DP. The extensive experiments show the effectiveness of the DPKSVMEL algorithm for kernel SVMs on real datasets; meanwhile, it achieves higher classification accuracy than existing private SVMs algorithms.
The rest of the paper is organized as follows: In Section 2, we discuss the work related to private SVMs. In Section 3, we give a brief overview of the basic knowledge of SVMs, DP, and topk frequent pattern mining. Section 4 proposes the DPKSVMEL algorithm, and Section 5 gives the experimental performance evaluation of the DPKSVMEL algorithm. Section 6 concludes the research work.
2. Related Work
In this section, we briefly review some work of private SVMs and then focus on the work related to differentially private SVMs.
There are some works related to private SVMs. Mangasarian et al. [21] proposed a highly efficient privacypreserving SVM PPSVM via random kernels for vertically partitioned data. Lin et al. [2] pointed out the private violation problem of SVs in the classification model of the SVM and proposed a privacypreserving SVM classifier PPSVC to replace the Gaussian kernel with a precisely approximate decision function. These two methods achieve similar classification accuracy to the original nonprivate SVM classifier. Nevertheless, the degree of privacy protection cannot be proved as the private SVMs based on DP.
DP is a rigorous privacy definition and has become an accepted standard for privacy protection in sensitive data analysis. The degree of privacy protection can be measured with privacy budget parameter ɛ. The classification model of the SVMs should be released under a DP guarantee. Chaudhuri et al. [22, 23] proposed two popular perturbationbased techniques, output perturbation and objective perturbation, for privacypreserving machine learning algorithm design. Output perturbation introduces randomness into the weight vector after the optimization process, and the randomness scale is determined by the sensitivity of . Objective perturbation introduces randomness into the objective function before the optimization process, and the randomness scale is independent of the sensitivity of . These two perturbationbased techniques have been applied to logistic regression and linear SVM algorithms. However, their sensitivity is difficult to analyze, and the loss function needs to satisfy certain convexity and differentiability criteria for objective perturbation. For nonlinear kernel SVMs, Chaudhuri et al. used the random projections method to approximate the kernel function by transforming it to linear classification and avoided publishing privacy values in the training datasets directly. The disadvantage of this method is that it needs to provide the projection matrix except for the privacy classification model in the prediction process, which increases the risk of privacy leakage. Furthermore, the appropriate projection dimension is also an issue to be considered. Rubinstein et al. [24] proposed two mechanisms for differentially private SVM learning, one with finitedimensional feature mappings and one with potentially infinitedimensional feature mappings. Both mechanisms are achieved by adding noise to the output classifier and are effective for all convex loss functions including the most frequent hinge loss function. They also came up with a utility metric by comparing the similarity of the classifiers released by private and nonprivate SVM. Their mechanisms are valid only for the translationinvariant kernels. Li et al. [25] developed a hybrid private SVM model by using public data and private data together. They leveraged a small portion of openconsented data to calculate the Fourier transformation to alleviate too much noise in the final outputs. However, public data are hard to obtain in the private world. Liu et al. [26] proposed a private classification algorithm LabSam with high classification accuracy under DP when the labeled data were limited and the privacy budget was small. Their algorithm implements random sampling under the exponential mechanism differing from the perturbationbased methods. Zhang et al. [27] constructed a novel private SVM classifier DPSVMDVP on dual variable perturbation, which added Laplace noise to the corresponding dual variables according to the ratio of errors.
3. Preliminaries
In this section, we give a brief overview of the basic knowledge of SVMs, DP, and topk frequent pattern mining.
3.1. Support Vector Machines
The SVM is an efficient learning method for classification based on structural risk minimization [3]. It aims to find an optimal separating hyperplane with a maximal margin to separate two classes of the given instances. The maximal margin corresponds to the shortest distance between the closest data points and any point on the hyperplane. Xue et al. [28] described the complete calculation process of the decision function in detail. The main task for training a SVM is to solve the optimization problem of quadratic programming as follows [29, 30]:
In equation (1), Q denotes a symmetric kernel matrix with Q_{ij} = y_{i}y_{j}K (x_{i}, x_{j}) and K is the kernel function. α is a dual vector, and x_{i} and y_{i} denote the training instance and label, respectively. The optimization problem of equation (1) can be solved by the sequential minimal optimization algorithm efficiently [30]. After the optimization process, we obtain the decision function as follows:
From equation (2), we can conclude that the classification model of the SVMs is composed of dual variables α and SVs. It is a very serious private issue that directly releases a classification model with the original instances of the training datasets.
3.2. Differential Privacy
With the advent of the digital age, more and more personal information has been collected and shared by mobile devices and web services to improve the quality of these services. At the same time, it also raises privacy concerns of data contributors. DP [6–8, 31] provides a mathematically rigorous definition of privacy for private data analysis. It guarantees that any possible outcome of the data analysis is hardly any different from each other whether or not the individual participates in the database. The maximal difference of the outcome is controlled by a small privacy budget parameter ɛ. Formally, the definitions related to DP are given in the following.
Definition 1. (εDP [6]). A randomized algorithm K satisfies εDP if datasets D and D′ differ on at most one instance, and for all subsets of possible outcomes of the algorithm S Range (K),
Definition 2. (sensitivity [6]). For a given query function and neighboring datasets D and D′, the sensitivity of f is defined asCurrently, there are two principal mechanisms used for realizing DP: the Laplace mechanism for numerical data and the exponential mechanism for nonnumerical queries.
Definition 3. (Laplace mechanism [8]). For a numeric function , the algorithm K that answers f in equation (5) provides ɛDP.where Lap (∆f/ε) is a random variable sampled from the Laplace distribution with mean 0 and standard deviation .
The Laplace mechanism retrieves the true results from the numerical query and then perturbs it by adding independent random noise according to the sensitivity.
Definition 4. (exponential mechanism [7]). Let be a scoring function on a dataset D that measures the quality of output , and represents the sensitivity. The algorithm K satisfies ɛDP ifThe exponential mechanism is useful to select a discrete output in a differentially private manner, which employs a scoring function q to evaluate the quality of an output r with a nonzero probability.
Definition 5. (composition properties [31, 32]). Let K_{1} and K_{2} be ε_{1} and ε_{2}DP algorithms, respectively. Then, we have the following: Sequential composition: releasing the output of K_{1} (D) and K_{2} (D) satisfies (ε_{1} + ε_{2})DP Parallel composition: for , releasing K_{1} (D_{1}) and K_{2} (D_{2}) satisfies max(ε_{1}, ε_{2})DP
3.3. Topk Frequent Pattern Mining
Frequent pattern mining aims to discover items that frequently appear together in a transaction dataset. Directly releasing the discovered frequent patterns with support counts will violate individual privacy. Therefore, the topk most frequent pattern should be released under DP guarantee. Zhang et al. [1] proposed an algorithm DFPGrowth, which accurately found the topk frequent patterns with noisy support counts while satisfying DP. The DFPGrowth algorithm is performed by two key steps: firstly, selecting the topk frequent patterns by the exponential mechanism; secondly, perturbing the true support count of each topk pattern by the Laplace mechanism.
4. Materials and Methods
To solve the privacy leakage problem of the SVs in the classification model of kernel SVMs, we proposed the DPKSVMEL algorithm based on the exponential and Laplace hybrid mechanism. The privacy of SVs is protected by postprocessing the nonprivate classification model with DP, while the training process of the original SVMs is not changed. Firstly, we train a nonprivate kernel SVM to obtain a classification model including dual vector α and the SVs. Secondly, we define a similarity parameter to denote the correlation or distance between the nonSVs and every SV and then every nonSV is divided into a group with one of the SVs according to the maximal value of the similarity. Thirdly, we use either the exponential mechanism or the Laplace mechanism to generate a new SV and then replace the original SV within the group under some certain similarity parameter values. The mechanism to use depends on whether the number of nonSVs is greater than k. Lastly, we output the classification model with the private SVs. Figure 1 gives an example of the DPKSVMEL algorithm implementation process.
In Figure 1, there are three SVs and eight nonSVs. The square represents the SV, the small circle represents the nonSV, the triangle represents the private SV, and the big circle represents a group. Every group is represented by different colors and viewed as a hypersphere with the SV as the center and the similarity as the radius. And, every nonSV is divided into one group according to the maximal value of its similarity with every SV. In particular, some nonSVs that are located at the intersection of multiple groups also belong to one group to satisfy the parallel composition property of DP. We set the parameter k to 2 in this example. In the groups with red and yellow colors, the number of the nonSVs is greater than k. We use the exponential mechanism to randomly select the most similar two nonSVs and then generate a new private SV with the mean value of them. However, there is only one nonSV in the blue group. We use the Laplace mechanism to generate a new private SV with noise. Therefore, the SVs in the final classification model are all private ones to prevent privacy leakage.
4.1. Similarity Parameter and Sensitivity
In the DPKSVMEL algorithm, the similarity is a vital parameter. We view the symmetric kernel matrix Q in equation (1) as the probability of similarity between every two instances in the datasets, especially for the radial basis kernel.
Definition 6. (similarity). For a nonSV x_{i} and a SV x_{j}, the similarity between them is defined asIn equation (7), Similarity is a subset of Q. It is obtained easily from the classification model and requires no extra complicated computation. The smaller the distance between a nonSV and a SV, the greater the value of the Similarity when they have the same labels. If they have different labels, the value of the Similarity is less than zero and the corresponding nonSV will be discarded from participating in the calculation within the group.
The Similarity is viewed as the probability of the correlation between a nonSV and a SV. In the DPKSVMEL algorithm, we set a lower limit for the Similarity named LLs to denote the minimum value of the correlation or the maximum value of the distance between them. If the value of the Similarity is less than LLs, it means that the correlation is too small or the distance is too large between the nonSV and the SV; then, the nonSV is also discarded from the group. After all the nonSVs are divided into groups, we use the exponential and Laplace hybrid mechanism to generate a new SV in every group. When LLs are fixed, the radius of the hypersphere corresponding to each group is determined. Then, the sensitivity of the exponential mechanism and Laplace mechanism is calculated by LLs easily. They are denoted by Sensitivity_{em} and Sensitivity_{lm}, respectively.In the exponential mechanism, we use Similarity as the scoring function. Under the fixed LLs, the maximum value of the similarity is 1 when the nonSV coincided with the SV and the minimum value is LLs within the group. Therefore, the sensitivity of the exponential mechanism is obvious as shown in equation (8). In the Laplace mechanism, we define the radius of the hypersphere R to denote the maximal distance between the nonSVs and the SV within a group that corresponds to the lower limit of the Similarity LLs. The relationship between LLs and R is drawn from equation (7) in thatwhere gamma is a scale parameter with a default value of 1/n for a dataset with n attributes. Then,where R denotes the maximal distance between every nonSV and the SV within the group. As all the attributes in a dataset are independent, we conclude that the biggest change of any attribute within a group is at most sqrt (−log (LLs)) based on the formula for the distance between two points. Therefore, the sensitivity of the Laplace mechanism is as shown in equation (9).
4.2. Privacy Budget Allocation
In DP algorithm, the privacy budget ε is another vital parameter. It determines the level of privacy protection for a randomized algorithm. The smaller the privacy budget, the higher the level of privacy protection. When the allocated privacy budget runs out, the randomized algorithm K will lose privacy protection. In the DPKSVMEL algorithm, every nonSV is divided into a group with one of the SV and there are no common instances between groups. Therefore, the DPKSVMEL algorithm satisfies the parallel composition property of DP. There is no need to split the privacy budget for the exponential mechanism and the Laplace mechanism.
4.3. Description of the DPKSVMEL Algorithm
In the DPKSVMEL algorithm, DP is achieved by the exponential and Laplace hybrid mechanism. The description of the DPKSVMEL algorithm is shown in Algorithm 1.

The DPKSVMEL algorithm protects the privacy of SVs by postprocessing the nonprivate classification model. It builds a single group for every SV and divides every nonSV into one of the groups according to its similarity. The private SVs are constructed by the exponential mechanism when the number of the nonSVs is greater than k within the group; otherwise, they are constructed by the Laplace mechanism. Finally, the DPKSVMEL algorithm outputs the private classification model. Because the running time of the algorithm only depends on the number of SVs, its time complexity is much less than O (n), where n denotes the number of instances.
4.4. Privacy Analysis
In the DPKSVMEL algorithm, randomness is introduced by the exponential and Laplace hybrid mechanism. According to the definition of DP, we prove that the DPKSVMEL algorithm satisfies DP by Theorem 1.
Theorem 1. DPKSVMEL algorithm satisfies DP.
Proof. In the DPKSVMEL algorithm, DP is achieved by postprocessing the nonprivate classification model. Every SV is viewed as the center of a group, and there is no intersection between groups. We consider the impact of adding an instance in the dataset on the classification model in the following three cases. The first one is that the new instance becomes a SV; then, one new group needs to deal with the Laplace mechanism. The second one is that the new instance is a nonSV and is divided into a group, which only adds one nonSV to be randomly selected by the exponential mechanism. The third one is that the new instance is a nonSV and does not belong to any group, and the classification model shows no change. Based on the sensitivity computation in equations (8) and (9), either the exponential mechanism or the Laplace mechanism for dealing with one group satisfies DP. According to the parallel composition property of DP, the DPKSVMEL algorithm satisfies DP.
5. Results
In this section, we compared the performance of the DPKSVMEL algorithm with the newest private SVM algorithms LabSam [26] and DPSVMDVP [27]. While PrivateSVM [24] does not provide practical and comparable experimental results, the experimental datasets in hybrid SVM [25] cannot be obtained now.
5.1. Datasets
The datasets in our experiments are commonly used for testing SVM algorithms’ performance and are available at https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/. Table 1 shows the basic information of the eight datasets and classification accuracy of the nonprivate SVM with the default parameters based on LIBSVM (version 3.24) [33]. We use the radial basis function as the kernel function in the experiments.
5.2. Algorithm Performance Experiments
In this section, we evaluated the performance of the DPKSVMEL algorithm by the Accuracy and AUC (the area under a ROC curve). The higher their values, the better the usability of the algorithm. To evaluate the algorithm performance under different parameters setting, we set k at 2 and 3, set privacy budget ɛ at 0.1, 0.5, and 1, and set the lower limit for the Similarity from 0.5 to 0.9. To avoid the influence of randomness on algorithm performance, we execute the algorithm DPKSVMEL 10 times under every set of parameters. Tables 2–9 show the mean value, standard deviation, maximum value, and minimum value of the Accuracy and AUC on the eight datasets. The values in bold represent the best case of the mean value of Accuracy and AUC under the same privacy budget. And, the running time of the DPKSVMEL algorithm is shown in Table 10.
Based on the above experimental results, a Nway ANOVA (analysis of variance) was conducted to compare the effect of the three parameters and the interaction model of every two parameters on Accuracy and AUC. Due to space constrains, we present the value in the results of ANOVA only as shown in Tables 11 and 12. We can conclude that the results for parameters ɛ and Similarity are significantly different at the level. However, the results for parameter k are not significantly different for most of the datasets, similarly as every two parameter interaction.
To observe the experimental performance of the algorithm more intuitively, Figures 2–5 show Accuracy and AUC with the variation of Similarity under different k and ɛ values in the form of the bar graphs on datasets Australian and Breast. The effect of the experimental performance by the three parameters is consistent with the result of ANOVA. The larger the privacy budget ɛ, the higher the classification accuracy of the algorithm. The experimental performance for parameter Similarity is mainly affected by the privacy protection mechanism adopted. When its value is small, there are more nonSVs within the group and the exponential mechanism plays a greater role. Otherwise, the Laplace mechanism plays a greater role. The DPKSVMEL algorithm is more stable on dataset Breast.
We compared the performance of the DPKSVMEL algorithm and the nonprivate SVM under different privacy budgets on datasets Australian and Breast from Figures 6–9. With the increase of the privacy budget, the performance of the DPKSVMEL algorithm gradually reaches or even exceeds that of nonprivate SVM.
Then, we compared the Accuracy of the DPKSVMEL algorithm and the LabSam algorithm under different privacy budgets on datasets Heart, Ionosphere, Sonar, German, and Diabetes in Figures 10–14. Finally, we compared the Accuracy of the DPKSVMEL algorithm and the DPSVMDVP algorithm under different privacy budgets on dataset Splice in Figure 15. Compared to the LabSam algorithm and the DPSVMDVP algorithm, our DPKSVMEL algorithm has higher classification accuracy and is closer to the nonprivate SVM under the same privacy budget.
In addition, as the DPKSVMEL algorithm does not change the training process of the classical nonprivate SVMs, a number of new optimization methods can be easily combined with our proposed algorithm to improve the classification accuracy, for example, CRP algorithm [34] for bilinear analysis and NISVM [35] in event analysis tasks.
6. Conclusions
In this paper, we study the privacy problem of the classification model of kernel SVMs. We proposed the DPKSVMEL algorithm based on the exponential and Laplace hybrid mechanism. The privacy of SVs is protected by postprocessing the nonprivate classification model with DP to prevent privacy leakage of the SVs. The DPKSVMEL algorithm is proved to satisfy DP theoretically and overcomes some shortcomings in the existing private SVM algorithm. Firstly, the postprocessing in the DPKSVMEL algorithm does not change the training process of the nonprivate kernel SVMs. Therefore, no complex sensitivity analysis is required like output perturbation and objective perturbation. Secondly, the DPKSVMEL algorithm avoids the additional risk of privacy disclosure and the consideration of projection dimension caused by the transformation from nonlinear SVMs to linear SVMs using random projection as PrivateSVM and hybrid SVM. Meanwhile, the DPKSVMEL algorithm has higher classification accuracy than the newest private SVM algorithms LabSam and DPSVMDVP under the same privacy budget. However, the DPKSVMEL algorithm views the value of kernel function as the probability of similarity between a nonSV and a SV that is valid only for some kernel functions with values in the range 0 to 1. Furthermore, the DPKSVMEL algorithm has poor performance for the datasets with a high proportion of SVs, especially when the similarity lower limit is small. In the future, we will work in two aspects. One is to extend the DPKSVMEL algorithm into more kernel functions. The other is to consider setting different similarity lower limits for different groups.
Data Availability
The datasets in our experiments are commonly used for testing SVM algorithms’ performance and are available at https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grants 61672179, 61370083, 61402126, and 61501275, by the Natural Science Foundation of Heilongjiang Province under Grant F2015030, by the Science Fund for Youths of Heilongjiang Province under Grant QC2016083, by the Postdoctoral Fellowship of Heilongjiang Province under Grant LBHZ14071, and by the Fundamental Research Funds in Heilongjiang Provincial Universities under Grant 135509312.