Abstract

With the widespread application of big data, privacy-preserving data analysis has become a topic of increasing significance. The current research studies mainly focus on privacy-preserving classification and regression. However, principal component analysis (PCA) is also an effective data analysis method which can be used to reduce the data dimensionality, commonly used in data processing, machine learning, and data mining. In order to implement approximate PCA while preserving data privacy, we apply the Laplace mechanism to propose two differential privacy principal component analysis algorithms: Laplace input perturbation (LIP) and Laplace output perturbation (LOP). We evaluate the performance of LIP and LOP in terms of noise magnitude and approximation error theoretically and experimentally. In addition, we explore the variation of performance of the two algorithms with different parameters such as number of samples, target dimension, and privacy parameter. Theoretical and experimental results show that algorithm LIP adds less noise and has lower approximation error than LOP. To verify the effectiveness of algorithm LIP, we compare our LIP with other algorithms. The experimental results show that algorithm LIP can provide strong privacy guarantee and good data utility.

1. Introduction

In many modern information systems, the amount of data is very large. Massive data increase the difficulty of data analysis and processing. Principal component analysis (PCA) is a standard data analysis method, which can be used to reduce the data dimensionality. More specifically, it projects the original high-dimensional data to the space of principal components composed by the eigenvectors of the covariance matrix of the data to get low-dimensional data, which can represent most of information of the original data. PCA simplifies the data, making data easier to use while saving on the computational complexity of the algorithm. For example, face recognition is much faster when first projecting the data into lower dimension.

Financial and medical data often deal with private or sensitive information. If machine learning tasks or data mining algorithms work directly on the original data, the outputs of these algorithms will leak private information, which may pose potential threats to individuals. Therefore, privacy preservation has become an urgent problem that needs to be solved. Differential privacy (DP) [1] is an effective and provable privacy protection model. It attends to hide private information while ensuring basic statistics of the original data. The notion of differential privacy has two types: -DP and -DP [2]. -DP is usually called pure differential privacy, while -DP with is called approximate differential privacy. -DP is a weaker version of -DP as the former provides freedom to violate strict differential privacy for some low probability events.

There are several approaches to making approximate PCA while satisfying differential privacy. Input perturbation adds noise to the data before computing the PCA, while output perturbation adds noise to the output of PCA. We can add Laplace noise to implement input perturbation and output perturbation. Both approaches can effectively simplify data and preserve the data privacy; however, there are few studies on their performance. At the same privacy protection level, better performance (less noise and lower error) mean better data utility. In this paper, we propose two differential privacy principal component analysis algorithms and evaluate their performance.

Our main contributions are as follows:(1)We apply Laplace mechanism to propose two differential privacy principle component analysis algorithms, Laplace input perturbation (LIP) and Laplace output perturbation (LOP), and give proof for its -DP.(2)We offer two criteria, i.e., noise magnitude and approximation error, to evaluate the performance of two algorithms. Less noise and lower approximation error result in better performance. Through theoretical verification, we ensure that LIP has better performance than LOP.(3)We conduct the experiments to verify the performance of LIP and LOP in terms of noise magnitude and approximation error on five real datasets. We further explore the variation of performance of the two algorithms with different parameters such as number of samples, target dimension, and privacy parameter. The experimental results show that at the different parameters, algorithm LIP always adds less noise and has lower approximation error than LOP. Compared with other algorithms, LIP can also provide good data utility.

The rest of the paper is organized as follows. Section 3 introduces principle component analysis, differential privacy, and Laplace mechanism. Section 4 first describes the two differential privacy principle component analysis algorithms and then analyzes the privacy and utility. Section 5 shows the performance of two algorithms on five real datasets. Section 6 concludes the paper.

Since Dwork proposed the concept of differential privacy, data preservation in the field of data mining and machine learning has received considerable attention. The current research studies mainly focus on privacy-preserving classification, regression, and frequent itemset mining.

Classification technology plays an important role in data prediction, which aims to build models that can describe and distinguish data. The typical privacy protection classification algorithms are SuLQ-Based ID3, DiffP-C4.5, and DiffGen. The basic idea of SuLQ-Based ID3 [3] is to add noise to true count value before calculating the information gain of the attributes and finally generate the corresponding decision tree. Although this method can satisfy differential privacy, the added noise is too large. To overcome the disadvantages of SuLQ-Based ID3, DiffP-C4.5 [4] first selects and splits attributes by exponential mechanism. However, this method can only support few analyses and queries. The classification accuracy of DiffGen [5] is higher than SuLQ-Based ID3 and DiffP-C4.5 from the perspective of theory and practical application; unfortunately, when the dimension of the classification attribute is very large, the selection method based on the exponential mechanism is inefficient and may exhaust the privacy budget. Frequent itemset mining is an effective data analysis method; it aims to discover itemsets that frequently appear in the dataset. Bhaskar et al. proposed algorithm truncated frequency (TF) [4]; it reduces the number of candidate itemsets depending on their own frequency. However, when the number of target itemsets is large, this method will fail. Considering this weakness, Li et al. proposed algorithm PrivBasis [5] according to the idea of θ-base (θ is a threshold) to generate candidate itemsets. However, generating θ-base is not very easy. Inspired by Zeng and Li, Wang et al. proposed algorithm PrivSuper [6] that randomly truncates transactions in a dataset, which will cause large truncation error. Regression is a common data analysis method in machine learning; it is a quantitative relationship that determines the interdependence of two or more attributes. The typical regression algorithms based on differential privacy are logistic regression and linear regression. In algorithms LPLog [7] and ObjectivePerb [8], the noise magnitude is decided by the sensitivity of the weight vector and the cost of computing sensitivity is high. Considering the disadvantages of the two algorithms, algorithm functional mechanism (FM) [9] controls the noise magnitude by the sensitivity of function itself instead of the weight vector.

However, there are few studies on differential privacy principal component analysis. Blum et al. [10] first proposed the early input perturbation framework SULQ, but not for data publishing. Chaudhuri et al. [11] proposed a privacy-preserving PCA algorithm MOD-SULQ based on the exponential mechanism, which can be used for data publishing. Kapralov and Talwar [12] argued that the algorithm (Chaudhuri et al.) lacks convergence time guarantee, and they also designed a complex algorithm using the exponential mechanism, but it is complicated to implement for high-dimensional data. Dwork et al. [13] provided the algorithms for -DP, adding Gaussian noise to the original sample covariance matrix. Inspired by Dwork, Imtiaz et al [14, 15] and Jiang et al. [2] designed their algorithms for -DP. Both of them added Wishart noise with parameters chosen to have a better utility bound.

3. Preliminaries

Given a dataset where is the i-th record. The matrix contains information about d attributes of n individuals (generally ). Following previous work on privacy-preserving PCA, we also assume , denotes the norm. For a vector , .

The covariance matrix of the original data iswhere A is a symmetric matrix.

The principal components are obtained by computing the eigenvalues and corresponding eigenvectors of the covariance matrix A:where is the eigenvalue, denoting the proportion of information that corresponding component includes. Larger means the component is more important. We assume are ordered decreasingly, i.e., . is the corresponding eigenvector.

In order to reduce the data to low dimension, a target dimension k is needed. We want to select first k eigenvectors which correspond to the top k eigenvalues. Given a threshold , α denotes accumulative contribution rate of the principal components [16]. Target dimension k can be decided by

Suppose is the first k eigenvectors of A, and are orthonormal. We project the original data X to the to get low-dimensional data:where ; we can also get the rank-k approximation [17] of X:

Our algorithms want to keep the statistics of X as much as possible, and the approximation error between Z and X can be measured by

Lower MSE provides better data utility. denotes the Frobenius norm. For a matrix , .

Now, we introduce the definition of differential privacy.

Definition 1 (differential privacy) [18]. A randomized mechanism M is differentially private if for any neighbouring datasets and (with at most one different record) and for all output ,where ε is the privacy budget controlling the strength of privacy guarantee; lower ε ensures more privacy.
Sensitivity is the key parameter that determines how much noise is required.

Definition 2 (sensitivity) [19]. For a function and any neighbouring datasets and , the sensitivity of is defined asThe sensitivity describes the largest change due to an data entry replacement. Sensitivity is only related to the function f.
The Laplace mechanism adds independent noise to the data; we use to represent the noise sampled from Laplace distribution with a scaling of b.

Definition 3 (Laplace mechanism) [19]. Given a dataset , for a function , with sensitivity , the mechanism M provides -DP satisfyingHere, is a random variable. Its probability density function is

4. Proposed Algorithms and Analysis

In this section, we describe two differential privacy principal component analysis algorithms: LIP and LOP. Through theoretical analysis, we prove the two algorithms satisfy -DP. Meanwhile, we investigate the utility of proposed algorithms.

4.1. Algorithm Description

In algorithm LIP, we use Laplace distribution to generate symmetric noise matrix and then add it to the data covariance matrix. After computing the eigenvalues and corresponding eigenvectors of the noised covariance matrix, we select first k eigenvectors to make up principal components space. In the end, we obtain low-dimensional data by projecting the original high-dimensional data to the principal components space. Algorithm LIP is described in Algorithm 1.

Input: matrix , number of samples n, attributes d, privacy parameter ε;
Output: : the rank-k approximation matrix
(1)Compute covariance matrix ;
(2)Noise matrix is a symmetric matrix where the upper triangle is i.i.d. sample from , and each lower triangle entry is copied from the opposite position;
(3)Add noise ;
(4)Compute eigenvalues and corresponding eigenvectors of the noised covariance matrix ;
(5)Given a threshold α, select top k eigenvectors of , low-dimensional data ;
(6)The rank-k approximation matrix ;

are the first k eigenvectors of the noised covariance matrix , which is close to the true first k eigenvectors of covariance matrix A [13].

Besides adding noise prior to computing PCA, we also add noise to the output of PCA. According to differential privacy parallel composition [20], the whole dataset is private as long as each record is private; a simple idea is adding noise to each record to protect private information. However, if this privacy preservation method is directly applied to big data, the introduced noise will significantly increase so that data utility dramatically drops. In order to reduce noise without decreasing the level of privacy preservation, we can add noise to fewer but most important parts of data. Algorithm LOP projects the original high-dimensional data to the principal component space to get low-dimensional data. The low-dimensional data are important data, so we add noise to them to protect data privacy. Algorithm LOP is described in Algorithm 2.

Input: matrix , number of samples n, attributes d, privacy parameter ε;
Output: : the rank-k approximation matrix
(1)Compute covariance matrix ;
(2)Compute eigenvalues and corresponding eigenvectors ;
(3)Given a threshold α, select top k eigenvectors of A, low-dimensional data ;
(4)Noise matrix is a matrix where the whole elements are i.i.d. samples from
(5)Add noise ;
(6)The rank-k approximation matrix ;
4.2. Privacy Analysis

Before proving that LIP and LOP satisfy -DP, we should analyze the sensitivities of these two algorithms. Suppose there are two neighbouring datasets and where , we assume the normalized data vector .

Lemma 1. In algorithm LIP, for all the input data, denote ; then, the sensitivity of the function equals .

Proof. Suppose that and are the covariance matrices of X and , respectively.

According to Definition 2, the sensitivity of function is . Then, we havewhere denotes the norm, for a matrix , . For the normalized , we have

Theorem 1. Algorithm LIP satisfies -DP.

Proof. For derived from algorithm LIP on X and , we obtain and where and are the corresponding noise matrices.

where and are the density functions of the output functions at neighbouring datasets X and . According to Lemma 1, we have

Combining equations (14) and (15), we can obtain

Therefore, algorithm LIP satisfies -DP.

Lemma 2. In algorithm LOP, given , denote ; then, the sensitivity of the function equals .

Proof. Suppose that , and , are the low-dimensional data and first k orthogonal eigenvectors of X and , respectively.

According to Definition 2, the sensitivity of function is . Then, we have

Since and are both composed of k unit orthogonal eigenvectors,

For the normalized , we have

Theorem 2. Algorithm LOP satisfies -DP.

Proof. For derived from algorithm LOP on X and , we obtain and , where and are the corresponding noise matrices.

where and are the density functions of the output functions at neighbouring datasets X and . According to Lemma 2, we have

Combining equations (22) and (23), we can obtain

Therefore, algorithm LOP satisfies -DP.

4.3. Utility Analysis

In Section 4.2, we proved that algorithms LIP and LOP both satisfy -DP. Next, we evaluate the performance of the two algorithms. In order to protect data privacy, we add noise to covariance matrix and low-dimensional matrix in LIP and LOP, respectively. Adding noise may have effect on the performance of algorithms, and noise magnitude directly determines the magnitude of effect. In addition, approximation error also describes the performance of algorithms. Better data utility means less noise and lower approximation error, so we evaluate algorithms LIP and LOP in terms of noise magnitude and approximation error.

Theorem 3. For a given privacy parameter ε, algorithm LIP adds less noise than LOP. The larger the samples n and target dimension k are, the less noise the algorithm LIP adds than LOP.

Proof. In algorithm LIP, noise matrix has elements, each element adds noise , and the variance of noise is about .

In algorithm LOP, noise matrix has elements, each element adds noise , and the variance of noise is about .

Now, we compare and to measure the noise magnitude of two algorithms:where . From formula (24), we observe , that is, algorithm LIP adds less noise than LOP.

Let be the noise ratio. We observe that . Furthermore, θ and n, k show strong negative correlation. That is, if we take a larger sample n and target dimension k, LIP will add less noise than LOP.

Theorem 4. For a given privacy parameter ε, algorithm LIP has lower error than LOP in the rank-k approximation of raw data.

Proof. In algorithm LIP, the rank-k approximation of is

In algorithm LOP, the rank-k approximation of X is

Let

MSE1 and MSE2 denote approximation errors in X and , X and , respectively. Now, we compare MSE1 and MSE2 to measure the approximation errors of two algorithms. Based on linear algebra, we have

In equation (29), and are computed with the accurate matrix A while is computed based on the matrix with noise in equation (27). Theorem 6 in Dwork et al. [13] provides the closeness between and . not only captures large amount of variance, but is also close to the of A. Theorem 6 in Dwork et al. [13] also gives the upper bound between and ; when , there iswhere is the noise parameter in Gaussian distribution. In Gaussian mechanism [13], noise matrices are samples from , equalling in our Algorithm 1. is a singular value in SVD; according to the relationship between PCA and SVD, we have .

From equation (30), we know that and are very close but still have little difference. Under the effect of difference and noise, we have

Combining equations (29) and (31), we have

From equation (28), is computed based on the accurate matrix A.

According to , we have

Let ; indicates algorithm LIP has lower approximation error than LOP:

is a matrix. With the increase of target dimension k, the value of will increase, while will decrease. Thus, approximation error ratio η decreases and . Since , LIP has lower error than LOP in the rank-k approximation of raw data.

Theorem 3 and Theorem 4 prove that algorithm LIP adds less noise and has lower approximation error than LOP, that is, algorithm LIP outperforms LOP in data utility.

5. Experimental Results and Analysis

In this section, we will give some experimental results to verify that algorithm LIP outperforms LOP in data utility. We compare algorithms LIP and LOP in terms of noise magnitude and approximation error. In addition, we investigate the variation of performance of the two algorithms with different key parameters such as number of samples n, target dimension k, and privacy parameter ε. Five UCI datasets are used in our experiments: Secom [21], Covtype [22], Musk [23], Handwritten [24] and Waveform [25]. We preprocess the data by subtracting the mean and normalizing data to meet the condition . We select target dimension k so that the accumulative contribution rate of the principal components α is at least 85%. In all cases, we show the average performance over 100 runs of each algorithm.

5.1. Experiments for Noise Magnitude

In this section, we evaluate the performance of algorithms LIP and LOP by comparing the magnitude of introduced noise. In Theorem 3, the ratio of introduced noise indicates that θ and n, k have strong negative correlation and . In the experiment, we verify that above conclusions are correct (we use the Frobenius norm of noise matrices and to represent and ).

In order to better present the experimental results of all datasets in one figure, we unify the noise ratio θ, and all the objects are scaled to same size (θ in datasets Secom, Covtype, Musk, Handwritten, and Waveform are 1, , , , and times the original value, respectively). For this experiment, we keep number of samples n fixed to investigate the relationship between the ratio of introduced noise θ and target dimension k. In Figure 1 we observe that even expanding the value of θ, θ is always less than 1; with the increase of value k, the ratio of introduced noise θ on five datasets continuously decreases. That is, θ and k are negatively correlated; the larger the target dimension k is, the less noise the LIP adds than LOP. The result is consistent with Theorem 3.

Then, we explore the effect of samples n on the ratio of introduced noise θ. In case of fixing target dimension k, Figure 2 shows θ is always less than 1, and θ decreases as the value of n increases in five datasets. That is, θ and n are negatively correlated; the larger the samples n are, the less noise the LIP adds than LOP. The result is consistent with Theorem 3.

5.2. Experiments for Approximation Error

In this section, we evaluate the performance of algorithms LIP and LOP by comparing the approximation error. In Theorem 4, the ratio of approximation error indicates that . Thus, in the experiment, we verify (1) η is less than 1 and (2) η and k are negatively correlated while η and ε are positively correlated.

In Theorem 3 and Section 5.1, we observe that the larger the target dimension k is, the less noise the algorithm LIP adds than LOP. Less noise results in lower error. In addition, we further explore mathematical expression of η in Theorem 4 and find that , and η and k are negatively correlated. Similarly, we unify the approximation error ratio η, and all the objects are scaled to same size (η in each dataset is , , , , and times the original value). For this experiment, we keep privacy parameter ε fixed to investigate the relationship between the ratio of approximation error η and target dimension k. As shown in Figure 3, when the value of k increases, the ratio of approximation error η decreases and . In other words, η and k are negatively correlated, which means when k is larger, LIP has lower error than LOP in the rank-k approximation of raw data. The experimental result is consistent with Theorem 4.

Finally, we explore the variation of η with privacy parameter ε. Similarly, we unify the approximation error ratio η, and all the objects are scaled to same size (η in each dataset is 10, 10, , , and times the original value). In Figure 4, for all the datasets, we observe that as ε increases, the ratio of approximation error η increases. Furthermore, η and ε are positively correlated, even in the case of no privacy preserving, i.e., , η is still less than 1. It can be explained as follows: Dwork et al. pointed out that is close to the true top k eigenvectors eigenvectors in input perturbation [13], that is, algorithm LIP is not very sensitive to privacy parameter ε. Output perturbation due to directly adding noise to the output and privacy parameter ε plays a negative role in data utility. Lower ε means more noise and higher approximation error. Thus, when ε increases, MSE1 decreases slightly while MSE2 decreases greatly and increases. Therefore, at the same privacy protection level, algorithm LIP has lower error than LOP in the rank-k approximation of raw data.

5.3. Experiments for Accuracy

In Sections 5.1 and 5.2, we verify that algorithm LIP outperforms LOP in data utility. To verify the effectiveness of algorithm LIP compared with the existing algorithms AG [13] and PPM [26], we evaluate the classification accuracy on Handwritten and Waveform datasets. The classifier used in the experiment is linear support vector machine (SVM). In SVM, there are many parameters that can affect accuracy; we mainly consider the influence of privacy parameter ε on the accuracy.

In Figure 5, we show the variation of accuracy with different values of ε. For all the datasets, we observe that as ε increases (higher privacy risk), the accuracy increases significantly, which indicates that the value of ε has an important effect on accuracy. On the other hand, the accuracy of algorithms AG and LIP are higher than that of PPM on the two datasets. In addition, algorithm AG outperforms LIP in accuracy; it can be explained as the utility gap between -DP and -DP (AG satisfies -DP and LIP satisfies -DP). -DP provides stronger privacy guarantee and weaker data utility than -DP. For large enough ε, our algorithm LIP can match the performance of AG. More important, it can provide a stronger privacy guarantee than AG. In conclusion, algorithm LIP achieves both strong privacy guarantee and good data utility.

6. Conclusions

In this paper, we propose two algorithms Laplace input perturbation (LIP) and Laplace output perturbation (LOP) for differential privacy principal component analysis. We compare the performance of LIP and LOP in terms of noise magnitude and approximation error via theoretical analysis. Then we conduct many experiments to verify the performance of two algorithms on five data sets. In the experiments, we show the variation of performance of the two algorithms with different parameters such as privacy parameter, target dimension and samples. Our theoretical and experimental results indicate that algorithm Laplace input perturbation (LIP) adds less noise and has lower approximation error than Laplace output perturbation (LOP). Last, to verify the effectiveness of algorithm LIP, we compare our LIP with other recent algorithms AG and PPM, the experimental results show that algorithm LIP can provide strong privacy guarantee and good data utility.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (61572263, 61502251, 61602263, and 61872197), the Postgraduate Research & Practice Innovation Program of Jiangsu Province (KYCX18_0891), the Natural Science Foundation of Jiangsu Province (BK20161516 and BK20160916), the Postdoctoral Science Foundation Project of China (2016M601859), and the Natural Research Foundation of Nanjing University of Posts and Telecommunications (NY217119).