Mathematical Problems in Engineering

Volume 2019, Article ID 1460234, 11 pages

https://doi.org/10.1155/2019/1460234

## A Differential Privacy Framework for Collaborative Filtering

^{1}College of Computer Science and Technology, Harbin Engineering University, Harbin, Heilongjiang 150001, China^{2}College of Computer and Control Engineering, Qiqihar University, Qiqihar, Heilongjiang 161006, China

Correspondence should be addressed to Xiaoye Li; nc.ude.uebrh@ileyoaix

Received 18 October 2018; Revised 23 November 2018; Accepted 18 December 2018; Published 9 January 2019

Academic Editor: A. M. Bastos Pereira

Copyright © 2019 Jing Yang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Focusing on the privacy issues in recommender systems, we propose a framework containing two perturbation methods for differentially private collaborative filtering to prevent the threat of inference attacks against users. To conceal individual ratings and provide valuable predictions, we consider some representative algorithms to calculate the predicted scores and provide specific solutions for adding Laplace noise. The DPI (Differentially Private Input) method perturbs the original ratings, which can be followed by any recommendation algorithms. By contrast, the DPM (Differentially Private Manner) method is based on the original ratings, which perturbs the measurements during implementation of the algorithms and releases the predicted scores. The experimental results showed that both methods can provide valuable prediction results while guaranteeing DP, which suggests it is a feasible solution and can be competent to make private recommendations.

#### 1. Introduction

In the Internet age, users are constantly troubled by information overload, since they cannot get really useful parts of large amounts of information. As a promising solution, recommender systems with personalized technologies have been widely used to enhance user experience in various online services. The typical case is that Netflix has been working on recommending movies which are best suitable for the users’ taste. Collaborative filtering (CF for short) is one of the most dominant techniques used in recommender systems. The basic idea is to predict user preference based on preferences of other similar users. The methods are generally divided into two classes, the memory-based methods and the model-based methods [1]. However, users’ rating data collected for recommendation is a potential source of leaking privacy for inferring users' sensitive information [2]. Calandrino et al. [3] developed the algorithms to demonstrate several inference attacks from continual recommendations with auxiliary information. In this work, we focus on the privacy issues in recommender systems and seek feasible solutions based on differential privacy (DP for short) [4], which is recognized as a promising technology for the privacy framework.

Zhu et al. [5] proposed a private neighbor collaborative filtering algorithm consisting of two major steps. Firstly, a redesigned exponential mechanism is used to privately select neighbors with higher quality to enhance the performance of the recommendations. The involved recommendation-aware sensitivity is a new sensitivity based on the notion of local sensitivity. Then, the original ratings of the selected neighbors are perturbed by adding Laplace noise. Zhu et al. [6] designed two differentially private algorithms with sampling, named DP-IR and DP-UR for item and user based recommendation, respectively. Both algorithms use the exponential mechanism with a carefully designed quality function. Jorgensen et al. [7] proposed a privacy preserving framework for personalized social recommendations. There are two distinct graphs in the model settings, an unweighted preference graph and an insensitive social graph. The users are clustered according to natural community structure of the social network, which significantly reduces the amount of noise required to guarantee DP. However, relationships in social networks are sometimes considered sensitive information.

Friedman et al. [8] proposed a generic framework and evaluated several ways of differentially private matrix factorization for recommender systems. The specific methods are input perturbation, stochastic gradient perturbation and ALS with output perturbation. Through comparison and analysis, the input perturbation performs best in the recommendation results. Mcsherry et al. [9] adapted several leading algorithms in the Netflix Prize competition to the framework of DP. Concretely, the Laplace noise is incorporated into various global effects and covariance matrix of user rating vectors based on item-item similarities. Given these noisy measurements, several algorithms (the k-Nearest Neighbor method [10] and the standard SVD-based prediction mechanism) are employed to make private recommendations directly. Liu et al. [11] proposed a hybrid approach for privacy preserving recommender system to hide users’ private data and prevent privacy inference. The users’ original data are disguised through randomized perturbation (RP for short). Similar to literature [9], covariance matrix and some averages are masked with a particular amount of noise again to achieve DP. Then, some existing algorithms can run directly on the published noisy measurements.

Differently from the works in literature [5, 6], we choose to calculate the predicted scores by using all users’ ratings, not the recommended list, which can make full use of data information for the estimation of noise error. In literature [6], the theoretical results are presented in detail on both privacy and accuracy of the proposed method, however, the experimental results are lacking. By comparison, more experimental results are provided in this work to demonstrate the relationship between privacy and accuracy. In this work, the similarity is calculated by row vectors in the rating matrices. That is, the calculation is based on user-user similarities, not item-item similarities as in literature [9]. This mainly takes into account the recommendation based on users with the similar preferences. Furthermore, the experimental results of this paper showed that the DPI method performs better than the DPM method, which is consistent with the conclusion in literature [8].

In this paper, we propose a differential privacy framework for collaborative filtering, which includes three existing algorithms to calculate the predicted scores and adopts two methods of adding Laplace noise to conceal individual ratings and provide valuable prediction results. The rest of the paper is organized as follows. Section 2 introduces the background knowledge. A detailed description of the framework is presented in Section 3. Section 4 reports the experimental evaluations. Finally, Section 5 concludes the study and provides further research directions.

#### 2. Background

##### 2.1. Differential Privacy

DP offers a mathematical definition of privacy and a provable privacy guarantee for each record in the dataset. Intuitively, the output of the computation should not reveal too much information about any record in the dataset. The probability of the output is insensitive to small input changes, whether one record is in the dataset or not. DP is presented in a series of papers [12–16], mainly used in data publishing [17–19] and data mining [20–22].

*Definition 1 ( ε-DP [4]). *A randomized computation

*K*satisfies

*ε*-DP if, for any neighboring datasets

*A*and

*B*differing on at most one record, and for all subsets of possible outputs where

*ε*is the privacy budget to make the trade-off between privacy and accuracy. The value of

*ε*is generally set to a small positive value. The smaller it is, the higher privacy and lower accuracy it provides and vice versa.

Specifically, the neighboring datasets contain the same ratings except for one in this context. The rating in

*A*that a user

*u*assigns to an item

*i*is different from in

*B*.

The noise mechanism is suitable for perturbing the numerical outputs, which is one of the common ways to achieve DP. The amount of noise required is dependent on global sensitivity of the function. There are important combination properties in differentially private algorithms. Formally, the relevant definitions and propositions are described as follows.

*Definition 2 (global sensitivity [4]). *For a function :* D*→*R*^{d}, the global sensitivity of iswhere* R*^{d} is a real vector of* d* dimensions and and are neighboring datasets. The global sensitivity denotes the maximum extent that a single record could affect the output results.

*Definition 3 (Laplace mechanism [23]). *For a function :* D*→*R*^{d}, the randomized algorithm* M* satisfies *ε*-DP ifwhere , which are* i.i.d.* random variables sampled from the Laplace distribution with mean 0 and scale parameter .

*Proof (see [24]). *Suppose For any output ,Similarly, . Thus,

Proposition 4 (sequential composition [25]). *Let each algorithm provide -DP. The combination algorithm A (A_{1} (D), A_{2} (D),…, A_{k} (D)) over the dataset D provides - DP.*

*Proof (see [24]). *For any output ,

Proposition 5 (parallel composition [25]). *Let each algorithm provide -DP. The combination algorithm A (A_{1} (D_{1}), A_{2} (D_{2}),…, A_{k} (D_{k})) over the disjoint subsets of dataset D provides -DP.*

*Proof (see [24]). *Without loss of generality, assume that* D*_{j} differs one element from ; other subsets are exactly the same. For any output ,In privacy preserving computations, the measurements need to be allocated a reasonable privacy budget based on the combination properties.

##### 2.2. Collaborative Filtering

The CF system usually presents a sorted list of predicted items to the active user. The performance can be evaluated by the expected utility of the items in the recommended list. Alternatively, the system may provide numeric scores directly for the predicted items. The performance can be measured by a normed distance between the predicted scores and the actual preference values (i.e., the original ratings). There are two common ways of calculating the similarity, Pearson correlation coefficient (Pcc) and Cosine-based similarity (Cos) [26].

Pcc is a linear correlation coefficient , which is used to measure the correlation between two random variables* X* and* Y*. The range of values is between -1 and 1, and the larger the absolute value is, the stronger the correlation is. When* X* is linearly dependent with* Y*, the value is 1 (namely, positive linear correlation) or -1 (namely, negative linear correlation). It is defined as the ratio of covariance and standard deviation between* X* and* Y*, where* E* is the expected value.

Cos evaluates the similarity by calculating the angle cosine of two vectors* X* and* Y*, which pays more attention to the difference in directions between the vectors. The range of values is between -1 and 1, and the value of 0 denotes that two vectors are orthogonal. When the directions of two vectors coincide, the angle cosine takes the maximum value of 1, and vice versa. The vector similarity is defined as follows:

#### 3. The Proposed Method

DP is essentially a property that the system should maintain, rather than a specific way of calculating. Therefore, the framework designed includes different perturbation methods to carry out predictions in a differentially private manner. The simple technique of noise addition is fully fit for predicting the scores while protecting original ratings without leakage.

As shown in Figure 1, the framework includes three different algorithms from the previous research [1]. The calculation methods are similar in the algorithms Pcc and Cos except for different measurements of similarities between users. For a detailed description, see Algorithm 1. The Avg algorithm makes use of the average of original ratings to predict the scores. According to the locations of adding noise, the perturbation methods are divided into two forms, respectively. In the DPI method, the noise is added to each entry in the rating matrix to mask the original data. In the DPM method, the noise is added to the various measurements of the algorithms based on the original rating matrix.