Social Security and Privacy for Social IoTView this Special Issue
Social Security and Privacy for Social IoT Polymorphic Value Set: A Solution to Inference Attacks on Social Networks
Social Internet of Things (SIoT) integrates social network schemes into Internet of Things (IoT), which provides opportunities for IoT objects to form social communities. Existing social network models have been adopted by SIoT paradigm. The wide distribution of IoT objects and openness of social networks, however, make it more challenging to preserve privacy of IoT users. In this paper, we present a novel framework that preserves privacy against inference attacks on social network data through ranked retrieval models. We propose PVS, a privacy-preserving framework that involves the design of polymorphic value sets and ranking functions. PVS enables polymorphism of private attributes by allowing them to respond to different queries in different ways. We begin this work by identifying two classes of adversaries, authenticity-ignorant adversary, and authenticity-knowledgeable adversary, based on their knowledge of the distribution of private attributes. Next, we define the measurement functions of utility loss and propose PVSV and PVST that preserve privacy against authenticity-ignorant and authenticity-knowledgeable adversaries, respectively. We take into account the utility loss of query results in the design of PVSV and PVST. Finally, we show that PVSV and PVST meet the privacy guarantee with acceptable utility loss in extensive experiments over real-world datasets.
SIoT integrates social network schemes into IoT systems, which provides opportunities for IoT objects to form social networks. The SIoT paradigm is promising as it is believed that SIoT structures are helpful in enhancing the navigability of IoT networks, identifying levels of trustworthiness and reusing existing social network models . In this scenario, privacy and security issues have been extensively studied [6, 7, 24, 48]. However, current studies in privacy preservation of IoT systems focus on access control [23, 46], communication and authentication protocols [4, 34, 44], and attribute-based encryption [39, 43]. The features of social network have not been thoroughly considered.
The nature of online social networks (OSN) requires sharing of information. User information, including activity patterns and descriptive attributes, is mined and analyzed to improve user experience of OSN applications. Third party users also take advantage of the huge amount of data collected by social networks . As part of the improvement of user experience, ranked retrieval models have been extensively studied and applied to many OSN features, e.g., link prediction and recommendation systems [32, 36, 47]. For instance, given a user’s information (which can be a tuple in a database), a ranked retrieval model returns a ranking result that serves OSN features, e.g., “People you may know” and “Recommended for you.” Furthermore, many OSN providers improve the accuracy of ranked results by taking into account private attributes of users in the ranked retrieval model. For example, sensitive demographics such as race, religion, and income can help in friend recommendation features as intuitively people sharing similar demographics are more likely to be interested in each other.
OSN providers relieve users’ concern of privacy leakage by allowing them to mark attributes as “private” and hiding private attributes from profiling pages and ranked results. Users believe that their privacy is well protected since private attributes are invisible to the public in their profiles or any ranked results. However, Rahman et al.  proposed Rank Inference and showed that privacy of private attributes in the ranked retrieval model is not guaranteed. In their approach, the value of a private attribute can be inferred through a ranked retrieval interface given the premises that the domain of the private attribute is finite and that the ranking function has both monotonicity property and additivity property . To be more specific, given monotonicity and additivity conditions, an adversary is always able to find a pair of differential queries, and , such that (a) and share the same predicate on all attributes except for a private attribute , (b) the value of in is θ while the value of in is not, and (c) the ranked result of contains the victim tuple while the ranked result of does not. Rahman et al. showed that given the above conditions, the adversary was able to conclude that the value of ’s is not equal to θ. Furthermore, the adversary was able to infer the value of ’s by finding more differential queries and excluding more values from the domain of , as long as the domain is finite.
Privacy issues arise when private attributes of users are taken into account in the ranked retrieval model. Intuitively, the issues can be solved by removing private attributes from ranking functions, which decreases the utility of many OSN features, the recommendation system cannot provide accurate results. From OSN providers’ perspective, removing private attributes is not a practical solution. Therefore, this work aims not only to address the issue of rank inference but also to propose a framework that preserves the privacy of users against all inference attacks through the ranked retrieval model while minimizes utility loss.
1.2. Related Work
Encryption technologies have been used throughout history in security and privacy preservation. Searching on encrypted data [10, 38] has been introduced to ensure data privacy. Cao et al.  proposed a scheme that allows privacy-preserving ranked search over encrypted data. A query consisting of multiple keywords is conducted by searching over encrypted documents with secure k-nearest neighbor (kNN) technique. Chen et al.  took into consideration correlations between documents before conducting a search query and achieved better performance. Vertical fragmentation [13, 15, 16, 22] has been applied to encrypted data, which hides identities of users by separating identifier attributes with descriptive attributes. However, encrypted searching is developed to preserve privacy against adversaries in a cloud computing environment. Adversaries can still access decrypted searching results from which private attribute values can be inferred.
As ONS has been emerging as an important source for big data, many studies have been carried out for privacy-preserving data mining (PPDM) and privacy-preserving data publishing (PPDP). Perturbative methods implement the “camouflage” paradigm where original data are directly modified . Agrawal et al.  proposed an algorithm that perturbs data with random additive noise. Liu et al. proposed data perturbation with multiplicative noise. However, random noise has predictable structures in the spectral domain, and thus, privacy provided by additive noise is questionable . Furthermore, additive or multiplicative noise can only be applied to numerical data. Data swapping approaches [19, 30, 31] perturb data by swapping values between records that are close to each other. Other distance-based approaches include  in which data points are perturbed without changing their relevant closeness relationships and  in which data points are clustered and each data point’s value is replaced by the value of the cluster center. However, those approaches rely on a universal measurement of closeness between data points in multidimensional space. Furthermore, they are limited by the distributions of data points.
Generalization is the process of replacing a group of values with a more general value that can represent the group. Suppression is the ultimate state of generalization such that the representative value is “not applicable” and as a result, the group of values is removed from the dataset . k-anonymity [40, 41] is a widely studied approach that preserves privacy of records by grouping at least k records into an equivalence class. The attribute values of the k records are suppressed so that the k records are indistinguishable from adversaries. Machanavajjhala et al.  proposed L-diversity that focuses on attribute privacy. L-diversity forces each equivalent class to have at least l different values for each attribute. Li et al.  proposed T-closeness that further considers the distribution of attribute values. T-closeness sets a threshold for the variance between the distribution of a private attribute in each equivalence class and the distribution of the same attribute in the entire dataset. However, those approaches are developed to preserve privacy of published data, while the ranked retrieval model of ONS does not directly reveal private attributes to the public. Furthermore, suppression of private attribute values introduces unnecessary utility loss to the ranked retrieval model. Equivalent Set  was proposed to preserve privacy against inference attacks through the ranked retrieval model. This approach groups different tuples into a set such that they are indistinguishable in ranked results. However, this approach requires that tuples in the same equivalent set have different values in every private attribute and have the same value in every public attribute. Assume that a kNN query q is sent to a ranked retrieval interface protected by Equivalent Set and that a tuple t is equal to q in most private attributes. Since the other tuples in the same equivalent set of t are different from t in every private attribute, they must be different from q in most private attributes too. Therefore, the original rank of t given q should be much higher than that of any other tuple in the same equivalent set. In this case, the rank of t given q will be significantly lowered by Equivalent Set in order to achieve indistinguishability, which reduces the accuracy of the ranked result of q. Furthermore, it is possible that there is no such that is different from t in every private attribute and is same with t in every public attribute. In this case, we have to suppress the private attribute values of t, which further introduces utility loss.
Differential privacy [8, 17, 18] is another widely studied framework that preserves privacy of published datasets or hidden databases. It imposes a strong guarantee of privacy on tuples in statistical databases by adding noise to the process of query results. However, the ranked retrieval model outputs ranks of tuples, instead of their values or aggregate statistics. We cannot directly add noise to ranked results as the rank of a tuple is determined by not only the tuple itself but also by other tuples in the database. Furthermore, the optimization of the ranked retrieval model has not been considered.
This work presents a novel scheme for privacy-preserving the ranked retrieval model. We start with an introduction to the adversary model and introduce our definition of privacy guarantee. We identify two categories of adversaries based on their prior knowledge and assume that adversaries can launch optimal inference attacks through ranked results.
We propose the polymorphic value set (PVS), a privacy-preserving framework for the ranked retrieval model. Different from existing methods, PVS does not directly modify values of tuples or query results. Instead, PVS enables polymorphism of private attributes such that a private attribute of a tuple can respond to different queries in different ways. We prove that our framework meets the privacy guarantee stated in Problem Statement. For adversaries with and without prior knowledge, we design and implement the polymorphic value set with true values (PVST) and polymorphic value set with Virtual Values (PVSV), respectively. In the design of PVST and PVSV, we consider utility loss in the ranked retrieval model and propose a practical measurement of utility loss. We prove that the task of minimizing utility loss is NP-hard and present two heuristic algorithms that implement PVST and PVSV, respectively. We run our implementations of PVST and PVSV on a real-world dataset from eHarmony  that contains 486,464 tuples. The experiments yield excellent results with respect to privacy guarantee and utility loss.
The remainder of this paper is organized as follows. Problem Statement introduces the adversary model and the privacy guarantee. Privacy-Preserving Framework introduces our privacy-preserving framework. Framework with Virtual Values presents the design and implementation of PVSV, along with our analysis of utility loss. The implementation of PVST and analysis of utility loss are presented in Framework with True Values. Experimental Results contains our experimental evaluation of PVSV and PVST. In Conclusions, we conclude this paper with a summary of our key contributions and a discussion of some open problems.
2. Problem Statement
2.1. Ranked Retrieval Model
In information retrieval, we have witnessed extensive research in the ranked retrieval model. Unlike the Boolean retrieval model where only results that exactly match the predicates can be returned, the ranked retrieval model allows users to retrieve a list of records sorted by a proprietary ranking function. Therefore, the ranked retrieval model provides an alternative solution for users seeking results sorted by their relevance to the query.
As discussed in the introduction, many OSN applications have been using the ranked retrieval model to process incoming queries. Upon a query q, the ranked retrieval model would calculate each tuple t’s score according to a proprietary score function and return top-k tuples in descending order of their scores. The attributes of tuples could be either categorical or numerical. In this paper, we consider only categorical data, which does not limit the scope of our research. Actually, numerical data can be treated as categorical data by categorizing the numerical domain into small intervals such that no more than one tuple in the database falls into the same interval. Without loss of generality, we also assume that there is no duplicate tuple that is equal to another tuple in every attribute.
We now formalize our ranked retrieval model with categorical attributes. Consider an n-tuple database D with m public attributes and private attributes . Let (resp. ) denote the value domain of (resp. ). Let (resp. ) denote the value of t in (resp. ). Upon query q, the score function computes a score for each tuple . The ranked retrieval model will then sort all tuples in D in the descending order and return them as the ranked result. We consider the case where the score function is linear. Therefore, can be defined aswhere (resp. ) is the weight of attribute (resp. ) in the score function, and the matching function , (resp. ) indicates if t matches q in attribute (resp. ). Therefore, the value of is 1 if is equal to , and the value of is 0 if is not equal to . Note that our ranked retrieval model satisfies the monotonicity and additivity properties defined in .
2.2. Adversary Model
In Motivation, we mentioned that we do not make any assumption about the method adopted by an adversary when attacking a database. We also assume that the adversary has prior knowledge about the metadata of tables in the database, as well as the proprietary ranking function. Furthermore, the adversary is assumed to be able to issue queries to the ranked retrieval model, view ranked results, and insert tuples to the database. As a result, the adversary is able to retrieve all public attribute values by crawling the database through the query interface . We denote the set of queries issued by the adversary as , the set of tuples inserted by the adversary as , and the corresponding set of ranked results as . is fully determined by and given fixed D. We name all information regarding a tuple that an adversary can find in as the trace of t and denote the trace of t as . The trace of t includes, but not limited to, the rank of t and the relationship between t and any other tuple (e.g., t has a higher or lower rank than another tuple ) in a ranked result. Therefore, given fixed D, and , is fully determined by the attribute values of t.
Another capability of the adversary we model is the adversary’s prior knowledge. Consider an extreme case where the adversary knows the equivalence relation between two attributes and . In this case, even without , the adversary is still able to infer the value of of any . In reality, an adversary can acquire such attribute correlations by being or consulting an expert in the domain of the dataset or by adopting data mining methods . For example, based on the personal information (e.g., gender, ethnicity, age, and blood type which can be used to infer the gene) stored as public attributes and published in public medical data repositories, genetic epidemiologists can generally conclude that an individual does not have some diseases, merely based on the fact that these diseases would never be found by the candidate gene among historic medical datasets. Therefore, an adversary with prior knowledge could eliminate the possibility of a certain tuple in the database. We model prior knowledge as a function that takes as input t and D and returns either 0 or 1. indicates that, given prior knowledge, the possibility of is zero. indicates that the adversary cannot eliminate the possibility of . As prior knowledge helps adversaries in launching an inference attack, adversaries can be partitioned into two classes: adversaries with prior knowledge and adversaries without prior knowledge of the dataset.
Definition 1. We name adversaries without prior knowledge of the authenticity of any tuple as authenticity-ignorant adversaries. For authenticity-ignorant adversaries, always outputs 1. We name adversaries with such prior knowledge as authenticity-knowledgeable adversaries. For authenticity-knowledgeable adversaries, if and if .
The objective of both classes of adversaries is to maximize the following value when inferring the value of victim tuple in attribute :where is the probability of and a is the value inferred by the adversary given prior knowledge and ranked results.
For authenticity-ignorant adversaries, always outputs 1 regardless of t and D. We only assume the cases where users input true information to the databases. Therefore, for authenticity-knowledgeable adversaries, if and if . In this paper, we assume a strong adversary that can infer the private attribute values of an arbitrary tuple, given the premise that the trace of the target tuple is unique. The premise can be easily met because as long as there is no duplicate tuple in the dataset, the adversary can always construct well-designed and such that the trace of the target tuple is different from the trace of any other tuple. Therefore, the adversary can always find a such that .
2.3. Problem Statement
A privacy breach can be described by a successful inference of a private attribute value in the database. We view privacy of as the upper bound on the possibility that an adversary succeeds in inferring the value of . Note that we do not make any assumption about the adversary’s attacking method. Our objective in this paper is to present a framework that sets an upper bound on the probability of successful inference of an arbitrary private attribute for tuple . Therefore, we define the objective of the framework as
We present the upper bound ϵ as our privacy guarantee.
However, a privacy-preserving framework should provide not only a privacy guarantee but also a notion of utility—after all, a framework that removes all private attribute values or replaces them with randomly generated values can surely preserve privacy. Therefore, we use a measurement based on the variance of ranked results before and after adopting our framework. Given D and a set of all possible queries denoted as Q, we define the utility loss for our ranked retrieval model as follows:where and refer to the ranks of tuple t in the ranked result given query q before and after applying our frameworks, respectively.
3. Privacy-Preserving Framework
The only information an adversary can obtain from a database through the ranked retrieval model is public attribute values and ranked results. For an adversary without prior knowledge, information regarding private attribute values can only be retrieved from ranked results. Therefore, in order to preserve privacy, we have to modify the ranked retrieval model such that the adversary cannot retrieve any useful information about private attributes from ranked results.
An idea is to group different tuples together in ranked results. As in our prior work , we can group two tuples and together such that they share the same rank in any ranked result. This can be achieved by adopting a new ranking function such that for all q. If and have different values on every private attributes, then the adversary is unable to infer the private values of since and are indistinguishable in any ranked results. However, this method suffers from high utility loss. In order to preserve the privacy of all private attributes, and have to be different over all private attributes. Thus, the original scores of and , i.e., and , differ a lot, which leads to a higher variance between the rank of before and after adopting this method.
In this work, we present a novel framework that preserves privacy of private attributes while minimizes the utility loss. We observe that for a tuple ’s private attribute if there are at least two potential values for and an adversary cannot differentiate any one of them, then the privacy of can be preserved. For instance, if the probability of is equal to the probability of , given ranked results and prior knowledge, then the adversary cannot exclude any one of them. If both the probabilities are 50%, then the adversary may choose to randomly pick a value from a and as the inferred result. In this case, will not exceed 50% and privacy of can be preserved. To prove this statement, suppose that is an arbitrary tuple in database D, and we want to preserve the privacy of . Let be an arbitrary value in . We construct tuple such that and differ in only one attribute : . We also construct database such that D and differ in only one tuple while . We define a new score function :whereif
Imagine a case where an adversary queries D and with the same query workload . We denote the ranked results from D as and the ranked results from as . As in the new score function, for , is identical to . Therefore, given only ranked results, the adversary cannot tell the difference between the two databases being queried. Furthermore, even if we exchange the values of and , the adversary still cannot observe any change in or . As a result, the value of and are equivalent from the perspective of the adversary, and the privacy of can be well preserved. Intuitively, can be seen as a tuple that has two polymorphic forms in : and . When calculating the score of with , we always choose the value that can maximize .
We can further extend the statement to a more general case. For each tuple and each private attribute , we can select e distinct values , , …, from , . The new score function can be defined aswhere
Consider e tuples which are identical to in all attributes except for . Let be , . Then for , we have . As a result, from the adversary’s perspective, there are potential values for : , , which are indistinguishable from each other from any ranked results. The privacy of can be preserved by grouping it with e equivalent values.
As such, we introduce the construction of the polymorphic value set (PVS). We put into a set in which all values are indistinguishable when calculating the score with respect to in the ranking function, i.e., . We name the above set as the polymorphic value set and denote the polymorphic value set of tuple in attribute as . We define as follows.
Definition 2. is a set containing all indistinguishable values of tuple ’s private attribute . Assigning with an arbitrary value in will not change the value of , i.e., and .
Since the adversary cannot distinguish different values in by launching any inference attacks based only on ranked results, the privacy guarantee of is
In order to achieve the privacy guarantee defined in (3), for each and each private attribute , we have to ensure that (a) there is one and only one polymorphic value set for ’s attribute and (b) the privacy guarantee defined in (9) is always valid.
4. Framework with Virtual Values
In this section, we introduce how polymorphic value sets can be constructed with generated values to meet the privacy guarantee in (9) against authenticity-ignorant adversaries. We name values generated by our framework as virtual values.
As we proved in Privacy-Preserving Framework, an authenticity-ignorant adversary cannot distinguish the value of from other valid values in , given ranked results. Since an adversary without prior knowledge cannot validate the authenticity of any value in , all values in are “valid” in the perspective of the adversary, no matter if they are generated by our framework or collected from real data in D. Therefore, we observe that can be formed by any values in .
In order to achieve the privacy guarantee of ,, we have to ensure that , , and . An intuitive algorithm to generate of size is to randomly pick values from . Specifically, let the initial . Then, we can randomly pick distinct values from and insert them into until contains at least distinct values. In the same manner, we can construct a polymorphic value set for each tuple’s each private attribute.
4.1.1. Privacy Guarantee
For a database D where every tuple ’s every private attribute is included by one polymorphic value set with virtual values (PVSV) whose size is at least l, if the adversary has no prior knowledge of D, a privacy level of is achieved.
For an authenticity-ignorant adversary, . As proved in Privacy-Preserving Framework, for an authenticity-ignorant adversary, it is impossible to distinguish with at least other values. Thus we have for and . According to equation (3), a privacy guarantee of can be achieved.
4.2. Utility Optimization
In this section, we discuss how to reduce utility loss caused by polymorphic value sets. We introduced a metric of utility loss in (4) that calculates the sum of difference in ranked results given all possible queries. To practically calculate utility loss, we limit the range of queries to a finite set named query workload. In practice, the query workload of a database D can be a set of queries that are more frequently issued than any other queries. A query workload may contain duplicate queries, which reflect the distribution of frequent queries. With a query workload, denoted as Q, we can define the practical utility loss as
In order to reduce utility loss, we have to find assignments of for and such that the overall can be minimized. Without loss of generality, we only consider constructing polymorphic value sets of size 2. In this case, the privacy guarantee is . For each , we need to find a value that is indistinguishable from . We denote the polymorphic value of as .
Definition 3. We define the 2-PVSV problem as follows: given database D and query workload Q, find a polymorphic value from for each , and , such that defined in (10) is minimized.
Theorem 1. The 2-PVSV problem is NP-hard.
The proof of Theorem 1 in detail can be found in Appendix A.
4.3. Heuristic Algorithm
We have proved that the 2-PVSV problem is an NP-hard problem that may not be solved in polynomial time. Therefore, we propose PVSV-Constructor, a heuristic algorithm that can return an approximate solution in polynomial time.
We observe that is relevant to . A smaller difference between and leads to a smaller difference between and . Therefore, can be approximately minimized by a solution that minimizes . As such, we propose an approximation of that calculates the score difference before and after adopting our framework. We denote the score difference as and define as follows:
As shown in Algorithm 1, given input database D, the number of public and private attributes m and respectively, query workload Q and privacy guarantee ϵ, PVSV-Constructor constructs an equivalent value set for each and that minimizes . Recall the definition of in (7), and we have
We denote the score difference of contributed by as .
Therefore, is the sum of over all tuples and private attributes:
Since the construction of each is independent from other polymorphic value sets, we can minimize by minimizing the score difference contributed by each for and . Note that if or . Also note that when and . Therefore, if , , then any assignment of cannot contribute to a higher since , , and is always zero. In this situation, . On the contrary, if , , then , and thus, . According to equation (13), the value of is
In order to minimize , we have to find an assignment of which minimizes . Consider the simplest case where we want to construct of size 2: . We have
In this case, β has to be a value in such that β has the lowest frequency among all values for . Note that β does not have to be an element of . If , its frequency is 0. Similarly, if we want to construct of size k, then we should insert the least frequent values among all values into .
In line 4 of Algorithm 1, we initialize the polymorphic value set of by inserting into . For , a set is constructed from , which contains all values that can be inserted into . In order to minimize , we insert the least frequent values from into by their frequencies in . The above construction of is repeated for each and . The computational complexity of Algorithm 1 is .
5. Framework with True Values
5.1. Authenticity-Knowledgeable Adversaries
As we mentioned in the adversary model, an authenticity-knowledgeable adversary is able to tell if is possible. As a result, the authenticity-knowledgeable adversary can launch a more efficient attack on private attributes by examining the authenticity of values learned from . We show a simple case where an authenticity-knowledgeable adversary breaks the privacy guarantee provided by polymorphic values sets constructed with virtual values. Consider that the objective of the adversary is to infer the value of and is the only private attribute in D. is the polymorphic value set generated for and . Without prior knowledge, and the privacy guarantee is achieved. However, for a value , if the adversary can conclude that for any tuple such that is equal to in all public attributes and , then the adversary can exclude β from . Therefore,
and equation (9) no longer holds.
As shown above, if values in are marked by an adversary as invalid for given , then the adversary can successfully break the privacy guarantee defined in framework with virtual values.
In this section, we propose polymorphic value sets with true values (PVST) that construct polymorphic value sets with values that cannot be excluded by . PVST considers nontrivial prior knowledge of adversaries and presents the same degree of privacy guarantee introduced in (9).
We have shown above that the implementation with virtual values can be compromised by adversaries with prior knowledge. Consider a tuple . Given privacy guarantee ϵ, we construct polymorphic values sets for each private attributes of where . Let set be
is the Cartesian product of sets each of which contains ’s all equivalent values in an attribute. For public attribute , the corresponding set is since public attribute values are open to the adversary. For private attribute , the corresponding set is . Therefore, contains all possible tuples that are indistinguishable with with respect to (including itself).
With prior knowledge on , an adversary is able to exclude a value β from if
As described above, it is safe for the adversary to conclude that , if there is no such that and . Alternatively, if for every β in , there is a t such that and , then the adversary cannot exclude any value in , and therefore, is guaranteed. Since iff , we construct polymorphic value sets with true values from .
Definition 4. If there exists l distinct values such that , holds the following property:Then, we say that covers l true values. We denote as the true value set of t in .
Privacy guarantee: for a database D where every tuple’s every private attribute is included by one equivalent value set which covers at least l true values, a privacy level of is achieved.
Assume that the adversary’s objective is to infer the value of . As mentioned in the adversary model, we make no assumption on the attacking methods adopted by an adversary. Consider defined in (18), , and , . Therefore, the adversary cannot distinguish tuples in by observing . In this situation, the adversary would use to exclude all tuple t in such that . However, as covers l true values, there exists l tuples such that and for arbitrary . As a result, values in are indistinguishable given and . Thus, . According to (3), a privacy level of is achieved.
5.3. Utility Optimization
An intuitive method of constructing is to insert the value of of all tuples that share the same public attribute values with into , i.e., . The privacy guarantee is met if covers at least true values. Nevertheless, utility loss cannot be ignored as in the intuitive method, would be the same for all , and thus information of private attributes are missing in the ranked result of . Therefore, the size of is critical in balancing privacy and utility. With no loss of generality, we limit the size of each polymorphic value set to 2. We show that constructing such polymorphic value sets is an NP-hard problem.
Definition 5. We define the 2-PVST problem as follows: given database D and a query workload Q, construct of size 2 for each tuple , , and minimize defined in (10).
Theorem 2. The 2-PVST problem is NP-hard.
The proof of Theorem 2 can be found in Appendix B.
5.4. Heuristic Algorithm
We have shown that the 2-PVST problem is NP-hard. In this subsection, we present PVST-Constructor, a heuristic algorithm that constructs PVST within polynomial time. PVST-Constructor tries to minimize and for each and with the greedy algorithm.
The pseudo-code of PVST-Constructor is shown in Algorithm 2. In lines 2 to 6, we initialize each with . Then, for each , PVST-Constructor constructs by finding a with the greedy algorithm and inserting into , . The above process will be taken multiple times until . Since every is a real tuple existing in D and we insert at least k tuples in the above processes, covers at least true values, and thus, the privacy guarantee is met. As mentioned in Utility Optimization, the size of is critical in minimizing utility loss. Therefore, the heuristic algorithm tries to minimize the utility loss by minimizing the size of each . We count the number of polymorphic value sets of , if the set contains less than k values, that can be enlarged by inserting ’s private attribute values. We denote this count as :
Tuple with a higher value can enlarge the size of more polymorphic value sets of , and thus, we can reduce the number of tuples that we have to insert into .
Furthermore, we take into consideration queries in Q. In order to minimize , we have to minimize for . Note that for each attribute , if and . We denote the value of as . Thus, we havewhere if and and otherwise. Therefore, tuple with a smaller can reduce the value of , and thus, we can reduce the value of .
The computation of and is done in line 11. Then, we compute , the score of that indicates how preferable is, relative to other tuples in . We adopt the greedy algorithm to find the next for , i.e., in each iteration, we choose the tuple that has the highest value. In line 15, we insert the private attribute values of the chosen tuple (denoted as ) into . The above process is repeated until the sizes of are no less than k.
6. Experimental Results
6.1. Experimental Setup
To validate PVSV-Constructor and PVST-Constructor algorithms, we conducted experiments on a real world dataset  from eHarmony which contains 58 attributes and 486,464 tuples. We removed 5 noncategorical attributes and randomly picked 20 categorical attributes from the remaining 53 attributes. The domain sizes of the 20 attributes range from 2 to 15. After removing duplicate tuples, we randomly picked 300,000 tuples as our testing bed.
By default, we use the ranking function from the ranked retrieval model with all weights set to 1. All experimental results were obtained on a Mac machine running Mac OS with 8 GB of RAM. The algorithms were implemented in Python.
The privacy guarantee of PVSV-Constructor and PVST-Constructor were tested by performing Ranked Inference attack , including Point-Query, In-Query, Point-Query&Insert, and In-Query&Insert attacking methods, on the dataset. From a total of 20 attributes, 5 attributes were randomly chosen as public attributes and another 5 attributes were randomly chosen as private attributes. We randomly picked 20,000 distinct tuples from the dataset as the testing bed and randomly generated 10 tuples as the query workload. For PVSV-Constructor, we constructed a polymorphic value set of size 2 for each private attribute and each tuple in the testing bed. For PVST-Constructor, we constructed a polymorphic value set that covers at least two true values for each private attribute and each tuple. We randomly picked 1,000 tuples from the testing bed as our targets and performed 1,000 Rank Inference attacks (250 attacks for each of the four methods) on the five private attributes of target tuples. We measured the attack success guess rates based on the frequency of successful inference among all inference attempts. Figure 1 shows the success guess rates of Rank Inference attacks on the unprotected testing bed, the testing bed with PVSV, and the testing bed with PVST. As the size of each polymorphic value set is 2, the success guess rates on PVSV are around 50%, which are significantly lower than those of unprotected dataset. We also observe that the success guess rates on PVST are slightly lower than those of PVSV. The reason is that PVSV-Constructor will be inserting values into tuple ’s polymorphic value sets until all ’s polymorphic value sets cover at least 2 true values. Thus, some polymorphic value sets of may contain more than 2 values.
In this subsection, we quantify utility loss of PVSV-Constructor and PVST-Constructor algorithms. The privacy guarantee of both PVSV-Constructor and PVST-Constructor is , i.e., the polymorphic value sets constructed by PVSV-Constructor contains 2 values, and the polymorphic value sets constructed by PVST-Constructor contains at least 2 true values. The key parameters here are the size of query workload Q, the size of database D, the number of public and private attributes, and the weight ratios in the ranking function. We randomly generated 20 tuples as the query workload. By default, we picked 10 tuples from Q and set , , and . Therefore, we randomly picked 10 attributes from the testing bed and set them as public attributes. The rest of the 10 attributes were set as private attributes.
Many recommendation systems of ONS applications feature top-k recommendation [1, 14, 47] where the ranked result contains a set of k tuples that will be of interest to a certain user, as it is impractical and unnecessary to return all tuples in the database to the user. Therefore, we introduce average top-k utility loss , a variant of that focuses on utility loss of the top-k tuples in a ranked result. is defined aswhere . Intuitively, represents the average percentage rank difference relative to k over all queries and all tuples in top-k. is equivalent to when . By default, we set .
6.3.1. Evaluation of with Varying
We first discuss the average top-k utility loss of PVSV-Constructor, PVST-Constructor, and a baseline algorithm on varying k with other parameters set to default values. For a tuple ’s attribute , the baseline algorithm constructs with and a value randomly picked from . The results are presented in Figure 2, which shows that the of PVSV-Constructor is significantly lower than that of the baseline algorithm. The of PVST-Constructor is also lower than that of the baseline algorithm when , even though the baseline algorithm cannot preserve privacy against authenticity-knowledgeable adversaries. The experimental results show that both PVSV and PVST can reduce utility loss with respect to rank differences. Also, note that PVST-Constructor constructs polymorphic value sets with true values, and thus, from Figure 3, we can see that some polymorphic value sets constructed by PVST-Constructor contains more than 2 values.
6.3.2. Evaluation of with Varying Sizes of
Figure 4 presents the average top-k utility loss of PVSV-Constructor on varying . When is set to 1, 5, or 10, we randomly picked 1, 5, or 10 queries from the original Q, respectively. With increasing number of queries in the query workload, the of PVSV-Constructor increases monotonically. The reason is that PVSV-Constructor always generates the least frequent value (denoted as ) in for . If is small, then it is possible that , and thus, . However, a larger query workload covers more private attributes values, and thus, it will be harder for PVSV-Constructor to generate a value for that has no impact on the rank of for any . Figure 5 presents the average top-k utility loss of PVST-Constructor on varying . We can see that the size of Q has no significant impact on the of PVST-Constructor, as the PVST-Constructor always pick a tuple that is different from in most attributes and then insert into .
6.3.3. Evaluation of with Varying Sizes of the Dataset
Figures 6 and 7 depict the impact of the size of datasets on of PVSV-Constructor and PVST-Constructor. Datasets of 100,000 and 200,000 tuples were randomly sampled from the testing bed of 300,000 tuples. As expected, has no significant impact on the of PVSV-Constructor since PVSV-Constructor generates values from the domain of each private attributes. has no impact on the of PVST-Constructor, which indicates that a dataset containing 100,000 tuples is sufficient for PVST-Constructor to generate polymorphic value sets with true values.
6.3.4. Evaluation of with Varying m
We investigate the impact of the number of private and public attributes on average top-k utility loss. Figure 8 presents the of PVSV-Constructor with fixed and varying m and with fixed m and varying . When is set to 5, we randomly removed 5 attributes from the testing bed. When is set to 15, we added 5 more categorical attributes randomly chosen from the unused attributes. We observe that the monotonically decreases with increasing number of public attributes and monotonically increases with increasing number of private attributes. As expected, a higher proportion of public attributes leads to less variant between and . The results of the same experiment with PVST-Constructor are shown in Figure 9. increases as increasing number of private attributes as expected. However, also increases slightly with increasing number of public attributes. This is due to the fact that with more public attributes, there will be few tuples that share the same public attribute values. Since PVST-Constructor inserts only private attribute values from tuples sharing same public attribute values, more values will be inserted to each polymorphic value set, which introduces higher utility loss.
6.3.5. Evaluation of with Varying Weight Ratios
Figures 10 and 11 illustrate the impact of weight ratios on the average utility loss. The experiment was conducted with a fixed private attribute weight of 1 and varying public attribute weights of 1, 2, and 3. As expected, of both PVSV-Constructor and PVST-Constructor decreases as increasing weight ratio of public attributes. The reason is that as the public attribute weight increases, the part in caused by private attributes decreases. Therefore, less impact would be made to by PVSV and PVST. As a result, would be closer to and the utility loss could be decreased.
In this paper, we proposed a novel framework that preserves privacy of private attributes against arbitrary attacks through the ranked retrieval model. Furthermore, we identify two categories of adversaries based on varying adversarial capabilities. For each kind of adversaries, we presented implementation of our framework. Our experimental results suggest that our implementations efficiently preserve privacy against Rank Inference attack . Moreover, the implementations significantly reduce utility loss with respect to the variance in ranked results.
It is our hope that this paper can motivate further research in privacy preservation of SIoT with consideration of social network features and/or variant information retrieval models, e.g., text mining.
In this subsection, we prove that constructing an optimal PVSV for each attribute of a tuple is an NP-hard problem.
Definition A.1. For a tuple in D, we create for each . We say satisfies query q if the following hold for any tuple t () in D: (1) If , then , and (2) if , then
For a database containing 2 tuples, the 2-PVSV problem can be redefined as follows: given a query workload Q, a database D, construct of size 2 such that satisfies the most queries in Q.
Definition A.2. Max-3Sat Problem: given a 3-CNF formula , find the truth assignment that satisfies that most clauses.
Lemma A.1. Max-3Sat 2-PVSV Problem.
Proof. We construct a reduction function which takes a Max-3Sat instance as input and returns a 2-PVSV instance. Without loss of generality, we suppose that is a conjunction of l clauses and each clause is a disjunction of 3 literals from set . We construct database D as follows: D has 0 public attributes and private attributes . Let be the attribute domain of , and be the attribute domain of . Let and . Two tuples, and , are inserted into D: and for , while and We simplify the score function defined in (1) by setting all weights to 1. Also, note that , if is null.
We construct query workload Q based on . For each clause , we construct a query such that iff the corresponding literal is a positive literal in , and iff is a negative literal in the clause. For example, given a clause (), the corresponding query q should satisfy , and . All other attributes in q are set to by default. Therefore, and , .
Since , in order to minimize utility loss, the value of should be as small as possible. We observe that the minimum possible is 1 because must be as . Without loss of generality, we assume that we have already constructed for such that , .
Now, we have an instance of 2-PVSV problem that given D, Q, constructs that satisfies the most queries in Q. Since , must be too as contains two distinct values. Therefore, for an arbitrary query q Q, we have and . In order to let satisfy q, we have to ensure that . Thus, we haveNow, we show that the solution of 2-PVSV problem constructed above can answer the corresponding Max-3Sat Problem. Suppose that we have the solution such that satisfies the most queries in Q. As in (A.1), if satisfies , we have . If does not satisfy , then . Recall that we assign value i or to if is a positive or negative literal in clause , respectively. Therefore, the assignment of given isFurthermore, from equation (A.1), we havewhere L is q’s corresponding clause in . Note that as and . Thus, the value of is either true or false.
As we proved above, satisfies if and only if is true given assignment constructed according to equation (A.2). Therefore, satisfies the most clauses in ϕ if and only if satisfies the most queries in Q.
We now prove that function f can be conducted in polynomial time. Given a formula with n variables and l clauses, we construct a 2-PVSV instance with 2 tuples each of which has attributes and l queries each of which has 4 attributes. Therefore, assignments are needed and f can be conducted in polynomial time.
Proof of Theorem 1. We now prove that the 2-PVSV problem is NP-hard. In Lemma 3 we proved that Max-3Sat Problem can be reduced to the 2-PVSV problem in polynomial time. Furthermore, as Max-3Sat Problem is a NP-hard problem , the 2-PVSV problem is NP-hard.
In this section, we prove that constructing an optimal PVST for each private attribute of a tuple is NP-hard. We use the definition of satisfying q from (9). The 2-PVST problem can be redefined as follows.
Definition B.1. 2-PVST problem: given a query workload Q, a database D, the optimization problem of 2-PVST is to construct an arbitrary tuple ’s polymorphic sets, , that satisfies the most queries in Q. The size of each polymorphic vale set is 2.
Lemma B.1. Max-3Sat 2-PVST problem.
Proof. We construct a reduction function which takes a Max-3Sat instance as input and returns a 2-PVST instance. We assume that is a conjunction of l clauses and each clause is a disjunction of 3 literals from set . Let D have no public attribute and private attributes . For each literal , we insert two tuples, and , into D whereThen, we insert tuple into D where and . We also set the domain of each as and the domain of as .
Without loss of generality, the score function defined in (1) is simplified by setting all weights to 1.
Next, we construct query workload Q. For each clause , we construct a query in which iff is a positive literal in , and iff is a negative literal in . We also set . The rest of the attribute values are set to by default. Therefore, if , then we have , , , , and for .
Note that if is null. Thus, and , . We observe that for and , and . In order to reduce , we have to maximize the number of such that and . The maximum value of and is 4, which can be achieved by inserting into and inserting into . Since , the value of is at least 1. Therefore, we have