Abstract
Modern applications and services leveraged by interactive cyberphysical systems (CPS) are providing significant convenience to our daily life in various aspects at present. Clients submit their requests including query contents to CPS servers to enjoy diverse services such as health care, automatic driving, and locationbased services. However, privacy concerns arise at the same time. Content privacy is recognized and a lot of efforts have been made in the literature of privacy preserving in interactive cyberphysical systems such as locationbased services. Nevertheless, neither the cloaking based solutions nor existing client based solutions have achieved effective content privacy by optimizing proper content privacy metrics. In this paper we formulate the problem of achieving the optimal content privacy in interactive cyberphysical systems using anonymity solutions based on two content privacy metrics, which are defined using the concepts of entropy and differential privacy. Then we propose an algorithm, Multilayer Alignment (MLA), to establish anonymity mechanisms for preserving content privacy in interactive cyberphysical systems. Our proposed MLA is theoretically proved to achieve the optimal content privacy in terms of both the entropy based and the differential privacy mannered content privacy metrics. Evaluation based on reallife datasets is conducted, and the evaluation results validate the effectiveness of our proposed algorithm.
1. Introduction
Cyberphysical systems (CPS), which deeply integrate different computing, communication, controlling, and monitoring components, have leveraged modern services in our daily life, like smart grid, intelligent transportation, automatic driving, etc. Recent development of mobile communication and networks has leveraged many modern applications built on interactive cyberphysical systems, in which client software programs or devices take actions according to their interactions with CPS servers. In more details, a client sends a request to the CPS server and is to take actions on receiving the reply from the CPS server. The actions to be taken depend on the reply of CPS servers. Health caring, automatic driving, and locationbased services fall into this category of interactive CPS applications. Suppose an old guy Bob is wearing a health caring device, which is connected to a CPS server though mobile Internet. Bob could send “stomachache” to the CPS server for instructions to help him. The CPS server may send a reply telling Bob what to do or where is the nearest hospital. Then Bob takes his action according to the reply. Similar processes could be adopted in applications of automatic driving and locationbased services. These modern services over interactive CPS are attractive since they do bring convenience to people’s daily life. However, privacy concerns arise at the same time while users’ requests are submitted to the CPS servers through Internet. These requests disclose users’ query contents to the CPS server and even vicious listeners to the communication channels. Here we refer query content as parameters in users’ requests, such as “stomachache” in Bob’s request to the health caring CPS server. These query contents should be kept as sensitive information for individuals, and the abuse or further leakage of these information will make users vulnerable in respect of private life or even individual security [1].
To the privacy concern, content privacy should be recognized to emphasize that users’ query contents should be kept as sensitive information in interactive cyberphysical systems. Many research efforts have been made to protect different types of query contents such as locations and keywords in the literature of interactive cyberphysical systems such as locationbased services. The major body of these efforts consists of two parts, cloaking based solutions and client based solutions. Cloaking based solutions employ a trusted server from a third party. When a user queries with a query content , the query is sent to the trusted server which in the next step generates a cloaking region containing ’s location and at least another users. Here specifies the level of privacy guarantee. Then the query with is sent to the CPS server, and the CPS server could not determine where is or even whether is querying. In this process, cloaking based solutions aim to make indistinguishable from another users; however it suffers from inherent drawbacks brought by the trusted server, which may become the single point of failure of privacy and the bottleneck of query performance. More seriously, when the CPS server holds certain side information such as the prior probability of query contents, cloaking based solutions will suffer from further privacy breach. To address this issue, client based solutions are presented, aiming at anonymity provided by the client side. Reference [2] generates dummy query contents for continuous scenarios. To prevent the adversary from inferring the actual query content, [2] constrains that the selected dummies should have prior probability larger than a predefined threshold. In practice, it is difficult to determine a proper threshold, and what is more, dummies selected by [2] could still be eliminated if they have quite different prior probability. Reference [3] employs an entropy based privacy metric and generates dummy locations in random manner. However, the improper dummies could be eliminated from the reported query contents due to the process of [3]; thus the provided privacy is degraded.
In this paper, we investigate the problem of preserving content privacy in interactive cyberphysical systems with a client based solution. To guarantee the utility of requests to the CPS servers, we adopt anonymity in order to prevent the adversary from recognizing the actual query content from the reported contents, since the actual query content must be sent to the CPS server for a meaningful reply. In this process, the major challenge arises in two aspects. First, the anonymity provided should be carefully designed so that the adversary is not able to eliminate any query contents. Second, the overall privacy produced by anonymity should be optimized. We present two privacy metrics denoted expected entropy and dpratio which depict the achieved content privacy using a entropy based concept and a differential privacy mannered measurement, respectively. Then an algorithm, Multilayer Alignment (MLA), which establishes anonymity based mechanisms for preserving content privacy is proposed. Given the prior probability of query contents together with an integer which specifies the privacy level, MLA generates a set of reports, each of which consists of distinct query contents, together with probability distribution on the report set for each query content. Given any report submitted to the CPS server, MLA guarantees that the posterior probability of each query content in is larger than 0. To this end, the adversary is not able to eliminate any query content from a report. Here a report could be taken as a set of different query contents, and we give its formal definition in Section 3. We theoretically introduce the properties of MLA by proving that MLA achieves the optimal expected entropy and the optimal dpratio at the same time. These attractive properties make MLA the optimal anonymity solution for preserving content privacy in interactive cyberphysical systems. The major contributions of this paper are as follows.(i)We formulate the problem of achieving the optimal anonymity based mechanisms for preserving content privacy in interactive cyberphysical systems. The problem formulation is based on two content privacy metrics with entropy and differential privacy concepts.(ii)We propose the Multilayer Alignment (MLA) algorithm, which establishes anonymity based mechanisms for preserving content privacy. The MLA algorithm prevents adversaries from eliminating query contents from reports using Bayes inference.(iii)We prove that MLA achieves the optimal anonymity mechanisms in terms of our presented content privacy metrics simultaneously.(iv)We evaluate our proposed MLA algorithm using reallife datasets. The evaluation results validate that MLA achieves effective content privacy in terms of the entropy based and differential privacy mannered content privacy metrics.
The rest of this paper is organized as follows. Section 2 introduces some necessary preliminaries including the process of preserving content privacy using a client based solution, together with common accepted privacy metrics. Section 3 formulates the problem of achieving the optimal anonymity for content privacy in interactive cyberphysical systems. Section 4 proposes the MLA algorithm which establishes effective mechanisms for preserving content privacy. Section 5 theoretically proves that our proposed MLA algorithm achieves the optimal anonymity for content privacy in terms of the content privacy metrics introduced in Section 3. Section 6 evaluates the MLA algorithm based on reallife datasets and related work of this paper is discussed in Section 7. Finally, Section 8 concludes this paper.
2. Preliminary
This section introduces necessary preliminaries including the process of preserving content privacy using a client based anonymity solution. Then two accepted privacy notions, i.e., anonymity and differential privacy, and their corresponding metrics are introduced.
2.1. Client Based Anonymity Solution
In the process of a client based anonymity solution for content privacy preservation, query contents are reported to the CPS server when a user wants to submit a request. The reported query contents are determined on ’s device, and no trusted thirdparty servers are employed; thus the potential single point of privacy failure and query performance bottleneck are eliminated. It is worth noticing that the actual query content queries should be included in the reported ones; otherwise is not able to receive a meaningful reply from the CPS server. After receiving the reported query contents, the CPS server processes queries, one for each reported query content, and then returns the query results to . Irrelevant results are filtered on ’s mobile device and the actual results are returned to . In this process, the CPS server receives distinct query contents instead of a single actual one, and to this end the actual query content is hidden. Nevertheless, careful design is required to avoid ineffective dummies in the reported query contents. The way of generating reports to the CPS server determines the level of content privacy achieved, and this motivates our work.
2.2. Privacy Notions and Metrics
2.2.1. Anonymity
One widely adopted notion of privacy is anonymity, which is firstly introduced in the database community by [4]. The principle of anonymity is to hide the sensitive information into dummies so as to make the adversary unable to recognize the actual one. In the literature of privacy protection in interactive cyberphysical systems, anonymity could be categorized into cloaking based solutions and client based solutions. The cloaking based solutions such as [5] employ a thirdparty but trusted server, which is responsible for hiding the actual user among at least dummy users by spatial generalization. The client based solutions including more recent work such as [2, 3, 6] perform on users’ devices and generate dummies in a local manner, in which process certain side information is adopted, for instance, the prior probability of each query content. The trusted server in cloaking based solutions may become a single failure if it is hacked by attackers and it is the performance bottleneck to incur long latency to requests. What is more, most of cloaking based solutions is unaware of side information held by the adversary such as the prior probability of query contents. At the same time, the existing client based solutions provide specious anonymity, since the attackers may violate the principle of anonymity through rerunning of the algorithms or launching probability inference for each of the query contents.
The quality of anonymity could be measured by the concept of entropy borrowed from the area of information theory. When the CPS server receives a report consisting of query contents, the entropy of is formulated as follows:
Note that the former formulation is slightly different from [3]. Actually, [3] takes the prior probability as an approximation of the posterior probability .
2.2.2. Differential Privacy
Differential privacy was firstly introduced and applied in statistic databases, and it aims to prevent the leakage of any individual’s information during query processing. Generally speaking, to satisfy the notion of differential privacy, a random algorithm should return query results with similarly distribution for two databases differing with just one tuple. In other words, a single modification in a database brings a minor change to query results under the control of differential privacy. The definition of differential privacy is given below.
Definition 1 (differential privacy). Given , a randomized algorithm satisfies differential privacy if for all neighboring databases and , . Here . Any pair of neighboring databases and satisfies one of the following conditions: (1) (for unbounded differential privacy) can be transformed to with exact one insertion or deletion; (2) (for bounded differential privacy) can be transformed to with exact one modification.
The bounded differential privacy prevents distinguishing two datasets with the same size while differing with exact one tuple. The unbounded differential privacy prevents distinguishing two datasets which are the same except that one of them holds exact one additional tuple.
The metric for differential privacy is the coefficient in Definition 1. Intuitively, a smaller leads to a better privacy but larger noise in the query result, while a larger leads to less noise in the query result but a weaker privacy guarantee.
3. Problem Definition
This section formulates the problem of achieving the optimal anonymity based mechanisms for content privacy in interactive cyberphysical systems. Before introducing the problem definition, we provide several definitions which interpret indispensable concepts for our problem definition.
When a user queries a content , the client based solution first generates a report consisting of distinct query contents and then sends the report to the CPS server for response. A formal definition of a report is given as follows.
Definition 2 (report). Given the global set of query contents and an integer , a report is a subset of with size . When a user queries content , the generated report must contain ; i.e., . Denote the set of all the reports for the given and by . The set of reports containing the query content is denoted by .
In the rest of this paper, we focus on specified and , and we also use the notion (instead of ) to refer to the set of reports containing the query content .
For a client based solution, multiple reports could include an identical query content . When is queried, one of these reports is submitted to the CPS server. The following definition of reporting probability depicts the process of selecting such a report.
Definition 3 (reporting probability). Given the global set of query contents and an integer , a reporting probability is a function satisfying the following constraints:(i)for any and , ;(ii)for any , ;(iii)for any and , .The first two constraints in the definition of reporting probability illustrate that when querying a content , a report is selected according to the probability . The third constraint specifies that a report will not be selected for if .
Next we formulate a client based solution as a mechanism in a probabilistic manner based on the concepts of report and reporting probability.
Definition 4 (mechanism). Given the global set of the query contents and an integer , a anonymity based mechanism consists of two components including the set of reports and the reporting probability . When a user queries content , randomly selects a report from and the probability of selecting is .
The above definition of a anonymity based mechanism looks speciously strange; nevertheless existing solutions could be taken as instances of the above definition. We could specify the reporting probability using additional parameters in these works, for instance, the predefined prior probability threshold in [2].
This paper adopts two privacy metrics to measure a given anonymity based mechanism in terms of privacy. As formulated in the following definition, the first metric integrates entropy measures of all the reports generated by .
Definition 5 (expected entropy). Given the global set of query contents, an integer and the prior probability of query contents as , the expected entropy of mechanism is calculated asHere , and it is the posterior probability of given report . measures the achieved content privacy overall by considering all the generated reports. The probability of each report is taken as the weight, and the entropy of each report is integrated in the above formulation. A larger indicates that a better content privacy is obtained with respect to the concept of entropy.
The second metric incorporates the notion of anonymity and differential privacy. It measures a mechanism with the most distinguishable pairs of query contents in the generated reports. The following definition formulates our second metric named dpcoefficient.
Definition 6 (dpcoefficient). Given the global set of query contents, an integer , and the prior probability of query contents as , the dpcoefficient of mechanism is calculated asHere the terms and are the posterior probability of query contents and given a report , and they could be calculated in the same way as the calculation of described above.
Based on the content privacy metrics, e.g., expected entropy and differential privacy coefficient, we formulate the problem of achieving the optimal anonymity for content privacy in interactive cyberphysical systems as follows.
Problem Definition. Given the global set of query contents, an integer , and the prior probability of query contents as , compute a mechanism with the optimal content privacy. The optimal content privacy is achieved if is maximized.
and depict the content privacy achieved by from the holistic and individual point of view, respectively. Although our problem definition aims at the optimized expected entropy, in the next section we propose an algorithm which achieves the optimal expected entropy and the optimal dpcoefficient simultaneously.
4. Achieving the Optimal Anonymity
This section in first provides a short discuss on a naïve solution to the problem defined in Section 3. Then we propose our Multilayer Alignment (MLA) algorithm which achieves the optimal anonymity for content privacy in interactive cyberphysical systems. MLA exhibits an attractive property that it achieves the maximized expected entropy and the minimized dpcoefficient simultaneously.
4.1. A Naïve Solution
According to the problem definition in Section 3, the essential challenge of establishing the optimal mechanisms lies in building the reporting probability . A naïve approach to achieve the optimal anonymity is formulating the problem using nonlinearprogramming technique with linear constraints in Definition 3, and expected entropy or dpcoefficient is used as the optimizing objective. However, the nonlinearprogramming formulation employs variables each of which stands for an entry in . When grows to 100 and is set to 10, there will be more than variables. Thus this naïve approach is impractical due to its computation expense.
4.2. The Multilayer Alignment Algorithm
MLA computes the optimal anonymity mechanism in two phases, namely, (1) Segment Alignment and (2) Mechanism Initiation. The major idea of MLA is to generate a mechanism where query contents have as similar posterior probability as possible in each report. To accomplish this goal in a holistic manner, Segment Alignment amortizes each query content with large prior probability to multiple query contents with small prior probability. What is more, the reports generated by Mechanism Initiation have the same distribution of posterior probability of the included query contents. Next we introduce the two phases of MLA.
4.2.1. Segment Alignment
Given the prior probability of query contents, MLA represents each query content using a segment with length . The segments for all the query contents are sorted in descend order, and denote the sorted set as . Then MLA aligns the segments onto layers in order. The aligning process has two modes, i.e., aligning dominant and aligning dominated. At the beginning of aligning, the mode of aligning dominant is active. The number of rest layers (denoted ) is set to . MLA checks whether the current segment is dominant. When aligning , is dominant if the condition holds. If the current segment is dominant, MLA aligns onto the current layer and takes up the entire layer. The aligning stays in mode aligning dominant, and segment is taken as the current segment when the aligning continues. If the current segment is not dominant, the aligning turns to mode aligning dominated. Then MLA sets the length of each of the remaining layers as , and it aligns along the rest layers. When aligning a segment and the current layer has blank length less than , is divided into two parts with lengths and . The first part is aligned onto the current layer, and the second part is aligned onto the beginning of the next layer. In the mode of aligning dominated, MLA goes on aligning all the remaining segments, and it never turns back to mode aligning dominant. After all the remaining segments are aligned, the first phase of MLA terminates and MLA continues to the second phase.
Example 7. Suppose and Alice wants to query nearby, and the prior probability of each query content at her location is given as follows: , , , , , , and . Alice desires for 4anonymity (), and what is the optimal mechanism for Alice?
Here we use the instance in Example 7 to illustrate the process of segment aligning. There are 4 layers in the process of aligning. The segment for is aligned in mode aligning dominant since . The segment for the remaining query contents is aligned in mode aligning dominated, and each of the remaining layers is at length . The aligning result is shown in Figure 1.
4.2.2. Mechanism Initiation
Denote the length of the th layer by , where . The second phase of MLA first shrinks the length of each layer, and the th layer is shrunk by ratio . The shrinking ratio of the th layer is recorded as . All the layers have the same length after shrinking. In the next step, MLA sets a vertical line at the beginning of each layer. Then the vertical line moves to the right until it touches the first point on any layer at which a segment ends. Then the scanned parts of the layers are packed into a report. Denote the scanned part on the th layer with length ; then the probability of this report is and the posterior probability of the query content on the th layer is . The vertical line continues moving to the right and MLA packs the next report when any segment ends on a layer. The process terminates after the vertical line moves to the end of each layer and generates the last report. Continue with Example 7 as shown in Figure 2, the shrinking ratios of the 4 layers are 2, 1, 1 and 1. Then a vertical line starts moving to the right from the left end of all the layers. It first touches the end points of and on layers 3 and 4, respectively, and a report is generated. Then it keeps moving to the right and touches the end points of and on layers 2 and 4, respectively, and report is generated. Finally, the vertical line touches the end points of all the layers and generates the last report . In the end, MLA generates 3 reports including , , and . The reporting probability is given in Table 1. Take , for instance; half of its prior probability is assigned into report , and onefourth of its prior probability is assigned to reports and . Thus the reporting probability of for reports , , and is , , and , respectively, as shown in Table 1. The reporting probability of other contents could be calculated in the same manner.
(a) After shrinking
(b) Mechanism initiation
The pseudocode of the Multilayer Alignment algorithm is shown in Algorithm 1. In the beginning, MLA sorts the query contents in according to their prior probability in descending order (line 1). Then it initiates necessary structures and variables. stores the alignment of layers (lines 23). Variable indicates the layer being processed (line 4), and indicates whether the alignment is under aligning dominant mode (line 5). Variable indicates how much prior probability of the current query content is taken up by the last layer while indicates the length of each layer processed in dominated mode, and they are initiated in line 6. Array keeping the lengths of layers is initiated in line 7. The loop in lines 828 aligns query contents in in order onto layers. The aligning mode is set dominant in line 5 before processing the first query content. Under aligning dominant mode, MLA checks whether should be aligned under aligning dominant mode. If the answer is yes, a segment for is created with length and it is added to the list for the current layer. Here the constructor of segment specifies the label and the length for a segment. Then the alignment of and the current layer terminates (lines 1012). If should not be aligned under aligning dominant mode, MLA turns to the mode of aligning dominated and calculates the length of each of the remaining layers as (lines 1315). In aligning dominated mode, MLA executes the code in lines 1628. If could be entirely aligned onto the current layer (line 16), MLA creates a segment for with length , and adds it to list (line 17). The length of the current layer and are updated (lines 1819). If the current layer does not have sufficient space to hold , is split into two segments. Lines 2123 align the first segment onto the current layer, and lines 2628 align the second segment onto the next layer. Lines 2425 deal with a special case where exactly uses up the space of the current layer. By here the Segment Alignment terminates and the MLA goes to the phase of Mechanism Initiation. It packs the heads of lists into a report (lines 3031). Then MLA determines the movement length ratio of the vertical line to the right as (line 32). The reporting probability related to the current report is calculated in line 35. For each layer, MLA updates the length of the head. If the head of a layer is entirely packed into a report, it is popped from the list (lines 3639).

The computation cost of MLA consists of three parts, sorting and initiating variables, Segment Alignment, and Mechanism Initiation. The first part costs . Segment Alignment costs since at most segments are aligned, and each alignment costs a constant time. Mechanism Initiation costs which is dominated by packing at most reports (each packing costs ) and calculating at most entries for . In practice, should be set smaller than . The total cost of MLA is .
5. Properties of MLA Algorithm
This section formally proves that MLA achieves the optimal expected entropy and the optimal dpcoefficient simultaneously. We first introduce some concepts which build necessary foundation for our formal proof.
Definition 8 (dominant content). Given the global set of query contents, an integer , and the prior probability , , let be the number of query contents larger than . A query content is a dominant content iff the following conditions hold:(i);(ii).
Definition 9 (dominated content). Given the global set of query contents, an integer , and the prior probability , is a dominated content, not a dominant content.
According to the process of segment alignment in MLA, each dominant content takes up an entire layer. If there are remaining layers, the dominated contents take up these layers and none of them take up an entire layer. In the rest of this paper, we use dominant layer and dominated layer to denote a layer taken up by a dominant content or dominated contents, respectively. Recall the instance in Figure 1; is a dominant content and layer 1 is a dominant layer. Query contents are dominated contents and layers 2, 3, and 4 are dominated layers.
Definition 10 (layering strategy). Given the global set of the query contents, an integer , the prior probability of query contents, and a reporting probability , let the query contents in each report be permuted arbitrarily and denote the th query content of report by . A layering strategy induced by is a dimensional vector, whose th component is calculated as . When query contents are sorted by the value of in descend order, the standard layering strategy is induced.
According to the above definition of layering strategy, a mechanism has multiple layering strategies. Intuitively, a mechanism assigns the prior probability of each query content to one or multiple reports. A report contains parts from distinct query contents, and they can be viewed as segments on layers. When query contents in each report are permuted, we can build layers by connecting all the segments on the same layer from different reports together. To this end, we call the dimensional vector a layering strategy. Next we define the entropy of a layering strategy.
Definition 11 (entropy of layering strategy). Given a layering strategy , the entropy of is calculated as .
Lemma 12. Given a mechanism , let be any induced layering strategy of ; then .
Proof. Suppose the query contents in each report of are arbitrarily sorted, and we get an induced layering strategy . Denote the set of reports with posterior probability larger than 0 by , and is the th query content of report in the process of inducing . For , we have the following equation:By applying the logsum inequality [7] (adopted in the last but one line in the below), we have the following condition:So we prove that .
Lemma 13. Given a mechanism generated by MLA, is the standard layering strategy of and is an arbitrary mechanism; then has at least one induced layering strategy satisfying the fact that .
Proof. Given the mechanism produced by MLA together with its standard layering strategy , we prove Lemma 13 by conducting an induced layering strategy for an arbitrary mechanism , so that . To this end, we sort the query contents in each report of as follows.
For each report of , we iterate all the query contents. For a query content , if it is a dominating query content determined by MLA and its order in is , we set the order of in by . After arranging all the dominating query contents, we sort the rest of query contents in by in descend order and then fill the blanks in the ordering of . In this way we conduct an inducing layering strategy of , and in the following we are to prove that .
Let be the number of dominant layers in , and we first investigate the first layers of . For each dominant content on the th layer of , it is also aligned only on the th layer of . At the same time, on the th layer of there are possibly dominated contents. So we get that for each dominant layer the length of is no smaller than that of , i.e., , . As a consequence, the total length of dominated layers in is no larger than that of if ; i.e., .
Here we turn to a necessary observation of modifying a layering strategy at two layers with increased entropy. Suppose is an arbitrary layering strategy, and its values on layer and layer are different. With no loss of generality, assume . Then we move a length of from layer to layer ; here . It is easy to see that the entropy of the modified layering strategy is larger than the entropy of . Next we transform to with a series of modifications of the above type between two layers with different lengths.
The transform includes two phases. In the first phase, make the dominated layers (here the dominated layers and dominant layers are determined by ) have the same length. Let be the average length of dominated layers for . We repeat the below modification. Each time we pick the dominated layer with smallest length and largest length, and move the length from longer to the shorter until either one of them reaches . Then the number of layers with length increases by at least one. After at most modifications, phase 1 terminates. And each modification make the entropy of increase. If the dominated layers of have the same length at the beginning of phase 1, its entropy remains unchanged.
In phase 2, we investigate each of the first layers. For a layer , let the th dominant content of MLA be . Then we remove a length of and distribute it evenly to dominated layers. After that, the length of each dominated layers for is no larger than that of (denoted ). Meanwhile, the remaining length of layer is larger than . According to the observation above, each modification of phase 2 will increase the entropy of . After at most modifications, will be transformed to , and each modification will not decrease the entropy.
Combining phase 1 and phase 2, we conduct a transformation from an induced layering strategy of an arbitrary mechanism to , which is an induced layering strategy of the mechanism produced by MLA. Each step of the conducted transformation will increase the entropy or keep the entropy unchanged, so we prove that .
Lemma 14. Given a mechanism generated by MLA, and is the standard layering strategy of , then .
Proof. The standard layering strategy of restores the result of segment alignment in the process of MLA. Let be the lengths of the generated layers. Due to the shrinking process of MLA, the initiated reports have the same ratios between pairs of corresponding query contents on two given layers. Thus the standard layering strategy of could be calculated as . For each produced report in the process of inducing , we use to denote the th query content in . Then we have , for . So we can get that the entropy of each report equals the entropy of . As a consequence the expected entropy of can be calculated as follows:So we prove that .
Lemmas 12, 13, and 14 illustrate the relationship between the expected entropy achieved by MLA and the entropy of induced layering strategies of any other mechanisms. Based on these facts, we get the following theorem.
Theorem 15. MLA achieves the optimal expected entropy.
Proof. According to Lemma 14, the mechanism produced by MLA achieves the expected entropy of , where is the standard induced layering strategy of . Assume is an arbitrary mechanism, and it has at least on induced layering strategy so that due to Lemma 13. At the same time, according to Lemma 12. Then we have , so we prove that MLA achieves the optimal expected entropy through .
Theorem 16. MLA achieves the optimal dpcoefficient.
Proof. Given the mechanism produced by MLA together with its standard layering strategy , we conduct an induced layering strategy for an arbitrary mechanism in the same way as the proof of Lemma 13. We sort the query contents in each report of as follows. For each report of , we traverse its query contents. For a query content , if it is a dominated content determined by MLA and its order in is , we set the order of in by . After arranging all the dominating query contents, we sort the rest of query contents in by in descend order, and then fill the blanks in the ordering of . The first layer of only contains the first dominant content; however the first layer of not only contains the first dominant content entirely but also possibly dominated contents. So we have . On the other hand, we know that the total length of dominated layers in is no smaller than that of . At the same time, each dominated layer has the same length in while the th layer in has the smallest length. Then we have . In the dpcoefficient is actually . Denote the set of reports produced by by , and let be the th query content in report , and . Then we have and . Thus . The dpcoefficient achieved by is . Since we have got that and , we conclude that . That is to say the dpcoefficient of is no larger than that of an arbitrary mechanism . So we prove that MLA achieves the optimal dpcoefficient through .
6. Evaluation
This section evaluates the performance of our proposed MLA algorithm based on three reallife datasets, and evaluation results report the comparison between MLA and three existing approaches including , [8], and [3].
6.1. Evaluation Setting
Datasets. To obtain the prior probability of query contents, we employ three reallife datasets including and from [9] and from [10]. and contains street objects in the state of Texas and California. Each object is labeled with a coordinate and a set of keywords. contains worldwide coordinates and geotags. We use the coordinates as locations, and take the keywords and geotags as query contents. We divide , , and into regions and calculate a prior distribution of query contents for each region. Given the number of query contents , we pick query contents with top frequency and they are used to compute the prior probability. In and , some keywords such as city name and state name are removed since they dominate the frequency but provide no meanings. For each dataset, the average measures of its regions are reported in the evaluation results. The details of , , and are introduced in Table 2.
Testbed. We implement our proposed MLA and competitors including , , and in Java language. The JDK version is jdk1.8.0_151. All of the evaluation is conducted on a PCmachine with i77700 CPU, 8GB memory, and 1TB 7200rpm Hard Disk.
Query Generation. For each prior distribution obtained for a region, we generate 1000 queries, which follows the prior distribution, to test . The internal loop times is set to 50 as in [3]. For MLA, , and we directly evaluate the privacy measures using the mechanism obtained for each prior distribution.
Privacy Measures. We employ three privacy measures to evaluate content privacy achieved by MLA and its competitors. These privacy measures are (1) expected entropy; (2) dpcoefficient; and (3) effective k. Expected entropy and dpcoefficient are introduced in Section 3. Effective k measures the number of query contents whose posterior probability is positive, and it measures the uncertainty of the reports in a mechanism.
Parameters. We test the effects of two parameters on the privacy measures we employ. These parameters include the number of query content in a report (denoted ) and the number of query contents in the global set (denoted ). In the following evaluation is set to 5, 10, 15, 20, and 25 and its default value is 10. Parameter is set to 50, 60, 70, 80, 90, and 100 and its default value is 80.
6.2. Evaluation Results
Figures 3 and 4 depict the expected entropy achieved by MLA and its competitors. We first study the effects of parameter on the expected entropy in Figure 3. Here the total number of query contents in is set to 80 and is increased from 5 to 25. As shown in Figure 3, our proposed MLA achieves the best expected entropy in the reallife datasets of , , and . This is consistent with the fact that MLA achieves the optimal expected entropy. The achieved expected entropy of MLA and its competitors grows with parameter , since a larger improves the uncertainty of reports in a mechanism. In the more skewed dataset, i.e., , MLA outperforms and in larger degree than that of the case in datasets of and . The reason is that MLA splits larger prior probability of query contents into a larger number of reports; thus it is more suitable to deal with skewed prior distribution of query contents. On the other hand, in , and (approximately) keep the ratio of posterior probability for two query contents the same as that of their prior probability. Compared to the datasets of and , the expected entropy achieved in is smaller correspondingly, since more skewed distribution of query content prior probability decreases the optimal expected entropy.
(a) CA dataset
(b) TX dataset
(c) POIs dataset
(a) CA dataset
(b) TX dataset
(c) POIs dataset
Figure 4 presents the achieved expected entropy when parameter grows from 50 to 100 while is fixed at . MLA again outperforms its competitors in terms of expected entropy. When grows, the expected entropy of MLA, , and slightly increases while gets decreasing expected entropy. The reason is that increased brings a relief to the skewness of prior distribution of query contents, so MLA, , and achieve better expected entropy. However, due to the process of , querying top frequent contents will make fewer query contents in reports eliminated. When an increasing relieves the effects of top frequent contents, more query contents in reports of get eliminated. Consistent with what is shown in Figure 3, a larger improvement is obtained in when we compare MLA with and . Meanwhile, better expected entropy is achieved in more uniform datasets of and compared to .
Next we investigate the dpcoefficient of MLA and its competitors. The privacy measure of dpcoefficient depicts the uncertainty of reports in a mechanism. A smaller dpcoefficient means that it is more difficult for the adversary to eliminate a query content from any report. The effects of parameter on dpcoefficient is studied in Figure 5. We fix parameter at 80 and increase from 5 to 25. In all the datasets, MLA achieves significantly better dpcoefficient compared to , , and . When increases, the dpcoefficient of all the algorithms grows, since more query contents are packed into the same report. In more skewed dataset, , a larger dpcoefficient is obtained. The skewness increases the difference of prior probability for query contents in the same report. In datasets and , very small dpcoefficient is achieved when is set to 5 to 10. For other cases of , the dpcoefficient is almost always smaller than 1, and this means very good uncertainty among query contents in any reports. On the other hand, , , and suffer a larger dpcoefficient around 4 and 2 in different datasets, respectively.
(a) CA dataset
(b) TX dataset
(c) POIs dataset
The effects of parameter on dpcoefficient are investigated in Figure 6. When we fix at 10 and increase from 50 to 100, , , and produce nearly constant dpcoefficient. The dpcoefficient of , , and in is larger than 2.5, while a dpcoefficient around 1.5 is obtained for datasets of and . In contrast, the dpcoefficient of MLA decreases when grows, since larger brings relief to the skewness of prior distribution. In , MLA achieves dpcoefficient smaller than 1.5. For more uniform datasets of and , MLA produces very small dpcoefficients. It obtains ideal dpcoefficient with 0 for dataset, and the dpcoefficient for is also very close to 0. This brings significant difficulty to the adversary to infer the actual query content from any report of MLA. Generally speaking, MLA achieves much better dpcoefficient compared to , , and , and it is able to produce dpcoefficient close to 0 for more uniform datasets.
(a) CA dataset
(b) TX dataset
(c) POIs dataset
Finally we test effective k of MLA and its competitors in Figure 7. Given the value of and , the same effective k is obtained for different datasets, so we report the effects of parameter and parameter on effective k in Figures 7(a) and 7(b), respectively (not for each dataset individually). As shown in Figure 7, MLA, , and achieve the optimal effective k with the value of . In contrast, provides smaller effective k than the value of . This illustrates the effectiveness of MLA, , and with regard to the disability of eliminating any query contents from each report. We argue that an effective anonymity mechanism should provide effective k with the value of .
(a) Varying parameter
(b) Varying parameter
In summary, MLA achieves the best privacy measures of expected entropy, dpcoefficient, and effective k simultaneously, which is consistent with our theoretical analysis in Section 5.
7. Related Work
Privacy issues are attracting more and more attention in people’s daily life, and studies for protecting privacy in various fields have been proposed, for instance, [11–13] for social network data, [14] for cloud storage, [15–17] for mobile crowd sensing systems, [18, 19] for wireless sensor networks, [1, 20] for sensory data and devices, [21, 22] for cyberphysical systems, and [23] for IoT applications.
Location privacy and content privacy are recognized in locationbased services. Solutions to preserving location privacy and content privacy in locationbased services mainly focus on cloaking technique such as [5]. Cloaking technique employs a thirdparty server to execute spatial generalization algorithms so that the querier is hidden among at least users. However, the thirdparty server unfortunately possibly becomes the single point of failure for privacy or a performance bottleneck of query processing. To this end, a number of client based solutions [2, 3, 6, 8] are proposed recently. Reference [3] works on the problem of generating proper dummies for locations in reported queries to CPS servers for hiding the user’s actual locations. In [3], locations with similar probability with the user’s location are chosen as dummy candidates, and of them are randomly selected as final dummies. This approach obtains good entropy for the locations in the reported query. Although this solution includes random nature, the posterior probability of the reported location is still different due to the process of dummy selection, and the privacy guaranteed is not clear. Reference [6] employs cache to avoid submitting queries to CPS servers as much as possible and thus prevents the leakage of user’s location. Reference [2] proposes a mechanism for protecting content privacy in a continuous manner. A set of query contents are generated for a traveling path, and the user submits the same queries along the path to avoid privacy breach. This fits to continuous querying; however there is no privacy guarantee since it simply chooses query contents with probability larger than a given threshold as candidates. In summary, serverbased anonymity suffers single point of failure and existing client based solutions do not provide provable privacy guarantee based on the location/query contents reported to CPS server. Reference [24] studies improving geoindistinguishability with multiple criteria for better location privacy; however this approach could not be adopted for content privacy due to utility concern. Reference [25] studies protecting privacy for smartphone usage, and this is parallel to our work. Recommendation [26] in locationbased system is getting more and more attention, and a location privacy preservation method is proposed for review publication in locationbased systems in [27]. The notion of anonymity is also developed in statistical databases in [28, 29].
Differential privacy is first introduced in statistic databases [30]. The intuitive idea of differential privacy is that a single change of the input should not modify the output significantly. By this guarantee the adversary cannot recognize the input among all possible inputs similar to the real one. Due to the simple and clean nature of differential privacy, it has been adopted widely, such as machine learning [31], statistic database [32–37], data mining [38], graph [39], data analytic [40], and crowdsourcing [15]. Recent research starts combining correlation [41] and personality [42] nature to original differential privacy. Our work is parallel to the large body of differential privacy research. We combine differential privacy and anonymity to provide guaranteed privacy in interactive cyberphysical systems. Differential privacy has been adopted in the literature of privacy protection in locationbased services, and [43] ensures that an adversary will not get significant information about a user’s location after a query is reported. This is achieved by making the ratio of two nearby locations’ posterior probability similar to that of their prior probability. Mechanisms following or adopting similar privacy guarantee are presented to optimize privacy or utility [44]. Besides anonymity and differential privacy, a number of works customize semantic privacy metrics such as [45–47] in social networks. This paper also defines privacy measures based on entropy and differential privacy.
8. Conclusion
This paper investigates preserving content privacy in interactive cyberphysical systems through anonymity based mechanisms. We present two privacy metrics denoted expected entropy and dpcoefficient, which are based on entropy and differential privacy, respectively, and formulate the problem of achieving the optimal anonymity for content privacy in interactive cyberphysical systems based on these privacy metrics. An algorithm MLA consisting of two phases, namely, segment alignment and mechanism initiation, is proposed to establish mechanisms for achieving the optimal anonymity. Theoretical analysis illustrates the attractive property that MLA achieves the optimal expected entropy and the optimal dpcoefficient simultaneously. We conduct evaluation based on three reallife datasets, and three privacy metrics, namely, expected entropy, dpcoefficient, and effective k, which depict uncertainty of reports in mechanisms, are tested. Evaluation result demonstrates that MLA outperforms its competitors including recent client based solutions over all the employed privacy metrics, and these results are consistent with the fact that MLA achieves the optimal anonymity for content privacy in interactive cyberphysical systems.
Data Availability
All the data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work is supported by Project (Nos. 61602129, 61772157, and U1509216) supported by the National Natural Science Foundation of China; Heilongjiang Postdoctoral Science Foundation Funded Project (Grant No. LBHZ14118); Sichuan Science and Technology Foundation funded Project (Grant No. 2017JZ0031).