Abstract
In the Internet of Things (IoT), aggregation and release of realtime data can often be used for mining more useful information so as to make humans lives more convenient and efficient. However, privacy disclosure is one of the most concerning issues because sensitive information usually comes with users in aggregated data. Thus, various data encryption technologies have emerged to achieve privacy preserving. These technologies may not only introduce complicated computing and high communication overhead but also do not work on the protection of endless data streams. Considering these challenges, we propose a realtime stream data aggregation framework with adaptive event differential privacy (ReADP). Based on adaptive event differential privacy, the framework can protect any data collected by sensors over any dynamic time stamp successively over infinite stream. It is designed for the fog computing architecture that dramatically extends the cloud computing to the edge of networks. In our proposed framework, fog servers will only send aggregated secure data to cloud servers, which can relieve the computing overhead of cloud servers, improve communication efficiency, and protect data privacy. Finally, experimental results demonstrate that our framework outperforms the existing methods and improves data availability with stronger privacy preserving.
1. Introduction
Driven by the development of cyberphysical networks, cloud computing, mobile Internet, contextaware smart devices, and the corresponding data experience explosive growth [1]. Cloud computing provides a good solution to deal with the explosive data growth and realize resource sharing [2]. However, cloudbased services may face many challenges, such as high latency and high overhead at cloud servers, due to the centralized structure and the limitation of network bandwidth. Some researches present a distributed service computing paradigm, called fog networking [3–5]. It allocates the capabilities of data gathering, data processing, computing, and applications to devices located at the edge of the network, so as to provide intelligent services for nearby users.
Although fog computing provides great benefits, sensitive and private information mined from raw data (e.g., social relationships and financial transactions) is also exposed to the risk of disclosure. Even more, due to the complexity and diversity of fog nodes, user privacy in a fog network can easily be disclosure. For example, more than electronic eyes in Beijing may lead to privacy leakage (e.g., vehicle location information) by data sharing in vehicular ad hoc networks (VANETs) [6–8]. Similarly, we can also gain illegal access to personal health datasets gathered from various sensors of physical sign in body sensor networks (BSNs) and publish these private data without permission [9–11]. As a result, how to protect user privacy is one of the important research issues in fog computing.
Currently, the protection of aggregated data privacy is mainly divided into two types. The first one is designed based on various encryption technologies, such as homomorphic encryption [6, 8–10]. In this type, the encryption technology may cause huge computational overhead as well as lots of computing resources of cloud services [12]. In addition, the cryptographybased schemes may lower communication system efficiency, especially when the system contains many sensors with high reporting frequency. The reason is that a great number of communication resources may be wasted on transmission of encryption information and the corresponding keys. As a result, this is not suitable for energylimited sensor networks.
The other type of aggregated data privacy preserving is explored by using differential privacy [13]. Compared with the traditional cryptographybased schemes, differential privacy can protect individuals privacy while improving data accuracy as much as possible. For example, the authors of [14] protect privacy of aggregated data with differential privacy by using machine learning. Although there exist many studies based on differential privacy, some challenges cannot be addressed. These studies do not consider the high correlation of time series so as not to generate realtime aggregated data with high accuracy. However, a practical framework should be able to satisfy batch queries in continuous time by exchanging information only once.
To address these challenges, we propose a realtime privacypreserving stream data aggregation framework based on adaptive event differential privacy under fog computing architecture. In fog computing, data storage, processing, and applications are concentrated in devices on the edge of the network rather than all in the cloud. This type of architecture reduces the amount of data transmitted to the cloud, increases efficiency, and significantly lowers overhead on the server itself. In addition, a fog center is considered as a data aggregator in our framework. It only reports the aggregation secure results to cloud server. In this way, the efficiency of communication can be greatly improved. Moreover, sensors only report raw data instead of encrypted data because our framework does not utilize complex encryption technology. Finally, many techniques for processing timeseries data is exploited in our framework to improve the accuracy of aggregation data, such as adaptive sampling, timeseries prediction, and filter.
In a nutshell, the main contributions of the paper are summarized as follows.(i)We propose a realtime privacypreserving stream data aggregation framework based on adaptive event differential privacy under fog computing architecture. The framework releases the overhead of cloud servers and generates aggregation data with differential privacy preserving.(ii)In order to promote event differential privacy, we pioneer a novel metric, i.e., quality of privacy (QoP). The QoP design takes into account both the window size and errors of published statistics. Using the metric, we adjust the size of window adaptively by dint of the design of QoPbased adaptive event mechanism.(iii)We exploit the long shortterm memory (LSTM) to predict timeseries data and design the adaptive sampling scheme to improve the accuracy of aggregation data.(iv)We theoretically analyze the privacy of the proposed ReADP framework and demonstrate the high accuracy of aggregated data through numerical simulation results.
The rest of the paper is organized as follows. In Section 2, we introduce preliminaries of differential privacy and event privacy. Then, we provide the system model, the adversary model, and the whole ReADP framework to illustrate our problem. In Section 3, we present a QoPbased adaptive event privacy algorithm that includes a dynamical adjustment method of the window size . Section 4 presents a smart groupingbased perturbation algorithm, which can reduce the noise added to data significantly. In Section 5, we analyze whether the ReADP framework satisfies differential privacy and provides a series of simulation results to discuss the performance of each mechanism in our framework. We then review previous works related to the privacy preserving of aggregated data and differential privacy in Section 6. Finally, Section 7 concludes our paper and explains promising research directions for future work.
2. Problem Statement and Preliminaries
2.1. System Model
The system model, shown as Figure 1, is composed of four layers: the things layer, the fog layer, the core layer, and the cloud layer. The function of each layer is described as follows.(i)Things layer, consisting of various smart devices, e.g., sensors, mobiles, and actuators, generates and reports raw data to fog layer.(ii)Fog layer, typically located between IoT devices and core networks, is composed of lots of fog devices. The fog devices can be considered as traditional network devices, such as routers, switches, gateways, or local servers that are specially deployed. In this paper, the devices are mainly composed of local servers and are responsible for (i) gathering and storing data reported from things layer, (ii) computing and aggregating data to satisfy differential privacy, and (iii) responding to query requests from the cloud layer.(iii)Core layer is in charge of transferring and exchanging data between the fog layer and the cloud layer through network protocols such as IP and MPLS.(iv)Cloud layer deploys many cloud servers that can analyze massive aggregated data. Using the analyzing results, cloud services can provide a wide range of services.
2.2. Adversary Model
In this paper, we assume that both the cloud layer and the core layer are untrustworthy. They will try to acquire actual values of gathered data or maliciously tamper data. And the fog layer is considered trusted, which means it can acquire raw data but do not disclose data to the third party.
2.3. Differential Privacy Basics
Differential privacy is one of the most popular notions of privacy in the current research field of privacy preservation. The basic idea is that the record of an individual, regardless of whether or not he is in the dataset, has little impact on the final output, thus protecting the privacy of the individual.
Definition 1. (differential privacy [13]): a randomized algorithm over datasets can provide differential privacy guarantee, if any neighboring datasets and are different on at most one record, and any output satisfieswhere denotes the range of the randomized algorithm .
Note that , called privacy budget, is an important parameter in differential privacy. It represents the privacy level of the randomized algorithm . More specifically, the level of privacy is inversely proportional to . Then, a mostly used method to achieve differential privacy is the Laplacian mechanism as shown below.
Theorem 2 (the Laplacian mechanism [15]). Let denote a set of datasets. Considering a function , the Laplacian mechanism for any dataset iswhere the noise follows a Laplacian distribution with mean zero and scale . Here, denotes sensitivity of , which is defined as the maximum norm for any neighboring datasets and .
Theorem 3 (sequential composition [16]). Let provide differential privacy. Then the sequence of provides ()differential privacy.
Obviously, Theorem 3 shows that the secrecy level of a combination of several differential privacypreserving algorithms is the sum of all budgets.
2.4. Event Privacy
event privacy, the abbreviation of event differential privacy, is a new privacy model proposed by Kellaris et al. [17]. It can protect privacy for any event sequence occurring at any window of time stamp.
We define two neighboring datasets at the th time stamp as and and a stream prefix of an infinite series at the th time stamp as .
Definition 4 (neighboring [17]). Two stream prefixes , are neighboring if one of the following two conditions holds:(i)for each , such that and ,(ii)for each , , , and with , and , it holds that , where , a positive integer, denotes the length of a sequence that can be protected at the same time.
Definition 5 (event privacy [17])). A mechanism satisfies event differential privacy, if, for all , all , at all , it holds thatwhere is the set of all possible outputs of . A mechanism satisfying event privacy will protect the sensitive information that may be disclosed from a sequence of length .
According to the above definitions, we refer to [17] to conclude Theorem 6. The theorem enables a event private scheme to view as the total available privacy budget in any sliding window of size and appropriately allocate portions of it across the time stamps.
Theorem 6. Assume that is a mechanism with input stream prefix and output . Supposing can be decomposed into , each generates independent randomness and achieves differential privacy. Then, satisfies event privacy if
Based on this fundamental theorem, we will explore a novel adaptive event differential privacy mechanism in our work. The proposed mechanism is designed for realtime privacypreserving stream data aggregation under fog computing architecture.
2.5. Motivation and System Framework
Our motivation is to design a realtime stream data aggregation framework that can protect user privacy in any time stamp, allow batch queries, and obtain highaccuracy results. In order to achieve the motivation, we divide our work into two main tasks.(i)Protect privacy in any window of time stamp. Servers may query aggregated data within time only one round of communication. Therefore, the proposed framework must protect privacy of data generated in time stamp. Besides the size of window should be adaptively adjusted according to the state of data changes.(ii)Improve the accuracy of aggregated data. Because of the Laplacian differential privacy, the proposed framework needs to add random noise to data to guarantee privacy protection. Thus, the framework must reduce extra errors of aggregated data as much as possible on premise of privacy preserving.
In this article, we intend to design an adaptive event based differential privacypreserving strategy. This strategy in Figure 2 is composed of adaptive event privacy analysis, smart groupingbased perturbation, and the filtering mechanism. Here, we outline the complete process of the proposed ReADP strategy, shown in Algorithm 1. The first component, illustrated in Section 3, is achieved based on the adaptive sampling and QoP measurement. The second one is presented in Section 4, which is designed based on Kmeans smart grouping and the corresponding perturbation mechanism. And we exploit the similar filtering mechanism in [18] to reduce errors of aggregated data so as to improve data availability.

3. QoPBased Adaptive Event Privacy Design
For privacy protection on infinite stream data aggregating, event privacy is a convincing model. The objective is to make a tradeoff between utility and privacy to protect all data sequences that occur within all windows of time stamp. However, it is not applicable to many realistic scenarios due to the fixed size of the sliding window. The key issue of the unrealistic assumption is that most realtime aggregate data streams collected from sensors are significantly different in various time periods. For example, within successive time stamps, it can be seen that traffic data varies sharply in the daytime but is relatively stable at night. Thus, we introduce a new QoPbased adaptive event privacy mechanism in this section to dynamically adjust the window size within different time stamps. The following three subsections describe the key parts to achieve this mechanism, including the QoP definition, the adaptive sampling design, and the adaptive event privacy design.
3.1. Quality of Privacy
Considering the window size and errors of aggregated statistics, QoP is proposed to measure the corresponding privacy quality. Assume and represent the raw time series in a window and the sanitized time series, respectively. Then, we exploit mean absolute error (MAE) to measure difference between these two time series.
Next, we employ a sampling mechanism in the proposed ReADP. It may perturb statistics at selected time stamp and approximate the nonsampled statistics with perturbed sampled statistics. Thus, (5) can be rewritten as follows:
As a result, QoP in a window is defined aswhere is a window size and is the weight between and . Here, is set to 0.002 in our experiments. In addition, is a logistic sigmoid function that is equal toThe reason that we employ the logistic sigmoid function for normalization is that we do not need to know the general characteristics of the data. Intuitively, as sensor data generated in contiguous time stamps is not independent, there is close correlation among these data when data changes slowly. Meanwhile, with the possibility that sensitive information may be disclosed, the windows size should be increased when data changes slowly.
3.2. The Adaptive Sampling Design
In general, a report of noisy data denotes the expenditure of fixed budget . When protecting all time stamps, the budget allocated to each time stamp will be small if the window size is large. In this case, the report will show gigantic errors. This problem can be addressed by using a sampling mechanism (this mechanism can perturb sampled statistics while skipping nonsampled statistics). In this case, we can employ skipping some data points to save budget for future perturbation as well as improve communication efficiency. Without exploiting the model controller, our mechanism with a proportionalintegralderivative (PID) controller has the advantages of strong robustness and low complexity. Therefore, we exploit the PID controller to change the sampling rate based on dynamic historical data. Firstly, we define the feedback errors of sensor .where and indicate the current sampling data point and the last sampling data point, respectively. It shows that data changes rapidly when the error increases. Then, the full PID error of sensor is defined as follows:where , , and denote the proportional gain, the integral gain, and the derivative gain, respectively.
Intuitively, the sampling interval needs to be small with rapid data change. Thus, a new sampling interval T is calculated by the following methods.
In (11), and denote the current sampling interval and the previous one of sensor . is used to regulate the sampling interval, and is used to control the sensitivity of the PID error.
3.3. The Adaptive Event Privacy Algorithm
On the basis of the two sections above, the adaptive event privacy algorithm is proposed in Algorithm 2. Note that pseudocodes from line to line are experiment offline over the training set.

We assume that the starting and ending points of the window are both sampling points and there are sampling points in the current window. As a result, the window size . According to (6) and (7), the QoP in a window can be calculated as follows:
After obtaining over training test, the adaptive event privacy mechanism is described from line to line . In particular, we can adjust the new window size by moving the start point of the window forward or backward time stamps.
4. Smart GroupingBased Perturbation
A naive method to achieve differential privacy is to inject the Laplacian noise to statistics. Nonetheless, it is likely to introduce more perturbation errors, especially in statistics with small values. Therefore, we present a smart groupingbased perturbation to aggregate sensors with small statistics together in a dynamic way with the change of statistics.
The Smart Grouping Algorithm is presented in Algorithm 3. It mainly is divided into three steps. First, it screens out the sensors that needs to be grouped according to the predicted statistics (denoted by ) by exploiting the LSTM model. Then, it groups sensors that need to be grouped using the Kmeans algorithm. Finally, aggregated data will be perturbed based on the grouping result. We will elaborate on each step in detail in the following subsections.

4.1. Statistics Prediction with LSTM
To protect privacy of raw data, we use the predicted data instead of real values in the smart groupingbased perturbation algorithm. As mentioned above, whether a sensor needs to be grouped depends on the prediction data of the sensor. In addition, which group each sensor is assigned to also depends on the predicted value. This means that the accuracy of the predicted value is critical to the accuracy of the final aggregated data. Thus, a good model must be formulated, which can describe characteristics of data change well and predict data accurately.
To achieve accurate prediction, we introduce the LSTM model. In general, a LSTM network [19] has been gradually applied to the timeseries analysis [20–22] by profiting from some advantages. In particular, it is a special type of recurrent neural network (RNN), which skillfully solved the problem of gradient vanishing of RNN. A common LSTM unit is composed of a memory cell, an undate gate, an output gate, and a forget gate. The memory cell stores a value (or state) for either long or short terms. It has the ability to remove or increase information to cell state through the welldesigned three gates that can transfer information. As a result, we adopt the LSTM network to formulate our model to characterize the nonlinear characteristics of data in our algorithm.
Considering the effectiveness of our Smart Grouping Algorithm, our LSTM network only consists of three layers (shown in Figure 3), i.e., the input layer, the hidden layer, and the output layer. The input layer has neurons, where the value of is determined by the number of previously aggregated data to be used for prediction. The output is just one neuron because we just need to predict the value of next time stamp. The hidden layer consists of several LSTM units. is a weight matrix between the input layer and the hidden layer, while is that between the output layer and the hidden layer. In addition, each context unit corresponds to a neuron in the hidden layer, which is used to record the output of the hidden layer in one recurrence.
As shown in Figure 3, the historical aggregated data is used as the training data to input to the LSTM model so as to predict the value for each sensor at current time. For example, suppose we need to predict the value generated from sensor at time (e.g., ). The previously aggregated data used for prediction is . We first calculate the output of a hidden layer unit (i.e., ). Figure 4 shows the detailed structure of a LSTM unit, and is calculated as follows.
First, the LSTM unit determines what information should be forgotten from the cell state by usingwhere is the logistic sigmoid function, is weight matrix of the forget gate, and is the bias vector of forget gate layer. is the output of the hidden layer at time , while is the input of the hidden layer at current time, which is computed as follows.
Next, LSTM employs the following equations to decide what new information needs to be stored in the cell state by using the update gate layer.where indicates which value will be updated and represents a vector of new candidate values. and are weight matrices of the input gate layer. And is defined as follows:Then, the cell state is updated based on (18) in current time .Here, the output at current time of the hidden layer is controlled by the update gate and the forget gate .
Finally, based on the latest cell state , the output of the hidden layer at current time, , can be calculated as follows:where is the output gate that determines which part of should be output.
According to (13), (15), and (19), we can be aware that the LSTM unit has the ability of determining which information is forgotten, updating and outputting intelligently. This ability enables us to predict time series of our network more accurately.
Final prediction data of sensor at time , e.g., , is calculated as follows:where is the activation function of the output layer.
Training of the LSTM network: in order to achieve realtime prediction, we should train the network related parameters offline in advance. In addition, we must employ the true statistics of the training set for the sake of the accuracy of the training model. Therefore, for sensor at time , the input is , and the expected output is the true statistic . And, based on the predicted statistic , the loss function of our network is defined as below.
Using the backpropagation algorithm [23], the training error is propagated to the neurons in the LSTM network. Then, we further calculate training errors caused by each neuron and adjust the corresponding weights to reduce the errors. Details of the training process can be established in [23]. Finally, given the historical aggregated data, the trained LSTM model can predict sensors data in real time.
4.2. KMeans Based Smart Grouping Algorithm
In this subsection, we present a Smart Grouping Algorithm based on the Kmeans method. The algorithm can smartly aggregate small statistics obtained from sensors in the noise scenarios. First of all, we allocate the budget to each sampling point and then generate an antinoise threshold dynamically. Clearly, we can utilize an inverse proportion to characterize the relationship between and the corresponding allocated budget. Then, we can obtain the predicted data of each sensor for each time by using the trained LSTM model. According to the above processing, we can exploit Kmeans algorithm [24] to achieve sensors data aggregation in the premise that is smaller than the antinoise threshold .
Compared with other algorithms, kmeans algorithm is fast and efficient, which is suitable for large data scenarios. Thus, it accords with the data size and realtime requirement of our algorithm. Next, we will introduce how the Kmeans algorithm works in our scenario. Note that the input is the predicted data of sensors at each time stamp , which need to be grouped as . In particular, we first randomly initialize the cluster centers and then divide it into clusters where each cluster is closest to its nearest cluster centers for each sensor . Here, we intend to employ the Euclidean distance to calculate the distance from the current point to the center point. Next, we update the cluster center according to the new clusters obtained from the previous step. The method to update cluster centers is to calculate the mean of all points in the cluster. And the convergent condition is that the minimum squared error of every point to the center point is less than a threshold value or the preset maximum number of iterations. Finally, the algorithm repeats the above two steps until convergence.
Figure 5 is an example to explain the whole process of the Smart Grouping Algorithm. Assuming there are four sensors that need to sampled at time stamp , we define that the predicted statistics are , , , and , respectively. The antinoise threshold is . For , is an independent group because , which is added to (the group strategy of the current time stamp). For , , and , we input them to the Kmeans algorithm. Note that and are clustered into a group while becomes a single group. Thus, the final group strategy is .
4.3. Smart GroupingBased Perturbation
To achieve additional noise loading, we intend to exploit the Laplacian mechanism to directly inject noises into aggregated statistics [15] based on results from adaptive sampling. The aggregated statistics do not include the nonsampled statistics that can be approximated by the last aggregated statistics. In this article, we present a scheme of smart groupingbased perturbation. This scheme is composed of a perturbation component and an allocation component. Considering the utilization of the grouping algorithm, we apply the Laplacian mechanism in each group rather than in each sensor in our scheme design.
We assume that a group has sensors and represents a function to aggregate the number of data contributors in . Intuitively, because all contributors can only appear in the collection range of a sensor at one time stamp, the sensitivity of the function is equal to ; i.e., . Then, the Laplacian mechanism can be employed in group as follows:where is the th sensor of the group and denotes the scale of Laplacian noises injected into . In order to avoid exceeding the total budget, our scheme considers the smallest budget of a sensor in a group as the budget of the whole group. In this case, the proposed RescueDP strategy does not make full use of the total budget. In addition, we also fix the sampling points in our scheme and allocate the total budget to each sampling point uniformly. Therefore, , which leads to making full use of the total budget as well as ensuring not exceeding the total budget.
Next, considering the predicted statistics in each sensor, we allocate the perturbed statistic. The allocation method can avoid errors resulting from the average operation in the RescueDP strategy. Our allocation method is shown as follows:where the weight of a sensor, , can be calculated by the predicted statistics of a sensor; i.e.,
According to the smart groupingbased perturbation scheme, the perturbed statistics of a sensor are more accurate.
5. Performance Discussion
In this section, we first analyze the privacy of our proposed ReADP framework in theory and then provide several numerical simulations to study the performance of our framework in terms of MAE and QoP.
5.1. Privacy Analysis
Theorem 7. The proposed ReADP framework satisfies differential privacy.
Proof. In the ReADP framework, perturbation is the only possible mechanism to disclose private information because it is the only one to access raw data. As a result, if the perturbation mechanism can be proved to satisfy differential privacy, the ReADP framework can meet the requirement of differential privacy subsequently.
On the basis of the smart grouping strategy at time stamp , each group includes several sensors. We assume that with sensors is an arbitrary group of . According to (12), the Laplacian mechanism on group is as follows:where is the th sensor of and .
Based on Definition 1, satisfies differential privacy. According to Axiom 2.1.1 in [25], postprocessing sanitized data will not reveal privacy as long as sensitive information is not available directly in the postprocessing algorithm. As a result, , , also satisfies differential privacy. Assume that and represent the budget consumed and the budget allocated for a sensor at timestamp , respectively. If all allocated budget is employed for perturbation in our algorithm, then holds.
Based on Theorem 6, the perturbation mechanism of a sensor satisfies differential privacy for every and , if it holds thatThe above formula always holds for any sliding window timestamp for the reason that holds in our budget allocation algorithm. Thus, the perturbation mechanism on each group can satisfy differential privacy. In other words, the ReADP algorithm also satisfies differential privacy. And this completes the proof of Theorem 7.
5.2. Numerical Simulation
We compare the performance of the proposed ReADP strategy with MLDP in [14] and the RescueDP strategy in [26] over two real datasets. The MLDP is a privacypreserving data aggregation scheme under fog computing based on machine learning, while the RescueDP is the latest strategy that provides event privacy for realtime aggregate data publishing. In the simulation, we employ MAE and QoP as metrics to study the performance of the three schemes. The specific expressions of these metrics are given by (6) and (12). Our experiment is conducted in Python environment in Windows operating system. Each experiment is run times and points in the results are the average values of times of each experiment.
The realworld test datasets to discuss the performance in our experiment are Bike data [27] and Station date [28]. The dataset of Bike provides an accurate data containing the bike share trip in Washington DC for one year from January , , to December , . It contains a total of bike share trajectories. Each of them consists of the bike number, the end station and time, and the start station and time. We transform it into a dataset that consists of sensors to count the number of bikes at each bike parking spot in real time. The first threequarters of the data is split as the training set and the fourthquarter of the data is used as test set. The dataset of Station consists of the number of passengers of stations between January , , and December , . It contains records and each record reports the number of passengers of a station. Because many stations have very little throughput, we chose sensors with more throughput to report. And the division of the test set and the training set is the same as the Bike dataset.
In our experiment, the parameters of the PID control are set as follows: , , , , , and for the adaptive sampling mechanism. In Algorithm 2, is set to be . In addition, we obtain for the Bike dataset and for the Station dataset by constantly iterating over training set. In Kmeans based Smart Grouping Algorithm, we set for Bike dataset and for the Station dataset, and the result is also the best performance on the training set. The parameters of the LSTM network are set as follows: the previous history data is used for the input of the LSTM network. So the numbers of input layers’ neurons is , and is the number of hidden layers’ neurons. Besides, we train by iterating times. Note that a training process that takes about two minutes is timeconsuming. However, we only conduct this process offline, which does not affect the realtime nature of the algorithm.
5.2.1. Utility versus Privacy
Figure 6 provides the tradeoff analysis between utility and privacy. It is clear that when increases, the MAE of three schemes decreases gradually. The reason is that the larger represents the smaller noise that needs to be injected. Moreover, for two realworld test datasets, the ReADP scheme outperforms the other two schemes greatly, especially in a small privacy budget. Also, the QoP of ReADP is obviously superior to the other two schemes in a sufficient privacy budget.
(a) MAE (Bike dataset)
(b) QoP (Bike dataset)
(c) QoP (Station dataset)
(d) QoP (Station dataset)
The superior performance of the ReADP scheme results from the following three aspects. First, due to the design of the optimal number of sampling points and the corresponding privacy budget allocation mechanism, the privacy budget is fully used for private perturbation. Second, the adaptive event privacy mechanism in the ReADP scheme satisfies the privacy window adaptively, which improves the practicability of the scheme. Finally, LSTMbased prediction can provide a highaccuracy prediction result for the smart grouping mechanism.
5.2.2. Effect of Adaptive Event Privacy Mechanism
In order to highlight the advantages of adaptive event privacy mechanism, we compare our ReADP scheme with a variant version, ReADP(f), which only adapts fixed event privacy mechanism. Figure 7 demonstrates the comparison results in terms of MAE and QoP. It can be clearly seen that the adaptive mechanism can increase QoP while decreasing MAE significantly in both realworld datasets. Therefore, we can draw the conclusion that the adaptive event privacy mechanism advances the quality of reported data considerably.
(a) MAE comparison
(b) QoP comparison
5.2.3. Effect of Smart Grouping Mechanism
In this part, we investigate the performance of our smart grouping mechanism. As shown in Figure 8, both MAE and QoP of the smart grouping mechanism exceed the ReADP without the smart grouping. The excellent performance of smart grouping chiefly benefits from the Kmeansbased grouping algorithm and the application of the deep learning algorithm.
(a) MAE comparison
(b) QoP comparison
6. Related Work
Many methods have been proposed to ensure the privacy of aggregated data generated from IoT devices [29–33]. Wu et al. [34] proposed a Dynamic Trust Relationships Aware Data Privacy Protection (DTRPP) mechanism for Mobile Crowd Sensing (MCS), which evaluates the trust value of a public key ingeniously. Zhang et al. [35] designed a prioritybased health data aggregation scheme (PHDA) in cloudassisted wireless body area networks. In the scheme, a credible relay node can be selected according to the social relationship between nodes to help aggregation data and then forward it to cloud servers. PHAD also provides a lightweight privacypreserving aggregation scheme, which can not only resist the forgery attack but reduce communication overhead. Li et al. [36] presented a privacyaware data aggregation protocol for mobile sensing, which can aggregate timeseries data to prevent untrustworthy aggregators from disclosing privacy. Using an additive homomorphic encryption and a novel key management scheme, the aggregator can only obtain the sum of all users data. Still, both schemes cannot cope with complex attacks that can also mine some privacy from the raw sum data.
In addition, all existing methods to achieve privacy preserving are based on encryption technologies. These complicated encryption technologies usually introduce high computation overhead, which may not be suitable for energyconstrained sensor networks. Some researchers suggest exploiting differential privacy, a convincing model for providing privacy, to protect aggregated data generated by IoT devices. Han et al. in [37] proposed a scheme to provide privacy preserving for health data aggregation. It employs a differential privacy model to resist differential attacks that most existing data aggregation schemes have suffered from. Yang et al. in [14] also proposed a differential privacy model based on machine learning algorithms. The model can reduce communication overhead as well as protect the privacy of sensitive data rigorously for the fog computing architecture. Also in fog computing, Wang et al. [38] put forward a privacypreserving contentbased publishsubscribe scheme with differential privacy in a publishsubscribe system, which can protect against collusion attacks.
Although these works apply differential privacy to protect privacy of aggregated data, there is a serious deficiency in a real scenario. They may greatly reduce the availability of aggregated data streams. Thus, some studies are committed to solving this challenge. Cao et al. in [39] studied a protection method for sensitive streams within a window instead of the whole infinite stream. Considering windowbased applications, they explored a streambased management system to cope with numerous aggregate queries simultaneously. In [18], Fan and Xiong intended to hide all events of users and designed a userlevel privacy strategy for a finite stream. For the received perturbing data, they employed the Kalman filter [40] in their strategy to improve accuracy of differentially private data release. Considering multiple events occurring at continuous time segments, Kellaris et al. presented a event differential privacy model in [17]. This model combined the advantages of eventlevel privacy model and userlevel privacy model skillfully. In the model, they employed a sliding window to capture a wide range of event privacy and designed a scheme to distribute and absorb the privacy budget on the assumption that statistics do not change significantly. On this basis, Wang and Zhang further designed an online aggregate monitoring scheme for infinite streams in [26]. Their scheme integrated adaptive sampling, a budget mechanism, and dynamic grouping and perturbation to provide privacy preserving of statistics.
Despite the fact that ongoing studies of differential privacy on streams data aggregation have played a vital role, there still exist challenges to be dealt with. We point out that the fixed sliding windows employed in most existing frameworks may not be practical. Moreover, existing metrics are only suitable for static data rather than streaming media. Motivated by these challenges, in this paper, we present a realtime privacypreserving streams data aggregation framework based on adaptive event differential privacy for the fog computing architecture.
7. Conclusion
Considering privacy disclosure of aggregated data in fog computing, we present a realtime stream data aggregation framework with adaptive event differential privacy (ReADP) in this paper. For the four layers of our system model, this framework is composed of three components, i.e., adaptive event privacy analysis, smart groupingbased perturbation, and the filtering mechanism. In particular, we can employ the first component to protect privacy of infinite stream over any successive time stamps. Then the second component is to achieve smart grouping based on Kmeans and inject additional noise into aggregated data, and we exploit an existing filter to improve data availability in the third component. Finally, we provide a theory to prove that the proposed ReADP framework satisfies differential privacy in theory. Extensive experiments over realworld datasets show that the ReADP scheme outperforms existing methods and improves the utility of realtime data publishing with strong privacy preserving.
Data Availability
The datasets used to support the findings of this study have been openly accessed. The Bike dataset can be found at https://www.capitalbikeshare.com/systemdata. And the Station dataset can be accessed at https://www.kaggle.com/saulfuh/bartridership/data. Also, the authors have cited these datasets in the References.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Grant nos. 61471028 and 61571010) and the Fundamental Research Funds for the Central Universities (Grant nos. 2017JBM004 and 2016JBZ003).