Abstract
Privacy protection is one of the major obstacles for data sharing. Timeseries data have the characteristics of autocorrelation, continuity, and large scale. Current research on timeseries data publication mainly ignores the correlation of timeseries data and the lack of privacy protection. In this paper, we study the problem of correlated timeseries data publication and propose a sliding windowbased autocorrelation timeseries data publication algorithm, called SWATS. Instead of using global sensitivity in the traditional differential privacy mechanisms, we proposed periodic sensitivity to provide a stronger degree of privacy guarantee. SWATS introduces a sliding window mechanism, with the correlation between the noiseadding sequence and the original timeseries data guaranteed by sequence indistinguishability, to protect the privacy of the latest data. We prove that SWATS satisfies εdifferential privacy. Compared with the stateoftheart algorithm, SWATS is superior in reducing the error rate of MAE which is about 25%, improving the utility of data, and providing stronger privacy protection.
1. Introduction
Timeseries data are a set of sequential, large, and continuous data sequences. In general, timeseries data can be regarded as a dynamic dataset that grows infinitely over time. Using the correlation between data values to analyze and mine timeseries data can bring considerable benefits to government, enterprises, and social public services. For example, in this outbreak of COVID19, monitoring and analyzing the patient’s physical condition can effectively treat the disease and control the spread of the epidemic. The navigation software needs to count the total amount of traffic in a specific time range of each road to calculate the best route to the destination.
The above example illustrates the importance of publishing timeseries data for knowledge discovery and acquisition. However, if the curator does not adopt appropriate privacy protection technology and publish the data directly, it will leak personal sensitive information and violate citizens’ privacy.
Traditional data publishing mainly uses anonymous technology, such as kanonymity [1] model and its derivative model [2, 3] for privacy protection. However, these methods are strongly dependent on the attacker’s background knowledge assumptions and cannot provide an effective and rigorous method to prove its privacy protection level. Some studies [4, 5] adopt the technology of combining blockchain and artificial intelligence (AI) to protect the privacy of data, but the technology will bring into low efficiency, and once there may be some vulnerabilities, it will confront some risks of significant attack. Differential privacy [6] is a strict and provable privacy protection technology, which can protect users’ sensitive information from leaking their privacy [7]. By adding random noise, it limits the impact of any record on the released statistical results to blur the existence of the record in the dataset, and such users’ privacy will be fundamentally protected. This model is widely used for the release of various data in many application scenarios [8]. For the privacy leakage problem in timeseries data publication, the existing work can also be solved by using the differential privacy model. Dwork et al. [9] achieved eventlevel differential privacy in the scenario of continuous statistics data publication. In order to reduce the noise added in the original timeseries data, Chan et al. [10] proposed to adopt a binary treebased divideandconquer method to decompose and store timeseries data.
Motivations and Contributions. Traditional differential privacy is widely used for data publication, while Kifer et al. [11] pointed out that it still encountered the risk of leaking personal privacy for publishing correlated timeseries data. The current publication methods of differential privacy on correlated timeseries data mainly include the methods of establishing correlation models, such as covariance matrix [12] and Markov [13, 14], and data transformation, e.g., the Fourier transform [15] and discrete wavelet transform (DWT). The abovementioned methods of differential privacy focus on the publication of independent and identically distributed (IID) data, which will lead to the following problems: Insufficient privacy protection: adding independent identically distributed noise to the correlated data will cause the attacker to filter out the noise through filtering attacks and other methods, thus causing the user’s privacy to be disclosed. Low data utility: since IID noise is added to the correlated data, it will lead to the reduction of privacy protection level. In order to maintain the same level of differential privacy protection, more noise needs to be added, resulting in a sharp decrease in the utility of published data.
These issues indicate that the current methods of differential privacy are not suitable for processing timeseries data with correlation. Although Wang et al. [16] proposed the CTSDP method to resist filtering attacks by adding noise consistent with the correlation of the original data, it ignored the periodicity of timeseries data and failed to provide adequate privacy protection. It also does not apply to the publication of dynamic data. Compared to the existing work, our main contributions are summarized as follows.
First, because timeseries data exhibit periodic changes and have strong autocorrelation, even if a single record in the dataset is deleted, an attacker can infer information about missing records from other correlated records. We propose periodic sensitivity to replace the global sensitivity in traditional differential privacy to avoid this situation and provide a stronger degree of privacy protection under the same privacy budget. Second, based on the periodic sensitivity, we propose a sliding window mechanism to process infinitely growing and correlated timeseries data. Third, we theoretically proved that our proposed correlated timeseries data publication algorithm based on sliding window (SWATS) satisfies differential privacy. And compared with the stateoftheart method, the experimental results show that SWATS can reduce more errors and provide stronger privacy protection.
2. Related Work
In the early research on differential privacy data publication, most literature studies assume that the data are independent. At present, the research on differential privacy on correlated data is still relatively limited. Because the main research obstacle of correlated differential privacy is that correlated records can provide additional information for attackers, while traditional mechanisms can hardly model it. In this case, meeting the definition of differential privacy is a complex task. Kifer et al. [11] proposed for the first time that differential privacy would reduce privacy guarantees on correlated datasets if the correlation between data is not considered. For example, suppose that a record r has an impact on a group of records. Even if the record r is deleted from the dataset, the relevant information of r can be derived from this group of records. In this case, the traditional differential privacy cannot provide enough privacy protection. Chen et al. [17] treated social networks as correlated datasets and solve the problem of insufficient privacy protection by multiplying the global sensitivity by the number of correlated records. However, this method introduces too much noise, making the utility of datasets decline sharply.
In the research of correlated timeseries data, Cao et al. [18] used internal coupling and internal coupling behavior functions to model related information and used these functions in the association framework to express the degree of association between behaviors. They proposed a hidden Markov detection model to detect abnormal transaction behavior based on grouping. They defined a time interval and assumed that behaviors falling within the same interval are related behaviors. Song et al. [19] proposed a hybrid coupling framework, which uses some special attributes to identify the relationship between records. Zhang et al. [20] proposed a related network traffic classification algorithm, using IP address to identify network traffic correlated records. Zhou et al. [21] mapped correlated records to an undirected graph and proposed a multiinstance learning algorithm.
Wang et al. [16] proposed the concept of sequence indistinguishability and proved that the correlations between the original time series and the time series after adding noise are consistent; then, the added noise meets the differential privacy. The differential timeseries data publication algorithm CTSDP proposed by them adds correlated noise to ensure the correlation of added noise. Zhu et al. [12] defined correlation sensitivity. They considered the correlation between records and proposed an effective related differential privacy solution, CIM (correlated iteration mechanism). CIM uses the covariance matrix to describe the correlation between sequences and uses the covariance matrix as the weight to calculate the sensitivity function. Experimental results show that this solution is superior to traditional differential privacy in terms of the mean squared error in response to large batches of queries. This also shows that the correlated differential privacy can successfully protect privacy while maintaining the practicality of the data.
Some scholars convert the correlated timeseries data to another independent domain for processing while retaining the main characteristics of the original sequence. Rastogi et al. [15] proposed a Fourier transform (FPA) method to solve this problem. In FPA, the discrete Fourier transform (DFT) is used to convert the correlated data into an independent Fourier domain. Approximately reconstruct the DFT coefficients of the original sequence. To overcome the shortcomings of FPA when applied to shortterm and nonstationary sequences, discrete wavelet transform (DWT) was proposed in [22, 23]. DWT extends the range of FPA and retains more features of the sequence. Although there are difficulties in ensuring differential privacy, the literature [24–26] uses principal component analysis (PCA) to extract the features of the dataset to another dimension, and the disturbance data published can be applied to some common statistical learning applications. Table 1 provides a summary of recent studies in correlated timeseries data publication of differential privacy.
Summary. Currently, on the issue of differential privacy correlated timeseries data, some methods add independent noise on the correlated timeseries data, which is easy to be attacked. The other methods add correlated noise but ignore the periodic changes of timeseries data, resulting in insufficient privacy protection. What is more, the current method can only be applied to the publication of static data. This article attempts to solve the following problems: How to dynamically publish correlated timeseries data? How to deal with the lack of privacy intensity due to the periodic changes of correlated timeseries data?
3. Preliminary Knowledge
3.1. Differential Privacy
Dwork et al. [11] proposed the differential privacy model for the first time, which is a strong privacy protection framework. By limiting the influence of the change of a single record in the dataset on the query results, the attacker cannot accurately obtain the sensitive information in the record even if he knows all the record information except a certain record.
Definition 1. (εdifferential privacy [26]). Consider two neighboring datasets, D and . For each output of a neighboring dataset, if the random algorithm A satisfiesthen the algorithm A satisfies εdifferential privacy.
Definition 2. (Global sensitivity [28]). Suppose there is a query function , which takes a dataset D as input and outputs a ddimensional real vector R. For any neighboring datasets, D and , the global sensitivity of the function f is defined as
Definition 3. (Laplace mechanism [28]). Given a dataset D and a function with sensitivity . The random algorithm,provides εdifferential privacy protection.
Theorem 1. parallel combinatorial properties [29]). With a random algorithm sequence and the random processes of any two algorithms that are independent of each other, the privacy protection budget is , respectively. For disjoint datasets , the combined algorithm composed of these algorithms provides differential privacy protection.
3.2. Problem Definition
Timeseries data are a set of sequential, large, and continuous data sequences. In general, timeseries data can be regarded as a dynamic dataset that grows infinitely over time. For example, Table 2 shows the blood glucose data collected by different users within one month of timeseries data.
Considering the following scenarios, user A wants to query the average value of blood glucose data within the range of T_{1}T_{2}; user B wants to query the number of people whose blood pressure is greater than 140 mmHg at time T_{3} ... The goal of this article is to use differential privacy technology to publish correlated timeseries data, and users can obtain meaningful query results under the premise that personal privacy in the database is not leaked. The curator aggregates the timeseries data of all users and divides it into subdatasets according to the data attributes. Each subdataset is divided into pieces of disjoint timeseries data according to the user dimension. The curator finally publishes all data on the premise of satisfying differential privacy and responds to user queries as shown in Figure 1.
For any piece of timeseries data X, it can be treated as a shortterm stationary sequence, and its autocorrelation can be expressed using an autocorrelation function.
Definition 4. (Autocorrelation function [30]). The correlation of timeseries data can be expressed by the autocorrelation function. For the original timeseries data X, the autocorrelation function can be expressed asAmong them, N_{0} represents the power spectral density of X and δ(τ) represents the impulse function.
Definition 5. (Sequence indistinguishability [16]). If the original timeseries data X and the noise sequence Z to be released have the same normalized autocorrelation functions, that is,then the noise sequence and the original sequence are indistinguishable to the attacker, and the attacker cannot simply use knowledge about the correlation of the original sequence to launch the attack.
4. Correlated TimeSeries Data Publishing Algorithm Based on Sliding Window
In real life, timeseries data are a dynamic dataset with infinite growth over time. Therefore, on the basis of the CTSDP algorithm, this paper uses the sliding window mechanism for any length of timeseries data to realize the continuous publication of timeseries data under the premise of satisfying differential privacy. In order to solve the problem of insufficient privacy protection in the CTSDP algorithm, we propose periodic sensitivity instead of global sensitivity to achieve greater privacy protection.
4.1. Sliding Window Model
Define timeseries data , where represents the data value at time t. The sliding window model is used to model the timeseries data X, each sliding window is defined as , and the sliding window size is . The data contained in each sliding window is , and the data to be published after processing by the algorithm is .
The sliding window in timeseries data refers to specifying an interval on the timeseries data, which contains the latest data. The purpose is to limit the infinite data stream and obtain data characteristics. With the arrival of new data, the data in the sliding window are processed after the amount of data reaches the set sliding window size. Then slide the window forward and wait for the next set of data. Figure 2 shows the process of publishing timeseries data using the sliding window model.
Differential privacy protection under timeseries data is divided into two levels: the event level and the user level [9]. The former protects every event in the timeseries data sequence, while the latter protects all user behaviors. This paper is aimed at the privacy protection of the event level, protecting each event in the timeseries data sequence.
4.2. The Sampling Period of TimeSeries Data
Timeseries data usually have a strong characteristic of periodic change. According to the characteristic of timing data showing a periodic change, the sampling period of the timing data can be determined. For example, the blood glucose of normal people remains in a constant range before three meals a day and before bedtime. Usually, the sampling frequency of health data within a day is taken as a period. Taking the blood glucose data as an example, the blood glucose data are sampled four times a day, and then the sampling period of blood glucose data is . For some data that can only obtain a single statistical value in a day, such as the number of steps, the sampling frequency of the data within a week or month can be used as the period, that is, or .
4.3. Periodic Sensitivity
Since the timeseries data have strong periodic changes, if the global sensitivity is still adopted at this time, it will indeed increase the risk of privacy leakage.
For example, someone’s blood pressure surged recently due to staying up late. If users query the blood pressure value of a day at this time, they will have a higher probability to infer the other approaching blood pressure samples. Therefore, in order to ensure that the data are not leaked, it is necessary to delete all the sampling data before and after approaching this blood pressure value. At this time, if the global sensitivity is still sampled to generate Laplacian noise, it is obviously unable to better protect the data from leakage. Based on this, this paper proposes periodic sensitivity to replace global sensitivity to provide stronger privacy protection.
Definition 6. (periodic sensitivity). According to the attribute N of timeseries data, determine the sampling period T of this attribute, and then the periodic sensitivity is defined asAmong them, X represents a piece of timeseries data of attribute N, Q represents the query function, means removing all data in the ith sampling period, and represents the number of sampled data points in a period.
4.4. Algorithm Design
The SWATS algorithm can iteratively process and publish the existing data (static data) in the database, and the recently arrived data (dynamic data) can be processed and published after the data volume meets the sliding window size. Or adjust the size of the sliding window to the size of the newly added data before publishing. The establishment process of the SWATS algorithm is shown in Algorithm 1.

Algorithm 1 shows the basic framework of SWATS. SWATS divides the original time series X into n subsequences according to the sliding window length L (line 1) and iteratively processes the subsequences in each sliding window (2∼9 lines). First, calculate the autocorrelation function of the subsequence (line 3) and periodic sensitivity (line 4); then generate 4 groups of white Gaussian noise (line 5) with the same length as the subsequence and the power spectral density of , where (the ratio of sensitivity and privacy budget) (line 4); four groups of Gaussian white noise are convolved with the impulse response to obtain four groups of Gaussian noise sequences with autocorrelation function (line 6); finally, Laplacian noise can be obtained by using the sum of the two Gaussian noise groups’ squares minus the sum of the squares of the other two, sample from which at intervals of 1 can calculate the Laplacian noise of length L (line 7); by splicing all Laplace noise of length L and adding them to the original time series, the final noiseadding sequence is gained and ready for publishing (lines 8 and 9).
For the newly added data, when the amount of data reaches the size of the sliding window, the sequence is obtained, and steps 3–7 in Algorithm 1 is directly executed to obtain the sequence with noise. Then execute and publish .
4.5. Privacy analysis
Theorem 2. Algorithm SWATS satisfies εdifferential privacy.
Proof. Literature [16] has proved that if the original timeseries data and the noise sequence added to the timeseries data meet Definition 5, then the published noise sequence meets εdifferential privacy.Therefore, according to Theorem 1, the algorithm SWATS satisfies εdifferential privacy.
Theorem 3. The noise sequence generated by the algorithm SWATS in each sliding window is correlated with the original sequence.
Proof. Literature [16] has proved that if the autocorrelation function satisfies , then the autocorrelation function of the noise sequence calculated by satisfies , so the noise sequence and the original time series have correlation within the same sliding window.
4.6. Time Complexity Analysis
In Algorithm 1, the time complexity of steps 3, 4, 5, and 6 is , , , and , respectively. Since the algorithm needs to iterate on each sliding window, the total computational complexity of the algorithm iswhere L is the length of the sliding window. When the length of the sliding window is the same as the original sequence, the time complexity of the algorithm is . With the continuous increase of new data, only the latest data can be calculated, so for the recently arrived data, the time complexity is .
4.7. Utility Analysis
This paper uses the differential privacy utility definition proposed by Blum et al. [31] to perform utility analysis.
Definition 7. ((α, β)accuracy [31]). For a query set Q, if for each query and the original dataset X, the privacy protection mechanism M can satisfy equation (9) with a probability , then M satisfies (α, β)accuracy.For any query , it is known that holds, and the generalized Laplace mechanism satisfies with a probability of at least
Proof. Let represent the error introduced by generalized Laplace noise, thenSince , where , according to the properties of the Laplace distribution, there is ; then, ; given , there is .
5. Experimental Evaluation
This experiment uses MATLAB language to realize the correlated timeseries differential privacy publishing algorithm based on sliding window. The experimental environment is Inter (R) Core (TM) i5 2.7 GHz, 4 GB memory, Windows 7 operating system. We used two realworld datasets in our evaluations as this has helped in illustrating the effectiveness of our approach in realworld applications.
Diabetes (http://archive.ics.uci.edu/ml/datasets/Diabetes). Diabetes dataset is a representative standard classification dataset in the UCI machine learning dataset. The records were obtained from two sources: an automatic electronic recording device and paper records. The automatic device had an internal clock to timestamp events, whereas the paper records only provided “logical time” slots (breakfast, lunch, dinner, and bedtime). For paper records, fixed times were assigned to breakfast (08 : 00), lunch (12 : 00), dinner (18 : 00), and bedtime (22 : 00).
Steps. The data are collected by teachers and students through smart bracelets and mobile phones. Table 3 shows some of the fields in the dataset, including start date, end date, and value. It means that the number of steps someone took during the period from 20190514 10 : 37 : 07 to 20190514 11 : 49 : 32 is 956 steps. Moreover, the start and end dates of each sampling are not fixed, indicating that the smart bracelet and mobile phones collect and count the number of steps in multiple periods within a day. After sorting out, the step data collected in each period of the day are merged to obtain the step data in the unit of day.
Metrics. In the experiment, to verify the effectiveness of the proposed algorithm in this paper, SWATS and CTSDP algorithms are compared. In terms of data utility evaluation, the mean absolute error (MAE) was used to measure the effectiveness. MAE was defined as follows:where N represents the length of the time series, and the lower MAE means the better utility of data.
5.1. Experimental Results
Nowadays, CTSDP is the stateoftheart method to publish correlation timeseries data. Therefore, we choose the CTSDP algorithm as a comparison.
5.1.1. Impact of Sliding Window Size on Data Utility
Figure 3 shows a graph of the experimental results of the two algorithms under different sliding window sizes when the privacy budget ε is 1 and 0.5, respectively. In the Diabetes dataset, a peace of time series was randomly selected from the experiment for processing. Each algorithm was tested 1000 times, and the experimental results were averaged 1000 times. It can be seen that the result of SWATS is obviously better than that of CTSDP, and the average error is reduced by 37.5%. As the size of the sliding window continues to increase, the MAE of SWATS also increases. In the Steps dataset, the dataset was first divided into 7 intervals according to the number of steps (an interval less than or equal to 3000 steps and an interval greater than 21000), and then the number of people in each interval was counted every day to form 7 statistical timeseries data. The experimental results also show that the results of SWATS are better than those of CTSDP, and the average error is reduced by 24.9%. With the increasing size of the sliding window, the effect of SWATS keeps increasing, but it is always smaller than that of CTSDP.
(a)
(b)
5.1.2. Impact of Epsilon on Data Utility
Figure 4 shows the comparison of the results of the two algorithms under different privacy budgets when the sliding window sizes are and , respectively. With the increase of the privacy budget, the MAE of both algorithms is decreasing, and the algorithm SWATS proposed in this paper is always better than CTSDP.
(a)
(b)
The average error of the algorithm SWATS in the Diabetes dataset is 25.1% less than that of CTSDP, and the decrease in the average error in the Steps dataset is 12.5%.
5.1.3. Privacy Protection Strength Calculation
In this paper, we use the filteringbased attack method proposed by Xiong et al. [32] to calculate the privacy protection strength. The privacy protection strength after the attack iswhere R is a vector, representing the autocorrelation function of the noise sequence. P is the crosscorrelation function of the original sequence and the noisy sequence, and ε represents the privacy budget. The smaller the ε′, the higher the privacy protection strength. Figure 5 shows the comparison of the privacy protection strength of the two algorithms. It can be seen that on the two datasets, as the privacy budget continues to increase, the privacy protection strength of the two algorithms has a downward trend. However, the privacy protection strength of the SWATS algorithm is always higher than that of the CTSDP. This proves that the periodic sensitivity proposed in this paper is effective and SWATS can protect the privacy of users to a greater extent from being leaked.
(a)
(b)
5.2. Experimental Conclusions
Each time CTSDP releases data, it needs to process all the timeseries data involved in the query. When new data arrive, CTSDP needs to recalculate all the timeseries data to be released and does a lot of unnecessary calculations. With the continuous growth of data flow, the calculation cost of the CTSDP algorithm will become larger and larger and may cause the system to crash in extreme cases. The SWATS algorithm proposed in this paper introduces a sliding window mechanism on the basis of CTSDP, which can both process the latest data and respond to queries with different time starting points and lengths. This reduces a lot of unnecessary calculations and greatly saves the system resources. The experimental results show that, under the sliding windows of different sizes, the error of SWATS is reduced by about 31% than that of CTSDP, and under different privacy budgets, the error is reduced by about 19%.
6. Conclusions and Future Works
In this paper, we proposed a sliding windowbased differential privacy publishing algorithm for autocorrelation time series, which is applied to the publishing of timeseries data. We proved that SWATS satisfies εdifferential privacy. The experimental results show that the algorithm is significantly better than the comparison algorithm in the publishing of timeseries data and can be applied to the publishing of dynamic data.
Although SWATS is effective, there are still some aspects to be improved in the future. One is that the periodic sensitivity depends on the sampling period of the timing data. Only when the timeseries data have an obvious sampling period, SWATS can have a better protection effect. If the timeseries data are sampled randomly, the privacy protection strength may not meet the expectations. At the same time, in order to calculate the periodic sensitivity, the length of the sliding window must be greater than three times the length of the sampling period. At present, the SWATS algorithm only considers the autocorrelation of single attribute and can only process the timeseries data of a single attribute each time. The data of each attribute not only have selfcorrelation but also have a mutual correlation. It is the next research direction of this paper to consider the correlation between multiple attributes and publish multidimensional correlation timeseries data.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article. This work was supported by the National Natural Science Foundation of China (Grant no. 41971407), Major Technical Innovation Project of Hubei (Grant no. 2018AAA046), and Applied Basic Research Project of Wuhan (Grant no. 2017060201010162).