Mathematical Problems in Engineering

Volume 2015, Article ID 105128, 13 pages

http://dx.doi.org/10.1155/2015/105128

## Chebyshev Similarity Match between Uncertain Time Series

^{1}School of Information Science and Technology, Donghua University, Shanghai 201620, China^{2}School of Computer Science and Technology, Donghua University, Shanghai 201620, China^{3}School of Information Science and Engineering, Yanshan University, Qinhuangdao 066004, China

Received 27 April 2015; Revised 11 June 2015; Accepted 25 June 2015

Academic Editor: Hamed O. Ghaffari

Copyright © 2015 Wei Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

In real application scenarios, the inherent impreciseness of sensor readings, the intentional perturbation of privacy-preserving transformations, and error-prone mining algorithms cause much uncertainty of time series data. The uncertainty brings serious challenges for the similarity measurement of time series. In this paper, we first propose a model of uncertain time series inspired by Chebyshev inequality. It estimates possible sample value range and central tendency range in terms of sample estimation interval and central tendency estimation interval, respectively, at each time slot. In comparison with traditional models adopting repeated measurements and random variable, Chebyshev model reduces overall computational cost and requires no prior knowledge. We convert Chebyshev uncertain time series into certain time series matrix; therefore noise reduction and dimensionality reduction are available for uncertain time series. Secondly, we propose a new similarity matching method based on Chebyshev model. It depends on overlaps between two sample estimation intervals and overlaps between central tendency estimation intervals from different uncertain time series. At the end of this paper, we conduct an extensive experiment and analyze the results by comparing with prior works.

#### 1. Introduction

Over the past decade, a large amount of continuous sensor data was collected in many applications, such as logistics management, traffic flow management, astronomy, and remote sensing. In most cases, these applications organize the sequential sensor readings into time series, that is, sequences of data points ordered by temporal dimension. The problem of processing and mining time series with incomplete, imprecise, and even error-prone measurements is of major concern in recent studies [1–6]. Typically, uncertainty occurs due to the impreciseness of equipment and methods during physical data collection period. For example, the inaccuracy of a wireless temperature sensor follows a certain error distribution. In addition, intentional deviation brought by privacy-preserving transformation also causes much uncertainty. For example, the real time location information of some VIP may be perturbed [7, 8].

Managing and processing uncertain data were studied in the traditional database area during the 80s [9] and have been borrowed in the investigation of uncertain time series in recent years. Two widely adopted methods are introduced in modeling uncertain time series. First, a probability density function (pdf) over the uncertain values represented by a random variable is estimated in accord with a priori knowledge, among which the hypotheses of Normal distribution are ubiquitous [10–12]; however, the hypotheses of Normal distribution are quite limited in many applications; the uncertain time series data with Uniform or Exponential distribution is frequently found in some other applications, for example, Monte Carlo simulation of power load and evaluation of reliability of electronic components [13, 14]. Second, the unknown data distribution is summarized by repeated measurements (i.e., sample or observations) [15]; the accurate estimation of data distribution is obtained by large amount of repeated measurements; however, it causes high computational cost and more storage space.

In this paper, we propose a new model for uncertain time series by combining the two methods above and use descriptive statistics (i.e., central tendency) to resolve the uncertainty. On this basis, we present an effective matching method to measure the similarity between two uncertain time series, which is adaptive to distinct error distributions. Our model estimates the sample value range and the central tendency range derived from Chebyshev inequality, extracting the sample estimation interval and central tendency estimation interval drawn from repetitive measurements at each time slot. Unlike traditional similarity matching methods of uncertain time series based on the measurement of distance, we adopt the overlap between sample estimation intervals and that between central tendency estimation intervals to evaluate similarity. If both estimation intervals from two uncertain time series at corresponding time slot have a chance of being equal, the extent of similarity is larger as compared to the case in which they never be the same.

The rest of this paper is organized as follows. In Section 3 we propose the model of Chebyshev uncertain time series. Section 4 is on the preprocessing of uncertain time series based on Chebyshev model. Section 5 describes the process of similarity match with new method. Section 6 addresses the experiments. At last, Section 7 draws a conclusion.

To sum up, we list our contributions as follows:(i)We propose a new model of uncertain time series based on sample estimation interval and central tendency estimation interval derived from Chebyshev inequality and convert Chebyshev uncertain time series into certain time series matrix for dimensionality reduction and noise reduction.(ii)We present an effective method to measure the similarity between two uncertain time series within distinct error distributions without a priori knowledge.(iii)We conduct extensive experiments and demonstrate the effectiveness and efficiency of our new method in similarity matching between two uncertain time series.

#### 2. Related Work

The problem of similarity matching for certain time series has been extensively studied over the past decade; however the similar problem arises for uncertain time series. Aßfalg et al. first propose a probabilistic bounded range query (PBRQ) [15]. Formally, let be a set of uncertain time series and let be an uncertain time series as query input; let be a distance bound and let be a probability threshold. The is given by

Dallachiesa et al. proposed the method called MUNICH [16]; the uncertainty is represented by means of repeated observations at each time slot [15]. An uncertain time series is a set of certain time series in which each certain time series is constructed by choosing one sample observation for each time slot. The distance between two uncertain time series is defined as the set of distances between all combinations from one certain time series set to the other. Notice that the distance measures adopted by MUNICH are based on -norm and DTW distances; if , the -norm is Euclidean distance; the naive computation of the result set is not practical. Large result space causes exponential computational cost.

PROUD [12] processes similarity queries over uncertain time streams. It employs the Euclidean distance and models the similarity measurement as the sum of the differences of time series random variables. Each random variable represents the uncertainty of the value of corresponding time slot. The standard deviation of the uncertainty and a single observation for each time slot are prerequisites for modeling uncertain time series. Sarangi and Murthy propose a new distance measurement DUST. It is derived from the Euclidean distance and under the assumption that all time series values follow some specific distribution [11]. If the error of the time series values at different time slot follows Normal distribution, DUST is equivalent to the weighted Euclidean distance. Compared to the MUNICH, it does not need multiple observations and thus is more efficient. Inspired by the moving average, Dallachiesa et al. propose a simple similarity measurement that previous studies had not considered; it adopts Uncertain Moving Average (UMA) and Uncertain Exponential Moving Average (UEMA) filters to solve the uncertainty from time series data [16]. Although the experimental results show that they outperform the sophisticated techniques that have been proposed above, a priori knowledge of the error standard deviation is indispensable.

Most of the above techniques are based on the assumption that the values of time series are independent of one another. Obviously, this assumption is a simplification. Adjacent values in time series are correlated to a certain extent. The effect of correlations is studied in [16] and the research shows that there is a great benefit if the correlations are taken into account. Likewise, we implicitly embed correlations into estimation intervals in terms of repetitive observation values, adopting the degree of overlap to evaluate the similarity of uncertain time series. Our approach reduces overall computational cost and outperforms the existing methods on accuracy; new model requires no prior knowledge and makes dimensionality reduction available for uncertain time series.

#### 3. Chebyshev Uncertain Time Series Modeling

As shown in [15], let be an uncertain time series of length ; is a random variable represented by a set of measurements (i.e., random sample observations), . is denoted as sample size of . Distribution of the points in is the uncertainty at time slot . The larger sample size is, the more accurate data distribution is estimated. However computational cost is prohibitive. To solve the problem, we present a new model for uncertain time series by considering Chebyshev’s inequality below.

Lemma 1. *Let (integrable) be a random variable with finite expected value and finite nonzero variance . Then, for any real number ,*

*Formula (2) (Chebyshev’s inequality) [17] is the lower bound of probability of ; on condition that and are known, the distribution information need not be considered. Real number has an important influence on the determination of the lower bound. For an appropriate , the probability of possible values of random variable falling in the boundaries satisfies desired threshold. The estimation of possible value range is as follows.*

*Theorem 2. Given a random variable with the finite expected value and finite nonzero variance , if the in inequality (2) equals , thenno matter which probability distribution obeys.*

*Proof. *Consider

*The above proof shows that when equals , the probability of within interval exceeds 0.9; nearly all possible measurements fall in the interval. We substitute the random variable with to express the uncertainty.*

*According to the probability distribution of , possible value range description of uncertainty is insufficient; a central or typical value is another feature for a probability distribution; it indicates a center or location of the distribution, called central tendency [18]. The most common measure of central tendency is arithmetic mean (mean for short), so the central tendency of a random sample set in form of mean is defined below.*

*Given a random sample set drawn from with and , , each sample satisfies hypothesis; then*

*As a random variable, the expectation and variance are evaluated below:Analogously, for central tendency variable , in accord with Lemma 1, the corresponding estimation interval can be obtained.*

*Theorem 3. Given a random variable with and , a random sample set drawn from the population of , for the variable with and , if the in inequality (2) equals , then*

*Proof. *Consider

*In summary, the sample estimation interval of is the range of possible measurements and central tendency estimation interval is the range of central tendency of . The uncertainty of is represented by a combination of the two intervals at each time slot. Uncertain time series can be defined below.*

*Definition 4. *For an uncertain time series of length , each element is a random variable with and , is the central tendency of random sample set from the population corresponding to , and an Chebyshev uncertain time series is defined below:where is the cardinality of random sample set . Consider the Chebyshev uncertain time series above; and are difficult to be obtained because of the unidentified distribution of population. We choose two statistics to estimate the and ; one is the arithmetic mean of , mentioned in (5); the other is the sample standard deviation , calculated by the following equation:

*Equations (12) and (6) show that and are unbiased estimator for and . and in Definition 4 can be replaced with and ; is rewritten as follows.*

*Definition 5. *Given a sample set at time slot , is represented as follows:

*According to the descriptions above, the expression at each time slot can be transformed into a vector. It consists of four elements (except time value), namely, , , , and , in ascending order, denoted as ; consider*

*Definition 6. *An uncertain time series of length can be rewritten in terms of matrix with the following formula:Additionally, it can be expanded as follows:where is the lower bound sequence of random variable composed of , is referred to as lower bound sequence of variable , is named upper bound sequence, and the upper bound sequence of is denoted as , illustrated in Figure 1. Four certain time series constitute an uncertain time series based on Chebyshev model.