PRESEE: An MDL/MML Algorithm to Time-Series Stream Segmenting

Xu, Kaikuo; Jiang, Yexi; Tang, Mingjie; Yuan, Changan; Tang, Changjie

doi:https://doi.org/10.1155/2013/386180

The Scientific World Journal

On this page

Abstract Introduction Related Work Results Conclusion Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2013 | Article ID 386180 | https://doi.org/10.1155/2013/386180

PRESEE: An MDL/MML Algorithm to Time-Series Stream Segmenting

Kaikuo Xu,¹Yexi Jiang,²Mingjie Tang,³Changan Yuan,⁴and Changjie Tang⁵

Academic Editor: R. Haber, J. Ma, R. Valencia-Garcia, S.-S. Liaw

Received31 Mar 2013

Accepted09 May 2013

Published20 Jun 2013

Abstract

Time-series stream is one of the most common data types in data mining field. It is prevalent in fields such as stock market, ecology, and medical care. Segmentation is a key step to accelerate the processing speed of time-series stream mining. Previous algorithms for segmenting mainly focused on the issue of ameliorating precision instead of paying much attention to the efficiency. Moreover, the performance of these algorithms depends heavily on parameters, which are hard for the users to set. In this paper, we propose PRESEE (parameter-free, real-time, and scalable time-series stream segmenting algorithm), which greatly improves the efficiency of time-series stream segmenting. PRESEE is based on both MDL (minimum description length) and MML (minimum message length) methods, which could segment the data automatically. To evaluate the performance of PRESEE, we conduct several experiments on time-series streams of different types and compare it with the state-of-art algorithm. The empirical results show that PRESEE is very efficient for real-time stream datasets by improving segmenting speed nearly ten times. The novelty of this algorithm is further demonstrated by the application of PRESEE in segmenting real-time stream datasets from ChinaFLUX sensor networks data stream.

1. Introduction

Time series stream is everywhere in our daily life. It is widely used in fields such as ecology, medical care, and environment. These applications make time series stream type be possibly the most frequently encountered type for data mining problems [1]. Hence, in recent years, a large number of works focus on time series stream mining.

In order to process massive data efficiently, the method of time series stream segmenting is employed. The primary purpose of time series segmenting is dimensionality reduction. To achieve the goal of accelerating later mining tasks, time-series stream segmenting decomposes the time series stream into smaller number of segments. After segmenting, each segment can be described by a simple model like linear segment and monotonic segment [2]. An example of time-series segmenting can be seen in Figure 1.

(a)

(b)

There are several time series stream fitting models proposed, including symbolic mappings [3], adaptive multivariate spline [4], hybrid adaptive [5], wavelets [6], Fourier transforms [7], and piecewise linear representation [8, 9]. However, neither of them could handle different types of time series streams or is parameter free.

For real-time series application, the algorithm should be able to handle continuously real-time stream, which means that the stream could only be scanned once. A lot of real applications such as sensor network data [10], stock market trading data [11], or intensive-care unit (ICU) data [12] are in this form since the data are generated very fast and the processing time is limited. So, for time series stream segmentation, the issues of scalability, numerical stability, and efficiency cannot be avoided.

In this paper, we propose PRESEE to segment time series stream based on MDLMML method [13, 14]. MDLMML is an information expressing method in the field of information theory. By capturing the characteristics of information distribution in data, it can reduce the size of data while retaining most of the critical information. PRESEE the following characteristics has (1)High scalability. It can process time series stream in linear time. PRESEE adopts slide window to process data with the size of gigabytes or even larger scale.(2)Parameter free. Parameter settings are not essential in PRESEE for an entry-level user. This may avoid the trouble of misleading the algorithm by setting any improper parameters. Of course, if the end users are the domain experts and have confidence to set proper parameters, they can set some optional parameters to accelerate the segmenting speed.(3)Adaptive. PRESEE can segment the time series data according to the characteristic of data. Since the segmenting strategy is based on MDLMML, it can segment time series automatically. Violating place requires more characteristic points while elsewhere requires less.(4)Pipeline. PRESEE can output the earlier data while processing the newly arrived data. Thus, the later time series stream mining algorithm and PRESEE can run simultaneously.

The rest of this paper is organized as follows. The related work is described in Section 2. Some necessary concepts are introduced in Section 3. A time series stream segmenting algorithm named PRESEE is presented in Section 4. The result of experiments is evaluated in Section 5. Finally, the paper is concluded and future work is discussed in Section 6.

2.1. Time Series Stream Mining

Time series stream mining is possibly the most frequent mining task in recent data mining community. In particularly, in the last several years, a large number of papers are related to this area [15–17]. Time series stream mining derives from traditional time-series mining [1–4, 11, 18]. As a further requirement of deep understanding of the time series, it turns into high-dimensional data mining problem.

2.1.1. Segmenting

Segmenting is one of the major tasks in time-series stream mining. In order to process time-series data efficiently and effectively, segmenting is a key step for other time-series mining tasks. A lot of algorithms focus on finding good global segmenting of the time-series data.

There are mainly three characteristics of these algorithms. Firstly, these methods are mainly based on dynamic programming [19, 20], top-down [21], and bottom-up [22] strategies. Secondly, they require domain expert knowledge to set the parameters, either the parameter to measure the error [2, 22] or parameter () to control the number of segments [19, 21]. Thirdly, these algorithms can at most handle millions of data, and they can hardly to handle stream data (gigabytes at least) due to the limitation of the algorithms.

Segmenting with slide window can handle large-scale data. This method is attractive because it can be easily implemented as an online algorithm. Some existing slide-window-based algorithms work well, but their performances are parameter dependent. Since different time-series data types such as electrocardiogram (ECG), water level, and stock market own quite different characteristics, it is hard to find a general set of parameters for all these data types.

2.1.2. MDLMML

The theory of minimum message length (MML) and minimum description length (MDL) first appears in the computation complexity community [23, 24] then in the categorization community [25]. Its application in data mining community is the work of climate data segmentation [26], trajectory clustering [27], and social network mining [28]. So far, to the best of our knowledge, our work segmenting time-series stream with MDLMML is the work with the most features.

3. Preliminary

This section reviews the concepts for time-series data mining. Section 3.1 introduces terminology about the time series. Section 3.2 presents the distance function used in this paper. Section 3.3 is the problem statement.

3.1. Terminology

We first begin with the definition of the time-series data type.

Definition 1. Time-series: let denote a set of the observed values for given variables in the research domain. Let be an element observed at time . Time-series is an ordered sequence of n such elements. From the stream view, the length of is infinite.

Slide window may be a general and effective way to handle massive data that cannot be processed in whole. Thus we employ slide window idea to do the segmenting task.

Definition 2. Slide window: let B be a user-defined buffer to hold elements and w be the size of elements that can hold. The slide window is the buffer to hold a continuous subsequence of at any time. All the data in slide window can be processed by the algorithm in one time.

3.2. Distance Function for Time Series Segments

For the ease of segmenting, some data transformation work should be done. Almost all kinds of time series data can be discretized and transformed in the form of lines. For example, the original time-series data can be discretized into lines: . The goal of segmenting is to generate lines that can represent most of the characteristics of original lines. There should be a distance function to measure the distance between the original time-series line and the candidate segment . In order to better measure the distance between original time-series stream and its segmenting result, firstly, the distance function should be simple so that the stream can be processed very fast. Additionally, the measurement should consider the shape of stream and its segmenting result. Finally, the focus of factor in measurement can vary according to different application. After delving into the character of time-series data, we find that the best way to measure the distance between time-series by considering the conciseness and preciseness is to use Hausdorff metric. Hausdorff metric has been previously used in the area of pattern recognition and trajectory mining [27, 29]. Previous works proved that it is precise in the scenario of shape similarity measurement. In the scenario of time-series segmenting, we represent the segmenting distance by considering the perpendicular and angle space relationship based on Hausdorff metric.

Segmenting distance is a quantitative criterion to measure the quality of segmenting. Smaller distance represents better segment result for the original stream. The final form of distance between the original line and the segment it belongs to is defined in Definition 3.

Definition 3. Segmenting distance. Let be the original line, be the candidate segment, and be the distances from the start point and end point of to , respectively (Formula (1)), and be the smaller intersection angle between two lines. Then the following can be considered.(1)The perpendicular distance between two lines is defined as in Formula (2). In Formula (1), and represent the start point and end point of each original time-series line, respectively; and represent the coordinates of the start point and end point of a candidate segment time-series line (one possible segment solution in the process of segmenting computation), respectively.(2)The angle distance between two lines is defined as in Formula (3).(3)The segmenting distance between two lines is defined as in Formula (4): the weighted sum of perpendicular distance and angle distance.

Consider

The sum of weight and should be 1, and they can both be set to 1/2 if there is no special requirement. Figure 2 and Example 4 show an example of how to compute the distance.

Example 4. As shown in Figure 2, there are 3 original lines , , and and one segment . Since it can be observed that the start point of and is the same point, the distance . The distance between and is , and between and is .

3.3. Problem Statement

Given a time series with length (i.e., , n can be infinite), our algorithm generates a sequence of character points . For sequence , each pair to has a projection relationship: . This means that each located at in has a counterpart located at in . For each consecutive character point , there exist several points in S such that . Every pair of represents a segment which is an approximation of lines represented by several pairs of consecutive points in the original time series stream . Thus, these m character points partition the original stream into continuous segments. And for each , if is just the start or end point of one segment, it belongs to two segments; otherwise, it only belongs to one segment. Figure 3 shows lines representing compared with lines representing .

The segmenting algorithm is implemented under a pipeline framework shown in Figure 4. Besides the segmenting algorithm, we had already implemented the time-series stream motif mining algorithm. This framework is designed specifically for handling time-series stream mining. It owns several advantages as follows.

(1)Data stream is only scanned once. When data flow out of the slide window, it would never turn back to slide window again. (2)Mining tasks can be processed simultaneously. Earlier data that have been segmented before can be processed by following mining task while the later data is under processing by segmenting task.

4. PRESEE Algorithm

This section first introduces how to use segmenting strategy based on MDLMML in our algorithm PRESEE, and then introduces this algorithm in detail.

4.1. Information-Theory-Based Segmenting Strategy

Our algorithm aims at finding the best segments for time-series. As for the problem of segmenting, there are two properties to measure the quality: preciseness and conciseness. Preciseness measures the distance between the lines represented by consecutive points of the character point in set and lines represented by consecutive points of original time series in stream S between the corresponding two character points , . Smaller distance indicates better preciseness. Conciseness measures how less the character points are used to depict the certain length of data points in original stream. Less character points represents better conciseness.

It is easy to get the conclusion that, when every point in original stream is the character point, preciseness gets its maximum. However, such kind of segmenting is meaningless since it in fact just lets the stream go through the slide window and does not do any work to compress the stream. Conciseness reaches the maximum when there are only two character points for the stream, the start point and the end point.

The best preciseness and best conciseness cannot be satisfied at the same time because they are contradictory. Therefore, we need to do some work to find the optimum tradeoff between preciseness and conciseness, which generates the best segmenting .

In order to find the optimum tradeoff, we intend to solve this problem in information perspective by employing the MDLMML principle in information theory area. We use MDLMML in our algorithm because it is parameter free. MDLMML can automatically find a proper estimate of original information. If no proper segmenting solution exists, the data S are deemed as random data. In our scenario, we simply keep all the information.

The code of MDLMML is composed of two parts: [30, 31]. H specifies the hypothesis about the information (normally selected from a limited set of possible hypotheses), while specifies the code for the information based on the hypothesis. The shortest code in all the and combinations are the optimum solution for the piece of information. In our scenario, it are the optimum for stream segmenting.

The cost of code is represented by its length. In Shannon’s theory, the length of coding an event in optimum condition is given by . In time-series segmenting scenario, the computation of the formula is as follows:

In the first two formulas, w represents the size of data , represents the character points. The optimum segmenting is the minimum value of sum of and . The following is a concrete computation example for Figure 3. Line is the optimum segmenting for point through .

Consider

4.2. Algorithm Details

Finding global optima requires computing all partitions possibilities of the points, which is prohibitive for real applications. We present a greedy algorithm to find local optima.

Algorithm 1 shows the details of the segmenting process. At first, only the data flowed into slide window are processed. In lines 5 and 6, the costs of MDL_seg and MDL_noseq are computed, respectively. denotes the MDL cost by considering and as the character points for points . It equals to . denotes the MDL cost by assuming that there are no character points between and . It equals to . In the greedy strategy, the local optimum solution is the longest segment that satisfies the inequality (9).

Input: points in slideWindow Seg = ,
batchSize
Output: character points set C =
(1) Put into C
(2) startIndex = 1, length = 1;
(3) While startIndex + length < len do
(4) curIndex = startIndex + length;
(5) = MDL_seg(); =
MDL_noseg(, );
(6) If > then
(7) Put into C;
(8) ++curSize;
(9) If curSize == batchSize then
// enough batch has been processed
(10) Return C;
(11) Else startIndex = curIndex – 1; length = 1;
(12) Else length = length + 1;
(13) If C has only one point do
(14) Return NULL;
(15) Else into C;

In the algorithm shown previously, the points in slide window are scanned sequentially only once. The candidate segmenting (segment with and as start point, and end point resp.) grows once per time to test whether it satisfies inequality (9).

There is a parameter batchSize for this algorithm. The default value is 1, and the user can set it as a larger integer. The algorithm will return batchSize + 1 character points per time. Thus, the algorithm can run faster.

Consider

PRESEE algorithm calls Algorithm 1 every time when slide window is full. Algorithm 2 describes PRESEE algorithm in the form of pseudocode.

Input: windowSize, S = 〉, batchSize
Output: C =
(1) slideWindow = , apprSet =
(2) Read data into slideWindow
(3) While ReadIn() do
(4) tmpSet = MDLSlideWindow(slideWindow,
batchSize);
(5) If tmpSet.size() < 2 do
// no proper hypothesis is found,
data deemed as random noise
(6) Add all data in slideWindow into apprSet
(7) Empty slideWindow
(8) Else
(9) Add at most first batchSize + 1 points in
tmpSet into apprSet
(10) Take out used points from slideWindow
(11) Output apprSet

In line 2, the slide window is filled at the first time. Then the method ReadIn() is called in while loop. ReadIn() takes the response of filling slide window and checking whether there is new data. It returns false when no new data exists. In line 4, Algorithm 1 is called to provide the local optima segmenting result based on the data in slide window. It is possible that no proper segment exists. Thus the size of tmpSet is less than 2. In this scenario, we simply put all the data in slide window into apprSet and empty the slide window. Otherwise, add at most first batchSize + 1 points in tmpSet into apprSet. We use “at most” here because it is possible that all the points in slide window may be generated less than in batchSize segments.

From the pseudocode, we can see that the data are input and output simultaneously (line 4 and line 13), which guarantees that the earlier data can be processed by later mining algorithm. Additionally, it is obvious that the stream is only scanned once and processed once. Thus the time complexity of both algorithms is O(n), where n is the length of time-series stream.

5. Empirical Comparison of the Segmenting Algorithms

In this section, we demonstrate the effectiveness and efficiency of the proposed algorithm through several sets of experiments on large collections of real and synthetic time-series datasets. For the effectiveness test, the precision of the proposed method is compared with nonstream segmenting algorithm. Then the speed and scalability of the algorithm are tested with a different scale of datasets ranging from 10 M to 10 G.

All the experiments are performed on a laptop computer with 2 GHz Intel Core 2 Duo CPU and 3G main memories. The C++ implementation of the algorithm and the related source code are all available at http://code.google.com/p/otsm/.

5.1. Benchmark Algorithm

The performance of the proposed algorithm is compared with well-known benchmark algorithms. A good candidate is the BU (bottom-up) algorithm [22]. It is used as a counterpart in precision algorithm. Since BU cannot handle stream-like datasets and the time complexity is uncertain for large datasets, we use slide-window-based bottom-up (SWBU for short) segmenting algorithm in the part of efficiency and scalability experiment instead.

5.2. Dataset. Real Datasets

We consider two sets of real datasets to evaluate our algorithm. The first set includes three classical datasets: IBM stock price dataset from 1.2.1961 to 1.6.2010 [32], “Dodgers,” and “ICU” from UCI Machine Learning Repository [33] used by most of the papers in data mining and machine learning community. The second set of real-time stream data is collected from Chinese Terrestrial Ecosystem Flux Research Network (ChinaFLUX) [34], which is a long-term national network of micrometeorological flux measurement sites that measures the net exchange of carbon dioxide, water vapor, and energy. In this paper, we only choose part of the flux data from one wild filed survey site located in Yucheng, Shandong, China. The data is stored from 2009-03-04 to 2009-11-12 with the number of 38850980.

Synthetic Datasets. The synthetic datasets are generated according to Formula (10) with parameters in Table 1. The data is in the format of index, value, where index represents the timestamp of the record. The monotonically increasing record serial number is also available. In our experiments, without loss of generalization, we simply use the latter one.

For synthetic data, the range of values is bounded within [lb, ub]. The current record value fluctuates (increases, decreases, or remains the same) according to previous record value.

Consider

The sign is randomly selected as either + or −, and sharpness is a parameter to control the power of fluctuation. It is easy to observe that when the value would suffer more resistance when it goes far away from the mean .

6. Results

Visualization of Segmenting Result. For the ease of observing the segmenting result, Figure 5 presents the visualization of three real datasets (IBM stock, ICU, and Dodger). The first row is the original datasets; the second row is the segmenting result generated by BU; and the last row is generated by PRESEE. It is obvious that the charts in the first row seem to be the most complex because they contain the most detailed information about the time-series. For the other two rows, the charts seem to be more concise because segmenting algorithm removes some of the unimportant information from original time-series. Nevertheless, the main trends in the charts are reserved. This means that both algorithms keep the characteristics of original time-series data.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

Precision. We compare our algorithm with BU, which can process general types of time-series datasets and can find the global optimum segmenting solution. We evaluate the result of segmenting algorithm via the error rate measure in Formula (11). Let be the number of segments and be the number of points in segment . The error rate is computed as follows:

Error rate for fifteen datasets is reported in Table 2. It is evident that, for generating approximately the same number of segments, the error rate of bottom-up and PRESEE stays in the same level. That means that PRESEE can generate segments based on partial data (data only in slide window) no worse than segments generated in global perspective.

Problem of Getting Proper Number of Segments. In Table 2, for the real datasets ChinaFLUX, BU fails to find a proper number of segments because this kind of data changes tremendously. This is the flaw of BU. One significant advantage of PRESEE over BU is that users do not need to set any threshold parameter. The parameter of BU is hard to set. A little deviation would generate a quite different number of segments. In order to find out the relationship between segment size and the parameter error threshold that BU requires, we run each dataset 100 times to find a proper parameter that can generate the same number of segments as PRESEE. Figure 6 shows their relationship for BU. In the experiment, all three real datasets encounter a big drop of segments number when the error threshold increases. In particular, for Dodgers datasets, the number of segments drops from 6082 to 1958 when the value changes from 1.0 to 1.1, which is very significant. Further experiments show that, even we only increase the threshold with 0.001, the change is also tremendous. Such phenomenon indicates that we should be careful to the error threshold parameter of BU. Such puzzle can be well avoided by the user of PRESEE.

Efficiency and Scalability. We only test the relative speed of algorithms since the absolute speed (running time) is varied according to machines. The speed of synthetic-dataset-generated algorithm is used as a benchmark. It reflects the maximum processing speed that a certain running machine can reach.

Since the IBM stock, ICU, and Dodger datasets are too small and not suitable for horizontal comparison in efficiency and scalability test, we use real-time ChinaFLUX datasets and the synthetic datasets 1–4 and 9–11 in this experiment. At first, we compare the efficiency among data generator, SWBU, and PRESEE. The error rate threshold of SWBU set as 1.1 means that SWBU can generate comparatively the same number of segments with PRESEE. Figure 7 shows the efficiency of different algorithms in logarithmic plot for synthetic datasets 1–4 and 9–11. It is certain that data generator owns the best speed since it just generates data without any extra processing. In this figure, we can find that the curve of PRESEE is very near to the curve of data generator and these two are far from the curve of SWBU (near one order of magnitude). Figure 8 shows its efficiency on different datasets (datasets ChinaFLUX). Figure 8 indicates that the efficiency would not be affected by characters of datasets. Table 3 shows the relative speed of two segmenting algorithms’ relative speed (with representing the time cost of segmenting algorithm and representing the time cost of data generator). PRESEE is just a little slower than data generator with the value 1.1311, while SWBU is nearly one order of magnitude slower. This is because PRESEE is an algorithm, but SWBU uses BU, an O(n log n) algorithm, as the base algorithm in slide window. With larger error rate threshold set, slower SWBU would run.

Features Affecting Algorithm Efficiency. In this section, we do some experiments to explore the characteristic of our algorithm. There are two optional parameters for PRESEE: window size and batch size. Window size controls the number of points to process at once; batch size controls how many result segments are output per time. As is mentioned before, there is no necessary parameter in PRESEE, so the user can directly use default values for the two parameters. In order to see how the two parameters affect the algorithm’s efficiency, we run synthetic dataset number 3 for 100 times, set window size from 100 to 1000 and batch size from 1 to 10. Figure 9 shows the triple relationship among the efficiency, window size, and batch size. This figure indicates that the time cost decreases while the batch size increases, but if the batch size is too large, the efficiency of algorithm decreases. The reason is that more segments output per time can accelerate the speed of algorithm and let the slide window move faster at first. Gradually, the efficiency begins to decrease with segments size increases. There are two reasons why this phenomenon happens: firstly, the points in slide window do not have such many segments, the efficiency cannot increase forever. Secondly, the algorithm would cost extra time to identify the character points of each segment.

Compress Rate. There is no parameter to control the precision of result since MDLMML owns the self-adaptive property. We do an experiment to see the compress rate (compress rate = size of result/size of original dataset) on different kinds of datasets. We choose ChinaFLUX datasets, IBM stock price, ICU and Dodgers datasets, and synthetic datasets 4–8 to do the experiments. The four real datasets are quite different in their appearance and they are different in cyclical/noncyclical, sharp/smooth, degree of noise, dimensions, and length. The synthetic datasets 4–8 have the same parameter on data size and range, and they are generated in the same way. The only difference is the degree of sharpness.

From Table 4, we can conclude that, for the same kind of data and with different sharpness, the compress rate is steady.

7. Conclusion and Future Work

In this paper, we have proposed a new algorithm for time-series stream segmenting that is parameter free, scalable and self-adaptive. We also undertake several sets of time-series experiments on a variety of time series data types and compare them with the state-of-the art algorithms to evaluate our algorithm. The empirical results prove that PRESEE can generate a proper number of segments for time-series streams. Moreover, it can handle large dataset up to gigabytes. Finally, the parameters of PRESEE would only affect the efficiency but not the segmenting result.

In the future, we plan to design a series of time-series stream mining algorithms under the pipeline framework. Those algorithms should be able to well concatenate to the segmenting algorithm. Another direction is to design the algorithms that can do time-series mining tasks across multiple streams in real time.

Acknowledgments

This work was supported by NSF under Grant no. 31071700, the Natural Science Key Foundation of Guangxi under Grant no. 2011GXNSFD018025, and the Development Foundation of CUIT under Grant no. KYTZ201108. The authors would like to thank Dr. Yue Wang for discussing an early version of this paper.

References

J. F. Roddick and M. Spiliopoulou, “A survey of temporal knowledge discovery paradigms and methods,” IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 4, pp. 750–767, 2002.
View at: Publisher Site | Google Scholar
M. Brooks, Y. Yan, and D. Lemire, “Scale-based monotonicity analysis in qualitative modelling with flat segments,” in Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI '05), pp. 400–405, 2005.
View at: Google Scholar
C. S. Perng, H. Wang, S. R. Zhang, and D. S. Parker, “Landmarks: a new model for similarity-based pattern querying in time series databases,” in Proceedings of the 16th IEEE International Conference on Data Engineering (ICDE '00), pp. 33–42, March 2000.
View at: Google Scholar
J. Friedman, “Multivariate adaptive regression splines,” Annals of Statistics, vol. 19, pp. 1–141, 1991.
View at: Google Scholar
Z. Luo and G. Wahba, “Hybrid adaptive splines,” Journal of the American Statistical Association, vol. 92, no. 437, pp. 107–116, 1997.
View at: Google Scholar
D. B. Percival and A. T. Walden, Wavelet Methods for Time Series Analysis, Cambridge University Press, New York, NY, USA, 2000.
R. A. C. Faloutsos and A. Swami, “Efficient similarity search in sequence databases,” in Proceedings of the 4th Conference on Foundations of Data Organization and Algorithms, pp. 69–84, 1993.
View at: Google Scholar
V. Lavrenko, M. Schmill, D. Lawrie, P. Ogilvie, D. Jensen, and J. Allan, “Mining of concurrent text and time series,” in Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining, pp. 37–44, 2000.
View at: Google Scholar
X. Ge and P. Smyth, “Segmental semi-Markov models for endpoint detection in plasma etching,” IEEE Transactions on Semiconductor Engineering. In press.
View at: Google Scholar
K. Teymourian, M. Rohde, and A. Paschke, “Knowledge-based processing of complex stock market events,” in Proceedings of the 15th International Conference on Extending Database Technology (EDBT '12), pp. 594–597, 2012.
View at: Publisher Site | Google Scholar
L. A. Tang, X. Yu, S. Kim, J. Han, C. C. Hung, and W. C. Peng, “Tru-alarm: trustworthiness analysis of sensor networks in cyber-physical systems,” in Proceedings of the 10th IEEE International Conference on Data Mining (ICDM '10), pp. 1079–1084, Sydney, Australia, December 2010.
View at: Publisher Site | Google Scholar
ICU USER GUIDE, 2012, http://userguide.icu-project.org/icudata.
J. David and C. Mackay, Information Theory, Inference & Learning Algorithms, Cambridge University Press, New York, NY, USA, 2003.
P. Grünwal, The Minimum Description Length Principle, MIT Press, Boston, Mass, USA, 2007.
X. Lian and L. Chen, “Efficient similarity search over future stream time series,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 1, pp. 40–54, 2008.
View at: Publisher Site | Google Scholar
M. Kontaki, A. N. Papadopoulos, and Y. Manolopoulos, “Continuous trend-based classification of streaming time series,” in Advances in Databases and Information Systems, vol. 3631 of Lecture Notes in Computer Science, pp. 294–308, 2005.
View at: Publisher Site | Google Scholar
X. S. Wang, L. Gao, and M. Wang, “Condition evaluation for speculative systems: a streaming time series case,” in Proceedings of the 2nd Workshop on Spatio-Temporal Database Management (STDBM '04), pp. 65–72, 2004.
View at: Google Scholar
P. P. Rodrigues, J. Gama, and J. P. Pedroso, “ODAC: hierarchical clustering of time series data streams,” in Proceedings of the 6th SIAM International Conference on Data Mining, pp. 499–503, Bethesda, Md, USA, April 2006.
View at: Google Scholar
E. Bingham, A. Gionis, N. Haiminen, H. Hiisilä, H. Mannila, and E. Terzi, “Segmentation and dimensionality reduction,” in Proceedings of the 6th SIAM International Conference on Data Mining, pp. 372–383, April 2006.
View at: Google Scholar
E. Terzi and P. Tsaparas, “Efficient algorithms for sequence segmentation,” in Proceedings of the 6th SIAM International Conference on Data Mining, pp. 316–327, April 2006.
View at: Google Scholar
D. Lemire, “A better alternative to piecewise linear time series segmentation,” in Proceedings of the 7th SIAM International Conference on Data Mining, pp. 545–550, April 2007.
View at: Google Scholar
J. Hunter and N. Mcintosh, “Knowledge-based event detection in complex time series data,” in Proceedings of the Joint European Conference on Artificial Intelligence in Medicine and Medical Decision Making (AIMDM '99), pp. 271–280, 1999.
View at: Google Scholar
A. R. Barron and T. M. Cover, “Minimum complexity density estimation,” IEEE Transactions on Information Theory, vol. 37, no. 4, pp. 1034–1054, 1991.
View at: Publisher Site | Google Scholar
J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, no. 5, pp. 465–471, 1978.
View at: Google Scholar
C. S. Wallace and D. M. Boulton, “An information measure for classification,” Computer Journal, vol. 11, no. 2, pp. 185–194, 1968.
View at: Publisher Site | Google Scholar
Q. Q. Lu, “An MDL approach to the climate segmentation problem,” Annals of Applied Statistics, vol. 4, no. 1, pp. 299–319, 2010.
View at: Google Scholar
J. G. Lee, J. Han, and K. Z. Whang, “Trajectory clustering: a partition-and-group framework,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '07), pp. 593–604, June 2007.
View at: Publisher Site | Google Scholar
K. Xu, C. Tang, C. Li, Y. Jiang, and R. Tang, “An MDL approach to efficiently discover communities in bipartite network,” in Database Systems for Advanced Applications, vol. 5981 of Lecture Notes in Computer Science, pp. 595–611, 2010.
View at: Publisher Site | Google Scholar
C. S. Wallace and D. L. Dowe, “Minimum message length and Kolmogorov complexity,” Computer Journal, vol. 42, no. 4, pp. 281–283, 1999.
View at: Google Scholar
P. Grunwald, I. J. Myung, and M. Pitt, Advances in Minimum Description Length: Theory and Applications, MIT Press, Boston, Mass, USA, 2005.
IBM Stock Price, http://finance.yahoo.com.
A. Asuncion and D. J. Newman, UCI Machine Learning Repository, Irvine, University of California, School of Information and Computer Science, 2007, http://archive.ics.uci.edu/ml/datasets.
G. R. Yu, X. F. Wen, X. M. Sun, B. D. Tanner, X. Lee, and J. Y. Chen, “Overview of ChinaFLUX and evaluation of its eddy covariance measurement,” Agricultural and Forest Meteorology, vol. 137, no. 3-4, pp. 125–137, 2006.
View at: Publisher Site | Google Scholar
J. Cheng, Y. Zhou, B. Wang, X. Wang, and J. Li, “SROS: sensor-based real-time observing system for ecological research,” in Proceedings of the International Conference on Web Information Systems and Mining (WISM '09), pp. 396–400, Shanghai, China, November 2009.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2013 Kaikuo Xu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1848

Downloads

1279

Citations

The Scientific World Journal

PRESEE: An MDL/MML Algorithm to Time-Series Stream Segmenting

Abstract

1. Introduction

2. Related Work

2.1. Time Series Stream Mining

2.1.1. Segmenting

2.1.2. MDLMML

3. Preliminary

3.1. Terminology

3.2. Distance Function for Time Series Segments

3.3. Problem Statement

4. PRESEE Algorithm

4.1. Information-Theory-Based Segmenting Strategy

4.2. Algorithm Details

5. Empirical Comparison of the Segmenting Algorithms

5.1. Benchmark Algorithm

5.2. Dataset. Real Datasets

6. Results

7. Conclusion and Future Work

Acknowledgments

References

Copyright