Abstract

This paper presents an efficient algorithm, called dynamic fuzzy cluster (DFC), for dynamically clustering time series by introducing the definition of key point and improving FCM algorithm. The proposed algorithm works by determining those time series whose class labels are vague and further partitions them into different clusters over time. The main advantage of this approach compared with other existing algorithms is that the property of some time series belonging to different clusters over time can be partially revealed. Results from simulation-based experiments on geographical data demonstrate the excellent performance and the desired results have been obtained. The proposed algorithm can be applied to solve other clustering problems in data mining.

1. Introduction

Time series clustering problems arise when we observe a sample of time series data and want to group them into different categories or clusters. This is an important area of research for different disciplines because time series is a very popular type of data which exists in many domains, such as environmental monitoring, market research, and quality control. It is well known that the goal of time series clustering is to discover the natural grouping(s) of a set of patterns. An operational definition of time series clustering can be stated as follows. Given a representation of series, find groups based on a measure of similarity such that the similarities between objects in the same group are high while the similarities between objects in different groups are low.

Traditionally, the clustering methods are divided into two parts: crisp clustering and fuzzy clustering. Generally, each sample will belong to one cluster or the other by the crisp clustering methods, such as -means and spectral method. Instead of assigning each sample a cluster label, fuzzy partition methods allow each sample to have different membership degrees.

However, time series often display dynamic behavior in their evolution over time. From Figure 1, one can see that during a certain period, a time series might belong to a certain cluster; afterwards its dynamics might be closer to that of another cluster. This switch from one time state to another is a typically dynamic behavior of time series over time. Thus, this dynamic behavior should be taken into account when attempting to cluster time series. In this case, the traditional clustering approaches are unlikely to locate and effectively represent the underlying structure for the given time series. D’Urso and Maharaj [1, 2] pointed out the existence of switching time series and studied it by autocorrelation-based and wavelets-based methods, respectively. That is to say that, the cluster labels of switching series are varied over time. Therefore, it is worthwhile to further investigate how the cluster of the switching series is changed over time. Motivated by their work, our proposal investigates the problem of evolutionary clustering and proposes a dynamic fuzzy cluster algorithm based on improved FCM algorithm and key points. Some properties of switching time series is further detected over time.

The rest of the paper is organized as follows. Related works are reviewed in the next section. In Section 3, we introduce the definition of key point, improve the FCM algorithm, and conclude by proposing a dynamic fuzzy cluster algorithm. In Section 4, we provide experimental results to validate the proposal, and finally in Section 5 we give discussion and conclusions.

There are some previous references in the literature that have considered the problem of clustering time series. In 2005, Liao [3] presented a survey of clustering of time series data. He introduced and summarized previous works that investigated the clustering of time series data in various application domains, including general-purpose clustering algorithms, the criteria for evaluating the performance of the clustering results, and the measures to determine the similarity/dissimilarity between two time series being compared. Chakrabarti et al. [4] first presented a generic framework for evolutionary clustering and discussed evolutionary versions of two widely used clustering algorithms within this framework: -means and agglomerative hierarchical clustering. To fulfill evolutionary clustering, a measure of temporal smoothness was integrated in the overall measure of clustering quality and two frameworks that incorporated temporal smoothness in evolutionary spectral clustering were also proposed [5]. Corduas and Piccolo investigated time series clustering and classification by the autoregressive metric [6]. Xiong and Yeung studied the clustering of data patterns that are represented as sequences or time series possibly of different lengths by using mixtures of autoregressive moving average (ARMA) models [7]. Without assuming any parametric model for the true autoregressive structure of the series, a general class of nonparametric autoregressive models was studied [8]. Combined the tools of symbolic time series analysis with the nearest neighbor single linkage clustering algorithm, Brida et al. [9] introduced a new method to describe dynamic patterns of the real exchange rate comovements time series and to analyze their influence in currency crises. By using the principle of complex network, a novel algorithm for shape-based time series clustering was proposed by Zhang et al. [10]. It can reduce the size of data and improve the efficiency of the algorithm. An efficient pattern reduction algorithm for reducing the computation time of -means and -means-based clustering algorithms was proposed and applied to cluster time series in [11]. E. Keogh [1215] and his panel do a lot of work on time series classification and clustering and provide many useful datasets and benchmarks for testing time series classification and clustering. Furthermore, they also declared that clustering of time series subsequences, extracted via a sliding window, is meaningless. The latest development of cluster time series refers to Fu’s work [16].

Recently, Jain [17] undertook a review of the 50-year existence of -means algorithm and pointed out some of the emerging and useful research directions, including semisupervised clustering, ensemble clustering, simultaneous feature selection during data clustering, and large-scale data clustering. Fuzzy -means, proposed by Dunn [18] and later improved by many authors, is an extension of -means, where each data point can be a member of multiple clusters with a membership value. By modifying the FCM algorithm, Höppner and Klawonn [19] proposed a cross-correlation clustering (CCC) algorithm to solve the problem of clustering unaligned time series. It can be applied not only to short time series (whole series clustering) but also to time series subsequence (STS) clustering. More works related to fuzzy clustering can be found in [20, 21]. Data reduction by replacing group examples with their centroids before clustering them was used to speed up -means and fuzzy -means [22]. To incorporate the fuzziness in the clustering procedure, the so-called membership degree of each time series to different groups is considered as a means of evaluating the fuzziness in the assignment procedure. Möller-Levet et al. [23] introduced a new algorithm in the fuzzy -means family, which is designed to cluster time series and is particularly suited for short time series and those with unevenly spaced sampling points. Considering the dynamic behavior of time series, D’Urso and Maharaj proposed a fuzzy clustering approach based on the autocorrelation functions of time series, in which each time series is not assigned exclusively to only one cluster, but it is allowed to belong to different clusters with various membership degrees [1]. In the evaluation of the time series of labels, fuzzy -means clustering method is performed on merged dataset and time series of labels of each dataset are derived [24]. In order to deal with the more complicated data, Kannan et al. [25] proposed an alternative generalization of FCM clustering techniques called quadratic entropy based fuzzy -means.

Various applications of time series have also been investigated in geography. Since voluminous time series have been, and continue to be, collected with modern data acquisition techniques, there is an urgent need for effective and efficient methods to extract unknown and unexpected information from spatial datasets of unprecedentedly large size, high dimensionality, and complexity. To address these challenges, spatial data mining and geographic knowledge discovery have emerged as an active research field [26]. Obviously, it is a basic and important problem to cluster spatial data in geography. Some published examples of cluster analysis in time series have been based on environmental data, where we have time series from different locations and wish to group locations which show similar behavior. See, for instance, Macchiato et al. [27] for a spatial clustering of daily ambient temperature, or Cowpertwait and Cox [28] for an application to a rainfall problem. Other examples can be found in medicine, economics, engineering, and so forth. A method for clustering multidimensional nonstationary meteorological time series was presented by Horenko [29]. The approach was based on optimization of the regularized averaged clustering functional describing the quality of data representation in terms of several regression models and a metastable hidden process switching between them. Wang and Chen [30] presented a new method to predict the temperature and the Taiwan Futures Exchange, based on automatic clustering techniques and two-factors high-order fuzzy time series. Change-point analysis was used to detect changes in variability within GOMOS hindcast timeseries for significant wave heights of storm peak events across the Gulf of Mexico for the period 1900–2005. The change-point procedure can be readily applied to other environmental timeseries [21]. They presented a statistical approach based on the -means clustering technique to manage environmental sampled data to evaluate and forecast the energy deliverable by different renewable sources in a given site. Clustering of industrialized countries according to historical data of CO2 emissions was investigated by Alonso et al. [31]. Other examples can be found in medicine, economics, engineering, and so forth.

3. Dynamic Fuzzy Cluster Algorithm

A detailed description of the proposed algorithm is presented in this section.

3.1. Key Point

Mathematically, time series is defined as a set of observations , each one being recorded at a specified time . Generally, we can denote a -dimensional time series as follows: If all the time intervals are equal, that is, ), then can be simply written as .

Definition 1. A point is a change point of time series , if it satisfies the following conditions: or where , and is a parameter.

Definition 2. A change point is called key point of time series if the following condition is satisfied: where is a parameter, and is a neighbor change point of .

For convenience, we set and to be key points of time series and denote the set of key point of time series by .

Obviously, the key point is a typical point that describes, implicitly, how a series changes in a certain time. These points usually represent the special moments, such as the start or the end in a tendency of upward or downward, the peak or the bottom of the series. The key point clearly reveals the dynamic aspect of a given time series. Thus much more attention should be paid to them. The other advantage of key point is that it can effectively avoid the impact of singular points for clustering result.

3.2. Improved FCM Algorithm

In this subsection, we suggest a fuzzy clustering model for classifying time series. Fuzzy clustering is an overlapping clustering method which allows cases to belong to more than one cluster simultaneously as opposed to traditional clustering which results in mutually exclusive clusters. In the proposed fuzzy clustering model, we take into account the fuzziness and the key point. Our clustering model incorporates the information on the key point to the time series which obviously contains much more information than ordinary points in a series, and incorporates fuzziness which represents the uncertainty associated with the assignment of time series to different clusters. In what follows, it is apparent that in order to incorporate fuzziness into the clustering process, the so-called membership degree of each time series to each cluster should be considered.

The term of weighted-matrix should be introduced firstly before we start to describe the modified FCM model. The weighted-matrix is defined as where indicates the weight of th time series at th observation and refers to following criteria:

The importance of each key point is described initially by the nonnegative difference between the value of key point and the average of two adjacent points (ordinary points) of key point. Logically, the larger difference represents the more important status of key point. We map into 1 to 10 levels to understand and compute the importance of key point by function which is monotonically increasing function with range from −1 to 1. Doing so, we can easily measure the importance of each key point in terms of the value in the interval . Consider the fact that, in most cases, start points () are less important than the end points () which represent the last status of a time series, and they are assigned the average value 5 and the maximal value 10, respectively.

The modified fuzzy clustering model based on the key point can be formalized as follows: where is the membership degree of the th time series to the th cluster, is the squared Euclidean distance measure between the th time series and the centroid time series of the th cluster based on weighted , and is a parameter that controls the fuzziness of the partition. Usually , hence we take in our later experiments.

Fuzzy partitioning is carried out through an iterative optimization of the objective function shown above, with the update of membership degree and the cluster centers . This procedure converges to a local minimum or a saddle point of .

The updating steps are defined as

3.3. The Switching Time Series

Owing to the reason that the switching time series may belong to the different clusters over time, their dynamic property cannot be revealed sufficiently if we simply point out that they belong to a cluster with probability . Instead, we should further determine their cluster in a certain time period.

If there exists an , then we can say that is stable and belongs to th cluster. If , it is known that should not belong to th cluster. When (), it is necessary to further judge the relation between and th cluster. That is to say, is to be a switch series.

It is easy to find these switch series by modified FCM algorithm. Suppose that is a switch series with membership degrees and the set of maintaining key point . We compute the distance and the centroid time series of the th () cluster in interval by . If , the series is assigned cluster label .

3.4. The Description of Dynamic Fuzzy Cluster Algorithm

Given a set of , the set of the key points for each is first determined by using Definitions 1 and 2. Considering the importance of key point, we improve the FCM algorithm by proposing a novel measurement of dissimilarity. The traditional distances in the FCM model would be replaced by the suggested dissimilarity measurement, leading to an enhancement of the FCM model.

The dynamic fuzzy cluster algorithm is presented as follows.Input: the set of time series .Output: the result of cluster and the switching time series.Step  1: For each , compute the set of change points by Definition 1 and the set of the maintaining key points by Definition 2.Step  2: To calculate the weighted-matrix by formulating .Step  3: Cluster time series by the modified FCM algorithm.Step  4: If there exist the switching time series, further determine their cluster in a certain time period in terms of principle stated in Section 3.3, otherwise, output the cluster results.

4. Experiment

To test the validity of our proposal, we conducted experiments using real dataset and simulation studies. The reason that we choose the daily temperature data of three states in America lies in the following: (1) There exists distinct differences among its average temperature because these three states are far from each other in geography. (2) The temperature data recorded by these stations is relatively complete. (3) It is easy to understand the cluster result. Obviously, the temperature series recorded by the stations which located in the same state should group in the same cluster. The clustering results obtained here exactly show this fact. The second dataset called Beef was created by Keogh and is usually employed to test the result of classification or clustering for time series. To some extent, our results may explain the reason that some series are misclassified or misclustered when testing this dataset.

Example A (daily temperature data of the United States). The daily temperature data (http://www.ncdc.noaa.gov/IPS/coop/coop.html) from various stations of New Mexico (NM), Montana (MO), Hawaii (HA), Wyoming (WY), and South Dakota (SD) in the United States from 2010/09 to 2010/11 has been collected for our experiments. It had temperature recordings from 8 stations in New Mexico, 8 stations in Hawaii, 8 stations in Montana, and 4 stations in South Dakota and Wyoming located between the Montana and New Mexico in latitude. We clustered these 28 time series using the proposed algorithm. In our experiment we considered the thresholds, , , and . As a result, the stable time series and switching time series are obtained. The stable time series is composed of three clusters, which correspond exactly to the real case. The temperature data recorded by the stations located in the same state are grouped successfully in the same cluster; see Table 1 (the entries in the column are the name of the station where the daily temperature data is recorded). In our result, the four switching time series are obtained. From Figure 2, it is easy to see the dynamic behavior of switching time series over time. The Colony station is located in the south of Wyoming state and this state is between Montana state and New Mexico state. The temperature recorded at Colony station is similar to the average temperature in New Mexico state since they are all inland areas and have some common character in the geography. The meteorological information shows that the temperature in Colony region suddenly decreased in the middle of September and November because of the arrival of the cold air from the north. This result is reflected accurately in our figure. This phenomenon can also be observed in the other three stations located in South Dakota. The latter indicates the arrival of the winter in the north of America. Different from the other three stations, to be the same as the Ft Meade station, the Bison station is located in the south of the South Dakota state. They belong to the low basin topography. Meanwhile, this region is usually cloudy and the ground antiradiation is intensive. These reasons suggest that the climate in this region is mutable. Thus its temperature is bounced among three clusters.

Example B (test on a dataset). The Beef dataset created by Keogh et al. [15] is usually employed to test the result of classification or clustering for time series. There are 30 time series in this dataset. They are classified into 5 groups, and the length of each of series is 470. Here, we choose its testing set to show the validity of the proposed algorithm. There are 7 switching time series in this dataset, that is, 4th, 6th, 7th, 10th, 14th, 18th, and 28th. Their membership matrix is as follows: The clustering accuracy of our proposal is 74.5% if we label the series to the cluster in terms of the largest membership. In Figure 3, we can find the changes cluster of the 6th, 7th, and 18th series. The 6th series primarily belongs to the 5th cluster. In segment from about 320 to 380, this series is obviously closed to another cluster. The 18th series switches between two clusters and this result corresponded to its membership. These results present propitious analysis that help in predicting some properties of time series. Compared with existing classification or clustering methods (http://www.cs.ucr.edu/~eamonn/time_series_data/), our result is acceptable.

5. Discussion and Conclusion

At present, most cluster methods for time series directly adopt the methods that deal with the problem of grouping the static dataset. They usually consider the time series as the points in an -dimensional space. By doing so, the property of dynamic behavior of time series over time is neglected. However, the dynamic behavior or evolutionary nature of time series is a very important property when clustering them. For instance, a switching series belongs to different clusters in different time segments. For some time series set, it is possible that at the beginning, the set can be grouped into clusters and then clusters after a certain time points. These problems suggest the need to develop new methods to cluster time series which do not directly employ the cluster methods for static data to implement cluster for time series. In this sense, this paper can be considered as an attempt along this way.

No matter what we employ that the existing crisp or fuzzy clustering method, it is impossible to find the switching property of switching time series over time. Furthermore, this evolutionary property is a basic nature of time series and should be reflected when studying it. Thus, we propose a dynamic fuzzy cluster algorithm to reveal the evolution property for time series by finding key points and improving FCM algorithm. Different from the existing fuzzy cluster methods, the proposed algorithm can only allows each time series to belong to different clusters with various membership degrees but also reveals the changes procedure of clustering switching series over time. This is helpful in predicting and analyzing the evolutionary properties for time series.