Abstract

Multivariate time series (MTS) datasets are very common in various financial, multimedia, and hydrological fields. In this paper, a dimension-combination method is proposed to search similar sequences for MTS. Firstly, the similarity of single-dimension series is calculated; then the overall similarity of the MTS is obtained by synthesizing each of the single-dimension similarity based on weighted BORDA voting method. The dimension-combination method could use the existing similarity searching method. Several experiments, which used the classification accuracy as a measure, were performed on six datasets from the UCI KDD Archive to validate the method. The results show the advantage of the approach compared to the traditional similarity measures, such as Euclidean distance (ED), cynamic time warping (DTW), point distribution (PD), PCA similarity factor , and extended Frobenius norm (Eros), for MTS datasets in some ways. Our experiments also demonstrate that no measure can fit all datasets, and the proposed measure is a choice for similarity searches.

1. Introduction

With the improving requirement of industries for information and the rapid development of the information technology, there are more and more datasets obtained and stored in the form of multidimensional time series, such as hydrology, finance, medicine, and multimedia. In hydrology, water level, flow, evaporation, and precipitation are monitored for hydrological forecasting. In finance, stock price information, which generally includes opening price, average price, trading volume, and closing price, is used to forecast stock market trends. In medicine, electroencephalogram (EEG) from 64 electrodes placed on the scalp is measured to examine the correlation of genetic predisposition to alcoholism [1]. In multimedia, for speech recognition, the Australian sign language (AUSLAN) is gathered from 22 sensors on the hands (gloves) of a native Australian speaker using high-quality position trackers and instrumented gloves [2].

A time series is a series of observations, , made sequentially through time where indexes the measurements made at each time point [3]. It is called a univariate time series when is equal to 1 and a multivariate time series (MTS) when is equal to or greater than 2.

Univariate time series similarity searches have been broadly explored and the research mainly focuses on representation, indexing, and similarity measure [4]. A univariate time series is often regarded as a point in multidimensional space, so one of the goals of time series representation is to reduce the dimensions (i.e., the number of data points) because of the curse of dimensionality. Many approaches are used to extract the pattern, which contains the main characteristics of original time series, to represent the original time series. Piecewise linear representation (PLA) [5, 6], piecewise aggregate approximation (PAA) [7], adaptive piecewise constant approximation (APCA) [8], and so forth use adjacent segments to represent the time series with length . Furthermore, perceptually important points (PIP) [9, 10], critical point model (CMP) [11], and so on reduce the dimensions by preserving the salient points. Another common family of time series representation approaches transform time series into discrete symbols and perform string operations on time series, for example, symbolic aggregate approximation (SAX) [12], shape description alphabet (SDA) [13], and other symbols generated method based on clustering [14, 15]. Representing time series in the transformation is another large family, such as discrete Fourier transform (DFT) [4] and discrete wavelet transform (DWT) [16] which transform the original time series into frequency domain. After transformation, only the first few or the best few coefficients are chosen to represent the original time series [3]. Many of the representation schemes are incorporated with different multidimensional spatial indexing techniques (e.g., k-d tree [17] and r-tree and its variants [18, 19]) to index sequences to improve the query efficiency during similarity searching. Given two time series and and their representations and , a similarity measure function calculates the distance between the two time series, denoted by to describe the similarity/dissimilarity between and , such as Euclidean distance (ED) [4] and the other norms, dynamic time warping (DTW) [20, 21], longest common subsequence (LCSS) [22], the slope distance [23], and the pattern distance [24].

The multidimensional time series similarity searches study mainly two aspects, the overall matching and match-by-dimension. The overall matching treats the MTS as a whole because of the important correlations of the variables in MTS datasets. Many overall matching similarity measures are based on principal component analysis (PCA). The original time series are represented by the eigenvectors and the eigenvalues after transformation. The distance between the eigenvectors weighted by eigenvalues is used to describe the similarity/dissimilarity, for example, Eros [3], [25], and [26]. Lee and Choi [27] combined PCA with the hidden Markov model (HMM) to propose two methods, PCA + HMM and PCA + HMM + SVM, to find similar MTS. With the principal components such as the input of several HMMs, the similarity is calculated by combining the likelihood of each HMM. Guan et al. [28] proposed a pattern matching method based on point distribution (PD) for multivariate time series. Local important points of a multivariate time series and their distribution are used to construct the pattern vector. The Euclidean distance between the pattern vectors is used to measure the similarity of original time series.

By contrast, match-by-dimension breaks MTS into multiple univariate time series to process separately and then aggregates them to generate the result. Li et al. [29] searched the similarity of each dimensional series and then synthesized the similarity of each series by the traditional BORDA voting method to obtain the overall similarity of the multivariate time series. Compared to the overall matching, match-by-dimension could take advantage of present univariate time series similarity analysis approaches.

In this paper, a new algorithm based on the weighted BORDA voting method for the MTS nearest neighbor (NN) searching is proposed. Each MTS dimension series is considered as a separate univariate time series. Firstly, similarity searching approach is used to search the similarity sequence for each dimension series; then the similar sequences of each dimensional series are synthesized on the weighted BORDA voting method to generate the multivariate similar sequences. Compared to the measure in [29], our proposed method considers the dimension importance and the similarity gap between the similar sequences and generates more accurate similar sequences.

In the next section, we briefly describe the BORDA voting method and some similarity measures widely used. Section 3 presents the proposed algorithm to search the NN sequences. Datasets and experimental results are demonstrated in Section 4. Finally, we conclude the paper in Section 5.

In this section, we will briefly discuss BORDA voting method, the method in [29], and the DTW, on which our proposed techniques are based. Notations section contains the notations used in this paper.

2.1. BORDA Voting Method

BORDA voting, a classical voting method in group decision theory, is proposed by Jena-Charles de BORDA [30]. Supposing is the number of winners, is the number of candidates; electors express their preference from high to low in the sort of candidates. To every elector’s vote, the candidate ranked first is provided points (called voting score), the second candidate -1 points, followed by analogy, and the last one is provided 1 point. The accumulated voting score of the candidate is BORDA score. The candidates, BORDA scores in the top , are called BORDA winners.

2.2. Similarity Measure on Traditional BORDA Voting

Li et al. [29] proposed a multivariate similarity measure based on BORDA voting, denoted by ; the measure is divided into two parts: the first one is the similarity mining of univariate time series and the second one is the integration of the results obtained in the first stage by BORDA voting. In the first stage, a certain similarity measure is used to query NN sequences on univariate series of each dimension in the MTS. In the second stage, the scores of each univariate similar sequence are provided through the rule of BORDA voting. The most similar sequence scores points, the second scores , followed by a decreasing order, and the last is 1. The sequences with same time period or very close time period will be found in different univariate time series. According to the election rule, the sequences whose votes are less than the half of dimension are eliminated; then the BORDA voting of the rest of sequences is calculated. If a sequence of some certain time period appears in the results of univariate sequences and its scores are , respectively, then the similarity score of this sequence is the sum of all the scores. In the end, the sequence with the highest score is the most similar to the query sequence.

2.3. Dynamic Time Warping Distance

Dynamic programming is the theoretical basis for dynamic time warping (DTW). DTW is a nonlinear planning technique combining time and distance measure, which was firstly introduced to time series mining areas by Berndt and Clifford [20] to measure the similarity of two univariate time series. According to the minimum cost of time warping path, the DTW distance supports time axis stretching but does not meet the requirement of triangle inequality.

3. The Proposed Method

In the previous section, we have reviewed the similarity measure on traditional BORDA voting, , for multivariate time series. In this section, we propose a dimension-combination similarity measure based on weighted BORDA voting, called , for MTS datasets NN searching. The similarity measure can be applied for the whole sequence matching similarity searching and the subsequence matching similarity searching.

3.1. SWBORDA: Multivariate Similarity Measure on Weighted BORDA Voting

takes just the order into consideration, without the actual similarity gap between two adjacent similar sequences that may lead to rank failure for the similar sequences. For example, assuming the four candidates take part in race, the first round position is , the second is , the third is , and the last is . The four runners are all ranked number 1 with traditional BORDA score (10 points), because of considering only the rank order, but without the speed gap of each runner in the race. In our proposed approach, we use the complete information of candidate, including the order and the actual gap to neighbor.

The multivariate data sequences with dimensions are divided into univariate time series, and each dimension is a univariate time series. Given multivariate query sequence , to search the multivariate NN sequences, each univariate time series is searched separately. For the th dimension time series, the nearest neighbor sequences are , where is equal or greater than and is the th dimension series of and is considered to be the most similar to itself. The distances between and are , respectively, where is less than or equal to describes the similarity gap between and to . Let the weighted voting score of be and let be 1; the weighted voting score of the sequence , , is defined by where is a weight vector based on the eigenvalues of the MTS dataset, , and represents the importance of the th dimension series among the MTS. is inversely proportional to ; that is, is the baseline; the higher similarity gap between and is, the lower weighted BORDA score will get.

We accumulate the weighted voting score of each item in a candidate multivariate similar sequence and then obtain its weighted BORDA score. The candidate sequences are ranked on weighted BORDA scores, and the top are the final similar sequences to . The model of similarity searching based on weighted BORDA voting is shown in Figure 1.

In the model of Figure 1, firstly, PCA is applied on original MTS and transforms it to new dataset whose variables are uncorrelated with each other. The first dimensions series which contain most of characteristics of the original MTS are retained to reduce dimensions. Furthermore, univariate time series similarity searching is performed to each dimension series in and finds out the univariate sequences; should be equal or greater than the final . Moreover, sequences are truncated to obtain the candidate multivariate similar sequences. Finally, is performed on candidate multivariate similar sequences to obtain the of query sequences. Intuitively, measures the similarity from different aspects (dimensions) and synthesizes them. The more aspects (dimensions) from measured sequences is similar to the query sequences, the more similar the sequence is to the query sequences of the whole. The following sections describe the similarity searching in detail.

3.2. Performing PCA on Original MTS

In our proposed method, all MTS dimension series are considered independent of each other, but, in fact, correlation exists among them more or less, so PCA is applied to the MTS which can be represented as a matrix and represents the length of series, and is the number of dimensions (variables). Each row of can be considered as a point in -dimensional space. Intuitively, PCA transforms dataset by rotating the original -dimensional axes and generating a new set of axes. The principal components are the projected coordinate of on the new axes [3].

Performing PCA on a multivariate dataset is based on the correlation matrix or covariance matrix of and results in two matrices, the eigenvectors matrix and the variances matrix . Each column of , called eigenvector, is a unit vector, geometrically, and it presents the new axes position in the original -dimensional space. The variances matrix element , called eigenvalue, provides the variance of the th principal component. The matrix of the new projected coordinates of the original data can be calculated by . The first dimension univariate time series of is the first principal component and accounts for the largest part of variances presented in the original ; the th principal component accounts for the largest part of the remaining variances and is orthogonal to the 1st, 2nd, , and dimensions. Select the first components , which retain more than, for example, 90% of the total variation presented in the original data representing . Thus, the dimensionality reduction may be achieved, as long as . Geometrically, the original is projected on the new -dimensional space.

In whole sequence matching similarity searching, we apply PCA to all MTS items and retain components so that more than, for example, 90% of the total variations are retained in all MTS items at least.

3.3. Truncating Univariate Similar Sequences

In candidate similar MTS, each dimension series starts at the same time. However, the similar sequences of each dimension time series may not start at the same time. The similar sequences with close start time of each dimension could be treated as in the same candidate similar MTS and truncated. The truncation includes four steps: grouping the sequences, deleting the isolated sequences, aligning the overlapping sequences, and reordering the sequences. After truncation, the candidate multivariate similar sequences could be obtained.

The truncation for whole sequence matching similarity searching is just a special case of subsequence matching, so we introduce the truncation for subsequence matching. In Figure 2, 3NN sequences are searched for multivariate query sequences with length , and the application of PCA on the data MTS results in the principal component series with three dimensions. 3NN searching is performed on each dimension principal component series. The 3NN sequences of first dimension are (the subsequence from to ), (from to ), and (from to ). The univariate similar sequences are presented according to their occurrence time, and the present order does not reflect the similarity order to the query sequence. The 3NN sequences of the second dimension are (from to ), (from to ), and (from to ), and these of the third dimension are (from to ), (from to ), and (from to ).

(1) Grouping the Univariate Similar Sequences. The univariate similar sequences of all dimensions are divided into groups, so that in each group, for any sequence , at least one sequence , which overlaps with over the half length of sequence , could be found. The univariate similar sequence, which does not overlap with any other similar sequences, will be put into a single group just including itself. In Figure 2, all the similar sequences are divided into five groups. The group g1 includes . overlaps with , respectively, and the overlapping lengths are all over half of the length , group g2 includes , group g3 includes , , group g4 includes , , and group g5 includes .

(2) Deleting the Isolated Sequences. The group, in which the number of similar sequences is less than half number of the dimensions, is called an isolated group, and the similar sequences in isolated group are called isolated similar sequences. In Figure 2, the number of similar sequences in group g2 or g5 is less than half of the number of dimensions, that is, 3, so the similar sequences in them are deleted.

(3) Aligning the Overlapping Sequences. The sequences in the same group are aligned to generate the candidate multivariate similar sequences. For one group, the average start time of all the included sequences is calculated; then the subsequence from to , denoted by cs, is the candidate multivariate similar sequence. Each dimension series of cs is regarded as the univariate similar sequences. The similarity distance between cs and query sequence is recalculated by the selected univariate similarity measure dimension by dimension; if the group contains the th dimension similar sequence, then the corresponding similarity distance is set to the th dimension series of cs to reduce computation. In Figure 2, for group g1, the average of is calculated; then the subsequence , from to , is the candidate multivariate similar sequence. For group g3, the similarity distance between the 2nd dimension series of and the query sequence should be recalculated. The same alignment operation is performed on group g4 to obtain the candidate multivariate sequence .

(4) Reordering the Candidate Similar Sequences. For each dimension, the candidate univariate similar sequences are reordered by the similarity distance calculated in Step (3). After reordering, is used to synthesize the candidate similar sequences and generate the multivariate NN sequences.

In whole matching searching, the similar sequences are either whole overlapping or not overlapping each other at all, and the truncation steps are the same as those of the subsequence matching.

3.4. Computing Weights

By applying PCA to the MTS, the principal components series and the eigenvalues, which can represent the variances for principal components, are obtained. When we calculate the weighted BORDA score, we take into consideration both the similarity gap and the dimension importance for . The heuristics proposed algorithm in [3] is used to calculate the weight vector based on the eigenvalues. Variances are aggregated by a certain strategy, for example, min, mean, and max, on eigenvalues vectors dimension by dimension, and the vector is obtained. The weight vector element is defined by where denotes the aggregating strategy. means the variance vector of th dimension of MTS items. Generally, for subsequence matching, includes one element, and whole matching is greater than 1. Intuitively, each in the weight vector represents the aggregated variance for all the th principal components. The original variance vector could be normalized before aggregation.

4. Experimental Evaluation

In order to evaluate the performance of our proposed techniques, we performed experiments on six real-world datasets. In this section, we first describe the datasets used in the experiments and the experiments methods followed by the results.

4.1. Datasets

The experiments have been conducted on four UCI datasets, [31] electroencephalogram (EEG), Australian sign language (AUSLAN), Japanese vowel (JV), and robot execution failure (REF), which are all labeled MTS datasets.

The EEG contains measurements from 64 electrodes placed on the scalp and sampled at 256 Hz to examine EEG correlates of genetic predisposition to alcoholism. Three versions, the small, the large, and the full, are included in this dataset according to the volume of the original data. We utilized the large dataset containing 600 samples and 2 classes.

The AUSLAN2 consists of samples of AUSLAN (Australian sign language) signs. 27 examples of each of 95 AUSLAN signs were captured from 22 sensors placed on the hands (gloves) of a native signer. In total, there are 2565 signs in the dataset.

The JV contains 640 time series of 12 LPC cepstrum coefficients taken from nine male speakers. The length of each time series is in the range 7–29. It describes the uttering of Japanese vowels /ae/ by a speaker successively. The dataset contains two parts: training and test data; we utilized the training data which contains 270 time series.

The REF contains force and torque measurements on a robot after failure detection. Each failure is characterized by 6 forces/torques and 15 force/torque samples. The dataset contains five subdatasets LP1, LP2, LP3, LP4, and LP5; each of them defines a different learning problem. The LP1, LP4, and LP5 subdatasets were utilized in the experiment. LP1 which defines the failures in approach to grasp position contains 88 instances and 4 classes, LP4 contains 117 instances and 3 classes, and LP5 contains 164 instances and 5 classes. A summary is shown in Table 1.

4.2. Method

In order to validate our proposed similarity measure , 1NN classification, and 10-fold cross validation are performed [32]. That is, each dataset is divided into ten subsets, 1-fold for testing and the rest 9 for training. For each query item in the testing set, 1NN is searched in the training set and the query item is classified according to the label of the 1NNs, and the average precision is computed across all the testing items. The experiment is repeated 10 times for different testing set and training set, and 10 different error rates are obtained; then the average error across all 10 trials is computed for 10-fold cross validation. We performed 10 times 10-fold cross validation and computed the average error across all 10-fold cross validations to estimate the classification error rate. The similarity measures tested on our experiments include , PD, Eros, and . DTW is selected as the univariate similarity measure for and . They are denoted as and , respectively. For DTW, the maximum amount of warping is decreased to 5% of the length. DTW has been extensively employed in various applications and time series similarity searching, because DTW can be applied to two MTS items warped in the time axis and with different lengths.

All other measures except PD and Eros require determining the number of components to be retained. Classification has been conducted for consecutive values of which retain more than 90% of total variation, until the error rate reaches the minimum. The number which retains less than 90% of total variation is not considered. For Eros, the experiments are conducted as proposed in [3]. The Wilcoxon signed-rank test is used to ascertain if yields an improved classification performance on multiple data sets in general. PCA is performed on the covariance matrices of MTS datasets.

4.3. Results

The classification error rates are presented in Table 2 in the form of percentages. Although experiments have been conducted for various , that is, the number of principal components (for , and ), only the best classification accuracies are presented.

Firstly, we will compare similarity measures with respect to each dataset, respectively. For the EEG dataset, produces the best performance and performs significantly better than the others. With regard to the AUSLAN2 dataset, Eros produces the lowest classification error rate and PD gives very poor performance. For the overall matching method, for example, Eros, performs better than the others. For the JV dataset, gives the best performance and the makes the poorest performance. For the LP1 dataset, makes the best performance. For the LP4 dataset, makes the best performance and and perform better than others. In the end, for the LP5 dataset, gives the best performance.

Finally, the similarity measures are compared for all the datasets. Between and , the Wilcoxon signed-rank test reports that value equals 0.043 (double side) and shows that the algorithms are significantly different. With 5% significance level, has made better performance over . Compared to Eros and , the has better performance on LP1, LP4, and LP5. But it shows poor performance on EEG, AULSAN2, and JV. Table 3 shows the number of principal components, which just retain more than 90% of total variation, in experiment datasets. For LP1, LP4, and LP5, the first few principal components retained most of the variation after PCA performing, but for EEG, AUSLAN2, and JV, to retain more than 90% of total variation, more principal components should be retained. searches the similar sequences dimension by dimension and then synthesizes them; it is hard to generate the aligned candidate multivariate similar sequences when many principal components are contained in the principal component series. Furthermore, for the datasets, for example, EEG, AUSLAN2, and JV, the first few principal components could not retain sufficient information of original series, and produces poor precision. could make better performance on the datasets which aggregate the most variation in the first few principal components after PCA.

5. Conclusion and Future Work

A match-by-dimension similarity measure for MTS datasets, , is proposed in this paper. This measure is based on principal component analysis, weighted BORDA voting method, and univariate time series similarity measure. In order to compute the similarity between two MTS items, performs PCA on the MTS and retains dimensions principal component series, which present more than 90% of the total variances. Then a univariate similar analysis is applied to each dimension series, and the univariate similar sequences are truncated to generate candidate multivariate similar sequences. At last, the candidate sequences are ordered by weighted BORDA score, and the NN sequences are obtained. Experiments demonstrate that our proposed approach is suitable for small datasets. The experimental result has also shown that the proposed method is sensitive to the number of the classes in datasets. In the further work, it will be investigated furtherly.

In the literature, at present, there are still not so many studies in similarity analysis for MTS. In the future, we will explore new integration methods.

Notations

:The number of MTS dimensions
:The number of observations in a MTS
:The number of retained principal components
:The number of multivariate nearest neighbor sequences
:The number of univariate nearest neighbor sequences, usually greater than or equal
:The th univariate similar sequence,
:The distances between and ,
:Weighted voting score of the th univariate similar sequence,
:Eigenvectors matrix of PCA
:Variances matrix of PCA
:The aggregating strategy of variances
:Weight vector of principal components series
:The th dimension variances vector of MTS items, .

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research is supported by the Fundamental Research Funds for the Central Universities (no. 2009B22014) and the National Natural Science Foundation of China (no. 61170200, no. 61370091, and no. 61202097).