Abstract

Change-Point (CP) detection has attracted considerable attention in the fields of data mining and statistics; it is very meaningful to discuss how to quickly and efficiently detect abrupt change from large-scale bioelectric signals. Currently, most of the existing methods, like Kolmogorov-Smirnov (KS) statistic and so forth, are time-consuming, especially for large-scale datasets. In this paper, we propose a fast framework for abrupt change detection based on binary search trees (BSTs) and a modified KS statistic, named BSTKS (binary search trees and Kolmogorov statistic). In this method, first, two binary search trees, termed as BSTcA and BSTcD, are constructed by multilevel Haar Wavelet Transform (HWT); second, three search criteria are introduced in terms of the statistic and variance fluctuations in the diagnosed time series; last, an optimal search path is detected from the root to leaf nodes of two BSTs. The studies on both the synthetic time series samples and the real electroencephalograph (EEG) recordings indicate that the proposed BSTKS can detect abrupt change more quickly and efficiently than KS, -statistic (), and Singular-Spectrum Analyses (SSA) methods, with the shortest computation time, the highest hit rate, the smallest error, and the highest accuracy out of four methods. This study suggests that the proposed BSTKS is very helpful for useful information inspection on all kinds of bioelectric time series signals.

1. Introduction

Abrupt change detection is to identify abrupt changes in the statistical properties of a signal series, which occur at unknown instants [13]. These changes are interesting because they are indicative of qualitative transitions in the data generation mechanism (DGM) underlying the signals. Currently, CP detection has attracted considerable attention in the fields of data mining and statistics, and it has been widely studied in many real-world problems, such as atmospheric and financial analyses [1], fault detection in engineering system [4, 5], climate change detection [6], genetic time series analyses [7], signal segmentation [8, 9], and intrusion detection in computer network [4].

In community of statistics, some nonparametric approaches for CP detection have been widely explored. For example, KS statistic quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution or between the empirical distribution function of two samples [10, 11]. Also, KS statistic and its modified versions are broadly investigated on many application fields, for example, testing hypotheses regarding activation in blood oxygenation level-dependent functional MRI data [12], modeling the cumulative distribution function of rub-induced AE signals, quantifying the goodness of fit to offer a suitable signal feature for diagnosis [13], as well as abrupt change detecting on EEG signals [14], and gene expression time series [15]. Meanwhile, as for the model-related statistic approaches, some modified cumulative sum (CUSUM) methods provide the asymptotic distributions of test statistics and the consistency of procedures and behave better in finite samples and have a higher stability with respect to the time of change than ordinary CUSUM procedures [16]. The CUSUM method and its revised versions have been widely applied to detect the structural breaks in the parameters of stochastic models, as well as the abrupt changes in the regression parameters of multiple time series regression models, such as multiple CP detection in biological sequences [17], abrupt change detection in the regression parameters of a set of capital asset pricing data related to the Fama-French extension of the CAPM [16], and abrupt change detection in a shape-restricted regression model [18].

On the other hand, SSA is a powerful technique for time series analyses. SSA is nonparametric and requires no prior knowledge on the properties of time series signal [19]. The main idea of SSA is applied in the principal component analyses on the trajectory matrix with subsequent reconstruction of the original time series. SSA has been proved to be very successful and has already become a standard tool in the analyses of climatic [10], meteorological, and geophysical time series [11, 19]. Currently, SSA has been successfully applied in the real time series recordings, for example, abrupt change analyses on EMG-onset detection [12] and CP detection in time series [13]. Although SSA is a model-free method, it is not scalable to large-scale datasets, because it is time-consuming and sometimes invalid for time series analyses with less significant data fluctuation.

In addition, Wavelet Transform (WT) is another important tool for time series analyses [14, 15, 2023]. WT has been widely applied in anomaly detection, time series prediction, image processing, and noise reduction [15, 2325]. WT can represent general function at different scales and positions in a versatile and sophisticated manner, so that the data distribution features can be easily extracted from different time or space scales [25, 26]. As a simple WT, Haar Wavelet (HW) owns some attractive features including fast implementation and ability to analyze the local features. HW is very useful to find abrupt changes of discontinuity and high frequency in time series, so it is a potential candidate in modern electrical and computer engineering applications, such as signal and image compression, eye detection [27], abnormality detection on time series [28, 29], and abrupt change detection on autoregressive conditional heteroscedastic processes [30].

However, all of these methods above are time-consuming and sometime invalid for abrupt change detection near the left or the right boundary, especially for insignificant data fluctuation in large-scale time series. To resolve these problems, we propose a fast framework for CP detection based on binary search trees and a modified KS statistic, termed BSTKS for short. In this novel method, first, two BSTs are derived from a diagnosed time series. Second, three search criteria are introduced in terms of the statistic and variance fluctuations between two adjacent time series segments, and then an optimal search path is detected from the root to leaf nodes of two BSTs. Last, the proposed BSTKS and other KS, , and SSA methods are tested on both the synthetic time series and real EEG recordings and evaluated in terms of computation time, hit rate, error, accuracy, and area under curve (AUC) of Receiver Operating Characteristic (ROC) curve analyses.

In general, for a certain bioelectric signal, an abrupt change means an important transition of biological functions or health states before and after a strong attack or an acute perturbation from internal or external environment. Therefore, it is very necessary to not only discern abrupt change from all kinds of physiological and psychological time series signals, but also inspect the significant fluctuation between adjacent time series segments with different scales. The following sections focused on not only presenting the framework of the proposed BSTKS method through theoretical foundation, simulation, and evaluation, but also discussing how it can more quickly and efficiently detect abrupt change on both synthetic and real bioelectric EEG signals than other existing KS, , and SSA methods. The rest of this paper is organized as follows. Section 2 gives the preliminary of abrupt change by introducing the statistic and variance fluctuations between two adjacent time series segments. Section 3 implements the integrated framework of the BSTKS method in terms of three search criteria in detail. Section 4 provides some representative experiments by using the synthetic time series and real EEG recordings and then analyzes the performance of BSTKS by comparing with other KS, , and SSA methods. Section 5 gives summary and conclusion from previous sections.

2. Preliminary

2.1. Statistic Fluctuation

KS statistic is sensitive to differences in both location and shape of the cumulative distribution functions (c.d.f) of two samples. The null distribution of KS statistic is calculated under the null hypothesis that the two samples are drawn from the same distribution or one sample is drawn from the reference distribution. To detect an abrupt change from a diagnosed time series , we define the statistic fluctuation between two adjacent segments within by means of KS statistic as follows [1, 4, 19].

Definition 1. Supposing a time series sample, , one observes where is a set of the discrete and centred i.i.d random variables and is a noisy mean signal with unknown distribution. The statistic fluctuation between two adjacent segments and is defined asin which and are the c.d.f of and , respectively; , , and . Supposing the hypothesized and in (2) are not available, we can derive the empirical cumulative distribution functions (e.c.d.f) of and from and . Then, and can be redefined aswhere and count the proportion of the sample points below level .

Hypothesis 1. In order to discern an abrupt change on in terms of statistic fluctuation defined above, we introduce KS test for two adjacent segments and in asif , no abrupt change occurs in ;if , abrupt change occurs in , in which is a threshold of the statistic fluctuation within belonging to an identical distribution. Then, we test against from observations. If an abrupt change occurs in , there exists a value satisfying , , and . In this hypothesis, we assume that the number, the location, and the size of the function in (1) are unknown, and the upper bound of the statistic fluctuation is supposed to be known.

2.2. Variance Fluctuation

Provided the statistic fluctuation defined in (2) is insignificant enough, it is difficult to detect abrupt change near the left or the right boundary within , especially when sample size gets smaller. Therefore, we need to introduce another variable to calculate the variance fluctuation between two adjacent parts within a time series sample.

Definition 2. Supposing two adjacent segments and in , the variance fluctuation between and is defined asin which , , and .

Hypothesis 2. If , no abrupt change occurs at in ; if , abrupt change occurs at in .
Here, is a variance threshold of time series which obeys an identical distribution. If there exists a value satisfying , then an abrupt change occurs at in .

3. Method

3.1. Two BSTs’ Construction

In the first part of the proposed BSTKS method, two BSTs, that is, BSTcA and BSTcD, are constructed from a time series sample , by using multilevel HWT. Generally, as shown in Figure 1, a discrete time series signal can be decomposed into the th-level trend and -level fluctuations, that is, , . The -level HWT is the mapping defined as [13]and then, the mapping can be represented by the approximation and detail coefficient matrices, termed McA and McD as follows: where and .

Supposing the size of a diagnosed is divisible times by 2, the th element in and the th element in can be denoted aswhere and ; , , and .

During two BSTs’ construction, as shown in Figure 2, the non-leaf nodes in BSTcA and BSTcD are assembled by the -level coefficient vectors of McA and McD, respectively; and then the leaf nodes are derived directly from the original time series . Therefore, the features of abrupt change in can be reflected and distributed into the different non-leaf nodes of BSTcA and BSTcD, in accordance with the level coefficient vectors in McA and McD.

3.2. CP Detection Based on Three Search Criteria

To find an optimal path towards the potential CP within a given time series quickly and efficiently, some search criteria need to be introduced, and then the data exceptions can be detected from the root to leaf nodes of two BSTs. As for the statistic fluctuation within BSTcA, first, a new variable is defined according to a current non-leaf node in BSTcA,where , ; , , and . Then, the statistic fluctuation between two adjacent segments and can be defined by a modified KS statistic asin which is a new element defined in (8); and stand for the sizes of and , respectively; , , , and . measures the e.c.d.f difference between and , and the larger means the more significant statistic fluctuation between and . Therefore, a potential abrupt change might occur at in with more probability.

Definition 3. For a current non-leaf node in BSTcA, with its left and right-child nodes and , the distance of e.c.d.f, , and can be defined aswhere , ; , , ; , ; and . To estimate an optimal path towards the potential change position within , without loss of generality, the first search criterion is introduced based on the statistic fluctuations and .

Criterion 1. Given two statistic fluctuation variables and in accordance with two non-leaf child nodes and of the current selected node in BSTcA, and , (a)if holds true, then the left-child node is selected and involved into the current search path; meanwhile, the right-child is discarded;(b)if holds true, then the right-child node is selected and involved into the current search path; meanwhile, the left-child is discarded.

Proof. For a selected non-leaf node in BSTcA, as shown in Figure 3, the original time series is divided equally into two adjacent segments and , which are covered by two non-leaf child nodes and , respectively. According to the definitions of and in (10), the satisfied indicates that the statistic fluctuation within is more significant than that one within ; that is, a potential abrupt change might be contained in with more probability than in , and vice versa. Furthermore, if holds true, then of Hypothesis 1 is satisfied; that is, abrupt change occurs in , and vice versa, where is the critical value predefined in an identical distribution and is the significance level. Therefore, one of the two child nodes and is selected and involved into the current search path; meanwhile, the remaining one is discarded. Once the statistic fluctuation is significant enough, an optimal search path can be detected by Criterion 1 from the top to the last non-leaf level in BSTcA. However, the search procedure is probably forced to cease because the statistic fluctuation is so insignificant that Criterion 1 is invalid for detecting it, especially for the left or the right boundary when sample is with smaller size . Therefore, it is necessary to introduce another search criterion based on the variance fluctuations within BSTcD.

Definition 4. For a current non-leaf node in BSTcD, with its left and right-child nodes and , respectively, the variance fluctuations and are defined in terms of (4) aswhere , ; , , ; , ; , ; and , .
Suppose Criterion 1 is invalid as holds true; the second search criterion needs to be introduced in terms of the two variance fluctuation variables and as follows.

Criterion 2. Given two variance fluctuation variables and according to the two non-leaf child nodes and of the selected node in BSTcD, and , (a)if holds true, then the left-child node in BSTcA is accordingly selected and involved into the current search path; meanwhile the right one is ignored;(b)if holds true, then the right-child node in BSTcA is accordingly selected and involved into the current search path; meanwhile the left one is ignored.

Proof. Similarly, as illustrated in Figure 4, the satisfied in Criterion 2 means that the variance fluctuations within are stronger than that one within , in terms of the definitions of and in (11). That is, a potential abrupt change might exist in with more probability than in , and vice versa. Meanwhile, if holds true, then in Hypothesis 2 is satisfied; that is, abrupt change occurs in , and vice versa, where is the critical value predefined in an identical distribution and is the significance level. As a result, one of the two non-leaf child nodes and in BSTcA can be accordingly selected, and the other one is neglected. Therefore, if Criterion 1 is invalid for less significant statistic fluctuation within BSTcA, Criterion 2 ensures that the search procedure can keep going forward to the potential abrupt change in , especially when abrupt change occurs near the left or the right boundary of with smaller size .

Based on Criterions 1 and 2 above, a search path can be obtained from the top root to the last non-leaf levels of BSTcA. In order to estimate an abrupt change from the original elements of , another criterion needs to be introduced to discern which one can be selected from two adjacent leaf nodes in BSTcA.

Definition 5. Supposing the current node is selected in the last non-leaf level of BSTcA, , with two child leaf nodes and , two statistic fluctuation variables and are defined based on KS test aswhere and ; and refer to the e.c.d.f of and , respectively; or and .
Consider that the largest statistic fluctuation between and is achieved either before or after one of the jumps, that is,Then, another two variables and are defined asTherefore, the maximal statistic fluctuations and can be selected from and , as well as and . Then, the third search criterion is introduced in terms of and as follows.

Criterion 3. Given and in accordance with two child leaf nodes and of the selected non-leaf node in BSTcA, , (a)if holds true, then the left leaf node in is taken as the estimated CP, and the right one is neglected;(b)if holds true, then the right leaf node in is taken as the estimated CP, and the left one is neglected;(c)otherwise, no abrupt change is detected from .

Proof. Obviously, if is satisfied in Criterion 3, then the statistic fluctuation overtakes the critical value which is given in an identical data distribution, and is the significance level. Therefore, one of the two leaf nodes and is taken as the estimated CP within .

Supposing a non-leaf node is selected in BSTcA, the statistic and variance fluctuations are accordingly calculated between two adjacent segments and . Meanwhile, the search procedure is implemented from the root to non-leaf nodes in the last second level of BSTcA, in terms of Criterions 1 and 2. Then, the estimated CP can be obtained from the leaf nodes in BSTcA, by using Criterion 3. Thereafter, an optimal path towards a potential CP within is detected from BSTcA, after about binary search steps.

3.3. Methods Compared with BSTKS

There are many methods proposed for abrupt change detection in time series, and the following are some typical methods, to evaluate the proposed BSTKS framework.

KS Statistic (see [31]). In this method, a diagnosed time series is divided into two adjacent segments and , and then KS statistic is applied to calculate the statistic distance between and aswhere and stand for the e.c.d.f of , and , respectively; , is the total length of , and refers to a current test position within .

t-Statistic (see [32]). also known as Welch’s -test is used only when the two population variances are assumed different (the two sample sizes may or may not be equal) and hence must be estimated separately. Suppose a diagnosed is divided into and . Then, -statistic is calculated aswhere and are the sample means of and , respectively; is an unbiased estimator of the standard deviation, , and and are the sizes of two segments and , respectively.

SSA (see [12, 13, 33]). In SSA method, a windowed portion is chosen within a time series , where is large enough and a window width and the lag parameter are set such that , . For each , this method takes an interval of the time series and then defines the trajectory matrix and describes the structure of the windowed portion as an -dimensional subspace. If the structure changes further, it will not be well described by the computed subspace. Then, the distance between this subspace and the new trajectory vectors will increase; therefore, this increase will signal that an abrupt change occurs in .

4. Results and Discussion

In this section, the proposed BSTKS is evaluated on the synthetic time series and real EEG recordings with different size . By comparing with existing KS, , and SSA methods, the efficiency, sensitivity, and performance are analyzed in terms of the computation time, error and accuracy, hit rate, and AUC of ROC analyses. Furthermore, the novelty of our algorithm and necessity for real application are discussed in the following paragraphs.

4.1. CP Detection on Synthetic Time Series

In our simulations, some typical time series samples were derived from the normally distributed datasets (mean, , and standard deviation, sd = 1). Each diagnosed sample of size is composed of a normal segment of size and an adjacent segment of size , in which the abnormal part is simulated by adding a constant variation into the random numbers of size . The proposed BSTKS and other three methods, namely, KS, , and SSA, were tested, respectively, on 200 samples which were derived from each time series group with , , and , where and . For each sample in , a series of test positions were arranged by , , and .

First, simulations were carried out according to different value of sample size and test position . The average analyses on four methods were listed in Table 1, and the results of simulations on datasets were illustrated in Figure 5. In general, our BSTKS is the most promising with the shortest computation time, the highest hit rate, the smallest error, and the highest accuracy out of all four methods. Particularly, as sample size increases from to , all four methods take longer time for bigger , and BSTKS is always the fastest one. Meanwhile, BSTKS owns the highest level of hit rate against the low tracks of other three methods; and BSTKS is much more efficient with the smallest error and the highest accuracy, though all four methods tend to be better with increasing. However, BSTKS has smaller AUC of ROC analyses, that is, bigger search space, than SSA and KS.

Second, simulations were carried out based on the datasets , , and . The proposed BSTKS and other three methods were tested according to the different value of variance , , and , respectively. The average results of four methods on , , and were summarized in Table 2, and the typical simulations were selected on and represented in Figure 6. Generally, when gets larger, all four methods get better hit rate, accuracy, and AUC of ROC analysis, except for longer computation time for bigger size . Compared with other three methods, the proposed BSTKS is more encouraging because of the shortest computation time, especially when gets bigger, as well as the highest hit rate and accuracy, especially when gets smaller. Moreover, the simulations on with different variance (Figure 6) explicitly illustrate that BSTKS has the best performance when gets larger, in terms of the shortest time and the biggest increase of the hit rate out of four methods. For the accuracy and AUC, both BSTKS and KS keep higher sensitivity than and SSA, as increases from 0.5 to 3.0. Moreover, the simulations on and were omitted, because similar results can be obtained like above.

Third, simulations were implemented based on different CP test positions within and . The proposed BSTKS and other three methods were analyzed according to the different value of test position CPK and variance . The results of simulations on and were illustrated in Figure 7, and the results near the left and right boundaries in and were summarized in Table 3. In general, all four methods tend to be better, when increases under a fixed , or when increases under a fixed . Meanwhile, for test position CPK near the left and right boundaries, the proposed BSTKS produces better performance than other three methods, because of the highest hit rate, the smallest error in all four methods, and higher accuracy and AUC than and SSA. Moreover, the simulations on and near the left and right boundaries were illustrated in Figure 8 in detail. In terms of the distribution of estimated CP (e-CP), PDF of e-CP, and AUC of ROC analysis, these simulations indicate that BSTKS is more sensitive for both left and right boundaries than other three methods, especially when sample size and variance get smaller.

Therefore, all simulation results above suggest that our proposed BSTKS is an encouraging and efficient method for abrupt change detection from the synthetic time series datasets, because of the shortest computation time, the highest hit rate, and accuracy out of four methods, especially for less significant statistic fluctuation when gets smaller, as well as for less significant variance fluctuation when gets bigger, and gets smaller.

4.2. Abrupt Change Analyses on EEG Recordings

To verify the proposed method further, we take some representative samples from the CHBMIT Scalp EEG Database. In the PhysioBank platform, the CHBMIT Scalp EEG Database (CHBMIT) was collected at the Children’s Hospital Boston; it consists of EEG recordings from pediatric subjects with intractable seizures [34, 35]. In this CHBMIT EEG database, some subjects were monitored up to several days after withdrawal of antiseizure medication in order to characterize their seizures and assess their candidacy for surgical intervention. Based on these EEG recordings in the CHBMIT EEG database, as well as some existing experiments in [3639], the proposed BSTKS and other three methods were tested according to different value of test position CPK and sample size .

First, a diagnosed EEG sample was assembled from two significantly different segments, in which and were derived from chb01_04_edfm and chb01_05_edfm, respectively. Then, BSTKS and other three methods were tested on the assembled EEG recordings , respectively, according to the different value of assigned test position CPK and sample size . The results of abrupt change detection on these assembled EEG samples were illustrated in Figure 8 and summarized in Table 4. Generally speaking, all four methods can roughly estimate the assigned test position from each assembled EEG recording and then divide it into two adjacent segments and . It is worth stressing that the proposed BSTKS can discern the different EEG segments accurately with the smallest error and the highest accuracy out of four methods. Also, BSTKS is the most efficient and encouraging with the shortest time in all four methods.

Moreover, for CPK near the left and right boundaries in , BSTKS has much better sensitivity than other KS, , and SSA methods because of the smallest error and the highest accuracy, especially for less statistic fluctuation when gets smaller, as well as less significant variance fluctuation when gets bigger. Supposing the assembled EEG sample indicates that a sharp transition of one’s mental situation occurs before and after a sudden attack or acute stimulation, it is meaningful to estimate the location of the abrupt change and the maximal difference of data distribution exists between two adjacent EEG segments. These experiments above suggest that the proposed BSTKS can successfully detect the change position where a sudden change occurs under a potential mental shock, more quickly and efficiently than KS, , and SSA methods.

Second, the original EEG samples were selected directly from different recordings in the chb01_05_edfm; then the proposed BSTKS and other three methods were tested according to different sample size . Because the distance of e.c.d.f (V.e.c.d.f) can partly reflect the data fluctuation between two adjacent EEG segments, we use this V.e.c.d.f variable to distinguish different performance of BSTKS and other three methods. The results of abrupt change analyses on these original EEG recordings were shown in Figure 9 and summarized in Table 5. For all methods above, they can estimate an abrupt change from each of these original EEG samples and then divide it into two adjacent EEG segments. Compared with other three methods, the proposed BSTKS is encouraging for the shortest time out of four methods. Moreover, BSTKS has bigger V.e.c.d.f than and SSA, which means that it can more reasonably distinguish two adjacent EEG segments with different state of mental health. Although KS has the biggest V.e.c.d.f in all four methods, it takes much more search time than BSTKS, especially when sample size is getting larger. In addition, needs the longest search time out of four methods, and it is invalid occasionally, for example, for , , and .

For these original EEG recordings with intractable seizures, it is of great concern to predict when and where a significant change happens from these EEG signals. This abrupt change probably indicates that a patient encounters a vertical transition from a previous mental status, and it is very important and helpful for diagnosing the patients with intractable seizures. These experiments on original EEG samples above indicate that the proposed BSTKS can not only accurately detect the change position, but also estimate the maximal difference of data distribution existing between two adjacent EEG segments, more quickly and efficiently than existing KS, , and SSA methods.

5. Conclusion

In this paper, a novel BSTKS method is proposed based on binary search trees and a modified KS statistic. In this method, two BSTs were constructed from a diagnosed time series by multilevel HWT, and then an optimal search path is detected from the root to leaf nodes of two BSTs in terms of three search criteria. The novelty of the proposed method is addressed by comparing with other KS, , and SSA methods, and simulations on the synthetic time series indicate that the proposed BSTKS is more efficient due to the shortest time, the highest hit rate, and the smallest error and highest accuracy out of four methods. Meanwhile, BSTKS has better sensitivity than KS near the left and right boundaries, because of shorter search time, higher hit rate, and bigger AUC, especially when sample size gets smaller with less significant statistic fluctuation. In addition, the necessity of the proposed method in the real domain is analyzed on real EEG recordings, and the results indicate that the proposed method can successfully discern an abrupt change and then obviously distinguish two adjacent EEG segments from the real EEG recordings. Through inspecting the significant fluctuation between adjacent segments signals, it is encouraging further for useful information inspection on all kinds of physiological and psychological time series signals. In a word, our BSTKS is a novel, efficient, and promising method for abrupt change analysis, and it is very helpful for useful information inspection on all kinds of real time series with different scales.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

The authors would like to thank Professor Mohan Karunanithi in the Australia e-Health Research Centre, CSIRO Computation Informatics, for his assistance, support, and advice for this paper. This paper is supported by National Natural Science Foundation of China (no. 13K10414 and no. 61104154) and Specialized Research Fund for Natural Science Foundation of Shanghai (nos. 16ZR1401300 and 16ZR1401200).