A New Fuzzy Logic-Based Similarity Measure Applied to Large Gap Imputation for Uncorrelated Multivariate Time Series
The completion of missing values is a prevalent problem in many domains of pattern recognition and signal processing. Analyzing data with incompleteness may lead to a loss of power and unreliable results, especially for large missing subsequence(s). Therefore, this paper aims to introduce a new approach for filling successive missing values in low/uncorrelated multivariate time series which allows managing a high level of uncertainty. In this way, we propose using a novel fuzzy weighting-based similarity measure. The proposed method involves three main steps. Firstly, for each incomplete signal, the data before a gap and the data after this gap are considered as two separated reference time series with their respective query windows and . We then find the most similar subsequence () to the subsequence before this gap and the most similar one () to the subsequence after the gap . To find these similar windows, we build a new similarity measure based on fuzzy grades of basic similarity measures and on fuzzy logic rules. Finally, we fill in the gap with average values of the window following and the one preceding . The experimental results have demonstrated that the proposed approach outperforms the state-of-the-art methods in case of multivariate time series having low/noncorrelated data but effective information on each signal.
Nowadays huge time series can be considered due to the availability of effective low-cost sensors, the wide deployment of remote sensing systems, internet based measure networks, etc. However, collected data are often incomplete for various reasons such as sensor errors, transmission problems, incorrect measurements, bad weather conditions (outdoor sensors), for manual , etc. This is particularly the case for marine samples  that we consider in this paper. For example, the MAREL-Carnot database characterizes sea water in the eastern English Channel, in France . The data contain nineteen time series that are measured by sensors every 20 minutes as nitrate, fluorescence, phosphate, pH, and so on. The analysis of these data with remarkable size and shape allows sea biologists to reveal events such as algal blooms, understand phytoplankton processes  in detail, or detect sea pollution and so on. But the data have a lot of missing values: 62.2% for phosphate, 59.9% for nitrate, 27.22% for , etc., and the size of missing data varies from one-third hour to several months.
Most proposed models for multivariate time series analysis often have difficulties processing incomplete datasets, despite their powerful techniques. They usually require complete data. Then the question is how can missing values be dealt with? Ignoring or deleting is a simple way to solve this drawback. But serious problems regularly arise when applying this solution. This is prominent in time series data where the considered values depend on the previous ones. Furthermore, an analysis based on the systematic differences between observed and unobserved data leads to biased and unreliable results . Thus, it is important to propose a new technique to estimate the missing values. The imputation technique is a conventional method to handle incompleteness problems .
Considering imputation methods for multivariate time series, taking advantage of the correlations between variables is commonly applied to predict lacking data [6–11]. This means that relations permit using the values of available features to estimate the missing values of other features. However, considering multivariate datasets having low/noncorrelations (for instance the MAREL-Carnot dataset), the observed values of full variables cannot be utilized to complete attributes containing missing values. To handle missing data in this case, we must employ the observed values of the unique variable with the missing data to compute the incomplete values. Therefore the proposed method has to manage the high level of uncertainty of this kind of signal.
Particularly, imperfect time series can be modelled using fuzzy sets. The fuzzy approach makes it possible to handle incomplete data, vague, and imprecise circumstances , which provide a high uncertainty environment to make decision. This property enables modelling and short-term forecasting of traffic flow in urban arterial networks using multivariate traffic data [13, 14]. Recent works to urban traffic flow prediction  and to lane-changes prediction  have been proposed with success. Furthermore, the successful use of fuzzy-based similarity measure in pattern recognition , in retrieval systems , and in recommendation systems  leads us to study its ability to complete missing values in uncorrelated multivariate time series. Wang et al.  proposed using information granules and fuzzy clustering for time series long-term forecasting. But according to our knowledge, there is no application devoted to complete large gap(s) in uncorrelated multivariate time series using a fuzzy-weighted similarity measure.
Thus, this paper aims to propose a new approach, named FSMUMI, to fill large missing values in low/uncorrelated multivariate time series by developing a new similarity measure based on fuzzy logic. However, estimating the distribution of missing values and whole signals is very difficult, so our approach makes an assumption of effective patterns (or recurrent data) on each signal.
The rest of this paper is organized as follows. In Section 2, related works to imputation methods and fuzzy similarity measure are reviewed. Section 3 introduces our approach for completing large missing subsequences in low/uncorrelated multivariate time series. Next, Section 4 demonstrates our experimental protocol for the imputation task. Section 5 presents results and discussion. Conclusions are drawn and future work is presented in the last section.
2. Related Works
This section presents, first, related work about multivariate imputation methods, followed by a review on the fuzzy similarity measure and its applications.
2.1. Classical Multivariate Imputation Methods
Up to now, numerous successful researches have been devoted to complete missing data in multivariate time series imputation such as [10, 11, 20–28]. Imputation techniques can be categorized in different perspectives: model-based or machine learning-based and clustering-based imputation techniques.
In view of the model-based imputation, two main methods were proposed. The first method was introduced by Schafer . With the hypothesis that all variables follow a multivariate normal distribution, this approach is based on the multivariate normal (MVN) model to determine completion values. And, the second method, namely, MICE, was developed by van Buuren et al.  and Raghunathan et al. . This method uses chained equations to fill in incomplete data: for each variable with missing values, MICE computes the imputation data by exploiting the relations between all other variables.
According to the concept of machine learning-based imputation, many studies focus on completion of missing data in multivariate time series. Stekhoven and Bühlmann  implemented missForest based on the Random Forest (RF) method for multivariate imputation. P.Bonissone et al.  proposed a fuzzy version of RF that they named fuzzy random forest FRF. At the moment FRF is only devoted to classification and in our case FRF may be only interesting to separate correlated and uncorrelated variables in multivariate time series if necessary. In , Shah et al. investigated a variant of MICE which fills in each variable using the estimation generated from RF. The results showed that the combination of MICE and RF was more efficient than original methods for multivariate imputation. K-Nearest Neighbors (-NN)-based imputation is also a popular method for completing missing values such as [11, 26, 27, 30–32]. This approach identifies most similar patterns in the space of available features to impute missing data.
Besides these principal techniques, clustering-based imputation approaches are considered as power tools for completing missing values thanks to their ability to detect similar patterns. The objective of these techniques is to separate the data into several clusters when satisfying the following conditions: maximizing the intercluster similarity and minimizing intracluster dissimilarity. Li et al.  proposed the -means clustering imputation technique that estimates missing values using the final cluster information. The fuzzy -means (FcM) clustering is a common extension of -means. The squared-norm is applied to measure the similarity between cluster centers and data points. Different applications based on FcM are investigated for the imputation task as [7–9, 34–38]. Wang et al.  used FcM based on DTW to successfully predict time series in long-term forecasting.
In general, most of the imputation algorithms for multivariate time series take advantage of dependencies between attributes to predict missing values.
2.2. Methods Based on Fuzzy Similarity Measure
Indeed similarity-based approaches are a promising tool for time series analysis. However, many of these techniques rely on parameter tuning, and they may have shortcomings due to dependencies between variables. The objective of this study is to fill large missing values in uncorrelated multivariate time series. Thus, we have to deal with a high level of uncertainty. Mikalsen et al.  proposed using GMM (Gaussian mixture models) and cluster kernel to deal with uncertainty. Their method needs ensemble learning with numerous learning datasets that are not available in our case at the moment (marine data). So we have chosen to model this global uncertainty using fuzzy sets (FS) introduced by Zadeh . These techniques consider that measurements have inherent vagueness rather than randomness.
Uncertainty is classically presented using three conceptually distinctive characteristics: fuzziness, randomness and incompleteness. This classification is interesting for many applications, like sensor management (image processing, speech processing, and time series processing) and practical decision-making. This paper focuses on (sensor) measurements treatment but is also relevant for other applications.
Incompleteness often affects time series prediction (time series obtained from marine data such as salinity and ). So it seems natural to use fuzzy similarity between subsequences of time series to deal with these three kinds of uncertainties (fuzziness, randomness, and incompleteness). Fuzzy sets are now well known and we only need to remind the basic definition of “FS.” Considering the universe , a fuzzy set is characterized using a fuzzy membership function :
where represents the membership of to and is associated to the uncertainty of . In our case, we will consider similarity values between the subsequences as defined in the following. One solution to deal with uncertainty brought by multivariate time series is to use the concept of fuzzy time series . In this framework, the variable observations are considered as fuzzy numbers instead of real numbers. In our case the same modelling is used considering distance measures between subsequences and then we compute the fuzzy similarity between these subsequences to find similar windows in order to estimate the missing values in observations.
Fuzzy similarity is a generalization of the classical concept of equivalence and defines the resemblance between two objects (here subsequences of time series). Similarity measures of fuzzy values have been compared in  and have been extended in . In , Pappis and Karacapilidis presented three main kinds of similarity measures of fuzzy values, including(i)measures based on the operations of union and intersection,(ii)measures based on the maximum difference,(iii)measures based on the difference and the sum of membership grades.
In [44, 45], the authors used these definitions to propose a distance metric for a space of linguistic summaries based on fuzzy protoforms. Almeida et al. extended this work to put forward linguistic summaries of categorical time series . The introduced similarity measure takes into account not only the linguistic meaning of the summaries but also the numerical characteristic attached to them. In the same way, Gupta et al.  introduced this approach to create a hybrid similarity measure based on fuzzy logic. The approach is used to retrieve relevant documents. In the other research, Al-shamri and Al-Ashwal presented fuzzy weightings of popular similarity measures for memory-based collaborative recommend systems .
Concerning the similarity between two subsequences of time series, we can use the DTW cost as a similarity measure. However, to deal with the high level of uncertainty of the processed signals, numerous similarity measures can be used to compute similarity like the cosine similarity, Euclidean distance, Pearson correlation coefficient. Moreover, a fuzzy-weighted combination of scores generated from different similarity measures could comparatively achieve better retrieval results than the use of a single similarity measure [12, 18].
Based on the same concepts, we propose using a fuzzy rules interpolation scheme between grades of membership of fuzzy values. This method makes it possible to build a new hybrid similarity measure for finding similar values between subsequences of time series.
3. Proposed Approach
The proposed imputation method is based on the retrieval and the similarity comparison of available subsequences. In order to compare the subsequences, we create a new similarity measure applying a multiple fuzzy rules interpolation. This section is divided into two parts. Firstly, we focus on the way to compute a new similarity measure between subsequences. Then, we provide details of the proposed approach (namely, Fuzzy Similarity Measure Based Uncorrelated Multivariate Imputation, FSMUMI) to impute the successive missing values of low/uncorrelated multivariate time series.
3.1. Fuzzy-Weighted Similarity Measure between Subsequences
To introduce a new similarity measure using multiple fuzzy rules interpolation to solve the missing problem, we have to define an information granule, as introduced by Pedrycz . The principle of justifiable granularity of experimental data is based on two conditions: (i) the numeric evidence accumulated within the bounds of numeric data has to be as high as possible and, (ii) at the same time, the information granule should be as specific as possible .
To answer the first condition, we take into account 3 different distance measures between two subsequences () and () including Cosine distance, Euclidean distance (these two measures are widely used in the literature), and Similarity distance (this one was presented in our previous study ). These three measures are defined as follows:(i)Cosine distance is computed by (2). This coefficient presents the cosine of the angle between and (ii)Euclidean distance is calculated by To satisfy the input condition of fuzzy logic rules, we normalize this distance to by this function .(iii)Similarity measure is defined by the function (4). This measure indicates the similarity percentage between and
To answer the second condition, we use these 3 distance measures (or attributes) to generate 4 fuzzy similarities (see Figure 2), then applied to a fuzzy inference system (see Figure 1) using the cylindrical extension of the 3 attributes which provides 3 coefficients to calculate a new similarity measure. The universe of discourse of each distance measure is normalized to the value .
And, finally, the new similarity measure is determined bywhere , , and are the weights of the Cosine, ED, and Sim measures, respectively. Thus uncertainty modelled using FS is kept during the similarity computation and makes it possible to deal with a high level of uncertainty as shown in the sequel. The coefficients are generated from the fuzzy interpolation system (Figure 1). We use FuzzyR R-package  to develop this system. All input and output variables are expressed by 4 linguistic terms as low, medium, medium-high, and high. A trapezoidal membership function is handled in this case to match input and output spaces to a degree of membership (Figure 2). The multiple rules interpolation is applied to create the fuzzy rules base. So, 64 fuzzy rules are introduced. Each fuzzy rule is presented in the following form: Rule R: IF ( is ) and ( is ) and ( is ) THEN ( is ) and ( is ) and ( is ) in which , and .
3.2. FSMUBI Approach
Let us consider some notations about multivariate time series and the concept of large gap. A multivariate time series is represented as a matrix with collected signals of size . is the value of the th signal at time . is the feature vector at the -th observation of all variables. is called an incomplete time series when it contains missing values. We define the term gap of -size at position as a portion of where at least one signal of between and contains consecutive missing values .
Here, we deal with large missing values in low/uncorrelated multivariate time series. For isolated missing values () or small -gap, conventional techniques can be applied such as the mean or the median of available values [50, 51]. A -gap is large when the duration is longer than known change process. For instance, in phytoplankton study, is equal to one hour to characterize Langmuir cells and one day for algal bloom processes . For small time series () without prior knowledge of an application and its change process, we set a large gap when .
The mechanism of FSMUMI approach is demonstrated in Figure 3. Without loss of generality, in this figure, we consider a multivariate time series including 3 variables whose correlations are low. The proposed approach involves three major stages. The first stage is to build two queries and . The second stage is devoted to find the most similar windows to the queries. This stage includes two minor steps, comparing sliding windows to queries by using the new similarity measure and selecting the similar windows and . Finally, the imputation values are computed by averaging values of the window following and the one preceding to complete the gap.
This method concentrates on filling missing values in low/uncorrelated multivariate time series. For this type of data, we cannot take advantage of the relations between features to estimate missing values. So we must base our approach on observed values on each signal to complete missing data on itself. This means that we can complete missing data on each variable, one by one. Further, an important point of our approach is that each incomplete signal is processed as two separated time series, one time series before the considered gap and one time series after this gap. This allows increasing the search space for similar values. Moreover, applying the proposed process (one by one), FSMUMI makes it possible to handle the problem of wholly missing variables (missing data at the same index in the all variables).
The proposed model is described in Algorithm 1 and is mainly divided into three phases:(i)The first phase: Building queries (cf. 1 in Figure 2) For each incomplete signal and each -gap, two referenced databases are extracted from the original time series and two query windows are built to retrieve similar windows. The data before the gap (noted ) and the data after this gap (denoted ) are considered as two separated time series. We noted is the subsequence before the gap and is the respective subsequence after the gap. These query windows have the same size as the gap.(ii)The second phase: Finding the most similar windows (cf. 2 and 3 in Figure 2) For the database, we build sliding reference windows (noted ) of size . From these windows, we retrieve the most similar window () to the query using the new similarity measure as previously defined in Section 3.1. Details are in the following: We first find the threshold, which allows considering two windows to be similar. For each increment , we compute a similarity measure between a sliding window and the query . The is the maximum value obtained from the all calculated (Step a: in Algorithm 1). We then find the most similar window to the query . For each increment similar window , a of a sliding reference and the query is estimated. We then compare this to the to determine if this reference is similar to the query . We finally choose the most similar window with the maximum of all the similar windows (Step b: in Algorithm 1). The same process is performed to find the most similar window in data. In the proposed approach, the dynamics and the shape of data before and after a gap are a key-point of our method. This means we take into account both queries (after the gap) and (before the gap). This makes it possible to find out windows that have the most similar dynamics and shape to the queries.(iii)The third phase (cf. 4 in Figure 2) When results from both referenced time series are available, we fill in the gap by averaging values of the window preceding and the one following . The average values are used in our approach because model averaging makes the final results more stable and unbiased .
4. Experiment Protocol
The experiments are performed on three multivariate time series with the same experiment process and the same gaps, described in detail below.
4.1. Datasets Description
For the assessment of the proposed approach and the comparison of its performance to several published algorithms, we use 3 multivariate time series, one from UCI Machine Learning repository, one simulated dataset (this allows us to handle the correlations between variables and percentage of missing values), and finally a real time series hourly sampled by IFREMER (France) in the eastern English Channel.(i) Synthetic dataset : The data are synthetic time series, including 10 features, 100,000 sampled points. All data points are in the range -0.5 to +0.5. The data appear highly periodic but never exactly repeated. They have structure at different resolutions. Each of the 10 features is generated by independent invocations of the function: where produces a random integer between 0 and . These data are very large so we choose only a subset of 3 signals for performing experiments.(ii) Simulated dataset: In the second experiment, a simulated dataset including 3 signals is produced as follows: for the first variable, we use 5 sine functions that have different frequencies and amplitudes . Next, 3 various noise levels are added to data , . We then repeat 4 times (this dataset has 32,000 sampled points). In this study, we treat with missing data in low/uncorrelated multivariate time series. So to satisfy this condition, the two remaining signals are generated based on the first signal with the correlations between these signals are low (). We apply the Corgen function of ecodist R-package  to create the second and the third variables.(iii)MAREL-Carnot dataset : The third experiment is conducted on MAREL-Carnot dataset. This dataset consists of nineteen series such as phosphate, salinity, turbidity, water temperature, fluorescence, and water that characterize sea water. These signals were collected from the January 2005 to the February 2009 at a 20 minute frequency. Here they were hourly sampled, so they have 35,334 time samples. But the data include many missing values, the size of missing data varying on each signal. To assess the performance of the proposed method and compare it with other approaches, we choose a subgroup including fluorescence, water level, and water temperature (the water level and the fluorescence signals are completed data, while water temperature contains isolated missing values and many gaps). We selected these signals because their correlations are low. After completing missing values, completion data will be compared with the actual values in the completed series to evaluate the ability of different imputation methods. Therefore, it is necessary to fill missing values in the water temperature. To ensure the fairness of all algorithms, filling in the water temperature series is performed by using the na.interp method ().
4.2. Multivariate Imputation Approaches
In the present study, we perform a comparison of the proposed algorithm with 7 other approaches (comprising Amelia II, FcM, MI, MICE, missForest, na.approx, and DTWUMI) for the imputation of multivariate time series. We use R language to execute all these algorithms.(1)Amelia II (Amelia II R-package) : The algorithm uses the familiar expectation-maximization algorithm on multiple bootstrapped samples of the original incomplete data to draw values of the complete data parameters. The algorithm then draws imputed values from each set of bootstrapped parameters, replacing the missing values with the drawn values.(2)FcM-Fuzzy -means based imputation: This approach involves 2 steps. The first step is to group the whole data into clusters using fuzzy- means technique. A cluster membership for each sample and a cluster center are generated for each feature. The second step is to fill in the incomplete data by using the membership degree and the center centroids . We base on the principles of  and use the -means function  to develop this approach.(3)MI: Multiple Imputation (MI R-package) : This method uses predictive mean matching to estimate missing values of continuous variables. For each missing value, its imputation value is randomly selected from a set of observed values that are the closest predicted mean to the variable with the missing value.(4)MICE: Multivariate Imputation via Chained Equations (MICE R-package) : For each incomplete variable under the assumption of MAR (missing at random), the algorithm performs a completion by full conditional specification of predictive models. The same process is implemented with other variables having missing data.(5)missForest (missForest R-package) : This algorithm uses random forest method to complete missing values. For each variable containing missing data, missForest builds a random forest model on the available data. To estimate missing data this model is applied in the variable, repeating the procedure until it meets a stopping condition.(6)Linear interpolation: na.approx (zoo R-package) : This method is based on an interpolation function to predict each missing point.(7)DTWUMI : For each gap, this approach finds the most similar window to the subsequence after (resp. before) the gap based on the combination of shape-features extraction and Dynamic Time Warping algorithms. Then, the previous (resp. following) window of the most similar one in the incomplete signal is used to complete the gap.
4.3. Imputation Performance Measurements
In order to estimate the quantitative performance of imputation approaches, six usual criteria in the literature are used as follows:(1)Similarity evaluates the similar percent between the estimated values () and the respective real values (). This index is defined by where T is the number of missing values. The similarity tends to 1 when the two curves are identical and tends to 0 when the amplitudes are strongly different.(2) score is determined as the square of correlation coefficient between two variables and . This indicator makes it possible to assess the quality of an imputation model. A method presents better performance when its score is higher ()(3)RMSE (Root Mean Square Error) is computed as the average squared difference between and . This is an appreciate coefficient to measure global ability of a completion method. In general, a lower RMSE highlights a better imputation performance. It is now well admitted that good imputation performance does not lead automatically to good estimation performance. It is why other indices like FSD, FA2, and FB (that enable evaluating the shape of the two signals) are used in this study.(4)FSD (Fraction of Standard Deviation) is defined as This fraction points out whether a method is acceptable or not. Applying to the imputation task, when FSD value approaches 0, an imputation method is impeccable.(5)FB: Fractional Bias: determines the rate of predicted values are overestimated or underestimated relative to observed values . This indicator is given by (10). An imputation model is considered ideal as its FB equals 0.(6)FA2 defines the percentage of outlier between two variables and . It is described by When FA2 value is close to 1, a model is considered perfect.
4.4. Experimental Process
Indeed, evaluating the ability of imputation methods cannot be done because the actual values are lacking. So we must produce artificial missing data on completed time series in order to compare the performance of imputation approaches. We use a technique based on three steps to assess the results detailed in the following:(i)The first step: Generate simulated missing values by removing data values from full time series.(ii)The second step: Apply the imputation methods to fill in missing data.(iii)The third step: Evaluate the ability of proposed approach and compare with state-of-the-art methods using different performance indices above-mentioned.
In this paper, we perform experiments with seven missing data levels on three large datasets. On each signal, we create simulated gaps with different rates ranging from 1%, 2%, 3%, 4%, 5%, 7.5%, to 10% of the data in the complete signal (here the biggest gap of MAREL-Carnot data is 3,533 missing values corresponding to 5 months of hourly sampled). For every missing ratio, the approaches are run 5 times by randomly choosing the positions of missing in the data. We then perform iterations for each dataset.
5. Results and Discussion
This section provides experiment results obtained from the proposed approach and compares its ability with the seven published approaches. Results are discussed in three parts, i.e., quantitative performance, visual performance, and execution times.
5.1. Quantitative Performance Comparison
Tables 1, 2, and 3 illustrate the average ability of various imputation methods for synthetic, simulated, and MAREL-Carnot time series using measurements as previously defined. For each missing level, the best results are highlighted in bold. These results demonstrate the improved performance of FSMUMI to complete missing data in low/uncorrelated multivariate time series.
Synthetic Dataset. Table 1 presents a comparison of 8 imputation methods on synthetic dataset that contains 7 missing data levels (1-10%). The results clearly show that when a gap size is greater than 2%, the proposed method yields the highest similarity, , FA2, and the lowest RMSE, FB. With this dataset, na.approx gives the best performance at the smallest missing data level for all indices and is ranked second for other ratios of missing values (2-5%) for similarity and FA2, RMSE (2-4%), and (the rank at 2% missing rate, the at 3%, 5%). The results can explain that the synthetic data are generated by a function (6). na.approx method applies the interpolation function to estimate missing values. So it is easy to find a function to generate values that are approximate real values when missing data rates are small. But this work is more difficult when the missing sample size rises; that is why the ability of na.approx decreases as missing data levels increase, especially at 7.5% and 10% rates. Although this dataset never exactly repeats itself and our approach is proposed under the assumption of recurrent data the FSMUMI approach proves its performance for the imputation task even if the missing size increases.
Among the considered methods, the FcM-based approach is less accurate at lower missing rates but it provides better results at larger missing ratios as regards the accuracy indices.
Simulated Dataset. Table 2 illustrates the evaluation results of various imputation algorithms on the simulated dataset. The best values for each missing level are highlighted in bold. Our proposed method outperforms other methods for the imputation task on accuracy indices: the highest similarity, , and the lowest RMSE at every missing ratio. However, when considering other indices such as FA2, FSD, and FB, FSMUMI no longer shows its performance. It gains only at a 4% rate for the FB index and at 10% ratio for FA2. In contrast to FSMUMI, DTWUMI provides the best results for FSD indicator at all missing levels and FA2 at the first 5 missing ratios (from 1% to 5%).
Different from the synthetic dataset, on the simulated dataset, the FcM-based method is always ranked the third at all missing rates for similarity and RMSE indicators. Following FcM is missForest algorithm for the both indices.
Although, in the second experiment, data are built by various functions but they are quite complex so that na.approx does not provide good results.
MAREL-Carnot Dataset. Once again, as reported in Table 3, our algorithm demonstrates its capability for the imputation task. FSMUMI method generates the best results as regarding accuracy indices for almost missing ratios (excluding at 2% missing level on all indices, and at 5% missing rate on score). But when considering shape indicators, FSMUMI only provides the highest FA2 values at several missing levels (3%, 5%-10%). In particular, our method illustrates the ability to fill in incomplete data with large missing rates (7.5% and 10%): the highest similarity, , FA2, and the lowest RMSE, FSD (excluding at 7.5%), and FB. These gaps correspond to 110.4 and 147.2 days sampled at hourly frequency.
In contrast to the two datasets above, on the MAREL-Carnot data, na.approx indicates quite good results: the permanent second or third rank for the accuracy indices (the order at 5% missing rate on score), the lowest FSD (from 3% to 5% missing rates), and FB at some other levels of missing data. But when looking at the shape of imputation values generated from this method, it absolutely gives the worst results (Figure 6).
Other approaches (including FcM-based imputation, MI, MICE, Amelia, and missForest) exploit the relations between attributes to estimate missing values. However, three considered datasets have low correlations between variables (roundly 0.2 for MAREL-Carnot data, for simulated and synthetic datasets). So these methods do not demonstrate their performance for completing missing values in low/uncorrelated multivariate time series. Otherwise, our algorithm shows its ability and stability when applying to the imputation task for this kind of data.
DTWUMI approach was proposed to fill large missing values in low/uncorrelated multivariate time series. However, this method is not as powerful as the FSMUMI method. DTWUMI only produces the best results at 2% missing level on the MAREL-Carnot dataset and is always at the second or the third rank at all the remaining missing rates on the MAREL-Carnot and the simulated datasets. That is because the DTWUMI method only finds the most similar window to a query either before a gap or after this gap, and it uses only one similarity measure, the DTW cost, to retrieve the most similar window. In addition, another reason may be that DTWUMI has directly used data from the window following or preceding the most similar window to completing the gap.
5.2. Visual Performance Comparison
In this paper, we also compare the visualization performance of completion values yielded by various algorithms. Figures 4 and 5 illustrate the form of imputed values generated from different approaches on the synthetic series at two missing ratios 1% and 5%.
At a 1% missing rate, the shape of imputation values produced by na.approx method is closer to the one of true values than the form of completion values given by our approach. However, at a 5% level of missing data, this method no longer shows the performance (Figure 5). In this case, the proposed method proves its relevance for the imputation task. The shape of FSMUMI’s imputation data is almost similar to the form of true values (Figure 5).
Looking at Figure 6, FSMUMI one more time proves its capability for uncorrelated multivariate time series imputation: completion values yielded by FSMUMI are virtually identical to the real data on the MAREL-Carnot dataset. When comparing DTWUMI with FSMUMI, it is clear that FSMUMI gives improved results (Figures 4, 5, and 6).
5.3. Computation Time
Besides, we perform a comparison of the computational time of each method on the synthetic series (in second - s). Table 4 indicates that na.approx method requires the shortest running time and DTWUMI approach takes the longest computing time. The proposed method, FSMUMI, demands more execution time as missing rates increase. However, considering the quantitative and visual performance of FSMUMI for the imputation task (Table 1, Figures 5 and 6), the required time of the proposed approach is fully acceptable.
This paper proposes a novel approach for uncorrelated multivariate time series imputation using a fuzzy logic-based similarity measure, namely FSMUMI. This method makes it possible to manage uncertainty with the comprehensibility of linguistic variables. FSMUMI has been tested on different datasets and compared with published algorithms (Amelia II, FcM, MI, MICE, missForest, na.approx, and DTWUMI) on accuracy and shape criteria. The visual ability of these approaches is also investigated. The experimental results definitely highlight that the proposed approach yielded improved performance in accuracy over previous methods in the case of multivariate time series having large gaps and low or non-correlation between variables. However, it is necessary to make an assumption of recurrent data and sufficiently large dataset to apply the algorithm. This means that our approach needs patterns (in our case the two queries (before and after the considered gap)) existing somewhere in the database. This enables us to predict missing values if the patterns occur in the past or in the following data from the considered position. Thus a satisfactory and sufficient dataset (large dataset) is required.
In future work, we plan to (i) combine FSMUMI method with other algorithms such as Random Forest or Deep learning in order to efficiently fill incomplete values in any type of multivariate time series; (ii) investigate this approach applied to short-term/long-term forecasts in multivariate time series. We could also investigate complex fuzzy sets () instead of ordinary fuzzy sets that have given good results using an adaptive scheme in the case of the bivariate time series with small dataset.
The data used to support this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was kindly supported by the Ministry of Education and Training Vietnam International Education Development, the French government, and FEDER, the region Hauts-de-France (CPER 2014-2020 MARCO). The experiments were carried out using the CALCULCO computing platform, supported by SCoSI/ULCO (Univ. Littoral).
H. T. Ceong, H. J. Kim, and J. S. Park, “Discovery of and Recovery from Failure in a Coastal Marine USN Service,” Journal of Information and Communication Convergence Engineering, vol. 10, no. 1, pp. 11–20, 2012.View at: Google Scholar
A. Lefebvre, MAREL Carnot Data and Metadata from Coriolis Data Centre, SEANOE, 2015.View at: Publisher Site
H. Ichihashi, K. Honda, A. Notsu, and T. Yagi, “Fuzzy c-means classifier with deterministic initialization and missing value imputation,” in Proceedings of the 2007 IEEE Symposium on Foundations of Computational Intelligence, FOCI 2007, pp. 214–221, USA, April 2007.View at: Google Scholar
P. Saravanan and P. Sailakshmi, “Missing value imputation using fuzzy possibilistic c means optimized with support vector regression and genetic algorithm,” Journal of Theoretical and Applied Information Technology, vol. 72, no. 1, pp. 34–39, 2015.View at: Google Scholar
T. Furukawa, Ohnishi Shin-ichi, Yamanoi Takahiro. Missing Categorical Data Imputation for FCM Clusterings of Mixed Incomplete Data, 2014.
S. Oehmcke, O. Zielinski, and O. Kramer, “kNN ensembles with penalized DTW for multivariate time series imputation,” in Proceedings of the 2016 International Joint Conference on Neural Networks, IJCNN 2016, pp. 2774–2781, Canada, July 2016.View at: Google Scholar
A. Stathopoulos, M. G. Karlaftis, and L. Dimitriou, “Fuzzy rule-based system approach to combining traffic count forecasts,” Transportation Research Record, no. 2183, pp. 120–128, 2010.View at: Google Scholar
J. L. Schafer, Analysis of Incomplete Multivariate Data, Chapman & Hall, New York, NY, USA, 1997.View at: MathSciNet
E. R. Trivellore, M. L. James, H. Van John, and P. Solenberger, “A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models,” Survey methodology, vol. 27, no. 1, pp. 85–96, 2001.View at: Google Scholar
P. Royston, “Multiple imputation of missing values: Further update of ice, with an emphasis on interval censoring,” Stata Journal, vol. 7, no. 4, pp. 445–464, 2007.View at: Google Scholar
A. D. Shah, J. W. Bartlett, J. Carpenter, O. Nicholas, and H. Hemingway, “Comparison of random forest and parametric imputation models for imputing missing data using MICE: A CALIBER study,” American Journal of Epidemiology, vol. 179, no. 6, pp. 764–774, 2014.View at: Publisher Site | Google Scholar
G. Andrew, H. Jennifer, S. Yu-Sung et al., Su Yu-Sung, 2015.
H.-H. Hsu, A. C. Yang, and M.-D. Lu, “KNN-DTW based missing value imputation for microarray time series data,” Journal of Computers, vol. 6, no. 3, pp. 418–425, 2011.View at: Google Scholar
C. Y. Andy, H. Hui-Huang, and L. Ming-Da, in Microarray Gene Expression Data. In, Kinmen, Taiwan, 2009.
E. Kostadinova, V. Boeva, L. Boneva, and E. Tsiporkova, “An integrative DTW-based imputation method for gene expression time series data,” in Proceedings of the 2012 6th IEEE International Conference Intelligent Systems, IS 2012, pp. 258–263, Bulgaria, September 2012.View at: Google Scholar
D. Li, J. Deogun, W. Spaulding, and B. Shuart, “Towards missing data imputation: a study of fuzzy K-means clustering method,” in Rough sets and current trends in computing, vol. 3066 of Lecture Notes in Comput. Sci., pp. 573–579, Springer, Berlin, 2004.View at: Publisher Site | Google Scholar | MathSciNet
J. Tang, G. Zhang, Y. Wang, H. Wang, and F. Liu, “A hybrid approach to integrate fuzzy C-means based imputation method with genetic algorithm for missing traffic volume data estimation,” Transportation Research Part C: Emerging Technologies, vol. 51, pp. 29–40, 2015.View at: Publisher Site | Google Scholar
T. Furukawa, S.-I. Ohnishi, and T. Yamanoi, “On a fuzzy c-means algorithm for mixed incomplete data using partial distance and imputation,” in Proceedings of the International MultiConference of Engineers and Computer Scientists, IMECS 2014, pp. 319–323, Hong Kong, March 2014.View at: Google Scholar
R. J. Almeida, M.-J. Lesot, B. Bouchon-Meunier, U. Kaymak, and G. Moyses, “Linguistic summaries of categorical time series for septic shock patient data,” in Proceedings of the 2013 IEEE International Conference on Fuzzy Systems, FUZZ-IEEE 2013, India, July 2013.View at: Google Scholar
W. Pedrycz and F. Gomide, Fuzzy Systems Engineering: Toward Human-Centric Computing, John Wiley, Hoboken, NJ, USA, 2007.
T.-T. Phan, É. Poisson Caillault, A. Lefebvre, and A. Bigand, “Dynamic time warping-based imputation for univariate time series data,” Pattern Recognition Letters, 2017.View at: Google Scholar
J. Garibaldi, C. Chao, and F. Tajul, “FuzzyR: Fuzzy Logic Toolkit for R2017,” R package version 2.1, 2017.View at: Google Scholar
D. A. Paul, Missing Data Quantitative Applications in the Social Sciences, vol. 136, Sage Publication, 2001.
M. B. Christopher, Pattern Recognition and Machine Learning (Information Science and Statistics), Secaucus, NJ, USA, Springer-Verlag, 2006.
E. J. Keogh and M. J. Pazzani, “An indexing scheme for fast similarity search in large time series databases,” in Proceedings of the 11th International Conference on Scientific and Statistical Database Management (SSDBM '99), pp. 56–67, Cleveland, Ohio, USA, July 1999.View at: Publisher Site | Google Scholar
S. C. Goslee and D. L. Urban, “The ecodist package for dissimilarity-based analysis of ecological data,” Journal of Statistical Software , vol. 22, no. 7, pp. 1–19, 2007.View at: Google Scholar
J. Honaker, G. King, and M. Blackwell, “Amelia II: a program for missing data,” Journal of Statistical Software , vol. 45, no. 7, pp. 1–47, 2011.View at: Google Scholar
M. David, E. Dimitriadou, K. Hornik, A. Weingessel, and L. Friedrich, e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien2015. R package version 1.6-7.
Y.-S. Su, A. Gelman, J. Hill, and M. Yajima, “Multiple imputation with diagnostics (mi) in R: Opening windows into the black box,” Journal of Statistical Software , vol. 45, no. 2, pp. 1–31, 2011.View at: Google Scholar
S. van Buuren and K. Groothuis-Oudshoorn, “Mice: multivariate imputation by chained equations in R,” Journal of Statistical Software , vol. 45, no. 3, pp. 1–67, 2011.View at: Google Scholar
A. Zeileis and G. Grothendieck, Andrews Felix. Zoo: S3 Infrastructure for Regular and Irregular Time Series (Z’s Ordered Observations), vol. 14, 2016.