Applied Computational Intelligence and Soft Computing

Volume 2018, Article ID 9095683, 15 pages

https://doi.org/10.1155/2018/9095683

## A New Fuzzy Logic-Based Similarity Measure Applied to Large Gap Imputation for Uncorrelated Multivariate Time Series

^{1}Univ. Littoral Côte d’Opale, EA 4491-LISIC, F-62228 Calais, France^{2}Vietnam National University of Agriculture, Department of Computer Science, Hanoi, Vietnam

Correspondence should be addressed to Thi-Thu-Hong Phan; moc.liamg@nvtpgnoh and André Bigand; rf.larottil-vinu@dnagib

Received 30 May 2018; Accepted 25 July 2018; Published 9 August 2018

Academic Editor: Shyi-Ming Chen

Copyright © 2018 Thi-Thu-Hong Phan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The completion of missing values is a prevalent problem in many domains of pattern recognition and signal processing. Analyzing data with incompleteness may lead to a loss of power and unreliable results, especially for large missing subsequence(s). Therefore, this paper aims to introduce a new approach for filling successive missing values in low/uncorrelated multivariate time series which allows managing a high level of uncertainty. In this way, we propose using a novel fuzzy weighting-based similarity measure. The proposed method involves three main steps. Firstly, for each incomplete signal, the data before a gap and the data after this gap are considered as two separated reference time series with their respective query windows and . We then find the most similar subsequence () to the subsequence before this gap and the most similar one () to the subsequence after the gap . To find these similar windows, we build a new similarity measure based on fuzzy grades of basic similarity measures and on fuzzy logic rules. Finally, we fill in the gap with average values of the window following and the one preceding . The experimental results have demonstrated that the proposed approach outperforms the state-of-the-art methods in case of multivariate time series having low/noncorrelated data but effective information on each signal.

#### 1. Introduction

Nowadays huge time series can be considered due to the availability of effective low-cost sensors, the wide deployment of remote sensing systems, internet based measure networks, etc. However, collected data are often incomplete for various reasons such as sensor errors, transmission problems, incorrect measurements, bad weather conditions (outdoor sensors), for manual , etc. This is particularly the case for marine samples [1] that we consider in this paper. For example, the MAREL-Carnot database characterizes sea water in the eastern English Channel, in France [2]. The data contain nineteen time series that are measured by sensors every 20 minutes as nitrate, fluorescence, phosphate, pH, and so on. The analysis of these data with remarkable size and shape allows sea biologists to reveal events such as algal blooms, understand phytoplankton processes [3] in detail, or detect sea pollution and so on. But the data have a lot of missing values: 62.2% for phosphate, 59.9% for nitrate, 27.22% for , etc., and the size of missing data varies from one-third hour to several months.

Most proposed models for multivariate time series analysis often have difficulties processing incomplete datasets, despite their powerful techniques. They usually require complete data. Then the question is how can missing values be dealt with? Ignoring or deleting is a simple way to solve this drawback. But serious problems regularly arise when applying this solution. This is prominent in time series data where the considered values depend on the previous ones. Furthermore, an analysis based on the systematic differences between observed and unobserved data leads to biased and unreliable results [4]. Thus, it is important to propose a new technique to estimate the missing values. The imputation technique is a conventional method to handle incompleteness problems [5].

Considering imputation methods for multivariate time series, taking advantage of the correlations between variables is commonly applied to predict lacking data [6–11]. This means that relations permit using the values of available features to estimate the missing values of other features. However, considering multivariate datasets having low/noncorrelations (for instance the MAREL-Carnot dataset), the observed values of full variables cannot be utilized to complete attributes containing missing values. To handle missing data in this case, we must employ the observed values of the unique variable with the missing data to compute the incomplete values. Therefore the proposed method has to manage the high level of uncertainty of this kind of signal.

Particularly, imperfect time series can be modelled using fuzzy sets. The fuzzy approach makes it possible to handle incomplete data, vague, and imprecise circumstances [12], which provide a high uncertainty environment to make decision. This property enables modelling and short-term forecasting of traffic flow in urban arterial networks using multivariate traffic data [13, 14]. Recent works to urban traffic flow prediction [15] and to lane-changes prediction [16] have been proposed with success. Furthermore, the successful use of fuzzy-based similarity measure in pattern recognition [17], in retrieval systems [12], and in recommendation systems [18] leads us to study its ability to complete missing values in uncorrelated multivariate time series. Wang et al. [19] proposed using information granules and fuzzy clustering for time series long-term forecasting. But according to our knowledge, there is no application devoted to complete large gap(s) in uncorrelated multivariate time series using a fuzzy-weighted similarity measure.

Thus, this paper aims to propose a new approach, named FSMUMI, to fill large missing values in low/uncorrelated multivariate time series by developing a new similarity measure based on fuzzy logic. However, estimating the distribution of missing values and whole signals is very difficult, so our approach makes an assumption of effective patterns (or recurrent data) on each signal.

The rest of this paper is organized as follows. In Section 2, related works to imputation methods and fuzzy similarity measure are reviewed. Section 3 introduces our approach for completing large missing subsequences in low/uncorrelated multivariate time series. Next, Section 4 demonstrates our experimental protocol for the imputation task. Section 5 presents results and discussion. Conclusions are drawn and future work is presented in the last section.

#### 2. Related Works

This section presents, first, related work about multivariate imputation methods, followed by a review on the fuzzy similarity measure and its applications.

##### 2.1. Classical Multivariate Imputation Methods

Up to now, numerous successful researches have been devoted to complete missing data in multivariate time series imputation such as [10, 11, 20–28]. Imputation techniques can be categorized in different perspectives: model-based or machine learning-based and clustering-based imputation techniques.

In view of the model-based imputation, two main methods were proposed. The first method was introduced by Schafer [20]. With the hypothesis that all variables follow a multivariate normal distribution, this approach is based on the multivariate normal (MVN) model to determine completion values. And, the second method, namely, MICE, was developed by van Buuren et al. [21] and Raghunathan et al. [22]. This method uses chained equations to fill in incomplete data: for each variable with missing values, MICE computes the imputation data by exploiting the relations between all other variables.

According to the concept of machine learning-based imputation, many studies focus on completion of missing data in multivariate time series. Stekhoven and Bühlmann [6] implemented missForest based on the Random Forest (RF) method for multivariate imputation. P.Bonissone et al. [29] proposed a fuzzy version of RF that they named fuzzy random forest FRF. At the moment FRF is only devoted to classification and in our case FRF may be only interesting to separate correlated and uncorrelated variables in multivariate time series if necessary. In [25], Shah et al. investigated a variant of MICE which fills in each variable using the estimation generated from RF. The results showed that the combination of MICE and RF was more efficient than original methods for multivariate imputation. K-Nearest Neighbors (-NN)-based imputation is also a popular method for completing missing values such as [11, 26, 27, 30–32]. This approach identifies most similar patterns in the space of available features to impute missing data.

Besides these principal techniques, clustering-based imputation approaches are considered as power tools for completing missing values thanks to their ability to detect similar patterns. The objective of these techniques is to separate the data into several clusters when satisfying the following conditions: maximizing the intercluster similarity and minimizing intracluster dissimilarity. Li et al. [33] proposed the -means clustering imputation technique that estimates missing values using the final cluster information. The fuzzy -means (FcM) clustering is a common extension of -means. The squared-norm is applied to measure the similarity between cluster centers and data points. Different applications based on FcM are investigated for the imputation task as [7–9, 34–38]. Wang et al. [19] used FcM based on DTW to successfully predict time series in long-term forecasting.

In general, most of the imputation algorithms for multivariate time series take advantage of dependencies between attributes to predict missing values.

##### 2.2. Methods Based on Fuzzy Similarity Measure

Indeed similarity-based approaches are a promising tool for time series analysis. However, many of these techniques rely on parameter tuning, and they may have shortcomings due to dependencies between variables. The objective of this study is to fill large missing values in* uncorrelated multivariate time series*. Thus, we have to deal with a high level of uncertainty. Mikalsen et al. [39] proposed using GMM (Gaussian mixture models) and cluster kernel to deal with uncertainty. Their method needs ensemble learning with numerous learning datasets that are not available in our case at the moment (marine data). So we have chosen to model this global uncertainty using fuzzy sets (FS) introduced by Zadeh [40]. These techniques consider that measurements have inherent vagueness rather than randomness.

**Uncertainty** is classically presented using three conceptually distinctive characteristics: fuzziness, randomness and incompleteness. This classification is interesting for many applications, like sensor management (image processing, speech processing, and time series processing) and practical decision-making. This paper focuses on (sensor) measurements treatment but is also relevant for other applications.

**Incompleteness** often affects time series prediction (time series obtained from marine data such as salinity and ). So it seems natural to use fuzzy similarity between subsequences of time series to deal with these three kinds of uncertainties (fuzziness, randomness, and incompleteness). Fuzzy sets are now well known and we only need to remind the basic definition of “FS.” Considering the universe , a fuzzy set is characterized using a fuzzy membership function :

where represents the membership of to and is associated to the uncertainty of . In our case, we will consider similarity values between the subsequences as defined in the following. One solution to deal with uncertainty brought by multivariate time series is to use the concept of fuzzy time series [41]. In this framework, the variable observations are considered as fuzzy numbers instead of real numbers. In our case the same modelling is used considering distance measures between subsequences and then we compute the fuzzy similarity between these subsequences to find similar windows in order to estimate the missing values in observations.

**Fuzzy similarity** is a generalization of the classical concept of equivalence and defines the resemblance between two objects (here subsequences of time series). Similarity measures of fuzzy values have been compared in [42] and have been extended in [43]. In [42], Pappis and Karacapilidis presented three main kinds of similarity measures of fuzzy values, including(i)measures based on the operations of union and intersection,(ii)measures based on the maximum difference,(iii)measures based on the difference and the sum of membership grades.

In [44, 45], the authors used these definitions to propose a distance metric for a space of linguistic summaries based on fuzzy protoforms. Almeida et al. extended this work to put forward linguistic summaries of categorical time series [46]. The introduced similarity measure takes into account not only the linguistic meaning of the summaries but also the numerical characteristic attached to them. In the same way, Gupta et al. [12] introduced this approach to create a hybrid similarity measure based on fuzzy logic. The approach is used to retrieve relevant documents. In the other research, Al-shamri and Al-Ashwal presented fuzzy weightings of popular similarity measures for memory-based collaborative recommend systems [18].

Concerning the similarity between two subsequences of time series, we can use the DTW cost as a similarity measure. However, to deal with the high level of uncertainty of the processed signals, numerous similarity measures can be used to compute similarity like the cosine similarity, Euclidean distance, Pearson correlation coefficient. Moreover, a fuzzy-weighted combination of scores generated from different similarity measures could comparatively achieve better retrieval results than the use of a single similarity measure [12, 18].

Based on the same concepts, we propose using a fuzzy rules interpolation scheme between grades of membership of fuzzy values. This method makes it possible to build a new hybrid similarity measure for finding similar values between subsequences of time series.

#### 3. Proposed Approach

The proposed imputation method is based on the retrieval and the similarity comparison of available subsequences. In order to compare the subsequences, we create a new similarity measure applying a multiple fuzzy rules interpolation. This section is divided into two parts. Firstly, we focus on the way to compute a new similarity measure between subsequences. Then, we provide details of the proposed approach (namely, Fuzzy Similarity Measure Based Uncorrelated Multivariate Imputation, FSMUMI) to impute the successive missing values of low/uncorrelated multivariate time series.

##### 3.1. Fuzzy-Weighted Similarity Measure between Subsequences

To introduce a new similarity measure using multiple fuzzy rules interpolation to solve the missing problem, we have to define an information granule, as introduced by Pedrycz [47]. The principle of justifiable granularity of experimental data is based on two conditions: (i) the numeric evidence accumulated within the bounds of numeric data has to be as high as possible and, (ii) at the same time, the information granule should be as specific as possible [19].

To answer the first condition, we take into account 3 different distance measures between two subsequences () and () including Cosine distance, Euclidean distance (these two measures are widely used in the literature), and Similarity distance (this one was presented in our previous study [48]). These three measures are defined as follows:(i)Cosine distance is computed by (2). This coefficient presents the cosine of the angle between and (ii)Euclidean distance is calculated by To satisfy the input condition of fuzzy logic rules, we normalize this distance to by this function .(iii)Similarity measure is defined by the function (4). This measure indicates the similarity percentage between and

To answer the second condition, we use these 3 distance measures (or attributes) to generate 4 fuzzy similarities (see Figure 2), then applied to a fuzzy inference system (see Figure 1) using the cylindrical extension of the 3 attributes which provides 3 coefficients to calculate a new similarity measure. The universe of discourse of each distance measure is normalized to the value .