Computational Intelligence and Neuroscience

Volume 2017, Article ID 8734214, 11 pages

https://doi.org/10.1155/2017/8734214

## A Time-Series Water Level Forecasting Model Based on Imputation and Variable Selection Method

Department of Information Management, National Yunlin University of Science and Technology, Yunlin, Taiwan

Correspondence should be addressed to Ching-Hsue Cheng; wt.ude.hcetnuy@gnehchc

Received 20 June 2017; Revised 18 September 2017; Accepted 27 September 2017; Published 9 November 2017

Academic Editor: Amparo Alonso-Betanzos

Copyright © 2017 Jun-He Yang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Reservoirs are important for households and impact the national economy. This paper proposed a time-series forecasting model based on estimating a missing value followed by variable selection to forecast the reservoir’s water level. This study collected data from the Taiwan Shimen Reservoir as well as daily atmospheric data from 2008 to 2015. The two datasets are concatenated into an integrated dataset based on ordering of the data as a research dataset. The proposed time-series forecasting model summarily has three foci. First, this study uses five imputation methods to directly delete the missing value. Second, we identified the key variable via factor analysis and then deleted the unimportant variables sequentially via the variable selection method. Finally, the proposed model uses a Random Forest to build the forecasting model of the reservoir’s water level. This was done to compare with the listing method under the forecasting error. These experimental results indicate that the Random Forest forecasting model when applied to variable selection with full variables has better forecasting performance than the listing model. In addition, this experiment shows that the proposed variable selection can help determine five forecast methods used here to improve the forecasting capability.

#### 1. Introduction

Shimen Reservoir is located between Taoyuan City and Hsinchu County in Taiwan. The Shimen Reservoir offers irrigation, hydroelectricity, water supply, flood control, tourism, and so on. This reservoir is very important to the area and offers livelihood, agriculture, flood control, and economic development. Thus, the authorities should plan and manage water resources comprehensively via accurate forecasting.

Previous studies of reservoir water levels have identified three important problems:(1)There are few studies of reservoir water levels: related studies [1–4] in the hydrological field use machine learning methods to forecast water levels. They focused on water level forecasting of the flood stages in pumping stations, reservoirs, lakes, basins, and so on. Most of the water level forecasting of these flood stages collected the data about typhoons, specific climate, seasonal rainfall, or water levels.(2)Only a few variables have been used in reservoir water level forecasting. The literature shows only a few related studies of forecasting [5, 6]. These used water level as the dependent variable, and the independent variable only has rainfall, water level, and the time lag of the combined two variables. Thus, a few independent variables were selected. It is difficult to determine the key variable set in the reservoir water level.(3)No imputation method used in datasets of reservoir water level: previous studies of water level forecasting in hydrological fields have shown that the collected data are noninterruptible and long-term, but most of them did not explain how to deal with the missing values from human error or mechanical failure.

To improve these problems, this study collected data on Taiwan Shimen Reservoir and the corresponding information on daily atmospheric datasets. The two datasets were concatenated into single dataset based on the date. Next, this study imputed missing values and selected a better imputation method to further build forecast models. We then evaluated the variables based on different models.

This paper includes five sections: Section 2 is related work; Section 3 proposes research methodology and introduces the concepts, imputation methods, variable selection, and forecasting model; Section 4 verifies the proposed model and compares with the listing models. Section 5 concludes.

#### 2. Related Work

This section introduces a forecast method of machine learning, imputation techniques, and variable selection.

##### 2.1. Machine Learning Forecast (Regression)

###### 2.1.1. RBF Network

Radial Basis Function Networks were proposed by Broomhead and Lowe in 1988 [7]. RBF is a simple supervised learning feed forward network that avoids iterative training processes and trains the data at one stage [8]. The RBF Network is a type of ANN for applications to solve problems of supervised learning, for example, regression, classification, and time-series prediction [9]. The RBF Network consists of three layers: input layer, hidden layer, and output layer. The input layer is the set of source nodes, the second layer is a hidden layer high dimension, and the output layer gives the response of the network to the activation patterns applied to the input layer [10]. The advantages of the RBF approach are the (partial) linearity in the parameters and the availability of fast and efficient training methods [11]. The use of radial basis functions results from a number of different concepts including function approximation, noisy interpolation, density estimation, and optimal classification theory [12].

###### 2.1.2. Kstar

The Kstar is an instance-based classifier that differs from other instance-based learners in that it uses an entropy-based distance function [13]. The Lazy Family Data Mining Classifiers supports incremental learning. It contains some classifiers such as Kstar, and it takes less time for training and more time for predicting [14]. It provides a consistent approach to handling symbolic attributes, real valued attributes, and missing values [15]. Kstar uses an entropy-based distance function for instance-based regression. The predicted class value of a test instance comes from values of training instances that are similar to the Kstar [16].

###### 2.1.3. KNN

The -Nearest-Neighbor classifier offers a good classification accuracy rate for activity classification [17]. The kNN algorithm is based on the notion that similar instances have similar behavior and thus the new input instances are predicted according to the stored most similar neighboring instances [18].

##### 2.2. Random Forest

A Random Forest can be applied for classification, regression, and unsupervised learning [19]. It is similar to the bagging method. Random Forest is an ensemble learning method. A decision tree represents the classifier. Random Forest gets outputs through decision trees. These are forecast by voting for all of the predicted results. It can solve the classification and regression problems. Random Forest is simple and easily parallelized [20].

##### 2.3. Random Tree

Random Tree is an ensemble learning algorithm that generates many individual learners and employs a bagging idea to produce a random set of data in the construction of a decision tree [21]. Random Tree classifiers can deal with regression and classification problems. Random trees can be generated efficiently and can be combined into large sets of random trees. This generally leads to accurate models [22]. The Random Tree classifier takes the input feature vector and classifies it with every tree in the forest. It then outputs the class label that received the majority of the votes [23].

##### 2.4. Imputation

The daily atmospheric data may have missing values due to human error or machine failure. Many previous studies have shown that the statistical bias occurred when the missing values were directly deleted. Thus, imputing data can significantly improve the quality of the dataset. Otherwise, biased results may cause poor performance in the ensuing constructs [24]. Single imputation methods have several advantages such as a wider scope than multiple imputation methods. Sometimes it is more important to find the missing values than to estimate the parameters [25]. The median of nearby point imputation methods uses nearby values for ordering and then selects the median to replace the missing value. The advantage of the median imputation method is that its replaced value is actually a real value in the data [26]. Series mean imputation methods replace the average of the variables directly. Regression imputation method uses simple linear regression to estimate missing values and replace them. The mean of the nearby point imputation methods is the mean of nearby values. The number of nearby values can be found by using a “span of nearby points” option [27]. The linear imputation is most readily applicable to continuous explanatory variables [28].

##### 2.5. Variable Selection

The variable selection method mainly identifies the key variable that actually influences the forecasting target from several variables. It then deletes the unimportant variables to improve the model’s efficiency. It can solve high dimensional and complex problems. Previous studies in several field have shown that variable selection can improve the forecasting efficiency of machine learning methods [29–32].

Variable selection is an important technique in data preprocessing. It removes irrelevant data and improves the accuracy and comprehensibility of the results [33]. Variable selection methods can be categorized into three classes: filter, wrapper, and embedded. Filter uses statistic methods to select variables. It has better generalization ability and lower computational demands. Wrapper methods use classifiers to identify the best subset components. The embedded method has a deeper interaction between variable selection and construction of the classifier [34].

Filter models utilize statistical techniques such as principal component analysis (PCA), factor analysis (FA), independent component analysis, and discriminate analysis in the investigation of other indirect performance measures. These are mostly based on distance and information measures [35].

PCA transforms a set of feature columns in the dataset into a projection of the feature space with lower dimensionality. FA is a generalization of PCA; the main difference between PCA and FA is that FA allows noise to have nonspherical shape while transforming the data. The main goal of both PCA and FA is to transform the coordinate system such that correlation between system variables is minimized [36].

There are several methods to decide how many factors have to be extracted. The most widely used method for determining the number of factors is using eigenvalues greater than one [10].

#### 3. Proposed Model

Reservoirs are important domestically as well as in the national defense and for economic development. Thus, the reservoir water levels should be forecast over a long period of time, and water resources should be planned and managed comprehensively to reach great cost-effectiveness. This paper proposes a time-series forecasting model based on the imputation of missing values and variable selection. First, the proposed model used five imputation methods (i.e., median of nearby points, series mean, mean of nearby points, linear, and regression imputation). It then compares these findings with a delete strategy to estimate the missing values. Second, by identifying the key variable that influences the daily water levels, the proposed method ranked the importance of the atmospheric variables via factor analysis. It then sequentially removes the unimportant variables. Finally, the proposed model uses a Random Forest machine learning method to build a forecasting model of the reservoir water level to compare it to other methods. The proposed model could be partitioned into four parts: data preprocessing, imputation and feature selection, model building, and accuracy evaluation. The procedure is shown in Figure 1.