#### Abstract

Data-driven methods are very useful for streamflow forecasting when the underlying physical relationships are not entirely clear. However, obtaining an accurate data-driven model that is sufficiently performant for streamflow forecasting remains often challenging. This study proposes a new data-driven model that combined the variational mode decomposition (VMD) and the prediction models for daily streamflow forecasting. The prediction models include the autoregressive moving average (ARMA), the gradient boosting regression tree (GBRT), the support vector regression (SVR), and the backpropagation neural network (BPNN). The latest decomposition model, the VMD algorithm, was first applied to extract the multiscale features from the entire time series and to decompose them into several subseries, which were predicted after that using forecast models. The ensemble forecast was finally reconstructed by summing. Historical daily streamflow series recorded at the Wushan and Weijiabao hydrologic stations from 1 January 2001 to 31 December 2014 in China were investigated using the proposed VMD-based models. Three quantitative evaluation indexes, including the Nash–Sutcliffe efficiency coefficient (NSE), the root mean square error (RMSE), and the mean absolute error (MAE), were used to evaluate and compare the predicted results of the proposed VMD-based models with two other models such as nondecomposition method (BPNN) and BPNN based on ensemble empirical mode decomposition (EEMD-BPNN). Furthermore, a comparative analysis of the performance of the VMD-BPNN model under different forecast periods (1, 3, 5, and 7 days) was performed. The results evidenced that the proposed VMD-based models could always achieve good performance in the testing stage and had relatively good stability and representativeness. Specifically, the VMD-BPNN model considered both the prediction accuracy and computation efficiency. The results show that the reliability of the forecasting decreased as the foresight period increased. The model performed satisfactorily up to 7-d lead time. The VMD-BPNN model could be applied as a promising, reliable, and robust prediction tool for short-term streamflow forecasting modelling.

#### 1. Introduction

Streamflow forecasting, especially daily streamflow forecasting, is an important task for optimizing the allocation of water resources and providing effective flood control measures [1]. For this reason, streamflow forecasting has received significant attention from the scientific community in the recent decades, and many models were proved to be instrumental for forecasting river flow to improve the prediction accuracy [2, 3].

Numerous classical black box time series models, which include the autoregression (AR) model, the autoregressive moving average (ARMA) model, and the autoregressive integrated moving average (ARIMA) model [4, 5], have been used in streamflow forecasting since 1970. These models are linear and therefore miss the nonlinear and nonstationary characteristics that are hidden in the real streamflow series. Hence, researchers have focused on developing strong nonlinear mapping abilities (machine learning techniques) to overcome these drawbacks, including decision trees such as the gradient boosted regression tree (GBRT) [6, 7], the kernel methods such as the support vector machine (SVM) [8, 9], and the support vector regression (SVR) [10]. The SVM [11, 12] and the SVR [13, 14] have been used in the field of streamflow forecasting research. It is important to remark that the artificial neural networks (ANN) represent the most widely applied artificial intelligence techniques for modelling [15–17], and it has been widely used in hydrology [18–20]. The backpropagation neural network (BPNN) is the improvement of the ANN learning representations by error backpropagation algorithm, which is the most popular neural network. Many extensions and modifications of the BPNN have gained development in different fields in the past few years [21–23], including for instance their application to rainfall-runoff modelling, which has been relatively successful [24–26]. However, some limitations still need to be addressed; the main one being the slow learning speeds and the overfitting [27, 28], which is caused by the insufficiency of generalization ability. Especially for the use of BPNN in the hydrology field, it is difficult to obtain satisfactory prediction accuracy due to the great heterogeneity of the rainfall-runoff process.

Although the abovementioned data-driven models have become appropriate alternatives to knowledge-driven models in hydrological forecasting and are both flexible and useful, they have limitations regarding the highly nonstationary characteristic of the hydrological series that vary over a range of scales (e.g., from daily to multidecadal) [29]. For this reason, recent developments in signal processing tools use data-driven models that can deal with the nonstationary datasets of hydrological signals and provide time-scale localization. There are many models for streamflow forecasting that are based on multidimensional feature extraction and feature learning. Wavelet Transforms (WT) [30], Empirical Mode Decomposition (EMD) [31], and Ensemble Empirical Mode Decomposition (EEMD) [32] are commonly used data-preprocessing techniques to feature the extraction of original data. WT has been applied in reservoir inflow modelling [33]. Although it possesses good time-frequency localization characteristics, the decomposition results mainly depend upon the mother wavelet and decomposition level, and its adaptability is relatively poor [34]. Several successful applications of EMD-based and EEMD-based modified model were ideally suitable for forecasting river flow [35–37]. However, one of the main drawbacks of the EMD is the frequent appearance of weak mode-mixing (i.e., the scale separation problem). EEMD is a significant improvement of the EMD since it adds white noise to compensate this drawback [32, 38]. Li and Liang [39] also showed that compared with the EMD and WT, EEMD has some special characteristics, including that it is self-adaptive, direct, intuitive, and empirical [40]. However, the EEMD involves a large number of calculations and the modal component is uncontrollable, which easily leads to the nonconvergence of the function and subsequently affects the accuracy of the algorithm [41].

To fill the aforementioned gaps, this paper designed a theoretically well-founded and robust decomposition method (VMD) and data-driven models referred to as VMD-based models to improve the accuracy and stability of the runoff forecasting. VMD was introduced into the prediction problem as a new adaptive time-frequency decomposition method and is substantially more sensitive to sampling and noise than existing decomposition approaches, such as EMD and EEMD. Different VMD-based decomposed methods have been proposed and successfully applied for chatter detection in milling processes [42], vibroacoustic feature extraction [43], and container throughput forecasting [44]. However, they have been less applied for forecasting highly nonlinear and nonstationary streamflow series. In general, in streamflow forecasting, the model’s performance (in terms of the model’s accuracy) deteriorates as the lead time increases [45, 46]. Therefore, the single model based on VMD and BPNN, referred to as VMD-BPNN, is proposed to practically predict daily streamflow for different forecast periods (1, 3, 5, and 7 d) in this study.

In summary, the primary objective of this paper is to introduce a specific and novel decomposition method (VMD) and data-driven models referred to as VMD-ARMA, VMD-GBRT, VMD-SVR, and VMD-BPNN, which preserve the characteristics of hydrologic time series. To the best of the authors’ knowledge, no existing studies have compared the different VMD and EEMD decomposition methods and the different GBRT, ARMA, SVR, and BPNN feature learning methods to forecast streamflows concerning both the model’s accuracy and the precision in flow forecasting. The paper explores for the first time a new realm of hydrological modelling by combining VMD with prediction models and assessing both the model’s accuracy and flow forecasting capability, which is a novel application of these decomposition algorithms in hydrology forecasting. Also, different lead-time forecasts were investigated to assess the accuracy of the models.

#### 2. Methodology

##### 2.1. VMD-Based Hybrid Model

To deal with the nonstationary problem of streamflows, a VMD-based hybrid model that incorporates the VMD algorithm with the ARMA, GBRT, SVR, and BPNN models was built for simulating the daily streamflow in the Wei River Basin. VMD was used to decompose the original time series into several subseries, and ARMA, GBRT, SVR, and BPNN were used to build the forecast model for each subseries. The subseries that were obtained by VMD were relatively stationary and could provide information about the original data structure and its periodicity. Therefore, the performance of the forecast models was expected to be improved by giving useful information on various resolution levels. The VMD-based models were based on the decomposition-ensemble framework, and it included the following three stages: (1) to decompose the runoff time series into a collection of IMFs using VMD, (2) to forecast each subseries using prediction models such as ARMA, GBRT, SVR, and BPNN, and (3) to obtain the final forecast by summing the outputs of the subseries. A series of experiments based on the VMD-ARMA, VMD-GBRT, VMD-SVR, and VMD-BPNN model was conducted. A flowchart of the VMD-based models is illustrated in Figure 1. The framework of the decomposition-based data-driven model was represented as follows: Step 1: decompose the entire original daily streamflow time series into *n* IMFs using VMD and divide the entire period into the training and validation periods Step 2: develop the ARMA, GBRT, SVR, and BPNN models for each subseries of the training period using the appropriate lags that are obtained by the PACF as inputs and then select the optimal model parameters using the error-index minimization criteria Step 3: use the selected prediction models to forecast the IMFs Step 4: obtain the final forecast of the training periods by summing the outputs of all the prediction models Step 5: apply the ARMA, GBRT, SVR, and BPNN models to forecast the subseries in the test period and obtain the final forecast of the test period by adding the outputs of all prediction models

##### 2.2. Multidimensional Feature Extraction with VMD

The VMD algorithm was used to decompose the original daily streamflow series into multiple components. It decomposed a complicated dataset into several intrinsic mode functions (IMFs) with different dominant frequencies and amplitudes.

Variational mode decomposition (VMD) first appeared in 2014 [47], and it was mainly used to decompose adaptively the input signal into several subsignals (a modal component function) . VMD decomposed the daily streamflow series into *n* + 1 IMFs, i.e., . Each mode had a finite bandwidth with different centre frequencies, and the sum of the estimated bandwidth of each mode was required to be minimal. For the sake of conciseness and convenience, all the IMFs and the original series are uniformly called subseries in the following section. The specific steps for the daily streamflow sequence decomposition are as follows: Step 1: using the streamflow sequence as the original input signal , the demodulated signal of the mode is calculated using the Hilbert transform to obtain the corresponding unilateral spectrum. The spectrum of each mode is shifted to the baseband by mixing it with an exponential tuned to the predicted centre frequency. The squared *L*^{2}-norm of the gradient of the demodulated signal is calculated to estimate the bandwidth of . Step 2: decompose the daily streamflow sequence () into the set of K modes and construct the corresponding constrained variational model [47] as follows: where and are shorthand notations of all the *K* modes and their centre frequencies, respectively, represents the Dirac distribution, and *t* is the time. Equally, refers to the summation over the set of all modes. Step 3: to render the constrained variational problem into an unconstrained one, the quadratic penalty term and the Lagrangian multipliers are introduced. The augmented Lagrangian function is expressed as follows [47]: where the quadratic penalty is a parameter to encourage reconstruction data fidelity and the Lagrangian multipliers are used to guarantee the strictness of the constraint. Step 4: using the alternate direction method of the multipliers (ADMM) [48], , and are alternately updated to find the saddle point of the augmented Lagrange function of equation (2) such that is expressed as follows: where is equivalent to and is equivalent to . Step 5: based on the Parseval/Plancherel Fourier transform, equation (3) is converted to equation (4) to update in the frequency domain, and then the centre frequency is updated using equation (5). The update equation is shown as equation (6): where denotes the time step. Step 6: in the process of solving the iterative solution of the variational model, repeat steps 2 to 5. The frequency centre and the bandwidth of each streamflow sequence exponential component are continuously updated to complete the adaptive segmentation of the signal band until the iteration stop condition is met:

Given the discriminative precision *e* > 0 for equation (7), the whole loop is ended. Finally, the mode and the corresponding centre frequencies are obtained according to the frequency domain characteristics of the daily streamflow sequence.

##### 2.3. Preliminary Processing-PACF

The development of the prediction model depends largely on the determination of the input variables. This requires some degree of a priori knowledge of the system to be modelled [49, 50]. However, it is difficult to select the initial input variables based on the underlying physical processes, especially in complex systems. Therefore, an analytical method, such as auto- or crosscorrelation, is often employed [51]. This paper used a statistical method (i.e., the partial autocorrelation function (PACF)) to address the limitation of neglecting the relationship between the number of parameters and the different antecedent values. Detailed descriptions of the PACF can be found in the previous literature [52]. The output variable was strongly correlated with (current flow) and (antecedent flow with *i*-day time lag ), which were thus selected as the input variables if the lag autocorrelation length *i* exceeded 95% confidence interval .

##### 2.4. Models for Learning Prediction

Several techniques are employed for feature learning in this study, including the ARMA, GBRT, SVR, and BPNN models. A brief overview of these methods is presented here.

###### 2.4.1. GBRT Model for Learning Prediction

The traditional GBRT model, according to Friedman [53], solves the following classification and the regression problem to combine the output of many weak prediction models or “learners” into a powerful “committee” [54]. At each stage of the gradient boosting algorithm, we calculate the following:where are the basic functions that are referred to as weak learners and small regression trees of fixed size in the case of the GBRT. The GBRT model is a summation of small regression trees. For each boosting iteration from to , the gradient boosting algorithm improves the that determines the new model by adding a new regression tree to improve the previous model. The procedure estimates the target value based on a perfect from the training set, such that satisfies the following:and also

Thus, in equation (10) is the regression tree model fitting the residual at the iteration . Also, the current residuals for a given model are the negative gradients of the squared error loss function:

This shows that equals the negative gradient of the squared loss function. Therefore, equation (11) proves that the gradient boosting is a gradient descent algorithm minimizing the squared error loss function. This explanation can be generalized to other loss functions by replacing the squared error with a different loss function and its gradient. Further details of the gradient boosted regression trees are provided in Hastie et al. [54].

The objective of the model is to find an optimal estimate that minimizes the value of the loss function using the training set . A simplification is applied, and the steepest descent step is used to solve this minimization problem. The model is updated by the following equations [54, 55]:where the derivatives are obtained with respect to the functions for .

In the step, a generic gradient boosting algorithm fits a regression tree to the pseudoresiduals. Let be the number of leaves. The regression tree model splits the input space into separated regions and obtains a constant value for each region. For the input , is then written as the following sum [56]:where is the constant value that is calculated for the region . Each coefficient is multiplied by a different optimal for each of the regions, which is carried out by a modified algorithm which was proposed by Friedman instead of using only one for the whole tree. After that the model is updated as follows [54, 55]:

To obtain the optimal model, understanding the effect of the different parameter values on the model’s performance is critical. The well-known machine learning toolkit scikit-learn [57] was applied to train the GBRT model using training and developing dataset. To achieve lower prediction errors, 6-fold crossvalidation was used. The hyperparameters, i.e., the learning rate (LR), the maximum depth (MD), the maximum features (MF), the minimum sample split (MSS), and the minimum sample leaf (MSL), are searched for using Bayesian optimization based on the Gaussian processes.

###### 2.4.2. ARMA Model for Learning Prediction

ARMA models [58] are linear stochastic models that are obtained by combining the AR and the MA models, and they are used to model the dependent stochastic components of a time series. ARMA models can be written as follows:where represents the autoregressive operator of the order of the model, denotes the backward operator, is the moving average operator of the order of the model, and is the white noise sequence (residuals). The order of the model was identified in order to fit the model to the data. Various types of models should be validated to find the best choice for each series. For this purpose, the AIC is generally used such that the model with the lowest AIC value is accepted as the best among the alternative models and is written as follows:where is the total number of observations and is the variance of the residual terms.

###### 2.4.3. SVR Model for Learning Prediction

The support vector regression (SVR) is a powerful nonlinear regression model that was developed from the support vector machine (SVM) [59]. In the SVR, the aim is to solve the following minimization problem [60] to find the parameter of the model for an appropriate function with a small (a weight vector) that has the least possible deviation () for a set of *N*-training data, where , , and is a nonlinear mapping function. Another parameter that defines the performance of an SVR model is the kernel parameter , which nonlinearly maps samples into a high-dimensional feature space. The loss function is defined as follows:which is subjected to the following:where is a slack variable, is a positive regularized parameter, and is the weight vector.

, , and are determined through a trial-and-error process. The modelling process for an SVR model is presented as follows: (1) set all the possible combinations for ; (2) split the data into 5 folds with all possible combinations; (3) train and validate with all the possible , , , and fold combinations; (4) keep the , , and that yielded the lowest RMSE and the highest *R*^{2} in the training and developing sets; (5) train the model using all the data with the selected , , and ; and (6) test the SVR model using the testing data.

###### 2.4.4. BPNN Model for Learning Prediction

BPNN is the error backpropagation algorithm. Developed by Rumelhart et al. [61], it is one of the most common and effective approaches amongst all neuron networks that consist of an input layer, a hidden layer, and an output layer. The number of hidden neurons depends on the complexity of the mathematical problem [62] and is often determined by trial and error. Each input and output of neuron are calculated bywhere is the weight of the connection from the input to the hidden layer, or from the hidden to the output layer, is an activation function, and is the bias input to the neuron.

For the training or learning process, where the weights are selected, the neural network uses the gradient descent learning method to modify the weights and to minimize the error between the actual output values and the expected values [63]. It stops when the errors are minimized or when another stopping criterion is met. The weights and biases are modified using the following formulas:where is the number of epochs, is the backpropagation errors in the hidden layer, and is the learning rate.

##### 2.5. Model Evaluation Criteria

Several error indices are commonly used to evaluate the prediction performance of the model. These include the root mean squared error (RMSE), the mean absolute error (MAE), and the Nash–Sutcliffe efficiency coefficient (NSE). The RMSE and MAE were used to characterize the overall precision and accuracy of the prediction results, respectively. The NSE was used to characterize the stability of the prediction results. If the forecasting errors exhibited large fluctuation, the model was deemed unreliable, even if the other error indices were high. The NSE was sensitive to the fluctuation of the data series and could describe the tracking ability of the prediction to the measurements. The RMSE, MAE, and NSE are defined as follows:where *n* is the number of samples; and are the observed and forecasted values, respectively; and is the mean of the observed values. The best fit between the observed and forecasted values would yield RMSE = 0, NASH = 1, and MAE = 0.

##### 2.6. Data Normalization

To achieve better performance and faster convergence, the data normalization was performed on the raw series according to the following equation:where is the normalized data, is the raw data, and and are the maximum and minimum values of the sequence, respectively. The data series will be normalized in the range of −1 and 1. Firstly, the original streamflow series were decomposed by the VMD method. The resulting decomposed data were normalized in equation (22), and the normalized data were subsequently used to learn feature by the different forecasting models.

#### 3. Case Study and Catchment Description

The Wei River (Figure 2) was used as a case study in this research to demonstrate the effectiveness of the proposed hybrid model. The Wei River, which is the largest tributary of the Yellow River, originates from the Mountain Bird Mouse in Gansu Province, flows through the Gansu and Shaanxi provinces and converges into the Yellow River in Tongguan with a total length of 818 km. The basin covers an area of 135,000 km^{2} (103.5°E–110.5°E and 33.5°N–37.5°N). Located in the continental monsoon climate zone, the Wei River Basin (WRB) is characterized by abundant precipitation in summer and rare precipitation in winter [27]. The basin has an annual precipitation of approximately 559 mm. Topographically, the altitude of the basin is high in the northwest mountainous areas and low in the Guanzhong Plain. It is worth mentioning that the Wei River is an important grain production area and a state key economic development zone in China’s “one belt and one road” construction. Therefore, its economic development will directly affect the sustainable development of the economic society of the Gansu and Shaanxi provinces. Considering the development of the economy and the population growth, the WRB water resources demand increases noticeably. Therefore, it is of great significance to develop a hybrid forecasting model for streamflow in the WRB.

Because the observations of meteorological factors such as rainfall and temperature are incomplete in the study area, only the historical runoff data collected from the Wushan and the Weijiabao hydrological stations in the WRB were collected to assess the proposed model. The Wushan and the Weijiabao hydrological stations correspond to the main upstream and midstream downstream stations of the Wei River, respectively. The daily streamflow data were collected from the published hydrologic manual of the WRB and the Hydrological and Water Resources Information Network of Shaanxi Province (http://www.shxsw.com.cn/7/39/list.aspx). The data quality was strictly controlled during its release. A consistency analysis has demonstrated that all the daily streamflow data used in this study were reliable. Based on the importance and representativeness of the hydrological series, the daily streamflow records (5113 total samples from 1 January 2001 to 31 December 2014) of the Wushan and Weijiabao hydrological station were used (Figure 3). To avoid local minima problems, the historical series of daily runoff data of the Wushan and the Weijiabao stations from 1 January 2001 to 31 December 2011, were selected for training the model structures in the model calibration, and the remainder (datasets from 1 January 2012 to 31 December 2014) for validation. The statistical properties of daily streamflow of two hydrological stations are shown in Table 1, indicating that the standard deviation values are higher than skewness coefficient in both stations.

#### 4. Results and Discussion

##### 4.1. Data Decomposition with VMD

According to the model established in Section 2.1, the initial daily streamflow series were decomposed into several IMFs using the VMD approach. The optimal value could be determined by analyzing the centre frequency [64]. Noticeable aliasing of the centre frequency for the frequency spectrum was found to appear when the number of components increased to a certain value. To limit spurious components in the decomposition results to the minimum in this study, six multidimensional features (IMF_{1}–IMF_{6}) and eleven decomposition components (IMF_{1}–IMF_{11}) were decomposed by the VMD algorithm at the Wushan and the Weijiabao hydrological stations, respectively. Figure 4 shows the decomposition results of the different dominant frequencies and amplitudes for the Weijiabao station.

**(a)**

**(b)**

##### 4.2. Selection of the Best Input Variables by PACF

Figure 5 depicts the PACF analysis results of the subseries for the Weijiabao station, where PACF_{1–}PACF_{11} represent the respective PACFs of the normalized subseries. The input variables were determined by analyzing the resulting partial autocorrelation diagram, which corresponds to the plots of the PACFs with respect to the lag length (Figure 5). In the iterative process of the daily streamflow prediction, the input variables of eleven subseries and the number of inputs for the prediction models were determined, as illustrated in Table 2.

##### 4.3. Performance of the Different Forecasting Models and Decomposition in Preprocessing Methods

###### 4.3.1. Performance of the GBRT Forecasting Model and Decomposition in Preprocessing Methods

The feasibility of the GBRT modelling was demonstrated by comparing the predicted streamflow with the observed results using the training set, which employed Bayesian optimization for the hyperparameter search. Table 3 presents the appropriate values of five parameters, including the LR, MD, MF, MSS, and MSL for IMF_{1}–IMF_{11} of the Weijiaobao study sites. Table 3 shows that the RMSE is relatively small and tends towards zero, and the NSE values exceed 0.94, which indicates that the VMD-GBRT methods was very performant during both the training and developing periods. Therefore, this optimal model can be used to learn the features of the decomposed results in the test phase.

###### 4.3.2. Performance of the ARMA Forecasting Model and Decomposition in Preprocessing Methods

According to the AIC model selection criterion, the values of the parameters and were assured during the training and developing phases. The best ARMA model has been determined as the one that yielded the lowest RMSE and the highest NSE values. Table 4 presents the values of the parameters and their errors for all the IMFs for the Weijiabao station. For all sequences, the RMSE tends towards zero, and the NSE values exceeded 0.995, which indicates that the VMD-ARMA methods have good performance. Thus, this optimal model could be used to learn the features of the decomposed results in the test phase.

###### 4.3.3. Performance of SVR Forecasting Model and Decomposition in Preprocessing Methods

For the SVR model, the Bayesian optimization was employed to optimize the balance parameter , the least possible deviation , and the kernel function parameter in the SVR models. Using the Weijiabao station as an example, eleven experiments, designated as IMF_{1–}IMF_{11} series, were conducted for each training set of each mode. The parameter optimization results are listed in detail in Table 5. The results of the error show that these parameters were so appropriate that they could effectively eliminate the errors. Hence, the optimal model and parameters in Table 5 can be further applied to the learning of the decomposed IMFs in the test period.

###### 4.3.4. Performance of the BPNN Forecasting Model and Decomposition in the Preprocessing Method

The BPNN model was developed for the feature learning of the decomposed IMFs. The optimum structure of the BPNN derived through trial and error is the one that decreases or increases the number of layers and neurons in each layer to determine the optimal parameter settings. In this experiment, the BPNN training structures of IMF_{1} for the Weijiabao station were initialized with 11 input neurons and 1 output neuron. The number of hidden layers was initially set to 1, the hidden neurons in a hidden layer were divided into 10 levels from 4 to 13 in training with a step size of 1, and the stopping condition was based on the BPNN reaching convergence. Table 6 shows the BPNN evaluation results with 11 input neurons and 1 hidden layer based on the two performance criteria. When the hidden neurons of one layer were 7, the NSE was 0.9999942, and the RMSE was 0.2182 m^{3}/s, the model exhibited the best learning performance evaluation. The analysis of He et al. [65] revealed that the prediction performance of the neural network model with multiple hidden layers was not always better than the model with a single hidden layer, which therefore implies that the excess hidden layers of a neural network model might lead to an overfitting problem [66, 67]. The hidden layer of the BPNN can be one or multiple layers, but the BPNN with one hidden layer was able to complete the mapping of any continuous function with arbitrary accuracy. In order to avoid excessive slowdown of the training speed, one hidden layer was adopted for the BPNN model at the two selected hydrological stations.

Through the above method, the corresponding BPNN model structure of the other IMFs could be determined, and the prediction performance is also summarized in Table 7.

##### 4.4. Forecasting Results of the Models and Model Comparison

By summing all the components of the modelled IMFs of the daily streamflow during the testing period from 1 January 2012 to 31 December 2014, this section derived the resulting daily flow rates for the Wushan and Weijiabao hydrological stations. In order to verify the prediction accuracy and the stability of the different VMD-based models, six types of daily-scale streamflow prediction models were established, including the BPNN, EEMD-BPNN, VMD-ARMA, VMD-GBRT, VMD-SVR, and VMD-BPNN. The comparison between the predicted results and the observed streamflow series is shown in Figures 6 and 7.

Figures 6 and 7 show that the test set forecasting results of various models at the Wushan and Weijiabao stations are in line with expectations, respectively. With regards to the prediction accuracy, taking Weijiabao Station as an example, the models could be ranked from the highest to the lowest as follows: VMD-BPNN > VMD-SVR > VMD-ARMA > VMD-GBRT > EEMD-BPNN > BPNN. In the single prediction model, the BPNN model exhibited a poor-fitting ability and a high forecasting error, particularly for the days with relatively high flow rates. Compared with the single BPNN model forecasting results, the combined prediction model after being processed by EEMD and VMD modal decomposition was always better than single prediction model, which could, therefore, better fit the high flow and yield better prediction accuracy. Besides, the match between the recorded and the modelled daily runoff with the VMD-BPNN model was better than that with the EEMD-BPNN model, which is mainly because different decomposition algorithms had different control methods on the modal number, thereby affecting the prediction error. The centre frequency of the VMD method was controllable, which could effectively avoid modal aliasing in comparison with the EEMD method [68]. To further evidence the superiority of the VMD-based model’s feature learning capability, four VMD-based models were also compared in terms of reproducing the statistical characteristics of observations, which included the VMD-ARMA, VMD-GBRT, VMD-SVR, and VMD-BPNN (Figure 6). All the VMD-based models closely represented the observed flows, and there was no major difference in the performances of the three models. For the two stations, the results show that the prediction results of the VMD-BPNN model were always better than all the other models and required the lowest computational effort [69].

##### 4.5. Evaluating All Models’ Forecasting Accuracy

The scatter plots of the observed and forecasted flows, as well as the fitting line of the VMD-ARMA, VMD-GBRT, VMD-SVR, and VMD-BPNN models in Figures 8 and 9 are closed to the ideal fitting line (the black fitting line in Figures 8 and 9), which indicates that the overall prediction effect was very well, and the high flow prediction error was the smallest, followed by BPNN, then EEMD-BPNN. The fitting effect of the VMD-BPNN model is better than the BPNN model, which exhibits the worst fitting effect. The scatter plots of the forecasted streamflow values were derived by the VMD-BPNN, and the observed flows follow the 45° lines, whereas the least square regression (LSR) lines of the BPNN deviate substantially from the 45° lines. These results confirm the underestimation of the flows, especially the high flows, and demonstrate that the BPNN model had relatively poor robustness and a weak forecasting ability if not integrated with any data processing techniques. This can be attributed to a separate forecasting model to learn feature for the original runoff series, which insufficiently represents the instinctual information of nonstationary and nonperiodicity data [70]. The points for the EEMD-BPNN model present obvious unorganized point-sets. Overall, the EEMD-BPNN model tends to overestimate the medium flows and underestimate the high flows. The superior performance of the VMD-BPNN methods becomes obvious when employing the VMD decomposition algorithm. The EEMD-BPNN results are close to the observed values in some days, but the EEMD-BPNN model always yielded less accurate forecasts than the VMD-BPNN model for different higher flows; the reason lies in the modes obtained by VMD which are much smoother than components decomposed by EEMD [44].

As shown in the scatter plots in Figures 8 and 9, no significant difference terms of performance could be observed between the VMD-ARMA, VMD-GBRT, VMD-SVR, and VMD-BPNN models in the test phase. The results demonstrate that the forecasted flows of the VMD-based models had a good performance at forecasting the maximum and minimum streamflows. Compared with the VMD-ARMA, VMD-GBRT, and VMD-SVR models, the VMD-BPNN model has powerful approximation ability for solving complex nonlinear problems, which therefore caused the prediction accuracy to improve. Moreover, by using the VMD-based decomposed method to sufficiently represent the instinctual information of the streamflow series, the generalization ability of the BPNN model can be effectively enhanced.

The statistical indicators of performance can more accurately assess the predictive power of each model. Table 8 shows the model performance indices of the BPNN, EEMD-BPNN, VMD-ARMA, VMD-GBRT, VMD-SVR, and VMD-BPNN models at the Wushan and Weijiabao stations during their respective testing period.

Taking Weijiabao Station as an example, the various performance statistics provided in Table 8 show that the forecasted flows using the VMD-BPNN reproduced the statistical characteristics of the observations fairly well in overall, compared to the BPNN and EEMD-BPNN. The results concur with relatively high NSE values (0.9865 for the VMD-BPNN vs. 0.8132 for the BPNN, 0.9236 for the EEMD-BPNN, 0.9862 for the VMD-ARMA, 0.9750 for the VMD-GBRT, and 0.9857 for the VMD-SVR model); the NSE values of the VMD-BPNN model were increased by 28.16%, 6.81%, 0.03%, 1.18%, and 0.09%, respectively, compared to other models. The RMSE values of each model are as follows: 16.80 m^{3}/s (VMD-BPNN) vs. 68.86 m^{3}/s (BPNN), 39.66 m^{3}/s (EEMD-BPNN), 16.84 m^{3}/s (VMD-ARMA), 22.70 m^{3}/s (VMD-GBRT), and 17.18 m^{3}/s (VMD-SVR); the RMSE values of the VMD-BPNN model was reduced by 75.60%, 57.63%, 0.24%, 25.96%, and 2.20%, respectively, compared to other models. The MAE values are, respectively, as follows: 7.95 m^{3}/s (VMD-BPNN) vs. 24.79 m^{3}/s (BPNN), 13.96 m^{3}/s (EEMD-BPNN), 7.99 m^{3}/s (VMD-ARMA), 10.21 m^{3}/s (VMD-GBRT), and 8.13 m^{3}/s (VMD-SVR); the MAE values of the VMD-BPNN model was reduced by 67.95%, 43.08%, 0.60%, 22.19%, and 2.30%, respectively, compared to the other models. Through the calculation and comparison relative to error statistics (such as NSE, RMSE, and MAE) of testing set of each model, the results show that the VMD-BPNN model provided accurate and reliable streamflow forecasts for almost all models (Table 8), which demonstrates that the forecasting performance metric values of the BPNN were unsatisfactory, and the forecasted streamflow that was obtained by the EEMD-BPNN were rather accurate. These results imply that the nondecomposition method or EEMD-based decomposition method may not be suitable for daily streamflow forecasts. The consistent forecast errors (Table 8) that were produced by the VMD-ARMA, VMD-GBRT, VMD-SVR, and VMD-BPNN models imply that the VMD-based model could be used for daily streamflow forecasts. The means of the predicted flows from the VMD-ARMA, VMD-GBRT, VMD-SVR, and VMD-BPNN models appear to be equivalent and closely match the measured data in the test phase, indicating that the amplitude and the shape of the subseries that are decomposed using VMD are well forecasted by the VMD-based model. There was no obvious difference in the performance amongst these four models in the test phase, but the performance errors predicted by the VMD-BPNN model were slightly better than the other three models, and the computational run-time results of these four models also show that the computational efficiency was the highest by using the VMD-BPNN model (Table 8).

All the observations demonstrate the following: (1) hybrid models based on decomposition preprocessing methods such as VMD and EEMD surpass traditional models that perform no decomposition, (2) the VMD-based models are able to achieve almost perfect forecasting performance, and (3) EEMD cannot effectively extract useful information from the daily streamflow time series similar to its counterparts, such as VMD [71]. The performance of the models can be explained based on the mode-mixing phenomena that are observed. For the BPNN, as discussed previously in Figure 4, the centred daily streamflow time series were heavily mode-mixed with various frequencies along the spectrum. The different frequencies were associated with different physical meanings. As a result, the forecasting performance of the BPNN was unsatisfactory, as observed in Table 8.

##### 4.6. Forecasting Accuracy over Long Time Periods (>1 Day Ahead Time)

The above studies show that all the BPNN, EEMD-BPNN, VMD-ARMA, VMD-GBRT, VMD-SVR, and VMD-BPNN models yielded an NSE coefficient greater than 0.8 when the foreseen period was 1 day, both the RMSE and MAE were small, and the effect was satisfactory. Except that the hydrological forecasts focus on the prediction accuracy, the length of the foresight period is also of great significance for guiding the operation and dispatching of reservoirs. Now, taking the VMD-BPNN model as an example, the effects of the foresight period of 1, 3, 5, and 7 days on the prediction results were also examined. The forecasting results using the VMD-BPNN model for 1, 3, 5, and 7 days ahead in the Wushan and Weijiabao stations are displayed in Figures 10 and 11. The histograms of the four performance metrics of the NSE, RMSE, MAE, and the time cost along with the different lead-times of 1–7 days are presented in Figure 12, respectively.

**(a)**

**(b)**

**(c)**

**(d)**

From Figures 10 and 11, it is clear that the predicted data exhibits a stronger tracking ability for the recorded data for 3 and 5 days ahead, but has slightly poor performance for forecasting streamflow 7 day ahead, similar to that seen in scatter plot. If the NSE value was less than 0.8, the model forecasts are considered to be poor and are therefore not applied to predict streamflow [72]. Obviously, the VMD-BPNN model performed best in forecasting streamflow when the foresight period was 1 day, which was also better than the prediction accuracy yielded when the foresight period was, respectively, 3 d, 5 d, and 7 d. It is worth noting that the worse performance occurred for longer foresight period (Figure 12), which was caused by continuous accumulation of errors. As the foresight period increases, the RMSE and MAE increase gradually, while the NSE coefficient decreases, but remains above 0.8 and can, therefore, satisfy the formulation of the daily operation plan of the reservoir. Figure 12 shows the acceptable reliability of the forecasting of the VMD-BPNN model under the foresight period of 1 d, 3 d, 5 d, and 7 d in the Wushan and Weijiabao stations.

#### 5. Conclusions

This paper presented a comparative study of the VMD-based models for the forecasting daily streamflow of the Wushan and Weijiabao hydrological stations in China. To improve the forecasting accuracy, three aspects were considered in a decomposed and integrated fashion: (1) multiscale feature extraction, which consisted in extracting subseries (IMFs) by the VMD; (2) determination of the input variables by PACF, whereby the PACF was used to analyze the characteristics of each subseries for extracting the input variables; (3) predicting each subseries based on the VMD-ARMA, VMD-GBRT, VMD-SVR, and VMD-BPNN models, and the final forecasting results were derived by summing; (4) using the NSE, RMSE, and MAE performance evaluation indices, the forecasting results of the VMD-ARMA, VMD-GBRT, VMD-SVR, and VMD-BPNN models were compared with that of two prediction models including BPNN and EEMD-BPNN; and (5) studying the effect of different foresight period on the streamflow forecast results for the VMD-BPNN model.

In general, the VMD-based models proposed in this paper could significantly improve the prediction accuracy of the daily time scale nonstationary streamflow series. Different from traditional daily streamflow forecasts, the key features of the proposed method lie in the following aspects: (1) the multiscale feature extraction using the VMD approach was much more robust to the sampling and noise, which could effectively distinguish the original nonstationary series; (2) the combination of the VMD decomposition method and the prediction models such as the ARMA, GBRT, SVR, and BPNN could improve the forecasting accuracy; (3) especially the VMD-BPNN method improved algorithm's stability with regards to the accuracy and limited computation time. The prediction results showed that among all the methods mentioned, the VMD-BPNN model was the best on prediction accuracy and computational efficiency. The model can effectively predict highly nonlinear and nonstationary hydrological streamflow series and is a powerful tool for daily streamflow forecasting.

Even though the hybrid model obtains better prediction accuracy than several traditional methods, they still have limitations that a black box model lacks explanation for its instinctive features. No matter how well the predicted results are, it is not incredibly reliable to use in practical engineering. Hence, in the future, the research of the interpretable machine learning is a point; in addition, the model can be retrained to take the complexity and representative of data into consideration to alleviate the potential influence of the limited data samples and has higher generalization ability.

#### Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

This research was supported financially by the State Key Laboratory Base of Eco-Hydraulic Engineering in Arid Area, China (Grant no. 2017ZZKT-5), National Natural Science Foundation of China (Grant no. 51609197), CAS “Light of West China” Program (Grant no. XAB2016AW06), and Xi’an Science and Technology Program (Grant no. SF1335).