Abstract

Hydrological forecasting is one of the key research areas in hydrology. Innovative forecasting tools will reform water resources management systems, flood early warning mechanisms, and agricultural and hydropower management schemes. Hence, in this study, we compared Stacked Long Short-Term Memory (S-LSTM), Bidirectional Long Short-Term Memory (Bi-LSTM), and Gated Recurrent Unit (GRU) with the classical Multilayer Perceptron (MLP) network for one-step daily streamflow forecasting. The analysis used daily time series data collected from Borkena (in Awash river basin) and Gummera (in Abay river basin) streamflow stations. All data sets passed through rigorous quality control processes, and null values were filled using linear interpolation. A partial autocorrelation was also applied to select the appropriate time lag for input series generation. Then, the data is split into training and testing datasets using a ratio of 80 : 20, respectively. Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and coefficient of determination (R2) were used to evaluate the performance of the proposed models. Finally, the findings are summarized in model variability, lag time variability, and time series characteristic themes. As a result, time series characteristics (climatic variability) had a more significant impact on streamflow forecasting performance than input lagged time steps and deep learning model architecture variations. Thus, Borkena’s river catchment forecasting result is more accurate than Gummera’s catchment forecasting result, with RMSE, MAE, MAPE, and R2 values ranging between (0.81 to 1.53, 0.29 to 0.96, 0.16 to 1.72, 0.96 to 0.99) and (17.43 to 17.99, 7.76 to 10.54, 0.16 to 1.03, 0.89 to 0.90) for both catchments, respectively. Although the performance is dependent on lag time variations, MLP and GRU outperform S-LSTM and Bi-LSTM on a nearly equal basis.

1. Introduction

The science of streamflow forecasting is still one of the crucial research topics in hydrology. Accurate and reliable streamflow forecasting is vital for water resources planning, management, and disaster mitigation response authorities [1]. Streamflow forecasting can be classified into two. Firstly, short-term or real-time forecasting includes daily and hourly timestamps, widely applicable in flood management systems. Secondly, long-term forecasting usually contains weekly, monthly, and annual streamflow forecasting, crucial for reservoir operation, irrigation system management, and hydropower generation [2].

Generally, streamflow forecasting models can also be categorized into conceptual, physical-based, and data-driven models [3]. Conceptual models are lumped in nature and typically rely on empirical relationships among various hydrological variables. Due to the model’s reliance on observed data, it is rarely applicable to data-scarce catchments. Hydrological processes can also be represented in physical models through mass, momentum, and energy conservation equations. These models may account for spatial variability, but since they are distributed in nature, they require a large amount of data on land use, slope, soil, and climate [4]. Lastly, data-driven models form a nonlinear input-output relationship without physical catchment information and minimum data requirements. Hence, the popularity of data-driven models is exploding these days with the advancement of computational capability and data set availability [5].

Zhang et al. [6] classified the data-driven approach as conventional and Artificial Intelligence (AI) based models. Conventional data-driven models such as Auto Regressive Moving Average with the exogenous term (ARMAX), Multiple Linear Regressive (MLR), and Auto-Regressive Integrated Moving Average (ARIMA) are easy to implement [6]. However, the nonlinearity of hydrological processes was left out of the equations. On the other hand, AI-based data-driven models can detect nonlinearity and perform better in streamflow forecasting. As a result, machine learning models have become a hot research topic these days.

AI-based data-driven streamflow forecasting models can be univariate when the model’s input and output are designed with a single time series variable. Univariate forecasting models are straightforward to train using sparse data and provide ease of inference when evaluating forecast performance. Due to the complexity of agrometeorological data, it is simpler and more efficient to forecast the variables individually [7]. On the other hand, multivariate models are designed with multiple variables such as precipitation, temperature, evaporation, and other variables as input and a streamflow variable as output [6]. Thus, in data-scarce catchments with a limited amount of data, the application of univariate modelling is more feasible and has received wide attention in recent years [3, 6, 8, 9].

Wide variates of classical and deep learning models are present in the literature for time series forecasting, which includes Artificial Neural Network (ANN), Support Vector Machine (SVM), Fuzzy Logic (FL), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Adaptive Neuro-Fuzzy Inference System (ANFIS), and Genetic Programming (GP). However, because of the nonlinearity present in streamflow time series, the forecasting performance of these models is usually debatable [3, 10]. Under one-step and multistep-ahead forecast scenarios, Suradhaniwar et al. [7] compared the performance of Machine Learning (ML) and Deep Learning (DL) based time series forecasting algorithms. They also evaluated recursive one-step forward forecasts using walk-forward validation. Finally, Seasonal Auto-Regressive Integrated Moving Average (SARIMA) and Support Vector Regression (SVR) models outperform their DL-based counterparts: Neural Network (NN), Recurrent Neural Network (RNN), and Long Short-Term Memory (LSTM) with fixed forecast horizons.

ANN (MLP) is the most widely used classical machine learning architecture in hydrology [11]. Cui et al. [12] demonstrated that when used for hourly river flow forecasting, the new generation of ANN or Emotional Neural Network (ENN) models outperformed the Multivariate Adaptive Regression Splines (Mars), Minimax Probability Machine Regression (MPMR), and Relevance Vector Machine (RVM) models. Yaseen et al. [2] also conducted a detailed review of literature from high impacted journals in the time frame of 2000–2015 on the state-of-the-art application of Artificial Neural Network (ANN), Support Vector Machine (SVM), Fuzzy Logic (FL), Evolutionary Computation (EC), and Wavelet-Artificial Intelligence (W-AI) for streamflow forecasting. The same study was concluded by stating that time series preprocessing, input variable selections, and time scale choices are the critical parameters for high-performing forecasting models.

RNN is the popular type of deep learning architecture that is optimized for time series analysis. However, it has drawbacks, such as vanishing and exploding gradients. Hochreiter and Schmidhuber [13] introduced the improved RNN version or LSTM, a popular time series model for longtime step forecasting. Recently, various fields of study have been experimenting with these models [1420]. Moreover, Cho et al. [21] firstly introduced GRU as a simplified version of the LSTM model. GRU merges short-term and long-term memory cells into a single gate with reasonably good performance and fast running time [22]. Lara Benítez et al. [23] evaluated the accuracy and efficiency of seven popular deep learning architectures: Multilayer Perceptron (MLP), Elman Recurrent Neural Network (ERNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Echo State Network (ESN), Convolutional Neural Network (CNN), and Temporal Convolutional Network (TCN). Additionally, they constructed over 38000 distinct models to search for optimal architecture configuration and training hyperparameters, with LSTM achieving the lowest weighted absolute percentage error followed by GRU.

Even though we found multiple effective forecasting models with GRU and LSTM in different fields, specifically in hydrology, the accuracy of these models must be further fine-tuned with different data processing techniques and data input variations [2427]. We can restructure the previous time step data series as a predictor input variable for univariate time series forecasting and the current or next step as the output variable. However, the number of input time steps selected is difficult to decide without prior knowledge. Hence, studying the effect of lagged time selection in streamflow forecasting is mandatory for to obtain accurate models. Lagged variables in univariate streamflow forecasting are significant factors that vary the model’s performance positively or negatively and hold temporal dependency information as a predictor variable [28]. Tyralis and Papacharalampous [29] concluded that using a low number of recent predictor variables achieves higher accuracy for time series forecasting using Random Forest (RF) algorithm. It is vital to expand this finding to other popular deep learning models and to different climatic conditions.

Papacharalampous et al. [30] tested 20 one-step ahead univariate time series forecasting methods with extensive time series data. They concluded the study by suggesting that the most and least accurate approaches for one-step-ahead forecasting, such as the importance of time series length on the performance of various forecasting methods, are well addressed. Furthermore, the same study underlined that the machine learning model’s optimization heavily relies on hyperparameter optimization and lagged variable selection. Torres et al. [31] also identified research gaps in various fields by analyzing the most successful deep learning architectures that predict time series effectively and highlighting MLP, RNN, GRU, and Bi-LSTM architectures in particular.

In the present study, we compared the different forms of LSTM architectures, S-LSTM, Bi-LSTM, and GRU, with the classical MLP network to forecast a single step streamflow amount with the available records of daily streamflow data. To the best of our knowledge, the LSTM has been mainly studied for monthly multivariate time series and not for daily univariate streamflow forecasting. Even though machine learning can model hydrological forecasting efficiently, researchers should further examine the impact of suitable input variables and model parameters selection on model accuracy very carefully [32].

2. Study Area and Data

Two river basin subcatchments in Ethiopia were selected for this study: (a) Gummera subcatchment in Abay River basin (Figure 1(a)), (b) Borkena subcatchment in Awash River basin (Figure 1(b)).

2.1. Borkena Catchment (Awash River Basin/Ethiopia)

Borkena River originates at Kutaber woreda or the conjunction of Abay and Awash River basins (Ethiopia). The Awash River basin is usually classified into three main catchments, Lower Awash, Middle Awash, and Upper Awash. Borkena River is found in Lower Awash with its different tributaries, including Berbera River, Arawutie River, Abasharew/Wuranie River, Abba Abdela/Desso River, Worka River, and Leyole River. The total length of this river is estimated at around 165 km. Borkena River catchment hosts major cities, including Kombolcha, Dessie, and Kemissie. The study area streamflow outlet is at Kombolch station, and the catchment covers an area of 1709 km2, bounded from 39° 30′E to 40° 0′E to 10° 15′N to 11° 30′N. Moreover, the area’s elevation varies from 1775 m to 2638 m above sea level. The rainfall pattern of this catchment is unimodal, where 84% of the rainfall is happening from July to September [33].

2.2. Gummera Catchment (Abay River Basin/Ethiopia)

The second case study area is the Gummara subbasin, one of the main tributaries of Lake Tana in the Abay River basin. The lake is located in the north-western highlands at 12° 00′N and 37° 15′E and collects runoff from more than 40 rivers. The lake receives water from several major rivers, including Gilgel Abay in the south, Megech River in the north, and Ribb and Gummara in the east. Small river streams from the lake’s western side drain into the lake. Gummara River originates from the Guna mountains southeast of Debre Tabor at an altitude of approximately 3250 m.a.s.l. The Gummara catchment covers a total area of about 1592 km2. Many small intermittent and perennial rivers and springs flow into the mainstream Gummara River. The catchment’s topography is undulating, ranging from 1788 m.a.s.l. to 3750 m.a.s.l.

2.3. Data

Daily streamflow time series of two hydrometrological stations were collected from Ethiopia’s Ministry of Water, Irrigation, and Energy (MoWIE). Then, these data were used to forecast single-step streamflow. At Borkena station, 6575 available streamflow data series were collected over the time window of January 1, 1972, to December 31, 1989. Similarly, at Gummera station, 9496 streamflow time series values from January 1, 1981, to December 31, 2006, were collected. A total of 866 null values are examined in the time series (i.e., 658 at Borkena and 208 at Gummara). The options for filling these gaps range from simple interpolation to complex statistical methods [34]. The method chosen depends on the length and season of missing data, availability of hydrometeorological data, climatic region, and length of previous observations. The sample mean or subgroup mean can be used to fill in missing values in daily streamflow data. However, replacing missing values with sample means may cause underestimating variance and incorrect subgroup identification [35]. Instead, the linear interpolation method is quick and straightforward to use, and it may be sufficient for data with a small gap [36]. Thus, we implemented linear interpolation in this study, and since the majority of the null values were stacked at low flows, the interpolation is acceptable [3740].

After passing these rigorous quality control processes, the raw data were then split into training and testing datasets using a ratio of 80 : 20, respectively. Accordingly, different sizes of single overlapping step sliding windows were used to rebuild the input time series into a supervised learning format. The subsets were further standardized using a standard scalar approach. Figure 2 shows the descriptive statistics and the corresponding plots of split data for both catchments.

3. Methods

This study compared three types of recurrent network architectures (GRU, Bi-LSTM, and S-LSTM) with the classical neural network architecture (MLP).

3.1. Deep Learning Models

Deep learning models are usually distinguished from nondeep machine learning models by the depth of the network layer or by the number of stacked neuron layers and designed architectures. Nondeep learning models usually do not accurately learn the advanced nonlinearity present in the input and output variables [41]. In contrast, deep learning models are widely applied in different tasks, including processing, analyzing, designing, estimating, filtering, and detection tasks [42]. The popular deep learning models applied in different fields of studies are Multilayer Perceptron (MLP), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), Radial Basis Function Networks (RBFN), and Generative Adversarial Network (GAN) [21, 25, 4346]. The time series analysis models used in this study were specifically chosen, and they are briefly discussed in detail in the following sections.

3.1.1. Multilayer Perceptron (MLP)

ANN or feed-forward multilayer perceptron (MLP) is an immensely used architecture in the hydrological literature [47]. Perceptron operates artificially, replicating our brain processing system and passing on different mathematical and probabilistic operations [48]. MLP contains three main layers: input layer, hidden layer, and output layer. The network becomes deep and can extract higher-order statistics by adding more hidden layers [44]. Three-layered MLP is common in hydrological time series modelling.

A typical diagram of one node (jth) ANN is displayed in Figure 3. Conditional to the layer location, the series of inputs form an input vector X = (x1, xi, …, xn). The equivalent series of weights fit to each input form a weight vector . The output for the node j is calculated using the value of a function (f) with the inner product of the input vector (X) and weight vector (Wj) minus bias (bj) [49]. The stated operation is displayed as follows:

The activation function (f) helps decide a neuron activates a process or not, and these are the few commonly used activation functions: Rectified Linear Unit (ReLu), Leaky ReLu, Sigmoid, Hyperbolic Tangent (Tanh), and Softmax [50]. Even though ANN has been enormously applied in hydrological modelling for the past decades, its performance to capture extreme events is doubtful for complex problems such as rainfall-runoff processes [47].

3.1.2. Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) networks are unique to MLP because the networks have recurrent connections; the information from previous long time-step memories is used to formulate forecasting. Overcoming vanishing and exploding gradients makes LSTM more popular in sequence and time series analysis than the traditional Recurrent Neural Network (RNN) [51]. LSTM networks have memory cells and three gates: input, forget, and output. These gates are responsible for the network to save, forget, pay attention, or pass the information to other cells [27]. Figure 4 displays the typical LSTM memory cell with three gated layers, and the detailed mathematical formulation for the network components is described as follows:where ft (equation (2)) is a forget gate that has a responsibility to choose the information to reject from the cell, it (equation (3)) is an input gate that can decide on the input values to update the memory state, and ot (equation (4)) is an output gate that can decide the output value after analyzing the input and memory of the cell. The weight matrices , , , and correspond to the input gate, forget gate, output gate, and cell gate units, respectively. While ui, uf, uo, and uc weight matrices map the hidden layer output gates, bi, bf, bo, and bc are the bias vectors of the input gate, forget gate, output gate, and cell gate units, respectively. Moreover, ct (equation (6)) and ht (equation (7)) are a memory cell and hidden state [24].

3.1.3. Bidirectional LSTM (Bi-LSTM)

Bi-LSTM is also the other option for getting the most out of RNN by stepping through the input time steps forward and backward. Although Bi-LSTMs were developed for speech recognition, the use of bidirectional input sequences is one of the principal options for sequence prediction nowadays. The hidden layer of the Bi-LSTM model needs to save two values, in which ht involves the forward calculation and ht involves the backward calculation. The final output value Yt is obtained by combining the outputs of the forward layers and backward layers’ outputs [52].

Each point in the input sequence of the output layer is provided with the complete past and future contextual information. There is no information flow between the forward and backward hidden layers, ensuring that the expanded graph is acyclic [53]. Figure 5 displays the structure of bidirectional LSTM architecture.

3.1.4. Gated Recurrent Unit (GRU)

GRU is the newer generation of LSTM that merges input and forget gates into the update gate. Hence, it has fewer parameters, faster running time, and debatable performance than LSTM [24, 26, 27, 40]. Update and reset gates are the two gates available in GRU. Update gate renews the current memory, which enables the memorization of valuable information; on the contrary, the reset gate clears the current memory to forget invaluable information at any time step. Figure 6 shows the structure of the GRU network, and the detailed equations of the hidden units are described as follows:where Xt is the input vector, Zt (equation (8)) is the update gate vector, rt (equation (9)) is the reset gate vector, ht (equation (11)) is the output vector, (equation (10)) is a candidate activation vector, W, U, and b are parameter matrices, and the sign ⊙ denotes Hadamard product.

3.2. Main Model Development

Optimizing deep learning models require decisions on a combination of large hyperparameters, including the number of layers, number of units, batch size, epochs, and learning rate [54]. A random search can produce an infinite number of hyperparameter combinations with a median cost from the four major hyperparameter optimization techniques: trial-and-error, grid, random, and probabilistic approach [31]. Hence, in this study, we use a computationally efficient randomized search method called Keras Tuner developed by the Google team to search random combinations of parameters for optimized performance. The detailed flow chart of the proposed methodology is presented here in Figure 7.

The proposed models applied a double fully connected hidden layer. The minimum and the maximum numbers of neurons at each hidden layer were set with prior experience. Then, the output layer is linked to a dense layer with a single output neuron with a linear activation function. The network is compiled with Adam optimizer and mean squared error loss function, and the details of hyperparameter value ranges and choices are listed in Table 1. Moreover, the following paragraph discusses each hyperparameter optimized using Keras tuner.

3.2.1. Activation Function

In deep learning models, the activation function defines the output from the inputs each node receives. In our case, we applied Rectified Linear Units (ReLU) in all layers except the output layer.

3.2.2. Learning Rate

In deep learning, the learning rate is one of the hyperparameters that decides the step size at each time the model progresses to a minimum loss function. Hence, it is crucial to optimize the learning rate properly; otherwise, the model may converge slowly with too small learning rate or diverge from the optimal error points with large learning rate values [55]. This study sets three values (1e − 2, 1e − 3, or 1e − 4) to choose using Keras Tuner.

3.2.3. Number of Epochs

The other hyperparameter that decides how much time the deep learning algorithm will learn with the entire training samples. When we set more epochs, the weight in the model will get more chance to update itself. The loss curve goes through different stages such as underfitting, optimal state, or overfitting; even though there are no strict rules for these hyperparameter configurations, we set the minimum (10) and maximum (100) values for optimization using Keras Tuner.

3.2.4. Number of Batch Sizes

The batch size is a sample size cluster processed before the model updates itself.

3.2.5. Drop Out

The dropout layer is a hyperparameter that prevents overfitting and enhances training performance. Hence, at each iteration, the dropout freezes a fraction of the neuron from training, and it is defined on a range of 0 to 1.

Different open-source Python libraries were used for model development, such as Tensorflow, Keras, Scikit-Learn, Matplotlib for visualization, Statsmodels for performance evaluation. Moreover, the simulation was also conducted on a computer with Processor: Intel(R) Core (TM) i7-6500U CPU 2.50 GHz and RAM: 8 Gigabytes memory.

3.3. Input Time Lag Selection

To reprocess the highly correlated time series delay that was decomposed and used as a single input to a deep learning neural network, the autoregressive model using the partial autocorrelation function (pacf) was applied. Equation (12) shows the autoregressive model A.R. (p).where φ is the autoregressive parameter, x is the observation at time t, and ε is the weighted noise at time t. The autoregressive model explores the correlation between current and past values [56]. As shown in Figure 8, the partial autocorrelation of daily streamflow time series with a 95% confidence interval, the time delay of 4 days, was considered for both case study areas in this study.

3.4. Performance Measures

The following performance measures were used to evaluate the accuracy of the model developed: coefficient of determination (R2), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) [57].(i)Coefficient of Determination (R2)(ii)Root Mean Square Error (RMSE)(iii)Mean Absolute Error (MAE)(iv)Mean Absolute Percentage Error (MAE):where Qobs = discharge observed, Qsim = discharge simulated, and n = number of observations. The range of R2 lies between 0 and 1, representing, respectively, no correlation and a perfect correlation between observed and simulated values, whereas, for RMSE, MAE, and MAPE, the better performance is reached when we are close to 0. If R2 > 0.90, the simulation is very acceptable; if 0.60 < R2 < 0.90, the simulation is acceptable; and if R2 < 0.60, the simulation is unacceptable [58].

4. Results and Discussion

Previous analyses revealed a wide range of results, which varied according to the type of deep learning architecture used, the degree of climatic variability, and the timescale used. The GRU model outperforms both the extreme learning machine (ELM) and the least-squares support vector machine (LSSVM) on monthly univariate streamflow data from the Shangjingyou and Fenhe reservoir stations in the upper reaches of the Fenhe River [59]. Sahoo et al. [19] also demonstrated the superiority of LSTM over RNN on univariate daily discharge data from the Basantapur gauging station in India’s Mahanadi River basin. Suradhaniwar et al. [7] demonstrated that SARIMA and SVR models outperform NN, LSTM, and RNN models when hourly averaged univariate time series data is used. Even though the best model generalization is complex, case-based analysis is the most effective method for determining which model best fits a given situation [60]. For the first time in this study, we attempted to organize the findings around three distinct themes: model variability, lag time variability, and time series characteristics (climatic variability).

The following section discusses the performance of four selected models under four different input time lag scenarios; in total, 16 experimental results are presented. Tables 2 and 3 show the optimized hyperparameter values for both stations and all scenarios using Keras Tuner. Additionally, Tables 4 and 5 illustrate performance metrics in terms of RMSE, MAE, MAPE, R2, and TTPE (sec). In the testing phase, single-step streamflow forecasting results demonstrated very acceptable performance for both case study areas.

4.1. Model Variability

We used four evaluation matrices to investigate the effect of model variability on the accuracy of single-step streamflow forecasting. Then, using box plots, we could visualize the spread of prediction error (m3/s). Additionally, we plotted a bar chart (Figure 9) with different prediction error classes to identify the class limit with the highest error concentration. As a result, as shown in Table 4, GRU’s model has a slight performance advantage over MLP and S-LSTM for Borkena station’s lag time (T + 2 and T + 3). Whereas, for the Gummera catchment (Table 5), MLP outperformed GRU and S-LSTM in terms of performance increment in lag time (T + 1 and T + 3). Prediction error box plots and bar chart graphs (Figures 10 and 11) were used to investigate these high-performing architectures further. Hence, the prediction error of GRU is typically concentrated in small ranges (0 to 0.5 m3/s) for Borkena and (0 to 2.5 m3/s) for the Gummera catchment. Moreover, as shown in Tables 4 and 5, when considering computational speed, MLP demonstrated the quickest training time per epoch, followed by S-LSTM, GRU, and Bi-LSTM.

4.2. Time Series Characteristics (Climatic Variability)

The other major issue affecting deep learning model performance is time series characteristics. As a result, in this study, the four-evaluation metrics displayed Borkena’s river catchment forecasting result is more accurate than Gummera’s catchment with RMSE, MAE, MAPE, R2, ranging between (0.81 to 1.53, 0.29 to 0.96, 0.16 to 1.72, 0.96 to 0.99) and (17.43 to 17.99, 7.76 to 10.54, 0.16 to 1.03, 0.89 to 0.90) for both catchments, respectively. The possible cause for this performance variation between the two catchments is the significant natural streamflow time series variability in the Borkena catchment. Furthermore, the spread of prediction error (m3/sec) in Figures 10 and 11 shows that the error class limit for most cases is smaller in the Borkena station than in the Gummera station.

4.3. Lag Time Variability

In univariate streamflow forecasting, lagged variables are the other significant factors that hold temporal information and affect model performance. Taylor diagram in Figure 12 shows the forecasting ability of the proposed models with the observed test data for both case study areas. The diagram is designed on a two-dimensional scale: the standard deviation on the polar axis, root mean square error, and correlation coefficient on the radial axis. It shows that, irrespective of deep learning models, forecasting with a lag time of four gives us a time series closest to the standard deviation of the actual test observations. Moreover, Figures 13 and 14 also display the actual values and predicted values of the optimized high score deep learning model for each time lag and both case study catchments.

5. Conclusions

This study showed a comparative analysis of different deep learning algorithms for one-step daily streamflow forecasting at two subcatchment stream flow outlets. MLP, S-LSTM, Bi-LSTM, and GRU are the four algorithms used in this study. The study clearly showed the impacts of climatic (time series characteristics) and lagged time variability on the performance of different proposed deep learning models. The following key points will elaborate on the outcome of this research.(i)Deep learning models have excellent potential in forecasting short-term daily streamflow in different time series characteristics.(ii)The performance of deep learning models for short-term streamflow forecasting varies with time series characteristics and input time lag variations. Moreover, the Borkena station has more significant natural streamflow variability than the Gummera station, which is also reflected in the model results. Hence, this study showed that catchment response variability impacts deep learning model performance.(iii)MLP and GRU outperform S-LSTM and Bi-LSTM on a nearly equal basis for single-step short-term streamflow forecasting in both case study areas. However, the performance is relative to the lagged time variations.(iv)Catchment characteristics had a high impact on the performance of streamflow forecasting than deep learning model architectures and lagged time variations.(v)The study also showed that classical MLP could almost equally perform with S-LSTM and GRU deep learning networks on a small amount of streamflow time series data.

Future research may further expand this study’s findings to other climatic regions, hybrid deep learning model architectures, hyperparameter tuning, and lagged time selection methods. We must also investigate the effects of large input variability on deep learning models for univariate streamflow forecasting in all its implications. As part of our future work, we plan to implement an ensemble learning approach to simulate streamflow from remote sensing-derived data products (precipitation and vegetation indexes) using a combination of neural networks, decision trees, and boosting algorithms.

Data Availability

The corresponding author’s raw metrological and hydrological data sets used for the Borkena and Gummera watershed are available upon request. However, authorization letters are required from the Ethiopian National Metrological Agency (NMA) and Ethiopian Ministry of Water, and Energy (MoWE) (https://mowe.gov.et/)

Conflicts of Interest

The authors declare that they have no conflicts of interest for this article.

Acknowledgments

The authors acknowledge the Ethiopian Ministry of Water, and Energy (MoWE) for hydrological data and the Ethiopian National Metrological Agency (NMA) for metrological data.