Deep sequential (DS) models are extensively employed for forecasting time series data since the dawn of the deep learning era, and they provide forecasts for the values required in subsequent time steps. DS models, unlike other traditional statistical models for forecasting time series data, can learn hidden patterns in temporal sequences and have the memorizing data from prior time points. Given the widespread usage of deep sequential models in several domains, a comprehensive study describing their applications is necessary. This work presents a comprehensive review of contemporary deep learning time series models, their performance in diverse domains, and an investigation of the models that were employed in various applications. Three deep sequential models, namely, artificial neural network (ANN), long short-term memory (LSTM), and temporal-conventional neural network (TCNN) along with their applications for forecasting time series data, are elaborated. We showed a comprehensive comparison between such models in terms of application fields, model structure and activation functions, optimizers, and implementation, with a goal of learning more about the optimal model used. Furthermore, the challenges and perspectives of future development of deep sequential models are presented and discussed. We conclude that the LSTM model is widely employed, particularly in the form of a hybrid model, in which the most accurate predictions are made when the shape of hybrids is used as the model.

1. Introduction

There are many different forms of data, and among them is time series data. This data type has the ability to predict future data at the same rate as the forecasting technique of analysis [1]. Time series frequently exhibit time dependencies, resulting in two comparable time points to be classified differently or to predict divergent behavior [2]. Time periods are represented by years, months, weeks, days, or hours [3]. Many real-world applications, such as biological sciences [4], healthcare [5], and financial and weather prediction [6], use time series data to capture data over time. Traditional approaches, such as exponential smoothing [7, 8], autoregressive (AR) [7, 9], or structural time series models [10], are focused on parametric models driven by domain knowledge. Machine learning techniques have recently made it possible to learn chronological dynamics purely from data [11]. Machine learning has emerged as a vital component of the next generation of time series forecasting algorithms, as data availability and computer power have increased in recent years [12]. A variety of machine learning algorithms and models have been employed to forecast time series data. Some of them employ hybrid approaches for prediction, which mix multiple models or integrate any optimization process in the prediction process [3].

Deep learning (DL) models, a subset of machine learning that use deep neural networks, are employed in a range of research domains, including speech recognition [13], text mining, and image analysis [14]. DL models combine numerous layers to represent the data abstraction to develop computational methods. Although the DL training model takes a long time to train due to the large number of parameters, it tests faster than other machine learning techniques [15]. This is because DL models can learn hidden patterns from data. Deep neural networks can learn complicated data representations [16] by including unique architectural assumptions [17], which take into account the small difference of the underlying datasets. This eliminates the requirement for model creations and human feature engineering. Additionally, the amount of available frameworks for back-propagation made the network training much easier by allowing for modification of network elements and loss functions [18, 19]. We emphasize diverse parts of sequential understanding and how they compare to one another in regard to deep learning. The goal of this paper is to give a comprehensive analysis of the three different deep sequential forecasting models that have been utilized to produce predictions. Our analysis may aid researchers in identifying the best model to use based on the application domain, as well as in the development of unique or hybrid algorithms.

With the widespread use of deep sequential models in several domains of science and application, one may be confused as to which model is appropriate for solving their problems. Our research is driven by a goal to provide a comprehensive review of current deep learning time series models. It is worth providing the user communities with the best model for their problem in terms of the model performance in various fields. This study intends to answer which models are typically used for which applications, which programs best implement the algorithms, how data splitting is utilized in the studies, and which optimization approaches are employed by the researchers. Our evaluation may aid researchers and practitioners in making the best model selection based on scope resemblance and system necessities, consequently assisting in the development of novel algorithms. The main contributions of this paper are summarized as follows:(i)To the best of our knowledge, this is the first review work that covers the usage of deep sequential models for forecasting time series data to serve the research community.(ii)We reviewed the three common deep sequential models.(iii)A comprehensive review of deep sequential models in different applications, such as healthcare, finance and stocks predictions, weather forecasting, environment, and pollution, is provided.(iv)Main challenges in deep sequential models are highlighted.(v)Some research gaps and open issues in the field that need further investigation are discussed.

The reminder of the paper is organized as follows. Section 2 gives a quick overview of three popular models for interpreting deep sequential models with time series data. In Section 3, we reviewed selected articles in the last three years that used deep sequential models in various applications. Section 4 gives an in-depth discussion about the models, their implementation, activation functions used, and a comparison of applications based on the models used. Finally, in Section 5, we draw the matter to a close by making a last comment on the methodologies that have been examined.

2. Deep Sequential Model

Deep sequential models are deep learning techniques used when both the input and the output are sequence data [20]. Sequences are made up of data points that can be arranged so that observations at one point in the sequence provide meaningful information about observations at other places in the sequence. The sequence data require managing incalculable supervised learning tasks, and the sequence learning issue may occur when the input is a sequence and the output is a single data point, as in video activity detection, classification of sentiments, and stock price forecasting. On the other hand, the output might be a series of data points and the input is a single data point, as in picture captioning, music production, and speech synthesis. There are also the cases when both the output and the input are sequences, as in voice recognition, natural language understanding, and DNA sequence analysis; in this case, the lengths of the input and output sequences might be the same or different [21].

Deep learning neural networks have proven to be capable of simulating extremely complex input-output mappings, particularly in disciplines like computer vision, voice recognition, and natural language processing. This concept has led to the development of several deep learning techniques in the field of time series forecasting that outperform traditional approaches in terms of accuracy performance [22]. In this section, we will briefly describe the three most commonly used deep sequential models in the literature: ANN, LSTM, and TCNN.

2.1. Artificial Neural Network (ANN)

ANN is a technique for processing data that mimics the human brain. A brain acquires knowledge from human experiment, and an ANN processes data similarly to the human brain. It is categorized as a supervised or unsupervised learning based on the amount of information about the values of the output variables [23]. The basis of an ANN is made up of nodes or neurons that are placed in parallel linked processing unit arrays. A multilayer perceptron (MLP) is an ANN model that feed-forwards datasets into a set of appropriate data outputs [24]. The ANNs are composed of three fundamental layers: input, hidden, and output. An input layer is the information source for networks. The hidden layer learns the nonlinear relationships amidst the input(s) and the output(s) by altering the weights. Layers are composed of varying numbers of neurons, which process input through activation functions. The output layer contains forecasted data [25], and the output of unit i can be given as follows:where is the parameter vector of unit i, p is the number of neurons in the unit i preceding layer, is the input of unit i, and is the bias of unit i. This weighted sum , referred to as the unit i incoming signal, is then transferred via a transfer function as shown in equation (1) [26]. The ANN’s general architecture is described in Figure 1.

The output of a function is known as the activation function. Weights associated with the connections between two neurons are proportional to their strength. Typically, a neuron’s computation is separated into two stages: aggregation and activation. Utilizing the aggregation function requires computing the total of all the inputs received by the neurons through all their incoming connections. The result values are sent to the activation function. There are several types of activation functions; the sigmoid and hyperbolic tangent are frequently used. The rectified linear unit is another function which has grown in popularity [2]. The mathematical formulas of these activation functions are as follows: equations (2)–(5).(i)ReLU is the often used activation function, owing to its ease and effectiveness. A subset of neurons fires simultaneously, sparing the network and increasing efficiency.(ii)Sigmoid is another fairly common method of activation, which constrains the output to a value between 0 and 1 [28],(iii)Tanh is the same as the sigmoid function, but function allows for the output values within a specified range (−1, 1),(iv)Softmax: this function produces continuous values between (0, 1) and is often applied as a classifier at the output layer since it produces probabilities spread throughout the number of classes in multi-class predictions,

2.2. RNN

Deep learning algorithms are subsets of machine learning algorithms that aim to identify various representations for incoming data. One of the most widely used models for deep sequential learning is the recurrent neural network (RNN), which was developed in the 1980s [29]. RNN is used to handle sequence data [15]. RNNs contain memory cells, which enables them to remember data from the past, and that is important for forecasting future outcomes. They have been extensively utilized to solve supervised learning problems involving sequential data. Due to the sequential structure of time series, comparable designs have been utilized for predicting time series as well [22]. The back-propagation through time (BPTT) algorithm introduced by Werbos [30] is capable of training RNN. The drawback with RNN training is that the RNN design is backward dependent over time [31]. We will cover some RNN designs that have been proven to be effective at predicting time series data, such as LSTM model and gated recurrent unit (GRU) [32]. The GRU structure is similar to the LSTM structure; however, it is easier to compute and implement and has less parameters.

2.2.1. LSTM

LSTM algorithm is a type of RNN algorithms that was developed by Hochreiter and Schmidhuber [33] to model the structure of sequential data. The disadvantage of the normal RNNs is that their learning output reduces when data are moved away from the input. When the initial weight is no longer accessible, the vanishing gradient descent problem occurs. The LSTM is recommended for managing each memory cell for both state and output values throughout the learning process [31]. LSTM networks are specialized in learning and analyzing sequential data such as data classification [34], processing [35], and time series data forecasting with time differences of unknown sizes [36]. LSTM is a chain-like architecture suitable for memorizing the information and long term using four network layers as represented in Figure 2. The LSTM network is made up of memory blocks that look like cells. This cell state is crucial because it allows data to go ahead and stays unaltered. However, data can be added or wiped using sigmoid activation function gates. These gates consist of a sequence of matrix operations with dissimilar weights. With gates, it is possible to avoid long-term dependency challenges that exist with memorization of the LSTM. In the following paragraph, we will go through how to use LSTM [37].

In the LSTM procedure, the first step is to specify data that is unnecessary for the operation. This is accomplished by using the sigmoid function, which takes the output as () at time () and the existing input as () at time (). The sigmoid function assesses whether the portion of the old output should be deleted based on the old output. This process is referred to as the forget gate () as in equation (6). The vector values vary from 0 to 1, with one value for each digit in the cell ().

The sigmoidal function is denoted by as in equation (3), the weights are denoted by , and the bias of the forget gate is denoted by .

The following equations include two cases: one that ignores, and one that stores the current state of input cell in . Two layers comprise the process: a sigmoidal layer and a layer. Using 0 or 1, the sigmoidal layer determines whether the latest information is updated or disregarded. The weights are updated and calculated via the function and then transfer data between (1 to −1) in the second layer. Values are assigned weight based on their importance. As can be seen in (), both values have been changed, and a new cell state is created.

Finally, the output is multiplied by the newly layer formed via . It is based on the output of sigmoidal gates ().where and denote the parameter within a neuron and biases of the output gates, respectively.

2.2.2. TCNN

TCNNs are generic convolutional networks proposed by [39]. The TCN network is generally utilized to perform the sequence modeling tasks with causal constraint. A sequence modeling network learns to predict the output sequence () given the input sequence (), and the matching output sequence () by network training on some loss function between both the estimated and output sequences () may be obtained. The causal constraint imposed on the network means that the prediction is conditional on the inputs () but not on the future inputs (). As can be seen in Figure 3, the TCNN is a hierarchical structure composed of many convolutional hidden layers of equal size to the input layer [39].

3. Application

To demonstrate the importance of the time series prediction problems, a state-of-the-art study is conducted by categorizing deep learning research works by different application domains, such as healthcare, finance, energy, traffic, and weather prediction, based on the most widely used network model designs (ANN, LSTM, TCNN). The following paragraphs provide a summary of the articles for each application domain, highlighting the aims achieved for each method and field.

3.1. Healthcare

Deep sequential models are frequently used in healthcare. Deep learning’s predicting capabilities and automated feature detection make it a compelling tool for disease diagnosis. There are numerous studies in the field of healthcare applications.

Nikparvar et al. [40] suggested a multivariate and multi-time series long short-term memory (MTS-LSTM) network for forecasting the COVID-19 pandemic in terms of affirmed cases, mortality, and movement concurrently. The results indicated that including mobility as a variable and training the network with many samples improve predictive performance in terms of prediction bias and variance. Additionally, the study demonstrated that the projected outcomes are comparable in terms of accuracy and spatial patterns to those of a typical ensemble model employed as a benchmark [40].

TEG-net was presented by Hong et al. [41] as a revolutionary deep learning approach for physiologic diagnosis and explained time series with high precision. TEG-net constructs T-net (a multi-scale bidirectional TCNN) for directly modeling physiological time series, E-net (personalized linear model) for directly modeling expert features extracted from physiological time series, and G-net (gating neural network) for combining T-net and E-net for diagnosis. Through G-net, the combination of T-net and E-net enhances precision in diagnosing, and E-net may be used for the sake of clarification. TEG-net exceeds the second-best benchmark in terms of area under the receiver operating characteristic curve and area under the precision-recall curve by 13.68% and 11.49%, respectively [41].

In another work, the LSTM was used by Da Silva et al. [42] to anticipate the patient’s vital signs and then to estimate the severity of the patient’s health status using prognostic indexes, which are extensively employed in medicine. According to experiments, it is possible to estimate vital signs with a high degree of accuracy (>80%) and so forecast prognostic indexes in advance, enabling patients to get treatment before they deteriorate [42].

To anticipate COVID-19 cases in India and Chennai, multivariate LSTM models were examined by Devaraj et al. [43]. Both short- and long-term infected cases were predicted, and the study showed that the stacked LSTM and LSTM models beat other models in terms of accuracy. When compared to other algorithms, the stacked LSTM models predicted more reliably and generated results with an error of less than 2% [43].

In the work of Kafieh et al. [44], different models include multi-layered perceptrons, random forests, and various versions of LSTM. Their proposed model was trained on three datasets, including COVID-19 cases, to predict the outbreak in nine different countries (Germany, Iran, Japan, Italy, Switzerland, Korea, China, Spain, and the USA), and their performances were evaluated using four metrics, including MAPE, RMSE, NRMSE, and R2. According to the authors, promising results were discovered for predicting the pandemic’s future trajectory by using a modified version of LSTM termed M-LSTM [44].

A unique deep learning architecture based on the LSTM was also presented by Balaji et al. [45] for the purpose of grading the severity of Parkinson’s disease (PD) based on gait pattern. Three independent gait datasets were used to train the LSTM network. Each dataset included vertical ground reaction force (VGRF) records of distinct types of walking. The experimental findings demonstrated that Adam-optimized LSTM networks were capable of successfully learning gait kinematic data and provided a rate of accuracy for binary and multi-class classification, as well as an accuracy increase over similar approaches [45].

Shastri et al. [46] suggested three types of LSTM, including stacked LSTMs, bidirectional LSTMs, and convolutional LSTMs, for forecasting COVID-19 cases one month in advance in India and the USA. As a result, convolutional LSTM outperformed the other two models and forecasted the mortality rate of COVID-19 with high accuracy and very little error for all four datasets in both countries. Based on the mean absolute percentage error (MAPE), a convolutional LSTM model obtained an error rate of 2.0% to 3.3% [46].

Kara [47] presented a hybrid technique for predicting multi-step influenza outbreaks based on incorporated LSTM neural networks and genetic algorithms (GAs). They used weekly data on influenza-like illnesses (ILI), which obtained in the USA. The experimental findings indicated that the provided hybrid model outperforms other well-developed machine learning techniques, a statistical model, and a fully connected neural network when various performance measures were considered during peak times [47].

In the work of Shetty and Pai [48], they showed the capacity of the multilayer perceptron (MLP), an ANN model, to anticipate the number of infected cases in the Indian state of Karnataka. The forecasting model’s parameters were chosen using the partial autocorrelation function, and their performance was compared to parameters chosen using the cuckoo search (CS) technique. The use of the CS algorithm resulted in improved prediction performance when MAPE was used. Additionally, to validate the model’s effectiveness, data from Hungary’s COVID-19 cases were employed, which produced a MAPE of 1.55%, confirming the stability of the developed ANN model for predicting COVID-19 instances in the Karnataka state [48].

Hawas [49] utilized the GRU version of the RNN structure to predict daily COVID-19 infections in Brazil using forecasting models trained on sparse raw data (30 time-steps and 40 time-steps alternatives). Since the GRU has a smaller number of parameters and has a simpler structure, it converged faster and provided better performance than LSTM, which has a lot more parameters and is more complicated to implement [49]. Table 1 summarizes the abovementioned studies that applied deep sequential models in healthcare predictions.

3.2. Energy

Deep learning’s recent success in a variety of domain applications has drawn researchers to the energy field, particularly in areas like demand forecasting, renewable energy, and building energy efficiency, and the breadth of methods presented, as well as the growing number of publications, demonstrates its enormous potential.

Two suggested methodologies for forecasting the power consumption of office buildings were validated by Chen et al. [50] using actual data from the University of Glasgow as a case study. To minimize the impacts of occupants’ activities, the first suggested technique divided power consumption data into set time periods including working and nonworking hours. The second proposed approach combined ANNs and fuzzy logic approaches to match the building’s base load, peak demand, and occupancy rate to many meteorological factors. The simulation findings indicated that, when compared to the conventional ANN technique, both suggested approaches had lower RMSE for forecasting power consumption [50].

In work of Lin et al. [51], for solar energy forecasting, the performance of TCNN was compared to multilayer feedforward neural networks and recurrent networks, including the state-of-the-art LSTM and GRU recurrent networks. The assessment was based on two Australian datasets that comprise historical heliacal and climate data, as well as future-day weather prediction data. The results indicated that TCNN beats other models in terms of precision and was capable of maintaining a lengthier history of effectiveness than recurrent networks [51].

Lin et al. [52] also proposed a new deep learning technique for probabilistic forecasting that combined attention processes with temporal convolutional attention network (TCAN) and shows its effectiveness in a case study of solar energy forecasting. TCAN extracts temporal dependence using the hierarchical convolutional structure of TCNN and then employs sparse attention to concentrate on critical time steps. Authors claimed that the TCAN beat various state-of-the-art deep learning prediction models, like TCNN, in terms of accuracy. They also conclude that the TCAN needs less convolutional layers for an enlarged receptive field, and it is quicker to train [52].

A method for estimating extremely short-term solar production relying on the LSTM with the temporal attention mechanism (TA-LSTM) was also suggested by Pan et al. [53]. The length of the time series was first determined using partial autocorrelation, which was then used as an input to the LSTM forecasting model. The trials were carried out to ensure that the proposed approach worked well. The predictive solar production of each forecasting technique from 1 May 2016 to 10 May 2016 showed that the suggested strategy was possible and successful [53].

Dudek et al. of [54] developed an approach for deep learning that was hybrid and hierarchical for predicting midterm electricity request for 35 European countries. It was then combined with exponential time smoothing (ETS) and LSTM. Their experiments on monthly electricity demand time series demonstrated the proposed model’s superior performance and competitiveness with both classical models such as ARIMA and ETS and the latest machine learning models [54].

On investigating the multi-step time series dataset, several models of LSTM were analyzed by Ghanbari and Borna [55]. The research examined if a model could be developed to forecast how much power a residence would use over the following seven days. The dataset they used contains four years’ worth of data on residential power use. The intended prediction results were obtained with the minimum errors possible when contrasted with current state-of-the-art models [55].

Kumar Dubey et al. [56] used the LSTM model to contribute to the establishment of accurate predictions and forecasting for the energy unit supply. The dataset in their experiments was associated with energy consumption measurements from 5,567 London households that participated in the UK Power Network-led Low Carbon London initiative during November 2011 to February 2014. The findings showed that energy usage had a significant positive correlation with humidity and a strong negative correlation with temperature, and the LSTM outperformed ARIMA and SARIMA models with an average mean absolute error (MAE) of 0.23.

Another work by Mustapa et al. [57] on energy consumption was estimated by using multiple linear regression (MLR) and nonlinear autoregressive with exogenous input (NARX-ANN). The NARX-ANN architecture was optimized using the particle swarm optimization (PSO) technique, and the anticipated values were compared to the actual values. It was shown that the NARX-ANN-PSO model had a smaller error [57].

Dang et al. [58] proposed a technique for forecasting next-day electricity prices to the 5-minute level using a model integrated with eight components of ANN. The combined ANN model was then used to forecast electricity or time-of-use (TOU) prices for the following day with a 5-minute accuracy. The authors concluded that the model’s performance had a MAPE accuracy of roughly 13%, which might be used as a benchmark for EV charging decision-making [58]. Table 2 summarizes the abovementioned studies that applied deep sequential models in energy predictions.

Wei et al. [59] presented CFML (complementary ensemble empirical mode decomposition (CEEMD)-fuzzy time series (FTS)-multi-objective grey wolf optimizer (MOGWO)-long short-term memory (LSTM)) as a novel hybrid prediction system. Four wind speed datasets and two electrical power load datasets were used for energy forecasting; the forecasting models CEEMD, FTS, and MOGWO demonstrated the ability to carry the strength of each component while electively improving the prediction performance of the CFML predicting model on the basis of stabilization and precision [59].

Another work by Wen et al. [60] proposed a novel approach for predicting power load. In this approach, stages of weather station were selected and predicted using the Takagi–Sugeno (TS) fuzzy model, which is an upgraded self-organizing radial basis function recurrent neural network using the TS fuzzy model (ISO-TS-RBF-RFNN). The suggested ISOTS-RBF-RFNN model developed a technique for determining the current firing strength of fuzzy modes by using a new type of activation mechanism and robust-type fuzzy rules to enhance the prediction model’s resilience. In comparison with the other five models (MLR, SVR, FIR, GA-LSTM, and classic SO-TS-RBF-RFNN), the suggested model outperforms in comparison with the other five models when confronted with a variety of uncertainty about feature components [60].

3.3. Environment and Pollution

Global climate change and its associated consequences have created a great deal of issues for farmers and land management. Not only does climate change have a noteworthy influence on the resources, environment, and socioeconomic situations of all areas, but it also influences water resources, air quality, agriculture, rural development, and health. Therefore, its mitigation requires precise forecasting.

By combining the unique cooperation search algorithm (CSA) into the ANN learning process, Feng and Niu [61] presented a hybrid ANN model for river flow forecasting. The suggested model was evaluated using data of two real-world hydrological sites in China. During both the training and testing stages, the hybrid technique based on ANN and CSA consistently exceeded the effectiveness of control systems and improved predicting outcomes [61].

To forecast hourly PM2.5 (particles that have diameter less than 2.5 micrometers) concentrations, a unique hybrid forecasting model was constructed on the basis of complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) and deep temporal convolutional neural network (deep-TCNN) was developed by Jiang et al. [62]. Using the PM2.5 concentrations in Beijing as a sample, experimental results revealed that the proposed CEEMDAN-deep-TCNN model had the highest forecasting accuracy when compared to classical time series models, ANN, and the prevalent DL model [62].

With the purpose of improving forecasts for smart agriculture production, the authors of [63] employed LSTM model and compared it to back-propagation method. Experimental results based on the smart agriculture dataset, which was sophisticated by the Electronic Systems Lab of Gadjah Mada University, showed that the LSTM outperformed back-propagation in terms of prediction accuracy [63].

In another work by Belavadi et al. [64], the LSTM and RNN models were presented as an approach that is scalable for monitoring and collecting real-time data on air pollution concentrations from diverse locations and for forecasting future air pollutant concentrations. The data were obtained through a wireless sensor network that collects and transmits pollutant concentrations to a server, with sensor nodes distributed around Bengaluru, South India, and from the Government of India’s real-time air quality data as part of its Open Data project. They found that there is no such thing as a “one-size-fits-all” strategy that will work in every environment or circumstance [64].

Hamami and Dahlan [65] developed a construct LSTM model for forecasting air pollution levels. The model was able to forecast five measures of air pollution, including PM10, SO2, CO, O3, and NO2. The findings indicated that the LSTM model was efficient with a root mean square error that equals 5.58 [65].

In order to improve the accuracy of water quality forecasting, an LSTM-BP combination model technique was presented by Jia and Zhou [66]. This algorithm used both LSTM and BP neural networks. A framework for time series prediction was created using the water temperature data from the No.6 large-scale integrated observation signal in the Yangtze estuary as an example. The proposed approach was compared to the LSTM and the BP models, and it was determined that the time series predicted by LSTM-BP were more accurate [66].

Four ANN models were calibrated by Sharma et al. [67] to evaluate the effect of different variables on river flow prediction. When model 3 (without river flows) was compared to model 4 (with river flows), the results were greatly improved (with including lag 1 river flow as one of the inputs). In terms of the performance metrics, it was concluded that the developed ANN model 4 was more accurate at modeling river flow, except for extremely high peaks, resulting in solving highly complex nonlinear river flow prediction problems [67].

The hybrid models provided by Zhao et al. [68] to forecast monthly streamflow data first improved complete ensemble empirical mode decomposition with adaptive noise (ICEEWT) and then used the GRU to learn the relationship between historical and future streamflow series. The model was then used in conjunction with the improved grey wolf optimizer (IGWO) to identify the approximate parameter combination while balancing underfitting and overfitting. For two stations, the proposed ICEEWT-IGWO-GRU model outperformed the single GRU model average, decreasing MAE and RMSE by 50% and 52%, respectively, confirming its efficiency with a more reliable prediction [68]. Table 3 summarizes the abovementioned studies that applied deep sequential models in environment and pollution predictions.

3.4. Finance and Stocks

Stock market time series forecasting is a concept used in quantitative finance. Finance prediction is a traditional but difficult topic that has attracted the interest of both economists and computer scientists. Researchers are using machine deep sequential models for predicting stock market. For example, three machine learning methods were used by Zekić-Sušac et al. [69] to forecast the energy costs of public buildings, and an actual dataset of general buildings of the Croatian public sector’s localized database was developed. The prediction models were developed in two stages: the first stage involved a comparison of the use of ANNs and two types of regression trees, CART and random forest (RF); the second stage involved the combination of RF-Boruta feature and machine learning approach for forecasting. The ANN and regression tree topology were optimized using cross-validation approach. The findings indicated that when Boruta and machine learning are combined, the highest precise forecasting in terms of symmetric mean absolute percentage error (sMAPE) is attained [69].

Ahmed et al. [70] suggested a strategy which incorporates the forex loss function (FLF) into an LSTM model named FLFLSTM, which reduces the discrepancy among the actual and forecast averages of forex candles. Data were used from 10,078 four-hour candles for the EURUSD pair, and they discovered that when compared to the standard LSTM model, the suggested FLF-LSTM system showed a 10.96% percent reduction in total mean absolute percentage error (MAPE). Additionally, they stated that the predicting error for high and low prices was lowered. The research demonstrated that by introducing domain expertise into the methods of training machine learning models, substantial improvements in forex price prediction were obtained [70].

In another work conducted by Dave et al. [71], Indonesia’s future export forecasts with LSTM were applied to the nonlinear component of the data. Autoregressive integrated moving average (ARIMA) model was also applied to the linear part, and the two most widely used time series prediction algorithms (ARIMA and LSTM) were combined to generate the best model. In comparison with the separate models, the hybrid (LSTM-ARIMA) model produced the lowest mean absolute percentage error (MAPE) and root mean square error (RMSE) values [71].

In another work conducted by Hryhorkiv et al. [72] to forecast time series of stock indices, they suggested a sophisticated hybrid forecasting model based on a combination of ARIMA and ANN, which combines the benefits of both approaches. Mean square error (MSE) calculations demonstrated that the suggested algorithm produced more accurate projections of the 500 stock indexes [72].

Livieris et al. [73] suggested a CNN-LSTM model for accurate gold price and movement prediction. The model made use of convolutional layers’ ability to extract relevant information and learns the internal representation of time series data, as well as the LSTM layers’ efficacy in recognizing short- and long-term relationships. The initial experimental study demonstrated that combining the LSTM with extra convolutional layers might significantly improve predicting ability [73].

Farahani and Hajiagha [74] proposed a framework for forecasting stock price indices using an artificial neural network (ANN) and training it using various novel metaheuristic algorithms including social spider optimization (SSO) and the bat algorithm (BA). They selected features using genetic algorithm (GA) like a heuristic technique. The standard artificial neural network was compared to a hybrid metaheuristic-based ANN. In comparison with earlier approaches, SSO and BA had least errors, indicating that they could better forecast stock prices. However, when error is used as a metric of superior, the social spider algorithm outperformed the others. The ANNs outperformed ARIMA-based time series models in predicting stock prices. Their experimental results demonstrated that the hybrid models achieved a high degree of accurately explaining the model [74].

A new intuitionistic fuzzy time series (IFTS) forecasting method based on LSTM is proposed by Kocak et al. [75]. The model named as DIFTS-LSTM is applied to the Giresun temperature data and the Nikkei 225 stock exchange time series forecasting. Experimental results of the proposed model demonstrated better forecasting performance than some alternative methods, namely, fundamental fuzzy time series models and the classical models of time series [75]. Table 4 summarizes the abovementioned studies that applied deep sequential models in finance and stock predictions.

3.5. Weather Forecasting

Meteorological forecasting is the process of predicting the condition of the atmosphere at a particular area using a variety of weather characteristics. Weather predictions are created by collecting data on the status of the atmosphere at any given time.

Farmers often rely on further precise short- to medium-range predicted from regional forecasting models with greater accuracy. For this purpose, Hewage et al. [76] investigated current state-of-the-art models (TCN and LSTM), with the goal of designing and testing a lightweight and innovative weather forecasting systems. According to the findings, the suggested model that uses TCN delivered superior prediction than the LSTM and other conventional machine learning techniques [76].

In another work, an ANN model was created by Yadav and Malik [77] to forecast wind speed at 10, 20, 30 minutes, and 1 hour ahead of time, in the mountainous region of India. They proposed a graph to depict wind speed data that have been averaged over a 10-minute period. The study concluded that their experiment can be used to track wind power over the Internet [77].

The application of ANN, a computational intelligence approach, was proposed by Rahul et al. [78] as a significant step forward in the creation of an intuitive framework capable of comprehending and predicting nonlinear weather phenomena. The suggested research focused on creating a user-friendly framework that can accurately forecast the weather with the least amount of mistake and a more appropriate design [78].

In another study by Bou-Rabee et al. [79], a hybrid ANN technique was suggested for wind power capacity and electricity generation at coastal sites, based on wind-related dependent variable’s properties. Data from three Kuwaiti coastal sites were used to confirm the suggested approach. For the computation of wind speed, the hybrid model, whose integrated ANN and particle swarm optimization (PSO), forecasted wind direction one month ahead of time. Before finding the best ANN architecture, the NN started by assessing its output using a variable neuron in the hidden layer. With RMSE and MSE, the hybrid ANN-PSO framework outperformed marginal systems based on ANN [79].

In another study on forecasting Indian summer monsoon rainfall (ISMR), Johny et al. [80] examined an adaptive hybrid modeling framework called the adaptive ensemble empirical mode decomposition-artificial neural network (AEEMD-ANN) for forecasting ISMR. The AEEMD-ANN method performed reasonably well in capturing hydrologic extremes when compared to the EEMD-ANN forecasting model, with a high degree of precision when trying to capture dry seasons [80].

Unnikrishnan and Jothiprakash [81] developed an integrated SSA-ARIMA-ANN model for daily rainfall forecasting. Singular spectrum analysis (SSA) was used to preprocess the data before combining it with ARIMA and ANN models. The statistical performance of the proposed model revealed that the hybrid SSA-ARIMA-ANN model was capable of forecasting daily rainfall in the catchment with a high degree of confidence [81].

Niu et al. [82] introduced a new sequence-to-sequence model based on the attention-based gated recurrent unit (AGRU) that significantly enhances the quality of wind power forecasting (WPF) systems. AGRU model was assessed using National Renewable Energy Laboratory (NREL) data. The model successfully handles the error accumulation issue associated with a recursive approach. The authors concluded that the GRU model can be used instead of the LSTM in multi-step-ahead WPF problems because the GRU can predict the same results as the LSTM while taking less time to compute [82]. Table 5 summarizes the abovementioned studies that applied deep sequential models in weather predictions.

3.6. Traffic Flow

Traffic flow forecasting is a critical activity that predicts the amount of time required to flow the traffic in advance. Numerous intriguing studies using deep learning have been conducted in this field. For traffic flow forecasting at isolated sites, Lu et al. [83] developed a novel LSTM network supplemented with temporal-aware convolutional context (TCC) blocks and a novel loss-switch mechanism (LSM). Extensive trials on two widely used benchmark traffic flow datasets from California’s metropolitan regions demonstrated that the suggested approaches are effective and reliable when compared to state-of-the-art traffic flow forecasting methods [83].

To improve the accuracy of short-term traffic flow forecasting, a model based on traffic flow time series analysis was presented by Ma et al. [84], with an upgraded LSTM. To begin, time series analysis was done on traffic flow data to produce a reliable time series. Then, an enhanced LSTM model was created using LSTM and bidirectional LSTM networks. The bidirectional long-term memory network (BILSTM) was incorporated into the prediction model by combining the benefits of sequential data and the long-term reliance of forwarding and reversing LSTM. Finally, the suggested method’s performance was assessed by comparing anticipated results to real traffic data. With respect to precision and stability, the suggested strategy outperformed both examined methods [84].

Wang et al. [85] analyzed the road section’s traffic flow data and weather circumstances in details, and produced a model for short-term traffic flow prediction based on the attention mechanism and the 1DCNN-LSTM network. The proposed model combined the temporal expansion capabilities of the CNN with the long-term memory benefits of the LSTM. The experimental findings indicated that the 1DCNN-LSTM-attention model’s prediction effect was superior to that obtained without incorporating the weather element. The suggested model’s prediction effect indicated a quicker convergence rate and greater forecast accuracy [85].

Bohan and Yun [86] proposed a model to predict traffic flow data. A database was collected using GPS data from K5, 32, and 73 buses received from the Hohhot Bus Corporation. Their data were collected from 6 : 00 to 20 : 00 on weekdays between May and October 2017. They chose a bidirectional recurrent neural network (BRNN) model inside a recurrent neural network (RNN) and compared it to the LSTM model and a gated recurrent unit (GRU) model. Their results showed that BRNN outperforms the other two models in terms of prediction accuracy and model performance [86].

George and Santra [87] proposed a new hybrid model based on agglomerated hierarchical K-means (AHK) clustering and fuzzy optimum long short-term memory (FOLSTM) called AHK-FOLSTM. In the LSTM model, the whale optimization algorithm (WOA) was used to better manage the fuzzy rule parameters and calculated the best weight value. Experimental results demonstrated that their proposed method outperforms other state-of-the-art approaches on different evaluation metrics used [87]. Table 6 summarizes the abovementioned studies that applied deep sequential models in traffic flow predictions.

3.7. Other Fields

Deep sequential models are applied in different other fields for forecasting time series data. For instance, a new hybrid method based on biased random key genetic algorithms was suggested by Cicek and Ozturk [88] to identify the optimal network design and parameter combinations. The BRKGA-NN method estimated the number of hidden neurons, their bias values, and the weights of ties among nodes. The proposed BRKGA-NN model was assessed on some of the most well-known time series datasets. The performance of the BRKGA-NN was compared against genetic algorithm-based ANNs, ANNs with back-propagation, support vector regression, and autoregressive integrated moving average. Forecasting findings indicated that the BRKGA-NN algorithm delivered more accurate predictions than the other approaches [88].

A deep model structure was created by Hong et al. [31] to improve the prediction accuracy of the turbofan engine’s remaining useful life (RUL). The proposed model improved performance by sequentially stacking one-dimensional convolutional neural network (1D-CNN), LSTM, and bidirectional LSTM algorithms. Additionally, it integrated a residual network and a dropout strategy to enhance the proposed model’s learning ability and minimizes model complexity via correlation analysis. When compared to former studies, the outcomes indicated that the model performed better at predicting RUL [31]. In another study by Fu et al. [89], to predict RUL, they created a deep residual LSTM featuring domain-invariance (DIDR-LSTM) to optimize RUL estimation’s prognostic efficacy among domains. Experimental results showed that the DIDR-LSTM achieved a superior cross-domain RUL prediction accuracy [89].

Chen et al. [90] provided a probabilistic forecasting framework for multiple linked time series forecasting using TCNN model. The proposed approach was applied to estimate both parametric and nonparametric estimations of probability density. In estimating both point and probabilistic forecasting, the framework outperformed existing state-of-the-art approaches [90].

In another study by Pandey and Wang [91], a CNN model with an encoder-decoder architecture was suggested with an extra temporal convolutional module (TCM) placed among the encoder and decoder for real-time domain voice enhancement, termed TCNN. The suggested model was trained independently of speaker and noise. The suggested model consistently outperformed the state-of-the-art real-time convolutional recurrent model in terms of enhanced outcomes [91].

Another technique was developed by Golshani and Ashtiani [92] to improve forecasting and decision-making accuracy in comparison with the existing literature using TCNN and RNN for cloud service workload prediction and usage prediction. The suggested approach’s decisions set up a degree of compromise between the choice criteria. The order of preference by similarity to ideal solution technique was used, which is one of the most well-known multi-criteria decision-making procedures. It was offered as a technique for deciding on scalability [92].

In the work of Smyl [93], the M4 forecasting competition’s winning entry model was proposed. The proposed model employed a dynamic computational graph neural network architecture that provides a hybrid and hierarchical forecasting technique by combining traditional exponential smoothing with an LSTM model [93].

In the work of Li et al. [94], a unique multi-objective speech enhancement algorithm called a stacked and temporal convolutional neural network (STCNN) was developed. The STCNN architecture outperformed other neural network models in terms of feature extraction and sequential modeling. The proposed STCNN also outperformed LSTM, TCNN, and CRNN in all noise types [94].

Fan et al. [95] devised another novel hybrid model for efficient and precise well production forecast, which is critical for prolonging the life of a well and optimizing reservoir recovery. This approach combined ARIMA and LSTM models. Three well real production time series were evaluated to compare the efficiency of the suggested models ARIMA-LSTM-DP (daily production time series) and ARIMA-LSTM with the LSTM, ARIMA, and LSTM-DP models. In comparison with other models, the hybrid ARIMA-LSTM-DEEP model outperformed and its results were more reliable.

He et al. [96] used a SARIMA-CNN-LSTM model to predict daily visitor demand, which is critical for the tourism industry’s functioning in 6 countries, including China, Hong Kong, Singapore, Korea, Philippine, and Taiwan. The SARIMA-CNN-LSTM model beats other models in terms of prediction accuracy and was capable of extracting more information from high-frequency data throughout the forecasting process.

A novel fuzzy time series approach based on an ARMA-type recurrent pi-sigma artificial neural network was presented by Kocak et al. [97], and particle swarm optimization (PSO) was used to execute the optimization. The experimental results of using proposed model (FTS-ARMATPS-ANN) with a recurrent structure based on both ARMA-type and PSO algorithms revealed improvement in predicting performance for numerous real-life time series [97].

To develop a novel forecasting model, Egrioglu et al. [98] developed new fuzzy time series and intuitionistic fuzzy time series definitions. The data were fuzzified with an intuitionistic fuzzy c-means approach, and the fuzzy relations were created with pi-sigma ANNs, which were trained with an artificial bee colony approach. The temperature dataset and stock exchange were used to predict future data in the suggested technique. Among all the competitor methods, the proposed method produced the best forecasts for all test sets of investigated data [98]. Table 7 summarizes the abovementioned studies that applied deep sequential models in different applications.

4. Discussion

In this section, we will go over the reviewed articles, discuss paper’s implementation settings, compare each model’s application, and focus on the significant differences.

4.1. Models

We looked at 60 papers from the last three years, in which most of them are from 2021, as shown in Figure 4. The models we reviewed in this study were LSTM, ANN, and TCNN, with LSTM (26 articles) being the most popular, followed by ANNs (15 articles) and TCNN (9 articles), as illustrated in Figure 5. The LSTM network offers considerable benefits for processing long-term sequences, which is why it is used in the majority of the examined publications. This is owing to its ability to independently correlate information on the time axis, as well as its high memory learning capacity [99].

Using the TCNN model, on the other hand, is uncommon because data storage during the assessment of the TCNNs models must store the raw sequence up to the duration of the effective history, which may necessitate additional memory during the assessment. This includes the possibility of parameter modification in the event of a domain transfer, and different domains may have different requirements for the amount of history required for prediction. Thus, TCNN may underperform when transferring a model from a domain where just a limited amount of memory is required (i.e., small filter sizes (k) and dilation factors (d)) to a domain where considerably more memory is required (i.e., much bigger k and d) [39].

It is also worth noticing that ANN models are not as often used as LSTM models for sequential data prediction. This is due to the ANNs disadvantages, which depends on how weight values are initialized, local minima, and overfitting problems, as well as the difficulty of generalizability [100].

4.2. Implementation

Deep sequential (DS) models are difficult to implement and require a high level of technical understanding as well as a significant amount of time. Several firms have focused their efforts on developing frameworks for implementing, training, and using DS models in order to make them more accessible. The basic purpose of DS frameworks is to provide an interface that allows models to be implemented regardless of their mathematical complexity [101]. The most often utilized frameworks in the literature are listed in Table 8. It refers to ANN, TCNN, and LSTM models, demonstrating that Python and the Keras package are the most popular programming languages for building deep learning models. ANN was primarily implemented in MATLAB, whereas TCNN was implemented in PyTorch. Only one paper used the R programming language.

4.3. Application Comparison

An ultramodern study was undertaken to demonstrate the problem’s importance, categorizing deep learning research works by application domain (such as finance and stocks, energy, healthcare, weather, environment, traffic flow, and others), and the most often utilized network architectures are LSTM, TCNN, and ANN. The LSTM was frequently used for sequential data in most fields, as shown in Figure 6. For weather prediction, the ANN model was preferred compared to other models, and the same result was obtained by another survey prepared by Jaseena and Kovoor [102]. According to the publications we evaluated, the ANN models were not used to predict traffic or healthcare, and the TCNN models were not used for weather forecasting. We could not come up with any points that proved the differences in other fields.

4.4. Optimization Methods

Deep learning models require optimization method as a fundamental component. We believe that optimization for neural networks is an important topic for theoretical study because, despite its nonconvexity, it is tractable and can considerably improve our understanding of tractable problems. Furthermore, traditional optimization theory is insufficient to account for a wide range of events [103]. Table 9 shows several optimization approaches based on deep sequential models that have been examined, with Adam and stochastic gradient descent being the most commonly utilized optimizers (SGD). We may conclude from this review that a variety of optimization algorithms are used, including Adam, SGD, global optimizer, cooperation search algorithm (CSA), PSO, cross-entropy, BRKGA, and NADAM. The Adam optimizer was employed in practically all deep sequential models, most notably in LSTM, then TCNN, and only infrequently in ANN. In LSTM model optimization, the SGD optimizer was utilized, and an ANN was optimized using several types of optimizers.

The Adam optimizer has a number of advantages over other optimizers that make it more useful. The magnitudes of parameter updates are invariant to gradient rescaling, its step sizes are roughly bound by the step size hyperparameter, it does not require a stationary objective, it works with limited gradients, and it generally conducts a type of step size annealing. After that, SGD established itself as an effective and dependable optimization strategy that was crucial in the success of various machine learning initiatives [104]. We noticed that Adam proved to be dependable and well suited to a wide range of nonconvex optimization problems in deep learning models.

4.5. Activation Function

The activation functions are the fundamental decision-making units of any neural network. They also assess the output of the network’s neural nodes, making them critical for the network’s overall performance. As a result, while computing neural networks, it is critical to select the most appropriate activation function [105].

The frequently utilized activation functions in the reviewed studies are ReLU, sigmoid, Tanh, scaled exponential linear unit (SeLU), and Softmax. ReLU is the most commonly used function, as shown in Table 10, especially in LSTM and TCNN models. This could be due to the fact that it is the most widely used function for hidden layers. Lee [106] found that ReLU6 (a subtype of ReLU) outperforms other activation functions [106]. Additionally, the sigmoid was commonly employed in ANN models. This is due to the fact that it is not zero-centered, and it is known to be prone to the vanishing gradient function and zigzagging during training [105]. Because each activation function is unique, we cannot readily state which one is the best; the only option we can do is pick an activation function that is appropriate for our purposed network structure [107].

4.6. Model Evaluation

In this work, three prominent deep sequential models (ANN, LSTM, and TCNN) were investigated along with other hybrid models that outperformed standard models in forecasting accuracy. At least two deep learning algorithms are combined in hybrid models. These models are more resilient because they often increase forecast accuracy by enhancing the benefits of the particular methods applied [100]. The widely used architectures are the hybrid of LSTM models in energy applications, followed by the LSTM model, as found in the results of Alkhayat and Mehmood [108]. When compared to the LSTM model, the TCNN model produced highly accurate predictions when used to forecast solar power [51]. One of the initial preprocessing steps necessary in the context of deep learning is splitting the dataset into training and testing data. The evaluation of model performance is made possible by the creation of different training and testing samples. Instead of segmenting data into training-testing set, best practices advocated segmenting data into training, validation, and testing sets. By splitting dataset, we can assess the model’s generalization performance while also determining the model’s hyperparameters. The validation set is used to test the model, whereas the training set is used to fit it [109]. Almost all of the evaluated articles in this review study have used a significant portion of the data for training. About ten studies, for example, used more than 80% of their data for training and the rest for testing. Validation data, on the other hand, are only employed in a few of the studies reviewed.

Evaluation metrics are used to assess and compare performance models [110, 111]. There are four sorts of evaluation measures for sequential models in general: classification metrics, regression metrics, profit analysis, and significance analysis [112]. Only one publication utilized a classification measure, while the others used regression measures, suggesting that these metrics are more efficient than others.

5. Conclusion

Deep sequential models are frequently employed in a variety of real-world applications for forecasting time series data. In this review paper, three deep sequential models for time series data forecasting were extensively investigated in 60 research projects over the last three years. This study provided an in-depth examination of the most effective and widely utilized deep sequential models for forecasting. We offered a comprehensive overview and comparison of current state-of-the-art deep sequential models. The LSTM model was the most widely utilized in a number of industries, including healthcare, financial, weather, energy, environment, and other disciplines. Hybrid models, on the other hand, were able to forecast extremely well. Furthermore, the use of time series forecasting methods like TCNN and LSTM in conjunction with optimization techniques like Adam worked quite well. Python and the Keras package were the most commonly used programming languages for the implementation of deep sequential models for real-world problems. The ReLU activation function was employed in the majority of the publications we looked at, but we could not say which one was the best. In addition, we explored a variety of topics, including common deep model applications, in-depth feature discussions, and substantial research gaps, before offering several possible future research pathways in this subject. We believe that this review article will benefit researchers from a variety of fields and will serve as a guideline for using time series data in deep learning-based applications.

Conflicts of Interest

The authors declare that they have no conflicts of interest.