An Improved Long Short-Term Memory Model for Dam Displacement Prediction
Displacement plays a vital role in dam safety monitoring data, which adequately responds to security risks such as the flood water pressure, extreme temperature, structure deterioration, and bottom bedrock damage. To make accurate predictions, former researchers established various models. However, these models’ input variables cannot efficiently reflect the delays between the external environment and displacement. Therefore, a long short-term memory (LSTM) model is proposed to make full use of the historical data to reflect the delays. Furthermore, the LSTM model is improved to optimize the performance by making variables more physically reasonable. Finally, a real-world radial displacement dataset is used to compare the performance of LSTM models, multiple linear regression (MLR), multilayer perceptron (MLP) neural networks, support vector machine (SVM), and boosted regression tree (BRT). The results indicate that the LSTM models can efficiently reflect the delays and make the variables selection more convenient and the improved LSTM model achieves the best performance by optimizing the input form and network structure based on a clearer physical meaning.
Health monitoring is an effective way to analyze various phenomena through the whole life cycle of the dam, such as displacement, leakage, and vibration. Among all the monitoring data, displacement plays an important role that is easy to measure and understand. Meanwhile, when the dam suffers some security risks such as the flood water pressure, extreme temperature, structure deterioration, and bottom bedrock damage, the variation of displacement can be captured in time. To master the variation rule of dam displacement, predict it accurately, and detect anomalies, researchers established models to reflect the linear or nonlinear relationship between dam displacement and external factors, such as the well-known hydrostatic-season-time (HST) model proposed by Beaujoint in 1967 .
In the past, the dam displacement prediction models can be classified as statistical model, deterministic model, hybrid model, and mixed model . The statistical model is established only by observed data. Multiple linear regression (MLR) and its advanced forms that belong to the statistical model are widely used so far due to its intuition and ease of operation [3–8]. The deterministic model sets up the dam, rock mass, fault, and other solid models by using several assumptions and simplifications such as the constitutive relation, antishear slippage, and boundary conditions. Then, the numerical method, usually finite element method, is used to solve and analyze the model. Complex and uncertainty process makes it hard for the deterministic model to predict the measured data of the dam [9, 10]. The hybrid model can be used to alleviate the problem by feedback adjustment according to the measured data [11, 12]. The mixed model is the mixture of the statistical model and the deterministic model. It deals with the water level factor by using the deterministic model to generate statistical data and the other factors such as the temperature and aging factor by the statistical method.
Recently, the machine learning algorithm has been accepted as a useful tool for modeling complex nonlinear systems and widely used for displacement prediction on dam bodies. For example, Mata  compared neural network (NN) and MLR models for the horizontal displacement in a large Portuguese arch dam. The conclusion is that the NN model showed flexibility and proved to be more adequate for months with extreme temperature than the MLR model. Stojanovic  adopted and optimized artificial neural networks (ANN) for given conditions using genetic algorithms (GA) to establish the dam displacement model. The results of the analysis have shown that the ANN/GA model can give rather better accuracy compared to the MLR/GA model. Yu  applied the principal component analysis (PCA) method to realize data reduction, lower data redundancy, and reduce noise and false alarm rate in multivariate analysis of dam monitoring data. Su  used the support vector machine (SVM) to build a static model and real-time updated model to predict the displacement of an actual roller-compacted dam. Rankovic  adopted the support vector regression (SVR) and nonlinear autoregressive with exogenous inputs (NARX) model for the dam tangential displacement prediction. Kang  proposed an extreme learning machine- (ELM-) based health monitoring model for displacement prediction of Fengman gravity dams. The study shows that ELM can achieve better prediction performance for the horizontal displacements than backpropagation (BP) neural networks, MLR, and stepwise regression models. F. Salazar  used boosted regression trees (BRT), random forests (RF), NN, SVM, and multivariate adaptive regression splines (MARS) models for the displacement analysis on La Baells arch dam. By comparing the radial and tangential displacement and leakage data of dam, the BRT model obtained the best overall performance.
Due to the delays between the external environment and displacement, historical data such as air temperature needs to be used as input variables when the model is established . Previous models adopt complex and diverse input variables to consider the delays, for example, moving average, trigonometric function while its delay function is also a trigonometric function, previous displacement used in the autoregressive (AR) model and its variants including autoregressive with exogenous inputs (ARX) model, NARX model, and singular spectrum analysis with autoregressive model (SSA-AR) . It will bring some confusions, for instance, depending on prior knowledge, multicollinearity, and noncausality [22, 23]. Meanwhile, limited variables will lead to partial loss of the original information or long-term memory impairments during data processing. Therefore, it is formidable for previous models to design the appropriate input variables to deal with the arbitrary, unknown causal delays between the inputs and dam displacement.
The recurrent neural network (RNN) evolves from the Hopfield network of internal feedback connection which can deal with the delays [24, 25]. It is a kind of ANN which allows nodes to be connected into a directed cyclic . The traditional RNN model is subjected to gradient vanishing or blowing up after passing many time steps. The long short-term memory (LSTM) proposed by Hochreiter  has successfully solved the problem by using a method called “constant error carrousel (CEC)”. Meanwhile, stored information and already correct outputs can be protected against perturbation by using the proposed gate units. Gers  improved the LSTM by using the forget gate to know how to segment the input streams into training subsequences without prior knowledge which finally become the most typical LSTM model. Owing to these advantages of LSTM, the historical data of the dam can be fully utilized.
In this study, the LSTM model is applied for dam radial displacement to make accurate prediction through making full use of the historical data such as water level and temperature to reflect the delays. Meanwhile, an improved LSTM model is proposed to optimize the performance by making the variables more physically reasonable. It divides the variables into two parts, the delay variables and no-delay variables, which go through different kinds of hidden nodes including LSTM memory blocks and the normal ones. Results are also compared with that of the MLR, multilayer perceptron (MLP), SVM, and BRT models by using a real-world dataset of a concrete arch dam. The rest of the paper is organized as follows. In Section 2, LSTM model and a pretreatment method for the dam health monitoring (DHM) data are introduced. Then an improved LSTM model is present in detail. In Section 3, the performance of the LSTM, MLR, MLP, SVM, and BRT models is demonstrated. In Section 4, conclusions are provided.
2. LSTM Model
2.1. Model Establishment
The RNN model and LSTM memory block are shown in Figure 1.
(b) LSTM memory block
LSTM model contains subnetwork called memory block, which is used to replace the hidden layer node in the RNN as shown in Figure 1(a). One memory block consists of a memory cell, a forget gate, input and output squashing units, input and output gates, and input and output gating. Components of the memory block are demonstrated as follows .
The input squashing unit uses nonlinear transformation to deal with the memory block output at time step t-1 and the input at time step t. They are run through a tanh activation function and then generate the input squashing .
The input gate uses a sigmoid function to transfer the weighted sum of and and then generate . The input gating is the Hadamard product of and . Thus, the information flow of the input squashing unit cuts off and goes through when the input gate outputs are 0 and 1, respectively. Multiplicative gates are distinctive features of the LSTM model.
The memory cell with a self-connected recurrent edge is the core of the memory block. It is updated by a constant linear function. The self-connected recurrent edge with constant weight 1 is to ensure that the gradient can pass across many time steps without vanishing or exploding. The state of the memory cell is called the internal state . In vector notation, = , where denotes the Hadamard product.
The forget gate uses a sigmoid function to transfer the weighted sum of and and then generate . The internal state uses to flush the contents. This design is especially useful in continuously running networks. Thus = .
The output gate uses a sigmoid function to transfer the weighted sum of and and then generate .
The output squashing unit transfers the internal state using a tanh function and then generate tanh().
The memory block output is the Hadamard product of and the input squashing tanh().
2.2. Learning Algorithm of LSTM
The LSTM can contain many hidden layers, and each hidden layer can contain many extended memory blocks, and each memory block can contain one or more memory cells. Here we mainly consider the situation that one memory block only has one memory cell. The backpropagation outside the memory block used the algorithm of backpropagation through time (BPTT) . For the sake of clarity, only the backpropagation inside the memory block at time step t is given in Figure 2.
As can be seen, Wxg denotes the weight matrix between and . Whg represents the weight matrix between and . Wxi represents the weight matrix with . Whi represents the weight matrix between and . Wxf represents the weight matrix between and . Whf represents the weight between and . Wxo represents the weight matrix between and . Who represents the weight matrix between and . bg, bi, bf, and bo denote the bias of , , , and , respectively. There are several parameters outside of the memory block: the output layer weight matrix W and bias b, the loss function L, the sample size N, and the last time step k. It is noteworthy that each time step shares the same weight and bias matrices. Therefore, the time steps can be selected long enough to meet demand.
2.3. Pretreatment of Dam Health Monitoring (DHM) Data for LSTM
Unlike the traditional models, LSTM inputs the variable values of different time into the network sequentially, thus adding a sequence-related dimension named time step mentioned above. Primarily, it uses input variable values of different time to fit displacement. The input for LSTM needs to be processed into three-dimensional data . The subscript i of denotes the i-th displacement sample. The subscript j of means the j-th time step. The last time step k corresponds to the observation time of the displacement sample, and k also represents the length of input subsequence. The time interval of the time step does not need to be consistent with the displacement sampling interval. Since the displacement may be related to the historical daily temperature, the time interval of the time step should be set as one day to meet the requirement. The subscript v of represents the v-th input variable. For input variables selection, the LSTM model can directly use 3 variables of water level, temperature, and aging factor time in the raw subsequence form due to its memory and nonlinear advantages. Figure 3 manifests the LSTM calculating process of dam displacement. The network structure involves three layers: input layer, hidden layer, and output layer. The input layer is fed by the input . There is one memory block in the hidden layer. The output layer applies a sigmoid transformation to the memory block output at the last time step and yields a response to compute the model output.
2.4. Improved LSTM Model for Dam Displacement
As far as we know, not all input variables need to take into account the delays, such as the aging factor influencing the irreversible deformation of the dam and the water level used in HST model. However, the water level in the past has a particular influence on the temperature distribution and water pressure of the dam and the surrounding rock, which potentially affects the displacement of the dam. Therefore, it is reasonable to consider the delay of the water level, for example, the moving averaged form used by Kang . Meanwhile, the redundancy of the input subsequence will increase the complexity and training time of the model which seems unnecessary.
An improved model is proposed to solve the above problems. The input variables are divided into the delay variables and no-delay variables. The delay variables participate in LSTM model (see Figure 3) calculation and generate the memory block output , while the no-delay variables merge with to form new input. Then, a feedforward neural network is established for the new input. One more hidden layer 2 is added to enhance the nonlinear expression ability of the network. So the no-delay variables can directly use aging factor time at the last time step without complex transformation and the form of subsequence. The improved LSTM model also can save training time because, for no-delay variables, there is no need for recurrent calculation inside the memory block. However, it increases the time consumption due to the additional hidden layer 2. Therefore, the neurons’ number and the transfer function of hidden layer 2 should be appropriately set to control the time consumption. Figure 4 shows the improved LSTM model calculating process.
The backpropagation algorithm in Figure 2 also needs to be modified to accommodate this improved model. The loss function only considers the output at the last time step and can be expressed as
Step 1 and step 2 are the same as Figure 2. However, step 3 should be expressed aswhere represents the matrix of no-delay variables; and represent the weight and bias matrix of no-delay variables, respectively; denotes the output of hidden layer 1 which is not activated.
Because of the additional hidden layer 2, the next two steps are expressed aswhere and represent the weight and bias matrix of the output layer, respectively; denotes the output of hidden layer 2 which is not activated.
The remaining steps are expressed as follows:
The last step is to update the weights and biases.
3. Case Study
In this section, the performance of the LSTM model and its improved model for the radial displacement prediction of the Dongjiang arch dam is demonstrated. The MLR, MLP, SVM, and BRT models are used as contrast models. All the algorithms are implemented in Python 3.6 environment. The modeling process using MLR, MLP, SVM, and BRT is carried out using an application program interface (API) scikit-learn (v0.19.0). LSTM algorithm is implemented using TensorFlow (v1.8.1) developed by Google. All these algorithms are used in their original version and are performed via a laptop computer with detailed technical parameters shown as one CPU of Intel core i5-4200U@1.60 Hz and 2 core processors, one RAM of 8GB, and 64-bit system type. Before running these models, data were normalized into the range .
3.1. Dongjiang Arch Dam
Dongjiang arch dam was constructed between 1978 and 1980. It is a double-curvature arch dam consisting of 29 blocks. It is provided with 4 horizontal inspection galleries and the base gallery located near the foundation rock. It has a crest elevation of 294 m, height of 157 m, crest length of 438 m, crest thickness of 7m, and base thickness of 35 m. Figure 5 shows the dam.
3.2. Dataset and Model Variables
The horizontal displacements are obtained with the pendulum method as shown in Figure 6. The pendulums consist of the direct pendulum and inverted pendulum. They are used to measure the relative displacement from the reference point at the dam crest and rock foundation, respectively. By assuming that the displacement at the rock base is zero, the displacement of each point can be calculated.
L5 is the label of the pendulum at the crown cantilever situated in the 15th block. L5H291R represents the radial displacement of L5 at 291m elevation. The available time series data of the L5H291R are used for modeling. These data cover 14 years (from 11/6/2000 to 29/5/2014 with 341 displacement samples) and consist of daily reservoir water level and air temperature as shown in Figures 7 and 8. Finally, the dataset splits into training dataset and validation dataset by using proportion 0.8, which contains 272 training samples and 69 validation samples.
For input variables selection, MLR, MLP, SVM, and BRT adopt 10 variables according to the HST model involving the water level, temperature, and aging factor; X = [H, H2, H3, H4, sin(2πt/365), cos(2πt/365), sin2(2πt/365), sin(2πt/365)cos(2πt/365),, e-θ], where H stands for reservoir water level; the trigonometric function is used to simulate the temperature field with quasi-stable state for the dam running for many years; t represents the number of days between the beginning of the year (January 1) and the date of observation; aging factor time stands for the number of days since the initial sample date. The air temperature (see Figure 8) is finally not used in these four models because the variables selection of moving averaged method [18, 19] is diverse and the results perform worse than the HST model for the dataset by our tests.
The LSTM model uses 3 variables of water level H, air temperature T, and aging factor time in the raw subsequence form as discussed in Section 2.3. The improved LSTM model uses the delay variables water level H and air temperature T in the raw subsequence form and the no-delay variable aging factor time at the last time step. In this study, the unimproved model is named LSTM and the improved model is named ILSTM.
3.3. Model Evaluation Index
The evaluation indexes of the models adopt the mean square error (MSE), the mean absolute error (MAE), and the correlation coefficient(R) as follows:where and are simulated values and mean simulated values from the model; and are the measured values and the mean measured values.
3.4. Model Performance Analysis
The MLR model parameters are listed in Table 1. It can be seen from Table 1 that the coefficients of H, H2, H3, and H4 are significantly greater than the other coefficients and there exists serious linear correlation. It can be deduced that when the linear correlation equilibrium relationship is broken, such as extreme water level, the predicted value may be considerably biased.
A MLP, SVM, and BRT comparison for dam behavior modeling was performed in detail by F. Salazar . For the sake of clarity, we mainly focus on the parameters selection of these models in this work. For the case of MLP, the structure with one hidden layer with 4, 5, 6, 7, 8, and 9 neurons, the numbers of training epochs 100, 300, 500, and 700, the regularization parameters 0.01, 0.001, and 0.0001, and the hyperbolic tangent, sigmoid, rectified linear unit (Relu)  transfer function are chosen. In particular, the solver is limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) which converges faster and with better solutions on small datasets . For SVM model, the “cost” parameter C = 10, 100, 300, 500, and 700, the ε-insensitive error function parameter ε = 0.1, the radial basis kernel function parameter γ = 1, 0.1, 0.01, 0.001, and 0.0001 are considered. For BRT model, the learning rate ν= 0.001, 0.0001, the estimators with 1000, 2000, 3000, 4000, and 5000 maximum trees, and the trees depths 1, 2, and 3 are applied for candidates. The best parameters are selected by the method of 5-fold cross-validation to prevent overfitting. Finally, the MLP with one hidden layer and 6 neurons, 500 training epochs, regularization parameters 0.01, and the Relu transfer function are selected. The SVM model with C = 500 and γ = 0.01 and the BRT model with learning rate ν = 0.001, 4000 estimators, and the tree depth 2 are the best.
For the LSTM models (LSTM, ILSTM), the hyperparameters include the training epochs and convergence tolerance, the number of the hidden layers and nodes in each hidden layer, the kind of nodes in each hidden layer, transfer function, learning rate and its updater, and the input subsequence length. The parameters have many combination modes, and it is hard to design without any reference before in the dam displacement prediction field. In this work, these parameters are obtained by the trial-and-error method including the 5-fold cross-validation. Finally, the training epochs are set as 3000 and the convergence tolerance is 0.001. Hidden layer 1 with 3 memory blocks is designed for the LSTM and ILSTM models’ network structures. Hidden layer 2 with 4 neurons is applied for the ILSTM model. Hidden layer 2 is not used in the LSTM model because we found that there is no facilitation or even vulnerability to overfitting. The Relu transfer function is used in the hidden layer 2 and the linear transfer function is used in the output layer to replace the sigmoid function in Figures 2–4 by multiple tries. The initial learning rate is set as 0.01 and the Adam updater which designs an independent adaptive learning rate for different parameters by calculating the first-order moment estimation and the second-order moment estimation of the gradient is used. It is very robust and usually converges quickly and gives pretty good performance and the hyperparameters β1=0.9, β2=0.999, and ϵ=10e−8 .
The weight and bias matrices of LSTM and ILSTM models are initialized by a normal distribution with a mean of 0 and a variance of 1. Five runs are conducted for considering the impact of different initial weights and biases.
With the parameters above, the effects of the input subsequence length to the LSTM models’ performance are analyzed and the results are shown in Figure 9. As can be seen, the training MSE and validation MSE are relatively small when the length is greater than 60 (delay days) and have the smallest validation MSE at length 135. It can infer that short length is susceptible to information insufficiency and the input data within 45 days before the observed displacement date cannot fit the displacement well. Meanwhile, it has not much to improve accuracy when the length is long enough. Thus, the input subsequence length (delay days) is set as 135. The evaluation indexes of LSTM and ILSTM models are listed in Tables 2 and 3, respectively.
The performance of the six models on both the training and the validation datasets is shown in Table 4.
It can be seen from Table 4 that the training and validation accuracy of MLR model seems satisfactory without significant prediction deviation. MLP model’s training and validation accuracy are relatively high, which provide more support that the nonlinear relationship exists between the input with ten variables and output displacement. SVM model’s training accuracy is the worst, while the validation accuracy is better than the MLR and BRT models and the training accuracy. The training accuracy of BRT is the best, while the prediction accuracy presents a significant deviation. This phenomenon is shown in Figure 13 in the validation phase (2011/08/28~29/5/2014). Though the cross-validation can prevent overfitting such as the MLP and SVM models, it seems that the method does not work well in the BRT model. Figures 10–13 show that these models’ results with ten variables have a slightly rising trend at the validation phase.
The accuracy of the LSTM model is better than the MLR and SVM models and almost the same as MLP which has relatively higher training accuracy and lower validation accuracy. Figure 14 shows that LSTM model has a slightly downtrend trend in the validation phase. ILSTM has the best accuracy at the validation phase, and its training accuracy is only lower than the BRT model. It also has similar accuracy between the training and validation phases without any rising trend or downtrend (see Figure 15) as well as the overfitting problem. Therefore, synthesizing these models’ performance, ILSTM model is the best choice.
Interestingly, the models’ errors except BRT have great negative deviation at 2007/08/19 mainly because of a sudden increase in water level (see Figure 7). Therefore, these models may be not suitable for displacement prediction with a rapid jump of water level in the short term. Although the change can be simulated in the training phase by using the BRT model, the big deviation in the validation phase compared to the training phase leads to potential instability. In general, the sudden increase or drop of water level requires special attention because this phenomenon may be connected with the flood, leakage channels, and so forth. Therefore, the prediction accuracy in this situation is not necessary.
The training time as listed in Tables 2, 3, and 4 means that the model uses the parameters to train the 272 training samples at one time, excluding the time consumed by the 5-fold cross-validation. As we can see from Tables 2–4, LSTM models need more than 300 seconds to train once, which is one of the main problems of the deep learning model. The parallel processing capabilities of GPUs can accelerate the LSTM training and inference processes to solve the problem .
The LSTM models can make full use of the historical data to predict the dam displacement. Their variables in the form of subsequence can reflect the delays more convenient and practical instead of the complex and diverse variables selection. The improved LSTM model splits the input data into two sets: variables delayed through LSTM memory blocks and variables instantaneously fed through the network without memory blocks, making the variables more physically reasonable. The improved LSTM model is trained and validated on Dongjiang arch dam displacement data, and the results achieve the best performance compared to the results obtained with the typical LSTM, as well as MLP, MLR, SVM, and BRT models.
LSTM models also have some deficiencies compared with the previous models. The sequence form of the input increases the difficulty to calculate the impact of each variable. Meanwhile, the hyperparameters need to be adjusted, and their parameter combination modes for optimization is hard when the calculation speed is not keeping up.
In the future, LSTM models will be further studied for those monitoring data depending on the historical data of the external environment and attempting to replace those models of less accuracy or demanding of complex variables selection.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
The authors would like to thank Deng Zhenhua, Li Changsen, and Liu Zixing of Dongjiang power plant for providing them with detailed and reliable data.
N. G. Beaujoint, “Les méthodes de surveillance des barrages au service de la production hydraulique d'Electricité de France, problèmes anciens et solutions nouvelles,” in Proceedings of the IXth International Congress on Large Dams, pp. 529–550, 1967.View at: Google Scholar
“Methods of analysis for the prediction and the verification of dam behavior,” Swiss Committee on Dams, ICOLD, 2003.View at: Google Scholar
G. Prakash, A. Sadhu, S. Narasimhan, and J.-M. Brehe, “Initial service life data towards structural health monitoring of a concrete arch dam,” Structural Control and Health Monitoring, vol. 25, no. 1, 2018.View at: Google Scholar
R. Vanatwerp, “Engineering and design: deformation monitoring and control surveying,” Engineer manualUS Army corps of engineering EM:1110-1111, 1994.View at: Google Scholar
K.-T. T. Bui, D. T. Bui, J. Zou, C. Van Doan, and I. Revhaug, “A novel hybrid artificial intelligent approach based on neural fuzzy inference model and particle swarm optimization for horizontal displacement modeling of hydropower dam,” Neural Computing and Applications, vol. 29, pp. 1–12, 2016.View at: Publisher Site | Google Scholar
H. Mirzabozorg, M. A. Hariri-Ardebili, M. Heshmati, and S. M. Seyed-Kolbadi, “Structural safety evaluation of Karun III Dam and calibration of its finite element model using instrumentation and site observation,” Case Studies in Structural Engineering, vol. 1, no. 1, pp. 6–12, 2014.View at: Publisher Site | Google Scholar
F. Kang, J. Liu, J. Li, and S. Li, “Concrete dam deformation prediction model for health monitoring based on extreme learning machine,” Structural Control and Health Monitoring, vol. 24, no. 10, 2017.View at: Google Scholar
G. Lombardi, F. Amberg, and G. R. Darbre, “Algorithm for the prediction of functional delays in the behaviour of concrete dams,” International Journal on Hydropower and Dams, vol. 15, no. 3, pp. 111–116, 2008.View at: Google Scholar
F. Salazar, M. Á. Toledo, J. M. González, and E. Oñate, “Early detection of anomalies in dam performance: A methodology based on boosted regression trees,” Structural Control and Health Monitoring, vol. 24, no. 11, 2017.View at: Google Scholar
C. Shao, C. Gu, M. Yang, Y. Xu, and H. Su, “A novel model of dam displacement based on panel data,” Structural Control and Health Monitoring, vol. 25, no. 1, 2018.View at: Google Scholar
P. Pławiak and K. Rzecki, “Approximation of phenol concentration using computational intelligence methods based on signals from the metal-oxide sensor array,” IEEE Sensors Journal, vol. 15, no. 3, pp. 1770–1783, 2015.View at: Google Scholar
F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with LSTM,” in Proceedings of the 1999 the 9th International Conference on 'Artificial Neural Networks (ICANN99)', pp. 850–855, September 1999.View at: Google Scholar
Z. C. Lipton, J. Berkowitz, and C. Elkan, “A critical review of recurrent neural networks for sequence learning,” Computer Science: Machine Learning, 2015.View at: Google Scholar
J. Appleyard, T. Kocisky, and P. Blunsom, “Optimizing performance of recurrent neural networks on gpus,” Computer Science: Machine Learning, 2016.View at: Google Scholar
V. Nair and G. E. Hinton, “Rectified linear units improve Restricted Boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML '10), pp. 807–814, Haifa, Israel, June 2010.View at: Google Scholar
D. P. Kingma and J. B. Adam, “A method for stochastic optimization,” International conference on learning representations, 2015.View at: Google Scholar