Fluid Catalytic Cracking (FCC), a key unit for secondary processing of heavy oil, is one of the main pollutant emissions of NOx in refineries which can be harmful for the human health. Owing to its complex behaviour in reaction, product separation, and regeneration, it is difficult to accurately predict NOx emission during FCC process. In this paper, a novel deep learning architecture formed by integrating Convolutional Neural Network (CNN) and Long Short-Term Memory Network (LSTM) for nitrogen oxide emission prediction is proposed and validated. CNN is used to extract features among multidimensional data. LSTM is employed to identify the relationships between different time steps. The data from the Distributed Control System (DCS) in one refinery was used to evaluate the performance of the proposed architecture. The results indicate the effectiveness of CNN-LSTM in handling multidimensional time series datasets with the RMSE of 23.7098, and the R2 of 0.8237. Compared with previous methods (CNN and LSTM), CNN-LSTM overcomes the limitation of high-quality feature dependence and handles large amounts of high-dimensional data with better efficiency and accuracy. The proposed CNN-LSTM scheme would be a beneficial contribution to the accurate and stable prediction of irregular trends for NOx emission from refining industry, providing more reliable information for NOx risk assessment and management.

1. Introduction

Fluid Catalytic Cracking (FCC) is one of the most important technologies for secondary processing of heavy oil in refining and chemical enterprises [1]. Catalytic cracking reaction and catalyst regeneration are the main chemical processes of FCC. In the catalytic cracking reaction, crude oil is transformed into gasoline and diesel under catalysis during which 40%–50% of nitrogen in feedstock is transferred to coke and deposited on the catalyst [24]. Then, coke-covered spent catalysts are burned in the reaction regenerator for catalyst active regeneration, heat balance, energy recovery, and stable operation. During catalyst regeneration process, about 90% of the nitrogen in coke is converted into N2 and the rest into NOx and other reduced nitrogen compounds (NH3, HCN, etc.). NO and NO2 are the most detected NOx which have potential risks to human health. As blood poison, NO would cause hemichypoxia and depress the central nervous system by strongly binding with hemoglobin (HB); NO2 would cause bronchiectasis (even toxic pneumonia and pulmonary edema) by irritating and corroding the lung tissue [5, 6]. Furthermore, with the development of refining chemical technology, especially catalytic technique, more heavy oil with high percentage of nitrogen (such as residual oil and wax oil) were utilized. Therefore, it is urgent to accurately predict the NOx produced during FCC process so as to effectively optimize the noxious gas discharged into the environment subject to the technical and economic conditions.

The FCC process is complex both from the modelling and from the control points of view [711]. Fortunately, many researchers have explored and developed semiempirical models, lumped kinetic models, and molecular-based kinetic models [12]. A comprehensive review on FCC process modelling, simulation, and control was reported by [13]. Many research studies have been conducted using different models for modelling, controlling, and optimizing the FCC process with promising results [1416]. With the development statistical learning theory, machine learning algorithms have proved effective methods for simulating natural systems in capturing nonlinearity with limited computation costs. The application of machine learning algorithms in the field of FCC is still at an early stage. Michalopoulos et al. [17] and Bollas et al. [18] proved the applicability of Artificial Neural Networks (ANN) in predicting the FCC products and optimized the operation conditions by developing ANN models for determining the steady-state behaviour of industrial FCC units. Zhang [19] established a NOx emission model by Support Vector Machine (SVM) and further optimized the parameters with an improved adaptive genetic algorithm. Gu et al. [20] constructed a boiler combustion model on the basis of Least Support Vector Machines (LSSVM) and successfully forecasted NOx emissions and other parameters which were verified by field data. Recent advantages in artificial intelligence (AI) (lead by deep learning) offered powerful predictive tool for effectively solving the highly complex chemical processes (such as FCC). Shao et al. proposed a new fault diagnosis method of chemical process by combining LSTM (Long Short-Term Memory) and CNN (Convolutional Neural Network) [21]. Yang et al. integrated deep neural network (“black box model”) with lumped kinetic model (white box model) to create a novel “gray box model” for improving the efficiency and accuracy of simulating FCC process [22]. However, to the best of the authors’ knowledge, there are few research studies using deep learning algorithms for predicting the NOx emission in FCC unit. Some research studies of pollution emission problems have been conducted in power plants [23]. Compared to power plants, the FCC process is relatively complex with more factors involved. Therefore, it is of great difficulty to predict NOx emissions in FCC units.

In this paper, a novel deep learning architecture for predicting NOx emissions in the FCC Unit is proposed. The deep learning architecture is formed by integrating Convolutional Neural Network (CNN) and Long Short-Term Memory Network (LSTM) (refer as CNN-LSTM hereafter) with CNN layers extracting features among several variables and LSTM layers learning time series dependencies. The data from the Distributed Control System (DCS) in one refinery was used to demonstrate the performance of CNN-LSTM in the FCC unit. The main contributions of this paper are (1) the proposal of a novel hybrid CNN-LSTM scheme which is able to extract feature among different data sequences and the features between different time steps; (2) the application of the proposed scheme to predict NOx emission during the FCC process with significant results.

2. Deep Learning Algorithms

2.1. Convolutional Neural Network Model (CNN)

CNN is a special kind of neural network which is widely used in the field of image processing [24, 25]. In CNN, a feature map is used to extract the features from the input of the previous layer with a convolution operation. The pooling layer is used to reduce the computational complexity by reducing the size of the output from one stack layer to the next and at the mean time preserving important information. There are many pooling techniques available, among which maximum pooling is mainly used for pooling windows that contain maximum elements. The convolution layer provides the outputs of the pooling layer and maps it to the next layer. The last layer of CNN is usually fully connected for data classification. Figure 1 shows the basic architecture of CNN.

In neural network training, the accuracy and training speed could be affected by many factors [26]. For example, number of input layer nodes, number of hidden layers, number of hidden layer nodes, and the Internal Covariate Shift (ICS). That is to say, the inputs of the current layer would change according to the variation of parameters in the previous layers which would lead to more training time. In addition, if the inputs are distributed in ranges where the gradient of activation function is low, the ICS would cause the disappearance of gradient. In order to solve these problems, a Dropout method was included as follows.

Dropout (Figure 2) was first proposed by Hinton et al. in order to reduce the overfitting problem in neural networks [2732]. In dropout procedure, the local feature dependency of the model will be reduced with a probability of P, and consequently, the generalization ability of the model will be improved effectively.

2.2. Long Short-Term Memory Network (LSTM)

RNN is a kind of deep neural network which is specially used to process sequential data [33]. Compared with the traditional ANN, the characteristic of RNN is the inclusion of dependencies through time. The basic structure of a RNN is shown in Figure 3.

The left side and the right side in the architecture are the folded form and the expanded form, respectively. In equations (1)∼(4), t is time, x is the sequence of input data, h is the hidden layer state of the network, o is the output vector of the neuron, U is the parameter matrix from the input layer to the hidden layer, V is the parameter matrix from the hidden layer to the output layer, W is the parameter matrix between the hidden layers at different times, and ŷt represents the probability output of the predicted value after normalization. All the parameter matrices are shared matrix of the hidden states at different times.

In order to solve the disappearance or explosion of gradient during training RNN, researchers proposed LSTM by introducing gate mechanism in RNN [34, 35]. The gate mechanism is composed of input gate, output gate, and forgetting gate. As a special type of RNN, the neurons in the LSTM model are connected to each other in a directed cycle. The basic structure of LSTM is shown in Figure 4.

The LSTM model saves long-term dependencies using three different gates in an effective way. The structure of LSTM (shown in Figure 4) is similar to RNN. LSTM uses three gates to regulate and preserve information into every node state. The explanation of LSTM gates and cells is provided in equations (5)∼(8):where b represents the bias vector; W is weight matrix; xt is the input vector at time t; and In, f, C, and O represent input, forget, cell memory, and output gates, respectively.


Due to the characteristics of CNN and LSTM, a common thought to combine the advantages is to integrate CNN and LSTM. In this study, a new deep learning scheme was proposed by integrating CNN and LSTM. Two layers of CNN were used to ensure the correlation and effective extraction of multidimensional data. The feature sequence from the CNN layer was considered as the input for LSTM. The time dependencies were further extracted in the LSTM layer. Three fully connected layers existed in the architecture which refer to FC1, FC2, and FC3. FC1 and FC2 are used to obtain the features extracted by the CNN layer, and FC3 is used to conduct the final data prediction. Figure 5 shows the architecture of the proposed CNN-LSTM.

3.1. CNN Layer

The input data (train_x) and output data (train_y) are defined as follows:where p represents time step and q represents data features.

The ith sample from the training set is fed into the network. In the first convolution layer (1stConV), the convolution kernel size, number, and step length are denoted as filter_size = (m, n), filter_num, and strides, respectively.

The jth convolution kernel Wj is defined as follows:

The algorithms between jth convolution kernel Wj and input train_xi could be described as follows:

The operation for convolution layer is denoted as ⊙, where

The element x in the feature map is obtained through multiplying Wj by Receptive Field, which is recorded as follows:

, where ○ means multiply the elements.

1stConV is calculated aswhere W= [W1, W2, , Wk]

ReLU is used as the activation function:

The output of the convolutional layer is nonlinear mapping by the activation function. In pooling layer, the data are compressed and recorded as pooling_size = (m′, n′).

For every feature map,where

Thus, the ith sample after convolutional, activation, and pooling layer is

The convolutional, activation, and pooling in 2ndConV are similar to those in 1stConV.

Dropout is denoted as dropout (λ); λ takes the value between 0 and 1, which means the percentage of the data that should be discarded. For instance, dropout (0.5) means that 50% of neuron data are discarded randomly.

FC layer dense (α) is the output data in the last dimension. For the above input type [none, a′, b′, k], only the last dimension [none, a′, b′, α] is changed after full connection.

Transform [samples, height, width, channels] to [samples, timesteps, features], and then feed them in the LSTM layer. The modular construction of LSTM is shown as follows, in which forget, input, and output gates are included.

3.2. LSTM Layer

The forget gate is expressed as follows:where Wf represents the weight matrix for the forget gate; [ht−1, xt] means concatenation of ht−1 and xt; bf represents the offset of the forget gate; and σ represents the sigmoid function. The dimensionality of input layer, hidden layer, and cell state is dx, dh, and dc, respectively. In general, dc=dh, the dimensionality of weight matrix for the forget gate, and Wf is dc × (dh + dx). Actually, the weight matrix (Wf) is combined by two matrices Wfh (initem: ht−1; dimensionality: dc × dh) and Wfx (initem: xt; dimensionality: dc × dx), Wf could be written as follows:

Input gate could be expressed as follows:where Wi represents the weight matrix for the forget gate and bi represents the offset of the input gate.

The cell state for input description is calculated by the last output data and the current input data:

The current cell state (Ct) is as follows:where the last cell state (Ct−1) is multiplied by forget gate (ft) according to different element and the current input cell state () is multiplied by input gate (it) according to different element.

The new cell state (Ct) is established by current memory () and long-term memory Ct−1. On one hand, due to the mechanism of forget and input gate, the new cell state store information from a long time ago or forget the irrelevant content. On the other hand, the output gate controls the effect of long-term memory on current output:

The final output of LSTM is decided by the output gate and cell state (equation (29):

3.3. Realization of CNN-LSTM

The CNN-LSTM was realized in Keras using TensorFlow backend based on Figure 5 and the theory described in the previous sections (shown in Algorithm 1). After normalization, the training data (train_x, train_y) was fed into the constructed CNN model (1st ConV_model) to train the parameters with loss function (loss_function which is “mae” in our case) and optimizer (optimizer, which is “adam” in our case). The feature map of CNN was then extracted and reshaped to train the LSTM layer.

Input: train_x, train_y
Hyper-parameters: filters, kernel_size, pool_size, batch_size, rate
Initialize ()
Normalization (train_x, train_y)
//The first convolution layer
1st ConV_model = Sequential ([Convolution2D (filters, kernel_size, name = “Conv2D_1”), MaxPooling2D (pool_size), Flatten (), Dense (units, activation), Dropout (rate), Dense (units, activation)])
1st ConV_model.compile (loss_function, optimizer)
1st ConV_model.fit (train_x, train_y, epochs, batch_size)
//Extract the feature map
1st ConV_feature_model = Model (inputs, 1st ConV_model.get_layer (“Conv2D_1”).output)
1st ConV_feature_output = 1st ConV_feature_model.predict (train_x)
//LSTM layer
reshape (1st ConV_feature_output)
LSTM_model = Sequential (LSTM (units, activation, recurrent_activation), Dense (units, activation))
LSTM_model.compile (loss_function, optimizer)
LSTM_model.fit (1st ConV_feature_output, train_y, epochs, batch_size)

4. Experiments

4.1. Datasets

Several key production factors that affect the nitrogen oxide concentration in the plant were selected from 276 kinds of production factors of catalytic cracking unit. By inquiring experts, the key factors of production include nitrogen content in raw materials, process control parameters of reactor (FCC reaction temperature, catalyst/oil ratio, and residence time), the regeneration process control parameters (regeneration way, dense bed temperature, oxygen content in furnace, and carbon monoxide concentration), and catalyst species (platinum CO combustion catalyst and nonplatinum CO combustion catalyst).

A total of 2.592 × 105 of samples collected in half a year were divided into training and validation sets with the proportion of 70% and 30%, respectively. As shown in Table 1, the key production factors were used as input data and the NOx emission were used as labels.

In order to eliminate the dimensional effects among different variables, the original data was standardized using the MinMaxScaler function in Python (equations (25) and (26)):where are the maximum and minimum values of the data and max and min are maximum and minimum values of the zoom range. In addition, the problem of time prediction was reconstructed into supervised learning.

4.2. Hyperparameters
4.2.1. CNN

The hyperparameters in RNN mainly contain weight initialization, learning rate, activation function, epoch numbers, iteration times, etc. Several important hyperparameters include number of convolution layers, number of convolution kernels, and size of convolution kernel are discussed in this study.

4.2.2. LSTM

Long-short term memory (LSTM) is a kind of RNN, in which tanh could be replaced by sigmoid activation function, resulting in faster training speed. In LSTM, Adam was used as an optimizer, MSE was used as a loss function, and identity activation function was used to complete the weight initialization. The hyperparameters in LSTM mainly contain number of hidden layer nodes and the number of batch sizes. The number of hidden layer nodes in LSTM have direct influences on the learning results by affecting the ability of nonlinear mapping which is the same as in Feedforward Neural Networks. The batch size have an influence on the computation costs and the learning accuracy by affecting amount of data used for updating the gradient.

4.2.3. CNN-LSTM

As a neural network model combined CNN with LSTM, the hyperparameters of CNN-LSTM is basically the same with CNN and LSTM which mainly include learning rate η, regularization parameter λ, the number of neurons in each hidden layer (such as the full-connected layer and the number of neurons in LSTM), batch size, convolution kernel size, neuron activation function, pool layer size, and dropout rate. All the related hyperparameters were investigated and analysed in Section 5.

4.3. Performance Criteria

The performances of different algorithms were evaluated by the Root Mean Square Error (RMSE) (equation (27) and the coefficient of determination (R2) (equation (28)) [36]. The RMSE value reflects the discrete relationship between predicted and observed values:where N is the data length, is the nth observed value, and is the nth predicted value.

The R2 value reflects the accuracy of the model which ranges from 0 to 1 with 1 denotes perfect match:where represents the predicted value, is the average value, and is the observed value.

5. Results and Discussion


The hyperparameters mentioned above were determined by the trial-and-error method. RMSE and R2 were considered as objective function to optimize the size and number of convolution kernel, the number of batch size, the number of convolution layers, and the probability of dropout. The results shown in Figure 6 indicate the process of optimizing hyperparameters for the proposed method.

The network structure adopts two convolutional parts as the CNN layer; the kernel size is for the first and second CNN layer. Each convolution layer is followed by a Rectified Linear Unit (ReLU) layer (equation (29) and a maximum pooling layer. The output of CNN part is a 32-dimensional vector after operations. All the vectors form a sequence and feed into the LSTM layer:where is the input of the activation function at location on the kth channel. ReLU allows neural networks to compute faster than sigmoid or tanh activation functions and train deep network more effectively. In order to train a neural network with strict backpropagation algorithm, the contribution of all samples to the gradient must be considered simultaneously.

With the incorporation of the LSTM network, the proposed CNN-LSTM network can be trained with time series data of FCC unit. A LSTM layer followed by the FC layer is used to assign the predicted value to each frame in the sequence.

The output of the CNN layer passes through two dropout layers and two FC layers to combine the features extracted by the CNN layer. During the training stage, the dropout layer will randomly remove the connection between the CNN layer and the FC layer in each iteration. In our experiments, we set the dropout rate to an empirical value of 0.25, which has shown effectiveness in performance improvement (the experiment on the dropout rate is shown in Figure 6(e)). The convolution layer, convergence layer, and activation function layer are conducted to map the raw data to the feature space in hidden layer. And the full-connected layer plays the role of “classifier” which maps the learned feature representation to the memory space of the sample.

5.2. CNN

The hyperparameters of CNN were also determined by the trial-and-error method. RMSE and R2 were considered as an objective function to optimize the size and number of convolution kernel and the number of convolution layers. The results shown in Figure 7 indicate the process of optimizing hyperparameters for CNN from which one can conclude that the optimal values for the number of convolution layers, the number of convolution kernels and the size of convolution kernel are 1, 16, and 1 × 5.

5.3. LSTM

The optimization process of hyperparameters for LSTM was shown in Figure 8. RMSE and R2 were used as quantitative performance criteria to evaluate the hyperparameters (i.e., the number of hidden layer nodes and the number of batch size). The process and results shown in Figure 8 indicated the optimal values for the number of hidden layer nodes and the number of batch size are 40 and 500, respectively.

5.4. Experiments on Different Methods

The accuracy of CNN and LSTM and the proposed CNN-LSTM method for the training stage and validation stage were evaluated by R2 and RMSE. All methods were well tuned and ten test runs were conducted to eliminate the random errors of each method. The average criteria for each method were calculated to evaluate the performance. The results were presented in Tables 2 and 3 and Figure 9, respectively. Compared with traditional CNN, the proposed CNN-LSTM was more accurate in NOx emission prediction with average RMSE of 23.7089 and R2 of 0.8237. The combination of CNN and LSTM integrates the advantages of CNN and LSTM which are capable of extracting the features among different data sequences and the features between different time steps. The ability of CNN-LSTM is suitable for the characteristics of datasets from refining and chemical enterprises. By describing the local feature relationship under multidimensional and long-term conditions, CNN-LSTM matches the observations better than the other methods.

6. Conclusions

In this paper, a novel CNN-LSTM scheme combining CNN and LSTM was proposed for the prediction of NOx concentration observed during FCC process. Dropout were introduced to accelerate network training and address the overfitting issue. In our study, a series of hyperparameters (learning rate, regularization parameter, the number of neurons in each hidden layer, small batch data size, convolution kernel size, neuron activation function, pool layer size, and dropout rate) and conditions (raw materials, process control parameters of reactor, regeneration process control parameters, and catalyst species) were selected and optimized. Experiments were conducted to evaluate the proposed scheme with traditional methods (CNN and LSTM) being baseline models. The hyperparameters of all the methods were optimized to obtain the best results. RMSE and R2 were used to evaluate the performance of different methods. Due to the capability of extracting features among different data sequences and different time steps, better efficiency and accuracy were obtained by CNN-LSTM than baseline models. This study provides a potential direction of deep learning methods by integrating different architectures for individual advantages. The CNN-LSTM scheme proposed in this paper would be a beneficial contribution to the accurate and stable prediction of irregular trends for NOx emission from refining industry and provided more reliable information for NOx risk assessment and management. Future work will focus on attention and transformer mechanism to obtain better results and explore the application of the proposed scheme on other datasets.

Data Availability

All data and program files included in this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


This work was supported by CNPC Basic Research Fund Projects “Research on Prediction and Early Warning System of Air Pollution Emission in FCC and New Control Model” (no. 2017D-5008) and Science Foundation of China University of Petroleum, Beijing (nos. 2462018YJRC007 and 2462020YXZZ025).