Abstract
Fluid Catalytic Cracking (FCC), a key unit for secondary processing of heavy oil, is one of the main pollutant emissions of NO_{x} in refineries which can be harmful for the human health. Owing to its complex behaviour in reaction, product separation, and regeneration, it is difficult to accurately predict NO_{x} emission during FCC process. In this paper, a novel deep learning architecture formed by integrating Convolutional Neural Network (CNN) and Long ShortTerm Memory Network (LSTM) for nitrogen oxide emission prediction is proposed and validated. CNN is used to extract features among multidimensional data. LSTM is employed to identify the relationships between different time steps. The data from the Distributed Control System (DCS) in one refinery was used to evaluate the performance of the proposed architecture. The results indicate the effectiveness of CNNLSTM in handling multidimensional time series datasets with the RMSE of 23.7098, and the R^{2} of 0.8237. Compared with previous methods (CNN and LSTM), CNNLSTM overcomes the limitation of highquality feature dependence and handles large amounts of highdimensional data with better efficiency and accuracy. The proposed CNNLSTM scheme would be a beneficial contribution to the accurate and stable prediction of irregular trends for NO_{x} emission from refining industry, providing more reliable information for NO_{x} risk assessment and management.
1. Introduction
Fluid Catalytic Cracking (FCC) is one of the most important technologies for secondary processing of heavy oil in refining and chemical enterprises [1]. Catalytic cracking reaction and catalyst regeneration are the main chemical processes of FCC. In the catalytic cracking reaction, crude oil is transformed into gasoline and diesel under catalysis during which 40%–50% of nitrogen in feedstock is transferred to coke and deposited on the catalyst [2–4]. Then, cokecovered spent catalysts are burned in the reaction regenerator for catalyst active regeneration, heat balance, energy recovery, and stable operation. During catalyst regeneration process, about 90% of the nitrogen in coke is converted into N_{2} and the rest into NO_{x} and other reduced nitrogen compounds (NH_{3}, HCN, etc.). NO and NO_{2} are the most detected NO_{x} which have potential risks to human health. As blood poison, NO would cause hemichypoxia and depress the central nervous system by strongly binding with hemoglobin (HB); NO_{2} would cause bronchiectasis (even toxic pneumonia and pulmonary edema) by irritating and corroding the lung tissue [5, 6]. Furthermore, with the development of refining chemical technology, especially catalytic technique, more heavy oil with high percentage of nitrogen (such as residual oil and wax oil) were utilized. Therefore, it is urgent to accurately predict the NO_{x} produced during FCC process so as to effectively optimize the noxious gas discharged into the environment subject to the technical and economic conditions.
The FCC process is complex both from the modelling and from the control points of view [7–11]. Fortunately, many researchers have explored and developed semiempirical models, lumped kinetic models, and molecularbased kinetic models [12]. A comprehensive review on FCC process modelling, simulation, and control was reported by [13]. Many research studies have been conducted using different models for modelling, controlling, and optimizing the FCC process with promising results [14–16]. With the development statistical learning theory, machine learning algorithms have proved effective methods for simulating natural systems in capturing nonlinearity with limited computation costs. The application of machine learning algorithms in the field of FCC is still at an early stage. Michalopoulos et al. [17] and Bollas et al. [18] proved the applicability of Artificial Neural Networks (ANN) in predicting the FCC products and optimized the operation conditions by developing ANN models for determining the steadystate behaviour of industrial FCC units. Zhang [19] established a NO_{x} emission model by Support Vector Machine (SVM) and further optimized the parameters with an improved adaptive genetic algorithm. Gu et al. [20] constructed a boiler combustion model on the basis of Least Support Vector Machines (LSSVM) and successfully forecasted NO_{x} emissions and other parameters which were verified by field data. Recent advantages in artificial intelligence (AI) (lead by deep learning) offered powerful predictive tool for effectively solving the highly complex chemical processes (such as FCC). Shao et al. proposed a new fault diagnosis method of chemical process by combining LSTM (Long ShortTerm Memory) and CNN (Convolutional Neural Network) [21]. Yang et al. integrated deep neural network (“black box model”) with lumped kinetic model (white box model) to create a novel “gray box model” for improving the efficiency and accuracy of simulating FCC process [22]. However, to the best of the authors’ knowledge, there are few research studies using deep learning algorithms for predicting the NO_{x} emission in FCC unit. Some research studies of pollution emission problems have been conducted in power plants [23]. Compared to power plants, the FCC process is relatively complex with more factors involved. Therefore, it is of great difficulty to predict NO_{x} emissions in FCC units.
In this paper, a novel deep learning architecture for predicting NO_{x} emissions in the FCC Unit is proposed. The deep learning architecture is formed by integrating Convolutional Neural Network (CNN) and Long ShortTerm Memory Network (LSTM) (refer as CNNLSTM hereafter) with CNN layers extracting features among several variables and LSTM layers learning time series dependencies. The data from the Distributed Control System (DCS) in one refinery was used to demonstrate the performance of CNNLSTM in the FCC unit. The main contributions of this paper are (1) the proposal of a novel hybrid CNNLSTM scheme which is able to extract feature among different data sequences and the features between different time steps; (2) the application of the proposed scheme to predict NO_{x} emission during the FCC process with significant results.
2. Deep Learning Algorithms
2.1. Convolutional Neural Network Model (CNN)
CNN is a special kind of neural network which is widely used in the field of image processing [24, 25]. In CNN, a feature map is used to extract the features from the input of the previous layer with a convolution operation. The pooling layer is used to reduce the computational complexity by reducing the size of the output from one stack layer to the next and at the mean time preserving important information. There are many pooling techniques available, among which maximum pooling is mainly used for pooling windows that contain maximum elements. The convolution layer provides the outputs of the pooling layer and maps it to the next layer. The last layer of CNN is usually fully connected for data classification. Figure 1 shows the basic architecture of CNN.
In neural network training, the accuracy and training speed could be affected by many factors [26]. For example, number of input layer nodes, number of hidden layers, number of hidden layer nodes, and the Internal Covariate Shift (ICS). That is to say, the inputs of the current layer would change according to the variation of parameters in the previous layers which would lead to more training time. In addition, if the inputs are distributed in ranges where the gradient of activation function is low, the ICS would cause the disappearance of gradient. In order to solve these problems, a Dropout method was included as follows.
Dropout (Figure 2) was first proposed by Hinton et al. in order to reduce the overfitting problem in neural networks [27–32]. In dropout procedure, the local feature dependency of the model will be reduced with a probability of P, and consequently, the generalization ability of the model will be improved effectively.
(a)
(b)
2.2. Long ShortTerm Memory Network (LSTM)
RNN is a kind of deep neural network which is specially used to process sequential data [33]. Compared with the traditional ANN, the characteristic of RNN is the inclusion of dependencies through time. The basic structure of a RNN is shown in Figure 3.
The left side and the right side in the architecture are the folded form and the expanded form, respectively. In equations (1)∼(4), t is time, x is the sequence of input data, h is the hidden layer state of the network, o is the output vector of the neuron, U is the parameter matrix from the input layer to the hidden layer, V is the parameter matrix from the hidden layer to the output layer, W is the parameter matrix between the hidden layers at different times, and ŷ_{t} represents the probability output of the predicted value after normalization. All the parameter matrices are shared matrix of the hidden states at different times.
In order to solve the disappearance or explosion of gradient during training RNN, researchers proposed LSTM by introducing gate mechanism in RNN [34, 35]. The gate mechanism is composed of input gate, output gate, and forgetting gate. As a special type of RNN, the neurons in the LSTM model are connected to each other in a directed cycle. The basic structure of LSTM is shown in Figure 4.
The LSTM model saves longterm dependencies using three different gates in an effective way. The structure of LSTM (shown in Figure 4) is similar to RNN. LSTM uses three gates to regulate and preserve information into every node state. The explanation of LSTM gates and cells is provided in equations (5)∼(8):where b represents the bias vector; W is weight matrix; x_{t} is the input vector at time t; and In, f, C, and O represent input, forget, cell memory, and output gates, respectively.
3. CNNLSTM
Due to the characteristics of CNN and LSTM, a common thought to combine the advantages is to integrate CNN and LSTM. In this study, a new deep learning scheme was proposed by integrating CNN and LSTM. Two layers of CNN were used to ensure the correlation and effective extraction of multidimensional data. The feature sequence from the CNN layer was considered as the input for LSTM. The time dependencies were further extracted in the LSTM layer. Three fully connected layers existed in the architecture which refer to FC1, FC2, and FC3. FC1 and FC2 are used to obtain the features extracted by the CNN layer, and FC3 is used to conduct the final data prediction. Figure 5 shows the architecture of the proposed CNNLSTM.
3.1. CNN Layer
The input data (train_x) and output data (train_y) are defined as follows:where p represents time step and q represents data features.
The ith sample from the training set is fed into the network. In the first convolution layer (1^{st}ConV), the convolution kernel size, number, and step length are denoted as filter_size = (m, n), filter_num, and strides, respectively.
The jth convolution kernel W_{j} is defined as follows:
The algorithms between jth convolution kernel W_{j} and input train_x_{i} could be described as follows:
The operation for convolution layer is denoted as ⊙, where
The element x in the feature map is obtained through multiplying W_{j} by Receptive Field, which is recorded as follows:
, where ○ means multiply the elements.
1_{st}ConV is calculated aswhere W = [W_{1}, W_{2}, …, W_{k}]
ReLU is used as the activation function:
The output of the convolutional layer is nonlinear mapping by the activation function. In pooling layer, the data are compressed and recorded as pooling_size = (m′, n′).
For every feature map,where
Thus, the ith sample after convolutional, activation, and pooling layer is
The convolutional, activation, and pooling in 2ndConV are similar to those in 1stConV.
Dropout is denoted as dropout (λ); λ takes the value between 0 and 1, which means the percentage of the data that should be discarded. For instance, dropout (0.5) means that 50% of neuron data are discarded randomly.
FC layer dense (α) is the output data in the last dimension. For the above input type [none, a′, b′, k], only the last dimension [none, a′, b′, α] is changed after full connection.
Transform [samples, height, width, channels] to [samples, timesteps, features], and then feed them in the LSTM layer. The modular construction of LSTM is shown as follows, in which forget, input, and output gates are included.
3.2. LSTM Layer
The forget gate is expressed as follows:where W_{f} represents the weight matrix for the forget gate; [h_{t−1}, x_{t}] means concatenation of h_{t−1} and x_{t}; b_{f} represents the offset of the forget gate; and σ represents the sigmoid function. The dimensionality of input layer, hidden layer, and cell state is d_{x}, d_{h}, and d_{c}, respectively. In general, d_{c} = d_{h}, the dimensionality of weight matrix for the forget gate, and W_{f} is d_{c} × (d_{h} + d_{x}). Actually, the weight matrix (W_{f}) is combined by two matrices W_{fh} (initem: h_{t−1}; dimensionality: d_{c} × d_{h}) and W_{fx} (initem: x_{t}; dimensionality: d_{c} × d_{x}), W_{f} could be written as follows:
Input gate could be expressed as follows:where W_{i} represents the weight matrix for the forget gate and b_{i} represents the offset of the input gate.
The cell state for input description is calculated by the last output data and the current input data:
The current cell state (C_{t}) is as follows:where the last cell state (C_{t−1}) is multiplied by forget gate (f_{t}) according to different element and the current input cell state () is multiplied by input gate (it) according to different element.
The new cell state (C_{t}) is established by current memory () and longterm memory C_{t−1}. On one hand, due to the mechanism of forget and input gate, the new cell state store information from a long time ago or forget the irrelevant content. On the other hand, the output gate controls the effect of longterm memory on current output:
The final output of LSTM is decided by the output gate and cell state (equation (29):
3.3. Realization of CNNLSTM
The CNNLSTM was realized in Keras using TensorFlow backend based on Figure 5 and the theory described in the previous sections (shown in Algorithm 1). After normalization, the training data (train_x, train_y) was fed into the constructed CNN model (1^{st} ConV_model) to train the parameters with loss function (loss_function which is “mae” in our case) and optimizer (optimizer, which is “adam” in our case). The feature map of CNN was then extracted and reshaped to train the LSTM layer.

4. Experiments
4.1. Datasets
Several key production factors that affect the nitrogen oxide concentration in the plant were selected from 276 kinds of production factors of catalytic cracking unit. By inquiring experts, the key factors of production include nitrogen content in raw materials, process control parameters of reactor (FCC reaction temperature, catalyst/oil ratio, and residence time), the regeneration process control parameters (regeneration way, dense bed temperature, oxygen content in furnace, and carbon monoxide concentration), and catalyst species (platinum CO combustion catalyst and nonplatinum CO combustion catalyst).
A total of 2.592 × 10^{5} of samples collected in half a year were divided into training and validation sets with the proportion of 70% and 30%, respectively. As shown in Table 1, the key production factors were used as input data and the NO_{x} emission were used as labels.
In order to eliminate the dimensional effects among different variables, the original data was standardized using the MinMaxScaler function in Python (equations (25) and (26)):where are the maximum and minimum values of the data and max and min are maximum and minimum values of the zoom range. In addition, the problem of time prediction was reconstructed into supervised learning.
4.2. Hyperparameters
4.2.1. CNN
The hyperparameters in RNN mainly contain weight initialization, learning rate, activation function, epoch numbers, iteration times, etc. Several important hyperparameters include number of convolution layers, number of convolution kernels, and size of convolution kernel are discussed in this study.
4.2.2. LSTM
Longshort term memory (LSTM) is a kind of RNN, in which tanh could be replaced by sigmoid activation function, resulting in faster training speed. In LSTM, Adam was used as an optimizer, MSE was used as a loss function, and identity activation function was used to complete the weight initialization. The hyperparameters in LSTM mainly contain number of hidden layer nodes and the number of batch sizes. The number of hidden layer nodes in LSTM have direct influences on the learning results by affecting the ability of nonlinear mapping which is the same as in Feedforward Neural Networks. The batch size have an influence on the computation costs and the learning accuracy by affecting amount of data used for updating the gradient.
4.2.3. CNNLSTM
As a neural network model combined CNN with LSTM, the hyperparameters of CNNLSTM is basically the same with CNN and LSTM which mainly include learning rate η, regularization parameter λ, the number of neurons in each hidden layer (such as the fullconnected layer and the number of neurons in LSTM), batch size, convolution kernel size, neuron activation function, pool layer size, and dropout rate. All the related hyperparameters were investigated and analysed in Section 5.
4.3. Performance Criteria
The performances of different algorithms were evaluated by the Root Mean Square Error (RMSE) (equation (27) and the coefficient of determination (R^{2}) (equation (28)) [36]. The RMSE value reflects the discrete relationship between predicted and observed values:where N is the data length, is the nth observed value, and is the nth predicted value.
The R^{2} value reflects the accuracy of the model which ranges from 0 to 1 with 1 denotes perfect match:where represents the predicted value, is the average value, and is the observed value.
5. Results and Discussion
5.1. CNNLSTM
The hyperparameters mentioned above were determined by the trialanderror method. RMSE and R^{2} were considered as objective function to optimize the size and number of convolution kernel, the number of batch size, the number of convolution layers, and the probability of dropout. The results shown in Figure 6 indicate the process of optimizing hyperparameters for the proposed method.
(a)
(b)
(c)
(d)
(e)
The network structure adopts two convolutional parts as the CNN layer; the kernel size is for the first and second CNN layer. Each convolution layer is followed by a Rectified Linear Unit (ReLU) layer (equation (29) and a maximum pooling layer. The output of CNN part is a 32dimensional vector after operations. All the vectors form a sequence and feed into the LSTM layer:where is the input of the activation function at location on the kth channel. ReLU allows neural networks to compute faster than sigmoid or tanh activation functions and train deep network more effectively. In order to train a neural network with strict backpropagation algorithm, the contribution of all samples to the gradient must be considered simultaneously.
With the incorporation of the LSTM network, the proposed CNNLSTM network can be trained with time series data of FCC unit. A LSTM layer followed by the FC layer is used to assign the predicted value to each frame in the sequence.
The output of the CNN layer passes through two dropout layers and two FC layers to combine the features extracted by the CNN layer. During the training stage, the dropout layer will randomly remove the connection between the CNN layer and the FC layer in each iteration. In our experiments, we set the dropout rate to an empirical value of 0.25, which has shown effectiveness in performance improvement (the experiment on the dropout rate is shown in Figure 6(e)). The convolution layer, convergence layer, and activation function layer are conducted to map the raw data to the feature space in hidden layer. And the fullconnected layer plays the role of “classifier” which maps the learned feature representation to the memory space of the sample.
5.2. CNN
The hyperparameters of CNN were also determined by the trialanderror method. RMSE and R^{2} were considered as an objective function to optimize the size and number of convolution kernel and the number of convolution layers. The results shown in Figure 7 indicate the process of optimizing hyperparameters for CNN from which one can conclude that the optimal values for the number of convolution layers, the number of convolution kernels and the size of convolution kernel are 1, 16, and 1 × 5.
(a)
(b)
(c)
5.3. LSTM
The optimization process of hyperparameters for LSTM was shown in Figure 8. RMSE and R^{2} were used as quantitative performance criteria to evaluate the hyperparameters (i.e., the number of hidden layer nodes and the number of batch size). The process and results shown in Figure 8 indicated the optimal values for the number of hidden layer nodes and the number of batch size are 40 and 500, respectively.
(a)
(b)
5.4. Experiments on Different Methods
The accuracy of CNN and LSTM and the proposed CNNLSTM method for the training stage and validation stage were evaluated by R^{2} and RMSE. All methods were well tuned and ten test runs were conducted to eliminate the random errors of each method. The average criteria for each method were calculated to evaluate the performance. The results were presented in Tables 2 and 3 and Figure 9, respectively. Compared with traditional CNN, the proposed CNNLSTM was more accurate in NO_{x} emission prediction with average RMSE of 23.7089 and R^{2} of 0.8237. The combination of CNN and LSTM integrates the advantages of CNN and LSTM which are capable of extracting the features among different data sequences and the features between different time steps. The ability of CNNLSTM is suitable for the characteristics of datasets from refining and chemical enterprises. By describing the local feature relationship under multidimensional and longterm conditions, CNNLSTM matches the observations better than the other methods.
(a)
(b)
(c)
6. Conclusions
In this paper, a novel CNNLSTM scheme combining CNN and LSTM was proposed for the prediction of NO_{x} concentration observed during FCC process. Dropout were introduced to accelerate network training and address the overfitting issue. In our study, a series of hyperparameters (learning rate, regularization parameter, the number of neurons in each hidden layer, small batch data size, convolution kernel size, neuron activation function, pool layer size, and dropout rate) and conditions (raw materials, process control parameters of reactor, regeneration process control parameters, and catalyst species) were selected and optimized. Experiments were conducted to evaluate the proposed scheme with traditional methods (CNN and LSTM) being baseline models. The hyperparameters of all the methods were optimized to obtain the best results. RMSE and R^{2} were used to evaluate the performance of different methods. Due to the capability of extracting features among different data sequences and different time steps, better efficiency and accuracy were obtained by CNNLSTM than baseline models. This study provides a potential direction of deep learning methods by integrating different architectures for individual advantages. The CNNLSTM scheme proposed in this paper would be a beneficial contribution to the accurate and stable prediction of irregular trends for NO_{x} emission from refining industry and provided more reliable information for NO_{x} risk assessment and management. Future work will focus on attention and transformer mechanism to obtain better results and explore the application of the proposed scheme on other datasets.
Data Availability
All data and program files included in this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by CNPC Basic Research Fund Projects “Research on Prediction and Early Warning System of Air Pollution Emission in FCC and New Control Model” (no. 2017D5008) and Science Foundation of China University of Petroleum, Beijing (nos. 2462018YJRC007 and 2462020YXZZ025).