Abstract

As gasoline is the main fuel of small vehicles, the exhaust emissions from its combustion will affect air quality. The focus of gasoline cleaning is to reduce the sulfur and olefin content in gasoline while maintaining its RON as much as possible. The reduction of RON will bring great economic losses to enterprises. Therefore, it is very important for petrochemical enterprises to construct a RON loss model in the gasoline refining process. The model construction, which reduces RON loss during gasoline refining, is the main question in this paper. By Python and SPSS software, we got two variable filtering methods: the random forest importance filtering and PCA filtering, and combined with SVR and random forest models, RON of the product and sulfur content were predicted. The filtering order of the original data by Excel and Python is maximum and minimum removal, 3σ criterion removal, deletion of too many sites in incomplete data, and filling of empty values in the mean within two hours. Several RON prediction models were established with the help of Python software, and the variables selected were compared by two filtering methods: one is the SVR model based on Gaussian, linear, polynomial, and Sigmoid kernel functions; the other is the random forest model. The sulfur content and RON prediction model was constructed, which use evaluation functions such as MSE, , and RMSE to evaluate and sulfur content as the subject condition. We convert the problem into linear and nonlinear model variable optimization problems: the linear model is the variable selected by the SVR linear kernel function model and random forest; the nonlinear model is the combination of variables selected by the random forest model and random forest. Optimizing for each sample, the optimization method is to find the optimal solution for each variable and use the optimal method for each variable as the local optimal solution for the sample. The two models are evaluated from the perspectives of optimization degree, optimization rate, model running speed, etc.

1. Introduction

More than 95% of sulfur and olefins in finished gasoline come from catalytic cracking gasoline in our country. Therefore, the catalytic cracking gasoline must be refined to satisfy the gasoline quality requirements. RON is the most important indicator reflecting the combustion performance of gasoline and is used as the commercial brand name of gasoline, such as 89#, 92#, and 95#. Desulfurization and olefin reduction technology reduced RON of gasoline in modern catalytic cracking gasoline. However, the reduction of the RON will bring great economic losses to the enterprise. For 1-unit reduction for RON, the loss is equivalent to about 150 yuan/ton. Taking a 1-million-ton/year catalytic cracking gasoline refining unit as an example, if the RON loss can be reduced by 0.3 units, its economic benefit will reach 45 million yuan [1]. Therefore, the RON loss model in the gasoline refining process is critical to petrochemical companies. At the same time, the reduction of RON not only brings huge economic benefits to petrochemical companies but also brings new opportunities and challenges to material science, engineering geology, and other energy disciplines [28]. Due to the complexity of the refining process and the diversity of equipment, by which operating variables (control variables) have a highly nonlinear and strongly coupled relationship with each other, there are relatively few variables in the traditional data association model, and mechanism modeling needs high analysis requirements of raw materials. The result cannot meet the needs of the industry.

According to industrial demand, the less the RON is reduced, the higher the economic benefits of the company. With the average loss of RON of existing petrochemical companies and related references, if the loss of RON can be controlled at 0.5-1, the economic benefits of enterprises will be very considerable.

Since the refining process of catalytic cracking gasoline is continuous, the operating variables are sampled every 3 minutes, the measurement of RON (dependent variable) is more troublesome, and it cannot be matched only twice a week. However, according to the actual situation, it can be considered that the measured value of RON is the comprehensive effect of the manipulated variable within two hours before the measurement time. Then, the average value of the manipulated variable within two hours of the pretreatment corresponds to the measured value of RON. The establishment of a model for reducing RON loss involves 7 raw material properties, 2 spent adsorbent properties, 2 regenerated adsorbent properties, 2 product properties, and other variables, as well as 354 other operating variables (a total of 367 variables) based on the sample data. The method of dimensionality reduction first and then modeling is often used in engineering technology applications, which is conducive to ignoring minor factors and discovering and analyzing the main variables and factors that affect the model.

2. Data Processing

Since most of the variable data of collecting raw data are normal, some data of each device has problems in some locations, some variables only contain data of a part time, and the data of some variables are all empty or part of data is empty. The quality of the data will directly affect the results of the research, so the original data must be processed first. The processing process is as follows: (1)Converting the 2-D index into 1-D index, various properties are named: xx property_xx, e.g., raw material properties_sulfur content, product properties_RON, and regenerated adsorbent_coke, wt%. Due to the lack of Chinese names for data, English names are uniformly adopted(2)We constructed a new sheet table named sample property, constructed a new sheet, and kept the original format. Then, the raw materials, products, spent adsorbent, and regenerated adsorbent were also copied to the corresponding position of the sample property table. After that, splitting the operating variable table into sample 285 and sample 313, the header of the new two tables is the second row of the operation variable table, such as time|S-ZORB.CAL_H2.PV|S-ZORB.PDI_2102.PV|…, and then, Python was used for data processing(3)The data of sample 285 and sample 313 was imported, by the limit method of the maximum value of each column to filter, and some samples are removed that are not in this range. 0 is missing data, replaced with NA, deleting all columns with NA values. The average value was taken within two hours to fill in the missing values. Since the data are of two hours, the processed data was combined with appendix 1 with the method of mean fill(4)The 3σ criterion is used to remove irregular values: suppose that the measured variable is measured with equal accuracy, the arithmetic mean of and residual error were got, and the standard error is calculated according to the Bessel formula. If the residual error of a certain measured value satisfied , it is considered to be a bad value with a gross error value and should be eliminated [9]. The Bessel formula is as follows:

3. Analysis

The establishment of a reducing RON loss model includes 7 raw material properties, 2 spent adsorbent properties, 2 regenerated adsorbent properties, 2 product properties, and another 354 operating variables (a total of 367 variables). The method of dimensionality reduction first and then modeling is helpful for ignoring secondary factors, discovering, and analyzing the main variables and factors that affect the model. The procedure can be divided into five parts: missing data processing, low variance filtering processing, correlation analysis, principal component analysis, and random forest feature selection.

3.1. Missing Data Processing

It can be known that the sample data values are randomly missing by data analysis; we need to reduce the dimensionality and filter the operating variables and process the missing data according to the data obtained after processing. There are three ways to deal with missing values: deleting data, data imputation, and no processing. Data imputation is adding the unknown value to the subjective estimate value, which will bring errors. The imputation methods include mean imputation, data imputation, similar mean imputation, maximum likelihood estimation, and multiple imputation [10]. The columns of missing value with more than 50% will be deleted in this paper, the remaining missing data is processed by mean interpolation, and the number of columns deleted here is 8 columns.

3.2. Low Variance Filtering Processing

Low variance filtering is similar to the method of missing value deletion, which assumes that the column with very small changes in the data column contains less information. Therefore, all columns with small variances are removed, and the data needs to be normalized first because of the correlation between variance and data range. It is determined to normalize the data, then delete the columns with data variance less than 0.1, and the number of columns deleted here is 34 columns in this paper.

3.3. Correlation Analysis

Correlation analysis studies the direction and closeness between variables. First, a correlation matrix can be obtained by calculating the Pearson correlation coefficient between 359 features; only one variable with a correlation greater than 0.9 is retained. The number of variables filtered out is 153, the number of repeated filtering variables is 5, and the number of remaining variables is 177.

3.4. Principal Component Analysis (PCA)

PCA is a method of reducing the dimensionality of high-dimensional data, turning multiple variables into a few principal components, and removing noise at the same time. The purpose of the method is to use fewer features to explain most of the variation in the original data and to convert many highly correlated features into mutually independent or uncorrelated features. The idea is to select several new features that are more than the original features, which can explain the variation in most of the data; this is the so-called principal component [11].

3.4.1. The Principle of PCA

Supposed that we have th sample and th features, which can be denoted by, it is more efficient to write them in the matrix form: where is the th eigenvalue of the th sample.

The basic implementation steps of PCA can be divided into data standardization, calculation of covariance matrix, calculation of eigenvalues and eigenvectors, calculation of principal component contribution rate and cumulative contribution rate, calculation of principal component load, calculation of principal component score, and operating variable weights. Meanwhile, the retention of several principal components depends on the cumulative contribution rate of the retained part. In order to ensure that the main information is not lost, the cumulative contribution rate of the retained principal components should be greater than 85%.

3.4.2. Implementation of PCA

The data after processing the missing data is analyzed by principal components by SPSS software, and 359-dimensional features are described by 24 principal components. The contribution rate and cumulative contribution rate of each principal component are shown in Table 1.

From Table 1, we can see that the cumulative contribution rate of the first 24 principal components reaches 85.268%, and the feature value of the 24th principal component is , which almost contains most of the information of 359-dimensional features.

As shown in Figure 1, which is made according to the contribution of the principal components to the feature value in Table 1, they are the corresponding relationship. We can see that the advantage of the gravel figure in the gentle curve can explain the change of the characteristic and draw the conclusion. In Figure 1, each feature is called a factor, and there are 359 features, that is, 359 factors. The eigenvalue of the 24th principal component has a larger decline compared with the previous eigenvalue, this eigenvalue is smaller, and the following eigenvalues do not change much, indicating that adding factors corresponding to the eigenvalue can only add very little information. Therefore, the first 24 principal components can cover 359-feature information according to Figure 1.

3.5. Random Forest Feature Selection

It is necessary to select features that have a greater impact on the result for modeling, when the number of features in the data set exceeds 300 dimensions. The feature selection method we choose is random forest.

The method of random forest to evaluate the importance of features is considering the contribution of each feature on each tree, taking the average value, and comparing the contribution of different features. The evaluation indicators of contribution include the Gini index (Gini) and the error rate of out-of-bag data (OOB) [12].

In general, the Gini value is used as the criterion for splitting nodes in the random forest model. In weighted random forest (WRF), the weight has two functions: the first is to select the split point to calculate the Gini value, which can be denoted as where is an unseparated node; and are the left and right nodes after separation, respectively; is the class weight of samples; is the number of various samples inside the node; and is the reduction in impurity. The larger the value, the better the separation effect of the point; the second is that the class weight can be used to determine its class label in the terminal node; the expression is as follows:

The Gini value is used as the evaluation index of the contribution rate in this paper, the importance score of the variable is denoted by VIM, and GI denotes the Gini value. Suppose that are th features, calculating the Gini index score of each feature, which is the average change of split impurity for the th feature in all decision trees of random forest, the formula of the Gini index is as follows: where are th categories and represents the proportion of category in node . The importance of on the node , which is the change of the Gini index before and after the branch of node , can be denoted by where and represent the Gini indices of the two new nodes after branching, respectively.

If the node of feature in decision tree belongs to set , then the importance of in the th tree is

Assuming there are trees in the random forest, then

Normalize the importance score:

Return the importance of features through the random forest in Sklearn.

The operating variable weights and filtered variable obtained by PCA, operating variables, and filtered variable of RF are shown in Tables 25. The conclusion is that the operating variables are selected differently, which leads to different final weights. The first ten variables selected by two methods are the same and different from the eleventh. However, according to industrial demand, the variables selected by the random forest have a more important and direct impact on RON loss. The reliability of the calculation results can also be seen in the subsequent model calculations.

4. Establishing Model

The establishment of a model for reducing RON loss prediction is based on the processing of the original data, filtering, and extracting of data by dimensionality reduction. With various theoretical models or mathematical methods for data analysis, then get the final prediction model results. Since the RON loss is calculated by the RON of raw material minus the RON of the product, it is more accurate to calculate the RON after predicting the RON of the product. There are two types of selected variables: the partial linear variables obtained by the PCA and partial nonlinear variables obtained by the RF. By combining two variables and different models, the best combination of variables and models can be obtained. Therefore, the filtered features and model selection play a decisive role in the final establishment of the RON loss prediction model.

4.1. Regression Model of Support Vector Machine (SVR)

The support vector machine, denoted by SVM, is mainly applied in pattern recognition, classification, and regression analysis. As shown in Figure 2, 2-D data points of red and blue can be separated by a straight line, which is called a linearly separable problem in the pattern recognition; the black solid line is the dividing line, also known as the “decision surface.” Each decision surface corresponds to a linear classifier [13]. SVM can be expressed as

A general procedure for finding a decision function according to the given training sample when applying SVM for regression is denoted by SVR, where ; the decision function can be expressed as

where are undetermined model parameters, which can be obtained by fitting the data of and . In order to solve and , the above problem is transformed into an optimization problem:

Usually, equation (11) is not solved directly, by which the dual problem is introduced:

After obtaining , if , then , so

can be expressed as where is a kernel function; the linear kernel function formula is as follows:

The polynomial kernel function

The Gauss kernel function

The Sigmoid kernel function

4.2. Random Forest Model

Random forest is a relatively new machine learning model ensemble method, which is also called the nonlinear tree-based model [14, 15]. It is composed of a decision tree and bagging. The principle of random forest is to build a forest in a random way, which is a kind of cluster classification model. The decision trees that make up the random forest are not related to each other. After the random forest model is constructed, new samples are input into the model to be judged by decision trees.

We choose different errors to measure the deviation between the RON of the predicted product and the RON of the true value in the model; the selected errors are as follows [16, 17].

MSE (Mean Square Error) is used to measure the deviation between the RON of the predicted product and the RON of the true value in the model; MSE is close to 0, which means that the predictive ability of the model is better; on the contrary, it means that the predictive ability of the model is worse. We can use the following formula to calculate MSE:

The interval of _score is [0,1], , which means that the predictive ability of the model is better. The formula of _score is as follows:

MAE is the average value of absolute errors, which can better reflect the actual situation of predicted value errors. The formula is as follows:

RMSE is the square root of MSE, which can be calculated by the following formula:

5. Results and Conclusion

The problem of predicting the loss of RON is transformed into the problem of product RON, where the loss of RON equals RON of raw material minus RON of the product. According to the indicators in Tables 2 and 4, the following four models were established: the prediction model of product RON based on PCA and SVR [1821], the prediction model of product RON based on RF and SVR, the prediction model product RON based on PCA and RF, and the prediction model product RON based on RF [22, 23].

The indicators of each model are shown in Tables 69.

We can see that the evaluation indicators of random forest are in the forefront according to Tables 69. Compared with MSE, of SVR+PCA and SVR+RF does not perform well in random forest. Therefore, the various evaluation indicators obtained by random forest to measure the range of RON reduction is more illustrative in the industry and more convenient in practice.

The variables selected by PCA and the variables selected by the random forest are similar in performance on the random forest model. The comparison between their respective predictions and the original values is as Figures 3 and 4.

We can get the conclusion that the PCA variable random forest model and the random forest variable random forest model have similar fitting results according to Figures 3 and 4, but the random forest variables are more in line with industrial needs and close to the variables required in the traditional octane number prediction formula. The variables selected by random forest establish a random forest model to predict the loss of octane number (RON) according to Figure 4.

Data Availability

The data used to support the results of this study are included within the manuscript.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was financially supported by the Shaanxi Provincial Natural Science Foundation Research Project (2017JQ4005) and Shaanxi Province Natural Science Basic Research plan (2020JM-543); we thank the support of these projects.