#### Abstract

In the middle and late stages of heavy oil development, formulating a scientific and reasonable mining plan is the key to improving oilfield efficiency. At present, steam stimulation is still the main development method of heavy oil. The determination of its production is not only limited by boiler conditions, surface pipelines, and wellbore conditions but also by the steam absorption capacity of the formation. Therefore, local analysis cannot achieve the best effect in the whole process of steam stimulation. The mechanism model is the most commonly used method to predict heavy oil production, but too many idealized assumptions make the prediction results quite different from the actual production situation. With the rapid development of machine learning, people can achieve rapid prediction of production through field data. However, when the range of the actual parameter is small, the generalization ability of the model is weak and overfitting occurs. Based on the above background, this paper conducts a coupling study on surface steam pipeline flow, steam injection wellbore flow, and formation flow from the perspective of data-driven. Firstly, based on the correlation coefficient and the feature selection of Random Forest, the importance of the characteristics affecting liquid production and water content was ranked. Secondly, through the comparison of five typical machine learning algorithms, we select the optimal prediction model and optimal characteristics suitable for the sample of this paper. Finally, because of the poor generalization ability of the prediction model, we sampled the mechanism model and increased the diversity of steam dryness samples. We find that the accuracy of the optimal prediction model is improved and the generalization ability of the model is improved after the training of new samples. This paper provides a new idea for the production prediction of heavy oil steam stimulation reservoirs, which is helpful for the efficient development of heavy oil reservoirs.

#### 1. Introduction

As a rich mineral resource, heavy oil has important practical significance for its efficiency and economic development. However, due to the high viscosity and poor flowability of heavy oil, it is difficult to achieve ideal results with conventional technology. Therefore, steam stimulation is still the main development method of heavy oil. The local analysis theory of steam stimulation technology in surface pipelines, wellbores, and formations has been relatively mature and applied to the actual production of oilfields [1, 2]. For a given heavy oil block, the mining effect of steam stimulation depends on the injection and production parameters and the degree of thermal energy utilization of the injected steam. However, the steam injection parameters are only designed through this local software, which cannot make the whole steam stimulation process the best.

The dynamic prediction of steam stimulation wells is the basis of injection parameter design and production design optimization. To improve the mining effect of steam stimulation, researchers have conducted a lot of research on the index prediction of steam stimulation wells. Marx and Langenheim used the energy balance to calculate the heating area of the oil layer [3]. Boberg proposed a steam stimulation production prediction model, which can reflect the mechanism of heating viscosity reduction and oil increase in the process of steam stimulation, but there are many limitations [4]. Hou and Chen proposed an improved steam stimulation productivity prediction model based on previous studies and introduced the shape coefficient to correct the influence of the overlap phenomenon in the steam injection process [5]. Zheng et al. established a new analytical model for steam stimulation productivity prediction based on the Marx–Langenheim model [6]. The model shows an exponential change in the temperature field in the hot oil area, which is more in line with the actual reservoir. When the temperature is lower than a certain temperature, the heavy oil presents a non-Newtonian fluid state. Yang et al. considered the non-Newtonian steam stimulation productivity prediction model of heavy oil [7].

From the perspective of percolation mechanics and cybernetics, the reservoir system belongs to the distributed parameter system. The basic physical quantities describing the reservoir state are water saturation field and reservoir pressure field. Different parameters represent different underground conditions. The mechanism model reflects our induction and summary of real phenomena and is a reliable and prior cognition of the flow law of underground fluid. Although the mechanism model developed more and more perfectly but compared with the reservoir numerical simulation method, the parameters considered are much less. In 1953, Bruce et al. simulated the one-dimensional gas-phase unstable radial and linear flow [8]. Although limited by the computer level and solving algorithm at that time, it was a milestone in the history of reservoir numerical simulation. With the breakthrough of the numerical solution of linear equations, in 1968, Stone introduced the first numerical solver SIP [9]. In 1974, Coats et al. developed a three-dimensional three-phase steam injection thermal oil recovery model [10]. On this basis, several reservoir numerical simulation software such as CMG series and Eclipse series have been developed.

So far, reservoir numerical simulation software has made a great breakthrough in the integration of functions. For different types of oil and gas reservoirs, different mining methods can almost be used to deal with reservoir numerical simulation software [11–14]. We sample different underground conditions by reservoir numerical simulation and then describe the reservoir mining state by partial differential equations, but its accuracy is based on accurate geological models. Therefore, some idealized assumptions are needed. Since the production law is affected by many unquantifiable main control factors, this may lead to a large difference between the predicted results and the actual production data.

In recent years, artificial intelligence methods have been widely used in the field of petroleum engineering [15–19], which are mainly used for production control and optimization, information prediction, and model simulation in petroleum engineering [20–24]. However, limited by the actual conditions, there is little difference in the data of stratigraphic conditions and production systems between steam stimulation wells in the same block so that when applied to actual oilfield data, the generalization ability of the model is weak and overfitting occurs. Therefore, it is difficult to simply reflect the relationship between some key variables and output indicators from data analysis. This is because the basis of the approximate function space is uncertain and directionless when the simulation is carried out directly by the black-box method. The parameters can only be used blindly for fitting, and its stability cannot be guaranteed.

The innovations of this paper are as follows. (1) Based on previous studies, we conduct a coupled study on surface steam pipeline flow, steam injection wellbore flow, and formation flow based on data-driven. (2) Based on the correlation coefficient and Random Forest feature selection, this paper ranks the features that affect liquid production and water content in importance. (3) For a heavy oil field in eastern China, we used five typical machine learning algorithms to model and compare its field data. It is found that the six characteristics of produced degree, dynamic liquid surface, soaking time, stroke, stroke times, and well pattern mode have little effect on liquid production and water content, which are eliminated. At the same time, the prediction models of liquid production and water content based on Random Forest have the highest accuracy of 86% and 83%, respectively, but the generalization ability of the prediction models is poor. (4) We sampled the mechanism model, increased the diversity of steam dryness samples, and trained the new samples again. It is found that the accuracy of the optimal prediction model obtained previously was improved, making the prediction results more accurate and reliable, and the generalization ability of the model was improved.

The content of this paper is arranged as follows. The second part introduces the data source and data preprocessing. The third part is the establishment and verification of the input and output model of the reservoir system based on data-driven. The fourth part is the establishment and verification of the input and output model of the oil reservoir system based on hybrid data-driven. The fifth part is the conclusion.

#### 2. Data Source and Preprocessing

##### 2.1. Data Source

The data used in this paper are collected from the dynamic and static information, steam injection data, and production data of 109 heavy oil blocks in a heavy oil field in eastern China. Among them, the static information includes oil area, produced reserves, porosity, permeability, and other information data. Dynamic indicators include cumulative oil and cumulative water production. Steam injection data include steam quantity at the boiler outlet, steam pressure at the boiler outlet, and so on. Production data include liquid production and water content.

##### 2.2. Data Preprocessing

Data preprocessing is also an important part of data-driven index prediction, which greatly affects the accuracy of prediction. There are many missing or abnormal values in the actual production data, which cannot be directly trained. Therefore, data cleaning and other operations must be carried out first to obtain higher prediction accuracy.

###### 2.2.1. Outlier Processing

We remove outliers according to the PauTa criterion ( criterion). Assuming that the measured variables are measured with equal accuracy, is obtained. If the residual error of a measurement value satisfies , then is considered to be a bad value with a gross error value, and it is deleted. The formula for standard error is as follows:where , is the arithmetic mean, and the residual error is .

###### 2.2.2. Missing Value Filling

For the collected samples, if there is too much missing data for a certain group of samples or the sample is missing the two important data of liquid production and water content, the sample is deleted. For the missing values of other parameters such as steam temperature and steam pressure, the *K*-nearest neighbor algorithm is used for filling [25]. We compare the original dataset with the corresponding features in the new dataset and calculate the distance between the new data and each sample in the original dataset. Then, the category of the new data is voted by *K* samples with the smallest distance. The sample distance calculation formula is as follows:where is the relative distance between two fault feature samples and and are the corresponding point data of different fault feature samples, respectively.

After outlier processing and missing value filling, we finally sorted out 97 heavy oil blocks from 109 heavy oil blocks, a total of 780 groups of samples.

###### 2.2.3. Feature Selection

Feature selection is also called feature subset selection or attribute selection. It is a data preprocessing operation that selects from the original features to reduce the data dimension and improve the generalization ability of the model. In practice applications, although more parameters can be used to integrate more information, too many parameters will reduce learning efficiency and even affect prediction accuracy.

Since many factors affecting the development index of heavy oil steam stimulation, it is necessary to go through a systematic index analysis process to find the development index more accurately. Based on the basic theory of reservoir engineering and combined with related research [2, 13, 26–28], we obtained the factors affecting the production of heavy oil steam stimulation, which can be divided into the following five categories:(1)Reservoir characteristic: reservoir type, surface crude oil viscosity, initial formation temperature, reservoir buried depth, edge-bottom water, oil area, dynamic reserve, primitive oil-bearing saturability, reservoir effective thickness, porosity, net total ratio of oil layers, permeability, original formation pressure, and dynamic liquid surface, in turn with *x*_{1}∼*x*_{14}(2)Productive regulation: soaking time, well distance, well spacing density, well pattern mode, startup well number, stroke, stroke times, production time, and annual turnover, in turn with *x*_{15}∼*x*_{23}(3)Characteristics of historical production: cumulative oil production, cumulative water production, and produced degree, in turn with *x*_{24}∼*x*_{26}(4)Control variable: steam quantity at the boiler outlet, steam flow rate at the boiler outlet, steam pressure at the bottom of the steam injection well, and steam dryness at the bottom of the steam injection well(5)Output variable: liquid production and water content, represented by *y*_{1} and *y*_{2}, respectively

In the data-driven process, considering that the interaction between data may have a negative impact on the final result, appropriate choices are therefore needed. The four control variables directly affect the final mining effect so as the input of the model, and this paper only selects the remaining 26 variables.

The correlation coefficient is a type of statistical analysis index, which is usually used to determine the direction and degree of linear correlation of variables. The formula is as follows:

We get the correlation coefficient between 26 independent variables and 2 dependent variables, as shown in Table 1.

Feature screening based on Random Forest refers to how much contribution each feature makes on each tree in the Random Forest [29, 30], and then, take the average and compare the contribution of different features. The Gini index is usually used as an evaluation index to measure; its calculation formula is as follows:where *K* represents the category and *p*_{mk} represents the proportion of category *k* in node *m*.

Then, the importance of feature *x*_{j} at node *m* is as follows:where *GI*_{l} and *GI*_{r}, respectively, represent the Gini index of the two new nodes after branching.

If the node of feature *x*_{j} in the decision tree is set *M*, then the importance of feature *x*_{j} in the treeis as follows:

Assuming that there are *J* trees in the Random Forest, the importance of feature *x*_{j} throughout the Random Forest is as follows:

We get the importance of 26 characteristics that affect liquid production and water content, as shown in Table 2.

We obtained the correlation coefficient and the importance ranking based on Random Forest feature selection and then added them to make a comprehensive comparison and to obtain the importance ranking of variables affecting liquid production and water content. The results are shown in Table 3.

#### 3. Establishment and Verification of the Input and Output Model of the Reservoir System Based on Data-Driven

The steam stimulation oil recovery process is composed of a steam injection system, reservoir system, and lifting system. They perform the steam injection, soaking, and production, as shown in Figure 1. The reservoir system is the hub of the entire oil production system, which directly affects the energy consumption and system efficiency of steam injection and lifting systems. At the same time, due to the complexity of heavy oil formation conditions, it is very difficult to study the reservoir system from the perspective of the mechanism. Therefore, this paper explores the flow law of steam in the formation through data-driven to further improve the mining effect of steam stimulation. We convert the steam injection data from the boiler outlet to the bottom of the well through a simplified mechanism model, as shown in Figure 2. This paper assumes that only the steam dryness and steam pressure change during the steam flow process, while steam quantity and steam flow rate remain unchanged.

##### 3.1. Calculation of Steam Pressure and Steam Dryness at the Bottom of the Steam Injection Well

This paper uses the steam injection wellhead and bottom hole as nodes to couple surface steam pipeline flow, steam injection wellbore flow, and formation flow. To explore the complex formation flow law, firstly, we convert the field data from the boiler outlet to the bottom of the well through a simplified mechanism, as shown in Figure 1. Secondly, we explore the formation flow law through data-driven, to predict the heavy oil steam stimulation production. This paper assumes that only the steam dryness and steam pressure change during the steam flow process, and the other injection and production parameters remain unchanged.

###### 3.1.1. Steam Dryness Change of the Steam Pipeline

We make the following assumptions [2]:(1)The pressure loss when steam flows in the pipeline is not considered(2)The steam temperature and atmospheric temperature are fixed(3)There is an insulating layer outside the steam pipeline

Since reaching the wellhead is still saturated steam and we ignore the change of pressure, its temperature is constant. At the same time, we do not consider the change of kinetic energy and potential energy, but only consider the change of steam internal energy. Then, the wellhead dryness can be calculated by the energy balance principle. We have

The dryness loss of the steam pipeline is as follows:

###### 3.1.2. Steam Pressure Change in the Steam Injection Wellbore

We make the following assumptions [2]:(1)The steam injection rate, steam pressure, and steam quality of the wellhead remain unchanged(2)We assume that the heat transfer from the oil well to the cement ring is one-dimensional stable, and the heat transfer from the cement ring to the formation is one-dimensional unstable heat transfer and ignores the heat transfer along the well depth direction(3)We consider pressure changes in the wellbore(4)We assume that the thermal conductivity of the formation is constant

This paper only considers the case of vertical injection wells. Since saturated steam is injected into the well, it becomes a two-phase flow of water and vapor. Therefore, according to the pressure balance equation, the pressure drop formula is expressed as follows:

We obtain the steam pressure change of the steam injection wellbore as follows:

Considering the limitation of the article content, the proof process is shown in Appendix A.

###### 3.1.3. Steam Dryness Change in the Steam Injection Wellbore

In unit time, the heat loss on the length of the wellbore is . According to the assumptions in Section 3.1.2, we have

The heat loss of the wellbore will inevitably lead to a decrease in saturated steam energy, which will result in a decrease in steam dryness. We have

Furthermore,

We make the following transformation:

Therefore, the solution of equation (14) is as follows:

We obtain the following dryness loss of the steam injection wellbore:

Considering the limitation of the article content, the proof process is shown in Appendix B. See Appendix C for parameter description.

##### 3.2. Introduction and Evaluation of the Data-Driven Model

According to the importance ranking results of the characteristics affecting the liquid production and water content in Section 2.2, this section is based on five typical machine learning algorithms of *N*-Neighbours, Linear Regression, Random Forest, AdaBoost, and Support Vector Regression to predict the liquid production and water content of heavy oil steam stimulation, and select the optimal prediction model and the optimal number of features suitable for the problem samples in this paper. In order to evaluate the prediction effect of the model, we use the *R*^{2} (determination coefficient) of the model on the liquid production and water content as the measurement standard. The larger the *R*^{2}, the better the model accuracy. The formula for *R*^{2} is as follows:where *x*_{j,c} is the actual observation value, *x*_{j,p} is the predicted value, and *x*_{j,a} is the average value of the actual observation value.

For the 780 groups of samples sorted out in Section 2.2, we used the above five typical machine learning algorithms to conduct ten-fold cross validation on liquid production and water content, and the average value of *R*^{2} of cross-validation results was used as the estimation of algorithm accuracy. The effects of the feature number on the determination coefficients of liquid production and water content are shown in Figures 3 and 4.

It can be seen that when the number of features is 24, the prediction accuracy of liquid production and water content based on the Random Forest algorithm is the highest, which are 86% and 83%, respectively. At this time, the determination coefficients of the five algorithms for liquid production and water content are shown in Table 4.

##### 3.3. Model Validation

In order to further verify the accuracy of the model after adding dryness samples, we randomly selected two blocks (A and B) from 97 heavy oil blocks and used the established model to simulate the influence of steam quantity, bottom-hole steam pressure, and bottom-hole steam dryness on oil production and liquid production by the control variable method. The results are shown in Figures 5–7.

According to Figure 5, we can see that the oil production and liquid production increase with the increase of steam quantity, but the rising range gradually decreases, which is consistent with the actual change law.

According to Figure 6, we can see that the oil production and liquid production first increase with the increase of bottom-hole steam pressure and then gradually decrease after a “peak” appears.

According to Figure 7, it can be seen that, with the increase of bottom-hole steam dryness, the oil production and liquid production are gradually reduced, which is inconsistent with the actual changes. The reason for the poor consistency is that the actual data indicators fluctuate slightly, which leads to insufficient sample diversity and weak generalization ability after training.

#### 4. Establishment and Verification of the Input and Output Model of the Oil Reservoir System Based on Hybrid Data-Driven

The essence of training the model through field data is function fitting, and the fitting function has no clear direction, as shown in Figure 8(a). If the variation range of parameters is small, the generalization ability of the model is weak and there may be overfitting. When predicting simply based on the mechanism model, it is essentially an abstract description of physical laws, as shown in Figure 8(b). Although the generalization ability of the model is strong, because the theoretical basis is the ideal model, the results are not necessarily consistent with the actual situation. Therefore, this paper samples the mechanism model and combines it with the field data to train the model. In this way, it can implicitly and automatically realize the parameter adjustment and fitting work that originally required a large amount of manual operation during the machine learning training process and improve the fitting accuracy. It can also artificially adjust the parameters of mechanism simulation to increase the data diversity and improve the generalization ability of the training model, which is conducive to the reliability of the established prediction model, as shown in Figure 8(c).

##### 4.1. Introduction and Evaluation of the Hybrid Data-Driven Model

In Section 3.3, the effect of steam dryness on liquid production and water content is inconsistent with the actual change. Therefore, in this section, we sample the mechanism model by reservoir numerical simulation to increase the diversity of steam dryness samples and add them to the field data samples. To verify whether the model accuracy is improved after increasing the number of samples, we select the number of features as 24 and then use the above five typical machine learning algorithms to re-predict the liquid production and water content. The determination coefficients of the five algorithms for liquid production and water content are shown in Table 5.

According to Table 5, we can see that the prediction accuracy of liquid production and water content based on the Random Forest algorithm is the highest, which are 88% and 85%, respectively. At the same time, compared with Tables 4 and 5, we found that, after sampling the mechanism model and combining it with the field data, only the fitting effect of the water content prediction model based on AdaBoost and the liquid production prediction model based on Support Vector Regression did not change, while the fitting effect of the other models was improved.

##### 4.2. Model Validation

In order to further verify the accuracy of the model after adding dryness samples, we used a new prediction model for blocks *A* and *B* to simulate the influence of steam quantity, bottom-hole steam pressure, and bottom-hole steam dryness on oil production and liquid production by the control variable method. The results are shown in Figures 9–11.

According to Figure 9, we can see that the oil production and liquid production increase with the increase of steam quantity, but the rising range gradually decreases. Eventually, it tends to be flat, which is consistent with the actual change law.

According to Figure 10, we can see that the oil production and liquid production first increase and then decrease with the increase of bottom hole pressure, which is consistent with the actual change law.

Figure 11 shows that the oil production and liquid production increase with the increase of steam dryness, but the rising range gradually decreases. It is consistent with the actual change law. At the same time, compared with Figure 7, we can see that the generalization ability of the algorithm has been improved, which lays the foundation for further exploring the deep learning algorithm based on field data and surrogate model.

#### 5. Conclusions

(1)Based on previous studies, this paper conducts a coupled study on surface steam pipeline flow, steam injection wellbore flow, and formation flow based on data-driven. This provides a new idea for the prediction of heavy oil steam stimulation production and a theoretical basis for further formulating scientific and reasonable development plans.(2)Based on the correlation coefficient and Random Forest feature selection, this paper ranks the features that affect liquid production and water content in importance.(3)For a heavy oil field in eastern China, we compared the field data through five typical machine learning algorithms and selected the optimal prediction model and the optimal number of features suitable for the sample problem in this article, but the generalization ability of the prediction model is poor. Therefore, we sampled the mechanism model, increased the diversity of steam dryness samples, and trained the new samples again. It is found that the previously obtained optimal prediction model not only improved the accuracy but also the generalization ability of the model.It is feasible to study the steam stimulation production of heavy oil from the perspective of mechanism model and field data in this paper. However, this paper still has some limitations. Firstly, there is a certain error in the collection of field data, which may affect our results. Secondly, the lack of samples leads to weak generalization ability after training. Thirdly, the content of steam stimulation is complex, and many factors are affecting the production of steam stimulation. In the selection of features, this paper did not consider the influence of heavy oil lifting methods and viscosity reduction technology.

#### Appendix

#### A. Calculation of the Bottom-Hole Steam Pressure Based on the Mechanism Model

Based on the assumptions in Section 3.1.2, we know that wellbore pressure drop is the sum of friction energy loss, potential energy change, and kinetic energy change. According to the pressure balance equation, the pressure drop formula of vertical injection wells can be expressed as

The change of kinetic energy has obvious significance only in the case of the fog flow. For the fog flow, the gas volume flow is much larger than the liquid volume flow. Therefore, according to the law of ideal gas, we have

At the same time,

So,

We replace equation (A.4) with equation (A.1) and obtain the following changes in steam pressure in the steam injection wellbore:

#### B. Calculation of the Bottom-Hole Steam Dryness Based on the Mechanism Model

In unit time, the heat loss on the length of the wellbore is . Under the assumptions in Section 3.1.2, we have

The heat loss of the wellbore will inevitably lead to a decrease in saturated steam energy, which will result in a decrease in steam dryness. We have

Among them,Here,

At the same time,

So,

We make the following transformation:

Therefore, the solution of equation (12) is as follows:

We obtain the following dryness loss of the steam injection wellbore:

#### C. Parameter Description

Table 6

#### Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

The authors are grateful to all of the anonymous reviewers for their careful reading and valuable comments on how to improve this work. This work was supported by the National Natural Science Foundation of China (no. 11601451), the International Cooperation Program of Chengdu City (no. 2020-GH02-00023-HZ) and the Scientific Research Project of Sinopec Corporation “Heavy oil steam stimulation low-consumption and high-efficiency development of overall optimization technology” (no: P19018-5).