Abstract

The existing artificial intelligence model uses single-point logging data as the eigenvalue to predict shear wave travel times (DTS), which does not consider the longitudinal continuity of logging data along the reservoir and lacks the multiwell data processing method. Low prediction accuracy of shear wave travel time affects the accuracy of elastic parameters and results in inaccurate sand production prediction. This paper establishes the shear wave prediction model based on the standardization, normalization, and depth correction of conventional logging data with five artificial intelligence methods (linear regression, random forest, support vector regression, XGBoost, and ANN). The adjacent data points in depth are used as machine learning eigenvalues to improve the practicability of interwell and the accuracy of single-well prediction. The results show that the model built with XGBoost using five points outperforms other models in predicting. The R2 of 0.994 and 0.964 are obtained for the training set and testing set, respectively. Every model considering reservoir vertical geological continuity predicts test set DTS with higher accuracy than single-point prediction. The developed model provides a tool to determine geomechanical parameters and give a preliminary suggestion on the possibility of sand production where shear wave travel times are not available. The implementation of the model provides an economic and reliable alternative for the oil and gas industry.

1. Introduction

Sonic logs (compressional and shear wave travel times) are an important tool in determining production and exploration geophysics characterization [1]. Rock elastic parameters are critical to alleviate the risks associated with the drilling and wellbore stability [2]. Consequently, the availability of sonic logs is very essential for the development and production phases of hydrocarbon recovery. In sand production management, according to the acoustic travel time, density, and other logging data to predict the sand production index of a single well [3, 4] and the multiwell data to predict the regional sand production distribution can guide the macro decision-making of sand production and sand control of the whole block in weakly consolidated sandstone reservoir [5].

Shear wave travel times are usually obtained from full-wave train logging data, but due to its high cost, most wells lack shear wave logging data. Because logging data reflects reservoir information, different logging curves are essentially the reflection of the same reservoir in different physical quantities, so there is a mapping relationship between logging data. Using conventional logging data to realize shear wave travel times inversion by multiple regression method [1], the derived physical model often involves a certain degree of simplification, assumption, and subjective experience, which cannot guarantee the quality of the synthetic logging curve.

In recent years, with the application of machine learning in various fields of science and engineering [611], many researchers use data-driven methods to solve geological problems. Some intelligent systems [2, 12, 13] have been used to improve the prediction and accuracy of sonic wave velocity prediction when sonic logs have been lost due to poor storage, poor logging, failure of logging instruments, and bad hole conditions. In 2012, Quantico began to use artificial intelligence to generate acoustic and density logs from existing data flow [14]. Ramcharitar and Hosein established an artificial neural network with 10 hidden layers to estimate shear wave and P-wave by using depth, porosity, clay content, and bulk density [15]. Compared with the empirical model, the neural network model shows a lower absolute average error, so it is more suitable for the estimation of rock mechanical properties. Tariq et al. established an artificial neural network model based on conventional logging data (density, gamma, and neutron porosity) to predict acoustic travel time [2]. Zou predicted the shear wave travel times of shale gas horizontal wells in the Jiaoshiba area by using a random forest algorithm through the comparison of various machine learning methods [16]. Ni et al. based on conventional logging data, such as natural gamma-ray, density, and resistivity, used support vector regression to predict shear wave travel times and achieved good application results [17]. Onalo et al. proposed a three-layer forward multilayer perceptron artificial neural network model. The purpose of this model is to estimate the P-wave and shear wave transit times using real gamma-ray and formation density logging data [3].

The sedimentary environment and the source, origin, and dynamic evolution of the sediments have some continuity in the vertical direction. Most of the machine learning models used for acoustic logging prediction only obtain the input data from the same depth as the output data and construct the point-to-point mapping, which means that the output logging data is only related to the specific input logging data of the same depth and the prediction results are completely independent of each other in space, ignoring the trend information of logging curve changing with depth and the information contained in the previous and subsequent (spatial) correlation of data, and the vertical continuity characteristics of logging data are not fully utilized. In addition, the existing models lack a unified processing method for regional logging data, so it is very important to establish a unified prediction model for improving the comparability of cross-well calculation results, which puts forward higher requirements for data preprocessing.

In this paper, the regional logging data are standardized, normalized, and depth aligned, and the longitudinal continuous sample points are selected as the training characteristic parameters. The data-driven shear wave logging curve prediction models are established by using four intelligent machine learning methods, that is, random forest, support vector regression, XGBoost, ANN, and linear regression based on the largest R2 of the validation dataset. The model can extract the information contained in the input more effectively according to the spatial correlation, and the output is generated from a series of data inputs. It considers the internal relationship between logging curves and the variation trend of different logging curves with depth, which is more in line with the geological thought and is helpful to improve the practicability of cross-well prediction. Each scenario is evaluated based on the coefficient of determination, RMSE and MSE (between actual log data and predicted data). The results show that multipoint as a training feature can effectively extract the information contained in logging data, and the prediction accuracy is higher than that of single-point mapping.

These prediction results can be used as a reliable tool for predicting reservoir geomechanical properties (including sand production potential) and provide a reliable method to determine the sand production potential of the formation in real time when limited data or sonic logging data are not available. The significance of the model to the industry is that limited logging data does not need to be transmitted to the field for geoscience analysis to preliminarily determine the possibility of sand production, thus reducing the cost and time of exploration operation, as shown in Figure 1.

2. Methodology to Develop the Artificial Intelligence Model

Artificial intelligence (AI) is a tool to simulate the intelligent behavior of human learning from examples to solve complex problems [18]. Compared with other computational automation methods, AI is a very efficient decision-making system with less time and cost. Artificial intelligence technology uses data to solve the complex problems of classification, diagnosis, prediction, estimation, optimization, selection, and control. Machine learning is the core of artificial intelligence and an important method to realize artificial intelligence. Its statistical and implicit physical driving methods can learn autonomously based on a large number of training data, effectively find the complex mapping relationship between variables, and build the prediction model. This type of solution is called data-driven because it is “learning” or directly driven from data without assuming a predetermined equation as a model. The rapid development of artificial intelligence technology and machine learning method has greatly promoted its application in geophysical logging. Using the machine learning method to analyze and generate logging curves provides a new way to solve the technical problems in the logging data processing. Machine learning is divided into supervised learning and unsupervised learning according to whether the training data is labeled. Supervised learning is divided into a regression algorithm and a classification algorithm. The regression method is a supervised learning algorithm for forecasting and modeling continuous numerical variables, while the classification method is a supervised learning algorithm for modeling or forecasting discrete numerical variables.

In supervised learning, a data-driven model is built by processing a known label dataset, which includes expected input (feature) and output (label/response). A physical driven model is a kind of mathematical mapping based on theory, which connects input and output, while supervised learning identifies patterns in available data sets, learns from observation, and makes a necessary prediction based on statistical mapping of input and output. In the process of establishing the supervised learning model, the predicted value is compared with the output value, and the model is improved based on the loss function. This process continues until the data-driven model reaches a high level of accuracy and performance, thus minimizing the loss function.

The machine learning method is mainly used in geophysical logging technology to solve the classification and regression problems encountered in the process of logging data processing and interpretation. It is a supervised learning regression problem in machine learning to get the prediction model by training and learning the existing shear wave data and then predict the unknown shear wave curve. By excavating the internal relationship between the shear wave and conventional logging curves, the prediction value with high accuracy can be obtained, which can make up for the lack of shear wave data in the study area and lay the foundation for subsequent sand production prediction. This paper involves the following methods.

Linear regression is the earliest fitting method estimated by the least-squares method which provides a linear relationship between input variables (x) and one output variable (y).

Random forest (RF) is a machine learning method proposed by Breiman [19], which uses multiple decision trees to establish a multidecision tree model with predictive effect. The decision tree deduces simple decision rules from data characteristics to predict the value of target variables. Each tree depends on the value of a random vector, which samples independently and has the same distribution for all trees in the forest. Random forest algorithm has excellent performance. It can detect the correlation between feature variables and rank the importance of each feature, and the generated model has good interpretability. It has high prediction accuracy, has good tolerance to outliers and noise, and is not easy to overfit.

Support vector machine (SVM) for regression analysis is called support vector regression (SVR). The algorithm has the advantages of strong adaptability, global optimization, rigorous theory, high training efficiency, and good generalization performance. SVR is usually used as an effective machine learning method to predict rock properties [20]. By using kernel function [21] to map data points to inner product space, efficient regression function in high-dimensional space can be obtained.

XGBoost is a supervised learning model which can be used for regression and classification problems. The concept of XGBoost combines gradient descent and tree ensemble learning. Through additive training, one new tree is added at a time to the previous XGBoost model to improve its prediction performance and optimize the value of a prespecified objective function [22].

The artificial neural network model consists of an input layer, one or more hidden layers, and an output layer with different numbers of neurons. The adjacent layers are fully connected, and each connection is assigned weight. The artificial neural network with a hidden layer can simulate any nonlinear mapping from n-dimension to m-dimension with any precision on the closed set [23].

2.1. Data Preprocessing
2.1.1. Single-Well Logging Data Preprocessing

Data preprocessing is of great significance for machine learning to obtain an accurate prediction model. Well log data serves as the input for the proposed models. Different data formats are generated for logging data due to different data coding format acquisition software adopted by various technical service companies. All the LAS data files are converted into a dataset of one CSV file.

The case study presented in this work is actual well log data of four wells (W1, W2, W3, and W4) from an offshore sandstone oil field in North China. The data contains conventional logging data (including CAL, CNL, AC, GR, PE, RD, RMLL, RS, SP, DEN, DTS, and SP) and orthogonal dipole array acoustic logging data (in this paper, shear wave logging curve DTS is used). Figure 2 shows the original wireline log curve for an interval of 0.1 m with partial characteristics. Firstly, the data is analyzed to identify the suitable depth range that provides an accurate representation of the well. Table 1 shows the measuring depth range of logging curves of each well.

In order to restore the objectivity of the data and get more accurate analysis results, we need to process the original data, deal with the missing values, and remove the abnormal and null data. Caliper logs are used to eliminate borehole irregularities, key seats, and wash-out sections where the tools may have generated false readings. In order to predict sand production, the logging data of the reservoir section is selected according to logging interpretation results. Table 2 lists the statistical summary of the well log data, such as count, mean, standard deviation, min, median, and max after preprocessing.

2.1.2. Standardization of Regional Logging Data

It is necessary to establish a unified data set with multiwell data to establish a unified shear wave travel time prediction model for each block. The establishment of a unified mathematical model for multiwell logging data is of great significance to improve the prediction accuracy of regional shear wave logging. Based on editing, standardization, normalization, and depth drift correction of logging data, the original logging curve is corrected by the standardization method, and the standardized logging data of the whole work area is obtained. Only in this way can the logging interpretation of the whole work area have a unified accuracy.

The key of standardization is to select a reasonable standard layer. The sandstone, mudstone, or other lithologic bodies of the same layer system generally have the same sedimentary environment and approximate parameter distribution characteristics, and their logging responses are consistent. By using this characteristic, the strata with stable regional distribution, similar physical properties, or regular changes and certain thickness are selected as the standard layer. According to the standard layer, the logging curves of each well are translated to realize the standardization of regional logging curves. The common standardization methods include the histogram method and trend surface analysis method. The histogram method first determines the standard interval of each well, makes the frequency histogram of each well standard interval, analyzes the histograms and logging curves of all wells, determines the most representative frequency distribution, and takes it as the standard value to check and correct the logging curves of other wells, so that the logging characteristics of other wells are consistent with the standard value.

The difference between the average logging value of the standard histogram of the base well and the reference well is taken as the correction value. The mean translation method is used to standardize logging curves.

The log mean value before standardization is and that after standardization is . Then,where is the difference between the average logging value of the base well and the average logging value of the reference well:

Different types of logging data have different units. Scaling the data within the same range of values for each input feature can minimize the deviation from one feature to another and speed up the model training time. The training data set is trained after the normalization of mean and variance, and the mean and variance of the training set should be used in the test data set normalization. Because the test data simulate the real environment, the normalization of the data is also a part of the algorithm. The real environment may not get all the tests. Therefore, it is necessary to save the mean and variance of the training set. The data of multiple wells are merged to generate a historical data set. For certain feature data, the following formula can be used for standardization and normalization. For random forest and XGBoost algorithm, there are no feature normalization limits, and SVR, ANN, linear regression algorithms need to normalize the data.

There are many different types of normalization usually used to scale data including Z-score normalization, min-max normalization, sigmoid normalization, statistical column normalization, etc. For this study, to transform and normalize the data, the standardized expression of the mean value is as follows:

The feature scaling expression is as follows:where is the normalized parameter, is the actual parameter, and is the standard deviation of the actual parameters.

2.2. Feature Engineering

The purpose of feature engineering is to reduce the calculation cost and improve the upper limit of the model. The best practice is to find which input parameter has more influence on the output parameter by means of determining the correlation coefficient (CC). The value of CC between two parameters always lies in the range of −1 to 1. The value of CC close to “−1” shows a strong inverse relation between two parameters, the value of CC close to “1” shows a strong direct relation between two parameters, while the value of CC if “0” shows no relation between two parameters. The correlation matrix is shown in Figure 3. According to the correlation matrix, finally, learning variables CAL, CNL, SP, and AC are used.

In addition to correlation analysis, this paper also proposes the viewpoint of multipoint mapping to construct feature engineering. The principle is based on the vertical continuity of strata. The point-to-point feature mapping method does not reflect the actual geological analysis experience and geological point of view very well and is not the optimal method to generate logging curves. The sampling interval of logging is usually small (0.1 m), and the logging data of different depth formations obtained by logging tools affect each other in the longitudinal depth. Therefore, in the logging curve, each data point within the range of mutual influence includes several data points, which means that the prediction of shear wave travel times can be regarded as a sequence data analysis problem with spatial correlation. To make better use of the longitudinal continuity of the logging curve, we take the input vector of the training sample as the logging curve value of three points corresponding to shear wave travel times and its upper and lower points and compare the training effect of the two models with single-point mapping. The principle of feature selection is shown in Figure 4. The input variable is expressed aswhere is the K-dimensional input space, are input variables, K is the total number of features, and N is the number of samples.

The output variable is expressed aswhere is the output space, and is the output.

The training sample consists of feature parameters and labels, according to the correlation coefficient matrix, and CAL, CNL, SP, and AC are recorded as n and DTS as the training label. N samples are stored as n (m + 1) matrix by row, and m is the number of characteristic points. When modeling with a single-point feature, the training sample of the point at a certain depth is the single-point feature and label, as shown in Figure 5. When modeling with a 3-point feature, the construction feature of a sample contains 12 features of 3 adjacent points, namely, 34 features, and the label is the DTS of the depth, as shown in Figure 6. In , p stands for points, and q stands for features of a single point. The principle of five points is the same as that of three points.

2.3. Model Development
2.3.1. Data Set Partition

The generalization of a model is an important feature of any model. The developed model must be able to adequately describe the relationship in the training dataset such that the relationship is applicable for a dataset outside the training dataset. If the model successfully describes the relationship in the training dataset but fails to validate and test the model on an external dataset, the model is said to be poor. To avoid this, 70% of the dataset are used for cross-validation purpose and 30% used for testing purpose.

2.3.2. Optimized Hyperparameters for the Models

Model configuration is generally referred to as hyperparameters, such as n-value in random forest algorithm and different kernel functions in SVM. In most cases, the choice of hyperparameters is infinite. The minimum generalization error on the test set is considered as the optimum parameters of the models, as shown in Figure 7. When the model is too complex or too simple, it will be overfitted or underfitted and the generalization error will be very high. The balance point with the optimum parameters of the models in the middle is to be sought by drawing the learning curve of the training set and test. In order to evaluate the performance of the model, the evaluation index R2 (coefficient of determination), mean squared error (MSE), mean absolute error (MAE), mean squared error (MSE), and Root Mean Square Error (RMSE) are also recorded for both training and testing phases of the models. A highly efficient parallel approach is used for computing the XGBoost by an excellent utilization of the GPU resources offering a significant computational speedup without sacrificing any predictive accuracy. In this paper, five machine learning models are implemented based on third-party Python modules on a server with 64 cores, 256G RAM. Setting hyperparameters of proposed models and optimal hyperparameters solution are listed in Tables 3 and 4.

2.4. Results and Discussion
2.4.1. Comparison of Model Prediction Results

The corresponding model fit is generated with optimal hyperparameters after each of the machine learning models passes through the training data. Table 5 lists the performance comparison of each model during training and testing. The average R2 for all the models is 0.946 for training and 0.93 for testing, respectively. The difference is small, which indicates that the training process is reliable (i.e., no overfitting). This comparison clearly shows that the predicting model built with XGBoost using five points outperforms other models in prediction. The R2 of 0.993660634 and 0.964349 are obtained for the training set and testing set respectively.

The analyses between actual and predicted values for testing phases are, respectively, shown in Figure 8.

Figure 9 shows that the proposed XGBoost model using five points gives the highest R2 and lowest RMSE between actual and predicted data in the training set. During testing of unseen data, XGBoost outperforms other models. It means that for the given set of data, the XGBoost model trained is generalized and can give good results on unseen data.

Computational efficiency is another important aspect to evaluate an algorithm, apart from the prediction accuracy. 16701 points are used in training. Table 6 shows that the running time of linear regression is the shortest, but the calculation accuracy is the lowest. With the development of graphics processing unit- (GPU-) accelerated computing in recent years, other learning algorithms still cannot compete with XGBoost. For example, the running time of XGBoost keeps stable to generate the prediction result. With the increase of training characteristics, training time increases rapidly in the ANN model. The running time of ANN is almost 36 times slower than that of XGBoost. When the dataset size increases dramatically, the difference in the running time may not be neglected. Considering the calculation time and accuracy, five points are selected which meet the prediction requirements.

2.4.2. Influence of Feature Point Density on Model Accuracy

For the machine learning model, when an example of logging data is input into the model, the final output data is often determined by the characteristics of the training set close to the example. The density of feature points in the training set has a certain impact on the final prediction accuracy. In theory, if a large number of feature points are gathered in a certain area of the input space, the area will be better covered, so the model can better describe the mapping relationship between the input and output space. The logging data features selected in this paper form a four-dimensional input space.

The Gaussian kernel density estimation algorithm [24] is used to calculate the Gaussian kernel density distribution of the feature points of the samples in the four-dimensional input space of the training set and to estimate the feature points density of the training set at the sample points of the test set. The calculation results and absolute value of relative error are projected into the “AC-SP” space and plotted into a two-dimensional scatter diagram, as shown in Figure 10. The “blue-red” scatters are the distribution of 7158 test set feature points and the absolute value of the relative error of DTS, “purple-yellow” circles are the Gaussian kernel density of the current point, and the grey scatter is the distribution of training set samples. The absolute value of the relative error of test set sample points in the high-density area is generally low. Although the absolute value of relative error is not strictly inversely proportional to Gaussian kernel density, there will be a large number of error points in low-density regions and only low error points in high-density regions. If the feature points of the test set are distributed in the high-density area of the input space, the prediction accuracy in this area will be higher.

In practice, logging curves of different wells in the same block have high similarity, so the characteristic range of input space is small. At this time, as long as a certain number of samples are collected to make the input spatial feature points reach a certain density, the DTS within this range can be predicted with good accuracy.

3. Model Validation and Sanding Potential Prediction

The developed XGBoost model with a five-point mapping has been validated using field data for another well within the area under study. The dataset used for the validation process has not been used during building the model. The validation data involved a sandstone reservoir section including 1500 data points. Sanding potential can be determined from the shear modulus and bulk modulus estimated from sonic travel time.

Bulk modulus (K) can be estimated from the sonic transit times using

Shear modulus (G) is estimated from the sonic transit time using where is the formation rock bulk density, ; is the shear wave travel times, is the compressional wave travel time, .

Sand production index B (MPa), combined modulus (MPa), and Schlumberger ratio R (MPa2) are calculated using

Discriminant criteria: when DTC>345 , MPa, MPa, and MPa2, there is a tendency of sand production in oil and gas reservoirs, otherwise, no sand production. Figure 11 shows the vertical distribution law of sand production index based on the distribution of rock mechanics parameters along depth according to logging data. is obtained by machine learning prediction. The results of the four sand production indexes show that the tendency of sand production is serious.

4. Conclusions

(1)This paper depicts the potential of machine learning in sand management. The problem of lacking shear wave data in the process of building a geomechanical model is solved. Considering the geological continuity of the reservoir, the data-driven shear wave logging prediction model has been established with five artificial intelligence methods (linear regression, random forest, support vector regression, XGBoost, and ANN) from the perspective of geological analysis. The longitudinal continuous points are extracted as the machine learning characteristic parameters according to the logging curve trend and background information.(2)It is important to establish a unified mathematical model of artificial intelligence prediction in the whole area to improve the comparability of calculation results between wells, but at the same time, it puts forward higher requirements for data preprocessing. The selection and supplement, standardization, normalization, and depth alignment of regional logging data solve the matching problem between logging data and measured data from different aspects to ensure the rationality of the model and the accuracy of prediction. These methods are worth popularizing in using machine learning methods to predict well-logging reservoir parameters.(3)XGBoost model using five points gives the highest R2 and lowest RMSE between actual and predicted data in the training set. During testing of unseen data, XGBoost outperforms other models. No matter which machine learning method is used, the accuracy of multipoint prediction is higher than that of single-point prediction. Considering the calculation time and accuracy, five points are selected which meet the prediction requirements. The influence of feature point density on model accuracy is also discussed.

Data Availability

The labeled dataset used to support the findings of this study is available from the Corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This work was supported by the Scientific Research and Technological Development Project of CNPC, Research and Development of Integrated Software of Drilling and Completion Engineering Design and Optimization Decision (smartdrilling) (2020B-4019).