High-precision permeability prediction is of great significance to tight sandstone reservoirs. However, while considerable progress has recently been made in the machine learning based prediction of reservoir permeability, the generalization of this approach is limited by weak interpretability. Hence, an interpretable XGBoost model is proposed herein based on particle swarm optimization to predict the permeability of tight sandstone reservoirs with higher accuracy and robust interpretability. The porosity and permeability of 202 core plugs and 6 logging curves (namely, the gamma-ray (GR) curve, the acoustic curve (AC), the spontaneous potential (SP) curve, the caliper (CAL) curve, the deep lateral resistivity (RILD) curve, and eight lateral resistivity (RFOC) curve) are extracted along with three derived variables (i.e., the shale content, the AC slope, and the GR slope) as data sets. Based on the data preprocessing, global and local interpretations are performed according to the Shapley additive explanations (SHAP) analysis, and the redundant features in the data set are screened to identify the porosity, AC, CAL, and GR slope as the four most important features. The particle swarm optimization algorithm is then used to optimize the hyperparameters of the XGBoost model. The prediction results of the PSO-XGBoost model indicate a superior performance compared with that of the benchmark XGBoost model. In addition, the reliable application of the interpretable PSO-XGBoost model in the prediction of tight sandstone reservoir permeability is examined by comparing the results with those of two traditional mathematical regression models, five machine learning models, and three deep learning models. Thus, the interpretable PSO-XGBoost model is shown to have more advantages in permeability prediction along with the lowest root mean square error, thereby confirming the effectiveness and practicability of this method.

1. Introduction

Permeability is an important parameter in tight sandstone reservoir evaluation and oil and gas field development and is the basis for establishing geological models, accurately estimating oil and gas reserves, and determining reasonable development plans [14]. However, because tight sandstone reservoirs have experienced complex diagenesis and exhibit strong heterogeneity, it is difficult to predict their permeability with high accuracy [5, 6]. At present, the main methods for the high-precision prediction of permeability are (i) mathematical regression methods such as porosity and permeability regression based on petrophysical data or the response of logging curves [710]; (ii) theoretical modeling based on high-pressure mercury intrusion data, e.g., Winland, Purcell, Swanson, and Katz-Thompson; [1113] and (iii) prediction based on special logging series, e.g., nuclear magnetic resonance logging and dipole sonic logging [1416]. These methods have certain differences in terms of data requirements and parameter selection and are challenged by two main problems: (1) the accuracy of the mathematical regression method is poor. Whether it is based on porosity and permeability or the relationship between various logging curves and permeability, it is a complex nonlinear relationship. Hence, it is impossible to obtain high-precision permeability prediction results by using simple linear analytical expressions; (2) petrophysical experiments and special logging series each incur prohibitive costs and rather long evaluation periods. For petrophysical experiments, only a limited number of core plugs can be collected, and these are unable to provide the same resolution as logging curves. The arbitrary selection of core plugs with irregular intervals also restricts the accurate evaluation of the heterogeneity of the permeability of the tight sandstone reservoir. Therefore, a convenient, cheap, and reliable method that can examine data and express complex nonlinear relationships to generate a reliable conclusion is desirable as an effective technical approach to characterizing the permeability of tight sandstone reservoirs [17].

Machine learning is the core method of research and application in the field of artificial intelligence and can quickly find the mapping relationship between large amounts of data [18]. In recent years, the application of machine learning to solve the task of predicting the petrophysical properties of the reservoir has attracted increasing attention from geologists [1925]. Logging curves reflect the formation characteristics of different depths and contain various information on the formation parameters. Moreover, the data sets used for training are all easily-obtained, high-resolution logging curve data, thus, solving the cost problem. The studies of Al-Anazi and Gates [19] and Gholami et al. [20] both applied support vector machines (SVM) to construct the relationship between logging data and core permeability to achieve good results. Ao et al. [21] introduced the application of the random forest (RF) algorithm to predict the permeability and other formation characteristics and compared the reliability of the results with those of eight other algorithms for predicting the petrophysical properties of the reservoir. Although the above research has had good predictive effect, the accuracy of prediction is still limited because the SVM has no universal solution to nonlinear problems and because it is difficult to find a suitable kernel function. Moreover, the RF algorithm is prone to overfitting when solving the regression problem and is not at all well generalizable. In order to solve these problems, the abovementioned researchers proposed a powerful machine learning framework termed the extreme gradient boosting decision tree (XGBoost) [22]. The XGBoost framework has the following advantages: (i) it draws on the random forest algorithm to sample features; (ii) it establishes a new model by continuously reducing residuals, thus, decreasing any overfitting while reducing the amount of calculation; (iii) it uses standardized regularization terms that further mitigate against overfit in the trained model; and (iv) it enables the performance of parallel calculations to achieve high predictive efficiency and precision [23]. For example, Otchere et al. [24] explored the reservoir characterization method based on the XGBoost framework and carried out feature selection based on the integrated random forest algorithm and lasso regularization. The results show superior to the benchmark XGBoost and PCA-XGBoost models for predicting the petrophysical properties of the reservoir in the high permeability zone. In addition, Gu et al. [25] proposed an application of the particle swarm optimization (PSO) algorithm to improve the hyperparameter selection of the XGBoost framework to more accurately predict the permeability of tight sandstone reservoirs and favorably compared the performance with that of the stepwise, SVR, and GBDT algorithms.

Although the above methods show superiority in permeability prediction, the majority of machine learning methods are equivalent to black box processes in which a decision result is obtained from a batch of input data, but the basis of the intermediate decision is unknown and, hence, lacks interpretability [26]. While some studies have used feature importance methods to improve the interpretability of machine learning algorithms, this only reveals the key variables that affect the overall prediction result; it does not reveal the specific influence of the positive or negative aspects of each variable upon the final prediction, nor can it explain the specific details [2729]. In 2017, Lundberg and Lee proposed an interpretable method based on the game theory approach, known as Shapley additive explanations (SHAP) [30]. This is a method of model interpretation after the fact, in which calculation of the Shapley value of each feature enables both a local and global interpretation (the latter being obtained by averaging the absolute Shapley values over all features). Hence, the SHAP method has attracted widespread attention in recent years for enabling the capture of meaningful features in the application of machine learning and for improving the interpretability and transparency of the algorithm [3133].

In view of the above discussion, the present study attempts to use the SHAP method to introduce the interpretable XGBoost framework into the permeability prediction field. In addition, the particle swarm optimization algorithm is used to search for hyperparameters of the model and find the best combination in order to optimize the performance of the XGBoost. Finally, the performance of the proposed SHAP-PSO-XGBoost model in predicting the permeability of the tight sandstone reservoir is compared with that of traditional mathematical regression methods and other commonly-used machine learning algorithms, including the SVM, decision tree (DT), RF, gradient boosting decision tree (GBDT), and LightGBM. In addition, the performance of the proposed SHAP-PSO-XGBoost model is compared with that of traditional deep learning technologies such as convolutional neural networks (CNNs), long short-term memory (LSTM), and the gate recurrent unit (GRU). The experimental results indicate that the SHAP-PSO-XGBoost model has the highest accuracy, thus, demonstrating the feasibility and superiority of the proposed model and laying the foundation for the fine evaluation of heterogeneous tight sandstone reservoirs based on permeability.

The remainder of this article is organized as follows: Section 2 describes the data set and data preprocessing methods used in the study; Section 3 introduces the theoretical basis of the proposed method; Section 4 presents and discusses the results of the computational experiments, and Section 5 presents a brief summary.

2. Data Collecting and Preprocessing

2.1. Geological Setting and Data Acquisition

The Ordos Basin is located in central China and is the second largest sedimentary basin in China (Figure 1). The study area is the Yishan slope belt, which is the main body of the Ordos Basin and the main area of oil and gas enrichment. The morphology of the area is gentle and monoclinic in the west, with an inclination angle of less than 1°. During the Yanchang period of the Late Triassic, the basin formed a typical continental clastic sedimentary system characterized by river-lake facies under the background of overall subsidence. From bottom to top, it can be divided into ten oil-bearing formations. Among these, the Chang 6 and Chang oil-bearing formations were formed in the early stage of contraction of the lacustrine basin. The lithology is dominated by fine-grained feldspar sandstone. These are the important oil and gas exploration target layers in the study area [3436].

Overlying pressure porosity and permeability analyses were performed on 202 core plugs of the Chang and Chang 6 tight sandstone reservoirs from two coring wells in the study area. The porosity of the core samples was found to be between 2.0 and 13.2%, with an average porosity of 10.3%, and the permeability was between and , with an average of . The exponential regression relationship between the porosity and permeability of the core samples is presented in Figure 2, where the poor correlation is indicated by a correlation coefficient of only 0.441. This also indicates that the pore-throat structure of tight sandstone reservoirs is more complex than that of conventional oil reservoirs and, hence, it is more difficult to predict the permeability with high accuracy.

The core porosity (Por), core permeability, and sample points of the nine logging curves corresponding to the coring depths were selected as input data on the basis of core homing, giving a total of 202 data points. The six logging curves included the gamma-ray (GR) curve, acoustic curve (AC), spontaneous potential (SP) curve, caliper (CAL) curve, deep lateral resistivity (RILD) curve, and eight lateral resistivity (RFOC) curves. Among these, the RILD and RFOC used logarithmic data. In addition, the following three derived variables were used: the shale volume, the AC slope, and the GR slope.

2.2. Data Normalization

To eliminate the systematic errors in the various measurement tools and the dimensional influence between the various logging curves [37], the input data were normalized according to the following equation: where and are, respectively, the actual and normalized values of each index, and , and are the maximum and minimum values of each index. This process also facilitates the comparison of indicators having diverse units or magnitudes, as the quantified indicators will be distributed in the interval [0, 1]. In addition, the data normalization helps to accelerate the training process.

The normalized core porosity and permeability are presented in Figure 3, along with the statistical characteristics of the wireline logging curves. Here, the diagonal direction shows the distribution of the 11 features, all of which are discrete data, and none (except for the GR, AC slope, and GR slope) conform to the normal distribution. Therefore, the Spearman correlation coefficient was used to measure the dependence between two variables and is presented on the upper half of the figure, while the lower half shows the linear fitting relationship between the two features. It can be seen that, except for the slightly higher correlation with the porosity (), the correlation between the permeability and the other features is very poor. This indicates that the use of any of these features to predict the permeability directly will be inefficient and unreliable.

3. Methodology

3.1. Explainable Machine Learning
3.1.1. XGBoost

The XGBoost is an efficient implementation of the GBDT algorithm proposed by Chen and Guestrin [22]. Compared with the GBDT, the XGBoost not only provides parallel computing but also significantly improves the accuracy of the algorithm; hence, it is widely used in various industries [3840]. The XGBoost framework uses decision trees as base learners and combines them to form a strong learner. The algorithm flow is shown schematically in Figure 4.

The principle of the XGBoost algorithm is as follows.

Given a data set , the number of features in the data set is , the number of samples is , and the predicted value is the cumulative value of the output results of decision trees based on the input . The formula is given by the following equation: where is the set of regression trees, CART is a common regression tree type, represents the structure of the regression tree, is the number of leaves in the regression tree, is the input-output function of the th regression tree (with each corresponding to the structure of the th regression tree) and the leaf node weight , and is the score of the th leaf node.

In addition, the XGBoost adds a regularization term to the loss function to control the complexity of the model. The loss function is defined by the following equation: where is the true value of the -th sample; is the sum of the errors of the model’s predicted values for all samples; is the regularization term. The formula is given by the following equation: where is the number of leaf nodes in the tree ; is the normal form of the weight of the leaf nodes in the tree; and are hyperparameters, is the penalty coefficient, and the value range is [0,1]; is regular term coefficient.

Next, we need to find the that minimizes the loss function. Different from the traditional GBDT, XGBoost optimizes the loss function and carries out the second-order Taylor expansion of the loss function, so Eq. (3) can be approximated as the following equation: where represents the first derivative of with respect to , and represents the second derivative of with respect to .

Since the prediction score of the first tree and the residual of are constant, it does not affect the optimization of the objective function and can be directly removed. The loss function can be simplified as the following equation: Define as the sample set in leaf node . Since XGBoost splits all leaf nodes in the same layer indiscriminately when splitting a leaf node, the weight value on the same leaf node is the same and can be merged, so the loss function can be transferred from the sample dimension to the leaf node dimension. The formula is defined by the following equation: Derivation of based on the loss function and set the derivative to 0, the optimal weight expression of the leaf node area is obtained as the following equation: The corresponding loss function is Eq. (9), which can be used as the value to evaluate the quality of a tree. Let , Eq. (9) can be simplified to the following equation.

3.1.2. Shapley Additive Explanations

Although many methods of machine learning, including the XGBoost framework, can generate high-precision prediction results, they are essentially black box models. As noted above, this makes the prediction process difficult for humans to intuitively understand and, hence, the credibility and practical application of the machine learning models are limited by poor interpretability [26]. To address this issue, the SHAP is employed as an interpretable algorithm based on the Shapley value or characteristic attribution value that was proposed by Lloyd Shapley in the field of game theory [30]. In this approach, all features are considered as “contributors.” For each permeability prediction sample, the model produces a prediction value, i.e., the Shapley value. The sum of contributions of all the features is the final prediction of the model [31, 32], which can be expressed as the following equation: where is the predicted value of the machine learning model, is the explanatory model, and . is the dimension of the input feature. When the input feature is observed, , otherwise, it is 0. is the average predicted value (baseline value) of the machine learning model on the data set, and is the contribution value of the -th feature (Shapley value). For a single sample, the feature with the larger absolute value of the Shapley value has a greater impact on the sample’s prediction result. The positive or negative of the Shapley value reflects whether the feature makes the predicted value of the model increase or decrease, and it has a positive or negative effect on the model. The is given by the following equation: where is the number of features. is the feature subset that does not contain , represents the number of elements in the set ; and are the predicted values that contain feature and that do not contain feature , respectively.

As shown schematically in Figure 5, red indicates that the feature contribution is positive, and blue indicates that the feature contribution is negative. The SHAP method can not only rank the importance of each feature but also reveal the positive and negative influence of each feature upon the prediction result, thereby effectively combining global and local interpretability.

3.2. Particle Swarm Optimisation

The particle swarm algorithm is an intelligent search algorithm that simulates the foraging behavior of birds. Each individual is called a particle. The potential solution of the optimization problem is searched in D-dimensional space, initialized as a group of random particles, and the optimal solution is found by iteration. Each particle has 2 parameters, which can record its own speed and position. In each iteration, the particle tracks its own local optimal position or individual extreme value () and the group’s global optimal position or global extreme value () in order to update itself. This process is repeated until the global optimal solution is searched or the maximum number of iterations is reached [41, 42]. The particles update their speed and position through the following equations. where ; , which represents the dimension of the particle; represents the number of iterations; is the velocity of the -th iteration of the -th particle on the -th dimension; is the position of -th iteration of the -th particle on the -th dimension; the individual extreme value found by the particle is represented by , and the global extreme value found by the particle swarm is represented by ; is the inertia weight coefficient, which reflects the influence of the original speed of the particle on the current moving speed; and are the learning factors, which reflect the learning ability of the particle in its best state and the global best state, respectively; and are the random number between [0,1].

3.3. Using the SHAP-PSO-XGBoost Framework for Estimating Permeablity

The proposed modeling process of permeability prediction based on the SHAP-PSO-XGBoost is shown schematically in Figure 6. In brief, the following six primary steps are involved: (1)The sample data set is normalized, and the range of each feature is transformed to [0, 1](2)The 202 samples in the data set are randomly divided into a training set and a test set. The training set accounts for 80%, and the test set accounts for 20% of the data set. The training set is used to train the model, and the test set is used to evaluate the performance of the model(3)The XGBoost model is constructed for the sample data, the grid parameters are initialized according to the grid search method, and determine the hyperparameters(4)Based on the established model, the SHAP attribution theory is used to analyze the factors that influence the evaluation results in terms of both the overall and individual aspects. Then, according to the global importance result, the feature combination is optimized to construct a new data set(5)The XGBoost model is reapplied to the new data set, and the particle swarm algorithm is used to search for the best hyperparameters of the model, including the n_estimators, learning_rate, and max_depth. The ten-fold crossover method is used for verification during the search process. That is to say, the training set is divided into ten parts, including nine training parts and one verification part, which are run in sequence. The average value is then used to express the general performance of each set of parameters, thus, avoiding any deviation due to randomly dividing the training set(6)After determining the optimal hyperparameters, the test set is brought into the model to determine the predicted permeability. Then, the true performance of the model is evaluated according to the evaluation index

The experimental environment was as follows: the CPU configuration was an Intel (R) Core™ i5-3210M with 16 GB RAM and a frequency of 2.5 GHz. The programming language was Python3.7.6.

3.4. Model Evaluation Indicators

The root mean square error (RMSE), mean square error (MSE), and mean absolute error (MAE) were used to evaluate the prediction performance of the proposed model [4345]. These were calculated using Eqs. (5)–(7): where is the actual value, is the predicted value, and is the total number of samples.

4. Results and Discussion

4.1. Feature Selection and Interpretation

The results obtained according to the Shapley values of the features at the global level are presented in Figure 7. In Figure 7(a), the sample points for each feature are plotted against the corresponding Shapley values, with high values being indicated by the degree of redness, and low values indicated by the degree of blueness. Here, the numerical distribution variance of the Shapley value of porosity (Por) is seen to be significantly higher than that of the other features, especially for different types of sample, thus, enabling the model to distinguish between samples. Hence, this feature makes the largest contribution to the prediction of permeability. This understanding also conforms to the laws of petrophysics. Further, the features are ranked in order of importance and are plotted against the mean of the absolute Shapley value in Figure 7(b). Here, the Por, AC, CAL, and GR slope are seen to be the four most important features, and their average absolute Shapley values are all greater than 0.02. The other features have little effect on the results and can thus be considered to have little effect on the prediction of permeability. Consequently, only the Por, AC, CAL, and GR slope values are considered hereafter as suitable input variables.

At the local level, the use of the SHAP interpretation model to predict the permeability is demonstrated for two typical samples in Figure 8, where the red part represents features with positive Shapley values, and the blue part represents features with negative values. Moreover, the quantitative influence of each feature is revealed, as a larger area indicates a larger absolute Shapley value and, hence, a greater impact upon the predicted permeability. Finally, the sum of the Shapley value of each feature plus the baseline value of the sample is the output result of the model, i.e., the predicted permeability of the sample. Taken together, the results in Figures 7 and 8 indicate that although different features have the greatest impact on permeability prediction for different samples, and the feature with the highest global importance (i.e., the porosity) consistently has an important impact on the prediction results. Thus, when the positive and negative correlation of each feature in Figure 8(a) is considered for sample number 20, the Por feature is seen to increase the permeability prediction results of the model by 0.05, while the CAL, GR slope, and AC features decrease the permeability prediction by 0.03, 0.03, and 0.02, respectively. This gives a final prediction of 0.607 (Figure 9), which is close to the experimental value of 0.653.

4.2. Optimization Results

As the XGBoost framework contains multiple hyperparameters, the particle swarm optimization search method was used to automatically determine the optimal value of each of the three parameters with the greatest impact on the performance of the algorithm. The search interval of the parameters was based on the results of previous studies, the population size () was 25, the initial weight () and termination weight () were both 0.5, the learning factor was , and the maximum number of iterations () was 200. As detailed in Section 3.3, a 10-fold cross-validation process was used to verify the optimization results. As indicated in Table 1, the three optimized hyperparameters are quite different from the default values, thus, demonstrating the necessity of using the particle swarm optimization process. Further, the prediction results obtained using the XGBoost models before and after the hyperparameter optimization are presented in Table 2. Here, the significantly improved accuracy of the particle-swarm optimized (PSO) XGBoost model is clearly evidenced. Compared with the benchmark XGBoost framework, the PSO-XGBoost model reduces the RMSE, MSE, and MAE by 22.2%, 43.8%, and 22.3%, respectively.

4.3. Performance of the Proposed Model

To verify the validity and superiority of the interpretable PSO-XGBoost method for predicting the permeability of tight sandstone reservoirs, the results are compared with those obtained using the SVM, DT, RF, GBDT, and LightGBM approaches, each of which also uses the PSO method. The results presented in Table 3 and Figure 10 clearly confirm the accuracy and superiority of the proposed machine learning technology. Thus, the PSO-XGBoost framework has an average RMSE of 0.117, which is 25.6% lower than that of the SVR, 19.7% lower than that of the PSO-DecisionTree, 7.7% lower than that of the PSO-RandomForest, 6.8% lower than that of the PSO-GBDT, and 29.1% lower than that of PSO-LightGBM. Similarly, the average MSE of the PSO-XGBoost framework (0.016) is 50% lower than that of the SVR, 43.8% lower than that of the PSO-DecisionTree, 12.5% lower than that of both the PSO-RandomForest and the PSO-GBDT, and 62.5% lower than that of the PSO-LightGBM. Finally, the average MAE of the PSO-XGBoost framework (0.094) is 34.0% lower than that of the SVR, 27.7% lower than that of the PSO-DecisionTree, 7.4% lower than that of the PSO-RandomForest, 8.5% lower than that of the PSO-GBDT, and 33.0% lower than that of the PSO-LightGBM. These results fully demonstrate that the proposed interpretable PSO-XGBoost exhibits the smallest permeability prediction error rate.

4.4. Performance Comparison
4.4.1. Comparison with Traditional Methods

The linear fitting relationship between the core porosity and core permeability after data normalization is presented in Figure 11 and is given by the following equation:

After normalization, a multivariate regression analysis was performed using the following equation:

Using the results of Eqs. (8) and (9) and the test data, a 10-fold cross-validation was performed for comparative analysis, and the results are presented in Table 4. Here, the PSO-XGBoost model is seen to decrease the RMSE, MSE, and MAE by 25.6%, 50.0%, and 39.4%, respectively, compared to those of the linear regression model. Moreover, the PSO-XGBoost model decreases the RMSE, MSE, and MAE by 14.5%, 25.0%, and 23.4%, respectively, compared to those of the multivariate regression model.

In addition, a cross-plot of the actual and predicted core permeabilities obtained by the traditional mathematical regression method and the proposed PSO-XGBoost model is presented in Figure 12. Here, it can be seen that both the training set and the test set have a good match with the 45-degree line. Therefore, it can be concluded that the interpretable PSO-XGBoost model has better prediction accuracy with respect to the permeability of tight sandstone reservoirs than do the traditional mathematical regression methods.

4.4.2. Comparison with Deep Learning Technologies

In recent years, with the rapid development of deep learning, more and more network models have been proposed and widely used in the field of oil exploration and development. Hence, the permeability prediction performance of the proposed interpretable PSO-XGBoost model is compared with that the three main deep learning models, namely, the CNN, LSTM, and GRU [4648]. The accuracy of each deep learning model was increased via the grid search for hyperparameters, and the configuration parameters for each model are presented in Table 5. The 10-fold cross-validation method was then used to compare and analyze the various models using the test data set, and the results are presented in Table 6. Here, the CNN, LSTM, and GRU are all seen to perform poorly in predicting the permeability of tight sandstone reservoirs, with average RMSEs that are, respectively, 2.18, 3.75, and 3.62 times that of the PSO_XGBoost model. Similarly, the average MSEs of the CNN, LSTM, and GRU are, respectively, 8.63, 15.56, and 10.88 times that of the PSO_XGBoost model, and the average MAEs are 3.36, 4.73, and 4.40 times that of the PSO_XGBoost model.

From the perspective of error indicators, there is a big gap between the three deep learning models and the various types of machine learning model examined in previous sections (Table 3 and Figure 10), again indicating that the PSO-XGBoost gives the best performance. It is speculated that this is because the data set used in the present study was small, whereas the popular deep learning models are more suitable for larger data sets. It is difficult to obtain a good performance when there are only a few hundred data samples, and overfitting is likely. Therefore, it is suggested that machine learning algorithms should be preferred for small data sets, and that the interpretable machine learning method with the most obvious relationship between the features and the results should be selected. Thus, among all kinds of ensemble learning algorithms based on decision tree models (including the RF, GBDT, XGBoost, and LightGBM), the particle-swarm optimized XGBoost framework should be selected first.

4.5. Practical Application

Based on the high-precision permeability prediction of tight sandstone reservoirs in the study area, the interpretable PSO-XGBoost model was used to process Well X190. Well X190 is also a coring well and has been used in petrophysical experiments, but was not included in the sample data set. Hence, it can be used to test the prediction results of the PSO-XGBoost model. The results are presented in Figure 13, where Core_Por is the core porosity, Core_Perm is the core permeability, and Pred_Perm is the permeability predicted by the interpretable PSO-XGBoost model. The perforation section was 562–564.5 m, and its logging interpretation indicated an oil-water layer. The first-month oil production was 1.63 t/d, which meets the industrial oil flow standard of 0.85 t/d in the study area. The core plugs were located in the tight sandstone reservoir 569.1–570.1 m below the perforation section, which the logging interpretation indicates as a poor oil layer, i.e., an oil layer whose petrophysical properties are worse than those of the other conventional oil layers in the tight sandstone reservoir and which therefore has lower production. The upper limit of the petrophysical properties of the poor oil layer should be the lower limit of the conventional oil layers in the tight sandstone reservoir. According to the statistical analysis, the evaluation criteria for poor oil layers in the study area are a porosity of 9–10%, a permeability of , an RILD of 17–22 Ω, an AC of at least 217 μs/m, and a GR of no more than 90 API. The core permeability of the sample at 569.1–570.1 m is between and , and the predicted permeability is between and . These prediction results are consistent with the electrical characteristics of poor oil layers, thus, demonstrating the importance of the high-precision PSO-XGBoost model as an indicator for evaluating the permeability of tight sandstone reservoir types.

5. Conclusions

To improve the accuracy and interpretability of the permeability prediction of tight sandstone reservoirs, an improved XGBoost model based on the PSO algorithm and attributable interpretation was proposed herein. The following conclusions were drawn: (1)The SHAP can not only explain the importance of input features from a global perspective, mine the key features that affect the permeability prediction, and reduce the dimensionality of the samples according to the key features, but can also clarify the quantitative contribution of each feature towards the permeability prediction based on the evaluation results of each individual sample. This unifies the global-local interpretability of the model and improves the credibility of the prediction results(2)Particle swarm optimization was used to realize the automatic optimization of the hyperparameters of the XGBoost framework. The results of the computational experiments based on sample data showed that, compared with the baseline XGBoost model, the prediction error of the PSO-XGBoost model is significantly reduced, and the accuracy of the permeability prediction is higher(3)The interpretable PSO-XGBoost model demonstrates powerful predictive capabilities. By comparing the predictions of five types of machine learning models, two types of mathematical regression methods, and three types of deep learning models, along with the RMSE, MSE, and MAE data of the measured and predicted values, it was concluded that the PSO-XGBoost model is better in predicting the permeability of tight sandstone reservoirs. The proposed method is clearly more accurate and practical for use with small data sets and lays a solid foundation for the evaluation of tight sandstone reservoirs(4)In future, the proposed improved model could be further studied from the following three aspects: (i) the petrophysical data and wireline logging data from multiple areas could be selected to further test the performance and explore the generalizability of the proposed model with respect to different data sets, (ii) strategies for adding more geological constraint information, such as the sedimentary characteristics of the reservoir, lithological changes, and the relative position of the sample wells, etc., could be developed in order to further improve the accuracy of the model, and (iii) it is possible to explore the reasonable predictions of reservoir parameters and evaluation methods of high-quality reservoirs in the areas far away from the wellbore by establishing a fine three-dimensional geological model of the study area

Data Availability

The data are available in this article.

Conflicts of Interest

The authors declare that there are no conflicts of interest.


The authors would like to thank Zichang Oil Production Plant for providing samples and data. This research was supported by the Fundamental Research Funds for the Central Universities (no. 300102278402).