Abstract

The use of machine learning for seismic interpretation is a growing area of interest for researchers. Manual interpretation demands time and specialized effort. The use of machine learning model will expedite the process. The Convolutional Neural Networks (CNNs) are a class of deep learning algorithms used for images. In this paper, seismic facies segmentation using encoder-decoder architecture of CNNs is proposed. The proposed method filled the gap using a multimodel approach for seismic interpretation. The novelty of the model is that it is not limited to the current dataset and semantic segmentation models. The encoder-decoder architecture input and output size is the same, and it allows the labelling of each pixel of the image. Four models are trained on the open-sourced F3 block Netherlands dataset. Images of were extracted from the data. Data augmentation is used in two of the models to increase the data size for better model learning. Results of individual models and their ensemble are compared. Ensemble is performed by taking the average of the probabilities of the classes obtained from the trained models. Ensemble gave the superior results. Seven classes are segmented with a global pixel accuracy (GPA) of 98.52%, mean class accuracy (MCA) of 96.88%, and mean intersection over union (MIoU) of 93.92%.

1. Introduction

Discovery of the new reserves of oil and gas is strategic for countries. Seismic reflection surveying is used to obtain subsurface information to locate drilling locations in the oil and gas industry [1]. The structural and stratigraphic geometric features and potential hydrocarbon reservoirs can be configured from the seismic reflections. Accurate interpretation of seismic amplitude and zones is significant for the discovery of oil and gas [2]. Accurate interpretation leads to a lesser number of drills and significantly impacts the characterization of the reservoir. Seismic amplitude is manually interpreted by field experts and geophysicist to differentiate between bodies of rocks (strata) with different physical properties [3]. The geophysicist needs to manually analyse the images generated from seismic survey to mark the boundaries between different strata known as horizon. The process needs to be done for thousands of seismic images. Traditional method of manual interpretation for seismic delineation takes time, and there are false positives [4]. Accurate interpretation saves economic resources, as resources are spent for the drilling of the reservoirs.

For data and image classification, methods in computer vision and pattern recognition look for unique features. Success of these methods depends on the feature extraction techniques [5]. Convolutional Neural Networks (CNNs) provided an alternative for automatically learning the domain specific features [6]. The CNNs consists of a large number of convolutional layers, which can learn the most useful features automatically and eliminate the need for manual feature extraction techniques. The initial layers of the CNN detect basic features like edges, colours, and shapes. The deeper layers then combine the basic features to extract more complex features that represent the whole image. The hierarchical structure in the convolutional nodes of CNNs identifies the complex objects.

A deep CNN contains millions of parameters and requires a large amount of data for training the network. The CNNs are used for various image applications such as image classification, object detection, localization, and object segmentation. Semantic segmentation networks are the improvisation of CNNs in the form of encoder-decoder architecture [7]. The encoder is used for object detection and classification, and the decoder is used for object localization. Semantic segmentation is the process of taking an image and labelling each pixel in that image with a certain class.

Seismic facies interpretation is an important step for hydrocarbon exploration and exploitation [8]. A major issue of the application of deep learning to seismic data is lack of availability of labeled datasets. Limited datasets undermine the potential of deep learning. Annotating the seismic data is a time-consuming task and requires subject-level expertise. Very few labeled seismic datasets are available online. It is not feasible to use a trained model from one location and implement it directly to another location because the structures and dynamics are different for each location. Researchers annotated their own seismic facies datasets [912], annotated the Netherlands F3 block (North Sea) dataset for training deep neural networks, and open sourced it for further research. The authors used the 3D seismic reflection data in conjunction with the well log data to manually annotate the 3D seismic dataset.

The North Sea contains hydrocarbon deposits, and this area is well-studied. The continental shelf of the North Sea is situated in the waters of the Netherlands. A rectangular area of dimension known as the F3 block is located in the North Sea. A 3D seismic survey was conducted in 1987 to search for hydrocarbons and understand the lithostratigraphy of the area. The data was open sourced by dGB Earth Sciences. The F3 dataset is used extensively in research and studies [12].

Fully Convolutional Network (FCN), SegNet, U-Net, ENET, and DeepLab are few of the semantic segmentation networks with their distinct architectures. DeepLab’s variant DeepLabv3Plus and Seg-Net are fine-tuned in the proposed method using pretrained encoders: vgg-16 and ResNet-18. The results show a mean class accuracy of 96.88% and mean intersection over union of 93.92%.

1.1. Artificial Intelligence

Machine intelligence is another name for artificial intelligence. When humans are born, they have natural sensors in their bodies. They start to observe the world, their parents first; listen to voices; feel the hotness or coldness of a body, etc. With the help of such observations, they develop their responses and actions. This can be termed as training phase. With the passage of time, the brain develops, memory becomes stronger, and now, the actions and responses are well connected to the previous experiences. This example is given as an analogy to understand the human intelligence.

Similarly, a computer can be equipped with sensors and storage. The computer can then be trained by feeding thousands of images and voices. With the help of these images, the computer can learn the features differentiating them. After the training phase when an unseen (not from the images used to train the computer) image is shown to the computer, it can tell either it is from the class of images or not with very promising results using the modern techniques in AI especially deep CNN.

1.2. Machine Learning

Machine learning (ML) brings the promise of deriving meaning from all the data which is presently available in terms of audio, images, and datasets. Google Search is one of the examples of machine learning; each time Google Search is used, several ML models working in the background are activated to understand and correct the text and adjust the results according to user interests based on previous searches and other data being collected from other applications a user is using. There are several ML-based learning algorithms available to train the models which include Random Forest, Decision Tree, Naïve Bayes, SVM, and kNN to name a few.

The application of computational techniques on seismic data followed the trend of development of the machine learning community. Initially, different mathematical features and techniques are calculated to assist geologists in making predictions. The popularity of machine learning techniques enabled researchers to feed the calculated features in different machine learning models to extract results. In the third stage, due to GPUs, it was possible to make extremely complex models which gave rise to Convolutional Neural Networks. With the CNNs, there is no requirement of feature engineering and images are directly fed to the network.

Initially, different computational techniques were used to classify an image by their geological attributes that presented an application of textural analysis to 3D seismic volumes [13]. In this paper, the authors combine the image textural analysis with a neural network classification to quantitatively map seismic facies in three-dimensional data. In 2011, for exploration geology and geophysics, seismic texture analysis was a developing concept and a large number of different algorithms were published in the literature. In [14], review of the seismic texture concepts and methodologies, focusing on latest developments in seismic amplitude texture analysis, with particular reference to the gray-level cooccurrence matrix (GLCM) and the texture model regression (TMR) methods is presented. There are discontinuities in seismic images with varying illumination and contrast. In [15], a solution using the congruency of phase in Fourier components is proposed. The proposed algorithm shows far more better and efficient results in terms of accuracy as compared to the texture-based methods for salt dome boundary detection. Data-driven algorithm is proposed in [16]. The proposed protocol overcame the limitations of existing texture attribute-based salt dome detection techniques which depend on the relevance of attributes to the geological nature of salt domes and the number of attributes used for classification. The authors used a gray-level cooccurrence matrix (GLCM) with attributes extracted from Gabor filter to delineate salt domes in seismic data. In [17], seismic attributes are combined with their spatial locations for unsupervised seismic analysis using fuzzy -means algorithm. This method reduced the effect of seismic noise presented in discontinuous regions. A comprehensive evaluation of accuracy and performance of three texture descriptors, Gabor filters, GLCM, and Local Binary Patterns (LBP), is presented in [18] for seismic image retrieval to assist human interpreter in selecting region of interest (ROI).

Before the deep learning techniques became popular, features used to be hand engineered and fed into machine learning algorithms such as Random Forest Regressor, Support Vector Machines, and Boosting algorithms. In [19], an extremely randomized tree to automatically identify salt boundaries is presented. The proposed protocol extracted the features of signal amplitude, curve length, and second amplitude for each of the voxels and made predictions using extremely randomized trees. After the prediction, a postprocessing step is added to further increase the accuracy. Reference [20] used a machine learning approach for classifying facies. The 3D seismic reflection data of the North Sea is used in the paper. Fifteen different attributes are extracted for each pixel, such as reflector dip, continuity, and frequency range. The attributes are trained on twenty ML models such as Support Vector Machines (SVM), -nearest neighbours, regression trees, and neural networks. The best result of accuracy 98.3% is obtained using SVM. Reference [21] uses a 3D seismic survey data from New Zealand and calculated four features (peak spectral frequency, GLCM, homogeneity, and curvedness). These 4 features are input to Artificial Neural Networks (ANN) and Support Vector Machines (SVM), and ANN gives better accuracy on the test set as compared to SVM. Reference [22] uses 6-7 features (4-5 measured properties and two geologically derived features) and trained the model using -nearest neighbours (KNN), Naive Bayes, fuzzy logic, and ANNs. ANN is the most effective model and gives the best result.

Unsupervised learning techniques like self-organizing maps and principal component analysis (PCA) are also used to classify lithostratigraphic zones. Reference [23] uses unsupervised learning techniques like -means clustering, PCA, projection pursuit, vector quantization, and Kohonen self-organizing maps. Reference [24] uses competitive neural networks on seismic data to distinguish the distinct seismic behaviour of facies. Heterogeneity of classes was indicated in the results; however, the classes in the results were not labeled. In [25], Kohonen self-organizing maps are used to estimate the number of seismic facies and make maps. Wavelet transform was also used to detect seismic trace singularities. Others tried deep convolutional autoencoders (DCAE) for facies classification. Reference [26] uses DCAE as it can learn nonlinear, discriminant, and invariant features from unlabeled seismic data. The results show that DCAE outperforms conventional methods like -means or SOMs, Moreover, the results of DCAE are of much higher resolution and highlight important information. Reference [27] uses sparse autoencoder architecture that can detect major geological features from unlabeled seismic data. The model is tested on real and synthetic seismic data in order to extract relevant structures from the data.

The development of computational power led to a greater emphasis on the use of supervised CNN algorithms for seismic applications. The CNN supervised approach can be divided into seismic classification and seismic segmentation methodologies. Seismic classification uses the convolutional, max pooling, and fully connected layers to predict the class of the centre pixel of the image. In [28], a novel CNN consisting of six convolutional layers followed by a fully connected layer is presented. The network uses this architecture for salt detection. Cubes of are extracted from the 3D seismic data, and the centre of the cube represented the class (salt or not salt) of that pixel. Reference [29] uses CNNs (vgg-16, ResNet-50, and Waldeland) to classify seismic images of the F3 dataset. The centre pixel for each image is classified, and this step is repeated such that all pixels are classified. The results for vgg-16 and Waldeland architectures show significant improvement towards accuracy however found to be ineffective for ResNet-50. An investigation over the use of fully supervised CNNs and semisupervised Generative Adversarial Networks (GANs) is presented in [8]. The models are tested on realistic synthetic images. Results shows that CNNs perform better in scenarios where abundant data is available; however, GANs work better on new sites with limited availability of data. Reference [30] uses a custom-built CNN architecture to detect faults from a 3D seismic cube. The input to the network consisted of three orthogonal slices of , and the voxel at the intersection is classified as fault or not fault. The network is trained using synthetic images and tested on both synthetic and real data. The results show that CNN obtains a classification accuracy of 74% on the real dataset.

In [31], an encoder-decoder structure is introduced for CNNs. The encoder learns the distinctive features from the image, and the decoder maps back the features semantically. In [2], a novel segmentation network (Danet-FCN) is proposed. Danet is combined with FCN for pixel classification. For validation, F3 and the Penobscot datasets are used. The mean IoU of more than 98% is obtained on both of the datasets. A modification of Danet-FCN is presented in [4] in order to propose a new architecture Danet-FCN2 and Danet-FCN3 by removing the fused connections. The Danet-FCN3 improves the IoU to 99% on the Penobscot dataset. Reference [32] uses U-Net architecture with dilated convolution and soft attention mechanisms. The soft attention mechanism allows the model to suppress noise and learn the main features. The authors trained the models from scratch. CNN results show improvement with the use of dilated convolutions and soft attention mechanisms. Reference [33] presents work on the TGS salt classification dataset at Kaggle. Semisupervised technique is used for salt classification using an ensemble of CNNs. An iterative process is used by which predictions at each stage are treated as pseudolabels and the network is retrained using the training data and confident pseudolabels. The ensemble of U-Net architecture with the encoder of ResNet34 and ResNeXt50 is used. The results show an IoU of 0.896. In [34], a modified U-Net architecture is proposed to detect salt from seismic images. The model is trained and tested on synthetic dataset, and the results show a mean IoU of 90.53%. A comparison of 3D-based patch model and encoder decoder architecture on the F3 dataset is presented in [9]. The dataset is divided into 9 facies and manually interpreted the data for training. The work draws conclusion that the encoder-decoder model gives better results at near real-time speeds, at the expense of long training time. Reference [13] presents a fully annotated 3D geological model of the Netherlands F3 block. This model is based on the study of the 3D seismic data in addition to 26 well logs and is grounded on the careful study of the geology of the region. The study proposed two baseline models for facies classification based on deconvolution network architecture and made their codes publicly available. The first model is patch based; in this, patches are extracted from the crosslines, and in lines and the model is trained on it. The second approach is section based; in this, complete in lines and crosslines are fed to the model. Results shows that the section-based model (MCA of 0.817) gives better results than the section-based model (MCA of 0.705).

2.1. Dataset

A major challenge in seismic facies classification is the availability of annotated dataset. The data needs to be manually labeled by the domain experts. The process of labelling requires availability of a geophysicist and is subjective to human bias. Researchers working in the field of application of deep learning labeled their own datasets due to these limitations. The authors in [13] labeled the F3 Netherlands dataset and open sourced it for further research. In this paper, the same labeled F3 seismic dataset is used for model verification and results.

The Netherlands F3 block dataset is in 3D NumPy array format. First, the dataset is converted into image/label form (see Figure 1). There were a total of 22368 images, each of which is of the size . The dataset is split into train, validation, and test sets with a ratio of 60%, 20%, and 20%, respectively (see Table 1). The split is performed randomly. The image is divided into 7 facies which are as follows: (i)Upper North Sea Group (upperNS)(ii)Middle North Sea Group (middleNS)(iii)Lower North Sea Group (lowerNS)(iv)Rijnland/Chalk Group (rijanlandChalk)(v)Scruff Group (scruff)(vi)Zechstein Group (zechstein)(vii)Background

3. Methodology

In this paper, two deep learning models for semantic segmentation are used DeepLabv3Plus and SegNet. The models will be discussed with a brief overview of semantic segmentation and data augmentation.

Main elements of the framework are (a)Dataset preprocessing(b)Selection of semantic segmentation networks(c)Selection of pretrained CNNs to be used as encoders(d)Ensemble technique

Dataset is converted from NumPy arrays to images and labels. This conversion makes the dataset readily usable on any platform. Images and corresponding labels are resized to support the available hardware resources. The resizing of images to a suitable size is an iterative process to avoid GPU memory issues while training. Through iterations, the size of is chosen. Dataset is split into three parts, namely, training, validation, and testing.

3.1. Semantic Segmentation

Semantic segmentation is the process of categorizing each pixel of the image to a class. Semantic segmentation is applicable to a variety of computer vision tasks like autonomous driving and medical image diagnosis. Semantic segmentation is one of the most challenging aspects of computer vision. The classification problem objective is to predict the presence of an object in the image [35]. The difference in segmentation problems is to predict the class of each of the pixel within the image. The image needs to be segmented into different objects like car, pedestrian, roads, and road signals for autonomous driving applications.

In [31], use of CNNs for semantic segmentation is proposed. The authors use skip connections to join the semantic information from a deep layer to the localization information from a shallow layer, to produce the pixel-wise segmentation of the image. The decoder part was implemented as bilinear interpolation. The method improved the PASCAL VOC results by 20%; however, one of the major drawbacks of this technique is that it tends to ignore small objects.

In [36], a deep deconvolutional network for decoding, which consisted of deconvolution, unpooling, and activation layers is presented. The presented model performs better on the PASCAL VOC 2012 segmentation with an accuracy of 72.5% (see Figure 2). For the encoder-decoder architecture used for semantic segmentation in this paper, the encoder phase is similar to the conventional CNN classification model and consists of multiple convolutional and pooling layers. Each convolutional layer first convolves its input and then also applies batch normalization and an activation function. The pooling layer is used to downsample the image. In the pooling layer, a sliding window is passed through the image and is used to summarize the information by selecting minimum/maximum/average from the window. The encoder is used to classify the objects within an image. At the end of encoder stage, a low-resolution feature map is obtained. Encoder is followed by a decoder which works in an opposite manner to the encoder. It consists of multiple upsampling and convolutional networks to bring the output to the same size as that of the input. Decoder is used for the localization of objects to generate boundaries of the objects within the image.

For computer vision problems of semantic segmentation, the encoder-decoder architecture gave better results than other CNN architectures. We are using two of the encoder-decoder architectures that are DeepLabv3Plus and SegNet.

3.2. DeepLabv3Plus

DeepLab is one of the popular semantic segmentation architectures and is used in [37]. DeepLab is a model designed and open sourced by Google. It uses Atrous convolution in place of deconvolutional networks. Atrous convolution allows enlarging the field of view of filters without increasing the number of parameters or the amount of computation. Multiple Atrous convolutions are used in parallel to catch the contextual information at multiple scales.

Deep convolutional network consists of multiple layers due to which information of the smaller scale objects is lost. This is because the input feature map reduces as we move in the network. Atrous convolutions are used in DeepLab in the convolutional layers to counter this problem. Atrous convolution consists of an additional parameter of rate which is the stride at which input signal is sampled. The normal convolution is a specific case for . In [38], denser features are extracted by the use of Atrous convolutions without the need of extra parameters.

DeepLabv3Plus consists of multiple Atrous convolutions (see Figure 3). The Atrous rate applied to each of the convolutional layers is different which enables to extract features at different resolutions. This enables the extraction of denser features. DeepLabv3Plus obtained an accuracy of 89% on PASCAL VOC 2012 test datasets.

This research work uses DeepLabv3Plus upgraded versions; the results show improvement across object boundaries. ResNet-18 is used as an encoder for DeepLabv3Plus. Two variants of the model are trained; one is with data augmentation, and the other one is without data augmentation.

3.3. SegNet

SegNet is a semantic segmentation model developed by the University of Cambridge [7]. The encoder consists of 13 convolutional layers in the VGG16 network. At the decoder stage, it performs nonlinear upsampling, which uses indices computed in the max-pooling step of the corresponding encoder.

The major difference between SegNet and other encoder-decoder architectures is in the way it upsamples data. During the downsampling in the encoder stage, the pooling indices are stored in SegNet. During the decoder stage, the pooling indices are used to place the values in their original position as before the downsampling. In [7], such case is presented where the information of the pooling indices is passed on to the upsampling stage to produce dense feature maps (see Figure 4). The U-Net presented in [39] on the other hand transfers entire feature map from the encoder to the decoder. U-Net requires greater memory and training time due to this task. The first advantage of SegNet is that in the decoder, the upsampling layer is used which keeps intact the high-frequency details. The second advantage is convolutional layers are used in place of fully connected layers. The convolutional layers can remember the indices of image features as discussed in [38, 4043].

In this paper, SegNet with VGG16 encoder is used in two variants. One of the variants is with data augmentation, and the other is without data augmentation.

3.4. Data Augmentation

CNNs require a large quantity of data to learn from the images and perform well. Data augmentation is a technique to increase the size of data from the original data. New artificial training data is created in this process. The original training data is transformed to produce new training data. The transformations include a range of mathematical operations such as flipping, rotation, padding, scaling, cropping, and changing its colour.

The purpose of data augmentation in the presented models of this research work is to increase the data size. This allows the model to better generalize and learn from the data (see Figure 5) for the effect of reflection in the data along the -axis. Translation and reflection are used as the data augmentation techniques.

3.5. Application

In this paper, 5 models are developed for seismic facies segmentation of the generated images from the F3 Netherlands block. (i)DeepLabv3Plus with ResNet-18 encoder (without augmentation)(ii)DeepLabv3Plus with ResNet-18 encoder (with augmentation)(iii)SegNet with VGG16 encoder (without augmentation)(iv)SegNet with VGG16 encoder (with augmentation)(v)Ensemble of the models 1-4

Two semantic segmentation networks, DeepLabv3Plus and SegNet are used. Both are based on encoder-decoder architecture. Two models are trained using DeepLabv3Plus with encoder as ResNet-18. One model is trained without data augmentation, and the second was trained with data augmentation (translation and reflection). Two more models are trained using SegNet with encoder as vgg16, one with training data augmented and the other without augmentation. Ensembles of four models are created. For ensemble, averages of the predicted scores of the four models, corresponding to each class for a single pixel, are taken. The class with the highest average probability represented the pixel (see Figure 6). Results of the individual models and the ensemble are compared.

The numbers of pixels of seven classes in generated training patches are imbalanced (see Figure 7). This imbalance is detrimental to the learning process because the learning is biased in favor of the dominant classes. Class weighting is used to handle this issue. Median frequency class weights were calculated (see Table 2)

Weights in the pixel classification layers in encoders for both DeepLabv3Plus and SegNet are replaced with median class weights calculated (see Table 2) to compensate the class imbalance for training to be unbiased. Training options used for four models are as follows: Adam optimizer is used as weight optimization, initial learning rate is set to 0.001, squared gradient decay factor is set to 0.99, and minibatch size was 32. Due to pretrained encoders, training for all four models converged in few epochs. Training of each model is started with ten epochs but due to no change in accuracy and RMSE after five epochs, it is stopped early.

The validation loss and accuracy plots with epochs are presented in Figures 8 and 9) for DeepLabv3Plus and SegNet models, respectively. There is no change in RMSE and accuracy in all of the models after five epochs. Early stopping is applied on all four of the models after five epochs.

4. Results

To assess the performance of the proposed architecture, the following evaluation metrics are used. (i)Class accuracy (CA)

The percentage of the correctly classified pixels in a class , is called class accuracy. where represents class, represents predicted pixels, and represents true pixels. (ii)Intersection over union (IU)

This evaluation metric measures how to perfectly predict pixel overlap in ground truth. An output of 1 means perfect match, and 0 means no match. (iii)Mean intersection over union (MIU)

The mean of IU of all classes is mean IU. is total number of classes.

Moreover, the results of the individual models are presented in Tables 36. The results show a comprehensive analysis of presented models in terms of accuracy and intersection over union.

In Table 7, results of the ensemble for the proposed four models are presented (see Table 7). Ensemble gives the best result for GPA, MCA, and MIoU. Global pixel accuracy (GPA) is the percentage of pixels over all classes that are correctly classified. Mean class accuracy (MCA) is the average of class accuracy over all classes whereas class accuracy for a class is the percentage of pixels that are correctly classified in that class. Intersection over union (IoU) measures the overlap between the two sets, and it should be 1 if and only if all pixels were correctly classified. Further, averaging IoU over all classes gives the mean intersection over union (mean IoU).

The MCA and MIoU achieved with the ensemble method are 0.9655 and 0.9392 which are the highest amongst all five of the models. This shows that using an ensemble of various models is a useful technique that improves the results. The error in individual models generally occurs at the boundaries of various classes. The ensemble takes the average for each pixel, by which a wrong prediction made by one model can be compensated by the right prediction made by the other models.

Amongst the individual models, DeepLabv3Plus with basenet of ResNet-18 with data augmentations gives far better results, with a MCA and MIoU of 0.9655 and 0.9355, respectively. For both DeepLabv3Plus and SegNet, the results with data augmentation are better than the results without augmentation. This proves that augmentation is a useful technique that can be used to increase the data size for seismic applications. The model is able to learn better with this increased data size and gives better results.

A random image and its labels are taken from the test set to calculate MIoU using individual models and ensemble (see Figure 10). Ensemble did not give the highest MIoU on this random test image (see Table 8).

Visual results of the ensemble are better than those of other models (see Figure 11) because ensemble gives two classes the highest IoU whereas other models gives the highest IoU either to one class or none highlighted in italic. In Figure 11, when predicted pixels of a class went beyond the boundary of that class, those pixels are marked as green. When predicted pixels of a class did not reach the boundary of that class, the gap is marked as magenta.

When pixels of a certain class are present in both prediction and ground truth but in nonoverlapping regions, then the IoU is calculated as 0. When a pixel of a certain class is not present in both prediction and ground truth, IoU is calculated as NaN. Classes with IoU as NaN are not included in calculating GPA, MCA, or MIoU.

5. Conclusions

The ensemble of semantic segmentation networks gave better results as compared to individual models. The ensemble of Fully Convolutional Network (FCN), SegNet, U-Net, ENET, and DeepLabv3Plus with ResNet-50 and ResNet-101 as encoders is proposed for future works. Dataset in the form of images is open sourced so that researchers may try other semantic segmentation networks and ensemble them.

The automatic seismic facies segmentation proves to be a promising alternate to manual labelling of seismic facies by geologists. For manual labelling, a high degree of subject expertise is required, which can introduce into the results. The process is very complicated and computationally exhaustive and requires high degree of accuracy. It is not feasible for the geologists to label the entire area, and so usually, labelling is performed on only few of the images or portions of the block. These limitations can be countered by using deep learning techniques since they prove to give accurate results. Deep learning will require the labelling of some of the images, on which the models can be trained. After this, it could be applied to the complete area to make predictions.

6. Future Work

For future work, pretrained CNNs like VGG-19, ResNet-34, ResNet-50, ResNet-101, ResNet-152, Inception-v1, Inception-v3, SE-ResNet, ResNext, SENet-154, DenseNet-121, DenseNet-169, and DenseNet201 can be used as encoders in semantic segmentation networks like U-Net, Linknet, PSPNet, and FPN. Moreover, average ensemble results can be compared with voting ensemble. Success of the proposed architecture in the paper encourages improvising not only with other semantic segmentation networks and encoders but also with the data augmentation. Also, in the proposed architecture, crossentropy loss function is used. This loss function can be replaced with other loss functions to check if improvement in IU can be achieved.

Data Availability

The dataset is publically available in [12].

Conflicts of Interest

The authors declare that they have no conflicts of interest.