Abstract

Semantic segmentation is a significant research topic for decades and has been employed in several applications. In recent years, semantic segmentation has been focused on different deep learning approaches in the area of computer vision, which has aimed for getting superior efficiency while analyzing the aerial and remote-sensing images. The main aim of this review is to provide a clear algorithmic categorization and analysis of the diverse contribution of semantic segmentation of aerial images and expects to give the comprehensive details associated with the recent developments. In addition, the emerged deep learning methods demonstrated much improved performance measures on several public datasets and incredible efforts have been dedicated to advancing pixel-level accuracy. Hence, the analysis on diverse datasets of each contribution is studied, and also, the best performance measures achieved by the existing semantic segmentation models are evaluated. Thus, this survey can facilitate researchers in understanding the development of semantic segmentation in a shorter time, simplify understanding of its latest advancements, research gaps, and challenges to be used as a reference for developing the new semantic image segmentation models in the future.

1. Introduction

Semantic segmentation is an image analysis task, which assigns a label for each pixel in input images for describing the class of its encircled region [1]. Semantic segmentation of aerial images represents the assignment of one land cover category to each pixel, which is a complex task owing to the huge variations in the appearances of ground objects. Several works have been presented in recent years [2]. The state-of-the-art approaches in semantic segmentation are focused on the hand-crafted features, which fail to get the satisfactory performances and are restricted through the depiction ability of features [3]. When compared with object detection and image classification, semantic segmentation is used as the highest level of the image analysis process, which permits complete scene information of the complete input image [4]. In several remote-sensing tasks, semantic segmentation is considered as pixel-wise classification [5]. Semantic segmentation of aerial imagery has been employed in diverse applications such as hazard identification and avoidance, traffic management and evaluation, and urban area planning and monitoring [6]. However, the growth of semantic segmentation techniques was stopped years ago due to the lower accuracy rate of existing image analysis methods focused on the extraction of hand-crafted features [7].

Aerial and satellite imagery have been utilized in different applications such as regional planning, cartography, landscaping, and agriculture [8]. In 2020, Maddikunta et al. [9] have focused on applications, requirements, and challenges of UAV images which were captured from UAV vehicles for smart agriculture system. Multirotor UAVs are usually used for airborne surveillance, photography, and other similar tasks. These are the simplest to produce and the least expensive of all types of UAVs. These images have different visible colors and other spectra. There is also elevation imagery, which is generally prepared through light detection and ranging (LiDAR) and radar images [10]. Moreover, along with the emergence of satellite and aerial images, remote sensing is also implemented. Remote-sensing images are gathered from the remote object through a device, which cannot be physically contacted the object [11]. In recent years, the data analysis and interpretation are still performed by human experts. Although, semantic segmentation offers superior abilities in object detection, it suffers from implementing it into the real use cases [12]. In 2020, Ch et al. [13] have suggested the security and privacy of UAV data using blockchain technology. The value of virtual circuit (VC)-based devices—UAVs, drones, and similar other IoT-based devices—has grown tremendously in recent years. These gadgets are mostly utilized for aerial surveying in sensitive and isolated locations. The object detection in aerial images is complex due to the bird’s-eye view of aerial images, which have huge variations in orientation, high nonuniform object densities, large aspect ratios, and scale variations of objects. Moreover, several challenges are presented in the detection of objects using aerial images, which are low GPU memory capacity, downsampling a large image, and lack of inference on large images [14]. In aerial images, several sensor and resolution are considered as the factors for producing the dataset biases [15]. The standard dataset is prepared by collecting the images from different platforms and sensors through several resolutions including aerial images, satellite images, Gaofen-2 (GF-2) Satellite, and Google Earth [16].

Currently, many DL applications are being used all over the world. Healthcare, social network analysis, audio and speech processing (such as recognition and enhancement), visual data processing methods (such as multimedia data analysis and computer vision), and NLP (translation and sentence classification) are examples of these applications. These applications are divided into five groups: classification, localization, detection, segmentation, and registration. Although each of these jobs has its own aim, as seen in Figure 1, there is significant overlap in the pipeline implementation of these applications.

The semantic segmentation is adopted by deep learning approaches in recent years, which has attained high efficiency in diverse conventional computer vision applications and consists of detection and classification of objects and semantic segmentation [15]. These approaches have automatically derived features, which are customized for classification tasks that create these approaches to offer suitable options for managing complex cases [17]. The huge achievement in other fields makes the extension and adoption of deep learning approaches for solving the challenges in remote-sensing fields. Although, deep learning offers noteworthy performance, it suffers from allocating significant labels to the components of remote-sensing image [18]. Due to the large number and enormous quantity of modalities of the remote-sensing data, the deep neural network has been facilitated for feature extraction [19]. It has also offered great benefits to practitioners and researchers, which require less programming intensive tools for high-level data analysis and are understandable in geosciences [20]. In 2021, Kumar et al. [21] have given a secured privacy preserving framework for smart agriculture unmanned aerial vehicles for both blockchain and nonblockchain frameworks. Balamurugan et al. [22] have given a direction-of-arrival (DOA) tracking for seamless connectivity in beamformed IoT-based drones, and their communication and beamformed performances were increased.

The primary and significant deep learning approach consists of restricted Boltzmann machines, autoencoders, and convolutional neural networks (CNNs) which have focused on understanding the satellite imagery or aerial imagery [23]. Hence, this study has reviewed several semantic segmentation models with diverse deep learning algorithms for future works.

The major contribution of this survey is (i) to design a detailed survey on existing semantic segmentation models on diverse imaging modalities in recent years by gathering the noteworthy information from each and every semantic segmentation model along with diverse algorithms on machine learning and deep learning, (ii) to present a comprehensive study about datasets, simulation platforms, chronological review, performance metrics, features, and challenges of the conventional semantic segmentation models and their algorithms focused, and (iii) to give the appropriate research gap with the limitations present in existing semantic segmentation systems for motivating the researchers to design a new semantic segmentation model.

The remaining sections of this survey are depicted here. Section 2 discusses the literature review on state-of-the-art semantic segmentation models. Section 3 presents the algorithmic categorization and features and challenges of existing semantic segmentation models. Section 4 describes the simulation platforms and dataset description for conventional semantic segmentation models. Section 5 demonstrates the performance measures and best accuracy rate attained by the conventional semantic segmentation models. Section 6 gives the research gaps and challenges. Section 7 concludes this survey.

2. Literature Review on State-of-the-Art Semantic Segmentation Models

2.1. Literature Survey

In 2015, Saito et al. [24] have utilized CNN for training the pixel labeling to get the extracted building areas for determining the semantic segmentation of aerial images. Then, they have used Dijkstra’s algorithm for discovering the optimal seam line to get shortest path on the map. In 2016, Marmanis et al. [25] have described the semantic segmentation model using high-resolution aerial images and using ENSEMBLE OF CNNS named FCN and modified CNNs to show the superior efficiency on standard dataset. In 2017, Holliday et al. [26] have addressed the semantic segmentation model by applying the model compression techniques for getting the superior segmentation accuracy, which has also used ConvNet to determine the significance of segmentation.

In 2018, Chen et al. [27] have suggested shuffling CNNs for realizing the aerial images for semantic segmentation in a periodic way, which has also proposed a field-of-view improvement for improving the predictions. This model has attained effective and promising results for two datasets. In 2018, Yu et al. [28] have designed an end-to-end scheme for semantically segmenting the high-resolution aerial images by considering the CNN structure with pyramid pooling phase for extracting the feature maps at diverse scales. In 2018, Chen et al. [29] have presented the digital surface models (DSMS). They have presented the deeply supervised shuffling convolutional neural network (DSCNN) for efficient upsampling of feature maps, and furthermore, the multiscale features were attained. In 2018, Volpia and Tuia [30] have suggested a semantic segmentation model using aerial images for leaning the shallow-to-deep visual features, semantic boundaries across classes, and semantic class likelihoods through a multitask CNN. Here, the top-down and bottom-up information were combined and encoded with a conditional random field model. In 2018, Sun et al. [31] have implemented a new semantic segmentation model from LIDAR data and high-resolution aerial images through a multifilter CNN for offering multiresolution segmentation. It has also delineated the object boundaries to reduce the salt and pepper artifacts. In 2018, Kemker et al. [32] have designed a semantic segmentation method using DCNNs from multispectral remote-sensing images for getting the efficient performance on RIT-18 dataset. In 2018, Marmanis et al. [33] have designed a semantic segmentation model from high-resolution aerial images by applying DCNN for representing and extracting the boundaries among the regions of diverse semantic classes. In 2018, Vo and Woong [34] have designed a semantic segmentation method through investigating the effects of deep network and cascaded framework of dilated convolutions, which has improved the localization efficiency. This model has trained efficiently.

In 2019, Peng et al. [35] have presented a new architecture by combining the “dense connection and fully convolutional networks (FCN)” for providing the fine-grained semantic segmented maps for remote-sensing images. The suggested model has achieved the traditional efficiency on two datasets without any postprocessing and pretraining. In 2019, Luo et al. [36] have proposed a new deep FCN with channel attention mechanism (CAM-DFCN) for semantic segmentation using high-resolution aerial images, which has included encoder-decoder architecture. The integration of multilevel feature maps has also facilitated. It has also offered accurate segmentation for offering spatial location information and weight semantic information. In 2019, Li et al. [37] have designed a road segmentation system with the combination of “adversarial networks with multiscale context aggregation.” This study has focused on extracting the road by utilising the UAV remote-sensing images. This model has used morphological techniques for getting the results with the elimination of small independent patches. In 2019, Azimi et al. [38] have designed a symmetric FCN improved with wavelet transform for doing the segmentation of lane marking from aerial imagery. This model has used a customized loss function for improving the accuracy of pixel-wise localization. In 2019, Wang et al. [39] have designed a semantic segmentation from UAV-taken images for generating the defect detection outcomes through applying matrix operations with segment connection technique for connecting the segment features of objects. It has also used an artificial contour segment feature generator with a background filter which was used for line accessory detection that has enhanced the detection efficiency. In 2019, Cao et al. [40] have suggested a digital surface fusion models (DSMF) for improving the semantic segmentation results along with four end-to-end networks named DSMFNets to get the overall accuracy on segmenting the high-resolution aerial images. In 2019, Nguyen et al. [41] have suggested a MAVNet for semantic segmentation with the use of deep neural network on microaerial vehicles (MAVs). It has demonstrated the superior efficiency on standard datasets. In 2019, Guo et al. [42] have integrated the super-resolution approaches for improving the segmentation efficiency using “efficient subpixel convolutional neural network (ESPCN) and UNet” using remote-sensing imagery. It has significantly attained more precise and high accurate segmentation results. In 2019, Igonina and Tiumentseva [43] have focused on identifying the known neuroarchitectures to solve the problems persists in remote sensing of Earth’s surface, which has also focused on semantic segmentation of UAV images. In 2019, Wu et al. [44] have studied attention dilation-linknet (AD-linknet) neural network by adopting the encoder-decoder framework along with pretrained encoder, channel-wise attention scheme, and serial-parallel integrated dilated convolution for semantic segmentation of high-resolution satellite images. In 2019, Masouleh and Shah-Hosseini [45] have presented a Gaussian–Bernoulli restricted Boltzmann machine (GB-RBM) for the semantic segmentation of UAV-based thermal infrared images, which has evaluated the efficiency on average processing time and average precision concerning with the extraction of ground vehicles in road. In 2019, Audebert et al. [46] have introduced a regression-based semantic segmentation regularization model through a distance transform, in which the FCN was trained for both continuous and discrete spaces through learning the distance regression and joint classification. In 2019, Mohammadi et al. [47] have implemented a semantic segmentation model from polarimetric synthetic aperture radar images using FCN architecture, which has extracted the discriminative polarimetric features for finding the wetland on complex land cover ecosystem. In 2019, Hua et al. [48] have presented a CNN for processing the extracted features for enhancing the efficiency of semantic segmentation of aerial images, which has used two modules such as patch attention module and attention embedding module for getting the significant information of low level features. In 2019, Panboonyuen et al. [49] have designed a global convolutional network (GCN) for semantic segmentation of remotely sensed images for extracting the multiscale features from diverse phases of the network.

In 2020, Liu et al. [50] have proposed a semantic segmentation model for high-resolution remote-sensing images using a multichannel segmentation network termed DAPN that has completely extracted the multiscale features of the images and retained the spatial features of the object. In 2020, Mou et al. [51] have considered two efficient networks called channel and spatial relation module for learning and reasoning about the global correlations among the feature maps or positions. The suggested model was termed as relation module-equipped FCN. In 2020, Wang et al. [52] have designed a “context and semantic enhanced high-resolution network (CSE-HRNet)” with two comprehensive processes for tackling the intraclass heterogeneity problem and for enhancing the representational ability of multiscale contexts. In 2020, Martinez-Soltero et al. [53] have utilized CNN for terrain detection using aerial images, which has aimed for solving the navigation tasks and robot mapping along with the pixel-level segmentation for generating a high detailed map. In 2020, Jiawe et al. [54] have proposed a real-time semantic segmentation model by designing a new “asymmetric depth-wise separable convolution network (ADSCNet)” for offering the better prediction efficiency. In 2020, Deng et al. [55] have developed a semantic segmentation network from UAV images for real-time weed mapping for reducing the time gap among the herbicide treatment and image collection. This model has focused on implementing a hardware system with combined processes. In 2020, Niu et al. [56] have designed a new “hybrid multiple attention network (HMANET)” for adaptive capturing of global relationships, which has computed the category-based relationship and recalibrated the class level details. This study has introduced an efficient region shuffle attention (RSA) module for enhancing the effectiveness of semantic segmentation. In 2020, Chai et al. [57] have proposed the semantic segmentation model from high-resolution aerial images that has addressed the problem of learning spatial context through Deep CNNs (DCNNs). This model has predicted the distance map rather than the score map for every class that has enhanced the segmentation efficiency. In 2020, Song et al. [58] have offered the sunflower lodging detection method from remote-sensing images by considering the deep semantic segmentation and image fusion from UAV, which has attained by improved SegNet. In 2020, Diakogiannis et al. [59] have suggested a reliable framework with “ResUNet-a” for semantic segmentation of high-resolution aerial images along with dice loss function through UNet encoder-decoder network. In 2020, Ye et al. [60] have introduced Uavid dataset for semantic segmentation of urban scenes through ensemble learning including multispectral dilation with feature space optimization (FSO). In 2020, Bianco et al. [61] have suggested a semantic segmentation model for detecting the road participants and road lane through a multitask instance segmentation neural network. This model has developed an ad-hoc training process for composing the final annotations utilized to train the suggested model by applying the CNN. In 2020, Mi and Chen [62] have introduced “superpixel-enhanced deep neural forest (SDNF)” for improving the classification capability from remote-sensing images along with the semantic segmentation, which has also designed a “superpixel-enhanced region module (SRM)” for reducing the noises and improves the edges of ground objects. In 2020, Zhang et al. [63] have proposed a new fused network with the model-agnostic metalearning (MAML) and FCNN for semantic segmentation of remote sensing based on RGB images along with the optimization algorithm, particle swarm optimization (PSO) algorithm. In 2020, Boonpook et al. [64] have proposed a multifeature semantic segmentation from images of UAV photogrammetry using the deep learning method, in which the accuracy of building extraction has improved with help of SegNet. In 2020, Yang et al. [65] have focused on understanding the pixel-level information from high-spatial resolution remote-sensing images using end-to-end network called residual network (ResNet), which has also considered several additional losses for enhancing the suggested model with optimization of multilevel features. In 2020, Mehra et al. [66] have suggested a semantic segmentation method for classifying the land cover through “six deep learning architectures such as pyramid scene parsing, UNet, and deeplabv3, path aggregation network, encoder-decoder network, and feature pyramid network,” which has attained superior results. In 2020, Tasar et al. [67] had proposed a semantic segmentation method by using color mapping GAN named ColorMAPGAN, which has also used element-wise matrix manipulation to learn the transformation of colors in the training data to the colors of the test data. In 2020, Venugopal [68] has suggested “a feature learning method named deep lab dilated CNN (DL-DCNN)” for automatic semantic segmentation for determining the correlation among two images, which has shown the superior efficiency over existing methods.

In 2021, Girisha et al. [69] have an improved encoder-decoder-based CNN architecture termed Uvid-Net for semantic segmentation from UAV video frames. This architecture was used to incorporate the temporal smoothness, which has captured the correlation among the sequence of frames using multibranch CNNs. In 2021, Huang et al. [70] have suggested an attention-guided label refinement network (ALRNet) to enhance the semantic labeling of very high-resolution remote-sensing images with the encoder-decoder framework. Here, attention-guided feature fusion (AGFF) module was significantly developed for declining the semantic gap among diverse levels of features. In 2021, Abdollahi et al. [71] have suggested a GAN for segmenting the roads from high-resolution aerial imagery. This model has also used a modified UNet model (MUNet) for attaining the suitable results. In 2021, Alam et al. [72] have suggested an integrated framework using CNN with enhanced UNet and “encoder-decoder CNN structure SegNet with index pooling” for semantic segmentation of remote-sensing images, which has attained appropriate segmentation results on multitargets. In 2021, Anagnostis et al. [73] have suggested a semantic segmentation approach for obtaining the orchard trees from aerial images, which has used UNet for improving the efficient performance in terms of accuracy. This designed model has focused on automatic localization and detection of the canopy of orchard tress on different constraints. In 2021, Li et al. [74] have proposed a semantic segmentation model for analyzing the properties of photovoltaic, which has also enhanced the recommendations of segmenting the PV. It has revealed the high nonconcentrated and class imbalance distribution of photovoltaic panel image data through hard sampling and soft sampling. In 2021, Wang et al. [75] have designed a real-time semantic segmentation of high-resolution aerial images named an aerial bilateral segmentation network (Aerial-BiseNet) for offering superior accuracy. This suggested model has used two modules termed “feature attention module (FAM) and channel attention-based feature fusion module (CAFFM)” for analyzing the features. In 2021, Vasquez-Espinoza et al. [76] have suggested a semantic segmentation scheme using indoor imagery through the exploitation of details offered with the metadata utilized in the training stage of UNet. In 2021, Chen et al. [77] have considered different existing approaches such as “deeplabv3, generative adversarial network Pix2Pix, and UNet” for semantic segmentation of partially occluded apple trees, which has provided more details on branch paths, where the recovery of finer details from occlusions was offered. In 2021, Tasar et al. [78] have suggested a coined DAugNet for the semantic segmentation of satellite images, including a data augmentor and classifier, which have performed on life-long, multitarget, multisource, single-source, and single-target problems. In 2021, Li et al. [79] have recommended a “dual attention deep fusion semantic segmentation network of large-scale satellite remote-sensing images (DASSN_RSI)” for getting the significant results which have also analyzed the challenges of conventional semantic segmentation approaches using remote-sensing images. In 2021, Jiang [80] has suggested a semantic segmentation model using high-resolution remote-sensing images through CNN and mask generation, in which the NN architecture was intended for obtaining a precise mask. In 2021, Liu et al. [81] have designed a new semantic segmentation model using remote-sensing images using Inceptionv-4 network for getting the enhanced classified information. This model has introduced the fusion of features for solving the classification of edge of objects. In 2021, Zheng et al. [82] have implemented an “end-to-end CNN network named GAMNet” for balancing the controversies among the local and global information, which has also realized the boundary recovery and multiscale feature extraction. In 2021, Ouyang and Li [83] have offered a new DSSN called attention residual U-shaped network (AttResUNet) for encoding the feature maps and refining of features through attention module, which has also used GCN for classification.

2.2. Chronological Review

The chronological review on semantic segmentation models through deep learning approaches in the past years is given in Figure 2. The semantic segmentation is emerged as a major research area after 2015, and thus, this survey is prepared by gathering a set of research works from the year of 2015 to 2021. In the years of 2015, 2016, and 2017, the total number contributions is taken as 1.67% for each. Similarly, at 2018, 13.3% of the research works are gathered for analysis. In the year of 2020, 31.6% of the contributions are considered for evaluation. Likewise, while considering the 2019 and 2021, the number of research works is taken as 25%, respectively.

2.3. Security and Privacy Issues in Deep Learning

Many applications of deep learning in everyday life are self-driving cars, biometric security, health prediction, speech processing, financial technology, and retail [84]. Depending on the nature of the data and the user’s intent, each application has its own set of requirements. Many models were offered by the researchers to fit the application needs, users, and features of each sort of application, including LeNet, VGG, GoogleNet, Inception, and ResNet. Despite the fact that many studies on both attacking and safeguarding users’ privacy and security measures have been published, they remain fragmented. Tramèr evaluated different attack strategies based on FGSM and GAN before proposing the R-FGSM algorithm [85]. Xiaoyong Yuan also discusses security vulnerabilities in the deep learning approach. [86]. The preceding research has solely focused on the security of the deep learning model and does not provide an overview of preserving privacy in the deep learning model [87, 88].

In this work, we cover current studies on model security and data privacy that have led to the development of a secure and private artificial intelligence (SPAI). To address the demand for strong artificial intelligence (AI) systems, we compiled fragmented results and methodologies with the goal of delivering insights important to future study.

To conclude, we examine current research on privacy and security problems related to DL in the areas listed below.(1)DL model attacks: the two primary forms of DL attacks are evasion and poisoning attacks, with evasion attacks involving the inference phase and poisoning attacks involving the training phase(2)Defense of DL models: the different defense mechanisms presented may be divided into two broad categories based on the kind of attack, evasion and poisoning; tactics applied against evasion assaults can be further divided into empirical (e.g., gradient masking, robustness, and detection) and certified approaches(3)Privacy attacks on AI systems: the potential privacy threats to DL-based systems arising from service providers, information silos and users(4)Defense against a privacy breach: the most modern cryptographic protection approaches, such as homomorphic encryption, safe multiparty computing, and differential privacy

According to training and testing stages in deep learning model security, attack techniques are categorized. This research emphasises on threats at the testing. Furthermore, the categorization is based on the attacker’s expertise as well as the attacker’s pattern of assaulting black boxes and white boxes. Attack strategies are classed in order to safeguard user privacy based on the system design and the attacker’s knowledge. Attack strategies are divided into two categories in system architecture: centralized and distributed. According to the information, the attacker is also split into white box and black box attacks. Based on the stages of the deep learning model, defensive techniques are classified.

The assumptions for implementing certain threats in deep learning security are based on situations. The threat models are classified depending on the adversary’s knowledge, the goal of the attacker, and the frequency of attacks.

2.3.1. The Adversary’s Knowledge

A black box attack occurs when the attacker lacks knowledge of the system, in case of which the attacker submits input and receives output without understanding the system parameters. In contrast, in the event of a white box attack, the attacker has access to all system information, including the model’s structure and parameter values.

2.3.2. Attacker’s Target

Targeted attacks detect certain data or object types that misclassify this data collection. These types of attacks are common when categorization systems are used. In face recognition or authentication systems, for example, an attacker selects a certain face, one of which is misclassified among hostile samples. Nontargeted attacks, on the contrary, choose arbitrary data and are simpler to execute than targeted attacks.

2.3.3. Frequency of Attacks

One-time attacks require only one hostile example to be created. Otherwise, repeated attacks build adversarial instances through multiple updates. Iterative attacks outperform one-time attacks every time, but they need more queries to the deep learning system and take longer.

Deep learning security threats are classified into two types: adversarial and poisoning. We will concentrate on adversarial assaults in this research. During a system query, an adversarial attack introduces noise to the usual data. When the attacker receives the reported results, he or she utilizes this information to generate adversarial instances. This type of assault may be found in image processing, audio processing, and virus detection. It can trick deep learning machines, but not humans, particularly in the field of image processing. The gap between the source data and the adversarial example is represented by the noise value.

2.4. Limitations and Alternate Solutions of Deep Learning

Several challenges are frequently taken into account when adopting DL. Those that are more difficult are mentioned next, with various viable solutions supplied.

2.4.1. Training Data

Because it also requires representation learning, DL is tremendously data-hungry. To produce a well-behaved performance model, DL necessitates a massive quantity of data, i.e., as the data accumulates, an even more well-behaved performance model may be achieved. Most of the time, the supplied data are adequate to generate a solid performance model. However, there are situations when there is insufficient data to use DL directly. There are three proposed techniques for dealing with this issue. The first entails using the transfer-learning idea after collecting data from similar activities. While the transmitted data will not directly enhance the real data, it will aid in improving both the original input data representation and its mapping function. The model’s performance is improved as a result. Another method is to use a well-trained model from a comparable assignment and fine-tune the end of two layers, or even one layer, depending on the limited original data. The second option involves data augmentation. Because picture translation, mirroring, and rotation frequently do not modify the image label, this activity is extremely useful for supplementing image data. In contrast, it is critical to exercise caution while using this approach in some circumstances, such as with bioinformatics data. When mirroring an enzyme sequence, for example, the resulting data may not represent the real enzyme sequence. In the third way, simulated data may be used to increase the size of the training set. If the problem is sufficiently understood, it is sometimes possible to construct simulators based on the physical process. As a result, the end product will comprise the simulation of as much data as is required.

2.4.2. Transfer Learning

Deep CNNs, which provide ground-breaking help for solving numerous classification issues, have been widely used in recent research. Deep CNN models, in general, need a large amount of data in order to function well. The most prevalent problem with employing such models is a lack of training data. Gathering a big number of data is a demanding task, and no viable solution is currently available. As a result, the undersized dataset problem is now being addressed utilising the TL approach, which is very efficient in handling the lack of training data issue. The TL technique entails training the CNN model with vast amounts of data. The model is then fine-tuned for training on a small request dataset.

The student-teacher interaction is an effective method for explaining TL. The first step is to learn everything there is to know about the subject. The teacher then gives a “course” by imparting the material over time through a “lecture series.” Simply put that the instructor transmits information to the pupil. More specifically, the expert (teacher) imparts knowledge (information) to the learner (student). Similarly, the DL network is trained using a large amount of data and learns the bias and weights during training. These weights are then transmitted to several networks in order to retrain or test a comparable unique model. As a result, the innovative approach can pretrain weights rather than requiring training from beginning.

2.4.3. Data Augmentation Techniques

Data augmentation techniques are one viable answer if the aim is to expand the quantity of accessible data while avoiding overfitting. These strategies are data-space solutions to any problem with little data. Data augmentation refers to a set of approaches for improving the properties and quantity of training datasets. As a result, when these strategies are used, DL networks perform better. Following that, we will go through some other data augmentation solutions.(i)Flipping: vertical axis flipping is a less prevalent procedure than horizontal axis flipping. On datasets such as ImageNet and CIFAR10, flipping has been shown to be beneficial. Furthermore, it is really simple to implement. Furthermore, it is not a label conserving transformation on datasets involving text recognition (such as SVHN and MNIST).(ii)Color space: as a dimension tensor, encoding digital picture data is often utilized (height x width x colour channels). Performing enhancements in the colour space of the channels is an alternate method that is particularly practical for implementation. Color augmentation is as simple as isolating a channel of a certain colour, such as red, green, or blue. By dividing that matrix and introducing extra double zeros from the remaining two colour channels, you may quickly transform a picture utilising a single-color channel. Furthermore, the picture brightness may be increased or decreased by utilising simple matrix operations to modify the RGB values. Additional better colour augmentations can be acquired by generating a colour histogram that represents the image. Lighting changes can also be done by altering the intensity values in histograms similar to those used in photo-editing software.(iii)Cropping: cropping a prominent region of every single image is a technique used as a specialised processing step for image data with combined dimensions of height and width. Furthermore, random cropping can be used to achieve the same effect as translations. The distinction between translations and random cropping is that translations preserve the image’s spatial dimensions, but random cropping decreases the input size. The label-preserving transformation may not be addressed because to the cropping reduction threshold that was chosen.(iv)Rotation: rotation augmentations are created by rotating a picture left or right from 0 to 360° around the axis. The rotation degree parameter has a significant impact on the applicability of rotation augmentations. Small rotations (from 0 to 20°) are quite useful in digit identification tasks. When the rotation degree rises, however, the data label cannot be kept post-transformation.(v)Translation: shifting the picture up, down, left, or right is a highly important transformation for avoiding positional bias in image data. For example, it is typical for all of the photos in a dataset to be centred; also, the tested dataset should be fully composed of centred images in order to test the model. It is worth noting that, after translating the starting pictures in a certain direction, the remaining space should be filled with Gaussian or random noise, or a constant value such as 255 s or 0 s. Using this padding, the spatial dimensions of the picture after augmentation are kept.(vi)Noise injection: this method entails introducing a matrix of arbitrary values. A Gaussian distribution is typically used to generate such a matrix. Injecting noise into photos allows the CNN to learn more robust features.

2.4.4. Interpretability of Data

DL approaches are occasionally studied to serve as a black box. They can, in fact, be interpreted. Many areas, such as bioinformatics, have a requirement for a way of interpreting DL, which is utilized to acquire the valuable motifs and patterns detected by the network. It is necessary not only to understand just the illness diagnosis or prediction findings of a trained DL model but also how to improve the certainty of the prediction outcomes, as the model bases its choices on these verifications. To do this, each section of the specific example can be assigned a weighted score. Backpropagation-based techniques or perturbation-based approaches are employed in this solution. A fraction of the input is altered in the perturbation-based techniques, and the effect of this modification on the model output is monitored. This notion has a high computational complexity, yet it is easy to grasp. With contrast, in backpropagation-based approaches, the signal from the output propagates back to the input layer to verify the score of the relevance of distinct input sections.

2.4.5. Overfitting

Because of the large number of parameters involved, which are complexly interrelated, DL models have an extremely high risk of resulting in data overfitting during the training stage. Such circumstances limit the model’s capacity to perform well on the tested data. This issue is not just restricted to a single field, but also encompasses a variety of duties. As a result, while proposing DL approaches, this issue should be thoroughly examined and handled correctly. According to current research, the inherent bias of the training process helps the model to overcome critical overfitting concerns in DL. Nonetheless, strategies for dealing with the overfitting problem must be developed. An examination of the various DL algorithms for easing the overfitting problem may be divided into three categories. The first class contains the most well-known methods, such as weight decay, batch normalisation, and dropout, and it operates on both the model architecture and model parameters. Weight decay is the default approach in DL, and it is used widely as a universal regularizer in practically all ML algorithms. The second class is concerned with model inputs such as data corruption and data augmentation. One cause of overfitting is a paucity of training data, which causes the learnt distribution to differ from the true distribution. Data augmentation increases the size of the training data. In contrast, marginalised data corruption improves the solution solely through data augmentation. The last class is concerned with the model’s output. For regularising the model, a recently developed method penalises overconfident outputs. This approach has been shown to be capable of regularising RNNs and CNNs.

2.4.6. Vanishing Gradient Problem

In general, when utilising backpropagation- and gradient-based learning approaches with ANNs, an issue known as the vanishing gradient problem emerges, particularly, during the training stage. In further detail, during each training iteration, each weight of the neural network is updated depending on the current weight and is proportionately relevant to the partial derivative of the error function. However, owing to a vanishingly tiny gradient, this weight update may not occur in some situations, implying that no more training is feasible and the neural network would cease entirely. In contrast, the sigmoid function, such as other activation functions, compresses a huge input space to a compact input region. As a result of the huge fluctuation at the input resulting in a little variation at the output, the derivative of the sigmoid function will be small. Only a few layers in a shallow network employ these activations, which is not a big deal. While having additional layers causes the gradient to become very tiny during the training stage, the network operates effectively in this scenario. The gradients of neural networks are determined using the backpropagation approach. Initially, this approach identifies the network derivatives of each layer in reverse order, beginning with the most recent layer and moving back to the first. The next step is to multiply the derivatives of each layer along the network in the same way that the previous step was done. When there are N hidden layers, for example, multiplying N small derivatives together requires an activation function such as the sigmoid function. As a result, the gradient decreases exponentially as it propagates back to the first layer. Because the gradient is modest, the biases and weights of the initial layers cannot be updated efficiently during the training stage. Furthermore, because these early layers are typically vital in detecting the main aspects of the input data, this circumstance reduces total network accuracy. However, by using activation functions, such an issue may be avoided. These functions lack the squishing attribute, which allows them to squish the input space to a tiny space. The ReLU is the most preferred choice for mapping X to max since it does not provide a modest derivative that is useful in the field. Another option is to use the batch normalisation layer. As previously stated, the difficulty arises when a huge input space is squeezed into a tiny space, resulting in vanishing the derivative. Using batch normalisation mitigates this problem by simply normalising the input, i.e., the expression |x| does not achieve the sigmoid function’s outside borders. The normalisation procedure causes the majority of it to fall into the green region, ensuring that the derivative is large enough for future activities. Furthermore, faster hardware, such as that supplied by GPUs, can address the above issue. In comparison to the time necessary to notice the vanishing gradient problem, this enables normal backpropagation over many deeper levels of the network.

2.4.7. Exploding Gradient Problem

The gradient problem is the inverse of the vanishing problem. Specifically, during backpropagation, huge error-gradients accrue. The latter will result in extraordinarily big modifications to the network’s weights, causing the system to become shaky. As a result, the model’s capacity to learn successfully will deteriorate. Moving backward in the network during backpropagation causes the gradient to expand exponentially by repeatedly compounding gradients. As a result, the weight values may get extremely big and may overflow to produce a not-a-number (NaN) value. Some potential solutions include(1)Using different weight regularization techniques(2)Redesigning the architecture of the network model

2.4.8. Underspecification

In 2020, a Google team of computer scientists found a new difficulty known as underspecification. When evaluated in real-world applications such as computer vision, medical imaging, natural language processing, and medical genomics, machine learning models, particularly, deep learning models, frequently exhibit startlingly low performance. Underspecification is to blame for the poor performance. It has been demonstrated that modest changes may push a model to an entirely new solution and result in different predictions in deployment domains. There are several methods for dealing with the issue of underspecification. One of them is to create “stress tests” to see how well a model performs on real-world data and to identify potential problems. Nonetheless, this necessitates a solid grasp of the process, as the model can perform incorrectly. “Designing stress tests that are well-matched to application criteria and that give adequate “covering” of probable failure modes is a huge problem,” the researchers concluded. Underspecification severely limits the trustworthiness of ML predictions and may necessitate some reconsideration of some applications. Because ML is tied to humans through applications such as medical imaging and self-driving automobiles, it will necessitate careful consideration of this issue.

2.5. Computational Approaches and Comparison between Different Aspects Related to Devices

Complex ML and DL algorithms have quickly emerged as the most significant techniques for computationally exhausting applications, and they are widely applied in a variety of domains. The creation and refinement of algorithms, together with the capabilities of well-behaved computational performance and massive datasets, allow for the successful execution of various applications that were previously either impossible or difficult to conceive.

2.5.1. CPU-Based Approach

The CPU nodes’ well-behaved performance frequently aids robust network connectivity, storage capabilities, and huge memory. Although CPU nodes are more general purpose than FPGA or GPU nodes, they lack the ability to compete in raw compute facilities since this demands improved network capability and a bigger memory capacity.

2.5.2. GPU-Based Approach

GPUs are exceptionally effective for various fundamental DL primitives, including highly parallel-computing operations such as activation functions, matrix multiplication, and convolutions. Incorporating HBM-stacked memory onto modern GPU models dramatically improves bandwidth. This enhancement enables a wide range of primitives to make efficient use of all available computational resources on GPUs. In the case of dense linear algebra computations, the boost in GPU performance over CPU performance is typically 10–20 : 1.

2.5.3. FPGA-Based Approach

FPGA is widely used in a variety of functions, including deep learning. FPGA is widely used to create inference accelerators. The FPGA can be effectively configured to reduce the number of unnecessary or overhead functions in GPU systems. The FPGA, in comparison to the GPU, is limited to both poor-behaved floating-point performance and integer inference. The key FPGA feature is the ability to dynamically modify the array characteristics (at run-time), as well as to configure the array using effective design with little or no overhead. Table 1 [89] represents the comparison between different aspects related to the devices.

3. Algorithmic Categorization and Features and Challenges of Existing Semantic Segmentation Models

3.1. Algorithmic Classification

This section presents different deep learning approaches utilized for developing a semantic segmentation model as given in Figure 3.

The semantic segmentation models mostly use deep learning algorithms for getting superior accuracy with better quality. The techniques have been categorized into two sections, namely, deep learning and miscellaneous approaches. In deep learning, CNN architectures play a major role for semantic segmentation, which is extended by adopting different convolutional layers or other frameworks.

Supervised learning: in this model, training data consist of both input and desired results. These supervised learning algorithms are often accurate and fast. It has the ability of generalization that gives the precise results while processing new data without knowing a priori about the target.

CNN [24, 48, 53, 61, 80] inspires the researchers because of the superior efficiency in the area of computer vision, which has been adopted in diverse applications such as object detection, image recognition, and other fields. Figure 4 represents the architecture of convolutional neural network This architecture enhances accuracy of prediction or classification due to the large number of training samples along with building neural networks with several layers. CNN is a hierarchical system, which takes the input data as raw data through stacking a set of operations such as mapping of nonlinear activation functions, convolution, and pooling operations. This procedure is named as “feedforward operation.” Due to this effective operation of CNN, it has attained superior results in the data mining and natural processing tasks when compared with the deep neural networks. Owing to the efficiency of CNN architectures, multiple CNN-based approaches are designed by integrating many ideas or integration of FCN architecture. This adoption of several networks into one framework is named as ensemble learning [60, 66, 77], which has attained superior results compared to single architecture because of the utilization of multiple layers. Ensemble of CNNs [25] is adopted by utilising several layers of CNN architecture to reduce the computational cost and avoids aliasing problem. It provides promising performance when compared to the existing models. DP-DCN [35] focuses on extracting the significant features from DSM data and spectral channels for fusing them through an encoder-decoder framework. The extended version of CNN consists of Shuffling CNNs [27], DSMFNets [28], UVid-Net [69], ESPCN [42], neuroarchitectures [72], ensemble of CNNs [25], ADSCNet [54], DSCNN [29], DCNN [32, 33, 57], multitask CNN [30], multifilter CNN [31], ConvNet [26], GAMNet [82], DL-DCNN [68], and GCN [49, 83]. This modified or integrated concept of CNN is designed for efficient semantic segmentation.

FCN (see [38, 46, 47, 51]): the basic idea of FCN includes processes such as “multilayer convolution, deconvolution, and fusion,” where the convolutional layers are replaced with the fully connected layers. The image score is computed by using pixel-wise convolution. UNet [42, 73, 76] is a type of FCN that is efficient for small training dataset, which includes convolution and deconvolution layers with filters along with ReLU activation function. The modified versions of FCN are given here as integrated algorithm [71], ResUNet-a [59], CAM-DFCN [36], relation module-equipped FCN [52], FCN-Alexnet model [55], and AD-LinkNet [44]. Improved SegNet [58] and SegNet [64] follow a FCN structure with encoder and decoder network. SegNet saves the element index in the upsampling process of the decoder network for solving the ambiguous spatial information in the resultant of deeper layers. Figure 5 depicts the architecture of fully connected Network.

FCN introduces many significant ideas: (i) end-to-end learning of the upsampling algorithm via an encoder/decoder structure that first downsamples the size of the activations and then upsamples it again, (ii) using fully convolutional architecture allows the network to take images of arbitrary size as input since there is no fully connected layer at the end that requires a specific size of the activations, and (iii) introducing skip connections as a way of fusing information from different depths in the network for multiscale inference.

GAN (see [37, 71]): generative adversarial network (GAN) model considers a softmax layer, in which the discriminator of the GAN produces label types for efficient classification of unlabeled samples and labeled examples. The architecture of generative adversarial network is shown in Figure 6. The modified version of ColorMapGAN [67] has aimed at minimizing the computational complexity and improving the accuracy.

DNN: deep neural networks (DNN) focus on semantic segmentation of high-resolution images which consist of several parameters that need a large number of labeled examples for training. A general scheme for constructing a deep network to process a rich dataset is complex. The improved DNN models are modified inceptionV-4 network [50], NDRB [52], ResNet101-v2 [39], ALRNet [70], HMANET [56], ResNet [65], inceptionV-4 network [81], MAVNet [41], and SDNF [61], which are aimed to enhance the superior accuracy on segmentation.

Unsupervised learning: this model is not offered with the precise results during training, which can be employed for clustering the input data in classes through statistical properties.

DAugNet [78]: DAugNet generates the precise maps and has provided life-long adaptation settings for giving the superior semantic segmentation results. GB-RBM [73] is introduced for enhancing the segmentation results and improving the speed and accuracy. Figure 7 gives the training procedure of data augmentation network.

3.2. Features and Challenges

The features and challenges of the conventional semantic segmentation model using deep learning techniques are listed in Table 2. This description provides the researchers for focusing on a new semantic segmentation model on aerial images for solving the existing challenges through adopting deep learning techniques.

4. Simulation Platforms and Dataset Description for Conventional Semantic Segmentation Models

4.1. Simulation Platforms

The simulation environments used for implementing a semantic segmentation model with different imaging modality is presented in Figure 8. Here, some of the tools such as CUDA version 8.0, Edge Detection and Image Segmentation (EDISON) library, MXNet, TensorRT, and two-fold validation tool are used in 1.7% of the contributions, respectively. MATLAB and Tesla use 3.3% of the contributions for implementation and Pascal and Keras utilize 8.3% of the research works with the TensorFlow as a platform, respectively. TensorFlow is used as the simulation environment for 18.3% of the works and NVIDIA is considered in 5% of the contributions. Finally, the python tool is used in 6.6% of the research works and other platform environments are taken in 20% of the contributions.

4.2. Dataset Description and Imaging Modalities Focused

The dataset used for implementing the semantic segmentation model along with different imaging modalities is given in tabular forms (Tables 37). Most of the contributions are considered aerial images for semantic segmentation, which is used in 23.3% of the work, s, and high-resolution aerial imagery is taken in 16.6% of the contributions. Similarly, the remote-sensing and high-resolution remote-sensing images are taken in 25% of the research works. Unoccupied aerial vehicles’ (UAVs) images are gathered in 11.7% of the contributions.

Multiscale and multispatial resolution images are included in 1.7% of the research papers, respectively, and satellite images are taken in 5% of the contributions. Other high-resolution images are taken in 13.4% of the research works.

4.3. Datasets for Image Segmentation

In this section, we give a synopsis of a portion of the most generally utilized datasets for image segmentation. We combine these datasets into 3 classifications is 2-dimensional images, 2.5-dimensional RGB-D (complexity + colour) images, and 3-dimensional images and give subtle ties with regards to the attributes of each dataset. The recorded datasets have pixel-wise marks, which can be utilized for assessing model execution.

It is worth focusing on that a portion of these works, use augmentation of data to expand the quantity of marked samples, uncommonly the ones which manage little datasets such as in the medical domain. Augmentation of data serves to expand the quantity of preparing tests by applying a set of changes either in the information space, or element space, or now and again both to the images, i.e., both the input image and the segmentation map. Some normal changes incorporate interpretation, reflection, pivot, twisting, scaling, colour space shifting, trimming, and projections onto principal components. Augmentation of data has demonstrated to work on the presentation of the models, particularly when gaining from restricted datasets, like those in medical image investigation.

The common image segmentation research has concentrated on 2-dimensional images. From Figure 9 [91], pink, green, and yellow blocks mention semantic occurrence and panoptic segmentation algorithms, respectively. Therefore, several 2-dimensional image segmentation datasets are existing, and they are PASCAL Visual Object Classes (VOC) [92], PASCAL Context [93], Microsoft Common Objects in Context (MS COCO) [94], Cityscapes [95], ADE20 K/MIT Scene Parsing (SceneParse150) [96], SiftFlow [97], Stanford background [98], Berkeley Segmentation dataset [99], Youtube-Objects [100], KITTI [101], Semantic Boundaries Dataset (SBD) [102], PASCAL Part [103], SYNTHIA [104], Dobe’s Portrait Segmentation [105], etc., With the obtainability of reasonable range scanners, RGB-D images have become standard in both research and industrial applications. Some of the most standard 2.5-dimensional RGB-D datasets are NYU-D V2 [106], SUN-3D [107], SUN RGB-D[108], UW RGB-D Object Dataset [109], ScanNet [110], etc., Three-dimensional image datasets are standard in robotic, medical image analysis, 3D scene analysis, and construction applications. Three-dimensional images are generally provided via meshes or other volumetric illustrations, such as point clouds. Some of the standard 3-dimensional datasets are Stanford 2D-3D [111], ShapeNet Core [112], Sydney Urban Objects Dataset [113], etc.

4.4. Frameworks and Benchmark Datasets Employed for Different DL Tasks

Several deep learning frameworks and datasets have been developed in the last few years. Various frameworks and libraries have also been used in order to expedite the work with good results. Through their use, the training process has become easier. Tables 8 and 9 [89] list the most utilized frameworks and libraries and Benchmark datasets.

4.5. Algorithms Comparison Based on Different Datasets

Comparison of different algorithmic features and their results obtained based on clustering methods, conditional random field, PASCAL VOC2012 dataset, CamVid dataset, and MS COCO dataset are tabulated (Tables 1014).

5. Performance Measures and Best Accuracy Rate Attained by the Conventional Semantic Segmentation Models

5.1. Performance Metrics

An exemplary ought to preferably remain assessed in an assortment of ways, including quantitative precision, speed, and capacity necessities. The majority of previous research has concentrated on parameters for assessing model accuracy. The most commonly used parametric for evaluating the accuracy of segmentation algorithms is summarized below [91, 136]. On benchmarks, to analyze various models, quantitative measurements are utilized, and the visual nature of the model yields significance in figuring out.(i)Pixel accuracy (PA): basically, pixel accuracy states the ratio of correctly classified pixels to the total quantity of pixels. Pixel accuracy is known for N + 1 classes aswhere aij is the quantity of pixels of class I predicted as belonging to class j.(ii)Average/mean pixel accuracy (MPA): mean pixel accuracy has marginally further developed, in which the ratio of correct pixels is computed in a per-class basis and then averaged over the total number of classes:(iii)Intersection over union (IoU): this is quite possibly the most generally utilized measurement in semantic segmentation. It is determined as the area of intersection of the predicted division map and the ground truth divided by the area of the union of the predicted segmentation map and the ground truth:where P = true segmentation map and Q = predicted segmentation maps.The value of intersection over union lies between 0 and 1.(iv)Mean-IoU: mean intersection over union is an alternative standard metric defined by average intersection over union across entire modules. It is commonly used in reporting the performance of contemporary segmentation algorithms [91].(v)Precision/recall: for numerous classical image segmentation models, precision and recall are the standard metrics for recording. Definition for precision and recall for every class is as follows:where TP = True Positive, FP = False Positive, and FN = False Negative. Usually, we are attentive in a united form of precision and recall rates.(vi)F1 score: F1 score is also the standard metric and defined by the harmonic mean of precision and recall:(vii)Dice coefficient: Dice coefficient is an alternative standard metric used in medical image analysis for image segmentation, defined by “twice the overlap area of predicted and ground truth maps, divided by the total number of pixels in both images. The Dice coefficient is very identical to the IoU” [91]:While practical to Boolean data, the Dice coefficient is nearly equal to the F1 score:where TP indicates True Positive Fraction, FP indicates False Positive Fraction, and FN indicates False Negative Fraction.(viii)Frequency weighted mIoU: over the raw mIoU, frequency weighted mean intersection over union is an improved which weights each class importance depending on their appearance frequency [136]:(ix)Jaccard index: the Jaccard index, commonly known as the Jaccard similarity coefficient, is a statistic used to assess the similarity between sample sets. The measurement stresses similarity between finite sample sets and is officially defined as the intersection size divided by the sample set union size. The mathematical representation of the index is written as(x)Confusion matrix: a Confusion matrix is an N x N matrix that is used to assess the effectiveness of a classification model, where N is the number of target classes. Figure 10 represents the confusion matrix. The matrix compares the actual goal values to the machine learning model’s predictions. This provides us with a comprehensive picture of how well our classification model is working and the kind of errors it is producing. For a binary classification task, we would have a 2 × 2 matrix with four values, as illustrated in figure [137].Let us decode the matrix. The target variable has two values: positive or negative. The columns represent the actual values of the target variable. The rows represent the predicted values of the target variable.(xi)Kappa coefficient: it is used to assess the level of agreement between two human evaluators or raters (for example, psychologists) when assessing topics (patients). The machine learning community then “appropriated” it to quantify categorization performance. The kappa score, also known as Cohen’s kappa coefficient [138], is named after Jacob Cohen, an American statistician and psychologist who produced the foundational study on the subject. This measure is also known as Cohen’s kappa and the kappa statistic. To compute the kappa score, it is convenient to first summarize the ratings in a matrix shown in Figure 11.

The columns show the ratings by professor A. The rows show the ratings by Professor B. The value in each cell is the number of candidates with the corresponding ratings by the two professors.

The performance metrics employed for analyzing the diverse semantic segmentation models through deep learning is given in Table 15. From the set of research works, 63.3% of the works use OA, 48.3% of the contributions use F1 score, and 25% of the works consider recall and precision measures, respectively. mIoU metric is taken in 28.3% of the research works, 5% of the papers use Jaccard index, kappa coefficient, and dice coefficient the performance metric, confusion matrix, and PA are considered in 4% of the research works, respectively, and 23.3% of the contributions consider IoU measure. Furthermore, some of the additional measures are also taken for evaluating the efficiency of semantic segmentation, which are FWIoU, MCC, average accuracy, etc.

5.2. Best Performance Measures

The best performance measures obtained by diverse semantic segmentation models are depicted in Figure 12. From this comprehensive survey, Figure 8(a) represents contributions such as [32, 64] to get 97% as the highest accuracy rate than others. Secondly, the work in [50] obtains 94.49%, and the research works such as [26, 43, 47, 56, 59, 81] attain 92.63% accuracy rate when compared with other works. The best performances for some of the metrics such as overall accuracy, F1 score, intersection over union, and recall were noted and tabulated as shown in Table 16.

5.3. Research Gaps and Challenges

In recent decades, several semantic segmentation approaches have been designed for different applications such as surveillance systems, traffic monitoring, and analysis on environmental changes. However, manual segmentation methods are time tedious and complex one. Thus, an automated semantic segmentation of aerial images is emerged as the recent hot topic [139]. On the contrary, the semantic segmentation of aerial images is a complex task due to several constraints such as demand for pixel-level accuracy, nonconventional data, and lack of training examples. Each object in the remote-sensing images specifies important information, which requires to be precisely categorized from the neighboring ones. Numerous works have been proposed for solving this problem, which has been focused on improving regularization and FCN such as object boundary details. More numbers of public datasets have been considered for evaluating the performance of the deep learning approaches. Here, infrared and colour satellite images have gained noteworthy performance that is more equivalent to image sets utilized in the portrait and scenic computer vision tasks. From the comprehensive review, the public datasets such as ISPRS datasets get more importance that has guaranteed the implementation of deep learning approaches for facilitating the semantic segmentation [140]. Though, the semantic segmentation on different data or imaging modality and analysis metrics make evaluation complex. Moreover, handling of different modality of remote-sensing images such as UAV, hyperspectral images, and infrared and RGB images are complex to process. It results in lack of accuracy to estimate the nonconventional data.

Sometimes, a large volume of data and a lack of training examples pose complexities in aerial imaging applications. Conversely, it is much more challenging due to the nonconventional data sources such as LiDAR, hyperspectral images, and synthetic aperture radar images [141]. When the deep learning techniques are utilized for processing the nonconventional remote-sensing datasets with labels, it creates complexities. These deep learning methods suffer from the lack of training dataset. Any deep learning model may need a huge set of training images due to the number of classes and complications of the problem [142]. Moreover, the utilization of deep learning is more complicated while considering the expensive and additional remote-sensing data collection [143]. Thus, different augmentation approaches are mostly employed for increasing the variation and number of the dataset. Consequently, the most common datasets called “ISPRS’s 2D labeling dataset and IEEE’S GRRS dataset” have been attempted for addressing the data inefficiency through offering the very high-resolution remote-sensing images gathered from UAVs [141].

An additional limitation of deep learning-based semantic segmentation is the necessity of a high number of label dataset, which generally requires manual annotation. This issue has also considerably been solved through public datasets through offering the annotations [142]. However, it is still tedious while taking the own or manual datasets. Existing research works have utilized conventional approaches for producing the annotations. Similarly, the label dataset can be created with the feature of pretrained models. From the meta-analysis results, the deep learning provides enhanced efficiency and shows the superior performance when compared to conventional approaches [143]. Many challenges of deep learning-adopted techniques have been solved and reduced in recent decades, which have to increase the performance. The future research areas in the semantic segmentation of aerial images can integrate the well-known deep learning models with hybrid or new variant metaheuristic approaches. As the deep learning-based semantic segmentation models have emerged their future prospects, it has to create a new future scope on different applications using intelligent algorithms for increasing the accuracy rate [144]. In the future, it has to solve the nonconventional data and labeling problems while preparing a new datasets. Thus, this research helps the researchers to understand the semantic segmentation model with several other possibilities for coming up with new future research perspectives.

6. Conclusion

This study has presented a comprehensive review on conventional semantic segmentation models through deep learning approaches. For this purpose, a set of research works has been taken from recent years. This study has given the information regarding different machine learning or deep learning techniques used, simulation tools, performance metrics, features and challenges of conventional semantic segmentation models, different imaging modalities, and the datasets utilized. Finally, the research gaps and limitations were analyzed for exploring a future research perspective of semantic segmentation systems. On the whole, this study has offered the detailed information on semantic segmentation models, which are helpful for assisting the researchers to present a semantic segmentation model in the upcoming years.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors acknowledge the help of the Vellore Institute of Technology, Vellore, India, for giving excellent assets to this work. Also, the authors would like to thank the individual copyright holders for consent conceded to incorporate referred figures in this work.