Abstract

With the emergence of deep learning, computer vision has witnessed extensive advancement and has seen immense applications in multiple domains. Specifically, image captioning has become an attractive focal direction for most machine learning experts, which includes the prerequisite of object identification, location, and semantic understanding. In this paper, semantic segmentation and image captioning are comprehensively investigated based on traditional and state-of-the-art methodologies. In this survey, we deliberate on the use of deep learning techniques on the segmentation analysis of both 2D and 3D images using a fully convolutional network and other high-level hierarchical feature extraction methods. First, each domain’s preliminaries and concept are described, and then semantic segmentation is discussed alongside its relevant features, available datasets, and evaluation criteria. Also, the semantic information capturing of objects and their attributes is presented in relation to their annotation generation. Finally, analysis of the existing methods, their contributions, and relevance are highlighted, informing the importance of these methods and illuminating a possible research continuation for the application of semantic image segmentation and image captioning approaches.

1. Introduction

The data of optical perception are becoming increasingly available in large volume nowadays, creating a crucial use in several real-world applications such as quality assurance, medical analysis, surveillance, autonomous vehicles, face recognition, forensic and biometrics, and 3D reconstruction [14]. This upsurge in the bulk of digital images and video has directed the creation of computer vision (CV), a branch of computer science (CS). From a general overview, computer vision relates to the use of the computer to gain a high-level understanding of images and videos [5]. Rather than manual operations, it encompasses the automatic acquisition, processing, and analyzing of large data for the sole purpose of extracting patterns and intuition. In most cases, computer vision seeks to apply artificial intelligence (AI) theory, equations, tools, frameworks, and algorithms for accomplishing the task of helping computers to see and also understand the content of both digital and analog world through the mimicking of the human visual system [6]. Although seeing and understanding seems a trivial task or very easy for humans, it is nevertheless a complex problem for computers partly because of our limited understanding of how the human brain works and how it processes things [7]. However, through years of research and technological advancement, some feats have been achieved, and computer vision has extensively evolved [811]. Today, semantic segmentation remains a huge challenge in the scope of image and video understanding alongside image captioning which combines computer vision with another branch of artificial intelligence called natural language processing (NLP) to derive sentence description of an image [12]. Notwithstanding, as with all other AI-related tasks, a modern subset of machine learning (ML), namely, deep learning (DL), has been the evolution of machine learning, producing state-of-the-art results in almost all of the tasks compared to other traditional algorithms such as decision trees, naive Bayes, support vector machines (SVMs), ensembles, and clustering algorithms [1316].

Deep learning, as a branch of machine learning, uses layers of artificial neural networks to imitate the human neural networks in decoding intuition from a large amount of data automatically [17] and is unlike other machine learning algorithms which rely heavily on feature engineering, utilizing domain knowledge in the creation of feature extractors [18]. The stacked layer of neural networks represents feature hierarchy as simple features at the initial layers are reconstructed from one layer to another in forming complex features [19]. As a result, the deeper networks are computationally intensive to model and train, leading to the manufacture of more advanced computational chips, including Graphical Processing Units (GPUs) and Tensor Processing Units (TPUs) [20, 21]. Presently, several deep learning models exist, and some of the most popular ones include recurrent neural networks (RNNs), autoencoders, convolutional neural network (CNN), deep belief networks (DBNs), and deep Boltzmann machine (DBM) [2225]. Among the most common deep learning algorithms, the convolutional neural network is the most suitable for analyzing visual imagery because of its shift and space invariant characteristics, taking advantage of hierarchical learning in combining simpler patterns to form complex patterns and structures [26]. Using the shared weights architectural pattern of filters [2729], each filter represents different features of the input data which when summed can yield more complex structures [3032].

In this paper, our prime motivation is fixating on the recent deep learning segmentation techniques of both 2D and 3D images using fully convolutional network and other high-level hierarchical feature extraction methods as an integral component of computer vision. This is further expanded into the generation of captions for images, emerging as a subset of artificial intelligence’s natural language processing. Furthermore, we review the discussed models’ accomplishments by comparing their evaluation, which indicates the most effective and efficient approaches for different tasks and challenges encountered. This, we believe, is enlightening as it provides insight for the further evolution of practical model design.

This paper is organized as follows: it introduces segmentation, popular segmentation algorithms, characteristics, datasets, and evaluation in Section 2. An introduction of captioning and its various models are followed in Section 3 alongside available datasets, evaluation metrics, and a comparative discussion of the models. Finally, Section 4 concludes the paper with the overall summary of the typical problems, solutions, and possible directions in semantic segmentation and image captioning.

2. Semantic Segmentation

Semantic segmentation relates to the process of pixel-level classification of images such that each pixel in an image is classified into a distinguished class cluster [33]. Since the inception of deep learning, semantic segmentation has been a pivotal area of image processing and computer vision which has seen major research and application in several domains [34]. Image segmentation recognizes boundaries between objects in an image by using line and curve segments to categorize such objects, while instance segmentation, however, classifies instances of all the available classes in an image such that all objects are identified as a separate entity. All the same, semantic segmentation differs from ordinary segmentation which, on the one hand, only expresses the partitioning of an image into clusters without a tangible intuitive attempt at understanding the partitioned clusters or relating them with one another [35]. Semantic segmentation, on the other hand, as the name implies, tries to describe semantically meaningful objects in an image based on their well-defined association and understanding [36]; these differences are well depicted in Figure 1.

2.1. Methods and Approaches
2.1.1. Traditional Methods

During the pre-ANN era, most segmentation and semantic segmentation approaches were predominantly thresholding and clustering algorithms which are largely unsupervised methods. In most cases, traditional semantic segmentation methods consume less time for model computation. Also, most of the approaches require less data than the modern-day era of artificial neural networks and deep learning. The simplest, by all means, is the thresholding techniques which apply pixel intensity as the criteria for distribution. For binary segmentation, a single threshold value is required, and pixels on both sides of the threshold are classified separately into two distinct classes. There are also advance forms of thresholding involving more classes, and they are often grouped as histogram shape-based, entropy-based, object attribute-based, and spatial-based [37].

K-means clustering uses a predefined number of centroids to determine the number of clusters in which objects are to be categorized. The centroids are randomly selected at the beginning and then are iteratively adjusted by computing the distance apart from other points in the dataset, assigning each point to the closest centroid [38]. Fuzzy C-means (FCM), a technically advanced form of K-means, allows classification of data points into many label classes based on the level of membership [39]. This is of advantage in situations where dataset texture overlaps or does not have a well-defined cluster [40]. Gaussian mixture model (GMM) is also often used for both hard clustering and soft clustering by assigning the pixel to the component having the highest posterior probability [41]. GMM assumes that the data’s Gaussian distributions represent the number of clusters available in the data, and it uses the expectation-maximization (EM) algorithm to determine missing latent variables [42]. Random forest [43], naive Bayes [44], ensemble modeling [45], Markov random network (MRF) [46], and support vector machines (SVMs) [47] are also techniques that are useful for several tasks, especially classification and regression [48].

2.1.2. Region-Based Models

In the region-based semantic segmentation design, regions are first extracted in an image and described based on their constituent features [49]. Then, a region classifier that has been trained is used to label pixels per region with which it has the highest occurrence. The region-based approaches use the divide and conquer method such that many features are captured using multiscale features and then combined to form a whole. In cases where objects overlap on several regions, the classifier either determines the most suitable region or the model is set to select the region with the maximum value [50]. This often causes pixel mismatch, but a postprocessing operation is mostly used to reduce the effects [51].

Region CNN (R-CNN) uses a bounding box to identify and classify objects in an image by proposing several boxes convolving an image and identifying if they correspond to an object [52, 53]. The process of selective search is used in creating boundary boxes of varying window sizes for region proposal, and each of the boxes classifies objects based on different properties, making the algorithm quite impressive but slow [54]. To overcome the drawbacks of the R-CNN, Fast R-CNN [55] was proposed which eliminates the redundancy in the proposed region, thereby lessening the computational requirements. The R-CNN model was replaced with a single CNN per image whose computation would be shared among the proposals, using the region of interest pooling technique and training all the models including the use of convolutional neural network to classify the images, and bounding boxes regressor as a single entity. The Faster R-CNN [56] uses a region proposal network (RPN) to obtain a proposal from the network instead, while the Mask R-CNN [57] was extended to include a pixel-level segmentation. Technically, the Mask R-CNN replaces the region of interest pooling module in the Faster R-CNN with another which has an accurate alignment module. Also, it includes an additional parallel branch for segmentation mask prediction [58].

2.1.3. Fully Convolutional Network-Based Models

Fully convolutional network (FCN) models do not have dense layers, such as in other traditional CNNs; they are composed of 1 × 1 convolutions that achieve the task of dense layers or fully connected layers. Also, fully convolutional network (FCN), as displayed in Figure 2, takes images of arbitrary sizes as the input and returns outputs of corresponding spatial dimensions. This model principally builds on the encoder-decoder model to classify pixels in an image into predefined classes by using a convolution network in the encoder to extract features, thereby reducing the feature maps’ dimensionality before being upsampled by the decoder (SegNet) [60]. During convolutional neural network computation, input images are downsized, resulting in a smaller output with reduced spatial features. This problem is solved via the upsample technique, which transposes the downsized images to a larger size, making pixelwise comparison efficient and effective. Some upsampling methods such as transpose convolutions are learned, thus increasing model complexity and computation, while several others exclude learning including nearest neighbor, the bed of nails, and max unpooling [61]. The fully convolutional network is majorly trained as an end-to-end model to compute pixelwise loss and trained using the backpropagation approach. The FCN was firstly inspired by Long et al. [59] using the popular AlexNet CNN architecture in the encoder and transpose convolution layers in the decoder to upsample the feature to the desired dimension. A variant FCN having skip connections from previous layers in the network was proposed named UNet [62]. UNet intends to compliment the learned features with fine-grain details from contracting paths to capture the context and enhance classification accuracy.

Residual blocks were introduced with shorter skip connections between the encoder and decoder, granting faster convergence of deeper models during training [63]. Multiscale induction in the dense blocks conveys low-level features across layers to ones with high-level features, resulting in feature reuse [64]. This makes the model easier to train and improve the performance as well. An architecture having two-stream CNN uses its gates to process the image shape in different branches, then connected and fused at a later stage. The model also proposed a new loss function that aligns segmentation prediction with the ground truth boundaries [65]. An end-to-end single-pass model applying dense decoder shortcut connections extracts semantics from high-level features such that propagation of information from one block to another combines multiple-level features [66]. The model designed based on the ResNeXt’s residual building blocks helps it to aggregate blocks of feature captures which are fused as output resolutions [67].

ExFuse aims to connect the gap between low- and high-level features in convolutional networks by introducing semantic information at the lower-level features as well as high resolutions into the high-level features. This was achieved by proposing two fusion methods named explicit channel resolution embedding and densely adjacent prediction [68]. Contrary to most models, a balance between model accuracy and speed was achieved in ICNet which consolidates several multiresolution branches by introducing an image cascade network that allows real-time inference. The network cascade image inputs into different segments as low, medium, and high resolution before being trained with this label guidance [69].

2.1.4. Refinement Network

Because of the resolution reduction caused by the encoder models in the typical FCN-based model, the decoder has inherent problems of producing fine-grained segmentation, especially at the boundaries and edges [70]. Though this problem has been tackled by incorporating skip connections, adding global information, and others means, the problem is by no means solved, and some algorithms have involved several features or, in some cases, certain postprocessing functions to find alternative solutions [71]. DeepLab1 [72] combines ideas from the deep convolutional neural network and probabilistic graphical models to achieve pixel-level classification. The localization problem of the neural network output layer was remedied using a fully connected conditional random field (CRF) as a means of performing segmentation with controlled signal extermination. DeepLab1 applied atrous convolutions instead of the regular convolutions which accomplish the learning of aggregate multiscale contextual features. Visible in Figure 3, DeepLab1 allows the expansion of kernel window sizes without increasing the number of weights. The multiscale atrous convolutions help to overcome the problem of insensitivity to fine details by other models and decrease output blurriness. This could result in additional complexity in computation and time depending on the postprocessing network’s computational processes.

The ResNet deep convolutional network architecture was applied in DeepLab2 which enables the training of various distinct layers while preserving the performance [73]. Besides, DeepLab2 uses atrous spatial pyramid pooling (ASPP) to capture long-range context. The existence of objects at different scales and the reduced feature resolution problems of semantic segmentation are tackled by designing a cascade of atrous convolutions which could run in parallel to capture various scales of context information alongside global average pooling (GAP) to embed context information features [74]. FastFCN implements joint pyramid upsampling which substitutes atrous convolutions to free up memory and lessen computations. Using a fully connected network framework, the joint pyramid upsampling technique extracts feature maps of high resolution into a joint upsampling problem. The models used atrous spatial pyramid pooling to extract the last three-layer output features and a global context module to map out the final predictions [75]. The atrous spatial pyramid pooling limitation of lack of dense feature resolution scale is attempted by concatenating multiple branches of atrous-convolved features at different rates which are later fused into a final representation, resulting in dense multiscale features [76].

2.1.5. Weakly Supervised and Semisupervised Approaches

Though most models depend on a large number of images and their annotated label, the process of manually annotating labels is quite daunting and time-consuming, so semantic segmentation models have been attempted with weakly supervised approaches. Given weakly annotated data at the image level, the model was trained to assign higher weights to pixels corresponding with the class label. Trained on a subset of the ImageNet dataset, during training, the networks focus on recognizing important pixels relating to a prior labeled single-class object and matching them to the class through inference [77]. Bounding box annotation is used to train semantic labeling of image segmentation which accomplishes 95% quality of fully supervised models. The bounding box and information of the constituent object were used prior to training [78]. A model combining both labeled and weakly annotated images with a clue of the presence or absence of a semantic class was developed using the deep convolutional neural network and expectation-maximization (EM) algorithm [79].

BoxSup iteratively generates automatic region proposals while training convolutional networks to obtain segmentation masks and as well as improve the model’s ability to classify objects in an image. The network uses bounding box annotations as a substitute for supervised learning such that regions are proposed during training to determine candidate masks, overtime improving the confidence of the segmentation masks [80]. A variant of generative adversarial learning approach which constitutes a generator and discriminator was used to design a semisupervised semantic segmentation model. The model was first trained using full labeled data which enable the model’s generator to learn the domain sample space of the dataset which is leveraged to supervise unlabeled data. Alongside the cross-entropy loss, an adversarial loss was proposed to optimize the objective function of tutoring the generator to generate images as close as possible to the image labels [81].

2.2. Datasets

Deep learning requires an extensive amount of training data to comprehensively learn patterns and fine-tune the number of parameters needed for its gradient convergence. Accordingly, there are several available datasets specifically designed for the task of semantic segmentation which are as follows:

PASCAL VOC: PASCAL Visual Object Classes (VOC) [82] is arguably the most popular semantic segmentation dataset with 21 classes of predefined object labels, background included. The dataset contains images and annotations which could be used for detection, classification, action classification, person layout, and segmentation tasks. The dataset’s training, validation, and test set has 1464, 1449, and 1456 images, respectively. Yearly, the dataset has been used for public competitions since 2005.

MS COCO: Microsoft Common Objects in Context [83] was created to push the computer vision state of the art with standard images, annotation, and evaluation. Its object detection task dataset uses either instance/object annotated features or a bounding box. In total, it has 80 object categories with over 800,000 available images for its training, test, and validation sets, as well as over 500,000 object instances that are segmented.

Cityscapes: Cityscapes dataset [84] has a huge amount of images taken from 50 cites during different seasons and times of the year. It was initially a video recording, and the frames were extracted as images. It has 30 label classes in about 5000 densely annotated images and 20,000 weakly annotated images which have been categorized into 8 groups of humans, vehicles, flat surfaces, constructions, objects, void, nature, and sky. It was primarily designed for urban scene segmentation and understanding.

ADE20K: ADE20K dataset [85] has 20,210 training images, 2000 validation images, and 3000 test images which are well suited for scene parsing and object detection. Alongside the 3-channel images, the dataset contains segmentation masks, part segmentation masks, and a text file that contains information about the object classes, identification of instances of the same class, and the description of each image’s content.

CamVid: CamVid [86] is also a video sequence of scenes which have been extracted into images of high resolution for segmentation tasks. It consists of 101 images of 960720 dimension and their annotations which have been classified into 32 object classes including void, indicating areas which do not belong to a proper object class. The dataset RGB class values are also available, ranging from 0 to 255.

KITTI: KITTI [87] is popularly used for robotics and autonomous car training, focusing extensively on 3D tracking, stereo, optical flow, 3D object detection, and visual odometry. The images were obtained through two high-resolution cameras attached to a car driving around the city of Karlsruhe, Germany, while their annotations were done by a Velodyne laser scanner. The data aim to reduce bias in existing benchmarks with a standard evaluation metric and website.

SYNTHIA: SYNTHetic [88] Collection of Imagery and Annotations (SYNTHIA) is a compilation of imaginary images from a virtual city that has a high pixel-level resolution. The dataset has 13 predefined label classes consisting of road, sidewalk, fence, sky, building, sign, pedestrian, vegetation, pole, and car. It has a total of 13,407 training images.

2.3. Evaluation Metrics

The performance of segmentation models is computed mostly in the supervised scope whereby the model’s prediction is compared with the ground truth at the pixel level. The common evaluation metrics are pixel accuracy (PA) and intersection over union (IoU). Pixel accuracy refers to the ratio of all the pixels classified in their correct classes to the total amount of pixels in the image. Pixel accuracy is trivial and suffers from class imbalance such that certain classes immensely dominate other classes.where is represented as the number of classes and is also represented as the number of pixels of class which are predicted to class , while represents the number of pixels of class which are predicted as class with the total number of pixels of a particular class represented as .

Mean pixel accuracy (mPA) improves pixel accuracy slightly; it computes the accuracy of the images per class instead of a global computation of all the classes. The mean of the class accuracies is then computed to the overall number of classes.

Intersection over union (IoU) metric, which is also referred to as the Jaccard index, measures the percentage overlap of the ground truth to the model prediction at the pixel level, thereby computing the amount of pixels common with the ground truth label and mask prediction [89].

2.4. Discussion

Different machine learning and deep learning algorithms and backbones yield different results based on the models’ ability to learn mappings from input images to the label. In tasks involving images, CNN-based approaches are by far the most expedient. Although they can be computationally expensive compared to other simpler models, such models occupy a bulk of the present state of the art. Traditional machine learning algorithms such as random forest, naive Bayes, ensemble modeling, Markov random field (MRF), and support vector machines (SVMs) are too simple and rely heavily on domain feature understanding or handcrafted feature engineering, and in some cases, they are not easy to fine-tune. Also, clustering algorithms such as K-means and fuzzy C-means mostly require that the number of clusters is specified beforehand, and they are not very effective with multiple boundaries.

Because of the CNN’s invariant property, it is very effective for spatial data and object detection and localization. Besides, the modern backbone of the fully convolutional network has informed several methods of improving segmentation localization. First, the decoder uses upsampling techniques to increase the features' resolution, and then skip connections are added to achieve the transfer of fine-grain features to the other layers. Furthermore, postprocessing operations as well as the context and attention networks have been exploited.

The supervised learning approach still remains the dormant technique as there have been many options for generating datasets as displayed in Table 1. Data augmentation involving operations such as scaling, flipping, rotating, scaling, cropping, and translating has made multiplication of data possible. Also, the application of the generative adversarial network (GAN) has played a major role in the replication of images and annotations.

3. Image Captioning

Image captioning relates to the general idea of automatically generating the textual description of an image, and it is often also referred to as image annotation. It involves the application of both computer vision and natural language processing tools to achieve the transformation of imagery depiction into a textual composition [111]. Tasks such as captioning were almost impossible prior to the advent of deep learning, and with advances in sophisticated algorithms, multimodal techniques, efficient hardware, and a large bulk of datasets, such tasks are becoming easy to accomplish [112]. Image captioning has several applications to solving some real-world problems including providing aid to the blind, autonomous cars, academic bot, and military purposes. To a large extent, the majority of image captioning success so far has been from the supervised domain whereby huge amounts of data consisting of images and about two to five label captions describing the actions of the images are provided [113]. This way, the network is tasked with learning the images’ feature presentation and mapping them to a language model such that the end goal of a captioning network is to generate a textual representation of an image’s depiction [114].

Though characterizing an image in the form of text seems trivial and straightforward for humans, it is by no means simple to be replicated in an artificial system and requires advanced techniques to extract the features from the images as well as map the features to the corresponding language model. Generally, a convolutional neural network (CNN) is tasked with feature extraction, while a recurrent neural network (RNN) relies upon to translate the training annotations with the image features [115]. Aside from determining and extracting salient and intricate details in an image, it is equally important to extract the interactions and semantic relationship between such objects and how to illustrate them in the right manner using appropriate tenses and sentence structures [116]. Also, because the training labels which are texts are different from the features obtained from the images, language model techniques are required to analyze the form, meaning, and context of a sequence of words. This becomes even more complex as keywords are required to be identified for emphasizing the action or scene being described [117].

Visual features: deep convolutional neural network (DCNN) is often used as the feature extractor for images and videos because of the CNN’s invariance property [118] such that it is able to recognize objects regardless of variation in appearances such as size, illumination, translation, or rotation as displayed in Figure 4. The distortion in pixel arrangement has less impact on the architecture’s ability to learn essential patterns in the identification and localization of the crucial features. Essential feature extraction is paramount, and this is easily achieved via the CNN’s operation of convolving filters over images, subsequently generating feature maps from the receptive fields from which the filters are applied. Using backpropagation techniques, the filter weights are updated to minimize the loss of the model’s prediction compared to the ground truth [119]. There have been several evolutions over the years, and this has ushered considerate architectural development in the extraction methodology. Recently, the use of a pretrained model has been explored with the advantage of reducing time and computational cost while preserving efficiency. These extracted features are passed along to other models such as the language decoder in the visual space-based methods or to a shared model as in the multimodal space for image captioning training [120].

Captioning: image caption or annotation is an independent scope of artificial intelligence, and it mostly combines two models consisting of a feature extractor as the encoder and a recurrent decoder model. While the extractor obtains salient features in the images, the decoder model which is similar in pattern to the language model utilizes a recurrent neural network to learn sequential information [121]. Most captioning tasks are undertaken in a supervised manner whereby the image features act as the input which are learned and mapped to a textual label [122]. The label captions are first transformed into a word vector and are combined with the feature vector to generate a new textual description. Most captioning architectures follow the partial caption technique whereby part of the label vector is combined with the image vector to predict the next word in the sequence [123]. Then, the prior words are all combined to predict the next word and continued till an end token is reached. In most cases, to infuse semantics and intuitive representation into the label vector, a pretrained word embedding is used to map the dimensional representation of the embeddings into the word vector, enriching its content and generalization [124].

3.1. Image Captioning Techniques
3.1.1. Retrieval-Based Captioning

Early works in image captioning were based on caption retrieval. Using this technique, the caption to a target image is generated by retrieving the descriptive sentence of such an image from a set of predefined caption databases. In some cases, the newly generated caption would be one of the existing retrieved sentences or, in other cases, could be made up of several existing retrieved sentences [125]. Initially, the features of an image are compared to the available candidate captions or achieved by tagging the image property in a query. Certain properties such as color, texture, shape, and size were used for similarity computation between a target image and predefined images [126]. Captioning via the retrieval method can be retrieved through the visual and multimodal space, and these approaches produce good results generally but are overdependent on the predefined captions [127].

Specific details such as the object in an image or the action or scene depicted were used to connect images to the corresponding captions. This was computed by finding the ratio of similarity between such information to the available sentences [128]. Using the kernel canonical correlation analysis technique, images and related sentences were ranked based on their cosine similarities after which the most similar ones were selected as the suitable labels [129], while the image features were used for reranking the ratio of image-text correlation [130]. The edges and contours of images were utilized to obtain pattern and sketches such that they were used as a query for image retrieval, whereby the generated sketches and the original images were structured into a database [131]. Building on the logic of the original image and its sketch, more images were dynamically included alongside their sketches to enhance learning [132]. Furthermore, deep learning models were applied to retrieval-based captioning by using convolutional neural networks (CNNs) to extract features from different regions in an image [133].

3.1.2. Template-Based Captioning

Another common approach of image annotation is template-based which involves the identification of certain attributes such as object type, shape, actions, and scenes in an image, which are then used in forming sentences in a prestructured template [134]. In this method, the predetermined template has a constant number of slots such that all the detected attributes are then positioned in the slots to make up a sentence. In this case, the words representing the detected features make up the caption and are arranged such that they are syntactically and semantically related, thus generating grammatically correct representations [135]. The fixed template problem of the template-based approach was overcome by incorporating a parsed language model [136], giving the network higher flexibility and ability to generate better captions.

The underlying nouns, scenes, verbs, and prepositions determining the main idea of a sentence were explored and trained using a language model to obtain the probability distribution of such parameters [137]. Certain human postures and orientation which do not involve the movement of the hands such as walking, standing, and seeing and the position of the head were used to generate captions of an image [138]. Furthermore, the postures were extended to describe human behavior and interactions by incorporating motion features and associating them with the corresponding action [139]. Each body part and the action it is undergoing are identified, and then this is compiled and integrated to form a description of the complete human body. Human posture, position, and direction of the head and position of the hands were selected as geometric information for network modeling.

3.1.3. Neural Network-Based Captioning

Compared to other machine learning algorithms or preexisting approaches, deep learning has achieved astonishing heights in image captioning, setting new benchmarks with almost all of the datasets in all of the evaluation metrics. These deep learning approaches are mostly in the supervised setting which requires a huge amount of training data including both images and their corresponding caption labels. Several models have been applied such as artificial neural network (ANN), convolutional neural network (CNN), recurrent neural network (RNN), autoencoder, generative adversarial network (GAN), and even a combination of one or more of them.

Dense captioning: dense captioning emerges as a branch of computer vision whereby pictorial features are densely annotated depending on the object, object’s motion, and its interaction. The concept depends on the identification of features as well as their localization and finally expressing such features with short descriptive phrases [140]. This idea is drawn from the understanding that providing a single description for a whole picture can be complex or sometimes bias; therefore, a couple of annotations are generated relating to different recognized salient features of the image. The training data of a dense caption in comparison to a global caption are different in that various labels are given for individual features identified by bounding boxes, whereby a general description is given in the global captioning without a need for placing bounding boxes on the images [141]. Visible in Figure 5, a phrase is generated from each region in the image, and these regions could be compiled to form a complete caption of the image. Generally, dense captioning models face a huge challenge as most of the target regions in the images overlap which makes accurate localization challenging and daunting [143].

The intermodal alignment of the text and images was investigated on region-level annotations which pioneers a new approach for captioning, leveraging the alignment between the feature embedding and the word vector semantic embedding [144]. A fully convolutional localization network (FCLN) was developed to determine important regions of interest in an image. The model combines a recurrent neural network language model and a convolutional neural network to enforce the logic of object detection and image description. The designed network uses dense localization layer and convolution anchors built on the Faster R-CNN technique to predict region proposal from the input features [142]. A contextual information model that combines previous and future features of a video spanning up to two minutes achieved dense captioning by transforming the video input into slices of frames. With this, an event proposal module helps to extract the context from the frames which are fed into a long short-term memory (LSTM) unit, enabling it to generate different proposals at different time steps [145]. Also, a framework having separate detection network and localization captioning network accomplished improved dense captioning with faster speed by directly producing detected features rather than via the common use of the region proposal network (RPN) [146].

Encoder-decoder framework: most image captioning tasks are built on the encoder-decoder structure whereby the images and texts are managed as separate entities by different models. In most cases, a convolutional neural network is presented as the encoder which acts as a feature extractor, while a recurrent neural network is presented as the decoder which serves as a language model to process the extracted features in parallel with the text label, consequently generating predicted captions for the input images [147]. CNN helps to identify and localize objects and their interaction, and then this insight is combined with long-term dependencies from a recurrent network cell to predict a word at a time, depending on the image context vector and previously generated words [148]. Multiple CNN-based encoders were proposed to provide a more comprehensive and robust capturing of objects and their interaction from images. The idea of applying multiple encoders is suggested to complement each unit of the encoder to obtain better feature extraction. These interactions are then translated to a novel recurrent fusion network (RFNet) which could fuse and embed the semantics from the multiple encoders to generate meaningful textual representations and descriptions [149].

Laid out in Figure 6, a combination of two CNN models as both encoder and decoder was explored to speed up computational time for image captioning tasks. Because RNN’s long-range information is computed step by step, this causes very expensive computation and is solved by stacking layers of convolution to mimic tree structure learning of the sentences [150]. Three distinct levels of features which are regional, visual, and semantic features were encoded in a model to represent different analyses of the input image, and then this is fed into a two-layer LSTM decoder for generating well-defined captions [151]. A concept-based sentence reranking technique was incorporated into the CNN-LSTM model such that concept detectors are added to the underlying sentence generation model for better image description with minimal manual annotation [152]. Furthermore, the generative adversarial network (GAN) was conditioned on a binary vector for captioning. The binary vector represented some form of sentiment which the image portrays and then was used to train the adversarial model. The model took both images and an adjective or adjective-noun pair as the input to determine if the network could generate a caption describing the intended sentimental stance [153].

Attention-guided captioning: attention has become increasingly paramount, and it has driven better benchmarks in several tasks such as machine translation, language modeling, and other natural language processing tasks, as well as computer vision tasks. In fact, attention has proven to correlate the meaning between features, and this helps to understand how such a feature relates to one another [154]. Incorporating this into a neural network, it encourages the model to focus on salient and relevant features and pay less consideration to other noisy aspects of the data space distribution [155]. To estimate the concept of attention in image annotation, a model is trained to concentrate its computation on the identified salient regions while generating captions using both soft and hard attention [156]. The deterministic soft attention which is trainable via standard backpropagation is learned by weighting the annotated vector of the image features, while the stochastic hard attention is trained via maximizing a variational lower bound, setting it to 1 when the feature is salient [157].

Following the where and what analysis of what the model should concentrate on, adaptive attention used a hierarchical structure to fuse both high-level semantic information and visual information from an image to form intuitive representation [120]. The top-down and bottom-up approaches are fused using semantic attention which first defines attribute detectors that dynamically enable it to switch between concepts. This empowers the detectors to determine suitable candidates for attention computation based on the specified inputs [158]. The limitation of long-distance dependency and inference speed in medical image captioning was tackled using a hierarchical transformer. The model includes an image encoder that extracts features with the support of a bottom-up attention mechanism to capture and extract top-down visual features, as well as a nonrecurrent transformer captioning decoder which helps to compile the generated medical illustration [159]. Salient regions in an image are also extracted using a convolutional model as region features were represented as pooled feature vector. These intraimage region vectors are appropriately attended to obtain suitable weights describing their influence before they are fed into a recurrent model that learns their semantic correlation. The corresponding sequence of the correlated features is transformed into representations which are illustrated as sentences describing the features’ interactions [160].

3.1.4. Unsupervised or Semisupervised Captioning

Supervised captioning has so far been productive and successful partly due to the availability of sophisticated deep learning algorithms and an increasing outpour of data. Through the supervised deep learning techniques, a combination of models and frameworks which learn the joint distribution of images and labels has displayed a very intuitive and meaningful illustration of images even similar to humans. However, despite the achievement, the process of completely creating a captioning training set is quite daunting, and the manual effort required to annotate the myriad of images is very challenging. As a result, other means which are free of excessive training data are explored. An unsupervised captioning approach that combines two steps of query and retrieval was researched in [161]. First, several target images are obtained from the internet as well as a huge database of words describing such images. For any chosen image, words representing its visual display are used to query captions from a reference dataset of sentences. This strategy helps to eliminate manual annotation and also uses multimodal textual-visual information to reduce the effect of noisy words in the vocabulary dataset.

Transfer learning which has seen increasing application in other deep learning domains, especially in computer vision, was applied to image captioning. First, the model is trained on a standard dataset in a supervised manner, and then the knowledge from the supervised model is transferred and applied on a different dataset whose sentences and images are not paired. For this purpose, two autoencoders were designed to train on the textual and visual dataset, respectively, using the distribution of the learned supervised embedding space to infer the unstructured dataset [162]. Also, the process of manual annotation of the training set was semiautomated by evaluating an image into several feature spaces which are individually estimated by an unsupervised clustering algorithm. The centers of the clustered groups are then manually labeled and compiled into a sentence through a voting scheme which compiles all the opinions suggested from each cluster [163]. A set of naïve Bayes model with AdaBoost was used for automatic image annotation by first using a Bayesian classifier to identify unlabeled images and then labeled by a succeeding classifier based on the confidence measurement of the prior classifier [164]. A combination of keywords which have been associated with both labeled and unlabeled images was trained using a graph model. The semantic consistency of the unlabeled images is computed and compared to the labeled images. This is continued until all the unlabeled images are successfully annotated [165].

3.1.5. Difference Captioning

As presented in Figure 7, a spot-the-difference task which describes the differences between two similar images using advance deep learning technique was first investigated in [166]. Their model used a latent variable to capture visual salience in an image pair by aligning pixels which differ in both images. Their work included different model designs such as nearest neighbor matching scheme, captioning masked model, and Difference Description with Latent Alignment uniform for obtaining difference captioning. The Difference Description with Latent Alignment (DDLA) compares both input images at a pixel level via a masked L2 distance function.

Furthermore, the Siamese Difference Captioning Model (SDCM) also combined techniques from deep Siamese convolutional neural network, soft attention mechanism, word embedding, and bidirectional long short-term memory [167]. The features in each image input are computed using the Siamese network, and their differences are obtained using a weighted L1 distance function. Different features are then recursively translated into text using a recurrent neural network and an attention network which focuses on the relevant region on the images. The idea of the Siamese Difference Captioning Model was extended by converting the Siamese encoder into a Fully Convolutional CaptionNet (FCC) through a fully convolutional network [168]. This helps to transform the extracted features into a larger dimension of the input images which makes difference computation more efficient. Also, a word embedding pretrained model was used to embed semantics into the text dataset and beam search technique to ensure multiple options for robustness.

3.2. Datasets

There are several publicly available datasets which are useful for training image captioning tasks. The most popular datasets include Flickr8k [169], Flickr30k [170], MS COCO dataset [83], Visual Genome dataset [171], Instagram dataset [172], and MIT-Adobe FiveK dataset [173].

Flickr30K dataset: it has about 30,000 images from Flickr and about 158,000 captions describing the content of the images. Because of the huge volume of the data, users are able to determine their preferred split size for using the data.

Flickr8K dataset: it has a total of 8,000 images which are divided as 6,000, 1,000, and 1,000 for the training, test, and validation set, respectively. All the images have 5 label captions which are used as a supervised setting for training the images.

Microsoft COCO dataset: it is perhaps the largest captioning dataset, and it also includes training data for object recognition and image segmentation tasks, respectively. The dataset contains around 300,000 images with 5 captions for each image.

3.3. Evaluation Metrics

The automatically generated captions are evaluated to confirm their correctness in describing the given image. In machine learning, some of the common image captioning evaluation measures are as follows.

BLEU (BiLingual Evaluation Understudy) [174]: as a metric, it counts the number of matching n-grams in the model’s prediction compared to the ground truth. With this, precision is calculated based on the mean n-grams computed, and the recall is computed via the introduction of a brevity penalty in the caption label.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [175]: it is useful for summary evaluation and is calculated as the overlap of either 1-gram or bigrams between the referenced caption and the predicted sequence. Using the longest sequence available, the co-occurrence F-score mean of the predicted sequence’s recall and prediction is obtained.

METEOR (Metric for Evaluation of Translation with Explicit Ordering) [176]: it addresses the drawback of BLEU, and it is based on a weighted F-score computation as well as a penalty function meant to check the order of the candidate sequence. It adopts synonyms matching in the detection of similarity between sentences.

CIDEr (Consensus-based Image Description Evaluation) [177]: it determines the consensus between a reference sequence and a predicted sequence via cosine similarity, stemming, and TF-IDF weighting. The predicted sequence is compared to the combination of all available reference sequences.

SPICE (Semantic Propositional Image Caption Evaluation) [178]: it is a relatively new caption metric which relates with the semantic interrelationship between the generated and referenced sequence. Its graph-based methodology uses a scene graph of semantic representations to indicate details of objects and their interaction to describe their textual illustrations.

3.4. Discussion

With an increase in the generation of data, production of sophisticated computing hardware, and complex machine learning algorithms, a lot of achievements have been accomplished in the field of image captioning. Though there have been several implementations, the best results in almost all of the metrics have been recorded through the use of deep learning models. In most cases, the common implementation has been the encoder-decoder architecture which has a feature extractor as the encoder and a language model as the decoder.

Compared to other approaches, this has proven useful as it has become the backbone for more recent designs. To achieve better feature computation, attention mechanism concepts have been applied to help in focusing on the salient section of images and their features, thereby improving feature-text capturing and translation. In the same manner, other approaches such as generative adversarial network and autoencoders have been thriving in achieving concise image annotation, and to this end, such idea has been incorporated with other unsupervised concepts for captioning purposes as well. For example, reinforced learning technique also generated sequences which are able to succinctly describe images in a timely manner. Furthermore, analyses of several model designs and their results are displayed in Table 2, depicting their efficiency and effectiveness in the BLEU, METEOR, ROUGE-L, CIDEr, and SPICE metrics.

4. Conclusion

In this survey, the state-of-the-art advances in semantic segmentation and image captioning have been discussed. The characteristics and effectiveness of the important techniques have been considered, as well as their process of achieving both tasks. Some of the methods which have accomplished outstanding results have been illustrated including the extraction, identification, and localization of objects in semantic segmentation. Also, the process of feature extraction and transformation into a language model has been studied in the image captioning section. In our estimation, we believe that because of the daunting task of manually segmenting images into semantic classes, as well as the human annotation of images involved in segmentation and captioning, future research would move in the direction of an unsupervised setting of accomplishing this task. This would ensure more energy, and focus is invested solely in the development of complex machine learning algorithms and mathematical models which could improve the present state of the art.

Data Availability

The dataset used for the evalution of the models presented in this study have been discussed in the manuscript as well as their respective references and publications, such as PASCAL VOC: PASCAL Visual Object Classes (VOC) [82], MSCOCO : Microsoft Common Objects in Context [83], Cityscapes : Cityscapes dataset [84], ADE20K : ADE20K dataset [85], CamVid [86], Flickr8k [168], Flickr30k [169], MS COCO Dataset [83], VisualGenome Dataset [170], Instagram Dataset [171], and MIT–Adobe FiveK dataset [172].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the NSFC-Guangdong Joint Fund (Grant no. U1401257), the National Natural Science Foundation of China (Grant nos. 61300090, 61133016, and 61272527), the Science and Technology Plan Projects in Sichuan Province (Grant no. 2014JY0172), and the Opening Project of Guangdong Provincial Key Laboratory of Electronic Information Products Reliability Technology (Grant no. 2013A061401003).