#### Abstract

Scene understanding is to predict a class label at each pixel of an image. In this study, we propose a semantic segmentation framework based on classic generative adversarial nets (GAN) to train a fully convolutional semantic segmentation model along with an adversarial network. To improve the consistency of the segmented image, the high-order potentials, instead of unary or pairwise potentials, are adopted. We realize the high-order potentials by substituting adversarial network for CRF model, which can continuously improve the consistency and details of the segmented semantic image until it cannot discriminate the segmented result from the ground truth. A number of experiments are conducted on PASCAL VOC 2012 and Cityscapes datasets, and the quantitative and qualitative assessments have shown the effectiveness of our proposed approach.

#### 1. Introduction

Scene understanding, based on semantic segmentation, is a core problem in the field of computer vision, which has been applied to 2D image, video, and even volumetric data. Its goal is to assign each pixel a label and then provide complete understanding of a scene. Two examples of scene understanding are shown in Figure 1. The importance of scene understanding is highlighted by the fact that there are increasing applications, such as autonomous driving [1], human-computer interaction [2], robot technology, and augmented reality, to name a few.

**(a)**

**(b)**

The earliest scene parsing [3] is to classify 33 scenes for 2688 images on LMO dataset, which adopts label transfer technology to establish dense correspondences between the input image and each of the nearest neighbors using SIFT flow algorithm. State-of-the-art scene parsing frameworks are mostly based on fully convolutional network (FCN) [4]. FCN transforms the well-known networks-AlexNet, VGG, GooLeNet, and ResNet into fully convolutional ones by replacing the fully connected layers with convolutional ones. The key insight of FCN is to build the “fully convolutional” networks that take input of arbitrary size and produce corresponding-sized output with efficient inference and learning and realize end-to-end and image-to-image system of deep learning. For all these reasons and other contributions, FCN is considered as the milestone of deep learning. Although amounts of pooling operations enlarge the receptive fields of the convolution kernel of FCN, they lose the detailed location information, resulting in coarse segmentation result, which hinders its further application.

In order to refine the segmentation result, a postprocessing stage using conditional random field (CRF) is adopted after the output of system [5], which makes use of the fully connected pairwise CRF to capture the dependencies of pixels and achieve fine local details. Dilated convolution is a generalization of Kronecker-factored convolutional filters [6] which expand exponentially receptive fields without losing resolution by disposing of some pooling layers. The works [7] that make use of this technique allow dense feature extraction on any arbitrary resolution and then combine dilated convolutions of different scales to have wider receptive fields with no additional cost. Combined CRF with dilated convolution, Chen et al. [8] propose the “deeplab” system, which enlarges the receptive fields of filters at multiple scales and overcomes the disadvantage of location accuracy by using a fully connected CRF to response the final layer of network. In order to take the dense CRF with pairwise potentials as an integral part of the network, Zheng et al. [9] propose a model called CRFasRNN to refine the segmentation of FCN; they make it possible to fully integrate the CRF with a FCN and train the whole network end to end. Although CRF taking into account the correlation of pixels has improved the segmentation accuracy, it has also increased the computational complexity. To incorporate suitable global features, Zhao et al. [10] propose a pyramid scene parsing network (PSPNet), which extends the pixel-level feature to special designed pyramid pooling one in addition to traditional dilated convolution. This algorithm achieves the champion of ImageNet scene parsing challenge 2016.

In the above-mentioned algorithms, a common property is that all label variables are predicted either using unary potentials such as FCN or using pairwise potentials such as methods based on CRF. Despite the fact that pairwise potentials refine the accuracy of semantic segmentation, they only consider the correlation of two pixels. In an image, many pixels have the consistency across superpixels; high-order potentials should be effective in refining the segmentation accuracy. Arnab et al. [11] have integrated specific classes of high-order potentials in CNN-based segmentation models. This specific class may be object or superpixel and so on, for which we need to design different energy function to calculate high-order potentials, whose computation is complicated.

The generative adversarial nets (GAN) proposed by Goodfellow et al. [12] in 2014 can be characterized by training a pair of networks in competition with each other, in which an adversarial network can estimate the generative model without approximating many intractable probability computation. Because there is no need for any Markov chains or unrolled approximate inference network, GAN has drawn many researchers’ attention in the domains of superresolution [13], image-to-image translation [14, 15], and image synthesis [16, 17], etc. We are interested in higher-order consistency without confining to a certain class. We also do not want to have complex probability or inference computation. Motivated by all kinds of GAN, we proposed a semantic segmentation framework based on GAN, which consists of two components: generative network and adversarial network. The former one generates the segmented image, and the latter one encourages the segmentation model to improve continuously the semantic segmentation result until it cannot be distinguished from the ground truth according to the value of loss function. Different from the classic GAN, we take the original image as the input of the generative network and and the output of generative network or corresponding ground truth as the input of the adversarial network; then adversarial network discriminates the similarity of two inputs. If the value of loss function of the framework is large, backpropagation is performed to adjust the parameters of the network; if the value of loss function satisfies the termination criterion, the output of the generative network is the final semantic segmentation result. The semantic segmentation framework based on GAN is shown in Figure 2. This approach takes into account the high-order potentials of an image because it differentiates the similarity between the segmented image and the corresponding ground truth in the whole image.

#### 2. The Proposed Semantic Segmentation Approach

The aim of the proposed framework is to generate the semantic image from an original image . To achieve this goal, we design a generator network G and an adversarial network D. The generator is trained as a network parameterized by . These parameters denote the weights and are obtained by minimizing the loss function; then the output of generator and the ground truth are fed into the adversarial network parameterized by , in which the discriminator is trained to distinguish real or fake value. In order to achieve the desired result, it is important to design the architecture network and loss function.

##### 2.1. The Architecture of Networks

Some works have shown that deeper network model can improve the performance of the segmentation and meanwhile make the architecture of the network complex, resulting in difficult training [18]. We make a compromise between the depth of the network and the performance of the algorithm.

In the generative network, which is shown in the first row of Figure 3, there are two modules of convolution and deconvolution. The role of convolution module is to extract the feature maps of an image, which consists of 10 layers. Each layer is composed of convolution, activation function, and batch normalization. The convolution is performed with kernels and 64 feature maps followed by ReLU layer as the activation function, whose role is to conduct the nonlinear operation. Batch normalization is performed to avoid the network overfitting in each layer. Although pooling operations enlarge the receptive field of the network, they also reduce the accuracy of the segmentation. To improve the fine details of feature maps, the last three pooling outputs are integrated into one, on which deconvolution is performed to achieve the same size output with the original image.

To discriminate the ground truth from the segmented image, we train a discriminator network, which is illustrated in the second row of Figure 3. This architecture follows literature [13] to solve (4) in an alternating manner along with the generator. It contains eight convolution layers and uses LeakyReLU as the activation function. The convolution is conducted by kernels, resulting in final feature maps of size 512, which are followed by two dense layers and a final sigmoid activation function to achieve a probability for classification.

##### 2.2. Loss Function

In terms of information theory, cross entropy denotes the similarity of two variables; the more similar the distribution of two variables, the smaller the cross entropy, so we adopt the cross entropy as the loss function. The definition of cross entropy is shown in the following:where p and are the real value and predicted value. Equation (1) is Shannon entropy when p and are equal. In the multiple classification task, we use one-hot encoding cross entropy. Equation (1) can be rewritten as follows:where y specifies one pixel of ground truth and represents 0 or 1.

The loss function of the proposed networks is a weighted sum of two terms. The first is a multiclass cross entropy term of a generator that encourages the segmented output similar to the input. We use to denote the class probability map over C classes of size that the segmentation model generates given an input image x of size . This segmentation model predicts the right class label at each pixel independently, which is described in the following:where represents the cross entropy loss function of multiple classification on an image of size , in which the class probability of per-pixel is predicted as .

The second loss term represents the loss of the adversarial network. If the adversarial network can distinguish the output of generator from the ground truth, the loss value is large; otherwise, the loss is small. Because the loss is calculated based on the whole image or a large portion of it, this high-order statistics dissimilarity can be penalized by the adversarial loss term. We take the output of the adversarial network as . Training the adversarial model is equivalent to minimizing the following binary classification loss:where denotes the binary cross entropy loss and and represent the label maps of adversarial network when the network input is the ground truth or the output of a generator .

Given a data set of original images and the corresponding ground truth , we define the total loss functions of the proposed semantic segmentation networks based on GAN as in the following:where denotes weight factor. In this paper, we set it as 0.01.

#### 3. Experiments

To evaluate the proposed scene understanding algorithm based on GAN, we conduct some experiments on two widely used datasets, including PASCAL VOC 2012 [19] and urban scene understanding dataset Cityscapes [1]. We train networks on a NVIDIA Tesla K40 GPU and Intel Xeon E5 CPU using 2000 iterations and the batch size of size 16.

To quantitatively assess the accuracy of scene parsing, four performance indices are adopted: pixel accuracy (PA), mean pixel accuracy (MPA), mean intersection over union (MeanIoU), and frequency weighted intersection over union (FWIoU), whose formulations [20] are in (6)−(9). We assume a total of classes, and is the amount of pixels of class inferred to belong to class . denotes the number of true positives, while and are usually represented as false positives and false negatives, respectively:

We use adaptive estimates of first-order moments (ADAM) [21] to optimize the algorithm because it requires little parameter-tuning, in which and are set to 0.9 and 0.999, respectively. We have also compared the divergence of different learning rate on the algorithm to select the optimal value, which is shown in Figure 4. According to this figure, we select as the rate learning in these experiments.

##### 3.1. Experiment 1: PASCAL VOC 2012

We carry out experiments on PASCAL VOC 2012 segmentation dataset, which contains 20 object categories and 1 background class. Its augmented dataset [22] includes 10582, 1449, and 1456 images for training, validation, and testing. We have compared our method with the classic FCN [4] and popular DeepLab [5]: the accuracy of every class is shown in Table 1. Except for bicycle class, our approach achieves the highest accuracy on other 20 classes. Table 2 illustrates the four performance indices of different algorithms, PA, MPA, MeanIoU, and FWIoU. It is obvious that, from the left to right column, the accuracy of the algorithm gradually increases. The proposed approach gets the highest accuracy on these four performance indices.

To qualitatively validate the proposed method, several examples are exhibited in Figure 5. For “cat” in row one, our method gets the cat in accordance with the ground truth; however, FCN and DeepLab segment other noise regions. For “cow” and “child” in rows two and five, the details, such as leg, can be segmented in our method, while leg cannot be found in images using other two methods. In the fourth image, little cow and person are segmented in fine contour comparing with other two methods. In a word, the subjective quality of the segmented image using DeepLab is better than that using FCN; the segmented result using our method outperforms those using FCN and DeepLab.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

##### 3.2. Experiment 2: Cityscapes

Cityscapes [1] is a dataset for semantic urban scene understanding which was released in 2016. It contains 5000 high quality pixel-level finely annotated images collected from 50 cities in different seasons. The images, which consists of 2975, 500, and 1524 images for training, validation, and testing, are divided into 19 categories. Because this dataset is recently released, previous algorithms have not issued code for this dataset. We only do subjective assessment for Cityscapes using our method and FCN.

Several examples are shown in Figure 6. It is clear that our proposed method outperforms FCN and can achieve more details and distinguish road, building, cars, etc.

**(a)**

**(b)**

**(c)**

**(d)**

#### 4. Conclusion

In this paper, we propose a scene understanding framework based on generative adversarial networks, which trains the fully convolutional semantic segmentation network by adversarial network, and adopt high-order potentials to achieve the fine details and consistency of the segmented semantic image. We perform a number of experiments on two famous datasets, PASCAL VOC 2012 and Cityscapes. We analyze not only each class accuracy but also four accuracy indices by using different semantic segmentation algorithms. The quantitative and qualitative assessments have shown our proposed method achieves the best accuracy among all algorithms. In the future, we will do more experiments on Cityscapes dataset and address the misclassification caused by class imbalance.

#### Data Availability

The data used to support the findings of this study are included within the article.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

This work is supported by Shanghai Science and Technology Committee (no. 15590501300).