Abstract

The task of semantic segmentation is to obtain strong pixel-level annotations for each pixel in the image. For fully supervised semantic segmentation, the task is achieved by a segmentation model trained using pixel-level annotations. However, the pixel-level annotation process is very expensive and time-consuming. To reduce the cost, the paper proposes a semantic candidate regions trained extreme learning machine (ELM) method with image-level labels to achieve pixel-level labels mapping. In this work, the paper casts the pixel mapping problem into a candidate region semantic inference problem. Specifically, after segmenting each image into a set of superpixels, superpixels are automatically combined to achieve segmentation of candidate region according to the number of image-level labels. Semantic inference of candidate regions is realized based on the relationship and neighborhood rough set associated with semantic labels. Finally, the paper trains the ELM using the candidate regions of the inferred labels to classify the test candidate regions. The experiment is verified on the MSRC dataset and PASCAL VOC 2012, which are popularly used in semantic segmentation. The experimental results show that the proposed method outperforms several state-of-the-art approaches for deep semantic segmentation.

1. Introduction

Image semantic segmentation is the understanding of the semantic information contained in images. It uses the computer to extract semantic information of the captured scene from the image for understanding its contents, which can be applied in image recognition, classification, and analysis [1]. Semantic segmentation has been widely used in intelligent robot scene understanding, automatic driving system streetscape recognition, and medical image detection [2]. However, semantic segmentation has become one of the most challenging computer vision tasks due to the scale, position, illumination, and texture changes of objects in the image [3].

In most cases, image semantic segmentation is established as a fully supervised task. The fully supervised methods require using strong pixel-level annotations, which is very limited, expensive, and time-consuming in the labeling process, and it is different due to the subjective understanding of the labeling personnel [4]. However, weakly supervised semantic segmentation only requires image labels at the image-level, which is much cheaper and less time-consuming than pixel-level annotations. Weakly supervised semantic segmentation can be divided into three categories that included bounding box [5], partial marking [6], and image-level labels. At present, with the increasing popularity of image sharing websites (for example, Flickr) and providing a large number of user-labeled images, many studies have focused on image-level labels for weakly supervised semantic segmentation.

Therefore, the semantic segmentation of weakly supervised images based on image-level labels has gradually increased recently. According to the different methods of semantic label inference, the weakly supervised image semantic segmentation can be roughly divided into classifier, multigraph model, and deep convolutional neural network based methods. Among them, the first classifier-based method uses the superpixels or the candidate regions generated by superpixel as the basic processing unit to infer semantic label and then selects various classifier models to learn the inferred label. The main idea is that the superpixels or candidate regions with the same semantic label have similar appearance [7]. However, semantic label inference based on superpixel contains more redundant information, which can interfere with the accuracy. Although the methods based on candidate regions contain less redundant information, it is difficult to completely and accurately segment the number of image objects equaling the number of the labels by the current image segmentation techniques. Then the based multigraph model method uses all pixels or superpixels in the image as graph model nodes. And graph model is established with relationship between pixels or superpixels. But this method calculates a one-dimensional potential energy function for each superpixel and the algorithm complexity is high [8]. Fortunately, sparse representation and image hashing are powerful tools for data representation and the combinations of these two tools for scalable image retrieval. It is possible to replace the high-dimensional features with a low-dimensional Hamming space with preserving the similarity between features, which will reduce the computational complexity of the energy function, thereby reducing the complexity of the algorithm [914]. In addition, the deep convolutional neural network based method uses a pretrained classification network to obtain objects of the image and then fine tunes by segmentation networks and image-level labels. The methods are sensitive to the accuracy and dataset of the pretrained classification network. And the classification network can only identify small and discernible regions, which is insufficient for the inference of large-scale image-level semantic labels [15].

Although the weakly supervised image semantic segmentation based on the image-level labels is proposed constantly, its segmentation accuracy has a large room for refinement compared with the fully supervised image semantic segmentation. The main obstacles and difficulties lie in how to accurately implement the semantic label inference, that is, the accurate mapping from the image-level labels to image pixel positions. In addition, as a dense pixel-level label prediction task, not all features are equally important and discriminative for learning classification models [16]. Therefore, how to construct an effective model to infer semantic labels is also meaningful for improving the accuracy of weakly supervised image semantic segmentation.

Under the condition of weak supervision, this paper proposes a deep semantic segmentation using CNN and ELM with semantic candidate regions. The proposed method uses candidate region instead of the superpixel as the basic processing unit, and the neighborhood rough set combines with the semantic associated relationship between image-level labels to infer semantic label. In addition, the ELM is trained by candidate region contained semantic information to classify test candidate regions. The algorithm flow chart is shown in Figure 1 and the main contributions of this paper are as follows: A method for merging superpixel into candidate regions is proposed. The method guides superpixel merging with the number of image-level labels as supervised information and generates candidate region with high precision, which can solve the problem that multiple instances are not adjacent in an image. And merging process can reduce complexity of subsequent processing practically. An inference method of candidate region semantic label is proposed. The method uses the neighborhood rough set to generate different neighborhood particles and starts from the highest frequency semantic label to infer. Then the other candidate regions semantic labels are inferred based on the strongest associated relationship, which solves the problem of semantic label mapping difficultly. An ELM training method is proposed. It uses candidate region with semantic labels to train ELM, which can reduce the introduction of negative sample pixels in the training data and improve accuracy of classification.

As the simplest and most effective form of weak supervision, image-level labels are widely used in weakly supervised image semantic segmentation. It is difficult to correspond to image objects if only image-level labels data is used for training, since image-level labels cannot provide accurate information to describe boundaries and locations of objects due to inherent ambiguity of image-level labels. According to the different methods of semantic label inference, the paper divides the weakly supervised image segmentation algorithm into three categories: classifier, multigraph model, and deep convolutional neural network based methods.

The classifier based method uses image-level labels as supervised information and divides all pixels or superpixels in the image contained target label into positive samples and other negative samples without target label. Then classifier is trained directly and the best classifier is obtained by iteratively optimizing loss function. For example, Wei et al. [23] trained a multilabel classification network, where pictures are classified through the network, and finally matched the classification information with higher confidence to the original picture to obtain association between semantic labels and locations. However, this method directly introduces the pixel points of target image block as object regions into many negative sample pixels, such as pixels belonging to the background. Subsequently, Wei et al. [19, 22] proposed a simple to complex framework (STC) in 2017, which firstly trains an initial segmentation network using simple images and then predicts the labels of simple images using the network and uses these labels to enhance training semantic segmentation network. Finally, the enhanced network is used to predict labels of more complex images and train a better semantic segmentation network. However, this method requires collecting a large number of simple pictures; otherwise it is difficult to train a higher performance initialization network and continue to improve, and it has many training samples and long training time. Zhang et al. [18] proposed to use the spatial sparse reconstruction method to obtain an effective SVM classifier, which trains classifier by training data with noise, and to use method of subspace reconstruction to denoising and find optimal SVM classifier by iterative optimization. The methods iterate between generating temporary segmentation masks and learning with interim supervision. These methods benefit from pixel-level supervision; but errors easily accumulate in iterations.

The multigraph model based method uses all pixels or superpixels in the image as graph model nodes. And graph model is established with relationship between pixels or superpixels. Vezhnevets et al. [8] proposed a multi-instance learning (MIL) framework for weakly supervised images segmentation. The algorithm regards each superpixel as an instance; each image is represented as a series of instance sets. Only labels of instance set are known, so image segmentation is converted to instance label inference. But the algorithm lacks the labels between superpixel pairs. In order to solve this problem, Vezhnevets et al. [17] proposed a multi-image model (MIM) based on the graph model and built a common probability graph model on the training set and test set using conditional random fields for each superpixel. The one-dimensional potential energy function establishes a binary potential energy function between superpixel pairs and finally approximates parameters of conditional random field by method of graph division. However, this method calculates a one-dimensional potential energy function for each superpixel and the algorithm complexity is high. In order to enrich the description of superpixel features, Vezhnevets et al. [24] further proposed a series of parameterized structured models in which potential energy pairs are formed by multichannel visual features, and weight of each channel is determined by minimizing to distinguish different superpixel labels of trained segmentation model. The above graph-based algorithm has improved segmentation performance in weakly supervised environment, but it is limited by the low descriptiveness of the unary or binary potential energy function.

The method of deep convolutional neural network is based on DCNN framework, which is trained to obtain the object position. Oquab et al. [25] applied DCNN framework to generate a single point to infer the location of the object, but this method cannot detect multiple objects of same class in an image. Pinheiro et al. [21] and Pathak et al. [20] added segmentation constraints to final cost function to optimize parameters of DCNN image-level labels. However, the two methods generate coarse prediction because the algorithms generally do not use low-level cues.

3. The Proposed Method

The paper proposes a weakly supervised image semantic segmentation framework based on candidate regions and ELM. The framework of the paper consists of two phases of learning and testing. Among them, there are three basic steps in the learning phase: candidate region segmentation using superpixel; candidate region semantic inference using semantic label association; candidate regions classification using ELM. In the testing phase, the paper first performs superpixel segmentation and merging on the test image and then predicts the semantic label of each pixel with the candidate region as the basic processing unit.

3.1. Segmentation of Candidate Regions Using Superpixels

Compared with superpixels, the number of candidate regions in the image is smaller, which is more helpful for improving the accuracy of semantic label inference. Therefore it is necessary to merge oversegmented superpixels to obtain candidate regions library. In addition, the several low-level visual features are extracted to preserve the boundary information of each superpixel as much as possible during the merging process. Therefore the paper selects the colour, texture, sift, and surf features representing each superpixel. Specifically, due to the wide colour gamut of the LAB, this paper chooses the LAB as the colour feature. And this paper selects the Gabor filters to represent the texture feature of each superpixel, because the Gabor filter has the capability of dealing with spatial transformations [26].

First, the initial image is divided into superpixels based on the simple linear iterative clustering algorithm (SLIC). And compared with other superpixel segmentation methods, SLIC algorithm has the following advantages [27]: (a) the size of formed superpixels is basically the same; (b) the number of superpixels can be controlled by adjusting the parameter k; (c) the speed is fast and boundary fit between block and target boundary is high; d) the difference of features between pixels within each block is small.

Then, the 196-dimensional visual features are extracted to describe each superpixel, including colour features (3-dimension), texture features (65-dimension), Sift features (64-dimension), and Surf features (64-dimension). Finally, on the basis of superpixel spatial position adjacency, the most similar superpixels are merged by statistical superpixel similarity, and the number of superpixels is combined to be no more than three times of image labels, as shown in Figure 2.

Suppose an image contains n superpixels , and any superpixel has 196 dimensional visual features to describe, image labels , and l is the number of image semantic labels. Then similarity of any superpixels and is described aswhere is weight factor of adjusting distance and satisfies ; , , , are the Euclidean distance to represent the color, texture, Sift, and Surf distance of the superpixels and ; stores adjacency relationship between superpixels.The specific steps of superpixel merging algorithm are as shown in Algorithm 1.

Input: Data set, image-level label number .
Output: Cluster center for each target superpixel , the number of target superpixels in the image .
Step 1. SLIC superpixel segmentation,
Step 2. While
(a) Extract visual features of each superpixel: LAB(3 dim), Gabor(65 dim), Sift(64 dim), Surf(64 dim);
(b) The adjacency relationship between superpixels is counted and stored in matrix ;
(c) The superpixel similarity is calculated according to formula (1);
(d) Combine the most similar superpixel pairs with considering the adjacency;
(e) Calculate the mean of the merged superpixel clustering centers as a new clustering center;
(f) Update .
End
Step 3. Reclassify disconnected areas.
3.2. Candidate Region Semantic Inference Using Semantic Label Association

The inference from image-level to pixel-level semantic label is the key of the whole weakly supervised image semantic segmentation algorithm. In the process, the classification of candidate regions directly affects the semantic label inference results; it is necessary to extract rich visual features. Therefore the paper adopts CNN to extract features to ensure effective classification results. However, extracting multilayer visual features increases the data dimension; it will bring great difficulties to subsequent label clustering. The neighborhood classifier [28] has an important advantage in that it can get a subset of the features that are important for decision making through attribute reduction; that is, it can obtain discriminative features that are important for semantic label inference.

As for the candidate region as the basic processing unit, the paper regards the semantic label inference as the most similar neighborhood particle extraction problem; the uniqueness of the program is as follows: The paper stars inferring the semantic label from the semantic label with the most images, as much as possible to ensure the accuracy of prediction of the semantic labels; According to the image-level label number and the proportion of the images corresponding to the semantic label to be inferred, the number of candidate regions is included in each semantic label to be inferred; The inference of each semantic label is based on semantic label association relationship, which reduces the interference of the noise. The detailed steps are as follows:

First, semantic labels can be represented as ; k is the total number of semantic labels categories. According to image-level labels, each semantic label corresponding to the number of images is expressed as . According to the relationship between L and N, it can obtain a semantic label containing the most images in the data set. Then the number of candidate region set corresponding to the semantic label i can be expressed aswhere is a proportional parameter. It depends on the multiple of the number of image-level labels and the complexity of the training set image. Therefore, the proportion of the candidate region set corresponding to the semantic label in the entire candidate region library can be expressed as

Therefore, this paper obtains the range of the proportion of candidate regions set. And the inference of the semantic label is transformed into finding the proportion of candidate region corresponding to the semantic label.

Second, given a set of semantic labels that need to be associated, the semantic association relationship between labels is obtained by calculating the semantic association strength. And the association relationship is saved in a diagonal relationship matrix expressed aswhere is connection strength of two labels and in the data set, is frequency of simultaneous occurrence of labels and , and is frequency of any one occurrence of labels and . Semantic association strength is shown in Figure 3. The color from blue to red indicates that association strength is from weak to strong in the figure. And image semantic self-association is the strongest degree that is expressed as red.

As can be seen from Equation (4) and (5), this paper encourages inference from semantic labels that appear simultaneously in multiple images. Then the sematic labels are inferred from the strongest association. According to the semantic label association relationship and its corresponding semantic label, the proportion of the semantic label can be obtained.

In order to fully extract the features of each candidate region in the candidate region library, the paper adopts CNN to extract features. And the CNN network structure is shown in Table 1. It consists of five convolutional layers (cov1~cov5) and three fully connected layers (fc6~fc8). In this paper, five convolutional layers and two full convolutional layers are used for learning. After cov2 and cov5 convolution operations, the max pooling method is used to operate, and finally 4096-dimensional feature vector of fc7 layer is used as an image feature vector output. For CNN input data preparation phase, the sample patch uses an image block of 27×27 pixels in size, and the sampling center is candidate region center. For CNN output, feature extraction model chooses directly to use 4096-dimensional feature vector of fc7 layer as visual feature of candidate region.

According to the feature vector of the candidate region, we construct an information table , where the sample set of candidate regions , which is described by a series of features. Where is the number of candidate regions in the candidate region library, is feature set describing , is a set of attribute values, and is information function. And the neighborhood particles of each candidate region are constructed:where ; is called generated neighborhood information particle, which determines the size of the neighborhood particle. is the norm, is called the similarity measure, and is dimension of attribute matrix . According to nature of metric, it can be known that

If the size of the neighborhood particle is fixed, the neighborhood particle with the most similar candidate regions can be obtained. And can determine the size of the neighborhood particle. Then the paper can get neighborhood thresholds and get the smallest threshold . Therefore, the candidate regions corresponding to the most similar neighborhood particles with the minimum threshold are determined.

Finally, the paper obtains the candidate region corresponding to the semantic label to be inferred and its neighboring particles and completes the inference of the semantic label. After that, the inferred candidate region is removed from the candidate region library, iterating until all inferences of the rest of semantic labels are completed.

3.3. Candidate Regions Classification Using ELM

After completing the inference of all semantic labels, the paper selects the ELM to learn the inferred candidate regions. The main reason is that ELM is a new type of fast machine learning algorithm, which is a supervised algorithm based on single hidden layer feed forward neural network [29]. In addition, ELM trains parameters without iterating, which can improve algorithm efficiency.

First, the ELM is trained based on candidate regions with semantic labels and get trained ELM to classify in the training stage. And the candidate region is still used as the basic processing unit of semantic label prediction. The reason is that the candidate region is well close to the boundary of the target and is not susceptible to noise. In order to obtain the candidate regions corresponding to the test images, the paper first performs superpixel segmentation and superpixel merging to generate candidate regions under the same parameter setting and implementation steps. Then 4096-dimensional features are extracted on the candidate regions corresponding to the test images to ensure the consistency between the testing stage and the training stage.

After that, given an image candidate region in the ELM testing stage, is the number of the test candidate regions. The candidate region is directly used as the input of the ELM; then the semantic label is predicted by the ELM. The specific steps of the ELM classification algorithm are shown in Algorithm 2.

Input: Given training samples , ; The number of semantic label categories ;
Activation function ; The number of hidden layer nodes is l , Test sample .
Output: Predicted result .
Step 1. Initialize the weight and bias between the input layer and the hidden layer, Randomly set
the value of and , given the value of .
Step 2. Select the activation function of the hidden layer and calculating the output matrix .
Step 3. Calculate the output weight of the network : (where is the transpose of ).
Step 4. The output weights of the test samples : .
Step 5. the output of the predicted result : .

4. Experiment

4.1. Dataset and Evaluation

The performance of our algorithm was evaluated on the MSRC [30] dataset, which has 591 images, including natural scenes (such as trees), structured scenes (such as buildings and roads), and other structures scenes. The dataset provides pixel-level annotation semantic images, and all images corresponding to pixel-level annotations maps are 213×320 pixels in size. And the scene contains a total of 23 semantic categories of objects. The same rules are followed in use of dataset, ignoring the classes of the horse and mountain image type. This article uses 276 images for training and 256 images for testing.

In addition, our method is also evaluated on the PASCAL VOC 2012 segmentation benchmark dataset [31], which is one of the most widely used benchmark datasets for semantic segmentation. It contains one background category and 20 object categories. It consists of three parts: training set (1464 images), validation set (1449 images), and test set (1456 images). In our experiments, our work is also based on the training images (10582 images) amplified by Harry Harlan et al. [32] as a training set, which provides image-level labels for training.

In this paper, evaluation index selects pixel accuracy (PA), mean pixel accuracy (MPA), and mean intersection over union (mIoU). Calculation formula is as follows:where is the number of categories included in true value, is the pixel of category divided into category , and is the total number of pixels of category in ground truth.

4.2. Parameter Settings

The parameter setting of CNN model is given as follows. The learning rate was set to 0.001, and the performance of three CNN visual features in image clustering is analyzed and compared. The last 3 fully connected extracted visual characteristics of candidate regions, whose outputs are 4096, 4096, and 1000, respectively, are considered feature representations of image. Figure 4 shows comparison of three visual features on MSRC dataset. It can be seen that visual features are selected as output of fc7 layer for image clustering, whose precision is the highest.

The parameter setting of ELM algorithm is given as follows. When designing ELM, the cross-validation method is generally used to determine optimal hidden layer node number L within preset range of K value. The simulation is performed on MSRC-21 data. It is assumed that L is increasing from 1 to 200, and classification accuracy of test set is sequentially obtained as shown in Figure 5. It can be seen from Figure 5 that when L value reaches 64, test accuracy is the highest. However, with L value continuing to increase, the measurement accuracy of ELM is generally decreasing. So when 60≤L≤68, ELM has a good test accuracy.

4.3. Experimental Results

In order to evaluate the performance of the proposed weakly supervised image semantic segmentation method, the experiments were compared with the current weakly supervised image semantic segmentation algorithm on the MSRC-21 dataset and PASCAL VOC 2012 dataset. These comparison algorithms include STC [19], AE [22], SR [18], MIM [17], MIL+ILP+SP-sppxly [21], and CCNN [20], and these weakly supervised image semantic segmentation comparison algorithms are based on image-level labels.

First, the IoU of per-image label and the average IoU (mIoU) of all image labels are as in Tables 2 and 3, respectively, for the proposed method and the current weakly supervised image semantic segmentation algorithm on the MSRC-21 dataset and the PASCAL VOC 2012 dataset. And each column represents different algorithm accuracy of each semantic class on MSRC-21 and PASCAL VOC 2012 dataset, and the last column is average accuracy of all classes. The bold values in the table represent the best segmentation performance.

As shown in Tables 2 and 3, the proposed algorithm obtains comparable and competitive results on the IoU of per-image label and the average IoU (mIoU) of all semantic labels compared with the existing image-level labels weakly supervised image semantic segmentation algorithm method. Although the IoU on some semantic classes is lower than the compared algorithm on the MSCR and the PASCAL VOC 2012 validation set, the proposed algorithm achieves the best segmentation performance on the mIoU. In addition, the segmentation accuracy for the weakly supervised image semantic segmentation algorithm on the MSRC dataset is significantly higher than that of the PASCAL VOC 2012 dataset. The reason is that the images on the PASCAL VOC 2012 dataset contain more complex objects and backgrounds than the images on the MSRC dataset. Although many weakly supervised image semantic segmentation algorithms have been proposed, the segmentation accuracy of each semantic class on the entire dataset still has a relatively large room for improvement.

Then, in order to more intuitively display the segmentation performance of the proposed algorithm, some qualitative segmentation examples of MSRC and PASCAL VOC 2012 dataset are given. The specific segmentation results are shown in Figure 6.

As shown in Figure 6, the weakly supervised deep semantic segmentation using CNN and ELM with semantic candidate regions can achieve better segmentation performance. Moreover, the segmentation result based on the candidate region level can retain the edge information of the object in the image. However, the proposed method relies on semantic label inference and classifier learning at the candidate region level for an object that contains multiple regions with large contrast, which may be misclassified.

5. Conclusions

In this paper, a weakly supervised semantic segmentation method using ELM with semantic candidate regions is proposed. By merging superpixels into candidate regions instead of using a large number of superpixels in an image, the semantic associated relationship and neighborhood rough set are effectively combined to solve the difficulty of mapping from semantic labels into image objects. The image semantic labels quantity information is used as a condition to terminate superpixel merging, which avoids problem of manually set parameters and hence helps to solve the problem of nonadjacent multiple instances. The candidate regions are classified based on neighborhood rough set, where the candidate regions are inferred by using semantic associated relationship. As a result, more reliable candidate region semantic labels can be obtained to improve the classification accuracy. Future works can be extended to combine saliency detection [33, 34] and heuristic optimization in a data fusion framework [3538].

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors would like to express their gratitude for the support from the National Natural Science Foundation of China (61503271; 61603267), Shanxi Scholarship Council of China (2015-045; 2016-044), 100 People Talents Programme of Shanxi, Shanxi Natural Science Foundation of China (201801D121144), and Shanxi Natural Science Foundation of China (201801D221190).