Abstract

The popularity of social networks has brought the rapid growth of social images which have become an increasingly important image type. One of the most obvious attributes of social images is the tag. However, the sate-of-the-art methods fail to fully exploit the tag information for saliency detection. Thus this paper focuses on salient region detection of social images using both image appearance features and image tag cues. First, a deep convolution neural network is built, which considers both appearance features and tag features. Second, tag neighbor and appearance neighbor based saliency aggregation terms are added to the saliency model to enhance salient regions. The aggregation method is dependent on individual images and considers the performance gaps appropriately. Finally, we also have constructed a new large dataset of challenging social images and pixel-wise saliency annotations to promote further researches and evaluations of visual saliency models. Extensive experiments show that the proposed method performs well on not only the new dataset but also several state-of-the-art saliency datasets.

1. Introduction

Images and videos are two of the main ways for social entertainments and communications. With the popularity of photo sharing websites, social images have become an important type. The most obvious feature of social images is that they typically have several tags to describe the contents. How to use the tags for multimedia tasks, such as image indexing and retrieval [1, 2], has attracted increasing attention these days [3]. However, tags are seldom considered in state-of-the-art salient region detection models. Therefore, in this paper, we focus on salient region detection of social images using both appearance features and tag features.

With the development of saliency detection, a large number of saliency detection algorithms have been developed [46]. It has been found that only relying on low-level features cannot achieve satisfactory results. The researches have proved that the hierarchical and deep architectures [712] for salient region detection are very effective. Thus, a salient region detection method based on deep learning is proposed in this paper. In addition, various priors are also very important in salient region detection [13], for example, face [1416], car [17], color [14], center bias [13], and objectness [1820]. Intuitively, the tags could potentially be important high-level semantic cues for salient region detection [16, 21]. Thus, tags are incorporated into our salient region detection models.

It is observed that different methods perform differently in saliency analysis [22]. The performance of saliency varies with individual images. The problem also exists in deep feature based methods and handcrafted feature based methods. So handcrafted feature based detection methods can be considered as complementarities to deep feature based detection methods. However, the fusion process is without ground truth. It is nontrivial to determine which saliency map is better. The good saliency aggregation model should work on each individual image and be able to consider the performance gaps appropriately. Therefore, how to fuse saliency maps of different detection methods is a key issue to be solved in the paper.

The framework of salient region detection is shown in Figure 1. It includes two parts: deep learning based salient region detection and handcrafted feature based salient region detection. Deep features include CNN (convolution neural network) features and tag features. Finally, the spatial coherence of saliency maps is optimized through the fully connected conditional random field model.

There are a variety of saliency detection benchmark datasets, either from saliency detection field [7, 8, 2326] or from image segmentation field [2729]. To promote further researches and evaluations on visual saliency detection for social images, it is necessary to construct a new dataset of social images.

The paper focuses on salient region detection of social images. The contributions of this paper are twofold. First, a deep learning based salient region detection method for social images is proposed, considering both appearance features and tag features. Second, tag neighbor and appearance neighbor based saliency aggregation method is proposed, which fuses state-of-the-art handcrafted feature based detection methods with our deep learning based detection method. The aggregation method is dependent on each specific individual image and considers the saliency performance gaps appropriately. So the detection model has fully taken advantage of image tags.

The rest of the paper is organized as follows. The deep learning based model is proposed in Section 2. Section 3 discusses the handcrafted feature based detection models. In Section 4, the saliency aggregation method is proposed. Spatial coherence optimization is discussed in Section 5. In Section 6, the new saliency dataset of social images is introduced. In Section 7, extensive experiments are performed and analyzed. Finally, conclusions are given in Section 8.

2. Deep Learning Based Salient Region Detection

Deep learning based salient region detection uses two types of features, appearance based CNN (convolution neural network) features and social image tag features. They are discussed in the following subsections.

2.1. CNN Based Salient Region Detection
2.1.1. Network Architecture

The deep network for appearance feature extraction has 8 layers [30] as shown in Figure 2. It includes 5 convolution layers, 2 fully connected layers, and 1 output layer. The bottom layer represents the input image and the adjacent upper layer represents the regions for deep feature extraction.

The convolution layers are responsible for the multiscale feature extraction. In order to achieve translation invariance, max pooling operation is performed after convolution operation. The learned feature is composed of 4096 elements. Fully connected layers are followed by ReLU (Rectified Linear Units) for nonlinear mapping. The dropout procedure is to avoid overfitting. ReLU performs the operation for each element in the following.where is the feature of 4096 elements; if , then ; otherwise .

The output layer uses softmax regression to calculate the probability of image patches being salient.

2.1.2. Multiscale CNN Feature Computation

In an image, salient regions have uniqueness, scarcity, and obvious difference with their neighborhoods. Inspired by literature [8], in order to effectively compute the saliency, three types of differences are computed, that is, the difference between the region and its neighborhoods, the difference between the region and the whole image, and the difference between the region and image boundaries. To compute these differences, four types of regions are extracted: (1) rectangle sample in a sliding window fashion; (2) neighborhoods of rectangle sample; (3) boundaries of the image; (4) image area except rectangle sample. Four types of regions are shown in Figure 3.

2.1.3. Training of CNN Network

Caffe [30], an open source framework, is used for CNN training and testing. The deep convolution neural network is originally trained on the ImageNet dataset. We extract multiscale features for each region and fine-tune the network parameters. For each image in the training set, we crop samples into  RGB patches in a sliding window fashion with a stride of 10 pixels. To label the sample patches, if more than 70% pixels in the example are salient, then this sample label is 1; otherwise it is 0. Using this annotation strategy, we obtain sample regions and corresponding labels .

In fine-tuning process, the cost function is the softmax loss with weight decay given bywhere is the learnable parameter of convolution neural network, including the bias and weights of all layers; is the indicator function; is the probability of the th sample being salient; is the parameter of weight decay; is the weight of the th layer. We use stochastic gradient descent to train the network with batch size , . The initial learning rate is 0.01. When the cost is stabilized, the learning rate is decreased by a factor of 0.1. 80 epochs are repeated for the training process. The dropout rate is set to 0.5 to avoid overfitting.

2.2. Tag Semantic Feature Computation

Due to the fact that objects are closely related to salient regions, we use object tags to compute semantic features. The probability that a region is a particular object reflects the possibility being a salient region to some extent. Therefore, the probabilities that regions are specific objects can be regarded as priors.

RCNN (Regions with CNN) [31] is based on deep learning and has been widely used because of its excellent object detection accuracy. In the paper, RCNN is used to detect objects; thus tag semantics are transformed into RCNN features.

Suppose there are object detectors. For the th detector, the detection process is as follows.

(1) Select proposals which are more likely to contain the specific object.

(2) Compute the th proposal probability of the th proposal being the th object, , . At the same time, each pixel in the th proposal also has the same probability .

(3) For proposals, each pixel has the score being the th object. If the pixel is contained by th proposal, then , else .

dimension feature is obtained for each pixel after objects detector detection. dimension feature is normalized as , . Each dimension of indicates probability being a specific object.

2.3. Fusion of CNN Based Saliency and Tag Semantic Features

Assume that the saliency map is and RCNN based semantic features is ; the fusion is

Tags are priors and play weights in fusion. represents the fused saliency map.

3. Handcrafted Feature Based Salient Region Detection

It is observed that different methods perform differently in saliency analysis [22]. Although the overall detection effect based on deep features is better than that based on handcrafted features, the differences still exist on individual images. So handcrafted feature based salient maps can be considered as complementarities to deep feature based saliency maps. In Figure 4, the first column shows the original social images; the second shows the ground truth masks; the third shows the salient maps of DRFI method [25] which is based on handcrafted features; the last represents the salient maps of MDF method [8], which are based on deep features. We can see that the last column includes incomplete parts, unclear boundaries, and false detections. So in the paper, some state-of-the-art salient region detection methods based on handcrafted features are selected as complementarities to our proposed deep detection method.

4. Saliency Aggregation

4.1. Main Idea

It is observed that if a salient region detection method has good effects on a social image, this method has great possibility to get sound effect on similar images. The main idea of aggregation is based on this assumption.

In training process, sort lists of all detection methods on all images can be achieved. Sort lists can be seen as priors in testing.

In testing process, we search KNN (K nearest neighbors) images similar to the test image in the training set. Moreover, sort lists of KNN images are known in the training stage. KNN images can vote for detection methods through sort lists. Thus, the test image is able to obtain its sort list based on voting. Salient map of test image can be computed by aggregating its salient maps of different methods using sort lists.

Training process and testing process are shown in Figures 5 and 6.

4.2. Training Process

Given an image in the training set, its ground truth is given by ; its salient maps using different detection methods is denoted as . In this saliency map set, is the number of detection methods, and is the salient map of the th method.

For every detection method, its salient maps can be compared with ground truth and yield AUC (Area under ROC Curve) values. The greater the AUC value, the better the saliency detection performance. After AUC value computation, sort lists of all methods can be obtained.

For convenience, it is assumed that there are four detection methods. Sort lists are shown in Figure 7. The data structure is single linked list. Data domain of header node denotes image and pointer domain of header node points to data node. Nonheader node includes three domains: the first domain is the AUC value, the second domain is the method index, and the last domain is a pointer.

4.3. Testing Process

A social image has two parts: image and corresponding tags. In the testing set, image and its tag set are given, where is the number of tags. We search its neighbors through tag semantics and image appearance. Sort lists of neighbors can vote for salient maps of image .

4.3.1. Tag Based Neighbor Search

There are two types of tags: object tags and scene tags. Because objects are closely related to salient regions, object tags are used in semantic search.

There are 37 object tags in the new dataset, including animal, bear, birds, cat, fox, zebra, horses, tiger, cow, dog, elk, fish, whale, vehicles, boats, cars, plane, train, person, police, military, tattoo, computer, coral, flowers, flags, tower, statue, sign, book, sun, leaf, sand, tree, food, rocks, and toy.

In these categories, animal has super class and subclass relationship with bear, birds, cat, fox, zebra, horses, tiger, cow, dog, elk, fish, and whale; vehicles have super class and subclass relationship with boats, cars, plane, and train; person has super class and subclass relationship with police, military, and tattoo.

Although super class and subclass have great relevance in the class definition, many subclasses have a variety of differences in environment and appearance. So, for animal class, subclasses need exact matching to find neighbors; for vehicles class, subclasses need exact matching to find neighbors; because of particularity of class people, if there is no exact matching of subclass, matching can be performed at person level.

4.3.2. Appearance Based Neighbor Search

256 dimensional histogram of RGB color space is used and distance is computed.

4.4. Vote Based Saliency Maps Aggregation

Suppose the test image is , the number of tag neighbors is , and the number of appearance neighbors is .

After tag based search in the training set, the detected neighbor number is . If is bigger than , then images are selected according to appearance similarities from images. Finally, tag based neighbor set is given aswhere is the final number of neighbors; if , then ; otherwise, .

After appearance based similarity computation in the training set, nearest neighbors are selected as

Merge sets (4) and (5) and get the set as

Each neighbor image has a sort list and contains the AUC values of all detection methods. The AUC values can vote for each detection method. Vote weights are summed as

In , is the th neighbor and is the th detection method. is the number of detection models.

The salient map set of image iswhere is the saliency map of the th detection method.

The fused saliency map can be computed as follows.

5. Spatial Coherence Optimization

In saliency computations, the spatial relationship of adjacent regions is not considered, so it will result in noises on salient regions. In the field of image segmentation, the researchers use fully connected CRF (conditional random field) model [49] to achieve better segmentation results. Therefore, we use the fully connected CRF model to optimize the spatial coherence of saliency maps.

The objective function is defined as follows.where is the binary variable being salient or not. is the probability of pixel being salient. Initially, , . is the saliency of the pixel .

is defined as follows.

If , then , or else 0.

Both position information and color information are considered in .

is the position of pixel and is the position of pixel .

is the color of pixel and is the color of pixel .

suggests that adjacent pixels with similar colors should have similar saliency. and control color similarity and distance proximity.

only considers position information. The purpose is to remove small areas.

6. Construction of Saliency Dataset of Social Images

The paper focuses on salient region detection of social images, so it is necessary to construct a new dataset of social images to promote further researches and evaluations of visual saliency models. The following will be discussed in detail.

6.1. Data Source

NUS-WIDE dataset [50] is a web image dataset constructed by NUS lab for media search. The images and the tags of this dataset are from Flickr which is a popular social web site. We randomly select 10000 images from NUS-WIDE dataset. The images come from thirty-eight folders of NUS-WIDE dataset, including carvings, castle, cat, cell phones, chairs, chrysanthemums, classroom, cliff, computers, cooling tower, coral, cordless cougar, courthouse, cow, coyote, dance dancing, deer, den, desert, detail, diver, dock, close-up, cloverleaf, cubs, doll, dog, dogs, fish, flag, eagle, elephant, elk, f-16, facade, and fawn.

6.2. Salient Region Annotation

Since the bounding boxes for salient regions are rough and can not reveal region boundaries, we adopt the pixel-wise annotation. In annotation process, nine subjects are asked to specify the attractive regions according to their first glance at the image.

To reduce label inconsistency of the annotation results, the pixel consistency score is computed. A pixel can be considered salient if 50% of subjects have selected it [23].

Finally, two subjects use Adobe Photoshop to segment salient regions.

6.3. Image Selection

First, 10000 images are randomly selected from NUS-wide dataset. Then, the images are further selected by the following criteria.(1)The color contrast of any salient region and corresponding image is less than 0.7.(2)Salient regions are rich in size. The proportion of salient regions to the corresponding image covers 10 grades, , , , , , , , , , .(3)At least ten percent of the salient regions connected with the image boundaries.

After 5 rounds of selecting, the dataset contains 5429 images.

In the new dataset, the images have one or more salient regions; the positions of salient regions are not limited to image centers. The sizes of salient regions are varied. A great deal of images have complex/cluttered backgrounds. There are 78 tags which come from 81 tags of NUS-WIDE dataset. All these will bring challenges to salient region detection.

6.4. Typical Images of the New Dataset

In this section, typical examples of images, ground truth masks, and tags are listed below. Images can have one or multiple salient regions in Figure 8. The images may have cluttered and complex backgrounds in Figure 9. The sizes of salient regions are rich in Figure 10.

7. Experiments

7.1. Experimental Setup
7.1.1. Experiments on the New Dataset

The aim of the paper is to solve salient region detection of social images. So the main experimental dataset is our new dataset, which is abbreviated as TBD (Tag Based Dataset).

We selected 20 object tags, including bear, birds, boats, buildings, cars, cat, computer, coral, cow, dog, elk, fish, flowers, fox, horses, person, plane, tiger, train, and zebra. Correspondingly, 20 RCNN object detectors were chosen to extract RCNN features. Top 1000 proposals of each detector were used to compute RCNN features.

The proposed deep based detection method is abbreviated as DBS (Deep Based Saliency). DBS method was compared with 27 state-of-the-art methods in Section 7.2.1. 27 state-of-the-art methods are CB [34], FT [23], SEG [44], RC [14], SVO [17], LRR [39], SF [45], GS [37], CA [33], SS [47], HS [7], TD [48], MR [24], DRFI [25], PCA [41], HM [38], GC [36], MC [40], DSR [35], SBF [43], BD [42], SMD [46], BL [32], MCDL [9], MDF [8], LEGS [10], and RFCN [11]. These methods not only are very popular but also cover many types.

In addition, we also verify the performance of the aggregation method in Section 7.2.2.

7.1.2. Experiments on State-of-the-Art Datasets

We also carried out the experiments on six state-of-the-art datasets to validate our method. These datasets are MSRA1000 [23], DUT-OMRON [24], ECSSD [7], HKU-IS [8], PASCAL-S [51], and SOD [27]. In these datasets, SOD [27] is a dataset which is from segmentation field; others are from saliency field. Because these datasets have no image level tags, we extract objectness feature [19] of these datasets. Objectness is a kind of high-level semantic cues, so objectness cue is similar to tag feature. Compared with the method DBS, the method using objectness feature instead of tag feature is abbreviated as OBS (Objectness Based Saliency).

OBS method was compared with 11 state-of-the-art methods, including FT [23], RC [14], SF [45], HS [7], MR [24], DRFI [25], GC [36], MC [40], BD [42], MDF [8], and LEGS [10].

7.1.3. Evaluation Criteria

We adopted popular performance evaluations to quantitatively evaluate the results, including PR (Precision Recall) curves, ROC (Receiver Operating Characteristic) curves, -measure value, AUC (Area under ROC Curve) value, and MAE (Mean Absolute Error) value, respectively.

7.2. Experiments on the New Dataset TBD
7.2.1. Experiments of Deep Learning Based Detection Method

DBS is compared with 27 state-of-the-art methods. The results are given in Table 1 and Figure 11.

Among the 28 methods in Table 1, the top four methods are all deep learning based methods, including MCDL [9], RFCN [11], MDF [8], and DBS. To some extent, deep learning based detection methods are better than handcrafted feature based methods, in terms of both completeness and accuracy of saliency maps. AUC value of DBS method is the highest. -measure value of DBS method is slightly lower than RFCN [11]. MAE value of DBS is third low. The overall performance of DBS method is good.

Typical saliency maps are shown in Figure 11.

7.2.2. Experiments of Aggregation Method

The handcrafted feature based detection methods used as complementarities to DBS are DRFI [25], SMD [46], BL [32], and MC [40].

In neighbor searching, the number of tag neighbors is 4 and the number of appearance neighbors is 4.

In order to verify the effect of neighbors, appearance neighbor based method and tag neighbor based method are carried out, respectively. Appearance neighbor based aggregation method is abbreviated as ABS (Appearance Based Saliency). Tag neighbor based aggregation method is abbreviated as TBS (Tag Based Saliency). Tag neighbor and appearance neighbor based aggregation method is abbreviated as FBS (Fusion Based Saliency).

The detection performances of DBS, ABS, TBS, and FBS are compared in Table 2.

The performance of TBS is better than the performance of ABS. The reasons are as follows. ABS method is based on appearance feature based neighbor search. Appearance similar images cannot guarantee similar saliency maps. However, TBS method uses object information. The same or similar objects can ensure similar salient regions to some extent. So the performance of TBS is better.

PR and ROC curves are shown in Figures 12 and 13. PR and ROC curves of FBS are higher than 27 state-of-the-art methods.

The examples of typical saliency maps of FBS method and DBS method are shown in Figure 14. It can be seen that the aggregation results are more complete and the details are better.

7.3. Experiments on State-of-the-Art Datasets

The experiment results are given in Table 3. We can see that AUC values of OBS are the highest on all datasets, -measure values of OBS are the highest on all datasets, and MAE values are the lowest or the second lowest. The performance of OBS is the best. However, the improvements of OBS are not so obvious because objectness feature is not the accurate tag feature. Thus we believe that the results will be improved obviously if we use accurate tag annotation of images.

Experiments on state-of-the-art datasets validate the effectiveness of our proposed method DBS.

8. Conclusions

The paper focuses on salient region detection of social images. First, the proposed deep learning based salient region detection method considers both appearance features and tag features. Tag features are detected by RCNN models. Second, tag neighbor features and appearance neighbor features are added to the saliency aggregation model. Finally, a new database of challenging social images and pixel-wise saliency annotations is constructed, which can promote further researches and evaluations of visual saliency model.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the Program Project of Beijing Municipal Education Commission (KM201511417008), the National Natural Science Foundation of China (Grant no. 62372148), the National Natural Science Foundation of China (Grant no. 61272352), and Beijing Natural Science Foundation (4152016).