Abstract

Automatic estimation of salient object without any prior knowledge tends to greatly enhance many computer vision tasks. This paper proposes a novel bottom-up based framework for salient object detection by first modeling background and then separating salient objects from background. We model the background distribution based on feature clustering algorithm, which allows for fully exploiting statistical and structural information of the background. Then a coarse saliency map is generated according to the background distribution. To be more discriminative, the coarse saliency map is enhanced by a two-step refinement which is composed of edge-preserving element-level filtering and upsampling based on geodesic distance. We provide an extensive evaluation and show that our proposed method performs favorably against other outstanding methods on two most commonly used datasets. Most importantly, the proposed approach is demonstrated to be more effective in highlighting the salient object uniformly and robust to background noise.

1. Introduction

As an important preprocessing technique in computer vision to reduce computational complexity, saliency has attracted much attention in recent years. Saliency detection has boosted a wide variety of applications in computer vision tasks, such as image retrieval [1, 2], image compression [3], content-aware image editing [4, 5], and object detection and recognition [6].

According to previous works, saliency models can be categorized into bottom-up approaches [79] and top-down approaches [10, 11]. Most top-down approaches make use of prior knowledge and are based on supervised learning. While these supervised approaches can effectively detect salient regions and perform overall better than bottom-up approaches, it is still expensive to perform the training process, especially data collection. Compared to top-down approaches, bottom-up based approaches do not need data collection and training process, meanwhile requiring little prior knowledge. These advantages make bottom-up approaches more efficient and easy to implement in a wide range of real computer vision applications. In this paper, we focus on bottom-up approach.

In recent years, most works deploy color contrast [7, 8] as the measurement of saliency. And some other works utilize color rarity [12] to produce saliency map. Algorithms based on these cues usually work well in situations where foreground object appears to be distinctive in terms of contrast or rarity. But they may fail in dealing with images containing various background patterns. Work in [13] tends to focus on pattern distinctness in saliency detection. It defines a pixel to be salient if the patch corresponding to that pixel has less contribution to the whole image in PCA system. So it will perform badly in cases where salient object is overwhelming in size in an image. By assuming most of the narrow border of the image as background region, information of backgroundness priors can be exploited in calculating saliency maps, as done in [14, 15]. Though backgroundness priors seem to play a role in locating salient objects, treating patches on the border directly as background patches would incur serious problems, especially when salient objects touch the image border or present similar features to a certain part of the image border. Thus, looking for an effective way to construct background distribution, from which salient objects can be separated, seems to be necessary.

In this work, we suggest that clustering features within the pseudobackground region (border region of the image) would give a suitable representation of the background distribution. Thus we first construct background distribution in this way and then calculate saliency value of every location as the distance from background distribution. In the following, we employ a series of refinement methods to enhance the salient region and make salient objects stand out uniformly.

There are two main contributions underlying our proposed approach: () we propose a novel way to construct background distribution of an image based on clustering method. In this way, statistical and structural information of background can be fully exploited to construct background distribution, which would allow for a subsequently more accurate saliency measurement, and () we propose a new refinement framework, which is composed of edge-preserving element-level filtering and upsampling based on geodesic distance. The new refinement framework has an excellent ability to uniformly highlight the salient object and removing background noise.

2. Proposed Approach

Our proposed approach adopts a coarse-to-fine strategy in which every step fully utilizes the information from previous step. There are three main steps composing the proposed approach. First, the patches from image border are used to construct background distribution from which a coarse saliency map is derived. Second, the coarse saliency map is enhanced by an edge-preserving element-level filtering. Third, final saliency map is obtained by refinement based on geodesic distance. The main framework is depicted in Figure 1. Each step will be illustrated in detail in each section below.

2.1. Background Distribution

Since backgroundness prior has shown its importance in salient object detection, we aim to find an effective way to establish the background distribution, from which saliency map can be calculated. Similar to work in [11], we define the pseudobackground region as the border region of the image. Some improved methods [1517] separate the border region into several certain parts, such as top, bottom, left, and right, and deal with them, respectively. To our consideration, this may fail to exploit as much as possible the statistical information about backgroundness priors. In our approach, we construct the background distribution with the help of clustering algorithm. To implement clustering on all patch features within pseudobackground region, provide a way to construct background distribution, which allows for automatically and substantially exploiting statistical information of the backgroundness prior. Work in [18] also uses clustering algorithm to establish background distribution and thus generate a coarse saliency map based on the distribution. Different from our method, it only considers color information in superpixel level. By contrast, patch feature used in our method represents not only color information but also structural information of a neighborhood of a location. The effectiveness of patch feature has already been demonstrated in [13, 19].

For the feature representation of each patch, we choose to use the feature in CIE LAB color space, which possesses the property of perceptual uniformity and has also proved its effectiveness in many saliency algorithms [20, 21]. Thus, for example, if we take a patch from an image, the feature of this patch will be represented by a column vector (pixelNumber channelNumber). This kind of representation can represent both color and pattern information. Here, we denote this vectorized feature for image patch to be .

After patch features have been drawn from the border region of an image, we treat them as a collection of background feature samples and use -means clustering algorithm [22] to cluster the samples. Compared to separating border region into certain parts and dealing with them, clustering the features within border region into different groups would provide a novel way to exploit statistical and structural information of the backgroundness prior. Consequently, this would find clusters that best represent background feature distribution. Larger clusters have more contribution to background distribution and smaller clusters have less contribution. Through exercising with different numbers of clusters, we found that clustering the samples into four clusters could maintain relatively good performance; thus we choose to cluster the features into four groups. From the above process, we can get four feature clusters, sample number of each cluster is denoted , and centroid location value of each cluster is denoted, ,  .

In most of the previous work, a variety of measurements have been used to measure image saliency. The most commonly used measurement is rarity and contrast. But, as stated in Section 1, such measurement may not be suitable for more general cases. A more intuitive saliency measurement can be defined as the difference from background distribution which describes the backgroundness prior. In our approach, we consider a location to be salient if patch feature corresponding to that location is distant from background distribution. For the distance measurement, we adopt Euclidean distance as the distance metric.

As we get four background feature clusters, we then calculate a distance map for each cluster. In each distance map, value of every location is defined as the Euclidean distance between the patch feature of that location and the centroid value of this feature cluster. Distance map with respect to each cluster is computed bywhere denote patch location in image, .

In this way, four distance maps are derived. And the four maps are normalized to . In the following, we create a coarse saliency map through combination of the four distance maps. The combination procedure is defined as follows:where “” denotes element-wise multiplication and denotes weight for each distance map. is defined as follows:where . Obviously, the weight can be seen as the ratio of sample number in each cluster to whole sample number. In this way, coarse saliency map is more similar to distance maps which are produced from larger clusters. Larger clusters have a stronger representative power for representing the real background. So distance maps produced from these larger clusters are more different from background and meanwhile are more likely to highlight salient regions. The whole process of producing coarse saliency map is shown in Figure 2 (patches are zoomed in for observation purpose).

2.2. Edge-Preserving Element-Level Filtering

In the previous section, a coarse saliency map is generated. We aim to further discriminate salient object from background and highlight salient object uniformly in this section. Before starting the substantial operation, we decompose image into homogeneous basic elements and define saliency on element level hereafter. This is based on the observation that an image can be decomposed into basic, structurally representative elements that abstract away unnecessary detail and meanwhile allow for a clear and intuitive definition of saliency. Each element should locally abstract the image into perceptual homogenous regions. For detailed implementation, we use a linear spectral clustering method [23] to abstract the image into homogenous elements.

In this section, we adopt a method called edge-preserving element-level filtering to enhance salient region. This step is motivated by the observation that () salient object always tends to have sharp edges to background and () a region is likely to be salient if it is surrounded by salient regions. The operation of edge-preserving element-level filtering deals with basic elements described above. Its mathematical formula is described in (4) which we will explain in next paragraph. This operation possesses several properties, such as preserving sharp edges, refining saliency value of an element by combining saliency values of its surrounding elements.

After decomposing the image into homogenous elements, coarse saliency value of each element (superpixel) is calculated by averaging the coarse saliency value of all its pixels inside. Then, for th superpixel, if its coarse saliency value is labeled as , the saliency value of the th superpixel is measured by where are a set of superpixels for which superpixel is the neighbor within a certain distance, is a weight based on Euclidean distance, and indicate whether th superpixel is touching sharp edges in image. Formula (4) can be interpreted as follows: if th superpixel is touching sharp edges in image, then saliency value of superpixel remains unchanged as its coarse saliency value, but if superpixel is not directly connected to sharp edges, then saliency value of th superpixel is determined by saliency values of its surrounding superpixels.

Given superpixel positions and , weight in (4) is defined as Euclidean distance between positions of th and th superpixel:

The indicator in (4) is determined according to the following process: all superpixels contiguous to th superpixel are searched first, then the largest variance among saliency values of these superpixels is computed, and finally if the variance is larger than a certain threshold, the superpixel is regarded as touching sharp edges, thus setting , otherwise setting .

Results of edge-preserving element-level filtering are illustrated in Figure 3. It is obvious that coarse saliency maps are strongly enhanced after this procedure. Sharp edges of salient object have been well preserved and background noises have been largely removed. Work in [14] adopted an operation called “context-based error propagation,” which is similar to edge-preserving element-level filtering in this section. It smooths region saliency by taking all other region elements in the same feature cluster into consideration, while our method refines region saliency using its spatially adjacent regions. Obviously, our method is more intuitive and simple and, above all, preserves sharp edges well.

2.3. Geodesic Distance Refinement

The final step of our proposed approach is refinement with geodesic distance [24]. The motivation underlying this step is based on the consideration that determining saliency of an element as weighted sum of saliency of its surrounding elements, where weights are correspond to Euclidean distance, has a limited performance in uniformly highlighting salient object. We tend to find some other solutions that could enhance regions of salient object more uniformly. From recent works [25, 26], we found that the weights may be sensitive to geodesic distance. Thus, saliency value based on geodesic distance is defined as follows:where is the total number of superpixels and is the weight based on geodesic distance between th and th superpixels.

The weight values are produced similar to work in [25]. First, an undirected weight graph has been constructed by connecting all adjacent superpixels and assigning their weight as the Euclidean distance between their saliency values which are derived in the previous Section 2.2. Then, the geodesic distance between two superpixels can be defined as accumulated edge weights along their shortest path on the graph:

In this way, we can get geodesic distance between any two superpixels in the image. Then, the weight is defined aswhere is the deviation for all values. From formula (8), we can easily conclude that when and are in a flat region, saliency value of would have a higher contribution to saliency value of , and when and are in different regions between which a steep slope existed, saliency value of tends to have a less contribution to saliency value of .

In order to demonstrate the effectiveness of refinement using geodesic distance, we compare geodesic distance and Euclidean distance when they are used as combination weights. Figure 4 shows the experiment results. Refinement based on geodesic distance performs much better than refinement based on Euclidean distance, in that saliency objects are distinctively and uniformly highlighted.

3. Experiment

We provide an extensive comparison of our approach to several state-of-the-art methods on two most common-used datasets. The two benchmark datasets include ASD [7] and MSRA10K [8]. The ASD dataset contains 1000 images, while MSRA10K contains 10000 images, both with pixel-level ground truth.

3.1. Qualitative Evaluation

Figure 5 presents a qualitative comparison of our approach with other six outstanding methods. The other methods include Itti [27], GBVS [28], FT [7], CAS [19], PCA [13], wCtr [25], and DRFI [11]. It can be seen that results of Itti and GBVS only successfully detect some fuzzy locations of salient object, which makes it far from real application. CAS highlights more on object edges. While PCA performs overall better that these methods just mentioned before, there still exist failing occasions when salient objects are overwhelming in size. FT has a simple and efficient implementation, but problems occur when global cues play a role in saliency detection as FT focuses on local cues. Method of wCtr has a discriminative power to highlight salient objects but would go unstable when some nonsalient regions are also clustered as salient objects and are hardly connected to image boundary. Different from these methods which are all bottom-up based approaches, DRFI incorporates high level priors and supervised learning, which make it outstanding among other approaches. Our method is based on bottom-up processing; it is easy for us to conclude from the experiment results that our method provides more favorable results compared with others and even performs better than supervised approach like DRFI. An important tip is that our method is implemented on the original resolution of the input image without any rescale preprocessing.

3.2. Metric Evaluation

Similar to [7, 8], we deploy precision and recall as the evaluation metric. Precision corresponds to the percentage of salient pixels correctly assigned, while recall corresponds to the fraction of assigned salient pixels in relation to the ground truth number of salient pixels. Precision-recall (PR) curve can be produced by comparing the results with the ground truth through varying thresholds within the range . PR curve is an important metric in performance evaluation. PR curve with respect to each dataset is given in Figure 6.

In addition to precision and recall, we compute the -measure, which is defined as follows:where we set . In Figure 7, we show the precision, recall, and -measure values for adaptive threshold, which is defined as twice the mean saliency of the image. Figures 6 and 7 show that the proposed method achieves best among all these methods on the two datasets.

3.3. Performance

In order to evaluate algorithms thoroughly, we also compare the average running time of our approach to other methods on the benchmark images. Timing has been taken on a 2.6 GHz CPU with 8 GB RAM. The environment is Matlab 2016a installed on Windows. Running times of algorithms are listed in Table 1.

CAS method is slower because of its exhaustive nearest-neighbor search. Our running time is similar to that of PCA method. Both methods involve location-wise patch feature extraction and comparing. Our method spends most of the running time on the first step of feature clustering (21%) and subsequent coarse saliency map computation (25%). Time of our method is moderate among these methods. Considering that our method is implemented on the original resolution of input image, computational cost of our method is still favorable against other methods.

4. Conclusion

In this paper, an effective bottom-up salient object detection method is presented. In order to utilize the statistical and structural information of backgroundness priors, background distribution based on feature clustering has been constructed. That distribution allows for computing a coarse saliency map. Then two refinement steps, including edge-preserving element-level filtering and upsampling based on geodesic distance, have been applied to obtain the final saliency map. In the final map, salient objects are uniformly highlighted and background noises are removed thoroughly. The experimental evaluation exhibits the effectiveness of the proposed approach compared with the state-of-the-art methods.

Competing Interests

The authors declare that there are no competing interests regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National High Technology Research and Development Program of China (863 Program, Program no. 2011AA7031002G).