Abstract

Saliency detection is an important preprocessing step in many application fields such as computer vision, robotics, and graphics to reduce computational cost by focusing on significant positions and neglecting the nonsignificant in the scene. Different from most previous methods which mainly utilize the contrast of low-level features, various feature maps are fused in a simple linear weighting form. In this paper, we propose a novel salient object detection algorithm which takes both background and foreground cues into consideration and integrate a bottom-up coarse salient regions extraction and a top-down background measure via boundary labels propagation into a unified optimization framework to acquire a refined saliency detection result. Wherein the coarse saliency map is also fused by three components, the first is local contrast map which is in more accordance with the psychological law, the second is global frequency prior map, and the third is global color distribution map. During the formation of background map, first we construct an affinity matrix and select some nodes which lie on border as labels to represent the background and then carry out a propagation to generate the regional background map. The evaluation of the proposed model has been implemented on four datasets. As demonstrated in the experiments, our proposed method outperforms most existing saliency detection models with a robust performance.

1. Introduction

Many cognitive psychology researches have shown that given a visual scene, human attention is directed to particular parts by visual selective mechanism and these parts are called salient regions [1]. In computer vision, salient region detection simulates the functionality of selective attention and localizes and tags the attention-grabbing regions or pixels in a digital image. Borji et al. [2] provided a more precise definition; they think that a salient region detection model should first detect the salient attention-grabbing objects in a context and then segment the whole objects. Usually, a generated saliency map is the output of the model and the intensity of each pixel in map means its probability of belonging to salient regions. According to this definition, we can know that this issue is essentially a figure/ground segmentation problem, and the goal is to only segment the salient foreground object from the background. However, it is slightly different from the traditional image segmentation problem that aims to partition an image into perceptually coherent regions.

Saliency detection for image is an important preprocessing step in many areas such as computer vision, graphics, and robotics to reduce computational cost by focusing on salient regions and neglecting the nonsalient. The value of salient region detection models lies on their wide applications, for instance, object detection and recognition [3], image and video compression [4], thumb-nailing [5], image quality assessment [6], image segmentation [7], content-based image retrieval [8], and so on.

Viewed from the information processing perspective, the existing models of visual saliency detection can be categorized as two main kinds: bottom-up and top-down. Bottom-up methods [911], also known as stimuli-driven or task-independent models, mainly detect saliency based on low-level feature attributes (color, orientation, motion, etc.) without any prior knowledge. Top-down approaches [1215] are often task-driven or scene-dependent models learning through training process, which requires some specific prior knowledge or high-level information.

Most of bottom-up saliency methods are based on the intuitive assumptions-appearance contrast between object and background. Depending on the range where the contrast is computed, the bottom-up method can be further divided into local methods [1619] and global methods. Local methods compute various contrast measures in a local neighborhood of the pixel/patch, such as self-information [16], edge contrast [17], center-surround discriminative power [18], and center-surround differences [17, 19].

As a pioneer, Itti et al. [19] proposed employing center-surround differences across multi-scale image features to implement image saliency detection. Hereafter, many extended approaches based on this local contrast method have been invented, including graph based (GB) visual saliency model by Harel et al. [20] and a model (AC) presented by Achanta et al. [21] which determine salient regions by exploiting low-level information of color and luminance. Goferman et al. [22] propose a context-aware (CA) algorithm for realizing three principles involving local low-level clues, global considerations, and visual organization rules to highlight salient objects along with their contexts. However these local contrast methods still have several issues to be solved; typically, the boundaries of the salient object can be found well, but it cannot propagate the saliency to the object interior and highlight it uniformly, as shown in Figure 1(c). Moreover not all unique regions based on local contrast are salient; a small region with large local contrast may be viewed as meaningless distractor by the human.

Despite the fact that local contrast conforms to the psychobiology principle, global method should also be taken into account to capture the holistic rarity in image, when one region is similar to its neighborhoods while standing out in the whole scene.

Global methods generally exploit the whole image to compute the saliency of each pixel/region. Moreover, they assign comparable saliency values across similar regions. Specifically, they suppress the features that frequently occur, while maintaining those that deviate from the norm. Some methods assume globally less frequent features are more salient and use frequency analysis in the spectral domain [23, 24]. Other methods compare individual pixel/patch to the rest in image and use the averaged appearance dissimilarity as the saliency measure [25], and the averaging is usually weighted by spatial distances between pixels/patches to consider a fact that salient parts are usually grouped together to form a salient object.

However, global method has its inherent disadvantages (illustrated in Figure 1(d)). When a foreground region is globally compared with the remainder of the image, its contrast with the background is less prominent and the salient object is unlikely to be uniformly highlighted. And also most global methods rely mainly on color uniqueness in terms of global statistics, which can be insufficient to deal with complex variations common in natural images. Moreover, these methods ignore the spatial relations between different parts, which can be vital for reliable saliency detection.

Even when combining the local method with the global one, some critical problems remain to be addressed, such as objects containing salient small-scale patterns or backgrounds containing complex patterns, as shown in Figure 1(e). In the figure, the yellow flowers on the grass are likewise highlighted by the method mentioned in [17]; however, they are actually part of the background when viewing the picture as a whole.

Comparatively speaking, bottom-up saliency detection models based on local or global computing methods are usually more efficient for detecting fine details rather than global shape information. Top-down models, on the other hand, are typically used to detect objects of specific sizes and categories based on more representative features from training samples. In addition, these models can leverage high-level prior knowledge. For example, most background regions can be easily connected to the image boundaries mentioned in [26]. Nevertheless, the result tends to be coarse with fewer details. In terms of computational complexity, bottom-up approaches are often more efficient than top-down ones.

Because both bottom-up and top-down models have drawbacks, it is natural to integrate them to more effectively and efficiently detect salient objects. Reference [27] proposed a new computational visual attention model by combining bottom-up and top-down mechanisms for man-made object detection in scenes. This model shows that the statistical characteristics of orientation features can be used as top-down clues to help determining the location for salient objects in natural scenes. Reference [28] invented a computational model that performs such a combination in a principled way. The system learns an optimal representation of the influences of task and context and thereby constructs a biased saliency map representing the top-down information. This map is combined with bottom-up saliency maps in a process progressing over time as a function over the input. But this method is not universal and it is only applied to the study of visual fixation shift. Reference [29] proposed a novel salient region model based on the bottom-up and top-down mechanisms: the color contrast and orientation contrast are adopted to calculate the bottom-up feature maps, while the top-down cue of depth-from-focus from the same single image is used to guide the generation of final salient regions, since depth-from-focus reflects the photographer’s preference and knowledge of the task. Liu et al. [17] formulate salient object detection as a binary labeling problem and developed a conditional random field method to effectively combine bottom-up and top-down approaches. At the same time, a set of novel features including multi-scale contrast, center-surround histogram, and color spatial distribution are put forward to describe a salient object locally, regionally, and globally.

However, the most existing integrated methods including those mentioned above adopt a single similar linear weight to combine the feature maps computed by bottom-up or top-down approaches. Thus, the single similar linear weight loses universality and the performance is significantly decreased when it is applied to various types of images. In addition, existing integration models lack effectiveness and sufficient flexibility because several parameters must be empirically specified.

In this paper, a method is proposed to address the above problems. In Figure 2, the proposed saliency detection algorithm is outlined. The contributions of our work include the following.(1)A new local contrast map measurement method provides greater alignment with physiological law.(2)A bottom-up coarse salient region extraction method combines the local method with the global method. The global method is derived based on two kinds of prior information.(3)A novel top-down background measure is via boundary label propagation. We first construct an affinity matrix. Then, we select and label nodes existing on the border to represent the background. We then carry out a propagation to generate the regional background map.(4)Unlike the conventional, commonly used linear weighted combination, we present a unified approach to integrating bottom-up, low-level features and top-down, high-level priors for saliency optimization and refinement.

Figure 3 depicts examples of the proposed method. To verify the effectiveness of our proposed approach, we performed extensive comparisons on a benchmark dataset. Performance was evaluated in terms of precision-recall, F-measure, and mean absolute error (MAE). We compared our method with ten state-of-the-art techniques. Both quantitative and qualitative experimental results demonstrate that our proposed method achieves stable and excellent performance against other methods.

2. Coarse Salient Regions Extraction

In this section, we formulate the computations relating to the three components used in the proposed coarse salient region extraction method: the local contrast map, focus prior map, and global color prior map. After obtaining these three maps, we then combine them to enable coarse salient region extraction. However, it should be noted that, before computing the local contrast map, we must first employ the structure extraction algorithm [30] for preprocessing.

2.1. Local Contrast Map Measure
2.1.1. Structure Extraction

Some physiology experiments [31] have demonstrated that the human visual system is more sensitive to patterns in middle frequencies than those in high frequencies. Figure 4, respectively, depicts input images and texture-suppressed images. At first glance, viewers are attracted by the flowers in the images, while ignoring the minor variations of leaves surrounding them. However, the effects of such minor variations within textures may accumulate and cause some difficulties in the detection of salient objects. Therefore, we adopt the structure extraction algorithm proposed in [30] to suppress these small variances in textures. This algorithm smoothens the local gradients in textures, preserves the global structures of the objects, and diminishes insignificant details, thereby producing more accurate boundary information. The objective function is expressed aswhere is a given image, which could be the luminance (or log luminance) channel, and indexes pixels. is the extracted structure image. The data term serves to make the extracted structures similar to those in given image. The second term is a regularizer named as relative total variation which is effective at emphasizing the main structures. Different values of the parameter in (1) output images with various smoothness level, and is a small positive number to avoid division by zero. The larger is, the smoother the image is. and are the sum of the absolute spatial difference in the and directions weighted by a Gaussian function within a window for pixel . and are the modulo of the directional spatial difference sum in the and directions weighted by a Gaussian function within a window for pixel .

After this structure extraction, the texture-suppressed image is then segmented into superpixels by the simple linear iterative clustering (SLIC) algorithm [32] for postprocessing.

2.1.2. Weber Local Descriptor

Inspired by Weber’s law, the Weber local descriptor (WLD) [33] is based on a physiological law [34]. Weber’s law implies that the change of a stimulus (such as lighting) that is minimally noticeable is a constant ratio of the original stimulus. If the change is smaller than this constant ratio of the original stimulus, the human visual system does not recognize it as a valid signal; rather, it is recognized as background noise. Accordingly, WLD extracts features from the image by simulating the way a viewer perceives the surroundings. It consists of two components—differential excitation and orientation. The differential excitation of a given pixel is computed based on the ratio between two items. One is the relative intensity difference of a current pixel against its neighbors (e.g., a 3 × 3 region); the other is the intensity of the current pixel. After obtaining the differential excitation, we can attempt to derive the local salient patterns from the input image.

Based on the implication of Weber’s law, we must compute the two constituent parts of the WLD feature for each pixel in the input image: differential excitation and gradient orientation. In addition, four filtering structures , and , are additionally required in the calculation process as follows (see Figure 5).

Suppose the differences between the center point and its neighbors () are the output of filter , which is calculated asThen the differential excitation on pixel is computed as

Here, the value may take a minus value when the intensities of the neighbors are smaller than that of the current pixel. That is, positive means that the surroundings are lighter than the current pixel, while negative simulates the darker surroundings.

Similarly, we compute the orientation component of WLD as the gradient orientation described in [33], which is defined aswhere and are the outputs of the filters and .

As described in [33], WLD provides several advantages. For one, it can detect edges that elegantly conform to the subjective criterion because it relies on the perceived luminance difference. Furthermore, WLD performs a powerful feature extraction method for textures and is robust to noise effects in a given image. Most importantly, WLD is based on physiological law. It extracts features from an image by simulating a human’s perception of his/her surroundings.

2.1.3. Local Contrast Map

In this section, we define a function to measure the local contrast map based on the following three properties:(1)The higher the degree to which a superpixel contrasts with its neighbors, the greater its saliency value.(2)A superpixel has a much higher likelihood of being salient if it is closer to the image center.(3)A region is more likely to pertain to the background if it has a large number of pixels on the image boundary. Accordingly, we calculate the integrity of a region with consideration of the amount of pixels located on the boundary.

First, segment a given image into superpixels. For a superpixel and its corresponding neighborhood , , where represents the number of its neighbors. Then we can define the saliency computation of aswhere denotes the feature distance between superpixel and , where and , respectively, represent the distance of LAB color histograms and Weber local descriptor (WLD) histograms [33]. In addition, is the histogram Euclidean distance, and variable is a ratio between neighbor region and the total area of its neighborhood .

Function guarantees that the output will be positive. Thus, to highlight the salient regions, we use

In (5), and denote the average position of pixels in the superpixel and they are normalized to . Moreover, is a normalized distance between the superpixel center and the image center , which can be expressed bywhere is set as 1/3 of the image width and is set as 1/3 of the image height. Therefore, the superpixel that is closer to the center of the image will attain a greater weight.

We define the integrity in (5) aswhere represents the number of pixels on the image boundary contained by superpixel . In addition, denotes the total number of pixels on the image boundary, is an adjustment for tuning the strength of its impact, and is a threshold , . A superpixel with a larger means it is unlikely to be an integral object. If , it indicates that a region is not on the boundaries, and . Otherwise, is a value within .

2.2. Frequency Prior

The arguments proposed by Achanta et al. [23] indicate that the mechanism of the human visual system detects the salient objects in a scene which can be well modeled by integrating band-pass filtering responses from the color channels. By means of the filter of DoG (Difference of Gaussian), Achanta et al. achieve the desired results of band-pass filter. Inspired by their works, in this paper, we also apply the band-pass filter to salient object detection. However, in view of the following considerations we choose the log-Gabor filter instead of DoG.

Firstly, Field [35] suggests that natural images are better coded by filters that have Gaussian transfer functions when viewed on the scale of logarithmic frequency. On the linear frequency scale the log-Gabor function has a transfer function whose form can be expressed aswhere is the filter’s center frequency and controls the filter’s bandwidth. By definition, log-Gabor functions always have no DC component and can construct a log-Gabor filter with an arbitrarily bandwidth.

Secondly, the transfer function of the log-Gabor function has an extended tail at high frequency end which makes it more capable to encode the images more efficiently than other ordinary Gabor functions. cannot be analytically expressed due to the singularity in the log function at the origin. Instead, can only be approximately obtained by performing a numerical inverse Fourier transform to . An illustration of a 2D log-Gabor filter in frequency domain (, ) is shown in Figure 6.

Suppose an image , where defines a spatial domain in RGB color space and is a triple vector containing three intensities values at position . Similar to [23], we can obtain its frequency prior map which is modeled by band-pass filtering as follows: at first, the image needs to be converted to an opponent color space such as in which the three resulting channels are represented by , , and . Then we can defined aswhere denotes a convolution operation. For more details refer to ours [36]. An example is shown in Figure 7.

2.3. Color Spatial Distribution

From Figure 8 we can observe that if one type of color is distributed in an image extensively, the more possible this color forms the background, and the less possible a salient object contains it. In other words, a specific color with a smaller spatial variance is more likely to be part of the salient object. And the goal of using color spatial distribution is to take the global color information in image into consideration.

One simple approach to characterize the spatial variance of the color is based on the Gaussian Mixture Model (GMM), which is one of the most statistically ordinary algorithms for clustering [17].

Generally, each pixel in the GMM can be denoted as a probability assigned to a color component.where , , and are the weight, mean color, and covariance, respectively, and is the multivariate normal distribution of the component.

For each associated color component , its corresponding horizontal variance of the spatial position can be computed bywhere is -coordinate of the pixel , is a normalizing factor and , and is the mean of the Gaussian component:

Similarly we can also define the vertical variance and thus derived the spatial variance of a component by combining the horizontal and vertical variances:

Then employ a min–max approach to normalize this composite covariance to the range :

Finally, the color spatial distribution feature map is therefore defined as a weighted sum of its color distribution of each pixel, due to the assumption that a salient object tends to have a less widely distributed color:

The feature map is also normalized to fall within . Figure 8 exemplified several images to demonstrate the color spatial distribution feature maps, from which we can observe that these salient objects are highlighted effectively by this global feature.

It is obvious that exploiting this global feature can detect salient objects with a high accuracy when the background in image is monotonous. However, in the case of complex and colorful background, the color spatial distribution map fails to distinguish the salient object in the image. Figure 9 demonstrates to us this undesired property.

2.4. Map Integrating

After generating these three maps—local contrast map (LCM), frequency prior (FP), and color spatial distribution (CSD)—we can then derive the coarse salient map, , through the following combination:where LCM, CPM, and FPM are all normalized to and LCM is computed in a local region perspective. Meanwhile, FP and CSD are computed in a global perspective. They are complementary to each other.

3. Background Extraction Based on Boundary Label Propagation

3.1. Affinity Matrix Construction

In this section, we construct an affinity matrix based on boundary label propagation for the subsequent computation of background extraction. As discussed in Section 2.1.1, the input image must implement texture suppression. Then, it is segmented into superpixels by the SLIC algorithm. We regard these generated superpixels as nodes or regions in this affinity matrix construction procedure. In addition, those regions on the image border form a boundary set denoted as .

Thereafter, we can define a superpixel-based graph , where is a set of nodes and is a set of undirected edges weighted by an affinity matrix to indicate the similarity between any pair of nodes. The degree matrix of a given graph is defined as , where .

In this work, there are two adaptations to construct the affinity matrix entry by means of adding two constraints.

First, we connect each node to its 2-ring neighbors to fully employ the adjacency information of nearby nodes. This is because neighboring nodes are more likely to have a similar appearance and saliency (see Figure 10). That is, each node is not only connected to those neighboring it, but it is also connected to the nodes sharing common boundary with its neighboring node. By extending the scope of node connection, we effectively exploit the local smoothness cues.

Second, based on a background prior revealed in [26], there is a cue that the image boundary of most background and salient objects rarely touches the image boundary. This cue is more common than the prior of center bias that is often used. Thus, we enforce all the superpixels which lie on the four sides of image are connected; hence, any boundary node pair is handled as being adjacent (see Figure 10). This approach can reduce the geodesic distance of similar superpixels and further enhance the relationship among boundary nodes. Because we handle all boundary nodes as propagation labels in Algorithm 1, a tight connection among them can more effectively distinguish the background from the foreground.

Input:
: Row-normalized affinity matrix
: Set of given boundary nodes’ labels
: Set of unlabelled nodes
 Initialize, set for and for
while VAR > threshold do
  for do
  
  end for
  
   = var()
end while
BG = normalize()
BG() = map(BG)
Output:
 Regional background probability weight map BG() generate by background labels

Using the definitions on the graph and the constraints on the edges, we define the affinity entry of node to a certain node aswhere , denote the mean feature vectors of pixels included in the superpixels corresponding to the two nodes in the CIELAB color space and is a constant for controlling the strength of the weight. Furthermore, indicates the set of neighboring nodes existing in the 2-ring range of node . According to (18), we can additionally obtain a row-wise normalized color affinity matrix:

3.2. Propagation via Boundary Labels

In this section, we introduce a label propagation algorithm to compute the background probability weight of other unlabeled superpixels by setting the boundary nodes to serve as initial propagation labels.

Given a dataset , where denotes the data dimension and the former data points are labeled queries; the remaining data points can be calculated by propagating the information of similarity correlation with the queries. Let denote a mapping function that assigns a propagating value to each point , to indicate the degree to which each data point is similar to the labels, and can be regarded as a vector . The similarity measure satisfieswhere denotes the recursive times and is the affinity value in (19).

Suppose that the nodes in our build graph constitute the given dataset, . During the process of recursive calculation, the similarity indications of query labels are set to 1, while the initial indication of unlabeled nodes is set to 0. For a given node, the similarity value is learned through iteratively propagating the similarity measure of its neighbors, . Hence, its ultimate value of similarity to the labels is influenced by its surroundings.

Algorithm 1 outlines the entire procedure of label propagation via boundary nodes. Its output regional background probability weight map forms the extracted background. Its convergence condition constraint is found by determining whether the average variance of similarity measure in the last 50 iterations (e.g., constant = 49) is lower than a given threshold. The saliency measures of regions are converted into a pixel-wise map by map(·). This type of propagation process is similar to the proposal in [37]. This label propagation via boundary nodes works well in most cases, as shown in Figure 3(c).

4. Saliency Refinement

The extracted background map, BG, in Algorithm 1 and the coarse saliency map, , obtained in Section 2, are complementary and mutually reinforcing because they characterize the background and foreground, respectively. Nevertheless, they are still coarse and noisy. Thus, we propose a principled framework to integrate these two methods and generate the final refined saliency map, as described in Figure 2.

Commonly speaking, after extraction of multiple saliency cues according to different feature measures, a critical step is their combination strategy for integrating these feature maps. By far, there have been many kinds of framework for such computation, but most of them are based on the hypothesis that similar features compete for saliency, while different modalities contribute independently to the saliency detection [38]. In fact, there is a lot of psychological evidence to show that different features types have different contribution to the saliency, and not just a simple combined heuristically using weighted summation or multiplication. Thus many methods have been put forward to integrate different kinds of feature map for visual saliency recognition. As mentioned in [39], four combination methods are proposed—(1) simple normalized summation, (2) linear combination with learned weights, (3) global nonlinear normalization followed by summation, and (4) local nonlinear competition between salient locations followed by summation. Also [38], inspired by the role of sparse in human perception and the sparse property of human visual neurons, defined a feature sparse indicator that measures feature’s contribution to saliency map. In addition, [40] proposed AdaBoost as the central computational module that takes into account feature selection, weight assignment, and integration in a principled and nonlinear learning framework. One of approaches most close to our idea was presented in [17]; a conditional random field is learned to effectively combine multiple visual cues for salient object detection through defining an energy function.

In this paper, the proposed framework combines the saliency cues derived from the foreground and background, respectively, and formulates the saliency detection as an optimization problem of the saliency values of all superpixels in the given image. It then achieves an optimal saliency map by minimizing our designed objective cost function, such that it, respectively, assigns value 1 to the object region and value 0 to the background region.

Our cost function consists of three different constraint terms and is defined as shown in the following formula:where () represents the final saliency value assigned to superpixel after minimizing the cost.

The first term prompts superpixel with a high background probability to obtain a small value approximated to 0. As mentioned in Section 3, has a high accuracy derived from our effective background detection. Accordingly, in the second term, denotes the foreground weights associated with superpixel , while large boosts to take values close to 1. After optimization, the foreground weight map is significantly improved on account of our proposed optimization framework.

The last term is a smoothness term that encourages continuous saliency values. For any pair of adjacent superpixels, such as and , the weight can be defined asIt has a large value in the flat region; however, it has a small value on the region boundary. To minimize the influence of minor noise in both background and foreground terms and to regularize the optimization in cluttered regions, we empirically introduce a parameter and set it as 0.1, where the parameter , and denotes the Euclidean distance between their average colors in color system.

Equation (21) consists of three squared error terms. The optimization problem of the saliency map can be solved by using the least squares method.

5. Experiments

In this section, we will compare our proposed method with 8 state-of-the-art salient object detection schemes on four typical available datasets.

5.1. Datasets and Evaluation Metrics

Datasets. Four benchmark datasets for evaluation include commonly used MSRA10K [17] (10000 images), Judd [41] (900 images), SED2 [42] (two-object set) containing 100 images, and ECSSD [43] (1000 images with texture background). The MSRA10K comprises per-pixel ground truth annotation for 10000 MSRA images, each of which has an unambiguous salient object and the object region is accurately annotated with pixel-wise ground truth labeling. ECSSD is a dataset which includes 1000 semantically meaningful but structurally complex images for evaluation. Judd contains 900 images with multiple objects with high background clutter. SED2 is one of the two subsets of SED and has 100 images containing two salient objects. The reason to employ this dataset is to evaluate the accuracy of models on SED2 to verify whether saliency detection algorithms can work well on images containing two salient objects.

We compare our proposed method with 8 state-of-the-art salient region detection methods: GMR [44], LMLC [45], SVO [46], HS [43], CA [22], PCA [47], CB [48], and GR [49].

Evaluation Metrics. We use the precision, recall, F-measure , and mean absolute error (MAE) for our algorithm evaluation and comparison. These metrics are defined as follows:

In the above formulas, GT is the ground truth map, denotes a binary mask generated by thresholding the obtained saliency map Sal via setting a threshold value, and in (23) denotes the area (i.e., pixel number) of corresponding image region. Moreover, to acquire both high precision and high recall, we evaluate F-measure as (24), and the parameter is usually set as 0.3 to raise more importance for the precision value. in (25) is a saliency value in a pixel location , and and are the width and height of the saliency map and ground truth map, respectively.

In theory, a saliency map should agree with the ground truth map. Thus, the more similar the saliency map is to the ground truth, the more effective the generation of the saliency map algorithm is. Nevertheless, the precision-recall curves have limitations in this performance measure. We therefore additionally introduce MAE as a complementary metric. It is a normalized value denoting the average per-pixel errors between the continuous saliency map, Sal, and the binary ground truth map, GT. MAE measures how consistent a saliency map is with the ground truth map. A good saliency detection method should achieve high precision-recall curves while maintaining low MAE.

Parameters Setup. In the experiment, all parameters are determined as follows: the number of segment superpixels is set as 600; the value of filter’s center frequency , and the parameter which controls the filter’s bandwidth appears in Section 2.2; we set the control of color distance in (18). The above parameters configuration achieves desired performance in our experiments.

5.2. Quantitative Comparisons

Here we use three standard measures—precision-recall (P-R) curve, F-measure, and MAE for evaluating saliency maps produced by different models on four datasets and making comparisons.

Fixed Threshold. For each method, we first bipartite the generating saliency map using a fixed threshold that is varied from 0 to 255. On each threshold, the pair of precision/recall scores is computed by comparing the binarized map with the ground truth. These are finally combined to obtain the PR curves to evaluate the model performance in different states, as Figures 11(a), 12(a), 13(a), and 14(a) show. As we can see, compared with the all other approaches, our method shows the most consistent performance and ranks in top three best over 4 datasets. The performance of our technology is comparable to the outstanding mainstream methods including GMR, HS. The performances of LMLC and GR behave in an unstable manner on these four datasets and they perform well on the MSRA10K dataset, however with unsatisfied results on the other three datasets due to their weak background suppression ability. Our method significantly outperforms CB, CA, and LMLC on the datasets—ECSSD, Judd, and SED2. Besides, observing PR curves, our method achieves close performance to GMR and HS on MSRA10K and ECSSD datasets but outperforms GMR on SED2 dataset.

Adaptive threshold: in addition, adaptive threshold experiments were conducted to compute F-measure, where the binarizing threshold is defined as twice as the mean value of the saliency map Sal:

Results are shown in Figures 11(b), 12(b), 13(b), and 14(b). Our method achieves the highest precision and the second best F-measure on the two largest datasets MSRA10K and ECSSD and the most challenging dataset Judd further proves the effectiveness of our proposed method, for the precision is the ratio of the correctly detected areas to the whole detected areas. Fourth best precision and F-measure for our method are observed on SED2. For SED2 whose images contain two objects in more complex background, the performance of our method is close to GMR. Another major factor affecting the performance of the GMR algorithm is that SED2 has many images whose labeled objects touch boundaries leading to failure in satisfying the boundary prior used in GMR. In such cases, it may be better to carry out a rough detection using only contrast rather than taking advantage of the boundary prior knowledge. This reveals the reason that SVO performs inferiorly on other datasets, while it achieves relatively good performance on SED2.

MAE. Since precision-recall curves do not reflect the highlight level of an entire object, we also compute the MAE values. As Figures 11(c), 12(c), 13(c), and 14(c) show, our method consistently produces the minor error on all four datasets. Specifically, it ranks the least error on ECSSD and the second least error on MSRA10K, Judd, and SED2, which indicates our method with a more robust performance on different datasets. In addition, despite a relatively good performance in precision-recall curves and F-measure, SVO has the largest MAE due to its weak background suppression.

Summary. Though our method does not exhibit the top best effect on all four datasets experiment, however, it with a consistently excellent and robust performance compared with the other eight approaches.

5.3. Qualitative Comparisons

In addition to quantitative comparison, we also, respectively, provide the visual comparison of different methods in Figures 15 and 16, from which we can see that our method generates the best detection results even in some challenging cases such as the background of image being very extremely cluttered or the image with complex texture. For example, in rows 3, 7, and 10 of Figure 15, our approach can generate reasonable saliency map and more successfully highlights the entire salient object, while the other method may be distracted and influenced by the cluttered background textures.

Comparatively speaking, we can clearly observe the distinctness of different type of models, and a good detection approach tries to highlight the whole salient object while suppressing the background. Some models that incorporate a center-bias component also produce appealing results, for example, CB. However, this method suffers failure when the objects locate far off the image center. Interestingly, compared with other pixel-based or patch-based approaches, those region-based models, for example, CB, HS, GMR, GR, and ours, always preserve the object boundary well. PCA detects the outlines of the saliency objects while missing the interiors as it relies mostly on patterns. Though LMLC and SVO detect the salient regions, parts of the background are mistakenly detected as salient. HS is easy to miss a small target, such as the first row in Figure 16.

Another side-effect of some method is that it is possible to lose useful object parts if it involves thresholding operation. That happens occasionally on SED2 dataset (e.g., 11th row in Figure 16). One of the two objects in image can be missed after threshold, as two objects probably with different saliency levels. In contrast, such misjudgment can reduce in our method as no threshold is used to binarize the saliency map produced in intermediate procedure like in GMR.

5.4. Computation Cost Analysis

Table 1 shows the average execution time of processing one image over all 10K images of MSRA10K (typical 400 × 300 image resolution). Experiments are conducted on an Intel Xeon E5645 2.40 GHz machine equipped with 8 GB RAM. LMLC takes the longest time because of its Laplacian sparse subspace clustering process. SVO also spent a longer time for it introduces an objectness measure for each single image. The most time consuming step in our method is the process of coarse salient regions extraction, which takes about 3.57 s (51%).

6. Conclusions

In this work, we proposed a new approach first which comprehensively utilizes boundary prior information and local and global information to extract a variety of saliency features from an image to construct a coarse saliency map, then obtain a background map by means of boundary labels propagation, and finally integrate these two maps into a unified optimization framework to acquire a refined saliency detection result. We validate and compare our method on four popular databases against 8 mainstream approaches. The experimental results demonstrate that our method exhibits refined and consistently excellent saliency detection.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

The paper is supported by the National Science and Technology Support Project under Grant 2015IM030300.