Abstract

We introduce a novel approach to detect salient regions of an image via feature combination and discriminative classifier. Our method, which is based on hierarchical image abstraction, uses the logistic regression approach to map the regional feature vector to a saliency score. Four saliency cues are used in our approach, including color contrast in a global context, center-boundary priors, spatially compact color distribution, and objectness, which is as an atomic feature of segmented region in the image. By mapping a four-dimensional regional feature to fifteen-dimensional feature vector, we can linearly separate the salient regions from the clustered background by finding an optimal linear combination of feature coefficients in the fifteen-dimensional feature space and finally fuse the saliency maps across multiple levels. Furthermore, we introduce the weighted salient image center into our saliency analysis task. Extensive experiments on two large benchmark datasets show that the proposed approach achieves the best performance over several state-of-the-art approaches.

1. Introduction

Humans have the ability to locate the most interesting region in a cluttered visual scene by selective visual attention. The task of computer vision is to simulate the human intelligence, and the related research has been carried out for many years. The study of human visual systems suggests that the saliency is related to rarity, uniqueness, and surprise of a scene. It has recently gained much attention [123], as it has been brought into the various applications, including image classification [24], object recognition [25], and content-aware image editing [26].

Existing saliency region detection methods can be roughly classified into two categories: bottom-up, data driven and top-down, task driven approaches. Bottom-up methods which utilize low-level image features, such as color, intensity, and texture, determine the contrast of image regions to their surroundings, while top-down methods make use of high-level knowledge about “interesting” object. The majority of most bottom-up models can be roughly divided into local and global schemes.

Inspired by the early work by Treisman and Gelade [27] and Koch and Ullman [28], Itti et al. [1] proposed highly influential biologically plausible saliency analysis method and they define image saliency using local center-surrounding operators across multiscale image features, including intensity, color, and orientation. Harel et al. [4] proposed a method to generate saliency map by nonlinearly combining local uniqueness maps from different feature channels. Ma and Zhang [29] propose a novel approach which directly computed center-surround color difference in a fixed neighborhood for each pixel and then utilize a fuzzy growth model to extract image salient region. They classify the saliency into three levels: attended view, attended areas, and attended points. Liu et al. [19] propose a set of novel features, including center-surround histogram, multiscale contrast, and color spatial distribution, which are unified in a CRF learning framework, to detect salient region in images.

Later on, many saliency models are proposed which exploit various types of image features in a global scope for saliency detection. Hou and Zhang [3] propose a spectral residual method that relies on frequency domain processing. Zhai and Shah [2] define pixel-level saliency based on pixels contrast to all other pixels. To improve computational efficiency, they introduce the color histogram to analyze image saliency. Achanta et al. [5] propose a frequency tuned method which achieves globally consistent results by defining the saliency as the distance between the pixel and the overall mean image color. Cheng et al. [6] also utilize the color histogram and segmented region to analyze image saliency, which enable the assignment of comparable saliency values across similar image regions.

High-level priors have been used to analyze image saliency in recent years. Judd et al. [30] train SVM model using a combination of low-, middle-, and high-level image features, making their approaches potentially suitable for specific high-level computer vision tasks. The concept of center prior was considered in their approach. Shen and Wu [8] unify three higher level priors, including location prior, semantic prior, and color prior, to a low rank matrix recovery framework. Shape prior is proposed in Jiang et al. [7]; concavity context is utilized by [31]. Wei et al. [32] turn to background priors to analyze image saliency, and they assume that the image boundary is mostly background. Subsequently, many recent approaches use boundary prior to guide saliency detection, such as GMR [11], SO [17], PDE [33], AMC [15], and DSR [34], and using those methods can obtain state-of-the-art performance on several public available datasets.

Recent studies indicate that single saliency cue is far from being comprehensive. Some methods such as LC [2], FT [5], and HC [6] only use the contrast cue and the generated saliency maps are disappointing; the contrast cue sometimes produces high saliency values for background regions, especially for regions with complex structures. To alleviate the above problems, some approaches such as SF [18], PD [9] GC [12], PISA [21], PR [22], UFO [14], and HI [10] use multiple cues. Perazzi et al. [18] formulate saliency estimation using high-dimensional Gaussian filters by which region color and region position are, respectively, exploited to measure region uniqueness and distribution. Cheng et al. [12] and Tong et al. [22] also consider color contrast cue and color distribution cue when computing the saliency map. Margolin et al. [9] combine pattern distinctness, color uniqueness, and organization priors to generate saliency result. Shi et al. [21] present a generic framework for saliency detection via employing three terms, including color-based contrast term, structure-based contrast term, and spatial priors. Jiang et al. [14] propose a novel algorithm by integrating three saliency cues, namely, uniqueness, focusness, and objectness. Yan et al. [10] propose a multilayer approach to analyze image saliency. To determine the single-layer saliency cue, they exploit two useful saliency cues, including local contrast and location heuristic, and then a hierarchical inference framework is used to generate the final saliency map. The above-mentioned algorithms compute saliency maps from various cues and heuristically combine them to get the final results.

These methods can generate ideal saliency map when dealing with simple images. When computing the image with complex background, some methods such as [9, 12, 18] can only highlight part of salient object. Though methods such as [10, 14, 22] can highlight the entire object uniformly, the background may be highlighted too. Thus, to differentiate real salient regions from high-contrast parts, more saliency cues including low-level feature and high-level priors need to be integrated. To the best of our knowledge, there are few works that model the interaction between different saliency cues. Inspired by the work [10, 23], we propose a feature combination strategy which can capture the interaction between different cues. Our main contributions lie in three aspects. Firstly, we introduce feature combination to model the interaction between different cues, which is different from most existing methods that generate saliency maps heuristically from various cues. Secondly, we formulate salient estimation as a classification problem and learn a logistic classifier that can directly map a fifteen-feature vector to a saliency value. Thirdly, the use of smoothing and weighted salient image center can further improve the detection performance. The experimental results show that our method can generate reasonable saliency map, even though the image contains complex background and the salient object has similar color to background.

The framework of the approach is presented in Figure 1. First, our approach includes four main parts. The first one is hierarchical image abstraction, which segments the image to homogenous regions across several layers by using the efficient graph-based image segmentation [35]. Second, four saliency cues, including color contrast in a global context, center-boundary priors, spatially compact color distribution, and objectness, are used as an atomic regional feature. Then we map a four-dimensional regional feature to fifteen-dimensional feature vector which can capture the interaction between different features. Third, a logistic regression classifier is trained for mapping a fifteen-dimensional feature vector to a saliency value. Finally, we combine the saliency map at different layers to obtain our saliency map. Figure 2 shows samples of saliency maps generated by state-of-the-art methods and by ours.

The remainder of this paper is organized as follows. The proposed model is introduced in Section 2. Section 3 presents experiments and results. This paper is summarized in Section 4.

2. The Proposed Approach

Our method can be divided into four main stages: hierarchical image abstraction, regional feature generation, training a logistic regression classifier, and multilayer saliency map integration and reinforcement. In the following, we describe the details of the proposed approach.

2.1. Hierarchical Image Abstraction

Given an image , hierarchical image abstraction can be described as , where is the number of image pyramids and is the abstraction result of the th layer of image pyramid, including regions. We use the image-level hierarchical image abstraction which is different from regional-level image abstraction [10, 23], as is shown in Figure 1(b). The result of segmentation in first layer of image is described as , and the segmentation result of other layers can be described in a similar way. Each superpixel is represented by mean color (in CIELab) and a spatial position (-coordinate and -coordinate) which are defined as , , where stands for a pixel in the region , while represents the color vector of pixel and represents the coordinate vector of pixel . is the number of pixels in the segmented region .

2.2. Regional Feature Generation

(1) Color Contrast Cue. Given the result of segmentation in layer of image pyramid which is described as , the color contrast of a region can be formulated as follows:where is the smooth term which considers the distance between two regions and is the color distance between and .

(2) Color Distribution Cue. Inspired by Liu et al. [19], we use the nonoverlapped region as computing unit to compute region color distribution. First, all region colors are represented by Gaussian Mixture Models (GMMs) , where is weight, the mean color, and the covariance matrix of the th Gaussian component. The probability of a region belonging to the th component is given byThe number of Gaussian is set to 5 in the subsequent experiment. We exploit -means algorithm to initialize the parameters of GMMs and EM algorithm to train the GMMs. Referring to [19], the horizontal spatial variance of the th clustered component of GMMs is defined aswhere and is the -coordinate of . The vertical spatial variance can be defined in the same way. Different from Liu et al. [19] that use the two variances to compute saliency cue, we only use the horizontal spatial variance. The color distribution of region can be defined as

(3) Center-Boundary Prior Cue. Location is an important factor in saliency detection. Center and boundary are two priors which are widely used in previous saliency detection methods. After considering above two priors, our center-boundary heuristic is thus defined aswhere is the center prior term which measures the distance between the region and the image center and it is defined as , where is the center of image and it is set to ; the parameter controls the sensitivity of the center prior and it is set to 1 in the experiment. is the boundary prior term which measures the color distance between the region and the image boundary. Inspired by the approach proposed by Yang et al. [11], we define background feature of region aswhere is the sum of distances from region to the top boundary of image which is different from Yang et al. [11] and is the number of regions that intersect with the top image boundary. We use a simple approach to compute which is given by . And , , and can be computed in a similar way.

(4) Objectness Cue. Recently, a generic objectness measure is proposed to quantify how likely it is for an image window to contain an object of any class. The measure is based on low-level image cues. As our goal is to obtain a saliency map for the whole image, we should transfer the objectness value from the bounding boxes to the pixel level first and then we can obtain region level objectness measure. For more details, please refer to UFO [14]. For each region, we can get its region-level objectness , where is the objectness value for pixel .

(5) Cues Smoothing. Thus, we get four saliency cues and they are normalized to range using minimum–maximum normalization. Although we can efficiently compute four saliency cues, there exist at least two problems. Firstly, some regions with similar property will have very different saliency value and, secondly, some adjacent regions will be assigned to very different saliency value. To reduce noisy saliency results caused by above-mentioned issues, we use two smoothing procedures to refine the saliency value for each region.

-Means Clustering Based Smoothing. Given the results of segmentation in layer of image pyramid which is described as , we first exploit -means clustering algorithm to divide the segmented regions into different clusters in each layer. Referring to [36], we can then define an object function, sometimes called a distortion measure, given bywhich we can easily solve for to give is the binary indicator variables; if a region is assigned to cluster , then , and for . This is known as the 1-of- coding scheme. The two phases of reassigning data points to clusters and recomputing the cluster means are repeated in turn until there is no further change in the assignments. Then we get the number for each cluster . We replace the saliency values of each region by the weighted average of the saliency values of the same cluster (measured by distance). The saliency value of each region can be refined bywhere tmp can be replaced by 1, 2, 3, and 4. The parameter controls the importance of color space smoothing term. In our experiment, we set the parameter . Consider is the sum of distances between region and the other region in cluster . Consider is the color distance between region and region ; we set in the experiments.

Spatial Based Smoothing. We assume that two adjacent regions are likely to have similar saliency value. Therefore, we propose a spatial based approach to refine saliency between adjacent regions and the procedure is very similar to color space smoothing. We replace the saliency values of each region by the weighted average of the saliency values of its neighbors.

(6) Regional Feature. After completing the above steps, we can get four atomic features (, , , and ) for each segmented region, including color contrast, center-boundary prior, color distribution, and objectness. In order to capture the interaction between the four different features, a novel feature is generated by mapping a four-dimensional regional feature to fifteen-dimensional feature vector. There are four kinds of combinations: single term, double term, triple term, and quadruple term. For single term, we use the four atomic features ( in the vector). For double term, there are six elements which are combination of any two atomic features () (, , , , , and ). For triple term, the new features are the combination of three different atomic features which are formed to in the new vector (, , , and ). For the quadruple term, there is only one element which is the product of the four atomic features. The last feature is . Finally, we can get a novel fifteen-dimensional feature vector.

2.3. Learning Framework for Saliency Estimation

The logistic function is useful because it can take an input with any value from negative to positive infinity, whereas the output always takes values between zero and one [37]. We take full advantage of this property; thus our saliency estimation can be formulated as a probability framework. Let us assume thatwhere is our hypotheses. Consider is called the logistic function or the sigmoid function. Notice that tends towards 1 as , and tends towards 0 as . Hence, our hypotheses is always bounded between 0 and 1 and higher value indicates that the region is likely to belong to a salient object.

The parameter is what we want to learn from the data. We use the first layers of image pyramid for training, given the result of segmentation in 1st layer of image pyramid which is described as . A segmented region is considered to be positive if the number of the pixels belonging to the salient object exceeds of the number of the pixels in the region and its saliency value is set to 1. On the contrary, a region is considered to be negative if the number of the pixels belonging to the salient object is under of the number of the pixels in the region and its saliency value is set to 0. As aforementioned, each segmented region is described by a fifteen-dimensional vector . We learn a logistic regression classifier from the training data and the saliency value . Once the parameter is obtained, we can quickly perform the saliency estimation using (10).

2.4. Multilayer Saliency Map Integration and Reinforcement
2.4.1. Multilayer Saliency Map Integration

We combine image pyramid which is the multiscale representation of image to suppress background region. Similar to [1], the saliency map is obtained by adjusting the saliency map to the same scale and point-by-point addition. The fusion strategy is given by , where is the input image, is the mth layer of image pyramid, and is the saliency detection result of the mth layer of image pyramid.

2.4.2. Reinforcement of Salient Region

Salient object is always distributed in local region of the image, while background has a high degree of dispersion. To use such property, we introduced the weighted salient image center into our saliency estimation, and the newly defined salient center is defined aswhere is the number of pixels in image and is the th pixel. Hence, the final pixel-level saliency can be defined aswhere is a pixel in an image, is the Euclidean distance between the pixel and the weighted salient image center, and the parameter is the smooth term which controls the strength of spatial weight; we set the parameter to 0.4.

3. Experiments and Results

To validate our proposed approach, we have performed experiments on two publicly available datasets. The first one is the MSRA dataset [19], which contains 5000 images with pixel-level grounds truth. We used the same training set, validating set, and test set as the paper of Jiang et. al [23]. The training set contains 2500 images, the validating set contains 500 images, and the testing set contains 2000 images. The second one is the ECSSD dataset, which contains 1000 images with multiple objects and makes the detection tasks much more challenging and the pixel-level ground truth for image is provided by Yan et al. [10]. On the two datasets, we compare our method with 18 state-of-the-art saliency detection methods, including IT [1], LC [2], SR [3], GB [4], FT [5], HC [6], RC [6], CB [7], LR [8], PD [9], HI [10], GMR [11], GC [12], BMS [13], UFO [14], AMC [15], HDCT [16], and SO [17]. Three parameters of graph-based image segmentation [35] are used in the algorithm, and we set conservative parameters, sigma = 0.5 and min = 10, for all the images. The third parameter is set to different value according to image layers, We set for the first layer, for the second layer, and for other layers. To evaluate these methods, we either run our own implementations or exploit the results from the original authors.

3.1. Evaluation Methods

Following [5, 6, 8], we evaluate the performance of our method measuring its precision and recall rate. Precision measures the percentage of salient pixels correctly assigned, while recall measures the percentage of salient object detected. In order to study the performance of saliency detection approaches, we use two kinds of objective comparison measures in previous studies.

Firstly, the saliency map is segmented by a fixed threshold. Given a threshold , the pixels whose saliency values are lower than are marked as background; otherwise the pixels are marked foreground. When varying from 0 to 255, it will produce a sequence of precision-recall pairs, and a precision-recall curve can be obtained.

Secondly, we follow [5, 6, 8] to segment a saliency map by adaptive thresholding. The image is first segmented by mean-shift clustering algorithm. And then we calculate the average saliency value of each nonoverlapped region; an overall mean saliency value over the entire saliency map is calculated as well. The mean-shift segments whose saliency value is larger than twice of the overall mean saliency value will be marked as foreground, and the threshold is defined aswhere and are the width and height of the saliency map, respectively. In many applications, high precision and better recall rate are both required. In addition to precision and recall, we thus estimate , which is defined aswhere we set as is suggested in [5, 6, 18].

3.2. Performance on MSRA Dataset

We report both quantitative and qualitative comparisons of our method with 18 state-of-the-art saliency detection approaches on the MSRA dataset.

Quantitative Comparison. Figures 3(a) and 3(b) show the precision-recall curves of all the algorithms on the MSRA-5000 dataset. As observed from Figure 3, the curve of our method is consistently higher than others on this dataset. Besides, we compare the performance of various methods using adaptive thresholding. Each value of our -- (0.8524, 0.7794, and 0.8343) ranks first among the 18 state-of-the-art methods.

Qualitative Comparison. The visual comparison is given in Figure 4. To save the space, we only consider the most recent thirteen models: FT [5], HC [6], RC [6], LR [8], PD [9], GC [12], HI [10], GMR [11], BMS [13], UFO [14], AMC [15], HDCT [16], and SO [17]. Our method produces the best detection results on these images. It is also worth pointing out that our method can handle the challenging cases where the background is very extremely cluttered.

3.3. Performance on ECSSD Dataset

The ECSSD dataset is a more challenging dataset provided by Yan et al. [10]. As is shown in Figure 5, our approach achieves the best precision-recall curve. We also evaluate average precision, recall, and using adaptive thresholding; our recall and value rank first among all the methods.

We also provide the visual comparison of different approaches in Figure 6, from which we see that our approach produces the best detection results on these images and can highlight the entire salient object uniformly. We only consider the most recent thirteen models: FT [5], HC [6], RC [6], LR [8], PD [9], GC [12], HI [10], GMR [11], BMS [13], UFO [14], AMC [15], HDCT [16], and SO [17].

3.4. Evaluation on Different Feature Combination

To verify the effectiveness of the proposed feature combination method, we plot the corresponding - curves and the histogram of four combination schemes on the ASD dataset. Four logistic classifiers are trained to get the parameters of four different combination schemes. We use the learned parameters to detect the images and also provide the quantitative comparison of different combination schemes in Figure 7. As can be seen, our approach (shown in Figure 7(b)) can get better result when we use more features. Similar conclusions can be obtained from the results of - curves of different schemes (shown in Figure 7(a)): Scheme , four atomic features , Scheme , any combination of two atomic features and Scheme   , Scheme , any combination of three atomic features and Scheme   , and Scheme , any combination of four atomic features and Scheme   . Weighted combination of atomic features without considering any combination gets the lowest - curves in most of the range. The curve of Scheme is very close to Scheme and Scheme when recall is less than 0.9. Scheme and Scheme get almost the same curve.

3.5. Analysis of the Influencing Factors of Segmentation

Recently, low-level image segmentation methods have been widely used for saliency analysis. SLIC [38] and superpixel [35] approaches are two efficient algorithms and the source codes are publicly available. Because of considering different segmentation criterion, the segmentation results are quite different from each other, as is shown in Figure 8. From the figure, we can see that the result of SLIC method has more local compactness than superpixel method. We also provide the visual comparison of four saliency cues and final saliency map produced by the above two segmentation algorithms. Figure 9 shows that different segmentation algorithms can produce different salient cues and final saliency map. The superpixel approach can generate high-quality saliency map, while the SLIC segmentation algorithm may highlight some nonsalient region. Finally, we provide the quantitative comparison of SLIC and superpixel segmentation algorithm. To verify the effectiveness of two segmentation algorithms, we plot the corresponding precision-recall curves on the ASD dataset. As observed from Figure 10, the use of superpixel algorithm can obtain better precision-recall curves when compared with SLIC clustering algorithm.

4. Conclusion

In this paper, a novel salient region detection approach based on feature combination and discriminative classifier is presented. We use four saliency cues as atomic feature of segmented region in the image. To capture the interaction among different features, a novel feature vector is generated by mapping a four-dimensional regional feature to a fifteen-dimensional feature vector. A logistic regression classifier is trained to map a regional feature to a saliency value. We further introduce the multilayer saliency map integration and salient center for improvement. We evaluate the proposed approach on two publicly available datasets and the experiments results show that our model can generate high-quality saliency map which can uniformly highlight the entire salient object.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research is partly supported by the National Natural Science Foundation of China (no. 61305113, no. 61501394, and no. 61462038), Hebei Province Science and Technology Support Program, China (no. 13211801D), and The Specialized Research Foundation for the Doctoral Program of Higher Education by the Ministry of Education of PR China (no. 20131333110015).