Optimization for Detection and Recognition in Images and VideosView this Special Issue
Graph-Based Salient Region Detection through Linear Neighborhoods
Pairwise neighboring relationships estimated by Gaussian weight function have been extensively adopted in the graph-based salient region detection methods recently. However, the learning of the parameters remains a problem as nonoptimal models will affect the detection results significantly. To tackle this challenge, we first apply the adjacent information provided by all neighbors of each node to construct the undirected weight graph, based on the assumption that every node can be optimally reconstructed by a linear combination of its neighbors. Then, the saliency detection is modeled as the process of graph labelling by learning from partially selected seeds (labeled data) in the graph. The promising experimental results presented on some datasets demonstrate the effectiveness and reliability of our proposed graph-based saliency detection method through linear neighborhoods.
The goal of saliency detection is to identify and locate the most interesting and important region that pops out from the rest in an image, which has been widely used for applications in computer vision, including object detection and recognition [1, 2], image compression , image segmentation , content based image retrieval , image cropping , and photo collage .
Numerous researches have been conducted to design various algorithms for salient region detection. Among these works, graph-based saliency detection models have aroused considerable interest in recent years. Previous works on detecting salient regions from images represented as graphs include [8–15]. These models describe the input image as an undirected weight graph, in which vertices represent the image elements (pixels/regions) and edges represent the pairwise dissimilarity between vertices, and the salient object detection problem is formulated as random walks [8–10], binary segmentation [11, 12], labelling (ranking) task [13, 14], or distance metric  on the graph, which aims at finding the pop-out vertices at some local or global locations.
In methods of the random walks on graphs [8–10], the identification of salient regions is determined by the frequency of visits to each node at equilibrium. In , while some results are presented on only two synthetic images, there is no evaluation of how the method will work on real images. In , Harel et al. constructed the full-connected directed graph to represent the image in which the weight of the edge between two vertices is proportional to their dissimilarity, as well as their closeness in the spatial domain. Nonsalient regions are defined as the most frequently visited vertices in a local context. Wang et al.  analyzed multiple cues in a unified energy minimization framework and used the model in  to detect salient objects. A major problem is that cluttered backgrounds usually yield higher saliencies for possessing high local contrasts. Lu et al.  and Liu et al.  regarded the saliency detection problem as binary segmentation on a graph. In , Lu et al. developed a hierarchical graph model and utilize concavity context to compute weights between nodes, from which the graph is bipartitioned for salient object detection. Gopalakrishnan et al.  and Yang et al.  defined the saliency as the labelling or ranking task on a graph and applied the semisupervised learning technique to infer the binary labels of the unlabeled vertices with the salient seeds. However, it is difficult to determine the number and location of salient seeds that the semisupervised method requires, which is a known problem with graph labelling. In addition, the geodesic distance metric was applied to measure the feature contrast along paths on the graph in .
The reason why the graph model can be associated with the saliency detection is that the prior consistency or cluster assumption [16, 17] observed in semisupervised learning or manifold learning problem, which have been demonstrated effectively to preserve the intrinsic data structure hidden in the dataset, is also appropriate for uncovering the relationships between pixels in the image. The prior consistency mainly consists of two aspects: () nearby pixels are likely to have the same saliency; () pixels on the same structure (such as an object or a homogeneous region) are likely to have the same saliency. Note that the first assumption is local, while the second one is global. The cluster assumption advises us to consider both local and global information contained in the image during learning. It is straightforward to apply cluster assumption to the graph-based saliency detection models developed in recent years, since the central idea of these methods is to find the pop-out or salient nodes while preserving the global structure hidden in the image.
Although there has been some success with the graph-based saliency detection approaches, identifying salient objects in natural scenes remains a challenge because factors such as the local or global structure information are not fully described. The graph-based semisupervised learning or manifold learning methods model the whole dataset as a graph. Similarly, the graph-based saliency detection method models the input image in the same way. In most graph-based models, the superpixels are extracted and denoted as the basic graph nodes in consideration of the computation efficiency and perception meaning. In addition, the complete graph [9, 13], nearest neighboring graph or -regular graph [13, 15], or the close-loop graph  is applied to simulate the local graph structure in different saliency models. However, how to estimate the weight of each edge has not been fully studied. More concretely, most of methods adopted a Gaussian function to calculate the edge weights of the graph [9, 13–15]. But the variance of the Gaussian function will affect the detection results significantly. This problem has been demonstrated in the semisupervised learning methods , which occurs in the graph-based saliency models (illustrated in Figure 1) as well. However, there is no reliable approach for model selection if only very few labeled seeds are available; that is, it is hard to determine optimal , as pointed out by Zhou et al. .
(a) Original image
(b) Ground truth
To address the above issues, we propose a more reliable and stable graph-based saliency detection model in this paper. Firstly, the nodes of the graph are made up of a series of neighboring image superpixels, and the edges represent the neighborhood relationships between different image superpixels. Instead of considering pairwise neighborhood relationships adopted in current graph-based saliency detection methods, we apply the adjacent information provided by all neighbors of each image node to estimate the edge weighs in the graph based on the locally linear assumption that nearby superpixels are likely to have the same saliency. Then the edge weights of all nodes are assigned to the edges for constructing the undirected weight graph. Finally, we model the saliency detection as the process of labelling by learning from partially selected seeds (labeled data) in the graph. The experiments on some datasets demonstrate the effectiveness and higher parameter stability of our proposed graph-based saliency method.
The remainder of the paper is organized as follows. Section 2 will describe our proposed model in detail. In Section 3, the experiments on some popular datasets are presented, followed by the conclusions and future works in Section 4.
2. Proposed Method
In the proposed method, a spatially neighboring graph, where superpixels are extracted and considered as the basic graph nodes and the linear relationships for all neighbors of each node are applied to estimate the weights of edges, is constructed to represent the local structure in the image. The problem of saliency detection can be tackled by modeling the task of labelling by using the selected seeds in the whole graph. The emphasis of this section and the major contribution of this paper are the construction of the graph by all neighbors of each node with the assumption of the prior consistency and the following graph labelling for saliency detection.
2.1. Graph Construction
As shown in Figure 2, an undirected weighted graph is constructed to represent the input image, where is a set of nodes and is a set of undirected edges with weight . In this paper, the nodes are visually homogeneous superpixels, which are computationally efficient and perceptually meaningful compared to regular image patches, and generated by the Simple Linear Iterative Clustering (SLIC) algorithm proposed by Achanta et al. . The reason why we choose the SLIC is that the resulting superpixels are almost regular and compact image patches with better boundary adherence, which facilitate the preservation of the object edges in the saliency map [14, 15, 20]. In addition, the spatially neighboring superpixels are connected to simulate the local neighborhood relationships for all the nodes in the graph. Each edge is assigned a weight to represent the relationship between the two nodes.
(a) Input image
(c) The weighted graph
This Gaussian weight function mainly describes the pairwise dissimilarity between two neighboring superpixels. However, it is parameter dependent and sensitive as most natural images are more cluttered and complicated. Moreover, the locally linear information is ignored. Thus a more reliable and stable way to construct the graph based on all neighbors of each node is derived, which is capable of recovering the global nonlinear structure from the locally linear fits in the graph.
Based on the cluster assumption that each superpixel and its neighbors lie on or close to a locally linear region of the same manifold in the image, we characterize this local structure of these superpixels by linear coefficients that reconstruct each node from its neighbors in the corresponding graph. We measure the reconstruction errors by the following cost function:where denotes all the neighbors of and summarizes the contribution of the node to the node th reconstruction. To estimate the weight , our objective is to minimize the cost function subject to three constraints: () each node is reconstructed only from its neighbors ; if does not belong to the set of neighbors for node , is set to 0; () the sum of the contribution from all neighbors to node equals 1; that is; ; () we only consider the nonnegative edge weight, , in our saliency detection model. Clearly, when is more similar to the node , will be larger, which means that plays a more important role in reconstructing the node . Thus the reconstruction weights of each node can be solved by the least-square algorithm with the three constraints:
One question that should be noticed is that usually ; here we introduce the algorithm proposed in  to get a symmetric matrix. Once the reconstruction weights for all nodes in the graph are computed, we will obtain a sparse weight matrix for graph . This weight graph describes the locally neighboring relationships through synthesizing the linear neighborhood around each node, which facilitates the discovery of the globally hidden structure and the following pop-out salient nodes in the image. Furthermore, our graph structure is capable of solving the problem of determining an optimal parameter that occurred in conventional Gaussian function method.
2.2. Saliency Model
2.2.1. Graph Labelling
In this paper, the problem of saliency detection is modeled as the task of graph labelling, which aims to propagate the labels of the selected seeds to the unlabeled nodes using the graph constructed in Section 2.1. Given a dataset , the first data belong to the set of labeled and the remainders belong to the set of unlabeled which need to be labeled according to the relevance to the labeled ones. Let denote some classifying functions defined on , which can assign a real value to each data point . Let denote the label indicator for each data point. If , ; otherwise, . Thus, the problem of labelling the unlabeled data can be solved by an iterative procedure.
In each iteration, each data point can “absorb” a fraction of label information from its neighbors and retain some label value of its present state. In other words, all data spread their label scores to their neighbors via the weighted graph. Therefore, the label of data point at time becomeswhere is the classifying function learned at iteration and . The parameter specifies the relative contributions to the labelling scores from neighbors and the initial labels, which is in . Consider
Iterating (5) to update the label scores of all the data until convergence, let be the limit of the sequence where is the -dimensional identity matrix. In (6), the constant term , will not affect the labelling results of . Here, is equivalent to
The resulting label function provides a general framework for semisupervised learning. Here, multiple labels are predefined for applications including clustering, segmentation, and classification. In , Yang et al. applied to learning of the ranking scores for saliency detection according to the relations with the single label, which they call the manifold ranking  based algorithm.
2.2.2. Saliency Measure
For saliency detection, how to determine the number and location of salient labels that (7) requires remains a big problem. Reference  applied the most salient node and some background nodes together for label extraction of the salient region; however, the results are sensitive to the choosing of the seeds. Observing that background often presents local or global appearance consistent with the image boundary [15, 20, 22–24], Yang et al.  used the nodes on the image boundary to label most of the background nodes, which aims to produce some salient labels. Similar to the scheme proposed in , the salient regions are estimated by the process of two-phase graph labelling on the constructed weight graph (shown in Figure 3). In other words, the image boundary and the salient nodes are separated to generate the final saliency maps.
(a) Input image
(b) Ground truth
(c) Saliency map for first phase
(d) Saliency map for second phase
(1) Labelling with the Background Nodes. Based on the assumption that the image boundary is more likely to be background, we apply the superpixels on the image boundary as the seeds to remove some background clutters and in turn lead to better salient seeds for labelling. Considering the basic rule of the photographic composition that the four sides on the image boundary often show different appearances of the background, specially, they are marked with different labels for strong learning. If all of these boundary superpixels are characterized with the same label in the procedure of learning, the learned classifying function is usually less optimal as these nodes are dissimilar. Therefore, the first-phase saliency maps are generated using the four sides on the image boundary superpixels as the labeled nodes, respectively, and the rest as the unlabeled ones.
In this paper, we first obtain the four label indicator vectors of the top, bottom, left, and right image boundary. Then all the nodes on the graph are labelled based on in (7) with the four label indicator vectors separately, and the results are four -dimensional vectors, in which each element denotes the relevance to the labelled image boundary superpixels. We then normalize them to the range between 0 and 1. If the image elements are similar to the boundary, the labelling values obtained by are closer to 1. However, the saliency values are characterized with labelling with salient seeds in saliency detection models. So the final label score for each superpixel is modified aswhere is a node on the constructed graph. In order to obtain some candidate salient superpixels in the image, (8) is applied to integrate the four label scores generated by the four sides of image boundary, respectively. Thus, the first-phase saliency is defined as
Saliency maps generated by (9) can suppress most of the background superpixels, thus highlighting some candidate salient ones. To tackle this problem, we use the label function in (7) with the candidate salient nodes to further improve the performance of detection salient regions.
(2) Labelling with the Candidate Salient Nodes. The second phase aims to eliminate some background superpixels and highlight salient regions using the labelling function in (7) with the possible salient nodes produced in the first-phase labelling, since the possible salient regions inferred from the image boundary are weak and prior dependent, in order to get stronger label seeds which are salient or belong to the actual salient regions, “Otsu’s method”  is applied to partition the saliency maps generated in the first phase adaptively, and the remaining foreground superpixels are treated as strong salient seed for labelling. Then the label indicator vector is established to compute the relevance function in (7). Here, the saliency is defined as the normalized labelling scores between 0 and 1. Consider
Despite some imprecise foreground labels, the salient objects can be well detected in the final saliency maps. The reason is that the salient regions are usually compact and not dispersed in terms of spatial distribution compared to background regions and homogeneous and consistent when considering the appearance in the aspect of feature distribution, such as color and texture . All of these priors affect the detection results greatly.
In conventional semisupervised learning methods for graph labelling, the diagonal elements of matrix () play an important role in computing the relevance to the labelled data. Nevertheless, this diagonal matrix means that the relevance to each superpixel itself is considered in saliency detection if it does not equal 0, which can weaken the contributions of other labelled nodes as the self-relevance is usually to be large. In order to get better detection results, we set the diagonal elements of to 0.
Dataset. Two standard benchmark datasets ASD and SOD are adopted to evaluate the proposed graph-based saliency model in this paper.
(1) ASD. It contains 1000 images with accurate human labeled segmentation masks for salient objects provided in , which has been used for testing almost all saliency models.
(2) SOD. The dataset comes from the well-known Berkeley segmentation dataset of 300 images , in which the images are cluttered and complex in terms of the scales, appearances, and positions of the foreground objects, as well as the appearance of the background regions. The pixel-labeled ground truth is obtained from .
Evaluation Metrics. For average performance evaluation, we use the standard PR (precision-recall) curve and -measure. For each detected saliency map and the corresponding ground truth, precision rate corresponds to the ratio of salient pixels which are correctly detected in the saliency map, while recall rate is the percentage of all detected salient pixels belonging to salient objects in ground truth. To generate a PR curve, the saliency map is normalized into first. A series of binary masks are then produced by segmenting the saliency map with a threshold varying from 0 to 255. We compare these binary masks with the ground truth to obtain the PR curve for each saliency map. The curves obtained from all images on each dataset are averaged to generate an overall PR curve.
Although commonly used, PR curve is limited in that it only considers whether the saliency of the object saliency is higher than that of the background. Since high precision can be achieved at the cost of decreasing the recall and vice versa, the -measure is used to trade off the overall performance of the precision rate and recall rate:
And is set to 0.3 as in . As different images have optimal binary threshold dissimilarly, a constant threshold adopted in the PR curve ignores it. Then an adaptive threshold value is proposed and determined as the mean saliency of all pixels in each saliency map as in  to measure the average precision, recall, and -measure values on each dataset: where and are the width and height of the saliency map in pixels, respectively, and is the saliency value of the pixel .
3.1. Validation of the Graph Constructed through Linear Neighborhood
In this section, the proposed LNR (linear neighborhood relationship) graph is compared with the conventional pairwise relationship based graph weighted by the Gaussian function with different variances to generate saliency maps for all images on the ASD dataset in our framework of saliency detection. Figure 4(a) illustrates the results of average PR curves, and Figure 4(b) shows the precision, recall, and -measure values generated by the adaptive threshold, respectively. The experimental results indicate the reliability and stability of our LNR graph. The reason is that our LNR graph is more capable of representing the local information through the linear neighborhood around each superpixel, and discovering the global manifold structure through the semisupervised learning and labelling.
3.2. Validation of the Two-Phase Saliency
This paper applies the image boundary and some candidate salient nodes, respectively, to generate the label indicator vectors for graph labelling based saliency detection. We then compare the performance of the proposed approach for each phase. Figure 5 demonstrates that the second phase using the strong candidate foreground superpixels generated by Otsu’s method further enhances the performance of the first phase just with the image boundary ones.
3.3. Comparison with the State of the Art on Some Datasets
To verify the effectiveness of proposed saliency detection method through linear neighbors on some datasets, we compare with the most recently state-of-the-art approaches (PCA , SF , FT , RC , and HS ), some graph-based methods (GS_SP , MR , and GB ), and two traditional ones (IT  and MZ ). We select these methods for their varieties in design of the saliency models or the description of saliency. In the experiment, we use the implementation from Achanta et al.  for FT, IT, GB, and MZ. For RC, PCA, HS, and MR, we run the authors’ public codes. For SF and GS_SP, we directly use the provided saliency maps by authors. Figure 6 shows the comparison results of various saliency methods on the ASD datasets and demonstrates the performance of our method.
For the SOD dataset, we compare with the PCA, HS, GS_SP, RC, and MR methods. Example results of PR curves and the precision, recall, and -measure values achieved from the adaptive threshold are illustrated in Figure 7, respectively. We note that when measured with the fixed threshold method, our PR curve cannot perform well compared to some models at high recall rate (lower threshold). But as the threshold rises, the proposed method works best when compared with the methods mentioned in Figure 7(a), which means our method can highlight the whole object region uniformly. The reason is that the calculated pixel values of the saliency maps in our method are distributed within a fixed interval, which occupies a small range in the gray space. When the threshold is smaller than certain value, the segmented saliency maps cannot represent all the salient objects in the images, which results in the lower precision rate.
Results of visual comparison on the two datasets are shown in Figure 8.
(b) Ground truth
3.4. Experimental Settings and Run Time
The proposed method is applied on the ASD and SOD datasets based on a machine with Intel Core i5-4590 3.30 GHz CPU and 8 GB RAM, and we implement the saliency model by using the Matlab language. Specifically, our method spends 0.201 s, 0.515 s, and 0.342 s on superpixel generation, graph construction, and saliency map computation, respectively, for each image on the ASD dataset, and the run time of superpixel generation is estimated by segmenting each image into 200 superpixels.
Recently, graph-based salient regions detection methods have applied the Gaussian weight function to measure the pairwise neighboring relationships between different image elements, which is parameter dependent and sensitive. To solve the problem, we propose the graph-based saliency detection model through linear neighbors, which means that each node is measured by a linear combination of all its neighborhoods around to further represent the local information in the image. The saliency is defined as the process of semisupervised learning to label other nodes from partially selected seeds (labeled nodes) in the whole graph, which considers the global data structure hidden in the image for labelling. As a result, both the local grouping information and the global intrinsic structure of the cluster assumption are fully captured with the graph construction and labelling to learn the pop-out nodes in our linear neighborhood relationship based saliency detection method. In addition, our LNR graph has presented the reliability and stability of detecting salient regions, which does not need the predefined parameter for optimal model selection. We then evaluate the proposed model on some benchmark datasets in this paper. The experimental results indicate the effectiveness of our graph-based saliency detection method through linear neighborhoods when compared with some other state-of-the-art approaches.
It should be noted that the proposed method can be further improved if more strong labels are provided. Our future work will focus on the choosing of the seeds.
The authors declare that they have no competing interests.
This research was supported by “National Natural Science Foundation of China” (no. 61272523, no. 61572103), “the National Key Project of Science and Technology of China” (no. 2011ZX05039-003-4), and “the Fundamental Research Funds for the Central Universities” (no. DUT15QY33).
C. Kanan and G. Cottrell, “Robust classification of objects, faces, and flowers using natural image statistics,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '10), pp. 2472–2479, IEEE, San Francisco, Calif, USA, June 2010.View at: Publisher Site | Google Scholar
U. Rutishauser, D. Walther, C. Koch, and P. Perona, “Is bottom-up attention useful for object recognition?” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '04), vol. 2, pp. II37–II44, July 2004.View at: Google Scholar
L. Marchesotti, C. Cifarelli, and G. Csurka, “A framework for visual saliency detection with applications to image thumbnailing,” in Proceedings of the 12th International Conference on Computer Vision (ICCV '09), pp. 2232–2239, IEEE, Kyoto, Japan, October 2009.View at: Publisher Site | Google Scholar
J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in Proceedings of the 20th Annual Conference on Neural Information Processing Systems (NIPS '06), pp. 545–552, December 2006.View at: Google Scholar
Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using background priors,” in Computer Vision—ECCV 2012, pp. 29–42, Springer, Berlin, Germany, 2012.View at: Google Scholar
D. Zhou, O. Bousquet, T. N. Lal et al., “Learning with local and global consistency,” Advances in Neural Information Processing Systems, vol. 16, no. 16, pp. 321–328, 2004.View at: Google Scholar
O. Chapelle, J. Weston, and B. Schölkopf, “Cluster kernels for semi-supervised learning,” in Proceedings of the 16th Annual Neural Information Processing Systems Conference (NIPS '02), pp. 585–592, Vancouver, Canada, December 2002.View at: Google Scholar
D. Zhou, J. Weston, A. Gretton et al., “Ranking on data manifolds,” in Advances in Neural Information Processing Systems 16, pp. 169–176, MIT Press, 2004.View at: Google Scholar
H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient object detection: a discriminative regional feature integration approach,” in Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR '13), pp. 2083–2090, IEEE, Portland, Ore, USA, June 2013.View at: Publisher Site | Google Scholar
N. Otsu, “A threshold selection method from gray-level histograms,” Automatica, vol. 11, no. 285–296, pp. 23–27, 1975.View at: Google Scholar
D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proceedings of the 8th International Conference on Computer Vision, pp. 416–423, July 2001.View at: Google Scholar
F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung, “Saliency filters: contrast based filtering for salient region detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '12), pp. 733–740, IEEE, Providence, RI, USA, June 2012.View at: Publisher Site | Google Scholar
Y.-F. Ma and H.-J. Zhang, “Contrast-based image attention analysis by using fuzzy growing,” in Proceedings of the 11th ACM International Conference on Multimedia (MM '03), pp. 374–381, November 2003.View at: Google Scholar