Abstract

The detection of salient regions has attracted an increasing attention in machine vision. In this study, a novel and effective framework for saliency region detection is proposed to solve the problem of the low detection accuracy of traditional methods. Firstly, we divide the image into three levels. Second, each level uses three different feature methods to generate different feature saliency maps. Subsequently, a novel integration mechanism, termed competition mechanism, is introduced into the coarse saliency maps at the same level, and the two coarse saliency maps with the highest similarity are selected for fusion to ensure the effectiveness of the salient region map. Accordingly, after adjusting the scales of the saliency map after the fusion of different levels, among three coarse saliency maps of the different levels, the two feature maps with the most significant difference are selected to fuse to further obtain the final refined saliency map. Finally, using the proposed method, experiments on three benchmark datasets were conducted. As demonstrated by the experimental results, the proposed algorithm is superior to other state-of-the-art methods.

1. Introduction

With the development of image processing technology, large amounts of image data can be quickly acquired, which is a challenge to the efficiency of image processing algorithms. According to the visual attention mechanism, the human eye can quickly select the most interesting objects and ignore useless information in the scene, and the saliency detection task aims to identify these important regions. In recent years, saliency detection has gradually become a research hotspot in machine vision, and an increasing number of researchers have focused on this problem. As an important preprocessing procedure, saliency detection contributes to many machine vision and image processing tasks such as visual tracking [1], image compression [2], object recognition [3], and image segmentation [4].

The saliency detection method consists of salient object detection and gaze predictions [5]. Existing salient object detection methods can be divided into two categories: top-down [6] and bottom-up [7]. [8]. Top-up approaches are task-driven, and supervised learning frameworks typically use top-down strategies such as deep learning or convolutional neural networks. Although deep learning has achieved good results in salient object detection, it requires numerous datasets for training and involves many parameters, which complicates parameter tuning. Conversely, bottom-up methods are driven by vision; their performance relies on various low-level visual feature extraction methods, such as color features, texture, and orientation.

Many salient object detection models exist, among which graph-based methods measure the similarity between nodes when generating saliency graphs [9]. For superpixel images with superpixels as nodes, the similarity weight between nodes is regarded as a graph model, and the saliency information of the original image can be extracted through node propagation. In these methods, initial node selection is the key to the salient object detection process. Existing node selection methods introduce various types of prior knowledge to ensure the accuracy of the foreground or background node selection. For example, the boundary prior [10] considers the presence of salient objects at the boundary of an image, and the salient objects are connected to the boundary. The center prior [11] considers that the salient object is located at the center of the image. However, this is not the case for natural images alone, where the salient object locations are randomly distributed. This is a significant challenge for existing prior-based methods.

As regards the advantages and disadvantages of saliency graph object detection methods based on node selection in graph methods, we can conclude that prior knowledge determination plays an important role in node selection; however, the saliency graphs generated by different prior knowledge determinations of initial nodes also differ. Some researchers have attempted to improve the accuracy of saliency graphs by determining the initial node. If the initial node is initially misjudged, the accuracy of the saliency graph is significantly reduced, and even its feature method is invalid. In addition, superpixel images generated using the simple linear iterative clustering (SLIC) algorithm [12] have different shapes and numbers of nodes generated by images of different sizes. In other words, image layering enriches the input features of salient objects, thereby ensuring the robustness of the detection results. It also provides a basis for subsequent feature selection.

Therefore, image multiple hierarchy and competition mechanisms are particularly important for saliency detection. In this study, we focus on bottom-up methods to compute saliency detection models. In contrast to traditional methods, the proposed scheme improves the accuracy and consistency of saliency results by fusing multilevel saliency priors and an integration mechanism of global similarity cues. The main contributions of the proposed method are as follows:(1)The use of multilayered images overcomes the nonrobustness problem of saliency measurements because of the size of the salient objects. Simultaneously, the shapes and number of nodes generated by images of different scales are different, thus enriching the input features of the salient objects.(2)Three different feature methods are proposed in parallel to prevent prior knowledge from judging wrong nodes at the beginning, which is beneficial to the selection of later features.(3)A novel feature saliency map fusion mechanism was established. Firstly, feature maps are fused at the same level, and the two most similar coarse saliency maps are selected for fusion. Accordingly, two feature saliency maps with large differences were selected for fusion. The first competition removes features with low detection accuracy for individual features, thereby ensuring the effectiveness of saliency maps at the same level. The second competition enriches the feature map boundary features and ensures the detection accuracy of the saliency map.

The remainder of this paper is organized as follows. In Section 2, we review related studies on saliency detection. In Section 3, the construction of various hierarchical coarse saliency maps is presented. Section 4 presents a saliency graph refinement of the competition mechanism. Section 5 presents the experimental data and analysis. Finally, Section 6 presents conclusions and discussion.

Existing salient object detection methods can be divided into two categories: bottom-up and top-down. Bottom-up methods are driven by vision, and their performance relies on various low-level visual feature extractions such as color features, texture, and orientation. To date, many low-level features exist in the saliency detection work, and it has been confirmed that features such as color, salient object size, location, and image boundaries directly influence our visual attention mechanism and play an important role in saliency detection. Although top-up approaches are task-driven, supervised learning frameworks typically use top-down strategies, such as deep learning or convolutional neural networks.

Visual saliency detection is mostly based on bottom-up methods that use low-level visual features in images and videos. A previous study on saliency detection was presented by Itti et al. [13], in which a final saliency map of an input image was obtained by exploiting the center-surround differences in low-level image features based on visual features (including color, intensity, and orientation maps). Ma and Zhang [14] proposed a method to detect regions of visual attention in images based on local contrast analysis, which in turn is based on center-surrounding and chromatic aberration calculations. Wei et al. [15] proposed two priors, boundary and connectivity, and assumed that the image boundary was the main background. Combining this prior with the geodesic distance, an effective salient object detection model called the GS model was designed. Saliency detection was performed using the background priors and distances. Perazzi et al. [16] decomposed a given image into compact and perceptually homogeneous elements, evaluated the uniqueness and spatial distribution of these elements, inferred the contrast of these elements to derive saliency measures, and generated pixel-accurate saliency maps. Yang et al. [17] ranked the similarity of image elements to foreground or background cues through graph-based MR, representing an image as a closed-loop graph with superpixels as nodes based on the similarity of background and foreground queries, which is ranked based on the affinity matrix to effectively extract background regions and foreground salient objects. Wu et al. [7] used two perceptual cues, proximity and similarity, to generate background nodes through background probability measures and established a label propagation model to generate saliency maps. Wu et al. [18] proposed a boundary-guided graph structure to explore the correlations between superpixels and built an iterative propagation mechanism to refine the saliency map. Wang et al. [19] used the convex hull of interest points to estimate the locations of salient objects by employing random walks to propagate nodes and generate saliency maps. Jian et al. [20] used a discrete wavelet transform to extract directional features, calculated the centroids of salient objects, and combined the color contrast and background features to generate robust saliency maps. As proposed by Al-Azawi et al. [21], local saliency detection used the structure of objects to determine saliency, whereas global saliency detection identifies saliency based on background-related contrast and uses irregularity as a two-stage saliency measure. Chakraborty and Mitra [22] used random walks on Markov chains to obtain saliency maps, and k-dense submaps enhanced the salient regions in the image.

Among these saliency detection methods, graph-based methods generate saliency graphs that measure the similarity between nodes based on different rules. In these classes of methods, the initial node selection is the key to the salient object detection process. Existing node selection methods introduce various types of prior knowledge to ensure the accuracy of the foreground or background node selection. These priors are significant for carefully selected samples such as ID photos. However, this is not the case for natural images alone, where the salient object locations are randomly distributed. This is a significant challenge for existing prior-based methods. In addition, superpixel node images generated using the SLIC algorithm have different shapes and numbers of nodes generated by images of different sizes. In other words, image layering enriches the input features of salient objects, thereby ensuring the robustness of the detection results, thus providing a basis for the subsequent feature selection.

As regards the advantages and disadvantages of saliency graph object detection methods based on node selection in graphical methods, we can conclude that prior knowledge determination plays an important role in node selection; however, the saliency graphs generated by different prior knowledge determinations of initial nodes also differ. To improve the accuracy of saliency graphs, some researchers have attempted to improve the accuracy by determining the initial node. If the initial node is initially misjudged, the accuracy of the saliency graph is significantly reduced, and even its feature method is invalid.

Consequently, we propose a saliency map refinement method using a competitive mechanism. First, multiple feature maps are generated in parallel, and then, the feature saliency maps are further screened to determine useful and eliminate useless feature maps. Here, the similarity of the saliency maps generated by a certain number of features was compared, and effective feature maps were selected. For convenience, this study adopted three different types of features and compared the similarity of the saliency maps generated by the three features.

In contrast to bottom-up saliency detection models, top-down methods are based on high-level priors on images and are thus always task-specific-driven training and learning methods. In [6], Yang et al. used joint learning of a conditional random field (CRF) and a visual dictionary for salient object detection, which comprised a hierarchical structure from top to bottom: CRF, sparse coding, and image patches to sparse coding. As an intermediate layer, CRF is learned in a feature-adaptive manner and also used as an output layer to learn a dictionary under structured supervision, and a top-down saliency detection model was proposed. Shan et al. [23] proposed a neural network for three-stage layering: first, fast R-CNN extracts feature for each superpixel; second, an attention mechanism was used to expand the receptive field from one superpixel to the surrounding and related; finally, the value of the saliency score was obtained using the global regression model to generate a saliency map for top-down saliency detection. Although deep learning-based salient object detection has achieved good results [24], these methods are extremely sensitive to data attributes, which may influence their adaptability to different situations.

3. Construction of Multiple Layered Coarse Saliency Maps

The selection of initial nodes differs, the detection results of saliency graphs also differ significantly, and a competitive and complementary relationship exists between them. Based on various factors, such as salient object size, color, location, and image boundary, a multilevel hierarchical saliency detection algorithm with different prior knowledge is designed to determine the initial nodes. The algorithm framework of this study is illustrated in Figure 1. First, we divide the input image into three levels: Z1, Z2, and Z3, of different sizes. Subsequently, we simultaneously processed Z1, Z2, and Z3 by building three different types of saliency graph models. Finally, we defined a competition mechanism to fuse the saliency map twice to obtain a refined saliency map.In Figure 1, ① adopts the SLIC algorithm to generate superpixel images. ②, ③, and ④ represent the feature method based on center prior, the feature method based on color prior, and the feature method based on boundary connectivity prior, respectively. ⑤, ⑥, and ⑦ represent the feature analysis of their respective feature maps, ⑧ feature map of comparison, and ⑨ feature fusion.

3.1. Coarse Saliency Map Generation Based on Boundary Prior

Zhu et al. [10] proposed a method to measure the degree of connection between region Y and the image boundary; that is, the boundary connectivity when the salient object is located at the image boundary. It is expressed as the ratio of the perimeter of a region on the boundary to the square root of the area of the region.where is a node in area Y, is a collection of image boundary nodes, represents the perimeter of a region at the boundary, and Area(.) represents the area of a region.

To facilitate the calculation of area Y, is placed as a set of nodes with similar colors.where represents the total number of pixels of node , and represents the number of nodes in . Similarly, the image boundary length is defined as follows:where is 1 at the boundary. Otherwise, it is 0.

A similarity also exists in colors among these nodes. The salient object in the image has smaller spatial changes, whereas the background areas distributed over the entire image have higher spatial changes. In other words, tightness is an appropriate supplement to boundary connectivity, which can suppress areas where errors are prominent. The spatial compactness D can be calculated using the intraclass distribution.where is the position of node and is the center position of . At this moment,

Further, the above formula is extended to the background probability, which is based on the boundary connectivity value of node ; the background probability ,where is the adjustment parameter used to control the intensity . Because the background probability evaluates the possibility that a node belongs to the background, possible background nodes can be distinguished by the background probability. Therefore, the background exclusion can be expressed as follows:where is the parameter that controls the background probability threshold.

3.2. Coarse Saliency Map Generation Based on Color Prior

Kim et al. [25] proposed a linear combination of colors based on high-dimensional color space to create an image saliency map when the salient object position is independent of the node position of the salient object.

The histogram feature is an effective measure of saliency. The i-th superpixel histogram feature is measured using the chi-square distance between other superpixel histograms, and it is defined as follows:where b is the number of histogram bins and each histogram has eight bins.

The global contrast feature of the ith superpixel is as follows:where represents the Euclidean distance between the color value of the ith superpixel node and the color value of the j-th superpixel node , and the eight color channels of RGB, CIELab, hue, and saturation are used to calculate the color contrast feature. The local color contrast feature is defined as follows:where, represents the normalized position of the i-th superpixel node, and is a normalized term. Set according to [16] here.

For texture shape features, superpixel regions, histogram of gradients, and singular value features were used. The histogram of the gradient uses the gradient information of the pixels to quickly provide appearance features. The singular value feature [26] is based on the features of the feature image. It decomposes an image through the weighted summation of multiple feature images, where each weight is a singular value obtained through singular value decomposition. The feature image corresponding to the larger singular value determines the overall outline of the original image, whereas the other smaller singular values describe detailed information. Therefore, for blurred images, the larger singular values occupy a higher weight.

First, the initial graph is divided into 2 × 2, 3 × 3, 4 × 4 regions, and a threshold is applied to each region. Owing to changes in color saturation or illumination, the threshold setting is influenced to a certain extent. Here, Otsu's multilevel adaptive threshold [27] is used to control the ratio between the foreground, background, and unknown regions. Seven-level thresholds were used for each subarea. After three different threshold saliency maps were fused, we obtained a 21-level saliency map with a local threshold such that the saliency map had better local contrast. Therefore, although the local area may not be the most prominent area globally in the entire image, it can also capture the most prominent area locally. Finally, the coarse saliency map is obtained using the global threshold as follows:

3.3. Coarse Saliency Map Generation Based on Center Prior

Lou et al. [28] proposed a color name method based on the central prior when the salient object is located in the middle of the image. For input images of different sizes, the input image is first fixed to a certain pixel width to obtain the optimal parameters of the structural elements at a certain scale. Subsequently, the color name model is used to provide the im2c function and the mapping matrix and convert scaled image I from the RGB space to the color space C. For RGB images with pixel values in the integer interval [0, 255], the im2c function is defined as the RGB color values at the image coordinates (x, y) to calculate the mapping index .where represents rounding down, and the im2c function assumes the row corresponding to the index from the mapping matrix to obtain an 11-dimensional vector.

Through the above conversion, the RGB color value of each pixel in I is mapped to an 11-dimensional color name vector, and each element in the color name vector is a [0, 1] floating-point number in the interval, indicating the probability that the pixel in I belongs to each color name, the sum of 11 elements in the vector is equal to 1. Because only one probability map of the color name can be obtained each time the im2c function is called, the Color Name Space(CNS) method calls the im2c function 11 times to obtain 11 color name probability maps. For the convenience of expression, each color probability map becomes a color name channel. The entire color namespace C is composed of M color name channels, , where . To use a set of serialized threshold segmentation to obtain a Boolean graph, each color channel is normalized to an integer interval of [0, 255]. Although each channel is a probability map, it essentially reflects the intensity information of the color name; that is, the color name formed by these channels contains the perceptual color characteristics under the linguistic description.

The color name channel is used to generate a Boolean graph as follows: First, take a sampling interval in the interval [0, 255] to obtain a set of serialized segmentation thresholds. Assuming that the total number of segmentation thresholds is n, the j-th threshold is denoted as , and each color name channel is segmented as follows using these n thresholds:where is the Boolean graph obtained by the threshold segmentation of the color name channel , the subscript represents the color name channel number, and the superscript represents the sequence number of the segmentation threshold. The functions of the THRESH(.) function is the element value at coordinate (x, y) when ; otherwise, . When the area of the salient object has a large color intensity, the serialized segmentation of the normal phase image can obtain the object area. Conversely, if the salient object has a low color intensity, only the normal phase image segmentation will make the object area the background. At this time, a reverse graph is required.

At this time, Boolean graphs including the positive/negative of M color name channels.

Three morphological processes are performed on the serialized Boolean graph generated above, including closing, hole-filling, and boundary object clear-border, to obtain a saliency map based on the surrounding clues.

After obtaining the saliency maps of all color name channels, these saliency maps are combined in a linear average manner to obtain a more stable saliency map.where is the positive saliency map of the i-th color name channel, is the inverted saliency map of the i-th color name channel, is the i-th color name channel, and is the average saliency map of the M color name channels.

4. Integration Mechanism of the Saliency Map

Useful coarse saliency maps were fused from the coarse saliency maps in the above three feature methods, and coarse saliency maps with low detection accuracy were eliminated. The initial nodes determined by different prior knowledge have different effects on salient object detection. The feature maps generated using this feature exhibited certain differences. To ensure the selection of effective feature saliency maps when merging at the same level, two feature saliency maps with the most significant similarity are selected for fusion. This also effectively avoids reducing the saliency of salient objects, owing to the low accuracy of a certain feature saliency map.

Therefore, a novel integration mechanism was defined. Three feature saliency maps between the same level (Z1, Z2, and Z3) were selected to fuse the most similar feature saliency maps of the two saliency maps, which effectively avoids the saliency caused by a certain method. The detection accuracy is not high or fails, thereby introducing new errors, and the competing mechanism is shown below.where is the Zi-th level feature saliency map; similarly, and are the Zi-th level and feature saliency map, respectively. The two selected saliency maps are then fused, and the fusion method is as follows:where and are the saliency maps selected by the first handsome, are the weights calculated using and , and are the normalization terms.

Thus, the feature saliency map was further refined, the saliency maps of the three levels were adjusted to the original image size, and the different levels (Z1, Z2, and Z3) were merged again to ensure the effectiveness of the feature saliency map, to maximize the enrichment of the boundary information of the feature saliency map, improve the boundary accuracy of the salient object, and select the fusion of the two different levels of the feature saliency map with the largest difference. The direct fusion of the three-layer feature map to reduce the saliency of the boundary and the fusion of two similar feature maps that lose the boundary information of the salient object are both avoided. The fusion method is as follows:where , , and are the feature saliency maps of the Z1, Z2, and Z3 levels adjusted to a uniform scale, respectively. After the second screening, the two feature saliency maps are fused as follows:where and are the two saliency maps screened for the second time, denote the set parameters, denote the weights calculated from and , and denote the normalization terms.

Here, based on the first fusion, two saliency maps with the largest differences are selected such that the boundaries of the salient object are more abundant. The integration of the two feature saliency maps did not simply fuse all the feature saliency maps. For the first time, two similar feature maps at the same level were selected for fusion, which effectively avoided the failure of the single-feature method, which would reduce the salient object after feature fusion, hereby ensuring the validity of the feature saliency map. The second feature map fusion is to select feature saliency maps with significant differences in feature maps at different levels on the first basis, ensuring the accuracy of the saliency object boundary.

The proposed integration mechanism is based on two aspects:(1)When fusing feature maps of the same level (Z1, Z2, or Z3), to avoid a feature method from failing, select the feature map fusion of the two most similar feature maps, and the two saliency maps with saliency values close to 1-pixel regions, thereby ensuring their saliency.(2)For the fusion of feature saliency maps of different levels, based on the first fusion, attempt to ensure the richness of the boundary features of the salient objects and select the two feature saliency maps with the most significant differences between them; to improve the richness of the boundary features of the salient objects, the feature redundancy is suppressed to a certain extent, and the boundary accuracies of the salient objects are increased.

5. Experimental Results and Analysis

In this section, a series of experiments are performed on the proposed method. To provide a fair experimental evaluation, we first introduce several evaluation indicators and benchmark datasets. Further, we use the proposed framework to optimize other advanced methods to illustrate the robustness of the proposed framework. Finally, we compared the proposed algorithm with seven other saliency detection algorithms using three benchmark datasets.

5.1. Evaluation Metrics and Datasets

In this study, we verified the performance of the proposed algorithm based on four evaluation indicators: Precision-Recall (PR) curve, Receiver Operating Characteristic (ROC) curve, F-measure curve, F-measure score, and Mean Absolute Error (MAE). The PR curve is widely used in most salient object detection methods. The calculation method is as follows: Given a saliency map, we segment the map with a threshold from 0 to 255 and then compare each result with the real situation to generate a precision-recall rate curve. As a supplement, we also introduced the F-measure score, which is the comprehensive performance of the precision and recall values and is calculated as follows:where is set to 0.3, according to [29]. The combinations of the PR curve and F-measure score are common in existing studies.

The MAE is a measure of the average difference between the estimated saliency map and ground truth .where W and H are the width and height of the given image, respectively.

In this study, all the methods were compared on three benchmark datasets. The ECSSD dataset [30] contains 1000 images and is composed of many complex scenes, such as animals, trees, and people. The Image Pair dataset [31] consists of 210 challenging images. Finally, the ASD dataset [32] contains 300 images.

5.2. Comparison with Other Methods

To verify the robustness and effectiveness of the proposed framework, we compared the proposed method with other seven most advanced methods, including Robust Background Detection (RBD) [10], Manifold Ranking (MR) [16], Saliency Filters (SF) [15], Deformed Graph Label (DGL) [33], Geodesic Saliency(GS) [14], CNS [28], and High-Dimensional Color Transform (HDCT) [24].

The quantitative results of all methods are shown in Figure 2. The proposed method in the ECSSD, ASD, and image pair is a dataset that contains a large number of complex scenarios. For the information-rich ECSSD dataset, the proposed method achieved the best performance. The results of the PR curve of all methods are shown in Figure 2. As observed, the proposed method achieved the best overall performance on all datasets, particularly on the ECSSD dataset. Among the comparison methods, RBD, MR, SF, DGL, GS, CNS, and HDCT also achieved outstanding results; however, they are still inferior to our method. The three benchmark datasets contained a large number of complex scenarios, and the experimental results demonstrated the superiority of our method in complex scenarios.

From top to bottom: ASD dataset, ECSSD dataset, and Image Pair dataset.

Table 1 shows the F-measure score; that is, the comprehensive performance of the precision and recall values. The proposed algorithm is superior to the above seven methods.

Table 2 compares the MAE, and the proposed algorithm achieved the optimal value. In summary, the proposed algorithm was validated on three benchmark datasets containing a large number of complex scenarios, and the results demonstrate the superiority of our method in complex scenarios.

5.3. Visual Contrast

A visual comparison of the results is presented in Figure 3. It can be observed that the proposed algorithm can simultaneously highlight salient regions and suppress irrelevant background regions, such as the second and fifth rows. For salient objects with rich texture features, the competition mechanism of coarse and fine saliency maps can overcome the failure limit of a certain feature map, such as the sixth row. The proposed method can correctly label salient regions and completely suppress background regions, particularly shadow regions. This is because the proposed graph model not only guarantees that the foreground is not influenced by the surrounding background regions but also enhances the contrast between the foreground and background. The visual comparison results further demonstrate the superiority of the proposed method.

6. Conclusions

In this study, we proposed a novel saliency detection framework. In the first stage, we divide the image into three different levels. On the one hand, the shape and number of nodes generated by images of different scales differ, which enriches the features of salient objects. On the other hand, it compensated for the first feature fusion. The feature saliency map may lose some details. Subsequently, different types of coarse saliency maps were constructed using three different types of prior knowledge features. In the second stage, we fully utilized the competition and complementarity between the feature saliency maps by introducing a novel competition mechanism. In the first fusion, two similarities in the layer were selected to ensure the validity of the selected features. The feature saliency map fusion with the largest degree also ensured the saliency of the salient objects. In the second fusion, the feature maps of different levels were first adjusted to a unified scale, and then two feature saliency maps with a significant difference were selected for fusion, which reduced feature redundancy and ensured the salience of information on the boundaries of salient objects. The proposed framework overcame the limitations of previous graph-based methods in complex scenarios. We compared the proposed method with seven other state-of-the-art methods, including graph- and nongraph-based methods. Furthermore, considering the importance of existing saliency detection methods, the proposed detection method can be applied to other nongraph-based methods in future studies, such as machine learning-based methods, because the competition mechanism can better exploit these feature methods.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant no. 62075241).