Abstract

Ideal color image segmentation needs both low-level cues and high-level semantic features. This paper proposes a two-hierarchy segmentation model based on merging homogeneous superpixels. First, a region growing strategy is designed for producing homogenous and compact superpixels in different partitions. Total variation smoothing features are adopted in the growing procedure for locating real boundaries. Before merging, we define a combined color-texture histogram feature for superpixels description and, meanwhile, a novel objectness feature is proposed to supervise the region merging procedure for reliable segmentation. Both color-texture histograms and objectness are computed to measure regional similarities between region pairs, and the mixed standard deviation of the union features is exploited to make stop criteria for merging process. Experimental results on the popular benchmark dataset demonstrate the better segmentation performance of the proposed model compared to other well-known segmentation algorithms.

1. Introduction

Color image segmentation is an important task in image analysis and understanding. It is widely used in many image applications as a critical step. The primary purpose of segmentation is to divide an image into meaningful and spatially connected regions (such as object and background) on the basis of diverse properties. Existing segmentation approaches can predominantly be divided into the following four categories: region-based methods, feature-based methods, boundary-based methods, and model-based methods. Most existing image segmentation methods consider only the low-level features [14] and it is difficult to obtain ideal segmentation results without high-level visual features. Therefore, many researchers aim to build a bridge between low-level features and visual cognitive behaviors of human. Most segmentation methods integrated high-level knowledge, which are learning cognitive information from underlying features, and they obtain the corresponding label of elements and textons of image. Researchers aim to find high-level summary tags or labels to represent all the underlying features. With all these elements and textons, the segmentation can employ affinity computing to make decision of pixels labels of semantic objects and regions. This procedure employs knowledge-assisted multimedia analysis model to bridge the gap between semantics and low-level visual features [5, 6].

Many segmentation methods based on objects detection have shown their capability by utilizing low-level visual features [7, 8]. They integrated basic visual features to detect homogeneous regions for image segmentation. Even though these low-level features have shown favorable results, they are unbefitting in many complex scenes, such as texture, luminance, and overlapping. To deal with this problem, some researchers attempted to combine different features together [9]. However, most of them integrated these features in a simple way. Few studies evaluated the significance of object features to the overall segmentation. Besides, since most previous segmentation methods lack high-level knowledge about objects, it is difficult for them to discriminate semantic regions from cluttered images.

The objectness is a pixel-level characteristic obtained from training underlying characteristics, so we will combine it with color and texture to achieve better segmentation in this paper. The objectness is proposed by Alexe et al. [10] as a novel high-level property estimated from multiple low-level features, and it is widely used for salience computing, object detection, cosegmentation, object tracking, and so forth. It is designed for measuring the likelihood of a window containing a complete object in it at first. This measurement employs four cues in the calculation of objectness: multiscale saliency map, color contrast, edges density, and superpixels straddling. The four cues have been proved to be useful for detecting whole object from background. Jiang et al. [11] proposed a strategy for computing regional objectness by computing average objectness values from salient regions in slide windows. Through the above analysis, we find that the objectness is an effective feature to describe the integrity and semantic regions of objects in the image.

In order to tackle those issues remaining in existing segmentation methods, we developed a novel color image segmentation method based on merging superpixels supervised by regional objectness and underlying characteristics in this paper. Firstly, we adopt seeds growing strategy for superpixels computing. The seed pixels are selected from the blurring image from total variation (TV) diffusion image. Next, we introduce the growing method to expand these seed pixels to superpixels according to similarity of color and texture features at local areas. In the merging procedure, we define a combined feature consisting of color, texture, and objectness to measure the similarity between adjacent superpixels and implement merging procedure to form the segmentation results.

The main contributions of this paper are summarised as follows: () we develop efficacious superpixels extraction method to detect homogeneous regions with hybrid features, () we propose a combined histogram descriptor of color and texture feature for identifying semantic regions, and () we design a fast and less sampling objectness computing model and integrate it with low-level features for reasonable merging.

The rest of the paper is organized as follows: our segmentation method is shown in Section 2 including superpixels computing, objectness computing, combined feature definition of superpixels, and merging rules. The experiments on the Berkeley Segmentation Datasets are performed in Section 3 and the conclusion is made in Section 4. The segmentation process and intermediate results are shown in Figure 1.

2. Objectness Supervised Superpixels Merging Segmentation

In our method, the segmentation process consists of the following two stages. First, the input image is segmented into hundreds of uniform homogeneous superpixels with hybrid features including RGB channels and diffusion image. Second, we construct a descriptor for superpixels with invariant color-texture histograms as low-level features and define the objectness of all pixels based on sparse sampling of objectness windows as high-level features. Then, we develop the merging rules combined with two levels of features for reasonable segmentation. The detailed procedure of our method is as follows.

2.1. Compact Superpixels Computing

In the color image segmentation, to obtain homogeneous superpixels is an ideal preprocessing technique to simplify the segmentation task. A superpixel is defined as a connected homogeneous image region often called oversegmentation computed by certain strategies. An image is represented by hundreds of superpixels instead of many redundant pixels which can simplify the image representation and segmentation. There are many segmentation strategies for superpixels computing such as SLIC [12], N-cut [13], watershed [14], and Mean-Shift [15], and various strategies result in different property and quality of superpixels. Each method has its own advantages; however, all these methods did not perform well when encountering texture areas.

For segmentation task, the real boundary keeping and homogeneous superpixels extraction are critical for ideal final segmentation. Texture feature is consisted of group pixels that requires enough pixels to represent the texture patterns in superpixels. How to retain the superpixels boundaries surrounded complete texture is a difficult problem for superpixels computing algorithms. On the other hand, smaller scale superpixel is conducive to saving the real and accuracy boundaries. An important branch of superpixel extraction methods is to predefine segments number before segmentation and it is easy to control the running time and segmentation quality by adjusting segments number in these methods. It is also convenient to integrate multifeatures in these methods. The other data driven superpixels extraction methods are randomness in contrast. So we devolve our method based on the above analysis. We select seeds in the smooth area in the uniform grids of images and grow them to be superpixels with similarity computing. Implementation details are as follows.

2.1.1. Texture Smoothing

In order to enhance the descriptive ability of local regions, we detect texture information by integrating the diffusion feature and original channels into the region growing procedure. Perona and Malik proposed [16] an anisotropic diffusion filtering method for multiscale smoothing and edge detection; this diffusion process is a powerful image processing method which encourages intraregion smoothing in preference to smoothing across the boundaries. This filtering method tries to estimate homogeneous region of image with surface degradation and the edge strengths of local structure. To compute texture features, we adopt nonlinear diffusion to extract texture areas and enhance the edges simultaneously. It is easier to smooth regions and detect real boundary of image by TV diffusion model in texture area [17]. Therefore, the TV flow diffusion method is used for blurring local area of color image for extracting nonoverlapped superpixels. The implicit equation of nonlinear diffusion technique of TV is computed by

The initial value of is single channel of the original image in RGB space, and is the iteration times. is the diffusion function defined in [17]

The iteration times rely on the superpixels scale for proper smoothing of local area. The reasonable iteration times of diffusion are in proportion to the radius of superpixels ≈13 in this paper.

2.1.2. Superpixels Growing Scheme

In this paper, we adopt a seed growing method to obtain compact superpixels with complete homogeneous pixels. The number of superpixels is preset in SLIC [12] and Turbo [18]. Seed pixels selection method has been developed by choosing minimum gradients in divided uniform grids [12], which means to locate seed pixels at the smoothest areas of the grids. It is an effective method to find the optimal seed pixels for superpixels, so we employ the seed pixel extraction strategy and improve it by choosing seeds in the TV diffusion image as the foundation of growing superpixels.

The smoothed image is divided into a number of blocks and located gradient minimum point as the seed pixels. The color image after filtering provides a robust feature for computing homogeneous superpixels. We locate the initial seeds at the local minimum in gradient map of smoothed image in average grids at first. The grids setting method is in [12]. In the growing process, both original image and filtered image are integrated into a combined feature , and the similarity measurement between pixels and is calculated by Euclidean distance. The growing procedure is limited to local regions () for uniform superpixels, and is the average radius of grids. Let denote the initial seeds of the superpixels and denote the growing area corresponding to . The growing algorithm is thus outlined as follows:(1)Initialize the selected seed pixels as areas , and assign a label for each area .(2)Compute average of neighbor regions () and of ; check the neighbor pixels of , and if , merge neighbor pixels to , until no pixels can be labeled.(3)Connect all scatter regions 20 pixels to adjacent and similar regions.

We have shown some superpixels of test images in Figure 2. Since precise boundary is necessary for successful work, we adopt a standard measure of boundary recall. The average performance of boundaries precision with regard to superpixels numbers on BSD500 is shown in Figure 3; the smallest distance between ground truth and superpixels boundaries is 2 pixels in the experiment. We can see that more superpixels are conducive to more accurate boundaries, and 900 superpixels are big enough for merging segmentation in the next stage.

2.2. Combined Features and Superpixels Merging

Reliable region merging technique is critical for hierarchical segmentation. In this section, the similarity measurement of regions and corresponding stopping criteria are proposed based on the color-texture features and objectness of image. The merging process begins from the primitive superpixels of image until termination criterion is achieved, and the segmentation is finished.

To improve the segmentation accuracy and dependability, we define combined feature to describe the oversegmentation regions. This feature integrates both color-texture information and objectness values. His is a kind of histograms containing color information and local color distribution feature of pixels. As high-level visual information, Obj is the objectness value or probability of a region belonging to an identifiable object.

2.2.1. Color-Texture Histograms

The number of colors that can be discriminated by eyes is limited, no more than 60 kinds. Therefore, a quantized image is introduced to compute the color distribution of superpixels for segmentation and an image is quantized to 64 colors by the minimum variance method that leads to least distortion as well as reduces the computing time [19]. After quantization, the connection property of pixels in superpixels of same colors is computed as the texture features of superpixels. We assume Con as the connection relation between pixel and its neighbor pixel , and eight-neighbor pixels ( = 0–7) are coded by () as local color pattern (LCP). If color difference between and is greater than (standard deviation of all pixels in quantized image), then = 0, otherwise = 1. After coding all pixels, sum up all LCP values of pixels with same color in the superpixel as color distribution weights for corresponding color bins. Thus, the composite color-texture histograms feature (CF) is computed by multiplying the probability of quantitative colors and pixel distribution in the superpixels. The bins of color-texture histograms are defined as follows:

2.2.2. Objectness Based on Sparse Sampling of Regions

As mentioned in [10], the objectness is an attribute of pixels computed with multicues for a set of sample slide windows, and the objectness map is biased towards rectangle intensity response. It seems that objectness has no ability to discriminate regions, but it provides a potential and useful value of all pixels in the rectangle. Jiang et al. [11] proposed a novel method to assign objectness value to every pixel based on the definition of objectness.

Inspired by the objectness computing method proposed by Jiang et al. [11], we propose an improved objectnenss computing method based on sparse sampling of slide windows for all pixels. We also employ four cues for estimation of objectness characteristics of all pixels in an image: the multiscale saliency, color contrast, edge density, and superpixels straddling. The multiscale saliency (MS cues) in the original paper [10] is calculated based on the residual spectrum of image. This method highlights fewer regions of an image, and it is not suitable for segmentation. To expand the global significance of objectness, we use histogram-based contrast (HC) [20] salient map as the MS cues. The histogram-based contrast (HC) method is to define the saliency values for image pixels using color statistics of the input image. In the HC model, a pixel is defined by its color contrast to all other pixels in the image and sped up by sparse histograms. The saliency [20] value of pixel in image is defined aswhere is the color value of pixel , is the number of pixel colors after histogram quantization, and is the statistic value of pixel color that appears in image . In order to reduce the noise caused by the randomness of histogram quantization, the saliency results are smoothed by the average saliency of the nearest neighbor colors.

The color contrast (CC) is a measure of the dissimilarity of a window to its immediate nearby surrounding area and the surroundings; edge density (ED) measures the density of edges near the window borders; superpixels straddling (SS) is a cue to estimate whether a window covers an object. These three cues are specifically described in [10]. With the above-mentioned four cues, the objectness score of test window is defined by the Naive Bayes posterior probability in [10]where is the combined cues set and (obj) is the priors estimation value of the objectness. All the specific definitions of and are described in [10].

In the objectness computing methods [10, 11], the sampling windows are selected randomly by scales for calculation of objectness and 100000 windows are sampled for training. Then, they compute the other cues in with these windows.

In this paper, we try to sample less windows that most likely contain objects. Thus, presegment results can provide regions for locating object windows with higher probability. We calculate all the circumscribed rectangular windows of the presegmentation regions resulting from Mean-Shift (, , and ) and put all these vertices of circumscribed rectangular windows as the candidates vertex of slide windows. Next, we select all the upper left vertices and lower right corner vertices to compose two sets, respectively: left-up (LU) and right-down (RD).

In order to find the most likely sampling windows including real objects or homogeneous semantic regions, we choose arbitrary vertex from LU to combine with all vertices in RD whose coordinate values are less than the coordinates of . These coupled vertices can provide more reasonable coordinates of sampling windows that contain objects. All the sample windows can be seen in Figure 4. We calculate all the objectness value of sampling windows by (5) and sum them by pixels locations as pixel-wise objectness withwhere is the sample windows set, is the single window in , and is the score of sample window computed by (5). The pixel-wise objectness cannot represent the similarity of regions, so we compute the average objectness values of Mean-Shift segment regions as new objectness value of all pixels, and the objectness of superpixel is the average value of all pixels in it.

2.2.3. Merging Rules

The regions merging method and stop criteria are crucial for the segmentation results. We established adjacent table and similar matrix for all superpixels to carry out the merging procedure, and the metric of similarity between two regions can be calculated by the following formula:The fixed weight assignment is not well adapted to obtain the real object regions by merging process, so we define alterable parameter using two types of features to regulate the similarity computing: () fixed weight assignment is not well adapted to obtain the real object regions by merging process; () when the feature change is greater, the weight should be set smaller in the similarity computing, otherwise it will lead to overmerging. Therefore, we define a balance parameter using two types of features to regulate the similarity computing:where is the difference of given variables, is the range of , is the mean value of His features of superpixels, and is the standard deviation of of current merged superpixels. Here, we set the threshold based on to control the merging process. If the similarity between A and B is greater than , the adjacent merges two regions A and B, updates adjacent relations and labels, and checks next regions until no pair of regions satisfies the merging criterion. is the combined standard deviation of all combined features of current merged superpixels, and it is defined in

The description of the merging strategy is displayed in Algorithm 1. Each time after merging, the combined feature of the new region is updated as follows: firstly, the connection relations of all pixels are unchanged in the new region, then compute the statistics value for each bin of quantitative colors, and update of the histogram feature using formula (3) of the new region. Second, feature of the new region is obtained by computing average objectness value of all the pixels in the new region.

Input:
  superpixels , adjacent table of
Output:
  The merged result.
() For each node in , is neighbor region of
() if  , then
()   merged to , ,
()   change the region label of to ,
()   modified the histogram feature of two merged regions
()   update ,   and ;
() end if
() If no regions satisfied , end the merge process, otherwise, return to ();
() Merge meaningless and small areas to adjacent regions
() Return relabeled image.

3. Experimental Results

3.1. Dataset

The proposed segmentation model has been tested on Berkeley Segmentation Datasets [21, 22] and compared with human ground truth segmentation results. Specifically, the BSD dataset images were selected from the Corel Image Database in size of . The Berkeley Dataset and Benchmark is widely used by the computer vision for two reasons: () it provides a large number of color images with complex scenes and (), for each image, there is multiple ground truth segmentation which are provided for the numerical evaluation of the testing segmentation algorithms.

3.2. Visual Analysis of Segmentation Results

For a subjective visual comparison, we selected 5 segmentation results of test images and compared them with three classic algorithms (JSEG [23], CTM [24], and SAS [25]). The superpixels number of our method for comparing is set to be ; the JSEG [23] algorithm needs three predefined parameters: quantized colors = 10, number of scales = 5, and merging parameter = 0.78; the results for CTM [24] is (); the SAS [25] method is implemented with default parameters. From the results shown in Figure 4, it is clear that the proposed method has performed better in detecting complete objects from images (such as flowers, zebra, cheetah, person, and house in the test images). For all the five test images, our method has shown clear and complete contours of the primary targets. The SAS method is more accurate in homogeneous region detection than other methods, but it perform less oversegmentation except in complex areas. CTM performs better in texture regions but lost in the completeness of the objects. JSEG is also a texture-oriented algorithm which has the same weakness as CTM. Thus, three other segmentation results have brought more over-segments and meaningless regions from the test results in Figure 5. The region number versus iteration times in the merging process of some test images is shown in Figure 6, and the region number versus parameter in the merging process of some test images is shown in Figure 7.

In Figure 8, we have displayed fourteen segmentation results from test images. These results include different type of images, such as animals, buildings, person, vehicles, and planes with various texture features. The segmentation results show that the proposed growing-merging framework can segment the images into compact regions and restrain the over- or undersegmentation.

3.3. Quantitative Evaluation on BSD300

The comparison is based on the four quantitative performance measures: probabilistic rand index (PRI) [26], variation of information (VoI) [27], global consistency error (GCE) [21] and Boundary Displacement Error (BDE) [28]. The PRI Index compares the segmented result against a set of ground truth by evaluating the relationships between pairs of pixels as a function of variability in the ground truth set. VoI defines the distance between two types of segmentation as the average conditional entropy of one type of segmentation given the other, and GCE is a region-based evaluation method by computing consistency of segmented regions. BDE measures the average displacement error of boundary pixels between two boundaries images. Many researchers have evaluated these indexes to compare the performance of various methods in image segmentation. The test values of five example images on four algorithms are shown in Table 1, and the average performance on BSD300 [21] are shown in Table 2. With the above comparison strategy, our algorithm provided more approximate boundaries with manual expert segmentation results. The bins number of the combined histograms is critical parameter for segmentation, so we have take some test examples by varying bins in Figure 9 to verify 64 bins is more proper for accurate segmentation.

3.4. Boundary Precision Comparison on BSD500

Estrada and Jepson [29] have proposed a thorough-quantitative-evaluation matching strategy to evaluate the segment accuracy for boundary extraction. The algorithms are evaluated using an efficient algorithm for computing precision and recall with regard to human ground truth boundaries. The evaluation strategy of precision and recall is reasonable and equitable as measures of segmentation quality because of not being biased in favor of over- or undersegmentation. An excellent segmentation algorithm should bring low segmented error and high boundary recall. It defines precision and recall to be proportional to the total number of unmatched pixels between two types of segmentation and , where is the boundaries extracted from segmentation results by computer algorithms and is the boundaries of human segmentation provided by BSD500 [22]. A boundary pixel is identified as true position when the smallest distance between the extracted and ground truth is less than a threshold ( pixels in this paper). The precision, recall, and -measures are defined as follows: The precision and recall measures provide a more reliable evaluation strategy based on two reasons: () it refined boundaries and avoided duplicate comparison of identical boundary pixels, and () it provides dynamic displacement parameters for comprehensive and flexible evaluation of segmentation performance. In this section, we displayed the boundaries precision and recall curves with five segmentation algorithms (MS [15], FH [2], JSEG [23], CTM [24], and SAS [25]) in Figure 10. The ROC curves fully characterize the performance in a direct comparison of the segmentations quality provided by different algorithms. Figure 10 shows that our method provides more closely boundaries of segmentations with the ground truth.

4. Conclusion

In this paper, we propose a novel hierarchical color image segmentation model based on region growing and merging. A proper growing procedure is designed to extract compact and complete superpixels with TV diffusion filtered image at first. Second, we integrate color texture and objectness information of superpixels into the merging procedure and develop effective merging rules for ideal segmentation. In addition, using a histogram-based color-texture distance metric, the merging strategy can obtain complete objects with smooth boundaries and locate the boundaries accurately. The subjective and objective experimental results for nature color images show a good segmentation performance and demonstrate a better discriminate ability with objectness feature for complex images of the proposed method. In the future, we will focus on the extension of objectness to different kinds of segmentation strategies.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

Haifeng Sima acknowledges the support of the Key Projects of National Natural Foundation, China (U1261206), National Natural Science Foundation of China (61572173 and 61602157), Science and Technology Planning Project of Henan Province, China (162102210062), the Key Scientific Research Funds of Henan Provincial Education Department for Higher School (15A520072), and Doctoral Foundation of Henan Polytechnic University (B2016-37).