Abstract

Natural image segmentation is an important topic in digital image processing, and it could be solved by clustering methods. We present in this paper an SOM-based k-means method (SOM-K) and a further saliency map-enhanced SOM-K method (SOM-KS). In SOM-K, pixel features of intensity and πΏβˆ—π‘’βˆ—π‘£βˆ— color space are trained with SOM and followed by a k-means method to cluster the prototype vectors, which are filtered with hits map. A variant of the proposed method, SOM-KS, adds a modified saliency map to improve the segmentation performance. Both SOM-K and SOM-KS segment the image with the guidance of an entropy evaluation index. Compared to SOM-K, SOM-KS makes a more precise segmentation in most cases by segmenting an image into a smaller number of regions. At the same time, the salient object of an image stands out, while other minor parts are restrained. The computational load of the proposed methods of SOM-K and SOM-KS are compared to J-image-based segmentation (JSEG) and k-means. Segmentation evaluations of SOM-K and SOM-KS with the entropy index are compared with JSEG and k-means. It is observed that SOM-K and SOM-KS, being an unsupervised method, can achieve better segmentation results with less computational load and no human intervention.

1. Introduction

Being a basis of high level image analysis tasks like object recognition, image retrieval, and scene understanding, and so forth, image segmentation is an important subject in image processing. Particularly, natural image segmentation is considered to be a difficult task, because there is not any priori knowledge in advance. It is found that in such cases salient objects are crucial for further image retrieval or scene understanding.

In an 8-bit RGB color space, each pixel in an image is represented by one of 224 combinations. Image segmentation can be considered to be a kind of clustering, which clusters similar pixels into same group. A successful segmentation depends on good selections of similarity measure, feature description of an image, evaluation of the segmentation, and priori information available. A natural image, not being a GIF animation image, magnetic resonance image, or any other man-made image, where advanced information is usually available, could be anything in the colorful, natural world. It contains a large quantity of mixed elements with different color and light projected from a 3-dimension colorful world to a 2-dimension plane. The world is recovered from the compressed 2-dimension images although human are sometimes cheated by illusion tricks. When such images are submitted to a machine, there is always a tradeoff among segmentation results, computation complexity, and time cost no matter what method is selected.

Researchers have investigated natural image segmentation for many years. According to [1], segmentation methods are mainly classified as follows: feature-space-based techniques, image-domain-based techniques, and physics-based techniques. Clustering, histogram thresholding, and so forth are feature-space segmentation techniques; split-and-merge, region growing, edge-based and neural-network-based methods are image-domain segmentation techniques [1]. Some researchers categorized the segmentation methods based on two basic properties of the pixels in relation to their local neighborhood: discontinuity and similarity [2]. Corresponding methods based on discontinuity property of the pixels are called boundary-based methods, while those based on similarity property are called region-based methods.

In fact, with the crossover of different disciplines, it is impossible to β€œsegment” each segmentation technique clearly and each segmentation method can be classified into different categories by its different features. For example, neural-network method is, in fact, a kind of learning/training progress which touches every aspect of the segmentation, and it makes use of features or domain information of an image too. Methods involving multiple disciplines usually get better results. In recent years, methods like JSEG [3], mean shift [4], normalized cuts [5], and so forth have achieved certain success and are often used as benchmarks for natural image segmentation. Mean shift [4] detects modes of the probability density of colors occurring jointly in image and color domains and shows a good performance on segmenting the images with strong variations of density. Shi and Malik [5] treated the segmentation as a graph partitioning problem. A weighted undirected graph is constructed by taking the pixels as the nodes of graph. The graph is partitioned by optimizing the criterion of normalized cut based on the computation of eigenvalues. JSEG [3] makes a multiscale analysis in the image domain of a clustering map obtained from quantization in color domain. Consisting of two independent stages of color quantization and spatial segmentation, it is an unsupervised segmentation especially being robust when applied to scenes where texture predominates.

Our work in this paper is motivated on one hand by a performance improvement of natural image segmentation: a tradeoff between segmentation results, time complexity, and algorithm complexity. On the other hand, oversegmented or undersegmented image of the natural image slows down the further image retrieval or scene understanding. Operations of enhancing the salient objects and weakening the minor parts during segmentation are required to reduce the time of following processing.

This paper presents a clustering method SOM-K, which is based on self-organizing map (SOM) and k-means. Feature vectors of each pixel is first trained by an SOM neural network which clusters similar input feature vectors to be nearer with each other topologically, and then, the output vectors of SOM are clustered by k-means clustering method. Another proposed method with saliency map feature (SOM-KS) further improves the performance of SOM-K. Both SOM-K and SOM-KS are guided by an entropy index for image segmentation evaluation to select a best segmentation. To our knowledge, the combination of SOM and k-means was first appeared in [6]. A k-means clustering method with Davies-Bouldin (DB) validity index is implemented on SOM prototype vectors to segment remote sensing images and the results are declared acceptable and efficient. Although there were no further evaluation on the image segmentation results and the method was only applied on remote sensing image, their method was still appealing to our research. The k-means clustering method was modified and combined with an SOM to process πΏβˆ—π‘Žβˆ—π‘βˆ— color space images [7]. It demonstrated some image segmentations and comparisons of different training parameters for SOM. In this paper, method SOM-K(S), which includes SOM-K and SOM-KS, implements the SOM and k-means in a different way. It compromises the segmentation results and time cost with less complexity.

The main contributions of our proposed method are the following.

(1) SOM-K, a new unsupervised natural image segmentation method based on SOM and k-means. Intensity and πΏβˆ—, π‘’βˆ—, and π‘£βˆ— values of a color image are taken as features to be trained by a SOM network. The output prototype vectors are filtered by the hit map at first and clustered by the k-means method. A best image segmentation result can be obtained according to the entropy-based segmentation evaluation method. The method is proved to be robust to natural color image segmentation through experiments.

(2) A modified saliency map. The Itti-Koch visual attention model [8] is modified to be more efficient for our application. The Gaussian pyramid image is constructed via original image (level 0) from level 1 to level 5 and a group of orientation operators are applied to each level to get the orientation saliency map. We follow [8] using intensity, orientation, and color components 𝑅, 𝐺, 𝐡, and π‘Œ and adopt the contrast-based image attention model [9] within a 3 × 3 window. After that, all saliency maps are resized to the original size and combined together to be a saliency map.

(3) SOM-KS, an unsupervised natural color segmentation method guided by image saliency map. Unlike other saliency map-based segmentation methods (refer to Section 2.2 for detail), the modified saliency map information is directly combined with intensity and πΏβˆ—π‘’βˆ—π‘£βˆ— to segment a natural color image through SOM and k-means in an automatic manner. This method enhances the attractive objects in the image and restrains the less salient parts, which can reduce the processing workload for further processing.

The remainder of this paper is organized as follows. Section 2 presents related work. Section 3 describes in detail our proposed method. Section 4 analyzed the performance with comparisons with other methods. Section 5 is the conclusion.

2.1. Clustering and Image Segmentation

Clustering methods provide us with a different view of the image segmentation. However, directly clustering methods like k-means and their variants are not acceptable considering the computational cost and a priori cluster number k needed. SOM with properties such as the input space approximation, topological ordering, and density matching, allied with the simplicity of the model and the ease of implementation of the learning algorithm, makes itself a promising clustering tool [10]. SOM is also helpful for visualization, cluster extraction, and data mining, and it has been proved to be successful for high dimensional data, where traditional methods may often fail or be insufficient.

Rarely simple SOM is implemented directly on image segmentation. Some researchers modified and expanded the typical SOM [11], while others combined SOM with other methods [6, 7, 10, 12–15]. AraΓΊjo and Costa [11] presented a new SOM with a variable topology for image segmentation. The proposed fast convergent network is capable of color segmenting images satisfactorily with a self-adaptive topology. In [10], an SOM-based clustering method was applied to the spectral data of remotely sensed image. Accounting for the activity level (hits) information, the zero-hits nodes and heterogeneous nodes are filtered at first; secondly, the CDbw (composing density between and within clusters) clustering index is applied to the dendrogram to get the best cluster number. Although the target image segmented is simple comparing with natural images and some priori information could be obtained, their ideas have been enlightening to natural image segmentation research. Besides directly applying to color components, other used derived features in SOM [12, 14, 15]. Reference [12] presented an unsupervised image segmentation framework CTex (based on color and texture), that is, based on the adaptive inclusion of color and texture in the process of data partition. In [14], a fully automatic three-level clustering algorithm for color-texture segmentation was presented. SOM is used to identify the number of components and initialize the Gaussian mixture model. Experimental results indicate that the proposed algorithm is efficient and competent with popular CTex and JSEG algorithms.

Enlightened by [16], some SOM-based clustering methods focused on visualization of the input data by analyzing the SOM-derived parameters [17, 18]. Data topology is integrated into the visualization of the SOM, and thereby provides a more elaborate view of the cluster structure than existing schemes [17]. The prototypes are often combined with hit numbers to implements an automatic detection of clusters [18].

Recently, some researchers exploited color image segmentation with methods like distance metric learning [19], contour deformation and region-based method [20], morphological clustering [21], watershed variants [22, 23], edge information [24], local features measured by Gabor filter and clustered by expectation maximization (EM) [25], mean-shift variants [26], ant colony-fuzzy 𝑐-means hybrid [27], enhanced gradient network [28], dynamic region growth/multi resolution merging [29], and so forth broadened the road ahead.

2.2. Saliency Map and Image Segmentation

In most cases, the aim of image segmentation is object recognition, image retrieval, or scene understanding, and so forth, which serves as a necessary and the first step of high-level, object-based applications. Therefore, correct segmentation of salient objects in the image is more important than segmenting other minor parts correctly. It is an interesting topic covered by many researchers. Object recognition in an image follows top-down or bottom-up method. The first method needs priori knowledge of the top level, with face detector, human body detector, and so forth, to facilitate the recognition while the second method deals with natural image where no further information available.

Itti et al. introduced their bottom-up visual attention model [8] inspired by the behavior and the neuronal architecture of the early primate visual system. In their method, multi-scale image feature maps of color, intensity, and orientation are extracted, and local spatial contrast is estimated for each feature at each location, providing a separate conspicuity map for each feature. These maps are combined to a single topographical saliency map that guided the attention focus in a bottom-up manner. However, the high computational cost and the parameter selection are still drawbacks of Itti-Koch model. After that, other derived image saliency map models were developed. In contrast to Itti-Koch attention model which derived attention based on the spatial location hypothesis, Sun and Fisher presented their mechanisms based on object-driven as well as feature driven. It is suggested that object-based and space-based attention can be integrated by using grouping-based salience to deal with dynamic visual tasks [30]. With advantage of not relying on either the parameters or rapidly salient objects detection, a spectral residual (SR) approach based on Fourier transform was proposed [31]. Hu et al. [32] led us to a different method to extract visual attentive regions in images using subspace estimation and analysis techniques. Ma and Zhang proposed a feasible and fast approach to attention area detection in images based on contrast analysis [9]. Some researchers also focused on saliency model of video [33]. Not trying to model human attention like traditional approaches, [34] proposed a new bottom-up validated stochastic model to estimate the probability that an image part is of interest. Reference [35] presented another new method focused on calculating the spatiotemporal saliency map of an image by its quaternion representation. Above all, some methods consider saliency over multi-scale image, while others over a single scale. In general, most methods use local contrast of image regions with their surroundings between multi-scale image using one or more of the features of color, intensity, and orientation.

As a basic step in image processing, image segmentation and object extraction are also facilitated by saliency map. On one hand, some authors defined their own saliency map for their research [36, 37]. On the other hand, some researchers exploited typical saliency map model, like Itti-Koch model [38–42] or spectral residual approach [43]. The saliency map being guidance to the image processing and object detection [44], many researches are expanded, and a lot of achievements gained. Applications like image retrieval [45–48], image retargeting [49], image content analysis [50], image fusion [51, 52], and image quality assessment [53] were all more or less based on the saliency map.

3. Proposed Method

In our proposed method, a color image features with/without saliency map are trained by an SOM neural network at first, the output prototype vectors are then filtered by a hits map, clustered by k-means accompanied with the guidance of entropy-based image segmentation evaluation index. Besides, the method also includes necessary preprocessing and postprocessing. The detail flowchart of the SOM-KS method is shown in Figure 1.

3.1. Self-Organizing Map

SOM, first put forward by Kohonen [54], is a kind of widely used unsupervised artificial neural network. The map is a group of node units represented by prototype vectors lying in a 2-dimension space usually though occasionally nodes are set in one or multi-dimensional space. These units are connected to adjacent units by a neighborhood function. Prototype vectors are initialized with random or linear methods and β€œfolded” in the 2-dimension space. Then, they are trained iteratively by randomly selected input samples sequentially or in batches and updated according to the neighbor function. After the training, prototype vectors become stable and β€œunfolded” themselves in the 2-dimension-space map. The typical features of SOM are topology visualization of the input patterns and representation of a large number of input patterns (π‘Γ—π‘š, say, where 𝑝 is the number of samples and π‘š the dimension of the input pattern) with a small number of nodes (π‘›Γ—π‘š, where 𝑛 is the number of nodes or prototype vectors). The most important attribute of SOM is that the input patterns which are similar in the input space are also nearby with each other topologically in the output space, the nodes map.

According to Kohonen [54], for each node 𝑖, there exists a prototype vector 𝑀𝑖=(𝑀𝑖1,𝑀𝑖2,…,π‘€π‘–π‘š). For each input sample 𝐱, a winner node, 𝑐, is chosen, using the similarity rule 𝑐=argminπ‘–ξ€½β€–β€–π±βˆ’π°π‘–β€–β€–ξ€Ύ,(1) which means the same as β€–β€–π±βˆ’π°π‘β€–β€–=minπ‘–β€–β€–π±βˆ’π°π‘–β€–β€–,(2) where β€–β‹…β€– represents the Euclidean distance. The winner node 𝑐 weight (prototype vector), together with the weights of neighbor nodes, is updated according to the following equation: 𝐰𝑖(𝑑+1)=𝐰𝑖(𝑑)+β„Žπ‘π‘–ξ€Ίπ±(𝑑)(𝑑)βˆ’π°π‘–ξ€»(𝑑),(3) where 𝑑 indicates the iteration of the training process, 𝐱(𝑑) is the input sample of current iteration 𝑑, and the β„Žπ‘π‘– is the neighborhood function of the winner node 𝑐. The last term of (3) is a decreasing function of iteration time 𝑑 and distance between the node 𝑖 and the winner node 𝑐, the learning rate 𝛼(𝑑) and neighborhood function β„Žβ„Žπ‘π‘–ξ€·β€–β€–π‘Ÿ(𝑑)=𝛼(𝑑)β„Žπ‘βˆ’π‘Ÿπ‘–β€–β€–ξ€Έ,𝑑,(4) where 𝛼(𝑑) is the learning rate, π‘Ÿπ‘– and π‘Ÿπ‘ are the positions of the node 𝑖, and the winner node 𝑐 in the topological map, respectively.

Also, a more effective SOM can be trained in batches [54].(1)For the initial prototype vectors, take, for instance, the first 𝐾 training examples, where 𝐾 is the number of nodes (prototype vectors).(2)For each map unit 𝑖, collect a list of copies, 𝑁𝑖, of all those training samples π‘₯ whose nearest prototype vector belongs to unit 𝑖.(3)Take for each new prototype vector the mean over the union of the lists in 𝑁𝑖.(4)Repeat from step 2 a few times.

This algorithm is particularly effective if the initial values of the prototype vectors are already roughly ordered even if they might not yet approximate the distribution of the input samples. It should be noticed that the above algorithm contains no learning rate parameter; therefore, it has no convergence problem and yields stable asymptotic values for the 𝐰𝑖 than the original algorithm. Especially, a few iterations of this algorithm will usually suffice [54].

For every input sample, there exists one node with maximal output. The node is said to be hit by the input sample. The number of input samples hitting a node is named as the hits value of the node, which forms a hits map. The larger the hits value, the more input samples are represented by the node. According to the attribute of SOM, except those unhit nodes, every node of SOM represents a group of input feature patterns and the topology of the nodes in a 2-dimension SOM map shows the topology of a multi-dimensional space within input feature patterns, so the input feature patterns (pixels for image) can be clustered through clustering the nodes themselves. A common idea of the SOM prototype vectors and hit map is shown in Figure 2.

3.2. k-Means and Image Evaluation

After the large quantities of pixel data are projected to a 2-dimension space to become a member in a group of nodes in a simple and fast way, a typical k-means method is adopted to cluster the prototype vectors. Clustering the SOM prototype vectors instead of directly clustering the data is a computationally effective approach [16].

As a kind of unsupervised learning method, clustering is divided to be hierarchical or partitional. The formal one can be agglomerative (bottom-up) or divisive (top-down). The latter one, partitional clustering, decides all clusters at once. Being a typical partitional clustering method, k-means method assigns each point to the cluster with the nearest center. The main steps of a standard k-means algorithm include [55] the following.(1)Set the number of cluster as k and randomly generate k clusters and determine the cluster centers, or directly generate k random points as cluster centers.(2)Assign each point to the nearest cluster center, usually calculated with Euclidean distance.(3)Re-calculate the new cluster centers.(4)Repeat steps (2) and (3) until convergence criterion is met.

The main advantages of k-means algorithm are its simplicity. Its disadvantages are heavy computation if the amount of data is large. It may not yield the same result with each run, since the resulting clusters depend on the initial random assignments. It minimizes intracluster variance, but it does not ensure that the result has a global minimum of variance. To overcome its disadvantages, the k-means is run for at least ten times to avoid instability caused by random assignments. Because the number of prototype vectors is very small, the computation cost is not a problem. With the standard k-means algorithm, the prototype vectors are clustered from 2 clusters to βŒˆβˆšπ‘›βŒ‰ (𝑛 is the number of SOM nodes) clusters, respectively, and image segmentation results with 2 clusters to βŒˆβˆšπ‘›βŒ‰ clusters are obtained.

There are several metrics of cluster evaluation to decide the best cluster number, like DB index [56], and CDbw index [57]. DB index was most commonly used to estimate the best number of clusters in remotely sensed images clustering [6] or employed in combining with another index in the unsupervised classification method proposed [58]. As stated by [59], the DB index is suitable only for spherical clusters and is sensitive to noisy points. In practice, though, it is better to use the DB validity index values as a guideline rather than absolute truth [16, 60]. CDbw, on the other hand, puts emphasis on the geometric features of clusters, handling efficiently arbitrary-shaped clusters. This is achieved by representing each cluster by a certain fixed number of clusters rather than a single center point. It was declared to be a reliable index showing that it performs favorably in all cases selecting independently of clustering algorithm the scheme that best fits the data under consideration [57].

Besides, the results of image clustering can be evaluated directly by segmentation results with image segmentation evaluation index like entropy-based index [61] or quantitative-based index [62]. A lower quantitative value or entropy value leads to better segmentations. A typical quantitative image evaluation method is 𝑄 index defined as below:1𝑄(𝐼)=√10000(𝑁×𝑀)𝑅×𝑅𝑖=1βŽ‘βŽ’βŽ’βŽ£π‘’2𝑖1+log𝐴𝑖+𝑅𝐴𝑖𝐴𝑖ξƒͺ2⎀βŽ₯βŽ₯⎦,(5) where 𝑁×𝑀 is the size of image 𝐼, 𝑅 is the number of regions of the segmented image, 𝐴𝑖 is the area (as measured by the number of pixels) of the 𝑖th region, and 𝑒𝑖 is the sum of the Euclidean distance between the RGB color vectors of the pixels of region 𝑖 and the color vector attributed to region 𝑖 in the segmented image. The 𝑅(𝐴𝑖) represents the number of regions having an area equal to 𝐴𝑖. The 𝑄 index was declared to be an effective guide in tuning segmentation algorithms [62]. Compared to quantitative evaluation, the entropy method [61] is based on information theory instead of empirical analysis and was declared to be able to indicate local minima over a wider range of values than quantitative method, which cannot distinguish local minima when the number of regions is large. For the region 𝑗 of a segmented image and the value π‘š of the feature value in region 𝑗, 𝐿𝑗(π‘š) denotes the number of pixels in the region 𝑗 that have a feature value π‘š in the original image and 𝑉𝑗 the set of all possible feature values in the region 𝑗. 𝑆𝐼 and 𝑆𝑗 denote the areas of image 𝐼 and region 𝑗, respectively. The entropy of region 𝑗 is defined as [61]𝐻𝑅𝑗=βˆ’π‘šβˆˆπ‘‰π‘—πΏπ‘—(π‘š)𝑆𝑗𝐿log𝑗(π‘š)𝑆𝑗.(6) The expected region entropy of image 𝐼 isπ»π‘Ÿ(𝐼)=𝑁𝑗=1𝑆𝑗𝑆𝐼𝐻𝑅𝑗.(7) The layout entropy is defined as𝐻𝑙(𝐼)=βˆ’π‘ξ“π‘—=1𝑆𝑗𝑆𝐼𝑆log𝑗𝑆𝐼.(8) The entropy index of current segmentation is𝐸=π‘€π‘™βˆ—π»π‘™(𝐼)+π‘€π‘Ÿβˆ—π»π‘Ÿ(𝐼).(9)

The region entropy generally decreases with the number of regions, while layout entropy increases with the number of regions. Hence, two weights π‘€π‘Ÿ and 𝑀𝑙 can be used to counteract the effects of oversegmenting or undersegmenting [61]. Because we aim at salient objects in the images, the 𝑀𝑙 and π‘€π‘Ÿ are set 0.8 and 0.2, respectively, to obtain an undersegmentation. According to our experiments, the entropy-based index with different weights on π»π‘Ÿ and 𝐻𝑙 is found to generate fewer segmented regions than the quantitative method which tends to generate oversegment regions. In fact, for most natural image segmentations in our experiments, the entropy-based segmentation nearly always leads to 2-cluster results with fewer regions, which makes sense in most natural images. As for DB index and CDbw index, they are fit for cluster evaluation instead of image evaluation. The entropy-based image evaluation index is adopted at last.

3.3. Modified Saliency Map

As discussed in Section 1, salient objects are crucial for further image retrieval or scene understanding. In many cases, an image is always oversegmented, and saliency map may provide valuable information. It can be observed in Section 1 that for some specific applications, a modified saliency map is a good choice. In this paper, to enhance salient regions in a natural color image to avoid broken regions of an object, the typical Itti-Koch model is modified as followed.

At first, intensity image 𝐼 is obtained through π‘Ÿ, 𝑔, and 𝑏 components of an original image. 𝐼 is calculated as𝐼=(π‘Ÿ+𝑔+𝑏)3.(10) Those pixels with intensity less than 10% of its maximum over the entire image have zero π‘Ÿ, 𝑔, and 𝑏 value, because that hue variations are not perceivable at very low luminance (and hence not salient).

𝑅, 𝐺, 𝐡, and π‘Œ are formed in this way𝑅=π‘Ÿβˆ’(𝑔+𝑏)2,𝐺=π‘”βˆ’(π‘Ÿ+𝑏)2,𝐡=π‘βˆ’(𝑔+π‘Ÿ)2,⎧βŽͺβŽͺβŽͺβŽͺ⎨βŽͺβŽͺβŽͺβŽͺβŽ©π‘Œ=(π‘Ÿ+𝑔)2βˆ’||||π‘Ÿβˆ’π‘”2βˆ’π‘,if(π‘Ÿ+𝑔)2βˆ’||||π‘Ÿβˆ’π‘”2βˆ’π‘>0,0,if(π‘Ÿ+𝑔)2βˆ’||||π‘Ÿβˆ’π‘”2βˆ’π‘β‰€0.(11)π‘Œ is included here because there exists a so-called β€œcolor-double-opponent” system, where in the center of their receptive fields, neurons are excited by one color and inhibited by another color, while the converse is true in the surround. Such special and chromatic opponency exists for the red/green, green/red, blue/yellow, and yellow/blue [8]. According to our experiments, a saliency map with component π‘Œ and 𝑅, 𝐺, and 𝐡 shows good segmentation results than without it. Five groups of Gaussian pyramid level 𝐼(𝜎), 𝑅(𝜎), 𝐺(𝜎), 𝐡(𝜎), and π‘Œ(𝜎) are generated from 𝐼, 𝑅, 𝐺, 𝐡, and π‘Œ, respectively, where 𝜎 is Gaussian pyramid level index, 𝜎∈[1,2,3,4,5]. Then, according to Ma and Zhang contrast-based saliency map [9], the center-neighborhood distance of 𝐼, 𝑅, 𝐺, 𝐡, and π‘Œ, which include the luminant and chromatic saliency information, are calculated within a 3 × 3 window by Euclidean distance.

As for orientations, a different approach of four line detection masks 𝑅0, 𝑅45, 𝑅90, and 𝑅135 [63] is proposed to filter the intensity image 𝐼 to simplify the calculation of Itti-Koch’s Gabor filters. Masks 𝑅0, 𝑅45, 𝑅90, and 𝑅135 are correspondent to four directions horizontal, 45Β°, vertical and 135Β°, respectively,⎑⎒⎒⎒⎣⎀βŽ₯βŽ₯βŽ₯⎦,⎑⎒⎒⎒⎣⎀βŽ₯βŽ₯βŽ₯⎦,⎑⎒⎒⎒⎣⎀βŽ₯βŽ₯βŽ₯⎦,⎑⎒⎒⎒⎣⎀βŽ₯βŽ₯βŽ₯⎦.𝑅0=βˆ’1βˆ’1βˆ’1222βˆ’1βˆ’1βˆ’1𝑅45=βˆ’1βˆ’12βˆ’12βˆ’12βˆ’1βˆ’1𝑅90=βˆ’12βˆ’1βˆ’12βˆ’1βˆ’12βˆ’1𝑅135=2βˆ’1βˆ’1βˆ’12βˆ’1βˆ’1βˆ’12(12) Each level of intensity image being filtered by four direction masks, contrast saliency map are calculated and combined to be an orientation saliency map 𝑂(𝜎).

At last, all saliency maps are resized to the original image size (level 0) and combined together to be a saliency map. To enhance the high salient region and restrain the low salient region, the saliency map is square power transformed𝑆=(𝐼(𝜎)+𝑅(𝜎)+𝐺(𝜎)+𝐡(𝜎)+π‘Œ(𝜎)+𝑂(𝜎))6ξ‚Ά2.(13) The proposed saliency map (flowchart shown in Figure 3) is proved through experiments to be efficient and effective in that at first, the luminance, chrominance, and orientation factors are taken into account to find the salient region; secondly, the simplified orientation calculation does not decrease the oriental salience; thirdly, the power transformation makes the salient region more significant without changing the smooth tone of the saliency map.

Figure 4 shows two natural images segmented by SOM-K and SOM-KS. Obviously, compared to SOM-K, SOM-KS with saliency map enhances the edge of salient regions and improves the segmentation performance.

3.4. SOM-K and SOM-KS

In the preprocessing stage, a median filter is applied to the RGB color space of an image separately with a 3Γ—3 window, pixel by pixel. The median filter erases salt-pepper noise and smoothes the image. It helps to decrease the instability of segmentation result caused by random noise.

There are a lot of color space to be selected as features in image segmentation, like RGB, CLE XYZ, CIE πΏβˆ—π‘’βˆ—π‘£βˆ—, CIE πΏβˆ—π‘Žβˆ—π‘βˆ—, YUV, and so forth. A color space is uniform, if equal distance in the color space corresponds to equal perceived color differences. Many color spaces are non-uniform. For example, RGB color space, displaying in screen instead of exhibiting the perceptual uniformity, does not model the way that human perceive colors. Comparatively, πΏβˆ—π‘’βˆ—π‘£βˆ—, being a perceptual uniformly color space, performs better than other color spaces. According to our experiments, features πΏβˆ—, π‘’βˆ—, and π‘£βˆ— combined with intensity are a good choice. Figure 5 shows the comparison between different features applied to SOM-K and k-means methods of an image.

In this paper, RGB color space of the image is transformed into πΏβˆ—π‘’βˆ—π‘£βˆ— color space. And then, the intensity and saliency map are integrated into color space to form input patterns 𝑇 of SOM-KS method𝑇SOM-KS=ξ€½(𝑖,𝑙,𝑒,𝑣,𝑠)𝑝,𝑝=1,2,…,𝑃,(14) where 𝑃 is the number of pixels in the image, and π‘š=5, the dimension of the input pattern including color space values 𝑖, 𝑙, 𝑒, and 𝑣 and salience value 𝑠. The only difference between SOM-K with SOM-KS is that the feature 𝑠 is not included in 𝑇 of SOM-K𝑇SOM-K=ξ€½(𝑖,𝑙,𝑒,𝑣)𝑝,𝑝=1,2,…,𝑃.(15)

𝑇 is trained with a 2-dimension hexagonal topology SOM with 9 × 9 nodes and the output 81 prototype vectors are clustered with the k-means algorithm from cluster number 2 to 9. Blank nodes (not hit by any pixels) will be deleted before clustering. Last selection of the best cluster number of the image is entropy-based validity index [61]. The entropy-based metrics helps to find the best segmentation. In most cases, it selects 2 as best cluster number. Note that cluster number does not necessarily mean region number in an image. The image may be divided into many regions though any of them is attributed to one of two clusters. Two clusters imply two regions at least. At the same time, three or more clusters do not ensure more regions than region number of two clusters. At the end of segmentation, those regions with area less than 0.1% of the original image area are neglected as a postprocessing step.

4. Performance of Color Image Segmentation and Comparisons

4.1. Segmentation Experiments and Analysis

Our experiments take natural color images of Berkeley segmentation dataset (BSD) [64] as source images. The goal of the dataset and benchmark is to provide an empirical basis for research on image segmentation and boundary detection. The dataset images are divided into a training set of 200 images and a test set of 100 images [64]. All 100 test images are processed with four methods: the proposed method SOM-K(S), SOM-KS is with saliency map as features while SOM-K not, JSEG method, and k-means method. Methods SOM-K, SOM-KS, and k-means are implemented based on MATLAB 6.5 and SOM toolbox 2.0 [65], while JSEG is based on window console. For JSEG, being a widely used image segmentation benchmark, images processed with default parameters are found to be oversegmented. So, according to their documents, the color quantization threshold and number of scales are set to 600 and 1, respectively, to obtain coarse segmentation images except that 3 out of 100 test images are not segmented by this setting (including 196073: sand snake, 38082: reindeer grassland, 227092: jar). The clustering number of the k-means method varies from 2 to βŒˆβˆšπ‘›βŒ‰ clusters and adopts an entropy-based image segmentation evaluation index, just as in the proposed SOM-K(S) method. The main parameters setting of different methods are listed in Table 1. In most cases, the segmented images always tend to have less regions corresponding to 2 clusters according to entropy-based validity index, either for k-means or SOM-K(S) methods, as discussed before.

The instability of the k-means method is overcome by running for at least ten times, either for SOM-KS, SOM-K, or for k-means. It is found that for most images, the segmentation will β€œconverge” to one or several results for the proposed methods.

Because an evaluation of image segmentation is still more subjective than objective, the segmentation results are roughly categorized into three types: good, common, and unacceptable by human selection. The β€œconverged” results of three categories are human selected with number of 40, 35, and 25, respectively. Figures 6, 7, and 8 shows some segmentation results of three categories.

As can be seen in Figure 6, the image segmentations with SOM-K(S) show better performance compared to JSEG. In fact, JSEG has consistent segmentation performances throughout 100 test images. Usually, it shows oversegmentation, but the shape of main objects in the image can be recognized clearly. The k-means shows similar segmentation performance as compared with the SOM-K(S) in many images although the time cost is higher. The entropy evaluation index involved, the segmentation results of SOM-KS tend to fewer regions compared to JSEG in most cases. Compared to SOM-K, SOM-KS plays better with the saliency map in following aspects (in Figure 6).(1)With the help of saliency map, the SOM-KS segments the images more precise in detail and object contour is enhanced to be complete and perfect. For example, in image 3096, a zigzag along the airplane contour is modified clearly in SOM-KS, compared to other three methods. In 69020 and 160068, it can be seen that the unconnected contours of kangaroo and big cat segmented by other three methods are completed by SOM-KS. So does images 62096, 300091, and so forth. With saliency map, the main object in the image is segmented clearly. (2)With the help of saliency map, nonsalient regions are restrained. This leads to a crisp and clean results compared to other three results, like in 14037, 58060, and 167083.(3)With the help of saliency map, the main object in the image stands out from the background. It can be found that in Figure 6, most of images either have a main object or the image foreground is clearly different from the background. This characteristic is compatible with the SOM-K(S) which clusters the pixels with SOM at first. A clear background and foreground facilitates the SOM to cluster.

Images in Figure 7 are complex than those in Figure 6. It is observed that the segmentation results are divergent within or between different images. For example, in images 38092, 41069, 123074, 175043, and 241048, some parts are segmented while others not. Compares to JSEG, the results are not satisfied though the SOM-K(S) keeps its characteristics, like fewer objects, better contour still. In 41069, for example, a little creature does not appear in SOM-K with a good segmentation of the rock but stands out in SOM-KS, and the rock is not segmented correctly. This case also happens to other images in Figure 7. It is difficult to decide whether the segmentation is good or not. In one hand, it surely clarifies some important objects in the image; on the other hand, it misses some other region boundaries. It also happens to JSEG in some images.

In Figure 8, the segmentation results are unacceptable for all methods. They do not tell useful information through the segmented image. The most important thing is that all these methods are uncomparable to each other via the segmentation results themselves. These images bear the stamp of complex background, foreground, and the similar color, luminance. Some images are even difficult to be segmented by human, let alone machine itself without any priori information or help from human interaction. JSEG keeps its performance in this part and several segmented images are recognizable like 69015 and 108082.

4.2. Entropy Index Comparison

No matter for good or for unacceptable segmentation, comparisons among them are subjective. Objective comparisons are made with entropy index [61] in this part. Though index itself is not comparable to human selection, it is a kind of objective measure after all. Figure 9 shows the entropy index comparisons of JSEG (red circle), k-means (magenta square), SOM-K (green downward-pointing triangle), and SOM-KS (blue upward-pointing triangle) methods. The entropy indexes of 100 segmented images are computed with the original images and their labeled maps. For entropy index, a small value means a better segmentation. Apparently, for most images, the SOM-K, SOM-KS, and k-means show good and stable performances with low values than JSEG does. JSEG shows a scattered results ranging from 0.8 to 2.1 compared with the other 3 methods, which obtain a centralized results ranging from 0.9 to 1.6. The reason for this is that the 𝐻𝑙 value of the entropy index with weight 0.8 is 0 for 1 region (unsuccessful segmentation with JSEG like 227092 and 196073, 32nd and 58th image in Figure 9) and a higher value for more regions. The π»π‘Ÿ value with weight 0.2 varies with different images and different region. Smooth region means low π»π‘Ÿ value (167062, 28th image in Figure 9), while region with more details leads to high π»π‘Ÿ value. That is, the region number (with high weight) and region attributes (with low weight) have impact on the entropy index value. For JSEG method, the region numbers of 100 images range from 1 to 8. For the other 3 method guided by the entropy index, nearly all cluster numbers are 2, which means that most region number is 2 or a little more. So, most entropy index values of these 3 methods are small and centralized in a narrow range compared with JSEG method. A close look at the right figure shows that the SOM-K, SOM-KS, and k-means have similar entropy index values, and in many cases, they are superimposed with each other. This also agrees with their segmentation results. Also, we can see from the right figure that most SOM-KS entropy index values are a little larger than SOM-K and k-means. This is because that the segmentation results of SOM-KS are usually including an object and a background. This leads to more variances within background and object itself that is, π»π‘Ÿ values are a little larger than those segmentations of SOM-K and k-means without that much variances within each regions. After all, the so-called objective entropy index is still far away from human selection. Every image has its own attributes which have complex influences on the entropy index.

4.3. Computational Complexity

Computational complexity is a key characteristic in assessing a method. The SOM-KS, SOM-K, and k-means run in MATLAB platform and the JSEG is under a window console environment. The computer is with Microsoft Window-XP Professional 2002 operating system and Intel Duo CPU P8700@ 2.53 GHz, 2.89 GB Memory. All 100 test images are tested and the average computation time of these methods are listed in Table 2. All test images of BSD have same area, 154401 (481 × 321) pixels.

Computational complexity of k-means is βˆ‘π‘‚(π‘π‘šπΆπ‘˜=2π‘˜), where π‘Γ—π‘š is dimension of samples, π‘˜ is possible trials from 2 to 𝐢. For SOM-K combining SOM with k-means, the computational complexity is 𝑂(π‘π‘šπ‘›π‘’), where 𝑛 and 𝑒 are numbers of nodes and training epochs. That is, either k-means or SOM-based methods have linear computational complexities. In practice, the running time varies due to implementations and different parameter settings of each algorithm. Cores of both SOM and k-means are implemented with MATLAB and tuned carefully. As can be seen in Table 2, JSEG keeps a relatively high time-performance-ratio compared to other methods though its segmentation results need further processing. SOM-K needs a little less time than JSEG, because SOM simplifies and accelerates the clustering process. Even with k-means clustering at second step, a small number of prototype vectors do not cost too much time. On the contrary, k-means alone costs more time than other methods. Compared to SOM-K, SOM-KS needs more than double times to finish segmentation. Apparently, the excess time comes from the saliency map. That is, a cost for a precise segmentation. It deserves a good result with time cost. In summary, we can see from the Table 2 that compared with k-means, SOM-K has similar segmentation performance and saves more time than k-means does. SOM-KS further enhances the segmentation performance with moderate increase in computational time and still being low than k-means. SOM-K uses similar time with JSEG, but with substantial segmentation improvement.

4.4. Limitations

As discussed before, the SOM-K(S) method can deal with most of the natural color image in the test dataset. But the involved k-means method leads to unexpected results. This comes from the random initialization of k-means. Figure 10 shows different segmentation results of same images, with same SOM-K(S) and different initialization of k-means. Occasionally, the results are confusing because on one side, the image could be right segmented but on the other side, it could be wrong. For example, in row 3 of Figure 10, the left circle land can be a correct segmentation if it is taken as a whole; also, the right circle land can be correct if you are serious about the detail of the circle land. A similar case appears in row 4 and 5, the person, jar, and red fish. If you focus on the person, the 1st and the 3rd are right segmentation. On the contrary, if you focus on the background, the other two may be a good segmentation. So does other images and vice versa. Strictly speaking, segmentations in Figure 10 are not good, but it depends on what people are interested in. If they are interested in the objects in images, they might be good. If people are interested in the segmentation in whole, they are not.

As mention in [12, 66], color information are not enough to segment complicated natural scenes. The features of intensity and πΏβˆ—π‘’βˆ—π‘£βˆ— in the proposed SOM-K(S) is not enough. So, a further description of a natural image with texture, and so forth, can improve the segmentation result. The idea in [11] with an optimum self-adaptive topology is also enlightening.

5. Conclusion

In this paper, a natural color image segmentation method based on SOM and k-means clustering is proposed. The method trains features of intensity and πΏβˆ—π‘’βˆ—π‘£βˆ— color space with SOM neural network, and the output prototype vectors are clustered with k-means method (SOM-K). A variant of the proposed method is to combine a modified saliency map with intensity and πΏβˆ—π‘’βˆ—π‘£βˆ— color space as new features to enhance the segmentation (SOM-KS). SOM-K(S) method is proved to be effective by experiments on the BSD. The saliency map enhances the segmentation results with more precise segmentation, and at the same time, the salient object in the image stands out and other minor parts are restrained; that is, over-segmentation is reduced in such areas. Segmentation evaluations of entropy index are compared with JSEG, SOM-K, SOM-KS, and k-means. The computational complexity is measured by the computation time and compared with each other. It is shown that the proposed SOM-K(S) method, being an automatic method, gets better segmentation results with less time needed and no need to set any parameters in advance. The only limitation is that it relies heavily on the initialization of k-means. It takes several times to obtain a better result. More segmentation images and source code of the proposed method are available on the web at http://sites.google.com/site/chidongxiang/.

Acknowledgments

The research was supported by the Science and Research Program of Shanghai Dianji University under Grant no. 07C402 and Shanghai Dianji University Key Discipline Program no. 10XKF01. The authors thank Professor Xue Ping of Nanyang Technological University for his guidance and kindly help and thank Luo Ye, Wang Junyan, Huang Wei, and Shen Minmin of Nanyang Technological University for their valuable suggestions.