Research Article | Open Access
Automatic Image Tagging Model Based on Multigrid Image Segmentation and Object Recognition
Since rapid growth of Internet technologies and mobile devices, multimedia data such as images and videos are explosively growing on the Internet. Managing large scale multimedia data with correct tags and annotations is very important task. Incorrect tags and annotations make it hard to manage multimedia data. Accurate tags and annotation ease management of multimedia data and give high quality retrieve results. Fully manual image tagging which is tagged by user will be most accurate tags when the user tags correct information. Nevertheless, most of users do not make effort on task of tagging. Therefore, we suffer from lots of noisy tags. Best solution for accurate image tagging is to tag image automatically. Robust automatic image tagging models are proposed by many researchers and it is still most interesting research field these days. Since there are still lots of limitations in automatic image tagging models, we propose efficient automatic image tagging model using multigrid based image segmentation and feature extraction method. Our model can improve the object descriptions of images and image regions. Our method is tested with Corel dataset and the result showed that our model performance is efficient and effective compared to other models.
Nowadays, we are always online. Desktop computers, laptop computers, and even smartphones are connected online anytime and anywhere. It is very easy to share multimedia data with our mobile devices and explosive growth of social network services such as Facebook, Flickr, and Twitter helps with tremendous growth of multimedia data on the Internet. To manage these multimedia data, reliable tag and annotation information should be improved. How to manage such large scale of multimedia is the most famous topic these days. Well-tagged image is effective for management and retrieval. We focus on automatic image tagging model using image segmentation and feature extraction. Since an image presents multiple objects on single image, we mainly focus on how to extract multiple objects successfully. We find out image segmentation technique and propose a multigrid based image segmentation method. Sometimes an image may contain single object but most of user created contents contain multiple objects in image (Figure 1). Therefore, extracting visual features from whole image has limitation for tagging or annotating an image. Feng et al.  also proposed grid based method which is more effective than the basic image segmentation models. But it still has limitation for multiobject problem in segmented region. Therefore, we propose a multigrid image segmentation method which is able to extract features of multiobjects presented in an image. Experimental results showed that our model presented efficient, effective, and most accurate image tagging results compared to other models.
We present related researches in Section 2 and propose our multigrid image segmentation model in Section 3. In Section 4, we present our novel automatic image tagging model based on multigrid image segmentation method. We present experimental results in Section 5. Finally, we reach to conclusions and feature works in Section 6.
2. Related Researches
Typically there are three types of image tagging models. Those are automatic, manual, and half-automatic models. Manual tagging is most accurate and reliable for image tags. Nevertheless, it takes tremendous cost to tag image manually. Half-automatic models such as Google Image Labeler are good way to tag image pretty accurate, but it also has limitation that users have to spend time in playing game and it might cause suffering from noise tags. Therefore, fully automatic image tagging model is most interesting research field in these days despite of lower performance compared to manual and half-automatic models. Many researchers make effort to increase accuracy.
Learning based automatic image tagging models are most recent research interests. Starting keyword based methods, semantic keyword methods are proposed. Most recently, for more effective image tagging, setting up the relationship between textual features and visual features is currently the main topic. Jeon et al.  and Yang et al.  proposed cross media model which tags images with joint probabilities of semantic information and visual features. They used discrete features to tag images and it can lose helpful visual information. Carneiro et al.  proposed SML model which is semisupervised learning model which is not suitable for image segmentation. Wang et al.  combined global and local regions and, to improve tagging performance, they used contextual features. Lindstaedt et al.  proposed visual folksonomy based automatic image tagging especially for fruits and vegetables. In addition, Manh and Lee  focus on small object segmentation based on visual saliency in natural images and Divya et al. , Santosh and Shyam , and Patil and Kokare  demonstrated that image segmentation and automatic image tagging models help with semantic image retrieval.
Unlike these algorithms, our model focused on efficient and effective multigrid based image segmentation model and object recognition. And we propose an image tagging model based on our multigrid image segmentation method (Figure 2). Our proposal for multigrid based image segmentation and object recognition is shown in next section and then we propose efficient automatic image tagging model.
3. Image Segmentation Model
Most of image region segmentation depends on surrounding contrast. Cheng et al.  proposed global contrast-based region detection algorithm. For image segmentation, Felzenszwalb and Huttenlocher  proposed graph-based image segmentation method, and Xiong et al.  proposed hierarchical deformable model for face detection. And color contrast for each image region is being calculated. In this paper, we calculate weight for each region for image region segmentation.
Let be the distance between image regions and ; then can be calculated as follows: where is probability of color on image region and is probability of color on image region . Distance between on region and on region is distance between the two pixels. It takes long time to calculate whole color distance because Lab color space is 2553. Therefore, we used histogram based compress Lab color space. Therefore, we can recalculate as follows: where is th bin of color in region and is th bin of color in region . is number of histogram bins. Now we can calculate as follows: where is number of bin colors and denotes number of pixels in region . If the pixel of certain color appears many times, it means that it is main color of certain region. If we calculate directly within (3), then similar color may be assigned to another bin and especially it can be noise when the region is small. To overcome such problem, we redefine (3) as follows: where is the number of similar colors with in histogram. is distance between and the th similar color of . and are normalized factor and is linear transformed weight. Now we calculate weight for certain region by comparing with other regions. We calculate region importance as follows: can be calculated with (2). denotes number of pixels in region and it can be weight of region as well. Since (5) does not concern spatial relationship, we recalculate (5) with spatial relationship as follows: where denotes spatial distance between regions and . is calculated using Euclidean distance measure. is used to control spatial weight.
Now we propose method to segment image into multigrid to recognize objects. We segment images based on multigrid based image segmentation method. And then we extract object feature as already mentioned. Finally, we extract visual feature from each segmented image. Since we extract visual features from multigrid segmented images, we can extract most objects in image. In this paper, we segment images into 3 steps. In first step, we extract feature from entitled image. In second step, we extract features from 2 by 2 grid segmented images. In third step, we extract features from 3 by 3 grid segmented image. The number of steps increases more than 3 by 3 grid; then the well-extracted number of objects (Figure 3) and accuracy (Figure 4) decrease.
Number of well-extracted objects and their accuracy show best result on step 3. That is, smaller objects in images are not that important in that image and more important object can be segmented into other regions which means important object feature can be lost.
We can see an object extracted from segmented images (Figure 5). When we extract feature from entitled image, only one object is extracted. When we extract features from 2 by 2 segmented images, we could recognize more detailed objects in images. Meanwhile, we could recognize more detailed objects in 3 by 3 segmented images.
4. Automatic Image Tagging Model
In this section, we introduce our automatic image tagging model. We combine with our multiscale segmented images introduced in Section 3. Visual features extracted from each region are single object of segmented regions.
For all input image , is segmented into 3 by 3 grid. Let us say is the number of segmented image regions. We extract d-dimensional feature vector from each region . And we define visual generation probability . We used Multiple Bernoulli Distribution  to calculate visual general probability. is unlabeled image and is feature vector of . is subset of tag label. is similarity between and . The process of jointly generating and is as follows:(1)select an image from training set with ;(2)obtain segmented image regions;(3)for each training image , ;(4)generate visual descriptions from th region by using conditional probability;(5)for each word is in tag set;(6)generate tag set by using Multiple Bernoulli Distribution;(7)using (7), calculate joint probability of visual description and labels in our model: In (7), is the probability of image from training set. Since there is no prior knowledge, can be assumed to obey uniform distribution: where is the size of training image set.
Probability is used to estimate visual generation probability of regions. Assume is visual features of regions in 3 by 3 segmentation; can be calculated as follows: where is the number of image regions and is the dimension of visual features. Equation (9) uses Gaussian kernel function to estimate the visual description of each region in image . Gaussian kernel is determined by covariance matrix .
is th component of Multiple Bernoulli Distribution. It means that probability of tag set which is generated by training image . Bayesian estimation is used for each tag label as follows: where is the number of labels in training set and is the size of training image set. is a binary function (if contains label then 1, else 0). is parameter of weight .
To evaluate our automatic image tagging model based on multigrid image segmentation, we compare our model with other models using Corel dataset. Corel dataset is a popular dataset in automatic image tagging, which includes over 5,000 images. This section focuses on how to construct an effective automatic image tagging model. For the convenience of comparing with other models, we do not use some new visual features. We use the same 30-dimensional features including 9-dimensional RGB color moments, 9-dimensional Lab color moments, and 12-dimensional Gabor texture features. To evaluate other automatic image tagging models, we use precision, recall, and -measure to evaluate tagging results (Figure 6). In addition, we also count the labels that are correctly tagged at least once, denoted as NZR which reflects the coverage level of annotation words.
We need to determine the parameter value of in (4) according to experiments. is the number of similar bins in histogram. Horizontal coordinate axis means the ratio of similar color bins in histogram (Figure 4). We can find out optimal parameter value for best annotation results when the ratio of similar color bins is 20%. Precisions and recalls of annotation results would fall if increases, because higher will reduce the region contrast, and then feature extracted image regions would be affected to some degrees.
To improve performance of our multigrid image segmentation method, we compared with current methods introduced in Section 2 which are global contrast based salient region detection method , graph-based image segmentation , and hierarchical deformable model . Experimental result demonstrates accuracy for each method and our method shows best performance compared to other methods (Figure 7).
Meanwhile, we evaluate our model with other image segmentation methods with precision, recall, -measure, and NZR (Table 1).
Finally, we present our automatic image tagging model performance. We compared our model with some state-of-the-art models, including Cross Media Relevance Model [2, 3], Multiple Bernoulli Relevance Model , Transductive Multi-Instance Multilabel , and Supervised Learning Model . We can find that our model is very effective and tagging results are better than those state-of-the-art models (Table 2). Our model obtained the highest precision 0.27 which is at least 12% higher than other models. Recall achieves 0.29 which is the same as Supervised Learning Model and the recall is higher than other models obviously. -measure of our model achieves 0.28, and it is approximately 8% higher than Supervised Learning Model which obtained the highest -measure in previous state-of-the-art models. In addition, in criterion of NZR which reflects the coverage of annotation words, our model reaches 144 and it is also the highest in all models.
We compare our model with MBRM model, and the rankings of tagging labels are sorted in descending order with tagging probability (Table 3). If labels are in ground truths, we use bold type. Here, we do not select test images that are perfectly tagged by our model. We can easily find that tagging results of our model show better performance than MBRM model. In addition, we also find that some tag words do not appear in ground truth annotations of the dataset, but some of these words can also describe the contents of images. That is, some correct tags are ignored by users. These labels are in italic type. For example, clouds, water, and sky do not belong to the ground truth in first image, but these labels can be used to describe the contents of first image without question. Besides, some labels in other images also have the similar situations.
In this paper, we proposed multigrid image segmentation method. And then we also proposed an automatic image tagging model based on our multigrid image segmentation method. Since segmented image may contain multiple objects, we proposed multigrid image segmentation method. Our model presented high performance compared to other image segmentation methods. With experimental results on automatic image tagging models, our image tagging model showed better performance compared to other state-of-the-art models especially on object feature extraction. To evaluate our proposed multigrid image segmentation method, we compared with other image segmentation methods and to evaluate our automatic image tagging model, we used Corel dataset and compared with other famous models: Cross Media Relevance Model, Multiple Bernoulli Relevance Model, Transductive Multi-Instance Multilabel, and Supervised Learning Model. Our model showed efficient, effective, and accurate performance in all evaluated functions precision, recall, -measure, and NZR.
Since there are limitations and many works to do, and multimedia data is growing even in this moment, more powerful, reliable, and accurate models must be prevented. With large amount of data being created every moment, we also need to focus on real time automatic annotation model.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
- S. L. Feng, R. Manmatha, and V. Lavrenko, “Multiple Bernoulli relevance models for image and video annotation,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1002–1009, 2004.
- J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image annotation and retrieval using cross-media relevance models,” in Proceedings of the 26th Annual International ACM SIGIR, pp. 119–126, 2003.
- Y. Yang, Z. Huang, and Z. Ma, “Robust cross-media transfer for visual event detection,” in ACM Multimedia'12, pp. 1045–1048, 2012.
- G. Carneiro, A. B. Chan, P. J. Moreno, and N. Vasconcelos, “Supervised learning of semantic classes for image annotation and retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 3, pp. 394–410, 2007.
- Y. Wang, T. Mei, S. Gong, and X.-S. Hua, “Combining global, regional and contextual features for automatic image annotation,” Pattern Recognition, vol. 42, no. 2, pp. 259–266, 2009.
- S. Lindstaedt, R. Mörzinger, R. Sorschag, V. Pammer, and G. Thallinger, “Automatic image annotation using visual content and folksonomies,” Multimedia Tools and Applications, vol. 42, no. 1, pp. 97–113, 2009.
- H. T. Manh and G. Lee, “Small object segmentation based on visual saliency in natural images,” Journal of Information Processing Systems, vol. 9, no. 4, pp. 592–601, 2013.
- U. J. Divya, K. Hyunseoul, L. Jun, and K. Jee-In, “Fractal based method on hardware acceleration for natural environments,” Journal of Convergence, vol. 4, no. 3, pp. 6–12, 2013.
- K. V. Santosh and K. N. Shyam, “Color directional local quinary patterns for content based indexing and retrieval,” Human-Centric Computing and Information Sciences, vol. 4, no. 6, 2014.
- P. B. Patil and M. B. Kokare, “Interactive semantic image retrieval,” Journal of Information Processing Systems, vol. 9, no. 3, pp. 349–364, 2013.
- M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu, “Global contrast based salient region detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '11), pp. 409–416, June 2011.
- P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based image segmentation,” International Journal of Computer Vision, vol. 59, no. 2, pp. 167–181, 2004.
- Y. Xiong, P. Gang, C. Zhaoquan, and Z. Kehan, “Occluded and low resolution face detection with hierarchical deformable model,” Journal of Convergence, vol. 4, no. 2, pp. 11–14, 2013.
- S. Feng and D. Xu, “Transductive Multi-Instance Multi-Label learning algorithm with application to automatic image annotation,” Expert Systems with Applications, vol. 37, no. 1, pp. 661–670, 2010.
Copyright © 2014 Woogyoung Jun et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.