Table of Contents Author Guidelines Submit a Manuscript
Advances in Multimedia
Volume 2018, Article ID 6153607, 11 pages
Research Article

Combining Convolutional Neural Network and Markov Random Field for Semantic Image Retrieval

1School of Information Technology in Education, South China Normal University, Guangzhou, China
2Guangdong Engineering Research Center for Smart Learning, South China Normal University, Guangzhou, China
3School of Computing and Mathematics, Charles Sturt University, Albury, NSW, Australia
4School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China

Correspondence should be addressed to Changqin Huang; nc.ude.uncs@gnauhqc

Received 4 May 2018; Accepted 12 June 2018; Published 1 August 2018

Academic Editor: Yong Luo

Copyright © 2018 Haijiao Xu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


With the rapidly growing number of images over the Internet, efficient scalable semantic image retrieval becomes increasingly important. This paper presents a novel approach for semantic image retrieval by combining Convolutional Neural Network (CNN) and Markov Random Field (MRF). As a key step, image concept detection, that is, automatically recognizing multiple semantic concepts in an unlabeled image, plays an important role in semantic image retrieval. Unlike previous work that uses single-concept classifiers one by one, we detect semantic multiconcept by using a multiconcept scene classifier. In other words, our approach takes multiple concepts as a holistic scene for multiconcept scene learning. Specifically, we first train a CNN as a concept classifier, which further includes two types of classifiers: a single-concept fully connected classifier that is best suited to single-concept detection and a multiconcept scene fully connected classifier that is good for holistic scene detection. Then we propose an MRF-based late fusion approach that is able to effectively learn the semantic correlation between the single-concept classifier and multiconcept scene classifier. Finally, the semantic correlation among the subconcepts of images is cought to further improve detection precision. In order to investigate the feasibility and effectiveness of our proposed approach, we conduct comprehensive experiments on two publicly available image databases. The results show that our proposed approach outperforms several state-of-the-art approaches.

1. Introduction

With the rapid development of information technique, a large number of multimedia objects such as images are available on the Web. Given a semantic query, how to effectively find relevant images from such a scalable Web database remains a challenge. For semantic image retrieval, image concept detection is a vital step. To address this issue, many approaches have been proposed, such as Markov random walk [1], group sparsity [2], ensemble learning [3], and multiview semantic learning [4]. Although effective, these approaches work in the case of single-concept-based image retrieval. This means that each semantic query is supposed to contain only one semantic concept, restricting its practice usability.

In this paper, we specifically consider the problem of multiconcept-based image retrieval. This paradigm allows users to employ multiple semantic concepts to search relevant images. Its critical step is image multiconcept detection, that is, identifying multiple semantic concepts in an unseen image. Most previous studies [5, 6] utilize multiple and independent single-concept classifiers to detect such a semantic multiconcept scene. Nonetheless, this method may be ineffective, since a visual multiconcept scene (e.g., “grass, person, soccer, and sports”) is hard to be detected solely by a single-concept classifier. Therefore, further studies on image multiconcept detection are necessary.

In recent years, CNNs have achieved the state-of-the-art performance in many image tasks, such as single-concept-based image retrieval [7, 8], face recognition [9], image segmentation [10], and image reconstruction [11]. This indicates that a CNN can learn robust visual features by capturing semantic structures of images. A natural idea is to devise a specific CNN for image multiconcept detection. For a task of image multiconcept scene detection, most conventional CNNs focus only on single-concept detection of images. As a result, they perform suboptimally on images with multiconcept scenes. We hence design a specific CNN that suits holistic scene detection, with two kinds of fully connected classifiers: a single-concept classifier and a multiconcept scene classifier. The former suits single-concept detection, while the latter is for holistic scene detection. Differing from the existing works that use single-concept classifiers, our method employs a multiconcept scene classifier to detect a semantic multiconcept scene, regarding multiple concepts as a holistic scene for multiconcept scene learning. Using our proposed MRF-based fusion method, we model the semantic correlation between single-concept classifier and multiconcept scene classifier and estimate the relevance score for an image multiconcept scene. The semantic link among the subconcepts presented in the images is further used to improve detection accuracy. Experimental results on MIR Flickr 2011 [12] and NUS-WIDE [13] datasets demonstrate the effectiveness of our proposed method. The major contribution of this paper is twofold:

Combining CNN and MRF, we propose a unified, novel CNN framework for image multiconcept scene detection.

We model the semantic link between a single-concept classifier and a holistic scene classifier in a way that effectively detects the semantic multiconcept scene in an unlabeled image.

The remainder of this paper is organized as follows. Section 2 briefly reviews some related works. Section 3 details our proposed approach. Section 4 reports our experiments with setup, results, and analysis, and Section 5 concludes this paper with some remarks on further studies.

2. Related Work

Clustered in terms of discriminative, generative, and nearest-neighbor methods, image concept detection is a vital step for semantic image retrieval. A discriminative method learns a classifier that projects visual images to semantic concepts, that is, Stochastic Configuration Networks (SCN) [14], while a generative method (e.g., a feature-word-topic model [15]) concentrates on learning the correlation between visual images and semantic concepts. By a majority vote of nearest neighbors of an image, a nearest-neighbor method assigns a semantic concept to this image. An influential work is the TagProp [6], which employed a weighted nearest-neighbor graph to learn semantic concepts of unseen images, achieving competitive learning performance. These above-mentioned methods lose sight of the valuable semantics latently embedded in image concepts so as to simplify the design of the system and related calculation. Alternatively, some others effectively integrate the semantics information under a unified learning framework, achieving the sound performance of concept detection. In [16], the Google semantic distance was proposed to extract the semantics of semantic concepts and phrases. In [17], a semantic ontology-based hierarchical pooling method was proposed to improve the coverage or diversity of the training images.

In the research field of image retrieval, MRF-based methods are also widely used, achieving promising performance. Laferte et al. [18] proposed a discrete MRF approach, which employed the maximum a posteriori estimation on the quadtree so as to reduce the computational expense. Metzler et al. [19] proposed a MRF-based query expansion approach that provided an effective mechanism for modeling semantic dependencies of image concepts. In [20], a potential function was proposed for parameter estimation and model inference, which empowered the learning ability for a concept classifier. Kawanabe et al. [1] utilized Markov random walks on graphs of textual tags to improve the performance of image retrieval. Lu et al. [21] utilized maximum-likelihood estimation to train a spatial Markov model and then employed this model for image concept detection. Dong et al. [22] proposed a sub-Markov random walk approach with concept prior to image retrieval, which can be interpreted as a conventional random walker on a graph with added auxiliary nodes. Most traditional methods concentrate on single-concept-based image retrieval. For an image multi-concept query, they employ a combination of single-concept classifiers [5, 6] to detect image multiconcept scene.

CNN-based deep learning has recently achieved state-of-the-art performance in single-concept-based image tasks. Simonyan et al. [23] trained a deep CNN termed VGG, achieving competitive performance on the large-scale dataset ImageNet. Szegedy et al. [7] proposed a deeper CNN architecture termed GoogLeNet, achieving better learning performance on ImageNet. To improve performance of image retrieval, Hoang et al. [24] proposed three masking schemes to select a representative subset of local convolutional features. Girshick et al. [8] proposed a scalable object detection approach, Regions with CNN features (R-CNN), which applied high-capacity CNNs to bottom-up region proposals. Ren et al. [25] proposed a Region Proposal Network (RPN) that shared full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. In [26], a Multi-Loss regularized Deep Neural Network (ML-DNN) framework was proposed, which exploited multiple loss functions with different theoretical motivations to mitigate overfitting during semantic concept learning. He et al. [27] proposed a residual learning framework to alleviate the training of neural networks. Wang et al. [28] proposed a deep ensemble learning approach for large-scale data analytics. Huang et al. [29] proposed a Dense convolutional Network (DenseNet) that connected each layer to every other layer in a feed-forward fashion, strengthening feature propagation and reducing training expense. Despite effectiveness, these methods are confined to cope with single-concept-based image retrieval, limiting its practical usability. This motivates us to devise a new model to resolve this issue.

3. Proposed Approach

Our approach, called CMMR, aims to combine CNN and MRF for the multiconcept-based image retrieval. Suppose that and stand for a training set and a test set, respectively. Each image in or is represented as a low-level visual feature vector. Given a vocabulary with unique semantic concepts, each concept in is a single concept, for example, “grass” or “person.” Each image in the training set is labeled with several semantic single concepts , while the images in the test set have no concept labels. Each semantic scene with the multiconcept , for example, “clouds, sky, and sunset,” is an element of the power set of , that is, or . Given a multiconcept query (e.g., “grass, person, soccer, and sports”) and the target set , our goal is to find a result set with relevant images. The result set satisfies the following conditions: each relevant image in includes all target single concepts ; and and , , where and stand for the relevance scores for .

Figure 1 shows our proposed CMMR framework with working mechanisms. Our CMMR framework consists of three main components: CNN framework, MRF-based fusion, and online retrieval. CMMR aims to learn concept classifiers. Normally the last layer of CNN is a single-concept classifier. We replace it with two types of classifiers: a single-concept fully connected classifier for single-concept detection and a multiconcept scene fully connected classifier for holistic scene detection. The MRF-based fusion component learns the semantic correlation between such two types of classifiers and produces the ultimate semantic score for a given multiconcept query with a semantic scene . Online retrieval obtains the search result for this by taking four steps. First, CMMR generates the detection context by using a semantic neighbor approach. The proposed CNN then learns a single-concept classifier and a multiconcept scene classifier. Third, the use of MRF-based fusion approach learns the ultimate semantic scores of . Finally, CMMR employs the learned semantic scores to perform semantic image retrieval.

Figure 1: The proposed CMMR framework.
3.1. Multiconcept Vocabulary Generation

CMMR regards each multiconcept as a whole, that is, one concept of a holistic scene. In order to avoid meaningless concept permutation, CMMR chooses the meaningful multiconcept to generate a multiconcept vocabulary according to the following cooccurrence rule over the training set :where is the cardinality of , for example, , and is the multiconcept frequency of . If the size of is too large, we can adjust the thresholds and to reduce the computational expense. In this way, containing multiconcepts is generated.

3.2. CNN Network of Our Proposal

Normally a CNN has multiple convolutional layers followed by fully connected classifier layers. The functionality of the convolutional layers is to learn and extract robust visual features, while the classifier layers learn a concept classifier. Any CNNs for image tasks can be incorporated into our framework. Without loss of generality, we choose an influential model, GoogLeNet [7], to build our convolutional layer.

Image concept detection serves as a critical step in semantic image retrieval. Most conventional CNNs concentrate on image single-concept detection, thus performing suboptimally on image multiconcept scene detection. Furthermore, an original CNN (e.g., GoogLeNet) aims to predict one concept label of an unseen image, whereas in our case each image is labeled with multiple concepts. Therefore, we modify the GoogLeNet so as to fit multiconcept scene detection.

First, we design a specific fully connected classifier layer that suits holistic scene detection, comprising two kinds of classifiers: a multiconcept scene classifier and a single-concept classifier. They share one convolutional layer, since this convolutional layer generates a general visual representation. Second, we follow [30] to define our softmax loss function of multiconcept learning. With this definition, the normalized prediction of the image in the jth multiconcept is calculated aswhere (e.g., “grass, person, soccer, and sports”) is one holistic scene concept, is the activation function, and is the number of multiconcepts. Following [30], we use a rectified linear unit as our nonlinear activation function. We minimize the Kullback-Leibler divergence between the prediction and the ground truth; is defined aswhere is the number of images and is the ground truth in the image in the jth multiconcept . It is obvious that we have if appears in and otherwise.

3.3. CNN Training

Training a CNN is a two-stage process: convolution layer training and classifier layer training. The former extracts deep feature, while the latter learns a reasonable concept classifier. This process is time-consuming, especially for training on large image databases. Therefore, a publicly released pretrained GoogLeNet is employed to accelerate training. This procedure includes three steps. After being initialized with the pretrained GoogLeNet, our CNN model is able to extract deep features. Next, these deep features are fed into the classifier layer, which is then well trained. Finally, the CNN is well retrained by freezing the bottom convolution blocks, as well as by fine-tuning the top convolution block and the classifier.

For learning multiconcept of a scene , the positive sample set and the negative sample set are built as follows:where is the annotation set for training image . Based on above positive samples and negative samples, we train the multiconcept classifier. For traditional single-concept classifier training, the images labeled with the concept are employed as positive samples and the rest as negative samples.

3.4. Detection Context Generation

Given a multiconcept query with a semantic scene , concept neighbors participate the concept detection and output the relevance scores. These concept neighbors are tightly linked to and hence can be taken as the detection context, denoted as . Some details on the procedure of generating the detection context are given below.

First, we generate a semantic neighbor set by choosing neighbor concepts with probabilities . This symmetric semantic probability measures the interdependency between two concepts and , which is represented aswhere and are the occurrence frequency of and , respectively, and is the number of images simultaneously including two multiconcepts and . Each multiconcept is seen as its own semantic neighbor and hence .

Second, we assign all subconcepts into the context set . Finally, we assign top- related concepts into the context set from the rest of . Thus, the detection context is generated, with elements. The interdependency probability should be normalized as follows:

3.5. MRF-Based Fusion for Multiconcept Scene Learning

With our CNN, the concept classifier has been learned. This concept classifier projects visual images to semantic concepts. If a semantic concept and its related concepts frequently appear in images, the relevance prediction of this semantic concept will be boosted in our model. Given a multiconcept query with the semantic scene , all concepts in the detection context are used for estimating the relevance. The relevance prediction is estimated as follows:The relevance prediction predicted by a multiconcept classifier is seen as an evidence of in an image , while the semantic correlation is treated as a weight of this relevance prediction. In view of the promising performance in single-concept learning reported in [6, 7], a single-concept classifier is integrated into the classifier layer of our CNN. Following [6], this single-concept prediction between and can be estimated as follows:where is the cardinality of and is a conventional single concept that is predicted by a single-concept classifier .

As a graphic model, MRF provides a basis for modeling contextual constraints in image retrieval. Hence, we employ MRF to analyze the semantic link between two types of classifiers mentioned above and produce the ultimate semantic score for . We first construct a specific MRF for the two types of classifiers and the query concept, that is, , so as to model their correlation. Then we infer the MRF-based fusion method for image concept detection.

Given a set of random variables on an MRF graph, the joint probability of MRF is a Gibbs distribution [31]:where is a normalization factor and is the energy function, that is, the sum of clique potentials over all possible cliques. If using random variable represents absence or presence of a multiconcept for an image , the joint probability of the random variable set can be defined aswhereWe define the potential functions aswhere are the CMMR parameters to be estimated and .

3.6. Parameter Optimization

A widely used technique for parameter optimization is a maximum likelihood, which chooses the parameters that maximize the joint probabilities over the training set. As such, we maximize the log-likelihood function of the query . The final relevance prediction of the image is given bywhere and are equivalent to and , respectively. Therefore, is written asBy substituting in (17) with (15)-(16) and (11)-(14), we obtain the following log-likelihood function :By using the gradient descent method [32], the log-likelihood for optimizing is maximized. The gradient of with respect to () can be expressed as the following form:where .

3.7. Online Retrieval

CMMR concentrates on semantic image retrieval, including single-concept-based image retrieval and multiconcept-based image retrieval. A user employs multiple concepts to search for top-K semantically similar images from a database. In a word, we perform four steps for semantic image retrieval.

Step 1. Employ a semantic neighbor method to build the detection context .

Step 2. Learn a multiconcept scene classifier A and a single-concept classifier B by our proposed CNN.

Step 3. Learn the final relevance score of by using MRF-based fusion.

Step 4. Perform semantic image retrieval by using the learned relevance scores. Higher relevance score ranks higher.

The detailed process of semantic image retrieval is presented in Algorithm 1. From Algorithm 1, we conduct complexity analysis of time and space. Computing a set of multiconcept scene is an offline process, costing time. Training a CNN is also an offline process, including deep feature extracting and classifier layer learning. This consumes time, where and are the trainable parameter number of CNN networks and the size of image set, respectively. By initializing our CNN with a pretrained GoogLeNet and using a very small classifier layer, the number is substantially reduced, boosting training efficiency. Computing the detection context is an online process, with time and space. For each test image, time and space complexity of computing predictions and fusing predictions are all . Therefore, all test images spend time and space. Ultimately, ranked images are returned through heap sort, consuming time and space. Hence, the complexities of time and space of Algorithm 1 are and , respectively.

Algorithm 1: Semantic image retrieval process.

4. Experiments

Our experiments on semantic image retrieval include multiconcept-based image retrieval and single-concept-based image retrieval.

4.1. Datasets

We conducted the comprehensive experiments of our approach on two public datasets: MIR Flickr 2011 and NUS-WIDE. Since they include large vocabularies, we chose them to evaluate the performance of multiconcept-based image retrieval. These two datasets are publicly available, containing images and ground truth for single-concept task evaluation. MIR Flickr 2011 contains 18,000 images labeled with 99 semantic concepts. We split it into 8000 training images and 10,000 test images. NUS-WIDE is comprised of 269,648 images with a vocabulary of 81 semantic concepts. We downloaded 230,708 images in total for our experiments. This dataset is randomly divided into two sets: 138,375 images for training and the rest of 92,333 images for test.

On MIR Flickr 2011, we follow literature [33], by using GIST, HOG, SIFT, and RGB histograms as visual features. To compare two features, we employ distance for GIST, for HOG, for SIFT, and for RGB. On NUS-WIDE, we use six visual features [13]. Similarly, we employ distance for wavelet texture, for an edge direction, for SIFT, and for LAB and HSV, which are used in [33].

The average number of images associated with a concept is around 940 in MIR Flickr 2011 and 5381 in NUS-WIDE. The average number of concepts associated with an image is approximately 11 in MIR Flickr 2011 and about 3 in NUS-WIDE. The label vocabularies consist of dozens of label concepts, and around two-thirds of the semantic concepts have frequencies less than the mean concept frequency. Hence, semantic scene retrieval on these imbalanced datasets is challenging.

4.2. Evaluation Measures

Given a query with semantic scene , the ground truth for is defined as follows: if an image depicts all target concepts , it is considered to be relevant; and it is irrelevant otherwise. To evaluate the performance of semantic retrieval, we use three evaluation measures: Mean Average Precision (MAP), Precision at (P@), and Precision-Recall (PR) curve. For each semantic query, Average Precision (AP) can be computed as , where is the total number of relevant images in the test set , is the rank in the retrieved image list , is an indicator function that equals if the th image is relevant to and equals 0 otherwise, and is the precision at cut-off in , which is defined as a ratio between and the number of retrieved images. MAP is the mean value of APs on all the queries. For , the correctness of high ranking retrieved image counts more. Clearly, the higher the MAP the better the retrieval performance. P@ is a variant of precision, where only the top- ranked images are considered. Higher P@ means better retrieval performance. Besides MAP and p@, we employ PR curve to measure semantic retrieval performance.

4.3. Experimental Configurations

In (1) and (2), and , respectively, control concept cardinality and concept frequency. Since training images with 11 and 3 concepts appear the most frequently, we set for MIR Flickr 2011 and for NUS-WIDE, respectively. To reduce computational cost, the size of is limited to an acceptable one. This means that if the frequency of a concept exceeds , it is put into ; otherwise it is discarded. We set for MIR Flickr 2011 and for NUS-WIDE in our experiments. Thus, contains 15,970 and 2084 multiconcepts, respectively. In (7), is used to control the size of , which is determined by 5-fold cross-validation. By testing from a candidate set , we observe that the best performance is achieved when setting on MIR Flickr 2011 and on NUS-WIDE, respectively. Therefore, we set their values accordingly. In addition, all the parameters in the compared methods are turned to the best performance reported in the relevant literatures.

The basic structure of the convolution layer we use is the same as the one used in [7]. For the classifier layer, it starts by a densely connected layer with the output size of 1024, followed by a 20% dropout. For all layers, rectified linear unit is employed as the nonlinear activation function. The optimization of the whole CNN is achieved by the stochastic gradient descent method with the mini-batch size of 128 at a 0.9 momentum. At the beginning, the CNN learning rate is adjusted to 0.01. After 20 epochs, a staircase weight decay is used.

4.4. Comparisons

Our method is compared with several state-of-the-art concept-based methods, including TagProp [6], FastTag [34], VGG [23], DBM [35], and GoogLeNet [7]. As a classical nearest-neighbor method, TagProp uses single-concept techniques to resolve multiconcept-based image retrieval. FastTag learns two linear classifiers coregularized in a joint convex loss function that can be efficiently optimized in closed form on large-scale datasets. The others are influential single-concept-based deep learning methods. After experimenting with TagProp on the large-scale dataset NUS-WIDE, we found that this method is difficult to scale up to a large-scale dataset due to its time and space complexity. As such, we perform TagProp experiments by using 25,000 examples on NUS-WIDE. In addition, following literature [6], we use (9) to compute relevance prediction, given a query with multiconcept scene .

4.5. Experiments on Semantic Image Retrieval

To evaluate retrieval performance, we construct a test query set , by following two steps. First, all single-concept queries are added to . Then 1500 randomly generated queries with multiconcept scenes are put into , with 500 2-concepts, 500 3-concepts, and 500 4-concepts, where -concept is a multiconcept with cardinality . In this way, is built. On MIR Flickr 2011, is comprised of 1500 multiconcepts and 99 single concepts, while contains 1500 multiconcepts and 81 single concepts on NUS-WIDE. The MAPs and P@10s are used for evaluation on semantic image retrieval with varying query lengths. Tables 1 and 2 report MAP scores and P@10 scores, where MAP scores and P@10 scores are given in the format MAP/P@10.

Table 1: MAPs (%) and P@10s (%) of semantic image retrieval over all 1599 semantic queries on MIR Flickr 2011. MAP scores and P@10 scores are given in the format MAP/P@10.
Table 2: MAPs (%) and P@10s (%) of semantic image retrieval over all 1581 semantic queries on NUS-WIDE. MAP scores and P@10 scores are given in the format MAP/P@10.

From the results, we can see that our method, CMMR, is better than other methods. Clearly, multiconcept queries perform much worse than single-concept queries on both datasets. This is because detecting a multiconcept scene is more difficult than detecting a single-concept one. A multiconcept scene may have the characteristic visual appearance, while the goal of traditional single-concept models is to achieve precise results of single-concept detection. To search for a holistic scene, traditional methods use a combination of single-concept technologies. However, in some cases, this may lose some semantics latently embedded in the holistic scene. Therefore, only using single-concept classifiers is difficult to detect a sophisticated multiconcept scene. This observation motivates us to jointly consider the multiconcept scene classifier and the single-concept classifier in devising our CNN. Moreover, the MRF-based fusion method can effectively learn the semantic correlation of multiconcept scene classifier and single-concept classifier, boosting the detection accuracy of a semantic scene.

We further conduct the comparisons with different experiment settings. More specifically, we construct a group of comparative evaluation, that is, a difficult query set with less than 100 relevant images and an easy query set with more than 100 relevant images. The experimental results are shown in Figure 2. We can find out that our method still leads the search results. Figure 3 shows the PR curves of all compared methods on two datasets, illustrating the precision variation with the varying recall. As can be seen, our method CMMR has the better precision than compared methods at every level of recall.

Figure 2: Semantic retrieval performance (MAPs % and P@10s %) over the comparative group: a difficult query set and an easy query set on two datasets.
Figure 3: The PR curves on two datasets.
4.6. Experiments on Rare Concept Queries

Most existing approaches assume balanced concept distributions or equal misclassification costs. Nevertheless, a real-world dataset is commonly highly imbalanced [36]. When presented with complex imbalanced datasets, these methods fail to properly represent the distributive characteristics of the data and resultantly provide unfavorable precision. On MIR Flickr 2011 and NUS-WIDE, the frequencies of most concepts are below average, leading a concept classifier to overclassify the frequent concepts with high occurrence frequencies in the learning stage. This makes it hard to derive a proper model for rare concepts with low occurrence frequencies. In such situations, a concept classifier commonly has the good performance on frequent concepts but very poor performance on rare concepts. This observation suggests that, for developing a classifier, we should consider varying frequencies of concepts.

Two groups of experiments are devised: rare concept queries and frequent concept queries. In the first group, the top-50 rare single concepts, the top-100 rare 2-concepts, and the top-100 rare 3-concepts from are selected as three respective sets of the single-concept rare queries, the 2-concept rare queries, and the 3-concept rare queries, respectively, denoted by , , and . In the second group, the top-50 frequent single concepts, the top-100 frequent 2-concepts, and the top-100 frequent 3-concepts from are, respectively, chosen as the set of single-concept frequent query, a set of the 2-concept frequent query, and a set of the 3-concept frequent query.

As shown in Figures 4 and 5, concept classifiers achieve the higher MAPs and P@10s on the frequent concept sets , , and but far lower MAPs and P@10s on the rare concept sets , , and , significantly impacting retrieval performance and user experience. For the rare concept sets , , and on MIR Flickr 2011, our approach outperforms the compared methods, with the better improved 30%, 24%, and 26% over the second best method in terms of MAP, respectively. On NUS-WIDE, a similar improvement is also observed. During rare concept detection with semantic scene , a group of weighted concept classifiers of its detection context take part in concept detection through MRF-based fusion method. Among these concepts from , some concepts may be frequent concepts, which significantly boosts the relevance prediction of and makes the rare concept easier to be detected. Moreover, our maximization of the log likelihood of semantic concepts compensates for the varying frequencies of concepts. Consequently, our approach can remit the issue of concept imbalance, thus boosting retrieval performance.

Figure 4: MAPs and P@10s (%) of semantic image retrieval for rare concepts and frequent concepts on MIR Flickr 2011.
Figure 5: MAPs and P@10s (%) of semantic image retrieval for rare concepts and frequent concepts on NUS-WIDE.

5. Conclusion

Searching semantic images with high accuracy turns to be significant nowadays because of a vast number of real-world applications such as cognitive educational resource retrieval. As a key step, image scene detection plays an important role in semantic image retrieval. In this paper, we have presented a novel CNN framework for semantic image retrieval, which combines CNN and MRF in a novel way that enhances the capacity of multiconcept scene detection. Compared with previous methods, our CNN framework seamlessly incorporates three components: single-concept classifier, multiconcept scene classifier, and semantics. The combination of these three components can enhance the capability of CNN for detecting semantic scenes. We have conducted the comprehensive experiments on two public datasets. The favorable results indicate that our proposed method outperforms the compared approaches.

For future work, we intend to develop a better learning and fusion method for multiconcept scene detection. Additionally, we would also like to explore the links among concepts, for example, concept graph or semantic hierarchy to boost the retrieval performance.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.


This work was supported by the National Natural Science Foundation of China (nos. 61370229 and 61702388), the GDUPS (2015), the CSC (no. 201706755023), and China Postdoctoral Science Foundation (nos. 2016M600657 and 2017T100637).


  1. M. Kawanabe, A. Binder, C. Müller, and W. Wojcikiewicz, “Multi-modal visual concept classification of images via Markov random walk over tags,” in Proceedings of the IEEE Workshop on Applications of Computer Vision, pp. 396–401, 2011. View at Scopus
  2. S. Zhang, J. Huang, H. Li, and D. N. Metaxas, “Automatic image annotation and retrieval using group sparsity,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 42, no. 3, pp. 838–849, 2012. View at Publisher · View at Google Scholar · View at Scopus
  3. S. Tang, Y.-D. Zhang, Z.-X. Xu, H.-J. Li, Y.-T. Zheng, and J.-T. Li, “An efficient concept detection system via sparse ensemble learning,” Neurocomputing, vol. 169, pp. 124–133, 2015. View at Publisher · View at Google Scholar · View at Scopus
  4. Z. Guan, L. Zhang, J. Peng, and J. Fan, “Multi-View Concept Learning for Data Representation,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 11, pp. 3016–3028, 2015. View at Publisher · View at Google Scholar · View at Scopus
  5. D. Grangier and S. Bengio, “A discriminative kernel-based approach to rank images from text queries,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 8, pp. 1371–1384, 2008. View at Publisher · View at Google Scholar · View at Scopus
  6. M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid, “TagProp: discriminative metric learning in nearest neighbor models for image auto-annotation,” in Proceedings of the International Conference on Computer Vision, pp. 309–316, 2009. View at Scopus
  7. C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolutions,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015. View at Scopus
  8. F. Radenović, G. Tolias, and O. Chum, “CNN image retrieval learns from BoW: unsupervised fine-tuning with hard examples,” in Proceedings of the European Conference on Computer Vision, pp. 3–20, 2016. View at Google Scholar
  9. G. Hu, X. Peng, Y. Yang, T. M. Hospedales, and J. Verbeek, “Frankenstein: learning deep face representations using small data,” IEEE Transactions on Image Processing, vol. 27, no. 1, pp. 293–303, 2018. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  10. Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang, “Semantic image segmentation via deep parsing network,” in Proceedings of the International Conference on Computer Vision, pp. 1377–1385, 2015. View at Scopus
  11. C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 38, pp. 295–307, 2016. View at Publisher · View at Google Scholar
  12. S. Nowak, K. Nagel, and J. Liebetrau, “The CLEF 2011 photo annotation and concept-based retrieval tasks,” in Proceedings of the CLEF Conference and Labs of the Evaluation Forum, pp. 1–25, 2011. View at Scopus
  13. T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “NUS-WIDE: a real-world web image database from National University of Singapore,” in Proceedings of International Conference on Image and Video Retrieval, pp. 48–56, 2009.
  14. D. Wang and M. Li, “Stochastic Configuration Networks: Fundamentals and Algorithms,” IEEE Transactions on Cybernetics, vol. 47, no. 10, pp. 3466–3479, 2017. View at Publisher · View at Google Scholar · View at Scopus
  15. C.-T. Nguyen, N. Kaothanthong, T. Tokuyama, and X.-H. Phan, “A feature-word-topic model for image annotation and retrieval,” ACM Transactions on the Web (TWEB), vol. 7, no. 3, pp. 1–24, 2013. View at Google Scholar · View at Scopus
  16. R. L. Cilibrasi and P. M. B. Vitányi, “The google similarity distance,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 3, pp. 370–383, 2007. View at Publisher · View at Google Scholar · View at Scopus
  17. S. Zhu, C.-W. Ngo, and Y.-G. Jiang, “Sampling and ontologically pooling web images for visual concept learning,” IEEE Transactions on Multimedia, vol. 14, no. 4, pp. 1068–1078, 2012. View at Publisher · View at Google Scholar · View at Scopus
  18. J.-M. Laferte, P. Perez, and F. Heitz, “Discrete Markov image modeling and inference on the quadtree,” IEEE Transactions on Image Processing, vol. 9, no. 3, pp. 390–404, 2000. View at Publisher · View at Google Scholar · View at MathSciNet
  19. D. Metzler and W. B. Croft, “Latent concept expansion using Markov random fields,” in Proceedings of the ACM International SIGIR Conference, pp. 311–318, 2007. View at Scopus
  20. Y. Xiang, X. Zhou, T.-S. Chua, and C.-W. Ngo, “A revisit of generative model for automatic image annotation using markov random fields,” in Proceedings of the 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2009, pp. 1153–1160, June 2009. View at Scopus
  21. Z. Lu and H. H. S. Ip, “Spatial Markov kernels for image categorization and annotation,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 41, no. 4, pp. 976–989, 2011. View at Publisher · View at Google Scholar · View at Scopus
  22. X. Dong, J. Shen, L. Shao, and L. Van Gool, “Sub-Markov random walk for image segmentation,” IEEE Transactions on Image Processing, vol. 25, no. 2, pp. 516–527, 2016. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  23. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”
  24. T. Hoang, T.-T. Do, D.-K. Le Tan, and N.-M. Cheung, “Selective deep convolutional features for image retrieval,” in Proceedings of the 25th ACM International Conference on Multimedia, MM 2017, pp. 1600–1608, October 2017. View at Scopus
  25. S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2017. View at Publisher · View at Google Scholar
  26. C. Xu, C. Lu, X. Liang et al., “Multi-loss Regularized Deep Neural Network,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 12, pp. 2273–2283, 2016. View at Publisher · View at Google Scholar · View at Scopus
  27. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, pp. 770–778, July 2016. View at Scopus
  28. D. Wang and C. Cui, “Stochastic configuration networks ensemble with heterogeneous features for large-scale data analytics,” Information Sciences, vol. 417, pp. 55–71, 2017. View at Publisher · View at Google Scholar · View at Scopus
  29. G. Huang, Z. Liu, L. v. Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269, Honolulu, Hawaii, USA, July 2017. View at Publisher · View at Google Scholar
  30. Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe, “Deep convolutional ranking for multilabel image annotation,”
  31. C. Wang, N. Komodakis, and N. Paragios, “Markov Random Field modeling, inference and learning in computer vision and image understanding: a survey,” Computer Vision and Image Understanding, vol. 117, no. 11, pp. 1610–1627, 2013. View at Publisher · View at Google Scholar · View at Scopus
  32. X. Li, “Preconditioned Stochastic Gradient Descent,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 5, pp. 1454–1466, 2018. View at Publisher · View at Google Scholar
  33. J. Verbeek, M. Guillaumin, T. Mensink, and C. Schmid, “Image annotation with tagprop on the MIRFLICKR set,” in Proceedings of the 2010 ACM SIGMM International Conference on Multimedia Information Retrieval, MIR 2010, pp. 537–546, USA, March 2010. View at Scopus
  34. M. Chen, A. Zheng, and K. Weinberger, “Fast image tagging,” in Proceedings of the International Conference on Machine Learning, pp. 1274–1282, 2013.
  35. N. Srivastava and R. Salakhutdinov, “Multimodal learning with deep Boltzmann machines,” Journal of Machine Learning Research, vol. 15, pp. 2949–2980, 2014. View at Google Scholar · View at MathSciNet
  36. G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing, “Learning from class-imbalanced data: Review of methods and applications,” Expert Systems with Applications, vol. 73, pp. 220–239, 2017. View at Publisher · View at Google Scholar · View at Scopus