Abstract

A novel image representation is proposed for content-based image retrieval (CBIR). The core idea of the proposed method is to do deep learning for the local features of image and to melt semantic component into the representation through a hierarchical architecture which is built to simulate human visual perception system, and then a new image descriptor of features conduction neural response (FCNR) is constructed. Compared with the classical neural response (NR), FCNR has lower computational complexity and is more suitable for CBIR tasks. The results of experiments on a commonly used image database demonstrate that, compared with those of NR related methods or some other image descriptors that were originally developed for CBIR, the proposed method has wonderful performance on retrieval efficiency and effectiveness.

1. Introduction

Driven by the demand of search service market, the method of content-based image retrieval (CBIR) becomes a hot issue in the research field of pattern recognition and artificial intelligence for many years. The common ground for CBIR systems is to extract a signature for every image based on its pixel values and to define a rule for comparing images. The components of the signature are called features. An obvious advantage of a signature over the original pixel values is the significant compression of image representation. However, a more important reason for using the signature is to gain an improved correlation between image representation and image semantics. Actually, the main task of designing a signature is to bridge the gap between image semantics and the pixel representation, that is, to create a better correlation with image semantics [1].

The researchers have tried to use machine learning techniques to derive the similarity measure of the high-level semantics of the image from the existing image representations [2] or to cluster the images by self-organizing maps firstly and then to do retrieval [3] with the former such as bandletized regions through support vector machines (BRSVM) learning and online multiple kernel similarity (OMKS) learning [46] and the latter such as tree structured self-organizing maps (TS-SOM) and growing hierarchical quadtree self-organizing map (GHSOQM) [7, 8]. These methods are often used in combination with relevance feedback technology, which can enhance the retrieval effectiveness to a certain extent [8, 9]. However, these methods are very technical and often need a lot of training time which makes them difficult to be applied in practice.

On the other hand, the research of image representation for CBIR is constantly advancing, and many creative image representation methods are proposed. These methods can be broadly divided into two categories: the global feature based approach and the local feature based approach. For example, the edge histogram descriptor (EHD) [10], multiple texture histogram (MTH) [11], and color difference histogram (CDH) [12] are all based on the global characteristics of the algorithm. These algorithms to extract characteristics have good identification ability and robustness. However, we know that the overly complex feature representation is not always applicable to CBIR [1, 13]. At the same time, the local feature extraction method has also been a great concern [14, 15]. These methods focus on the feature representation of the image using the key points [16] or significant blocks in the image [17, 18]. How to determine the key points and the salient regions of the image are often dependent on the complex image segmentation technology. So far, however, the image segmentation technology is still one of the difficult problems in image processing and thus limits the application of these methods in CBIR.

In recent years, the human visual cortex neural science and the related hierarchical learning methods provide a new direction for studying of this problem. Research has shown that the human visual perception system has very good abilities of learning and generalization through a few examples, and these abilities are given by the hierarchical structure of the visual cortex [1921]. Based on the hierarchical structure of visual cortex, Smale et al. proposed the concept of derived kernel and the related theory of neural responses (NR) [22]. They established a mathematical model to simulate the process of hierarchical processing information of the human visual system. In the NR model, the inner product defined by the neural response led to a similarity measure between images which was called the derived kernel. Based on a hierarchical architecture, a recursive definition of the neural response and associated derived kernel was given. The derived kernel can be used in a variety of application domains such as classification of images, strings of text, and genomics data. Theoretical analysis and experimental results show that the NR model is an effective feature extraction method. It has the potential to be further improved and enhanced in many applications [17, 2325]. Most important of all, the NR model has a key semantic component: a system of templates which can fuse the visual features and the semantic features of an image together and which is very important in CBIR.

However, because of the underlying neural response using the pixel value of the bottom subblock of image and then being passed to the upper level of the subblock, this algorithm is not suitable for CBIR. Because, in the task of CBIR, the image databases are usually very large and the resolution of the image is usually very high, the exhaustion algorithm of pixel to pixel is difficult to bridge the “semantic gap” in complex scene images and a huge amount of computation also limits its application in practice. In order to capture the high-level semantic feature of the image and at the same time improve the efficiency of retrieval, we propose the concept and the corresponding algorithm of features conduction neural response (FCNR) on the basis of the related theory of NR.

In the proposed method, we divide the spatial domain of an image in a simple way firstly and then obtain the local feature representation of the image by extracting the basic characteristics such as color, texture, and shape feature on the local area of the image. Next, we establish a hierarchical structure for the local feature representation of the image; at the same time, for each layer of the structure, a local feature template set is constructed. In the first layer of the hierarchical structure, local features are used to construct initial neural response and then these features are conducted to the senior subblocks layer by layer by the normalized inner product of the neural response. Finally, the image is expressed as a vector which is called FCNR, which can be used as an image representation for CBIR. The major advantages of the proposed method can be summarized as given here.(i)The FCNR is derived from a local feature array rather than just using the pixel values, which overcome the drawback of overlearning problem in classical NR method.(ii)The high-level semantic component of the image is introduced into the feature representation by the interaction between the subblocks of the image and the templates in every layer of the constructed hierarchical structure.(iii)Without loss of the excellent identification and the invariance, the FCNR gets rid of the plight of the pixel to pixel exhaustion algorithm of the NR and reduces the computational complexity significantly, which is essential for the CBIR purpose.

The rest of this paper is organized as follows. In Section 2 the models of FCNR are constructed firstly. In Section 3 the image retrieval method based on FCNR is introduced. Then, in Section 4, we verify the effectiveness of the proposed method with extensive experiments on popular data sets and compare it with other CBIR methods. Finally, conclusions are drawn, and some future research issues are discussed in Section 5.

2. Feature Conduction Neural Response (FCNR)

The starting point of NR was to establish the mathematical model for visual mechanism of primate visual cortex [19, 26, 27]. In order to simulate the hierarchical information processing of visual cortex, Smale et al. [22] divided the image domain into some nested blocks as shown in Figure 1.

The NR of an image was defined in bottom-up fashion based on the hierarchical architecture. As a feature vector of an image, NR can be used to define the similarity between images. The theoretical analysis and experimental results show that the NR model has good performance on discrimination and it was robust to transformations, which suggested that the learning process of NR model possessed the characteristics of the human visual system in a certain degree.

2.1. Notation and Preliminaries

In this paper, we consider the case of a three-layer hierarchical architecture.

As shown in Figure 2, let regions , , and in   () be pieces of the domain on which the patches or subpatches of images are defined. In the vision interpretation, these regions can be considered as receptive fields with different sizes. When we are working with gray scale images, an image or an image patch can be seen as a discrete function of two variables which take the corresponding gray values as the functional values. That is to say, an image of size can be seen as a function defined on the domain . In this case, an image set consisting of the images defined on the domain can be denoted by . For description convenience, we denote the cardinality of a set as in the rest of this paper. Accordingly, the images in can be denoted by ; that is, . Similarly, the sets of image patches of size and size can be denoted by and with and , respectively.

As an example, Figure 3 shows the nested architectures and the relationship of the image and the image patches. In Figure 3, is a whole image of size , are two examples of image patches of size cut out from image , and are four examples of image patches of size cut out from image , respectively.

The elements and in Figure 3 which restrict an image or image patch to a specific subpatch are the other key ingredients in NR model: transformations associating two adjacent domains. Formally, the set is called transformation set, in which the map is a transformation from the smallest patch to the next larger patch . Similarly with can be defined. In this paper, the transformations are limited to translations and take the form . Consequently, we can consider as a set of translations corresponding to moving a sliding window of size in patch and similarly as a set of translations corresponding to moving a sliding window of size in patch . For example, given an image of size , if the step length equals one pixel, image patches can be obtained by restricting the image on the given subpatch of size .

The following fundamental assumption related to image sets and transformation sets is supposed to be satisfied throughout this paper [22].

Axiom 1. If and , then , where denotes the restriction of image patch on region by transformation of . Similarly, if and .

The last essential factor in NR model is series templates sets. The finite elements are selected as the first-layer templates and the first-layer template set is constituted. In the same way, the second-layer template set can be obtained. Obviously, templates are some image patches which can be seen as image elements frequently encountered and serve as building blocks to represent other images. Those templates implicate abundant higher semantic information of the images which can be used to promote identification ability in image retrieval.

2.2. The Construction of FCNR

The first step of constructing FCNR is to segment the whole image in a simply way, which is different from other feature extraction methods based on region segmentation technology [8, 14], here just using the perpendicular line network to segment the image into some small rectangular area of the same size. Then in each small region we extract features such as color, texture, and shape, and all these characteristics are represented by a vector. So, an image can be represented as a three-dimension character array. On the basis of this three-dimension array, the local characteristics are conducted step by step to higher layer following the same mode of NR, and finally the FCNR can be obtained. Specific process is given below.

For any image , we divide it into rectangular blocks with the same size using perpendicular lines network; that is,We extract some visual characteristics on each rectangular block in the same way. The details of features extraction methods will be presented in the third part of this paper. Normalizing the vectors with these characteristics as components and denoting them by , we can get an arraywhich is the local feature representation of the image .

It should be emphasized that these are all normalized vectors with the same dimension and each component of these vectors represents a feature of image block. An obvious advantage of normalization is said to be invariance to the brightness change of the image. If characteristics are extracted from each rectangle image block , then is a three-dimensional array and can be simply represented asin which denotes the th feature of the image block in th row and th column of the image .

In this circumstance, the related notations and their meanings introduced in the previous section should be adjusted accordingly. The set of local feature representations of the images in the set is denoted by ; that is,where denotes the area of size . Accordingly, we use and to denote the sets of patches of size and size , respectively. It is needed to emphasize that the elements of previous and are obtained by sampling from the rows and columns of the array by moving windows rather than directly sampling from the image .

For example, assume that is an image of size . It is divided into square subblocks and we extract characteristics from each subblock. At this point, the local feature representation is a three-dimensional array of size and the size of is . If -size is and -size is , then and are three-dimensional arrays of size and size , respectively.

The notations and still denote the transformation sets of the transformations from to and to . The template sets and are obtained from in a similar way by moving the window on and , respectively. The elements in these template sets denoted by and are also some three-dimensional arrays.

Now we can define the feature conduction neural response. Firstly, assume and, for any , we have according to Axiom 1. Taking a template , we callneural responses of to the template , where denotes a three-dimensional array that is obtained by multiplying the corresponding elements of two three-dimensional arrays and , and represents the element of in th row and th column of th page. When take over the template set , we can get a dimensional vectorwhich is called the first layer of neural response of to the template set . After normalization, it is denoted as ; that is,where is inner product of two vectors in the usual sense.

Next, set , and, according to Axiom 1, we know that . For any template , we callneural responses of to the template . When take over the template set , we can get a dimensional vectorwhich is called the second layer of neural response of to the template set .

Finally, for any image , we define as features conduction neural response (FCNR) of the image and it is denoted by ; that is,

We add some remarks.(i)The FCNR of an image is a vector whose dimension is equal to the number of templates in the second layer and has nothing to do with the dimension of the image itself. Therefore, in the process of image processing, we can transform all the images into vectors with the same dimension, regardless of the idea that the sizes of the images are the same or not.(ii)Due to the use of image low-level visual features in the underlying layer, FCNR model effectively overcomes the shortcomings of pixel to pixel exhaustion algorithm of the NR model. At the same time, the low-level visual features of image are conducted to the upper layer by the interaction between the subblocks of the image and the templates in every layer of the constructed hierarchical structure and make the FCNR contain high-level semantic elements of the image and this is very important in the task of CBIR.(iii)From the perspective of learning theory, the feature extraction method of the FCNR belongs to the category of unsupervised learning [2, 19, 22], and the hierarchical structure is introduced to do deep learning for the low-level visual features.

2.3. Computational Complexity Analysis

In image retrieval task, we often have the real-time requirements. As a result, the complexity of the algorithm is very important when constructing the feature representation for CBIR. Here we analyze the computational complexity of the proposed method in this paper.

Consider the case of the layers hierarchical architecture as shown in Figure 1. We define a set of global transformations where the range is always the entire image domain rather than the next larger patch recursively setting for any , where contains only the identity . In the above formula, we denote the transformation set from patch of th layer to the next larger patch by .

We denote the template in the th layer by and ignoring the cost of normalization and of precomputing the neural responses of the templates, the number of required operations to export the NR is given bywhere we denote for notational convenience the cost of computing the initial kernel by [22].

Because the image is preprocessed, in (12) in the calculation of the FCNR will be less than that in the calculation of the NR. This will eventually lead to the fact that of FCNR is far less than that of NR. In order to illustrate this point intuitively, we give a specific example.

Suppose is an original image of size . In the calculation of the NR of , we take the -size as pixels and the -size as pixels. On the other hand, the image will be divided into the square subblock of size and features will be extracted from each block before the calculation of the FCNR of . In this case, we take the -size as and the -size as which correspond to the pixels and the pixels in the original image. We also assume that the number of templates selected in each layer of the two methods is equal. Note that in this paper; we can calculate of NR and FCNR using (12), respectively. The values of the parameters in (12) and the results of NR and FCNR are listed in Table 1.

From Table 1, we can see that the number of operations of FCNR is less than one five-thousandth of the number of operations of NR. This means that the computational complexity of NR is much higher than that of FCNR. In fact, for the image with high resolution, it is not practical to directly calculate the NR. Usually, we will do some simple preprocessing for image before calculating its NR. Therefore, the difference of computation is not so great in practice (see Section 4).

3. CBIR System Based on FCNR

For a given image library, we divided all the images in the library into rectangular blocks with appropriate size using mutually perpendicular lines. In each rectangular block, the low-level features are extracted in the same way and thus the local feature representation of an original image is obtained. The local features representations of all images in the library will constitute a local characteristic database. So, we can construct a hierarchical architecture for the local feature representation and the template sets of every layer of the architecture can be obtained using the local characteristic database. On this basis and using the algorithm as mentioned in Section 2.2 to compute FCNR for all images in the library, we can establish a FCNR library associated with original image library. If a proper similarity measure is defined on the feature space of FCNR, then the image retrieval can be carried out.

When the user enters a query image for the relevant images retrieval, first of all, the user calculates the FCNR of the query image according to the above-mentioned steps and then calculates the similarity between the query image and all images in the image database according to the defined similarity measure. Finally, the user sorts the image in the library in decreasing order according to the similarity, and some numbers of images which are arranged in the top are output to the user. The flow diagram of CBIR based on FCNR is shown in Figure 4.

3.1. Local Low-Level Feature Extraction

In this paper, some simple and robust methods are used to extract fourteen basic low-level features, including color feature, texture feature, and shape feature, for the image block.

Similar to some CBIR related literatures, we use the well-known YCbCr color space in the extraction of color features [1, 8]. In this color space, the luminance information is stored with a single component , and the color information is stored with two color difference components and . We calculate the mean and standard deviation of , , and for each subblock, among which the mean values are denoted as , , and and the standard deviations are denoted as , , and . In this way we can get six color features (for monochrome images, only two brightness features can be extracted).

Next, we will use Haar wavelet transform to extract texture features from the component of the rectangular image block. First of all, we will take Haar wavelet transform on each subblock in the rectangular image block and four matrixes can be obtained, which include a sampling approximation and three detail matrixes in three directions (horizontal, vertical, and diagonal). Set the three detail matrixes to be respectively, and letAfter wavelet transformation, we just assign the three variables to each pixel of the rectangular image block. Then, we can compute the averages and standard deviations of the three variables , , and for each rectangular image block and denote the averages as , and and the standard deviations as , , and , respectively.

Note that the standard deviation of the component of the rectangular image block has been obtained; we take the thirteenth feature aswhich is the smoothness of the image block and it reflects the relative smooth degree of brightness in the corresponding region. The last feature is the entropy of the component of the rectangular image block; that is,where is the gray level histogram of the component of the rectangular image block and is the number of possible gray series. Entropy is a measure of the randomness of the image elements [13].

In this way, the 14 features mentioned above are combined together; we can get low-level visual features representation of the rectangular image block, and we denote it as ; that is,After obtaining the low-level visual features representation of all rectangle blocks of an image, we can get the local feature representation of the whole image as shown in (2).

3.2. The Similarity Measure

Retrieval accuracy is not only dependent on a robust feature representation, but also dependent on a good similarity measure. In order to highlight the advantages of FCNR in the image feature representation, we adopt a very basic and very natural way in this paper; that is, the similarity between two images is defined as the normalized inner product of their FCNRs. Specifically, for any of , its FCNR is a vector of , where represents the number of the templates in the second layer. To normalize , we can obtainand the similarity of two images can be defined asIt is not difficult to see that the mode of definition of image similarity comes down in one continuous line of the definition of similarity of image patches at all layers in the process of construction of FCNR.

Thus, when the user inputs the query image , the system firstly computes according to (10) and according to (18) and then calculates the similarities of the query image and all images in the database according to (18) and (19). Finally, descending sorts the images based on the similarity and outputs the top images to the user as the query result, where the parameter is specified by the user according to the query requirements.

4. Experiments

In this section, we will discuss simulation experiments to demonstrate the performance of the proposed method in image retrieval. Firstly, the evaluation standards of the performance of CBIR system are given and the appropriate parameters of FCNR method are selected. Then, we will compare the performance of the FCNR with the classical NR and the local neural responses (LNR) [17] in image retrieval. Finally, we also compare the proposed method with several feature extraction methods which were originally designed for image retrieval, including the benchmark method and some relatively new methods.

The image library used in the experiments contains 1000 images with size or selected from COREL database which is a general-purpose image database including about 60,000 pictures [1]. These selected images have ten classes, each of which has a semantic name and contains 100 pictures. For the sake of clarity, these 1000 images are numbered from to . The semantic name and the corresponding number range of each class are listed in Table 2 [8]. We randomly selected four pictures from each class and show them in Figure 5.

It is necessary to emphasize that the templates used in the experiment for calculating FCNR are randomly intercepted from the local feature arrays of the image (and not from the original image) by moving or rotating window with specific size. The experiments were conducted on a computer with 4 GB random access memory and 2.60 GHz Intel(R) Core (TM) i5-3230 M processor, and the code was implemented in MATLAB in which the image processing toolbox functions are called [13].

4.1. Evaluation Standards and Parameter Determinations

There are a variety of ways for evaluation of the performance of retrieval. In this paper, we mainly use the recall-precision graph which is the most commonly used in community of image retrieval to evaluate the performance of FCNR. Precision is defined aswhere is the number of retrieved images and is the number of relevant images in the retrieved images. Recall is defined aswhere is the number of all relevant images in the library. An optimal recall-precision graph would have a straight line; that is, precision is always at 1. Typically, when recall increases, precision decreases accordingly.

However, the results of one or two times of retrieval can not fully exhibit the advantage and disadvantage of an algorithm, and it is not convenient to compare with other methods. Therefore, we randomly selected 50 images from the image database to form a set of query images. For fixed recall, averaging the precision of the 50 time queries, we can obtain the recall-average precision graph which is a relatively reliable evaluation standard. In general, the high average precision and high recall mean that the algorithm has good performance. This means that the algorithm whose recall-average precision graph is over the right upper is better. In addition, due to the requirements of real-time in CBIR task, the shorter the query time the better the performance of the algorithm.

In our experiments, the images of size are transformed into images of size firstly through the rotation, and then all the images are divided into the square subblock size of , which is a total of blocks for each image. For each image block, we extract local features by the methods described in Section 3.1 and we can get a three-dimensional array of for every image. The templates sets are constructed by randomly extracting 500 patches of -size and 300 patches of -size, respectively, from the local feature arrays of some 10 images per class. In the process of constructing FCNR, two very important parameters are the size of and . In order to select the proper sizes, we have carried out a series of experiments for different sizes of and .

Figure 6 shows four recall-precision graphs corresponding to four different patch sizes. In these experiments, the number of retrieved images is taken as 30. It is not difficult to see from Figure 6 that the sizes of , and are too small or too large to get good results. In contrast, when the -size is pixels and the -size is pixels, the retrieval results are the best. Therefore, we use these sizes in the experiments shown in Figure 6.

Figure 7 shows the top 20 images of two queries. The image in the front of the list is the query image and the figure at the bottom of each image is the number of the image in the library. As seen from Figure 7, the proposed CBIR system based on FCNR was done efficiently on the COREL image library. For the query semantics of “flower,” the outputs of the top 20 images are all the theme of flowers, and these flowers take different color, size, background, and forms. This suggests that the high-level semantics of “flower” can be correctly identified by the system. It is worth mentioning that the two images of number 674 and number 677 are rotated before the local feature extraction and still can be retrieved, indicating that the FCNR algorithm preserves the rotation invariance of NR [24, 27, 28]. For the query semantics as “the elephant,” the top 13 images of output are relevant to the query image, and among the top twenty images, only four images are inconsistent with the query image (the added border images in Figure 7).

4.2. Comparison with NR-Based and LNR-Based Methods

Next, we will compare the performance of FCNR with the classic NR method and the LNR method in CBIR [17]. Local neural response is an improved version of the neural response, which uses sparse techniques in the representation of image and its subblocks. Before calculating NR and LNR, it is necessary to do preprocessing to the images as mentioned in Section 2.3. In order to be relatively fair, we adopt the approach which makes the algorithm perform best as reported in the relevant literature: we convert images into gray images and the -size is pixels and the -size is . We use a similar manner to select the template in the three methods, that is, to intercept 500 templates of -size and 300 templates of -size randomly from the gray images or the local feature matrix of images.

Table 3 shows the time consumption of the three different methods at different stages, and the retrieval performance is shown in Figure 8. In these experiments, the number of retrieved images is still taken as 30.

It can be seen from Table 3 that both the hierarchical feature learning time and the query time of FCNR-based methods are significantly shorter than the other two methods. This is mainly because the latter two methods use exhaustion algorithm of translations of pixel to pixel. In particular, the LNR method, which introduces the solution of the quadratic optimization problems, is the most time-consuming method. Therefore, although the retrieval method based on FCNR can take some time on the extraction of local features, the time of learning the FCNR and the query time can be greatly reduced. This point is very important for the image retrieval task, because the real-time performance is the basic requirement in image retrieval [29].

Besides that, it is not hard to see from Figure 8 that the retrieval precision of the method based on FCNR is better than those methods based on NR or LNR. The main reason is that the FCNR-based method effectively overcomes the shortcomings of comparison of pixel values in the underlying image blocks just as in the NR and LNR methods. At the same time, the loss of color information also affected the performance of LNR and NR to a certain extent. By the way, the results based on LNR are better than that based on the NR. This is mainly because the localization and the sparse encoding in the LNR method make the image on the target location have a high value of neural response.

4.3. Comparison with Some Other Methods Proposed for Image Retrieval

Finally, we will compare FCNR with some other methods originally proposed for image retrieval, which include the edge histogram descriptor (EHD) method [10], the color difference histogram descriptor (CDH) [12], and the latest methods such as the error diffusion block truncation coding (EDBTC) [18] and the bandletized regions through support vector machines (BRSVM) [6].

As a benchmark, the EHD was initially used for texture image retrieval. In order to be fair, we extract features from the three components of R, G, and B of the image, and the edge intensity of the image block over 11 is used for the calculation of the histogram. Each component corresponds to an 80-dimensional feature vector, so that the resulting EHD feature vector is 240 dimensions. In the CDH representation we use YCbCr color space and the color and the orientation parameters are taken as 90 and 18, which is the relatively good performance parameter configuration as reported in the literature [12]. Thus the image of the CDH is expressed as a 108-dimension vector. The EDBTC produces two color quantizers and a bitmap image, which are further processed using vector quantization to generate the image feature descriptor. There are two features that are introduced in EDBTC, namely, color histogram feature and bit pattern histogram feature, to measure the similarity between a query image and the target image in database. The BRSVM method was prosed to overcome drawbacks and limitations of this traditional image segmentation technology. In BRSVM method, a bandelets transform based image representation technique is presented, which reliably returns the information about the major objects found in an image, and support vector machine is applied for image retrieval purposes.

Similarly, we randomly select 50 images from the image database to form a query image set , and the output image number is set to 10, 20, and 50, respectively. Table 4 lists the average precision and recall in the three different cases corresponding to the five methods, and Figure 9 shows the PR curve when the number of output images is set to 50.

It is not difficult to see from Table 4 and Figure 9 in the COREL-1000 image database that the performance of proposed method not only is significantly better than the MPEG-7 standard feature extraction method such as EHD but also has a stroke above those latest methods, such as EDBTC and BRSVM. From our point of view, this is mainly due to two reasons: the first reason is that, based on the hierarchical structure, FCNR is the result of deep learning on the low-level features of the image and the second reason is that the high-level semantic elements of images are integrated into the feature representation of FCNR by using the templates sets.

5. Conclusions

This research has been devoted to construct a new image feature representation of FCNR to be used for CBIR. It preserves the excellent characteristics of NR and LNR, such as being invariant to translation and illumination and robust to local distortion and clutter. More importantly, it also takes both visual feature and semantic feature into account in image recognition. The experimental results on the COREL-1000 image database have shown that, compared with NR and LNR, FCNR is more suitable for image retrieval tasks. In addition, the proposed method achieves a higher retrieval accuracy compared with other methods originally proposed for image retrieval in the COREL database. We attribute the effectiveness of the proposed method to both the local feature extraction and the hierarchical architecture which is used in deep learning of low-level visual features and take the high-level semantics of the image into its feature representation.

Although both theoretical analysis and experimental results show that FCNR is an applicable representation of image for CBIR, there are still some problems to be further studied.(i)The template sets and the number of selected templates in this paper are determined according to experience; we do not pay any attention to qualitative and quantitative analysis. How to select the more representative templates and what are the optimal numbers of templates which are needed in the construction of FCNR are still important issues for future work [30].(ii)In this paper, we used the simple inner product kernels as the similarity measure. In order to obtain a better retrieval performance, how to combine FCNR and the mainstream of similarity learning technology in recent years is worth studying problem.(iii)As is well known, the relevance feedback technique plays an important role in image retrieval [9], and how to introduce relevance feedback into the algorithm proposed in this paper is another interesting subject.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This work is supported by Educational Commission of Hubei Province of China (D20131803), the Doctoral Scientific Research Fund of Hubei Automotive Industries Institute (BK201209), and the Youth Foundation of Hubei Automotive Industrial Institute (X2012XQ09).