Abstract

With the fast growing number of images uploaded every day, efficient content-based image retrieval becomes important. Hashing method, which means representing images in binary codes and using Hamming distance to judge similarity, is widely accepted for its advantage in storage and searching speed. A good binary representation method for images is the determining factor of image retrieval. In this paper, we propose a new deep hashing method for efficient image retrieval. We propose an algorithm to calculate the target hash code which indicates the relationship between images of different contents. Then the target hash code is fed to the deep network for training. Two variants of deep network, DBR and DBR-v3, are proposed for different size and scale of image database. After training, our deep network can produce hash codes with large Hamming distance for images of different contents. Experiments on standard image retrieval benchmarks show that our method outperforms other state-of-the-art methods including unsupervised, supervised, and deep hashing methods.

1. Introduction

Millions of images are uploaded and stored on the Internet every second with the rapid development of storage technique. Given a query image, how to efficiently locate a certain number of content similar images from a large database is a big challenge. Speed and accuracy need to be carefully balanced. This kind of task is content-based image retrieval (CBIR) [14], a technique for retrieving images by automatically derived features such as colour, texture, and shape. There are also some applications of CBIR like free-hand sketch-based image retrieval [5] whose query images are abstract and ambiguous sketches. In CBIR, derived features are not easy to store. Searching from millions and even billions of images is very time-consuming.

The binary representation of images is an emerging approach to deal with both storage and searching speed of CBIR task. This method is called hashing method, and it works in three steps. First, use a hash function to map database images (gallery images) into binary codes and store them on the storage device; the typical length is 48 bits. Then calculate the Hamming distance between the binary code of query image and stored binary codes. The images with smallest Hamming distance to the query image indicate similar content and should be retrieved. Some examples of proposed hashing methods are in [611].

The critical part of hashing method is the features it uses to derive the hash code. The process of all hashing method includes feature extracting; the quality of feature directly affects the retrieval accuracy. Recently, convolutional neural network (CNN) has proved its remarkable performance in tasks highly depending on feature extracting, like image classification [12], natural language processing [13], and video analysis [14]. CNN based methods outperform previous leading ones in these areas, which shows that CNN can learn robust features representing the semantic information of images. A very natural idea is to use deep learning for learning compact binary hash codes. Following semantic hashing [15], deep hashing methods using CNN show high performance in content-based image retrieval.

In this paper, we propose a new supervised deep hashing method for learning compact hash code to perform content-based image retrieval; we call it deep binary representation (DBR). This paper is an extended version of the work [16]. Our method is an end-to-end learning framework with three main steps. The first step is to generate optimal target hash code from pointwise label information. The second step is to learn image features and hash function simultaneously through the training process of carefully designed deep network. The third step is to map image pixels to compact binary codes through a hash function and perform image retrieval. Compared to other deep hashing methods, our method has the following merits.

Our deep hash network is trained with calculated target hash code. The target hash code is optimal that Hamming distance between different labels is maximized. Methods like [17] derive hash codes from a middle layer of the deep network. Our method produces hash codes from the output layer. This method is more direct and shows better performance.

Our training process is pointwise; one training sample consists of one image and one target hash code. Compared to pairwise [18] and triplet methods [19, 20] whose training process needs two images or three images as one training sample, the training time is largely shortened. Our training process is a linear time algorithm, not exponential time algorithm for methods mentioned above.

Our method reaches state-of-the-art performance on both small image datasets like CIFAR-10 and relatively large datasets like ImageNet. For large image datasets, we further propose an architecture based on inception-v3 net [21]; we call it DBR-v3. DBR-v3 achieves state-of-the-art performance on image retrieval of ImageNet dataset. When we apply DBR-v3 to CIFAR-10, a 15 percent performance improvement is achieved.

2. Overview of Hashing Methods

Hashing methods include data-independent methods [22] and data-dependent methods [10, 2326]. Methods of the first category are proposed in earlier days. The most representative ones are locality-sensitive hashing (LSH) [22] and the variants of it. The hash function is not related to training data. Instead, they do random projections to map images into a feature space. The second category learns the hash function from the training data. Because of the extra information, data-dependent methods outperform data-independent ones.

Data-dependent methods can be further divided into unsupervised methods and supervised methods. Unsupervised methods include spectral hashing (SH) [23] and iterative quantization (ITQ) [26]. These methods learn hash functions from unlabelled training sets. To deal with the more complicated image database, supervised methods are proposed to learn a better hash function from label information of training images. For example, supervised hashing with kernels (KSH) [24] requires a limited amount of supervised information and achieves high-quality hashing. Minimal loss hashing (MLH) [25] is based on structured prediction with latent variables and a hinge-like loss function. And binary reconstructive embedding (BRE) [10] develops an algorithm for learning hash functions based on explicitly minimizing the reconstruction error between the original distances and the Hamming distances. Asymmetric inner-product binary coding (AIBC) [27] is a special hashing method based on asymmetric hash functions. AIBC learns two different hash functions for query images and dataset images. It can be applied to both unsupervised datasets and supervised datasets.

Hashing methods mentioned above use hand-crafted features which are not powerful enough for more complicated semantic similarity. Moreover, the feature extracting procedure is independent of hash function learning. Recently, CNN based hashing methods called deep hashing methods are proposed to issue these problems. CNN can learn more representative features over hand-crafted features. Furthermore, most deep hashing methods perform feature learning and hash function learning simultaneously and show great improvement compared to previous methods. Several deep hashing methods have been proposed and proved to have better accuracy in content-based image retrieval. For example, CNNH [18] proposes a two-stage deep hashing method. It simultaneously learns features and hash functions based on learned approximate hash codes. Deep pairwise-supervised hashing (DPSH) [28] performs learning based on pairwise labels. Reference [29] poses hash learning as a problem of regularized similarity learning and simultaneously learns hash function and image features through triplet samples. Our approach proposed in this paper outperforms the above methods.

3. Proposed Method

Our method aims to find a hash function solving the content-based image retrieval task. Given training images belonging to categories, is in the form of raw RGB values. Label information is noted as , . Our goal is to learn a function mapping input images to compact binary codes and , where stands for the hash length. The hash function satisfies the following:

(1) and are similar in the Hamming space when .

(2) and are far away in the Hamming space when .

Figure 1 shows the whole flowchart of our system and Figure 2 shows the proposed network. A target hash code generation component is proposed to generate optimal hash code for training based on code length and category number. Our framework contains a CNN model as the main component. Normally the last layer of CNN is a Softmax classification layer. We replace it with a hash layer of nodes. Since the output layer of CNN model has been changed, we need new output information to replace the labels. The hash function is the trained model concatenated with a revised function. Finally, we use the trained hash function to perform content-based image retrieval.

3.1. Target Hash Code Generation
3.1.1. Normal Situation

Target hash code is the mathematically optimal code set with codewords; the Hamming distance between each codeword is maximized. We use target hash code together with raw images as the training sample to train the whole network. We hope to get a network which accepts the raw image as input and can map it to binary codes close to the target hash code. The trained network, which is used as hash function , produces binary codes satisfying the goal. Learning the relationship between images with different labels is not our purpose. Instead, our purpose is to teach the network how to map image to a binary code. That is why we calculate the target hash codes and feed it to the network, not letting the network learn from original labels. This is the major difference between our method and others. Furthermore, the target hash code generation component makes our learning a pointwise manner. We require no pairwise inputs like [18], and the training speed is much faster. Our target hash code’s function is similar to the prototype code in adaptive binary quantization (ABQ) [30]. The difference is that, in ABQ, the binary code of any data point is represented by its nearest prototype. The output binary code of the hash function lies in the prototype code set. In our method, the target hash code is only used for training. Hash function can produce binary codes not in target hash code set.

To fit the target hash code length, we replace the last layer of original CNN classification model with a fully connected layer called hash layer which has nodes. How to generate the target hash code for images in different labels is the main focus of this part. The following is the detailed problem description and main algorithm.

Since the training images are in categories, our target is to find a binary code set with codewords. The minimum Hamming distance between any two codewords should be as large as possible. In a more specific way, given binary code length and codeword number , we want to find a code set , , whose minimum Hamming distance is maximized. This optimization problem can first be divided into smaller jobs: given code length and minimum Hamming distance , find a code set with more than code words. After that, repeat this process with larger until no code set can be found. The last solvable is the maximized minimum Hamming distance. The whole process is described in Algorithm 1. Please note that this optimization problem is a complicated problem with no fixed result. For different and , the scale of code set may not be a certain number [31]. We have proved that our algorithm is able to at least find a second optimal solution in our experiment cases. Consider a 24-bit code set for a 12-category dataset, the best solution is a code set whose minimum Hamming distance is 13 bits. Our algorithm will find one with the minimum Hamming distance of 12 bits.

Input: binary code length , number of categories
Output: code set , satisfies
(1) codeset.add(0)
(2) for () do
(3) flag=0
(4) for do
(5) if Hamming  then
(6) flag=1
(7) break
(8) if flag==0  then
(9) codeset.add(i)
(10) return  codeset
Repeat: Perform the algorithm with ,… until the length of code set is larger than .
Choose codewords from the code set with largest , that will be the target hash code set.

For instance, given code length for a dataset with categories, with our algorithm, the minimum Hamming distance results in a code set with 16 codewords. We further try and it results in a set with 4 codewords, which fails to meet our need. Then we randomly choose 10 codewords from the former set and the target hash code is constructed as Table 1 shows.

3.1.2. Semantically Uneven Situation

In some situations, the semantic relation between different labels is not evenly distributed. For example, image samples in a dataset are divided into 3 different categories and their labels are . The images belonging to labels and are of different categories but quite similar. However, images of label are really far away from and . When we input a cat as a query image, we hope to retrieve dogs before cars. The target hash code needs to be redesigned; an evenly distributed Hamming distance between each label is not reasonable. In this example, we need a target hash code set with small Hamming distance between and . In this example, target hash code set is a reasonable one since the Hamming distance between cat and dog is 2 and others are 3.

To generate such target hash code set, we need further information called semantic relation matrix . is a matrix. Each element in shows the semantic relation between label and label . So always equals and . A negative number means closer relation than average, for example, cat and dog. A positive number means more dissimilar relation than other labels. Zero value means the normal relation between two labels and most values should be zero. For the example, will be a matrix. and are −1 and all other values are 0. To generate this kind of uneven semantic target hash code, we need a slight revise to Algorithm 1. The whole process is shown in Algorithm 2. The only difference is in line . When comparing the Hamming distance between the generating code and already generated code, we need to add the corresponding of these two codes defined in semantic relation matrix . In Algorithm 2, means the value in of the currently compared codes. For example, if we are generating the fifth code and comparing it to the first code in current code set, is the value of .

Input: binary code length , number of categories , semantic relation matrix . means the
corresponding value in of the currently being compared two codes.
Output: code set , satisfies
(1) codeset.add(0)
(2) for ( do
(3) flag=0
(4) for do
(5) if Hamming then
(6) flag=1
(7) break
(8) if flag==0 then
(9) codeset.add(i)
(10) return codeset
Repeat: Perform the algorithm with until the length of code set is larger than .
Choose codewords from the code set with largest , that will be the target hash code set.

For instance, given code length for a dataset with categories, the semantic relation matrix is defined as follows. Labels , , are very similar and their Hamming distance should be closer, so . Labels , , are very dissimilar and their Hamming distance should be larger, so . All other values are zero. The target hash code is constructed as Table 2 shows. The first three codes have Hamming distance 4 and the last three codes have Hamming distance 8. All other codes have the minimum Hamming distance 6. This code set satisfied the semantic relation between different labels. To the best of our knowledge, there is no dataset that includes numerical defined semantic relation between different labels. We state our algorithm here to give a solution to such semantically uneven label situation.

3.2. Learning Hash Function

With the label information and target hash code set , we can construct our new training set with training samples; is raw RGB value of training images and is the target hash code for :

After preparing the training samples, we build a deep network to learn to map images to hash codes. For small image datasets with the size of around pixels, we present DBR network based on relatively shallow convolutional neural networks. For large image datasets with image size of , we build our network called DBR-v3 based on inception-v3 [21].

3.2.1. Deep Network for Small Images

For small image deep network, we call our method deep binary representation (DBR). We take CIFAR-10 training as an example. We adopt a widely used simple CNN model for CIFAR-10 for fast retrieval. CNN has the powerful ability to learn image features through the concatenation of convolution layer and fully connected layer. As shown in Figure 2, we use 32, 32, 64, 64 convolution kernels for the convolution layers. max pooling and dropout are added after 2nd and 4th convolution layer. Following convolution layers are two fully connected layers with 512 nodes and a dropout after the first one. All these layers are activated with ReLU function for adding nonlinearity. The hash layer is a fully connected layer with nodes, depending on the length of target hash code. For larger , the network can be trained to learn more features from the input image and lead to better performance. Each node implies a hidden feature of the input image. Sigmoid function ranges in and for most cases lies in . It is very suitable for indexing the output to binary codes.

Target hash code includes all the information needed to learn features from images, so loss function need not be specially designed; simple mean squared error (MSE) loss function works well. For training optimizer, we choose Adadelta [32] for its good balance of speed and convergence point. Without huge modifications on the CNN model, our proposed model can learn a robust hash function fast in hundreds of training epochs.

3.2.2. Deep Network for Large Images

For input images with relatively large size like pixels or similar size, we call our method DBR-v3. We build our deep network based on inception-v3 [21]. Inception net-v3 is a very deep convolutional neural network with more than 20 layers evolving from inception v1 [33]. This network achieves 21.2% top-1 and 5.6% top-5 error in ILSVRC for single frame evaluation with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. We adopt the ImageNet pretrained inception-v3 model as our basic model and perform our revise and training.

We make some changes to the inception-v3 to make it fit our hash function. After the final global pooling layer, we add one fully connected layer with 1024 nodes activated with ReLU function. Following this layer is the hash layer, a fully connected layer with nodes. The weights of newly added layers are randomly initialized and others are pretrained weights of original inception-v3. The training is a two-step process. First, we train the whole network for several epochs. After the top layers are well trained, we freeze the bottom layers and fine-tune the top 2 inception-blocks for several epochs. The loss function and training optimizer are also MSE and Adadelta [32].

This network accepts input images of pixels. For small image datasets, we use the upsampling algorithm to make images fit the network. This can make further performance improvement compared to original shallow network in Section 3.2.1. This performance improvement comes from two aspects, the power of pretrained weights and deeper networks.

3.3. Image Retrieval

After training, we combine all the components together to perform image retrieval. Our trained network accepts an input image in raw pixels and gives an output . To convert output to binary hash codes , we redefine the function: Finally, we get our hash function: where is the output of our proposed model.

For image retrieval, we regard training images as the image database and test images as query images. Image retrieval process searches top most similar images from the database. Three steps are performed to do the image retrieval.

Step 1. Map training images to hash codes .

Step 2. For each query image , first calculate and then retrieve images by the rank of the Hamming distance ; smaller Hamming distance ranks higher.

Step 3. Compare the similarities of retrieved images and the query image. Then evaluate the performance in MAP according to the result.

4. Experiments

In this part, we state our experiment settings and results. We calculate the MAP (mean average precision) of the image retrieval on different datasets and list it in Table 3. We apply our DBR method on MNIST and CIFAR-10, and DBR-v3 method on ImageNet. Furthermore, we upsample the images in CIFAR-10 and apply DBR-v3 to it. For each method, we list the time each network costs to train and to calculate the hash code of one image. The running time is listed in Table 6. We choose not to calculate the time of retrieving one image. The reason is that once the hash length is determined, the time to retrieve images according to the hash code is the same for all hashing methods. What matters is the time to map one image to the hash code. Please note that some results are missing because they are not available in corresponding paper and these methods are not totally open-source.

4.1. Results on MNIST

The MNIST dataset [34] consists of 70000 grey-scale images belonging to 10 categories of handwritten Arabic numerals from 0 to 9.

For MNIST, we use 32 convolution kernels of size 3 3 for each one of the two convolution layers. max pooling and dropout are added after the 2nd convolution layer. Following convolution layers are two fully connected layers with 128 nodes and a dropout after the first fully connected layer. The last layer is the hash layer and the number of nodes is adjustable with hash length. The model training uses Adadelta optimizer with mean squared error loss function.

Our proposed method is compared with state-of-the-art hashing methods including data-independent method LSH [22], two unsupervised methods SH [23] and ITQ [26], four supervised methods KSH [24], MLH [25], BRE [10], and ITQ-CCA [26], and CNN based deep hashing method CNNH [18] and its variant CNNH+ [18].

We follow the experiment configurations of [18] and derive results from the same resource. We randomly select 100 images per class and total 1000 images as test query images. For the unsupervised methods, we use all the rest images as the training set. And for supervised methods including CNNH, CNNH+, and ours, we select 5000 images (500 images per class) as the training set. The whole training process lasts around 120 s for 100 epochs training on a GTX1060 6 GB graphic processing unit. It costs around 80 us to map one MNIST image to its hash code.

To evaluate the performance of retrieval, we use mean average precision (MAP). For each query image, we calculate the average precision of retrieved images. MAP is the mean value of these average precisions. Please note that, for each query image, the correctness of high ranking retrieved image counts more. The MAP result of our test is shown in Table 3; the DBR column is our method. We can find that our method outperforms other methods in grey-scale image retrieval.

4.2. Results on CIFAR-10

CIFAR-10 [35] dataset consists of 60000 images belonging to 10 categories including airplane, automobile, and bird. The layer information of CIFAR-10 implementation is stated in Section 3.2.

Other than the methods we compared in Section 4.1, we compare our method with two more CNN based deep hash methods DHN [36] and DNNH [20]. And we also follow their experiment configuration. We randomly select 100 images per class as query set. For the unsupervised methods, we use all the rest images as the training set. For supervised methods, 5000 images (500 images per class) are randomly selected as the training set. The whole training process lasts around 600 s for 300 epochs training on a GTX1060 6 GB graphic processing unit. It costs around 160 us to map one CIFAR-10 image to its hash code. Two images are considered to be similar if they have the same label. The top 12 retrieved images of two query images are shown in Figure 3 as an illustration.

Furthermore, we upsample images in CIFAR-10 to the size of and apply DBR-v3 network to it. We train the whole network for 50 epochs and perform the fine-tuning for 20 epochs. The total training time is on a GTX1060 6 GB graphic processing unit. For DBR-v3, it costs around 3 ms to map one CIFAR-10 image to its hash code.

The MAP results are in Table 3; the result of our method is shown in columns DBR and DBR-v3. We can see that our method is better than other methods including unsupervised methods, supervised methods, and deep methods with feature learning. DBR-v3 has a big advantage over DBR. This is because the network is a lot deeper and pretrained with ImageNet. However, the training time and hash code calculating time sacrifice a lot.

We also conduct experiments on semantically uneven situations. For ten categories in CIFAR-10 dataset, we suppose that the automobile and the truck are semantically similar. We set the value of and all other values in semantic relation matrix . When the query image is an automobile, we can observe that trucks will be retrieved with higher rank compared to categories other than the automobile. At the same time, Table 4 shows that the overall MAP result remains at the same level. This experiment indicates that our target hash code can have the same semantic relation between different categories.

Following [28], we further do the comparison to some deep hashing methods with different experiment settings including DSCH, DRSCH [29], DSRH [19], and DPSH [28]; the results are directly derived from [28]. More specifically, we use 10000 test images as query set and 50000 images as the training set. The total training time is 1.5 hours for 300 epochs on a GTX1060 6 GB graphic processing unit. The MAP values are calculated according to top 50000 returned neighbours and are shown in Table 5. We can find out that our method still leads the MAP results.

4.3. Results on ImageNet

ImageNet is an image database with more than 1.2 million images in training set and more than 50 thousand images in the validation set. Each image is in one of the 1000 categories. The image size varies, and the common size is hundreds by hundreds of pixels. ImageNet is currently the largest image database for various tasks. And experiments on ImageNet show the ability to deal with large-scale high-definition images.

The network details including loss function and training optimizes are stated in Section 3.2.2. For a fair comparison, we follow the experiment settings in [37]. We randomly select 100 categories; all the images of these categories in the training set are used as training images. All the images of these categories in the validation set are used as query images. We train the whole network for 50 epochs and fine-tune the top 2 inception-blocks and hash layer for 20 epochs. The total training time is about 18 hours on a GTX1060 6 GB graphic processing unit. It costs around 3 ms to map one ImageNet image its hash code.

Our proposed method is compared to state-of-the-art hashing methods including HashNet [37] and most methods mentioned in Section 4.1. The data is derived directly from [37] and the test set is the same.

To evaluate the performance of retrieval, we use mean average precision (MAP), and the result is shown in Table 3. The result shows that our method has a great advantage over other methods. This indicates that our DBR-v3 method can solve large-scale image retrieval for high-definition images.

5. Conclusion

In this paper, we present a novel end-to-end hash learning network for content-based image retrieval. We design the optimal target hash code for each label to feed the network with the relation between different labels. Since the target hash codes between different labels have maximized Hamming distance, the deep network can map different-category images to hash codes with significant distance. For similar images, the network tends to produce exact same hash code. The deep network is based on convolutional neural network. We design two variants of our method: DBR for small images: this network trains fast and it calculates fast; with powerful clusters online training is even possible. DBR-v3 based on inception-v3 net: it benefits from the powerful learning ability of inception net and performs very good on high-definition image retrieval. Finally, we do experiments on standard image retrieval benchmarks. The results show that our method outperforms the previous works.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This work is supported by NSFC (61671296, 61521062, and U1611461) and the National Key Research and Development Program of China (BZ0300013).