#### Abstract

Hashing has been widely deployed to perform the Approximate Nearest Neighbor (ANN) search for the large-scale image retrieval to solve the problem of storage and retrieval efficiency. Recently, deep hashing methods have been proposed to perform the simultaneous feature learning and the hash code learning with deep neural networks. Even though deep hashing has shown the better performance than traditional hashing methods with handcrafted features, the learned compact hash code from one deep hashing network may not provide the full representation of an image. In this paper, we propose a novel hashing indexing method, called the Deep Hashing based Fusing Index (DHFI), to generate a more compact hash code which has stronger expression ability and distinction capability. In our method, we train two different architecture’s deep hashing subnetworks and fuse the hash codes generated by the two subnetworks together to unify images. Experiments on two real datasets show that our method can outperform state-of-the-art image retrieval applications.

#### 1. Introduction

With the rapidly growing of images on the Internet, it is extremely difficult to find relevant images according to different people’s needs. For example, nowadays the volume of images is becoming larger and larger, and a database having millions of images is quite common. Thus, a great deal of time and memory would be used in a linear search through the whole database. Moreover, images are always represented by real-valued features, so that the curse of dimension often occurred in many content-based image search engines and applications.

To address the inefficiency and the problem of memory cost of real-valued features, the ANN search [1] has become a popular method and a hot research topic in recent years. Among existing ANN techniques, hashing approaches are proposed to map images to compact binary codes that approximately preserve the data structure in the original space [2–6]. Due to the high query speed and low memory cost, the hashing and image binarization techniques have become the most popular and effective techniques to enhance identification and retrieval of information using content-based image recognition [4, 7–16]. Instead of real-valued features, images are represented by binary codes so that the time and memory costs of search can be greatly reduced [17]. However, the retrieval performance of most existing hashing methods heavily depends on the features they used, which are basically extracted in an unsupervised manner, thus more suitable for dealing with the visual similarity search than the semantic similarity search.

As we all know, the Convolutional Neural Network (CNN) has demonstrated its impressive learning power on image classification [5, 18–20], object detection [21], face recognition [22], and many other vision tasks [23–25]. The CNN used in these tasks can be regarded as a feature extractor guided by the objective function, specifically designed for the individual task [5]. The successful applications of CNN in various tasks imply that the features learned by CNN can well capture the underlying semantic structure of images in spite of significant appearance variations. Moreover, hashing with the deep learning network has shown that both feature representation and hash coding can be learned more effectively.

Inspired by the robustness of CNN features and the high performance of deep hashing methods, we propose a binary code generating and fusing framework to index large-scale image datasets, named Deep Hashing based Fusing Index (DHFI).

In our method, firstly, we train two different deep pairwise hashing networks which take image pairs along with labels to indicate whether the two images are similar as training inputs and produce binary codes as outputs. Then, we merge the hash codes produced by the two subnetworks together and regard the merged hash code as a fingerprint or binary index of an image. Under these two stages, images can be easily encoded by forward propagating through the network and then merging the network outputs to binary hash code representation.

The rest of the paper is organized as follows: Section 2 discusses the related work to the method. Section 3 describes DHFI method in detail. Section 4 extensively evaluates the proposed method on two large-scale datasets. Section 5 gives concluding remarks.

#### 2. Related Work

Existing learning methods can be divided into two categories: data-independent methods and data-dependent methods [8, 24, 26, 27].

The hash function in data-independent methods is typically randomly generated and is independent of any training data. The representative data-independent methods include locality-sensitive hashing (LSH) [1] and its variants. Data-dependent methods try to learn the hash function from some training data, which is also called learning to hash (L2H) methods [15, 26]. L2H methods can achieve comparable or better accuracy with shorter hash codes when compared to data-independent methods. In real applications, L2H methods have become more popular than data-independent methods.

Existing L2H methods can be further divided into two categories: unsupervised hashing and supervised hashing refer to a comprehensive survey [28].

Unsupervised hashing methods use the unlabeled training data only to learn hash functions and encode the input data points to binary codes. Typical unsupervised hashing methods include reconstruction error minimization [29, 30], graph based hashing [3, 31], isotropic hashing (IsoHash) [9], discrete graph hashing (DGH) [32], scalable graph hashing (SGH) [33], and iterative quantization (ITQ) [8].

Supervised hashing utilizes information, such as class labels, to learn compact hash codes. Representative supervised hashing methods include binary reconstruction embedding (BRE) [7], Minimal Loss Hashing (MLH) [34], Supervised Hashing with Kernels (KSH) [4], two-step hashing (TSH) [35], fast supervised hashing (FastH) [12], and latent factor hashing (LFH) [36]. In the pipelines of these methods, images are first represented by handcrafted visual descriptor feature vectors (e.g., GIST [37], HOG [38]), followed by separate projection and quantization steps to encode vectors into binary hash codes. However, such handcrafted feature represents the low level information of a picture whose construction process is independent of the hash function learning process, and the resulting features might not be optimally compatible with hash codes.

Recently, as the deep learning has shown its effective image representation power on high level semantic information in a picture, then, a lot of feature learning based deep hashing methods have recently been proposed and have shown their better performance than traditional hashing methods with handcrafted features, such as convolutional neural network hashing (CNNH) [39], network in network hashing (NINH) [40], deep hashing network (DHN) [41], and deep pairwise supervised hashing (DPSH) [15]. CNNH is proposed by Xia et al. The CNNH method first learns the hash codes from the pairwise labels and then tries to learn the hash function and feature representation from image pixels based on hash codes. Lai et al. improved the two-stage CNNH by proposing NINH. NINH uses a triplet ranking loss to preserve relative similarities and the hash codes of images are encoded by dividing and encoding modules. Moreover, this method is a simultaneous feature learning and hash coding deep network so that image representations and hash codes can improve each other in the joint learning process. DHN further improves NINH by controlling the quantization error in a principled way and devising a more principled pairwise cross entropy loss to link the pairwise Hamming distances with the pairwise similarity labels, while DPSH learns hash codes by learning features and hash codes simultaneously with pairwise labels. Due to the fact that different components in deep pairwise supervised hashing (DPSH) can give feedback to each other, DPSH outperforms other methods in image retrieval application as far as we know.

In this work, we further improve the retrieval accuracy by two steps: (1) training two different architecture’s deep hashing subnetworks and (2) fusing the hash codes generated by the two subnetworks to unify images so that the merged codes can represent more semantic information and support each other. These two important stages constitute the DHFI approach.

#### 3. The Proposed Approach

In this section, we describe our method in detail. We first train two different architecture’s deep hashing subnetworks. Then, we perform each image through the subnetworks to generate binary hash codes and fuse the hash codes generated by the same image together. For the first step discussed in Section 3.1, we follow the simultaneous feature learning and hash code learning method of [15]. The major novelty of our method is training two deep hashing subnetworks and fusing the hash codes generated by the two subnetworks together to index images.

##### 3.1. Subnetwork Training

We have images (feature points) and the training set of supervised hashing with pairwise labels also contains a set of pairwise labels with , where means that and are similar and means that and are dissimilar. Here, the pairwise labels typically refer to semantic labels provided with manual efforts.

The goal of supervised hashing with pairwise labels is to learn a binary code for each point , where is the code length. The binary code should preserve the similarity in . More specifically, if , then binary codes and should have a low Hamming distance; if , the binary codes and should have a high Hamming distance. In general, we can write the binary code as , where is the hash function to learn. For the subnetworks training step, we use the model and learning method called deep pairwise supervised hashing (DPSH) from Li et al. The model is an end-to-end deep learning method, which consists of two parts: the feature learning part and the objective function part.

The feature learning part has seven layers, which are the same as those of fast architecture’s Convolutional Neural Network (CNN-F) in [42, 43].

As for the objective function part, given the binary codes for all the images, the likelihood of pairwise labels can be defined as that of LFH [36]:where and . Please note that . When taking the negative log-likelihood of the observed pairwise labels in , the problem becomes an optimization problem:

The optimization problem above can make the Hamming distance between two similar images (points) as small as possible and make the Hamming distance between two dissimilar images (points) as large as possible simultaneously. While the problem is a discrete optimization problem, which is difficult to solve, we follow the strategy designed by Li et al., to reformulate the problem as follows:where and . And the problem can be continually optimized by moving the equality constraints in the equation to the regularization terms.where is the regularization term.

A fully connected hash layer is designed between the two parts to integrate them to a whole framework. The framework is shown in Figure 1. Please note that two images are input into the framework at each training time, and the loss function is based on pair labels of images.

For the hash layer, we setwhere denotes all the parameters of the first seven layers in the feature learning part, denotes the output of the seventh layer associated with image (point) , denotes a weight matrix, and is a bias vector.

After connecting the feature learning part and the objective function together, the problem of learning becomes

In each subnetwork, following Li et al., we also adopt the minibatch based strategy and alternating method to learn the parameters containing , , , and . We sample a minibatch of images (points) from the whole training set and each subnetwork learns based on these sampled images (points). Then, we optimize one parameter with other parameters fixed. can be directly optimized as follows:

We use the back-propagation method to learn other parameters , , and . Specially, we can compute the derivatives of the loss function with respect of as follows:where . Then, we can update the parameters , , and by back-propagation:

In our method, we trained two deep hashing subnetworks by utilizing the learning algorithm in [15]. More specially, the CNN-F and the Caffe-alex [18] pretrained networks are separately used in the feature learning part of the different subnetworks.

##### 3.2. Hash Codes Generating and Fusing

After we have successfully completed the training of subnetworks, we can only get the hash codes for images in the training data. We still have to predict the hash codes for other images which did not appear in the training set. For any image , we let it through each subnetwork to predict its hash codes just by forward propagation:

Thus we can get two hash codes related to . We concatenate the two different hash codes learned from the two different subnetworks together in a vector way and use the concatenated code as the latest hash code of . The hash code generating and fusing process is shown in Figure 2.

#### 4. Experiments

##### 4.1. Experimental Settings

All our experiments for DHFI are completed with MatConvNet [43] on a NVIDIA K40 GPU server.

In this section, we conduct extensive evaluations of the proposed method on two widely used benchmark datasets with different kinds of images: CIFAR-10 and NUS-WIDE. (1) The CIFAR-10 [44] dataset consists of 60K color tiny images which are categorized into 10 classes (6K tiny images per class). It is a single-label dataset in which each image belongs to one of the 10 classes. (2) The NUS-WIDE dataset [45, 46] has nearly 270K images collected from the web. It is a multilabel dataset in which each image is annotated with one or multiple class labels in 81 semantic concepts. Following [15, 40], we only use the images from the 21 most frequent classes. For these classes, the number of images in each class is at least 5K.

The experimental protocols in [15] are also employed in our experiments. In CIFAR-10, 1000 images (100 images per class) are randomly selected as the query set. For the unsupervised methods, we use the rest images as the training set. For the supervised methods, we randomly select 5000 images (500 images per class) from the rest of the images as the training set. The pairwise label set is constructed based on the image class labels, where two images will be considered to be similar if they share the same class label.

In NUS-WIDE, 2100 query images from 21 most frequent labels (100 images per class) are randomly sampled as the query set by following the strategy used in [15, 39, 40]. For the supervised methods, we randomly select 500 images per class from the rest images as the training set. The pairwise label set is constructed based on the image class labels. It means that two images will be considered to be similar if they share at least one common label.

Following [15], we compare our method to several state-of-the-art hashing methods, including SH [31], ITQ [8], SPLH [47], KSH [4], FastH [12], LFH [36], SDH [13], DPSH [15], CNNH [39], DHN [41], DSH [5], and NINH [40]. Note that SH and ITQ are unsupervised hashing methods and the other methods are supervised hashing methods. DPSH, CNNH, DHN, and DSH are four deep hashing methods with pairwise labels, while NINH is a triplet-based method. Beyond this, we also evaluate the nondeep hashing methods with deep features extracted by the CNN-F.

For hashing methods which use handcrafted features, we represent each image in CIFAR-10 by a 512-dimensional GIST vector. And we represent each image in NUS-WIDE by a 1134-dimensional low level feature vector, including 64-D color histogram, 144-D color correlogram, 73-D edge direction histogram, 128-D wavelet texture, 225-D block-wise color moments, and 500-D SIFT features.

For deep hashing methods, we first resize all images to pixels and then directly use the raw image pixels as input and adopt the CNN-F network which has been pretrained on the ImageNet dataset to initialize the layers of feature learning part. Similar initialization strategy has also been adopted by other deep hashing methods [48].

For our method, we learn the hash codes separately from different architecture’s pretrained networks; we use the fast architecture’s Convolutional Neural Network (CNN-F) and Caffe-alex network to initialize the parameters.

##### 4.2. Results and Discussion

The mean average precision (MAP) is often used to measure the accuracy in large-scale image retrieval applications. As most existing hashing methods, the MAP is used to measure the accuracy of the proposed method. For fair comparison, all of the methods use identical training and test sets. In this paper, the MAP value is calculated based on the top 5000 returned neighbors for NUS-WIDE dataset. The best MAP for each category in the tables are shown in boldface.

Firstly, to verify the effectiveness of deep binary hash code fusing, we compare our method to two different architecture’s deep pairwise supervised hashing models; one uses the CNN-F pretrained model in the feature learning part and the other uses the Caffe-alex pretrained model in the feature learning part. The MAP results are listed in Table 1. Please note that DPSH1 uses CNN-F and DPSH2 uses Caffe-alex pretrained model. By comparing DHFI to DPSH1 and DPSH2, we find that DHFI can dramatically outperform both of them. It means that the integrated hash codes learned from different architecture’s deep hashing subnetworks can get a better solution than hash codes generated from independent subnetwork.

Secondly, the MAP results of all methods are listed in Tables 2 and 3. Please note that, in Table 2, DPSH, DSH, DHN, NINH, and CNNH are deep hashing methods, and all the other methods are nondeep methods with handcrafted features. The results of NINH, CNNH, KSH, and ITQ are from [15, 39, 40], the results of DPSH are from [15], the results of DSH are from [5], and the results of DHN are from [41]. Please note that the above experimental settings and evaluation metrics are exactly the same as that in [15, 39, 40]. Hence, the comparison is reasonable. We can find that our method dramatically outperforms other baselines, including unsupervised methods, supervised methods with handcrafted features, and deep hashing methods with feature learning.

To further verify the effectiveness of the deep binary hash code fusing, we compare DHFI to other nondeep methods with deep features extracted by the fast architecture’s Convolutional Neural Network (CNN-F). The results are shown in Table 3, where the notation of “+CNN” denotes that the methods use deep features as input. We can find that our method outperforms all the other nondeep baselines with deep features.

#### 5. Conclusion

In this paper, we proposed a “two-stage” deep hashing based fusing index method for image retrieval. In the proposed method, we train two different architecture’s deep hashing networks at first and then merge the hash codes generated from separate networks together to unify an image. Due to the fact that hash codes are learned from different networks and they may provide different information and supplement each other, the proposed method can learn better codes than other hashing methods. Experiments on real datasets show that our method has superior performance over state-of-the-art image retrieval applications.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

This work was supported by Natural Science Foundation of China [Grant nos. 61370113, 61572004, 61650201, and 91546111], the Beijing Municipal Natural Science Foundation [Grant nos. 4152005 and 4162058], the Key Project of Beijing Municipal Education Commission [Grant no. KZ201610005009]; the Science and Technology Program of Tianjin [Grant no. 15YFXQGX0050], and the Science and Technology Planning Project of Qinghai Province [Grant no. 2016-ZJ-Y04].