Applied Computational Intelligence and Soft Computing

Volume 2017 (2017), Article ID 9635348, 8 pages

https://doi.org/10.1155/2017/9635348

## Deep Hashing Based Fusing Index Method for Large-Scale Image Retrieval

^{1}Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China^{2}Beijing Key Laboratory on Integration and Analysis of Large-Scale Stream Data, Beijing, China^{3}National Engineering Laboratory for Critical Technologies of Information Security Classified Protection, Beijing 100124, China^{4}School of Computer Science, Beijing Information Science and Technology University, Beijing 100101, China^{5}College of Applied Science, Beijing University of Technology, Beijing 100124, China

Correspondence should be addressed to Xing Su; nc.ude.tujb@usgnix

Received 31 March 2017; Accepted 26 April 2017; Published 24 May 2017

Academic Editor: Ridha Ejbali

Copyright © 2017 Lijuan Duan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Hashing has been widely deployed to perform the Approximate Nearest Neighbor (ANN) search for the large-scale image retrieval to solve the problem of storage and retrieval efficiency. Recently, deep hashing methods have been proposed to perform the simultaneous feature learning and the hash code learning with deep neural networks. Even though deep hashing has shown the better performance than traditional hashing methods with handcrafted features, the learned compact hash code from one deep hashing network may not provide the full representation of an image. In this paper, we propose a novel hashing indexing method, called the Deep Hashing based Fusing Index (DHFI), to generate a more compact hash code which has stronger expression ability and distinction capability. In our method, we train two different architecture’s deep hashing subnetworks and fuse the hash codes generated by the two subnetworks together to unify images. Experiments on two real datasets show that our method can outperform state-of-the-art image retrieval applications.

#### 1. Introduction

With the rapidly growing of images on the Internet, it is extremely difficult to find relevant images according to different people’s needs. For example, nowadays the volume of images is becoming larger and larger, and a database having millions of images is quite common. Thus, a great deal of time and memory would be used in a linear search through the whole database. Moreover, images are always represented by real-valued features, so that the curse of dimension often occurred in many content-based image search engines and applications.

To address the inefficiency and the problem of memory cost of real-valued features, the ANN search [1] has become a popular method and a hot research topic in recent years. Among existing ANN techniques, hashing approaches are proposed to map images to compact binary codes that approximately preserve the data structure in the original space [2–6]. Due to the high query speed and low memory cost, the hashing and image binarization techniques have become the most popular and effective techniques to enhance identification and retrieval of information using content-based image recognition [4, 7–16]. Instead of real-valued features, images are represented by binary codes so that the time and memory costs of search can be greatly reduced [17]. However, the retrieval performance of most existing hashing methods heavily depends on the features they used, which are basically extracted in an unsupervised manner, thus more suitable for dealing with the visual similarity search than the semantic similarity search.

As we all know, the Convolutional Neural Network (CNN) has demonstrated its impressive learning power on image classification [5, 18–20], object detection [21], face recognition [22], and many other vision tasks [23–25]. The CNN used in these tasks can be regarded as a feature extractor guided by the objective function, specifically designed for the individual task [5]. The successful applications of CNN in various tasks imply that the features learned by CNN can well capture the underlying semantic structure of images in spite of significant appearance variations. Moreover, hashing with the deep learning network has shown that both feature representation and hash coding can be learned more effectively.

Inspired by the robustness of CNN features and the high performance of deep hashing methods, we propose a binary code generating and fusing framework to index large-scale image datasets, named Deep Hashing based Fusing Index (DHFI).

In our method, firstly, we train two different deep pairwise hashing networks which take image pairs along with labels to indicate whether the two images are similar as training inputs and produce binary codes as outputs. Then, we merge the hash codes produced by the two subnetworks together and regard the merged hash code as a fingerprint or binary index of an image. Under these two stages, images can be easily encoded by forward propagating through the network and then merging the network outputs to binary hash code representation.

The rest of the paper is organized as follows: Section 2 discusses the related work to the method. Section 3 describes DHFI method in detail. Section 4 extensively evaluates the proposed method on two large-scale datasets. Section 5 gives concluding remarks.

#### 2. Related Work

Existing learning methods can be divided into two categories: data-independent methods and data-dependent methods [8, 24, 26, 27].

The hash function in data-independent methods is typically randomly generated and is independent of any training data. The representative data-independent methods include locality-sensitive hashing (LSH) [1] and its variants. Data-dependent methods try to learn the hash function from some training data, which is also called learning to hash (L2H) methods [15, 26]. L2H methods can achieve comparable or better accuracy with shorter hash codes when compared to data-independent methods. In real applications, L2H methods have become more popular than data-independent methods.

Existing L2H methods can be further divided into two categories: unsupervised hashing and supervised hashing refer to a comprehensive survey [28].

Unsupervised hashing methods use the unlabeled training data only to learn hash functions and encode the input data points to binary codes. Typical unsupervised hashing methods include reconstruction error minimization [29, 30], graph based hashing [3, 31], isotropic hashing (IsoHash) [9], discrete graph hashing (DGH) [32], scalable graph hashing (SGH) [33], and iterative quantization (ITQ) [8].

Supervised hashing utilizes information, such as class labels, to learn compact hash codes. Representative supervised hashing methods include binary reconstruction embedding (BRE) [7], Minimal Loss Hashing (MLH) [34], Supervised Hashing with Kernels (KSH) [4], two-step hashing (TSH) [35], fast supervised hashing (FastH) [12], and latent factor hashing (LFH) [36]. In the pipelines of these methods, images are first represented by handcrafted visual descriptor feature vectors (e.g., GIST [37], HOG [38]), followed by separate projection and quantization steps to encode vectors into binary hash codes. However, such handcrafted feature represents the low level information of a picture whose construction process is independent of the hash function learning process, and the resulting features might not be optimally compatible with hash codes.

Recently, as the deep learning has shown its effective image representation power on high level semantic information in a picture, then, a lot of feature learning based deep hashing methods have recently been proposed and have shown their better performance than traditional hashing methods with handcrafted features, such as convolutional neural network hashing (CNNH) [39], network in network hashing (NINH) [40], deep hashing network (DHN) [41], and deep pairwise supervised hashing (DPSH) [15]. CNNH is proposed by Xia et al. The CNNH method first learns the hash codes from the pairwise labels and then tries to learn the hash function and feature representation from image pixels based on hash codes. Lai et al. improved the two-stage CNNH by proposing NINH. NINH uses a triplet ranking loss to preserve relative similarities and the hash codes of images are encoded by dividing and encoding modules. Moreover, this method is a simultaneous feature learning and hash coding deep network so that image representations and hash codes can improve each other in the joint learning process. DHN further improves NINH by controlling the quantization error in a principled way and devising a more principled pairwise cross entropy loss to link the pairwise Hamming distances with the pairwise similarity labels, while DPSH learns hash codes by learning features and hash codes simultaneously with pairwise labels. Due to the fact that different components in deep pairwise supervised hashing (DPSH) can give feedback to each other, DPSH outperforms other methods in image retrieval application as far as we know.

In this work, we further improve the retrieval accuracy by two steps: (1) training two different architecture’s deep hashing subnetworks and (2) fusing the hash codes generated by the two subnetworks to unify images so that the merged codes can represent more semantic information and support each other. These two important stages constitute the DHFI approach.

#### 3. The Proposed Approach

In this section, we describe our method in detail. We first train two different architecture’s deep hashing subnetworks. Then, we perform each image through the subnetworks to generate binary hash codes and fuse the hash codes generated by the same image together. For the first step discussed in Section 3.1, we follow the simultaneous feature learning and hash code learning method of [15]. The major novelty of our method is training two deep hashing subnetworks and fusing the hash codes generated by the two subnetworks together to index images.

##### 3.1. Subnetwork Training

We have images (feature points) and the training set of supervised hashing with pairwise labels also contains a set of pairwise labels with , where means that and are similar and means that and are dissimilar. Here, the pairwise labels typically refer to semantic labels provided with manual efforts.

The goal of supervised hashing with pairwise labels is to learn a binary code for each point , where is the code length. The binary code should preserve the similarity in . More specifically, if , then binary codes and should have a low Hamming distance; if , the binary codes and should have a high Hamming distance. In general, we can write the binary code as , where is the hash function to learn. For the subnetworks training step, we use the model and learning method called deep pairwise supervised hashing (DPSH) from Li et al. The model is an end-to-end deep learning method, which consists of two parts: the feature learning part and the objective function part.

The feature learning part has seven layers, which are the same as those of fast architecture’s Convolutional Neural Network (CNN-F) in [42, 43].

As for the objective function part, given the binary codes for all the images, the likelihood of pairwise labels can be defined as that of LFH [36]:where and . Please note that . When taking the negative log-likelihood of the observed pairwise labels in , the problem becomes an optimization problem:

The optimization problem above can make the Hamming distance between two similar images (points) as small as possible and make the Hamming distance between two dissimilar images (points) as large as possible simultaneously. While the problem is a discrete optimization problem, which is difficult to solve, we follow the strategy designed by Li et al., to reformulate the problem as follows:where and . And the problem can be continually optimized by moving the equality constraints in the equation to the regularization terms.where is the regularization term.

A fully connected hash layer is designed between the two parts to integrate them to a whole framework. The framework is shown in Figure 1. Please note that two images are input into the framework at each training time, and the loss function is based on pair labels of images.