Abstract

In recent years, with the explosion of multimedia data from search engines, social media, and e-commerce platforms, there is an urgent need for fast retrieval methods for massive big data. Hashing is widely used in large-scale and high-dimensional data search because of its low storage cost and fast query speed. Thanks to the great success of deep learning in many fields, the deep learning method has been introduced into hashing retrieval, and it uses a deep neural network to learn image features and hash codes simultaneously. Compared with the traditional hashing methods, it has better performance. However, existing deep hashing methods have some limitations; for example, most methods consider only one kind of supervised loss, which leads to insufficient utilization of supervised information. To address this issue, we proposed a triplet deep hashing method with joint supervised loss based on the convolutional neural network (JLTDH) in this work. The proposed method JLTDH combines triplet likelihood loss and linear classification loss; moreover, the triplet supervised label is adopted, which contains richer supervised information than that of the pointwise and pairwise labels. At the same time, in order to overcome the cubic increase in the number of triplets and make triplet training more effective, we adopt a novel triplet selection method. The whole process is divided into two stages: In the first stage, taking the triplets generated by the triplet selection method as the input of the CNN, the three CNNs with shared weights are used for image feature learning, and the last layer of the network outputs a preliminary hash code. In the second stage, relying on the hash code of the first stage and the joint loss function, the neural network model is further optimized so that the generated hash code has higher query precision. We perform extensive experiments on the three public benchmark datasets CIFAR-10, NUS-WIDE, and MS-COCO. Experimental results demonstrate that the proposed method outperforms the compared methods, and the method is also superior to all previous deep hashing methods based on the triplet label.

1. Introduction

In recent years, because of the explosive growth of Internet big data, the Internet is filled with a large number of multimedia resources, including pictures, videos, and text, and there is an urgent need for fast search methods for massive big data. Approximate nearest neighbor (ANN) [1] search has been widely used in many fields such as image retrieval, computer vision, and data mining. Because of speed and low memory cost, hashing has become an important branch of ANN search, which is one of the widely used techniques in image retrieval [29]. Hashing techniques encode images, documents, videos, or other types of data in a short set of binary codes while keeping the original data similar. The hashing method produces binary encodings that make the nearest neighbor search of large datasets easy.

A series of different hashing methods are proposed to implement efficient ANN search using Hamming distance [3, 5, 8, 9]. More recently, deep hashing methods [10, 11, 12, 13, 14] show that image representation and hash coding can be learned more effectively using deep neural networks, resulting in state-of-the-art results on many benchmark datasets.

Recently, triplet loss has been studied for computer vision problems. The triplet labels contain richer information than pairwise labels. Each triplet label can be naturally decomposed into two pairwise labels. In particular, a triplet label ensures that the query image is close to the positive image and far away from the negative image in the learning hash code space. However, a pairwise label only ensures that one constraint is observed. The retrieval performance of triplet loss is better than that of pointwise and pairwise losses. Therefore, the triplet likelihood loss is introduced in this paper.

At the same time, the existing deep hashing methods still have some shortcomings in the utilization of classification information. The classification information only plays a role in deep neural network image representation but has no direct impact on the optimization of the hash function. Therefore, this paper proposes a linear classification loss to deal with this situation.

Therefore, combining triplet likelihood loss and linear classification loss, we propose a triplet deep hashing method with joint supervised loss based on the convolutional neural network (JLTDH). The supervised information used in this method is in the form of triplet labels. For the sake of fully utilizing the triplet information and the classification information, we propose a joint loss function, which consists of two parts: the triplet negative log-likelihood loss and the linear classification loss. Depending on this joint loss function, the hash codes can be further optimized by the linear classifier. The linear classifier indicates the relationship between the label information and the hash codes. We choose the convolutional neural network (CNN) as our deep learning model, for example, AlexNet, ResNet, and VGG, which can learn image representation and hash function at the same time. The whole process is divided into two stages: In the first stage, taking the triplets generated by the triplet selection method as the input of the CNN, the three CNNs with shared weights are used for image feature learning, and the last layer of the network outputs a preliminary hash code. In the second stage, relying on the hash code of the first stage and the joint loss function, the neural network model is further optimized so that the generated hash code has higher query precision.

This work is summarized as follows:(1)In this paper, a triplet deep hashing method with negative log-likelihood is proposed, and the method performs both image feature representation and hash code learning in a convolutional neural network. In order to overcome the cubic increase in the number of triplets and make triplet training more effective, we adopt a novel triplet selection method.(2)To fully utilize the supervised triplet information, JLTDH proposed a joint loss function combining the triplet negative log-likelihood loss and the linear classification loss. Relying on the hash code of the first stage and the joint loss function, the neural network model is further optimized so that the generated hash code has higher query precision.(3)We perform extensive experiments on the three public benchmark datasets CIFAR-10, NUS-WIDE, and MS-COCO. Experimental results demonstrate that the proposed method outperforms the compared methods, and the method is also superior to all previous deep hashing methods based on the triplet label.

Hashing methods are divided into data-independent methods and data-dependent methods. Locality-sensitive hashing (LSH) [1] is a typical representative of data-independent hashing methods among various hashing techniques; the hash function of this method is generated by a random map, and random projection maps data points from the original space to the Hamming space; in this process, the training data are not used to learn hash functions. One drawback of LSH is that LSH usually requires a very long bit length, which leads to huge storage overhead. Data-dependent methods [15] learn the data feature from the training data so as to learn the hash function, which are also known as learning to hashing (L2H) methods. Compared to the data-independent method, the L2H method can get better accuracy with a shorter hash code. Therefore, the L2H method is more widely used than the data-independent method in the practical application.

The L2H methods include two types: unsupervised hashing and supervised hashing [4, 16]. Unsupervised hashing does not use supervised information for hash function learning, and the purpose of unsupervised hashing is to preserve the metric structure in the training data. Typical unsupervised approaches include iterative quantization (ITQ) [3], isotropic hashing (IsoHash) [4], discrete graph hashing (DGH) [15], and scalable graph hashing (SGH) [17]. Unsupervised hashing often fails to achieve satisfactory retrieval performance in practical applications.

On the contrary, supervised hashing tries to learn the hash function by utilizing supervised information. The purpose of supervised hashing is to map the data points in the original space to the Hamming space with supervised information, and supervised information optimizes the learning of hash function so as to learn better hash code. In recent years, using supervised information for learning, supervised hashing has higher precision than unsupervised hashing, so more and more researchers have studied it deeply [5, 12, 18]. Typical supervised hashing methods include supervised hashing with kernels (KSH) [5], fast supervised hashing (FastH) [8], supervised discrete hashing (SDH) [9], latent factor hashing (LFH) [19], adaptive binary quantization (ABQ) [20], hash bit selection [21], and structure-sensitive hashing (SSH) [10]. ABQ [20] jointly pursues a set of prototypes in the original space and a subset of binary codes in the Hamming space. The prototypes and the codes are correspondingly associated and together define the hash function for small hash codes. Hash bit selection [21] presented two related selection methods via dynamic programming and quadratic programming, incorporating bit reliability and complementarity. SSH [10] simultaneously captures the two types of structures among data in an alternative way. It learns discriminative hash functions that quantize data into the cluster prototypes associated with unique binary codes.

However, most traditional supervised hashing methods cannot extract features very well. In recent years, researchers have proposed deep learning hashing methods, which can effectively extract image features to identify similar images, and their performance is better than that of the traditional hashing method. Representative deep hashing methods including convolutional neural network-based hashing (CNNH) [18] adopt a two-stage strategy: learning binary hash codes in the first stage and learning a deep-network-based hash function to fit the codes in the second stage. DNNH [12] improved CNNH with a simultaneous feature learning and hash coding pipeline such that deep representations and hash codes are optimized by the triplet loss. The deep hashing network (DHN) [14] improves DNNH by jointly preserving the pairwise semantic similarity and controlling the quantization error by simultaneously optimizing the pairwise cross-entropy loss and quantization loss via a multitask approach. Other typical deep hashing methods include deep pairwise supervised hashing (DPSH) [13] and deep supervised hashing (DSH) [22].

Recently, some new deep hashing methods emerged, such as cross-modal hashing and hashing-based generative adversarial network. Representative methods including progressive generative hashing (PGH) [23] learn a discriminative hashing network in an unsupervised way, which exploits the power of hash-conditioned GANs and progressive learning. Triplet-based deep hashing (TDH) [24] is used for cross-modal retrieval, and triplet labels are exploited as supervised information to capture relative semantic correlation between heterogeneous data from different modalities. UCH [25] is a cross-modal retrieval method, where the outer-cycle network is used to learn powerful common representation and the inner-cycle network is explained to generate reliable hash codes. Deep learning hashing methods can significantly outperform nondeep supervised hashing in many applications, so we focused on deep hashing.

3. The Framework of JLTDH

3.1. Problem Definition

In our proposed method JLTDH, the input of the convolutional neural network is triplets. We denote the image triplet set as ; in each triplet of the image , indicate, respectively, the anchor point, the positive point, and the negative point; the positive image is defined as ( and are similar and belong to the same category), and the negative image is defined as ( and are dissimilar and belong to different categories). The distance between and is smaller than the distance between and .

Our goal is to learn the hash codes for the image ; the similarity between the two hash codes is calculated using the Hamming distance. The hash codes should satisfy all the triplet labels in the Hamming space as much as possible; for triplet labels, should be as small as possible as , where represents the Hamming distance between two hash codes.

3.2. Framework

We introduce the proposed framework of JLTDH; this is an end-to-end hash learning framework based on the convolutional neural network.

As shown in Figure 1, we proposed a triplet deep hashing with joint supervised loss, which is a deep learning framework capable of both automatic feature learning and hash coding, and it joins triplet deep learning and linear classification quantization. This is an end-to-end approach and supervised by triplet labels, which contains three main components: image feature learning part, hash code learning part, and joint loss function part. It integrates these three components into the same end-to-end framework.

We generally generate triplets based on the category information of the sample, select the anchor image and positive image from the same category, and select the negative image from different categories. However, as the dataset increases, the number of all possible triplets is very large; using all triplets is computationally difficult and not optimal, and at the same time, it is not helpful for training and will lead to slow convergence of training. The existing triplet hashing method does not solve the problem of triplet selection very well. Therefore, the mining and selection of triplets is an urgent problem to be solved. We adopt a novel triplet selection method, which will be discussed in detail in Section 4.

3.2.1. Image Feature Learning Part

In this part, we use three CNNs with shared weights to extract the appropriate feature representation for binary hash code learning. We use the AlexNet [26] network architecture for this part, and VGG [27] and ResNet [28] can also apply to this part as well. Each CNN contains five convolutional layers and three fully connected layers. The last layer of AlexNet is replaced with the FC (fully connected) layer, and the output of the last layer is projected as a hash code. The configuration of AlexNet is shown in Table 1.

3.2.2. Hash Code Learning Part

This part generates the hash code of the image according to the image features of the previous part. The FC layer uses the sign functions as activation functions. Binary code is obtained by using the sign function. The length of the hash code is determined by the number of FC-layer neurons in the last layer.

3.2.3. Joint Loss Function Part

The joint loss function combines two kinds of supervised loss functions: the triplet negative log-likelihood loss and the linear classification loss, and is designed to further optimize the hash code so that the hash code and classification information can maintain the semantic relationship between points. The joint loss function is the weighted combination of triplet label likelihood loss and linear classification loss.

4. Triplet Selection Method

In large datasets like NUS-WIDE and MS-COCO, the number of all possible triplets is very large. Thus, using all triplets is computationally difficult and not optimal. Specifically, it is not helpful for training and will lead to slow convergence of training. For example, the dataset MS-COCO is used in this paper, whose training dataset contains 10,000 images, and the number of all possible triplets is approximately . This is a very large number which is very difficult to calculate.

Inspired by the study in [29], we adopt a novel triplet selection method to reduce the computational cost. Randomly splitting the training data into several groups , the triplets are selected only within groups , where , respectively, represent the anchor points, positive points, and negative points in the group. is the group of positive samples consisting of the samples similar to the anchor point in the group. We randomly chose negative samples from the group of negative samples , and we found that negative points far away from the anchor point were not helpful for training, so we excluded these negative points.

Using the proposed triplet selection method, we find that the number of triplets is much smaller than the number of possible triplets in our dataset. The specific results are shown in Table 2. The CPU running the code of the triplet selection method is Intel Xeon CPU E5-2687W @ 3.0 GHz with 12 cores, and the RAM is 32 GB. The time cost of the triplet selection method on three datasets is low and acceptable.

5. Joint Loss Function

The joint loss function combines two kinds of supervised loss functions: the triplet negative log-likelihood loss and the linear classification loss. We introduce them in the following.

5.1. Triplet Negative Log-Likelihood Loss

The Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. It is often used to calculate the similarity between two hash codes. In other words, the smaller the Hamming distance between two hash codes, the more similar they are, and vice versa. indicates the Hamming distance between and , which can be calculated by inner product : , where is generally the length of the hash code. From the above equation, it can be concluded that the larger the inner product of the two hash codes, the smaller the Hamming distance and the more similar they are. Let represent half of the inner product between two binary codes : . DPSH [13] is a method for simultaneously learning feature representation and hash codes using pairwise label likelihood function as the loss function, and the likelihood function is formulated aswhere is the set of all hash codes and is the sigmoid function:. The two hash codes with the large inner product should have high similarity.

The supervised information used by DPSH [13] is a pairwise label. Inspired by DPSH, this paper proposes a hashing method using triplet label likelihood function, which is a deep hashing method using a deep neural network. A set of triplet labels is given. Suppose the conditions are independent of each other, according to naive Bayes’ theorem, with some prior , then the posterior probability of B can be computed as follows:

We can define the triplet label likelihood function as follows:

We can learn the optimal through maximum a posteriori estimation.

In order to solve , according to , we can get

From equation (4), we can conclude that the smaller the distance between the anchor point and the positive point, and the greater the distance between the anchor point and the negative point, the larger the value of .

According to equation (4), we can make the following definition:where is a superparameter whose role is to adjust the gap between and . When we can conclude that and . The smaller the distance between the anchor point and the positive point, and the greater the distance between the anchor point and the negative point, the larger the value of the triplet likelihood function, and vice versa. This is consistent with the intention of our objective function.

By maximizing the triplet likelihood , we can enforce the Hamming distance between the anchor image and the positive image to be smaller than that between the anchor image and the negative image. Taking the negative log-likelihood of the triplet label, the following optimization problem can be obtained:

The above optimization problem can minimize the Hamming distance between the anchor point and the positive point and simultaneously maximize the Hamming distance between the anchor point and the negative point. This exactly matches the goal of supervised hashing with the triplet label.

Substituting formula (5) into formula (6), the following formula can be obtained:

A common problem with the deep hashing methods is how to train the neural network output to be binary codes [11, 20, 25, 26]. Minimizing the loss of equation (7) is a very difficult discrete optimization problem [30]. The usual method is to relax from discrete to continuous. In the last layer of the neural network, we use the sign function as the activation function to obtain the binary code. However, the sign function is nondifferentiable because the gradient of the sign function always equals zero, and the backpropagation of the loss function is difficult to proceed. A good method is to relax the binary codes and add a quantization error term in the objective function during training. This method is also utilized in [30, 25].

This method is described below: we use to denote the continuous output of the last layer before the sign function for the image. can be obtained by . We relax to continuous vectors , and we redefine as follows:

Then, we approximate equation (7) aswhere represents the quantization error term and is the superparameter, which is used to balance the original objective function and quantization error.

To integrate the above image feature representation part and the loss function part into a deep neural network framework, we setwhere stands for all the parameters of the neural network except for the last layer, represents the output of the neural network, represents a weight matrix, and is a bias term.

5.2. Joint Learning with Linear Classification Quantization Loss

Triplet label information was used to learn hash codes by equation (9), but the label information was not fully utilized. We hope to fully utilize the label information so that we use the joint learning linear classifier to further optimize the hash code, making the learned hash code optimal. Inspired by the study in [31], we use the following classifier, which can represent the relationship between the learned hash code and label information :where is the classifier weight and is the ground-truth label vector, in which is the number of categories in the dataset. We usually choose loss for the linear classifier, and we define the loss function as follows:where is the linear classifier loss function, is the regularization parameter, and is the norm of a matrix. Equations (9) and (12) are combined by weight parameters, and the following formula can be obtained:where is the norm of a vector and is the trade-off parameter used to balance the triplet likelihood term and the linear classification loss.

5.3. Optimization

Equation (13) is a joint loss optimization problem, which is still nonconvex, and it is difficult to solve. Here, we adopt the discrete cyclic coordinate descent (DCC) optimization method. Equation (13) can be decomposed into two suboptimization problems, and the linear classification loss can be solved iteratively by alternating minimization. For equation (12), the linear classification loss can be rewritten as

By fixing , using the matrix trace function , the following formula is obtained:

Taking the derivative of and setting , we get the minimum value of :

So once we solve , we assume as a constant matrix. By fixing , equation (14) becomes

We can get a closed solution to a row of B by fixing the other rows. We use the discrete cyclic coordinate descent method to iteratively solve B row by row. Let be the row of , (K is the length of the hash code), and the matrix of B excluding . Let be the row of W and the matrix of excluding . The third term in equation (17) can be solved as follows:

Similarly, let , and let be the row of Q and the matrix of excluding . The second term in equation (17) can be solved as follows:

Putting equations (17), (18), and (19) altogether, there is an optimal solution to this problem:

According to equation (21), each bit hash code z is calculated according to the bit that has been learned. We can iteratively update each bit until the program converges to a better set of hash codes .

The proposed method JLTDH is briefly summarized in Algorithm 1.

Input: training images ; code length ; epochs = 150; superparameters ; and minibatch = 128
Initialization: initialize neural network parameters , , and with the AlexNet model and iteration number
 Generate triplet training set: ;
for ; ; do
 Randomly sample a minibatch of points from , and for each sampled point :
(i)Calculate by forward propagation;
(ii)Compute ;
(iii)Compute the binary code of with ;
(iv)Compute derivatives for point ;
(v)Update the parameters by utilizing BP;
(vi)Compute ;
for ; ; do
  Discrete cyclic coordinate descent (DCC) optimization:
(i)Compute according to (16);
(ii)Iteratively optimization update bit by bit using the DCC method according to (21) in the minibatch;
End for
 Update in the minibatch according to joint loss function in (13);
End for
Output: , with the parameters ;
 Hash codes .

6. Experiment and Analysis

In this section, we will describe our experiments and results. Three commonly used datasets are used to verify the effectiveness of our algorithm. We calculated the precision and mean average precision (MAP) of the retrieval results and showed the performance of our method on CIFAR-10, NUS-WIDE, and MS-COCO. Specifically, given an anchor , we can calculate its average precision (AP) using the following equation:where is the number of relevant samples, is the precision at cutoff k in the returned sample list, and is an indicator function which equals 1 if the k-th returned sample is a ground-truth neighbor of . Otherwise, is 0. Given Q queries, MAP is the AP of all query results sorted; we can compute the MAP as follows:

6.1. Experimental Settings

Our server configuration is as follows: the CPU is Intel Xeon CPU E5-2687W @ 3.0 GHz with 12 cores, the GPU is NVIDIA GTX 1080 8 GB, and the RAM is 32 GB. The Linux operating system is Ubuntu 16.04, and the deep learning framework is PyTorch [32].

We use three widely used benchmark datasets of different scales; they are CIFAR-10 [33], NUS-WIDE [34], and MS-COCO [35]. The CIFAR-10 dataset contains 60,000 images and 10,000 test images, belonging to 10 categories. The size of each image is pixels. We randomly selected 5,000 images (500 for each class) as the training set and 1,000 images (100 for each class) as the test query set.

The NUS-WIDE dataset contains 269,648 images in 81 categories. We used the 21 most commonly used categories. We randomly selected 2,100 images (100 images per class) as the query point and randomly selected 10,500 images (500 images per class) as the training set.

MS-COCO is an image dataset widely used for image recognition, segmentation, and captioning. It contains 82,783 training images and 40,504 validation images, in which each image is labeled by some of the 80 semantic concepts. We randomly selected 5,000 images as the query point and the rest images as the database and randomly sample 10,000 images from the database for training. Table 3 shows some sample points from three datasets. Table 4 shows the dataset settings used in our experiment.

We compared our method with several representative hashing methods for MAP; the comparison methods we selected are divided into two groups: traditional hashing methods and deep hashing methods. Traditional unsupervised hashing methods include SH [36] and ITQ [3]. Traditional supervised hashing methods include KSH [5], FastH [8], SDH [9], SPLH [37], and LFH [19]. The deep hashing methods include DSRH [30], DSCH [38], DRCSH [38], CNNH [18], NINH [12], DPSH [13], DHN [14], DQN [39], DTSH [40], VDSH [41], and DSDH [42]. In this paper, we also emphasize on the comparison of the deep hashing methods with triplet labels, including DSRH [37], DSCH [38], DRSCH [38], and NINH [12]. To be fair, some test results are directly evaluated in previously published papers. Following [41], CNN features on CIFAR-10 were extracted using the pretrained CNN-F model. The hyperparameters of this method are set according to the standard cross-validation procedure. In equation (12), is set to 0.1, is 1, is 55, and is half the length of the hash code.

6.2. Empirical Analysis
6.2.1. Comparison to Other Deep Methods and Nondeep Methods

As shown in Table 5 and Figure 2, the MAP is calculated based on the top 5,000 returned neighbors. NINH, CNNH, KSH, and ITQ results were obtained from the study in [11, 18]. Results of other methods were obtained from the study in [42]. Our method performs better in these three datasets than the existing hashing methods; compared with nondeep learning methods, our method has been significantly improved. Our method further improved performance by 2–6% compared to the current best deep learning methods. At the same time, we found that our method performed better in shorter hash codes (). Most deep hashing methods have significant performance advantages.

MAP results for different numbers of bits on the MS-COCO dataset are shown in Table 6; except for our method, the other results were obtained from the study in [29]. In order to be consistent with the settings in [29], we set the hash code length as 8 bits, 16 bits, 24 bits, and 32 bits. The image pixel of MS-COCO is more complex, which will lead to more difficult classification, which may lead to inaccurate results of feature extraction and inaccurate classification results. As can be seen from Table 5, the performance of the MS-COCO dataset decreased to a certain extent compared with the results of MAP of the NUS-WIDE dataset. In spite of this, our method is still much better than the comparison methods.

Our method can achieve excellent performance under shorter hash codes. At the same time, we also discussed the performance change under the long hash code. Since the comparison method does not provide the results of long hash codes, we only discuss our own method here. As can be seen from Figure 3, CIFAR-10 gets a good MAP value at 32 bits, while NUS-WIDE gets a good MAP value between 48 and 64 bits. With the growth of hash code length, the MAP of CIFAR-10 is decreasing, while the MAP of NUS-WIDE is not decreasing, but there is no obvious increase.

6.2.2. Comparison to Nondeep Hashing Methods Using Deep Features

One might say that the performance improvements come from the neural network, not our approach. In order to further verify our method, we compared our method with the traditional hashing method using the CNN-F network [43] to extract the depth features. As shown in Figure 4, we can see that our method is significantly superior to the traditional method. It should be noted that most of the traditional methods are not tested based on MS-COCO, and we cannot get the corresponding data, so we did not do the comparative test of MS-COCO in this part.

6.2.3. Comparison to Hashing Methods with Triplet Labels

This paper adopts the method of triplet labels; therefore, we focus on the comparison of the deep hashing methods with the triplet label, and we will compare it further with other deep hashing methods using the triplet label. These methods include DSRH [37], DSCH [38], DRSCH [38], and NINH [12]. The results of DSRH, DSCH, and DRSCH were adopted from the study in [38]. As shown in Figure 5, compared with previous deep hashing methods based on the triplet label, our method is obviously better and leads the comparative method by a wide margin.

6.3. Sensitivity to Hyperparameters

The most important hyperparameters for JLTDH are and . is the trade-off parameter used to balance the triplet likelihood term and the linear classification loss. is used to balance the likelihood term and the quantization error. We explore the influence of these two hyperparameters.

We report the MAP values for different from the range of [0.1, 200] on two datasets with the code length being 12 bits and 32 bits. We can find that JLTDH is not sensitive to in a large range when . According to Figure 6(a), we found that when , CIFAR-10 can obtain the best MAP performance. As shown in Figure 6(b), when , NUS-WIDE can achieve better MAP performance.

Furthermore, we present the influence of in Figures 7(a) and 6(b). As can be seen, there is a wide range of in that our method performs well. When hash code = 12 bits, MAP accuracy is better on both datasets with , and when hash code = 32 bits, MAP accuracy is better on both datasets with . Other superparameter settings are obtained through cross-validation: is set to 0.1, and is half the length of the hash code.

6.4. Analysis of Three Loss Functions

By observing the convergence of the loss function, we can judge whether the selected loss function is reasonable or not. Figure 8 shows the change of three kinds of losses for different lengths of hash codes during training. We only take CIFAR-10 as an example; the results for the other two datasets are similar. It is reasonable to conclude that the joint loss combining triplet likelihood loss and linear classification quantization loss successfully optimizes the loss during training. Figures 8(a) and 8(b) are similar: they show that the joint loss and triplet likelihood loss converge rapidly and are kept at a low level for different bits, that our method is reasonable and effective, and that the optimization process only needs a few iterations.

In order to further reveal the respective influence of the two parts of the joint loss function, we have shown the loss curves of triplet likelihood loss and linear classification loss, respectively, in Figures 8(b) and 8(c). The magnitude of the loss value in Figure 8(c) is about of that in Figure 7(b), which is because triplet likelihood loss is used to optimize the first stage and plays a major role in training, while linear classification loss is used to optimize the second stage, which is further optimization based on the first stage and fine-tuning optimization.

6.4.1. Ablation Study about Loss Function

In order to confirm the contribution of different losses to final performance, we selected a variant of JLTDH for comparison: JLTDH-T is a variant of JLTDH, whose loss function contains only triplet loss and no linear classification loss. As can be seen from Figure 9, on the CIFAR and NUS-WIDE datasets, JLTDH-T can achieve good MAP performance with only triplet loss. Based on further optimization of linear classification loss, JLTDH achieves better MAP performance by using the joint loss. JLTDH is about 2% ahead of JLTDH-T on average.

6.5. Visualization of Query Results

We show the illustration of top 12 returned images for better understanding of the impressive performance improvement of the proposed method. Figure 10 illustrates the top 12 returned images of the proposed method for three query images on the three datasets CIFAR-10, NUS-WIDE, and MS-COCO, and the length of the hash code is 32. It shows that the method we proposed can truly preserve the features of an image and save them to the hash code. Regarding the query image in CIFAR-10, only one of the returned results is wrong, and the wrong image is only at the bottom of the sorted image. In contrast, for the query image in NUS-WIDE, only 1 out of the top 12 images is incorrect, and the top 12 images have a similar shape or similar color to the query image. And for the query image in MS-COCO, ten of the top 12 images are correct; compared with the previous two datasets, the query accuracy is slightly reduced. The possible reason is that MS-COCO is a multiobjective dataset. This shows that our method can provide the desired search results.

7. Conclusion

In this work, we propose a triplet deep hashing method with joint supervised loss based on the convolutional neural network (JLTDH). To fully utilize the supervised triplet information, this paper proposed a joint loss function combining two kinds of supervised loss functions: the triplet negative log-likelihood loss function and the linear classification loss function. At the same time, in order to overcome the cubic increase in the number of triplets and make triplet training more effective, we designed a triple selection method. The whole process is divided into two stages: Firstly, the last layer of the network outputs a preliminary hash code. Secondly, relying on the joint loss function and backpropagation algorithm, the parameters of the neural network are further updated so that the generated hash code has higher query precision. We perform extensive experiments on the three public benchmark datasets CIFAR-10, NUS-WIDE, and MS-COCO. Experimental results demonstrate that the proposed method outperforms the compared methods and is also superior to all previous deep hashing methods based on the triplet label.

Data Availability

Data are owned by a third party: the experimental part of this paper uses three widely used public datasets (CIFAR-10, MS-COCO, and NUS-WIDE), which can be publicly accessed. CIFAR-10 can be accessed at http://www.cs.toronto.edu/kriz/cifar.html, MS-COCO at http://mscoco.org, and NUS-WIDE at http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm The implementation code of the algorithm in this paper is written by PyTorch. The code can be obtained from Mingyong Li upon request at [email protected].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the science and technology project of Chongqing Education Commission of China (No. KJ1600332), the humanities and social science project of Ministry of Education (18YJA880061), the humanities and social science project of Chongqing Education Commission of China (No. 19SKGH035), Chongqing Education Scientific Planning Project (No. 2017-GX-116), and Chongqing Graduate Education Reform Project (No. yjg193093).