Abstract
In recent years, with the explosion of multimedia data from search engines, social media, and ecommerce platforms, there is an urgent need for fast retrieval methods for massive big data. Hashing is widely used in largescale and highdimensional data search because of its low storage cost and fast query speed. Thanks to the great success of deep learning in many fields, the deep learning method has been introduced into hashing retrieval, and it uses a deep neural network to learn image features and hash codes simultaneously. Compared with the traditional hashing methods, it has better performance. However, existing deep hashing methods have some limitations; for example, most methods consider only one kind of supervised loss, which leads to insufficient utilization of supervised information. To address this issue, we proposed a triplet deep hashing method with joint supervised loss based on the convolutional neural network (JLTDH) in this work. The proposed method JLTDH combines triplet likelihood loss and linear classification loss; moreover, the triplet supervised label is adopted, which contains richer supervised information than that of the pointwise and pairwise labels. At the same time, in order to overcome the cubic increase in the number of triplets and make triplet training more effective, we adopt a novel triplet selection method. The whole process is divided into two stages: In the first stage, taking the triplets generated by the triplet selection method as the input of the CNN, the three CNNs with shared weights are used for image feature learning, and the last layer of the network outputs a preliminary hash code. In the second stage, relying on the hash code of the first stage and the joint loss function, the neural network model is further optimized so that the generated hash code has higher query precision. We perform extensive experiments on the three public benchmark datasets CIFAR10, NUSWIDE, and MSCOCO. Experimental results demonstrate that the proposed method outperforms the compared methods, and the method is also superior to all previous deep hashing methods based on the triplet label.
1. Introduction
In recent years, because of the explosive growth of Internet big data, the Internet is filled with a large number of multimedia resources, including pictures, videos, and text, and there is an urgent need for fast search methods for massive big data. Approximate nearest neighbor (ANN) [1] search has been widely used in many fields such as image retrieval, computer vision, and data mining. Because of speed and low memory cost, hashing has become an important branch of ANN search, which is one of the widely used techniques in image retrieval [2–9]. Hashing techniques encode images, documents, videos, or other types of data in a short set of binary codes while keeping the original data similar. The hashing method produces binary encodings that make the nearest neighbor search of large datasets easy.
A series of different hashing methods are proposed to implement efficient ANN search using Hamming distance [3, 5, 8, 9]. More recently, deep hashing methods [10, 11, 12, 13, 14] show that image representation and hash coding can be learned more effectively using deep neural networks, resulting in stateoftheart results on many benchmark datasets.
Recently, triplet loss has been studied for computer vision problems. The triplet labels contain richer information than pairwise labels. Each triplet label can be naturally decomposed into two pairwise labels. In particular, a triplet label ensures that the query image is close to the positive image and far away from the negative image in the learning hash code space. However, a pairwise label only ensures that one constraint is observed. The retrieval performance of triplet loss is better than that of pointwise and pairwise losses. Therefore, the triplet likelihood loss is introduced in this paper.
At the same time, the existing deep hashing methods still have some shortcomings in the utilization of classification information. The classification information only plays a role in deep neural network image representation but has no direct impact on the optimization of the hash function. Therefore, this paper proposes a linear classification loss to deal with this situation.
Therefore, combining triplet likelihood loss and linear classification loss, we propose a triplet deep hashing method with joint supervised loss based on the convolutional neural network (JLTDH). The supervised information used in this method is in the form of triplet labels. For the sake of fully utilizing the triplet information and the classification information, we propose a joint loss function, which consists of two parts: the triplet negative loglikelihood loss and the linear classification loss. Depending on this joint loss function, the hash codes can be further optimized by the linear classifier. The linear classifier indicates the relationship between the label information and the hash codes. We choose the convolutional neural network (CNN) as our deep learning model, for example, AlexNet, ResNet, and VGG, which can learn image representation and hash function at the same time. The whole process is divided into two stages: In the first stage, taking the triplets generated by the triplet selection method as the input of the CNN, the three CNNs with shared weights are used for image feature learning, and the last layer of the network outputs a preliminary hash code. In the second stage, relying on the hash code of the first stage and the joint loss function, the neural network model is further optimized so that the generated hash code has higher query precision.
This work is summarized as follows:(1)In this paper, a triplet deep hashing method with negative loglikelihood is proposed, and the method performs both image feature representation and hash code learning in a convolutional neural network. In order to overcome the cubic increase in the number of triplets and make triplet training more effective, we adopt a novel triplet selection method.(2)To fully utilize the supervised triplet information, JLTDH proposed a joint loss function combining the triplet negative loglikelihood loss and the linear classification loss. Relying on the hash code of the first stage and the joint loss function, the neural network model is further optimized so that the generated hash code has higher query precision.(3)We perform extensive experiments on the three public benchmark datasets CIFAR10, NUSWIDE, and MSCOCO. Experimental results demonstrate that the proposed method outperforms the compared methods, and the method is also superior to all previous deep hashing methods based on the triplet label.
2. Related Works
Hashing methods are divided into dataindependent methods and datadependent methods. Localitysensitive hashing (LSH) [1] is a typical representative of dataindependent hashing methods among various hashing techniques; the hash function of this method is generated by a random map, and random projection maps data points from the original space to the Hamming space; in this process, the training data are not used to learn hash functions. One drawback of LSH is that LSH usually requires a very long bit length, which leads to huge storage overhead. Datadependent methods [15] learn the data feature from the training data so as to learn the hash function, which are also known as learning to hashing (L2H) methods. Compared to the dataindependent method, the L2H method can get better accuracy with a shorter hash code. Therefore, the L2H method is more widely used than the dataindependent method in the practical application.
The L2H methods include two types: unsupervised hashing and supervised hashing [4, 16]. Unsupervised hashing does not use supervised information for hash function learning, and the purpose of unsupervised hashing is to preserve the metric structure in the training data. Typical unsupervised approaches include iterative quantization (ITQ) [3], isotropic hashing (IsoHash) [4], discrete graph hashing (DGH) [15], and scalable graph hashing (SGH) [17]. Unsupervised hashing often fails to achieve satisfactory retrieval performance in practical applications.
On the contrary, supervised hashing tries to learn the hash function by utilizing supervised information. The purpose of supervised hashing is to map the data points in the original space to the Hamming space with supervised information, and supervised information optimizes the learning of hash function so as to learn better hash code. In recent years, using supervised information for learning, supervised hashing has higher precision than unsupervised hashing, so more and more researchers have studied it deeply [5, 12, 18]. Typical supervised hashing methods include supervised hashing with kernels (KSH) [5], fast supervised hashing (FastH) [8], supervised discrete hashing (SDH) [9], latent factor hashing (LFH) [19], adaptive binary quantization (ABQ) [20], hash bit selection [21], and structuresensitive hashing (SSH) [10]. ABQ [20] jointly pursues a set of prototypes in the original space and a subset of binary codes in the Hamming space. The prototypes and the codes are correspondingly associated and together define the hash function for small hash codes. Hash bit selection [21] presented two related selection methods via dynamic programming and quadratic programming, incorporating bit reliability and complementarity. SSH [10] simultaneously captures the two types of structures among data in an alternative way. It learns discriminative hash functions that quantize data into the cluster prototypes associated with unique binary codes.
However, most traditional supervised hashing methods cannot extract features very well. In recent years, researchers have proposed deep learning hashing methods, which can effectively extract image features to identify similar images, and their performance is better than that of the traditional hashing method. Representative deep hashing methods including convolutional neural networkbased hashing (CNNH) [18] adopt a twostage strategy: learning binary hash codes in the first stage and learning a deepnetworkbased hash function to fit the codes in the second stage. DNNH [12] improved CNNH with a simultaneous feature learning and hash coding pipeline such that deep representations and hash codes are optimized by the triplet loss. The deep hashing network (DHN) [14] improves DNNH by jointly preserving the pairwise semantic similarity and controlling the quantization error by simultaneously optimizing the pairwise crossentropy loss and quantization loss via a multitask approach. Other typical deep hashing methods include deep pairwise supervised hashing (DPSH) [13] and deep supervised hashing (DSH) [22].
Recently, some new deep hashing methods emerged, such as crossmodal hashing and hashingbased generative adversarial network. Representative methods including progressive generative hashing (PGH) [23] learn a discriminative hashing network in an unsupervised way, which exploits the power of hashconditioned GANs and progressive learning. Tripletbased deep hashing (TDH) [24] is used for crossmodal retrieval, and triplet labels are exploited as supervised information to capture relative semantic correlation between heterogeneous data from different modalities. UCH [25] is a crossmodal retrieval method, where the outercycle network is used to learn powerful common representation and the innercycle network is explained to generate reliable hash codes. Deep learning hashing methods can significantly outperform nondeep supervised hashing in many applications, so we focused on deep hashing.
3. The Framework of JLTDH
3.1. Problem Definition
In our proposed method JLTDH, the input of the convolutional neural network is triplets. We denote the image triplet set as ; in each triplet of the image , indicate, respectively, the anchor point, the positive point, and the negative point; the positive image is defined as ( and are similar and belong to the same category), and the negative image is defined as ( and are dissimilar and belong to different categories). The distance between and is smaller than the distance between and .
Our goal is to learn the hash codes for the image ; the similarity between the two hash codes is calculated using the Hamming distance. The hash codes should satisfy all the triplet labels in the Hamming space as much as possible; for triplet labels, should be as small as possible as , where represents the Hamming distance between two hash codes.
3.2. Framework
We introduce the proposed framework of JLTDH; this is an endtoend hash learning framework based on the convolutional neural network.
As shown in Figure 1, we proposed a triplet deep hashing with joint supervised loss, which is a deep learning framework capable of both automatic feature learning and hash coding, and it joins triplet deep learning and linear classification quantization. This is an endtoend approach and supervised by triplet labels, which contains three main components: image feature learning part, hash code learning part, and joint loss function part. It integrates these three components into the same endtoend framework.
We generally generate triplets based on the category information of the sample, select the anchor image and positive image from the same category, and select the negative image from different categories. However, as the dataset increases, the number of all possible triplets is very large; using all triplets is computationally difficult and not optimal, and at the same time, it is not helpful for training and will lead to slow convergence of training. The existing triplet hashing method does not solve the problem of triplet selection very well. Therefore, the mining and selection of triplets is an urgent problem to be solved. We adopt a novel triplet selection method, which will be discussed in detail in Section 4.
3.2.1. Image Feature Learning Part
In this part, we use three CNNs with shared weights to extract the appropriate feature representation for binary hash code learning. We use the AlexNet [26] network architecture for this part, and VGG [27] and ResNet [28] can also apply to this part as well. Each CNN contains five convolutional layers and three fully connected layers. The last layer of AlexNet is replaced with the FC (fully connected) layer, and the output of the last layer is projected as a hash code. The configuration of AlexNet is shown in Table 1.
3.2.2. Hash Code Learning Part
This part generates the hash code of the image according to the image features of the previous part. The FC layer uses the sign functions as activation functions. Binary code is obtained by using the sign function. The length of the hash code is determined by the number of FClayer neurons in the last layer.
3.2.3. Joint Loss Function Part
The joint loss function combines two kinds of supervised loss functions: the triplet negative loglikelihood loss and the linear classification loss, and is designed to further optimize the hash code so that the hash code and classification information can maintain the semantic relationship between points. The joint loss function is the weighted combination of triplet label likelihood loss and linear classification loss.
4. Triplet Selection Method
In large datasets like NUSWIDE and MSCOCO, the number of all possible triplets is very large. Thus, using all triplets is computationally difficult and not optimal. Specifically, it is not helpful for training and will lead to slow convergence of training. For example, the dataset MSCOCO is used in this paper, whose training dataset contains 10,000 images, and the number of all possible triplets is approximately . This is a very large number which is very difficult to calculate.
Inspired by the study in [29], we adopt a novel triplet selection method to reduce the computational cost. Randomly splitting the training data into several groups , the triplets are selected only within groups , where , respectively, represent the anchor points, positive points, and negative points in the group. is the group of positive samples consisting of the samples similar to the anchor point in the group. We randomly chose negative samples from the group of negative samples , and we found that negative points far away from the anchor point were not helpful for training, so we excluded these negative points.
Using the proposed triplet selection method, we find that the number of triplets is much smaller than the number of possible triplets in our dataset. The specific results are shown in Table 2. The CPU running the code of the triplet selection method is Intel Xeon CPU E52687W @ 3.0 GHz with 12 cores, and the RAM is 32 GB. The time cost of the triplet selection method on three datasets is low and acceptable.
5. Joint Loss Function
The joint loss function combines two kinds of supervised loss functions: the triplet negative loglikelihood loss and the linear classification loss. We introduce them in the following.
5.1. Triplet Negative LogLikelihood Loss
The Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. It is often used to calculate the similarity between two hash codes. In other words, the smaller the Hamming distance between two hash codes, the more similar they are, and vice versa. indicates the Hamming distance between and , which can be calculated by inner product : , where is generally the length of the hash code. From the above equation, it can be concluded that the larger the inner product of the two hash codes, the smaller the Hamming distance and the more similar they are. Let represent half of the inner product between two binary codes : . DPSH [13] is a method for simultaneously learning feature representation and hash codes using pairwise label likelihood function as the loss function, and the likelihood function is formulated aswhere is the set of all hash codes and is the sigmoid function:. The two hash codes with the large inner product should have high similarity.
The supervised information used by DPSH [13] is a pairwise label. Inspired by DPSH, this paper proposes a hashing method using triplet label likelihood function, which is a deep hashing method using a deep neural network. A set of triplet labels is given. Suppose the conditions are independent of each other, according to naive Bayes’ theorem, with some prior , then the posterior probability of B can be computed as follows:
We can define the triplet label likelihood function as follows:
We can learn the optimal through maximum a posteriori estimation.
In order to solve , according to , we can get
From equation (4), we can conclude that the smaller the distance between the anchor point and the positive point, and the greater the distance between the anchor point and the negative point, the larger the value of .
According to equation (4), we can make the following definition:where is a superparameter whose role is to adjust the gap between and . When we can conclude that and . The smaller the distance between the anchor point and the positive point, and the greater the distance between the anchor point and the negative point, the larger the value of the triplet likelihood function, and vice versa. This is consistent with the intention of our objective function.
By maximizing the triplet likelihood , we can enforce the Hamming distance between the anchor image and the positive image to be smaller than that between the anchor image and the negative image. Taking the negative loglikelihood of the triplet label, the following optimization problem can be obtained:
The above optimization problem can minimize the Hamming distance between the anchor point and the positive point and simultaneously maximize the Hamming distance between the anchor point and the negative point. This exactly matches the goal of supervised hashing with the triplet label.
Substituting formula (5) into formula (6), the following formula can be obtained:
A common problem with the deep hashing methods is how to train the neural network output to be binary codes [11, 20, 25, 26]. Minimizing the loss of equation (7) is a very difficult discrete optimization problem [30]. The usual method is to relax from discrete to continuous. In the last layer of the neural network, we use the sign function as the activation function to obtain the binary code. However, the sign function is nondifferentiable because the gradient of the sign function always equals zero, and the backpropagation of the loss function is difficult to proceed. A good method is to relax the binary codes and add a quantization error term in the objective function during training. This method is also utilized in [30, 25].
This method is described below: we use to denote the continuous output of the last layer before the sign function for the image. can be obtained by . We relax to continuous vectors , and we redefine as follows:
Then, we approximate equation (7) aswhere represents the quantization error term and is the superparameter, which is used to balance the original objective function and quantization error.
To integrate the above image feature representation part and the loss function part into a deep neural network framework, we setwhere stands for all the parameters of the neural network except for the last layer, represents the output of the neural network, represents a weight matrix, and is a bias term.
5.2. Joint Learning with Linear Classification Quantization Loss
Triplet label information was used to learn hash codes by equation (9), but the label information was not fully utilized. We hope to fully utilize the label information so that we use the joint learning linear classifier to further optimize the hash code, making the learned hash code optimal. Inspired by the study in [31], we use the following classifier, which can represent the relationship between the learned hash code and label information :where is the classifier weight and is the groundtruth label vector, in which is the number of categories in the dataset. We usually choose loss for the linear classifier, and we define the loss function as follows:where is the linear classifier loss function, is the regularization parameter, and is the norm of a matrix. Equations (9) and (12) are combined by weight parameters, and the following formula can be obtained:where is the norm of a vector and is the tradeoff parameter used to balance the triplet likelihood term and the linear classification loss.
5.3. Optimization
Equation (13) is a joint loss optimization problem, which is still nonconvex, and it is difficult to solve. Here, we adopt the discrete cyclic coordinate descent (DCC) optimization method. Equation (13) can be decomposed into two suboptimization problems, and the linear classification loss can be solved iteratively by alternating minimization. For equation (12), the linear classification loss can be rewritten as
By fixing , using the matrix trace function , the following formula is obtained:
Taking the derivative of and setting , we get the minimum value of :
So once we solve , we assume as a constant matrix. By fixing , equation (14) becomes
We can get a closed solution to a row of B by fixing the other rows. We use the discrete cyclic coordinate descent method to iteratively solve B row by row. Let be the row of , (K is the length of the hash code), and the matrix of B excluding . Let be the row of W and the matrix of excluding . The third term in equation (17) can be solved as follows:
Similarly, let , and let be the row of Q and the matrix of excluding . The second term in equation (17) can be solved as follows:
Putting equations (17), (18), and (19) altogether, there is an optimal solution to this problem:
According to equation (21), each bit hash code z is calculated according to the bit that has been learned. We can iteratively update each bit until the program converges to a better set of hash codes .
The proposed method JLTDH is briefly summarized in Algorithm 1.

6. Experiment and Analysis
In this section, we will describe our experiments and results. Three commonly used datasets are used to verify the effectiveness of our algorithm. We calculated the precision and mean average precision (MAP) of the retrieval results and showed the performance of our method on CIFAR10, NUSWIDE, and MSCOCO. Specifically, given an anchor , we can calculate its average precision (AP) using the following equation:where is the number of relevant samples, is the precision at cutoff k in the returned sample list, and is an indicator function which equals 1 if the kth returned sample is a groundtruth neighbor of . Otherwise, is 0. Given Q queries, MAP is the AP of all query results sorted; we can compute the MAP as follows:
6.1. Experimental Settings
Our server configuration is as follows: the CPU is Intel Xeon CPU E52687W @ 3.0 GHz with 12 cores, the GPU is NVIDIA GTX 1080 8 GB, and the RAM is 32 GB. The Linux operating system is Ubuntu 16.04, and the deep learning framework is PyTorch [32].
We use three widely used benchmark datasets of different scales; they are CIFAR10 [33], NUSWIDE [34], and MSCOCO [35]. The CIFAR10 dataset contains 60,000 images and 10,000 test images, belonging to 10 categories. The size of each image is pixels. We randomly selected 5,000 images (500 for each class) as the training set and 1,000 images (100 for each class) as the test query set.
The NUSWIDE dataset contains 269,648 images in 81 categories. We used the 21 most commonly used categories. We randomly selected 2,100 images (100 images per class) as the query point and randomly selected 10,500 images (500 images per class) as the training set.
MSCOCO is an image dataset widely used for image recognition, segmentation, and captioning. It contains 82,783 training images and 40,504 validation images, in which each image is labeled by some of the 80 semantic concepts. We randomly selected 5,000 images as the query point and the rest images as the database and randomly sample 10,000 images from the database for training. Table 3 shows some sample points from three datasets. Table 4 shows the dataset settings used in our experiment.
We compared our method with several representative hashing methods for MAP; the comparison methods we selected are divided into two groups: traditional hashing methods and deep hashing methods. Traditional unsupervised hashing methods include SH [36] and ITQ [3]. Traditional supervised hashing methods include KSH [5], FastH [8], SDH [9], SPLH [37], and LFH [19]. The deep hashing methods include DSRH [30], DSCH [38], DRCSH [38], CNNH [18], NINH [12], DPSH [13], DHN [14], DQN [39], DTSH [40], VDSH [41], and DSDH [42]. In this paper, we also emphasize on the comparison of the deep hashing methods with triplet labels, including DSRH [37], DSCH [38], DRSCH [38], and NINH [12]. To be fair, some test results are directly evaluated in previously published papers. Following [41], CNN features on CIFAR10 were extracted using the pretrained CNNF model. The hyperparameters of this method are set according to the standard crossvalidation procedure. In equation (12), is set to 0.1, is 1, is 55, and is half the length of the hash code.
6.2. Empirical Analysis
6.2.1. Comparison to Other Deep Methods and Nondeep Methods
As shown in Table 5 and Figure 2, the MAP is calculated based on the top 5,000 returned neighbors. NINH, CNNH, KSH, and ITQ results were obtained from the study in [11, 18]. Results of other methods were obtained from the study in [42]. Our method performs better in these three datasets than the existing hashing methods; compared with nondeep learning methods, our method has been significantly improved. Our method further improved performance by 2–6% compared to the current best deep learning methods. At the same time, we found that our method performed better in shorter hash codes (). Most deep hashing methods have significant performance advantages.
(a)
(b)
(c)
MAP results for different numbers of bits on the MSCOCO dataset are shown in Table 6; except for our method, the other results were obtained from the study in [29]. In order to be consistent with the settings in [29], we set the hash code length as 8 bits, 16 bits, 24 bits, and 32 bits. The image pixel of MSCOCO is more complex, which will lead to more difficult classification, which may lead to inaccurate results of feature extraction and inaccurate classification results. As can be seen from Table 5, the performance of the MSCOCO dataset decreased to a certain extent compared with the results of MAP of the NUSWIDE dataset. In spite of this, our method is still much better than the comparison methods.
Our method can achieve excellent performance under shorter hash codes. At the same time, we also discussed the performance change under the long hash code. Since the comparison method does not provide the results of long hash codes, we only discuss our own method here. As can be seen from Figure 3, CIFAR10 gets a good MAP value at 32 bits, while NUSWIDE gets a good MAP value between 48 and 64 bits. With the growth of hash code length, the MAP of CIFAR10 is decreasing, while the MAP of NUSWIDE is not decreasing, but there is no obvious increase.
6.2.2. Comparison to Nondeep Hashing Methods Using Deep Features
One might say that the performance improvements come from the neural network, not our approach. In order to further verify our method, we compared our method with the traditional hashing method using the CNNF network [43] to extract the depth features. As shown in Figure 4, we can see that our method is significantly superior to the traditional method. It should be noted that most of the traditional methods are not tested based on MSCOCO, and we cannot get the corresponding data, so we did not do the comparative test of MSCOCO in this part.
(a)
(b)
6.2.3. Comparison to Hashing Methods with Triplet Labels
This paper adopts the method of triplet labels; therefore, we focus on the comparison of the deep hashing methods with the triplet label, and we will compare it further with other deep hashing methods using the triplet label. These methods include DSRH [37], DSCH [38], DRSCH [38], and NINH [12]. The results of DSRH, DSCH, and DRSCH were adopted from the study in [38]. As shown in Figure 5, compared with previous deep hashing methods based on the triplet label, our method is obviously better and leads the comparative method by a wide margin.
(a)
(b)
6.3. Sensitivity to Hyperparameters
The most important hyperparameters for JLTDH are and . is the tradeoff parameter used to balance the triplet likelihood term and the linear classification loss. is used to balance the likelihood term and the quantization error. We explore the influence of these two hyperparameters.
We report the MAP values for different from the range of [0.1, 200] on two datasets with the code length being 12 bits and 32 bits. We can find that JLTDH is not sensitive to in a large range when . According to Figure 6(a), we found that when , CIFAR10 can obtain the best MAP performance. As shown in Figure 6(b), when , NUSWIDE can achieve better MAP performance.
(a)
(b)
Furthermore, we present the influence of in Figures 7(a) and 6(b). As can be seen, there is a wide range of in that our method performs well. When hash code = 12 bits, MAP accuracy is better on both datasets with , and when hash code = 32 bits, MAP accuracy is better on both datasets with . Other superparameter settings are obtained through crossvalidation: is set to 0.1, and is half the length of the hash code.
(a)
(b)
6.4. Analysis of Three Loss Functions
By observing the convergence of the loss function, we can judge whether the selected loss function is reasonable or not. Figure 8 shows the change of three kinds of losses for different lengths of hash codes during training. We only take CIFAR10 as an example; the results for the other two datasets are similar. It is reasonable to conclude that the joint loss combining triplet likelihood loss and linear classification quantization loss successfully optimizes the loss during training. Figures 8(a) and 8(b) are similar: they show that the joint loss and triplet likelihood loss converge rapidly and are kept at a low level for different bits, that our method is reasonable and effective, and that the optimization process only needs a few iterations.
(a)
(b)
(c)
In order to further reveal the respective influence of the two parts of the joint loss function, we have shown the loss curves of triplet likelihood loss and linear classification loss, respectively, in Figures 8(b) and 8(c). The magnitude of the loss value in Figure 8(c) is about of that in Figure 7(b), which is because triplet likelihood loss is used to optimize the first stage and plays a major role in training, while linear classification loss is used to optimize the second stage, which is further optimization based on the first stage and finetuning optimization.
6.4.1. Ablation Study about Loss Function
In order to confirm the contribution of different losses to final performance, we selected a variant of JLTDH for comparison: JLTDHT is a variant of JLTDH, whose loss function contains only triplet loss and no linear classification loss. As can be seen from Figure 9, on the CIFAR and NUSWIDE datasets, JLTDHT can achieve good MAP performance with only triplet loss. Based on further optimization of linear classification loss, JLTDH achieves better MAP performance by using the joint loss. JLTDH is about 2% ahead of JLTDHT on average.
(a)
(b)
6.5. Visualization of Query Results
We show the illustration of top 12 returned images for better understanding of the impressive performance improvement of the proposed method. Figure 10 illustrates the top 12 returned images of the proposed method for three query images on the three datasets CIFAR10, NUSWIDE, and MSCOCO, and the length of the hash code is 32. It shows that the method we proposed can truly preserve the features of an image and save them to the hash code. Regarding the query image in CIFAR10, only one of the returned results is wrong, and the wrong image is only at the bottom of the sorted image. In contrast, for the query image in NUSWIDE, only 1 out of the top 12 images is incorrect, and the top 12 images have a similar shape or similar color to the query image. And for the query image in MSCOCO, ten of the top 12 images are correct; compared with the previous two datasets, the query accuracy is slightly reduced. The possible reason is that MSCOCO is a multiobjective dataset. This shows that our method can provide the desired search results.
7. Conclusion
In this work, we propose a triplet deep hashing method with joint supervised loss based on the convolutional neural network (JLTDH). To fully utilize the supervised triplet information, this paper proposed a joint loss function combining two kinds of supervised loss functions: the triplet negative loglikelihood loss function and the linear classification loss function. At the same time, in order to overcome the cubic increase in the number of triplets and make triplet training more effective, we designed a triple selection method. The whole process is divided into two stages: Firstly, the last layer of the network outputs a preliminary hash code. Secondly, relying on the joint loss function and backpropagation algorithm, the parameters of the neural network are further updated so that the generated hash code has higher query precision. We perform extensive experiments on the three public benchmark datasets CIFAR10, NUSWIDE, and MSCOCO. Experimental results demonstrate that the proposed method outperforms the compared methods and is also superior to all previous deep hashing methods based on the triplet label.
Data Availability
Data are owned by a third party: the experimental part of this paper uses three widely used public datasets (CIFAR10, MSCOCO, and NUSWIDE), which can be publicly accessed. CIFAR10 can be accessed at http://www.cs.toronto.edu/kriz/cifar.html, MSCOCO at http://mscoco.org, and NUSWIDE at http://lms.comp.nus.edu.sg/research/NUSWIDE.htm The implementation code of the algorithm in this paper is written by PyTorch. The code can be obtained from Mingyong Li upon request at limingyong@cqnu.edu.cn.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the science and technology project of Chongqing Education Commission of China (No. KJ1600332), the humanities and social science project of Ministry of Education (18YJA880061), the humanities and social science project of Chongqing Education Commission of China (No. 19SKGH035), Chongqing Education Scientific Planning Project (No. 2017GX116), and Chongqing Graduate Education Reform Project (No. yjg193093).