Abstract

In recent years, hashing learning has received increasing attention in supervised video retrieval. However, most existing supervised video hashing approaches design hash functions based on pairwise similarity or triple relationships and focus on local information, which results in low retrieval accuracy. In this work, we propose a novel supervised framework called discriminative codebook hashing (DCH) for large-scale video retrieval. The proposed DCH encourages samples within the same category to converge to the same code word and maximizes the mutual distances among different categories. Specifically, we first propose the discriminative codebook via a predefined distance among intercode words and Bernoulli distributions to handle each hash bit. Then, we use the composite Kullback–Leibler (KL) divergence to align the neighborhood structures between the high-dimensional space and the Hamming space. The proposed DCH is optimized via the gradient descent algorithm. Experimental results on three widely used video datasets verify that our proposed DCH performs better than several state-of-the-art methods.

1. Introduction

Under the condition of the increase in smartphones, the amount of video data has shown an explosive growth trend [13]. For example, TikTok has over 400 million daily active users who upload approximately 2,000 videos every minute. YouTube receives a total of 100 hours of videos per minute [46]. Due to the economic storage and efficiency of binary codes, hash-based methods have been widely applied to visual retrieval tasks [713].

Previous hash-related work [14] mainly focused on image hashing and can be divided into data-independent and data-dependent methods. Data-independent approaches learn binary codes without data information but through random space projection. The most representative algorithm is local sensitive hashing (LSH) [15], which generates huge redundant information using random mapping and obtains satisfactory performance with long hash codes. Data-dependent hash methods [1618], which can also be divided into unsupervised hashing and supervised hashing, are proposed to generate more efficient hash codes by maintaining the neighborhood structure between data. For example, Gong et al. [19] proposed iterative quantization hashing (ITQ), which minimizes quantization error by rotating principal component analysis (PCA) projection data. Spectral hashing (SH) [20] assumes that data obey a uniform distribution and divides the data according to the main direction of the data stream. Density sensitive hashing (DSH) [21] extends LSH by studying structural information. Zhang et al. [22] developed a convergence-preserving parametric learning algorithm, called latent factor hashing (LFH), to learn similarity-preserving binary codes based on latent factor models. Liu et al. [23] proposed kernel supervised hashing (KSH) by applying kernel-based formulas to accommodate linearly inseparable data and designed a greedy algorithm to solve the hash function optimization problem.

In recent years, hashing methods proposed for video retrieval have also received extensive attention [2431] and are composed of two categories: machine learning methods and deep hashing. Machine learning methods, resembling image hashing approaches, learn binary codes of video keyframes based on the low-level manual features and then calculate video hashing codes via averaging. Wu et al. [4] employed video hashing via using color histograms to obtain global features. This is the first application of hash learning in the video field. Multiple-feature hashing (MFH) [32] adopts the weight-based method to combine different features. Ye et al. [33] used video structural information in the supervised learning paradigm to obtain the optimal binary codes. Stochastic multiview hashing (SMVH) [34] attempts to separately calculate the probability similarity matrices of video frames in the feature space and the Hamming space, and then, the difference between the above two probability matrices is minimized using the KL divergence. Nie et al. [35] defined joint multiview hashing (JMVH) by maximizing the interclass distance and minimizing the innerclass distance to preserve the global structure and local structure with multiple features. Boosting temporal video hashing (BTVH) [36] studies the multitable learning problem to boost the performance and captures the inherent similarity of video from both visual and temporal perspectives. In addition, some researchers in recent years have used deep networks to obtain the temporal and spatial information between keyframes. For instance, central similarity quantization (CSQ) [37] learns the temporal information by using 3D convolutional neural networks and proposes a view point called hash center to enhance the central similarity.

However, most existing video hashing approaches may lead to the following problems. (1) Low discriminability among different categories: functions based on pairwise similarity or triple relationships only consider local information, which results in good maintenance of the information of similar samples but shows poor performance in distinguishing samples from different categories. (2) Poor performance in real-world scenarios: in real application scenarios, similar data often accounts for only a small proportion, and most samples are not similar, which leads to low efficiency when the data are imbalanced [37]. (3) Greater time costs on deep learning: deep learning frameworks are time-consuming when training models and have no significant performance based on the spatiotemporal information extracted by the network. Hence, these video hashing functions cannot learn discriminative hash codes to enhance the performance.

To solve the above problems, in this work, we propose a novel framework for supervised video retrieval, called discriminative codebook hashing, which considers the global structure to construct the hash function. DCH encourages samples within the same category to converge to the identical codeword and maximizes the mutual distances between different categories. Specifically, the discriminative codebook is first generated based on two characters: the predefined distance between intercode words and Bernoulli distributions for ensuring that each hash bit stores more information. Then, to keep the similarity matrix between the feature space and the Hamming space, the composite KL divergence is proposed to solve this problem. Finally, the gradient descent algorithm is utilized to optimize the algorithm. In this way, we can obtain discriminative binary codes for video retrieval. Figure 1 shows the framework of DCH, and the method we proposed has the following innovations:(i)We proposed the discriminative codebook based on the predefined distance between intercode words and Bernoulli distributions for ensuring each hash bit to store more information(ii)The DCH method, which can maximize the distance of the intercode words generated by the predefined codebook to learn discriminative binary codes for supervised video retrieval, is proposed(iii)We verify our proposed method by experimenting on three widely used datasets, which shows that DCH has a significant improvement in contrast with several state-of-the-art methods

The other sections are organized as follows. Section 2 introduces some preliminary works. Section 3 introduces the proposed discriminative codebook hashing in detail. The experimental work is presented in Section 4, and the conclusion of DCH is shown in Section 5.

2. Preliminary Work

In this section, we briefly introduce the preliminary work, namely, stochastic multiview hashing [34]. It is a supervised video retrieval method that aims to preserve the similarity structure from the original space to the Hamming space.

Let be the video set, where indicates the video of and is the number of videos. is hash code of the video set, where is -bit length binary codes transformed by . The video features are extracted based on the set of keyframe features , where , is the number of keyframes, and is the dimension of each keyframe. represents the corresponding binary codes of the keyframes, where . The conversion relationships between the above variables are formulated aswhere is the temporal result of linear projection, is a bias parameter, is the projection matrix, is the set of frames, and is the sum of samples in the set. The high-dimensional keyframe feature matrix is first projected into the lower matrix . Then, the sigmoid function is used to map the variable between and . Finally, a thresholding function is used to change the data into a binary code with if and , otherwise.

SMVH keeps the similarity matrix between the feature space and the Hamming space using a composite KL divergence measure. In particular, it separately calculated the similarity probability matrix in the original space and the pairwise similarity matrix among samples in the Hamming space. Then, the KL divergence is used to examine how well the above two probability matrices and match. Therefore, the objective function of SMVH is defined as follows:where controls the weight of the regular term to prevent overfitting and is the composite KL divergence. The latter can be represented aswhere controls the influence of the composite KL divergence, is the similarity structure based on , and is another probability matrix preserving the similarity information of in the Hamming space. In addition, the KL divergence is defined as follows:where is a conditional probability that reflects the similarity between and , and another conditional probability represents the probability of returning given the query .

3. Discriminative Codebook Hashing

In this section, we present the proposed DCH in detail through four parts, including the proposed discriminative codebook, the objective function, algorithmic optimization, and complexity analysis.

3.1. Discriminative Codebook

Motivated by CSQ [37], we propose a novel and discriminative codebook for supervised video retrieval, where is the code word of the category. The proposed codebook is defined according to two characters. The first is that the value in the same bit of different code words obeys a Bernoulli distribution. Specifically, the proportions of and of the same bit in different categories are both , that is, has a probability of being or , which will maximize the entropy and store more information in each bit. The other is that the mutual distances among intercode words are defined as follows:where is the Hamming distance between code words and , is the length of binary codes, and represents the fault tolerance. The mutual distance between intercode words will be the largest constrained by equation (7).

Overall, the proposed codebook encourages samples within the same category to converge to the same codeword and maximizes the mutual distance between different categories. Therefore, the proposed codebook can preserve global structures and help generate discriminative binary codes for video retrieval. The scheme of the proposed discriminative codebook is presented in Algorithm 1.

Input: the number of categories ; the number of samples per category ; code length ; maximum number of iterations ; fault tolerance rate .
Output: codebook
(1)for iteration
(2)for category
(3)  [random half coordinate]
(4)  [the rest coordinate]
(5)end
(6)if any two rows of satisfy equation (7)
(7)  break
(8)end
(9)end
3.2. Objective Function

According to the proposed discriminative codebook , we expand each row of the codebook matrix into according to the number of samples, where . The detailed generation process of is shown in Algorithm 2. We minimize the error between the binary codes and the predefined codebook as

Input: training data ; codebook ; maximum number of iterations ; code length ; parameters , , ; learning rate ;
Output: hash codes .
(1)Initialization: initialize the projection matrix and bias matrix as a random matrix and vector.
(2)Generatingaccording to the number of samples:
(3)for category  : 
(4)
(5)end
(6)Gradient descent:
(7)for iteration  : 
(8)-Step:
(9)-Step:
(10)end
(11)Video binary code computation: video hash codes are obtained by equations (1)–(3).

Specifically, for each , we take as the codebook of to make samples in the same category share the same codebook and samples in different categories have discriminative binary codes.

To keep the similarity matrix between the feature space and the Hamming space, we join the composite KL divergence and our proposed codebook to construct the overall objective function of DCH as follows:where controls the weight of the error loss between the codebook and the learned hash codes, and the second term of equation (9) aligns values between binary codes and their corresponding code word.

In this way, our proposed DCH can solve the problem that other algorithms only consider the pairwise relationships and ensure that samples in the same category share the same code word. Furthermore, DCH maximizes the mutual distances between different categories and then obtains discriminative binary codes.

3.3. Algorithmic Optimization

The optimization problem has two main variables: and . Our solution is to use the gradient descent algorithm to find good solutions. To facilitate the writing, we split the objective function equation (9) into three parts:

The detailed optimization procedure is presented as follows.

-Step: the corresponding problem is to minimize the following loss function:

To compute the optimal , the relevant deviation formula can be expressed as

The derivative of w.r.t. can be computed as follows:where and are represented as

Following the norm derivation law, can be optimized as follows:where indicates that the elements in the same position of two matrices are multiplied.

For , we have the derivative that

-Step: the subproblem of is given by

The deviation w.r.t. can be expressed as

The derivative of is described as follows:where

The second term of equation (18) is described as follows:

Algorithm 2 describes the overall algorithm optimization process of the proposed DCH.

3.4. Complexity Analysis

The time complexity of the entire training process of SMVH [34] is approximately , and the proposed DCH algorithm adds two parts time-consuming on this basis. The first part is the learning process of , and the time complexity is . The second part is that the time complexity of optimizing equations (15) and (21) together is in each iteration. Therefore, the overall time complexity of DCH is . In this work, time complexities and can be ignored due to so that our complexity is nearly . Additionally, the calculation of the hash codes is a linear projection with a time complexity of approximately , and the online search can be performed by XOR operations. Although the algorithm proposed in this paper adds a constraint on SMVH, the maximum number of iterations directly affects the time complexity of the algorithm. It can be proven in subsequent experiments that DCH can converge in fewer iterations. Thus, the time complexity of DCH is in a reasonable range.

4. Experiments

In this section, we first introduce the datasets used in this paper, and then, the baselines and some experimental details will be introduced. Finally, we present the experimental results.

4.1. Datasets

CC_WEB_VIDEO [4] is the most useful dataset in near-duplicate video retrieval (NDVR) research, which contains data from YouTube, Google, and Yahoo. There are 12,877 videos that are divided into 24 sets, and keyframes are extracted by a uniform sampling method to represent the video. Since some videos do not have label information, we take 3,482 videos with labels as the experimental dataset. In each category, we select of the video data as the training set and the remainder as the testing set. We extract 10 keyframes for each video uniformly and extract 4096-dimensional features to represent keyframes by using the pretrained VGG-19 network.

HMDB51 [38] contains 6,766 human action videos selected from movies and some other public sources such as YouTube. The dataset is divided into 51 categories, and each of them includes approximately 100 clips. In each category, we randomly select 45 video samples. Of these, 25 videos are added to the training set and the rest are select to the testing set. We uniformly extract 10 keyframes for each video, and the VGG-19 pretraining network is used to extract the 4096-dimensional deep features.

UCF101 [39] contains 13,320 videos which has been divided into 101 human behavior categories, such as sports, instruments, character interactions, and others used for action recognition. We randomly select 70 videos in each category to join the training set, and 30 videos to join the testing set. For each video, 10 keyframes are uniformly selected to represent the video. We use VGG-19 to extract the 4096-dimensional features for each keyframe.

4.2. Experimental Setting
4.2.1. Baselines

Several state-of-the-art hash functions, including ITQ [19], SH [20], DSH [21], LFH [22], KSH [23], JMVH [35], and SMVH [34], are used for comparison. Among these methods, ITQ, SH, and DSH are unsupervised hashing methods, while LFH, KSH, JMVH, and SMVH are supervised hashing methods. For the comparative test, we use the source codes published to conduct the experiment. JMVH and SMVH can also be used for multiview video retrieval, but in this paper, we only test these methods as a single view method. It is worth noting that all the experimental results are obtained in MATLAB R2016a on the same computer with an Intel Core i7-6700 CPU @ 3.40 GHz, 72 GB RAM and the 64 bit Windows 10 operating system.

4.2.2. Evaluation Metrics

We use four popular evaluation metrics to comprehensively evaluate experimental results. The mean average precision (mAP) is widely used in the retrieval field. The higher the mAP score is, the better the retrieval performance of the method is. The precision@K curve represents the precision accuracy versus the first retrieved samples, where precision represents the proportion of the number of retrieved correct videos to the total number of retrieved videos. The recall@K curve represents the average recall rate versus the first retrieved samples, where recall represents the proportion of the correct video volume retrieved in all near-duplicate video samples. The precision-recall (PR) curve is an index used to evaluate reliability and is widely used in the fields of medicine and machine learning.

4.2.3. Parameter Selection

We have three model parameters, including , , and , and the number of iterations . According to SMVH [34], we set and . As shown in Figure 2(a), when is in the range of to 1, the results are stable across three different datasets. Therefore, we empirically choose in our proposed model. The maximum number of iterations determines the training time cost and the performance, so it is worth discussing. Figure 2(b) shows the effect of the maximum iterations in the range of 100 to 1400 on mAP performance. For HMDB51, it can be seen that the best mAP is generated with before decreasing. However, in the other two datasets, is not an optimal experimental result. Therefore, after comprehensive consideration, is set as the final parameter setting.

4.3. Results and Discussion

Table 1 shows the mAP results for different lengths of hash codes on the three datasets, and the results of other evaluation metrics are shown in Figures 35. We will give the detailed analysis of all results of the three datasets in the following parts.

According to Table 1, for the CC_WEB_VIDEO dataset, the mAPs are very high because the dataset is movie clips, and videos of the same category are near-duplicate videos. As shown in Table 1, the performance of the proposed DCH is at least better than that of the other methods from 32 to 64 bits. When the code length is 96 bits, the mAP of DCH is slightly lower than that of LFH. As shown in Figure 3, the experimental results of our method in precision@K and recall@K are equal to or slightly higher than those of most other methods. Besides, as the code length increases, the performance of our proposed DCH gradually surpasses that of other methods. Figures 3(i)3(l) show that the area surrounded by DCH is gradually increasing.

Table 1 shows that our proposed DCH performs better than other hash methods in most cases in the HMDB51 dataset. Although the mAP performance of the JMVH method surpasses over that of DCH with 32 bits, the mAPs of our proposed DCH are better than those of the other comparison methods in the subsequent experiments. Figure 4 shows that when the length of hash codes is larger than 32 bits, regardless of whether precision@K curve, recall@K curve, or PR curve is used, DCH has excellent performance compared with other methods in all metrics for the precision@K curve, recall@K curve, and PR curve.

For the UCF101 dataset, DCH obtained the optimal experimental results in the range of [32, 48, 64] bits. It is worth noting that the size of the UCF101 dataset is relatively large, and SMVH cannot obtain discriminative video hash when the hash code length is very small. Therefore, SMVH has no experimental results available for and . As shown in Figure 5, the performance of DCH is much higher than those of some of the methods except JMVH. We can see that the recall rate of DCH for positive samples is slightly lower than that of JMVH based on Figures 5(e)5(h). Figures 5(i)5(k) show that the performance of DCH for 32 to 48 bits is better than those of all other methods for the PR curve.

5. Conclusion

In this paper, we propose a novel supervised video hashing framework, termed discriminative codebook hashing, which can generate discriminative binary codes for video retrieval. The proposed DCH encourages samples within the same category to converge to the same code word and maximizes the mutual distances between different categories. Specifically, we generate a discriminative codebook to distinguish between samples of different categories more accurately. Extensive experimental results prove that the performance of DCH is significantly improved compared to several state-of-the-art methods. In future work, we will use a smaller matrix storing the similarity information between samples to avoid consuming considerable training time and space when the amount of data is large. This will improve the performance of the model while reducing the time complexity.

Data Availability

CC_WEB_VIDEO dataset can be downloaded from http://vireo.cs.cityu.edu.hk/webvideo/, the HMDB51 dataset can be downloaded from https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/#dataset, and the UCF101 dataset can be downloaded from https://www.crcv.ucf.edu/data/UCF101.php.

Conflicts of Interest

The authors declare that there are no conflicts of interest in the publication of this paper.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (nos. 61902087, 61772149, 61936002, and 6202780103), Guangxi Science and Technology Project (nos. 2019GXNSFFA245014, AD18281079, AA18118039, and AD18216004), and Guangxi Key Laboratory of Image and Graphic Intelligent Processing (no. GIIP2001).