Abstract

Image perceptual hashing is broadly applied in image content authentication, recognition, retrieval, and social media hotspot event detection. An image authentication algorithm is put forward based on the Itti visual saliency model and geometric invariant vector distance. To begin with, the image is preprocessed and weighted by the Itti model and contourlet transform. After that, the weighted image is randomly divided into blocks, and the image feature vector is constructed by calculating the geometric invariant vector distance on both Hu invariant moment vector and maximum singular value vector of the random blocks. In the end, the feature vector is quantized and encrypted to generate the ultimate hash. Experimental results illustrate that when the threshold T = 70, the true positive rate for duplicate images stands at 0.96574, while the false rate of different images is merely 0.0224, with the total error rate reaching the minimum value (0.0566). Furthermore, the AUC value of the proposed algorithm is 0.9951, which is higher than that of the comparison algorithms, indicating that the algorithm has better performance than other state-of-the-art algorithms in terms of various visual content-preserving attacks.

1. Introduction

As a wide variety of image editing tools are getting developed and popularized, such as ACDSee, Corel Paint Shop Pro Photo, and Photoshop, malicious users can easily modify, copy, edit, and tamper images, resulting in numerous duplicate images growing exponentially on the networks, which brings serious challenges to the copyright protection and content authentication of images [13]. There are some content-preserving operations which usually refer to those images that undergo various kinds of image modifications such as rescaling, occlusion, noise adding, and luminance and color change. Therefore, content-based image copy detection has been dealt with as speedily as possible. The purpose of content-based image detection is to find the duplicate version of a given copyright image in the database. Duplicate images and similar images are visually very similar to copyright images. Figure 1 shows similar images and duplicate images of a copyright image. Therefore, how to effectively distinguish similar images from duplicate images is a difficult problem in image copy detection. This study focuses on how to use image hashing technology to effectively detect content-based copy images.

At present, there are two kinds of effective image content detection methods. The first is digital watermarking technology [4], but the robustness of watermarking against various content-preserving attacks limits the practical application. The other is image hashing technology [5]. Compared to the former, image hashing only depends on the image content itself and does not need to damage the image quality. The image hashing generates a brief summary through the extraction and compression of image perceived content and information, which is used for recording or representing the image content or further used for finding the location where the image content is tampered [6, 7]. It provides a high-efficiency technology for image content authentication. Because image hashing can effectively reduce the cost of image storage and computational complexity, it realizes the quick image processing. Also, the hashing has been applied in image retrieval [8] and other applications [913].

In recent years, the image hashing algorithms [1420] ignore the impact of human visual model on hashing perceptual characteristics and pay less attention to the impact of geometric invariant vector distance on image hash robustness. However, in reference [21], Liu and Huang presented an effective approach based on geometric invariant vector distance, that is, to construct image hashing by the vector distance between Hu invariant moment vectors in the spatial domain and DCT (discrete cosine transform) coefficients in the frequency domain. Although this algorithm focused on the geometric invariance and had good classification performance in image copy detection and geometric attack, it did not fully consider the perceptual characteristics of the human visual model. In reference [22], an image hashing algorithm was introduced using the Itti visual saliency model and DWT (discrete wavelet transformation), which had good robustness and uniqueness, but DWT was not as good as contourlet transform (CT) in sparse expression of images. In view of these, this study chooses the Itti visual saliency model, contourlet transform, and geometric invariant vector distance to study the image hashing algorithm. Compared with previous algorithms, the main contributions of this hashing algorithm are as follows:(1)A weighted image representation is proposed by making use of the Itti visual saliency model and contourlet transform. The Itti visual saliency model can efficiently capture the visual saliency region of the image and can make a compromise between the calculation speed and the saliency detection performance. Compared with wavelet transform, contourlet transform has good multiresolution, localization, and directivity, and the coefficient energy is more concentrated, which can better sparse represent the image. Weighting the above two factors to generate a weighted image and extracting hashing features from it can ensure that the algorithm achieves a good balance between perceptual robustness and discrimination.(2)The geometric invariant vector distances on the Hu invariant moment vector and the maximum singular value vector are extracted to construct the hash. Hu invariant moment and maximum singular value vector have strong stability against various attacks. At the same time, geometric invariant vector distance is not only robust to geometric transformation but also effectively distinguishes different images, so that the algorithm has good classification performance.(3)Two keys ensure the security of the algorithm. First, the image is randomly partitioned by key and then the hash eigenvalue is chaotically encrypted by key .

Extensive experiments are done to verify the performances of our hashing algorithm. The results illustrate that when the threshold T = 70, the true positive rate for duplicate images stands at 0.96574, while the false rate of different images is merely 0.0224, with the total error rate reaching the minimum value (0.0566). Furthermore, the AUC value of the proposed algorithm is 0.9951, which is higher than that of the comparison algorithms, indicating that the algorithm has better performance than other state-of-the-art algorithms in terms of various visual content-preserving attacks. The rest of this study is organized as follows. Section 2 describes the related work, Section 3 describes the steps of the proposed algorithm, Section 4 analyzes experimental results, and Section 5 concludes this study.

At present, image hashing algorithms can be roughly divided into two categories: signal processing and machine learning. Generally, machine learning is mainly used in image retrieval, while signal processing is mainly applied in image authentication, image quality evaluation, and copy detection [7]. Since this study focuses on the hashing based image copy detection, it puts emphasis on the hashing algorithms about the signal processing.

2.1. Image Hashing Algorithms Based on Signal Processing

The scheme [14] distilled image boundary information by utilizing the Canny operator and applied two-dimensional DWT for edge detection. The hash sequence was obtained by weighting the DWT coefficients of different wavelet subbands, with good property in the quality evaluation of semireference images. Abdullahi et al. proposed an algorithm based on FMT (Fourier–Mellin Transform) and fractal encoding, which is resistant to most content-invariant operations and runs faster in reference [15]. The scheme [16] used shape context and SURF to generate a faster hash. The scheme [17] combined Kaze features and SVD (singular value decomposition) to generate hash sequences, which were robust to gamma correction. The scheme [18] took advantage of quaternion Zernike moments to engender the hash, with good sorting performance. Huang and Liu [19], to construct hash, distilled the biggest gradient magnitude and correlative orientation information from the three color channels and the generated image gradient field. Su et al. integrated the distance metric learning technique and the quantization strategy into the processing of image authentication, which formed a personal authentication framework in reference [20]. Shen and Zhao extracted hashes from the secondary image and quadtree structured features by concatenating color elements COC in reference [23]. The scheme [24] drew steady and distinctive image features from the CIELAB color space to generate robust hash with SVD. These image hashing algorithms ignore the impact of human visual model on hashing perceptual characteristics.

2.2. Image Hashing Algorithms Based on Machine Learning

Tang et al. [25] raised an image hashing approach based on DCT and LLE (locally linear embedding), which has good robustness to image scaling and watermark embedding, while the calculation speed of the LLE algorithm is relatively slow. Liang et al. [26] proposed a fast image hashing algorithm based on LC (luminance contrast) and 2D PCA (principal components analysis), which conduced to a low calculating expenditure. Recently, some scholars have proposed some image authentication algorithms based on deep learning hashing algorithms. For example, an integrated feature matching scheme was proposed in Reference [3], which integrates the matching of SIFT (scale-invariant feature transform) features and CNN (convolutional neural networks) features between images for partial-duplicate image detection. Since both the good robustness of SIFT features and the high discriminative power of CNN features are sufficiently explored, the scheme could attain an accurate detection. Li et al. [27] raised an image hashing approach based on neural network, which was the first time that deep neural network was applied to this field. By comparison, this hash algorithm can achieve better performance. Reference [28] presented an image hashing algorithm using multiconstrained CNN, which especially displayed good performance in tampering image detection. Although learning-based hashing methods have achieved satisfactory results in image semantic retrieval, they are not very suitable for content-based image copy detection.

In order to more prominently represent the salient features of the human visual system and improve the robustness and geometric invariance of hashing, this paper chooses the Itti visual saliency model, contourlet transform, and geometric invariant distance to study image hashing algorithms.

3. Proposed Algorithm

In this paper, image hashing is studied though making use of techniques such as the Itti visual model, contourlet transform, SVD, and geometric invariant vector distance, in which the Itti visual model can efficiently detect the visual salient area of the image, the Hu invariant moment can resist common geometric attacks, and the maximum singular value vector is robust to various attacks. Firstly, the image is weighted with the Itti visual model and contourlet transform, and secondly, the geometric invariant vector distance of Hu invariant moment vector and maximum singular value vector is constructed as the hashing. Therefore, a better compromise between robustness and discrimination of the perceptual hashing algorithm can be achieved. The framework of hashing generation is displayed in Figure 2.

3.1. Algorithm Steps
3.1.1. Preprocessing

To begin with, the original image is transformed to a normalized image sized by the bicubic linear interpolation, which is primarily to ensure that images of different sizes have the same length of hash. After that, the Gaussian low-pass filter is utilized for the normalized image by a convolution mask in order for reducing an effect on white noise, interpolation error, and speckle noise on the hash. Generally, filtering can be implemented using a convolution mask. Let be the element in the ith row and jth column of the convolution mask. Thus, the convolution template can be computed by the following formula:where is defined as

In equation (2), is the standard deviation of all elements of the convolution template. The image after preprocessing is marked as .

3.1.2. Constructing a Weighted Image

In this section, the weighted image is obtained by unifying calculation of the Itti visual saliency map and the contourlet transform decomposition coefficients, which is shown Algorithm 1. The specific calculation steps are as follows.

Input: the original image
Output: the weighted image
Begin
begin preprocessing
  standardize the image sized
  Gaussian low-pass filter
  return
end preprocessing
 M = Itti_Model ;
begin calculating N
   = HSI_Mode
  N = Contourlet_transform
end calculating N
 standardize M and N, return J and K
 Return the weighted image I = J  K
End

(1) Calculate the Itti Visual Saliency Map M. The Itti model [29] mainly utilizes techniques such as binary Gaussian pyramid, center-surround computation, cross-scale computation, and linear fusion to extract visual saliency map. Generally, the Itti model calculation is divided into four steps. The first step is to extract the saliency map, denoted as Mcolors, through techniques such as Gaussian pyramid, center-surrounding computation, and cross-scale computation. The second step is to extract the intensity saliency map, Mintensity, which is similar to the calculation process of color saliency map extraction. Likewise, the third step is to extract the saliency map of the orientation, Morientation. Finally, the above three saliency maps are combined to generate the final visual saliency map M. The calculation formula is shown as

(2) Calculate the Contourlet Transform Coefficient N. In this study, the contourlet transform [30] is selected for weighting because of its good suitability of sparse representation of 2D images. On the one hand, contourlet transform has the multiresolution time-frequency analysis features of wavelet transform. On the other hand, the contourlet transform also has good anisotropy, which can approximate singular curves with the least coefficients along the edge of the image contour. Due to the better sparse representation of the image, rudimentary data compression can be achieved, and it is less affected by common digital operations, for instance, noise pollution, low-pass filtering, and JPEG compression. According to formula (4), the preprocessing image is calculated to obtain , which is used as the input of the contourlet transformation. Let be the contourlet transform coefficient:

(3) Calculate the Weighted Image. Assuming that the size of the visual saliency map M is S × S and the contourlet coefficient N is P × Q, in order for calculating the weighted image, the size of M and N is sampled as n × n by the bicubic interpolation. Let J and K be the sampled versions of M and N; therefore, the weighted image I can be calculated bywhere I (x, y), J (x, y), and K (x, y) represent the elements of the matrices I, J, and K at row x and column y, respectively.

The weighted image is randomly divided into blocks, and the random key is denoted as . Let be the kth block and the size of is b × b.

3.1.3. Hu Invariant Moment Vector Extraction

For each block , seven Hu invariant moments are calculated, and we get the column feature vector . Let be the invariant moment feature matrix of the entire image. To build a normalized matrix for further calculation, data normalization is carried out. Let be the jth () row of . Each element is normalized in the jth row as

In formula (6), and are the mean and the standard deviation of all elements of .

Finally, the normalized Hu invariant moment vector is obtained, which is denoted as .

3.1.4. The Largest Singular Value Vector Extraction

Similarly, for each block , SVD is performed. We extract the singular vector corresponding to the maximum eigenvalue, denoted as . Let U =  be the largest singular value vector. In the same way, the vector is normalized as formula (6), and the normalized largest singular value vector is obtained, which is denoted as .

3.1.5. Calculating Geometric Invariant Vector Distance

Through the above steps, we get the feature vectors . Since the vector distance is invariant to the digital operations on the content of the image, we use the invariant vector distance to construct perceptual hashing, which is extracted in detail as follows.(1)Calculating the geometric invariant vector distance of the matrix ;For the matrix , in order to calculate the geometric invariant vector distance, we provide a reference vector , where ) is the mean of the jth row.The Euclidean distance between the kth column vector and the column is defined asAfter calculation, the geometric invariant vector distance of Hu invariant moment eigenvector is obtained, which is denoted as .(2)Calculating the geometric invariant vector distance of the matrix :Also, for the matrix , to calculate the geometric invariant vector distance, we provide a reference vector , where ) is the mean of the jth row. The Euclidean distance between the kth column vector and the column is defined asAfter calculation, the geometric invariant vector distance of the maximum singular vector is obtained, which is denoted as .After calculating through step (1) and (2), we get the feature vector .(3)Encryption:In order to improve the security of the algorithm, we use the chaotic encryption algorithm to encrypt the vector D. Assuming that is a secret key shared by the sender and the receiver, the logistic equation is shown asLet , and then the key vector is obtained.Let , where Y is the hashing sequence after encryption and randomization.(4)Quantization:Since the encrypted eigenvalues are all floating-point data, in order to reduce the storage space and improve the efficiency of image authentication, the eigenvector Y is quantized bywhere [·] is the round-up or round-down operation.

Finally, the final hashing sequence is generated, denoted as . In order to determine the hashing length in the binary form, the hashing values of 1000 different images are calculated and analyzed to determine the number of bits occupied by each hashing value, which is discussed in Section 4.6.

From Section 3.1.3 to Section 3.1.5, we can calculate the final hash sequence by Algorithm 2.

Input: the weighted image I
Output: the hashing sequence
Begin
for 1: k (maximum number of blocks)
  for 1: b (block size)
    = overlappingblock (I) with secret key ;
  end
End
While
 = Hu_Feature();
End
 Hu_Feature_Vector Normalize ;
While
    = SVD_Feature ();
End
 Hu_Feature_Vector ;
 D1 = geometric_invariant_distance ();
 D2 = geometric_invariant_distance ();
;
 Y = chaotic_encryption (D) with secret key ;
 Hash = quantize (Y);
End
3.2. Image Copy Detection

Let IS be the image set, be the hash function, and be the content-preserving operation function. There are two different images with different perception contents, and and represent the hash sequence of the image x and y, respectively. Let and represent the hash of the image .

Euclidean distance is used to measure the difference between two image hashes during image copy detection. When the Euclidean distance between the two different images is greater than the threshold T, that is, , it is determined that the two images are different perceptual contents. When the Euclidean distance between the origin image and its duplicate images is less than the threshold , that is, , it is determined that these duplicate images are from a copyright image.

4. Experiment Results

The proposed algorithm is programmed in Matlab language and implemented on the platform of Windows 2010 and MATLAB 2018b, in which the CPU is i7-4770k @ 3.50 GHz and the memory is 8.00 GB. The image datasets used in the experiment includes 1022 images. There are 22 images sized 768 512 or 512 768 from Kodak datasets [31], and the other 1000 images sized 512 384 or 384 512 are from the UCID [32]. The simulation results and analysis are as follows.

4.1. Experimental Parameters

The experimental parameters are as follows. The original image is interpolated into a fixed size , that is, n = 256, and the size of the random block size is , that is, b = 60. The number of blocks is 50, that is, L = 50. The size of Gaussian low-pass filter convolution mask is , and the standard deviation is . The random block key is 100, and the initial key of the chaotic encryption of the eigenvector is 0.27, that is to say, the key pair is (100, 0.27). According to the process of the previous Algorithm 2, the length of the hashing sequences is equal to twice the number of blocks, that is, 100 integers. Next, the simulation results and analysis of the experiment are given below. Section 4.2 verifies robustness, Section 4.3 discusses discrimination, Section 4.4 analyzes the hashing security, Section 4.5 explores the influence of different parameters, Section 4.6 probes hash length and storage cost, Section 4.7 analyzes which component is crucial for the proposed algorithm, and Section 4.8 discusses the performance comparison of other state-of-the-art algorithms.

4.2. Robustness

To evaluate the performance of the raised algorithm, 22 images sized or in the Kodak dataset were subjected to robust attacks using Photoshop, Matlab, and StirMark [33] to acquire their duplicate images. These robust attacks are composed of a wide range of operations with perceptual content unaltered, and the details are shown in Table 1, containing the using tools, operation types, the parameters and the values of parameters, and the number of each operation. Therefore, the total number of robustness attacks utilized in the experiment is 69. In other words, there are 69 copy versions of each original image after the robustness attacks. Consequently, there are 22 69 = 1518 duplicate images, bringing the total number of images to 1518 + 22 = 1540. We extract hashing sequences of the 22 test images and their copy images, and calculate the Euclidean distance between each original image and its duplicate versions. In this experiment, we select the threshold T = 70 (for the selection of threshold, please see the relevant instructions in Section 4.3). As described in Section 3.2, when the Euclidean distance between the origin image and its duplicate images is less than the threshold , that is, , it is determined that the duplicate images are from a copyright image.

The results of the Euclidean distance for each operation are listed in Table 2, including the maximum value, the minimum value, the mean value, and the standard deviation. The variation curves of Euclidean distance mean under different robustness operations and parameters are revealed in Figure 3, where the abscissa represents the different parameter settings under each digital processing operation, and the ordinate represents the Euclidean distance mean. It is clear from Table 2 and Figure 3 that the algorithm has good robustness to the majority of robust operations, especially for JPEG compression, affine transformation, salt pepper noise, speckle noise, gamma correction, watermark embedding, scaling, contrast and brightness changes, and the maximum values of European distances corresponding to these operations are lower than the threshold (T = 70), and their average values are lower than 21. In addition, it can be seen from Table 2 and Figure 3(g) that although the maximum value of Gaussian noise is 75.1961, the average value range is [20.1225, 37.5264], and the overall average value is 29.2106. In this experiment, only three image pairs have Euclidean distances exceeding 70, therefore, the proposed algorithm also shows good performance against Gaussian noise. However, for rotation transformation, it has poor adaptability. Table 3 lists the statistical analysis results corresponding to different rotation angle ranges of 22 images, including the number of distances exceeding the threshold (T ≥ 70), mean value, and (shown in formula (11)). It can be seen from Tables 2, 3 and Figure 3(j) that when the rotation angle range is less than 2 degrees, the proposed algorithm has strong adaptability, and the can reach 96.59%, with merely six pairs exceeding 70. However, when the rotation angle is greater than 3 degrees, the performance of the algorithm decreases, and the is less than 90%.

Therefore, the experiments show that the raised algorithm has good robustness to the majority of robust operations.

4.3. Discrimination

In order for verifying the discrimination, 1000 images which are visually different are selected from UCID image databases. To begin with, we attain the perceptual hashing sequences of these images using the raised algorithm, and we match pairwise among all the different images at random and calculate the Euclidean distance of each pair. After that, we get sets of Euclidean distances. In the end, the frequency statistic of these groups of distances is counted, and we plot the frequency histogram which reveals the total intensity distribution of these Euclidean distances, as shown in Figure 4. It is apparent that the result of the experiment nearly is in line with the normal distribution, where the mean value is , standard deviation is , the maximum value is , and the minimum value is .

In order to determine a reasonable threshold, the true positive rate(), the false positive rate(), and the total error rate (1 −  + ) under a variety of thresholds are calculated and itemized in Table 4, with detailed definitions of and shown in formulas (11) and (12). The range of the threshold T is between 50 and 150, with increasing step size of 5. The threshold criterion that we selected here is to minimize the total error rate. It is observed from Table 4 that when the threshold T = 70, the stands at 0.96574, while is merely 0.0224, with the total error rate reaching the minimum value (0.0566). On the other hand, when the threshold is lower than or higher than 70, the total error rate gradually increases and converges to the minimum value only when T = 70. Remarkably, the overall performance of the algorithm can be evaluated by the total false rate. Therefore, this algorithm can choose T = 70 as a recommended threshold.

Obviously, a small threshold means a low collision probability, that is, good discrimination. A large threshold is helpful for improving the robustness of the algorithm, but it will inevitably reduce the discrimination performance. Therefore, in practice, the threshold can be selected on the basis of the specific application scenarios to obtain the balance between discrimination and robustness. For image similar detection, for example, the threshold T can be taken at . Under the circumstance, the reaches the order of magnitude, that is to say, when there are images received, one pair of diverse images regarded as visually identical images would probably occur. Therefore, it is believed that this method can guarantee the discrimination of different perceptual images.

4.4. Security Experiment

In this paper, the key K1 is used for random selection of image blocks, and the initial secret key K2 is used for chaotic encryption, that is, the key pair (K1, K2) is taken advantage of enhancing the safety of the algorithm. The key pair is different and the resulting image hash should also be different. For the sake of verifying the dependence of the algorithm on the key pair, we firstly make use of the initial key pair (201, 0.27) to generate the hash of an image, and secondly, select 200 different key pairs (K1, K2) at random to extract the hashes of the same image. Next, the Euclidean distance of 201 hash values is matched, and a total of 19,900 matching results are generated. The result of experiment is given in Figure 5. Clearly, the least Euclidean distance is 155.5796, with being far larger than the given threshold. It illustrates that hash sequences generated by various key pairs of the same image will not have false matches. Therefore, when the key pair is unacquainted, it is impossible to forge the correct image hashing, which heightens the safety of the algorithm.

4.5. Influence of Different Parameters on Hashing Performance

This section mainly explores the influence of the image block size (denoted as b) and the number of random blocks (denoted as L) on the classification performance of the algorithm with respect to robustness and discrimination. The image databases of Sections 4.2 and 4.3 are still employed in this experiment. In the experiments, only the parameter discussed currently is changed, and the other parameters maintain unchangeable. In this section, the famous receiver operating characteristic (ROC) [34] curve is exploited to analyze the classification performance of the algorithm. In the ROC curve, the y-axis coordinate usually represents the true positive rate , while the x-axis coordinate represents the false positive rate . The specific calculation formulas of and are defined as follows:where and , respectively, represent the number of the pairs of copy images estimated as visually identical images and the whole pairs of visually identical images, and , respectively, represent the number of the pairs of diverse images regarded as visually identical images and all the pairs of visually different images. Apparently, and respectively, represent the robustness and discrimination of the algorithm. Therefore, the ROC curve is closer to the upper-left corner (i.e., the smaller value of , the larger value of ), then the classification performance of the algorithm is better. For quantitative comparison, ROC curve area (AUC) [34] can be regarded as a judgment standard. The range of AUC value is (0, 1), and a larger AUC stands for better classification performance. The influence of parameter L and b on the algorithm is discussed.

4.5.1. The Number of Random Blocks

Figure 6 is the contrast chart of ROC curves under diverse L values when b = 60, including L = 30, 40, 50, 60, and 70. It is obvious that all ROC curves get close to the upper-left corner, indicating that these L values can make the presented algorithm commendably balanced as for robustness and discrimination. The ROC curves in the left-top corner are blown up, as displayed in the small graph below the middle. Obviously, the ROC curve with L = 50 is closest to the upper-left corner, and the corresponding AUC value is 0.9948, which is higher than the AUC value of the ROC curve with other L values. Therefore, the number of random blocks L is selected as 50. It can be seen from Figure 6 that when L = 30 and L = 70, the ROC curves appear alternately, and the corresponding AUC values are 0.9868 and 0.9922, respectively, which are also lower. This is because when the block size b is fixed, a small number of blocks will cause the key features of the image to not be collected, and the image cannot be expressed well, resulting in performance degradation. If the value of L is too large, some unnecessary redundant features will be introduced, leading to overfitting. In addition, the ROC curves of L = 30, 40, 50, 60, and 70 are very close, and the corresponding AUC values are also very close, which shows that the number of random blocks L has little influence on the algorithm, and it also reflects the good stability of the algorithm on the other hand.

4.5.2. The Image Block Size

Figure 7 is a comparison graph of ROC curves under diverse b values when L = 50, including b = 30, 40, 50, 60, and 70. It is distinct that all ROC curves approach the upper-left corner, manifesting that these b values make the raised algorithm magnificently balanceable in regard to robustness and discrimination. The ROC curves in the upper-left corner are blown up, as exhibited in the small figure below the middle. Obviously, the ROC curves of b = 60 and b = 70 appear alternately and are significantly higher than the ROC curves of b = 30, 40, and 50. This is because when the number of random blocks L is fixed, the smaller the block is, the easier it is to cause the loss of image features, which cannot express the image well, resulting in a decrease in classification performance. If the value of b is too large, some repeated image regions will be introduced, leading to feature redundancy and overfitting. From Figure 7, the AUC value of b = 60 is 0.99482 higher than that of b = 60, whose AUC value is 0.99451. Therefore, we choose the image block size b = 60.

4.6. The Length of Hashing

The storage cost of the hashing is analyzed though making use of the hashing sequences extracted in discriminative experiments in Section 4.3. As just mentioned before, every hashing sequence contains 100 elements, so the total number of the hashing elements is 1000 × 100 = 100000. The hashing value distribution curve of each image is shown in Figure 8. According to the calculation and statistics, the least value is 1, the biggest value is 183, and the average value is 28.3104. We calculate the frequency distribution of these hashing element values and get a frequency distribution bar graph as shown in Figure 9, where the x-axis is the value range of the hash element and the y-axis is its frequency. From Figures 8 and 9, it is clear that the hashing element values are distributed between [0, 128), accounting for 99.985%, and only 15 hash values are between 128 and 183, that is, (128, 183). Since 8 bits can represent unsigned integer in the interval (0.255), each hashing element of the algorithm only needs 8 bits to store. It can be seen that the storage cost of the hashing of the proposed algorithm is 800 bits.

4.7. Component Analysis

The algorithm in this paper mainly includes two steps, one is weighted image construction (see Algorithm 1 for details), and the other is hashing feature extraction (see Algorithm 2 for details). To begin with, for Algorithm 1, we test the impact of each single component (the Itti visual model or contourlet transform (CT)) on the algorithm performance, respectively, with the remaining Algorithm 2 unchanged. Similarly, for algorithm 2, we mainly check out the influence of each single component (the geometric invariant vector distance of Hu eigenvector or SVD eigenvector) respectively, keeping Algorithm 1 fixed. Because different components directly affect the Euclidean distance of the image hashing, different threshold values are selected in the experimental tests of different components, and the selection standard is also to minimize the total error rate. Table 5 shows , , the total error rate, and corresponding thresholds under different components. It is clear that the overall performance of SVD component is the best among the four components, with reaching 0.9447, standing at 0.0090, and a total error rate of 0.0643. The Itti component is second only to SVD. However, the overall performance of the two factors, contourlet and Hu component, is poor. Figure 10 shows the ROC curves of four components, including the Itti visual model, contourlet transform, Hu feature vector, and SVD feature vector. It can also be seen from ROC curves that SVD feature vector and the Itti visual model are two important components of the proposed algorithm, and their corresponding ROC curves are very close (AUC values are 0.9938 and 0.9904, respectively), which are significantly higher than the ROC curves of the other two factors, contourlet transform, and Hu eigenvector. But no matter Table 5 or Figure 10, we can clearly find that the performance of the proposed algorithm is the best when the four components are combined. In the experiment, it is also found that the Euclidean distances of hashing sequence extracted by contourlet transform and Hu feature vector have little gap in robustness and discrimination. To sum up, SVD feature vector and the Itti visual model are two key components of the proposed algorithm.

4.8. Performance Comparisons of Different Algorithms

For the sake of illustrating the superiority of this algorithm, it is compared with some state-of-the-art algorithms which were recently issued in academic journals or conference. The comparison algorithms contain geometric invariant vector distance (GIVD) [21], visual attention model and invariant moments (VAM and IM) [22] and hybrid features hashing [35]. The experimental image library is the same as Sections 4.2 and 4.3. For impartial comparison, all parameter settings are in accordance with their original settings. In this algorithm, L = 50, b = 60, and the key pair (100, 0.27) are selected for the comparison among algorithms.

Figure 11 shows the ROC curves of different algorithms. By comparison, it can be seen clearly that the AUC value of the proposed algorithm is 0.9951, and the AUC values of the GIVD algorithm, the algorithm based on VAM and IM and the hybrid features algorithm are 0.9887, 0.9914, and 0.9946, respectively. Obviously, the AUC value of the proposed algorithm is a bit higher than that of the comparison algorithms. Therefore, by comparing the ROC curves and AUC values, we find out that the proposed algorithm outperforms other algorithms in regard to classification performance.

Otherwise, we also analyze other properties of different algorithms, as displayed in Table 6, such as running time, the AUC value, and the length of the hashing. For calculation of running time, in the first place, we recorded the total time spent on generating hash sequence from 1000 different images and then calculated the average time cost on running an image. It can be seen from Table 6 that the proposed algorithm is faster than the hybrid features algorithm and slower than the GIVD algorithm and is equal to the VIM_IM algorithm. From the overall perspective of running time, the length of the hash, and classification performance, the proposed algorithm has better combination property.

5. Conclusion

In this study, an image hashing algorithm is proposed based on the Itti visual saliency model and geometric invariant vector distance, combining the features of the frequency domain and the spatial domain. The Itti visual saliency model and low-frequency subband of the contourlet transform are put into use in constructing a weighted image, which ensure the perceptual robustness of the algorithm. Geometric invariant distance is calculated between the Hu invariant moment vector and the maximum singular value vector to construct hash so that the algorithm reaches good classification performance between robustness and uniqueness. Apart from that, random selection of image blocks and encrypting the hash sequence with the chaotic encryption make sure the high safety performance. The simulation experiment analysis indicates that the proposed perceptual hash has good comprehensive performance.

With the growing popularity of smart terminal applications such as mobile phones and tablets, a large number of screen content images have been generated. Therefore, we can further study the image quality evaluation hashing algorithm for screen content in the future. In addition, the computing and storage performance of mobile terminal equipments is limited. The current mainstream image hashing algorithms have limitations when applied to mobile devices. Therefore, the next step is to study the lightweight image hashing algorithm for mobile device applications.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Anhui Provincial Key Research and Development Plan (201904a05020091), the Provincial Natural Science Research Program of Higher Education Institutions of Anhui Province (KJ2021A1030), the Key Scientific Research Projects of Chaohu University (XLZ-202108), Special Support Plan for Innovation and Entrepreneurship Leaders in Anhui Province, the Provincial Humanities and Social Science Research Program of Higher Education Institutions of Anhui Province (SK2021A0613), and the Key Scientific Research Projects of Chaohu University (XLZ-202104).