Abstract

Perceptual hashing technique for tamper detection has been intensively investigated owing to the speed and memory efficiency. Recent researches have shown that leveraging supervised information could lead to learn a high-quality hashing code. However, most existing methods generate hashing code by treating each region equally while ignoring the different perceptual saliency relating to the semantic information. We argue that the integrity for salient objects is more critical and important to be verified, since the semantic content is highly connected to them. In this paper, we propose a Multi-View Semi-supervised Hashing algorithm with Perceptual Saliency (MV-SHPS), which explores supervised information and multiple features into hashing learning simultaneously. Our method calculates the image hashing distance by taking into account the perceptual saliency rather than directly considering the distance value between total images. Extensive experiments on benchmark datasets have validated the effectiveness of our proposed method.

1. Introduction

With the widespread use of low cost and even free editing software, people can easily create a tampered image. Compared to forensic images, fake images could undergo kinds of manipulations, such as color changing, salient object changing, and copy-move forgery. Generally, there are two main problems in image forensics: one is tamper detection and the other one is tamper localization. Recently, more researchers pay attention to image tamper detection, which aims to discriminate whether a given image is pristine or fake. Image hashing based tamper detection approaches have been extensively studied recently for their great efficiency. It supports image content forensics by representing the semantic content in a compact signature, which should be robust against a wide range of content preserving attacks but sensitive to malicious manipulations.

For image hashing generation, the state-of-art hashing methods could be mainly divided into two categories: data independent hashing and data dependent hashing. In conventional image hashing methods, image hash generation is a robust feature compression process without any learning stage. It includes (1) invariant feature transform based methods, such as Wavelet transform [1], Radon transform [2], Fourier-Mellin transform [3], DCT transform [4], and QFT transform [5], which aim to extract robust features from transform domains; (2) local feature points based methods, such as SIFT [6] and end-stopped wavelet [7], which take advantages of the invariant local feature under some content preserving image processing attacks; (3) dimension reduction based methods, such as singular value decomposition (SVD) [8], nonnegative matrix factorization (NMF) [9], and Fast Johnson-Lindenstrauss transform (FJLT) [10], which embed the low level features of the high dimensional space into lower dimension; (4) statistics features based methods, such as the robust image hashing with ring partition and invariant vector distance [11]. Moreover, Wang et al. [12] propose a perceptual image hashing method by combining image block based features and key-point-based features. Yan et al. [13] use a multiscale image hashing method based on the adaptive local feature.

Since the hashing generation is independent of the data distribution, data independent hashing methods may not consider the characters of data distribution into hashing generation. Currently, more researchers begin to focus on the data dependent methods with learning for image tamper detection. Lv et al. [14] propose a semi-supervised spectral embedding method for image hashing. Efficient learning is incorporated into image hash generation by taking advantages of virtual prior attacked hash space (VPAHS). However, this algorithm only focuses on the postprocessing of image hashing. They assume the availability of real-valued image hashes and concentrate on the topic of compressing them into a short binary image hash. Currently, deep learning begins to be widely used in image forensics. Chen et al. [15] and Qian et al. [16] propose a median filtering detection and steganalysis based on convolutional neural networks (CNNs). Bayar et al. [17] propose a universal forensic approach to performing manipulation detection using deep learning. A new form of convolutional layer that is specifically designed to suppress an image’s content and adaptively learn manipulation detection features is developed. Bondi et al. [18] propose a tampering detection and localization algorithm through clustering of camera-Based CNN features. The CNN is exploited to extract characteristic camera model features from image patches. Forgery patches are detected by the descriptors learned by CNN. Likewise, Yarlagadda et al. [19] propose a satellite image forgery detection and localization method using a generative adversarial network (GAN), which is also used for feature representation learning of pristine satellite images. More recently, video forgery detection [20] and camera model identification with CNNs [21, 22] are proposed. However, most of the algorithms only emphasise the feature learning by using of deep network.

Considering the abovementioned methods, there are two aspects which are not taken into full consideration. Firstly, most of the methods describe image content with single feature. Currently, most of features are only robust against for one or several types of attacks. It may not be feasible to extract one absolute robust feature which can satisfy the needs of users. Lv et al. [14] propose an image hashing algorithm based on semi-supervised spectral embedding. Two real-valued intermediate hashing methods are adopted for learning. Likewise, Yan et al. [5] proposed a quaternion-based image hashing for tampering localization. Four types of feature maps are selected for quaternion image formation. Secondly, current hashing methods usually acquire hashing detection results by treating each local region equally. Importantly, we argue that the integrity for salient objects, such as object adding, deleting, and semantic modifying, are more critical and important to be verified, since the semantic content of the image is highly connected to them. Zhang et al. [23] extract local texture features from salient regions to represent contents, which are combined with global features for computing final hash sequence. However, the saliency weights for selected regions are not taken into account for hashing metric distance. Therefore, how to efficiently combine different image features to enhance the overall performance and how to efficiently design image hashing approach based on perceptual saliency is a topic of great importance but less studied in current research.

In this paper, we present a Multi-View Semi-supervised Hashing with Perceptual Saliency (MV-SHPS) algorithm. The contributions are as follows:

(1) We effectively exploit simultaneously the supervised information and multiple features into the hashing learning.

(2) Instead of learning metric distance on global image, we explore the local hashing distance by considering the perceptual saliency effect among different regions.

(3) An extensive set of experiments on image datasets demonstrates that the proposed method outperforms several state-of-the-art perceptual image hashing techniques.

2. Proposed Method

2.1. Preprocessing

To alleviate effects of commonly used digital signal processing manipulations, the preprocessing is needed, as shown in Figure 1. All the input images are first converted to a standard image by bilinear interpolation. The purpose of resizing is to resist possible resizing operations and ensure that those images with different sizes have the fixed hash length. And then, Gaussian low-pass filtering is applied to the standard image (Figure 1(b)), which can reduce the influence of minor modifications, such as noise contamination or filtering. As the CIE LAB color space is more perceptually uniform than other color space and the L component closely matches human perception of lightness. The RGB color image is firstly converted into the corresponding XYZ color space, and the XYZ color space is then converted into the corresponding LAB color space by the following [24, 25]:where R, G, and B are the red, green, and blue component of a pixel, X, Y, and Z are the CIE XYZ tristimulus values (1), and L, A, and B ((2), (4), and (3)) are color lightness, chromaticity, and coordinates, respectively. Xw=0.950456, Yw=1.0, and Zw=1.088754 are the CIE XYZ tristimulus values of the reference white point, and f(t) is calculated by the following rule:and the L component is then taken for image representation (Figure 1(c)). Integer Wavelet Transform (IntWT) is an approximation of original image and is more robust against signal processing attacks. Therefore, we finally apply one-level IntWT to the L component and take the low frequency subband (LL) as the semantic perceptual image (Figure 1(d)), from which multiple types of feature are extracted for hash generation.

2.2. Hashing Learning

Suppose there are images in the given whole set, represented as , , where represents feature vector. For each image, we extract their types of features. The task of multiview perceptual image hashing is to learn hash functions by simultaneously utilizing the feature matrices , with corresponding to the type of feature matrix. Let denote the combined matrix for multiview feature, where , , and is the dimension of type feature. The goal of our algorithm is to learn hash functions that map to a compact representation in a low-dimensional Hamming space, where is the digits length.

In the set , there are labeled images, , which are associated with at least one of the two categorizes and . Specifically, a pair is denoted as perceptually similar pair when are the images that have been under content-preserved un-malicious distortions and attacks. is denoted as perceptually dissimilar pair when two samples are the original image and the one that is suffered from malicious manipulations or perceptually significant attacks such as object insertion and removal. Let us denote the feature matrix formed by the corresponding columns of as . Note that the feature matrices are normalized to zero-centered.

We define the perceptual confidence measurement for each image example. The matrix incorporating the pairwise labeled information from , is the pairwise relationship for , which is defined as

Suppose we want to learn hash functions that leading to a -digit representation of . For each digit , its hash function is defined aswhere is the coefficient vector. Let and the representation of the feature matrix for image set isOur goal is to learn a that is simultaneously maximizing the empirical accuracy on the labeled image and variance of hash bits over all images. The empirical accuracy on the labeled image is defined asThe objective function for empirical accuracy can be represented asThen, the empirical accuracy is presented asMoreover, to maximize the information provided by each bit, the variance of hash bits over all data is also measured and taken as a regularization term:Maximizing the above function with respect to is still hard due to its nondifferentiability. As the maximum variance of a hash function is lower bounded by the scaled variance of the projected data, the information theoretic regularization is represented as

Finally, the overall semi-supervised objective function combines the relaxed empirical fitness term from (11) and the regularization term from (13). We get the following optimization problem [26]:withwhere , is a tradeoff parameter, and the constraint makes the projection directions orthogonal. Learning the optimal projections can be solved by eigenvalue decomposition on matrix .

2.3. Perceptual Saliency

Image forgeries are often created by combining several images, including object adding, deleting, replacing, etc., which are highly relevant to the human perception. In other words, these object modifications usually affect the perceptual saliency of the corresponding image. In this paper, we call this variation on saliency map between trust and test image as perceptual saliency and consider it as a hint for tamper. Therefore, in our proposed method, we explore the computing of image hashing by considering the perceptual saliency effect rather than hashing acquiring from total image directly. According to [27], we take the structured matrix decomposition (SMD) model that treats the (salient) foreground/background separation as a problem of low-rank and structured-sparse matrix decomposition, to compute the saliency map of a given image.

Given the feature matrix of an input image, it can be decomposed as a low-rank matrix corresponding to the nonsalient background and a sparse matrix corresponding to the salient foreground objects. The structured matrix decomposition model can be formulated aswhere is a low-rank constraint to allow identification of the intrinsic feature subspace of the redundant background patches, is structured-sparsity Regularization, is Laplacian regularization, and and are positive tradeoff parameters.

2.4. Tamper Detection

For tamper detection, a forensic hash should be calculated from a trusted image and sent to a destination after encoding. Divide the original image into overlapping and pseudo randomly selected rectangular regions . For each region, we extract type of feature matrix and obtain the corresponding hashing code . Likewise, the same procedures are employed to the test image to calculate hashing code with respect to feature matrix .Considering the perceptual saliency, the metric distance between two hashing code is calculated bywhere is the number of random selected regions and and are the salience weights for each region of original and tampered images. We finally find the distance that leads to the highest difference value and call it , which is obtained by

For tamper detection, a forensic hash should be calculated from a trusted image and sent to destination after encoding. Finally, the threshold is defined to judge whether the test image is a similar image or a tampered image.The metric distance threshold parameter for tamper detection in our method is set as 0.16. Here, we test the probability distribution of the detection results with varying thresholds on our newly created database and finally determine it based on empirical value.

3. Experiments

3.1. Experiment Setting
3.1.1. Dataset

In our experiments, we employ four real-world datasets for evaluation. Our training dataset is generated in the basis of Kodak (http://r0k.us/graphics/kodak/). We adopt 18 thousands unique images in the training set generated from Kodak as our training data and randomly sampled 5K images as labeled subset. It includes similar images with different type of content preserving attacks and tampered images with particular logo insert. For each image, the ground-truth similar images are derived from index label; i.e., images from the same index are deemed to be similar. For test, three other real-world datasets are CASIA v1.0 [26], Realistic Tampering Dataset(RTD) [28, 29], and our newly created database. CASIA consists of 800 original images and 921 tampered images, which are in JPEG format, with a size of , and belong to various categories according to their content scene, animal, architecture, character, plant, article, nature, texture, etc. Realistic Tampering Dataset (RTD) consists of 220 original images and 220 tampered images, which are in TIFF format, with a size of . Our newly created database includes 280 original color images and 280 tampered images with the size about . The tampered images are generated by changing colors of the scene elements, inserting or deleting different objects into the source images and substituting image background.

3.1.2. Metric and Parameter Setting

For algorithm parameters, we set the dimension of the test images , hashing learning parameter , and metric distance parameter for all datasets. For each image, we extract three type of features (view number ): wavelet [1], SVD [8], and statistical [11] features as multiview observation. In this paper, we use the probability of true authentication, which means the ratio of similar/tampered images judged as similar/tampered images to the total number of corresponding type of images, as the evaluation metric for all the algorithms.

To prove the performance of image authentication we evaluate the image authentication performance using CASIA, Realistic Tampering Dataset (RTD) and newly created database. Except for simple tampering (TP) for image content, six types of content-preserving attacks are performed to verify the robustness of our proposed method: scaling with the percentage as 1.5, JPEG compression with the quality factor as 50, sharpening with the value as 0.49, Gaussian blurring with the size of the filter as 3,. and the standard deviation of the filter as 10, motion blurring with the amount of the linear motion as 3 and the angle of the motion blurring filter as 45, and salt & pepper noise with the noise density as 0.005.

For thresholds determination, we analyze the probability distribution of the authentication results with varying threshold . As shown in Figure 2, the results shown by the solid line and dashed line indicate the probability distribution of similar images and tampered images under six types of content-preserving manipulations, respectively. The probability distribution results of similar and tampered images approximately intersect at . Therefore, in our experiments, we set to distinguish the similar images and forgery images. For the compared methods, we tune all the parameters to best performances.

3.2. Perceptual Saliency Analysis

To evaluate the impacts of perceptual saliency for tamper detection results, Figure 3 shows some examples of metric distance for tamper detection. Here, we extract the image saliency map followed by randomly select ten regions () with size . The accuracy is based on smaller size and larger number. However, it also leads to higher hashing code length, which will decrease memory efficiency. We set the final parameter values by making a tradeoff between accuracy and efficiency. For each region, we resize it into and compute the hashing code using (12). As shown in Figure 3, the modification for original image (columns (a) and (c)) are effectively map into the saliency map (columns (b) and (d)). For example, the semantic content for region six () of image A is modified by adding a red flower, leading to the higher perceptual difference in between two images. As shown in (13), we mark such difference and take it as weight for final hashing distance computing. Likewise, for object deleting and modifing, regions and also reflect such tamper. Figure 4 illustrates the hashing distance for different regions with perceptual saliency corresponding to three images in Figure 3. Our perceptual saliency design for image hashing fully considers and improves the impact of local features. Figure 5 shows the probability of true authentication capability for tamper detection on newly created database with/without perceptual saliency under different threshold settings.

3.3. Comparison Results

We compare our method with the following baselines. Wavelet-based image hashing [1] develops an image hash based on an image statistics vector extracted from the various subbands in a wavelet decomposition of the image. SVD-based image hashing [8] uses spectral matrix invariants as embodied by singular value decomposition. RPIVD-based image hashing [11] incorporates ring partition and invariant vector distance to image hashing algorithm for enhancing rotation robustness and discriminative capability. Quaternion-based image hashing [5] constructs quaternion image, which combines advantages of both color and structural features, to implement the quaternion Fourier transform for image feature hashing generation. We report the the tamper detection results due to our emphasis. Table 1 shows the probability of true authentication capability of the proposed method compared to the methods proposed in [1, 5, 8, 11]. Note that the region size in our method is set as . We use the probability of true authentication capability to comprehensively evaluate the performance. For similar images, it records the ratio of similar images judged as similar images to the total number of similar images, which indicate the algorithm robustness. For tampered images, they record the ratio of tampered images judged as tampered images to the total number of tampered images, which indicate the algorithm discrimination. We conducted many experiments and calculated the corresponding results under various attacks. As is shown, the probability of true authentication capability with different content-preserving attacks on three databases is illustrated. Note that higher values indicate better performance for all metrics. In a big picture, our approach outperforms all the baselines. For the tamper detection, including removal, insertion, and replacement of objects, color modification, and background substitution, our method outperforms other methods, especially under various attacks. It should be noted that, for all experiments, we set our hashing length as 64 digits, which is relative short compared with other methods.

3.4. Complexity Analysis

The complexity of the proposed image hashing algorithm that will be discussed here includes semi-supervised learning, saliency map generation and tamper detection. In the semi-supervised learning, it is actually the most time consuming step in our method because most of the time is spent on learning . We sample a subset of the training items (e.g., containing items). The pairwise similarity preserving considers the similarities of all pairs of items in the subset. The time complexity is . is the number of hash code and is the dimension of image feature. It is important to note that the semi-supervised learning process is an offline procedure, and the produced optimal projections W are then fixed for the whole procedure of proposed method. Practically, the training procedure has been done with a nonoptimized MATLAB code on a regular personal computer. This procedure can be preprocessed by any user on the personal computer with common configurations. As for our proposed scheme, the computational complexity mainly depends on saliency map and hashing distance computations. For saliency map generation, the complexity depends on the salient detection algorithm. The current fast model is about 0.017 seconds per image. For tamper detection, our algorithm is to efficiently produce a sequence ordered by the increasing distances between the original and tamper images. The time complexity cost is .

3.5. Comparison with Deep Learning Based Methods

For current forensics application, Chen et al. [15] and Qian et al. [16] propose a median filtering detection and steganalysis based on convolutional neural networks (CNNs). Likewise, Bayar et al. [17] propose image manipulation detection using deep learning. All of these methods focus on image manipulation, which are content preserving attacks. As for hashing application, the hashing code is robust against a wide range of content preserving attacks but sensitive to malicious manipulations. Bondi et al. [18] and Yarlagadda et al. [19] propose tampering detection and localization algorithms. However, these algorithms are not based on hashing operation. For image hashing based algorithm, the image semantic content is represented in a compact signature. Moreover, video forgery detection [20] and camera model identification with CNNs [21, 22] are proposed. In summary, most of current proposed algorithm are focused on image content preserving manipulations or not based on hashing representation. Our proposed method effectively exploits simultaneously the supervised information and multiple features into the hashing learning and performs tamper detection by considering the perceptual saliency effect among different regions. The comparison with deep learning based methods for image forensics is shown in Table 2.

3.6. Discussion

From the description of our MV-SHPS algorithm and the experimental results, we draw the conclusion that there are three aspects that importantly affect the perceptual hashing algorithm.

(1) Learning based image hashing: In our proposed method, we effectively exploit the supervised information into the hashing learning. The experimental results have shown that data dependent methods with learning can lead to high quality hashing. The process is trained to optimally fit data distributions and specific objective functions, which produce better hashing codes to preserve the local similarity. Therefore, how to efficiently learn hashing code based on the image data is a first topic of great importance in the future research.

(2) Image Hashing based on multiview embedding: Most of the methods describe image content with single feature. Since most features are only robust against for one or several types of attacks, it may not be feasible to extract one absolute robust feature which can satisfy the needs of users. Therefore, how to efficiently combine different image features to enhance the overall performance is a second topic of great importance but less studied in current research.

(3) Image hashing based on saliency detection: Current hashing methods usually acquire hashing detection results by treating each local region equally. Instead of learning metric distance on global image, we explore the local hashing distance by considering the perceptual saliency effect among different regions. Therefore, how to efficiently design image hashing approach based on perceptual saliency is a third topic of great importance for perceptual hashing algorithm on tamper detection.

4. Conclusion

In this paper, we proposed a novel Multi-View Semi-supervised Hashing algorithm with Perceptual Saliency (MV-SHPS). In summary, our proposed method has several desirable contributions: first, we effectively exploited simultaneously the supervised information and multiple features into the hashing learning. Second, instead of assuming only global image hashing contributes to metric distance of hash code for tamper detection, we explored the local hashing distance by considering the perceptual saliency effect among different regions. We performed extensive experiments on three image datasets compared with the state-of-the-art hashing techniques. Experimental results demonstrated that the proposed semi-supervised hashing with multiview features and perceptual saliency yields superior performance. The current work can be extended with the design of coregularized hashing for multiple features, which is expected to show even better performance.

Data Availability

All data generated or analysed during this study are included in this paper. For the datasets used in this paper, CASIA v1.0 and RTD can be downloaded from http://forensics.idealtest.org/ and http://kt.agh.edu.pl/~korus/downloads/dataset-realistic-tampering/. The new-created dataset is available from the corresponding author ([email protected]) on request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This research was supported by National Natural Science Foundation of China (Grant no. 61602344), the Science & Technology Development Fund of Tianjin Education Commission for Higher Education (Grant no: 2017KJ091), and Natural Science Foundation of Tianjin (Grant no. 17JCQNJC00100).