Abstract

The importance of digital image authentication has grown in the last decade particularly with the widespread availability of digital media and image manipulation tools. As a result, different techniques were developed to detect fraudulent alterations in digital images and restore the original data. In this paper, a new algorithm is proposed to authenticate images by hiding a copy of the approximation band in the original image. The approximation band is hidden by embedding it inside the image pixels. The intensity of the hiding was decided using a perceptual map that simulates the human vision system and adds more intensity in areas where the human eye cannot recognize changes. The perceptual map consists of three parts, luminance mask, texture mask, and edge detection mask. Results show a high ability to blindly recover images after different attacks such as removing and blocking attacks. At the same time, the structure similarity index of resultant images was higher than 0.99 for all tested images.

1. Introduction

Modifying digital image content has become quick and easy with the widespread adoption of digital image editing applications on personal computers or mobile devices. It is recurrent to see forged images appear in reports, science experiments, and even legal evidence such as crime investigations and images of traffic accidents. As a result, we can no longer take the authenticity of the images as a granted fact [1]. Therefore, an essential aim of digital image forensics is to authenticate images and specify areas of images that have been manipulated or altered in one form or another.

Different algorithms have specialized in validating digital images, and distinguishing original images from fake batches and authenticating digital photographs have become one of the biggest challenges today [2].

Watermarking as the process of hiding information inside original data can be used in the approach of solving the problem of authenticating digital images, where important image characteristics can be included in the watermark and then be embedded in the cover image. On the receiver side, the integrity of the image can be verified by comparing the retrieved watermark and the characteristics of the image [3]. Watermarks can be considered as an additive noise; hence, different algorithms for noise estimation can be used when the watermark is added to test the invisibility of the watermark, as in the model described in [4, 5].

An important feature in digital image retrieval algorithms is that the process of detecting and recovering the alteration is recommended to be performed without the need for the original image as a reference.

Another feature is the algorithm’s simplicity, which allows it to be run in devices with limited hardware specifications or to be run in real-time applications.

In this paper, a simplified, reference-free image recovery algorithm is developed based on integer lifting wavelet transform (LWT) with perceptual mapping. The used methodology aims to distribute the approximation band coefficients on the original image pixels with intensities that are adjustable according to the human vision system (HVS) sensitivity that is calculated using a low complexity perceptual mapping model. The recovery of these coefficients will retrieve the untampered approximation band of the original image and recover tampered areas using it. Lifting wavelet transform was used in this paper for its integer-based calculation which makes it simple. Also, it has the same features of Discrete Wavelet Transform (DWT) as parent-child relation and localization [6].

The paper is organized as follows: the literature survey is presented in the following section. A brief explanation of LWT is presented in the third section. In section four, the methodology of the proposed model is explained followed by experimental results and comparisons in section five. The paper is concluded in the last section.

2. Literature Review

Different approaches had been used in the literature to address image and video tampering and recover original data, and these approaches relied on different methodologies such as using the transform domain as DWT, SVD, or DCT, or by using the time domain as modifying the least significant bits (LSBs) of image pixels.

In [7], Discrete Wavelet Transform (DWT) was used to embed the features of the digital image that are represented in the approximation sub-band in high-frequency bands. Although the recovery method was not explained in the paper, such methods are used to apply the DWT on the receiver side and retrieve data from the high-frequency band. A video-tampering detection algorithm based on feature conversion in the multimedia space and multimedia fusion has been proposed in [8]. The obtained features from different intra-/interframe pixel blocks were used to detect any video alteration. The algorithm allows the detection of tampering in low-bandwidth videos. This was achieved using the passive tamper detection method, and an approach to model signatures that were included in the preprocessing sequence in the camera was introduced. These characteristics help in the detection algorithm, especially when the identity of the camera source is not accessible. To perform tamper detection in the camera, a SIFT (scale-invariant feature transform)-based function is adopted in [9]. The SIFT algorithm has been enhanced for creating a key point of it. The SIFT-based image form had been established for image representation.

The existence of a watermark is regarded as a critical component in investigating the authenticity and data integrity [10]; as a result, it is used in different tamper detection and self-recovery studies. Sarika et al. detected image manipulation using wavelet transform and singular value decomposition (SVD). The transmitted and retrieved images are compared to see if the image has been changed, whereas the recovery of the original image’s information was achieved by SVD [11]. Chen Xiaoling and Zhao Huimin [12] developed an algorithm that considers key values as the embedded watermark. These values were calculated using Hash transform to the P-frame table. The watermark embedding was to motion P-frame vectors. Different surveys were written to summarize studies related to detecting forgery as in [13], and issues and challenges in the design of the video authentication system as tampering attack categories and robustness were presented in [14].

SVD was used by Tafti and Hassannia [15] where the statistical information from the original image’s Lower Upper and SVD is computed with the addition of cellular automata. This combination is used to construct a cipher key that is unique for the host image and will be changed if any tampering occurs. In [16], an interframe tampering detection model between frames was proposed when the Consistency of Correlation Coefficients of Gray Values (CCCoGV) was used as a medical property. In the original videos, CCCoGV stayed stable, intended alteration in the frame stream results in unrealistic values. In [17], a survey was presented to summarize image recovery studies.

In [18], image blocks were distinguished to different types, and according to the chosen type, different watermark embedding, tamper detection, and recovery strategies were employed for enhancing the efficiency of data hiding. Watermarking was also used in [19], for Audio/Video Interleaved (AVI) video file format. Two different time-domain watermarking techniques have been proposed, and each pixel is represented by two bytes in the AVI video format.

The proposed work in [20] identified the similarities in the spatial correlations between frames and inside frames. A two-step blind detection algorithm for video using average texture variation (ATV) has been proposed. The ATV value of each frame is calculated to obtain the ATV curve for the video. Then, the obtained curve is processed further to highlight its properties that estimate the original frame rate. A self-recovery method to recover the tampered regions using chaotic maps was proposed in [21] in which chaotic maps are utilized to produce a 2 × 2 image blocks generated and used for authentication.

In [22], tamper detection of a blind image and self-recovery have been presented using Lifting Scheme which is characterized by simplicity, integer-based calculations, and LSB modification. The algorithm in [23] consisted of feature extraction and the localization of unusual points. In the process of feature extraction, the 2D phase congruency of each frame has been obtained. Then, the unusual points were detected using the k-means clustering method. In 2019, Rajput and Ansari presented a method in which the original image was copied four times after its size was reduced, and then the copies were hidden in the original image’s 4-LSB using four pseudorandom codes. These copies are the references for detecting any tamper in the image [24].

In [25], the methodology obtains the visual content scales of the video frame by Gaussian pyramid transform and finds the resemblance in a single visual content scale. Using the information theory, normalized mutual information between two frames was determined. To locate tampering, the local outlier isolated factor technique has been used. In [26], a method was proposed to detect coarse-to-fine video manipulation combining spatial constraints with a stable characteristic. In the approximate tamper detection stage, the low motion and the high texture regions have been extracted using the spatial restriction criteria. The two regions above are combined to obtain rich quantitative correlation regions which are then used to extract optimal similarity properties from the video. Then, suspicious tampering points are found by combining the previous properties.

The research gap in previous works is that few of them taking into consideration the usage of perceptual maps and HVS bias to modifications in different areas. In addition, frequency domain transformations relies on complex and floating-point calculations that require time especially while using limited resources embedded systems. Hence, it is required to use, relatively fast and low-complexity methods in addition to consider HVS sensitivity.

3. Lifting Wavelet Transform

In 1995, Lifting Wavelet Transform (LWT), as the second generation of wavelet transform was proposed by Sweldens [27] in order to construct wavelet coefficients efficiently by forward- and integer-based equations. The LWT implementation has low complexity overhead and requires fewer resources in compare traditional, Fourier-based DWT. LWT is an integer-to-integer computation that makes it appropriate to be used in embedded systems, and it can be implemented using three main phases [27, 28].

3.1. Split

In the split step, the digital signal is divided into two smaller subdivisions. In digital images, the image is split to odd and even pixels.

3.2. Predict

In the prediction step (while it is applied to digital images), the pixels’ value at the odd position (referred as Xo) is predicted by the two neighbor’s values at even positions (each of which is referred as Xe). The actual difference between the anticipated pixel value in the odd position and the actual value of that pixel is saved at the position of the odd pixels. The signal that is produced from this step is the detailed band (Dn). In the gradient changes that shown in flat areas, where pixel intensity varies approximately as linear, the predicted values will be near the actual values, because of that, the values of details band coefficient are near to zero. However, in locations with wide-intensity changes, as textured surfaces, the ratios of the detail band will have higher values. The larger the values of the coefficients in certain areas, the greater the deviation between the pixels values in this area. This feature was used in the used texture mask. Equation (1) demonstrates the calculation of the prediction step.

3.3. Update

The energy of a signal can be reduced to half while keeping its general structure if the average value of each two samples was collected from the signal instead of the complete signal [29]. For image pixels, the average value of all two consecutive odd pixels are replaced by the even pixel value between them. Due to the nonlinearity of pixels, the value of the pixel in the middle must be updated by the difference found in the prediction step. The signal obtained from the update step is known as the approximation signal since it is similar to the original one but the number of samples is half (equation (2)). The approximation band is used in edge detection in the used perceptual map algorithm.

LWT steps produces details and approximation band coefficients (Figure 1). Original signal is reconstructed using the inverse lift wavelet transform (ILWT) using the same LWT equation in reverse direction [28], as shown in Figure 2.

4. Methodology

The main contribution of this work is to use perceptual mapping to efficiently hide the original image in the approximation band of LWT. That is, it needs to be hidden with high intensity in places where the human eye cannot perceive the noise, and in less intensity in places where the human eye is sensitive changes.

The proposed algorithm consists of two main phases, embedding and recovery. The embedding phase consists of perceptual mapping and band distribution. Also, the extraction process consists of retrieving the original band and recovers alterations if they exist.

The methodology of the process is shown in Figure 3.

4.1. Embedding

In Embedding phase, the first consideration should be applied is to find the maximum intensity of embedding that can be used without affecting the visual quality of the image. This is done using the perceptual map. Then, the approximation band is distributed with intensities that are equivalent to that tolerance.

4.1.1. Creating a Perceptual Map

The perceptual map which is proposed by [30] is reused in this paper, and it consists of the following three factors:

1. Luminance Mask. HVS is less sensitive to changes occur in very dark regions and very bright ones. Hence, in digital images, the embedding weight can be larger in intensities near to 0 and 255 than in middle intensities. Luminance mask is given by

LM is the luminance mask, approximation band intensity of coefficient at the location (i, j) is S (i, j).

2. Texture Masking. Large variance in intensity distribution constantly indicates the existence of greater texture. Also, higher texture means less sensitivity to the human eye. Accordingly, higher noise tolerance weight will be added for higher textured areas.

Figure 4 depicts the details band calculation; details band of lifting wavelet transform is divided into blocks each of 5 × 5 coefficient. Adjacent values in each row are subtracted from each other, and the absolute value of the subtraction results are accumulated, so the model is referred to as Accumulative Lifting Difference (ALD) [30], and each block owned a value that indicated the amount of its own texture. The ALD is given in equation (4):

I and J are center coefficients of each 5 × 5 block in detail band D2. i and j indicate coefficient’s index in the block.

3. Edges Elimination. Human eyes are higher sensitive to changes near edges. Hence, an edge detection algorithm is used to detect edges and exclude them from texture areas.

Four kernels were used in [30] for edge detection as shown in Figure 5, where a and b are used to extract the horizontal and vertical edges, and c and d are used to extract the diagonal edges.

4.1.2. Band Distribution

After reading the image, three operations are applied to it; the first one is applying three LWT decompositions to get the approximation band of the third decomposition. The obtained approximation band has the approximate image with a size 1/12 of the original one. This band will be distributed as binary slices in the original image.

The second operation is to convert the image to one dimensional array, in which each pixel will be replaced by the average of its two neighbored pixels, for instance, pixel 2 is the average of pixels 1 and 3. Then, a certain value is added to or subtracted from that average, the value of the addition and subtraction are decided by the value of the perceptual map on that location. The operation if it is addition or subtraction is decided by the value of the binary digit in the slice, if the pixel value in the slice is 0, then, the value will be subtracted, otherwise it will be added (Figure 6).

4.2. Recovery

The recovery process is initiated by extracting the hidden approximation band from the LWT coefficients. Then compare that extracted approximation band with the decomposed one from the target image, which can be altered. Difference between the original (hidden) approximation band and the extracted one is marked for each pixel. Then the approximation band is replaced by the hidden one. Eventually, we can get two images, a small image which is the recovered approximation band, and the original size image in which the ILWT is applied the hidden approximation band is recovered instead of the altered one (Figure 7)

5. Results

Experimental results were obtained using Core I7 CPU with Windows 10. The approximation execution time for the embedding and distribution process was 1.19 seconds and 1.23 seconds, respectively; however, this value does not implement the real execution time as the algorithm is running with operating systems and other applications. Results were divided into three subsections, perceptual quality of the resultant images, the ability of recovering the image after attacks, and a comparison to show the impact of using the perceptual map.

5.1. Perceptual Quality

Two measurements were used. PSNR is the simple assessment metric that defines image quality via comparing each pixel intensity in the altered image with its alternative in original one. It is calculated as a logarithmic transformation of the mean squared error (MSE), which is a measure to the amount of noise that is added to the image. PSNR and MSE equations are given as

X and Y are the original and altered image, respectively, with m and n dimensions. i and j are denoted to image pixel index. PSNR has a drawback that it does not take into account the bias of the human eye in observing the same amount of noise in different images structures. On the other hand, SSIM [31] has more realistic values and gives a better performance compared to the PSNR [32], as it considers three components: luminance, contrast, and structural information. SSIM assigns a specific equation for each component, where the luminance of a digital image can be estimated as a function of the mean intensity, the contrast is a function of the standard deviation, and the structural information can be extracted after the luminance subtraction and variance normalization, and then, the factors are combined into a single equation as given by equation:where X and Y are two non-negative image signals, original and altered image, respectively. µx and µy are the mean intensities, σx and σy are the standard deviations for the images, and C1 and C2 are constants. The SSIM factors can simulate human eye observations better than the intensity differences used in PSNR. Obtained PSNR and SSIM value for 10 tested images (thumbnails are shown in Figure 8) are listed in Table 1.

As it can be noticed in Table 1, the objectives metrics represented by PSNR and SSIM is high. It is also noticed that, all SSIM values are greater than 0.99, which denotes to high quality obtained images, and this value is more accurate than PSNR as it considers the existing of perceptual maps.

5.2. Recovery Results

Figure 9 shows the retrieved data after different attacks. Column a shows original images, column b shows the block attack, column c shows the recovery after block attack, while column d and e display the clone attack and the recovery after clone attack, respectively. Column e shows the remove attack, and the recovery of remove attack is shown in column . Recovery results shows that the proposed algorithm can retrieve the missed or altered data successfully, as important information, as blocked car number, removed face details, and texts were clearly retrieved.

5.3. Perceptual Mask Effect Comparison

To show the effectiveness of using perceptual maps, the process of approximation band distribution was achieved using the perceptual mask one time and without it in the other. Then, remove attack was applied in the case where no perceptual mask is used (embedding strength still = 1 in all cases), and the retrieval image was compared with the one that is retrieved from perceptual mask. The PSNR and SSIM for the retrieved images was measured and compared in Figures 10 and 11, respectively. As noticed, the PSNR and SSIM for most retrieved images are improved while using the perceptual mapping, even in subjective evaluation. For instance, the car plate number restoration was not clear where perceptual maps are not used as shown in Figure 12.

6. Conclusion

A perceptual mapping-based image tamper detection and retrieval were presented. LWT was efficiently utilized in two operations: band distribution and texture masking. For band distribution, the approximation band was used to hide the important features of the image within image pixels. While the texture mask creates with the edge detection and luminance mask a perceptual mask that decides the embedding strength. Obtained results show high quality images after distribution with SSIM more than 0.99. Also, the manipulated data could be retrieved after blocking, copying, and remove attack. In comparison to a previous study, the effect of using perceptual masks increases the perceptual quality of images and enhances the retrieved images after alteration. The proposed method can be combined with deep learning in future work to obtained better recovered images and decrease the noise. Also, region of interest can be located to embed only a part from the image instead of all of it, which reduce the payload and multiple copies of the image can be embedded.

Data Availability

The new algorithm is supported by equations and results. Further details can be obtained from the first author on request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is an enhancement for the manuscript presented in the 1st International Conference on Engineering and Technology (ICoEngTech) 2021 15–16 March 2021, Perlis, Malaysia, in 2021 based on the link: https://iopscience.iop.org/article/10.1088/1742-6596/1962/1/012021.