Abstract

Cloud computing has evolved in various application areas such as medical imaging and bioinformatics. It raises the issues of privacy and tampering in the images especially related to the medical field and bioinformatics for various reasons. The digital images are quite vulnerable to be tampered by the interceptors. The credibility of individuals can transform through falsified information in the images. Image tampering detection is an approach to identifying and finding the tampered components in the image. For the efficient detection of image tampering, the sufficient number of features are required which can be achieved by a deep learning architecture-based models without manual feature extraction of functions. In this research work, we have presented and implemented a cloud-based residual exploitation-based deep learning architectures to detect whether or not an image is being tampered. The proposed approach is implemented on the publicly available benchmark MICC-F220 dataset with the -fold cross-validation approach to avoid the overfitting problem and to evaluate the performance metrics.

1. Introduction

Computer vision and applications are the upcoming research areas in the field of computing and bioinformatics. The applications include object detection, video analytics, image segmentation, and image pixel analysis. Image forensics and forgery detection are the important applications that can be considered for computer vision applications. It has become easy for the online players to digitally tamper and alter the different dimensions of the images. The detection of forgery in the images is one of the key challenges in computer vision applications such as video surveillance. The different types of tampering include copy-move forgery and splicing. The recent research works on image tampering detection focus on the splicing [1] and copy-move methods [24]. The image tampering can be done in many ways and not restricted only copy-move and splicing methods. In some recent works, the residual methods are used for image tampering detection [57]. In this paper, the problem of image tampering detection with residual exploitation is dealt with the Convolutional Neural Network (CNN) models and fusion-based approach.

Recent advances in the machine learning (ML) and deep learning (DL) technologies are used for the tampering detection of the images. The underlying relationships are identified by the ML algorithms in analysing the data and making decisions. Speech and vision problems use the ML algorithms to understand and evaluate the different parameters for the speech and vision data [8]. However, these algorithms were limited to the capability of prediction of smaller amounts of data. DL techniques with CNN became more popular for solving the different challenging problems in speech and vision [9]. These techniques are used for image segmentation, classification, detection, image context, and retrieval-related tasks [10, 11]. The optimal solutions to the computer vision problems using DL have paved the way for solving different types of problems in image tampering detection as well. In this paper, the residual nature of the CNN models is exploited for the image tampering detection.

CNN has given researchers to provide an insight into the image tampering detection using the feature maps provided at each layer. It was used to detect whether the image is tampered or not initially, but not to locate the tampered regions. However, there are some attempts to locate the tampered regions using the CNN but are not accurate [12]. A nonoverlapping image patch method was used for the image segmentation in [13]. However, when the image size is small, it fails to identify the tampering. The contextual information of the image is lost and leads to incorrect prediction because they use the image patch as a part of the input to the network. Once the CNN goes deeper, there will also be gradient degradation problem and weak discrimination of features as well. The weak discrimination of the features leads to the incorrect prediction. In this paper, the traditional extraction method of the image patch for image tampering detection is replaced with a fusion-based model based on the residual nets for optimal image tampering detection.

In this paper, image forgery detection is carried out through a novel decision method by residual exploitation-based deep learning models. The proposed approach consists of three phases on the pretrained and fine-tuned spatial exploitation-based CNN models, namely, ResNet-18, ResNet-50, and ResNet-101 [5]⁠. In the first phase, a system to extract features using residual exploitation-based CNN models in the second-phase machine learning-based classifier is deployed on the extracted features, and in the final phase, the fusion of decision outcomes based on these residual exploitation-based CNN models is done to evaluate the accuracy of the model.

The main contributions of this paper are as follows: (i)A decision fusion-based system is proposed using the CNN-based approach for image tampering detection. The residual exploitation-based CNN models used for the fusion decision are ResNet-18, ResNet-50, and ResNet-101(ii)The fusion decision system is implemented in two phases. First, the pretrained weights for the residual exploitation-based CNN models are used to evaluate the tampering of the images. Second, the fine-tuned weights are used to compare the results of the tampering of the images with the pretrained model(iii)The utilization of the residual exploitation-based CNN models leads to the reduction of the number of false matches, thereby reducing the false-positive rate and ultimately increasing the accuracy of the approach

The paper is further organized as follows. In Section 2, the related work is discussed on the image tampering detection methods and the CNN methods with spatial exploitation that are used for image tampering detection. In Section 3, the fusion model using the residual exploitation-based CNN models is proposed, and it follows the regularization applied on the fusion model in Section 4. The experiments and results are discussed in Section 5 followed by the conclusion.

The different areas of research identified in the image tampering detection domain are resampling detection, JPEG artefacts, detection of copy-move operations, splicing, and object removal [14]. Digital content has evolved over a period of time with the advances in the computer graphics, internet, and digital contents. The advances are utilized for many applications in computer vision and image recognition applications [15]. However, the downside of the exploitation of these applications lies in the fact of analysis of creating fake images and videos. The current research focuses on identifying the forgeries in the images using different DL models and techniques. In this section, we discuss some of the related work based on some of image tampering detection methods and the residual exploitation of CNN models used for the image tampering detection methods.

In the method of copy-move forgeries, the image is divided into overlapping blocks, and correlation is determined for the cloned blocks. A patch detection-based algorithm [16] was used to approximate the neighbours for the forgery detection. Geometrical-based transformations with invariant features of the image were used for the copy-move forgery detection. Local binary pattern (LBP) and steerable pyramid transform (SPT) were used for image forgery detection [17, 18]. These methods are used for the traditional extraction of the tampered regions for the forgery detection. However, these methods fail for the images that are small in size and provide inaccurate tampered regions of the image.

DL methods are widely used for image forgery detection in recent works [19, 20]. The image manipulation tasks that are generally used are generic manipulations, resampling [21], and splicing [20]. One of the works in [22] used Gaussian-based CNN for Steganalysis. In [23], a stacked autoencoder was used for image tampering detection. Further, CNN combined with LSTM was used for image tampering detection using the various layers of CNN. Residual-based networks such as ResNet 50 were used in [15] for image tampering detection using the input of computer-generated images.

In 2015, a variant of CNN called U-Net was proposed in [24]. U-Net gained a huge success in neuronal structure segmentation, and because of its features which are propagated among layers, its framework is path breaking in the above field. The context information in U-Net is captured by successive layers; later, the output feature is up sampled and finally combined with the high-resolution features propagated by a symmetric expanding path. This enables precise location and also reduces the loss of detail information. This helped to propose some image segmentation methods [25, 26] based on U-Net. Most of the time in image splicing forgery detection, we need to segment out the tampered region in an image which is impossible to do with human eyes. Hence, image splicing forgery detection can be understood as a complicated image segmentation task which is independent of the human visual system. Extraction of discriminative features plays a vital role in locating tampered regions of an image by providing the differences of image attributes. Even though U-Net can extract relatively shallow discriminative features, only two sides of the U-Net structure are interactive; this is not enough to locate the tampered regions. Besides, the gradient degradation problem [26] is observed when the network architecture becomes much more deeper.

VLAD [27] is a representation used in image recognition which is encoded by the residual vectors with respect to a dictionary. The formulated probabilistic version of VLAD [27] is used to form Fischer vector [28]. Both the representations are powerful for image retrieval and classification. Encoding residual vectors [28] is preferred over encoding original vectors for vector quantization. Multigrid method [29] is widely used in computer graphics and low-level vision to solve Partial Differential Equations (PDEs). This method develops subproblems at multiple scales, where each subproblem gives the residual solution between the finer and the coaster scale. Hierarchical basis preconditioning [30] is an alternative to Multigrid which relies on the variables that represent the residual vectors between the coarser and finer scales. As the standard solvers are unaware of the residual nature of the solutions, they converge slower compared to the Multigrid or Hierarchical basis preconditioning solvers [30]. These methods suggest that preconditioning or good reformulation can make the optimization easy.

Image classification using deep learning techniques involves the same three steps that are followed in machine learning algorithms for image classification. Those three steps are preprocessing, feature extraction, and classification. First, the input dataset is divided into two sets for training and testing. Both training and testing images are then preprocessed to resize the images according to the pretrained network size [31]. Further, these preprocessed images are sent through various layers of the network until the fully connected layer (FC-1000) extract features from the images sent. The classifier model is trained by passing the features extracted by FC-1000. The prediction of test images is done using the trained classifier model. Naïve Bayes, -nearest neighbour, and multiclass model using SVM learner are the three classifiers used in this model.

In 2015, the ResNet architecture was proposed [32] which won the championship in the classification task of ImageNet match. For a few stacked layers in ResNet, residual mapping is defined as Equation (1), where represents the input, the operation is performed by a shortcut connection and element-wise addition, and represents the output. The gradient degradation problem is a serious problem in image splicing forgery detection. This is generally seen in deeper networks; hence, the residual mapping technique is proposed to overcome this problem. The differences of image essence attributes are hard to discover through the multilayer structure as the discrimination of image essence attribute features will be weaker. To solve the above issue and to simultaneously strengthen the learning way of CNN, the residual mapping should be utilized more efficiently.

To make full use of features to detect tampering and to fuse features, adaptive attention mechanism and residual refinement network [33] are used which are robust to various postprocessing, such as blur, noise, and JPEG recompression. Residual-based [34] descriptors have proven extremely effective for a number of image forensic applications. Experimental results based on residual-based fully convolutional network [35] for image tampering detection for various datasets performed better than some existing methods in generalization ability, localization ability, and robustness against additional operations.

As CNN contains numerous parameters, weights, layers (spatial filters), biases, and so on, nowadays, they are widely used for detecting image forgery. The convolution operation in CNN considers the neighbourhood of the pixels in an image which results in different sizes of layers (spatial features). Various sizes of filters encapsulate the images with unique levels of granularity. The coarse-grained features of the image are extracted using the large-sized filters, while the fine-grained portions of the images are extracted using the small filters. Various researches on adjusting the size of the filters are conducted to optimize the performance of CNN to extract both coarse-grained and fine-grained features of an image.

The evaluation metrics play a vital role in estimating the tampering in images. There are two types of metrics used for the evaluation, pixel-based and image-based [36]. In the pixel-based method, the classification of the pixels is done as copy-move and authentic, whereas in the image-based method, the classification of image is done as either tampered or authentic. The measures used at image level are TP (true positive): tampered images are detected as tampered images, TN (true negative): nontampered images are detected as nontampered images, FP (false positive): nontampered images are detected as tampered images, and FN (false negatives): tampered images are detected as nontampered images. In this paper, the proposed method uses image-based methods to evaluate the accuracy. Among the existing methods discussed in this section, the CNN model is used to extract the spatial features of the image which includes the geometry, texture, wavelet, and transformations. The weights of the majority of the above-discussed models need to be altered each time for a new dataset of images as they use pretrained weights. In the proposed system, a fusion of decision-making is involved for image tampering detection based on the CNN models. The proposed fusion model is discussed in further sections.

3. Proposed System

The architecture of the proposed fusion system of residual exploitation-based CNN models is as shown in Figure 1. The residual exploitation-based CNN models chosen are ResNet-18, Resnet-50, and ResNet-101. It consists of three stages, namely, data preprocessing, fusion model, and the classification. In the data preprocessing stage, the input image is preprocessed based on the dimensions required by the fusion models. A support vector machine (SVM) is used for the classification of the image as tampered/forged or not.

The proposed system is implemented in two parts, i.e., pretrained and fine-tuned. In the pretrained implementation, regularization is not applied, and the pretrained weights are used for the classification. Regularization is a set of techniques that can prevent overfitting in neural networks and thus improve the accuracy of convolutional neural network-based models. Thus, to minimize the effect of overfitting regularization is applied in fine-tuned model implementation. In the fine-tuned implementation, regularization is applied for the classification. Initially, we discuss the residual exploitation-based CNN models, and then, the strategy used for the regularization is discussed in the further sections.

3.1. Data Preprocessing

In this stage, the input image that needs to be identified whether it is tampered or not is subjected to preprocessing. The dimensions of the input image required for ResNet-18 is . The dimensions of the input image required for ResNet-50 is . The dimensions of the input image required for ResNet-101 is . The input image is preprocessed first based on the dimensions required for each of the residual exploitation-based CNN models. Each CNN model then takes the input image to produce the feature vector in the further stages.

3.2. Residual Exploitation-Based CNN Models for Image Classification

The different residual exploitation-based CNN models that are considered for fusion are ResNet-18, ResNet-50, and ResNet-101. These models are used for the image classification problems numerously. In this section, these models are discussed briefly. The residual deep learning models considered are summarized as shown in Table 1.

3.2.1. ResNet-18

It is a CNN trained on the ImageNet dataset with 18 layers deep and can classify the images upto 1000 categories. The network has learnt rich representations of the images with 11.7 million parameters. The network has an image input size of 224-by-224.

3.2.2. ResNet-50

It is a CNN trained on the ImageNet dataset with 50 layers deep and can classify the images upto 1000 categories. The network has learnt rich representations of the images with 25.6 million parameters. The network has an image input size of 224-by-224.

3.2.3. ResNet-101

It is a CNN trained on the ImageNet dataset with 101 layers deep and can classify the images upto 1000 categories. The network has learnt rich representations of the images with 44.6 million parameters. The network has an image input size of 224-by-224.

3.2.4. SVM

SVM is used as a classifier, as it is more suitable, popular, efficient, and widely used for binary classification problems as compared to other classifiers. Performance of the proposed approach is evaluated at image level by calculating the performance metrics as precision, false-positive rate (FPR), and recall, also known as true positive rate (TPR), -score, and accuracy.

3.3. Fusion Model and Regularization

The proposed system is first implemented as a CNN with pretrained weights for the image classification. Afterwards, the proposed system is implemented as a fusion of the residual exploitation-based CNN models as discussed in the previous section. Initially, the input image is passed to the residual exploitation-based CNN models to obtain their feature maps, respectively. The feature map from the ResNet-18 is denoted as , the feature map from the ResNet-50 is denoted as , and the feature map from the ResNet-101 is denoted as . For the fusion model, the pretrained CNN output feature mapping fp is used. This feature map is a combination of the feature maps obtained from the residual exploitation-based CNN models as shown in Equation (1).

The fusion model uses feature map as a local descriptor for input patch to extract the features of the image. The image for the fusion model is represented as a function where is the patch in the input image. For a test image size , a sliding window of size is used to compute the local descriptor is computed as shown in Equation (2). It is obtained as a concatenation of all the input patches , and the new image representation is given by Equation (3) where is the size of the stride used for transforming the input patch; this new image representation fusion is used as the feature map for the classification by the SVM as tampered or nontampered. In Equation (2), represents the weights of the shortcut connections from the residual features of ResNet-18, ResNet-50, and ResNet-101 models.

For fine tuning of the parameters of the fusion model, the initialization of the weight kernels is used as shown in Equation (4). In this equation, represents the weights of the fusion model, represents the weights of the ResNet-18 model, represents the weights of the ResNet-50 model, and represents the weights of the ResNet-101 model. The weight of the fusion model is initialized as shown in Equation (5). The initialization of the weights acts as a regularization term and facilitates the fusion model to learn robust features of detecting the forgery rather than the complex image representations.

4. Experiments and Results

In this section, the experiments and results of the proposed fusion model are discussed. The experiment is carried out in two stages. In the first stage, the residual exploitation-based CNN models are used with the pretrained weights, in the second stage, the fusion model with the strategy of weight initialization as discussed in the previous section. The configuration of the system used for the experiments is shown in Table 2.

4.1. Dataset

The dataset used for the experiment is benchmark publicly available MICC-F220 [37] of 110 nonforged images and 110 forged images with 3 channels, i.e., color images of size to pixels with 10 different combinations of geometrical and transformations attacks as shown in Figures 2 and 3. To avoid the problem of overfitting and to generalize the approach -fold cross-validation with the value of as 5 is used for training and testing sampling of images.

4.2. Baseline Models and Metrics

The baseline models that are used for the comparison of the fusion model are summarized as follows. (i)SIFT: It uses the forensic method of the image tampering detection using a scale invariant features transform (SIFT) approach [37].(ii)SURF: It uses a speeded up robust features (SURF) and hierarchical agglomerative clustering (HAC) for the image tampering detection [38].(iii)DCT: It uses discrete cosine transform (DCT) features for each block and through lexicographical sorting of block-wise DCT coefficients for the image tampering detection [39].(iv)PCA: It uses PCA on the image blocks to reduce the dimension space and perform lexicographical sorting for the image tampering detection [40].(v)CSLBP: It uses center-symmetric local binary pattern (CSLBP) based on the combined features of Hessian points for the image tampering detection [41].(vi)SYMMETRY: It uses the local symmetry value of an image to compute the key points for image tampering detection [42].(vii)CLUSTERING strategy: It uses SIFT features with a clustering strategy to detect image tampering [43].

The basic metrics that are used for the evaluation of the fusion model are false-positive rate (FPR), recall (R), precision (P), -score, and accuracy as shown in the equations (Equations (7) to (10)). The confusion matrix is used as the basis for the evaluation of the tampered and nontampered images as shown in Table 3, and the notations used are as follows: (i)TP: Tampered image detected as tampered.(ii)FN: Tampered image detected as nontampered.(iii)FP: Nontampered image detected as tampered.(iv)TN: Nontampered image detected as nontampered.

4.3. Pretrained Residual Exploitation-Based CNN Models

In this section, the results of the pretrained residual-based CNN models are discussed. The three models, namely, ResNet-18, ResNet-50, and ResNet-101 are used with the pretrained weights for the image tampering detection. Table 4 shows the confusion matrix for the ResNet-18 model. It can be observed that the accuracy of the ResNet-18 model is 92.27%, and the percentage of the prediction of the correct tampered images is 50% and correct nontampered images is 42.27%. However, the wrong tampered image prediction is 7.23%. Table 5 shows the confusion matrix for the ResNet-50 model. It can be observed that the accuracy of the ResNet-50 model is 92.27%, and the percentage of the prediction of the correct tampered images is 50% and correct nontampered images is 42.27%. However, the wrong tampered image prediction is 7.23%. Table 6 shows the confusion matrix for the ResNet-101 model. It can be observed that the accuracy of the ResNet-101 model is 91.81%, and the percentage of the prediction of the correct tampered images is 50% and correct nontampered images is 41.82%. However, the wrong nontampered prediction is 8.18%.

The ROC curve is used to estimate the AUC values for the pretrained residual exploitation-based convolutional neural networks as shown in Figure 4. Figure 4(a) represents the ROC curve for the pretrained ResNet-18 model with AUC of 97.57%. Figure 4(b) represents the ROC curve for the pretrained ResNet-50 model with AUC of 97.57%. Figure 4(c) represents the ROC curve for the pretrained ResNet-101 model with AUC of 96.52%.

4.4. Fine-Tuned Residual Exploitation-Based CNN Models

In this section, the results of the fine-tuned residual exploitation-based models are discussed. The three models, namely, ResNet-18, ResNet-50, and ResNet-101 are used with the fine-tuned weights for image tampering detection. Table 7 shows the confusion matrix for the fine-tuned ResNet-18 model. It can be observed that the accuracy of the fine-tuned ResNet-18 model is 95.0%, and the percentage of the prediction of the correct tampered images is 50% and correct nontampered images is 45.0%. However, the wrong tampered image prediction is 5.0%. Table 8 shows the confusion matrix for the fine-tuned ResNet-50 model. It can be observed that the accuracy of the fine-tuned ResNet-50 model is 90.90%, and the percentage of the prediction of the correct tampered images is 50% and correct nontampered images is 40.91%. However, the wrong tampered image prediction is 9.09%. Table 9 shows the confusion matrix for the fine-tuned ResNet-101 model. It can be observed that the accuracy of the fine-tuned ResNet-101 model is 87.27%, and the percentage of the prediction of the correct tampered images is 45.45% and correct nontampered images is 41.82%. However, the prediction of the wrong tampered images is 8.18%, and wrong nontampered image prediction is 4.55%.

The ROC curve is used to estimate the AUC values for the fine-tuned residual exploitation-based models as shown in Figure 5. Figure 5(a) represents the ROC curve for the fine-tuned ResNet-18 model with AUC of 97.0. Figure 5(b) represents the ROC curve for the fine-tuned ResNet-50 model with AUC of 93.91%. Figure 5(c) represents the ROC curve for the ResNet-101 model with AUC of 92.0%.

4.5. Fusion Model

In this section, the results of the fusion models are discussed. Table 10 shows the confusion matrix for the pretrained fusion models. It can be observed that the accuracy of the pretrained fusion model is 90.91%, and the percentage of the prediction of the correct tampered images is 49.55% and correct nontampered images is 41.36%. However, the wrong tampered image prediction is 8.64%, and wrong nontampered image prediction is 0.45%.

Table 11 shows the confusion matrix for the fine-tuned fusion model. It can be observed that the accuracy of the fine-tuned fusion network is 93.18%, and the percentage of the prediction of the correct tampered images is 50% and the correct nontampered images is 43.18%. However, the wrong tampered image prediction is 6.82%. It can be clearly observed that the percentage of wrong tampered image prediction is less as compared to the pretrained residual exploitation-based models. The accuracy of the fine-tuned fusion model is higher than the pretrained fusion model.

4.6. Performance Comparison
4.6.1. Performance Comparison with Pretrained Residual Exploitation-Based Models

In this section, the performance comparison of the fusion model is carried out with pretrained residual exploitation-based CNN models. The metrics used for the comparison are precision, recall, -score, and accuracy. The results of the performance comparisons are as shown in Table 12. The precision and recall metrics are important to determine the effectiveness of the CNN models. According to Equations (7) to (10), the values in Table 12 were obtained.

4.6.2. Performance Comparison with Fine-Tuned Residual Exploitation-Based Models

In this section, the performance comparison of the fusion model is carried out with fine-tuned residual exploitation-based CNN models. The metrics used for the comparison are precision, recall, f-score, and accuracy. The results of the performance comparison are shown in Table 13.

It can be observed from the values in Table 13 that the proposed fusion model achieves comparatively more precision, recall, and -score than the fine-tuned residual exploitation-based CNN models. The results of the performance comparison of the fusion model with the baseline models are as shown in Table 14. The metrics used for the comparison are the FPR and TPR as they give the correctness of the model for the image tampering detection. The FPR for baseline 1 [37] is 8%, baseline 2 [38] is 3.64%, baseline 3 [39] is 84%, baseline 4 [40] is 86%, baseline 5 [41] is 2.89%, baseline 6 [42] is 5.45%, baseline 7 [43] is 7.63%, proposed pretrained fusion model is 17.27%, and proposed fine-tuned fusion model is 13.63%. The TPR for baseline 1 [37] is 100%, baseline 2 [38] is 73.64%, baseline 3 [39] is 89%, baseline 4 [40] is 87%, baseline 5 [41] is 96%, baseline 6 [42] is 83.64%, baseline 7 [43] is 97.87%, proposed pretrained fusion model is 99.09%, and proposed fine-tuned fusion model is 100%. Therefore, it can be observed that the fusion model has higher TPR as compared to the baseline models due to the weight initialization strategy used for the fusion model.

5. Conclusion

Image tampering detection helps to differentiate between the original and the manipulated or fake images. In this paper, a decision fusion of residual exploitation-based CNN models is implemented for image tampering detection. The idea was to use the residual exploitation-based CNN models, namely, ResNet-18, ResNet-50, and ResNet-101, and then combine all these models to obtain the decision to detect the tampering of the image. Regularization of the weights of the pretrained models is implemented to arrive at a decision of the image tampering. The experiments carried out indicate that the fusion-based approach gives more accuracy than the state-of-the-art approaches. In the future, the fusion decision can be improved with other weight initialization strategies for image tampering detection.

Data Availability

The data used for the research is already taken from public repository, and the link is provided in the paper.

Conflicts of Interest

The authors declare that they have no conflicts of interest.