Abstract

With the continuous development of deep learning techniques, it is now easy for anyone to swap faces in videos. Researchers find that the abuse of these techniques threatens cyberspace security; thus, face forgery detection is a popular research topic. However, current detection methods do not fully use the semantic features of deepfake videos. Most previous work has only divided the semantic features, the importance of which may be unequal, by experimental experience. To solve this problem, we propose a new framework, which is the multisemantic pathway network (MSPNN) for fake face detection. This method comprehensively captures forged information from the dimensions of microscopic, mesoscopic, and macroscopic features. These three kinds of semantic information are given learnable weights. The artifacts of deepfake images are more difficult to observe in a compressed video. Therefore, preprocessing is proposed to detect low-quality deepfake videos, including multiscale detail enhancement and channel information screening based on the compression principle. Center loss and cross-entropy loss are combined to further reduce intraclass spacing. Experimental results show that MSPNN is superior to contrast methods, especially low-quality deepfake video detection.

1. Introduction

Automated video editing techniques have made great strides in the past few years with the development of deep learning. In particular, people have shown great interest in face manipulation. It is now easy to transfer facial expressions from one video to another based on generative adversarial networks (GANs) and autoencoders [1]. Even those who do not know deep learning can easily change one person’s face to another in a few minutes [2], and a fake face is difficult for human eyes to distinguish. It is easy to change who the speaker is or what is said. While deepfake techniques bring benefits, there are hidden dangers.

These techniques open a new window for film and television. For example, dead movie stars can reappear through face manipulation, and people who do not exist in the real world can be created through GANs. Moreover, malicious attacks and revenge porn are a small part of malicious face manipulation. This also influences politics, such as by tampering with speech content and spreading fake news [3]. As a result, deepfake videos have attracted the interest of researchers, and methods to detect whether a face has been manipulated have become paramount.

Deepfake videos can have at least three levels of forgery characteristics: microscopic, mesoscopic, and macroscopic. Microscopic features correspond to unseen differences, such as anomalies in small regions. Macroscopic or semantic features refer to the whole image semantics that the human eyes can feel. Mesoscopic features are seen in between. Afchar et al. [4] designed MesoNet to detect mesoscopic features. Current deepfake video detection methods do not take full advantage of these three levels of features. Usually, authenticity discrimination has been based only on high semantic features, and the performance needs improvement. It is possible to design a network that can integrate the three levels for deepfake detection. However, semantic segmentation methods that can ensure the improvement of accuracy have not yet been proposed. Similar work was based on the practical experience of feature hierarchy division. In addition, it is uncertain whether the weights of the three hierarchical features are the same.

Deepfake video detection methods have achieved accuracy of nearly 100% for high-quality videos, but their accuracy for low-quality videos with high-compression rates needs to be improved [5]. For example, the accuracy of Xception [6] was 99.26% for the uncompressed FaceForensics++ dataset [7], but 72.93% at the C40 compression rate without pretraining. The high-compression rate makes the video very blurry and the forgery trace becomes unclear and not obvious, thereby becoming more difficult to distinguish the real video from the fake. Most videos on the Internet are compressed due to upload size limitations; thus, low-quality video forgery detection is significant. For this kind of video, we studied the commonly used H.264 video compression format, which includes inter and intraframe compression [8]. If only the original adjacent frames are removed through interframe compression, the accuracy will indeed be improved in theory. However, this will lead to inconsistencies with the creation requirements of benchmark datasets such as FaceForensics++, so we only use intraframe compression, which preserves the Y channel information on the YCrCb space and compresses the CbCr information as much as possible. Figure 1 shows the changes of different channels in an image at different compression rates. After comparative experimental analysis, we find that when the Y channel of the image is used as the input, the accuracy is higher than that when other channels are used. In addition, to highlight the high-frequency information of low-quality videos, multiscale detail enhancement was performed on images before channel separation. Based on the above two findings, we propose a deepfake detection method integrating different semantics in the network. We find no standard for semantic division from the aspect of channel level, but division from the aspect of the receptive field of the convolution kernel is reasonable.

When considering semantic level importance, instead of assigning weights manually, we use channel-spatial attention to assign them automatically. Therefore, a multichannel network with different receptive fields is proposed to integrate the features at different levels to capture forgery features. In constructing the neural network, the essential information is extracted through preprocessing and input to the network. We connect the feature maps of multiple pathways or semantics and automatically assign the weights to the three semantics through the channel-spatial attention module, perform feature fusion, and classify. We train and test our model on FaceForensics++ and DeepFake-TIMIT [9] and perform cross-validation on Celeb-DFv2 [10]. Experimental results show that our network has better accuracy than current methods, especially in low-quality deepfake video encoding.

This work makes the following contributions:(1)A multiscale detail enhancement method is introduced in deepfake detection. Fuzzy features are extracted from three Gaussian kernels, the residuals are calculated with the original image, and the detailed texture features of the forged image are highlighted;(2)Based on the study of video compression methods, the extraction of significant channel information assists in the detection of forged images with high-compression rates;(3)A multipath network for multisemantic information fusion is proposed. The three kinds of semantic information are automatically assigned weights by a channel-spatial attention module, and low, medium, and high semantic information of forged images can be effectively divided and interpreted;(4)Our method is evaluated on manipulated videos datasets. It performs well on the DeepFake-TIMIT and FaceForensics++ datasets and generalizes satisfactorily on Celeb-DFv2. The proposed preprocessing method can improve the detection of low-quality counterfeit videos, and the network can comprehensively capture different semantic information of images.

We summarize current fake video generation methods, analyze deepfake detection methods, and introduce our method.

2.1. Deepfake Image Generation

Image generation techniques have developed rapidly over the past two decades, and methods such as StyleGAN [11] can produce fake images or videos that are credible to the eye. It is especially difficult to see traces of forgery after a video is compressed. Juefei-Xu et al. [12] produced a comprehensive report on counterfeiting generation and detection. Deep learning generation techniques of deepfake videos include autoencoders and GANs. Forgery methods can be categorized by the generated results as entire face synthesis, attribute manipulation, identity swap, and expression swap, as shown in Figure 2. Entire face synthesis generates a face that does not exist in the world. The input of these networks is a random vector, and the output is a realistic fake face image. Many models can be used, such as WGAN [13], StyleGAN, and PGGAN [14]. Attribute manipulation can modify the attributes of a person’s head, including simple attributes such as expression, hair color, and baldness and complex attributes such as gender, age, and the wearing of glasses. Classic examples are StarGAN [15] and STGAN [16]. Identity swapping, which replaces a face in a source image with a target’s face, has attracted much interest in recent years. Apps such as Zao [17] allow one to swap identities with a favorite star. Moreover, there are malicious attacks. Examples of identity swapping methods include FaceSwap [18] and CycleGAN [19]. Also known as face reconstruction, face-swapping is somewhat similar to identity swapping, replacing the source image’s facial expression with that of the target image’s facial expression, which include Face2Face [1] and A2V [20].

Methods of forgery generation include AAMS [21] for style transfer, SC-FEGAN [22] for image repair, and SAN [23] for super-resolution, but most of these methods are not the focus of face manipulation detection. According to the risk rank, identity swapping entails the most risk, followed by expression swap. Entire face synthesis and attribute manipulation are not very dangerous.

2.2. Deepfake Image Detection

Methods to detect deepfake features are based on spatial or image pixels, the frequency domain, or biological signals. Spatial-based methods use either conventional feature forensics or deep learning. Conventional image forensics relies on specific manipulation evidence [24], using frequency domain and statistical features such as local noise analysis, illumination, and device fingerprints to distinguish deepfakes. Nataraj et al. [25] extracted co-occurrence matrices on three color channels in the pixel field and conducted classification training according to these features. Although the conventional forensics technique is mature, several shortcomings are present in dealing with deepfake videos because it pays more attention to abnormal features of local images. Deepfake videos are usually processed to avoid detection, such as by compression methods, compression rates, and condensation. Therefore, the conventional feature forensics technique cannot be directly applied to detect deepfake videos.

Methods based on or combined with deep learning have recently gained attention [2629]. Sabir et al. [30] used recurrent neural networks to capture temporal differences in fake videos. Liu et al. [31] conducted an empirical study on real and fake faces and obtained some important findings. One of these findings is that the texture of a fake face is fundamentally different from that of a real face. Deep learning techniques and large datasets make it easier to catch the features associated with forgery [32]. This method can judge the authenticity of a single-frame image and detect video frames by a combined strategy, but it has limitations. Most learning models rely on the same dataset with the same data distribution for both training and testing and are weak in the face of unknown tampering types [33]. At the same time, the ability of deep learning models to detect highly compressed video frames is greatly reduced.

The method based on the frequency domain analyzes the differences of deepfake images such as through a Fourier or wavelet transform [34]. Durall et al. [35] proved that standard upsampling methods lead the forged images generated by these models to fail and to correctly reproduce the spectral distribution of natural training data. Most methods calculate feature maps with the differences between true and fake images in the frequency domain, and combine deep learning such as the support vector machine (SVM) for classification. Because the available spectrum of high-resolution images is much smaller than that of high-resolution photos, it is challenging to identify compressed videos.

Biometric authentication techniques have developed in recent years [36]. Detection methods based on biological signals cannot reproduce natural physiological characteristics by using fake videos, and the physiological characteristics of fake faces are inconsistent with those of real faces. [37]. Therefore, biological signal detection-based methods are constantly being developed by researchers. For example, by monitoring minimal periodic changes in skin color, Qi et al. [38] speculated that the normal heartbeat rhythm would be interrupted by deepfakes and proposed a dual temporal attention network. Although detection methods based on physiological signal characteristics can effectively make use of the defects of deepfake techniques, these methods gradually become invalid with the continuous improvement of generation methods, such as the addition of physiological characteristics (e.g., blink frequency). Besides, methods based on hard-to-find biological signals, such as heart rate, would be far less accurate due to video compression and other processing [39].

Because conventional forensic techniques are easily avoided by new deepfake techniques, frequency domain feature-based statistical methods are not strong at detecting low-resolution forged videos, and biological signal-based methods are weak in improving generation technique. Most current work still adopts data-driven deep learning methods. As far as we know, current deep learning methods do not fully use the three semantics of images. For example, Mesonet only used mesoscopic semantics, while later networks used macroscopic semantics for judgment, such as Xception [7], FDFtnet [40], and AMTEN [41]. Zhao et al. [42] used microscopic and macroscopic semantics. Although some previous work mentioned semantics, they could not explain the relationship between network depth and the three types of semantics. Our work developed a targeted solution to this problem; specifically, the three semantics are set according to the width of the network, which has better interpretability. Moreover, ablation experiments show that the proposed method is effective and can surpass current methods at detecting forged images, especially in low-resolution videos. In addition, according to the compression principle, we propose a preprocessing method for low-resolution video.

3. Proposed Method

Based on the above analysis, we design a multisemantic path neural network (MSPNN) for deepfake detection to capture deepfake features under different semantics, as shown in Figure 3.

3.1. Multiscale Detail Enhancement

We use a multiscale approach to enhance the details of the source image. We first define three Gaussian filters:where andwhere .

Then, we obtain three fuzzy images using Gaussian image filterswhere , , and are Gaussian kernels with respective kernel sizes of 5 5, 9 9, and 19 19 and standard deviations  = 1.0,  = 2.0, and  = 4.0; represents convolution; and , , and are the three filtered images. The fine, intermediate, and coarse details are, respectively, extracted as

We combine the three layers to generate a detailed image of the whole:

According to experience, , , and are fixed as 0.5, 0.5, and 0.25, respectively. Figure 4 shows the process of image detail enhancement. Figure 5 shows the effect of multiscale detail enhancement. Faces at the top in Figure 5 are slightly blurred, while at the bottom, detail enhancement makes the visual perception of local details clearer, which aids in the detection of forged images with high compression.

3.2. Compressed Videos Analysis

According to our research, the detection accuracy of high- and medium-quality deepfake videos, i.e., uncompressed and medium-compressed, respectively, is close to 100%, while that of high-compression videos is much worse, especially for some videos with more realistic tampering effects. Therefore, research on high-compression forged video must be improved. Since human eyes are not sensitive to the chromaticity of an image but are sensitive to its brightness, during image compression, it is desirable to retain as much chromaticity information as possible and compress brightness information to save storage space. Since the chrominance information of the compressed video hardly changes, the definition of the video does not change significantly. Since compression is carried out in YCrCb color space and our datasets are RGB images, spatial conversion is first required, given as follows:where R, G, and B are the gray values of the three components of RGB.

Figure 1 shows images with different compression rates. The compression rate increases gradually from the first to the third row. The first line is the original image, and the image that is almost visually lossless in the second row is slightly compressed. The third row is a low-quality image. Column shows images in RGB color space, and column shows images under the YCrCb channels. Column c, column d, and column e show separate images using the YCrCb channels, such as the Y channel, the Cr channel, and the Cb channel, respectively. The change in the Y channel is the least obvious, and the change in the Cr and Cb channels is the most obvious. Inspired by the above observations, we extract the image information of the RGB channel into two types of luminance information and one type of chrominance information, i.e., the YCrCb channel. Then, we conducted four experiments using the Y channel, the Cr channel, the Cr channel, and the original image separately to verify our idea. Experimental results show that using only Y channel information can improve the accuracy of highly compressed video and has little effect on slightly compressed video.

3.3. Multisemantic Path

MSPNN can output feature maps with multiple semantics through different receptive fields and network depths. The features of these different layers are finally connected, and a learnable weight is added to the three feature layers for fusion classification. The final classification relies on the deep feature map and considers the shallow and middle feature maps. The overall framework is shown in Figure 3.

The network has three parts. First is simple image preprocessing to generate 32 feature maps. Different feature maps are generated through three semantic channels. The network details are shown in Figure 6. Since low semantics can be understood as microscopic images, all filters in the semantic channel adopt a 3 × 3 window. The high semantics are the macroscopic features of the image, and the corresponding receptive field is more extensive, so the filter size of the semantic channel is 7 × 7. Inspired by Inception [43], we replace a 7 × 7 convolutional kernel with three 3 × 3 convolutional kernels, which can reduce computation without reducing the receptive field and can have more nonlinear transformations, as shown in Figure 6. Mesoscopic semantics is between mesoscopic and macro semantics. The receptive field of this channel is 5 × 5, and we use two 3 × 3 convolution kernels. Considering the influence of network depth semantics, the three semantic depths are also increased.

3.4. Semantic Integration

Although the microscopic, mesoscopic, and macroscopic semantics of images are juxtaposed, their importance is not the same. Hence, we apply a weight to each of the semantics instead of feeding back directly to the discriminator. In our model, these weights are learnable, which we accomplish through a channel-attention module to combine space and channels; this can achieve better results than SENet [44], which only pays attention to the channel. The first one is the channel-attention module of the image given as follows:where denotes the sigmoid function, and . Note that MLP weights and are shared for both inputs, and ReLU activation is followed by . Then the spatial attention iswhere represents convolution with a filter, and AvgPool() and MaxPool() are average and maximum pooling, respectively. The fused feature map is fed to the final classifier.

3.5. Loss Function

According to our investigation, the center loss function, while used in many face recognition tasks [45], does not improve performance in tasks such as handwritten number recognition. We conclude that the center loss function is more suitable for fine-grained classification tasks. To this end, we introduce a center loss function to our model aswhere represents the distribution center of category data; that is, the feature center of true or fake faces, represents the feature before the full connection layer, and is the batch size. We use this loss to continually decrease the sum of squares of the distance between the feature maps of each sample and the feature, i.e., to make the in-class distance as small as possible.

Normally, should be updated as the depth features change. The choice of feature centers should consider the entire training set and average the features of each class in each iteration. Specifically, is updated in small batches, and the centers are calculated by averaging the characteristics of the corresponding classes in each iteration. Second, to avoid large disturbances caused by a small number of mislabeled samples, we use the scalar , which is limited to the range [0, 1], to control the learning rate of the center. The updated equation of iswhere if is satisfied, then ; otherwise, ; that is, when the tags and are of different categories , then does not require updating. We use a cross-entropy loss function and central loss joint supervision to train the network to learn true and fake features. The equation of the final loss function is given as follows:

We first consider and of equation (11) equally important, so we set as 1. Values can have different effects on the result, and we believe that multiple attempts can find a more suitable value. We computewhere is the score of the -th face, and is the related face label, where the label 0 is associated with faces from real, original videos, and 1 is associated with fake videos. is the total number of faces used to train each batch, and is the sigmoid function.

4. Experimental Results and Analysis

We describe popular datasets, video segmentation methods, and their implementation, describe pretreatment ablation experiments and comparative experiments with other methods, and discuss verification of generalization.

4.1. Datasets

Our experiments use the FaceForensics++, DeepFake-TIMIT, and Celeb-DFv2 datasets. FaceForensics++ is one of the largest and most diverse deepfake datasets. It is a prominent face forgery dataset widely used in deepfake detection, with 1,000 YouTube videos. The authors of FaceForensics++ used four types of face tampering to create fake videos, including FaceSwap, DeepFakes, Face2Face, and NeuralTextures. A total of 1000 deepfake videos are generated with each tampering method, including videos compressed with the original compression rate (C0), videos compressed with the micro compression rate (C23), and low-quality videos (C40). FaceForensics++ datasets have 1000 fake videos and 1000 real videos for each compression rate. When detecting forged videos, we divided the datasets into training, validation, and test sets according to the standard of FaceForensics++. There are 720 training sets, 140 validation sets, and 140 test sets.

DeepFake-TIMIT is generated by the face exchange algorithm based on the VidTIMIT dataset, which was developed using the faceswap-GAN method. Furthermore, Deepfake-TIMIT is the first deepfake dataset generated by GAN. The 640 generated fake videos are available in high and low quality. The production quality is better than that of Faceforensic++, but the video resolution is not high. We divided the dataset according to the settings of FaceForensics++. There are 320 videos of the two qualities, 230 training sets, 45 verification sets, and 45 test sets.

Celeb-DFv2 is a challenging deepfake video dataset that improves upon some weaknesses of other datasets. For example, UADFV, Faceforensic++, and Deepfake-TIMIT have low image resolution, poor quality of synthesized videos, rough tampering traces, and excessive flicker of video faces. The dataset consists of 590 real videos and 5,639 deepfake videos. Real videos from YouTube show celebrities of different genders, ages, and races.

For a fair comparison, we processed the video according to the clipping of FaceForensics++. All videos were framed, and dlib [46] was used to extract the feature points of each frame of the face to help locate and clip the face area, which was expanded by 1.3 times. Each video of the cropped face was taken in 30 frames. For data preparation of frame-level streams, we used OpenCV to extract frames. Since the datasets only operate on the faces in the video, not all frame information is helpful for deepfake detection from this perspective [7]. We focused our analysis on the area of the subject’s face, and therefore on human faces, using dlib for face detection, which further reduced the amount of data processing. When extracting a face, dlib sometimes fails to recognize the face in a video frame, in whose case we skipped the frame and kept a constant number of faces captured in each video.

Figure7 shows the input image samples and output feature maps in the three experiments. The first line uses the low-compressed DeepFakes datasets in FF++ for training and testing. The generation method of forged image in the second line is the same as in the first line, with a higher compression rate. The third line uses the DeepFakes datasets with low compression in FF++ for training and Celeb-DFv2 for testing so as to verify the generalization performance. The output feature maps are the result of the fusion of the three paths. It can be seen from Figure 7 that the real image with higher brightness is concentrated in the center of the feature map, while the forged image with higher brightness is concentrated in the lower part.

4.2. Implementation

All experiments were performed on RTX 3090. The baseline [7] has a high accuracy in uncompressed datasets, and we only evaluated our model on low- and high-compressed data. We implemented MSPNN using the PyTorch deep learning library. For more details, we selected cross-entropy as the loss function in the training phase. The output of the network was distributed between 0 and 1, and we adopted the autoadaptive algorithm Adam in the optimization process. The initial learning rate was 1e-4, and the policies of cosine annealing LR were both used. The center loss function used the SGD optimizer. Batch normalization was used in each convolution to reduce the impact of overfitting. Dropout was introduced in the final full connection, with a ratio of 0.5. The batch size of the input data was 32. We trained our models with 100 epochs. The graph of the learning rate with each epoch was similar to a cosine function. The rest of the model settings were default values, the random seed was 43, and the input image was 224 × 224.

4.3. Preprocessing Analysis

Preprocessing had two steps. Multiscale detail enhancement highlights face textures, especially low-quality images, which are so blurred that it is difficult to see forged traces. In this process, three filters of different sizes, , , and , were used to filter the image to obtain fuzzy images , , and . The original image was subtracted from to obtain detail image . The detail image was obtained by combining detail image and fuzzy image , and the detail image was obtained by combining detail image and fuzzy image . The three detail images were fused with the original image to enhance the detail images. The improved results are shown in Figure 5. Ablation experiments were performed on the datasets of FaceForensics++ with compression rates C23 and C40, as shown in Table 1, from which we can see that the detection performance of the high-compression dataset was effectively improved compared with the low-compression dataset, which shows the effectiveness of the proposed preprocessing method for low-quality datasets. It is worth noting that the proposed detection was improved at any compression rate on the most challenging NeuralTextures dataset. The proposed method only modifies the facial expression corresponding to the mouth, leaving the eye area unchanged, and requiring more subtle detection methods.

The second preprocessing step was channel separation for high-compression images with low detection accuracy. We investigated the video compression standard H.264 and found that the measure keeps the information of the Y channel as much as possible while compressing the other two channels. In Figure 1, we can see the changes in the knowledge of the three channels after compression. So we converted the RGB image to a YCrCb image, and the images of Y, Cr, and Cb channels were taken out for training. We found that the accuracy of the image containing the brightness information channel is much higher than that containing the chroma information channel. The accuracy of the chromaticity information channel is much lower than of that of the ordinary RGB channel, as shown in Table 2, according to which most subset accuracy can be improved by using only Y channel information on the FaceForensics++ dataset, especially on the highly compressed C40 dataset. The experimental effect on some datasets becomes worse, but this change is not very large. We believe that the forged image with a low compression rate is close to the original image, so the effect is not apparent.

4.4. Experimental Results

Most detection methods are based on macroscopic semantics, i.e., the final feature maps of the network. The difference between a natural face and a fake is often subtle and occurs in the local area. Minor artifacts caused by the deepfake method are usually stored in the shallow characteristic of texture information. We believe that the microscopic semantic or superficial semantic features cannot be ignored. Focusing only on details is also flawed. A microscopic analysis based on image noise cannot be applied to the compressed video environment, where the image noise is strongly reduced. It is difficult for the human eye to distinguish the forged images at the same higher semantic level, especially in fine-grained analyzes, such as face discrimination. Therefore, our work takes into account the three kinds of semantic information, which receptive fields of various sizes can also capture, and which are more explanatory, as shown in Figure 8.

Gradient-weighted class activation mapping is used to visually display the details of the attention of the three pathways, and it is evident that the microscopic semantic pathway pays more attention to details and the mesoscopic semantic path to multiple blocks. Macro semantics focus more on areas that are difficult to forge, such as the eyes, nose, and mouth because these are the most difficult to reproduce during the generation of forged images. In addition, small convolution kernels are used to map large convolution kernels, which reduces the computation of convolution, while increasing multiple nonlinear activations, and the receptive field is unchanged. We add residual blocks to the mesoscopic and macroscopic semantic pathways to ensure that information is not lost when the network depth increases. In Table 3, we can observe that much of the accuracy of the datasets is improved under the three pathways. They have a poor effect on some datasets, in particular the NeuralTextures dataset, which only tampers with parts of the images, whereas our microscopic semantic pathway captures much information that is not helpful to the detection of these datasets. Our addition of preprocessing makes up for this problem, as shown in Table 1. We also conducted experiments to verify the effectiveness of our proposed channel and spatial attention modules. It is valid for most datasets, as shown in Table 4. In particular, we find that NeuralTextures and Face2Face can have satisfactory effects in the most complex datasets of FaceForensics++.

Our overall accuracy on FaceForensics++ datasets exceeds that of many other previous methods, as shown in Table 5. Most of the work on the TIMIT dataset uses the AUC indicator. To evaluate the overall detection performance, we calculated the area under the curve (AUC), which is the area under the receiver operating characteristic (ROC) curve, whose maximum value is 1 and displays the results in Table 5. The AUC of our proposed method is higher than that of other methods, indicating better performance on compressed deepfake video detection.

4.5. Validation of Generalization on Celeb-DFv2

Cross-dataset validation was carried out to evaluate the generalization ability of the proposed method. The model was trained on FaceForensics++ and tested on Celeb-DFv2. We followed the setup of Celeb-DFv2 [10] to divide the test set and displayed the experimental index AUC scores in Table 6. It can be seen from the results that this method has a better generalization effect than most methods. Masi’s [55] generalization on Celeb-DFv2 is better than ours, but the AUC score in the original dataset is far behind. Our approach has limitations, but it has always been a challenge to balance accuracy and generalization.

5. Conclusion

Although methods for deepfake detection of videos and images have made much progress, few methods consider multiple aspects of semantic information. This work proposes a new face forgery detection method, MSPNN, which can simultaneously capture micro, mesoscopic, and macro semantics to comprehensively distinguish forged images, with weights assigned automatically to the three semantics. The neural network can comprehensively capture different semantic information of an image. In view of the challenges of face tampering in a small-range, high-compression dataset, and cross-dataset, the proposed framework can effectively capture minor forged artifacts and macro forged traces, which can further improve the detection of high-compression forged images. This framework has good generalization as well. Furthermore, the proposed preprocessing method can improve the detection ability of our framework for low-quality counterfeit videos. Our future work will consider the combination of frequency domain information and brightness information at the separation point to integrate the corresponding features for deepfake detection.

Data Availability

The data supporting this work are from previously reported studies and datasets, which have been cited. The processed data are available at https://github.com/ondyari/FaceForensics/blob/master/dataset/README.md, https://conradsanderson.id.au/vidtimit/#downloads and https://github.com/yuezunli/celeb-deepfakeforensics.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this article.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (Nos. 62002313, 62101481, 62166047), Key Areas Research Program of Yunnan Province in China (No. 202001BB050076), Major Scientific and Technological Project of Yunnan Province (No. 202202AD080002), Yunnan Fundamental Research Projects of China (Nos. 202201AU070033, 202201AT070112), the fund project of Yunnan Province Education Department (No. 2022j0008), Key Laboratory in Software Engineering of Yunnan Province (No. 2020SE408), the open project of Engineering Research Center of Cyberspace in 2021-2022 (No. KJAQ202112012), and Industrial Internet Security Situation Awareness Platform of Yunnan Province.