Abstract

At present, the development of deep forgery technology has brought new challenges to media content forensics, and the use of deep forgery identification methods to identify forged audio and video has become a significant focus of research and difficulty. Deep forgery technology and forensic technology play a mutual game and promote each other’s development. This paper proposes a spatiotemporal local feature abstraction (STLFA) framework for facial forgery identification to solve the media industry challenges of deep forgery technology. To adequately utilize local facial features, we combine facial key points, key point movement, and facial corner points to detect forgery content. This paper establishes a spatiotemporal relation, which realizes face forgery detection by identifying abnormalities of facial keypoints and corner points for interframe judgments. Meanwhile, we utilize RNNs to predict the sequences from facial key point movement abnormalities and corner points for interframe. Experimental results show that our method achieves better performance than some existing methods and good anticompression forgery face detection performance on FF++.

1. Introduction

Media content forgery has brought some security problems to society. Especially with the development of autoencoders (AEs) [1] and generative adversarial networks (GANs) [2], media content forgery has become easy to achieve through deep forgery techniques. The techniques usually utilize deep learning methods to alter a person’s identity in a video to synthesize a piece of media content that does not exist. Deep forgery identification techniques include both image-level detection and video-level detection.

Forgery detection of images or video frames is mostly the detection of forged video content, including color inconsistencies and semantic inconsistencies. Image forgery detection can be divided into detecting the image as a whole and detecting the facial area, according to the detection dimension. Forgery detection of the image as a whole is mainly to detect the physical properties of the image, such as the direction of the image’s light source[3], the saturated pixel frequency [4], and the spectral sensitivity [4]. It is classified by judging the difference between forged and authentic images. Forgery detection for facial regions includes inconsistent iris color, missing tooth gaps, and inconsistent eye reflexes, including detection of facial artifacts using light estimation, global consistency and geometric estimation [5], corneal highlight region consistency detection [6], and facial artifact detection [7].

The detection of video sequences is mainly performed by combining optical flow anomalies, motion incoherence, or anomalies between video frames. Forgery detection based on optical flow mainly calculates the optical flow field of the target in the video and detects the inconsistency of the optical flow field [8]. Some authors utilize eye blinks [9], abnormal head movements [10], and facial distortions [11] to detect incoherent motion or abnormal behaviors in consecutive frames.

However, the early works were mainly focused on global features. Specifically, we notice that forgery detection features are particularly evident in key facial organs such as the eyes, nose, and mouth [5, 6, 12]. For example, Xue et al. [12] found that only using facial organs such as the nose, lips, eyes, eyebrows, and chin can detect deep forgery very well.

Based on this, we first consider constructing the facial organs’ relation. These organs can be abstracted to local features and represented by sequential vectors. We then adopt recurrent neural networks (RNNs) to capture their internal properties or differences to obtain instructive guidance that describes whether the face is falsified. For comprehensive detection, we realize face forgery detection for key facial local regions such as the lips, eyes, nose, eyebrows, and chin, thus achieving impressive performance. The contributions of our work are summarized as follows:(1)We propose a spatiotemporal local feature abstraction (STLFA) framework for facial forgery identification, which establishes local features’ relation via an organ-specific method.(2)In STLFA, we combine abnormal facial movement detection and facial landmark time discontinuity detection to analyze the facial key point and corner point features frame by frame. Meanwhile, we judge video sequences’ key point movement and corner point number transformation to achieve forgery identification of images and videos.(3)This paper demonstrates the effectiveness and robustness of the proposed method and discusses and analyzes the advantages and disadvantages of STLFA.

2.1. Deep Forgery Discrimination Based on Image or Video Frames

Currently, most forgery detection of images or video frames is performed by detecting manual features for forgery identification. The detection subject can be divided into two categories: image detection and inconsistency detection only for human faces.

Image forgery detection mainly detects the inconsistent lighting conditions and color inconsistencies in images. Chen et al. [13] proposed a robust dual-stream network by integrating dual-color spaces RGB and YCbCr using an improved Xception model, which considers both the luminance and chrominance components of dual-color spaces (RGB and YCbCr) to enhance the robustness. Johnson and Farid [3] proposed a method to detect lighting inconsistencies by estimating the direction of point light sources in a single image to estimate the consistency of light sources for the whole image. McCloskey and Albright [4] analyzed the structure of the popular GAN network. They found that the image generated by the GAN network differs from the captured image in color processing. They propose a method for forgery classification by saturated pixel frequency detection and spectral sensitivity detection.

The forgery detection of inconsistencies in the person’s face focuses on the incomplete consideration of semantics in the content generation process by the deep forgery method, resulting in the generation of a person with inconsistent iris colors in the left and right eyes, inconsistent reflections, and uneven gaps in the teeth. Matern et al. [5] detected facial artifacts based on detecting intraframe image artifacts using light estimation, global consistency, and geometric estimation. Hu et al. [6] proposed a scheme to study whether the highlight patterns on the corneas of two eyes are consistent to determine whether they are fake. Li and Lyu [7] determined the forgery traces by detecting artifacts traced from the affine transformation during face forgery.

In order to integrate the features of facial regions, some authors proposed novel approaches. Wang et al. [14] proposed a method that fused facial region feature descriptor for forgery determination by extracting feature points of a person’s face. Xue et al. [12] built a transformer model for a deepfake-detection method by organs to obtain the deepfake features. Yang et al. [15] proposed a method for detecting differences in face textures by amplifying the texture differences between genuine and fake images and using a bootstrap filter to enhance postprocessing-induced texture artifacts and display the underlying features of the artifacts.

2.2. Deep Forgery Discrimination Based on Video Sequences

The video sequence-based deep forgery approaches have more detection items than the image-based deep forgery approach. The forged video generation process is frame-by-frame leading to optical flow inconsistencies between the preceding and following frames and motion anomalies.

In terms of forgery identification based on optical flow detection, Amerini et al. [8] proposed a forgery detection method based on optical flow anomalies between different frames by extracting the correlation of the optical flow field and using a CNN classifier for classification. Trinh et al. [16] proposed a forgery detection framework by superimposing optical flow fields on RGB images for forgery detection. Caldelli et al. [17] proposed a CNN-based classification method to distinguish motion dissimilarities in the temporal structure of video sequences by using optical flow fields.

In terms of forgery identification based on abnormal motion detection, Li et al. [9] proposed a GAN-based model that could not represent blinking in fake synthetic videos, enabling the detection of blink inconsistencies. Yang et al. [10] proposed a detection method based on the inconsistency of 3D head pose estimation by extracting the coordinates of facial key points and calculating the direction vector difference between the center of the face and the coordinates of peripheral key points to achieve deep forgery detection. Sun et al. [11] proposed a geometric feature calibration module to determine the accuracy of interframe geometric features to determine the abnormal facial movements of characters.

3. Methods

3.1. Framework

In this section, we provide a detailed illustration of our proposed method. Figure 1 illustrates the architecture of STLFA. We used facial preprocessing modules to crop the eight facial organ regions, including the left eyebrow, right eyebrow, left eye, right eye, nose, mouth, inner mouth, and chin. We built a sequence group by facial key points, key point movement, number of corner points, and number of variations. Meanwhile, RNN models are trained for each region until they have the detection ability. After that, we integrate the results from the RNNs and obtain the final prediction.

3.2. Facial Preprocessing

The facial preprocessing module mainly contains three steps: face detection, face landmark detection, and landmark alignment. Following [11], we use tracking and denoising methods to match the key points between video sequences to obtain the complete facial key point coordinates and coordinate movement. We utilized the Lucas–Kanade (LK) operation in the tracking method to track the coordinate points and forward-backward processes to eliminate inaccurate predictions. Meanwhile, the denoising method is used to solve the noise caused by the LK operation and to ensure the stability of the landmark, using the Kalman filter to integrate the prediction information.

3.3. Facial Key Points Extraction
3.3.1. Facial Key Points Coordinates Extraction

The facial key point coordinate detection method requires cropping the preprocessed image. After that, we detect 68 facial key points representing the facial shape, as shown in Figure 2(a). We select the key point frame to extract eight facial key organ regions based on the 68 key points, as demonstrated in Figure 2(b). We create vector for each key organ region.

Each region can be expressed as :where is the horizontal coordinate of the first key point in region and is the vertical coordinate of region .

3.3.2. Corner Extraction

(1) Motivation for Using FAST Feature Points. The FAST algorithm is a corner detection algorithm mainly used to extract the feature points in the image. Based on the feature point information, the translation, distortion, and rotation objects in the dynamic process are associated with realizing the target tracking in a series of images of dynamic imaging and positioning. Wang et al. [14] found that although the fake video face was highly similar to the original video face, it still lost many fine details used to determine the FAST feature points and found that the phenomenon was more evident in the local area of the face. Based on this observation, we design a FAST feature descriptor to extract the phenomenon of the occasional failure of face-changing in the local area of the fake video and further complete the face forgery detection.

(2) Extraction Algorithm Feature Point of FAST. Features from accelerated segment test (FAST) [19] is an efficient corner point detection method mainly used for feature extraction of image corner points. The FAST method builds up the intensity of a pixel point , sets the threshold value to , and creates a Bresenham circle for 16-pixel points around , as shown in Figure 3(a).

Designating pixel point as a corner point if there is a set of n consecutive pixels in the circle that are all brighter than or darker than .

In order to speed up the operation, the pixel points compared with can be simplified and set to 1, 5, 9, and 13, as shown in Figure 3(b). This paper focuses on establishing FAST corner point detection for eight regions extracted, such as the eyes, nose, lips, and eyebrows, and establishing corner point comparisons between frames, as shown in Figure 3(c).

We define pixel as a corner when the circle in Figure 3(a) has a group of consecutive pixel points. Meanwhile, the points are brighter than or darker than . In order to speed up the operation, the points can be simplified and only use points 1, 5, 9, and 13 to calculate, as shown in Figure 3(b). We focus on establishing FAST corner point detection for eight regions, as shown in Figure 3(c), and setting corner point comparisons between frames.

3.4. Abnormal Facial Movement Detection
3.4.1. Facial Shape Movement Abnormal Detection

Facial shape movement detection is based on the extraction of 68 feature points of facial feature extraction; the facial area is divided into 8 areas, and the temporal movement pattern of the feature points in each area is established for each area to realize facial shape movement abnormal detection. We analyze the movement of key points in each region and build a key points coordinate vector .

The key point coordinate vector of the eight regions collection in frame can be expressed as :where represents the respective vectors of the eight regions in frame and the corresponding key points are as follows: 6∼10 represent the chin, 17∼21 points represent the left eyebrow, 22∼26 represent the right eyebrow, 36∼41 represent the left eye, 42∼47 represent the right eye, 27∼35 represent the nose, 48∼60 represent the mouth, and 61∼67 stands for the inner mouth.

Then, we use , extracted frame by frame, to provide clues for subsequent temporal discontinuity detection of facial motion morphology.

3.4.2. Facial Corner Abnormal Detection

Following [14], we use FAST to obtain feature points with a descriptor of 32 dimensions. We assume that the number of corner points of the focus organ region is in frame , then can be expressed as follows:

In this way, a feature vector can be created for the eight regions:where is a statistical vector based on corner points in region , containing the number of corner points in region at frame . We create time series based on to detect clues of alternating authentic and forgery faces in forgery videos.

3.5. Facial Landmark Time Discontinuity Detection
3.5.1. Facial Key Points Time Discontinuity Detection

We detect the temporal discontinuity of facial key point displacement between frames based on the displacement information of facial key points between consecutive frames. We analyze the movement of key points in each region and build a key point coordinate movement vector ; each region can be expressed as follows:

The key point coordinate movement vector of the eight regions collection in frame can be expressed as :where is the adjacent frames variation in the horizontal coordinates; we can calculate using , the same as .

3.5.2. FAST Feature Time Discontinuity Detection

The in Section 3.4.2 is the corner number vector of the described local region, and we use this vector to build the corner number difference vector between consecutive frames:

is the difference between the number of corners in region in frame and the number in region in frame . The statistical vector of the difference in the number of the corners in the whole facial region can be expressed as follows:

We use to detect nonsmooth facial corner number changes in the video.

3.6. Facial Forgery Prediction
3.6.1. Facial Feature Vector Association

Based on , , , and obtained in Sections 3.4 and 3.5, the local facial feature fusion vector is formed by concatenating the four types of feature vector sequences:

Then, the local facial feature fusion vector for region of the entire video can be expressed as follows:

We utilize a series of the local facial feature fusion vectors to represent the facial fusion features. After that, we use the connected feature vector to train a dual-stream RNN model for each of the eight regions to classify the forgery videos.

3.6.2. RNN-Based Deep Forgery Detection

We utilize RNNs to model local facial feature sequences. In order to ensure an identical input dimension of the RNN and to achieve deep forgery detection at the video level, each video sample used as input is cut into a fixed length, and a fixed number of key frames are extracted. Based on the extraction results, the RNN parameters are selected for training to achieve deep forgery detection of the overall video.

Through the embedding process, the RNNs are adopted to model the feature sequences of each local region, learning the shape movement pattern, landmark difference pattern, and FAST feature point variation pattern. Then, the fully connected (FC) network is connected to each RNN output layer. Furthermore, calculate 8 FC layers output average result as the final prediction to achieve deep forgery detection based on the local regions of the face. We utilize to represent this process:

4. Experiments

4.1. Datasets
(1)FaceForensics++ (FF++) [18]: FF++ is one of the benchmark datasets for large-scale deep forgery detection, with a total of over 1,000 segments, more than 1.5 million frames in total, and over 1.5 TB of video data in the original video format. Meanwhile, a face detector is used to filter the video footage to ensure that there are three video qualities in the FF++ dataset, Raw, c23, and c40, characterized by many forged video segments, and a variety of deep forgery methods are considered.(2)Celeb-DF [20]: The Celeb-DF (v2) dataset is a large-scale deepfake forensic dataset that addresses the shortcomings of poor forged video quality, apparent forgery traces, and flickering video faces. The Celeb-DF (v2) dataset improves the deep forgery generation method and the face key point localization method to obtain stable fake video content quality. The dataset contains 590 raw videos collected from YouTube with categories of different ages, races, and genders. 5639 HD deepfake videos are the same quality as the online broadcast videos.(3)DFDC preview dataset [21]: This dataset comes from The Deepfake Detection Challenge hosted by Facebook. It is the preliminary dataset for the competition. It consists of 5,214 videos, of which the ratio of true and false content is 1 : 0.28, and forgery data contain data generated by two deep forgery methods. Each video is a clip of about 15 s.
4.2. Experiment Settings

During preprocessing, DLIB was used for face cropping and face landmark detection, and FAST detector and BRIEF descriptors were used for corner point detection and description. In the classification process, a bidirectional recurrent neural network connects to the feature sequences in the respective regions. Each RNN in the detection framework consists of a GRU (gated recurrent unit) with a hidden layer feature output dimension of 64. A dropout layer is set between the input and the RNN, using a fully connected network to connect to the output of the RNN layer. Using two dropout layers separated between the RNN layer and the fully connected layer and inside, these experimental parameter settings partly refer to existing research results [22].

In the experimental dataset section, the ratio of training data to test data was 7 : 3, with 120 frames drawn from each video. The model was optimized using the Adam optimizer for the specific training process. We initialize the learning rate at 0.005, set the batch size to 1024, and the maximum number of iterations Epoch was 800 rounds. The experiments in this paper use AUC (area under curve) to evaluate the performance of the deep forgery detection model, and the AUC is calculated as follows:where is the predicted probability of getting a positive sample, is the predicted probability of getting a negative sample, is the number of positive samples, is the number of negative samples, and is the number of samples where the predicted probability of a positive sample is greater than the predicted probability of a negative sample in the sample.

4.3. Experiments
4.3.1. Partial Organ Comparison

In this paper, experiments are conducted on the FF++ dataset to compare each organ region module’s detection effect to verify each organ’s region detection effect on deep forgery. In this paper, following the idea of [14], eight key regions such as the left eyebrow, right eyebrow, left eye, right eye, nose, mouth, inner mouth, and chin were set up and compared, as shown in Table 1. The “Points” results are obtained using facial key point coordinate detection and facial key point coordinate movement detection, “Coordinate” indicates the detection result using only the facial key point coordinates, and “Movement” indicates the detection result using only the facial key point movement coordinates. “C + M” indicates the result obtained by combining the key point coordinate detection and the facial key point coordinate movement detection. “Corners” is the result obtained using FAST corner number detection and corner number change detection. “All” means that the results of “Points” and “Corners” are combined with the experimental results of FAST features, and the RNNs of each segment are trained separately.

From Table 1, all local organs can be used individually in the FF++ dataset to detect whether the images contain forgeries. This paper observes that among the eight organ regions, the eyebrows, eyes, and mouth have the highest accuracy rate, while the nose and chin have a low accuracy rate. Also, in the “Points” detection group, where three experiments were set up, it was seen that “Coordinate” could perform a single-frame detection task with an average detection rate of 87.2%. “Movement” is the detection method combined with video sequences, with an average detection rate of 82.6%. The combination of “Coordinate” and “Movement” enables the combination of abnormal facial movement detection and facial landmark time discontinuity detection, allowing for more effective acquisition of key facial features with an accuracy rate of 91.1%.

4.3.2. Ablation Study

In this paper, we use the frame-level AUC to verify the effectiveness of face key point and corner point detection on deep forgery detection, respectively, to validate the proposed method. The models in the experiments are trained on FF++ (raw) and tested on three datasets: FF++, DFDC Preview, and Celeb-DF. The results are shown in Table 2.

The experimental results show that “Points” and “Corners” have similar detection results in terms of AUC, with an average of 71.3% and 74.1%, respectively, and all the best detection was achieved by “All,” with an AUC of 75.9%. Meanwhile, in the FF++, DFDC Preview, and Celeb-DF datasets, the AUC values of “All” were higher than those of “Points” and “Corners” and “All” has a higher AUC than “Points” and “Corners.” This proves that the method proposed in this paper, which combines facial key point and corner point detection, is reasonable and effective.

4.3.3. Comparison Experiments

In this paper, using frame-level AUC evaluation, we selected mainstream deep forgery detection methods based on full-frame face region forgery detection [18], fake face edge fusion region detection [23], facial landmark feature enhancement forgery and detection [11], visual distortion detection [24], and capsule network forgery detection [25]. Tests were carried out on datasets such as FF++, DFDC Preview, and Celeb-DF. We refer to the detection results of [11, 14], as shown in Table 3. In the FF++ dataset, “raw” represents the uncompressed data and “c40” represents the compressed LQ data.

As can be seen from Table 3, the AUC results of the proposed method on FF++ are better than those of mainstream methods such as Xception [18], Face X-ray [23], LRNet [11], DSP-FWA [24], and Capsule [25]. In particular, in the experimental group of “c40,” the proposed method has better robustness for low-quality forged video identification, with a 1.7% improvement over LRNet [11] and a 35.8% improvement over Face X-ray [23].

In anticompression forgery face detection, our work shows a good forgery face detection performance. The method in this paper extracts the geometric features of the local facial region by combining the local facial key points and the corner. The extracted features have more robust and lower cost characteristics and have high sensitivity in detecting changes in the number of the corner. The strategy designed in this paper for face forgery detection through 8 local facial regions improves the accuracy of overall face forgery detection by reducing the detection error of a single region. The effectiveness of our strategy is also verified on FF++ (Raw, c40).

The low-complexity and high-performance geometric feature extraction method designed in this paper can effectively reduce the impact of image compression on the face forgery detection task, and the experimental results further demonstrate this. We compared this method’s training and testing results and other methods on the FF++ (Raw, c40) dataset in Table 3. The results show that our method achieves better performance than some existing methods, with a difference of 0.4% in AUC compared to the Single XceptionNet [26] method on FF++ (c40), and has better anticompression forgery face detection performance. The detection performance suffers less interference on c40 data.

4.3.4. Cross-Dataset Experiments

Our method can tolerate the local area detection, such as eyes, nose, and other organs, which is suitable for detecting the forgery videos with stain and shelter. To further demonstrate the robustness of our method, the models trained on FF++ (raw) were selected and tested on the DFDC Preview and Celeb-DF datasets. The results of training and testing on FF++ (raw, c40) in Table 4 sets cross-dataset experiments in individual organs and organ combinations.

The experimental results show that our method is innovative and can only use individual organs to detect forgery videos with defilement and stain. Meanwhile, using all organ regions has higher average accuracy. To further verify the ability of our method, we set up cross-dataset experiments to compare with the state-of-arts in Table 5.

The test results are shown in Table 5, Xception [18], LRNet [11], DSP-FWA [24], Capsule [25], Single XceptionNet [26], FWA [7], LipForensics [31], STIL [33], ADDNet-3D [34], and ours are compared. The method has certain advantages in the existing DFDC Preview cross-dataset test results, but the effect still needs to be further improved in the cross-dataset test results. The specific reasons are analyzed as follows: the framework of this paper utilizes the spatial and temporal features such as the spatial position of facial feature points and the statistical number of FAST corner points and shows good performance on the FF++ dataset. This paper strengthens the description and distinguishing capabilities of forgery faces to a certain extent by using geometric features and uses the RNN to model the time series of features to complete fake face detection, which verifies the effectiveness of the framework. Applying geometric features improves the sensitivity to detecting facial feature point motion patterns and differential changes to a certain extent. Still, in the face of forging changes in the scene around the face of different datasets, the feature extraction method in this framework needs to be further optimized. Obtaining more effective forgery face features is the further optimization direction of this framework.

4.4. Discussion

Although the proposed method utilizes RNNs to model local facial feature sequences, it achieves deepfake discrimination through abnormal facial movement detection and facial landmark time discontinuity detection and exhibits good detection performance and compression resistance. Our method mainly mines the detection performance of each local face region for deep forgery and can effectively learn and model local face regions’ forgery features and patterns. However, since the sample distribution of the FF++ dataset cannot represent all deep forgery techniques, the generalization of this method under the new data distribution is not explicitly guaranteed, which may lead to the degradation of performance in cross-database testing. Research on the generalization problem will be our future goal.

5. Conclusion

The development of deep forgery technology has brought new challenges to the authenticity of media content. The mutual promotion of deep forgery technology and forensics technology is prominent in addressing the challenges brought by deep forgery technology to the media industry. We focus on the consistency of facial key points and corner points’ coordinates and propose a spatiotemporal local feature abstraction (STLFA) framework for facial forgery identification, which establishes local features’ relation via an organ-specific method, which combines abnormal facial movement detection and facial landmark time discontinuity detection to analyze the facial key point, and corner point features frame by frame. It is mainly to detect the consistency of the movement of facial key point coordinates and the facial corner point number variations. At the same time, the method utilizes the bidirectional RNN to establish the sequence in eight local facial regions to model the facial shape pattern, the key point movement pattern, and the corner point number variations.

Experimental results show that our method performs better than some existing methods and achieves good anticompression forgery face detection performance on FF++. At the same time, for the detection of face forgery, the generalization ability under cross-dataset testing is also important. Therefore, a robust method with strong generalization ability is the goal of our future work.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by the National Fiscal Expenditure Program of China under grant 130016000000200003.