Research Article

Vision Transformer-Based Video Hashing Retrieval for Tracing the Source of Fake Videos

Table 2

Evaluation of cross-dataset performance on five subsets of FaceForensics++ (FF++): DeepFake (DF), Face2Face (F2F), Faceswap (FS), NeuralTexture (NT), and FaceShifter (FSh).

Training setMethodsTest set (ACC)
DFF2FFSNTFS

DFXception [54]99.373.649.073.6
HRNet [51]99.368.239.171.4
Face X-ray [52]98.763.360.069.8
ADD [1]98.7
Grad-CAM [53]99.20.76.449.781.4
Ours98.80.98.898.899.198.6

F2FXception [54]80.399.476.269.6
HRNet [51]83.699.556.661.3
Face X-ray [52]63.098.493.894.5
ADD [1]96.8
Grad-CAM [53]83.799.498.798.4
Ours99.299.499.299.299.2

FSXception [54]66.488.899.471.3
HRNet [51]63.664.199.268.9
Face X-ray [52]45.896.198.195.7
ADD [1]97.9
Grad-CAM [53]68.599.399.598.0
Ours99.999.899.999.899.9

NTXception [54]79.981.373.199.1
HRNet [51]94.187.364.198.6
Face X-ray [52]70.591.791.092.5
ADD [1]88.5
Grad-CAM [53]89.499.599.399.4
Ours99.399.299.399.399.3

FSADD [1]96.6
Ours98.898.899.099.399.1

We trained on one subset and tested on the other four subsets. Bold values represent the best results in the correlation domain.