Security and Communication Networks

Research Article

Vision Transformer-Based Video Hashing Retrieval for Tracing the Source of Fake Videos

Table 2

Evaluation of cross-dataset performance on five subsets of FaceForensics++ (FF++): DeepFake (DF), Face2Face (F2F), Faceswap (FS), NeuralTexture (NT), and FaceShifter (FSh).


Training set	Methods	Test set (ACC)
Training set	Methods	DF	F2F	FS	NT	FS

DF	Xception [54]	99.3	73.6	49.0	73.6	—
	HRNet [51]	99.3	68.2	39.1	71.4	—
	Face X-ray [52]	98.7	63.3	60.0	69.8	—
	ADD [1]	98.7	—	—	—	—
	Grad-CAM [53]	99.2	0.76.4	49.7	81.4	—
	Ours	98.8	0.98.8	98.8	99.1	98.6

F2F	Xception [54]	80.3	99.4	76.2	69.6	—
	HRNet [51]	83.6	99.5	56.6	61.3	—
	Face X-ray [52]	63.0	98.4	93.8	94.5	—
	ADD [1]	—	96.8	—	—	—
	Grad-CAM [53]	83.7	99.4	98.7	98.4	—
	Ours	99.2	99.4	99.2	99.2	99.2

FS	Xception [54]	66.4	88.8	99.4	71.3	—
	HRNet [51]	63.6	64.1	99.2	68.9	—
	Face X-ray [52]	45.8	96.1	98.1	95.7	—
	ADD [1]	—	—	97.9	—	—
	Grad-CAM [53]	68.5	99.3	99.5	98.0	—
	Ours	99.9	99.8	99.9	99.8	99.9

NT	Xception [54]	79.9	81.3	73.1	99.1	—
	HRNet [51]	94.1	87.3	64.1	98.6	—
	Face X-ray [52]	70.5	91.7	91.0	92.5	—
	ADD [1]	—	—	—	88.5	—
	Grad-CAM [53]	89.4	99.5	99.3	99.4	—
	Ours	99.3	99.2	99.3	99.3	99.3

FS	ADD [1]	—	—	—	—	96.6
FS	Ours	98.8	98.8	99.0	99.3	99.1

We trained on one subset and tested on the other four subsets. Bold values represent the best results in the correlation domain.