Abstract
Digital videos have become essential to broadcast news that targets many audiences around the world, and it is therefore important to ensure the reliability of these broadcasted videos. Unfortunately, digital videos can be manipulated by replacing a person’s face or expressions with another person’s face or expressions without leaving visible traces. This facial manipulation is a challenging problem due to the lack of digital forensic techniques that can be used to verify the originality of video content. In this paper, we propose a novel approach, dubbed FaceMD, based on fusing three streams of convolutional neural networks to detect facial manipulation. The proposed FaceMD incorporates spatiotemporal information by fusing video frames, motion residuals, and 3D gradients to improve facial manipulation detection accuracy. We combine these three streams using different fusion methods and places to best use this spatiotemporal information, hence increasing detection performance. The experimental results show that the proposed FaceMD achieves state-of-the-art accuracy using two different facial manipulation data sets.