Abstract
With the development of social media, people tend to express their opinions on multimedia such as text, photos, audios, and videos. Meanwhile, more rumors hiding in multi-modal content are misleading social media users. The early rumor detection task aims to detect rumors before spreading. However, annotation on multi-modal data often involves a large amount of manpower. Existing approaches universally used transfer learning to overcome it. But they ignored the differences between the source domain of pre-trained models and the task domain. In this paper, Multi-task and Domain Adaptation based Multi-modal Network (MDMN) is proposed, which consists of three components: Textual Feature Extractor, Visual Feature Extractor, and Fusion & Classification Network. To improve the diversity and stability of textual representation, a Multi-task Sharing Layer, a task-specific Transformer Encoder and a Selection Layer are applied. Domain Adaptation is involved in training an adaptive model for extracting visual representation. The adaptive models can encode task data better than fine-tuning the pre-trained models. Then, multi-modal representations are fused through two fusion strategies, each having their own benefits. The experiment on multi-modal datasets collected from Weibo and Twitter show that the proposed MDMN can outperform the baseline methods. The decision-level fusion strategy achieves a Recall of over 92%.
•Proposed a Multi-modal Network for early rumor detection.•Developed a multi-task framework for few-shot NLP tasks.•Reduced the gap between social network images and ImageNet.•Compare feature/decision level fusion of textual and visual representations.