MemesViTa: A Novel Multimodal Fusion Technique for Troll Memes Identification
The proliferation of troll memes on social media platforms has become a pressing issue due to their potential to spread misinformation and incite conflict. Detecting troll memes is complex due to the intricate interplay between visual and textual elements. In this research, we propose MemesViTa, a n...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2024-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10766576/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | The proliferation of troll memes on social media platforms has become a pressing issue due to their potential to spread misinformation and incite conflict. Detecting troll memes is complex due to the intricate interplay between visual and textual elements. In this research, we propose MemesViTa, a novel multimodal fusion model that combines Vision Transformer (ViT) for image processing and DeBERTa for textual analysis, specifically designed to detect troll memes with high accuracy. This research uses a public dataset of 4,368 training image-text and 1092 testing image-text. The primary goal of this study is to identify the most effective approach for troll meme detection by evaluating the performance of different multimodal models. We comprehensively assess multimodal deep learning approaches for troll meme detection, leveraging visual, textual, and Large Language Models in zero-shot and few-shot scenarios. Our study integrates models including BERT, GPT-2, GPT-4, VGG16, ResNet50, and ViT to analyze their performance. Experimental results indicate that the proposed ViT+DeBERTa model, named MemesViTa, achieves superior results, with 94.287% accuracy and 95.82% F1 score, representing a 41.3% improvement over the best visual model of VGG16 and a 48.1% improvement over the best textual model of CNN. Standalone LLMs like GPT-4, even in few-shot scenarios, show promising yet inferior results, with GPT-4 in a 50-shot scenario reaching an accuracy of 80.00% and an F1 score of 37.00%. Multimodal combinations, including BERT+VGG16 and BERT+VGG16+CLIP, highlight the importance of integrating multiple data modalities. Our findings suggest that combining visual and textual data and the strategic use of LLMs can significantly improve the robustness and accuracy of troll meme detection systems. In the future, we intend to enhance our dataset by incorporating a broader spectrum of troll memes. This will help our model to work better across different social media platforms and contexts. |
|---|---|
| ISSN: | 2169-3536 |