CLIP-BCA-Gated: A Dynamic Multimodal Framework for Real-Time Humanitarian Crisis Classification with Bi-Cross-Attention and Adaptive Gating

During humanitarian crises, social media generates over 30 million multimodal tweets daily, but 20% textual noise, 40% cross-modal misalignment, and severe class imbalance (4.1% rare classes) hinder effective classification. This study presents CLIP-BCA-Gated, a dynamic multimodal framework that int...

Full description

Saved in:
Bibliographic Details
Main Authors: Shanshan Li, Qingjie Liu, Zhian Pan, Xucheng Wu
Format: Article
Language:English
Published: MDPI AG 2025-08-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/15/8758
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:During humanitarian crises, social media generates over 30 million multimodal tweets daily, but 20% textual noise, 40% cross-modal misalignment, and severe class imbalance (4.1% rare classes) hinder effective classification. This study presents CLIP-BCA-Gated, a dynamic multimodal framework that integrates bidirectional cross-attention (Bi-Cross-Attention) and adaptive gating within the CLIP architecture to address these challenges. The Bi-Cross-Attention module enables fine-grained cross-modal semantic alignment, while the adaptive gating mechanism dynamically weights modalities to suppress noise. Hierarchical learning rate scheduling and multidimensional data augmentation further optimize feature fusion for real-time multiclass classification. On the CrisisMMD benchmark, CLIP-BCA-Gated achieves 91.77% classification accuracy (1.55% higher than baseline CLIP and 2.33% over state-of-the-art ALIGN), with exceptional recall for critical categories: infrastructure damage (93.42%) and rescue efforts (92.15%). The model processes tweets at 0.083 s per instance, meeting real-time deployment requirements for emergency response systems. Ablation studies show Bi-Cross-Attention contributes 2.54% accuracy improvement, and adaptive gating contributes 1.12%. This work demonstrates that dynamic multimodal fusion enhances resilience to noisy social media data, directly supporting SDG 11 through scalable real-time disaster information triage. The framework’s noise-robust design and sub-second inference make it a practical solution for humanitarian organizations requiring rapid crisis categorization.
ISSN:2076-3417