CLIP-BCA-Gated: A Dynamic Multimodal Framework for Real-Time Humanitarian Crisis Classification with Bi-Cross-Attention and Adaptive Gating
During humanitarian crises, social media generates over 30 million multimodal tweets daily, but 20% textual noise, 40% cross-modal misalignment, and severe class imbalance (4.1% rare classes) hinder effective classification. This study presents CLIP-BCA-Gated, a dynamic multimodal framework that int...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-08-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/15/8758 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | During humanitarian crises, social media generates over 30 million multimodal tweets daily, but 20% textual noise, 40% cross-modal misalignment, and severe class imbalance (4.1% rare classes) hinder effective classification. This study presents CLIP-BCA-Gated, a dynamic multimodal framework that integrates bidirectional cross-attention (Bi-Cross-Attention) and adaptive gating within the CLIP architecture to address these challenges. The Bi-Cross-Attention module enables fine-grained cross-modal semantic alignment, while the adaptive gating mechanism dynamically weights modalities to suppress noise. Hierarchical learning rate scheduling and multidimensional data augmentation further optimize feature fusion for real-time multiclass classification. On the CrisisMMD benchmark, CLIP-BCA-Gated achieves 91.77% classification accuracy (1.55% higher than baseline CLIP and 2.33% over state-of-the-art ALIGN), with exceptional recall for critical categories: infrastructure damage (93.42%) and rescue efforts (92.15%). The model processes tweets at 0.083 s per instance, meeting real-time deployment requirements for emergency response systems. Ablation studies show Bi-Cross-Attention contributes 2.54% accuracy improvement, and adaptive gating contributes 1.12%. This work demonstrates that dynamic multimodal fusion enhances resilience to noisy social media data, directly supporting SDG 11 through scalable real-time disaster information triage. The framework’s noise-robust design and sub-second inference make it a practical solution for humanitarian organizations requiring rapid crisis categorization. |
|---|---|
| ISSN: | 2076-3417 |