A multimodal transformer-based visual question answering method integrating local and global information.
Addressing the limitations in current visual question answering (VQA) models face limitations in multimodal feature fusion capabilities and often lack adequate consideration of local information, this study proposes a multimodal Transformer VQA network based on local and global information integrati...
Saved in:
| Main Authors: | Cuiyang Huang, Zihan Hu |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Public Library of Science (PLoS)
2025-01-01
|
| Series: | PLoS ONE |
| Online Access: | https://doi.org/10.1371/journal.pone.0324757 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
-
Multimodal representative answer extraction in community question answering
by: Ming Li, et al.
Published: (2023-10-01) -
Enhancing Visual Question Answering for Multiple Choice Questions
by: Rashi Goel, et al.
Published: (2025-01-01) -
Visual Question Answering Using Semantic Information from Image Descriptions
by: Tasmia Tasmia, et al.
Published: (2021-04-01) -
Informed-Learning-Guided Visual Question Answering Model of Crop Disease
by: Yunpeng Zhao, et al.
Published: (2024-01-01) -
Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering
by: Faheem Shehzad, et al.
Published: (2024-01-01)