Attention Mechanism-Based Cognition-Level Scene Understanding

Given a question–image input, a visual commonsense reasoning (VCR) model predicts an answer with a corresponding rationale, which requires inference abilities based on real-world knowledge. The VCR task, which calls for exploiting multi-source information as well as learning different levels of unde...

Full description

Saved in:
Bibliographic Details
Main Authors: Xuejiao Tang, Wenbin Zhang
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/16/3/203
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850203931989245952
author Xuejiao Tang
Wenbin Zhang
author_facet Xuejiao Tang
Wenbin Zhang
author_sort Xuejiao Tang
collection DOAJ
description Given a question–image input, a visual commonsense reasoning (VCR) model predicts an answer with a corresponding rationale, which requires inference abilities based on real-world knowledge. The VCR task, which calls for exploiting multi-source information as well as learning different levels of understanding and extensive commonsense knowledge, is a cognition-level scene understanding challenge. The VCR task has aroused researchers’ interests due to its wide range of applications, including visual question answering, automated vehicle systems, and clinical decision support. Previous approaches to solving the VCR task have generally relied on pre-training or exploiting memory with long-term dependency relationship-encoded models. However, these approaches suffer from a lack of generalizability and a loss of information in long sequences. In this work, we propose a parallel attention-based cognitive VCR network, termed PAVCR, which fuses visual–textual information efficiently and encodes semantic information in parallel to enable the model to capture rich information for cognition-level inference. Extensive experiments show that the proposed model yields significant improvements over existing methods on the benchmark VCR dataset. Moreover, the proposed model provides an intuitive interpretation of visual commonsense reasoning.
format Article
id doaj-art-c0d76b70e9db4eb9bda29de85a84975e
institution OA Journals
issn 2078-2489
language English
publishDate 2025-03-01
publisher MDPI AG
record_format Article
series Information
spelling doaj-art-c0d76b70e9db4eb9bda29de85a84975e2025-08-20T02:11:23ZengMDPI AGInformation2078-24892025-03-0116320310.3390/info16030203Attention Mechanism-Based Cognition-Level Scene UnderstandingXuejiao Tang0Wenbin Zhang1Institute for Information Processing, Leibniz University Hannover, Welfengarten 1, 30167 Hannover, GermanyKnight Foundation School of Computing & Information Sciences, Florida International University, Miami, FL 33199, USAGiven a question–image input, a visual commonsense reasoning (VCR) model predicts an answer with a corresponding rationale, which requires inference abilities based on real-world knowledge. The VCR task, which calls for exploiting multi-source information as well as learning different levels of understanding and extensive commonsense knowledge, is a cognition-level scene understanding challenge. The VCR task has aroused researchers’ interests due to its wide range of applications, including visual question answering, automated vehicle systems, and clinical decision support. Previous approaches to solving the VCR task have generally relied on pre-training or exploiting memory with long-term dependency relationship-encoded models. However, these approaches suffer from a lack of generalizability and a loss of information in long sequences. In this work, we propose a parallel attention-based cognitive VCR network, termed PAVCR, which fuses visual–textual information efficiently and encodes semantic information in parallel to enable the model to capture rich information for cognition-level inference. Extensive experiments show that the proposed model yields significant improvements over existing methods on the benchmark VCR dataset. Moreover, the proposed model provides an intuitive interpretation of visual commonsense reasoning.https://www.mdpi.com/2078-2489/16/3/203visual commonsense reasoningvisual understanding
spellingShingle Xuejiao Tang
Wenbin Zhang
Attention Mechanism-Based Cognition-Level Scene Understanding
Information
visual commonsense reasoning
visual understanding
title Attention Mechanism-Based Cognition-Level Scene Understanding
title_full Attention Mechanism-Based Cognition-Level Scene Understanding
title_fullStr Attention Mechanism-Based Cognition-Level Scene Understanding
title_full_unstemmed Attention Mechanism-Based Cognition-Level Scene Understanding
title_short Attention Mechanism-Based Cognition-Level Scene Understanding
title_sort attention mechanism based cognition level scene understanding
topic visual commonsense reasoning
visual understanding
url https://www.mdpi.com/2078-2489/16/3/203
work_keys_str_mv AT xuejiaotang attentionmechanismbasedcognitionlevelsceneunderstanding
AT wenbinzhang attentionmechanismbasedcognitionlevelsceneunderstanding