A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments

Human visual attention is influenced by multiple factors, including visual, auditory, and facial cues. While integrating auditory and visual information enhances prediction accuracy, many existing models rely solely on visual-temporal data. Inspired by cognitive studies, we propose a computational m...

Full description

Saved in:
Bibliographic Details
Main Authors: Hamideh Yazdani, Alireza Bosaghzadeh, Reza Ebrahimpour, Fadi Dornaika
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:Big Data and Cognitive Computing
Subjects:
Online Access:https://www.mdpi.com/2504-2289/9/5/120
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849711392071876608
author Hamideh Yazdani
Alireza Bosaghzadeh
Reza Ebrahimpour
Fadi Dornaika
author_facet Hamideh Yazdani
Alireza Bosaghzadeh
Reza Ebrahimpour
Fadi Dornaika
author_sort Hamideh Yazdani
collection DOAJ
description Human visual attention is influenced by multiple factors, including visual, auditory, and facial cues. While integrating auditory and visual information enhances prediction accuracy, many existing models rely solely on visual-temporal data. Inspired by cognitive studies, we propose a computational model that combines spatial, temporal, face (low-level and high-level visual cues), and auditory saliency to predict visual attention more effectively. Our approach processes video frames to generate spatial, temporal, and face saliency maps, while an audio branch localizes sound-producing objects. These maps are then integrated to form the final audio-visual saliency map. Experimental results on the audio-visual dataset demonstrate that our model outperforms state-of-the-art image and video saliency models and the basic model and aligns more closely with behavioral and eye-tracking data. Additionally, ablation studies highlight the contribution of each information source to the final prediction.
format Article
id doaj-art-15d60f7b7d2a430b93dd4f58a205120e
institution DOAJ
issn 2504-2289
language English
publishDate 2025-05-01
publisher MDPI AG
record_format Article
series Big Data and Cognitive Computing
spelling doaj-art-15d60f7b7d2a430b93dd4f58a205120e2025-08-20T03:14:38ZengMDPI AGBig Data and Cognitive Computing2504-22892025-05-019512010.3390/bdcc9050120A Computational–Cognitive Model of Audio-Visual Attention in Dynamic EnvironmentsHamideh Yazdani0Alireza Bosaghzadeh1Reza Ebrahimpour2Fadi Dornaika3Faculty of Computer Engineering, Shahid Rajaee Teacher Training University, Tehran 16785163, IranFaculty of Computer Engineering, Shahid Rajaee Teacher Training University, Tehran 16785163, IranCenter for Cognitive Science, Institute for Convergence Science and Technology (ICST), Sharif University of Technology, Tehran, 14588-89694, IranFaculty of Computer Engineering, University of the Basque Country, 20018 San Sebastian, SpainHuman visual attention is influenced by multiple factors, including visual, auditory, and facial cues. While integrating auditory and visual information enhances prediction accuracy, many existing models rely solely on visual-temporal data. Inspired by cognitive studies, we propose a computational model that combines spatial, temporal, face (low-level and high-level visual cues), and auditory saliency to predict visual attention more effectively. Our approach processes video frames to generate spatial, temporal, and face saliency maps, while an audio branch localizes sound-producing objects. These maps are then integrated to form the final audio-visual saliency map. Experimental results on the audio-visual dataset demonstrate that our model outperforms state-of-the-art image and video saliency models and the basic model and aligns more closely with behavioral and eye-tracking data. Additionally, ablation studies highlight the contribution of each information source to the final prediction.https://www.mdpi.com/2504-2289/9/5/120visual attentionaudio-visual saliencyface saliencysaliency predictionfixation predictionattention fusion
spellingShingle Hamideh Yazdani
Alireza Bosaghzadeh
Reza Ebrahimpour
Fadi Dornaika
A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments
Big Data and Cognitive Computing
visual attention
audio-visual saliency
face saliency
saliency prediction
fixation prediction
attention fusion
title A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments
title_full A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments
title_fullStr A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments
title_full_unstemmed A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments
title_short A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments
title_sort computational cognitive model of audio visual attention in dynamic environments
topic visual attention
audio-visual saliency
face saliency
saliency prediction
fixation prediction
attention fusion
url https://www.mdpi.com/2504-2289/9/5/120
work_keys_str_mv AT hamidehyazdani acomputationalcognitivemodelofaudiovisualattentionindynamicenvironments
AT alirezabosaghzadeh acomputationalcognitivemodelofaudiovisualattentionindynamicenvironments
AT rezaebrahimpour acomputationalcognitivemodelofaudiovisualattentionindynamicenvironments
AT fadidornaika acomputationalcognitivemodelofaudiovisualattentionindynamicenvironments
AT hamidehyazdani computationalcognitivemodelofaudiovisualattentionindynamicenvironments
AT alirezabosaghzadeh computationalcognitivemodelofaudiovisualattentionindynamicenvironments
AT rezaebrahimpour computationalcognitivemodelofaudiovisualattentionindynamicenvironments
AT fadidornaika computationalcognitivemodelofaudiovisualattentionindynamicenvironments