A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments
Human visual attention is influenced by multiple factors, including visual, auditory, and facial cues. While integrating auditory and visual information enhances prediction accuracy, many existing models rely solely on visual-temporal data. Inspired by cognitive studies, we propose a computational m...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-05-01
|
| Series: | Big Data and Cognitive Computing |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2504-2289/9/5/120 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849711392071876608 |
|---|---|
| author | Hamideh Yazdani Alireza Bosaghzadeh Reza Ebrahimpour Fadi Dornaika |
| author_facet | Hamideh Yazdani Alireza Bosaghzadeh Reza Ebrahimpour Fadi Dornaika |
| author_sort | Hamideh Yazdani |
| collection | DOAJ |
| description | Human visual attention is influenced by multiple factors, including visual, auditory, and facial cues. While integrating auditory and visual information enhances prediction accuracy, many existing models rely solely on visual-temporal data. Inspired by cognitive studies, we propose a computational model that combines spatial, temporal, face (low-level and high-level visual cues), and auditory saliency to predict visual attention more effectively. Our approach processes video frames to generate spatial, temporal, and face saliency maps, while an audio branch localizes sound-producing objects. These maps are then integrated to form the final audio-visual saliency map. Experimental results on the audio-visual dataset demonstrate that our model outperforms state-of-the-art image and video saliency models and the basic model and aligns more closely with behavioral and eye-tracking data. Additionally, ablation studies highlight the contribution of each information source to the final prediction. |
| format | Article |
| id | doaj-art-15d60f7b7d2a430b93dd4f58a205120e |
| institution | DOAJ |
| issn | 2504-2289 |
| language | English |
| publishDate | 2025-05-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Big Data and Cognitive Computing |
| spelling | doaj-art-15d60f7b7d2a430b93dd4f58a205120e2025-08-20T03:14:38ZengMDPI AGBig Data and Cognitive Computing2504-22892025-05-019512010.3390/bdcc9050120A Computational–Cognitive Model of Audio-Visual Attention in Dynamic EnvironmentsHamideh Yazdani0Alireza Bosaghzadeh1Reza Ebrahimpour2Fadi Dornaika3Faculty of Computer Engineering, Shahid Rajaee Teacher Training University, Tehran 16785163, IranFaculty of Computer Engineering, Shahid Rajaee Teacher Training University, Tehran 16785163, IranCenter for Cognitive Science, Institute for Convergence Science and Technology (ICST), Sharif University of Technology, Tehran, 14588-89694, IranFaculty of Computer Engineering, University of the Basque Country, 20018 San Sebastian, SpainHuman visual attention is influenced by multiple factors, including visual, auditory, and facial cues. While integrating auditory and visual information enhances prediction accuracy, many existing models rely solely on visual-temporal data. Inspired by cognitive studies, we propose a computational model that combines spatial, temporal, face (low-level and high-level visual cues), and auditory saliency to predict visual attention more effectively. Our approach processes video frames to generate spatial, temporal, and face saliency maps, while an audio branch localizes sound-producing objects. These maps are then integrated to form the final audio-visual saliency map. Experimental results on the audio-visual dataset demonstrate that our model outperforms state-of-the-art image and video saliency models and the basic model and aligns more closely with behavioral and eye-tracking data. Additionally, ablation studies highlight the contribution of each information source to the final prediction.https://www.mdpi.com/2504-2289/9/5/120visual attentionaudio-visual saliencyface saliencysaliency predictionfixation predictionattention fusion |
| spellingShingle | Hamideh Yazdani Alireza Bosaghzadeh Reza Ebrahimpour Fadi Dornaika A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments Big Data and Cognitive Computing visual attention audio-visual saliency face saliency saliency prediction fixation prediction attention fusion |
| title | A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments |
| title_full | A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments |
| title_fullStr | A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments |
| title_full_unstemmed | A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments |
| title_short | A Computational–Cognitive Model of Audio-Visual Attention in Dynamic Environments |
| title_sort | computational cognitive model of audio visual attention in dynamic environments |
| topic | visual attention audio-visual saliency face saliency saliency prediction fixation prediction attention fusion |
| url | https://www.mdpi.com/2504-2289/9/5/120 |
| work_keys_str_mv | AT hamidehyazdani acomputationalcognitivemodelofaudiovisualattentionindynamicenvironments AT alirezabosaghzadeh acomputationalcognitivemodelofaudiovisualattentionindynamicenvironments AT rezaebrahimpour acomputationalcognitivemodelofaudiovisualattentionindynamicenvironments AT fadidornaika acomputationalcognitivemodelofaudiovisualattentionindynamicenvironments AT hamidehyazdani computationalcognitivemodelofaudiovisualattentionindynamicenvironments AT alirezabosaghzadeh computationalcognitivemodelofaudiovisualattentionindynamicenvironments AT rezaebrahimpour computationalcognitivemodelofaudiovisualattentionindynamicenvironments AT fadidornaika computationalcognitivemodelofaudiovisualattentionindynamicenvironments |