On the Nuances of Multimodal Laughter Detection: Where, Why, and When Visual Insight Augments Audio in Classification
This paper explores the impact of visual input on traditional audio classification, focusing on the analysis of audiovisual data compilation and the interaction between modalities in laughter detection. While assessing user reactions to measure potential engagement is distinct from evaluating a cont...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11037826/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | This paper explores the impact of visual input on traditional audio classification, focusing on the analysis of audiovisual data compilation and the interaction between modalities in laughter detection. While assessing user reactions to measure potential engagement is distinct from evaluating a content’s entertainment value, incorporating user-centered audio and visual cues enhances the contextual understanding of stimuli and the completeness of the emotional response. Among the results, making use of facial expressions augments the audio-only prediction accuracy with up to 26% when cross-attention mechanisms are employed. To evaluate the robustness of our proposed approach, experiments were systematically conducted under unimodal conditions, omitting each modality individually to assess its impact on performance. As audio and visual insights yield unbalanced contributions to the user’s overall emotional profiling, the accuracy gain of visual features on top of the audio setup provides evidence on the classifier’s ability to capture the nuances of human laughter behaviors. |
|---|---|
| ISSN: | 2169-3536 |