Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss

The art music of North India is rich in the use of hand gestures that accompany vocal performance. However, such gestures are idiosyncratic and are neither taught nor rehearsed by the singer. The recent availability of computer vision techniques allows us to use computational methods to analyze the...

Full description

Saved in:
Bibliographic Details
Main Authors: Sujoy Roychowdhury, Preeti Rao
Format: Article
Language:English
Published: Ubiquity Press 2025-07-01
Series:Transactions of the International Society for Music Information Retrieval
Subjects:
Online Access:https://account.transactions.ismir.net/index.php/up-j-tismir/article/view/221
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849230046793826304
author Sujoy Roychowdhury
Preeti Rao
author_facet Sujoy Roychowdhury
Preeti Rao
author_sort Sujoy Roychowdhury
collection DOAJ
description The art music of North India is rich in the use of hand gestures that accompany vocal performance. However, such gestures are idiosyncratic and are neither taught nor rehearsed by the singer. The recent availability of computer vision techniques allows us to use computational methods to analyze the accompanying gestures and look for complementarity with the audio. Using an available dataset of Hindustani raga performances by 11 singers, we extract features from audio and video (gesture) and apply deep learning models to classify the raga from short excerpts. With the gesture-based classification approximately at chance, we attempt to disentangle the singer information from the raga classification embeddings by using a gradient reversal approach. We next investigate a framework that considers the body of existing multimodal fusion techniques via experiments for the multimodal raga classification. Despite the much weaker performance of the video modality relative to audio, we achieve a singer–feature-disentangled multimodal fusion system that slightly, but consistently, outperforms the audio-only classification.
format Article
id doaj-art-d342c663b8144a9e8d3ea5b40f1da022
institution Kabale University
issn 2514-3298
language English
publishDate 2025-07-01
publisher Ubiquity Press
record_format Article
series Transactions of the International Society for Music Information Retrieval
spelling doaj-art-d342c663b8144a9e8d3ea5b40f1da0222025-08-21T12:49:43ZengUbiquity PressTransactions of the International Society for Music Information Retrieval2514-32982025-07-0181195–212195–21210.5334/tismir.221221Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive LossSujoy Roychowdhury0Preeti Rao1Indian Institute of Technology BombayIndian Institute of Technology BombayThe art music of North India is rich in the use of hand gestures that accompany vocal performance. However, such gestures are idiosyncratic and are neither taught nor rehearsed by the singer. The recent availability of computer vision techniques allows us to use computational methods to analyze the accompanying gestures and look for complementarity with the audio. Using an available dataset of Hindustani raga performances by 11 singers, we extract features from audio and video (gesture) and apply deep learning models to classify the raga from short excerpts. With the gesture-based classification approximately at chance, we attempt to disentangle the singer information from the raga classification embeddings by using a gradient reversal approach. We next investigate a framework that considers the body of existing multimodal fusion techniques via experiments for the multimodal raga classification. Despite the much weaker performance of the video modality relative to audio, we achieve a singer–feature-disentangled multimodal fusion system that slightly, but consistently, outperforms the audio-only classification.https://account.transactions.ismir.net/index.php/up-j-tismir/article/view/221multimodal raga classificationgradient reversaldisentanglementmultimodal fusion
spellingShingle Sujoy Roychowdhury
Preeti Rao
Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss
Transactions of the International Society for Music Information Retrieval
multimodal raga classification
gradient reversal
disentanglement
multimodal fusion
title Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss
title_full Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss
title_fullStr Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss
title_full_unstemmed Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss
title_short Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss
title_sort multimodal raga classification from vocal performances with disentanglement and contrastive loss
topic multimodal raga classification
gradient reversal
disentanglement
multimodal fusion
url https://account.transactions.ismir.net/index.php/up-j-tismir/article/view/221
work_keys_str_mv AT sujoyroychowdhury multimodalragaclassificationfromvocalperformanceswithdisentanglementandcontrastiveloss
AT preetirao multimodalragaclassificationfromvocalperformanceswithdisentanglementandcontrastiveloss