Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss

The art music of North India is rich in the use of hand gestures that accompany vocal performance. However, such gestures are idiosyncratic and are neither taught nor rehearsed by the singer. The recent availability of computer vision techniques allows us to use computational methods to analyze the...

Full description

Saved in:

Bibliographic Details
Main Authors:	Sujoy Roychowdhury, Preeti Rao
Format:	Article
Language:	English
Published:	Ubiquity Press 2025-07-01
Series:	Transactions of the International Society for Music Information Retrieval
Subjects:	multimodal raga classification gradient reversal disentanglement multimodal fusion
Online Access:	https://account.transactions.ismir.net/index.php/up-j-tismir/article/view/221
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849230046793826304
author	Sujoy Roychowdhury Preeti Rao
author_facet	Sujoy Roychowdhury Preeti Rao
author_sort	Sujoy Roychowdhury
collection	DOAJ
description	The art music of North India is rich in the use of hand gestures that accompany vocal performance. However, such gestures are idiosyncratic and are neither taught nor rehearsed by the singer. The recent availability of computer vision techniques allows us to use computational methods to analyze the accompanying gestures and look for complementarity with the audio. Using an available dataset of Hindustani raga performances by 11 singers, we extract features from audio and video (gesture) and apply deep learning models to classify the raga from short excerpts. With the gesture-based classification approximately at chance, we attempt to disentangle the singer information from the raga classification embeddings by using a gradient reversal approach. We next investigate a framework that considers the body of existing multimodal fusion techniques via experiments for the multimodal raga classification. Despite the much weaker performance of the video modality relative to audio, we achieve a singer–feature-disentangled multimodal fusion system that slightly, but consistently, outperforms the audio-only classification.
format	Article
id	doaj-art-d342c663b8144a9e8d3ea5b40f1da022
institution	Kabale University
issn	2514-3298
language	English
publishDate	2025-07-01
publisher	Ubiquity Press
record_format	Article
series	Transactions of the International Society for Music Information Retrieval
spelling	doaj-art-d342c663b8144a9e8d3ea5b40f1da0222025-08-21T12:49:43ZengUbiquity PressTransactions of the International Society for Music Information Retrieval2514-32982025-07-0181195–212195–21210.5334/tismir.221221Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive LossSujoy Roychowdhury0Preeti Rao1Indian Institute of Technology BombayIndian Institute of Technology BombayThe art music of North India is rich in the use of hand gestures that accompany vocal performance. However, such gestures are idiosyncratic and are neither taught nor rehearsed by the singer. The recent availability of computer vision techniques allows us to use computational methods to analyze the accompanying gestures and look for complementarity with the audio. Using an available dataset of Hindustani raga performances by 11 singers, we extract features from audio and video (gesture) and apply deep learning models to classify the raga from short excerpts. With the gesture-based classification approximately at chance, we attempt to disentangle the singer information from the raga classification embeddings by using a gradient reversal approach. We next investigate a framework that considers the body of existing multimodal fusion techniques via experiments for the multimodal raga classification. Despite the much weaker performance of the video modality relative to audio, we achieve a singer–feature-disentangled multimodal fusion system that slightly, but consistently, outperforms the audio-only classification.https://account.transactions.ismir.net/index.php/up-j-tismir/article/view/221multimodal raga classificationgradient reversaldisentanglementmultimodal fusion
spellingShingle	Sujoy Roychowdhury Preeti Rao Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss Transactions of the International Society for Music Information Retrieval multimodal raga classification gradient reversal disentanglement multimodal fusion
title	Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss
title_full	Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss
title_fullStr	Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss
title_full_unstemmed	Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss
title_short	Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss
title_sort	multimodal raga classification from vocal performances with disentanglement and contrastive loss
topic	multimodal raga classification gradient reversal disentanglement multimodal fusion
url	https://account.transactions.ismir.net/index.php/up-j-tismir/article/view/221
work_keys_str_mv	AT sujoyroychowdhury multimodalragaclassificationfromvocalperformanceswithdisentanglementandcontrastiveloss AT preetirao multimodalragaclassificationfromvocalperformanceswithdisentanglementandcontrastiveloss

Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss

Similar Items