Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss
The art music of North India is rich in the use of hand gestures that accompany vocal performance. However, such gestures are idiosyncratic and are neither taught nor rehearsed by the singer. The recent availability of computer vision techniques allows us to use computational methods to analyze the...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Ubiquity Press
2025-07-01
|
| Series: | Transactions of the International Society for Music Information Retrieval |
| Subjects: | |
| Online Access: | https://account.transactions.ismir.net/index.php/up-j-tismir/article/view/221 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849230046793826304 |
|---|---|
| author | Sujoy Roychowdhury Preeti Rao |
| author_facet | Sujoy Roychowdhury Preeti Rao |
| author_sort | Sujoy Roychowdhury |
| collection | DOAJ |
| description | The art music of North India is rich in the use of hand gestures that accompany vocal performance. However, such gestures are idiosyncratic and are neither taught nor rehearsed by the singer. The recent availability of computer vision techniques allows us to use computational methods to analyze the accompanying gestures and look for complementarity with the audio. Using an available dataset of Hindustani raga performances by 11 singers, we extract features from audio and video (gesture) and apply deep learning models to classify the raga from short excerpts. With the gesture-based classification approximately at chance, we attempt to disentangle the singer information from the raga classification embeddings by using a gradient reversal approach. We next investigate a framework that considers the body of existing multimodal fusion techniques via experiments for the multimodal raga classification. Despite the much weaker performance of the video modality relative to audio, we achieve a singer–feature-disentangled multimodal fusion system that slightly, but consistently, outperforms the audio-only classification. |
| format | Article |
| id | doaj-art-d342c663b8144a9e8d3ea5b40f1da022 |
| institution | Kabale University |
| issn | 2514-3298 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | Ubiquity Press |
| record_format | Article |
| series | Transactions of the International Society for Music Information Retrieval |
| spelling | doaj-art-d342c663b8144a9e8d3ea5b40f1da0222025-08-21T12:49:43ZengUbiquity PressTransactions of the International Society for Music Information Retrieval2514-32982025-07-0181195–212195–21210.5334/tismir.221221Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive LossSujoy Roychowdhury0Preeti Rao1Indian Institute of Technology BombayIndian Institute of Technology BombayThe art music of North India is rich in the use of hand gestures that accompany vocal performance. However, such gestures are idiosyncratic and are neither taught nor rehearsed by the singer. The recent availability of computer vision techniques allows us to use computational methods to analyze the accompanying gestures and look for complementarity with the audio. Using an available dataset of Hindustani raga performances by 11 singers, we extract features from audio and video (gesture) and apply deep learning models to classify the raga from short excerpts. With the gesture-based classification approximately at chance, we attempt to disentangle the singer information from the raga classification embeddings by using a gradient reversal approach. We next investigate a framework that considers the body of existing multimodal fusion techniques via experiments for the multimodal raga classification. Despite the much weaker performance of the video modality relative to audio, we achieve a singer–feature-disentangled multimodal fusion system that slightly, but consistently, outperforms the audio-only classification.https://account.transactions.ismir.net/index.php/up-j-tismir/article/view/221multimodal raga classificationgradient reversaldisentanglementmultimodal fusion |
| spellingShingle | Sujoy Roychowdhury Preeti Rao Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss Transactions of the International Society for Music Information Retrieval multimodal raga classification gradient reversal disentanglement multimodal fusion |
| title | Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss |
| title_full | Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss |
| title_fullStr | Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss |
| title_full_unstemmed | Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss |
| title_short | Multimodal Raga Classification from Vocal Performances with Disentanglement and Contrastive Loss |
| title_sort | multimodal raga classification from vocal performances with disentanglement and contrastive loss |
| topic | multimodal raga classification gradient reversal disentanglement multimodal fusion |
| url | https://account.transactions.ismir.net/index.php/up-j-tismir/article/view/221 |
| work_keys_str_mv | AT sujoyroychowdhury multimodalragaclassificationfromvocalperformanceswithdisentanglementandcontrastiveloss AT preetirao multimodalragaclassificationfromvocalperformanceswithdisentanglementandcontrastiveloss |