A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems

Research work on the design of robust multimodal speech recognition systems making use of acoustic and visual cues, extracted using the relatively noise robust alternate speech sensors is gaining interest in recent times among the speech processing research fraternity. The primary objective of this...

Full description

Saved in:

Bibliographic Details
Main Authors:	Sadasivam UMA MAHESWARI, A. SHAHINA, Ramesh RISHICKESH, A. NAYEEMULLA KHAN
Format:	Article
Language:	English
Published:	Institute of Fundamental Technological Research Polish Academy of Sciences 2020-07-01
Series:	Archives of Acoustics
Subjects:	Lombard speech multimodal ASR throat microphone visual speech Convolutional Neural Network Hidden Markov Model
Online Access:	https://acoustics.ippt.pan.pl/index.php/aa/article/view/2623
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850057279634669568
author	Sadasivam UMA MAHESWARI A. SHAHINA Ramesh RISHICKESH A. NAYEEMULLA KHAN
author_facet	Sadasivam UMA MAHESWARI A. SHAHINA Ramesh RISHICKESH A. NAYEEMULLA KHAN
author_sort	Sadasivam UMA MAHESWARI
collection	DOAJ
description	Research work on the design of robust multimodal speech recognition systems making use of acoustic and visual cues, extracted using the relatively noise robust alternate speech sensors is gaining interest in recent times among the speech processing research fraternity. The primary objective of this work is to study the exclusive influence of Lombard effect on the automatic recognition of the confusable syllabic consonant-vowel units of Hindi language, as a step towards building robust multimodal ASR systems in adverse environments in the context of Indian languages which are syllabic in nature. The dataset for this work comprises the confusable 145 consonant-vowel (CV) syllabic units of Hindi language recorded simultaneously using three modalities that capture the acoustic and visual speech cues, namely normal acoustic microphone (NM), throat microphone (TM) and a camera that captures the associated lip movements. The Lombard effect is induced by feeding crowd noise into the speaker’s headphone while recording. Convolutional Neural Network (CNN) models are built to categorise the CV units based on their place of articulation (POA), manner of articulation (MOA), and vowels (under clean and Lombard conditions). For validation purpose, corresponding Hidden Markov Models (HMM) are also built and tested. Unimodal Automatic Speech Recognition (ASR) systems built using each of the three speech cues from Lombard speech show a loss in recognition of MOA and vowels while POA gets a boost in all the systems due to Lombard effect. Combining the three complimentary speech cues to build bimodal and trimodal ASR systems shows that the recognition loss due to Lombard effect for MOA and vowels reduces compared to the unimodal systems, while the POA recognition is still better due to Lombard effect. A bimodal system is proposed using only alternate acoustic and visual cues which gives a better discrimination of the place and manner of articulation than even standard ASR system. Among the multimodal ASR systems studied, the proposed trimodal system based on Lombard speech gives the best recognition accuracy of 98%, 95%, and 76% for the vowels, MOA and POA, respectively, with an average improvement of 36% over the unimodal ASR systems and 9% improvement over the bimodal ASR systems.
format	Article
id	doaj-art-dd0cdaffe1eb4436bc3f72af25c00905
institution	DOAJ
issn	0137-5075 2300-262X
language	English
publishDate	2020-07-01
publisher	Institute of Fundamental Technological Research Polish Academy of Sciences
record_format	Article
series	Archives of Acoustics
spelling	doaj-art-dd0cdaffe1eb4436bc3f72af25c009052025-08-20T02:51:28ZengInstitute of Fundamental Technological Research Polish Academy of SciencesArchives of Acoustics0137-50752300-262X2020-07-0145310.24425/aoa.2020.134058A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR SystemsSadasivam UMA MAHESWARI0A. SHAHINA1Ramesh RISHICKESH2A. NAYEEMULLA KHAN3SSN College of EngineeringSSN College of EngineeringSSN College of EngineeringVIT UniversityResearch work on the design of robust multimodal speech recognition systems making use of acoustic and visual cues, extracted using the relatively noise robust alternate speech sensors is gaining interest in recent times among the speech processing research fraternity. The primary objective of this work is to study the exclusive influence of Lombard effect on the automatic recognition of the confusable syllabic consonant-vowel units of Hindi language, as a step towards building robust multimodal ASR systems in adverse environments in the context of Indian languages which are syllabic in nature. The dataset for this work comprises the confusable 145 consonant-vowel (CV) syllabic units of Hindi language recorded simultaneously using three modalities that capture the acoustic and visual speech cues, namely normal acoustic microphone (NM), throat microphone (TM) and a camera that captures the associated lip movements. The Lombard effect is induced by feeding crowd noise into the speaker’s headphone while recording. Convolutional Neural Network (CNN) models are built to categorise the CV units based on their place of articulation (POA), manner of articulation (MOA), and vowels (under clean and Lombard conditions). For validation purpose, corresponding Hidden Markov Models (HMM) are also built and tested. Unimodal Automatic Speech Recognition (ASR) systems built using each of the three speech cues from Lombard speech show a loss in recognition of MOA and vowels while POA gets a boost in all the systems due to Lombard effect. Combining the three complimentary speech cues to build bimodal and trimodal ASR systems shows that the recognition loss due to Lombard effect for MOA and vowels reduces compared to the unimodal systems, while the POA recognition is still better due to Lombard effect. A bimodal system is proposed using only alternate acoustic and visual cues which gives a better discrimination of the place and manner of articulation than even standard ASR system. Among the multimodal ASR systems studied, the proposed trimodal system based on Lombard speech gives the best recognition accuracy of 98%, 95%, and 76% for the vowels, MOA and POA, respectively, with an average improvement of 36% over the unimodal ASR systems and 9% improvement over the bimodal ASR systems.https://acoustics.ippt.pan.pl/index.php/aa/article/view/2623Lombard speechmultimodal ASRthroat microphonevisual speechConvolutional Neural NetworkHidden Markov Model
spellingShingle	Sadasivam UMA MAHESWARI A. SHAHINA Ramesh RISHICKESH A. NAYEEMULLA KHAN A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems Archives of Acoustics Lombard speech multimodal ASR throat microphone visual speech Convolutional Neural Network Hidden Markov Model
title	A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems
title_full	A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems
title_fullStr	A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems
title_full_unstemmed	A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems
title_short	A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems
title_sort	study on the impact of lombard effect on recognition of hindi syllabic units using cnn based multimodal asr systems
topic	Lombard speech multimodal ASR throat microphone visual speech Convolutional Neural Network Hidden Markov Model
url	https://acoustics.ippt.pan.pl/index.php/aa/article/view/2623
work_keys_str_mv	AT sadasivamumamaheswari astudyontheimpactoflombardeffectonrecognitionofhindisyllabicunitsusingcnnbasedmultimodalasrsystems AT ashahina astudyontheimpactoflombardeffectonrecognitionofhindisyllabicunitsusingcnnbasedmultimodalasrsystems AT rameshrishickesh astudyontheimpactoflombardeffectonrecognitionofhindisyllabicunitsusingcnnbasedmultimodalasrsystems AT anayeemullakhan astudyontheimpactoflombardeffectonrecognitionofhindisyllabicunitsusingcnnbasedmultimodalasrsystems AT sadasivamumamaheswari studyontheimpactoflombardeffectonrecognitionofhindisyllabicunitsusingcnnbasedmultimodalasrsystems AT ashahina studyontheimpactoflombardeffectonrecognitionofhindisyllabicunitsusingcnnbasedmultimodalasrsystems AT rameshrishickesh studyontheimpactoflombardeffectonrecognitionofhindisyllabicunitsusingcnnbasedmultimodalasrsystems AT anayeemullakhan studyontheimpactoflombardeffectonrecognitionofhindisyllabicunitsusingcnnbasedmultimodalasrsystems

A Study on the Impact of Lombard Effect on Recognition of Hindi Syllabic Units Using CNN Based Multimodal ASR Systems

Similar Items