Speech recognition using an english multimodal corpus with integrated image and depth information

Abstract Traditional English corpora mainly collect information from a single modality, but lack information from multimodal information, resulting in low quality of corpus information and certain problems with recognition accuracy. To solve the above problems, this paper proposes to introduce depth...

Full description

Saved in:
Bibliographic Details
Main Author: Bing Wang
Format: Article
Language:English
Published: Nature Portfolio 2024-11-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-024-78557-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850194725791858688
author Bing Wang
author_facet Bing Wang
author_sort Bing Wang
collection DOAJ
description Abstract Traditional English corpora mainly collect information from a single modality, but lack information from multimodal information, resulting in low quality of corpus information and certain problems with recognition accuracy. To solve the above problems, this paper proposes to introduce depth information into multimodal corpora, and studies the construction method of English multimodal corpora that integrates electronic images and depth information, as well as the speech recognition method of the corpus. The multimodal fusion strategy adopted integrates speech signals and image information, including key visual information such as the speaker’s lip movements and facial expressions, and uses deep learning technology to mine acoustic and visual features. The acoustic model in the Kaldi toolkit is used for experimental research.Through experimental research, the following conclusions were drawn: Under 15-dimensional lip features, the accuracy of corpus A under monophone model was 2.4% higher than that of corpus B under monophone model when the SNR (signal-to-noise ratio) was 10dB, and the accuracy of corpus A under the triphone model at the signal-to-noise ratio of 10dB was 1.7% higher than that of corpus B under the triphone model at the signal-to-noise ratio of 10dB. Under the 32-dimensional lip features, the speech recognition effect of corpus A under the monophone model at the SNR of 10dB was 1.4% higher than that of corpus B under the monophone model at the SNR of 10dB, and the accuracy of corpus A under the triphone model at the SNR of 10dB was 2.6% higher than that of corpus B under the triphone model at the SNR of 10dB. The English multimodal corpus with image and depth information has a high accuracy, and the depth information helps to improve the accuracy of the corpus.
format Article
id doaj-art-d70f12a97cec45bdb7ff7c5ee4e0594e
institution OA Journals
issn 2045-2322
language English
publishDate 2024-11-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-d70f12a97cec45bdb7ff7c5ee4e0594e2025-08-20T02:13:55ZengNature PortfolioScientific Reports2045-23222024-11-0114111110.1038/s41598-024-78557-2Speech recognition using an english multimodal corpus with integrated image and depth informationBing Wang0School of Foreign Languages, Henan University of Science and TechnologyAbstract Traditional English corpora mainly collect information from a single modality, but lack information from multimodal information, resulting in low quality of corpus information and certain problems with recognition accuracy. To solve the above problems, this paper proposes to introduce depth information into multimodal corpora, and studies the construction method of English multimodal corpora that integrates electronic images and depth information, as well as the speech recognition method of the corpus. The multimodal fusion strategy adopted integrates speech signals and image information, including key visual information such as the speaker’s lip movements and facial expressions, and uses deep learning technology to mine acoustic and visual features. The acoustic model in the Kaldi toolkit is used for experimental research.Through experimental research, the following conclusions were drawn: Under 15-dimensional lip features, the accuracy of corpus A under monophone model was 2.4% higher than that of corpus B under monophone model when the SNR (signal-to-noise ratio) was 10dB, and the accuracy of corpus A under the triphone model at the signal-to-noise ratio of 10dB was 1.7% higher than that of corpus B under the triphone model at the signal-to-noise ratio of 10dB. Under the 32-dimensional lip features, the speech recognition effect of corpus A under the monophone model at the SNR of 10dB was 1.4% higher than that of corpus B under the monophone model at the SNR of 10dB, and the accuracy of corpus A under the triphone model at the SNR of 10dB was 2.6% higher than that of corpus B under the triphone model at the SNR of 10dB. The English multimodal corpus with image and depth information has a high accuracy, and the depth information helps to improve the accuracy of the corpus.https://doi.org/10.1038/s41598-024-78557-2English multimodal corpusSpeech recognition methodsDepth informationElectronic images
spellingShingle Bing Wang
Speech recognition using an english multimodal corpus with integrated image and depth information
Scientific Reports
English multimodal corpus
Speech recognition methods
Depth information
Electronic images
title Speech recognition using an english multimodal corpus with integrated image and depth information
title_full Speech recognition using an english multimodal corpus with integrated image and depth information
title_fullStr Speech recognition using an english multimodal corpus with integrated image and depth information
title_full_unstemmed Speech recognition using an english multimodal corpus with integrated image and depth information
title_short Speech recognition using an english multimodal corpus with integrated image and depth information
title_sort speech recognition using an english multimodal corpus with integrated image and depth information
topic English multimodal corpus
Speech recognition methods
Depth information
Electronic images
url https://doi.org/10.1038/s41598-024-78557-2
work_keys_str_mv AT bingwang speechrecognitionusinganenglishmultimodalcorpuswithintegratedimageanddepthinformation