End-to-end audiovisual speech recognition based on attention fusion of SDBN and BLSTM

An end-to-end audiovisual speech recognition algorithm was proposed.In algorithm,a sparse DBN was constructed by introducing mixed l<sub>1/2</sub>norm and l<sub>1</sub>norm into Deep Belief Network with bottleneck structure to extract the spars...

Full description

Saved in:
Bibliographic Details
Main Authors: Yiming WANG, Ken CHEN, Aihaiti ABUDUSALAMU
Format: Article
Language:zho
Published: Beijing Xintong Media Co., Ltd 2019-12-01
Series:Dianxin kexue
Subjects:
Online Access:http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2019290/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841530637531480064
author Yiming WANG
Ken CHEN
Aihaiti ABUDUSALAMU
author_facet Yiming WANG
Ken CHEN
Aihaiti ABUDUSALAMU
author_sort Yiming WANG
collection DOAJ
description An end-to-end audiovisual speech recognition algorithm was proposed.In algorithm,a sparse DBN was constructed by introducing mixed l<sub>1/2</sub>norm and l<sub>1</sub>norm into Deep Belief Network with bottleneck structure to extract the sparse bottleneck features,so as to reduce the dimension of data features,and then a BLSTM was used to model the feature in time series.Then,a attention mechanism was used to align and fuse the lip visual information and audio auditory information automatically.Finally,the fused audiovisual information was classified and identified by a BLSTM with a Softmax layer attached.Experiments show that the algorithm can effectively identify visual and auditory information,and has good recognition rate and robustness in similar algorithms.
format Article
id doaj-art-fb4c49696475486e80ec2c68012af815
institution Kabale University
issn 1000-0801
language zho
publishDate 2019-12-01
publisher Beijing Xintong Media Co., Ltd
record_format Article
series Dianxin kexue
spelling doaj-art-fb4c49696475486e80ec2c68012af8152025-01-15T03:01:54ZzhoBeijing Xintong Media Co., LtdDianxin kexue1000-08012019-12-0135798959585627End-to-end audiovisual speech recognition based on attention fusion of SDBN and BLSTMYiming WANGKen CHENAihaiti ABUDUSALAMUAn end-to-end audiovisual speech recognition algorithm was proposed.In algorithm,a sparse DBN was constructed by introducing mixed l<sub>1/2</sub>norm and l<sub>1</sub>norm into Deep Belief Network with bottleneck structure to extract the sparse bottleneck features,so as to reduce the dimension of data features,and then a BLSTM was used to model the feature in time series.Then,a attention mechanism was used to align and fuse the lip visual information and audio auditory information automatically.Finally,the fused audiovisual information was classified and identified by a BLSTM with a Softmax layer attached.Experiments show that the algorithm can effectively identify visual and auditory information,and has good recognition rate and robustness in similar algorithms.http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2019290/end-to-endaudiovisual speech recognitionsparse bottleneck featuresattention mechanism
spellingShingle Yiming WANG
Ken CHEN
Aihaiti ABUDUSALAMU
End-to-end audiovisual speech recognition based on attention fusion of SDBN and BLSTM
Dianxin kexue
end-to-end
audiovisual speech recognition
sparse bottleneck features
attention mechanism
title End-to-end audiovisual speech recognition based on attention fusion of SDBN and BLSTM
title_full End-to-end audiovisual speech recognition based on attention fusion of SDBN and BLSTM
title_fullStr End-to-end audiovisual speech recognition based on attention fusion of SDBN and BLSTM
title_full_unstemmed End-to-end audiovisual speech recognition based on attention fusion of SDBN and BLSTM
title_short End-to-end audiovisual speech recognition based on attention fusion of SDBN and BLSTM
title_sort end to end audiovisual speech recognition based on attention fusion of sdbn and blstm
topic end-to-end
audiovisual speech recognition
sparse bottleneck features
attention mechanism
url http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2019290/
work_keys_str_mv AT yimingwang endtoendaudiovisualspeechrecognitionbasedonattentionfusionofsdbnandblstm
AT kenchen endtoendaudiovisualspeechrecognitionbasedonattentionfusionofsdbnandblstm
AT aihaitiabudusalamu endtoendaudiovisualspeechrecognitionbasedonattentionfusionofsdbnandblstm