基于多模递归神经网络的音频-视频语音识别
udio Visual Speech Recognition with Multimodal Recurrent Neural Networks
关于人机交互接口的一些研究表明视觉信息可以提升语音识别准确率,尤其是在嘈杂环境中。由于深度学习在语音识别和图像识别方面均取得良好效果,人们常用其解决音频-视频语音识别问题。虽然已有的深度学习方法成功地将视觉信息融入到语音识别过程中,它们没有同时考虑音频模态和视频模态的序列特性。为了弥补这个不足,我们提出一个多模态递归神经网络来考虑音频和视频的序列性。具体地,多模态递归神经网络包含三个部分:处理音频模态的听觉部分,处理视频模态的视觉部分,以及将两种模态的信息进行结合的融合部分。我们使用LSTM RNN建模听觉部分,使用CNN和LSTM RNN 建模视觉部分,并通过多模层将两者结合起来。模型的有效性通过在一个基于音频和视频两种模态信息的语音识别基准测试库AVletters上来验证。实验结果表明了相较于已有结果的语音识别准确率的提升,并验证了多模态递归神经网络的鲁棒性。
Studies on nowadays human-machine interface have demonstrated that visual information can enhance speech recognition accuracy especially in noisy environments. Deep learning has been widely used to tackle such audio visual speech recognition (AVSR) problem due to its astonishing achievements in both speech recognition and image recognition. Although existing deep learning models succeed to incorporate visual information into speech recognition, none of them simultaneously considers sequential characteristics of both audio and visual modalities. To overcome this deficiency, we proposed a multimodal recurrent neural network (multimodal RNN) model to take into account the sequential characteristics of both audio and visual modalities for AVSR. In particular, multimodal RNN includes three components, i.e., audio part, visual part, and fusion part, where the audio part and visual part capture the sequential characteristics of audio and visual modalities, respectively, and the fusion part combines the outputs of both modalities. Here we modelled the audio modality by using a LSTM RNN, and modelled the visual modality by using a convolutional neural network (CNN) plus a LSTM RNN, and combined both models by a multimodal layer in the fusion part. We validated the effectiveness of the proposed multimodal RNN model on a multi-speaker AVSR benchmark dataset termed AVletters. The experimental results show the performance improvements comparing to the known highest audio visual recognition accuracies on AVletters, and confirm the robustness of our multimodal RNN model.
冯为江、骆志刚、张翔、李渊、管乃洋
电子技术应用计算技术、计算机技术通信
计算机应用深度学习多模学习递归神经网络长短时记忆
omputer Applicationdeep learningmultimodal learningrecurrent neural networksLSTM
冯为江,骆志刚,张翔,李渊,管乃洋.基于多模递归神经网络的音频-视频语音识别[EB/OL].(2017-05-12)[2025-08-21].http://www.paper.edu.cn/releasepaper/content/201705-848.点此复制
评论