Real-time Generation of Various Types of Nodding for Avatar Attentive Listening System
Real-time Generation of Various Types of Nodding for Avatar Attentive Listening System
In human dialogue, nonverbal information such as nodding and facial expressions is as crucial as verbal information, and spoken dialogue systems are also expected to express such nonverbal behaviors. We focus on nodding, which is critical in an attentive listening system, and propose a model that predicts both its timing and type in real time. The proposed model builds on the voice activity projection (VAP) model, which predicts voice activity from both listener and speaker audio. We extend it to prediction of various types of nodding in a continuous and real-time manner unlike conventional models. In addition, the proposed model incorporates multi-task learning with verbal backchannel prediction and pretraining on general dialogue data. In the timing and type prediction task, the effectiveness of multi-task learning was significantly demonstrated. We confirmed that reducing the processing rate enables real-time operation without a substantial drop in accuracy, and integrated the model into an avatar attentive listening system. Subjective evaluations showed that it outperformed the conventional method, which always does nodding in sync with verbal backchannel. The code and trained models are available at https://github.com/MaAI-Kyoto/MaAI.
Kazushi Kato、Koji Inoue、Divesh Lala、Keiko Ochi、Tatsuya Kawahara
计算技术、计算机技术
Kazushi Kato,Koji Inoue,Divesh Lala,Keiko Ochi,Tatsuya Kawahara.Real-time Generation of Various Types of Nodding for Avatar Attentive Listening System[EB/OL].(2025-08-04)[2025-08-07].https://arxiv.org/abs/2507.23298.点此复制
评论