|国家预印本平台
首页|Diarization-Aware Multi-Speaker Automatic Speech Recognition via Large Language Models

Diarization-Aware Multi-Speaker Automatic Speech Recognition via Large Language Models

Diarization-Aware Multi-Speaker Automatic Speech Recognition via Large Language Models

来源:Arxiv_logoArxiv
英文摘要

Multi-speaker automatic speech recognition (MS-ASR) faces significant challenges in transcribing overlapped speech, a task critical for applications like meeting transcription and conversational analysis. While serialized output training (SOT)-style methods serve as common solutions, they often discard absolute timing information, limiting their utility in time-sensitive scenarios. Leveraging recent advances in large language models (LLMs) for conversational audio processing, we propose a novel diarization-aware multi-speaker ASR system that integrates speaker diarization with LLM-based transcription. Our framework processes structured diarization inputs alongside frame-level speaker and semantic embeddings, enabling the LLM to generate segment-level transcriptions. Experiments demonstrate that the system achieves robust performance in multilingual dyadic conversations and excels in complex, high-overlap multi-speaker meeting scenarios. This work highlights the potential of LLMs as unified back-ends for joint speaker-aware segmentation and transcription.

Yuke Lin、Ming Cheng、Ze Li、Beilong Tang、Ming Li

计算技术、计算机技术

Yuke Lin,Ming Cheng,Ze Li,Beilong Tang,Ming Li.Diarization-Aware Multi-Speaker Automatic Speech Recognition via Large Language Models[EB/OL].(2025-06-06)[2025-07-23].https://arxiv.org/abs/2506.05796.点此复制

评论