arxivApril 6, 2026 at 4:00 AM1 min readneutral

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping speech, backchannels, rapid turn-taking, and context window constraints.

We propose Speaker-Reasoner, an end-to-end Speech LLM with agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window.

Trained with a three-stage progressive strategy, Speaker-Reasoner achieves consistent improvements over strong baselines on AliMeeting and AISHELL-4 datasets, particularly in handling overlapping speech and complex turn-taking. The details of the model and its performance can be found in the paper, which is available at https://doi.org/10.48550/arXiv.2604.03074, and is classified under the subjects Audio and Speech Processing, Computation and Language, and Sound.

The paper was submitted by Zhennan Lin and is available on arXiv as arXiv:2604.03074, with the first version being submitted on Fri, 3 Apr 2026 14:52:20 UTC.

Read original article ↗

No replies yet. Be first.

arxiv6h ago

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

Related Articles

Advantage-Guided Diffusion for Model-Based Reinforcement Learning

FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?