Model Detail
speaker-diarization-community-1
▲ 0.1%A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech
arXiv:2604.06327v1 Announce Type: cross Abstract: Recent diffusion-based text-to-speech (TTS) models achieve high naturalness and expressiveness, yet often suffer from speaker drift, a subtle, gradual shift in perceived speaker identity within a single utterance. This underexplored phenomenon underm
Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS
arXiv:2409.18512v2 Announce Type: replace-cross Abstract: Recent advancements in speech synthesis have enabled large language model (LLM)-based systems to perform zero-shot generation with controllable content, timbre, speaker identity, and emotion through input prompts. As a result, these models he
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
arXiv:2604.03074v1 Announce Type: cross Abstract: Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping sp
Pashto Common Voice: Building the First Open Speech Corpus for a 60-Million-Speaker Low-Resource Language
arXiv:2603.27021v1 Announce Type: new Abstract: We present the Pashto Common Voice corpus -- the first large-scale, openly licensed speech resource for Pashto, a language with over 60 million native speakers largely absent from open speech technology. Through a community effort spanning 2022-2025, t