Model Detail
speaker-diarization-community-1
▼ 0.4%speaker-diarization-community-1 is an audio model released by pyannote. The model is registered under the automatic-speech-recognition pipeline tag on Hugging Face, distributed under the permissive cc-by-4.0 license.
The cc-by-4.0 license is permissive, allowing commercial deployment and derivative work without per-seat fees, though attribution requirements still apply.
Downloads of speaker-diarization-community-1 have moved -0.4% over the past 24 hours, -0.2% over the trailing thirty days. That is a slight downtrend, consistent with normal cooling as newer models compete for the same workloads. These numbers are signal, not guarantee — week-over-week download counts on Hugging Face also reflect mirror traffic, CI scrapes, and one-off benchmarking runs.
speaker-diarization-community-1 is best fit for speech recognition, transcription, or speech synthesis depending on the task head. Treat this as a starting matrix rather than a benchmark verdict — the right deployment usually depends on the specific evaluation suite that mirrors your workload.
Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space
arXiv:2606.01909v1 Announce Type: cross Abstract: We present Echo, a proof-of-concept audio system built around a single 25 M-parameter ViT encoder. The encoder is pretrained with a JEPA objective and then specialised by stages to carry speaker identity, phonetic content, and dynamic source routing
G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition
arXiv:2603.10468v2 Announce Type: replace-cross Abstract: We study timestamped speaker-attributed automatic speech recognition (SA-ASR) for long-form, multi-party speech with overlap. In this setting, chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-
Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech
arXiv:2603.07551v2 Announce Type: replace-cross Abstract: Zero-shot Text-to-Speech (TTS) voice cloning poses severe privacy risks, demanding the removal of specific speaker identities from trained TTS models. Conventional machine unlearning is insufficient in this context, as zero-shot TTS can dynam
Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio
arXiv:2505.10975v3 Announce Type: replace-cross Abstract: Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances h
Explainable AI in Speaker Recognition -- Making Latent Representations Understandable
arXiv:2604.23354v2 Announce Type: replace-cross Abstract: Neural networks can be trained to learn task-relevant representations from data. Understanding how these networks make decisions falls within the Explainable AI (XAI) domain. This paper proposes to study an XAI topic: uncovering the unknown o
Interactive In-Meeting Speaker Correction with Human Feedback
arXiv:2509.18377v2 Announce Type: replace Abstract: Most automatic speech processing systems operate in ``open loop'' mode without user feedback about who said what, yet human-in-the-loop workflows can potentially enable higher accuracy. We propose an LLM-assisted in-meeting speaker correction syste