speaker-diarization-community-1 news

47 articles mentioning speaker-diarization-community-1

arxiv3d ago

Autoregressive Guidance of Deep Spatially Selective Filters using Bayesian Tracking for Efficient Extraction of Moving Speakers

arXiv:2603.23723v2 Announce Type: replace-cross Abstract: Deep spatially selective filters achieve high-quality enhancement with real-time capable architectures for stationary speakers of known directions. To retain this level of performance in dynamic scenarios where only the speakers' initial dire

arxiv3d ago

Large Audio Language Models for Spoofing-Aware Speaker Verification

arXiv:2607.14753v1 Announce Type: cross Abstract: Recent advances in text-to-speech and voice cloning make high-quality spoofing inexpensive and scalable, threatening voice authentication systems, especially automatic speaker verification (ASV). Existing defenses mainly address this threat through b

arxiv4d ago

Diarization-Guided Qwen-ASR Adaptation for Multilingual Two-Speaker Conversational Speech

arXiv:2607.08208v2 Announce Type: replace Abstract: This paper describes our self-designed system for Task 1 of the MLC-SLM 2026 Challenge for multilingual two-speaker conversational speech. The system combines a modular speaker diarization front end with a challenge-adapted Qwen3-ASR-1.7B recognize

techcrunch6d ago

OpenAI’s first hardware device is reportedly a screenless speaker that can move

The device is weirdly described as involving "mechanical elements that can move on their own" and the Bloomberg report includes the detail that the device is designed to "feel like a companion and become a physical manifestation of OpenAI’s ChatGPT."

theverge6d ago

OpenAI may announce a ChatGPT smart speaker this year

OpenAI's first device is set to be a smart speaker that lets you talk with ChatGPT, according to a report from Bloomberg. The device apparently won't have a screen, but will use a camera and additional sensors to "understand" your environment. The report comes just days after Apple filed a lawsuit a

arxivJul 14

TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding

arXiv:2601.06896v2 Announce Type: replace-cross Abstract: We present TagSpeech, a unified LLM-based framework that utilizes Temporal Anchor Grounding for joint multi-speaker ASR and diarization. The framework is built on two key designs: (1) decoupled semantic and speaker streams fine-tuned via Seri

arxivJul 14

Breaking the Quality--Intelligibility Trade-off in Streaming Target Speaker Extraction via Deep-Feature-Anchored Preference Optimization

arXiv:2607.10191v1 Announce Type: cross Abstract: Generative streaming models for Target Speaker Extraction (TSE) commonly exhibit a quality--intelligibility trade-off, wherein naive optimization for perceptual audio quality tends to degrade speech intelligibility, and conversely. We reveal that thi

arxivJul 10

PS4: Proxy-Supervised Joint Training for Real Target Speaker Extraction

arXiv:2607.08111v1 Announce Type: cross Abstract: Training target speaker extraction (TSE) models for real conversational mixtures remains challenging because large-scale training corpora and clean target speech for supervision are unavailable. We present PS4, a proxy-supervised training framework f

arxivJul 3

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

arXiv:2607.02504v1 Announce Type: cross Abstract: Long-form TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies on \textbf{speaker recognition}, the task of accurately attributing each spoken utterance to its respective cha

arxivJul 3

SPARCLE: SPeaker-aware Aligned Representations via Contrastive Language Embeddings

arXiv:2607.01238v1 Announce Type: cross Abstract: Recent advances in speech synthesis have shifted from phoneme representations to direct grapheme modeling. While phonemes address the one-to-many mapping between text and acoustics, they rely on grapheme-to-phoneme (G2P) systems that fail to capture

arxivJul 2

L-Proto: Language-Aware Episodic Prototypical Training for Multilingual Speaker Verification

arXiv:2606.17416v2 Announce Type: replace-cross Abstract: Multilingual speaker verification remains challenging because language-dependent acoustic variability causes speaker identity to become entangled with linguistic characteristics, degrading generalization across languages. In multilingual trai

arxivJul 2

Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation

arXiv:2605.07694v2 Announce Type: replace-cross Abstract: Single-channel speaker distance estimation has recently achieved centimeter-level accuracy in simulated environments, yet it remains unclear which components of the room impulse response (RIR) the model exploits and how performance depends on

arxivJul 2

Disentangling Speaker and Language Effects in Cross-Lingual Speaker Verification for Iberian Languages

arXiv:2607.01161v1 Announce Type: cross Abstract: Cross-lingual speaker verification (SV) systems typically exhibit performance degradation when enrollment and test utterances are spoken in different languages. However, standard evaluation protocols confound language mismatch with inter-speaker vari

thevergeJul 1

Google built a great smart speaker, but Gemini isn’t ready for it

Smart speakers have spent the past few years searching for a compelling second act. Beyond music, timers, and controlling your lights, they've struggled to justify taking up space on the kitchen counter. AI promised to change that. Amazon debuted its new hardware powered by a revamped Alexa last fal

arxivJul 1

UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling

arXiv:2606.31128v1 Announce Type: cross Abstract: Speech editing aims to modify specific portions of an utterance while preserving the remaining speech. Existing approaches primarily focus on word-level content modification and typically treat content, speaker, and emotion editing as separate tasks,

arxivJun 30

BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings

arXiv:2509.15001v3 Announce Type: replace-cross Abstract: Child-centered daylong recordings are essential for studying early language development, but existing speech models trained on clean adult data perform poorly due to acoustic and linguistic differences. We introduce BabyHuBERT, a self-supervi

arxivJun 30

AMR: Adaptive Modality Routing for Multimodal Polyglot Speaker Identification

arXiv:2606.29335v1 Announce Type: cross Abstract: Multimodal speaker identification systems face two key challenges in real-world deployment: missing modalities and language mismatch between training and testing conditions. In practical scenarios, background multi-speaker conversations, ambient nois

arxivJun 30

Child-Centric Voice Anonymization in Single and Multi-Speaker Speech via Domain-Adapted SSL Models

arXiv:2606.29897v1 Announce Type: cross Abstract: Voice anonymization aims to protect speaker identity while preserving linguistic content and speech usability. However, most anonymization systems are developed on adult speech, leading to degraded performance when applied to child speech. This paper

arxivJun 30

Explainable AI in Speaker Recognition -- Making Latent Representations Understandable

arXiv:2604.23354v3 Announce Type: replace-cross Abstract: Neural networks can be trained to learn task-relevant representations from data. Understanding how these networks make decisions falls within the Explainable AI (XAI) domain. This paper proposes to study an XAI topic: analysing, visualising a

arxivJun 30

KM-Speaker: Keypoint-Based Style Control for High-Quality Speech-Driven 3D Facial Animation and Dialogue Localization

arXiv:2606.28568v1 Announce Type: cross Abstract: Speech-driven 3D facial animation methods face significant challenges in simultaneously achieving high-fidelity motion and precise artistic control at production quality. Existing controllable models typically learn global style control by relying on

arxivJun 29

Advancing Speaker-Based Vocal Effort Classification with WavLM and Data Augmentation in Naturalistic Non-Calibrated Speech Recordings

arXiv:2606.27543v1 Announce Type: cross Abstract: The variations in vocal effort range (e.g. whisper, soft, neutral, loud, shout) alter production and speech acoustics, reducing intelligibility and limiting the robustness of any subsequent speech technology. Classification is challenging since effor

arxivJun 29

DG^VoiC: Speaker Clustering for Fraud Investigation under Real Call-Centre Conditions

arXiv:2606.28048v1 Announce Type: cross Abstract: Insurance fraud remains costly and operationally difficult, particularly in call-centre workflows where many customer interactions begin at FNOL. While recent fraud detection methods mainly rely on structured data, text, or images, repeated speaker i

arxivJun 26

Neural Speaker Diarization via Multilingual Training: Evaluation on Low-Resource Nepali-Hindi Speech

arXiv:2606.26144v1 Announce Type: cross Abstract: Speaker diarization, the task of determining "who spoke when" in a multi-speaker recording, is a critical component in applications such as meeting transcription, accessibility tools, and multilingual information retrieval. While end-to-end neural di

thevergeJun 24

The Google Home Speaker sounds good and looks great — but it’s finicky

Right out of the box, the new Google Home Speaker passed a couple of important tests. Even with the volume at 100 percent and music blaring out of the speaker, it quickly ducked the audio and listened every time I said "Hey, Google." In fact, in two days of testing, the speaker's three microphones h

arxivJun 24

VieSpeaker: A Large-Scale Vietnamese Speaker Recognition Dataset Beyond Visual Dependency

arXiv:2606.24066v1 Announce Type: cross Abstract: Speaker recognition has advanced rapidly with large-scale training datasets, yet Vietnamese remains under-resourced, with existing corpora limited in scale and acoustic diversity. Most large-scale datasets rely on facial cues to link speech with spea

arxivJun 18

Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

arXiv:2603.10827v2 Announce Type: replace-cross Abstract: Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker's gender, leaving it unclear whether they encode speaker

arxivJun 18

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

arXiv:2606.19325v1 Announce Type: cross Abstract: Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean voc

arxivJun 18

Revisiting Active Speaker Detection: An In-the-Wild Benchmark for Generalization and Robustness

arXiv:2505.21954v2 Announce Type: replace-cross Abstract: We present UniTalk, a novel dataset emphasizing challenging scenarios to enhance model generalization for the task of active speaker detection (ASD). Previously established benchmarks such as AVA predominantly comprise old movies and thus exh

techcrunchJun 17

Google bets on Gemini to reinvent the smart home speaker

Google is betting generative AI can breathe new life into the smart speaker. The company's new $99.99 Google Home Speaker replaces the rigid commands of the Google Assistant era with more conversational Gemini interactions.

thevergeJun 17

Google’s first smart speaker in six years arrives next week

Google's first new smart speaker in six years starts shipping on June 25th, narrowly missing its promised spring launch window. Preorders for the Google Home Speaker open today, June 17th. Nothing has changed hardware-wise in the nine months since the $99 speaker was announced. It has the same sligh

arxivJun 17

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

arXiv:2606.17542v1 Announce Type: new Abstract: We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction. We compare supervised

arxivJun 15

Efficiency-Performance Trade-offs in Neural Speaker Diarization via Structured Pruning and Low-Bit Quantization

arXiv:2606.14030v1 Announce Type: cross Abstract: Streaming speaker diarization is crucial for time-critical medical dispatch, but deploying it on resource-constrained hardware requires smaller, faster models. Using SIMSAMU, a dataset of simulated medical-dispatch conversations, we evaluate streamin

arxivJun 15

Multimodal Speaker Identification in Classroom Environments

arXiv:2606.13712v1 Announce Type: cross Abstract: Automated analysis of K-12 classroom dynamics faces challenges due to background noise and variable child speech, often confounding acoustic-only models. This study evaluates a multimodal speaker identification framework anchoring acoustic embeddings

thevergeJun 10

Microsoft, like, totally gets why students are booing AI-pilled graduation speakers

New college graduates around the country have been booing and heckling commencement speakers who hype up AI. Microsoft would like everyone to talk it out. In a blog post running more than 3,100 words, Microsoft vice chair and president Brad Smith addressed the recent spate of viral clips from gradua

arxivJun 10

Speaker Group Encoding in Self-supervised Speech Recognition Models

arXiv:2606.10654v1 Announce Type: new Abstract: We investigate what self-supervised speech recognition models (S3Ms) learn about speaker groups (SGs). We examine several states of S3Ms: pretrained, finetuned on speaker identification (SID), finetuned on automatic speech recognition (ASR), and ASR-fi

arxivJun 5

FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition

arXiv:2606.06211v1 Announce Type: new Abstract: Automatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear Modulation (FiLM), in

arxivJun 2

Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space

arXiv:2606.01909v1 Announce Type: cross Abstract: We present Echo, a proof-of-concept audio system built around a single 25 M-parameter ViT encoder. The encoder is pretrained with a JEPA objective and then specialised by stages to carry speaker identity, phonetic content, and dynamic source routing

arxivJun 1

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

arXiv:2603.07551v2 Announce Type: replace-cross Abstract: Zero-shot Text-to-Speech (TTS) voice cloning poses severe privacy risks, demanding the removal of specific speaker identities from trained TTS models. Conventional machine unlearning is insufficient in this context, as zero-shot TTS can dynam

arxivJun 1

G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition

arXiv:2603.10468v2 Announce Type: replace-cross Abstract: We study timestamped speaker-attributed automatic speech recognition (SA-ASR) for long-form, multi-party speech with overlap. In this setting, chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-

arxivMay 29

Interactive In-Meeting Speaker Correction with Human Feedback

arXiv:2509.18377v2 Announce Type: replace Abstract: Most automatic speech processing systems operate in ``open loop'' mode without user feedback about who said what, yet human-in-the-loop workflows can potentially enable higher accuracy. We propose an LLM-assisted in-meeting speaker correction syste

arxivMay 29

Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

arXiv:2505.10975v3 Announce Type: replace-cross Abstract: Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances h

arxivMay 27

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

arXiv:2510.10774v3 Announce Type: replace-cross Abstract: Persian remains substantially underrepresented in open speech-text resources, limiting progress in multi-speaker text-to-speech (TTS), speech-language modelling, and low-resource speech processing. We introduce ParsVoice, the largest publicly

arxivMay 27

FalAR: A Large-scale Speaker-Annotated European Portuguese Speech Corpus of Parliamentary Sessions

arXiv:2605.27062v1 Announce Type: new Abstract: State-of-the-art performance for Automatic Speech Recognition (ASR) largely depends on the availability of large-scale labeled corpora. This creates a demand for increased data collection efforts, particularly for under-represented languages and dialec

arxivMay 26

WhoSaidIt: Human-LLM Collaborative Annotation for Text-Based Multilingual Speaker-Attribute Classification

arXiv:2605.26070v1 Announce Type: new Abstract: Annotating speaker attributes from text is inherently ambiguous, particularly in multilingual settings where demographic and social cues are implicit and culturally variable. We propose a human-large language model (LLM) collaborative re-annotation fra

arxivMay 26

Continual Speaker Identity Unlearning with Minimal Interference

arXiv:2605.25962v1 Announce Type: cross Abstract: Machine unlearning removes designated concepts or knowledge from pre-trained models. Recent work has extended this paradigm to speaker identity unlearning in zero-shot text-to-speech (ZS-TTS), the task of selectively erasing a model's ability to repl

arxivMay 22

Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization

arXiv:2603.00086v2 Announce Type: replace-cross Abstract: Automatic speech recognition for French medical conversations remains challenging, with word error rates often exceeding 30% in spontaneous clinical speech. This study proposes a multi-pass LLM post-processing architecture alternating between

arxivMay 19

VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

arXiv:2605.18547v1 Announce Type: new Abstract: Emotion Recognition in Conversation (ERC) is essential for effective human-machine interaction, aiming to identify speakers' emotional states in multi-turn dialogues. Early text-based methods struggle with complex scenarios like sarcasm because they in