Model Detail
audio-flamingo-next-hf
▲ 5.3%Audio Video Verbal Analysis (AVVA) for Capturing Classroom Dialogues
arXiv:2604.22043v1 Announce Type: cross Abstract: Background: The classroom discourse analysis has been transformed by the growing use of audio-video multimodal data, which demands analytical methods that balance interpretive depth with computational scalability. Methods: This study introduces the A
AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA
arXiv:2604.21766v1 Announce Type: new Abstract: Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific biases, or even b
Misinformation Span Detection in Videos via Audio Transcripts
arXiv:2604.21767v1 Announce Type: new Abstract: Online misinformation is one of the most challenging issues lately, yielding severe consequences, including political polarization, attacks on democracy, and public health risks. Misinformation manifests in any platform with a large user base, includin
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
arXiv:2604.20267v1 Announce Type: cross Abstract: Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval research has
KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness
arXiv:2604.19782v1 Announce Type: cross Abstract: Recent advances in large audio language models (LALMs) have enabled multilingual speech understanding. However, benchmarks for evaluating LALMs remain scarce for non-English languages, with Korean being one such underexplored case. In this paper, we
DASB - Discrete Audio and Speech Benchmark
arXiv:2406.14294v4 Announce Type: replace-cross Abstract: Discrete audio tokens have recently gained considerable attention for their potential to bridge audio and language processing, enabling multimodal language models that can both generate and understand audio. However, preserving key informatio