arxiv
PublishedMay 14, 2026 at 4:00 AM
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
Publisher summary· verbatim
arXiv:2604.27389v2 Announce Type: replace-cross Abstract: In recent years, Multimodal Large Language Models (MLLMs) have achieved remarkable progress on a wide range of multimodal benchmarks. Despite these advances, most existing benchmarks mainly focus on single-image or multi-image comprehension.
Stay posted· Newsletter
A 5-min weekly brief — top movers, price watch, story of the week.
Discussion
No replies yet. Be first.
Related coverage
More from ARXIV
arxivEDEN: A Large-Scale Corpus of Clinical Notes for Italian1darxivASTER: Latent Pseudo-Anomaly Generation for Unsupervised Time-Series Anomaly Detection1darxivLoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling1darxivAPPO: Agentic Procedural Policy Optimization1dThe Bubble Brief
WEEKLYRead AI insights every Tuesday — top movers, new releases, story of the week.
Originally published on arxiv ↗