arxiv
PublishedJune 11, 2026 at 4:00 AM
MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models
Publisher summary· verbatim
arXiv:2606.11792v1 Announce Type: cross Abstract: Video Large Multimodal Models have achieved remarkable progress in video understanding, yet they remain prone to hallucinations, where generated responses are not faithfully supported by the input video. In this paper, we propose MultiToP, a multimod
Stay posted· Newsletter
A 5-min weekly brief — top movers, price watch, story of the week.
Discussion
No replies yet. Be first.
Related coverage
More from ARXIV
arxivMODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning3harxivThe Environmental Cost of LLMs in AIED: Reporting and Practices3harxivTAHOE: Text-to-SQL with Automated Hint Optimization from Experience3harxivPosition: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!3hThe Bubble Brief
WEEKLYRead AI insights every Tuesday — top movers, new releases, story of the week.
Originally published on arxiv ↗