arxivApril 10, 2026 at 4:00 AM2 min readneutral

VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics

View PDF Abstract:Existing online benchmarks for mobile GUI agents remain largely app-centric and task-homogeneous, failing to reflect the diversity and instability of real-world mobile usage. To this end, we introduce VenusBench-Mobile, a challenging online benchmark for evaluating general-purpose mobile GUI agents under realistic, user-centric conditions. VenusBench-Mobile builds two core evaluation pillars: defining what to evaluate via user-intent-driven task design that reflects real mobile usage, and how to evaluate through a capability-oriented annotation scheme for fine-grained agent behavior analysis. Extensive evaluation of state-of-the-art mobile GUI agents reveals large performance gaps relative to prior benchmarks, indicating that VenusBench-Mobile poses substantially more challenging and realistic tasks and that current agents remain far from reliable real-world deployment. Diagnostic analysis further shows that failures are dominated by deficiencies in perception and memory, which are largely obscured by coarse-grained evaluations. Moreover, even the strongest agents exhibit near-zero success under environment variations, highlighting their brittleness in realistic settings. Based on these insights, we believe VenusBench-Mobile provides an important stepping stone toward robust real-world deployment of mobile GUI agents. Code and data are available at this https URL. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.06182 [cs.HC] (or arXiv:2604.06182v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.06182 arXiv-issued DOI via DataCite Submission history From: Yichen Gong [view email] [v1] Fri, 6 Feb 2026 18:54:42 UTC (7,314 KB)

Read original article ↗

No replies yet. Be first.

arxiv5h ago

VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics

Related Articles

Advantage-Guided Diffusion for Model-Based Reinforcement Learning

FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?