arxivApril 11, 2026 at 4:00 AM2 min readneutral

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Authors Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu, Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen, Dongfu Jiang, Wenhu Chen, and Kelsey R. Allen have introduced ClawBench, an evaluation framework designed to test the capabilities of AI agents in completing everyday online tasks. AI agents may be able to automate tasks such as managing an inbox, but the question remains as to whether they can automate other routine aspects of life. Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents.

The ClawBench framework consists of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories. These tasks include completing purchases, booking appointments, and submitting job applications, and require demanding capabilities beyond existing benchmarks. For example, they require obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and performing write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction.

A lightweight interception layer is used to capture and block only the final submission request, ensuring safe evaluation without real-world side effects. The authors evaluated 7 frontier models using the ClawBench framework and found that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. The results of this evaluation demonstrate that there is still significant work to be done in developing AI agents that can function as reliable general-purpose assistants. Progress on ClawBench brings us closer to achieving this goal, and the framework provides a valuable tool for evaluating and improving the capabilities of AI agents.

The project page for ClawBench is available at the provided URL, and the paper can be cited as arXiv:2604.08523 [cs.CL]. The submission history and other details are also available, including the DOI and the version history of the paper. The authors have made their work available for others to build upon, and it is expected that ClawBench will be a valuable resource for researchers and developers working on AI agents and related technologies.

Read original article ↗

No replies yet. Be first.

arxiv1h ago

FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes

arxiv1h ago

AlphaLab: Autonomous Multi-Agent Research Across Optimization Domains with Frontier LLMs

arxiv1h ago

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Related Articles

FluidFlow: a flow-matching generative model for fluid dynamics surrogates on unstructured meshes

AlphaLab: Autonomous Multi-Agent Research Across Optimization Domains with Frontier LLMs

Mapping generative AI use in the human brain: divergent neural, academic, and mental health profiles of functional versus socio emotional AI use