arxivMay 16
arXiv:2605.15188v1 Announce Type: cross Abstract: AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay
arxivApr 17bullish
arXiv:2603.13683v2 Announce Type: replace Abstract: Although debiased large language models (LLMs) excel at handling known or low-bias prompts, they often fail on unfamiliar and high-bias prompts. We demonstrate via out-of-distribution (OOD) detection that these high-bias prompts cause a distributio
arxivApr 17bullish
arXiv:2604.13552v1 Announce Type: cross Abstract: Large language models (LLMs) demonstrate strong reasoning capabilities, but their performance often degrades under distribution shift. Existing test-time adaptation (TTA) methods rely on gradient-based updates that require white-box access and need s