Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

Source

arxiv.orgfull article ↗

Read on arxiv

Publisher summary· verbatim

arXiv:2606.00103v1 Announce Type: new Abstract: We introduce a multi-turn interactive framework for reasoning evaluation that treats reasoning as active evidence acquisition and belief updating. Wherein, LLMs receive only the task rules, must issue targeted queries to a hidden environment, integrate

Stay posted· Newsletter

A 5-min weekly brief — top movers, price watch, story of the week.

Discussion

No replies yet. Be first.

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

Related coverage

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

Related coverage