arxiv4d agobullish

LeanFlow: A Case Study in Workflow-Driven Lean Autoformalization

arXiv:2607.20503v1 Announce Type: new Abstract: We present and evaluate LeanFlow, an LLM agent system specialized for translating mathematical papers into buildable Lean projects. Recent verifier-in-the-loop systems show that large formal artifacts can be produced, but it remains unclear which runti

KIGP2 models #mathematics #formalization #large language models

arxivJul 14bullish

SETA: Scaling Environments for Terminal Agents

arXiv:2607.10891v1 Announce Type: new Abstract: Large language models (LLMs) are rapidly shifting toward agents that solve tasks through diverse interfaces, including web and graphical user interfaces (GUIs). Among these, the terminal command line provides a text-based, general-purpose interface, co

QWDE2 models #reinforcement learning #large language models #open-source Read on arxiv →

arxivMay 22

Robust Reasoning Benchmark

arXiv:2604.08571v2 Announce Type: replace-cross Abstract: While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their problem-solving abilities depend on the context and textual formatting. We introduce the Robust Reasoning Benchmark (RRB), a pipeline of 13

CL1 model #benchmark #mathematical reasoning #large language models Read on arxiv →

arxivApr 30bullish

Information Extraction from Electricity Invoices with General-Purpose Large Language Models

arXiv:2604.25927v1 Announce Type: new Abstract: Information extraction from semi-structured business documents remains a critical challenge for enterprise management. This study evaluates the capability of general-purpose Large Language Models to extract structured information from Spanish electrici

GEMI2 models #information extraction #large language models #document processing Read on arxiv →

arxivApr 24bullish

DryRUN: On the Role of Public Tests in LLM-Driven Code Generation

arXiv:2604.21598v1 Announce Type: cross Abstract: Multi-agent frameworks are widely used in autonomous code generation and have applications in complex algorithmic problem-solving. Recent work has addressed the challenge of generating functionally correct code by incorporating simulation-driven plan

DRCO2 models #autonomous code generation #software engineering #large language models Read on arxiv →

arxivApr 7bullish

TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

arXiv:2601.22776v2 Announce Type: replace Abstract: Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on s

QWQW2 models #reinforcement learning #large language models #reasoning Read on arxiv →