arxiv5d agobullish

Toward Anthropomorphic Dialogue: A Closed-Loop Framework for Human-Like Chat Generation, Evaluation, and Preference Alignment

arXiv:2607.17191v2 Announce Type: replace Abstract: Human-like private chat requires more than fluent response generation: a system must preserve persona, relationship, memory, bounded knowledge, medium-specific timing, and a coherent multi-turn arc. We present AnthroDial, a closed-loop framework th

QW1 model #anthropomorphic dialogue #dialogue systems #reinforcement learning

arxivJul 18bullish

Branching Policy Optimization: Sandbox-Native Language Agent Reinforcement Learning

arXiv:2607.14171v1 Announce Type: new Abstract: Reinforcement learning has emerged as the dominant paradigm for training large language model (LLM) agents that interact with executable sandboxes. State-of-the-art algorithms such as PPO, RLOO, and GRPO inherit their rollout topology from RLHF: for ea

PPRLGR5 models · +2 #reinforcement learning #language models #optimization Read on arxiv →

arxivJul 14bullish

SETA: Scaling Environments for Terminal Agents

arXiv:2607.10891v1 Announce Type: new Abstract: Large language models (LLMs) are rapidly shifting toward agents that solve tasks through diverse interfaces, including web and graphical user interfaces (GUIs). Among these, the terminal command line provides a text-based, general-purpose interface, co

QWDE2 models #reinforcement learning #large language models #open-source Read on arxiv →

arxivJun 17

A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation

arXiv:2605.12227v2 Announce Type: replace Abstract: Existing approaches to post-train models for long-context tasks face complementary limitations: (i) supervised fine-tuning (SFT) provides stable supervision but suffers from exposure bias; (ii) reinforcement learning methods such as Group Relative

GRON2 models #long-context #reinforcement learning #distillation Read on arxiv →

arxivMay 14bullish

HLS-Seek: QoR-Aware Code Generation for High-Level Synthesis via Proxy Comparative Reward Reinforcement Learning

arXiv:2605.13536v1 Announce Type: cross Abstract: High-Level Synthesis (HLS) compiles algorithmic C/C++ descriptions into hardware, with Quality of Results (QoR) -- latency and resource utilization -- critically governed by pragma configurations and code structure. Existing LLM-based HLS approaches

HLGP2 models #reinforcement learning #high-level synthesis #quality of results Read on arxiv →

arxivApr 7bullish

Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

arXiv:2603.13842v2 Announce Type: replace-cross Abstract: End-to-end autonomous driving is typically built upon imitation learning (IL), yet its performance is constrained by the quality of human demonstrations. To overcome this limitation, recent methods incorporate reinforcement learning (RL) thro

PATRDI3 models #autonomous driving #imitation learning #reinforcement learning Read on arxiv →

arxivApr 7bullish

TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

arXiv:2601.22776v2 Announce Type: replace Abstract: Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on s

QWQW2 models #reinforcement learning #large language models #reasoning Read on arxiv →

arxivApr 3

Semantic Interaction Information mediates compositional generalization in latent space

arXiv:2603.27134v2 Announce Type: replace Abstract: Are there still barriers to generalization once all relevant variables are known? We address this question via a framework that casts compositional generalization as a variational inference problem over latent variables with parametric interactions

REECFU4 models · +1 #machine learning #generalization #reinforcement learning Read on arxiv →

arxivApr 3bullish

Adaptive Regime-Aware Stock Price Prediction Using Autoencoder-Gated Dual Node Transformers with Reinforcement Learning Control

arXiv:2603.19136v2 Announce Type: replace Abstract: Stock markets exhibit regime-dependent behavior where prediction models optimized for stable conditions often fail during volatile periods. Existing approaches typically treat all market states uniformly or require manual regime labeling, which is

AUDUSO3 models #machine learning #stock market #prediction Read on arxiv →