arxiv4d ago

Approximate Quantum State Preparation Through Proximal Policy Optimization

arXiv:2607.21121v1 Announce Type: cross Abstract: In this work, a quantum architecture search framework for approximate quantum state preparation (QSP) is proposed. QSP is a challenging task, since the search space grows exponentially with the number of qubits, making the identification of the optim

PR1 model #quantum-physics #reinforcement-learning #state-preparation Read on arxiv →

arxiv4d agobullish

Adaptive Multi-Horizon Reinforcement Learning

arXiv:2607.20656v1 Announce Type: cross Abstract: Effective decision-making in complex and changing environments requires balancing short-term and long-term consequences. In reinforcement learning (RL), this trade-off is typically controlled through a fixed discount factor, which imposes a single ex

#reinforcement-learning #continual-learning #machine-learning Read on arxiv →

arxiv4d ago

Error Amplification Limits ANN-to-SNN Conversion in Continuous Control

arXiv:2601.21778v3 Announce Type: replace-cross Abstract: Spiking Neural Networks (SNNs) can achieve competitive performance by converting already existing well-trained Artificial Neural Networks (ANNs), avoiding further costly training. This property is particularly attractive in Reinforcement Lear

#neural-networks #reinforcement-learning #conversion Read on arxiv →

arxiv4d agobullish

Relative Value Learning

arXiv:2607.21120v1 Announce Type: cross Abstract: In reinforcement learning, critics typically estimate absolute state values $V(s)$, estimating how good a particular situation is in isolation. However, it turns out that only differences in value are relevant for control. Motivated by this, we propo

PP1 model #reinforcement-learning #value-estimation #policy-gradient Read on arxiv →

arxiv6d agobullish

From Trajectories to Instructions: Language-Conditioned Meta-Reinforcement Learning

arXiv:2607.18830v1 Announce Type: cross Abstract: Model-Agnostic Meta-Learning (MAML) is a widely used framework for reinforcement learning (RL) that enables efficient transfer by learning global policy parameters that can be rapidly adapted to new tasks. MAML training proceeds in two loops: an inne

MALA2 models #reinforcement-learning #meta-learning #language-instructions Read on arxiv →

arxiv6d agobullish

A Reinforcement-Learning-Augmented Liquid-Fueled Reactor Network Model for Predicting Lean Blowout in Gas Turbine Combustors

arXiv:2607.19281v1 Announce Type: new Abstract: This study introduces a reinforcement learning (RL) framework for generating optimal liquid-fueled reactors to improve lean blowout (LBO) predictions in gas turbine combustors. Existing approaches for determining cluster boundaries rely on manual heuri

K-AC2 models #reinforcement-learning #clustering #gas-turbine Read on arxiv →

arxivJul 21bullish

Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL

arXiv:2607.16204v1 Announce Type: new Abstract: Recent growth in reinforcement learning (RL) has surfaced a need for diverse, specialized training environments. Hand-curated environments with fixed task and reward difficulties become ineffective signals as model performance improves, and sparse rewa

LLMDLF5 models · +2 #reinforcement-learning #world-models #autoregressive-models Read on arxiv →

arxivJul 18bullish

RAD: Retrieval High-quality Demonstrations to Enhance Decision-making

arXiv:2507.15356v2 Announce Type: replace Abstract: Offline reinforcement learning (RL) learns policies from fixed datasets, thereby avoiding costly or unsafe environment interactions. However, its reliance on finite static datasets inherently restricts the ability to generalize beyond the training

RE1 model #reinforcement-learning #offline-rl #generative-models Read on arxiv →

arxivJul 18bullish

Reachability-Aware Pretraining for Efficient Target-Oriented Path Exploration in Temporal Knowledge Graph Reasoning

arXiv:2607.14886v1 Announce Type: new Abstract: Temporal Knowledge Graph (TKG) reasoning under the extrapolation setting focuses on forecasting future time-stamped events (facts) from historical data in a temporal knowledge graph. Existing approaches, reinforcement learning (RL)-based multi-hop reas

RA1 model #temporal-knowledge-graph #reinforcement-learning #pretraining Read on arxiv →

arxivJul 18

PAC Learning in Turn-Based Stochastic Games with Reachability Objectives: A Decentralized Private Approach via Expected Conditional Distance

arXiv:2607.14877v1 Announce Type: new Abstract: Reachability is the most fundamental logical objective, yet it is notoriously difficult to learn in reinforcement learning settings: even for Markov decision processes, PAC learning of reachability is impossible without additional assumptions. This dif

#reinforcement-learning #game-theory #machine-learning Read on arxiv →

arxivJul 16

Operator-on-F complements value-equivalence: a planning-time diagnostic for latent world models

arXiv:2607.04464v2 Announce Type: replace-cross Abstract: World-model evaluation for model-based reinforcement learning typically asks whether the learned model predicts reward and value well, which can leave planning-relevant errors in the model's latent rollouts unmeasured. We introduce a compleme

TDPU2 models #model-evaluation #reinforcement-learning #diagnostics Read on arxiv →

arxivJul 16

Flow-aware Optimal Navigation in Unsteady Flows through Reinforcement Learning

arXiv:2607.13553v1 Announce Type: cross Abstract: Autonomous robotic navigation in nonstationary time-varying fluid flows remains a fundamental challenge due to partial observability and the unpredictability of realistic environments. While classical optimal control frameworks employed in robotics r

TD1 model #robotics #reinforcement-learning #navigation Read on arxiv →

arxivJul 14bullish

Active Offline-to-Online Reinforcement Learning

arXiv:2607.11720v1 Announce Type: cross Abstract: Background: Offline reinforcement learning (RL) enables effective policies to be trained from large, previously collected datasets and subsequently improved through limited online interaction. This offline-to-online RL (O2O-RL) paradigm is particular

#reinforcement-learning #offline-learning #fine-tuning Read on arxiv →

arxivJul 3bullish

FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning

arXiv:2607.01440v1 Announce Type: new Abstract: Faithful reasoning is essential in medicine, where clinical decisions require transparent justification grounded in reliable evidence. Current medical LLMs either lack active access to evidence or use retrieved evidence without supervising how it shoul

FAQW2 models #medicine #llms #reinforcement-learning Read on arxiv →

arxivJul 1bullish

Safe Online Learning via Smooth Safety-Structured Policy Composition

arXiv:2606.31320v1 Announce Type: new Abstract: Safe online reinforcement learning requires policies to respect safety constraints while maintaining smooth optimization dynamics. Existing approaches typically rely on either strict safety enforcement via action interventions, which introduce disconti

AU1 model #reinforcement-learning #safety #robotics Read on arxiv →

arxivJun 29bullish

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

arXiv:2605.03065v4 Announce Type: replace Abstract: Generative control policies (GCPs), such as diffusion- and flow-based control policies, have emerged as effective parameterizations for robot learning. This work introduces Off-policy Generative Policy Optimization (OGPO), a sample-efficient algori

#robotics #machine-learning #optimization Read on arxiv →

arxivJun 25bullish

ExTra: Exploratory Trajectory Optimization for Language Model Reinforcement Learning

arXiv:2606.24994v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) for language-model reasoning can fail at both extremes of task difficulty: easy prompts often produce all-correct, low-diversity rollout groups with little gradient signal, while hard prompts can pr

QW1 model #reinforcement-learning #language-models #exploration Read on arxiv →

arxivJun 25bullish

Themis: An explainable AI-enabled framework for Reinforcement Learning with Human Feedback

arXiv:2606.24622v1 Announce Type: new Abstract: Training safe Reinforcement Learning (RL) systems is inherently challenging, with no guarantee of avoiding unwanted behaviors. The most effective defenses against this are (i) transparency through explainability and (ii) alignment via human feedback. W

#reinforcement-learning #explainability #human-feedback Read on arxiv →

arxivJun 25

Reinforcement Learning Improves Traversal of Parametric Knowledge in LLMs

arXiv:2511.05933v2 Announce Type: replace Abstract: Reinforcement learning (RL) is often credited with improving language model reasoning at the expense of knowledge. We challenge this narrative by showing that reasoning models consistently outperform their instruction-tuned versions on pure knowled

#reinforcement-learning #language-models #knowledge-recall Read on arxiv →

arxivJun 25

Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

arXiv:2606.26027v1 Announce Type: new Abstract: Tool use enables large language models (LLMs) to perform complex tasks, and recent agentic reinforcement learning (RL) methods show promise for enhancing model capabilities. However, RL alone often leads to instability or limited gains in tool-use task

#machine-learning #reinforcement-learning #large-language-models Read on arxiv →

arxivJun 25

Bias Fitting to Mitigate Length Bias of Reward Model in RLHF

arXiv:2505.12843v2 Announce Type: replace Abstract: Reinforcement Learning from Human Feedback (RLHF) relies on reward models to align large language models with human preferences. However, RLHF often suffers from reward hacking, wherein policy learning exploits flaws in the trained reward model to

FIDIBE3 models #reinforcement-learning #bias-mitigation #language-models Read on arxiv →

arxivJun 20bullish

MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning

arXiv:2506.14990v3 Announce Type: replace Abstract: Benchmarks play a central role in reinforcement learning (RL) research, yet their computational constraints often shape what is studied. Despite the motivation of lifelong learning, most continual RL papers consider only 3-10 sequential tasks, as C

#reinforcement-learning #benchmark #multi-agent Read on arxiv →

arxivJun 19bullish

VIMPO: Value-Implicit Policy Optimization for LLMs

arXiv:2606.20008v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards has become a central tool for improving the reasoning ability of large language models, but current methods face a trade-off between simplicity and credit assignment. Group-relative methods such as GRPO av

VIGRPP3 models #reinforcement-learning #language-models #optimization Read on arxiv →

arxivJun 19

A Model-Driven Approach for Developing Families of Reinforcement Learning Environments

arXiv:2606.20324v1 Announce Type: cross Abstract: Virtual training environments are software-intensive systems in which reinforcement learning (RL) agents learn, adapt, and demonstrate meaningful behavior. Virtual training environments offer a safe and cost-efficient alternative to training agents i

#reinforcement-learning #software-engineering #model-driven-development Read on arxiv →

arxivJun 17

Learning Upper Lower Value Envelopes to Shape Online RL: A Principled Approach

arXiv:2510.19528v2 Announce Type: replace-cross Abstract: We investigate the fundamental problem of leveraging offline data to accelerate online reinforcement learning - a direction with strong potential but limited theoretical grounding. Our study centers on how to \emph{learn} and \emph{apply} val

#reinforcement-learning #offline-data #value-functions Read on arxiv →

arxivJun 16bullish

Reinforcement Learning-Guided Retrieval with Soft Fusion for Robust Multimodal Imitation Learning under Missing Modalities

arXiv:2606.15514v1 Announce Type: cross Abstract: Robotic systems perceive the world through multiple input modalities -- including visual camera streams and natural language instructions -- and must select appropriate actions based on these signals. However, assuming the permanent availability of a

RL1 model #robotics #imitation-learning #reinforcement-learning Read on arxiv →

arxivJun 15bullish

UP-NRPA: User Portrait based Nested Rollout Policy Adaptation for Planning with Large Language Models in Goal-oriented Dialogue Systems

arXiv:2606.13683v1 Announce Type: new Abstract: To address the challenge that current dialogue policy planning methods struggle to dynamically adapt to diverse user characteristics, this paper proposes a User Portrait based Nested Rollout Policy Adaptation (UP-NRPA) online framework with Large Langu

LA1 model #dialogue-systems #personalization #reinforcement-learning Read on arxiv →

arxivJun 15bullish

Learning Coordinated Preference for Multi-Objective Multi-Agent Reinforcement Learning

arXiv:2606.14693v1 Announce Type: cross Abstract: Cooperative multi-objective multi-agent reinforcement learning (MOMARL) models team decision making under multiple, potentially conflicting objectives. In this setting, conflicts arise not only across objectives but also across agents with different

PC1 model #multi-agent #reinforcement-learning #cooperative-learning Read on arxiv →

arxivJun 12

Phi-Actor-Critic: Steering General-Sum Games to Pareto-Efficient Correlated Equilibria

arXiv:2606.11284v1 Announce Type: cross Abstract: Real-world multi-agent systems, from traffic coordination to resource allocation, are often modeled as general-sum games where individual incentives conflict with collective welfare. In these settings, the central challenge is not merely finding an e

PH1 model #multi-agent #reinforcement-learning #game-theory Read on arxiv →

arxivJun 12

When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control

arXiv:2605.26418v2 Announce Type: replace Abstract: A properly calibrated rule-based autoscaler can beat every one of six mainstream deep reinforcement learning (DRL) algorithms on cost across every workload we test - so when, if ever, does DRL actually help? We study this in RLScale-Bench, a reprod

PPDQA26 models · +3 #reinforcement-learning #benchmark #resource-control Read on arxiv →

arxivJun 10bullish

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

arXiv:2606.11119v1 Announce Type: cross Abstract: Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, aris

QW1 model #reinforcement-learning #language-models #optimization Read on arxiv →

arxivJun 10bullish

Offline Reinforcement Learning for Rotation Profile Control in Tokamaks

arXiv:2605.05857v2 Announce Type: replace Abstract: Tokamaks remain leading candidates for achieving practical fusion energy, yet many important control problems inside these devices are still difficult or unsolved. One such challenge is controlling the plasma rotation profile, which strongly influe

RE1 model #fusion #energy #reinforcement-learning Read on arxiv →

arxivJun 10bullish

Baichuan-M4: A Clinical-Grade Medical Agent System for Continuous Care

arXiv:2606.08982v2 Announce Type: replace Abstract: Baichuan-M4 is Baichuan Intelligence's clinical-grade medical large model, designed for continuous care rather than single-turn medical question answering. It is built as a coordinated medical agent system around three pillars: Baichuan-Harness, a

BA1 model #medical-ai #reinforcement-learning #clinical-trials Read on arxiv →

arxivJun 6bullish

Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

arXiv:2606.05950v1 Announce Type: new Abstract: Text-guided image editing has advanced rapidly with diffusion models and unified multimodal foundation models. However, most existing methods remain confined to single-turn settings, overlooking the more realistic scenario of multi-turn in-context edit

ED1 model #image-editing #diffusion-models #multimodal-models Read on arxiv →

arxivJun 4bullish

RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

arXiv:2606.04272v1 Announce Type: new Abstract: The standard LLM training pipeline applies reinforcement learning (RL) only after pre-training and supervised fine-tuning (SFT). We question this status quo by training a LLM from scratch and applying RL, SFT, and SFT followed by RL directly to interme

#reinforcement-learning #pre-training #fine-tuning Read on arxiv →

arxivJun 3bullish

Experience-Driven Dynamic Exits for LLMs with Reinforcement Learning

arXiv:2606.03113v1 Announce Type: new Abstract: Large Language Models suffer from slow autoregressive inference. While self-speculative decoding accelerates this process, its efficiency is hampered by static configurations like fixed exit layers and speculation lengths. We reframe this optimization

MEME2 models #optimization #reinforcement-learning #language-models Read on arxiv →

arxivJun 2bullish

HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation

arXiv:2606.01230v1 Announce Type: new Abstract: Large language model agents are moving beyond text-only interaction toward physical-world control, with smart homes as a representative domain. Real domestic interaction requires understanding ambiguous intents, operating in dynamic environments, and p

HOHOGP3 models #smart-home #language-models #benchmark Read on arxiv →

arxivJun 2bullish

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

arXiv:2605.12969v3 Announce Type: replace-cross Abstract: Group Relative Policy Optimization (GRPO) is one of the most widely adopted RLVR algorithms for post-training large language models on reasoning tasks. We first show that GRPO admits an equivalent discriminative reformulation, in which policy

GRCO2 models #reinforcement-learning #language-models #optimization Read on arxiv →

arxivJun 1

The Challenges of Using Reinforcement Learning for Controlling Industrial Energy Systems

arXiv:2605.31044v1 Announce Type: new Abstract: Reinforcement learning has shown promising results for optimizing the control of industrial energy systems, yet most existing studies remain limited to the application in simulation environments. We investigate the challenges of deploying reinforcement

#reinforcement-learning #industrial-energy #real-world-deployment Read on arxiv →

arxivMay 29bullish

From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning

arXiv:2601.21909v2 Announce Type: replace Abstract: Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach do

#llm #reinforcement-learning #cognitive-architecture Read on arxiv →

arxivMay 29bullish

Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk

arXiv:2605.29788v1 Announce Type: new Abstract: Critical sequential decisions are rarely single-timescale: a strategic decision causally shapes the context in which every subsequent tactical choice is made; standard bandit and reinforcement-learning theory does not capture this causal coupling betwe

NERF2 models #reinforcement-learning #causal-inference #bandits Read on arxiv →

arxivMay 29bullish

Moment Matching Q-Learning

arXiv:2605.29033v1 Announce Type: new Abstract: Score-based and flow-based generative models exhibit remarkable expressive capacity in capturing complex distributions, and have been extensively deployed in tasks ranging from image generation to reinforcement learning. Nevertheless, these models suff

#reinforcement-learning #generative-models #efficiency Read on arxiv →

arxivMay 28

DSSE: a drone swarm search environment

arXiv:2307.06240v2 Announce Type: replace-cross Abstract: The Drone Swarm Search project is an environment, based on \textsc{PettingZoo}, that is to be used in conjunction with multi-agent (or single-agent) reinforcement learning algorithms. It is an environment in which the agents (drones), have to

#reinforcement-learning #multi-agent #machine-learning Read on arxiv →

arxivMay 26

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

arXiv:2605.24202v1 Announce Type: new Abstract: Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-ag

LL1 model #multi-agent #reinforcement-learning #workflow Read on arxiv →

arxivMay 26

Reinforcement Learning for Reachability: Guaranteeing Asymptotic Optimality

arXiv:2605.24740v1 Announce Type: new Abstract: Reinforcement learning (RL) for reachability specifications is fundamental in sequential decision-making, yet theoretical guarantees remain less explored. A recent work achieves asymptotic convergence to optimal policies. However, this approach provide

#reinforcement-learning #machine-learning #convergence Read on arxiv →

arxivMay 22bullish

Token-weighted Direct Preference Optimization with Attention

arXiv:2605.21883v1 Announce Type: new Abstract: Direct Preference Optimization (DPO) aligns Large Language Models with human preferences without the need for a separate reward model. However, DPO treats all tokens in responses equally, neglecting the differing importance of individual tokens. Existi

LA1 model #optimization #language-models #reinforcement-learning Read on arxiv →

arxivMay 22bullish

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

arXiv:2605.20740v1 Announce Type: cross Abstract: Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point estimates without ensur

#machine-learning #regression #reinforcement-learning Read on arxiv →

arxivMay 22bullish

Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning

arXiv:2506.21039v3 Announce Type: replace-cross Abstract: Long-horizon goal-conditioned tasks pose fundamental challenges for reinforcement learning (RL), particularly when goals are distant and rewards are sparse. While hierarchical and graph-based methods offer partial solutions, their reliance on

#reinforcement-learning #hierarchical-rl #goal-conditioned-rl Read on arxiv →

arxivMay 22bullish

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

arXiv:2605.20255v1 Announce Type: cross Abstract: Simulation-based testing of self-driving cars (SDCs) typically relies on scripted or simplified pedestrian models that do not capture the heterogeneity and uncertainty of real human crossing behavior. This limits the realism of safety assessments, es

MU1 model #reinforcement-learning #self-driving-cars #safety-assessment Read on arxiv →

arxivMay 19

Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization

arXiv:2511.07288v2 Announce Type: replace-cross Abstract: Learning complex policies with Reinforcement Learning (RL) is often hindered by instability and slow convergence, a problem exacerbated by the difficulty of reward engineering. Imitation Learning (IL) from expert demonstrations bypasses this

GATR2 models #reinforcement-learning #imitation-learning #machine-learning Read on arxiv →

arxivMay 16

Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability

arXiv:2605.14246v1 Announce Type: cross Abstract: Many safety-critical control problems are modeled as risk-sensitive partially observable Markov decision processes, where the controller must make decisions from incomplete observations while balancing task performance against safety risk. Although b

#reinforcement-learning #safety #markov-decision-processes Read on arxiv →

arxivMay 13bullish

Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning

arXiv:2605.11467v1 Announce Type: new Abstract: Reasoning models post-hoc rationalize answers they have already committed to internally, producing chains of *reasoning theater*: deliberative-looking steps that contribute nothing to correctness. This wastes inference tokens, pollutes interpretability

MEQWCL3 models #reasoning #reinforcement-learning #interpretability Read on arxiv →

arxivMay 11

Mitigating Cognitive Bias in RLHF by Altering Rationality

arXiv:2605.06895v1 Announce Type: new Abstract: How can we make models robust to even imperfect human feedback? In reinforcement learning from human feedback (RLHF), human preferences over model outputs are used to train a reward model that assigns scalar values to responses. Because these rewards a

LL1 model #reinforcement-learning #human-feedback #cognitive-biases Read on arxiv →

arxivMay 11bullish

Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions

arXiv:2512.20974v3 Announce Type: replace-cross Abstract: Bayesian Reinforcement Learning (BRL), a subclass of Meta-Reinforcement Learning (Meta-RL), provides a principled framework for generalisation by explicitly incorporating Bayesian task parameters into transition and reward models. However, cl

#reinforcement-learning #bayesian-inference #deep-learning Read on arxiv →

arxivMay 8

Discovering What You Can Control: Interventional Boundary Discovery for Reinforcement Learning

arXiv:2603.18257v2 Announce Type: replace-cross Abstract: When an RL agent's observations contain distractors driven by the same confounders as its true state, observational data alone cannot identify which dimensions the agent controls. In our benchmarks, even state-conditioned observational select

SA1 model #machine-learning #artificial-intelligence #reinforcement-learning Read on arxiv →

arxivMay 8

Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning

arXiv:2605.05373v1 Announce Type: new Abstract: A key capability of intelligent agents is operating under partial observability: reasoning and acting effectively despite missing or incomplete state observations. While recurrent (memory-based) policies learned via reinforcement learning address this

#reinforcement-learning #optimal-control #partial-observability Read on arxiv →

arxivMay 8bullish

PPO-Based Dynamic Positioning of HAPS-BS in Wind-Disturbed Stratospheric Maritime Networks

arXiv:2605.05240v1 Announce Type: cross Abstract: High-Altitude Platform Stations (HAPS) offer a promising solution for wide-area wireless coverage in maritime regions lacking terrestrial infrastructure. However, maintaining reliable performance is challenging due to dynamic ship mobility and atmosp

PR1 model #wireless-coverage #reinforcement-learning #maritime-networks Read on arxiv →

arxivMay 7bullish

Adaptive Ensemble Aggregation for Actor-Critics

arXiv:2507.23501v2 Announce Type: replace Abstract: Ensembles are ubiquitous in off-policy actor-critic learning, yet their efficacy depends critically on how they are aggregated. Current methods typically rely on static rules or task-specific hyperparameters to balance overestimation bias and varia

#reinforcement-learning #ensemble-methods #machine-learning Read on arxiv →

arxivMay 5bullish

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

arXiv:2605.00425v1 Announce Type: new Abstract: Reinforcement learning (RL) has significantly advanced the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. Yet effective training remains challenging, as sparse, outcome-only rewards make it diffic

#reinforcement-learning #large-language-models #exploration-exploitation Read on arxiv →

arxivApr 30bullish

Adaptive Scaling of Policy Constraints for Offline Reinforcement Learning

arXiv:2508.19900v2 Announce Type: replace Abstract: Offline reinforcement learning (RL) enables learning effective policies from fixed datasets without any environment interaction. Existing methods typically employ policy constraints to mitigate the distribution shift encountered during offline RL t

#offline-rl #reinforcement-learning #machine-learning Read on arxiv →

arxivApr 27bullish

An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments

arXiv:2604.22199v1 Announce Type: cross Abstract: Autonomous robots operating in open environments need the ability to continuously handle tasks that are not covered by predefined local methods. However, existing approaches often rely on repeated large-language-model (LLM) interaction for uncovered

LL1 model #autonomous-robots #open-environments #large-language-models Read on arxiv →

arxivApr 24bullish

ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs

arXiv:2604.21357v1 Announce Type: new Abstract: This paper proposes ReaGeo, an end-to-end geocoding framework based on large language models, designed to overcome the limitations of traditional multi-stage approaches that rely on text or vector similarity retrieval over geographic databases, includi

RE1 model #geocoding #language-models #reinforcement-learning Read on arxiv →

arxivApr 24bullish

Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Vetting via Automated Spectral Inspection

arXiv:2601.06498v3 Announce Type: replace Abstract: Due to the limited generalization and interpretability of deep learning classifiers, The final vetting of rare celestial object candidates still relies on expert visual inspection--a manually intensive process. In this process, astronomers leverage

SP1 model #astronomy #spectroscopy #multimodal Read on arxiv →

arxivApr 24bullish

Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

arXiv:2604.21896v1 Announce Type: new Abstract: This paper introduces a new paradigm for AI game programming, leveraging large language models (LLMs) to extend and operationalize Claude Shannon's taxonomy of game-playing machines. Central to this paradigm is Nemobot, an interactive agentic engineeri

NE1 model #game-playing #ai-agents #reinforcement-learning Read on arxiv →

arxivApr 24bullish

Thinking While Listening: Fast-Slow Recurrence for Long-Horizon Sequential Modeling

arXiv:2604.01577v2 Announce Type: replace Abstract: We extend the recent latent recurrent modeling to sequential input streams. By interleaving fast, recurrent latent updates with self-organizational ability between slow observation updates, our method facilitates the learning of stable internal str

LSTR2 models #machine-learning #reinforcement-learning #sequential-modeling Read on arxiv →

arxivApr 23bullish

MOA: Multi-Objective Alignment for Role-Playing Agents

arXiv:2512.09756v2 Announce Type: replace Abstract: Role-playing agents (RPAs) require balancing multiple objectives, such as instruction following, persona consistency, and stylistic fidelity, which are not always perfectly aligned across different dimensions. While prior work has primarily relied

MO1 model #reinforcement-learning #role-playing-agents #multi-objective-optimization Read on arxiv →

arxivApr 21bullish

Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

arXiv:2510.10959v3 Announce Type: replace-cross Abstract: Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy

LA1 model #machine-learning #reasoning #reinforcement-learning Read on arxiv →

arxivApr 21bullish

Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints

arXiv:2507.16727v3 Announce Type: replace Abstract: Improving the reliability of large language models (LLMs) is critical for deploying them in real-world scenarios. In this paper, we propose \textbf{Deliberative Searcher}, the first framework to integrate certainty calibration with retrieval-based

#reliability #research #question-answering Read on arxiv →

arxivApr 18

Continuous-time reinforcement learning: ellipticity enables model-free value function approximation

arXiv:2602.06930v2 Announce Type: replace Abstract: We study off-policy reinforcement learning for controlling continuous-time Markov diffusion processes with discrete-time observations and actions. We consider model-free algorithms with function approximation that learn value and advantage function

#reinforcement-learning #markov-diffusions #function-approximation Read on arxiv →

arxivApr 18

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

arXiv:2604.14258v1 Announce Type: cross Abstract: Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a tra

#language-models #fine-tuning #reinforcement-learning Read on arxiv →

arxivApr 18

Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?

arXiv:2509.12833v2 Announce Type: replace Abstract: Projection-based safety filters, which modify unsafe actions by mapping them to the closest safe alternative, are widely used to enforce safety constraints in reinforcement learning (RL). Two integration strategies are commonly considered: Safe env

#reinforcement-learning #safety #optimization Read on arxiv →

arxivApr 18bullish

Enhancing LLM-based Search Agents via Contribution Weighted Group Relative Policy Optimization

arXiv:2604.14267v1 Announce Type: new Abstract: Search agents extend Large Language Models (LLMs) beyond static parametric knowledge by enabling access to up-to-date and long-tail information unavailable during pretraining. While reinforcement learning has been widely adopted for training such agent

LLQWQW3 models #machine-learning #reinforcement-learning #search-agents Read on arxiv →

arxivApr 13bullish

Sample-Efficient Neurosymbolic Deep Reinforcement Learning

arXiv:2601.02850v2 Announce Type: replace Abstract: Reinforcement Learning (RL) is a well-established framework for sequential decision-making in complex environments. However, state-of-the-art Deep RL (DRL) algorithms typically require large training datasets and often struggle to generalize beyond

#reinforcement-learning #deep-learning #neuro-symbolic Read on arxiv →

arxivApr 10

Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

arXiv:2603.28281v2 Announce Type: replace Abstract: We consider robustness against data corruption in offline multi-agent reinforcement learning from human feedback (MARLHF) under a strong-contamination model: given a dataset $D$ of trajectory-preference tuples (each preference being an $n$-dimensio

#machine-learning #reinforcement-learning #robustness Read on arxiv →

arxivApr 10bullish

SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents

arXiv:2604.07791v1 Announce Type: cross Abstract: Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have demonstrated significant potential in single-turn reasoning tasks. With the paradigm shift toward self-evolving agentic learning, models are increasingly expected to learn

#reinforcement-learning #self-evolving-agents #knowledge-reasoning Read on arxiv →

arxivApr 8bullish

Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents

arXiv:2604.05808v1 Announce Type: new Abstract: Large language model (LLM) agents have demonstrated strong capabilities in complex interactive decision-making tasks. However, existing LLM agents typically rely on increasingly long interaction histories, resulting in high computational cost and limit

ST1 model #reinforcement-learning #hierarchical-learning #large-language-models Read on arxiv →

arxivApr 7

Pedagogical Safety in Educational Reinforcement Learning: Formalizing and Detecting Reward Hacking in AI Tutoring Systems

arXiv:2604.04237v1 Announce Type: cross Abstract: Reinforcement learning (RL) is increasingly used to personalize instruction in intelligent tutoring systems, yet the field lacks a formal framework for defining and evaluating pedagogical safety. We introduce a four-layer model of pedagogical safety

#education #reinforcement-learning #safety Read on arxiv →

arxivApr 6bullish

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

arXiv:2604.02869v1 Announce Type: new Abstract: Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Rela

QWQWGP5 models · +2 #reinforcement-learning #conversational-ai #benchmark Read on arxiv →

arxivApr 3

Reinforcement Learning-based Task Offloading in the Internet of Wearable Things

arXiv:2510.07487v2 Announce Type: replace Abstract: Over the years, significant contributions have been made by the research and industrial sectors to improve wearable devices towards the Internet of Wearable Things (IoWT) paradigm. However, wearables are still facing several challenges. Many stem f

#wearables #edge-computing #reinforcement-learning Read on arxiv →