·
DataBubble
  • Home
  • Models
  • News
  • Compare
  • Boards
  • Pricing
  • About
  • Newsletter
  • Methodology
  • Contact
Latest
SpaceX officially prices shares at $135 in the largest IPO ever5h◆Our new community investments in Virginia support local jobs and expand energy affordability.5h◆SpaceX SPV investors won’t know their true holdings until post-IPO lock-ups lift5h◆Amazon’s data centers used 2.5 billion gallons of water last year8h◆Deezer’s new tool can identify AI music from Spotify, Apple Music, and others9h◆Pool’s new app turns your screenshots into something useful10h◆DoorDash’s new AI chatbot lets you order with prompts and photos11h◆Anthropic apologizes for invisible Claude Fable guardrails14h◆Google DeepMind is worried about what happens when millions of agents start to interact14h◆Deezer launches an AI music detector for other streaming services17h◆Opendoor’s India exit is fueling a bigger conversation about AI and outsourcing21h◆MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning21h◆Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!21h◆ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation21h◆Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions21h◆The Impossibility of Eliciting Latent Knowledge21h◆Mapping Scientific Literature with Large Language Models and Topic Modeling21h◆Grounding Computer Use Agents on Human Demonstrations21h◆Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models21h◆LSTM based IoT Device Identification21h◆SpaceX officially prices shares at $135 in the largest IPO ever5h◆Our new community investments in Virginia support local jobs and expand energy affordability.5h◆SpaceX SPV investors won’t know their true holdings until post-IPO lock-ups lift5h◆Amazon’s data centers used 2.5 billion gallons of water last year8h◆Deezer’s new tool can identify AI music from Spotify, Apple Music, and others9h◆Pool’s new app turns your screenshots into something useful10h◆DoorDash’s new AI chatbot lets you order with prompts and photos11h◆Anthropic apologizes for invisible Claude Fable guardrails14h◆Google DeepMind is worried about what happens when millions of agents start to interact14h◆Deezer launches an AI music detector for other streaming services17h◆Opendoor’s India exit is fueling a bigger conversation about AI and outsourcing21h◆MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning21h◆Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!21h◆ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation21h◆Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions21h◆The Impossibility of Eliciting Latent Knowledge21h◆Mapping Scientific Literature with Large Language Models and Topic Modeling21h◆Grounding Computer Use Agents on Human Demonstrations21h◆Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models21h◆LSTM based IoT Device Identification21h◆
Tag

#reinforcement-learning

44 articles tagged #reinforcement-learning

arxivJun 3bullish

Experience-Driven Dynamic Exits for LLMs with Reinforcement Learning

arXiv:2606.03113v1 Announce Type: new Abstract: Large Language Models suffer from slow autoregressive inference. While self-speculative decoding accelerates this process, its efficiency is hampered by static configurations like fixed exit layers and speculation lengths. We reframe this optimization

MEME2 models#optimization#reinforcement-learning#language-modelsRead on arxiv →
arxivJun 2bullish

HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation

arXiv:2606.01230v1 Announce Type: new Abstract: Large language model agents are moving beyond text-only interaction toward physical-world control, with smart homes as a representative domain. Real domestic interaction requires understanding ambiguous intents, operating in dynamic environments, and p

HOHOGP3 models#smart-home#language-models#benchmarkRead on arxiv →
arxivJun 2bullish

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

arXiv:2605.12969v3 Announce Type: replace-cross Abstract: Group Relative Policy Optimization (GRPO) is one of the most widely adopted RLVR algorithms for post-training large language models on reasoning tasks. We first show that GRPO admits an equivalent discriminative reformulation, in which policy

GRCO2 models#reinforcement-learning#language-models#optimizationRead on arxiv →
arxivJun 1

The Challenges of Using Reinforcement Learning for Controlling Industrial Energy Systems

arXiv:2605.31044v1 Announce Type: new Abstract: Reinforcement learning has shown promising results for optimizing the control of industrial energy systems, yet most existing studies remain limited to the application in simulation environments. We investigate the challenges of deploying reinforcement

#reinforcement-learning#industrial-energy#real-world-deploymentRead on arxiv →
arxivMay 29bullish

From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning

arXiv:2601.21909v2 Announce Type: replace Abstract: Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach do

#llm#reinforcement-learning#cognitive-architectureRead on arxiv →
arxivMay 29bullish

Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk

arXiv:2605.29788v1 Announce Type: new Abstract: Critical sequential decisions are rarely single-timescale: a strategic decision causally shapes the context in which every subsequent tactical choice is made; standard bandit and reinforcement-learning theory does not capture this causal coupling betwe

NERF2 models#reinforcement-learning#causal-inference#banditsRead on arxiv →
arxivMay 29bullish

Moment Matching Q-Learning

arXiv:2605.29033v1 Announce Type: new Abstract: Score-based and flow-based generative models exhibit remarkable expressive capacity in capturing complex distributions, and have been extensively deployed in tasks ranging from image generation to reinforcement learning. Nevertheless, these models suff

#reinforcement-learning#generative-models#efficiencyRead on arxiv →
arxivMay 28

DSSE: a drone swarm search environment

arXiv:2307.06240v2 Announce Type: replace-cross Abstract: The Drone Swarm Search project is an environment, based on \textsc{PettingZoo}, that is to be used in conjunction with multi-agent (or single-agent) reinforcement learning algorithms. It is an environment in which the agents (drones), have to

#reinforcement-learning#multi-agent#machine-learningRead on arxiv →
arxivMay 26

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

arXiv:2605.24202v1 Announce Type: new Abstract: Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-ag

LL1 model#multi-agent#reinforcement-learning#workflowRead on arxiv →
arxivMay 26

Reinforcement Learning for Reachability: Guaranteeing Asymptotic Optimality

arXiv:2605.24740v1 Announce Type: new Abstract: Reinforcement learning (RL) for reachability specifications is fundamental in sequential decision-making, yet theoretical guarantees remain less explored. A recent work achieves asymptotic convergence to optimal policies. However, this approach provide

#reinforcement-learning#machine-learning#convergenceRead on arxiv →
arxivMay 22bullish

Token-weighted Direct Preference Optimization with Attention

arXiv:2605.21883v1 Announce Type: new Abstract: Direct Preference Optimization (DPO) aligns Large Language Models with human preferences without the need for a separate reward model. However, DPO treats all tokens in responses equally, neglecting the differing importance of individual tokens. Existi

LA1 model#optimization#language-models#reinforcement-learningRead on arxiv →
arxivMay 22bullish

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

arXiv:2605.20740v1 Announce Type: cross Abstract: Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point estimates without ensur

#machine-learning#regression#reinforcement-learningRead on arxiv →
arxivMay 22bullish

Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning

arXiv:2506.21039v3 Announce Type: replace-cross Abstract: Long-horizon goal-conditioned tasks pose fundamental challenges for reinforcement learning (RL), particularly when goals are distant and rewards are sparse. While hierarchical and graph-based methods offer partial solutions, their reliance on

#reinforcement-learning#hierarchical-rl#goal-conditioned-rlRead on arxiv →
arxivMay 22bullish

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

arXiv:2605.20255v1 Announce Type: cross Abstract: Simulation-based testing of self-driving cars (SDCs) typically relies on scripted or simplified pedestrian models that do not capture the heterogeneity and uncertainty of real human crossing behavior. This limits the realism of safety assessments, es

MU1 model#reinforcement-learning#self-driving-cars#safety-assessmentRead on arxiv →
arxivMay 19

Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization

arXiv:2511.07288v2 Announce Type: replace-cross Abstract: Learning complex policies with Reinforcement Learning (RL) is often hindered by instability and slow convergence, a problem exacerbated by the difficulty of reward engineering. Imitation Learning (IL) from expert demonstrations bypasses this

GATR2 models#reinforcement-learning#imitation-learning#machine-learningRead on arxiv →
arxivMay 16

Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability

arXiv:2605.14246v1 Announce Type: cross Abstract: Many safety-critical control problems are modeled as risk-sensitive partially observable Markov decision processes, where the controller must make decisions from incomplete observations while balancing task performance against safety risk. Although b

#reinforcement-learning#safety#markov-decision-processesRead on arxiv →
arxivMay 13bullish

Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning

arXiv:2605.11467v1 Announce Type: new Abstract: Reasoning models post-hoc rationalize answers they have already committed to internally, producing chains of *reasoning theater*: deliberative-looking steps that contribute nothing to correctness. This wastes inference tokens, pollutes interpretability

MEQWCL3 models#reasoning#reinforcement-learning#interpretabilityRead on arxiv →
arxivMay 11

Mitigating Cognitive Bias in RLHF by Altering Rationality

arXiv:2605.06895v1 Announce Type: new Abstract: How can we make models robust to even imperfect human feedback? In reinforcement learning from human feedback (RLHF), human preferences over model outputs are used to train a reward model that assigns scalar values to responses. Because these rewards a

LL1 model#reinforcement-learning#human-feedback#cognitive-biasesRead on arxiv →
arxivMay 11bullish

Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions

arXiv:2512.20974v3 Announce Type: replace-cross Abstract: Bayesian Reinforcement Learning (BRL), a subclass of Meta-Reinforcement Learning (Meta-RL), provides a principled framework for generalisation by explicitly incorporating Bayesian task parameters into transition and reward models. However, cl

#reinforcement-learning#bayesian-inference#deep-learningRead on arxiv →
arxivMay 8

Discovering What You Can Control: Interventional Boundary Discovery for Reinforcement Learning

arXiv:2603.18257v2 Announce Type: replace-cross Abstract: When an RL agent's observations contain distractors driven by the same confounders as its true state, observational data alone cannot identify which dimensions the agent controls. In our benchmarks, even state-conditioned observational select

SA1 model#machine-learning#artificial-intelligence#reinforcement-learningRead on arxiv →
arxivMay 8

Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning

arXiv:2605.05373v1 Announce Type: new Abstract: A key capability of intelligent agents is operating under partial observability: reasoning and acting effectively despite missing or incomplete state observations. While recurrent (memory-based) policies learned via reinforcement learning address this

#reinforcement-learning#optimal-control#partial-observabilityRead on arxiv →
arxivMay 8bullish

PPO-Based Dynamic Positioning of HAPS-BS in Wind-Disturbed Stratospheric Maritime Networks

arXiv:2605.05240v1 Announce Type: cross Abstract: High-Altitude Platform Stations (HAPS) offer a promising solution for wide-area wireless coverage in maritime regions lacking terrestrial infrastructure. However, maintaining reliable performance is challenging due to dynamic ship mobility and atmosp

PR1 model#wireless-coverage#reinforcement-learning#maritime-networksRead on arxiv →
arxivMay 7bullish

Adaptive Ensemble Aggregation for Actor-Critics

arXiv:2507.23501v2 Announce Type: replace Abstract: Ensembles are ubiquitous in off-policy actor-critic learning, yet their efficacy depends critically on how they are aggregated. Current methods typically rely on static rules or task-specific hyperparameters to balance overestimation bias and varia

#reinforcement-learning#ensemble-methods#machine-learningRead on arxiv →
arxivMay 5bullish

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

arXiv:2605.00425v1 Announce Type: new Abstract: Reinforcement learning (RL) has significantly advanced the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. Yet effective training remains challenging, as sparse, outcome-only rewards make it diffic

#reinforcement-learning#large-language-models#exploration-exploitationRead on arxiv →
arxivApr 30bullish

Adaptive Scaling of Policy Constraints for Offline Reinforcement Learning

arXiv:2508.19900v2 Announce Type: replace Abstract: Offline reinforcement learning (RL) enables learning effective policies from fixed datasets without any environment interaction. Existing methods typically employ policy constraints to mitigate the distribution shift encountered during offline RL t

#offline-rl#reinforcement-learning#machine-learningRead on arxiv →
arxivApr 27bullish

An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments

arXiv:2604.22199v1 Announce Type: cross Abstract: Autonomous robots operating in open environments need the ability to continuously handle tasks that are not covered by predefined local methods. However, existing approaches often rely on repeated large-language-model (LLM) interaction for uncovered

LL1 model#autonomous-robots#open-environments#large-language-modelsRead on arxiv →
arxivApr 24bullish

ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs

arXiv:2604.21357v1 Announce Type: new Abstract: This paper proposes ReaGeo, an end-to-end geocoding framework based on large language models, designed to overcome the limitations of traditional multi-stage approaches that rely on text or vector similarity retrieval over geographic databases, includi

RE1 model#geocoding#language-models#reinforcement-learningRead on arxiv →
arxivApr 24bullish

Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Vetting via Automated Spectral Inspection

arXiv:2601.06498v3 Announce Type: replace Abstract: Due to the limited generalization and interpretability of deep learning classifiers, The final vetting of rare celestial object candidates still relies on expert visual inspection--a manually intensive process. In this process, astronomers leverage

SP1 model#astronomy#spectroscopy#multimodalRead on arxiv →
arxivApr 24bullish

Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models

arXiv:2604.21896v1 Announce Type: new Abstract: This paper introduces a new paradigm for AI game programming, leveraging large language models (LLMs) to extend and operationalize Claude Shannon's taxonomy of game-playing machines. Central to this paradigm is Nemobot, an interactive agentic engineeri

NE1 model#game-playing#ai-agents#reinforcement-learningRead on arxiv →
arxivApr 24bullish

Thinking While Listening: Fast-Slow Recurrence for Long-Horizon Sequential Modeling

arXiv:2604.01577v2 Announce Type: replace Abstract: We extend the recent latent recurrent modeling to sequential input streams. By interleaving fast, recurrent latent updates with self-organizational ability between slow observation updates, our method facilitates the learning of stable internal str

LSTR2 models#machine-learning#reinforcement-learning#sequential-modelingRead on arxiv →
arxivApr 23bullish

MOA: Multi-Objective Alignment for Role-Playing Agents

arXiv:2512.09756v2 Announce Type: replace Abstract: Role-playing agents (RPAs) require balancing multiple objectives, such as instruction following, persona consistency, and stylistic fidelity, which are not always perfectly aligned across different dimensions. While prior work has primarily relied

MO1 model#reinforcement-learning#role-playing-agents#multi-objective-optimizationRead on arxiv →
arxivApr 21bullish

Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

arXiv:2510.10959v3 Announce Type: replace-cross Abstract: Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy

LA1 model#machine-learning#reasoning#reinforcement-learningRead on arxiv →
arxivApr 21bullish

Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints

arXiv:2507.16727v3 Announce Type: replace Abstract: Improving the reliability of large language models (LLMs) is critical for deploying them in real-world scenarios. In this paper, we propose \textbf{Deliberative Searcher}, the first framework to integrate certainty calibration with retrieval-based

#reliability#research#question-answeringRead on arxiv →
arxivApr 18

Continuous-time reinforcement learning: ellipticity enables model-free value function approximation

arXiv:2602.06930v2 Announce Type: replace Abstract: We study off-policy reinforcement learning for controlling continuous-time Markov diffusion processes with discrete-time observations and actions. We consider model-free algorithms with function approximation that learn value and advantage function

#reinforcement-learning#markov-diffusions#function-approximationRead on arxiv →
arxivApr 18

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

arXiv:2604.14258v1 Announce Type: cross Abstract: Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a tra

#language-models#fine-tuning#reinforcement-learningRead on arxiv →
arxivApr 18

Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?

arXiv:2509.12833v2 Announce Type: replace Abstract: Projection-based safety filters, which modify unsafe actions by mapping them to the closest safe alternative, are widely used to enforce safety constraints in reinforcement learning (RL). Two integration strategies are commonly considered: Safe env

#reinforcement-learning#safety#optimizationRead on arxiv →
arxivApr 18bullish

Enhancing LLM-based Search Agents via Contribution Weighted Group Relative Policy Optimization

arXiv:2604.14267v1 Announce Type: new Abstract: Search agents extend Large Language Models (LLMs) beyond static parametric knowledge by enabling access to up-to-date and long-tail information unavailable during pretraining. While reinforcement learning has been widely adopted for training such agent

LLQWQW3 models#machine-learning#reinforcement-learning#search-agentsRead on arxiv →
arxivApr 13bullish

Sample-Efficient Neurosymbolic Deep Reinforcement Learning

arXiv:2601.02850v2 Announce Type: replace Abstract: Reinforcement Learning (RL) is a well-established framework for sequential decision-making in complex environments. However, state-of-the-art Deep RL (DRL) algorithms typically require large training datasets and often struggle to generalize beyond

#reinforcement-learning#deep-learning#neuro-symbolicRead on arxiv →
arxivApr 10

Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

arXiv:2603.28281v2 Announce Type: replace Abstract: We consider robustness against data corruption in offline multi-agent reinforcement learning from human feedback (MARLHF) under a strong-contamination model: given a dataset $D$ of trajectory-preference tuples (each preference being an $n$-dimensio

#machine-learning#reinforcement-learning#robustnessRead on arxiv →
arxivApr 10bullish

SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents

arXiv:2604.07791v1 Announce Type: cross Abstract: Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have demonstrated significant potential in single-turn reasoning tasks. With the paradigm shift toward self-evolving agentic learning, models are increasingly expected to learn

#reinforcement-learning#self-evolving-agents#knowledge-reasoningRead on arxiv →
arxivApr 8bullish

Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents

arXiv:2604.05808v1 Announce Type: new Abstract: Large language model (LLM) agents have demonstrated strong capabilities in complex interactive decision-making tasks. However, existing LLM agents typically rely on increasingly long interaction histories, resulting in high computational cost and limit

ST1 model#reinforcement-learning#hierarchical-learning#large-language-modelsRead on arxiv →
arxivApr 7

Pedagogical Safety in Educational Reinforcement Learning: Formalizing and Detecting Reward Hacking in AI Tutoring Systems

arXiv:2604.04237v1 Announce Type: cross Abstract: Reinforcement learning (RL) is increasingly used to personalize instruction in intelligent tutoring systems, yet the field lacks a formal framework for defining and evaluating pedagogical safety. We introduce a four-layer model of pedagogical safety

#education#reinforcement-learning#safetyRead on arxiv →
arxivApr 6bullish

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

arXiv:2604.02869v1 Announce Type: new Abstract: Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Rela

QWQWGP5 models · +2#reinforcement-learning#conversational-ai#benchmarkRead on arxiv →
arxivApr 3

Reinforcement Learning-based Task Offloading in the Internet of Wearable Things

arXiv:2510.07487v2 Announce Type: replace Abstract: Over the years, significant contributions have been made by the research and industrial sectors to improve wearable devices towards the Internet of Wearable Things (IoWT) paradigm. However, wearables are still facing several challenges. Many stem f

#wearables#edge-computing#reinforcement-learningRead on arxiv →
HomeModelsNews