arxivApr 14
MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization
arXiv:2601.07208v2 Announce Type: replace-cross Abstract: Group-Relative Policy Optimization (GRPO) has emerged as an efficient paradigm for aligning Large Language Models (LLMs), yet its efficacy is primarily confined to domains with verifiable ground truths. Extending GRPO to open-domain settings