Papers

Trending research and the full catalog - each paper linked to the benchmarks, methods, and models it introduces.

Filtered by domain: Reinforcement LearningClear

OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

25 Jun 2026

Outcome-based reinforcement learning provides a stable optimization backbone for language agents, but its sparse trajectory-level rewards provide little guidance on which intermediate decisions should be reinforced or suppressed.

Question Answering Reinforcement Learning

330.5/h

Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

24 Jun 2026

Tool use enables large language models (LLMs) to perform complex tasks, and recent agentic reinforcement learning (RL) methods show promise for enhancing model capabilities. However, RL alone often leads to instability or limited gains in tool-use tasks.

Language Modeling Reinforcement Learning

20.1/h

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

23 Jun 2026

The composition of training data, governed by the diversity of sources and their mixing strategy, is a cornerstone of Large Language Model (LLM) pre-training. Online Data Mixing (ODM), the technique of adaptively adjusting data mixtures during training, has emerged as a…

Language Modeling Reinforcement Learning

201

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

22 Jun 2026

Scaling reinforcement learning for visual mathematical reasoning requires more than generating harder questions: as data volume grows, the reward labels themselves must remain reliable.

Reinforcement Learning

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

24 Jun 2026

Process reward models enable fine-grained, step-level evaluation of LLMs, yet building them for agentic settings remains prohibitively difficult: long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo…

Language Modeling Reinforcement Learning

ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation

22 Jun 2026

On-policy distillation (OPD) improves LLM reasoning by training a student model on its own generated outputs, but standard OPD treats all student-generated outputs (SGOs) equally regardless of their informativeness.

Language Modeling Reasoning Reinforcement Learning