Papers

Trending research and the full catalog - each paper linked to the benchmarks, methods, and models it introduces.

Filtered by domain: Reinforcement LearningClear

OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

25 Jun 2026

Outcome-based reinforcement learning provides a stable optimization backbone for language agents, but its sparse trajectory-level rewards provide little guidance on which intermediate decisions should be reinforced or suppressed.

Question Answering Reinforcement Learning

330.5/h

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

15 Jun 2026

This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime.

Language Modeling Reinforcement Learning

1.4k0.7/h

Understanding the Behaviors of Environment-aware Information Retrieval

15 Jun 2026

Recent retrieval-augmented generation (RAG) approaches have demonstrated strong capability in handling complex queries, yet current research overlooks a critical challenge: different retrievers require fundamentally different query formulation strategies for optimal performance.

Language Modeling Reinforcement Learning

70.0/h

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

12 Jun 2026

We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context length to 1M tokens, and post-trained using…

Agents Coding Agents Instruction Following Language Modeling

1.5k0.2/h

Rethinking the Divergence Regularization in LLM RL

8 Jun 2026

Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization.

Language Modeling Reinforcement Learning

7220.2/h

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

6 Jun 2026

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions.

Image Understanding Language Modeling Reinforcement Learning

2730.4/h

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

1 Jun 2026

Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked.

Question Answering Reinforcement Learning Retrieval

7850.2/h

Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

24 Jun 2026

Tool use enables large language models (LLMs) to perform complex tasks, and recent agentic reinforcement learning (RL) methods show promise for enhancing model capabilities. However, RL alone often leads to instability or limited gains in tool-use tasks.

Language Modeling Reinforcement Learning

20.1/h

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

23 Jun 2026

The composition of training data, governed by the diversity of sources and their mixing strategy, is a cornerstone of Large Language Model (LLM) pre-training. Online Data Mixing (ODM), the technique of adaptively adjusting data mixtures during training, has emerged as a…

Language Modeling Reinforcement Learning

201

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

22 Jun 2026

Scaling reinforcement learning for visual mathematical reasoning requires more than generating harder questions: as data volume grows, the reward labels themselves must remain reliable.

Reinforcement Learning

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

24 Jun 2026

Process reward models enable fine-grained, step-level evaluation of LLMs, yet building them for agentic settings remains prohibitively difficult: long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo…

Language Modeling Reinforcement Learning

ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation

22 Jun 2026

On-policy distillation (OPD) improves LLM reasoning by training a student model on its own generated outputs, but standard OPD treats all student-generated outputs (SGOs) equally regardless of their informativeness.

Language Modeling Reasoning Reinforcement Learning