OpenTinker: Separating Concerns in Agentic Reinforcement Learning

We introduce OpenTinker, an open infrastructure for training large language model (LLM) agents with many LoRA-backed policies over shared execution resources. Modern agent workloads mix supervised fine-tuning (SFT), online reinforcement learning (RL), rollout generation, validation, and multi-turn environment interaction. In such workloads, LoRA adapters are not static inference artifacts: they are frequently updated policy states whose optimizer state, rollout snapshot, and training data attribution must remain consistent. OpenTinker centers the runtime around this policy lifecycle. Users define environments, agents, and learning objectives, while the system manages training clients, rollout samplers, checkpoint handles, and policy-version refresh. The same data path supports SFT and RL by converting trajectories into token sequences with explicit masks: context and environment observations condition the model, while generated action tokens carry supervised weights or RL advantages. This design enables multi-LoRA SFT/RL training in which many users, tasks, or agents can share a base model while keeping adapter updates, checkpoints, and rollout snapshots isolated. We describe the system architecture, the adapter lifecycle, the service-backed snapshot handoff used by the current implementation, the backend contract for mixed-adapter rollout kernels, and the training scheduler that isolates adapter-local gradients and optimizer state. Representative validation tasks exercise single-turn, multi-turn, LoRA, and multi-agent agentic training.

OpenTinker: Separating Concerns in Agentic Reinforcement Learning

Abstract

Authors