The Training Loop
When you launch an RL training run, three decoupled components coordinate:- Inference generates model completions turn-by-turn via a vLLM server.
- Orchestrator samples prompts from the environment’s dataset, dispatches rollouts to inference, drives the multi-turn interaction loop (calling the environment’s response logic and tool execution between each model turn), scores completed rollouts via the rubric, packs training batches, and relays weight updates between trainer and inference.
- Trainer receives scored rollouts, computes advantages, and produces weight updates via GRPO.
prime-rl.
What is an Environment?
An environment is a self-contained Python module that packages three things:- A dataset of prompts (with optional ground-truth answers or metadata).
- A harness that controls how the model interacts with the task: single-turn Q&A, multi-turn tool calling, stateful sandbox sessions, etc.
- A rubric of reward functions that score the model’s output and produce scalar signals.
prime eval run), and synthetic data generation. RL environments and agent evals are the same abstraction: dataset + harness + scoring rules.
Environments are distributed as versioned Python wheels declared in pyproject.toml, installable and shareable via the Environments Hub.
The Type Hierarchy
All user-facing environment types descend fromEnvironment, the abstract base class that defines the dataset, rubric, and rollout interface. MultiTurnEnv extends Environment and implements the core rollout loop — the rollout() method is finalized there. The concrete types (SingleTurnEnv, ToolEnv, StatefulToolEnv) each extend MultiTurnEnv, layering interaction complexity progressively.
SingleTurnEnv
Prompt in, completion out, reward computed. This is MultiTurnEnv with max_turns=1. Use it for standard Q&A benchmarks.
MultiTurnEnv
The base class for anything conversational or interactive. Override two hooks:
env_response(messages, state): produce the environment’s next message given the conversation so far. Game logic, simulation steps, or external calls go here.@vf.stopmethods: async methods decorated with@vf.stopact as termination conditions. The rollout ends when any stop condition returnsTrueormax_turnsis reached.
env_response until a stop condition fires or max_turns is reached.
ToolEnv
Adds native tool/function-calling. Pass a list of Python functions; verifiers extracts JSON schemas from type hints and docstrings, wires them into the OpenAI-compatible tool-calling protocol, and dispatches calls during rollouts.
ToolEnv must be stateless and idempotent: each call is fully determined by its arguments. The environment terminates when the model responds without issuing any tool calls (or hits max_turns).
StatefulToolEnv
When tools need per-rollout state (a sandbox handle, a database connection, a session token), StatefulToolEnv adds two hooks:
setup_state(state): called at rollout start to initialize a fresh state dict.update_tool_args(tool_name, args, messages, state): intercepts each tool call to inject state into arguments before dispatch.
Environment Structure
load_environment(), which returns an Environment instance. This is what the training and evaluation infrastructure calls.
load_environment() or __init__(). Per-rollout setup goes in setup_state().
Rubrics
ARubric is a set of scoring functions evaluated after each rollout. Functions can be sync or async, and receive keyword arguments like prompt, completion, answer, state, and parser depending on what they need.
Functions are weighted and combined into a scalar reward. You can also define non-reward metrics (token count, tool call frequency) that get logged without affecting gradients. For subjective criteria, JudgeRubric delegates scoring to an external LLM judge.