Skip to main content

The Training Loop

When you launch an RL training run, three decoupled components coordinate:
  • Inference generates model completions turn-by-turn via a vLLM server.
  • Orchestrator samples prompts from the environment’s dataset, dispatches rollouts to inference, drives the multi-turn interaction loop (calling the environment’s response logic and tool execution between each model turn), scores completed rollouts via the rubric, packs training batches, and relays weight updates between trainer and inference.
  • Trainer receives scored rollouts, computes advantages, and produces weight updates via GRPO.
One step of training:
Orchestrator samples prompts from the environment
  -> Inference generates a rollout (multi-turn, driven by env logic)
  -> Environment scores the rollout via its rubric
  -> Orchestrator packs the batch and computes advantages
  -> Trainer updates policy weights
  -> Orchestrator broadcasts new weights to inference
  -> Repeat
The full loop is asynchronous. Rollout generation at step N overlaps with training at step N-1, and a single trajectory may span multiple policy versions as weights update mid-rollout. The environment is the only part you write. Everything else (batching, weight sync, GPU scheduling, async rollout management) is handled by the infrastructure. An environment plugs directly into this loop, whether you run it via Hosted Training or self-hosted with prime-rl.

What is an Environment?

An environment is a self-contained Python module that packages three things:
  1. A dataset of prompts (with optional ground-truth answers or metadata).
  2. A harness that controls how the model interacts with the task: single-turn Q&A, multi-turn tool calling, stateful sandbox sessions, etc.
  3. A rubric of reward functions that score the model’s output and produce scalar signals.
The same environment definition drives RL training, standalone evaluation (prime eval run), and synthetic data generation. RL environments and agent evals are the same abstraction: dataset + harness + scoring rules. Environments are distributed as versioned Python wheels declared in pyproject.toml, installable and shareable via the Environments Hub.

The Type Hierarchy

All user-facing environment types descend from Environment, the abstract base class that defines the dataset, rubric, and rollout interface. MultiTurnEnv extends Environment and implements the core rollout loop — the rollout() method is finalized there. The concrete types (SingleTurnEnv, ToolEnv, StatefulToolEnv) each extend MultiTurnEnv, layering interaction complexity progressively.

SingleTurnEnv

Prompt in, completion out, reward computed. This is MultiTurnEnv with max_turns=1. Use it for standard Q&A benchmarks.
import verifiers as vf

dataset = vf.load_example_dataset("gsm8k")

async def correct_answer(completion, answer) -> float:
    return 1.0 if completion[-1]["content"] == answer else 0.0

env = vf.SingleTurnEnv(
    dataset=dataset,
    rubric=vf.Rubric(funcs=[correct_answer]),
)

MultiTurnEnv

The base class for anything conversational or interactive. Override two hooks:
  • env_response(messages, state): produce the environment’s next message given the conversation so far. Game logic, simulation steps, or external calls go here.
  • @vf.stop methods: async methods decorated with @vf.stop act as termination conditions. The rollout ends when any stop condition returns True or max_turns is reached.
The rollout loop alternates between model inference and env_response until a stop condition fires or max_turns is reached.

ToolEnv

Adds native tool/function-calling. Pass a list of Python functions; verifiers extracts JSON schemas from type hints and docstrings, wires them into the OpenAI-compatible tool-calling protocol, and dispatches calls during rollouts.
async def calculate(expression: str) -> str:
    """Evaluate a mathematical expression.

    Args:
        expression: A math expression to evaluate (e.g. "2 + 2 * 3")
    """
    return str(eval(expression))

env = vf.ToolEnv(
    dataset=dataset,
    tools=[calculate, search_tool],
    rubric=rubric,
    max_turns=10,
)
Tools in ToolEnv must be stateless and idempotent: each call is fully determined by its arguments. The environment terminates when the model responds without issuing any tool calls (or hits max_turns).

StatefulToolEnv

When tools need per-rollout state (a sandbox handle, a database connection, a session token), StatefulToolEnv adds two hooks:
  • setup_state(state): called at rollout start to initialize a fresh state dict.
  • update_tool_args(tool_name, args, messages, state): intercepts each tool call to inject state into arguments before dispatch.
All mutable state lives in the state dict. Never store globals.

Environment Structure

my_env/
├── my_env.py         # Implementation
├── pyproject.toml    # Dependencies + metadata
└── README.md         # Documentation
The entry point is load_environment(), which returns an Environment instance. This is what the training and evaluation infrastructure calls.
def load_environment(split: str = "train") -> vf.Environment:
    dataset = load_dataset("my-org/my-dataset", split=split)

    rubric = vf.Rubric(
        funcs=[correctness, format_score],
        weights=[0.8, 0.2],
    )

    return vf.ToolEnv(
        dataset=dataset,
        tools=[search_tool, execute_tool],
        rubric=rubric,
        max_turns=5,
    )
Expensive setup (loading datasets, building indices) goes in load_environment() or __init__(). Per-rollout setup goes in setup_state().

Rubrics

A Rubric is a set of scoring functions evaluated after each rollout. Functions can be sync or async, and receive keyword arguments like prompt, completion, answer, state, and parser depending on what they need. Functions are weighted and combined into a scalar reward. You can also define non-reward metrics (token count, tool call frequency) that get logged without affecting gradients. For subjective criteria, JudgeRubric delegates scoring to an external LLM judge.