> ## Documentation Index
> Fetch the complete documentation index at: https://docs.primeintellect.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# The Environment Model

> Understand environment types, rubrics, and the RL training loop

## The Training Loop

When you launch an RL training run, three decoupled components coordinate:

* **Inference** generates model completions turn-by-turn via a vLLM server.
* **Orchestrator** samples prompts from the environment's dataset, dispatches rollouts to inference, drives the multi-turn interaction loop (calling the environment's response logic and tool execution between each model turn), scores completed rollouts via the rubric, packs training batches, and relays weight updates between trainer and inference.
* **Trainer** receives scored rollouts, computes advantages, and produces weight updates via GRPO.

One step of training:

```
Orchestrator samples prompts from the environment
  -> Inference generates a rollout (multi-turn, driven by env logic)
  -> Environment scores the rollout via its rubric
  -> Orchestrator packs the batch and computes advantages
  -> Trainer updates policy weights
  -> Orchestrator broadcasts new weights to inference
  -> Repeat
```

The full loop is asynchronous. Rollout generation at step N overlaps with training at step N-1, and a single trajectory may span multiple policy versions as weights update mid-rollout.

**The environment is the only part you write.** Everything else (batching, weight sync, GPU scheduling, async rollout management) is handled by the infrastructure. An environment plugs directly into this loop, whether you run it via Hosted Training or self-hosted with `prime-rl`.

***

## What is an Environment?

An environment is a self-contained Python module that packages three things:

1. **A dataset** of prompts (with optional ground-truth answers or metadata).
2. **A harness** that controls how the model interacts with the task: single-turn Q\&A, multi-turn tool calling, stateful sandbox sessions, etc.
3. **A rubric** of reward functions that score the model's output and produce scalar signals.

The same environment definition drives RL training, standalone evaluation (`prime eval run`), and synthetic data generation. RL environments and agent evals are the same abstraction: dataset + harness + scoring rules.

Environments are distributed as versioned Python wheels declared in `pyproject.toml`, installable and shareable via the Environments Hub.

***

## The Type Hierarchy

All user-facing environment types descend from `Environment`, the abstract base class that defines the dataset, rubric, and rollout interface. `MultiTurnEnv` extends `Environment` and implements the core rollout loop — the `rollout()` method is finalized there. The concrete types (`SingleTurnEnv`, `ToolEnv`, `StatefulToolEnv`) each extend `MultiTurnEnv`, layering interaction complexity progressively.

### `SingleTurnEnv`

Prompt in, completion out, reward computed. This is `MultiTurnEnv` with `max_turns=1`. Use it for standard Q\&A benchmarks.

```python theme={null}
import verifiers as vf

dataset = vf.load_example_dataset("gsm8k")

async def correct_answer(completion, answer) -> float:
    return 1.0 if completion[-1]["content"] == answer else 0.0

env = vf.SingleTurnEnv(
    dataset=dataset,
    rubric=vf.Rubric(funcs=[correct_answer]),
)
```

### `MultiTurnEnv`

The base class for anything conversational or interactive. Override two hooks:

* **`env_response(messages, state)`**: produce the environment's next message given the conversation so far. Game logic, simulation steps, or external calls go here.
* **`@vf.stop` methods**: async methods decorated with `@vf.stop` act as termination conditions. The rollout ends when any stop condition returns `True` or `max_turns` is reached.

The rollout loop alternates between model inference and `env_response` until a stop condition fires or `max_turns` is reached.

### `ToolEnv`

Adds native tool/function-calling. Pass a list of Python functions; `verifiers` extracts JSON schemas from type hints and docstrings, wires them into the OpenAI-compatible tool-calling protocol, and dispatches calls during rollouts.

```python theme={null}
async def calculate(expression: str) -> str:
    """Evaluate a mathematical expression.

    Args:
        expression: A math expression to evaluate (e.g. "2 + 2 * 3")
    """
    return str(eval(expression))

env = vf.ToolEnv(
    dataset=dataset,
    tools=[calculate, search_tool],
    rubric=rubric,
    max_turns=10,
)
```

Tools in `ToolEnv` must be **stateless and idempotent**: each call is fully determined by its arguments. The environment terminates when the model responds without issuing any tool calls (or hits `max_turns`).

### `StatefulToolEnv`

When tools need per-rollout state (a sandbox handle, a database connection, a session token), `StatefulToolEnv` adds two hooks:

* **`setup_state(state)`**: called at rollout start to initialize a fresh state dict.
* **`update_tool_args(tool_name, args, messages, state)`**: intercepts each tool call to inject state into arguments before dispatch.

All mutable state lives in the state dict. Never store globals.

***

## Environment Structure

```
my_env/
├── my_env.py         # Implementation
├── pyproject.toml    # Dependencies + metadata
└── README.md         # Documentation
```

The entry point is `load_environment()`, which returns an `Environment` instance. This is what the training and evaluation infrastructure calls.

```python theme={null}
def load_environment(split: str = "train") -> vf.Environment:
    dataset = load_dataset("my-org/my-dataset", split=split)

    rubric = vf.Rubric(
        funcs=[correctness, format_score],
        weights=[0.8, 0.2],
    )

    return vf.ToolEnv(
        dataset=dataset,
        tools=[search_tool, execute_tool],
        rubric=rubric,
        max_turns=5,
    )
```

Expensive setup (loading datasets, building indices) goes in `load_environment()` or `__init__()`. Per-rollout setup goes in `setup_state()`.

***

## Rubrics

A `Rubric` is a set of scoring functions evaluated after each rollout. Functions can be sync or async, and receive keyword arguments like `prompt`, `completion`, `answer`, `state`, and `parser` depending on what they need.

Functions are weighted and combined into a scalar reward. You can also define non-reward metrics (token count, tool call frequency) that get logged without affecting gradients. For subjective criteria, `JudgeRubric` delegates scoring to an external LLM judge.
