Table of Contents
- Your First Environment
- Datasets
- Rubrics
- Tool Environments
- Custom Multi-Turn Environments
- Developing Environments
- Environment Groups
- Integrations and Experimental Environments
Your First Environment
The simplest single-turn environments need only a dataset of tasks and a reward function for scoring responses:- The
promptis sent to the model - The model generates a response, which becomes the
completion - The reward function scores the result
SingleTurnEnv, the simplest environment type, just a single model response occurs per rollout. More complex environment types will allow us to add tool use or other custom interaction protocols.
Datasets
Environments use thedatasets library from Hugging Face for loading and manipulating datasets. Each row typically has a prompt column, containing a list of initial messages to send to the model. Additionally, there are optional columns for scoring:
answer— a simple string for ground truth comparisonsinfo— structured metadata (dict or JSON string)
answer, info, both, or neither.
When using info, prefer using JSON strings if rows may have different schemas, e.g. different fields or nested structures:
dict by the environment when running rollouts.
Building the Prompt
The examples above useprompt directly, providing a list of messages ready to send to the model. Alternatively, you can provide a question column containing a string, and the environment will wrap it in a user message:
system_prompt to the environment, which prepends a system message:
prompt column, both question and system_prompt are ignored.
Evaluation Datasets
Environments can be initialized with a separateeval_dataset for evaluation, distinct from the training dataset:
vf-eval, the evaluation dataset is used by default. If no eval_dataset is provided, evaluation falls back to the training dataset.
Rubrics
Each environment has aRubric that manages scoring. The rubric holds reward functions, combines their outputs into a final reward score, and tracks metrics for observability.
Reward Functions
Reward functions evaluate rollouts and return floats, typically between 0.0 and 1.0. They can request data from the rollout by naming arguments directly:completion— the model’s output (list of messages)prompt— the input messagesanswer— from datasetinfo— from datasetstate— the full rollout state (used in more complex environments)
Multiple Reward Functions
Rubrics can combine multiple reward functions with custom weights:weight=0:
Execution Order and State
Reward functions execute in the order they are added to the rubric. Sincestate is mutable and shared across all reward functions, earlier functions can store computed values for later functions to use:
Group-Based Reward Functions
During evaluation and RL training, rollouts are organized into groups of rollouts from the same input example. When evaluating, group structure enables per-example aggregate statistics (e.g., pass@k). When training with RL, groups are used for advantage computation relative to other rollouts for the same example. For a dataset with 100 example rows, running 4 rollouts per example yields 100 groups of 4 rollouts each. In some cases, it is useful for reward functions to operate at the group level, such as to measure diversity or compute relative rankings. To define a group reward function, use plural argument names (completions, prompts, answers, infos) and return a list of scores:
Shared Objects
Beyond rollout data, reward functions can request static objects that live within the Rubric class. These are stored in the Rubric’sclass_objects dictionary, and can be added after initialization via add_class_object():
judge callable to reward functions for scoring responses:
judge callable formats a prompt comparing the model’s response to the ground truth and returns the judge model’s verdict.
For more control, JudgeRubric accepts a custom judge_prompt template and exposes its internals (judge_client, judge_model, judge_prompt, judge_sampling_args) as class objects:
Rubric Groups
Environments can include multiple rubrics by combining them into aRubricGroup (which itself behaves as a single rubric), aggregating all rewards and metrics from constituent rubrics. This is particularly useful for conjoining multiple rubrics of different types.
For example, MathRubric is a built-in rubric that uses symbolic verification to check mathematical correctness:
correct_answer reward function that parses \boxed{} answers and uses the math-verify library for symbolic equivalence checking. To add LLM-based evaluation alongside it:
Metrics and Monitor Rubrics
For simple cases, metrics can be added directly to a rubric viaadd_metric() as shown above. Monitor rubrics extend this pattern by packaging metrics into separate rubrics that are combined via add_rubric(). This allows each environment type in a class hierarchy to contribute its own metrics automatically.
Many environment types automatically include a monitor rubric that tracks metrics specific to their level of the environment class hierarchy:
| Environment | Tracked Metrics |
|---|---|
MultiTurnEnv | num_turns |
ToolEnv | total_tool_calls, per-tool counts |
SandboxEnv | sandbox_ready_wait_time, sandbox_command_execution_time |
PythonEnv | python_ready_wait_time |
add_rubric():
RubricGroup as needed, so monitor rubrics stack up the class hierarchy—PythonEnv inherits metrics from both SandboxEnv and ToolEnv.
Tool Environments
All currently-supported environment types in Verifiers are built onMultiTurnEnv, which implements the core single-agent rollout loop (even SingleTurnEnv is simply a MultiTurnEnv with max_turns=1 and a placeholder env_response method). ToolEnv adds tool calling to this foundation.
Tools are defined as Python functions. Verifiers extracts tool schemas from function signatures and docstrings for use with OpenAI-compatible tool calling:
ToolEnv directly:
max_turns). Each turn consists of a model response followed by the environment’s tool execution. Tool call counts are tracked automatically via monitor rubrics (see above).
MCP Tool Environments
For tools implemented as MCP (Model Context Protocol) servers,MCPEnv extends ToolEnv to provide an integration that automatically connects to MCP servers and exposes their tools to the model:
Stateful Tool Environments
ToolEnv and MCPEnv are designed for stateless, read-only tools where no session state needs to persist across calls within a rollout. For tools that require per-rollout state—such as a sandbox container, database connection, or session ID—use StatefulToolEnv.
The setup_state method is called at the beginning of each rollout for all environments which extend MultiTurnEnv, but is a no-op by default (including in ToolEnv).
StatefulToolEnv overrides this to initialize per-rollout resources, and introduces two additional concepts:
- Hidden arguments: Tool functions can have parameters that are injected by the environment but hidden from the model’s tool schema (via
args_to_skip) update_tool_args: An abstract method you implement to inject state into tool calls at runtime
run_code(code: str) in its tool schema, but the environment injects session_id from rollout state before each call.
Verifiers includes several built-in stateful environment classes: SandboxEnv provides a containerized bash shell, and PythonEnv extends it with a persistent Python REPL (both of which are configured for use with Prime Intellect’s Sandboxes). These handle sandbox lifecycle management automatically.
Stateful environments often define methods decorated with @vf.cleanup (called after each rollout) or @vf.teardown (called once at environment shutdown) for resource management. These decorators, along with @vf.stop for custom stop conditions (boolean functions checked after each turn), are powerful tools for rollout lifecycle control in custom MultiTurnEnv subclasses.
Custom Multi-Turn Environments
For interaction patterns beyond tool calling—games, simulations, or other custom protocols—MultiTurnEnv can be subclassed directly, exposing full control over the rollout loop’s behavior.
The Rollout Loop
Each rollout follows this structure:- Initialize state —
setup_state(state)is called to prepare per-rollout resources - Loop until done:
- Get prompt messages (initial prompt, or previous conversation + environment response)
- Get model response
- Check stop conditions — if any
@vf.stopmethod returnsTrue, exit loop
- Render completion — final conversation is assembled into
state["completion"] - Cleanup — all
@vf.cleanupmethods are called
env_response method is an abstract method that must be overridden by all MultiTurnEnv subclasses, and defines how the environment responds after each model turn:
env_response receives the full conversation history thus far (and state) and returns a list of new messages to append. When a parser is passed to the environment, it becomes available as self.parser. Passing the same parser to the rubric makes it available to reward functions by name. For tool environments, env_response typically executes tool calls and returns results. For games or other custom protocols, this might involve parsing structured output (as above) and returning state updates or feedback.
Several other methods can optionally be overridden for more control in complex custom environments:
setup_state(state)— add environment-specific state fields at rollout startget_prompt_messages(state)— customize how messages are assembled (e.g. for non-linear conversations)render_completion(state)— customize how the final completion is assembledadd_trajectory_step(state, step)— set intermediate rewards, advantages, or extra metadata per turn
Stop Conditions
Rollouts continue until a stop condition is met, checked after each model response. Custom stop conditions are defined with the@vf.stop decorator:
MultiTurnEnv includes built-in stop conditions for errors, prompt length limits, and max_turns by default.
Execution order can be controlled with priority (higher runs first). This is useful for checking cheap conditions before expensive ones:
Error Handling
Verifiers defines a hierarchy of error types undervf.Error:
vf.ModelError— errors from model interactions (e.g.,vf.EmptyModelResponseError)vf.OverlongPromptError— prompt exceeds model context lengthvf.ToolError— tool-related errors (vf.ToolParseError,vf.ToolCallError)vf.InfraError— infrastructure errors (e.g.,vf.SandboxError)
vf.Error is raised during a rollout, it is automatically caught and stored in state["error"], triggering the built-in has_error stop condition at the next check. This allows rollouts to terminate gracefully rather than crashing.
For tool environments, you can configure which errors should stop the rollout immediately via stop_errors:
stop_errors are caught and returned as tool response messages, providing the model a chance to recover.
State Initialization
Overridesetup_state to initialize per-rollout state:
Cleanup and Teardown
For resource management, use@vf.cleanup (per-rollout) and @vf.teardown (at environment shutdown):
Signaling Early Termination
To end a rollout from withinenv_response (e.g., when the game ends), set state["final_env_response"]:
Developing Environments
Environments are packaged as installable Python projects. We recommend developing environments in a workspace withenvironments/ and configs/ folders. The vf-setup command initializes this structure:
vf-init command initializes a new environment project:
load_environment() function that returns a vf.Environment. Explicitly declare any arguments your environment accepts:
pyproject.toml
Thepyproject.toml defines package metadata, dependencies, and evaluation defaults:
pyproject.toml sections:
[project]— Package name (used byvf-installandvf-eval), description, version, and dependencies. Thetagsfield is optional metadata for categorizing environments.[build-system]— Hatchling is used as the build backend for the Environments Hub.[tool.hatch.build]— Lists files to include in the package. Always includepyproject.tomlalongside your environment file to ensure that environment metadata is available when the environment is installed. Add any additional source files here.[tool.verifiers.eval]— Default parameters forvf-evalwhen flags aren’t provided.
Managing Dependencies
All packages your environment needs must be declared in thedependencies array. Always include verifiers with a minimum version. If your environment uses additional libraries, add them here—they will be installed automatically when the environment is installed:
Installation
Install a local environment withvf-install:
uv pip install -e for local environments, making them importable by vf-eval and other integrations.
Environment Groups
EnvGroup combines multiple environments into a single environment class, enabling multi-task evaluation and training across heterogeneous environments from a unified entrypoint. Each sub-environment maintains its own dataset, rubric, and rollout logic, while the group handles routing and metric aggregation:
task column that routes rollouts to the appropriate environment for generation and scoring. Metrics from all environments are tracked together.
Integrations and Experimental Environments
Beyond the core environment types, Verifiers includes integrations with several third-party environment libraries, as well as a few newer and more experimental environment classes (which are less stable and more subject to frequent changes). Supported third-party environment integrations include:TextArenaEnv— wraps TextArena text-based game environmentsReasoningGymEnv— wraps reasoning-gym procedural datasets
uv add 'verifiers[ta]' for TextArena).
Newer and more experimental environment classes include:
GymEnv— universal runner for Gym-compatible environments (OpenAI Gym / Gymnasium API)CliAgentEnv— runs custom agent code inside sandboxes, intercepting API requestsHarborEnv— loads Harbor-format agent benchmark tasksRLMEnv— implements Recursive Language Models for unbounded context processing