Table of Contents
- Type Aliases
- Data Types
- Classes
- Client Classes
- Configuration Types
- Prime CLI Plugin
- Decorators
- Utility Functions
Type Aliases
Messages
ChatMessage
role, content, and optional tool_calls / tool_call_id fields.
SystemMessage
vf.SystemMessage.from_path(...) to load a system prompt from a UTF-8 text file while preserving the file contents verbatim.
Info
SamplingArgs
temperature, top_p, max_tokens).
Tasks
load_tasks(split=...) on a
vf.Taskset subclass and return vf.Tasks. In vf.Env, an empty split is treated as an absent base
dataset.
SystemPrompt
vf.SystemPromptConfig(path="system_prompt.txt") for file-backed prompts, or
override load_system_prompt(config) when prompt construction belongs to the
class. System prompt resolution is per task: task prompt overrides taskset
prompt for the taskset side, then the harness applies
system_prompt_strategy. The default strategy is HT.
RewardFunc
ClientType
Client implementation to use. Set via ClientConfig.client_type.
Data Types
State
dict subclass that tracks rollout information. Accessing keys in INPUT_FIELDS automatically forwards to the nested input object.
Fields set during initialization:
| Field | Type | Description |
|---|---|---|
input | RolloutInput | Nested input data |
client | Client | Client instance |
model | str | Model name |
sampling_args | SamplingArgs | None | Generation parameters |
is_completed | bool | Whether rollout has ended |
is_truncated | bool | Whether generation was truncated |
tool_defs | list[Tool] | None | Available tool definitions |
trajectory | list[TrajectoryStep] | Multi-turn trajectory |
trajectory_id | str | UUID for this rollout |
timing | RolloutTiming | Timing information |
| Field | Type | Description |
|---|---|---|
completion | Messages | None | Final completion |
reward | float | None | Final reward |
advantage | float | None | Advantage over group mean |
metrics | dict[str, float] | None | Per-function metrics |
stop_condition | str | None | Name of triggered stop condition |
error | Error | None | Error if rollout failed |
RolloutInput
RolloutOutput
dict subclass that provides typed access to known fields while supporting arbitrary additional fields from state_columns. All values must be JSON-serializable. Used in GenerateOutputs and for saving results to disk.
TrajectoryStep
RoutedExpertsPayload
TrajectoryStepTokens
TimeSpan
TimeSpans
RolloutTiming
total = scoring.end - generation.startoverhead = total - setup.duration - model.duration - env.duration - scoring.duration
generation.start is stamped at the top of the rollout (before setup_state), so total covers the entire rollout including setup, generation loop, finalize, and scoring. overhead captures any time not attributed to the named phases.
TokenUsage
| Field | Description |
|---|---|
input_tokens | Sum of prompt tokens across all turns. Shared context is counted each time it appears in a prompt. |
output_tokens | Sum of completion tokens across all turns. |
final_input_tokens | Non-completion tokens in the final turn’s context (system prompts, user messages, tool results, etc.). |
final_output_tokens | Completion tokens in the final turn’s context. Equals output_tokens for single-turn rollouts. |
input_tokens == final_input_tokens and output_tokens == final_output_tokens. In a multi-turn rollout, input_tokens > final_input_tokens because earlier turns’ prompts are counted again.
The final_* metrics assume a single, continuously extended trajectory. Non-linear trajectories (multi-agent, context summarization, history rewriting) are not accounted for.
GenerateOutputs
Environment.generate(). Contains a list of RolloutOutput objects (one per rollout) and generation metadata. Each RolloutOutput is a serialized, JSON-compatible dict containing the rollout’s prompt, completion, answer, reward, metrics, timing, and other per-rollout data.
GenerateMetadata
base_url is always serialized as a string. For multi-endpoint runs (e.g., using ClientConfig.endpoint_configs), it is stored as a comma-separated list of URLs.
shuffle records whether evaluation inputs were shuffled before selecting examples. shuffle_seed records the seed used for that shuffle; when shuffle is enabled without an explicit seed, the saved value is 0.
version_info captures the verifiers framework version/commit and the environment package version/commit at generation time. Populated automatically by GenerateOutputsBuilder.
RolloutScore / RolloutScores
Classes
Environment Classes
Environment
| Method | Returns | Description |
|---|---|---|
generate(inputs, client, model, ...) | GenerateOutputs | Run rollouts asynchronously. client accepts Client | ClientConfig. |
generate_sync(inputs, client, ...) | GenerateOutputs | Synchronous wrapper |
evaluate(client, model, ...) | GenerateOutputs | Evaluate on eval_dataset |
evaluate_sync(client, model, ...) | GenerateOutputs | Synchronous evaluation |
| Method | Returns | Description |
|---|---|---|
get_dataset(n=-1, seed=None) | Dataset | Get training dataset (optionally first n, shuffled) |
get_eval_dataset(n=-1, seed=None) | Dataset | Get evaluation dataset |
make_dataset(...) | Dataset | Static method to create dataset from inputs |
| Method | Returns | Description |
|---|---|---|
rollout(input, client, model, sampling_args) | State | Abstract: run single rollout |
init_state(input, client, model, sampling_args) | State | Create initial state from input |
get_model_response(state, prompt, ...) | Response | Get model response for prompt |
is_completed(state) | bool | Check all stop conditions |
run_rollout(sem, input, client, model, sampling_args) | State | Run rollout with semaphore |
run_group(group_inputs, client, model, ...) | list[State] | Generate and score one group |
| Method | Description |
|---|---|
set_kwargs(**kwargs) | Set attributes using setter methods when available |
set_concurrency(concurrency) | Set concurrency and scale all registered thread-pool executors to match |
add_rubric(rubric) | Add or merge rubric |
set_max_seq_len(max_seq_len) | Set maximum sequence length |
set_score_rollouts(bool) | Enable/disable scoring |
SingleTurnEnv
Single-response Q&A tasks. Inherits fromEnvironment.
MultiTurnEnv
env_response.
Abstract method:
has_error, prompt_too_long, max_turns_reached, timeout_reached, max_total_completion_tokens_reached, has_final_env_response
Hooks:
| Method | Description |
|---|---|
setup_state(state) | Initialize per-rollout state |
get_prompt_messages(state) | Customize prompt construction |
render_completion(state) | Customize completion rendering |
add_trajectory_step(state, step) | Customize trajectory handling |
set_max_total_completion_tokens(int) | Set maximum total completion tokens |
ToolEnv
no_tools_called (ends when model responds without tool calls)
Methods:
| Method | Description |
|---|---|
add_tool(tool) | Add a tool at runtime |
remove_tool(tool) | Remove a tool at runtime |
call_tool(name, args, id) | Override to customize tool execution |
StatefulToolEnv
Tools requiring per-rollout state. Overridesetup_state and update_tool_args to inject state.
SandboxEnv
prime sandboxes.
Key parameters:
| Parameter | Type | Description |
|---|---|---|
sandbox_name | str | Name prefix for sandbox instances |
docker_image | str | Docker image to use for the sandbox |
cpu_cores | int | Number of CPU cores |
memory_gb | int | Memory allocation in GB |
disk_size_gb | int | Disk size in GB |
gpu_count | int | Number of GPUs |
timeout_minutes | int | Sandbox timeout in minutes |
timeout_per_command_seconds | int | Per-command execution timeout |
environment_vars | dict[str, str] | None | Environment variables to set in sandbox |
labels | list[str] | None | Labels for sandbox categorization and filtering |
PythonEnv
Persistent Python REPL in sandbox. ExtendsSandboxEnv.
OpenEnvEnv
.build.json), supports both gym and MCP contracts, and requires a prompt_renderer to convert observations into chat messages.
SandboxDebugEnv
SandboxTaskSet instances. It creates the task sandbox, optionally runs task setup, runs one debug step (none, gold_patch, command, or script), and optionally runs tests and scores the result. SWEDebugEnv remains as a deprecated wrapper for older callers.
EnvGroup
info["env_id"] as internal routing metadata; it is not a top-level input,
state, or output field.
v1 Taskset/Harness Classes
The v1 API is exposed from the top-levelverifiers namespace and documented
in BYO Harness. Its core unit is:
Taskset and Env package that runner for datasets, evals, and training.
Task
Taskset, but can be run directly through a standalone Harness.
Common top-level fields:
| Field | Description |
|---|---|
prompt | User/developer/tool messages for the rollout. Must not contain system messages. |
system_prompt | Per-task system messages or string. |
answer | Reference answer or target data. Stays on task, not state. |
info | Serializable metadata. |
max_turns | Per-task base-loop turn limit. |
tools | Toolset-keyed tool visibility: {"wiki": {"show": [...]}} or {"wiki": {"hide": [...]}}. |
toolsets | Toolset visibility: {"show": [...]} or {"hide": [...]}. |
sandbox | Per-task sandbox overrides for sandboxed programs. |
artifacts | Per-task text/JSON files collected after program execution. |
program | Task-owned files, dirs, env, setup, artifacts, bindings, and command args. |
task.runtime is not public schema. Runtime metadata belongs on State.
State
is_completed, stop_condition,
is_truncated, and error cannot be written directly. Use state.stop(...) or
raise vf.Error subclasses.
State.for_task(...) can borrow selected active runtime handles from another
state:
Taskset
load_tasks(split="train" | "eval"). During rollout,
records are always materialized as vf.Task.
Taskset.__init__ is final; subclasses customize behavior through config,
task-loading methods, lifecycle handlers, load_toolsets, load_user, and
other public load methods.
Harness
Harness.__init__ is final; subclasses customize behavior through config,
load_sandbox, load_toolsets, load_system_prompt, lifecycle handlers, and
program config.
HarnessConfig.program is a ProgramConfig. Dict/TOML inputs are accepted as
shorthand for the same config object:
| Form | Meaning |
|---|---|
ProgramConfig() | Default endpoint-backed tool loop. |
ProgramConfig(base=True, ...) | Explicit default loop, usually with sandbox options. |
ProgramConfig(fn="pkg.module:run", ...) | Importable Python program. |
ProgramConfig(command=["cmd", "arg"], ...) | Local or sandboxed command. |
ProgramConfig and implement
resolve() with self.resolve_command(command=..., ...) so typed harness
settings resolve to a canonical command program without a second loader layer.
Sandboxed program.fn refs resolve their owning local package from the resolved
module root: single-file modules use pyproject.toml in the same directory as
the module file, and package modules use pyproject.toml inside the package
directory. v1 uploads and installs that package in the program sandbox. Package
dependencies come from normal [project.dependencies].
Env
config is an EnvConfig object or mapping with nested taskset and
harness sections.
Toolset And MCPTool
objects.* bindings are
private to the owning toolset/user and are not directly accessible from state.
String binding sources are framework paths; literal strings should be bound via
callable sources.
Tasks show all toolsets/tools by default and can restrict them with show or
hide visibility at task["toolsets"] and task["tools"].
v1 Config Models
"module:object" refs for Python callables and loaders. Users are typed
UserConfig objects materialized through registered User subclasses, not
string refs. Unknown fields fail validation.
Parser Classes
Parser
XMLParser
| Method | Returns | Description |
|---|---|---|
parse(text) | SimpleNamespace | Parse XML into object with field attributes |
parse_answer(completion) | str | None | Extract answer field from completion |
get_format_str() | str | Get format description string |
get_fields() | list[str] | Get canonical field names |
format(**kwargs) | str | Format kwargs into XML string |
ThinkParser
</think> tag. For models that always include <think> tags but don’t parse them automatically.
MaybeThinkParser
Handles optional<think> tags (for models that may or may not think).
Rubric Classes
Rubric
1.0. Functions with weight=0.0 are tracked as metrics only.
Methods:
| Method | Description |
|---|---|
add_reward_func(func, weight=1.0) | Add a reward function |
add_metric(func, weight=0.0) | Add a metric (no reward contribution) |
add_class_object(name, obj) | Add object accessible in reward functions |
JudgeRubric
LLM-as-judge evaluation.MathRubric
Math-specific evaluation usingmath-verify.
RubricGroup
Combines rubrics forEnvGroup.
Client Classes
Client
vf types (Messages, Tool, Response) and provider-native formats. The client property exposes the underlying SDK client (e.g., AsyncOpenAI, AsyncAnthropic).
get_response() is the main public method — it converts the prompt and tools to the native format, calls the provider API, validates the response, and converts it back to a vf.Response. Errors are wrapped in vf.ModelError unless they are already vf.Error or authentication errors.
Abstract methods (for subclass implementors):
| Method | Description |
|---|---|
setup_client(config) | Create the native SDK client from ClientConfig |
to_native_prompt(messages) | Convert Messages → native prompt format + extra kwargs |
to_native_tool(tool) | Convert Tool → native tool format |
get_native_response(prompt, model, ...) | Call the provider API |
raise_from_native_response(response) | Raise ModelError for invalid responses |
from_native_response(response) | Convert native response → vf.Response |
close() | Close the underlying SDK client |
Built-in Client Implementations
| Class | client_type | SDK Client | Description |
|---|---|---|---|
OpenAIChatCompletionsClient | "openai_chat_completions" | AsyncOpenAI | Chat Completions API (default) |
OpenAICompletionsClient | "openai_completions" | AsyncOpenAI | Legacy Completions API |
OpenAIChatCompletionsTokenClient | "openai_chat_completions_token" | AsyncOpenAI | Custom vLLM token route (/v1/chat/completions/tokens) — server-side templating + token IDs returned alongside content |
OpenAIResponsesClient | "openai_responses" | AsyncOpenAI | OpenAI Responses API |
RendererClient | "renderer" | AsyncOpenAI | Renderer-backed token-in generate client (client-side tokenization via the renderers package) |
AnthropicMessagesClient | "anthropic_messages" | AsyncAnthropic | Anthropic Messages API |
NeMoRLChatCompletionsClient | "nemorl_chat_completions" | AsyncOpenAI | NeMo-RL Chat Completions variant |
vf.OpenAIChatCompletionsClient, vf.AnthropicMessagesClient, etc. RendererClient requires the optional renderer package; install it with uv add "verifiers[renderers]" before importing vf.RendererClient or using client_type="renderer".
Response
Client implementations return Response from get_response().
Tool
Client converts them to its native format via to_native_tool().
Configuration Types
v1 Config
EnvConfig is the typed v1 loader envelope. TOML [env.taskset] and
[env.harness] sections populate EnvConfig.taskset and EnvConfig.harness.
The normal environment package loader stays typed as load_environment(config: vf.EnvConfig) and delegates child coercion to vf.load_taskset /
vf.load_harness. Environment-specific fields belong on the taskset or harness
config that owns them; do not subclass EnvConfig just to narrow child config
types in ordinary environment packages.
Config subclasses are strict Pydantic config models. Validate raw mappings
with MyConfig.model_validate(...) or use the typed object directly.
ClientConfig
extra_headers_from_state maps HTTP header names to state field names. For each inference request, the header value is dynamically read from the rollout state dict. For example, {"X-Session-ID": "trajectory_id"} adds a X-Session-ID header with the value of state["trajectory_id"], enabling sticky routing at the inference router level.
client_type selects which Client implementation to instantiate (see Client Classes). Use endpoint_configs for multi-endpoint round-robin. In grouped scoring mode, groups are distributed round-robin across endpoint configs.
preserve_all_thinking and preserve_thinking_between_tool_calls are forwarded to the underlying renderer when client_type == "renderer". They control whether past-assistant reasoning_content is re-emitted on subsequent renders — preserve_all_thinking keeps every past-assistant turn’s thinking, and preserve_thinking_between_tool_calls keeps thinking only inside the in-flight assistant→tool→…→assistant block after the most recent user turn (when that block contains at least one tool response). Both default to False (template default applies).
When api_key_var is "PRIME_API_KEY" (the default), credentials are loaded with the following precedence:
- API key:
PRIME_API_KEYenv var >~/.prime/config.json>"EMPTY" - Team ID:
PRIME_TEAM_IDenv var >~/.prime/config.json> not set
prime login.
EndpointClientConfig
ClientConfig.endpoint_configs. Has the same fields as ClientConfig except endpoint_configs itself, preventing recursive nesting.
EvalConfig
EndpointConfig
api_key_var is a credential reference. Endpoint configs never serialize the
materialized API key.
Endpoints maps an endpoint id to one or more endpoint variants. A single variant is represented as a one-item list.
Prime CLI Plugin
Verifiers exposes a plugin contract consumed byprime for command execution.
PRIME_PLUGIN_API_VERSION
prime and verifiers.
PrimeCLIPlugin
build_module_command returns a subprocess command list for python -m <module> ....
get_plugin
prime.
Decorators
@vf.stop
is_completed().
@vf.cleanup
@vf.teardown
Utility Functions
Data Utilities
\boxed{} format. When strict=True, returns "" if no \boxed{} is found (used by MathRubric to avoid scoring unformatted responses). When strict=False (default), returns the original text as a passthrough.
#### marker (GSM8K format).
Environment Utilities
"primeintellect/gsm8k").
Configuration Utilities
MissingKeyError (a ValueError subclass) with a clear message listing all missing keys and instructions for setting them.
Logging Utilities
VF_LOG_LEVEL env var to change default.
vf.log_level("WARNING").