SingleTurnEnv, ToolEnv, or MultiTurnEnv examples, see
Environments. For API signatures, see
Reference.
Start From The Template
Initialize the v1 Taskset/Harness template first:my_env.py and edit it in this order:
- Add user-facing task settings to
MyTasksetConfig. - Fill
MyTaskset.load_tasks(split=...)with train/eval task records. - Add task-owned tools with
MyTaskset.load_toolsets()when the task defines an action space. - Add task behavior with
@vf.setup,@vf.update,@vf.reward,@vf.metric,@vf.cleanup, and related lifecycle methods onMyTaskset. - Add a
Usersubclass andload_user()when the task owns simulated user behavior. - If
--with-harnessis used, put execution-level program, sandbox, endpoint, model, or harness lifecycle behavior onMyHarness. - Keep the generated loaders as the typed entrypoints:
load_taskset, optionalload_harness, and the rootload_environment.
Golden Shape
Every v1 environment has one root loader and typed child loaders:load_taskset(config: MyTasksetConfig) defines the [env.taskset] schema; load_harness(config: MyHarnessConfig) defines the [env.harness] schema. Keep
load_environment(config: vf.EnvConfig) as-is: implement the config surface through taskset and harness configs, not root loader kwargs.
Start with a taskset and the base harness. Add a custom harness only when the
environment owns a reusable execution protocol such as a command agent,
third-party framework adapter, browser loop, endpoint interceptor, primary
sandbox placement, or program runner.
Implementation Map
- Import as
import verifiers as vf. - Use
XXXConfigclasses for structured settings. - Put task behavior on the taskset config/class.
- Put execution behavior on the harness config/class.
- Use
vf.Envandvf.EnvConfigfor ordinary environment packages. - Let the base
Taskset,Harness, andUserconstructors handle construction; customize with config fields and public methods. - Put taskset/harness behavior on the owning class with standard public methods
or
@vf.*decorators. - Use
system_promptfor system messages. - Keep reusable multi-line internals in utility modules with clear names.
Ownership
| Object | Owns |
|---|---|
Taskset | Task loading, task data, task prompts, task controls, task-owned tools, user behavior, task-specific lifecycle, metrics, rewards, advantages, and task-owned program/sandbox inputs. |
Harness | Rollout execution, execution-level system prompts, model/client defaults, programs, command agents, framework adapters, endpoint interception, primary sandbox placement, harness-owned tools, and execution artifacts. |
Env | The adapter that makes one taskset/harness pair usable by eval and training workers. |
- Wikispeedia link tools belong to the Wikispeedia taskset.
- TextArena game state and user responses belong to the TextArena taskset.
- Harbor task directories, uploads, and tests belong to
HarborTaskset. - OpenCode, Pi, mini-swe-agent, Terminus, and RLM execution belong to harness classes.
- Endpoint routing and interception belong to the harness/runtime, not task rows.
Config
Config values must be serializable. Use import-ref strings such as"my_env.module:factory" when config needs to name a callable across TOML, CLI,
or package boundaries. Python constructors may pass runtime objects only where
the constructor explicitly accepts them, such as vf.Toolset(tools=[...]) or
standalone vf.Harness(model=..., client=...).
Common owner config fields:
| Field | Meaning |
|---|---|
system_prompt | String, system-message list, or vf.SystemPromptConfig. |
user | UserConfig subclass that materializes a registered User. |
toolsets | Configured toolset collection. |
objects | Private dependency factories owned by this object. |
bindings | Hidden argument bindings for handlers, tools, users, and programs. |
artifacts | Text/JSON artifacts owned by this object. |
| lifecycle lists | Import-ref setups, updates, metrics, rewards, cleanups, etc. |
scoring | Per-handler tuning or skipping by handler name. |
TasksetConfig; put harness fields on HarnessConfig.
Avoid broad unions and untyped mappings unless arbitrary JSON is the actual
task payload.
Tasks
Tasksets load train and eval data throughload_tasks(split=...):
vf.Tasks may be a datasets.Dataset, an iterable of serializable records, or
an iterable of vf.Task objects. During rollout, records become immutable
vf.Task objects.
Return a datasets.Dataset directly when the source already has standard
columns such as question and answer; the framework derives prompt from
question. Hardcode fixed upstream split names inside load_tasks(split=...).
Only expose split-name config when the upstream split choice is
genuine user-space configuration, not the way v1 decides whether eval exists.
Return [] for split == "eval" when the taskset has no explicit eval source;
vf.Env treats the empty split as an absent eval dataset, so
Environment.get_eval_dataset() falls back to train data with the standard
warning.
Common task fields:
| Field | Meaning |
|---|---|
prompt | User/developer/tool messages. No system messages. |
system_prompt | Per-task taskset-side system prompt override. |
answer | Reference answer or target data. |
info | Serializable metadata. |
max_turns | Per-task base-loop turn limit. |
toolsets | Toolset visibility: {"show": [...]} or {"hide": [...]}. |
tools | Per-toolset tool visibility: {"search": {"show": [...]}}. |
sandbox | Per-task sandbox override. |
program | Task-owned files, dirs, setup, env, artifacts, bindings, and args. |
artifacts | Task-owned artifacts collected after program execution. |
max_turns, sandbox,
program, and tool visibility fields in task records only when they genuinely
vary by example.
System Prompts
System prompt resolution happens per task during rollout setup. There are two sides:T: the resolved taskset side.task["system_prompt"]wins for that task; otherwise the taskset usesTasksetConfig.system_prompt.H: the harness side fromHarnessConfig.system_prompt.
HarnessConfig.system_prompt_strategy decides how those two sides resolve:
| Strategy | Meaning |
|---|---|
HT | Harness side followed by resolved taskset side. Default. |
TH | Resolved taskset side followed by harness side. |
H_OR_T | Harness side when present, otherwise resolved taskset side. |
T_OR_H | Resolved taskset side when present, otherwise harness side. |
H | Harness side only. |
T | Resolved taskset side only. |
REJECT | Error if both sides are present. |
load_system_prompt(config) only when prompt loading is computed from
other config fields or package resources.
Toolsets
Toolsets package model-visible schemas, hidden bindings, private objects, artifacts, lifecycle hooks, and optional runtime scope.task.*, state.*, objects.*, and tools.*.
Tasks show all toolsets and tools by default. Restrict visibility in task data:
state; keep task rows serializable. Dynamic schemas use
state.add_tool("toolset_name", vf.Tool(...)) during rollout setup against a
named rollout toolset.
MCP servers are normal tool entries:
Users
AUser simulates environment/user responses between model turns. It is not a
callable; subclass vf.User and implement get_response.
Programs, Harnesses, And Sandboxes
HarnessConfig.program is a vf.ProgramConfig:
| Form | Meaning |
|---|---|
vf.ProgramConfig() | Base endpoint-backed tool loop. |
vf.ProgramConfig(base=True) | Explicit base loop, usually with sandbox options. |
vf.ProgramConfig(fn="my_env:run") | Importable Python program. |
vf.ProgramConfig(command=["agent", "run"]) | Local or sandboxed command. |
task["program"].
Harnesses still own the program kind, channel wiring, and primary sandbox
placement. Duplicate files, env vars, artifacts, or bindings fail fast.
Put sandbox config on the harness when it is part of the execution mechanism:
Prompt Preparation Hook
Harnesses can reshape the prompt right before each model request usingprepare_prompt(prompt, state). The default implementation is identity; override it to compact context, redact content, or inject reminders. The prepared messages are what the runtime sends to the model and records in the trajectory, and the base-program loop uses the prepared prompt on subsequent turns.
Lifecycle And Scoring
Lifecycle decorators attach behavior to the owning class:task, state, completion, prompt, and bound
hidden args. Group handlers use tasks and states and must return one value
per state when scoring.
Objects, Bindings, And Artifacts
objects are private dependency factories. bindings connect those objects,
task fields, state fields, or runtime values to hidden callable arguments.
Nested Harnesses
Nested harnesses are ordinary harness runs. Create a child task, create a child state, and run the child harness.Packaged Tasksets And Harnesses
Reusable implementations live in standalone packages underpackages/:
TOML And CLI
Eval and training config owns the run: model, endpoint, sampling, examples, and rollout count. v1 child config owns environment behavior:[...scoring.function_name] to tune or skip an existing class-defined
metric/reward without creating a new signal:
Checklist
Before publishing or asking for review:load_environment(config: vf.EnvConfig)is the only root loader shape.- Custom tasksets have
load_taskset(config: MyTasksetConfig). - Custom harnesses have
load_harness(config: MyHarnessConfig). - No
Taskset,Harness, orUsersubclass overrides__init__. - No ordinary environment subclass of
vf.Envorvf.EnvConfigexists. - Config fields are serializable and named
XXXConfig. - Taskset-owned behavior is not hidden in the harness.
- Harness-owned execution is not hidden in task rows.
- Static prompts live in config; computed prompts use
load_system_prompt. - Tools are exposed through
vf.Toolset; task rows only show/hide them. - Runtime-only resources live on state or runtime-managed owners.
- Metrics/rewards/setup/update/cleanup are decorated with
@vf.*. - Generated component loaders remain the typed taskset/harness entrypoints.
- One-off helper methods and bottom-of-file helper functions are absent.
- Install/load/eval has been validated with
prime eval runor the relevant package-install test.