BYO Harness is the preferredDocumentation Index
Fetch the complete documentation index at: https://docs.primeintellect.ai/llms.txt
Use this file to discover all available pages before exploring further.
verifiers.v1 Taskset/Harness authoring path for
new environments that need a clean separation between the task being attempted
and the way a model attempts it.
Use this path when you want to bring your own harness: a tool loop, CLI program,
third-party Python program, sandboxed program, user simulator, MCP server, or
nested sub-harness workflow. For simple one-off environments, the core
Environments guide remains the shortest path.
Core Shape
Taskset: task rows, task-owned tools, user behavior, metrics, rewards, and cleanup;Harness: rollout behavior, model endpoint forwarding, program execution, harness-owned tools, sandboxes, and nested harness calls;Env: adapter that makes a taskset/harness pair usable by eval and training workers.
vf.Env uses the base endpoint-backed harness.
Keep the boundary strict: if a tool defines the task’s action space,
observations, success condition, or domain state, put it on the Taskset.
Harnesses should own only execution adapters and framework-specific mechanics.
For example, a Wikispeedia taskset owns click_link and go_back; a
LangChain, OpenAI Agents, CLI, or base harness should consume those tools from
runtime state instead of constructing its own copy.
Tasksets
Taskset(source=...) accepts either a direct iterable of rows or a zero-argument
loader. Direct iterables are fine for tiny examples. Real tasksets should use a
zero-argument loader so imports and constructors stay cheap.
task field for routing. v1 tasksets serialize
the full task payload through info["task"] for worker compatibility, and
environment routing uses info["env_id"].
Task Controls
Tasks can request rollout behavior through top-level serializable fields:max_turns: per-rollout turn limit for the base harness loop;tools: tool visibility as{"show": [...]}or{"hide": [...]};toolsets: toolset visibility or rollout-local toolsets;sandbox: per-task overrides for a sandboxed program;program: per-task files, dirs, env, setup, artifacts, bindings, and command args.
prompt. v1 resolves system_prompt from the
task, taskset, and harness as a separate field; the base harness concatenates
the resolved system messages with prompt only when it submits a model request.
If more than one source provides a system prompt, resolution fails unless the
harness explicitly sets a merge policy.
state.runtime comes from explicit standalone state passing, Taskset.init_group
customization, or eval/training model controls. For normal tasksets, use
top-level task controls:
task.runtime is not part of the public task schema. Runtime metadata lives on
state.runtime and is written by the harness, the taskset group initializer, or
the eval/training worker.
Use task.program when a taskset owns files or environment variables that a
reusable harness should consume. The taskset cannot change the harness command
or tool interface; duplicate keys across the taskset and harness fail.
Toolsets
Tools are packaged asToolset objects. A taskset can own tools directly:
task.*, state.*, and tools.*. Tool and user callables can also
bind objects.* from their own private dependency factories.
Custom harness programs can adapt taskset-owned tools through state.get_tools().
That keeps the same taskset reusable across the base harness, a third-party
agent framework, and CLI or sandbox harnesses:
Harnesses
Create a harness when rollout behavior is no longer just “call the model with the resolved taskset tools.”Harness.program can be:
| Form | Meaning |
|---|---|
None | default endpoint-backed tool loop |
| callable | Python program called in-process |
{"fn": "pkg.module:run"} | importable Python program |
{"command": ["cmd", "arg"]} | local or sandboxed command |
{"sandbox": True} | sandboxed default loop |
program={"command": [...], "sandbox": True, "tools": "mcp"}. Python programs
receive callable tool handles by default, or can set
program={"sandbox": True, "tools": "callable"} when the base loop is moved
into a sandbox.
Programs are also the right shape for LLM-free replay:
Harness only when packaging reusable behavior with a new
config surface; do not subclass Env just to bypass inference.
Packaged CLI harnesses should use the same boundary. These implementations live
under verifiers.v1.packages while the v1 surface stabilizes, and are
re-exported through verifiers.v1. CLIHarness is the generic command wrapper;
OpenCode, Pi, MiniSWEAgent, and RLM are bundled leaf wrappers:
HarborTaskset owns Harbor task loading, sandbox overrides, task uploads, and
test scoring. CLI harnesses own CLI installation/config/run behavior and work
with any taskset that supplies a prompt.
Setup, Updates, Signals, And Cleanup
task, state, plus any Toolset-bound hidden args. Group
signals accept exactly tasks, states and return one value per state. Setup
functions use @vf.setup and run before the program body; update functions use
@vf.update and run before scoring; cleanup functions use @vf.cleanup and run
after scoring; teardown functions use @vf.teardown.
For sandbox command/Python programs, program files, directories, setup commands,
state handoff, and tool-interface setup are framework setup contributions with
fixed priorities. User @vf.setup(priority=...) handlers can intentionally run
before or after those built-ins without adding new lifecycle hooks.
env.requires_group_rollouts is true when group-stage updates, scoring,
cleanup, or group setup are part of the environment contract.
env.provides_advantages is true when the environment has explicit advantage
handlers.
TOML Config
Eval and RL TOML own the outer run: model, endpoint, sampling, rollout count, and trainer/eval settings. v1 config owns taskset and harness behavior inside the environment package. The recommended loader takes oneconfig object and routes its taskset and
harness sections:
args and v1 config through
the taskset/harness sections:
taskset/harness sections stay the most specific source and override
those defaults.
env:
fn = "module:callable" when metadata is needed:
[...scoring.function_name] to tune or skip an existing metric/reward without
creating a new signal.
For command harnesses, keep endpoint and tool registration under the requested
program.tools interface:
verifiers/v1/README.md.
When To Use Which Path
Use the coreSingleTurnEnv, ToolEnv, and MultiTurnEnv docs when you want
the shortest path through the established environment classes.
Use BYO Harness when you want reusable tasksets, reusable harnesses, task-owned
or harness-owned toolsets, third-party Python programs, sandboxed programs,
stateful users, MCP tools, or nested harness calls.
The repository also includes a deeper implementation guide at
verifiers/v1/README.md.