BYO Harness is the preferredDocumentation Index
Fetch the complete documentation index at: https://docs.primeintellect.ai/llms.txt
Use this file to discover all available pages before exploring further.
verifiers.v1 Taskset/Harness authoring path for
new environments that need a clean separation between the task being attempted
and the way a model attempts it.
Use this path when you want to bring your own harness: a tool loop, CLI program,
third-party Python program, sandboxed program, user simulator, MCP server, or
nested sub-harness workflow. For simple one-off environments, the core
Environments guide remains the shortest path.
Core Shape
Taskset: task rows, task-owned tools, user behavior, metrics, rewards, and cleanup;Harness: rollout behavior, model endpoint forwarding, program execution, harness-owned tools, sandboxes, and nested harness calls;Env: adapter that makes a taskset/harness pair usable by eval and training workers.
vf.Env uses the base endpoint-backed harness.
Keep the boundary strict: if a tool defines the task’s action space,
observations, success condition, or domain state, put it on the Taskset.
Harnesses should own only execution adapters and framework-specific mechanics.
For example, a Wikispeedia taskset owns click_link and go_back; a
LangChain, OpenAI Agents, CLI, or base harness should consume those tools from
runtime state instead of constructing its own copy.
Tasksets
Tasksets own row loading throughload_tasks() and load_eval_tasks() methods.
Config should hold user-facing knobs, such as dataset name, split, or size
limits; taskset methods read those knobs from self.config and return
vf.Tasks.
task field for routing. v1 tasksets serialize
the full task payload through info["task"] for worker compatibility, and
environment routing uses info["env_id"].
Shared Dependencies
Shared dependencies live on the taskset and are injected into named lifecycle or scoring functions through bindings:Message Access
Taskset/harness environments expose one transcript selector:vf.get_messages(...) to get the transcript as typed message objects,
optionally filtered by role. Index or slice the returned list with ordinary
Python. The helper does not parse answers; task-specific extraction belongs in
ordinary Python or a taskset-bound object.
Keep rollout-loop data manipulation explicit. A few lines that read
state["completion"], select messages, inspect task fields, or build a prompt
should usually be written directly where they are used, not hidden behind a
library helper or a one-off private function. Helpers are appropriate when the
logic is reused in multiple places, when a taskset-bound object is part of the
environment contract, or when complex behavior belongs in a named secondary
module. Do not create buried utils imports just to avoid three clear lines in
a reward, update, setup, or program function.
Task Controls
Tasks can request rollout behavior through top-level serializable fields:max_turns: per-rollout turn limit for the base harness loop;tools: tool visibility as{"show": [...]}or{"hide": [...]};toolsets: toolset visibility or rollout-local toolsets;sandbox: per-task overrides for a sandboxed program;program: per-task files, dirs, env, setup, artifacts, bindings, and command args.
prompt. v1 resolves system_prompt from the
task, taskset, and harness as a separate field; the base harness concatenates
the resolved system messages with prompt only when it submits a model request.
If more than one source provides a system prompt, resolution fails unless the
harness explicitly sets a merge policy.
state.runtime comes from explicit standalone state passing, Taskset.init_group
customization, or eval/training model controls. For normal tasksets, use
top-level task controls:
task.runtime is not part of the public task schema. Runtime metadata lives on
state.runtime and is written by the harness, the taskset group initializer, or
the eval/training worker.
Use task.program when a taskset owns files or environment variables that a
reusable harness should consume. The taskset cannot change the harness command
or tool channel; duplicate keys across the taskset and harness fail.
Toolsets
Tools are packaged asToolset objects. A taskset can own tools directly:
task.*, state.*, and tools.*. Tasksets, toolsets, and users can
also bind objects.* from their own private dependency factories.
String binding sources are always framework paths. Use a callable source for
literal string values so misspelled paths fail during setup.
Custom harness programs can adapt taskset-owned tools through state.get_tools().
That keeps the same taskset reusable across the base harness, a third-party
agent framework, and CLI or sandbox harnesses:
Harnesses
Create a harness when rollout behavior is no longer just “call the model with the resolved taskset tools.”Harness.program can be:
| Form | Meaning |
|---|---|
None | default endpoint-backed tool loop |
| callable | Python program called in-process |
{"fn": "pkg.module:run"} | importable Python program |
{"command": ["cmd", "arg"]} | local or sandboxed command |
{"sandbox": True} | sandboxed default loop |
program={"command": [...], "sandbox": True, "channels": "mcp"}. Python programs
receive callable tool handles by default, or can set
program={"sandbox": True, "channels": "callable"} when the base loop is moved
into a sandbox. program.channels supports only the generic callable and mcp
channels. Harness-specific tool carriers, such as RLM skill uploads, should
live on the taskset upload directory contract or the harness config.
Sandbox package installs, sandboxed Python programs, and the MCP proxy share
the managed Python runtime at /tmp/verifiers/python. v1 installs or reuses a
uv binary, creates that runtime as a Python 3.11 venv, and uses uv pip for
package installs, so command harnesses should not add ad hoc MCP package setup.
For sandboxed program.fn refs, v1 resolves the owning local package from the
resolved module root: single-file modules use pyproject.toml in the same
directory as the module file, and package modules use pyproject.toml inside
the package directory. v1 uploads that package and installs it in the program
sandbox. Package dependencies are normal [project.dependencies].
Programs are also the right shape for LLM-free replay:
Harness only when packaging reusable behavior with a new
config surface; do not subclass Env just to bypass inference.
Packaged CLI harnesses should use the same boundary. These implementations live
under verifiers.v1.packages. OpenCode, Pi, MiniSWEAgent, Terminus2,
and RLM are bundled Harness leaf wrappers for common command-line agents:
HarborTaskset(config=HarborTasksetConfig()) loads Harbor-format task
directories from the environment package’s reserved tasks/ directory. Set
dataset = "owner/name" on the config to fetch a Harbor Hub dataset. The
taskset owns Harbor task loading, sandbox overrides, task uploads, and test
scoring.
TextArenaTaskset(config=TextArenaTasksetConfig(...)) wraps compatible
TextArena single-player text games as v1 task rows plus a taskset-owned user
callback. The reusable taskset owns TextArena lifecycle, answer injection, row
sampling, and <guess>...</guess> parsing. Environment packages own
task-specific defaults such as game, answer_state_key, system_prompt,
observation formatting, and rewards.
CLI harnesses own CLI installation/config/run behavior and work with any
taskset that supplies a prompt.
Tasksets can expose package-owned upload directories with get_upload_dirs().
The base Taskset discovers a sibling skills/ directory by default, and
RLM uploads that directory to /task/rlm-skills unless skills= is passed
explicitly to the harness. RLM also registers v1 rollout tools as generated
skills in the same directory during setup. Generated skills run simple callable
tools inside the RLM sandbox by default; tools that need verifier runtime state,
toolset bindings, tool sandboxes, MCP sessions, borrowed handles, or other
nonlocal resources fall back to /vf/tools. Explicit or taskset skill
directories take precedence and generated tool skills get a suffixed name when
there is a collision.
Use RLMConfig in env.harness for RLM-specific settings such as
rlm_repo_ref, rlm_tools, rlm_max_turns, and summarize_at_tokens.
Setup, Updates, Signals, And Cleanup
task, state,
completion, and prompt, plus hidden args supplied by taskset or toolset
bindings. Group signals can request tasks, states, and bound hidden args,
and must return one value per state. Setup functions use @vf.setup and run
before the program body; update functions use @vf.update and run before
scoring; cleanup functions use @vf.cleanup and run after scoring; teardown
functions use @vf.teardown.
For sandbox command/Python programs, program files, directories, setup commands,
state handoff, and channel setup are framework setup contributions with
fixed priorities. User @vf.setup(priority=...) handlers can intentionally run
before or after those built-ins without adding new lifecycle hooks.
env.requires_group_rollouts is true when group-stage updates, scoring,
cleanup, or group setup are part of the environment contract.
env.provides_advantages is true when the environment has explicit advantage
handlers.
TOML Config
Eval and RL TOML own the outer run: model, endpoint, sampling, rollout count, and trainer/eval settings. v1 config owns taskset and harness behavior inside the environment package. The recommended loader takes onevf.EnvConfig object, asserts the child config
types supplied by the child factory annotations, and routes its taskset and
harness sections:
taskset/harness sections:
load_taskset annotation fixes the taskset config
type; define load_harness(config: MyHarnessConfig) only when the environment
owns a custom harness.
env:
fn = "module:callable" when metadata is needed:
[...scoring.function_name] to tune or skip an existing metric/reward without
creating a new signal.
For command harnesses, keep endpoint and tool registration under the requested
program.channels channel:
program.channels, and put skill uploads on the
taskset upload directory contract or the harness config. If a skill runner needs
Python packages in the sandbox, declare them through the sandbox package/setup
path instead of baking MCP proxy setup into the skill.
The implementation details for TOML refs, toolset tables, row loading, program
bindings, and custom config subclasses are in
verifiers/v1/README.md.
When To Use Which Path
Use the coreSingleTurnEnv, ToolEnv, and MultiTurnEnv docs when you want
the shortest path through the established environment classes.
Use BYO Harness when you want reusable tasksets, reusable harnesses, task-owned
or harness-owned toolsets, third-party Python programs, sandboxed programs,
stateful users, MCP tools, or nested harness calls.
The repository also includes a deeper implementation guide at
verifiers/v1/README.md.