Table of Contents
Useprime eval to execute rollouts against any supported model provider and report aggregate metrics. Supported providers include OpenAI-compatible APIs (the default) and the Anthropic Messages API (via --api-client-type anthropic_messages).
Basic Usage
Run evaluations directly against a local or Hub environment:prime eval resolves and installs the environment when needed, imports the environment module using Python’s import system, calls its load_environment() function, runs 5 examples with 3 rollouts each (the default), scores them using the environment’s rubric, and prints aggregate metrics.
Hosted Evaluations
You can also run evaluations on Prime-managed infrastructure withprime eval run --hosted. Hosted evaluations require an environment that has already been published to the Environments Hub, and they are useful when you want Prime to manage execution, monitor logs remotely, or run against a shared Hub environment slug instead of a local package.
--follow, --timeout-minutes, --allow-sandbox-access, and --custom-secrets, see the official Hosted Evaluations guide.
Command Reference
Environment Selection
| Flag | Short | Default | Description |
|---|---|---|---|
env_id_or_path | (positional) | — | Environment ID(s) or path to TOML config |
--taskset.FIELD / --harness.FIELD | — | — | Typed v1 EnvConfig overrides for single-environment runs |
--env-args | -a | {} | JSON object passed to load_environment() |
--extra-env-kwargs | -x | {} | JSON object passed to environment constructor |
--timeout | — | None | Per-rollout wall-clock timeout in seconds. Wins over equivalent values in --extra-env-kwargs or TOML [eval.extra_env_kwargs]. Bounds generation only — scoring is not bounded. |
--env-dir-path | -p | ./environments | Base path for saving output files |
- Single environment:
gsm8k— evaluates one environment - TOML config path:
configs/eval/benchmark.toml— evaluates multiple environments defined in the config file
my-env → my_env) and imported after prime eval run resolves the environment package.
For v1 load_environment(config: vf.EnvConfig) loaders, prefer typed
taskset/harness overrides. These flags are parsed against the concrete child
config types from load_taskset(config: ...) and load_harness(config: ...):
--taskset.id and --harness.id select the
taskset and harness loader packages.
For legacy or direct-constructor environments, the --env-args flag passes
arguments to your load_environment() function:
--extra-env-kwargs flag passes arguments directly to the environment constructor, useful for overriding defaults like max_turns which may not be exposed via load_environment():
--timeout flag. It wins over equivalent values set via --extra-env-kwargs or TOML’s [eval.extra_env_kwargs]:
timed_out=True with stop_condition="timeout_reached" and ends cleanly through the same finalize path as a normal completion (timing, completion rendering, and cleanup handlers all run). is_truncated is not set on timeout — that field tracks model-output truncation (token caps), not wall-clock pressure. Note that --timeout bounds the generation phase only — scoring (reward function execution) is not bounded by this flag.
The same key works in TOML configs as a top-level entry of an [[eval]] table:
Executor autoscaling
Thread-pool executors are automatically sized to match the evaluation concurrency. Duringprime eval run, if concurrency is not explicitly provided via --extra-env-kwargs, it is computed from the concurrency level (max_concurrent, or num_examples * rollouts_per_example when unlimited) using recommended_max_workers(). This value is passed to Environment.set_concurrency(), which resizes both the default event-loop executor and all registered executors.
To override the automatic value:
set_concurrency() directly at runtime:
Model Configuration
| Flag | Short | Default | Description |
|---|---|---|---|
--model | -m | openai/gpt-4.1-mini | Model name or endpoint alias |
--api-base-url | -b | https://api.pinference.ai/api/v1 | API base URL |
--api-key-var | -k | PRIME_API_KEY | Environment variable containing API key |
--api-client-type | — | openai_chat_completions | Client type: openai_completions, openai_chat_completions, openai_chat_completions_token, openai_responses, renderer, anthropic_messages, or nemorl_chat_completions |
--endpoints-path | -e | ./configs/endpoints.toml | Path to TOML endpoints registry |
--header | — | — | Extra HTTP header (Name: Value), repeatable |
--header-from-state | — | framework session id | Per-request header whose value is read from rollout state (Name: state_key), repeatable |
renderer client type requires the optional renderer package. Install it with uv add "verifiers[renderers]" before running evals with --api-client-type renderer.
For convenience, define model endpoints in ./configs/endpoints.toml to avoid repeating URL and key flags.
api_client_type field to select the client implementation (defaults to "openai_chat_completions"). Use "anthropic_messages" for Anthropic models when calling the Anthropic API directly, and "openai_responses" for OpenAI-compatible Responses endpoints.
Optional HTTP headers for inference requests use a short TOML key headers (inline table). The alias extra_headers is accepted with the same shape; do not set both on one row.
[[eval]] TOML configs you can set extra headers as headers = { ... } and/or as a list header = ["Name: Value", ...] (same form as repeated --header). Merge order is: registry row, then the headers table, then each header / --header line, with later entries overriding the same name.
For per-request headers that need to vary per rollout, use headers_from_state = { "X-Name" = "state_key" } and/or header_from_state = ["X-Name: state_key", ...] (same form as repeated --header-from-state). The value for each request is resolved at send time as state[state_key]. If unset, Verifiers supplies a framework-managed X-Session-ID.
To define equivalent replicas, add multiple [[endpoint]] entries with the same endpoint_id.
Then use the alias directly:
--api-base-url and/or --api-key-var. If the model name isn’t found, the CLI flags are used (falling back to defaults when omitted).
In other words, -m/--model is treated as an endpoint alias lookup when present in the registry, and otherwise treated as a literal model id.
When using eval TOML configs, you can set endpoint_id in [[eval]] sections to resolve from the endpoint registry. endpoint_id is only supported when endpoints_path points to a TOML registry file.
Sampling Parameters
| Flag | Short | Default | Description |
|---|---|---|---|
--max-tokens | -t | model default | Maximum tokens to generate |
--temperature | -T | model default | Sampling temperature |
--sampling-args | -S | — | JSON object for additional sampling parameters |
--sampling-args flag accepts any parameters supported by the model’s API:
[sampling]:
reasoning_effort and enable_thinking stay in sampling_args and are also
mirrored into extra_body.chat_template_kwargs for OpenAI-compatible servers
that read chat template options there. Keeping the top-level values lets the
client translate them for the selected provider.
Evaluation Scope
| Flag | Short | Default | Description |
|---|---|---|---|
--num-examples | -n | 5 | Number of dataset examples to evaluate |
--rollouts-per-example | -r | 3 | Rollouts per example (for pass@k, variance) |
--shuffle | — | false | Shuffle the evaluation dataset before selecting examples |
--shuffle-seed | — | 0 when shuffled | Seed used with --shuffle |
num_examples × rollouts_per_example.
When --shuffle is enabled, Verifiers shuffles the full evaluation dataset first, then selects --num-examples, then repeats each selected example according to --rollouts-per-example. This applies to standard environments, composable tasksets, and v1 Taskset/Harness environments because it happens in the shared evaluation input-selection layer.
Concurrency
| Flag | Short | Default | Description |
|---|---|---|---|
--max-concurrent | -c | 32 | Maximum concurrent requests |
--max-concurrent-generation | — | same as -c | Concurrent generation requests |
--max-concurrent-scoring | — | same as -c | Concurrent scoring requests |
--no-interleave-scoring | -N | false | Disable interleaved scoring |
--independent-scoring | -i | false | Score each rollout individually instead of by group |
--max-retries | — | 0 | Retries per rollout on transient InfraError |
--num-workers | -w | auto | Number of env server worker processes (auto = concurrency ÷ 256, minimum 1) |
--no-interleave-scoring to score all rollouts after generation completes.
The --max-retries flag enables automatic retry with exponential backoff when rollouts fail due to transient infrastructure errors (e.g., sandbox timeouts, API failures).
The --num-workers flag controls how many worker processes the env server spawns. Each worker owns its own environment instance and runs rollouts independently. The default auto scales with concurrency.
Display
When evaluating multiple environments, the display shows an overview panel at the top with a compact status line per environment, and a detail panel below with full progress, metrics, and logs for one environment at a time. Use the left/right arrow keys to switch between environments. The overview scrolls to keep the selected environment visible and is capped at half the terminal height. When an eval runs against a Prime Inference endpoint and the model id has pricing available fromprime inference models, token usage rows also show the estimated total USD cost as cost (all). Cost appears after the final input/output token metrics in the live display and in the final summary. If pricing or token usage is unavailable, the cost field is omitted.
Output and Saving
| Flag | Short | Default | Description |
|---|---|---|---|
--verbose | -v | false | Enable debug logging |
--fullscreen | -f | false | Use alternate screen buffer (fullscreen) for the Rich display |
--disable-tui | -d | false | Disable Rich display; use normal logging and tqdm progress |
--abbreviated-summary | -A | false | Abbreviated summary: show settings and stats, skip example prompts |
--output-dir | -o | — | Custom output directory for evaluation results and logs |
--save-results | -s | false | Save results to disk |
--resume [PATH] | -R | — | Resume from a previous run (auto-detect latest matching incomplete run if PATH omitted) |
--state-columns | -C | — | Extra state columns to save (comma-separated) |
--save-to-hf-hub | -H | false | Push results to Hugging Face Hub |
--hf-hub-dataset-name | -D | — | Dataset name for HF Hub |
--heartbeat-url | — | — | Heartbeat URL for uptime monitoring |
./outputs/evals/{env_id}--{model}/{run_id}/. Use --output-dir to override the base output directory — when set, results (and logs) are saved under {output_dir}/evals/{env_id}--{model}/{run_id}/ instead. The directory contains:
results.jsonl— rollout outputs, one per linemetadata.json— evaluation configuration and aggregate metrics
metadata.json includes a cost object with total-run input_usd, output_usd, and total_usd. The field is omitted for unpriced models, third-party providers, unavailable pricing, or missing usage.
Resuming Evaluations
Long-running evaluations can be interrupted and resumed using checkpointing. When--save-results is enabled, results are saved incrementally after each completed group of rollouts. Use --resume to continue from where you left off. Pass a path to resume a specific run, or omit the path to auto-detect the latest incomplete matching run.
Running with checkpoints:
-s (save results) enabled, partial results are written to disk after each group completes. If the evaluation is interrupted, the output directory will contain all completed rollouts up until the interruption.
Resuming from a checkpoint:
results.jsonl and metadata.json. With --resume and no path, verifiers scans the environment/model output directory and picks the most recent incomplete run matching env_id, model, rollouts_per_example, shuffle, and shuffle_seed where saved num_examples is less than or equal to the current run. When resuming:
- Existing completed rollouts are loaded from the checkpoint
- Remaining rollouts are computed based on the example ids and group size
- Only incomplete rollouts are executed
- New results are appended to the existing checkpoint
--model, --env-args, --rollouts-per-example, --shuffle, or --shuffle-seed can lead to undefined behavior. For reliable results, resume with the same configuration used to create the checkpoint, only increasing --num-examples if you need additional rollouts beyond the original target.
Example workflow:
--state-columns flag allows saving environment-specific state fields that your environment stores during rollouts:
Environment Defaults
Environments can specify default evaluation parameters in theirpyproject.toml (See Developing Environments):
- TOML per-environment settings (when using a config file)
- CLI flags
- Environment defaults (from
pyproject.toml) - Global defaults
Multi-Environment Evaluation
You can evaluate multiple environments usingprime eval with a TOML configuration file. This is useful for running comprehensive benchmark suites.
TOML Configuration
For multi-environment evals or fine-grained control over settings, use a TOML configuration file. When using a config file, CLI arguments are ignored.[[eval]] sections to define each evaluation. You can also specify global defaults at the top:
[[eval]] section:
[[eval]] section usually contains an id field. env_id is accepted as
a legacy alias and normalizes to the same internal field. For Taskset/Harness
configs, id may be omitted when [eval.taskset].id names the taskset loader
package; the taskset id becomes the environment id for loading and outputs. All
other fields are optional:
| Field | Type | Description |
|---|---|---|
id | string | Environment module name; optional when [eval.taskset].id is set |
name | string | Optional eval label for display and saved result paths |
args | table | Arguments passed to load_environment() |
taskset | table | v1 taskset config passed through EnvConfig.taskset |
harness | table | v1 harness config passed through EnvConfig.harness |
num_examples | integer | Number of dataset examples to evaluate |
rollouts_per_example | integer | Rollouts per example |
shuffle | boolean | Shuffle the evaluation dataset before selecting examples |
shuffle_seed | integer | Seed used when shuffle = true |
extra_env_kwargs | table | Arguments passed to environment constructor |
model | string | Model to evaluate |
endpoint_id | string | Endpoint registry id (requires TOML endpoints_path) |
sampling | table | Shorthand for sampling_args generation parameters |
name to run the same environment more than once with different args:
taskset and harness sections:
sampling_args = { ... } spelling is still accepted and
normalizes the same way as [sampling] / [eval.sampling].
Ablation Sweeps
Use[[ablation]] blocks to automatically generate eval configs from a cartesian product of parameter values. This is useful for hyperparameter sweeps and ablation studies without manually writing each combination.
- Fixed fields in the
[[ablation]]block (likeid) apply to all expanded configs [ablation.sweep]keys are lists of values crossed as a cartesian product[ablation.sweep.args]keys are swept and merged into environment args- Fixed
argscan be set alongside swept ones (e.g.args = {split = "test"}keepssplitfixed while sweeping other env args). The same key cannot appear in both fixed and swept args. - Multiple
[[ablation]]blocks are independent (no cross-product between blocks) [[ablation]]and[[eval]]blocks can coexist in the same config fileidcan be a fixed field or a sweep key (e.g.id = ["env-a", "env-b"]), but note that all swept envs must accept the sameargs— use separate[[ablation]]blocks for envs with different argument schemas
--abbreviated-summary (-A) to get a compact summary focused on settings and stats, which is useful when comparing many ablation runs.
Configuration Precedence
When using a config file, CLI arguments are ignored. Settings are resolved as:- TOML per-eval settings — Values specified in
[[eval]]sections - TOML global settings — Values at the top of the config file
- Environment defaults — Values from the environment’s
pyproject.toml - Built-in defaults — (
num_examples=5,rollouts_per_example=3)
- CLI arguments — Flags passed on the command line
- Environment defaults — Values from the environment’s
pyproject.toml - Built-in defaults — (
num_examples=5,rollouts_per_example=3)