Table of Contents
Useprime eval to execute rollouts against any supported model provider and report aggregate metrics. Supported providers include OpenAI-compatible APIs (the default) and the Anthropic Messages API (via --api-client-type anthropic_messages).
Basic Usage
Environments must be installed as Python packages before evaluation. From a local environment:prime eval imports the environment module using Python’s import system, calls its load_environment() function, runs 5 examples with 3 rollouts each (the default), scores them using the environment’s rubric, and prints aggregate metrics.
Command Reference
Environment Selection
| Flag | Short | Default | Description |
|---|---|---|---|
env_id_or_path | (positional) | — | Environment ID(s) or path to TOML config |
--env-args | -a | {} | JSON object passed to load_environment() |
--extra-env-kwargs | -x | {} | JSON object passed to environment constructor |
--env-dir-path | -p | ./environments | Base path for saving output files |
- Single environment:
gsm8k— evaluates one environment - TOML config path:
configs/eval/benchmark.toml— evaluates multiple environments defined in the config file
my-env → my_env) and imported. Modules must be installed (via prime env install or uv pip install).
The --env-args flag passes arguments to your load_environment() function:
--extra-env-kwargs flag passes arguments directly to the environment constructor, useful for overriding defaults like max_turns which may not be exposed via load_environment():
Model Configuration
| Flag | Short | Default | Description |
|---|---|---|---|
--model | -m | openai/gpt-4.1-mini | Model name or endpoint alias |
--api-base-url | -b | https://api.pinference.ai/api/v1 | API base URL |
--api-key-var | -k | PRIME_API_KEY | Environment variable containing API key |
--api-client-type | — | openai_chat_completions | Client type: openai_chat_completions, openai_completions, openai_chat_completions_token, or anthropic_messages |
--endpoints-path | -e | ./configs/endpoints.toml | Path to TOML endpoints registry |
--header | — | — | Extra HTTP header (Name: Value), repeatable |
./configs/endpoints.toml to avoid repeating URL and key flags.
api_client_type field to select the client implementation (defaults to "openai_chat_completions"). Use "anthropic_messages" for Anthropic models when calling the Anthropic API directly.
To define equivalent replicas, add multiple [[endpoint]] entries with the same endpoint_id.
Then use the alias directly:
--api-base-url and/or --api-key-var. If the model name isn’t found, the CLI flags are used (falling back to defaults when omitted).
In other words, -m/--model is treated as an endpoint alias lookup when present in the registry, and otherwise treated as a literal model id.
When using eval TOML configs, you can set endpoint_id in [[eval]] sections to resolve from the endpoint registry. endpoint_id is only supported when endpoints_path points to a TOML registry file.
Sampling Parameters
| Flag | Short | Default | Description |
|---|---|---|---|
--max-tokens | -t | model default | Maximum tokens to generate |
--temperature | -T | model default | Sampling temperature |
--sampling-args | -S | — | JSON object for additional sampling parameters |
--sampling-args flag accepts any parameters supported by the model’s API:
Evaluation Scope
| Flag | Short | Default | Description |
|---|---|---|---|
--num-examples | -n | 5 | Number of dataset examples to evaluate |
--rollouts-per-example | -r | 3 | Rollouts per example (for pass@k, variance) |
num_examples × rollouts_per_example.
Concurrency
| Flag | Short | Default | Description |
|---|---|---|---|
--max-concurrent | -c | 32 | Maximum concurrent requests |
--max-concurrent-generation | — | same as -c | Concurrent generation requests |
--max-concurrent-scoring | — | same as -c | Concurrent scoring requests |
--no-interleave-scoring | -N | false | Disable interleaved scoring |
--independent-scoring | -i | false | Score each rollout individually instead of by group |
--max-retries | — | 0 | Retries per rollout on transient InfraError |
--no-interleave-scoring to score all rollouts after generation completes.
The --max-retries flag enables automatic retry with exponential backoff when rollouts fail due to transient infrastructure errors (e.g., sandbox timeouts, API failures).
Output and Saving
| Flag | Short | Default | Description |
|---|---|---|---|
--verbose | -v | false | Enable debug logging |
--tui | -u | false | Use alternate screen mode (TUI) for display |
--debug | -d | false | Disable Rich display; use normal logging and tqdm progress |
--save-results | -s | false | Save results to disk |
--resume [PATH] | -R | — | Resume from a previous run (auto-detect latest matching incomplete run if PATH omitted) |
--state-columns | -C | — | Extra state columns to save (comma-separated) |
--save-to-hf-hub | -H | false | Push results to Hugging Face Hub |
--hf-hub-dataset-name | -D | — | Dataset name for HF Hub |
--heartbeat-url | — | — | Heartbeat URL for uptime monitoring |
./outputs/evals/{env_id}--{model}/{run_id}/, containing:
results.jsonl— rollout outputs, one per linemetadata.json— evaluation configuration and aggregate metrics
Resuming Evaluations
Long-running evaluations can be interrupted and resumed using checkpointing. When--save-results is enabled, results are saved incrementally after each completed group of rollouts. Use --resume to continue from where you left off. Pass a path to resume a specific run, or omit the path to auto-detect the latest incomplete matching run.
Running with checkpoints:
-s (save results) enabled, partial results are written to disk after each group completes. If the evaluation is interrupted, the output directory will contain all completed rollouts up until the interruption.
Resuming from a checkpoint:
results.jsonl and metadata.json. With --resume and no path, verifiers scans the environment/model output directory and picks the most recent incomplete run matching env_id, model, and rollouts_per_example where saved num_examples is less than or equal to the current run. When resuming:
- Existing completed rollouts are loaded from the checkpoint
- Remaining rollouts are computed based on the example ids and group size
- Only incomplete rollouts are executed
- New results are appended to the existing checkpoint
--model, --env-args, or --rollouts-per-example can lead to undefined behavior. For reliable results, resume with the same configuration used to create the checkpoint, only increasing --num-examples if you need additional rollouts beyond the original target.
Example workflow:
--state-columns flag allows saving environment-specific state fields that your environment stores during rollouts:
Environment Defaults
Environments can specify default evaluation parameters in theirpyproject.toml (See Developing Environments):
- TOML per-environment settings (when using a config file)
- CLI flags
- Environment defaults (from
pyproject.toml) - Global defaults
Multi-Environment Evaluation
You can evaluate multiple environments usingprime eval with a TOML configuration file. This is useful for running comprehensive benchmark suites.
TOML Configuration
For multi-environment evals or fine-grained control over settings, use a TOML configuration file. When using a config file, CLI arguments are ignored.[[eval]] sections to define each evaluation. You can also specify global defaults at the top:
[[eval]] section:
[[eval]] section must contain an env_id field. All other fields are optional:
| Field | Type | Description |
|---|---|---|
env_id | string | Required. Environment module name |
env_args | table | Arguments passed to load_environment() |
num_examples | integer | Number of dataset examples to evaluate |
rollouts_per_example | integer | Rollouts per example |
extra_env_kwargs | table | Arguments passed to environment constructor |
model | string | Model to evaluate |
endpoint_id | string | Endpoint registry id (requires TOML endpoints_path) |
env_args:
Configuration Precedence
When using a config file, CLI arguments are ignored. Settings are resolved as:- TOML per-eval settings — Values specified in
[[eval]]sections - TOML global settings — Values at the top of the config file
- Environment defaults — Values from the environment’s
pyproject.toml - Built-in defaults — (
num_examples=5,rollouts_per_example=3)
- CLI arguments — Flags passed on the command line
- Environment defaults — Values from the environment’s
pyproject.toml - Built-in defaults — (
num_examples=5,rollouts_per_example=3)