Skip to main content
This section explains how to run evaluations with Verifiers environments. See Environments for information on building your own environments.

Table of Contents

Use prime eval to execute rollouts against any supported model provider and report aggregate metrics. Supported providers include OpenAI-compatible APIs (the default) and the Anthropic Messages API (via --api-client-type anthropic_messages).

Basic Usage

Environments must be installed as Python packages before evaluation. From a local environment:
prime env install my-env           # installs ./environments/my_env as a package
prime eval run my-env -m gpt-4.1-mini -n 10
prime eval imports the environment module using Python’s import system, calls its load_environment() function, runs 5 examples with 3 rollouts each (the default), scores them using the environment’s rubric, and prints aggregate metrics.

Command Reference

Environment Selection

FlagShortDefaultDescription
env_id_or_path(positional)Environment ID(s) or path to TOML config
--env-args-a{}JSON object passed to load_environment()
--extra-env-kwargs-x{}JSON object passed to environment constructor
--env-dir-path-p./environmentsBase path for saving output files
The positional argument accepts two formats:
  • Single environment: gsm8k — evaluates one environment
  • TOML config path: configs/eval/benchmark.toml — evaluates multiple environments defined in the config file
Environment IDs are converted to Python module names (my-envmy_env) and imported. Modules must be installed (via prime env install or uv pip install). The --env-args flag passes arguments to your load_environment() function:
prime eval run my-env -a '{"difficulty": "hard", "num_examples": 100}'
The --extra-env-kwargs flag passes arguments directly to the environment constructor, useful for overriding defaults like max_turns which may not be exposed via load_environment():
prime eval run my-env -x '{"max_turns": 20}'

Model Configuration

FlagShortDefaultDescription
--model-mopenai/gpt-4.1-miniModel name or endpoint alias
--api-base-url-bhttps://api.pinference.ai/api/v1API base URL
--api-key-var-kPRIME_API_KEYEnvironment variable containing API key
--api-client-typeopenai_chat_completionsClient type: openai_chat_completions, openai_completions, openai_chat_completions_token, or anthropic_messages
--endpoints-path-e./configs/endpoints.tomlPath to TOML endpoints registry
--headerExtra HTTP header (Name: Value), repeatable
For convenience, define model endpoints in ./configs/endpoints.toml to avoid repeating URL and key flags.
[[endpoint]]
endpoint_id = "gpt-4.1-mini"
model = "gpt-4.1-mini"
url = "https://api.openai.com/v1"
key = "OPENAI_API_KEY"

[[endpoint]]
endpoint_id = "qwen3-235b-i"
model = "qwen/qwen3-235b-a22b-instruct-2507"
url = "https://api.pinference.ai/api/v1"
key = "PRIME_API_KEY"

[[endpoint]]
endpoint_id = "claude-sonnet"
model = "claude-sonnet-4-5-20250929"
url = "https://api.anthropic.com"
key = "ANTHROPIC_API_KEY"
api_client_type = "anthropic_messages"
Each endpoint entry supports an optional api_client_type field to select the client implementation (defaults to "openai_chat_completions"). Use "anthropic_messages" for Anthropic models when calling the Anthropic API directly. To define equivalent replicas, add multiple [[endpoint]] entries with the same endpoint_id. Then use the alias directly:
prime eval run my-env -m qwen3-235b-i
If the model name is in the registry, those values are used by default, but you can override them with --api-base-url and/or --api-key-var. If the model name isn’t found, the CLI flags are used (falling back to defaults when omitted). In other words, -m/--model is treated as an endpoint alias lookup when present in the registry, and otherwise treated as a literal model id. When using eval TOML configs, you can set endpoint_id in [[eval]] sections to resolve from the endpoint registry. endpoint_id is only supported when endpoints_path points to a TOML registry file.

Sampling Parameters

FlagShortDefaultDescription
--max-tokens-tmodel defaultMaximum tokens to generate
--temperature-Tmodel defaultSampling temperature
--sampling-args-SJSON object for additional sampling parameters
The --sampling-args flag accepts any parameters supported by the model’s API:
prime eval run my-env -S '{"temperature": 0.7, "top_p": 0.9}'

Evaluation Scope

FlagShortDefaultDescription
--num-examples-n5Number of dataset examples to evaluate
--rollouts-per-example-r3Rollouts per example (for pass@k, variance)
Multiple rollouts per example enable metrics like pass@k and help measure variance. The total number of rollouts is num_examples × rollouts_per_example.

Concurrency

FlagShortDefaultDescription
--max-concurrent-c32Maximum concurrent requests
--max-concurrent-generationsame as -cConcurrent generation requests
--max-concurrent-scoringsame as -cConcurrent scoring requests
--no-interleave-scoring-NfalseDisable interleaved scoring
--independent-scoring-ifalseScore each rollout individually instead of by group
--max-retries0Retries per rollout on transient InfraError
By default, scoring runs interleaved with generation. Use --no-interleave-scoring to score all rollouts after generation completes. The --max-retries flag enables automatic retry with exponential backoff when rollouts fail due to transient infrastructure errors (e.g., sandbox timeouts, API failures).

Output and Saving

FlagShortDefaultDescription
--verbose-vfalseEnable debug logging
--tui-ufalseUse alternate screen mode (TUI) for display
--debug-dfalseDisable Rich display; use normal logging and tqdm progress
--save-results-sfalseSave results to disk
--resume [PATH]-RResume from a previous run (auto-detect latest matching incomplete run if PATH omitted)
--state-columns-CExtra state columns to save (comma-separated)
--save-to-hf-hub-HfalsePush results to Hugging Face Hub
--hf-hub-dataset-name-DDataset name for HF Hub
--heartbeat-urlHeartbeat URL for uptime monitoring
Results are saved to ./outputs/evals/{env_id}--{model}/{run_id}/, containing:
  • results.jsonl — rollout outputs, one per line
  • metadata.json — evaluation configuration and aggregate metrics

Resuming Evaluations

Long-running evaluations can be interrupted and resumed using checkpointing. When --save-results is enabled, results are saved incrementally after each completed group of rollouts. Use --resume to continue from where you left off. Pass a path to resume a specific run, or omit the path to auto-detect the latest incomplete matching run. Running with checkpoints:
prime eval run my-env -n 1000 -s
With -s (save results) enabled, partial results are written to disk after each group completes. If the evaluation is interrupted, the output directory will contain all completed rollouts up until the interruption. Resuming from a checkpoint:
prime eval run my-env -n 1000 -s --resume ./environments/my_env/outputs/evals/my-env--openai--gpt-4.1-mini/abc12345
When a resume path is provided, it must point to a valid evaluation results directory containing both results.jsonl and metadata.json. With --resume and no path, verifiers scans the environment/model output directory and picks the most recent incomplete run matching env_id, model, and rollouts_per_example where saved num_examples is less than or equal to the current run. When resuming:
  1. Existing completed rollouts are loaded from the checkpoint
  2. Remaining rollouts are computed based on the example ids and group size
  3. Only incomplete rollouts are executed
  4. New results are appended to the existing checkpoint
If all rollouts are already complete, the evaluation returns immediately with the existing results. Configuration compatibility: When resuming, the current run configuration should match the original run. Mismatches in parameters like --model, --env-args, or --rollouts-per-example can lead to undefined behavior. For reliable results, resume with the same configuration used to create the checkpoint, only increasing --num-examples if you need additional rollouts beyond the original target. Example workflow:
# Start a large evaluation with checkpointing
prime eval run math-python -n 500 -r 3 -s

# If interrupted, find the run directory
ls ./environments/math_python/outputs/evals/math-python--openai--gpt-4.1-mini/

# Resume from the checkpoint
prime eval run math-python -n 500 -r 3 -s \
  --resume ./environments/math_python/outputs/evals/math-python--openai--gpt-4.1-mini/abc12345
The --state-columns flag allows saving environment-specific state fields that your environment stores during rollouts:
prime eval run my-env -s -C "judge_response,parsed_answer"

Environment Defaults

Environments can specify default evaluation parameters in their pyproject.toml (See Developing Environments):
[tool.verifiers.eval]
num_examples = 100
rollouts_per_example = 5
These defaults are used when higher-priority sources don’t specify a value. The full priority order is:
  1. TOML per-environment settings (when using a config file)
  2. CLI flags
  3. Environment defaults (from pyproject.toml)
  4. Global defaults
See Configuration Precedence for more details on multi-environment evaluation.

Multi-Environment Evaluation

You can evaluate multiple environments using prime eval with a TOML configuration file. This is useful for running comprehensive benchmark suites.

TOML Configuration

For multi-environment evals or fine-grained control over settings, use a TOML configuration file. When using a config file, CLI arguments are ignored.
prime eval run configs/eval/my-benchmark.toml
The TOML file uses [[eval]] sections to define each evaluation. You can also specify global defaults at the top:
# configs/eval/my-benchmark.toml

# Global defaults (optional)
model = "openai/gpt-4.1-mini"
num_examples = 50

[[eval]]
env_id = "gsm8k"
num_examples = 100  # overrides global default
rollouts_per_example = 5

[[eval]]
env_id = "alphabet-sort"
# Uses global num_examples (50)
rollouts_per_example = 3

[[eval]]
env_id = "math-python"
# Uses global defaults and built-in defaults for unspecified values
A minimal config requires only a single [[eval]] section:
[[eval]]
env_id = "gsm8k"
Each [[eval]] section must contain an env_id field. All other fields are optional:
FieldTypeDescription
env_idstringRequired. Environment module name
env_argstableArguments passed to load_environment()
num_examplesintegerNumber of dataset examples to evaluate
rollouts_per_exampleintegerRollouts per example
extra_env_kwargstableArguments passed to environment constructor
modelstringModel to evaluate
endpoint_idstringEndpoint registry id (requires TOML endpoints_path)
Example with env_args:
[[eval]]
env_id = "math-python"
num_examples = 50

[eval.env_args]
difficulty = "hard"
split = "test"

Configuration Precedence

When using a config file, CLI arguments are ignored. Settings are resolved as:
  1. TOML per-eval settings — Values specified in [[eval]] sections
  2. TOML global settings — Values at the top of the config file
  3. Environment defaults — Values from the environment’s pyproject.toml
  4. Built-in defaults — (num_examples=5, rollouts_per_example=3)
When using CLI only (no config file), settings are resolved as:
  1. CLI arguments — Flags passed on the command line
  2. Environment defaults — Values from the environment’s pyproject.toml
  3. Built-in defaults — (num_examples=5, rollouts_per_example=3)