Skip to main content
This section explains how to run evaluations with Verifiers environments. See Environments for information on building your own environments.

Table of Contents

Use prime eval to execute rollouts against any OpenAI-compatible model and report aggregate metrics.

Basic Usage

Environments must be installed as Python packages before evaluation. From a local environment:
prime env install my-env           # installs ./environments/my_env as a package
prime eval run my-env -m gpt-4.1-mini -n 10
prime eval imports the environment module using Python’s import system, calls its load_environment() function, runs 5 examples with 3 rollouts each (the default), scores them using the environment’s rubric, and prints aggregate metrics.

Command Reference

Environment Selection

FlagShortDefaultDescription
env_id(positional)Environment module name (e.g., my-env or gsm8k)
--env-args-a{}JSON object passed to load_environment()
--extra-env-kwargs-x{}JSON object passed to environment constructor
--env-dir-path-p./environmentsBase path for saving output files
The env_id is converted to a Python module name (my-envmy_env) and imported. The module must be installed (via vf-install or uv pip install). The --env-args flag passes arguments to your load_environment() function:
prime eval run my-env -a '{"difficulty": "hard", "num_examples": 100}'
The --extra-env-kwargs flag passes arguments directly to the environment constructor, useful for overriding defaults like max_turns which may not be exposed via load_environment():
prime eval run my-env -x '{"max_turns": 20}'

Model Configuration

FlagShortDefaultDescription
--model-mopenai/gpt-4.1-miniModel name or endpoint alias
--api-base-url-bhttps://api.pinference.ai/api/v1API base URL
--api-key-var-kPRIME_API_KEYEnvironment variable containing API key
--endpoints-path-e./configs/endpoints.pyPath to endpoints registry
--headerExtra HTTP header (Name: Value), repeatable
For convenience, define model endpoints in ./configs/endpoints.py to avoid repeating URL and key flags:
ENDPOINTS = {
    "gpt-4.1-mini": {
        "model": "gpt-4.1-mini",
        "url": "https://api.openai.com/v1",
        "key": "OPENAI_API_KEY",
    },
    "qwen3-235b-i": {
        "model": "qwen/qwen3-235b-a22b-instruct-2507",
        "url": "https://api.pinference.ai/api/v1",
        "key": "PRIME_API_KEY",
    },
}
Then use the alias directly:
prime eval run my-env -m qwen3-235b-i
If the model name is in the registry, those values are used by default, but you can override them with --api-base-url and/or --api-key-var. If the model name isn’t found, the CLI flags are used (falling back to defaults when omitted).

Sampling Parameters

FlagShortDefaultDescription
--max-tokens-tmodel defaultMaximum tokens to generate
--temperature-Tmodel defaultSampling temperature
--sampling-args-SJSON object for additional sampling parameters
The --sampling-args flag accepts any parameters supported by the model’s API:
prime eval run my-env -S '{"temperature": 0.7, "top_p": 0.9}'

Evaluation Scope

FlagShortDefaultDescription
--num-examples-n5Number of dataset examples to evaluate
--rollouts-per-example-r3Rollouts per example (for pass@k, variance)
Multiple rollouts per example enable metrics like pass@k and help measure variance. The total number of rollouts is num_examples × rollouts_per_example.

Concurrency

FlagShortDefaultDescription
--max-concurrent-c32Maximum concurrent requests
--max-concurrent-generationsame as -cConcurrent generation requests
--max-concurrent-scoringsame as -cConcurrent scoring requests
--no-interleave-scoring-NfalseDisable interleaved scoring
By default, scoring runs interleaved with generation. Use --no-interleave-scoring to score all rollouts after generation completes.

Output and Saving

FlagShortDefaultDescription
--verbose-vfalseEnable debug logging
--save-results-sfalseSave results to disk
--save-every-f-1Save checkpoint every N rollouts
--state-columns-CExtra state columns to save (comma-separated)
--save-to-hf-hub-HfalsePush results to Hugging Face Hub
--hf-hub-dataset-name-DDataset name for HF Hub
Results are saved to ./outputs/evals/{env_id}--{model}/ as a Hugging Face dataset. The --state-columns flag allows saving environment-specific state fields that your environment stores during rollouts:
prime eval run my-env -s -C "judge_response,parsed_answer"

Environment Defaults

Environments can specify default evaluation parameters in their pyproject.toml (See Developing Environments):
[tool.verifiers.eval]
num_examples = 100
rollouts_per_example = 5
These defaults are used when flags aren’t explicitly provided. Priority order: CLI flags → environment defaults → global defaults.