Skip to main content
This section explains how to run evaluations with Verifiers environments. See Environments for information on building your own environments.

Table of Contents

Use vf-eval to execute rollouts against any OpenAI-compatible model and report aggregate metrics.

Basic Usage

Environments must be installed as Python packages before evaluation. From a local environment:
uv run vf-install my-env           # installs ./environments/my_env as a package
uv run vf-eval my-env -m gpt-4.1-mini -n 10
vf-eval imports the environment module using Python’s import system, calls its load_environment() function, runs 10 examples with 3 rollouts each (the default), scores them using the environment’s rubric, and prints aggregate metrics.

Command Reference

Environment Selection

FlagShortDefaultDescription
env_id(positional)Environment module name (e.g., my-env or gsm8k)
--env-args-a{}JSON object passed to load_environment()
--extra-env-kwargs-x{}JSON object passed to environment constructor
--env-dir-path-p./environmentsBase path for saving output files
The env_id is converted to a Python module name (my-envmy_env) and imported. The module must be installed (via vf-install or uv pip install). The --env-args flag passes arguments to your load_environment() function:
vf-eval my-env -a '{"difficulty": "hard", "num_examples": 100}'
The --extra-env-kwargs flag passes arguments directly to the environment constructor, useful for overriding defaults like max_turns which may not be exposed via load_environment():
vf-eval my-env -x '{"max_turns": 20}'

Model Configuration

FlagShortDefaultDescription
--model-mgpt-4.1-miniModel name or endpoint alias
--api-base-url-bhttps://api.openai.com/v1API base URL
--api-key-var-kOPENAI_API_KEYEnvironment variable containing API key
--endpoints-path-e./configs/endpoints.pyPath to endpoints registry
--headerExtra HTTP header (Name: Value), repeatable
For convenience, define model endpoints in ./configs/endpoints.py to avoid repeating URL and key flags:
ENDPOINTS = {
    "gpt-4.1-mini": {
        "model": "gpt-4.1-mini",
        "url": "https://api.openai.com/v1",
        "key": "OPENAI_API_KEY",
    },
    "qwen3-235b-i": {
        "model": "qwen/qwen3-235b-a22b-instruct-2507",
        "url": "https://api.pinference.ai/api/v1",
        "key": "PRIME_API_KEY",
    },
}
Then use the alias directly:
vf-eval my-env -m qwen3-235b-i
If the model name isn’t found in the registry, the --api-base-url and --api-key-var flags are used instead.

Sampling Parameters

FlagShortDefaultDescription
--max-tokens-tmodel defaultMaximum tokens to generate
--temperature-Tmodel defaultSampling temperature
--sampling-args-SJSON object for additional sampling parameters
The --sampling-args flag accepts any parameters supported by the model’s API:
vf-eval my-env -S '{"temperature": 0.7, "top_p": 0.9}'

Evaluation Scope

FlagShortDefaultDescription
--num-examples-n5Number of dataset examples to evaluate
--rollouts-per-example-r3Rollouts per example (for pass@k, variance)
Multiple rollouts per example enable metrics like pass@k and help measure variance. The total number of rollouts is num_examples × rollouts_per_example.

Concurrency

FlagShortDefaultDescription
--max-concurrent-c32Maximum concurrent requests
--max-concurrent-generationsame as -cConcurrent generation requests
--max-concurrent-scoringsame as -cConcurrent scoring requests
--no-interleave-scoring-NfalseDisable interleaved scoring
By default, scoring runs interleaved with generation. Use --no-interleave-scoring to score all rollouts after generation completes.

Output and Saving

FlagShortDefaultDescription
--verbose-vfalseEnable debug logging
--save-results-sfalseSave results to disk
--save-every-f-1Save checkpoint every N rollouts
--state-columns-CExtra state columns to save (comma-separated)
--save-to-hf-hub-HfalsePush results to Hugging Face Hub
--hf-hub-dataset-name-DDataset name for HF Hub
Results are saved to ./outputs/evals/{env_id}--{model}/ as a Hugging Face dataset. The --state-columns flag allows saving environment-specific state fields that your environment stores during rollouts:
vf-eval my-env -s -C "judge_response,parsed_answer"

Environment Defaults

Environments can specify default evaluation parameters in their pyproject.toml (See Developing Environments):
[tool.verifiers.eval]
num_examples = 100
rollouts_per_example = 5
These defaults are used when flags aren’t explicitly provided. Priority order: CLI flags → environment defaults → global defaults.