- Your dataset (
strorList[ChatMessage]) - Your Rubric (one or more reward functions)
- Your interaction protocol, extended from
MultiTurnEnv
Core Concept: Interaction Protocols
Verifiers allows defining arbitrary interaction patterns between models and environments:Example Protocols
- Q&A Tasks: Single model response → evaluation
- Tool Use: Model request → tool execution → model continues
- Games: Model move → game state update → environment feedback → repeat
- Tutoring: Model attempt → hint/correction → retry until correct
- Debate: Model A argument → Model B rebuttal → judge evaluation
Environment Types
MultiTurnEnv: Maximum Flexibility
The base class for custom interaction protocols:ToolEnv: Native Tool Calling
Leverages models’ built-in tool calling for agentic workflows:StatefulToolEnv and override update_tool_args instead of relying on global state.
SingleTurnEnv: Simple Evaluation
For straightforward Q&A tasks without interaction: env = vf.SingleTurnEnv( dataset=dataset, system_prompt=“Answer the question.”, rubric=rubric, )Key Components
Rubrics: Multi-Criteria Evaluation
Rubrics define how to evaluate model responses by combining multiple criteria:- Single criterion: One reward function (e.g., exact match)
- Multi-criteria: Weighted combination of multiple aspects
- Judge-based: Using LLMs to evaluate quality
- Stateful: Tracking patterns across interactions
Environment Modules
Package your interaction protocol as a reusable module:- Easy sharing and versioning
- Dependency isolation
- Standardized interfaces
State Management
Environments maintain state throughout interactions:state.
Design Philosophy
1. Protocol-First Design
Start by defining your interaction pattern:- When should the environment respond?
- What information should it provide?
- How should the conversation end?
2. Composable Evaluation
Build complex evaluation from simple parts:- Individual reward functions for specific criteria
- Rubrics to combine and weight them
- Environments to orchestrate the process
3. OpenAI-Compatible Integration
Works with any OpenAI-compatible API:Data Flow
- Dataset provides prompts and ground truth
- Environment orchestrates the interaction protocol
- Model generates responses via OpenAI-compatible client
- Rubric evaluates quality through reward functions
- Results include full interaction traces and scores
Evaluation lifecycle
-
Inputs expected by environments:
prompt: str or list[ChatMessage] (chat-style). If you usequestionin your dataset, environments will turn it into a chat message, addingsystem_prompt/few_shotif provided.answerorinfo: optional.answeris a string;infois a dict for richer metadata. Both can be omitted for environments that evaluate based solely on completion quality (e.g., format adherence, length constraints, style assessment).task: optional string used byEnvGroup/RubricGroupto route behavior.
-
Running evaluation:
rollouts_per_example > 1repeats dataset entries internally.max_concurrentthrottles concurrent rollouts.save_every(when > 0) checkpoints intermediate progress during interleaved rollouts (setinterleave_scoring=True).
-
Scoring:
- Each reward function returns a float. Weights applied inside
Rubriccombine them intoresults.reward. - All individual scores are logged under
results.metricskeyed by function name (even if weight is 0.0).
- Each reward function returns a float. Weights applied inside
-
Outputs (
GenerateOutputs):prompt,completion,answer,state,info,task,id,reward,metrics: dict[str, list[float]], plus ametadatablock summarizing the run.
-
Message types:
message_type="chat"(default) expects chat messages;"completion"expects raw text continuation. Choose based on your task (e.g., continuation quality uses completion).
Optional Utilities
Parsers
For extracting structured information when needed:XMLParser: Extract XML-tagged fieldsThinkParser: Separate reasoning from answers- Custom parsers for domain-specific formats
Integration Points
For Evaluation
The most convenient way to run quick evaluations is via thevf-eval CLI tool:
-s) eval results:
env.evaluate is async—wrap it with asyncio.run(...) (as above) or call env.evaluate_sync when you must stay in synchronous code.
For Training
Run RL training viavf-rl using a single TOML to configure model, environment, inference, and trainer. See the training guide for a minimal example.
For Custom Workflows
All components can be used independently:Next Steps
- To create custom interactions, see Environments
- For advanced component usage and examples, see Components
- To train models with your environments, see Training