Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.primeintellect.ai/llms.txt

Use this file to discover all available pages before exploring further.

This section covers how to use Verifiers environments for RL training with our Hosted Training platform, our open-source prime-rl trainer, or other supported libraries.

Table of Contents

Hosted Training

Hosted Training, available within our Lab platform, enables you to automatically train models via prime-rl without needing to manage your own infrastructure. Hosted Training supports LoRA for RL training, and can be used with any environment built with Verifiers.

Configuration

Use the prime lab setup script to download example configuration files for Hosted Training into your workspace:
prime lab setup
This will download example TOML configs for Hosted Training into configs/rl/, example eval configs into configs/eval/, along with configs/endpoints.toml and GEPA starter configs in configs/gepa/:
configs/
├── endpoints.toml
├── eval/
│   ├── qwen-3-5.toml
│   ├── qwen-3-5-moe.toml
│   ├── nemotron-3.toml
│   └── llama-3.toml
├── rl/
│   ├── qwen-3-5.toml
│   ├── qwen-3-5-moe.toml
│   ├── nemotron-3.toml
│   └── llama-3.toml
└── gepa/
    ├── qwen-3-5.toml
    ├── qwen-3-5-moe.toml
    ├── nemotron-3.toml
    └── llama-3.toml
Example configuration file for the primeintellect/reverse-text environment with Qwen/Qwen3.5-4B:
# Qwen3.5 dense models. Uncomment exactly one model.
# model = "Qwen/Qwen3.5-0.8B"
# model = "Qwen/Qwen3.5-2B"
model = "Qwen/Qwen3.5-4B"
# model = "Qwen/Qwen3.5-9B"

max_steps = 100
batch_size = 128
rollouts_per_example = 8

[sampling]
max_tokens = 1024

[[env]]
id = "primeintellect/reverse-text"
For v1 BYO Harness environments, put taskset/harness config under taskset and harness:
model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
max_steps = 100
batch_size = 256
rollouts_per_example = 8

[sampling]
max_tokens = 4096

[[env]]
id = "primeintellect/my-v1-env"

[env.args]
arg1 = "non-th-arg"

[env.harness]
max_turns = 8

[env.taskset.toolsets.search]
tools = ["my_env.tools:search"]
bindings = { "search.index" = "objects.index" }

[[env.taskset.rewards]]
fn = "my_env.signals:exact_answer"
weight = 1.0
See BYO Harness for the matching eval config shape and v1 callable/toolset patterns. We currently support the following models for Hosted Training:
  • Qwen/Qwen3-30B-A3B-Instruct-2507
  • Qwen/Qwen3-30B-A3B-Thinking-2507
  • Qwen/Qwen3-4B-Instruct-2507
  • Qwen/Qwen3-4B-Thinking-2507
  • Qwen/Qwen3-VL-4B-Instruct
  • Qwen/Qwen3.5-0.8B
  • Qwen/Qwen3.5-2B
  • Qwen/Qwen3.5-4B
  • Qwen/Qwen3.5-9B
  • Qwen/Qwen3.5-35B-A3B
  • Qwen/Qwen3.5-122B-A10B
  • Qwen/Qwen3.5-397B-A17B
  • meta-llama/Llama-3.2-1B-Instruct
  • meta-llama/Llama-3.2-3B-Instruct
  • nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
  • nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
  • openai/gpt-oss-20b
  • openai/gpt-oss-120b
  • zai-org/GLM-4.7
Hosted Training is currently in Private Beta. For access, please fill out this form.

Training with prime-rl

Our prime-rl trainer is a production-ready async RL training framework that supports large-scale multi-node training, agentic rollouts with Verifiers environments, Mixture-of-Experts (MoE) models, LoRA adapters, and other training algorithms such as SFT and online distillation. We recommend using prime-rl for training with Verifiers environments on self-managed GPU infrastructure. The default configuration distills the best practices from our research team’s experience and the broader community into a stable, easy-to-use recipe, including advanced features such as online difficulty filtering, continuous batching, in-flight weight updates, importance sampling and logprob clipping for stability, and more.

Setup and Configuration

To set up your workspace for training with prime-rl, run:
prime lab setup --prime-rl
This will clone and install the prime-rl trainer and its dependencies. For configuration files and launch commands, use the prime-rl documentation.

Prompt Optimization with prime gepa run

prime gepa run is the CLI entrypoint for automatic system prompt optimization using GEPA (Genetic-Pareto prompt optimization). It iteratively refines your environment’s system prompt using a teacher LLM to reflect on evaluation results, without requiring gradient-based training. Current support is for system prompt optimization only.

Usage

Basic usage mirrors prime eval run:
prime gepa run wiki-search --model google/gemini-3-flash-preview
This will optimize the system prompt for the wiki-search environment using the specified model for both evaluation rollouts and reflection. Results are saved to environments/wiki-search/outputs/gepa/. Key options:
  • --model / -m: Model for evaluation rollouts
  • --reflection-model / -M: Teacher model for prompt reflection (defaults to --model)
  • --max-calls / -B: Evaluation budget (default: 500)
  • --num-train / -n: Training examples (default: 100)
  • --num-val / -N: Validation examples (default: 50)
  • --minibatch-size: Number of examples evaluated together per reflection step (default: 3)
  • --perfect-score: Maximum score for a rollout in your environment (if applicable); minibatches achieving this score are skipped during reflection (useful if your environment has a known max score)
  • --state-columns: Additional state columns to copy into the reflection dataset. By default, query, completion, expected_answer, reward, and error are included. Use this to add environment-specific state fields (e.g., --state-columns tool_calls reasoning_trace)
In TOML configs, set GEPA parameters such as max_calls, num_train, num_val, minibatch_size, and max_concurrent under [gepa]. Put generation parameters such as max_tokens and temperature under [sampling]; the CLI passes that table through as sampling_args. Use [[env]] for one or more environments; GEPA samples train and validation examples uniformly by environment. A single [env] table is still accepted for older configs.

Output

After optimization, you’ll find:
  • system_prompt.txt - The optimized system prompt. Load it with vf.SystemMessage.from_path("/path/to/system_prompt.txt").
  • results.jsonl - Candidate prompt rows for evaluation upload; GEPA-specific fields live under info.
  • pareto_frontier.jsonl - Best candidate references per validation example
  • metadata.json - Run configuration and summary
Use prime eval run to verify performance before and after optimization.

RL Rules of Thumb

RL training can be sensitive to implementation details and hyperparameters. Some simple practical guidance:

Before Training

  1. Evaluate baseline performance: If your model gets 0% reward after 10+ attempts, the task is too hard
  2. Check task difficulty: If baseline is already 80%+, consider harder examples
  3. Ensure reward diversity: You want varied scores within each generation group

Performance Trade-offs

For more aggressive training (higher risk of collapse):
  • Increase learning rate (1e-5 to 1e-4 for LoRA, 1e-6 to 1e-5 for full finetuning)
  • Decrease rollouts_per_example and batch_size for faster generation
For more stable training (slower progress):
  • Increase rollouts_per_example (16-32)
  • Increase batch_size (512-1024)
  • Use larger models (14B+)
The best way to improve training is to ensure appropriate task difficulty for your model. When using Hosted Training or prime-rl, you can enable online difficulty filtering to ensure that rollout groups used for training always contain a diversity of rewards.

Inference Client Types

The rollout client’s client_type controls how prompt assembly and token state flow between the inference server and the trainer. For RL the trainer must see the exact tokens the server sampled — re-tokenization across turns drifts under BPE round-trip and fragments multi-turn rollouts into multiple training samples.
  • openai_chat_completions (MITO, messages-in): standard OpenAI-compatible path. Server-side chat templating, returns text. The trainer re-tokenizes — fine for eval and short single-turn training, but can fragment multi-turn rollouts.
  • openai_chat_completions_token (TITO, token-in): server-side templating, but returns prompt and completion token IDs alongside text so the trainer doesn’t re-tokenize. Use when you trust the server’s chat template to be stable across turns.
  • renderer (experimental): client-side tokenization via a per-model renderer in the renderers package. Install it with uv add "verifiers[renderers]" before using client_type="renderer". The trainer renders messages to token IDs locally and sends those to vLLM’s /v1/generate endpoint. The renderer’s bridge_to_next_turn extends prior-turn tokens verbatim across multi-turn boundaries (the extension property) and synthesizes the canonical turn-close on mid-completion truncation, so multi-turn rollouts merge into one training sample with one clean loss mask.
For production RL training, use openai_chat_completions_token — it’s the tried-and-tested path with broad model coverage. The renderer client is newer and offers stronger token-preservation guarantees in theory, but is experimental: hand-coded renderers exist only for a subset of models, and corner cases are still being shaken out. See reference § Built-in Clients for the full list.

Common Issues

Non-Increasing Chat Templates: The Qwen3 and DeepSeek-R1 model series both remove <think> sections from messages when processing inputs, which violates the increasing context requirement for multi-turn training. We provide versions of many of these models with modified chat templates here. OOM during generation:
  • Reduce rollouts_per_example or micro_batch_size
  • Use LoRA instead of full finetuning
  • Check vLLM server has sufficient memory
Training instability:
  • Decrease learning rate
  • Increase rollouts_per_example
  • Increase batch_size
Slow training:
  • Increase learning rate
  • Leverage continuous rewards
  • Use online difficulty filtering
  • Calibrate difficulty appropriately via smarter models, easier tasks

Other Trainers

verifiers is intended to be largely trainer-agnostic and is straightforward to support for any trainer which can expose an OpenAI-compatible inference client for rollouts.

vf.RLTrainer (Legacy)

The legacy vf.RLTrainer still exists for educational and experimental purposes via the optional verifiers-rl package and the legacy RL CLI entrypoint, but it is not actively maintained. It is a compact single-node async RL trainer with a narrower feature set than production trainers. Its core implementation (trainer.py and orchestrator.py under packages/verifiers-rl/verifiers_rl/rl/trainer/) remains intentionally lightweight for algorithm experimentation. For production training and current guidance, use prime-rl.

Tinker

Tinker supports Verifiers environments via the tinker-cookbook recipes.

SkyRL

SkyRL supports Verifiers environments via its skyrl-train integration.

rLLM

rLLM supports Verifiers environments with both verl (local GPU) and Tinker (remote GPU) backends.