prime-rl trainer, or other supported libraries.
Table of Contents
- Hosted Training
- Training with
prime-rl - Prompt Optimization with
prime gepa run - RL Rules of Thumb
- Other Trainers
Hosted Training
Hosted Training, available within our Lab platform, enables you to automatically train models viaprime-rl without needing to manage your own infrastructure. Hosted Training supports LoRA for RL training, and can be used with any environment built with Verifiers.
Configuration
Use theprime lab setup script to download example configuration files for Hosted Training into your workspace:
configs/rl/, example eval configs into configs/eval/, along with configs/endpoints.toml and GEPA starter configs in configs/gepa/:
primeintellect/alphabet-sort environment with Qwen/Qwen3-30B-A3B-Instruct-2507:
Qwen/Qwen3-4B-Instruct-2507Qwen/Qwen3-4B-Thinking-2507Qwen/Qwen3-30B-Instruct-2507Qwen/Qwen3-30B-Thinking-2507Qwen/Qwen3-235B-Instruct-2507Qwen/Qwen3-235B-Thinking-2507PrimeIntellect/INTELLECT-3
Training with prime-rl
Our prime-rl trainer is a production-ready async RL training framework that supports large-scale multi-node training, agentic rollouts with Verifiers environments, Mixture-of-Experts (MoE) models, LoRA adapters, and other training algorithms such as SFT and online distillation. We recommend using prime-rl for training with Verifiers environments on self-managed GPU infrastructure. The default configuration distills the best practices from our research team’s experience and the broader community into a stable, easy-to-use recipe, including advanced features such as online difficulty filtering, continuous batching, in-flight weight updates, importance sampling and logprob clipping for stability, and more.
Setup and Configuration
To set up your workspace for training withprime-rl, run:
prime-rl trainer and its dependencies, and set up a default TOML config for training with the included wiki-search Environment on 8 GPUs.
Then, you can start training with:
Prompt Optimization with prime gepa run
prime gepa run is the CLI entrypoint for automatic system prompt optimization using GEPA (Genetic-Pareto prompt optimization). It iteratively refines your environment’s system prompt using a teacher LLM to reflect on evaluation results, without requiring gradient-based training. Current support is for system prompt optimization only.
Usage
Basic usage mirrorsprime eval run:
wiki-search environment using the specified model for both evaluation rollouts and reflection. Results are saved to environments/wiki-search/outputs/gepa/.
Key options:
--model/-m: Model for evaluation rollouts--reflection-model/-M: Teacher model for prompt reflection (defaults to--model)--max-calls/-B: Evaluation budget (default: 500)--num-train/-n: Training examples (default: 100)--num-val/-N: Validation examples (default: 50)--minibatch-size: Number of examples evaluated together per reflection step (default: 3)--perfect-score: Maximum score for a rollout in your environment (if applicable); minibatches achieving this score are skipped during reflection (useful if your environment has a known max score)--state-columns: Additional state columns to copy into the reflection dataset. By default,query,completion,expected_answer,reward, anderrorare included. Use this to add environment-specific state fields (e.g.,--state-columns tool_calls reasoning_trace)
Output
After optimization, you’ll find:best_prompt.txt- The optimized system promptpareto_frontier.jsonl- Best prompts per validation examplemetadata.json- Run configuration and summary
prime eval run to verify performance before and after optimization.
RL Rules of Thumb
RL training can be sensitive to implementation details and hyperparameters. Some simple practical guidance:Before Training
- Evaluate baseline performance: If your model gets 0% reward after 10+ attempts, the task is too hard
- Check task difficulty: If baseline is already 80%+, consider harder examples
- Ensure reward diversity: You want varied scores within each generation group
Performance Trade-offs
For more aggressive training (higher risk of collapse):- Increase learning rate (1e-5 to 1e-4 for LoRA, 1e-6 to 1e-5 for full finetuning)
- Decrease
rollouts_per_exampleandbatch_sizefor faster generation
- Increase
rollouts_per_example(16-32) - Increase
batch_size(512-1024) - Use larger models (14B+)
prime-rl, you can enable online difficulty filtering to ensure that rollout groups used for training always contain a diversity of rewards.
Common Issues
Non-Increasing Chat Templates: The Qwen3 and DeepSeek-R1 model series both remove<think> sections from messages when processing inputs, which violates the increasing context requirement for multi-turn training. We provide versions of many of these models with modified chat templates here.
OOM during generation:
- Reduce
rollouts_per_exampleormicro_batch_size - Use LoRA instead of full finetuning
- Check vLLM server has sufficient memory
- Decrease learning rate
- Increase
rollouts_per_example - Increase
batch_size
- Increase learning rate
- Leverage continuous rewards
- Use online difficulty filtering
- Calibrate difficulty appropriately via smarter models, easier tasks
Other Trainers
verifiers is intended to be largely trainer-agnostic and is straightforward to support for any trainer which can expose an OpenAI-compatible inference client for rollouts.
vf.RLTrainer (Legacy)
The legacy vf.RLTrainer still exists for educational and experimental purposes via the optional verifiers-rl package and the legacy RL CLI entrypoint, but it is not actively maintained. It is a compact single-node async RL trainer with a narrower feature set than production trainers. Its core implementation (trainer.py and orchestrator.py under packages/verifiers-rl/verifiers_rl/rl/trainer/) remains intentionally lightweight for algorithm experimentation. For production training and current guidance, use prime-rl.
Tinker
Tinker supports Verifiers environments via thetinker-cookbook recipes.
SkyRL
SkyRL supports Verifiers environments via itsskyrl-train integration.