prime-rl trainer, or other supported libraries.
Table of Contents
Hosted Training
Hosted Training, available within our Lab platform, enables you to automatically train models viaprime-rl without needing to manage your own infrastructure. Hosted Training supports LoRA for RL training, and can be used with any environment built with Verifiers.
Configuration
Use theprime lab setup script to download example configuration files for Hosted Training into your workspace:
configs/lab/, along with endpoints.py:
primeintellect/alphabet-sort environment with Qwen/Qwen3-30B-A3B-Instruct-2507:
Qwen/Qwen3-4B-Instruct-2507Qwen/Qwen3-4B-Thinking-2507Qwen/Qwen3-30B-Instruct-2507Qwen/Qwen3-30B-Thinking-2507Qwen/Qwen3-235B-Instruct-2507Qwen/Qwen3-235B-Thinking-2507PrimeIntellect/INTELLECT-3
Training with prime-rl
Our prime-rl trainer is a production-ready async RL training framework that supports large-scale multi-node training, agentic rollouts with Verifiers environments, Mixture-of-Experts (MoE) models, LoRA adapters, and other training algorithms such as SFT and online distillation. We recommend using prime-rl for training with Verifiers environments on self-managed GPU infrastructure. The default configuration distills the best practices from our research team’s experience and the broader community into a stable, easy-to-use recipe, including advanced features such as online difficulty filtering, continuous batching, in-flight weight updates, importance sampling and logprob clipping for stability, and more.
Setup and Configuration
To set up your workspace for training withprime-rl, run:
prime-rl trainer and its dependencies, and set up a default TOML config for training with the included wiki-search Environment on 8 GPUs.
Then, you can start training with:
Training with vf.RLTrainer
If you want to hack on new training algorithms and are less concerned with maximum performance or advanced features, you can use the included RLTrainer (via vf-rl), whose core files are under 1000 lines of code and include only the most essential logic for fairly-performant async off-policy training (with a similar core algorithm as prime-rl).
The included RLTrainer is a minimal, hackable training loop based on transformers.Trainer that supports both full-parameter finetuning and LoRA training. RLTrainer can be viewed as a “baby” prime-rl that adopts a similar default training recipe (async CISPO with one-step off-policy overlap), intended for single-node test runs with dense models. The primary files (trainer.py and orchestrator.py, located in verifiers/rl/trainer/) are under 1000 lines of code, and are designed to be a convenient starting point for writing your own training loop.
The feature set is intentionally kept minimal and focused. Users seeking maximum performance, MoE support, multi-node training, multidimensional parallelism, and other advanced features should use the prime-rl trainer.
Setup and Configuration
To usevf.RLTrainer in your own project, install with RL extras:
vf-setup script to download example configuration files for vf.RLTrainer into your workspace:
vf.RLTrainer into configs/vf-rl/, along with endpoints.py:
vf-rl can be used with a single TOML file, largely mirroring the configuration options for prime-rl but with some key differences in organization and feature sets.
Example configuration file for the primeintellect/wiki-search Environment with Qwen/Qwen3-4B-Instruct-2507:
vf.RLTrainer, do:
[trainer.args]:
rollouts_per_example: completions per prompt (group size)micro_batch_size: rollouts per GPU per stepbatch_size: rollouts per global batch (must be divisible bymicro_batch_size * world_size)
rollouts_per_example: Larger groups (16-32) increase reward diversity but increase training time and memory usagemicro_batch_size: Limited by GPU memory after model weightsbatch_size: Total rollouts per global batch (must be divisible bymicro_batch_sizeandrollouts_per_example)
Generation Parameters
Bothprime-rl and vf-rl support configurable generation parameters, including:
max_tokens: maximum number of tokens to generate per turntemperature: temperature for samplingtop_p: top-p samplingtop_k: top-k samplingmin_p: minimum probability for samplingrepetition_penalty: repetition penalty for sampling
prime-rl, these parameters are configured in the [orchestrator.sampling] section, and in vf-rl, they are configured in the [trainer.args] section.
Training Schedule
Core fields in[trainer.args]:
learning_rate,lr_scheduler_type,warmup_steps,max_stepsmax_grad_norm,bf16,gradient_checkpointing
Model loading
By default,vf.RLTrainer will use Liger Kernel for optimized training. To disable Liger Kernel, set use_liger = false in [trainer.args].
RL Rules of Thumb
RL training can be sensitive to implementation details and hyperparameters. Some simple practical guidance:Before Training
- Evaluate baseline performance: If your model gets 0% reward after 10+ attempts, the task is too hard
- Check task difficulty: If baseline is already 80%+, consider harder examples
- Ensure reward diversity: You want varied scores within each generation group
Performance Trade-offs
For more aggressive training (higher risk of collapse):- Increase learning rate (1e-5 to 1e-4 for LoRA, 1e-6 to 1e-5 for full finetuning)
- Decrease
rollouts_per_exampleandbatch_sizefor faster generation
- Increase
rollouts_per_example(16-32) - Increase
batch_size(512-1024) - Use larger models (14B+)
prime-rl, you can enable online difficulty filtering to ensure that rollout groups used for training always contain a diversity of rewards.
Common Issues
Non-Increasing Chat Templates: The Qwen3 and DeepSeek-R1 model series both remove<think> sections from messages when processing inputs, which violates the increasing context requirement for multi-turn training. We provide versions of many of these models with modified chat templates here.
OOM during generation:
- Reduce
rollouts_per_exampleormicro_batch_size - Use LoRA instead of full finetuning
- Check vLLM server has sufficient memory
- Decrease learning rate
- Increase
rollouts_per_example - Increase
batch_size
- Increase learning rate
- Leverage continuous rewards
- Use online difficulty filtering
- Calibrate difficulty appropriately via smarter models, easier tasks
Other Trainers
verifiers is intended to be largely trainer-agnostic. It is supported by SkyRL and Tinker, and is straightforward to support for any trainer which can expose an OpenAI-compatible inference client for rollouts.