Training options
You can train with the includedRLTrainer (via vf-rl) or with external projects like prime-rl.
- Use the included trainer when you want a simple, hackable training loop and LoRA-first defaults.
- Use
prime-rlwhen you want FSDP-first orchestration and large-scale features.
Summary of similarities and differences
-
Similarities
- OpenAI-compatible inference (vLLM) and async rollouts
- One-step off-policy overlap by default (generate at step n-1 while training at step n)
- Differences
-
RLTrainer: Accelerate/DeepSpeed-based; optional LoRA/PEFT; easy to script and extend in Python
- PRIME-RL: FSDP-first;
rlentrypoint; strong checkpointing; extensive CLI/TOML configuration
- PRIME-RL: FSDP-first;
Train with vf-rl (included trainer)
The included trainer runs alongside a vLLM server, managed automatically byvf-rl inside a tmux session. Configure everything in a single TOML.
Quick Start
Install RL extras and set up default configs:TOML Configuration
Minimal TOML example:Key Hyperparameters
Batch Configuration
Key fields in[trainer.args]:
rollouts_per_example: completions per prompt (group size)micro_batch_size: rollouts per GPU per stepbatch_size: rollouts per global batch (must be divisible bymicro_batch_size * world_size)
num_generations: Larger groups (16-32) increase reward diversity but use more memoryper_device_train_batch_size: Limited by GPU memory after model weightsgradient_accumulation_steps: Use to achieve larger effective batch sizes
Generation Parameters
Specify in[trainer.args]:
max_tokens(per-turn),temperature,top_p,top_k,min_p,repetition_penaltymax_prompt_len,max_seq_len
- High temperature (0.8-1.0) increases diversity within groups
- Consider your model’s context window when setting lengths
- Longer completions allow more complex reasoning but increase memory usage
Training Schedule
Core fields in[trainer.args]:
learning_rate,lr_scheduler_type,warmup_steps,max_stepsmax_grad_norm,bf16,gradient_checkpointing
Async Generation
RLTrainer is asynchronous (one step off-policy) by default. Generation is controlled via [trainer.args] and the environment:
generation_timeout,max_concurrent
Evaluation During Training
Seteval_strategy/eval_steps in [trainer.args] and provide an eval split via your environment configuration if supported.
Parameter-Efficient Training
LoRA is enabled by default; configure via[trainer.args] fields like use_lora, lora_rank, lora_alpha, lora_dropout, and optionally lora_target_modules.
RL Rules of Thumb
RL is notoriously sensitive to implementation details. Here’s practical guidance:Before Training
- Evaluate baseline performance: If your model gets 0% reward after 10+ attempts, the task is too hard
- Check task difficulty: If baseline is already 80%+, consider harder examples
- Ensure reward diversity: You want varied scores within each generation group
Stability vs Performance Trade-offs
For more aggressive training (higher risk of collapse):- Set
beta = 0(no KL penalty) - Increase learning rate (2e-6 to 5e-6)
- Increase
num_iterations(2-4)
- Increase
num_generations(32-64) - Increase batch size via
gradient_accumulation_steps - Decrease
max_grad_norm(0.001-0.005) - Use larger models (14B+)
- Keep
num_iterations = 1(stay on-policy)
Best Practices
Likely beneficial:- Learning rate warmup (10-20 steps minimum)
- Periodic reference model updates for 500+ step runs
- One-step off-policy training (
num_batches_ahead = 1)
- High
betavalues (0.1+) - more conservative - Overlong filtering - depends on task
- Tool response masking - useful for multi-turn
Troubleshooting
Common Issues
Non-Increasing Chat Templates: The Qwen3 and DeepSeek-R1 model series both remove<think> sections from messages when processing inputs, which violates the increasing context requirement for multi-turn training. We provide versions of many of these models with modified chat templates here.
OOM during generation:
- Reduce
num_generationsorper_device_train_batch_size - Use LoRA instead of full finetuning
- Check vLLM server has sufficient memory
- Reduce learning rate
- Decrease
max_grad_norm - Increase
betafor stronger KL regularization
- Increase temperature
- Check if task difficulty matches model capability
- Ensure your rubric differentiates quality levels
Infrastructure
- Ensure
huggingfaceandwandblogins are configured - Set
OPENAI_API_KEY(can be dummy for vLLM) - Increase ulimit for high concurrency:
ulimit -n 4096 - For NCCL issues: try
NCCL_P2P_DISABLE=1
Advanced Configuration
Custom Sampling
Resource Optimization
Monitoring
Train with PRIME-RL
If you prefer an FSDP-first setup with higher throughput, you can train the sameverifiers Environments using prime-rl.
- Install
prime-rl(see its README for CUDA requirements):
- Create or install a Verifiers Environment module (inside your
prime-rlcheckout if developing there):
- Configure the orchestrator to use your Environment. In your orchestrator TOML (e.g.
configs/my_exp/orch.toml):
- Launch a single-node run (adjust GPU split to your hardware):
- Use
bash scripts/tmux.shinprime-rlto open a panes layout for trainer/orchestrator/inference logs. - Log to W&B by adding
--wandb.project <proj> --wandb.name <run>onuv run rl(shared to trainer + orchestrator). - For checkpointing/resume, see the
prime-rlREADME (supports step-tagged checkpoints across trainer/orchestrator).
Next Steps
- Explore Environments to create custom tasks
- Review Components for advanced patterns
- See the examples directory on GitHub for complete training scripts