Training - Prime Intellect Docs

This guide covers RL training with the included trainer and vLLM inference, orchestrated via a single TOML config.

Training options

You can train with the included RLTrainer (via vf-rl) or with external projects like prime-rl.

Use the included trainer when you want a simple, hackable training loop and LoRA-first defaults.
Use prime-rl when you want FSDP-first orchestration and large-scale features.

Summary of similarities and differences

Similarities
- OpenAI-compatible inference (vLLM) and async rollouts
- One-step off-policy overlap by default (generate at step n-1 while training at step n)
Differences
RLTrainer: Accelerate/DeepSpeed-based; optional LoRA/PEFT; easy to script and extend in Python
- PRIME-RL: FSDP-first; rl entrypoint; strong checkpointing; extensive CLI/TOML configuration

Train with vf-rl (included trainer)

The included trainer runs alongside a vLLM server, managed automatically by vf-rl inside a tmux session. Configure everything in a single TOML.

Quick Start

Install RL extras and set up default configs:

uv add 'verifiers[rl]'
uv run vf-setup

Launch training from a TOML (tmux with vLLM + trainer panes):

uv run vf-rl @ configs/rl/config.toml

TOML Configuration

Minimal TOML example:

model = "Qwen/Qwen3-4B-Instruct-2507"

[env]
id = "kalomaze/alphabet-sort" # auto-installed from hub if given as user/env-id, or from local project if given as env-id

[inference]
gpus = 1

[inference.args]
enforce_eager = true

[trainer]
gpus = 1

[trainer.args]
run_name = "alphabet-sort"
use_lora = true
learning_rate = 1e-5
micro_batch_size = 4
rollouts_per_example = 16
batch_size = 512
max_steps = 100
max_tokens = 512
max_seq_len = 2048

Key Hyperparameters

Batch Configuration

Key fields in [trainer.args]:

rollouts_per_example: completions per prompt (group size)
micro_batch_size: rollouts per GPU per step
batch_size: rollouts per global batch (must be divisible by micro_batch_size * world_size)

How to think about batch settings:

num_generations: Larger groups (16-32) increase reward diversity but use more memory
per_device_train_batch_size: Limited by GPU memory after model weights
gradient_accumulation_steps: Use to achieve larger effective batch sizes

Generation Parameters

Specify in [trainer.args]:

max_tokens (per-turn), temperature, top_p, top_k, min_p, repetition_penalty
max_prompt_len, max_seq_len

Generation strategy:

High temperature (0.8-1.0) increases diversity within groups
Consider your model’s context window when setting lengths
Longer completions allow more complex reasoning but increase memory usage

Training Schedule

Core fields in [trainer.args]:

learning_rate, lr_scheduler_type, warmup_steps, max_steps
max_grad_norm, bf16, gradient_checkpointing

Async Generation

RLTrainer is asynchronous (one step off-policy) by default. Generation is controlled via [trainer.args] and the environment:

generation_timeout, max_concurrent

Evaluation During Training

Set eval_strategy/eval_steps in [trainer.args] and provide an eval split via your environment configuration if supported.

Parameter-Efficient Training

LoRA is enabled by default; configure via [trainer.args] fields like use_lora, lora_rank, lora_alpha, lora_dropout, and optionally lora_target_modules.

RL Rules of Thumb

RL is notoriously sensitive to implementation details. Here’s practical guidance:

Before Training

Evaluate baseline performance: If your model gets 0% reward after 10+ attempts, the task is too hard
Check task difficulty: If baseline is already 80%+, consider harder examples
Ensure reward diversity: You want varied scores within each generation group

Stability vs Performance Trade-offs

For more aggressive training (higher risk of collapse):

Set beta = 0 (no KL penalty)
Increase learning rate (2e-6 to 5e-6)
Increase num_iterations (2-4)

For more stable training (slower progress):

Increase num_generations (32-64)
Increase batch size via gradient_accumulation_steps
Decrease max_grad_norm (0.001-0.005)
Use larger models (14B+)
Keep num_iterations = 1 (stay on-policy)

Best Practices

Likely beneficial:

Learning rate warmup (10-20 steps minimum)
Periodic reference model updates for 500+ step runs
One-step off-policy training (num_batches_ahead = 1)

Context-dependent:

High beta values (0.1+) - more conservative
Overlong filtering - depends on task
Tool response masking - useful for multi-turn

Key insight: The best way to improve training is ensuring appropriate task difficulty for your model - not too easy, not too hard.

Troubleshooting

Common Issues

Non-Increasing Chat Templates: The Qwen3 and DeepSeek-R1 model series both remove <think> sections from messages when processing inputs, which violates the increasing context requirement for multi-turn training. We provide versions of many of these models with modified chat templates here. OOM during generation:

Reduce num_generations or per_device_train_batch_size
Use LoRA instead of full finetuning
Check vLLM server has sufficient memory

Training instability:

Reduce learning rate
Decrease max_grad_norm
Increase beta for stronger KL regularization

Poor reward diversity:

Increase temperature
Check if task difficulty matches model capability
Ensure your rubric differentiates quality levels

Infrastructure

Ensure huggingface and wandb logins are configured
Set OPENAI_API_KEY (can be dummy for vLLM)
Increase ulimit for high concurrency: ulimit -n 4096
For NCCL issues: try NCCL_P2P_DISABLE=1

Advanced Configuration

Custom Sampling

# Fine-grained generation control
args.repetition_penalty = 1.1   # Reduce repetition
args.top_k = 50                # Limit vocabulary
args.min_p = 0.05              # Min probability threshold

Resource Optimization

# Memory-constrained settings
args.gradient_checkpointing = True
args.ds3_gather_for_generation = False  # For very large models
args.generation_batch_size = 16  # Control generation batch size

Monitoring

# Logging configuration
args.logging_steps = 1
args.log_completions = True
args.report_to = "wandb"  # or "none" to disable
args.num_completions_to_print = 5  # Sample size to log

Train with PRIME-RL

If you prefer an FSDP-first setup with higher throughput, you can train the same verifiers Environments using prime-rl.

Install prime-rl (see its README for CUDA requirements):

curl -sSL https://raw.githubusercontent.com/PrimeIntellect-ai/prime-rl/main/scripts/install.sh | bash

Create or install a Verifiers Environment module (inside your prime-rl checkout if developing there):

# create a new Environment template
uv run vf-init vf-custom-environment

# OR install an existing Environment from this repo
uv run vf-install vf-math-python --from-repo

Configure the orchestrator to use your Environment. In your orchestrator TOML (e.g. configs/my_exp/orch.toml):

[environment]
id = "vf-math-python"  # or your custom environment ID

[environment.args]
# Example args forwarded to the Environment
split = "train"
rollouts_per_example = 8
max_concurrent = 512

Launch a single-node run (adjust GPU split to your hardware):

uv run rl \
  --trainer @ configs/my_exp/train.toml \
  --orchestrator @ configs/my_exp/orch.toml \
  --inference @ configs/my_exp/infer.toml \
  --trainer-gpus 2 --inference-gpus 6

Tips:

Use bash scripts/tmux.sh in prime-rl to open a panes layout for trainer/orchestrator/inference logs.
Log to W&B by adding --wandb.project <proj> --wandb.name <run> on uv run rl (shared to trainer + orchestrator).
For checkpointing/resume, see the prime-rl README (supports step-tagged checkpoints across trainer/orchestrator).

Next Steps

Explore Environments to create custom tasks
Review Components for advanced patterns
See the examples directory on GitHub for complete training scripts

Docs

​Training options

​Summary of similarities and differences

​Train with vf-rl (included trainer)

​Quick Start

​TOML Configuration

​Key Hyperparameters

​Batch Configuration

​Generation Parameters

​Training Schedule

​Async Generation

​Evaluation During Training

​Parameter-Efficient Training

​RL Rules of Thumb

​Before Training

​Stability vs Performance Trade-offs

​Best Practices

​Troubleshooting

​Common Issues

​Infrastructure

​Advanced Configuration

​Custom Sampling

​Resource Optimization

​Monitoring

​Train with PRIME-RL

​Next Steps