Training

This page covers everything you need to launch, observe, checkpoint, and recover a prime-rl training run — the RL trainer (and the distillation algorithms that run through it) and the SFT trainer. For multi-node and cluster layouts, see Scaling. For the loss math and algorithm knobs, see Algorithms.

AI agents working in this repo: the equivalent runbooks are at skills/training/ — top-level routing in skills/training/SKILL.md, launch details in skills/training/start-run/SKILL.md, and check-in / restart procedures in skills/training/monitor-run/SKILL.md.

Entrypoints
RL Trainer
SFT Trainer
Checkpointing
Observability
Rules of Thumb

Entrypoints

Command	Purpose	Notes
`uv run rl`	Wraps the trainer, orchestrator, and inference server in one launch from a merged TOML.	The default for any RL run. Runs locally for single-node experiments; submits to SLURM for single- or multi-node when `[slurm]` is set (see Scaling § SLURM).
`uv run sft`	Supervised fine-tuning on a HF dataset.	Launches torchrun internally; never call torchrun directly.
`uv run inference`	vLLM server.	Always use this entrypoint over `vllm serve` — it adds `/update_weights`, `/load_lora_adapter`, and `/init_broadcaster`.
`uv run trainer`	Standalone trainer process group.	Use only when launching the trainer separately from the orchestrator (e.g. multi-node RL without the `rl` wrapper).
`uv run orchestrator`	Standalone orchestrator process.	Pair with a separately-launched trainer + inference.

RL Trainer

Launch

The minimal RL run trains an SFT-warmed Qwen3-0.6B on the reverse-text task — the env is bundled with the verifiers submodule, so nothing else needs to be installed:

uv run rl @ examples/basic/reverse-text/rl.toml

Useful Knobs

A condensed view of the knobs you’ll most often tune. For trainer-side parallelism, sampling, optimizer, and loss knobs see Scaling and Algorithms. Data and algorithm:

Knob	What it does
`orchestrator.batch_size`	Tasks per trainer step.
`orchestrator.group_size`	Rollouts generated per task.
`orchestrator.max_off_policy_steps`	How many distinct policies may have contributed to one rollout before it’s discarded (default 8). The main off-policy dial on long agentic rollouts — bump for throughput, lower for tighter on-policyness. Watch `errored_rollouts` and `mismatch_kl/all/mean` when tuning.
`[orchestrator.algo]`	Training algorithm — its `type` names it (`grpo` default, `max_rl`, `opd`, `opsd`, `sft`, `echo`). See Algorithms.
`[[orchestrator.train.env]]`	Training environments. List multiple tables for multi-env training; weight them via `ratio`. See Configuration § Environments.
`[[orchestrator.eval.env]]` + `orchestrator.eval.interval`	Eval environments and cadence (default every 100 steps).

Monitoring:

Knob	What it does
`log.level`	Process log level for trainer + orchestrator (`info` default; falls back to `$PRIME_LOG_LEVEL`). Set per-process via `trainer.log.level` / `orchestrator.log.level`, or globally on the `rl` entrypoint to propagate to both.
`orchestrator.log.vf_level`	Env-worker / `verifiers` log level (`info` default; `debug` is noisy but useful for env debugging).
`--wandb` (+ `--wandb.project`, `--wandb.name`)	Enable Weights & Biases logging. See Weights & Biases.
`--orchestrator.prime-monitor`	Stream metrics to the Prime Intellect platform (Prime Lab). See Platform monitoring.

Run management:

Knob	What it does
`--clean-output-dir`	Wipe `<output_dir>` before starting. Useful when re-running an experiment with the same name during iteration.
`--output-dir outputs/<name>`	Per-run output directory. Always set this when running more than one experiment in parallel.
`--max-steps N`	Stop after `N` trainer steps. Overrides the config value.
`--dry-run`	Resolve + validate the full config, write per-process TOMLs to `<output_dir>/configs/`, and exit without launching. The fastest way to debug a misbehaving config.

Algorithms

The RL entrypoint supports several training algorithms, switched via [orchestrator.algo]’s type (see Algorithms for the full reference, model references, and per-algorithm customization):

`algo.type`	Frozen model	Use case
`grpo` (default)	None	Standard group-relative RL
`max_rl`	None	MaxRL: GRPO with mean-normalized advantages (maximum-likelihood RL)
`opd`	Required, must be vLLM (needs `prompt_logprobs`)	On-policy distillation: the policy generates rollouts, the trainer minimizes per-token reverse KL to a reference model
`sft`	Required, any OpenAI-compatible endpoint	Hard-distill: a frozen model generates rollouts, the policy trains on its tokens
`opsd`	None — the live policy is its own reference (no deployment)	SDFT: the model is its own reference conditioned on expert demonstrations
`echo`	None	GRPO plus cross-entropy on env-observation tokens

A new algorithm is a named class in code, not a config — see Algorithms § Authoring an Algorithm. Frozen models are declared inline on the algorithm, named where the model is used — [orchestrator.algo.teacher] for opd (the frozen model scored against), [orchestrator.algo.sampling.source] for sft (the model it samples from) — each with name + base_url. opsd declares no frozen model: it self-distills against the live policy. The rl entrypoint only manages policy inference — start frozen-model servers yourself and point base_url at them:

CUDA_VISIBLE_DEVICES=1 uv run inference \
  --model.name <frozen-model> --server.port 8001

The standalone uv run sft entrypoint is the more traditional SFT path — pure dataset-based, no orchestrator. Use the sft algorithm only when you want a frozen model to generate the supervision on the fly.

Important Metrics

Pulled from the console logs and mirrored to W&B. Progress (orchestrator):

reward/{all,env}/mean — main signal. Should trend upward over hundreds of steps.
seq_len/{all,env}/mean and is_truncated/{all,env}/mean — rollout length and truncation rate.
num_turns/{all,env}/mean — for multi-turn envs.
empty_rollouts/{all,env}, errored_rollouts/{all,env} — non-zero is fine in small numbers; sustained > 5% is a smell.
eval/{env}/{avg@k,pass@k} — eval scores when [orchestrator.eval] is set.

Stability (trainer):

mismatch_kl/{all,env}/{mean,std,max} — KL between trainer’s current policy and the (older) inference policy that generated the rollouts. A sustained, growing mean is the early-warning sign for off-policy collapse.
entropy/{all,env}/mean — too low means mode-collapse; too high means the model isn’t committing.
masked_advantage_{positive,negative}/mean — fraction of DPPO-masked tokens, split by sign.
optim/grad_norm — spikes precede divergence; check the loss config or lower the LR.

Performance (trainer + orchestrator step independently):

Source	Metric	Reading
trainer	`time/wait_for_batch`	high → orchestrator bottleneck
orchestrator	`time/wait_for_ckpt`	high → trainer bottleneck

SFT Trainer

uv run sft runs supervised fine-tuning from a HF dataset. It shares model loaders, FSDP setup, checkpointing, and the chat-template plumbing with the RL trainer, so a typical workflow is SFT → RL → SFT → … without any reformatting.

Dataset Format

Two accepted layouts:

Prompt-completion: a HF dataset with prompt and completion columns (TRL format). The trainer masks out the prompt and computes loss only over the completion.
Messages: a HF dataset with a single messages column containing a list of chat turns. The trainer interprets the whole conversation as one sample, applies role-based loss masking, and trains over all assistant turns.

If both columns are present, messages takes precedence. Tool definitions and renderer controls. For tool-use SFT, add a tools column (OpenAI function-calling format) or tool_defs (verifiers rollout format). Each row’s value can be either a list of dicts or a JSON-encoded string of a list — both are accepted, and tool_defs rows are auto-converted to OAI shape before being passed into the renderer. Renderer-backed SFT reads template controls from the typed [renderer] config in the SFT TOML. For example:

[renderer]
name = "qwen3"
enable_thinking = false

If a model needs another template control, add it to that model’s renderer config in renderers (for example a new field on the relevant *RendererConfig) and consume it in the renderer implementation. Renderer-backed tokenization. SFT tokenization is renderer-only. The renderers package owns message-to-token conversion and loss attribution end-to-end, so position-dependent chat templates (for example templates that strip past <think> blocks across user turns) do not corrupt the loss mask. [renderer] defaults to name = "auto"; set a typed renderer config only when you need model-specific template controls. Hand-coded renderers ship for Qwen3, Qwen3.5, GLM-5, GLM-4.5, Kimi K2/K2.5, MiniMax M2, DeepSeek V3, Nemotron 3, GPT-OSS, and VLM families such as Qwen3-VL/Qwen3.5. VLM training requires a custom PrimeRL implementation. Training a model with [model.vlm] set (SFT or RL) requires model.impl = "custom" and only works for models with a registered PrimeRL VLM class (currently Qwen3.5 dense and MoE). See Algorithms § Multi-Turn Trajectories for the full picture.

Launch

The minimal SFT run trains Qwen3-0.6B on the reverse-text SFT dataset:

uv run sft @ examples/basic/reverse-text/sft.toml --wandb

Multi-GPU and multi-node use torchrun under the hood (the sft entrypoint manages this for you — see Scaling § SFT and Torchrun for non-default layouts; multi-node SFT goes through SLURM).

SFT-Specific Knobs

Knob	What it controls
`data.name`	HF dataset name or local path
`data.batch_size`	Tokens per trainer step (packed)
`data.seq_len`	Per-sample sequence length
`loss_mask.*`	Which roles contribute to loss (system / user / assistant / tool).
`val.interval`	Run validation every N steps; `val.data` mirrors `data`

Important Metrics

Pulled from the console log and mirrored to W&B. Progress and loss:

loss/mean — main signal. Should decrease through the run.
val/loss — validation loss when [val] is set, logged every val.interval steps.
progress/epoch, progress/num_samples, progress/num_tokens — dataset progress.
progress/<subset>/ratio_{samples,tokens} — when training on multiple HF subsets/splits, the realized mixing ratio.

Stability and optimization:

optim/grad_norm — spikes precede divergence.
optim/lr, optim/zero_grad_ratio — LR schedule and the fraction of params that received zero gradients (high → dead path or wrong loss masking).
For MoE: max_vio/mean (load-balancing violation), routing_confidence/mean — both are logged when non-zero.

Performance:

Metric	Reading
`perf/throughput`, `perf/throughput_per_gpu`	tokens/s overall and per GPU
`perf/mfu`	MFU
`perf/peak_memory`	peak GPU memory (GiB)
`time/step`, `time/forward_backward`, `time/save_ckpt`	step breakdown

Checkpointing

Checkpointing is split across processes because the orchestrator and trainer can be on different machines and on different steps at any given time. Inference is stateless.

Process	What’s saved	Where
Trainer	FSDP-sharded model (DCP), optimizer, scheduler, progress	`<output_dir>/checkpoints/step_N/trainer/`
Orchestrator	Step counter, total tokens / samples / problems	`<output_dir>/checkpoints/step_N/orchestrator/`
Inference	nothing — re-pushed from the latest checkpoint on restart	n/a
Trainer (HF weights)	HF-compatible weight snapshot for serving	`<output_dir>/weights/step_N/`

Enabling Checkpoints

Checkpointing is off by default to save disk. Enable it with --ckpt:

uv run rl @ rl.toml --ckpt                              # default: end-of-training only
uv run rl @ rl.toml --ckpt.interval 25                  # every 25 steps
uv run rl @ rl.toml --ckpt.interval 25 --ckpt.keep-last 3  # rolling window of 3
uv run rl @ rl.toml --ckpt.interval 25 --ckpt.keep-interval 100  # …plus permanent every 100

Resuming a Run

Re-run the same launch command and pass --ckpt.resume-step <N> (or -1 for “latest”). Make sure --max-steps is at least the target final step, not the remaining delta:

# First run: steps 1–10
uv run rl @ rl.toml --max-steps 10 --ckpt

# Resume: continue to step 20
uv run rl @ rl.toml --max-steps 20 --ckpt.resume-step 10

Serving Checkpoints

HF-compatible weight snapshots are written under <output_dir>/weights/step_N/ whenever a full checkpoint runs (or you can write weights-only via --ckpt.weights-only for cheaper snapshots). Upload directly:

uv run hf upload <user>/<model>-RL outputs/weights/step_100

For LoRA runs, set ckpt.weights.save_adapter_separately = true to also write the raw adapter alongside the merged weights — useful when serving the adapter through a separate /load_lora_adapter call.

Observability

Log Files

The launcher tees every process’s stdout/stderr into <output_dir>/logs/. The full layout (single-node runs skip the node_*.log and router_*.log files):

<output_dir>/logs/
├── trainer.log                  # rank 0 only; symlink → trainer/node_0.log on multi-node
├── orchestrator.log             # single instance, single file
├── inference.log                # symlink → inference/node_0.log on multi-node
├── trainer/
│   ├── node_*.log               # per-node trainer stdout (multi-node only)
│   └── torchrun/<rdzv>/attempt_0/<rank>/{stdout,stderr}.log   # per-rank
├── inference/
│   ├── node_*.log               # per-node inference stdout (multi-node only)
│   └── router_*.log             # vllm-router per replica (multi-node only)
└── envs/{train,eval}/<env_name>/
    ├── env_server.log
    └── env_worker_<id>.log

Env worker logs are the first place to look for env-side errors (most user code lives there). Verbosity is controlled by orchestrator.log.vf_level. For multi-rank trainer debugging, drop into logs/trainer/torchrun/<rdzv>/attempt_0/<rank>/{stdout,stderr}.log — verbose and per-rank. Live tailing from a single point (works on the head node for multi-node runs over a shared filesystem):

tail -F <output_dir>/logs/{trainer,orchestrator,inference}.log
tail -F <output_dir>/logs/trainer/node_*.log     # multi-node only
tail -F <output_dir>/logs/inference/router_*.log # multi-node only

Console Output

scripts/tmux.sh opens a 4-pane tmux session that follows trainer.log, orchestrator.log, inference.log, and the union of env worker logs. Start it before launching:

bash scripts/tmux.sh
# then in the Launcher window:
uv run rl @ ... --output-dir outputs/my-run

Pass -s <session> and -o <output_dir> to run multiple parallel experiments side-by-side in different sessions. The helper also works on a SLURM head node — bash scripts/tmux.sh my-rl-job /shared/outputs/my-rl-job.

Weights & Biases

W&B is off by default. Enable with --wandb:

uv run rl @ rl.toml --wandb                               # default project, random name
uv run rl @ rl.toml --wandb.project my-proj --wandb.name run-42
uv run rl @ rl.toml --no-wandb                            # force-disable even if the TOML enables it

The trainer and orchestrator log into a single shared W&B run, so all metrics from both processes land in one place. Shared mode requires the W&B SDK ≥ 0.19.9 and is incompatible with wandb.offline = true. By default, every 10 steps each process also logs a sample of prompts/completions (with rewards and advantages) and reward/advantage/entropy distributions as W&B tables. Tune via --wandb.log-extras.interval and --wandb.log-extras.sample-ratio, or disable subsets:

uv run rl @ rl.toml --wandb \
  --orchestrator.wandb.log-extras.interval 50 \
  --no-trainer.wandb.log-extras.distributions

prime-rl deliberately logs a large number of metrics for maximum observability: every rollout metric is emitted per subset (all/effective), per statistic (mean/max/min/p10/p90), and per environment alongside a cross-env aggregate, so a multi-env run can emit thousands of series. To keep that navigable, W&B mode auto-creates an overview saved view on the first run into a project — curating the handful of metrics that matter into train, eval, stability, and performance sections (with per-env breakdowns). The view is created once per project and adapts to the run’s environments; if a later run uses a different set of environments, a new versioned view (overview-v2, …) is created instead of overwriting the first.

Platform Monitoring

Register a run on the Prime Intellect platform (Prime Lab) and stream training metrics, samples, and distributions to the platform dashboard. Bare flag uses defaults:

uv run rl @ rl.toml --orchestrator.prime-monitor

Or set it in TOML:

[orchestrator.prime_monitor]
run_name = "my-experiment"

Requires PRIME_API_KEY (set via prime login or env var) and an allowlisted team. Currently internal-only.

Rules of Thumb

Start small. Run examples/basic/reverse-text/rl.toml end-to-end on 2 GPUs before scaling. If the smoke run finishes cleanly, your install is good.
Batch size ≥ 64. Smaller batches give noisy gradient estimates and the trainer’s overhead-per-step dominates throughput. 64 is the practical floor; 128–512 is the range for quick ablations; production RL often runs at 1024+.
Group size ≥ 8. Bigger groups (orchestrator.group_size) make it more likely that a task produces a mix of high- and low-reward rollouts, which is what gives the trainer a usable signal — if all rollouts in a group succeed or all fail, the within-group advantage collapses to zero and the trainer learns nothing from that task. Bigger groups also tighten advantage normalization. 8 is the floor; 16–32 is common.
Pin output_dir per run. Sharing a directory across runs will mix rollouts and break resumes. --output-dir outputs/<unique-name> is the simplest discipline.
Use --dry-run before SLURM. Validators (e.g. CP needs flash-attention) fail fast in dry-run and slow in queue.

Getting Started

Lab

Libraries

Compute

Table of Contents

Entrypoints

RL Trainer

Launch

Useful Knobs

Algorithms

Important Metrics

SFT Trainer

Dataset Format

Launch

SFT-Specific Knobs

Important Metrics

Checkpointing

Enabling Checkpoints

Resuming a Run

Serving Checkpoints

Observability

Log Files

Console Output

Weights & Biases

Platform Monitoring

Rules of Thumb

​Table of Contents

​Entrypoints

​RL Trainer

​Launch

​Useful Knobs

​Algorithms

​Important Metrics

​SFT Trainer

​Dataset Format

​Launch

​SFT-Specific Knobs

​Important Metrics

​Checkpointing

​Enabling Checkpoints

​Resuming a Run

​Serving Checkpoints

​Observability

​Log Files

​Console Output

​Weights & Biases

​Platform Monitoring

​Rules of Thumb

Table of Contents

Entrypoints

RL Trainer

Launch

Useful Knobs

Algorithms

Important Metrics

SFT Trainer

Dataset Format

Launch

SFT-Specific Knobs

Important Metrics

Checkpointing

Enabling Checkpoints

Resuming a Run

Serving Checkpoints

Observability

Log Files

Console Output

Weights & Biases

Platform Monitoring

Rules of Thumb