Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.primeintellect.ai/llms.txt

Use this file to discover all available pages before exploring further.

This page covers everything you need to launch, observe, checkpoint, and recover a prime-rl training run — the RL trainer, the SFT trainer, and the related on-policy distillation mode. For multi-node and cluster layouts, see Scaling. For the loss math and algorithm knobs, see Algorithms.
AI agents working in this repo: the equivalent runbooks are at skills/training/ — top-level routing in skills/training/SKILL.md, launch details in skills/training/start-run/SKILL.md, and check-in / restart procedures in skills/training/monitor-run/SKILL.md.

Table of Contents

Entrypoints

CommandPurposeNotes
uv run rlWraps the trainer, orchestrator, and inference server in one launch from a merged TOML.The default for any RL run. Runs locally for single-node experiments; submits to SLURM for single- or multi-node when [slurm] is set (see Scaling § SLURM).
uv run sftSupervised fine-tuning on a HF dataset.Launches torchrun internally; never call torchrun directly.
uv run inferencevLLM server.Always use this entrypoint over vllm serve — it adds /update_weights, /load_lora_adapter, and /init_broadcaster.
uv run trainerStandalone trainer process group.Use only when launching the trainer separately from the orchestrator (e.g. multi-node RL without the rl wrapper).
uv run orchestratorStandalone orchestrator process.Pair with a separately-launched trainer + inference.

RL Trainer

Launch

The minimal RL run trains an SFT-warmed Qwen3-0.6B on the reverse-text task — the env is bundled with the verifiers submodule, so nothing else needs to be installed:
uv run rl @ examples/reverse_text/rl.toml

Useful Knobs

A condensed view of the knobs you’ll most often tune. For trainer-side parallelism, sampling, optimizer, and loss knobs see Scaling and Algorithms. Data and algorithm:
KnobWhat it does
orchestrator.batch_sizeTasks per trainer step.
orchestrator.group_sizeRollouts generated per task.
orchestrator.max_off_policy_stepsHow many distinct policies may have contributed to one rollout before it’s discarded (default 8). The main off-policy dial on long agentic rollouts — bump for throughput, lower for tighter on-policyness. Watch errored_rollouts and mismatch_kl/all/mean when tuning.
orchestrator.training_moderl (default), opd, or sft. See Training modes.
[[orchestrator.train.env]]Training environments. List multiple tables for multi-env training; weight them via ratio. See Configuration § Environments.
[[orchestrator.eval.env]] + orchestrator.eval.intervalEval environments and cadence (default every 100 steps).
Monitoring:
KnobWhat it does
log.levelProcess log level for trainer + orchestrator (info default; falls back to $PRIME_LOG_LEVEL). Set per-process via trainer.log.level / orchestrator.log.level, or globally on the rl entrypoint to propagate to both.
orchestrator.log.vf_levelEnv-worker / verifiers log level (info default; debug is noisy but useful for env debugging).
--wandb (+ --wandb.project, --wandb.name)Enable Weights & Biases logging. See Weights & Biases.
--orchestrator.prime-monitorStream metrics to the Prime Intellect platform (Prime Lab). See Platform monitoring.
Run management:
KnobWhat it does
--clean-output-dirWipe <output_dir> before starting. Useful when re-running an experiment with the same name during iteration.
--output-dir outputs/<name>Per-run output directory. Always set this when running more than one experiment in parallel.
--max-steps NStop after N trainer steps. Overrides the config value.
--dry-runResolve + validate the full config, write per-process TOMLs to <output_dir>/configs/, and exit without launching. The fastest way to debug a misbehaving config.

Training Modes (RL / OPD / SFT)

The RL entrypoint supports three training modes, switched via orchestrator.training_mode:
ModeStudentTeacherUse case
rlRequiredForbiddenStandard RL
opdRequiredRequired, must be vLLM (needs prompt_logprobs)On-policy distillation: student generates rollouts, trainer minimizes KL to teacher logprobs
sftRequiredRequired, any OpenAI-compatible endpointHard-distill: teacher generates rollouts, student trains on them
The rl entrypoint only manages student-policy inference. For OPD and (local-vLLM) SFT, start the teacher inference server manually and point [orchestrator.teacher.client] at it:
CUDA_VISIBLE_DEVICES=1 uv run inference \
  --model.name <teacher> --server.port 8001
The standalone uv run sft entrypoint is the more traditional SFT path — pure dataset-based, no teacher, no orchestrator. Use orchestrator.training_mode = "sft" only when you want a teacher to generate the supervision on the fly.

Important Metrics

Pulled from the console logs and mirrored to W&B. Progress (orchestrator):
  • reward/{all,env}/mean — main signal. Should trend upward over hundreds of steps.
  • seq_len/{all,env}/mean and is_truncated/{all,env}/mean — rollout length and truncation rate.
  • num_turns/{all,env}/mean — for multi-turn envs.
  • empty_rollouts/{all,env}, errored_rollouts/{all,env} — non-zero is fine in small numbers; sustained > 5% is a smell.
  • eval/{env}/{avg@k,pass@k} — eval scores when [orchestrator.eval] is set.
Stability (trainer):
  • mismatch_kl/{all,env}/{mean,std,max} — KL between trainer’s current policy and the (older) inference policy that generated the rollouts. A sustained, growing mean is the early-warning sign for off-policy collapse.
  • entropy/{all,env}/mean — too low means mode-collapse; too high means the model isn’t committing.
  • masked_advantage_{positive,negative}/mean — fraction of DPPO-masked tokens, split by sign.
  • optim/grad_norm — spikes precede divergence; check the loss config or lower the LR.
Performance (trainer + orchestrator step independently):
SourceMetricReading
trainertime/wait_for_batchhigh → orchestrator bottleneck
orchestratortime/wait_for_ckpthigh → trainer bottleneck

SFT Trainer

uv run sft runs supervised fine-tuning from a HF dataset. It shares model loaders, FSDP setup, checkpointing, and the chat-template plumbing with the RL trainer, so a typical workflow is SFT → RL → SFT → … without any reformatting.

Dataset Format

Two accepted layouts:
  • Prompt-completion: a HF dataset with prompt and completion columns (TRL format). The trainer masks out the prompt and computes loss only over the completion.
  • Messages: a HF dataset with a single messages column containing a list of chat turns. The trainer interprets the whole conversation as one sample, applies role-based loss masking, and trains over all assistant turns.
If both columns are present, messages takes precedence. Tool definitions. For tool-use SFT, add a tools column (OpenAI function-calling format) or tool_defs (verifiers rollout format). Each row’s value can be either a list of dicts or a JSON-encoded string of a list — both are accepted, and tool_defs rows are auto-converted to OAI shape before being passed into the chat template’s tools=... argument. The chat_template_kwargs column, if present, is forwarded verbatim into apply_chat_template. Position-dependent chat templates. Multi-turn SFT under the default tokenization path (build_incremental_token_mask) requires that tokenizing the first k turns of a conversation be a strict prefix of tokenizing all n ≥ k turns. Qwen3’s upstream template violates this — it strips past <think> blocks across user turns, silently corrupting the loss mask. Two fixes:
  • Enable the renderer (set a typed [renderer] config, e.g. name = "qwen3", recommended; defaults to "auto" for RL). The renderers package owns tokenization end-to-end and is robust to position-dependent templates. Hand-coded renderers ship for Qwen3, Qwen3.5, GLM-5, GLM-4.5, Kimi K2/K2.5, MiniMax M2, DeepSeek V3, Nemotron 3, GPT-OSS. Not supported for VLMs.
  • Patched chat template — the prime-rl–patched checkpoints (e.g. PrimeIntellect/Qwen3-0.6B, used in examples/reverse_text/sft.toml) ship a chat template that preserves thinking. Or supply your own.
See Algorithms § Multi-Turn Trajectories for the full picture.

Launch

The minimal SFT run trains Qwen3-0.6B on the reverse-text SFT dataset:
uv run sft @ examples/reverse_text/sft.toml --wandb
Multi-GPU and multi-node use torchrun under the hood (the sft entrypoint manages this for you — see Scaling § SFT and Torchrun for non-default layouts; multi-node SFT goes through SLURM).

SFT-Specific Knobs

KnobWhat it controls
data.nameHF dataset name or local path
data.batch_sizeTokens per trainer step (packed)
data.seq_lenPer-sample sequence length
loss_mask.*Which roles contribute to loss (system / user / assistant / tool).
val.intervalRun validation every N steps; val.data mirrors data

Important Metrics

Pulled from the console log and mirrored to W&B. Progress and loss:
  • loss/mean — main signal. Should decrease through the run.
  • val/loss — validation loss when [val] is set, logged every val.interval steps.
  • progress/epoch, progress/num_samples, progress/num_tokens — dataset progress.
  • progress/<subset>/ratio_{samples,tokens} — when training on multiple HF subsets/splits, the realized mixing ratio.
Stability and optimization:
  • optim/grad_norm — spikes precede divergence.
  • optim/lr, optim/zero_grad_ratio — LR schedule and the fraction of params that received zero gradients (high → dead path or wrong loss masking).
  • For MoE: max_vio/mean (load-balancing violation), routing_confidence/mean — both are logged when non-zero.
Performance:
MetricReading
perf/throughput, perf/throughput_per_gputokens/s overall and per GPU
perf/mfuMFU
perf/peak_memorypeak GPU memory (GiB)
time/step, time/forward_backward, time/save_ckptstep breakdown

Checkpointing

Checkpointing is split across processes because the orchestrator and trainer can be on different machines and on different steps at any given time. Inference is stateless.
ProcessWhat’s savedWhere
TrainerFSDP-sharded model (DCP), optimizer, scheduler, progress<output_dir>/checkpoints/step_N/trainer/
OrchestratorStep counter, total tokens / samples / problems<output_dir>/checkpoints/step_N/orchestrator/
Inferencenothing — re-pushed from the latest checkpoint on restartn/a
Trainer (HF weights)HF-compatible weight snapshot for serving<output_dir>/weights/step_N/

Enabling Checkpoints

Checkpointing is off by default to save disk. Enable it with --ckpt:
uv run rl @ rl.toml --ckpt                              # default: end-of-training only
uv run rl @ rl.toml --ckpt.interval 25                  # every 25 steps
uv run rl @ rl.toml --ckpt.interval 25 --ckpt.keep-last 3  # rolling window of 3
uv run rl @ rl.toml --ckpt.interval 25 --ckpt.keep-interval 100  # …plus permanent every 100

Resuming a Run

Re-run the same launch command and pass --ckpt.resume-step <N> (or -1 for “latest”). Make sure --max-steps is at least the target final step, not the remaining delta:
# First run: steps 0–10
uv run rl @ rl.toml --max-steps 10 --ckpt

# Resume: continue to step 20
uv run rl @ rl.toml --max-steps 20 --ckpt.resume-step 10

Serving Checkpoints

HF-compatible weight snapshots are written under <output_dir>/weights/step_N/ whenever a full checkpoint runs (or you can write weights-only via --ckpt.weights-only for cheaper snapshots). Upload directly:
uv run hf upload <user>/<model>-RL outputs/weights/step_100
For LoRA runs, set ckpt.weights.save_adapter_separately = true to also write the raw adapter alongside the merged weights — useful when serving the adapter through a separate /load_lora_adapter call.

Observability

Log Files

The launcher tees every process’s stdout/stderr into <output_dir>/logs/. The full layout (single-node runs skip the node_*.log and router_*.log files):
<output_dir>/logs/
├── trainer.log                  # rank 0 only; symlink → trainer/node_0.log on multi-node
├── orchestrator.log             # single instance, single file
├── inference.log                # symlink → inference/node_0.log on multi-node
├── trainer/
│   ├── node_*.log               # per-node trainer stdout (multi-node only)
│   └── torchrun/<rdzv>/attempt_0/<rank>/{stdout,stderr}.log   # per-rank
├── inference/
│   ├── node_*.log               # per-node inference stdout (multi-node only)
│   └── router_*.log             # vllm-router per replica (multi-node only)
└── envs/{train,eval}/<env_name>/
    ├── env_server.log
    └── env_worker_<id>.log
Env worker logs are the first place to look for env-side errors (most user code lives there). Verbosity is controlled by orchestrator.log.vf_level. For multi-rank trainer debugging, drop into logs/trainer/torchrun/<rdzv>/attempt_0/<rank>/{stdout,stderr}.log — verbose and per-rank. Live tailing from a single point (works on the head node for multi-node runs over a shared filesystem):
tail -F <output_dir>/logs/{trainer,orchestrator,inference}.log
tail -F <output_dir>/logs/trainer/node_*.log     # multi-node only
tail -F <output_dir>/logs/inference/router_*.log # multi-node only

Console Output

scripts/tmux.sh opens a 4-pane tmux session that follows trainer.log, orchestrator.log, inference.log, and the union of env worker logs. Start it before launching:
bash scripts/tmux.sh
# then in the Launcher window:
uv run rl @ ... --output-dir outputs/my-run
Pass -s <session> and -o <output_dir> to run multiple parallel experiments side-by-side in different sessions. The helper also works on a SLURM head node — bash scripts/tmux.sh my-rl-job /shared/outputs/my-rl-job.

Weights & Biases

W&B is off by default. Enable with --wandb:
uv run rl @ rl.toml --wandb                               # default project, random name
uv run rl @ rl.toml --wandb.project my-proj --wandb.name run-42
uv run rl @ rl.toml --no-wandb                            # force-disable even if the TOML enables it
The trainer and orchestrator log into a single shared W&B run, so all metrics from both processes land in one place. Shared mode requires the W&B SDK ≥ 0.19.9 and is incompatible with wandb.offline = true. By default, every 10 steps each process also logs a sample of prompts/completions (with rewards and advantages) and reward/advantage/entropy distributions as W&B tables. Tune via --wandb.log-extras.interval and --wandb.log-extras.sample-ratio, or disable subsets:
uv run rl @ rl.toml --wandb \
  --orchestrator.wandb.log-extras.interval 50 \
  --no-trainer.wandb.log-extras.distributions

Platform Monitoring

Register a run on the Prime Intellect platform (Prime Lab) and stream training metrics, samples, and distributions to the platform dashboard. Bare flag uses defaults:
uv run rl @ rl.toml --orchestrator.prime-monitor
Or set it in TOML:
[orchestrator.prime_monitor]
run_name = "my-experiment"
Requires PRIME_API_KEY (set via prime login or env var) and an allowlisted team. Currently internal-only.

Rules of Thumb

  • Start small. Run examples/reverse_text/rl.toml end-to-end on 2 GPUs before scaling. If the smoke run finishes cleanly, your install is good.
  • Batch size ≥ 64. Smaller batches give noisy gradient estimates and the trainer’s overhead-per-step dominates throughput. 64 is the practical floor; 128–512 is the range for quick ablations; production RL often runs at 1024+.
  • Group size ≥ 8. Bigger groups (orchestrator.group_size) make it more likely that a task produces a mix of high- and low-reward rollouts, which is what gives the trainer a usable signal — if all rollouts in a group succeed or all fail, the within-group advantage collapses to zero and the trainer learns nothing from that task. Bigger groups also tighten advantage normalization. 8 is the floor; 16–32 is common.
  • Pin output_dir per run. Sharing a directory across runs will mix rollouts and break resumes. --output-dir outputs/<unique-name> is the simplest discipline.
  • Use --dry-run before SLURM. Validators (e.g. CP needs flash-attention) fail fast in dry-run and slow in queue.