> ## Documentation Index
> Fetch the complete documentation index at: https://docs.primeintellect.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Training

This page covers everything you need to launch, observe, checkpoint, and recover a `prime-rl` training run — the RL trainer (and the distillation algorithms that run through it) and the SFT trainer. For multi-node and cluster layouts, see [Scaling](/prime-rl/scaling). For the loss math and algorithm knobs, see [Algorithms](/prime-rl/algorithms).

> **AI agents working in this repo:** the equivalent runbooks are at [`skills/training/`](https://github.com/PrimeIntellect-ai/prime-rl/tree/main/skills/training) — top-level routing in [`skills/training/SKILL.md`](https://github.com/PrimeIntellect-ai/prime-rl/blob/main/skills/training/SKILL.md), launch details in [`skills/training/start-run/SKILL.md`](https://github.com/PrimeIntellect-ai/prime-rl/blob/main/skills/training/start-run/SKILL.md), and check-in / restart procedures in [`skills/training/monitor-run/SKILL.md`](https://github.com/PrimeIntellect-ai/prime-rl/blob/main/skills/training/monitor-run/SKILL.md).

## Table of Contents

* [Entrypoints](#entrypoints)
* [RL Trainer](#rl-trainer)
  * [Launch](#launch)
  * [Useful Knobs](#useful-knobs)
  * [Algorithms](#algorithms)
  * [Important Metrics](#important-metrics)
* [SFT Trainer](#sft-trainer)
  * [Dataset Format](#dataset-format)
  * [Launch](#launch-1)
  * [SFT-Specific Knobs](#sft-specific-knobs)
  * [Important Metrics](#important-metrics-1)
* [Checkpointing](#checkpointing)
  * [Enabling Checkpoints](#enabling-checkpoints)
  * [Resuming a Run](#resuming-a-run)
  * [Serving Checkpoints](#serving-checkpoints)
* [Observability](#observability)
  * [Log Files](#log-files)
  * [Console Output](#console-output)
  * [Weights & Biases](#weights--biases)
  * [Platform Monitoring](#platform-monitoring)
* [Rules of Thumb](#rules-of-thumb)

## Entrypoints

| Command               | Purpose                                                                                 | Notes                                                                                                                                                                                    |
| --------------------- | --------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `uv run rl`           | Wraps the trainer, orchestrator, and inference server in one launch from a merged TOML. | The default for any RL run. Runs locally for single-node experiments; submits to SLURM for single- or multi-node when `[slurm]` is set (see [Scaling § SLURM](/prime-rl/scaling#slurm)). |
| `uv run sft`          | Supervised fine-tuning on a HF dataset.                                                 | Launches torchrun internally; never call torchrun directly.                                                                                                                              |
| `uv run inference`    | vLLM server.                                                                            | Always use this entrypoint over `vllm serve` — it adds `/update_weights`, `/load_lora_adapter`, and `/init_broadcaster`.                                                                 |
| `uv run trainer`      | Standalone trainer process group.                                                       | Use only when launching the trainer separately from the orchestrator (e.g. multi-node RL without the `rl` wrapper).                                                                      |
| `uv run orchestrator` | Standalone orchestrator process.                                                        | Pair with a separately-launched trainer + inference.                                                                                                                                     |

## RL Trainer

### Launch

The minimal RL run trains an SFT-warmed `Qwen3-0.6B` on the `reverse-text` task — the env is bundled with the [`verifiers`](https://github.com/PrimeIntellect-ai/verifiers) submodule, so nothing else needs to be installed:

```bash theme={null}
uv run rl @ examples/reverse_text/rl.toml
```

### Useful Knobs

A condensed view of the knobs you'll most often tune. For trainer-side parallelism, sampling, optimizer, and loss knobs see [Scaling](/prime-rl/scaling) and [Algorithms](/prime-rl/algorithms).

**Data and algorithm:**

| Knob                                                       | What it does                                                                                                                                                                                                                                                                 |
| ---------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `orchestrator.batch_size`                                  | Tasks per trainer step.                                                                                                                                                                                                                                                      |
| `orchestrator.group_size`                                  | Rollouts generated per task.                                                                                                                                                                                                                                                 |
| `orchestrator.max_off_policy_steps`                        | How many distinct policies may have contributed to one rollout before it's discarded (default 8). The main off-policy dial on long agentic rollouts — bump for throughput, lower for tighter on-policyness. Watch `errored_rollouts` and `mismatch_kl/all/mean` when tuning. |
| `[orchestrator.algo]`                                      | Training algorithm — its `type` names it (`grpo` default, `max_rl`, `opd`, `opsd`, `sft`, `echo`). See [Algorithms](#algorithms).                                                                                                                                            |
| `[[orchestrator.train.env]]`                               | Training environments. List multiple tables for multi-env training; weight them via `ratio`. See [Configuration § Environments](/prime-rl/configuration#environments-orchestratortrainenv).                                                                                  |
| `[[orchestrator.eval.env]]` + `orchestrator.eval.interval` | Eval environments and cadence (default every 100 steps).                                                                                                                                                                                                                     |

**Monitoring:**

| Knob                                            | What it does                                                                                                                                                                                                                  |
| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `log.level`                                     | Process log level for trainer + orchestrator (`info` default; falls back to `$PRIME_LOG_LEVEL`). Set per-process via `trainer.log.level` / `orchestrator.log.level`, or globally on the `rl` entrypoint to propagate to both. |
| `orchestrator.log.vf_level`                     | Env-worker / [`verifiers`](https://github.com/PrimeIntellect-ai/verifiers) log level (`info` default; `debug` is noisy but useful for env debugging).                                                                         |
| `--wandb` (+ `--wandb.project`, `--wandb.name`) | Enable Weights & Biases logging. See [Weights & Biases](#weights--biases).                                                                                                                                                    |
| `--orchestrator.prime-monitor`                  | Stream metrics to the Prime Intellect platform (Prime Lab). See [Platform monitoring](#platform-monitoring).                                                                                                                  |

**Run management:**

| Knob                          | What it does                                                                                                                                                       |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `--clean-output-dir`          | Wipe `<output_dir>` before starting. Useful when re-running an experiment with the same name during iteration.                                                     |
| `--output-dir outputs/<name>` | Per-run output directory. Always set this when running more than one experiment in parallel.                                                                       |
| `--max-steps N`               | Stop after `N` trainer steps. Overrides the config value.                                                                                                          |
| `--dry-run`                   | Resolve + validate the full config, write per-process TOMLs to `<output_dir>/configs/`, and exit without launching. The fastest way to debug a misbehaving config. |

### Algorithms

The RL entrypoint supports several training algorithms, switched via `[orchestrator.algo]`'s `type` (see [Algorithms](/prime-rl/algorithms#the-algorithm-abstraction) for the full reference, model references, and per-algorithm customization):

| `algo.type`      | Frozen model                                                | Use case                                                                                                                                                                           |
| ---------------- | ----------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `grpo` (default) | None                                                        | Standard group-relative RL                                                                                                                                                         |
| `max_rl`         | None                                                        | [MaxRL](https://arxiv.org/abs/2602.02710): GRPO with mean-normalized advantages (maximum-likelihood RL)                                                                            |
| `opd`            | Required, must be vLLM (needs `prompt_logprobs`)            | [On-policy distillation](https://thinkingmachines.ai/blog/on-policy-distillation/): the policy generates rollouts, the trainer minimizes per-token reverse KL to a reference model |
| `sft`            | Required, any OpenAI-compatible endpoint                    | Hard-distill: a frozen model generates rollouts, the policy trains on its tokens                                                                                                   |
| `opsd`           | None — the live policy is its own reference (no deployment) | [SDFT](https://arxiv.org/abs/2601.19897): the model is its own reference conditioned on expert demonstrations                                                                      |
| `echo`           | None                                                        | GRPO plus cross-entropy on env-observation tokens                                                                                                                                  |

A new algorithm is a named class in code, not a config — see [Algorithms § Authoring an Algorithm](/prime-rl/algorithms#authoring-an-algorithm).

Frozen models are declared inline on the algorithm, named where the model is used — `[orchestrator.algo.teacher]` for `opd` (the frozen model scored against), `[orchestrator.algo.sampling.source]` for `sft` (the model it samples from) — each with `name` + `base_url`. `opsd` declares no frozen model: it self-distills against the live policy. The `rl` entrypoint only manages policy inference — start frozen-model servers yourself and point `base_url` at them:

```bash theme={null}
CUDA_VISIBLE_DEVICES=1 uv run inference \
  --model.name <frozen-model> --server.port 8001
```

The standalone `uv run sft` entrypoint is the more traditional SFT path — pure dataset-based, no orchestrator. Use the `sft` algorithm only when you want a frozen model to generate the supervision on the fly.

### Important Metrics

Pulled from the console logs and mirrored to W\&B.

**Progress** (orchestrator):

* `reward/{all,env}/mean` — main signal. Should trend upward over hundreds of steps.
* `seq_len/{all,env}/mean` and `is_truncated/{all,env}/mean` — rollout length and truncation rate.
* `num_turns/{all,env}/mean` — for multi-turn envs.
* `empty_rollouts/{all,env}`, `errored_rollouts/{all,env}` — non-zero is fine in small numbers; sustained > 5% is a smell.
* `eval/{env}/{avg@k,pass@k}` — eval scores when `[orchestrator.eval]` is set.

**Stability** (trainer):

* `mismatch_kl/{all,env}/{mean,std,max}` — KL between trainer's current policy and the (older) inference policy that generated the rollouts. A sustained, growing mean is the early-warning sign for off-policy collapse.
* `entropy/{all,env}/mean` — too low means mode-collapse; too high means the model isn't committing.
* `masked_advantage_{positive,negative}/mean` — fraction of DPPO-masked tokens, split by sign.
* `optim/grad_norm` — spikes precede divergence; check the loss config or lower the LR.

**Performance** (trainer + orchestrator step independently):

| Source       | Metric                | Reading                            |
| ------------ | --------------------- | ---------------------------------- |
| trainer      | `time/wait_for_batch` | **high → orchestrator bottleneck** |
| orchestrator | `time/wait_for_ckpt`  | **high → trainer bottleneck**      |

## SFT Trainer

`uv run sft` runs supervised fine-tuning from a HF dataset. It shares model loaders, FSDP setup, checkpointing, and the chat-template plumbing with the RL trainer, so a typical workflow is *SFT → RL → SFT → …* without any reformatting.

### Dataset Format

Two accepted layouts:

* **Prompt-completion**: a HF dataset with `prompt` and `completion` columns ([TRL format](https://huggingface.co/docs/trl/en/dataset_formats#prompt-completion)). The trainer masks out the prompt and computes loss only over the completion.
* **Messages**: a HF dataset with a single `messages` column containing a list of chat turns. The trainer interprets the whole conversation as one sample, applies role-based loss masking, and trains over all assistant turns.

If both columns are present, `messages` takes precedence.

**Tool definitions.** For tool-use SFT, add a `tools` column (OpenAI function-calling format) or `tool_defs` ([`verifiers`](https://github.com/PrimeIntellect-ai/verifiers) rollout format). Each row's value can be either a list of dicts or a JSON-encoded string of a list — both are accepted, and `tool_defs` rows are auto-converted to OAI shape before being passed into the chat template's `tools=...` argument. The `chat_template_kwargs` column, if present, is forwarded verbatim into `apply_chat_template`.

**Position-dependent chat templates.** Multi-turn SFT under the default tokenization path (`build_incremental_token_mask`) requires that tokenizing the first *k* turns of a conversation be a strict prefix of tokenizing all *n ≥ k* turns. Qwen3's upstream template *violates* this — it strips past `<think>` blocks across user turns, silently corrupting the loss mask. Two fixes:

* **Enable the renderer** (set a typed `[renderer]` config, e.g. `name = "qwen3"`, recommended; defaults to `"auto"` for RL). The [`renderers`](/prime-rl/algorithms#renderers) package owns tokenization end-to-end and is robust to position-dependent templates. Hand-coded renderers ship for Qwen3, Qwen3.5, GLM-5, GLM-4.5, Kimi K2/K2.5, MiniMax M2, DeepSeek V3, Nemotron 3, GPT-OSS. Not supported for VLMs.
* **Patched chat template** — the prime-rl–patched checkpoints (e.g. `PrimeIntellect/Qwen3-0.6B`, used in `examples/reverse_text/sft.toml`) ship a chat template that preserves thinking. Or supply your own.

See [Algorithms § Multi-Turn Trajectories](/prime-rl/algorithms#multi-turn-trajectories) for the full picture.

### Launch

The minimal SFT run trains `Qwen3-0.6B` on the `reverse-text` SFT dataset:

```bash theme={null}
uv run sft @ examples/reverse_text/sft.toml --wandb
```

Multi-GPU and multi-node use torchrun under the hood (the `sft` entrypoint manages this for you — see [Scaling § SFT and Torchrun](/prime-rl/scaling#sft-and-torchrun) for non-default layouts; multi-node SFT goes through [SLURM](/prime-rl/scaling#slurm)).

### SFT-Specific Knobs

| Knob              | What it controls                                                   |
| ----------------- | ------------------------------------------------------------------ |
| `data.name`       | HF dataset name or local path                                      |
| `data.batch_size` | Tokens per trainer step (packed)                                   |
| `data.seq_len`    | Per-sample sequence length                                         |
| `loss_mask.*`     | Which roles contribute to loss (system / user / assistant / tool). |
| `val.interval`    | Run validation every N steps; `val.data` mirrors `data`            |

### Important Metrics

Pulled from the console log and mirrored to W\&B.

**Progress and loss:**

* `loss/mean` — main signal. Should decrease through the run.
* `val/loss` — validation loss when `[val]` is set, logged every `val.interval` steps.
* `progress/epoch`, `progress/num_samples`, `progress/num_tokens` — dataset progress.
* `progress/<subset>/ratio_{samples,tokens}` — when training on multiple HF subsets/splits, the realized mixing ratio.

**Stability and optimization:**

* `optim/grad_norm` — spikes precede divergence.
* `optim/lr`, `optim/zero_grad_ratio` — LR schedule and the fraction of params that received zero gradients (high → dead path or wrong loss masking).
* For MoE: `max_vio/mean` (load-balancing violation), `routing_confidence/mean` — both are logged when non-zero.

**Performance:**

| Metric                                                 | Reading                      |
| ------------------------------------------------------ | ---------------------------- |
| `perf/throughput`, `perf/throughput_per_gpu`           | tokens/s overall and per GPU |
| `perf/mfu`                                             | MFU                          |
| `perf/peak_memory`                                     | peak GPU memory (GiB)        |
| `time/step`, `time/forward_backward`, `time/save_ckpt` | step breakdown               |

## Checkpointing

Checkpointing is split across processes because the orchestrator and trainer can be on different machines and on different steps at any given time. Inference is stateless.

| Process              | What's saved                                                | Where                                           |
| -------------------- | ----------------------------------------------------------- | ----------------------------------------------- |
| Trainer              | FSDP-sharded model (DCP), optimizer, scheduler, progress    | `<output_dir>/checkpoints/step_N/trainer/`      |
| Orchestrator         | Step counter, total tokens / samples / problems             | `<output_dir>/checkpoints/step_N/orchestrator/` |
| Inference            | *nothing* — re-pushed from the latest checkpoint on restart | n/a                                             |
| Trainer (HF weights) | HF-compatible weight snapshot for serving                   | `<output_dir>/weights/step_N/`                  |

### Enabling Checkpoints

Checkpointing is **off by default** to save disk. Enable it with `--ckpt`:

```bash theme={null}
uv run rl @ rl.toml --ckpt                              # default: end-of-training only
uv run rl @ rl.toml --ckpt.interval 25                  # every 25 steps
uv run rl @ rl.toml --ckpt.interval 25 --ckpt.keep-last 3  # rolling window of 3
uv run rl @ rl.toml --ckpt.interval 25 --ckpt.keep-interval 100  # …plus permanent every 100
```

### Resuming a Run

Re-run the same launch command and pass `--ckpt.resume-step <N>` (or `-1` for "latest"). Make sure `--max-steps` is at least the target final step, not the remaining delta:

```bash theme={null}
# First run: steps 1–10
uv run rl @ rl.toml --max-steps 10 --ckpt

# Resume: continue to step 20
uv run rl @ rl.toml --max-steps 20 --ckpt.resume-step 10
```

### Serving Checkpoints

HF-compatible weight snapshots are written under `<output_dir>/weights/step_N/` whenever a full checkpoint runs (or you can write weights-only via `--ckpt.weights-only` for cheaper snapshots). Upload directly:

```bash theme={null}
uv run hf upload <user>/<model>-RL outputs/weights/step_100
```

For LoRA runs, set `ckpt.weights.save_adapter_separately = true` to also write the raw adapter alongside the merged weights — useful when serving the adapter through a separate `/load_lora_adapter` call.

## Observability

### Log Files

The launcher tees every process's stdout/stderr into `<output_dir>/logs/`. The full layout (single-node runs skip the `node_*.log` and `router_*.log` files):

```
<output_dir>/logs/
├── trainer.log                  # rank 0 only; symlink → trainer/node_0.log on multi-node
├── orchestrator.log             # single instance, single file
├── inference.log                # symlink → inference/node_0.log on multi-node
├── trainer/
│   ├── node_*.log               # per-node trainer stdout (multi-node only)
│   └── torchrun/<rdzv>/attempt_0/<rank>/{stdout,stderr}.log   # per-rank
├── inference/
│   ├── node_*.log               # per-node inference stdout (multi-node only)
│   └── router_*.log             # vllm-router per replica (multi-node only)
└── envs/{train,eval}/<env_name>/
    ├── env_server.log
    └── env_worker_<id>.log
```

Env worker logs are the first place to look for env-side errors (most user code lives there). Verbosity is controlled by `orchestrator.log.vf_level`. For multi-rank trainer debugging, drop into `logs/trainer/torchrun/<rdzv>/attempt_0/<rank>/{stdout,stderr}.log` — verbose and per-rank.

Live tailing from a single point (works on the head node for multi-node runs over a shared filesystem):

```bash theme={null}
tail -F <output_dir>/logs/{trainer,orchestrator,inference}.log
tail -F <output_dir>/logs/trainer/node_*.log     # multi-node only
tail -F <output_dir>/logs/inference/router_*.log # multi-node only
```

### Console Output

`scripts/tmux.sh` opens a 4-pane tmux session that follows `trainer.log`, `orchestrator.log`, `inference.log`, and the union of env worker logs. Start it before launching:

```bash theme={null}
bash scripts/tmux.sh
# then in the Launcher window:
uv run rl @ ... --output-dir outputs/my-run
```

Pass `-s <session>` and `-o <output_dir>` to run multiple parallel experiments side-by-side in different sessions. The helper also works on a SLURM head node — `bash scripts/tmux.sh my-rl-job /shared/outputs/my-rl-job`.

### Weights & Biases

W\&B is off by default. Enable with `--wandb`:

```bash theme={null}
uv run rl @ rl.toml --wandb                               # default project, random name
uv run rl @ rl.toml --wandb.project my-proj --wandb.name run-42
uv run rl @ rl.toml --no-wandb                            # force-disable even if the TOML enables it
```

The trainer and orchestrator log into a **single shared W\&B run**, so all metrics from both processes land in one place. Shared mode requires the W\&B SDK ≥ 0.19.9 and is incompatible with `wandb.offline = true`.

By default, every 10 steps each process also logs a sample of prompts/completions (with rewards and advantages) and reward/advantage/entropy distributions as W\&B tables. Tune via `--wandb.log-extras.interval` and `--wandb.log-extras.sample-ratio`, or disable subsets:

```bash theme={null}
uv run rl @ rl.toml --wandb \
  --orchestrator.wandb.log-extras.interval 50 \
  --no-trainer.wandb.log-extras.distributions
```

prime-rl deliberately logs a **large number of metrics** for maximum observability: every rollout metric is emitted per subset (`all`/`effective`), per statistic (`mean`/`max`/`min`/`p10`/`p90`), and per environment alongside a cross-env aggregate, so a multi-env run can emit thousands of series. To keep that navigable, W\&B mode **auto-creates an `overview` saved view** on the first run into a project — curating the handful of metrics that matter into `train`, `eval`, `stability`, and `performance` sections (with per-env breakdowns). The view is created once per project and adapts to the run's environments; if a later run uses a different set of environments, a new versioned view (`overview-v2`, …) is created instead of overwriting the first.

### Platform Monitoring

Register a run on the Prime Intellect platform (Prime Lab) and stream training metrics, samples, and distributions to the platform dashboard. Bare flag uses defaults:

```bash theme={null}
uv run rl @ rl.toml --orchestrator.prime-monitor
```

Or set it in TOML:

```toml theme={null}
[orchestrator.prime_monitor]
run_name = "my-experiment"
```

Requires `PRIME_API_KEY` (set via `prime login` or env var) and an allowlisted team. Currently internal-only.

## Rules of Thumb

* **Start small.** Run `examples/reverse_text/rl.toml` end-to-end on 2 GPUs before scaling. If the smoke run finishes cleanly, your install is good.
* **Batch size ≥ 64.** Smaller batches give noisy gradient estimates and the trainer's overhead-per-step dominates throughput. 64 is the practical floor; 128–512 is the range for quick ablations; production RL often runs at 1024+.
* **Group size ≥ 8.** Bigger groups (`orchestrator.group_size`) make it more likely that a task produces a mix of high- and low-reward rollouts, which is what gives the trainer a usable signal — if all rollouts in a group succeed or all fail, the within-group advantage collapses to zero and the trainer learns nothing from that task. Bigger groups also tighten advantage normalization. 8 is the floor; 16–32 is common.
* **Pin `output_dir` per run.** Sharing a directory across runs will mix rollouts and break resumes. `--output-dir outputs/<unique-name>` is the simplest discipline.
* **Use `--dry-run` before SLURM.** Validators (e.g. CP needs flash-attention) fail fast in dry-run and slow in queue.