Configuration

Every prime-rl entrypoint uses pydantic-config: TOML files for reproducible base configs, CLI flags for one-off overrides.

AI agents working in this repo: the equivalent runbook is at skills/configs/SKILL.md, with extra runtime hints (where config classes live, validator conventions, the trainer-side enable_token_export flag) that aren’t surfaced here.

Sources and Precedence
TOML Composition
CLI Overrides
Inspecting and Validating
Syntax
Examples

Sources and Precedence

Field values come from three sources — Pydantic defaults, TOML files (passed with @), and CLI flags. They’re layered in this order, with later sources winning:

Defaults declared on the Pydantic model.
TOML files passed with @, left to right — later files override earlier ones.
CLI flags in dotted, kebab-case form (--model.name).

TOML Composition

The @ token introduces a TOML file. Multiple @ arguments compose left-to-right, deep-merged — unset fields in an overlay keep the base value:

uv run rl @ examples/basic/reverse-text/rl.toml                      # one file
uv run rl @ base.toml @ overlay.toml                           # left to right
uv run rl --trainer @ trainer.toml --orchestrator @ orch.toml  # per-section
uv run rl @ base.toml --trainer @ trainer.toml                 # mixed

Mind the space: @ path/to/x.toml, not @path/to/x.toml.

CLI Overrides

CLI flags mirror the TOML tree using dots:

--max-steps 50                              # top-level
--model.name Qwen/Qwen3-4B                  # nested
--trainer.optim.lr 1e-5                     # double-nested
--inference.parallel.tp 4

Field names are snake_case in TOML (max_model_len) and kebab-case on the CLI (--max-model-len).

Renamed fields keep their old name as a validation alias — e.g. rollouts_per_example is still accepted in TOML and CLI after being renamed to group_size. Mixing the two names across sources is safe.

Inspecting and Validating

uv run rl --help                                       # full schema
uv run rl @ rl.toml --dry-run --output-dir /tmp/check  # write resolved configs

Syntax

Booleans

CLI uses paired flags: bare --flag sets True, --no-flag sets False. TOML must be explicit:

uv run rl @ rl.toml --clean-output-dir       # True
uv run rl @ rl.toml --no-clean-output-dir    # False

clean_output_dir = true

Lists

CLI accepts space-separated values or a JSON literal. TOML uses an array literal. Both forms target the same field:

uv run rl @ rl.toml --trainer.model.lora.target-modules q_proj k_proj v_proj
uv run rl @ rl.toml --trainer.model.lora.target-modules '["q_proj", "k_proj", "v_proj"]'

[trainer.model.lora]
target_modules = ["q_proj", "k_proj", "v_proj"]

Overlay TOMLs replace lists wholesale — an overlay that wants to add one item must still spell out the full list. For arrays of tables (e.g. environments), see Environments.

Dicts

CLI takes a JSON literal. TOML uses a table or inline-table. CLI dicts deep-merge with TOML dicts — CLI keys win on conflict but don’t wipe the file’s keys:

uv run rl @ rl.toml --orchestrator.train.env.0.args \
  '{"dataset_name": "openai/gsm8k", "dataset_subset": "main"}'

[[orchestrator.train.env]]
args = { dataset_name = "openai/gsm8k", dataset_subset = "main" }

Optional Sub-Configs

Many sub-configs are typed SomeConfig | None. Two patterns enable them:

Bare flag with defaults: --model.compile or, in TOML, an empty section [model.compile]. The sub-config materializes with all-default values.
Enable and set fields together: --model.compile.fullgraph (CLI) or any populated [model.compile] table (TOML).

To disable a sub-config that’s on by default, use --no-<name> on the CLI or assign the string "None" in TOML (see None). This is how [ckpt], [model.lora], [model.compile], [trainer.wandb], etc. are turned on and off.

None

TOML has no null. Use the string "None", which the loader coerces:

[inference.model]
max_model_len = "None"

On the CLI: --inference.model.max-model-len None.

Discriminated Unions

Loss, advantage, optimizer, scheduler, weight broadcast transport, and several others are discriminated unions. Set the type field to pick a variant:

[trainer.optim]
type = "muon"
lr = 1e-5
mu = 0.95

Omit type to keep the default variant.

Environments (`[[orchestrator.train.env]]`)

Training environments are an array of tables — set one per env, optionally with sampling weights:

[[orchestrator.train.env]]
name = "gsm8k"
env.taskset = { id = "gsm8k-v1", split = "train" }
env.agent.harness = { id = "null", runtime = { type = "subprocess" } }
ratio = 3  # 75% of batches

[[orchestrator.train.env]]
name = "reverse-text"
env.taskset = { id = "reverse-text-v1" }
env.agent.harness = { id = "null", runtime = { type = "subprocess" } }
ratio = 1  # default — 25% of batches

[[orchestrator.eval.env]]
name = "gsm8k-eval"
env.taskset = { id = "gsm8k-v1", split = "test" }
env.agent.harness = { id = "null", runtime = { type = "subprocess" } }

ratio defaults to 1 (equal weight per env); values are relative weights normalized to probabilities across envs. Everything environment lives under the env block (verifiers’ [env] shape): env.taskset configures the v1 taskset, and each agent is a field on the env — env.agent.harness selects how the single-agent env’s tasks are run, and per-run caps are per-agent (env.agent.max_turns, env.agent.timeout, env.agent.max_output_tokens). A multi-agent env declares its own seats (env.<role>.*). The same taskset can appear multiple times across train and eval (or with different settings) — useful for evaluating on a held-out split or comparing two configurations side by side. When it is reused, set a distinct name on each entry; name defaults to the taskset id and must be unique across all envs in the same group.

Environment Variables

OS environment variables exported into launched component process(es). In rl configs, top-level [env_vars] applies to trainer, inference, and orchestrator:

[env_vars]
HF_HUB_OFFLINE = "1"
TOKENIZERS_PARALLELISM = "false"

Component-specific tables layer on top:

[trainer.env_vars]
NCCL_DEBUG = "INFO"
PYTORCH_CUDA_ALLOC_CONF = "expandable_segments:False"

[inference.env_vars]
VLLM_USE_DEEP_GEMM = "1"

[orchestrator.env_vars]
PI_USAGE_BASE_URL = "https://..."

The rl launcher applies these the same way in both single-node and multi-node (SLURM) runs. Precedence, low to high:

The launcher’s own defaults — your env_vars override these.
Your top-level [env_vars].
Your [component.env_vars].
Orchestration-critical vars the launcher always sets last — CUDA_VISIBLE_DEVICES (GPU partitioning) and WANDB_SHARED_* (the single shared W&B run) — these cannot be overridden from env_vars.

For standalone sft and inference configs, [env_vars] applies to that entrypoint’s process(es). For disaggregated P/D inference, the role-specific deployment.{prefill,decode}_env_vars layer on top of any shared inference env vars.

Examples

The shipped end-to-end examples in examples/ are the canonical, kept-up-to-date references — the rest of the repo’s TOMLs (under configs/) are CI- and debug-internal and may drift. Each basic example directory has its own README with the full launch story; the advanced examples are config-only. Basic (1–8 GPUs):

Reverse Text — Qwen3-0.6B reversing a chunk of text. Tiny single-turn SFT + RL; runs on a single consumer GPU in minutes.
Wordle — Qwen3-1.7B playing Wordle. Multi-turn SFT + RL; 2–4 H100s.
Alphabet Sort — Qwen3-4B-Instruct-2507 sorting names alphabetically. Multi-turn LoRA RL without SFT warmup; one H100.
Wiki Search — Qwen3-4B-Instruct-2507 answering trivia by searching a Wikipedia corpus. Multi-turn with tool use.
Hendrycks Sanity — DeepSeek-R1-Distill-Qwen-1.5B on a filtered MATH subset. Useful for algorithm ablations.

Advanced (32–2048 GPUs, SLURM):

Qwen3-30B-A3B — Qwen3-30B-A3B on math, SWE, and tool use.
GLM-4.5-Air — GLM-4.5-Air on search, SWE, and terminal.
Nemotron-3-Super — Nemotron-3-Super-120B hybrid-Mamba MoE on SWE at 131k context.
MiniMax-M2.5 SWE — MiniMax-M2.5 on agentic SWE.
INTELLECT-3.1 — reproduces our INTELLECT-3.1 training run.
High-throughput GLM-5 — large-scale GLM-5/GLM-5.2 inference with P/D disaggregation and FP8.

Worked Example: Compose, Override, Dry-Run

Start from a shipped base config, override two fields on the CLI, and dry-run:

uv run rl @ examples/basic/reverse-text/rl.toml \
  --wandb.name my-experiment \
  --trainer.optim.lr 5e-6 \
  --output-dir /tmp/reverse-dry \
  --dry-run

Then inspect the resolved config:

ls /tmp/reverse-dry/configs/
# rl.toml  trainer.toml  orchestrator.toml  inference.toml

Each per-process TOML reflects the final, validated configuration that the actual run would consume — exactly what each process sees when started standalone (uv run trainer @ /tmp/reverse-dry/configs/trainer.toml, etc.). This is the easiest way to bisect a misbehaving config: dry-run a known-good base, dry-run your overlay, diff the two.

Getting Started

Lab

Libraries

Compute

Table of Contents

Sources and Precedence

TOML Composition

CLI Overrides

Inspecting and Validating

Syntax

Booleans

Lists

Dicts

Optional Sub-Configs

None

Discriminated Unions

Environments (`[[orchestrator.train.env]]`)

Environment Variables

Examples

Worked Example: Compose, Override, Dry-Run

​Table of Contents

​Sources and Precedence

​TOML Composition

​CLI Overrides

​Inspecting and Validating

​Syntax

​Booleans

​Lists

​Dicts

​Optional Sub-Configs

​None

​Discriminated Unions

​Environments ([[orchestrator.train.env]])

​Environment Variables

​Examples

​Worked Example: Compose, Override, Dry-Run

Table of Contents

Sources and Precedence

TOML Composition

CLI Overrides

Inspecting and Validating

Syntax

Booleans

Lists

Dicts

Optional Sub-Configs

None

Discriminated Unions

Environments (`[[orchestrator.train.env]]`)

Environment Variables

Examples

Worked Example: Compose, Override, Dry-Run