EveryDocumentation Index
Fetch the complete documentation index at: https://docs.primeintellect.ai/llms.txt
Use this file to discover all available pages before exploring further.
prime-rl entrypoint uses pydantic-config: TOML files for reproducible base configs, CLI flags for one-off overrides.
AI agents working in this repo: the equivalent runbook is atskills/configs/SKILL.md, with extra runtime hints (where config classes live, validator conventions, the trainer-sidetoken_exportflag) that aren’t surfaced here.
Table of Contents
Sources and Precedence
Field values come from three sources — Pydantic defaults, TOML files (passed with@), and CLI flags. They’re layered in this order, with later sources winning:
- Defaults declared on the Pydantic model.
- TOML files passed with
@, left to right — later files override earlier ones. - CLI flags in dotted, kebab-case form (
--model.name).
TOML Composition
The@ token introduces a TOML file. Multiple @ arguments compose left-to-right, deep-merged — unset fields in an overlay keep the base value:
Mind the space:@ path/to/x.toml, not@path/to/x.toml.
CLI Overrides
CLI flags mirror the TOML tree using dots:Field names are snake_case in TOML (max_model_len) and kebab-case on the CLI (--max-model-len).
Renamed fields keep their old name as a validation alias — e.g.rollouts_per_exampleis still accepted in TOML and CLI after being renamed togroup_size. Mixing the two names across sources is safe.
Inspecting and Validating
Syntax
Booleans
CLI uses paired flags: bare--flag sets True, --no-flag sets False. TOML must be explicit:
Lists
CLI accepts space-separated values or a JSON literal. TOML uses an array literal. Both forms target the same field:Dicts
CLI takes a JSON literal. TOML uses a table or inline-table. CLI dicts deep-merge with TOML dicts — CLI keys win on conflict but don’t wipe the file’s keys:Optional Sub-Configs
Many sub-configs are typedSomeConfig | None. Two patterns enable them:
- Bare flag with defaults:
--model.compileor, in TOML, an empty section[model.compile]. The sub-config materializes with all-default values. - Enable and set fields together:
--model.compile.fullgraph(CLI) or any populated[model.compile]table (TOML).
--no-<name> on the CLI or assign the string "None" in TOML (see None). This is how [ckpt], [model.lora], [model.compile], [trainer.wandb], etc. are turned on and off.
None
TOML has nonull. Use the string "None", which the loader coerces:
--inference.model.max-model-len None.
Discriminated Unions
Loss, advantage, optimizer, scheduler, weight broadcast transport, and several others are discriminated unions. Set thetype field to pick a variant:
type to keep the default variant.
Environments ([[orchestrator.train.env]])
Training environments are an array of tables — set one per env, optionally with sampling weights:
args is forwarded verbatim to the environment’s load_environment(**args).
The same id can appear multiple times across train and eval (or with different args) — useful for evaluating on a held-out split of the env you’re training on, or comparing two configurations of the same env side by side. When id is reused, set a distinct name on each entry; name defaults to id and must be unique across all envs in the same group.
Examples
The shipped end-to-end examples inexamples/ are the canonical, kept-up-to-date references — the rest of the repo’s TOMLs (under configs/) are CI- and debug-internal and may drift. Each example directory has its own README with the full launch story.
Basic (1–8 GPUs):
- Reverse Text —
Qwen3-0.6Breversing a chunk of text. Tiny single-turn SFT + RL; runs on a single consumer GPU in minutes. - Wordle —
Qwen3-1.7Bplaying Wordle. Multi-turn SFT + RL; 2–4 H100s. - Alphabet Sort —
Qwen3-4B-Instruct-2507sorting names alphabetically. Multi-turn LoRA RL without SFT warmup; one H100. - Wiki Search —
Qwen3-4B-Instruct-2507answering trivia by web-searching Wikipedia. Multi-turn with tool use. - Hendrycks Sanity —
DeepSeek-R1-Distill-Qwen-1.5Bon a filtered MATH subset. Useful for algorithm ablations.
- Qwen 3 30B – A3B Math —
Qwen3-30B-A3Bon hard math. - Qwen 3 30B – A3B SWE —
Qwen3-30B-A3Bon hard SWE. - INTELLECT-3.1 — reproduces our INTELLECT-3.1 training run.
- MiniMax-M2.5 SWE —
MiniMax-M2.5on agentic SWE. - High-throughput GLM-5 —
GLM-5with P/D disaggregation and FP8 inference.
Worked Example: Compose, Override, Dry-Run
Start from a shipped base config, override two fields on the CLI, and dry-run:uv run trainer @ /tmp/reverse-dry/configs/trainer.toml, etc.). This is the easiest way to bisect a misbehaving config: dry-run a known-good base, dry-run your overlay, diff the two.