This page covers the math and the configurable algorithmic components: how off-policy training works, the default loss and advantage functions, how to plug in your own, the filters applied between rollout and training, and how multi-turn rollouts get merged into training samples.Documentation Index
Fetch the complete documentation index at: https://docs.primeintellect.ai/llms.txt
Use this file to discover all available pages before exploring further.
Table of Contents
- Async / Off-Policy Training
- Loss
- Advantage
- Filters
- Difficulty Pools
- Online Difficulty Filtering
- Multi-Turn Trajectories
Async / Off-Policy Training
prime-rl is asynchronous by default. The trainer and inference always run one step overlapped: while the trainer is producing from rollouts at step , inference is already generating the rollouts for step using . With matched trainer and inference step times this produces fully-overlapped pipeline parallelism — neither side ever idles.

- Trainer produces policy with weights from rollouts .
- Inference produces rollouts from policy .
Loss
Default Loss
The default RL loss is a DPPO policy-gradient term combined with a KL regularizer similar to Kimi-K2.5. For each prompt we sample a group of rollouts , score them to get , then optimize: where the policy-gradient term is and the KL regularizer penalizes drift between trainer and inference policies via the squared log importance ratio: is the policy that generated the rollout (inference), is the current policy (trainer), is the token-level advantage, is the importance-sampling clipping ratio, and is the KL temperature. Themin clamps the importance ratio from above so a stale rollout assigning very low probability to a high-reward token doesn’t produce a runaway gradient.
The knobs (under [trainer.loss] with type = "default"):
| Knob | Default | What it does |
|---|---|---|
dppo_mask_low / dppo_mask_high | 0.2 / 0.2 | Lower / upper thresholds for DPPO-style token-level masking. |
adv_tau | 1.0 | Temperature on the advantage term. Set to 0 for pure distillation (no RL signal). |
kl_tau | 1e-3 | Temperature on the KL regularizer. Set to 0 to disable. |
orchestrator.training_mode):
rlmode → DPPO + KL with the advantage signal.opdmode → KL distillation against the teacher’s per-token logprobs. The teacher must be a vLLM server (it’s the only one that exposesprompt_logprobs).sftmode → standard token-level NLL on teacher-generated rollouts.
[trainer.loss] type = "default" and configure via the knobs above. SFT and OPD modes ignore the policy-gradient–specific fields.
Custom Loss
The loss is computed per sequence: you write a function that takes one sequence’s tensors and returns a scalar loss. The trainer iterates and aggregates.metrics is averaged across sequences and logged with the other trainer metrics.
Advantage
Default Advantage
The default advantage is per-group reward minus per-group baseline (DR-GRPO without std normalization). For each prompt’s group ofgroup_size rollouts, every token in rollout receives advantage where is the group mean.
This is intentionally simple — it does the right thing for most envs. Switch to a custom advantage when you need group-aware shaping that depends on trajectory metadata (sub-agent rollouts, relative-rank shaping, …).
Two built-in length penalties can be layered on top of any advantage to discourage rambling:
[orchestrator.length_penalty] type = "tokens"— penalizes long completions in tokens, with configurable target and slope.[orchestrator.length_penalty] type = "turns"— penalizes long multi-turn rollouts by turn count.
Custom Advantage
Advantages are computed per group. You write a function that takes one group of rollouts and returns one advantage scalar per rollout. The orchestrator handles groups of varying size automatically — partial-group training kicks in when some rollouts in a group errored.AdvantageInputs.rollouts is a list of verifiers.RolloutOutput, so you have access to the full rollout (turns, tool calls, custom metadata) — not just the reward. Use this for anything reward-shaping-like that needs trajectory context.
Filters
Filters drop rollouts between scoring and training. Built-ins (composable):| Filter | Effect |
|---|---|
gibberish | Drops rollouts whose mean log-prob fall below a threshold — usually a sign of degenerate output. |
repetition | Drops rollouts with high n-gram repetition. |
zero_advantage | Drops rollouts whose advantage is zero, so the trainer doesn’t waste tokens on them. |
[orchestrator] config already includes all three filters with their defaults. To override, set filters explicitly — the list replaces the defaults wholesale:
Difficulty Pools
Difficulty pools gradually retire problems the model has solved or never solves. After each rollout, the average reward across a problem’s group is compared to two thresholds:buffer.easy_threshold— at or above this, the problem moves into theeasypool and is no longer sampled.buffer.hard_threshold— at or below this, the problem moves into thehardpool and is no longer sampled.- Otherwise the problem stays in
normaland remains in the sampling rotation.
easy_examples.jsonl / hard_examples.jsonl under each step’s orchestrator checkpoint). When you resume — or want to broaden the curriculum mid-run — buffer.easy_fraction / buffer.hard_fraction randomly lift that fraction of pooled problems back into normal so they re-enter sampling.
pool/{env}/{easy,normal,hard} (current pool ratios) and evicted_examples/{env}/{easy,hard} (per-step eviction rate).
Online Difficulty Filtering
Online difficulty filtering (ODF) drops collapsed-advantage groups on the way into the buffer. Setbuffer.online_difficulty_filtering = true (default false) to enable:
- Average reward across the group is 0.0 (every rollout failed) → drop the group, count under
filtered_rollouts/{env}/hard. - Average reward 1.0 (every rollout succeeded) → drop, count under
filtered_rollouts/{env}/easy. - Otherwise → into the buffer.
time/wait_for_batch high on the trainer), ODF can starve the loop. Bump orchestrator.oversampling_factor so inference produces enough groups per step to absorb the drops.
ODF is orthogonal to the pools: ODF reacts to the current group’s reward distribution, the pools track the running per-problem average. Many configs use both — ODF for per-step density, pools for long-horizon curriculum cleanup.
Multi-Turn Trajectories
Multi-turn rollouts (tool use, browser environments, long conversations) used to be stitched into a single fake “single-turn” sample, which silently corrupted the importance ratio when chat templates didn’t roundtrip. Sinceverifiers v0.1.8, prime-rl records each LLM request/response as an independent trajectory step and merges them at training time using best-effort interleaving — with renderers as the mechanism that keeps the merge safe by construction.
Extension Property
A sequence of trajectory steps has the extension property when each successive step’s prompt contains all previous prompts and completions as an exact prefix. The trainer relies on this property — when it holds:- Multiple steps merge into one training sample.
- Compute scales as in the trajectory length.
- Graceful fallback to multiple samples — no corrupted data.
- Worst case (every step breaks extension) is .
Best-Effort Interleaving
Concretely:U1, A1', U2 while A1' ≠ A1, the orchestrator can’t safely merge — either choice produces logprob drift between trainer and inference. Starting a fresh sample is the only correct behavior, so that’s what happens.
Renderers
Best-effort interleaving works because the renderer guarantees the exact-prefix invariant by construction — it never re-renders prior turns, so it can’t lose tokens to chat-template normalization, BPE retokenization drift, or thinking stripping. A renderer turns a model’s chat template into a Python object that can:render_ids(messages)— tokenize messages to ids the inference engine accepts.parse_response(completion_ids)— recover structured(content, reasoning_content, tool_calls)from sampled ids.bridge_to_next_turn(prev_prompt_ids, prev_completion_ids, new_messages)— extend the previous turn’s tokens verbatim with the new environment turn, instead of re-rendering history.
bridge_to_next_turn succeeds, the trainer sees the exact token stream the sampler produced; when it can’t be proven safe (e.g. the renderer is DefaultRenderer and the template’s stop sequence is unknown), it returns None and the orchestrator falls back to a full re-render — which triggers the new-sample fallback above.
A common source of breakage in the absence of a hand-coded renderer is models like Qwen3 whose chat templates strip past <think> blocks across user turns:
qwen3, qwen3-vl, qwen3.5, glm5, glm4.5, minimax-m2, deepseek-v3, kimi-k2, kimi-k2.5, nemotron-3, gpt-oss; anything else falls back to DefaultRenderer (a generic apply_chat_template wrapper). Pick one via:
apply_chat_template, when to write a hand-coded renderer), see the renderers writeup on the Prime Intellect blog — the canonical reference.