Verifiers v0.1.8 introduced trajectory-based rollouts, where each LLM request/response pair in a multi-turn interaction is recorded as an independent step. For details on the design decision, check the detailed design document in the verifiers repository.Documentation Index
Fetch the complete documentation index at: https://docs.primeintellect.ai/llms.txt
Use this file to discover all available pages before exploring further.
Best-Effort Interleaved Rollouts
PRIME-RL uses a best-effort interleaving strategy that automatically merges consecutive trajectory steps when possible, and starts a new training sample when the extension property breaks.The Extension Property
A sequence of trajectory steps has the extension property when each successive step’s prompt contains all previous prompts and completions as a prefix. When this holds:- Multiple steps can be merged into a single training sample
- Compute scales as O(T) for a trajectory of length T
- A new training sample is started from that step
- Compute scales as O(T²) in the worst case (every step breaks extension)
How It Works
- When extension holds: O(T) compute, single merged sample
- When extension breaks: graceful fallback, no corrupted data
- Mixed scenarios: optimal merging where possible
The Exact Prefix Invariant
Interleaving enforces a strict invariant:The prompt at turn must be the exact concatenation of prior messages exactly as the LLM originally generated themWe call this the “exact prefix” invariant. For example, at turn 2, the LLM should see U1,A1,U2 as the prompt, where U1 exactly matches the user message in turn 1 and A1 exactly matches the produced assistant message in turn 1. Any violation to this invariant will result in downstream problems when computing the importance sampling ratio during training. For example, assume that at turn 2 the prompt is U1,A1’,U2 where A1’ varies from A1. In this scenario it is not clear whether to add A1 or A1’ to the interleaved rollout:
- If we add A1’, the logprobs from turn 1 might be off because the inference LLM produced A1 but the trainer LLM is computing logprobs for A1’
- If we add A1, the logprobs from turn 2 might be off because the inference LLM is attending to A1’ but the trainer LLM is attending to A1
Arbitrary Chat Templates
There exist chat templates which add, modify, or remove tokens across turns. One good example is the chat template of the Qwen3-series of models, which strips thinking across user turns.Discontinuous Trajectories by Design
Some multi-turn environments are intentionally discontinuous. For example, in a sub-agent calling scenario:- Main agent receives a task and decides to delegate to a sub-agent
- Sub-agent runs independently (possibly multiple turns with its own context)
- Control returns to main agent with only the sub-agent’s final result
Deprecated: Branching Mode
The--trajectory-strategy branching option is deprecated. The best-effort interleaving strategy now handles all cases automatically, falling back to separate samples (equivalent to branching) when the extension property breaks.