Documentation Index
Fetch the complete documentation index at: https://docs.primeintellect.ai/llms.txt
Use this file to discover all available pages before exploring further.
prime-rl is a framework for large-scale, asynchronous reinforcement learning of large language models. It is designed to be easy to use and hackable, yet capable of training 1T+-parameter MoE models on 1000+ GPU clusters.
Architecture
Aprime-rl RL run is three cooperating processes:

- Inference — vLLM-backed server (or fleet) holding the current policy. The orchestrator drives rollouts through the token-in
/v1/generateroute via therendererspackage (OpenAI-compatible chat/completions routes are also exposed for external clients). Supports data + tensor + expert parallelism (withdeepepandflashinferall-to-all backends and EPLB), FP8 inference, prefill/decode disaggregation behind avllm-router, CPU KV-cache offload, and router replay (the routed-expert mask is returned to the trainer for FP8 MoE numerical parity). Weights are pushed in place through a customupdate_weightsendpoint over filesystem or NCCL transports. - Orchestrator — Lightweight CPU process that owns the data plane across many
verifierstraining and eval environments. Each env runs in an isolated subprocess with a variable-size pool of env workers for scalability. The orchestrator drives multi-turn rollouts against the inference fleet (tool use, browsers, sandboxes, long horizons) without re-tokenizing across turns, computes advantages, packs the rollouts into training batches, and relays new weights from trainer to inference. - Trainer — FSDP2 process group that consumes packed rollouts and steps the optimizer. We ship optimized custom modeling code for many MoE / dense / VLM families that unlocks advanced trainer parallelism — expert parallelism (EP, with DeepEP kernels) and context parallelism (CP) for long-sequence training — plus selective activation checkpointing, FP8 training on Hopper+, LoRA, and multi-tenant training (many concurrent LoRA tenants sharing one trainer + inference deployment).
Installation
verifiers / renderers / research-environments submodules, installs uv, and runs uv sync --all-extras. For manual setup, MoE-only installs (DeepGEMM / DeepEP / NIXL), or troubleshooting, see the README.
You need at least one NVIDIA GPU (RTX 3090/4090/5090, A100, H100, H200, or B200). Single-GPU runs are supported for debugging; production RL is typically 1× inference node + 1+ trainer nodes.
Quick Run
Train an SFT-warmedQwen3-0.6B on the reverse-text task — the env is bundled with the verifiers submodule so no separate install is needed. This config ships in the repo and runs on two GPUs (one for inference, one for the trainer):
rl entrypoint reads examples/reverse_text/rl.toml, splits it into per-process sub-configs, picks GPU 0 for inference and GPU 1 for the trainer, launches all three processes, and tees their stdout into outputs/logs/{trainer,orchestrator,inference}.log. Within a minute the trainer should log step 1 and a reward sample; after 20 steps the run completes and final HF-compatible weights land at outputs/weights/step_20.
Documentation
- Configuration — TOML composition, CLI overrides, dry-run.
- Training — Launch and observe RL and SFT runs.
- Scaling — Single-GPU through multi-node clusters via FSDP / EP / CP and SLURM.
- Algorithms — Async semantics, loss / advantage / filter plugins, trajectory merging.
- Advanced — Custom modeling, multimodal, LoRA, multi-tenant, P/D inference.
- Development — Test suite, pre-commit hooks, adding a new model.