RL
The main usecase of PRIME-RL is RL training. Three main abstractions facilitate RL training: the orchestrator, the trainer, and the inference service.
Orchestrator
The orchestrator is a lightweight CPU process that handles the core data and scheduling logic, serving as an intermediary between the trainer and inference service with bidirectional relays. In one direction, it collects rollouts from the inference server, assembles them into packed batches, and dispatches them to the trainer; in the other direction, it relays updated model weights from the trainer to the inference service. The orchestrator utilizesverifiers environments to abstract multi-turn rollout generation and scoring, leveraging async OpenAI-compatible inference clients.
Trainer
The trainer is responsible for producing an updated policy model given rollouts and advantages. We use FSDP2 as the backend with compatibility for any HuggingFace (HF) model. For some models we also provide custom implementations, mostly for performance reasons. FSDP shards model parameters, gradients, and optimizer states, allowing training large models with data parallelism and minimal GPU memory footprint. We support a variety of popular training objectives, such as GRPO, GSPO, OPO, RLOO and CISPO. The trainer is inspired bytorchtitan and relies on native PyTorch features to implement advanced parallelism techniques, such as tensor, context or expert parallelism.
Inference
The inference service in its simplest form is a standard OpenAI-compatible server with a vLLM backend. The API specification is extended with two custom endpoints to enable updating the server with the latest policy:update_weights is used to reload model weights from a HF-compatible checkpoint on disk, and `reload_weights is used to reset the weights to the base model in between experiments. Otherwise, we rely on vLLM’s optimized kernels, parallelism strategies, and scheduling for fast rollout generation. Given the disaggregated nature of the service architecture, it can be directly extended to include multiple engines with a shared request pool, allowing operation across multiple clusters and straightforward integration of alternative inference engines (e.g. SGLang, Tokasaurus). We also heavily rely on native data parallelism in vLLM (also available in SGLang) for orchestrating the fleet of nodes dedicated to inference.
RL
For doing RL training all components need to be started. One can do this manually,rl entrypoint to start all components.
uv run rl --help.
SFT
We provide a fairly straight-forward SFT trainer which is capable of fine-tuning any conversational model on multi-turn conversation with tool calling. It shares a lot of components with the RL trainer, such as the modeling code, parallelism techniques, checkpoint format, logger, etc. which ensures a seemless post-training workflow. To start an SFT training, you need to prepare a dataset in prompt-completion format (we do not support any other format). Single-turn fine-tuning should be compatible with the chat templates of most models. However, to properly handle loss masking, we require that the tokenizer’s chat template satisfies a prefix property: the tokenization of any conversation prefix must be a prefix of the tokenization of the full conversation. For instance, tokenizing message 1 should yield a token sequence that forms a prefix of tokenizing messages 1 and 2, which in turn should be a prefix of tokenizing messages 1, 2, 3, and so forth. An example of a chat template that does not satisfy this property is Qwen3’s chat template, as it strips away past think sections. On a single GPU, start the training with thesft entrypoint
torchrun with --nproc-per-node to start the training.
uv run sft --help.
Evals
You can eval any verifiers environment against API models, local models and checkpoints from an SFT or RL training using theeval entrypoint.
We recommned using theWe demonstrate evals by evaluating two common benchmarksvf-evalentrypoint for evaluating a single environment against API models or local models. This is often useful whne bulding an environment. However, if want to evaluate multiple environments in parallel and/ or evaluate a training checkpoint, the PRIME-RLevalentrypoint is likely more convenient.
gpqa and math500.
To check all available configuration options, run uv run eval --help.
Local Models
To evaluate any HF model, start an inference server with the desired model before running theeval command. For example, to evaluate against the math500 and aime2025 environments, run the following commands:
Checkpoints
To evaluate a SFT or RL checkpoint, start an inference server with the model being the base model that you started training from and specify the directory containing the weight checkpoints with--weights-dir.
--no-eval-base and to evaluate only specific steps, set --steps as a comma-separated list of integers representing the steps to evaluate. For example,