RL
The main usecase of PRIME-RL is RL training. Three main abstractions facilitate RL training: the orchestrator, the trainer, and the inference service.
Orchestrator
The orchestrator is a lightweight CPU process that handles the core data and scheduling logic, serving as an intermediary between the trainer and inference service with bidirectional relays. In one direction, it collects rollouts from the inference server, assembles them into packed batches, and dispatches them to the trainer; in the other direction, it relays updated model weights from the trainer to the inference service. The orchestrator utilizesverifiers environments to abstract multi-turn rollout generation and scoring, leveraging async OpenAI-compatible inference clients.
Trainer
The trainer is responsible for producing an updated policy model given rollouts and advantages. We use FSDP2 as the backend with compatibility for any HuggingFace (HF) model. For some models we also provide custom implementations, mostly for performance reasons. FSDP shards model parameters, gradients, and optimizer states, allowing training large models with data parallelism and minimal GPU memory footprint. We support a variety of popular training objectives, such as GRPO, GSPO, OPO, RLOO and CISPO. The trainer is inspired bytorchtitan and relies on native PyTorch features to implement advanced parallelism techniques, such as tensor, context or expert parallelism.
Inference
The inference service in its simplest form is a standard OpenAI-compatible server with a vLLM backend. The API specification is extended with two custom endpoints to enable updating the server with the latest policy:update_weights is used to reload model weights from a HF-compatible checkpoint on disk, and reload_weights is used to reset the weights to the base model in between experiments. Otherwise, we rely on vLLM’s optimized kernels, parallelism strategies, and scheduling for fast rollout generation. Given the disaggregated nature of the service architecture, it can be directly extended to include multiple engines with a shared request pool, allowing operation across multiple clusters and straightforward integration of alternative inference engines (e.g. SGLang, Tokasaurus). We also heavily rely on native data parallelism in vLLM (also available in SGLang) for orchestrating the fleet of nodes dedicated to inference.
RL
For doing RL training all components need to be started. One can do this manually:rl entrypoint to start all components.
uv run rl --help.
SFT
We provide a fairly straight-forward SFT trainer which is capable of fine-tuning any conversational model on multi-turn conversation with tool calling. It shares a lot of components with the RL trainer, such as the modeling code, parallelism techniques, checkpoint format, logger, etc. which ensures a seemless post-training workflow. To start an SFT training, you need to prepare a dataset in prompt-completion format (we do not support any other format). Single-turn fine-tuning should be compatible with the chat templates of most models. However, to properly handle loss masking, we require that the tokenizer’s chat template satisfies a prefix property: the tokenization of any conversation prefix must be a prefix of the tokenization of the full conversation. For instance, tokenizing message 1 should yield a token sequence that forms a prefix of tokenizing messages 1 and 2, which in turn should be a prefix of tokenizing messages 1, 2, 3, and so forth. An example of a chat template that does not satisfy this property is Qwen3’s chat template, as it strips away past think sections. On a single GPU, start the training with thesft entrypoint
torchrun with --nproc-per-node to start the training.
uv run sft --help.
Evals
You can eval any verifiers environment against API models, local models and checkpoints from an SFT or RL training using theeval entrypoint.
We recommned using the vf-eval entrypoint for debugging and developing an environment and the PRIME-RL eval entrypoint for production-grade workloads, as it has more advanced features such as multi-environment evaluation, saving results to the Environments Hub, and scales better to custom multi-node inference deployments.
By default, the entrypoint uses the OpenAI API with gpt-4.1-mini to provide a smooth out-of-the-box experience without requiring a vLLM server.
To check all available configuration options, run uv run eval --help.
API Models
By default, the entrypoint uses the OpenAI API. First, set the API key as an environment variable:Single Environment
Multiple Environments
Local Models
To use a different API, or custom inference deployment, override the client and model configuration. For example, for a bare vLLM server, run:Checkpoints
To evaluate a training checkpoints, start an inference server with the model being the base model that you started training from and specify the directory containing the weight checkpoints with--weights-dir.
--no-eval-base and to evaluate only specific steps, set --steps as a comma-separated list of integers representing the steps to evaluate. For example,
Synthetic Data
You can generate synthetic data in any verifiers environment using thesynthesize entrypoint. The entrypoint shares configuration with evals (ModelConfig, ClientConfig, SamplingConfig, and EnvConfig), making it easy to generate data using the same models and sampling settings as your evaluations.
The synthesize entrypoint supports single-turn, multi-turn, and tool-calling environments, and can generate data across multiple environments in parallel. It automatically parses reasoning content from model responses (configurable via --reasoning-field, defaults to reasoning_content), and saves data in append mode using the same schema as verifiers, making it straightforward to convert to SFT datasets. The entrypoint is robust to errors during generation, scoring, or saving—failed groups are simply dropped. Environments specified in the format {env_org}/{env_id} are automatically installed, similar to the RL entrypoint.
By default, the entrypoint uses the OpenAI API with gpt-4.1-mini (also the default for evals) to provide a smooth out-of-the-box experience without requiring a vLLM server. Make sure to set your API key as an environment variable
uv run synthesize --help.
Single-Turn Environments
Generate synthetic data in a single-turn environment (e.g.,gsm8k):
reasoning_content field from raw responses, as returned by vLLM when a reasoning parser is set. This behavior can be configured to handle different APIs:
Multiple Environments
Generate synthetic data across multiple environments (e.g.,gsm8k and hendrycks-math):
Multi-Turn Environments
Generate synthetic data in a multi-turn environment (e.g.,alphabet-sort):
Multi-Turn Environments with Tool Calls
Generate synthetic data in a multi-turn environment with tool calls (e.g.,wiki-search):