Skip to main content
This page covers the inference configuration and the supported features/deployment shapes. It covers how to scale the inference server from a single GPU to 1000s of GPUs that run agentic workloads at the speed of light with all the bells and whistles configured.

Table of Contents

Overview

prime-rl chooses to use vLLM as the inference engine. We aim to stay up-to-date with the latest vLLM features, being at-most 1 version behind the latest stable release. This allows us to use the latest features from vLLM as soon as they are released - such as router replay, CPU KV cache offload, and more. We support 3 distinct deployment shapes:
  • Single-Node - Runs the inference server on a single node. Useful for debugging, small scale experiments or smaller models. The default deployment shape.
  • Multi-Node - Runs the inference server on multiple nodes. Useful for large scale experiments or larger models, where latency is not a concern - i.e. single turn inference, long context inference, etc.
  • Disaggregated - Runs the inference server on multiple nodes, but disaggregates the prefill and decode stages. Useful for large scale experiments or larger models, where latency is a concern and multi-node deployment creates very high E2E rollout latency, such as agentic workflows.
Most of the features are supported for all deployment shapes, with few exceptions. These exceptions are rejected on validation. You can select the deployment shape with InferenceDeploymentConfig in your config file. This is a config-field that allows you to set the deployment shape, deployment-specific knobs such as num_nodes, num_replicas, router_port, backend_port, etc.
[inference.deployment]
type = "single_node" # or "multi_node" or "disaggregated"
To configure the inference server, you can use the InferenceConfig field. This is a config-field that allows you to set the inference server-specific knobs. Most of these are supported for all deployment shapes, with few exceptions. These exceptions are rejected on validation.
[inference]
model = "PrimeIntellect/INTELLECT-3"
...
We will now walk through the supported features and deployment shapes in detail, starting with the single-node deployment.

Single-Node

The single-node deployment is the default deployment shape. It runs the inference server on a single node. It is useful for debugging, small scale experiments or smaller models. You can configure the single-node deployment with the SingleNodeInferenceDeploymentConfig config-field.
[inference.deployment]
type = "single_node"
This deployment shape runs the inference server on a single node, if configured with NVLink enabled, it allows you more freedom in terms of parallelism configurations.
[inference]
enable_expert_parallel = true # defaults to False

[inference.parallel]
tp = 2
dp = 4

[inference.deployment]
type = "single_node"
We reccomend choosing your parallelism based on the expected throughput and latency requirements. High dp might create high latency, however it will also give you the highest throughput. This is a tradeoff you need to make based on your use case and required orchestrator.max_inflight_requests. Setting tp to a higher value will usually give you lower latency, but the inference server also will become saturated faster with lower number of requests. Another thing to consider, is the memory usage. You need to make sure that the model will fit into the available GPU memory. We will not go into the details on how to do this in this document. Related thing to consider, is the space for the KV cache. This will heavily affect the amount of requests your inference server can handle. You want to shard your model, either using inference.enable_expert_parallel or inference.parallel.tp to maximize the available GPU memory. You can also increase the available KV cache memory by enabling inference.kv_cache_offload. More details in the Advanced Configuration section.

Multi-Node

This deployment shape branches into 2 sub-shapes:
  • Multi-replica - Runs the inference server on multiple nodes, but each node runs an independent vLLM replica. You can think of this as a for-loop over single-node deployments.
  • Wide-EP - This option is gated behind inference.enable_expert_parallel = true. It allows you to run the inference server on multiple nodes, allowing you to use multi-node expert parallelism. This is a more advanced feature that is suitable for high-throughput, high-concurrency workloads.

Multi-replica

This deployment shape runs the inference server on multiple nodes, but each node runs an independent vLLM replica. Parallelism configuration is the same as the single-node deployment. The shape is defined by setting inference.deployment.type = "multi_node" and inference.deployment.num_nodes to the number of nodes you want to run the inference server on.
[inference.deployment]
type = "multi_node"
num_nodes = 2

[inference]
model = "PrimeIntellect/INTELLECT-3"

[inference.parallel]
tp = 2
dp = 4
This configuration will run 2 independent vLLM replicas, each with tp=2 and dp=4. Routing will be handled by the vllm-router instance running on the same node as the 1st replica. We aim to support more advanced routing options, such as llm-d or dynamo in the future. You can read more about the supported routing options in the router section.

Wide-EP

For huge, 200B+ scale models, you might want to use multi-node expert parallelism to maximize the KV-cache space. This deployment shape is defined by setting inference.deployment.type = "multi_node" and inference.enable_expert_parallel = true.
[inference.deployment]
type = "multi_node"
num_nodes = 2

[inference]
model = "PrimeIntellect/INTELLECT-3"
enable_expert_parallel = true

[inference.parallel]
tp = 2
dp = 8
This configuration will run 2 vLLM processes, each with data_parallel_size_local = 4 and tp = 2 and expert parallelism spanning 2 nodes. The requests are again routed to these processes via the vllm-router.

P/D Dissagregation

This is the most advanced deployment shape. It allows you to disaggregate the prefill and decode stages, with KV cache flowing between them. This is useful for large scale deployments, where there are high requirements on latency, such as agentic workflows spanning 100s of turns. This deployment shape is defined by setting inference.deployment.type = "disaggregated" and inference.deployment.num_prefill_nodes and inference.deployment.num_decode_nodes to the number of nodes you want to run the prefill and decode stages on.
[inference.deployment]
type = "disaggregated"
num_prefill_nodes = 2
num_decode_nodes = 2
Sometimes, you may want to run multiple independent vLLM instances within the prefill and decode node groups. You can do this by setting inference.deployment.num_prefill_replicas and inference.deployment.num_decode_replicas to the number of replicas you want to run.
[inference.deployment]
type = "disaggregated"
num_prefill_nodes = 2
num_decode_nodes = 2

num_prefill_replicas = 2
num_decode_replicas = 1
Now the total deployment will span 6 nodes - 2x2 for prefill and 1x2 for decode. 2 prefill replicas will run on 2 nodes each - total of 4 nodes for prefill. 1 decode replica will run on 2 nodes for decode. We also allow you to configure the total amount of inference replicas - this is useful if you’d like to multiply the above configuration by a factor of k, each running behind a separate vllm-router instance.
[deployment] # this is a top-level RL deployment, not inference.deployment!!
type = "multi_node"
num_train_nodes = 4
num_infer_nodes = 6 # this is per-replica

num_infer_replicas = 3
This will run 3 inference replicas, each running on 6 nodes. Each replica will run on 2x2 nodes for prefill and 1x2 nodes for decode. The total deployment will span 18 nodes. This will also spin-up 3 separate vllm-router instances.

Router

We use our own fork of vllm-router as the request handler. We plan to support more advanced proxy options in the future. Right now, router handles 2 most important things:
  • Request routing - KV cache re-use and balanced routing
  • P/D disaggregation - handling the prefill and decode stages separately

Routing policies

The 2 policies you might want to configure are:
  • consistent_hash - this is the default policy that optimizes for KV cache re-use across turns - this works by hashing a request header to determine where to route the request to. You can configure what to hash by setting orchestrator.student.client.extra_headers_from_state to the header the router expects to be set.
We set it to a sensible default, that works with all verifiers environments.
[orchestrator.student.client.extra_headers_from_state]
X-Session-ID = "trajectory_id" # this is the default - each rollout has a unique trajectory_id and router expects X-Session-ID
  • round_robin - this policy will round-robin the requests between the available replicas. This is useful if you want to balance the load between the replicas. This might give you better results if you don’t have enough rollouts to make consistent_hash hashing saturated.

Advanced Configuration

KV Cache Offload

Maximizing KV-Cache space is crucial to support high-concurrency workloads. You can offload the KV cache to CPU memory (and, behind it, disk) by setting inference.kv_cache_offload. It is a discriminated config with two composable tiers, cpu and disk: a cpu tier is always required, and an optional disk tier is layered behind it (GPU → DRAM → disk). Disk-only is not supported. The type field selects the backend:
  • native — vLLM’s built-in offloading. CPU-only uses OffloadingConnector; CPU+disk uses TieringOffloadingSpec (a CPU primary tier with a filesystem secondary tier). Fully self-contained — no extra processes.
  • mooncake — a Mooncake shared distributed store (SLURM only). One mooncake_master + metadata server runs on the head inference node; every inference node runs a mooncake_client that contributes its DRAM (and, with disk, SSD) segment to that single pool. Because blocks are keyed by model + parallel rank + content hash (no instance id), a prefix cached by one node/replica is reusable by all of them over RDMA — pooling every node’s CPU RAM into one KV cache. Use native for local/single-process runs.
# Native CPU offload (reserves 128GB of CPU KV cache for this instance)
[inference.kv_cache_offload]
type = "native"
[inference.kv_cache_offload.cpu]
num_bytes = 128_000_000_000 # 128GB

# Native CPU + disk tiering (self-contained)
[inference.kv_cache_offload]
type = "native"
[inference.kv_cache_offload.cpu]
num_bytes = 128_000_000_000
[inference.kv_cache_offload.disk]
path = "/scratch/kv"        # disk capacity is bounded by the filesystem

# Mooncake CPU + disk (per-node distributed store, RDMA)
[inference.kv_cache_offload]
type = "mooncake"
[inference.kv_cache_offload.cpu]
num_bytes = 128_000_000_000
[inference.kv_cache_offload.disk]
path = "/scratch/kv"
For native, cpu.num_bytes is the aggregate CPU KV pool for the instance (vLLM shards it across workers). For mooncake, cpu.num_bytes is the DRAM each node contributes to the shared pool (so the total pool ≈ num_bytes × #inference-nodes); the store uses RDMA, so it requires an RDMA-capable fabric. Enabling offload automatically enables prefix caching.

Optimized P/D disaggregation deployment

For optimal P/D disaggregation deployment, we automatically set the decode all2all_backend to deepep_low_latency and the prefill all2all_backend to deepep_high_throughput. We currently don’t support customizing all2all backends for P/D disaggragation out of the box. You can do this by overriding the slurm template only. For KV cache transfer, we utilize the NIXL connector. This is the default and only currently supported connector. We aim to support more advanced options, such as D->P transfer, or Mooncake Connector in the future. For configuring various knobs with environment variables, we enable you to configure prefill and decode environment variables separately. This is useful if you want to configure different environment variables for the prefill and decode stages.
[inference.deployment]
type = "disaggregated"

prefill_env_overrides = {"VLLM_ENABLE_MOE_DP_CHUNK"="0", "VLLM_DEEP_GEMM_WARMUP"="skip"}
decode_env_overrides = {"VLLM_DEEP_GEMM_WARMUP"="skip"}

Other vLLM features

We support various other vLLM features. Some of those, such as enable_dbo, enable_eplb are exposed as a top-level config fields. For those that are not, you can configure them by setting inference.vllm_extra to the desired value.
[inference.vllm_extra]
headless = true

Router Replay

Router replay works by capturing the expert routing decisions into a buffer. This buffer then gets sent to the trainer, which can use it instead of re-computing the routing. This lowers the trainer↔inference mismatch by an order of magnitude, resulting in more stable training. To enable router replay, you can set inference.enable_return_routed_experts = true.
[trainer]
enable_router_replay = true # this will also auto-set the inference.enable_return_routed_experts = true

[inference]
enable_return_routed_experts = true
This however is not free, it adds a significant overhead to the HTTP requests as this payload can grow quite large. We reccomend increasing orchestrator.*.env.num_workers to allow for more parallelization on the verifiers side. Currently this feature is also not supported with CPU KV cache offload, which can have negative impact on the inference throughput.