Table of Contents
Overview
prime-rl chooses to use vLLM as the inference engine. We aim to stay up-to-date with the latest vLLM features, being at-most 1 version behind the latest stable release. This allows us to use the latest features from vLLM as soon as they are released - such as router replay, CPU KV cache offload, and more.
We support 3 distinct deployment shapes:
- Single-Node - Runs the inference server on a single node. Useful for debugging, small scale experiments or smaller models. The default deployment shape.
- Multi-Node - Runs the inference server on multiple nodes. Useful for large scale experiments or larger models, where latency is not a concern - i.e. single turn inference, long context inference, etc.
- Disaggregated - Runs the inference server on multiple nodes, but disaggregates the prefill and decode stages. Useful for large scale experiments or larger models, where latency is a concern and multi-node deployment creates very high E2E rollout latency, such as agentic workflows.
InferenceDeploymentConfig in your config file. This is a config-field that allows you to set the deployment shape, deployment-specific knobs such as num_nodes, num_replicas, router_port, backend_port, etc.
InferenceConfig field. This is a config-field that allows you to set the inference server-specific knobs. Most of these are supported for all deployment shapes, with few exceptions. These exceptions are rejected on validation.
Single-Node
The single-node deployment is the default deployment shape. It runs the inference server on a single node. It is useful for debugging, small scale experiments or smaller models. You can configure the single-node deployment with theSingleNodeInferenceDeploymentConfig config-field.
dp might create high latency, however it will also give you the highest throughput. This is a tradeoff you need to make based on your use case and required orchestrator.max_inflight_requests. Setting tp to a higher value will usually give you lower latency, but the inference server also will become saturated faster with lower number of requests.
Another thing to consider, is the memory usage. You need to make sure that the model will fit into the available GPU memory. We will not go into the details on how to do this in this document. Related thing to consider, is the space for the KV cache. This will heavily affect the amount of requests your inference server can handle. You want to shard your model, either using inference.enable_expert_parallel or inference.parallel.tp to maximize the available GPU memory.
You can also increase the available KV cache memory by enabling inference.kv_cache_offload. More details in the Advanced Configuration section.
Multi-Node
This deployment shape branches into 2 sub-shapes:- Multi-replica - Runs the inference server on multiple nodes, but each node runs an independent vLLM replica. You can think of this as a for-loop over single-node deployments.
- Wide-EP - This option is gated behind
inference.enable_expert_parallel = true. It allows you to run the inference server on multiple nodes, allowing you to use multi-node expert parallelism. This is a more advanced feature that is suitable for high-throughput, high-concurrency workloads.
Multi-replica
This deployment shape runs the inference server on multiple nodes, but each node runs an independent vLLM replica. Parallelism configuration is the same as the single-node deployment. The shape is defined by settinginference.deployment.type = "multi_node" and inference.deployment.num_nodes to the number of nodes you want to run the inference server on.
tp=2 and dp=4. Routing will be handled by the vllm-router instance running on the same node as the 1st replica. We aim to support more advanced routing options, such as llm-d or dynamo in the future. You can read more about the supported routing options in the router section.
Wide-EP
For huge, 200B+ scale models, you might want to use multi-node expert parallelism to maximize the KV-cache space. This deployment shape is defined by settinginference.deployment.type = "multi_node" and inference.enable_expert_parallel = true.
data_parallel_size_local = 4 and tp = 2 and expert parallelism spanning 2 nodes. The requests are again routed to these processes via the vllm-router.
P/D Dissagregation
This is the most advanced deployment shape. It allows you to disaggregate the prefill and decode stages, with KV cache flowing between them. This is useful for large scale deployments, where there are high requirements on latency, such as agentic workflows spanning 100s of turns. This deployment shape is defined by settinginference.deployment.type = "disaggregated" and inference.deployment.num_prefill_nodes and inference.deployment.num_decode_nodes to the number of nodes you want to run the prefill and decode stages on.
inference.deployment.num_prefill_replicas and inference.deployment.num_decode_replicas to the number of replicas you want to run.
k, each running behind a separate vllm-router instance.
vllm-router instances.
Router
We use our own fork of vllm-router as the request handler. We plan to support more advanced proxy options in the future. Right now, router handles 2 most important things:- Request routing - KV cache re-use and balanced routing
- P/D disaggregation - handling the prefill and decode stages separately
Routing policies
The 2 policies you might want to configure are:consistent_hash- this is the default policy that optimizes for KV cache re-use across turns - this works by hashing a request header to determine where to route the request to. You can configure what to hash by settingorchestrator.student.client.extra_headers_from_stateto the header therouterexpects to be set.
round_robin- this policy will round-robin the requests between the available replicas. This is useful if you want to balance the load between the replicas. This might give you better results if you don’t have enough rollouts to makeconsistent_hashhashing saturated.
Advanced Configuration
KV Cache Offload
Maximizing KV-Cache space is crucial to support high-concurrency workloads. You can offload the KV cache to CPU memory (and, behind it, disk) by settinginference.kv_cache_offload. It is a discriminated config with two composable tiers, cpu and disk: a cpu tier is always required, and an optional disk tier is layered behind it (GPU → DRAM → disk). Disk-only is not supported.
The type field selects the backend:
native— vLLM’s built-in offloading. CPU-only usesOffloadingConnector; CPU+disk usesTieringOffloadingSpec(a CPU primary tier with a filesystem secondary tier). Fully self-contained — no extra processes.mooncake— a Mooncake shared distributed store (SLURM only). Onemooncake_master+ metadata server runs on the head inference node; every inference node runs amooncake_clientthat contributes its DRAM (and, withdisk, SSD) segment to that single pool. Because blocks are keyed by model + parallel rank + content hash (no instance id), a prefix cached by one node/replica is reusable by all of them over RDMA — pooling every node’s CPU RAM into one KV cache. Usenativefor local/single-process runs.
native, cpu.num_bytes is the aggregate CPU KV pool for the instance (vLLM shards it across workers). For mooncake, cpu.num_bytes is the DRAM each node contributes to the shared pool (so the total pool ≈ num_bytes × #inference-nodes); the store uses RDMA, so it requires an RDMA-capable fabric. Enabling offload automatically enables prefix caching.
Optimized P/D disaggregation deployment
For optimal P/D disaggregation deployment, we automatically set the decodeall2all_backend to deepep_low_latency and the prefill all2all_backend to deepep_high_throughput. We currently don’t support customizing all2all backends for P/D disaggragation out of the box. You can do this by overriding the slurm template only.
For KV cache transfer, we utilize the NIXL connector. This is the default and only currently supported connector. We aim to support more advanced options, such as D->P transfer, or Mooncake Connector in the future.
For configuring various knobs with environment variables, we enable you to configure prefill and decode environment variables separately. This is useful if you want to configure different environment variables for the prefill and decode stages.
Other vLLM features
We support various other vLLM features. Some of those, such asenable_dbo, enable_eplb are exposed as a top-level config fields. For those that are not, you can configure them by setting inference.vllm_extra to the desired value.
Router Replay
Router replay works by capturing the expert routing decisions into a buffer. This buffer then gets sent to the trainer, which can use it instead of re-computing the routing. This lowers the trainer↔inference mismatch by an order of magnitude, resulting in more stable training. To enable router replay, you can setinference.enable_return_routed_experts = true.
orchestrator.*.env.num_workers to allow for more parallelization on the verifiers side.
Currently this feature is also not supported with CPU KV cache offload, which can have negative impact on the inference throughput.