Table of Contents
- Custom Modeling
- Multimodal Training
- LoRA Training
- Multi-Tenant Training
- Disaggregated Prefill/Decode Inference
Custom Modeling
prime-rl ships custom optimized model implementations for several MoE families. With model.impl = "auto" (default) the trainer picks the custom path when the HF config type is registered, falling back to plain HF otherwise. To force one:
| Family | HF config types | EP | CP |
|---|---|---|---|
GLM-5 (glm_moe_dsa) | zai-org/GLM-5, zai-org/GLM-5-FP8 | ✅ | ✅ |
| Qwen3 MoE | Qwen/Qwen3-30B-A3B, … | ✅ | ✅ |
| Qwen3.5 MoE | Qwen/Qwen3.5-35B-A3B, … | ✅ | ✅ |
| Qwen3 / Qwen3.5 VLMs | see Multimodal training | MoE only | ✅ |
| Laguna | poolside/Laguna-XS.2 | ✅ | ✅ |
| MiniMax M2 | MiniMax/MiniMax-M2 | ✅ | ✅ |
| Nemotron H | nvidia/Nemotron-3-Nano-30B-A3B, … | ✅ | ❌ |
| Trinity (AFMoE) | arcee-ai/Trinity-Mini, … | ✅ | ✅ |
| GLM-4 / GLM-4.5 / INTELLECT-3 | THUDM/GLM-4-9B-0414, zai-org/GLM-4.5, PrimeIntellect/INTELLECT-3, … | ✅ | ✅ |
| GPT-OSS (HF MoE) | openai/gpt-oss-20b, openai/gpt-oss-120b | ❌ | ✅ |
model.fp8 = true, requires SM90+), and faster MoE kernels (moe_use_grouped_mm = true, default). Forcing impl = "hf" is mostly useful when debugging — it’s slower and disables most MoE-specific knobs.
Expert Parallelism Backends
model.ep_comm_backend picks the all-to-all kernel used for EP dispatch/combine:
torch(default): TorchTitan’s all-to-all collective. Works everywhere, no extra install.deepep: Utilizes DeepEP’s custom all-to-all collectives. This provides better performance if EP dimension spans multiple nodes. We provide pre-built binaries for H100/H200 with cuda runtime 12.9 installed, you can install them by runninguv sync --all-extras. DeepEP requires some careful tuning to achieve optimal performance, tuning parameters aredeepep_num_smsanddeepep_token_chunk_size.
optim.max_norm is set to None automatically.)
Multimodal Training
Supported Families
The built-in VLM registry covers:| Family | model_type | Vision attr | LM attr |
|---|---|---|---|
| Qwen3-VL | qwen3_vl | model.visual | model.language_model |
| Qwen3-VL MoE | qwen3_vl_moe | model.visual | model.language_model |
| Qwen3.5 | qwen3_5 | model.visual | model.language_model |
| Qwen3.5-MoE | qwen3_5_moe | model.visual | model.language_model |
model.named_children() and set them under [model.vlm] directly.
Enabling VLM Mode
Add[model.vlm] and bfloat16 dtypes:
{language_model_attr}.layers. automatically.
To add a new model family permanently, append an entry to VLM_REGISTRY in src/prime_rl/utils/vlm.py.
Limitations
- Vision encoder frozen by default. Set
freeze_vision_encoder = falseto fine-tune it; in that case it’s FSDP-sharded per block. The combinationfreeze_vision_encoder = false+ LoRA is rejected by a config validator — LoRA freezes everything non-adapter, so unfreezing the encoder under LoRA would be a silent no-op. - No multimodal-safe truncation. Token sequences are truncated to
seq_len, butpixel_valuesandimage_grid_thwpass through unchanged. If a sample’s tokens overflow, image tokens may get dropped while image tensors still describe the full image set. Setseq_lento cover your longest sample. - bfloat16 mandatory. The trainer config validator refuses any other
optimization_dtype/reduce_dtypefor VLMs — vLLM serves VLMs in bfloat16 and a mismatch breaks the importance ratio. - Higher KL mismatch with multi-image inputs. Expect noisier
mismatch_klthan text-only; this is from minor numerical differences between the trainer’s and vLLM’s image processing. - Images aren’t logged to monitors. Sample logging captures the prompt text but not the actual images.
LoRA Training
LoRA is enabled by adding[model.lora]:
target_modules defaults to a reasonable cross-family set (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, experts, plus a few latent-projection names for Nemotron). Unknown names are silently ignored, so the defaults work across architectures. Add architecture-specific names to extend coverage (e.g. in_proj / out_proj for Mamba).
LoRA is supported across SFT and RL. For RL, weight_broadcast.type = "nccl" is not supported with LoRA — use the default filesystem transport. To save the raw adapter alongside the merged HF weights:
Multi-Tenant Training
Multi-tenant training lets a single trainer + inference deployment serve many concurrent LoRA “tenants” — each a fully isolated run with its own orchestrator, LoRA adapter, optimizer, scheduler, checkpoints, and progress tracking — sharing the same backbone weights and the same vLLM server. This is the topology behind hosted training on the Prime Intellect platform (Lab). The trainer-side implementation is theMultiRunManager singleton, enabled by setting trainer.max_concurrent_runs > 1. For the full API surface, see src/prime_rl/trainer/runs.py.
Disaggregated Prefill/Decode Inference
For large MoE serving, splitting prefill and decode onto separate vLLM groups can substantially improve throughput. Pick the prefill:decode ratio based on workload shape:| Workload | P:D ratio | Why |
|---|---|---|
| Agentic (SWE, Lean) | 3:1 | Long growing contexts → prefill-heavy |
| Non-agentic (math, chat) | 1:2 | Short prompts, long generations → decode-heavy |
examples/glm5_pd_disag/rl.toml — full RL run on GLM-5 with P/D disaggregation behind a vllm-router, FP8 inference, and NCCL weight broadcast (see the README for the launch story).
Monitor live queue depths to detect imbalance:
third_party/ucx/; the bundled sbatch templates prepend it to LD_LIBRARY_PATH so it overrides the system version.