Prime-RL supports training vision-language models (VLMs) like Qwen3-VL.Documentation Index
Fetch the complete documentation index at: https://docs.primeintellect.ai/llms.txt
Use this file to discover all available pages before exploring further.
VLM Configuration
Supported Models
The built-in registry supports these model families out of the box:| Model Family | model_type | Vision Encoder | Language Model |
|---|---|---|---|
| Qwen3-VL | qwen3_vl | model.visual | model.language_model |
| Qwen3.5 | qwen3_5 | model.visual | model.language_model |
| Qwen3.5-MoE | qwen3_5_moe | model.visual | model.language_model |
[model.vlm] section. Both fields are required — they tell prime-rl where the vision encoder and language model live on the model object:
model.named_children().
Both fields are dotted attribute paths resolved on the loaded model. A bad path raises a ValueError immediately — there are no silent fallbacks.
The weight key prefix for NCCL broadcasting is derived automatically as {language_model_attr}.layers..
To add permanent support for a new model family, add an entry to VLM_REGISTRY in src/prime_rl/utils/vlm.py.
Current Limitations
-
Vision encoder is frozen by default: The vision encoder is frozen during training by default. Set
freeze_vision_encoder = falsein[model.vlm]to make it trainable. When unfrozen, the vision encoder is FSDP-sharded per-block for proper gradient flow. Note: this has no effect when using LoRA. -
No multimodal-safe truncation: Token sequences are truncated to
seq_len, butpixel_valuesandimage_grid_thware passed through unchanged. If a multimodal sample exceedsseq_len, image tokens can be dropped while image tensors still describe the full set of images. Ensureseq_lencovers your longest VLM samples. -
Optimization dtype must be bfloat16: Set
optimization_dtype = "bfloat16"andreduce_dtype = "bfloat16"in your trainer config. - Higher KL mismatch with multi-image inputs: VLM training exhibits higher KL mismatch compared to text-only, especially with multiple images.
- Images are not logged: The images the VLM sees during training are not logged to monitors.
How Multi-Turn VLM RL Training Works
VLM training uses the sameinterleave_rollout path as text-only models. Multi-turn trajectory steps are merged into a single training sample wherever the extension property holds.
Images are handled via a VLMImageCache built once per batch:
- Extract: Base64 images are decoded from trajectory step prompts into PIL images.
- Preprocess: Images are processed through the HuggingFace image processor, producing
pixel_valuesandimage_grid_thw. - Attach: Each training sample receives the cumulative
pixel_valuesup to its last merged step.
vLLM Configuration
VLLM_WORKER_MULTIPROC_METHOD=spawn is required for VLM inference. This is set automatically when using uv run rl @ ..., but if you start the vLLM server yourself, make sure this environment variable is set.