Development

This page covers workflows for developing on prime-rl itself — running the test suite, contributing changes, and adding new model architectures with the small-scale tooling we use to iterate on MoE families without booting up a 100B+ run.

Test Suite
Pre-Commit Hooks
Adding a New Model
Adding a Custom VLM Implementation

Test Suite

The test suite is split into three tiers, each with its own CI workflow.

Layout

tests/unit/ — fast-running, hermetic tests for isolated logic: config parsing and validation, advantage / loss / scheduler / packer math, individual dataset paths, model-conversion roundtrips, etc. Tests that need a GPU are tagged with the gpu marker.
tests/integration/ — full-stack RL/SFT runs on a tiny model end-to-end through inference + orchestrator + trainer.
tests/nightly/ — runs the configs in examples/ every night to catch regressions in the shipped examples.

Running Tests Locally

uv run pytest -v                                           # everything
uv run pytest tests/unit -v                                # unit only
uv run pytest tests/integration -v                         # integration only
uv run pytest -v -m "not gpu"                              # CPU-only subset (mirrors CPU CI)
uv run pytest -v -m gpu                                    # GPU-only subset
uv run pytest tests/integration/test_reverse_text.py -vvs  # one specific scenario

CI Workflows

Workflow	Trigger	What runs	Where
`cpu_tests.yaml`	every PR + push to `main`	`pytest tests/unit -m "not gpu"`, plus a slim-wheel install check that `prime-rl-configs` imports cleanly without heavy deps (no torch / vllm / transformers / wandb / verifiers / datasets / liger / loguru in `sys.modules`)	`ubuntu-latest`
`gpu_tests.yaml`	every non-draft PR + push to `main`	`pytest tests/unit -m gpu`, plus a matrix of named integration scenarios (`reverse_text`, `reverse_text_sft`, `reverse_text_lora`, `reverse_text_moe`, `reverse_text_multi_run`, `reverse_text_rl_opd`, `reverse_text_rl_sft`, `reverse_text_sft_lora`, `alphabet_sort`, `benchmark_regression`)	self-hosted GPU runners (`vm`, `4xa6000`)
`nightly_tests.yaml`	03:00 PST daily + manual `workflow_dispatch` (single-file filter optional)	every file in `tests/nightly/`, one matrix job per file	`research-cluster`

The GPU + Nightly workflows skip drafts — open the PR as Draft until you’re ready to consume CI compute, then mark it ready for review to trigger the GPU matrix.

Markers

Two pytest markers are declared in pyproject.toml (addopts = "--strict-markers"):

gpu — gate a test that needs CUDA. CPU CI uses -m "not gpu"; the GPU unit job uses -m gpu.
slow — gate a test that’s expensive enough you’d usually skip it locally. Deselect with -m "not slow".

Pre-Commit Hooks

Install the pre-commit hooks before your first commit so ruff check + format run on staged Python files automatically:

uv run pre-commit install

Adding a New Model

Bringing up a new model family is three steps: implement the modeling code, register a mini preset, and run the smoke test. The preset and smoke test let you iterate on the modeling code at ~0.5B scale on 1–2 GPUs instead of paying the cost of the full-size model — useful for catching bugs in modeling code, state-dict conversions, and pipeline integration before scaling.

Implement the Modeling Code

Drop the modeling code under src/prime_rl/trainer/models/<arch>/ (HF-compatible config, modeling, and weight conversion). Mirror the layout of an existing family — glm4_moe/ or qwen3_moe/ are good starting points.

Register a Mini Preset

Add an entry to scripts/mini_moe.py so the smoke-test workflow can build a ~0.5B test model in your architecture. The preset names the config class, picks small dimensions, and wires up the HF + prime-rl model classes plus a tokenizer source:

ARCH_PRESETS = {
    "glm4_moe": {
        "config_class": Glm4MoeConfig,
        "config_kwargs": dict(hidden_size=1024, num_hidden_layers=24, n_routed_experts=8, ...),
        "hf_model_class": HFGlm4MoeForCausalLM,
        "prime_model_class": PrimeRLGlm4MoeForCausalLM,
        "tokenizer_source": "THUDM/GLM-4-9B-0414",
    },
    # add your arch here
}

Run the Smoke Test

Build the mini model. This creates a ~543M-parameter GLM-4 MoE (1024 hidden, 24 layers, 8 experts) with random weights, copies the tokenizer from the original GLM-4 model, and verifies the HF↔prime-rl roundtrip is lossless:

uv run python scripts/mini_moe.py --arch glm4_moe --output-dir ./mini-glm-moe

To re-verify the roundtrip after a modeling-code change without re-creating the model:

uv run python scripts/mini_moe.py --arch glm4_moe --output-dir ./mini-glm-moe --verify-only

Warm up the random-weight mini model with SFT on reverse-text so KL divergence becomes meaningful in the RL phase. Loss drops from ~12 to ~2.5 — the output won’t be coherent, but the distribution is non-trivial. A pre-built SFT’d checkpoint lives at samsja/mini-glm-moe if you want to skip this step:

uv run sft \
  --model.name ./mini-glm-moe \
  --data.name PrimeIntellect/Reverse-Text-SFT \
  --max_steps 200 \
  --optim.lr 1e-4 \
  --ckpt.weights

Then run the full RL stack on reverse-text:

uv run rl @ configs/ci/integration/reverse-text-moe/start.toml \
  --model.name samsja/mini-glm-moe \
  --trainer.model.impl custom \
  --inference.gpu-memory-utilization 0.7 \
  --inference.model.max-model-len 2048

What to look for:

No crashes. Validates the full inference + orchestrator + trainer pipeline end-to-end.
Finite, non-zero KL. Confirms the reference distribution is meaningful.
Loss reasonable. Not NaN, not stuck.

Don’t expect reward to climb meaningfully in 20 steps on a random model.

Requirements for merging a new model

Before merging a new model, you need to ensure the following:

The model is correctly registered and defines and all the required methods - such as convert_hf_layer_to_tt and convert_tt_layer_to_hf.
The small smoke test passes.

In the PR that adds the new model, you also need to provide a table covering the KL mismatch across 20 steps on math environment with batch_size=64. All the entries in the table must lower than 0.015. If this is not met, the PR will not be merged (unless reasonable justification is provided). This is to ensure all our models are consistent and their implementations match the implementations in the inference framework.

Adding a Custom VLM Implementation

VLM training (any run with [model.vlm] set, SFT or RL) is custom-implementation-only: get_model rejects models without a custom PrimeRL VLM class at load time. To make a new VLM family trainable, extend a custom text model with a composite VLM body — the Qwen3.5 dense (models/qwen3_5/) and MoE (models/qwen3_5_moe/) implementations are the reference. The pieces, in dependency order:

Custom text model first. The VLM body wraps a custom *ForCausalLM (see Adding a New Model), so the text side — including its state-dict conversion and KL-mismatch table — comes first.
Composite VLM body. A *VLMModel that holds the HF vision encoder and the custom text model, with a prepare_inputs_embeds_and_position_ids step: embed tokens, run the vision encoder, scatter image embeddings over placeholder tokens, and build MRoPE 3D positions from mm_token_type_ids (the renderer owns the token→modality mapping). The unified *ForCausalLM dispatches on the config: composite config → VLM path, text config → text path.
Always run the vision encoder. Text-only micro-batches must feed the encoder dummy pixels and graft the result into the graph with zero contribution (inputs_embeds + image_embeds.sum() * 0.0) so FSDP/EP collectives stay symmetric across ranks when the encoder is trainable.
Packed-boundary consumption. Samples pack into shared rows with per-document boundaries in seq_lens; every custom model’s forward() declares the typed seq_lens/seq_lens_are_pre_shard parameters (the trainer passes them unconditionally) and must honor the boundaries — varlen flash cu_seqlens, linear-attention state resets per document, and a loud rejection on attention paths that can’t (see the packed-batch guard in any modeling file). Set supports_packed_multimodal_training on the VLM model once packed rows are handled — RL fails loudly at startup for VLM models without it.
Registration. Register the composite model_type in _CUSTOM_VLM_MAPPING (models/__init__.py) so get_model dispatches to the custom class, and describe the family in VLM_REGISTRY (utils/vlm.py).
Context parallelism (optional). CP-capable VLMs implement set_context_parallel_attributes and shard embeds/positions inside the model after the vision merge; the trainers defer sharding to the model for MRoPE batches under ulysses.
Validation. Same bar as text models: the KL-mismatch table for the text path, plus an SFT run and an RL run on a real multimodal dataset (the color-codeword environment is the reference task).

Getting Started

Lab

Libraries

Compute

Table of Contents

Test Suite

Layout

Running Tests Locally

CI Workflows

Markers

Pre-Commit Hooks

Adding a New Model

Implement the Modeling Code

Register a Mini Preset

Run the Smoke Test

Requirements for merging a new model

Adding a Custom VLM Implementation

​Table of Contents

​Test Suite

​Layout

​Running Tests Locally

​CI Workflows

​Markers

​Pre-Commit Hooks

​Adding a New Model

​Implement the Modeling Code

​Register a Mini Preset

​Run the Smoke Test

​Requirements for merging a new model

​Adding a Custom VLM Implementation

Table of Contents

Test Suite

Layout

Running Tests Locally

CI Workflows

Markers

Pre-Commit Hooks

Adding a New Model

Implement the Modeling Code

Register a Mini Preset

Run the Smoke Test

Requirements for merging a new model

Adding a Custom VLM Implementation