prime-rl itself — running the test suite, contributing changes, and adding new model architectures with the small-scale tooling we use to iterate on MoE families without booting up a 100B+ run.
Table of Contents
Test Suite
The test suite is split into three tiers, each with its own CI workflow.Layout
tests/unit/— fast-running, hermetic tests for isolated logic: config parsing and validation, advantage / loss / scheduler / packer math, individual dataset paths, model-conversion roundtrips, etc. Tests that need a GPU are tagged with thegpumarker.tests/integration/— full-stack RL/SFT runs on a tiny model end-to-end through inference + orchestrator + trainer.tests/nightly/— runs the configs inexamples/every night to catch regressions in the shipped examples.
Running Tests Locally
CI Workflows
| Workflow | Trigger | What runs | Where |
|---|---|---|---|
cpu_tests.yaml | every PR + push to main | pytest tests/unit -m "not gpu", plus a slim-wheel install check that prime-rl-configs imports cleanly without heavy deps (no torch / vllm / transformers / wandb / verifiers / datasets / liger / loguru in sys.modules) | ubuntu-latest |
gpu_tests.yaml | every non-draft PR + push to main | pytest tests/unit -m gpu, plus a matrix of named integration scenarios (reverse_text, reverse_text_sft, reverse_text_lora, reverse_text_moe, reverse_text_multi_run, reverse_text_rl_opd, reverse_text_rl_sft, reverse_text_sft_lora, alphabet_sort, benchmark_regression) | self-hosted GPU runners (vm, 4xa6000) |
nightly_tests.yaml | 03:00 PST daily + manual workflow_dispatch (single-file filter optional) | every file in tests/nightly/, one matrix job per file | research-cluster |
Markers
Two pytest markers are declared inpyproject.toml (addopts = "--strict-markers"):
gpu— gate a test that needs CUDA. CPU CI uses-m "not gpu"; the GPU unit job uses-m gpu.slow— gate a test that’s expensive enough you’d usually skip it locally. Deselect with-m "not slow".
Pre-Commit Hooks
Install the pre-commit hooks before your first commit so ruff check + format run on staged Python files automatically:Adding a New Model
Bringing up a new model family is three steps: implement the modeling code, register a mini preset, and run the smoke test. The preset and smoke test let you iterate on the modeling code at ~0.5B scale on 1–2 GPUs instead of paying the cost of the full-size model — useful for catching bugs in modeling code, state-dict conversions, and pipeline integration before scaling.Implement the Modeling Code
Drop the modeling code undersrc/prime_rl/trainer/models/<arch>/ (HF-compatible config, modeling, and weight conversion). Mirror the layout of an existing family — glm4_moe/ or qwen3_moe/ are good starting points.
Register a Mini Preset
Add an entry toscripts/mini_moe.py so the smoke-test workflow can build a ~0.5B test model in your architecture. The preset names the config class, picks small dimensions, and wires up the HF + PrimeRL model classes plus a tokenizer source:
Run the Smoke Test
Build the mini model. This creates a ~543M-parameter GLM-4 MoE (1024 hidden, 24 layers, 8 experts) with random weights, copies the tokenizer from the original GLM-4 model, and verifies the HF↔PrimeRL roundtrip is lossless:- No crashes. Validates the full inference + orchestrator + trainer pipeline end-to-end.
- Finite, non-zero KL. Confirms the reference distribution is meaningful.
- Loss reasonable. Not NaN, not stuck.
Requirements for merging a new model
Before merging a new model, you need to ensure the following:- The model is correctly registered and defines and all the required methods - such as
convert_hf_layer_to_ttandconvert_tt_layer_to_hf. - The small smoke test passes.
math environment with batch_size=64. All the entries in the table must lower than 0.015. If this is not met, the PR will not be merged (unless reasonable justification is provided). This is to ensure all our models are consistent and their implementations match the implementations in the inference framework.