Math Reasoning
Train models to solve mathematical problems step-by-step, using symbolic verification to reward correct answers. Environment type:SingleTurnEnv with MathRubric
Why RL works here: Models learn to produce correct final answers through trial and error. The reward signal is binary and cheap to compute — symbolic math verification checks whether the model’s \boxed{} answer matches the ground truth, without needing an LLM judge.
Example environment:
- Start with GSM8K for validation — baseline models typically score 40–70%, leaving room for improvement.
- For harder tasks (AIME, competition math), use a larger model like
Qwen/Qwen3-235B-A22B-Thinking-2507and increasemax_tokens. - Enable difficulty filtering to focus training on problems the model can partially but not consistently solve.
Code Generation with Sandboxes
Train models to write correct code by executing their solutions in sandboxed environments and verifying outputs against test cases. Environment type:PythonEnv or SandboxEnv
Why RL works here: The model gets a concrete pass/fail signal from running code. Unlike static checking, execution-based verification catches subtle bugs and rewards solutions that actually work. Multi-turn interaction lets the model iteratively debug when tests fail.
Example environment:
- Use
PythonEnvfor Python-specific tasks — it provides a persistent REPL that the model can use across turns. - Use
SandboxEnvfor multi-language tasks or when you need shell access. - Set
max_turnsto 3–5 to let the model iterate on failing test cases. - Consider a partial reward for passing some but not all tests, rather than all-or-nothing scoring.
Multi-Turn Games and Puzzles
Train models on interactive tasks where they must take actions over multiple turns, receiving feedback after each move. Environment type: CustomMultiTurnEnv subclass
Why RL works here: Games provide dense, structured reward signals. The model learns strategies through repeated play — each rollout is a complete game, and the final score becomes the reward. Multi-turn structure naturally teaches planning and sequential decision-making.
Example environment (word guessing game):
- Games are excellent for validating your setup since they tend to show clear reward improvements within a small number of steps.
- Shape rewards to be gradient-rich — instead of just 0/1 for win/loss, give partial credit (e.g., reward based on number of turns taken to win).
- The built-in
alphabet-sortenvironment is a great starting point — install it withprime env install primeintellect/alphabet-sort.
Tool Use and Agentic Tasks
Train models to use tools effectively — calling the right tool with the right arguments to accomplish a goal. Environment type:ToolEnv or MCPEnv
Why RL works here: Tool use requires the model to reason about which tool to call, compose correct arguments, interpret results, and decide on next steps. RL training lets the model learn this decision-making loop through practice, improving both tool selection and argument construction.
Example environment (research assistant with search):
- Use
JudgeRubricwith an LLM judge for open-ended tasks where exact matching isn’t feasible. - Store API keys for external services (judge models, search APIs) in a
secrets.envfile and reference it withenv_file. - Monitor tool call counts via the automatic metrics — if the model isn’t calling tools, the task may need a clearer prompt.
MCPEnvis useful when your tools are already implemented as MCP servers.
Multi-Environment Training
Train a single model on multiple tasks simultaneously to improve generalization. Why RL works here: Training on diverse tasks prevents the model from overfitting to a single task’s reward surface. The model learns transferable skills (reasoning, tool use, instruction following) that improve performance across all tasks. Training config:- Use
env_ratiosin the[buffer]section to control how much data comes from each environment. - Enable
online_difficulty_filteringto automatically focus on examples at the right difficulty level for each environment. - Run baseline evaluations on each environment before training to understand starting performance.
- Use W&B logging to compare per-environment reward curves during training.
Workflow Summary
Regardless of the use case, the typical workflow for hosted RL is:Build your environment
Create an environment with a dataset, harness, and rubric using the verifiers library.
Monitor and iterate
Watch reward curves on the dashboard, adjust your environment or config, and re-run.
Deploy
Download the trained LoRA adapter or deploy it for inference.