Skip to main content
Each example here covers a common RL use case: what kind of environment to build, a minimal working implementation, a training config, and practical tips. Use these as starting points or drop-in templates for your own runs on Lab. If you haven’t launched a training run yet, start with the Getting Started guide first.

Math Reasoning

Train models to solve mathematical problems step-by-step, using symbolic verification to reward correct answers. Environment type: SingleTurnEnv with MathRubric Why RL works here: Models learn to produce correct final answers through trial and error. The reward signal is binary and cheap to compute — symbolic math verification checks whether the model’s \boxed{} answer matches the ground truth, without needing an LLM judge. Example environment:
import verifiers as vf
from datasets import load_dataset

def load_environment(split: str = "train", num_examples: int = -1) -> vf.Environment:
    ds = load_dataset("openai/gsm8k", split=split)
    dataset = vf.Dataset.from_hf(ds, question_col="question", answer_col="answer")
    if num_examples > 0:
        dataset = dataset.select(range(num_examples))

    rubric = vf.MathRubric()
    return vf.SingleTurnEnv(
        dataset=dataset,
        rubric=rubric,
        system_prompt="Solve the problem step by step. Put your final answer in \\boxed{}.",
    )
Training config:
model = "Qwen/Qwen3-4B-Instruct-2507"
max_steps = 200
batch_size = 256
rollouts_per_example = 8

[sampling]
max_tokens = 1024

[[env]]
id = "your-username/gsm8k"
Tips:
  • Start with GSM8K for validation — baseline models typically score 40–70%, leaving room for improvement.
  • For harder tasks (AIME, competition math), use a larger model like Qwen/Qwen3-235B-A22B-Thinking-2507 and increase max_tokens.
  • Enable difficulty filtering to focus training on problems the model can partially but not consistently solve.

Code Generation with Sandboxes

Train models to write correct code by executing their solutions in sandboxed environments and verifying outputs against test cases. Environment type: PythonEnv or SandboxEnv Why RL works here: The model gets a concrete pass/fail signal from running code. Unlike static checking, execution-based verification catches subtle bugs and rewards solutions that actually work. Multi-turn interaction lets the model iteratively debug when tests fail. Example environment:
import verifiers as vf
from datasets import Dataset

def load_environment() -> vf.Environment:
    dataset = Dataset.from_list([
        {
            "question": "Write a function `fibonacci(n)` that returns the nth Fibonacci number.",
            "info": '{"test_code": "assert fibonacci(0) == 0\\nassert fibonacci(1) == 1\\nassert fibonacci(10) == 55"}'
        },
        # ... more examples
    ])

    async def tests_pass(completion, info, state) -> float:
        code = completion[-1]["content"]
        test_code = info["test_code"]
        try:
            exec_result = state.get("exec_result", "")
            return 1.0 if "PASSED" in exec_result else 0.0
        except Exception:
            return 0.0

    rubric = vf.Rubric(funcs=[tests_pass])
    return vf.PythonEnv(
        dataset=dataset,
        rubric=rubric,
        max_turns=5,
    )
Training config:
model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
max_steps = 300
batch_size = 256
rollouts_per_example = 16

[sampling]
max_tokens = 2048

[[env]]
id = "your-username/code-gen"
Tips:
  • Use PythonEnv for Python-specific tasks — it provides a persistent REPL that the model can use across turns.
  • Use SandboxEnv for multi-language tasks or when you need shell access.
  • Set max_turns to 3–5 to let the model iterate on failing test cases.
  • Consider a partial reward for passing some but not all tests, rather than all-or-nothing scoring.

Multi-Turn Games and Puzzles

Train models on interactive tasks where they must take actions over multiple turns, receiving feedback after each move. Environment type: Custom MultiTurnEnv subclass Why RL works here: Games provide dense, structured reward signals. The model learns strategies through repeated play — each rollout is a complete game, and the final score becomes the reward. Multi-turn structure naturally teaches planning and sequential decision-making. Example environment (word guessing game):
import verifiers as vf
from datasets import Dataset
import random

class WordGameEnv(vf.MultiTurnEnv):
    async def setup_state(self, state, **kwargs):
        state["target"] = state["info"]["target_word"]
        state["guesses"] = []
        return await super().setup_state(state, **kwargs)

    async def env_response(self, messages, state):
        guess = messages[-1]["content"].strip().lower()
        target = state["target"]
        state["guesses"].append(guess)

        if guess == target:
            state["won"] = True
            return [{"role": "user", "content": "Correct! You found the word."}]

        # Give hints: which letters are in the right position
        hints = []
        for i, (g, t) in enumerate(zip(guess, target)):
            if g == t:
                hints.append(f"Position {i+1}: correct")
            elif g in target:
                hints.append(f"Position {i+1}: wrong position, letter is in the word")
            else:
                hints.append(f"Position {i+1}: letter not in word")

        return [{"role": "user", "content": "\n".join(hints) + "\nGuess again."}]

    @vf.stop
    async def game_won(self, state):
        return state.get("won", False)


def load_environment() -> vf.Environment:
    words = ["apple", "brain", "cloud", "dance", "eagle"]
    dataset = Dataset.from_list([
        {"question": "Guess the 5-letter word. I'll give you hints after each guess.",
         "info": f'{{"target_word": "{w}"}}'} for w in words
    ])

    async def win_reward(state) -> float:
        if state.get("won"):
            return max(0.2, 1.0 - 0.15 * len(state["guesses"]))
        return 0.0

    rubric = vf.Rubric(funcs=[win_reward])
    return WordGameEnv(dataset=dataset, rubric=rubric, max_turns=8)
Training config:
model = "Qwen/Qwen3-4B-Instruct-2507"
max_steps = 100
batch_size = 128
rollouts_per_example = 8

[sampling]
max_tokens = 256

[[env]]
id = "your-username/word-game"
Tips:
  • Games are excellent for validating your setup since they tend to show clear reward improvements within a small number of steps.
  • Shape rewards to be gradient-rich — instead of just 0/1 for win/loss, give partial credit (e.g., reward based on number of turns taken to win).
  • The built-in alphabet-sort environment is a great starting point — install it with prime env install primeintellect/alphabet-sort.

Tool Use and Agentic Tasks

Train models to use tools effectively — calling the right tool with the right arguments to accomplish a goal. Environment type: ToolEnv or MCPEnv Why RL works here: Tool use requires the model to reason about which tool to call, compose correct arguments, interpret results, and decide on next steps. RL training lets the model learn this decision-making loop through practice, improving both tool selection and argument construction. Example environment (research assistant with search):
import verifiers as vf
from datasets import Dataset

async def web_search(query: str) -> str:
    """Search the web for information.

    Args:
        query: The search query to look up.

    Returns:
        Search results as text.
    """
    # your search implementation
    return await do_search(query)

async def calculate(expression: str) -> str:
    """Evaluate a mathematical expression.

    Args:
        expression: A math expression to evaluate (e.g. "2 + 2 * 3").

    Returns:
        The result of the evaluation.
    """
    try:
        return str(eval(expression))
    except Exception as e:
        return f"Error: {e}"

def load_environment() -> vf.Environment:
    dataset = Dataset.from_list([
        {
            "question": "What is the population of France divided by the population of Switzerland?",
            "answer": "approximately 8.3"
        },
        # ... more examples requiring tool use
    ])

    async def answer_quality(completion, answer, judge) -> float:
        verdict = await judge(completion, answer)
        return 1.0 if "correct" in verdict.lower() else 0.0

    rubric = vf.JudgeRubric(judge_model="gpt-4.1-mini")
    rubric.add_reward_func(answer_quality)

    return vf.ToolEnv(
        dataset=dataset,
        tools=[web_search, calculate],
        rubric=rubric,
        max_turns=10,
    )
Training config:
model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
max_steps = 200
batch_size = 256
rollouts_per_example = 16

[sampling]
max_tokens = 1024

[[env]]
id = "your-username/research-assistant"

env_file = ["secrets.env"]
Tips:
  • Use JudgeRubric with an LLM judge for open-ended tasks where exact matching isn’t feasible.
  • Store API keys for external services (judge models, search APIs) in a secrets.env file and reference it with env_file.
  • Monitor tool call counts via the automatic metrics — if the model isn’t calling tools, the task may need a clearer prompt.
  • MCPEnv is useful when your tools are already implemented as MCP servers.

Multi-Environment Training

Train a single model on multiple tasks simultaneously to improve generalization. Why RL works here: Training on diverse tasks prevents the model from overfitting to a single task’s reward surface. The model learns transferable skills (reasoning, tool use, instruction following) that improve performance across all tasks. Training config:
model = "Qwen/Qwen3-235B-A22B-Instruct-2507"
max_steps = 500
batch_size = 512
rollouts_per_example = 16

[sampling]
max_tokens = 2048

[[env]]
id = "primeintellect/gsm8k"
args = { split = "train" }

[[env]]
id = "your-username/code-gen"

[[env]]
id = "primeintellect/alphabet-sort"
args = { min_turns = 3, max_turns = 5 }

[buffer]
env_ratios = [0.4, 0.4, 0.2]
online_difficulty_filtering = true

[wandb]
project = "multi-env-training"
name = "235b-multi-task"

[eval]
interval = 100
eval_base_model = true
Tips:
  • Use env_ratios in the [buffer] section to control how much data comes from each environment.
  • Enable online_difficulty_filtering to automatically focus on examples at the right difficulty level for each environment.
  • Run baseline evaluations on each environment before training to understand starting performance.
  • Use W&B logging to compare per-environment reward curves during training.

Workflow Summary

Regardless of the use case, the typical workflow for hosted RL is:
1

Build your environment

Create an environment with a dataset, harness, and rubric using the verifiers library.
2

Evaluate baseline

Run prime eval run against your environment to measure where the model starts.
3

Configure and launch training

Write a .toml config and launch with prime rl run.
4

Monitor and iterate

Watch reward curves on the dashboard, adjust your environment or config, and re-run.
5

Deploy

Download the trained LoRA adapter or deploy it for inference.