> ## Documentation Index
> Fetch the complete documentation index at: https://docs.primeintellect.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Training Recipes

> Practical RL training recipes for math reasoning, code generation, tool use, and more on Lab.

Each example here covers a common RL use case: what kind of environment to build, a minimal working implementation, a training config, and practical tips. Use these as starting points or drop-in templates for your own runs on [Lab](/hosted-training/what-is-lab).

If you haven't launched a training run yet, start with the [Getting Started](/hosted-training/getting-started) guide first.

***

## Math Reasoning

Train models to solve mathematical problems step-by-step, using symbolic verification to reward correct answers.

**Environment type:** `SingleTurnEnv` with `MathRubric`

**Why RL works here:** Models learn to produce correct final answers through trial and error. The reward signal is binary and cheap to compute — symbolic math verification checks whether the model's `\boxed{}` answer matches the ground truth, without needing an LLM judge.

**Example environment:**

```python theme={null}
import verifiers as vf
from datasets import load_dataset

def load_environment(split: str = "train", num_examples: int = -1) -> vf.Environment:
    ds = load_dataset("openai/gsm8k", split=split)
    dataset = vf.Dataset.from_hf(ds, question_col="question", answer_col="answer")
    if num_examples > 0:
        dataset = dataset.select(range(num_examples))

    rubric = vf.MathRubric()
    return vf.SingleTurnEnv(
        dataset=dataset,
        rubric=rubric,
        system_prompt="Solve the problem step by step. Put your final answer in \\boxed{}.",
    )
```

**Training config:**

```toml theme={null}
model = "Qwen/Qwen3-4B-Instruct-2507"
max_steps = 200
batch_size = 256
rollouts_per_example = 8

[sampling]
max_tokens = 1024

[[env]]
id = "your-username/gsm8k"
```

**Tips:**

* Start with GSM8K for validation — baseline models typically score 40–70%, leaving room for improvement.
* For harder tasks (AIME, competition math), use a larger model like `Qwen/Qwen3-235B-A22B-Thinking-2507` and increase `max_tokens`.

***

## Code Generation with Sandboxes

Train models to write correct code by executing their solutions in sandboxed environments and verifying outputs against test cases.

**Environment type:** `PythonEnv` or `SandboxEnv`

**Why RL works here:** The model gets a concrete pass/fail signal from running code. Unlike static checking, execution-based verification catches subtle bugs and rewards solutions that actually work. Multi-turn interaction lets the model iteratively debug when tests fail.

**Example environment:**

```python theme={null}
import verifiers as vf
from datasets import Dataset

def load_environment() -> vf.Environment:
    dataset = Dataset.from_list([
        {
            "question": "Write a function `fibonacci(n)` that returns the nth Fibonacci number.",
            "info": '{"test_code": "assert fibonacci(0) == 0\\nassert fibonacci(1) == 1\\nassert fibonacci(10) == 55"}'
        },
        # ... more examples
    ])

    async def tests_pass(completion, info, state) -> float:
        code = completion[-1]["content"]
        test_code = info["test_code"]
        try:
            exec_result = state.get("exec_result", "")
            return 1.0 if "PASSED" in exec_result else 0.0
        except Exception:
            return 0.0

    rubric = vf.Rubric(funcs=[tests_pass])
    return vf.PythonEnv(
        dataset=dataset,
        rubric=rubric,
        max_turns=5,
    )
```

**Training config:**

```toml theme={null}
model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
max_steps = 300
batch_size = 256
rollouts_per_example = 16

[sampling]
max_tokens = 2048

[[env]]
id = "your-username/code-gen"
```

**Tips:**

* Use `PythonEnv` for Python-specific tasks — it provides a persistent REPL that the model can use across turns.
* Use `SandboxEnv` for multi-language tasks or when you need shell access.
* Set `max_turns` to 3–5 to let the model iterate on failing test cases.
* Consider a partial reward for passing some but not all tests, rather than all-or-nothing scoring.

***

## Multi-Turn Games and Puzzles

Train models on interactive tasks where they must take actions over multiple turns, receiving feedback after each move.

**Environment type:** Custom `MultiTurnEnv` subclass

**Why RL works here:** Games provide dense, structured reward signals. The model learns strategies through repeated play — each rollout is a complete game, and the final score becomes the reward. Multi-turn structure naturally teaches planning and sequential decision-making.

**Example environment (word guessing game):**

```python theme={null}
import verifiers as vf
from datasets import Dataset
import random

class WordGameEnv(vf.MultiTurnEnv):
    async def setup_state(self, state, **kwargs):
        state["target"] = state["info"]["target_word"]
        state["guesses"] = []
        return await super().setup_state(state, **kwargs)

    async def env_response(self, messages, state):
        guess = messages[-1]["content"].strip().lower()
        target = state["target"]
        state["guesses"].append(guess)

        if guess == target:
            state["won"] = True
            return [{"role": "user", "content": "Correct! You found the word."}]

        # Give hints: which letters are in the right position
        hints = []
        for i, (g, t) in enumerate(zip(guess, target)):
            if g == t:
                hints.append(f"Position {i+1}: correct")
            elif g in target:
                hints.append(f"Position {i+1}: wrong position, letter is in the word")
            else:
                hints.append(f"Position {i+1}: letter not in word")

        return [{"role": "user", "content": "\n".join(hints) + "\nGuess again."}]

    @vf.stop
    async def game_won(self, state):
        return state.get("won", False)


def load_environment() -> vf.Environment:
    words = ["apple", "brain", "cloud", "dance", "eagle"]
    dataset = Dataset.from_list([
        {"question": "Guess the 5-letter word. I'll give you hints after each guess.",
         "info": f'{{"target_word": "{w}"}}'} for w in words
    ])

    async def win_reward(state) -> float:
        if state.get("won"):
            return max(0.2, 1.0 - 0.15 * len(state["guesses"]))
        return 0.0

    rubric = vf.Rubric(funcs=[win_reward])
    return WordGameEnv(dataset=dataset, rubric=rubric, max_turns=8)
```

**Training config:**

```toml theme={null}
model = "Qwen/Qwen3-4B-Instruct-2507"
max_steps = 100
batch_size = 128
rollouts_per_example = 8

[sampling]
max_tokens = 256

[[env]]
id = "your-username/word-game"
```

**Tips:**

* Games are excellent for validating your setup since they tend to show clear reward improvements within a small number of steps.
* Shape rewards to be gradient-rich — instead of just 0/1 for win/loss, give partial credit (e.g., reward based on number of turns taken to win).
* The built-in `alphabet-sort` environment is a great starting point — install it with `prime env install primeintellect/alphabet-sort`.

***

## Tool Use and Agentic Tasks

Train models to use tools effectively — calling the right tool with the right arguments to accomplish a goal.

**Environment type:** `ToolEnv` or `MCPEnv`

**Why RL works here:** Tool use requires the model to reason about which tool to call, compose correct arguments, interpret results, and decide on next steps. RL training lets the model learn this decision-making loop through practice, improving both tool selection and argument construction.

**Example environment (research assistant with search):**

```python theme={null}
import verifiers as vf
from datasets import Dataset

async def web_search(query: str) -> str:
    """Search the web for information.

    Args:
        query: The search query to look up.

    Returns:
        Search results as text.
    """
    # your search implementation
    return await do_search(query)

async def calculate(expression: str) -> str:
    """Evaluate a mathematical expression.

    Args:
        expression: A math expression to evaluate (e.g. "2 + 2 * 3").

    Returns:
        The result of the evaluation.
    """
    try:
        return str(eval(expression))
    except Exception as e:
        return f"Error: {e}"

def load_environment() -> vf.Environment:
    dataset = Dataset.from_list([
        {
            "question": "What is the population of France divided by the population of Switzerland?",
            "answer": "approximately 8.3"
        },
        # ... more examples requiring tool use
    ])

    async def answer_quality(completion, answer, judge) -> float:
        verdict = await judge(completion, answer)
        return 1.0 if "correct" in verdict.lower() else 0.0

    rubric = vf.JudgeRubric(judge_model="gpt-4.1-mini")
    rubric.add_reward_func(answer_quality)

    return vf.ToolEnv(
        dataset=dataset,
        tools=[web_search, calculate],
        rubric=rubric,
        max_turns=10,
    )
```

**Training config:**

```toml theme={null}
model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
max_steps = 200
batch_size = 256
rollouts_per_example = 16

[sampling]
max_tokens = 1024

[[env]]
id = "your-username/research-assistant"

env_file = ["secrets.env"]
```

**Tips:**

* Use `JudgeRubric` with an LLM judge for open-ended tasks where exact matching isn't feasible.
* Store API keys for external services (judge models, search APIs) in a `secrets.env` file and reference it with `env_file`.
* Monitor tool call counts via the automatic metrics — if the model isn't calling tools, the task may need a clearer prompt.
* `MCPEnv` is useful when your tools are already implemented as MCP servers.

***

## Multi-Environment Training

Train a single model on multiple tasks simultaneously to improve generalization.

**Why RL works here:** Training on diverse tasks prevents the model from overfitting to a single task's reward surface. The model learns transferable skills (reasoning, tool use, instruction following) that improve performance across all tasks.

**Training config:**

```toml theme={null}
model = "Qwen/Qwen3-235B-A22B-Instruct-2507"
max_steps = 500
batch_size = 512
rollouts_per_example = 16

[sampling]
max_tokens = 2048

[[env]]
id = "primeintellect/gsm8k"
args = { split = "train" }

[[env]]
id = "your-username/code-gen"

[[env]]
id = "primeintellect/alphabet-sort"
args = { min_turns = 3, max_turns = 5 }

[wandb]
project = "multi-env-training"
name = "235b-multi-task"

[eval]
interval = 100
```

**Tips:**

* Run baseline evaluations on each environment before training to understand starting performance.
* Use W\&B logging to compare per-environment reward curves during training.

***

## Workflow Summary

Regardless of the use case, the typical Hosted Training workflow is:

<Steps>
  <Step title="Build your environment">
    Create an environment with a dataset, harness, and rubric using the [verifiers](/verifiers/overview) library.
  </Step>

  <Step title="Evaluate baseline">
    Run `prime eval run` against your environment to measure where the model starts.
  </Step>

  <Step title="Configure and launch training">
    Write a `.toml` config and launch with `prime train run`.
  </Step>

  <Step title="Monitor and iterate">
    Watch reward curves on the dashboard, adjust your environment or config, and re-run.
  </Step>

  <Step title="Deploy">
    Download the trained LoRA adapter or [deploy it for inference](/inference/adapter-deployments).
  </Step>
</Steps>

<CardGroup cols={2}>
  <Card title="Getting Started" icon="rocket" href="/hosted-training/getting-started">
    Launch your first Hosted Training run in minutes.
  </Card>

  <Card title="End-to-End Run" icon="arrow-trend-up" href="/hosted-training/end-to-end-run">
    Detailed walkthrough of a complete training run.
  </Card>

  <Card title="Environments" icon="cube" href="/verifiers/environments">
    Learn how to build custom environments with verifiers.
  </Card>

  <Card title="Advanced Configs" icon="gear" href="/hosted-training/advanced-configs">
    Multi-environment training, evals, and more.
  </Card>
</CardGroup>