Skip to main content
In this guide, we’ll walk you through setting up your Lab workspace, creating your first agent environment, using it to evaluate baseline performance, launching an RL run with Hosted Training, and deploying your model for inference. These instructions are intended for use on a Mac or Linux CPU development environment. No previous experience with RL is required. In fact, experience with coding isn’t even required — we’ll use agents for everything. We’ll just assume that you have Claude Code, Codex, Cursor, OpenCode, Amp, or some other similar coding agent installed on your computer.

Setting Up Your Lab Workspace

Ensure you have uv installed for managing Python packages:
curl -LsSf https://astral.sh/uv/install.sh | sh
Install the prime CLI:
uv tool install prime
Choose a folder on your machine as your Lab workspace (e.g. ~/dev/my-lab) and do:
prime lab setup
This command prepares your workspace with:

Python project bootstrap

Creates a Python project and installs verifiers for environment development.

Coding-agent setup

Configures your workspace for coding-agent workflows.

Instruction files

Downloads agent instruction files like AGENTS.md and Agent Skills.

Starter configs

Downloads example training and evaluation configs.
~/dev/demo prime lab setup
Supported coding agents: codex, claude, cursor, opencode, amp
Primary coding agent [codex]:
Using multiple coding agents? [y/N]:
No pyproject.toml found, initializing uv project...
Running: uv init
Initialized project `demo`
Running: uv add verifiers
...
... # install + download outputs omitted
...
[................................................................................] 1371 / 1371
Downloaded configs/rl/wordle.toml from https://github.com/primeintellect-ai/verifiers

+------------------------------------------ get started -------------------------------------------+
|                                                                                                  |
|  idea -> environment -> eval -> training                                                         |
|                                                                                                  |
|  +-------------------------------- ask codex ---------------------------------+                  |
|  |                                                                            |                  |
|  | I want to train a model for <my task domain>. Propose an initial           |                  |
|  | environment scaffold including relevant tools, generate a small            |                  |
|  | synthetic dataset, run a quick eval baseline, inspect the results,         |                  |
|  | and decide how to iterate on refining the implementation.                  |                  |
|  |                                                                            |                  |
|  +----------------------------------------------------------------------------+                  |
|                                                                                                  |
|  +-------------------- quick commands --------------------+                                      |
|  |                                                        |                                      |
|  | $ prime env init my-env                                |                                      |
|  | $ prime eval run my-env -m gpt-5-nano -n 5             |                                      |
|  | $ prime eval tui                                       |                                      |
|  | $ prime rl run configs/rl/wiki-search.toml             |                                      |
|  | $ prime gepa run my-env -m gpt-5-nano                  |                                      |
|  |                                                        |                                      |
|  +--------------------------------------------------------+                                      |
|                                                                                                  |
+--------------------------------------------------------------------------------------------------+
Use one Lab workspace per research project and version it with Git. A workspace can contain multiple environments, configs, scripts, data, and eval outputs.

Prompting Your Coding Agent

For many low-to-medium complexity environments, we find that the latest coding agents are often capable of “one-shotting” them, when equipped with the provided context from prime lab setup and given a sufficiently detailed prompt. Providing the prompt below to a frontier coding agent (OpenCode + Codex 5.3) resulted in a fully functional environment for a calendar scheduling agent:
Save this as prompt.md and pass it directly to your coding agent as your initial task prompt.
prompt.md
Make an environment for a calendar scheduling agent.
In each task, there should be a set of people with busy calendars, and individual + global constraints for scheduling the meeting.
Some constraints can be "hard" (not allowed to violate), others can be "soft", where violating a constraint incurs some utility cost for certain attendees.
Each attendee has a utility for the proposed meeting time between 0 and 1, and the task score will be the weighted average of attendee scores if an acceptable meeting time is found, and 0 otherwise.
Attendee importance weights should be normalized to 1 for each task.

We should be able to programmatically generate task problems, and deterministically validate that satisfying solutions exist (and what their best possible score would be).
We should have fine-grained controls for key degrees of freedom in task generation, with higher-level parameters ("easy" / "medium" / "hard") for the full task set, which then map into setting ranges for the more fine-grained controls.
Be creative, and use your judgment to design clean composition rules for converting meeting choices and conflicts into scores. Avoid complex branching/conditional logic where possible.
Think carefully about designing your system in a way which discourages "backdoor" strategies or reward hacks.
The best approach for an agent should be to make a good-faith effort to satisfy constraints as best as possible.
Experiment with sampling strategies to ensure that tasks are solvable most of the time (so that we can pre-filter any unsolvable tasks cheaply), and that they aren't too easy -- there shouldn't be an abundance of valid solutions, random proposal times should be a bad strategy.

Types of constraints we want to potentially account for:

- Conflicting schedules
- Time zones + early/late/day preferences
- Meeting length
- Room availability
- Back-to-back meeting preferences
- Desired-but-optional attendees
- Other related constraints which reflect real-world calendar challenges

Degrees of freedom:

- Number of attendees
- Window of consideration
- Types of constraints
- Tightness of constraints

Use the StatefulToolEnv pattern, and in-memory data structures for the calendar + attendee information. The agent should have tools for things like:

- Checking attendee calendars
- Viewing attendee constraints
- Checking score of a proposed window
- Submitting a window

The environment should have a max_turns parameter, and tool results should show the remaining turns to the agent.
Default limit should be enough to allow reasonable exploration, but not so high that the agent can brute-force search all times.

We should also have a nice standalone script in the environment which creates a TUI to visualize a "calendar problem" similar to typical meeting apps, including attendees, timeblocks, and constraints, but fully in the terminal, using Rich styling, similar design language to the `prime eval tui` viewer implemented within the `verifiers` library (inspect verifiers source for reference).

Create a detailed design doc and plan for testing (PLAN.md), implement in full, revise PLAN.md after major milestones to reflect accomplishments and updated TODOs, and run basic small evals throughout as needed.
You are welcome to use the PRIME_API_KEY set in my environment for inference tests (see configs/endpoints.toml for models).
Let me know when you're happy with your implementation.
You can view the created environment in the Environments Hub:

prime/calendar-scheduling

Environments Hub
We can use the visualizer script we asked our agent to make for viewing the environment task structure more directly:
~/dev/demo (main*) uv run --project environments/calendar_scheduling calendar-scheduling-tui --show-oracle --difficulty medium --seed 5

Calendar Scheduling TUI

╭───────────────────────────────────────────────────── Calendar Scheduling Problem ─────────────────────────────────────────────────────╮
 Task ID                   medium-209463
 Difficulty                medium
 Seed                      209463
 Window                    5 days
 Candidate UTC hours       09:00 - 19:00
 Meeting duration          90 minutes
 Score-check budget        6
 Total candidates          90
 Valid candidates          8
 Oracle best score         0.7970
 Random baseline           0.0630
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
                                               Attendees
┏━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
 ID Name Type TZ Weight Preferred Day Preferred Local Hard Local
┡━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
 attendee_1 Dakota optional UTC-2 0.073 Day 1 (Tue Jan 06) │ 11.7-17.5       │ none       │
 attendee_2 Morgan optional UTC-7 0.155 Day 1 (Tue Jan 06) │ 8.8-14.4        │ none       │
 attendee_3 Elliot required UTC+5 0.072 Day 4 (Fri Jan 09) │ 10.9-16.6       │ none       │
 attendee_4 Avery required UTC+0 0.146 Day 0 (Mon Jan 05) │ 12.5-17.0       │ none       │
 attendee_5 Parker required UTC-7 0.180 Day 0 (Mon Jan 05) │ 10.7-16.4       │ 8.3-18.8   │
 attendee_6 Robin required UTC-6 0.185 Day 3 (Thu Jan 08) │ 10.2-14.7       │ none       │
 attendee_7 Reese optional UTC-1 0.189 Day 4 (Fri Jan 09) │ 9.3-14.2        │ none       │
└────────────┴────────┴──────────┴───────┴────────┴────────────────────┴─────────────────┴────────────┘
╭───────────────────────────────────────────────────────────── Day Labels ──────────────────────────────────────────────────────────────╮
 Day 0 (Mon Jan 05) | Day 1 (Tue Jan 06) | Day 2 (Wed Jan 07) | Day 3 (Thu Jan 08) | Day 4 (Fri Jan 09)                                │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
                              Availability Timeline (X busy, . free, =/# best window overlay)
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 Lane Timeline
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 attendee_1 ........X...........|X..XXXXX..XXX.===...|....................|..XXX...............|.........XXXXX......
 attendee_2 ............XXXXX...|..............===...|....................|..XXX...............|XXXXX...............
 attendee_3 ....................|.X............===...|.........XXXX.......|......X.............|XXX..X..............
 attendee_4 ....................|..............===...|.....X.........X....|...............XXXXX|....................
 attendee_5 ................XX..|..............===...|.XXXX...XXXX........|....................|.....X..............
 attendee_6 .XXX............XXXX|..............===...|....................|.........XXXXX......|.....XX............X
 attendee_7 .XX.................|..XX.......XXX###...|...............XXXX.|....................|............XX......
 room_1 (room) │ ....................|..........XXX.===X..|....................|....................|....................
└───────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┘
╭────────────────────────────────────────────────────────── Constraint Model ───────────────────────────────────────────────────────────╮
 Hard constraints
 - Required attendees must be able to attend
 - Hard local-time bounds cannot be violated
 - Chosen room must be available
 - Duration must match task duration exactly

 Soft utility penalties
 - Early/late local-time penalties
 - Day-preference distance penalties
 - Back-to-back penalty near busy blocks
 - Optional attendee absence penalty

╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
                  Oracle Best Windows
┏━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
 Rank Window
┡━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
 1 Day 1 (Tue Jan 06) 16:00-17:30 UTC in room_1 │
└──────┴──────────────────────────────────────────────┘
╭─────────────────────────────────────────────────────────── Oracle Summary ────────────────────────────────────────────────────────────╮
 Best score: 0.7970
 Valid windows: 8 / 90
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Legend: X busy, . free, = highlighted best free slot, # highlighted busy slot

RL with Hosted Training

After the environment is created, we prompt our agent to test performance more exhaustively, then to start an RL training run using Qwen/Qwen3-30B-A3B-Instruct-2507, which is available for LoRA finetuning via Hosted Training.
test with GPT-4.1 models, make sure we're seeing proper rollouts that succeed with non-zero scores. those models should be able to solve; if they can't we have a bug somewhere or need clearer instructions in the env. Once you feel good about the env, make a RL config and start a training run with bs 128, 8 rollouts per group, for the 30b instruct qwen model.
Available models can be viewed with:
prime rl models
Example training configs (in configs/rl after running prime lab setup) look like:
# calendar-scheduling.toml
model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
max_steps = 100
batch_size = 128
rollouts_per_example = 8
env_file = ["../../secrets.env"]

[sampling]
max_tokens = 768

[[env]]
id = "prime/calendar-scheduling"
args = { difficulty = "medium", num_train = 512, num_eval = 128, max_turns = 18 }

[wandb]
project = "calendar-scheduling"
name = "qwen3-30b-i-calendar-scheduling-bs128-r8"
entity = "primeintellect"

The inclusion of secrets.env is optional — here we use it to set our W&B key for logging:
# secrets.env
OPENAI_API_KEY="${OPENAI_API_KEY}"
WANDB_API_KEY="${WANDB_API_KEY}"
Runs can be started by running:
prime rl run configs/rl/calendar-scheduling.toml

From the platform, you can view training curves, rollouts, configs, logs, and checkpoints. RL training curves Rollout viewer Training runs are shareable. You can view this example run here:

calendar-sch--qwen3-30b-a3b-instru--wxhl0r

Hosted Training

Deploying Your Model

Under Deployments, you can deploy LoRA adapters for inference with a single click: Deployments dashboard You can then share the Model Identifier with your coding agent and ask it to run some more evals if desired, or incorporate it into your application directly. And with that, you have now successfully deployed your first RL-trained model!