Skip to main content
This guide walks you through a complete hosted training run — from setting up your workspace and choosing an environment to launching a run, monitoring progress, and reviewing results.

Prerequisites

Make sure you’ve completed the initial setup:
# Install and authenticate the CLI
uv tool install prime
prime login

# Set up a workspace
mkdir ~/dev/my-lab && cd ~/dev/my-lab
prime lab setup
See Getting Started if you need help with any of these steps.

Step 1: Choose an Environment

You can use an existing environment from the Environments Hub or create your own. For this walkthrough, we’ll use the alphabet-sort environment — a multi-turn game where the model must sort letters into alphabetical order. Install it:
prime env install primeintellect/alphabet-sort

Step 2: Run a Baseline Evaluation

Before training, evaluate the base model to establish a baseline. This helps you confirm the environment works and understand where the model starts:
prime eval run primeintellect/alphabet-sort \
  -m Qwen/Qwen3-4B-Instruct-2507 \
  -n 20 -r 1
A good training environment should have a baseline reward between roughly 10–80%. If the model scores 0% after many attempts, the task is too hard. If it’s already at 80%+, consider harder examples or a different environment.
View the results:
prime eval tui

Step 3: Choose a Model

Check which models are available for Hosted Training:
prime rl models
For a first run, we recommend starting with a smaller model to validate your setup quickly:
Use CaseRecommended Model
Quick validationQwen/Qwen3-4B-Instruct-2507
ExperimentationQwen/Qwen3-30B-A3B-Instruct-2507
Production scaleQwen/Qwen3-235B-A22B-Instruct-2507 or PrimeIntellect/INTELLECT-3
See Models & Pricing for the full list.

Step 4: Create a Training Config

Training runs are configured via a .toml file. Create one in your configs/lab/ directory:
# configs/lab/alphabet-sort.toml
model = "Qwen/Qwen3-4B-Instruct-2507"
max_steps = 50
batch_size = 128
rollouts_per_example = 8

[sampling]
max_tokens = 512

[[env]]
id = "primeintellect/alphabet-sort"
This is a minimal config suitable for a validation run. The key fields are:
  • model — The Hugging Face model ID (must be a supported model)
  • max_steps — Total number of training steps
  • batch_size — Number of rollouts per training batch
  • rollouts_per_example — How many rollouts to generate per dataset example
  • [sampling].max_tokens — Maximum tokens the model can generate per response
  • [[env]].id — The environment ID on the Environments Hub

Step 5: Launch the Training Run

Start the run:
prime rl run configs/lab/alphabet-sort.toml
You’ll see output confirming the configuration and a link to the dashboard:
Loading config from configs/lab/alphabet-sort.toml

Creating RL training run...

Configuration:
  Model: Qwen/Qwen3-4B-Instruct-2507
  Environments: primeintellect/alphabet-sort
  Max Steps: 50
  Batch Size: 128
  Rollouts per Example: 8
  Max Tokens: 512

✓ Run created successfully!

Monitor run at:
  https://app.primeintellect.ai/dashboard/training/<run-id>

Step 6: Monitor the Run

You can monitor your run in two ways: In the terminal — stream logs in real-time:
prime rl logs <run-id> -f
On the dashboard — open the URL printed when the run started. The dashboard shows reward curves, rubric scores, reward distributions, and individual rollouts. Key metrics to watch:
  • Reward — The overall reward curve should trend upward over time
  • Rubric — Individual rubric component scores
  • Reward Distribution — Should shift from lower to higher values as training progresses

Step 7: Review Results

Once the run completes, you can review the trained model’s performance by running an evaluation with the trained adapter. Trained LoRA adapters can be downloaded from the dashboard. You can also deploy your trained LoRA adapter for live inference — see Deploying LoRA Adapters for Inference for a step-by-step guide. To compare against the baseline, re-run the same evaluation you ran in Step 2 and compare scores.

Putting It All Together

Here’s the complete workflow as a single script:
# Setup
uv tool install prime
prime login
mkdir ~/dev/my-lab && cd ~/dev/my-lab
prime lab setup

# Install and evaluate baseline
prime env install primeintellect/alphabet-sort
prime eval run primeintellect/alphabet-sort \
  -m Qwen/Qwen3-4B-Instruct-2507 -n 20 -r 1

# Launch training
prime rl run configs/lab/alphabet-sort.toml

# Monitor
prime rl logs <run-id> -f

Run Size Guidelines

Depending on your goals, here are some recommended configurations:

Small Run (Validation)

Use this to verify your environment and config work correctly before committing to a longer run.
model = "Qwen/Qwen3-4B-Instruct-2507"
max_steps = 50
batch_size = 128
rollouts_per_example = 8

[sampling]
max_tokens = 512

Medium Run (Experimentation)

Good for iterating on environment design and hyperparameters.
model = "Qwen/Qwen3-30B-A3B-Instruct-2507"
max_steps = 200
batch_size = 256
rollouts_per_example = 16

[sampling]
max_tokens = 512

[wandb]
project = "my-experiment"
name = "alphabet-sort-30b"

[eval]
interval = 50

Large Run (Production)

For serious training with full monitoring and evaluation.
model = "Qwen/Qwen3-235B-A22B-Instruct-2507"
max_steps = 500
batch_size = 512
rollouts_per_example = 16

[sampling]
max_tokens = 1024

[wandb]
project = "production"
name = "alphabet-sort-235b"

[eval]
interval = 100
num_examples = -1
rollouts_per_example = 1
eval_base_model = true

[val]
num_examples = 64
rollouts_per_example = 1
interval = 5

[buffer]
online_difficulty_filtering = true