Full configuration reference for Hosted Training runs
Hosted Training runs are configured via a .toml file. This page covers all available configuration fields, from basic setup to advanced features like multi-environment training, online evaluation, and W&B integration.
Below is a complete annotated config showing all available fields. Required fields are uncommented; optional fields are shown as comments with their defaults.
# ============================================================# Core Configuration (required)# ============================================================model = "Qwen/Qwen3-30B-A3B-Instruct-2507" # HuggingFace model IDmax_steps = 100 # Total training stepsbatch_size = 256 # Rollouts per training batchrollouts_per_example = 8 # Rollouts generated per dataset example# ============================================================# Training Hyperparameters (optional)# ============================================================# learning_rate = 1e-4 # Learning rate for LoRA# lora_alpha = 16 # LoRA alpha scaling factor# oversampling_factor = 2.0 # Oversample factor for rollout generation# trajectory_strategy = "interleaved" # "interleaved" or "branching"# ============================================================# Secrets (optional)# ============================================================# env_file = ["secrets.env"] # File(s) containing environment secrets# ============================================================# Sampling Configuration (required)# ============================================================[sampling]max_tokens = 512 # Max tokens per model response# enable_thinking = false # Toggle thinking mode (Qwen3.5, Nemotron)# reasoning_effort = "high" # Reasoning effort: "low" | "medium" | "high" (GPT-OSS)# ============================================================# Environment(s) (at least one required)# ============================================================[[env]]id = "primeintellect/alphabet-sort" # Environments Hub ID (owner/name)# args = { min_turns = 3, max_turns = 5 } # Arguments passed to load_environment()# Add multiple [[env]] sections for multi-environment training:# [[env]]# id = "primeintellect/another-env"# args = { split = "train", max_examples = 1000 }# ============================================================# Weights & Biases Logging (optional)# ============================================================# [wandb]# project = "my-project" # W&B project name# name = "my-run-name" # W&B run name# entity = "my-team" # W&B team/entity# ============================================================# Online Evaluation (optional)# ============================================================# [eval]# interval = 100 # Run eval every N training steps# num_examples = -1 # Number of eval examples (-1 = all)# rollouts_per_example = 1 # Rollouts per eval example# skip_first_step = false # Skip the pre-training eval of the base model## [eval.sampling] # Eval-time sampling overrides# max_tokens = 2048 # Max tokens per eval response# temperature = 0.0 # Eval sampling temperature# enable_thinking = false # Toggle thinking mode at eval time# reasoning_effort = "high" # Reasoning effort at eval time## [[eval.env]] # Environment-specific eval overrides# id = "primeintellect/eval-env"# args = { split = "test" }# num_examples = 30# rollouts_per_example = 4# ============================================================# Validation During Training (optional)# ============================================================# [val]# num_examples = 64 # Validation examples per check# rollouts_per_example = 1 # Rollouts per validation example# interval = 5 # Validate every N steps# ============================================================# Rollout Filters (optional)# ============================================================# [[pre_batch_filters]] # Applied before rollouts fill a batch slot# type = "zero_advantage" # "gibberish" | "repetition" | "zero_advantage"# enforce = true # Drop flagged rollouts (false = metrics only)## [[post_batch_filters]] # Applied after a batch is assembled# type = "repetition"# enforce = false# ============================================================# Warm-Start from Checkpoint (optional)# ============================================================# checkpoint_id = "..." # Resume training from an existing checkpoint# ============================================================# Checkpoints (optional)# ============================================================# [checkpoints]# interval = 100 # Save checkpoint every N steps# keep_cloud = 5 # Keep N checkpoints in cloud (-1 = keep all)# ============================================================# Adapters (optional)# ============================================================# [adapters]# interval = 0 # Upload adapter every N steps (0 = only at run end)# keep_last = 3 # Keep N adapters in cloud (-1 = keep all)# ============================================================# Infrastructure (optional)# ============================================================# [infrastructure]# compute_size = "M" # CPU allocation: S, M (default), or L
HuggingFace model ID. Must be a supported model. Run prime train models to see available options.
max_steps
integer
✓
Total number of training steps.
batch_size
integer
✓
Number of rollouts consumed per training batch. Larger values improve stability.
rollouts_per_example
integer
✓
Number of rollouts generated per dataset example. Higher values give more reward signal diversity.
checkpoint_id
string
—
Checkpoint ID to warm-start from. The checkpoint must be in READY status, accessible to you, and from a run using the same model. See Warm-Starting from a Checkpoint.
LoRA alpha scaling factor. Controls the magnitude of LoRA updates.
oversampling_factor
float
2.0
Generate this many more rollouts than needed per batch to ensure sufficient data.
trajectory_strategy
string
"interleaved"
How multi-turn trajectories are generated. "interleaved" runs turns across examples concurrently. "branching" generates full trajectories per example before moving on.
env_file
array of strings
[]
Path(s) to .env files containing secrets (e.g., API keys). See Secrets Management.
Overrides the inference server’s default sampling for eval-time rollouts only. All fields are optional; when the whole [eval.sampling] block is omitted, eval uses the server defaults.
Field
Type
Required
Description
[eval.sampling].max_tokens
integer
—
Maximum tokens generated per eval response turn.
[eval.sampling].temperature
float
—
Eval sampling temperature. 0.0 for deterministic eval scoring.
[eval.sampling].extra_body
table
—
Free-form extra parameters forwarded with each eval request to the inference server.
[eval.sampling].enable_thinking
boolean
—
Toggle thinking mode at eval time for supported models. Mutually exclusive with reasoning_effort.
[eval.sampling].reasoning_effort
string
—
Reasoning effort at eval time for supported models. One of "low", "medium", "high". Mutually exclusive with enable_thinking.
Enable periodic evaluation during training to track progress without interrupting the run:
[eval]interval = 100 # Evaluate every 100 stepsnum_examples = -1 # Use all eval examplesrollouts_per_example = 1skip_first_step = false # Evaluate the base model before training starts[[eval.env]]id = "primeintellect/alphabet-sort"args = { split = "test" }num_examples = 50rollouts_per_example = 4
The [eval] section sets global defaults, and [[eval.env]] sections can override settings per environment.
Eval rollouts use the inference server’s default sampling unless overridden via [eval.sampling]. The fields mirror [sampling] and [teacher.sampling] so the same knobs work everywhere — most commonly, you’d turn thinking off at eval time to get deterministic, faster scoring on a model that uses chain-of-thought during training:
enable_thinking and reasoning_effort are mutually exclusive — set at most one. Both ride on extra_body.chat_template_kwargs under the hood; you can also set extra_body directly if you need other chat-template controls.
prime-rl filters rollouts at two points in the training pipeline. [[pre_batch_filters]] run before a rollout enters the training batch, so flagged rollouts never consume a batch slot; [[post_batch_filters]] run after a batch is assembled, and flagged rollouts are recorded but not shipped to the trainer. Three filter types are available — gibberish, repetition, and zero_advantage — and each either records detection metrics only (enforce = false) or drops flagged rollouts (enforce = true).By default, all three filters run in monitor mode pre-batch and zero_advantage is enforced post-batch. Setting either section replaces the default filter list for that slot.To focus training compute on examples with useful reward signal — the successor to the removed difficulty buffer’s online_difficulty_filtering — enforce the zero-advantage filter pre-batch:
Control how often checkpoints are saved and how many are retained in cloud storage:
[checkpoints]interval = 100 # Save checkpoint every 100 stepskeep_cloud = 5 # Keep last 5 checkpoints in cloud
Field
Type
Default
Description
interval
integer
cluster default
Save a checkpoint every N training steps.
keep_cloud
integer
5
Number of checkpoints to retain in cloud storage. Set to -1 to keep all checkpoints.
Checkpoints enable resuming training from a specific step if a run is interrupted. They’re automatically uploaded to cloud storage and can be used to create new runs from a saved state.
Start a new run from an existing checkpoint by setting checkpoint_id at the top level of your config. The checkpoint must be READY, use the same model, and you need access to the original run.
checkpoint_id = "cp_abc123"
List available checkpoints with prime train checkpoints <run-id>.
Configure periodic adapter uploads during training. Adapters are LoRA weights that can be deployed for inference.
[adapters]interval = 100 # Upload adapter every 100 stepskeep_last = 3 # Keep last 3 adapters in cloud
Field
Type
Default
Description
interval
integer
0
Upload adapter every N training steps. Set to 0 to only upload the final adapter at run end.
keep_last
integer
3
Number of adapters to retain in cloud storage. Set to -1 to keep all adapters.
Deployed adapters are protected from automatic cleanup. If you deploy an adapter for inference, it will not be deleted even if it exceeds the keep_last limit.
Control the CPU and memory resources allocated to your environment containers. This only affects the environments you provide — trainer and inference infrastructure is fully managed by us.
[infrastructure]compute_size = "L"
Size
Description
S
Lower CPU allocation. Suitable for lightweight environments.
M
Default. Balanced allocation for most workloads.
L
High CPU allocation. Use for environments that compile code or for vision-language models with heavy image processing.
If not specified, runs default to M. Most users won’t need to change this — use L if you notice slow CPU-bound operations during training.
Tailscale networking is an enterprise-only feature. Contact your account team to enable it on your organization.
When enabled, every env-server (training and eval) for the run joins your Tailscale tailnet via a sidecar. From inside your environment code you can then reach private services — internal APIs, MCP servers, datasets behind a VPN — by their Tailscale IP, MagicDNS hostname, or by native LAN IP if a subnet router advertises it.
Tailscale pre-authenticated key (must start with tskey-auth-). OAuth client secrets are not supported. Prefer the TAILSCALE_AUTH_KEY environment variable so the secret is not committed to rl.toml.
[tailscale].hostname_prefix
string
"prime-hosted-training"
Prefix for the Tailscale node name. The full name is derived as {prefix}-env-{idx}-{run_id}. 1–30 lowercase alphanumeric chars or hyphens, must start with a letter.
Use a tagged, ephemeral, reusable auth key. Tagged keys let you scope the env-servers in your tailnet ACL without granting them the same access as a user-owned device.
The recommended way to supply secrets to Hosted Training is via environment secrets. Secrets linked or added to your environment are automatically injected at runtime — no config changes needed.
If you prefer to supply secrets via a file, you can use env_file in your training config instead:
env_file = ["secrets.env"]
The secrets.env file should contain key-value pairs:
OPENAI_API_KEY=sk-...CUSTOM_API_KEY=...
You can also manage secrets via the CLI:
prime secret list # list global secretsprime env secret list my-env # list secrets for an environment
In your environment code, validate required keys early using vf.ensure_keys():