Skip to main content
Hosted Training runs are configured via a .toml file. This page covers all available configuration fields, from basic setup to advanced features like multi-environment training, online evaluation, and W&B integration.

Full Config Reference

Below is a complete annotated config showing all available fields. Required fields are uncommented; optional fields are shown as comments with their defaults.
# ============================================================
# Core Configuration (required)
# ============================================================
model = "Qwen/Qwen3-30B-A3B-Instruct-2507"   # HuggingFace model ID
max_steps = 100                                # Total training steps
batch_size = 256                               # Rollouts per training batch
rollouts_per_example = 8                       # Rollouts generated per dataset example

# ============================================================
# Training Hyperparameters (optional)
# ============================================================
# learning_rate = 1e-4                         # Learning rate for LoRA
# lora_alpha = 16                              # LoRA alpha scaling factor
# oversampling_factor = 2.0                    # Oversample factor for rollout generation
# trajectory_strategy = "interleaved"          # "interleaved" or "branching"

# ============================================================
# Secrets (optional)
# ============================================================
# env_file = ["secrets.env"]                   # File(s) containing environment secrets

# ============================================================
# Sampling Configuration (required)
# ============================================================
[sampling]
max_tokens = 512                               # Max tokens per model response
# enable_thinking = false                      # Toggle thinking mode (Qwen3.5, Nemotron)
# reasoning_effort = "high"                    # Reasoning effort: "low" | "medium" | "high" (GPT-OSS)

# ============================================================
# Environment(s) (at least one required)
# ============================================================
[[env]]
id = "primeintellect/alphabet-sort"            # Environments Hub ID (owner/name)
# args = { min_turns = 3, max_turns = 5 }      # Arguments passed to load_environment()

# Add multiple [[env]] sections for multi-environment training:
# [[env]]
# id = "primeintellect/another-env"
# args = { split = "train", max_examples = 1000 }

# ============================================================
# Weights & Biases Logging (optional)
# ============================================================
# [wandb]
# project = "my-project"                       # W&B project name
# name = "my-run-name"                         # W&B run name
# entity = "my-team"                           # W&B team/entity

# ============================================================
# Online Evaluation (optional)
# ============================================================
# [eval]
# interval = 100                               # Run eval every N training steps
# num_examples = -1                            # Number of eval examples (-1 = all)
# rollouts_per_example = 1                     # Rollouts per eval example
# skip_first_step = false                      # Skip the pre-training eval of the base model
#
# [eval.sampling]                              # Eval-time sampling overrides
# max_tokens = 2048                            # Max tokens per eval response
# temperature = 0.0                            # Eval sampling temperature
# enable_thinking = false                      # Toggle thinking mode at eval time
# reasoning_effort = "high"                    # Reasoning effort at eval time
#
# [[eval.env]]                                 # Environment-specific eval overrides
# id = "primeintellect/eval-env"
# args = { split = "test" }
# num_examples = 30
# rollouts_per_example = 4

# ============================================================
# Validation During Training (optional)
# ============================================================
# [val]
# num_examples = 64                            # Validation examples per check
# rollouts_per_example = 1                     # Rollouts per validation example
# interval = 5                                 # Validate every N steps

# ============================================================
# Rollout Filters (optional)
# ============================================================
# [[pre_batch_filters]]                        # Applied before rollouts fill a batch slot
# type = "zero_advantage"                      # "gibberish" | "repetition" | "zero_advantage"
# enforce = true                               # Drop flagged rollouts (false = metrics only)
#
# [[post_batch_filters]]                       # Applied after a batch is assembled
# type = "repetition"
# enforce = false

# ============================================================
# Warm-Start from Checkpoint (optional)
# ============================================================
# checkpoint_id = "..."                        # Resume training from an existing checkpoint

# ============================================================
# Checkpoints (optional)
# ============================================================
# [checkpoints]
# interval = 100                               # Save checkpoint every N steps
# keep_cloud = 5                               # Keep N checkpoints in cloud (-1 = keep all)

# ============================================================
# Adapters (optional)
# ============================================================
# [adapters]
# interval = 0                                 # Upload adapter every N steps (0 = only at run end)
# keep_last = 3                                # Keep N adapters in cloud (-1 = keep all)

# ============================================================
# Infrastructure (optional)
# ============================================================
# [infrastructure]
# compute_size = "M"                           # CPU allocation: S, M (default), or L

Field Reference

Core Fields

FieldTypeRequiredDescription
modelstringHuggingFace model ID. Must be a supported model. Run prime train models to see available options.
max_stepsintegerTotal number of training steps.
batch_sizeintegerNumber of rollouts consumed per training batch. Larger values improve stability.
rollouts_per_exampleintegerNumber of rollouts generated per dataset example. Higher values give more reward signal diversity.
checkpoint_idstringCheckpoint ID to warm-start from. The checkpoint must be in READY status, accessible to you, and from a run using the same model. See Warm-Starting from a Checkpoint.

Training Hyperparameters

FieldTypeDefaultDescription
learning_ratefloat1e-4Learning rate for the LoRA adapter.
lora_alphainteger16LoRA alpha scaling factor. Controls the magnitude of LoRA updates.
oversampling_factorfloat2.0Generate this many more rollouts than needed per batch to ensure sufficient data.
trajectory_strategystring"interleaved"How multi-turn trajectories are generated. "interleaved" runs turns across examples concurrently. "branching" generates full trajectories per example before moving on.
env_filearray of strings[]Path(s) to .env files containing secrets (e.g., API keys). See Secrets Management.

Sampling

FieldTypeRequiredDescription
[sampling].max_tokensintegerMaximum number of tokens the model can generate per response turn.
[sampling].enable_thinkingbooleanToggle thinking mode for supported models. Mutually exclusive with reasoning_effort.
[sampling].reasoning_effortstringReasoning effort for supported models. One of "low", "medium", "high". Mutually exclusive with enable_thinking.

Eval Sampling

Overrides the inference server’s default sampling for eval-time rollouts only. All fields are optional; when the whole [eval.sampling] block is omitted, eval uses the server defaults.
FieldTypeRequiredDescription
[eval.sampling].max_tokensintegerMaximum tokens generated per eval response turn.
[eval.sampling].temperaturefloatEval sampling temperature. 0.0 for deterministic eval scoring.
[eval.sampling].extra_bodytableFree-form extra parameters forwarded with each eval request to the inference server.
[eval.sampling].enable_thinkingbooleanToggle thinking mode at eval time for supported models. Mutually exclusive with reasoning_effort.
[eval.sampling].reasoning_effortstringReasoning effort at eval time for supported models. One of "low", "medium", "high". Mutually exclusive with enable_thinking.

Environment

FieldTypeRequiredDescription
[[env]].idstringEnvironment ID on the Environments Hub, in owner/name format.
[[env]].argstableArguments passed to the environment’s load_environment() function.

Multi-Environment Training

You can train on multiple environments simultaneously by adding multiple [[env]] sections:
[[env]]
id = "primeintellect/alphabet-sort"
args = { min_turns = 3, max_turns = 5 }

[[env]]
id = "primeintellect/gsm8k"
args = { split = "train" }

Online Evaluation

Enable periodic evaluation during training to track progress without interrupting the run:
[eval]
interval = 100                    # Evaluate every 100 steps
num_examples = -1                 # Use all eval examples
rollouts_per_example = 1
skip_first_step = false           # Evaluate the base model before training starts

[[eval.env]]
id = "primeintellect/alphabet-sort"
args = { split = "test" }
num_examples = 50
rollouts_per_example = 4
The [eval] section sets global defaults, and [[eval.env]] sections can override settings per environment.

Eval Sampling

Eval rollouts use the inference server’s default sampling unless overridden via [eval.sampling]. The fields mirror [sampling] and [teacher.sampling] so the same knobs work everywhere — most commonly, you’d turn thinking off at eval time to get deterministic, faster scoring on a model that uses chain-of-thought during training:
[eval.sampling]
max_tokens = 2048
temperature = 0.0
enable_thinking = false           # Disable thinking at eval time
# reasoning_effort = "high"       # Or constrain reasoning effort
enable_thinking and reasoning_effort are mutually exclusive — set at most one. Both ride on extra_body.chat_template_kwargs under the hood; you can also set extra_body directly if you need other chat-template controls.

Validation

Validation is a lightweight check that runs more frequently than full evaluation:
[val]
num_examples = 64
rollouts_per_example = 1
interval = 5                      # Validate every 5 steps
This uses the training environment’s validation split (if available) and reports metrics to W&B and the dashboard.

Rollout Filters

prime-rl filters rollouts at two points in the training pipeline. [[pre_batch_filters]] run before a rollout enters the training batch, so flagged rollouts never consume a batch slot; [[post_batch_filters]] run after a batch is assembled, and flagged rollouts are recorded but not shipped to the trainer. Three filter types are available — gibberish, repetition, and zero_advantage — and each either records detection metrics only (enforce = false) or drops flagged rollouts (enforce = true). By default, all three filters run in monitor mode pre-batch and zero_advantage is enforced post-batch. Setting either section replaces the default filter list for that slot. To focus training compute on examples with useful reward signal — the successor to the removed difficulty buffer’s online_difficulty_filtering — enforce the zero-advantage filter pre-batch:
[[pre_batch_filters]]
type = "zero_advantage"
enforce = true
Type-specific tuning knobs (such as repetition’s window and prob_threshold) pass through to the trainer as written.

Checkpoints

Control how often checkpoints are saved and how many are retained in cloud storage:
[checkpoints]
interval = 100    # Save checkpoint every 100 steps
keep_cloud = 5    # Keep last 5 checkpoints in cloud
FieldTypeDefaultDescription
intervalintegercluster defaultSave a checkpoint every N training steps.
keep_cloudinteger5Number of checkpoints to retain in cloud storage. Set to -1 to keep all checkpoints.
Checkpoints enable resuming training from a specific step if a run is interrupted. They’re automatically uploaded to cloud storage and can be used to create new runs from a saved state.

Warm-Starting from a Checkpoint

Start a new run from an existing checkpoint by setting checkpoint_id at the top level of your config. The checkpoint must be READY, use the same model, and you need access to the original run.
checkpoint_id = "cp_abc123"
List available checkpoints with prime train checkpoints <run-id>.

Adapters

Configure periodic adapter uploads during training. Adapters are LoRA weights that can be deployed for inference.
[adapters]
interval = 100    # Upload adapter every 100 steps
keep_last = 3     # Keep last 3 adapters in cloud
FieldTypeDefaultDescription
intervalinteger0Upload adapter every N training steps. Set to 0 to only upload the final adapter at run end.
keep_lastinteger3Number of adapters to retain in cloud storage. Set to -1 to keep all adapters.
Deployed adapters are protected from automatic cleanup. If you deploy an adapter for inference, it will not be deleted even if it exceeds the keep_last limit.

Infrastructure

Control the CPU and memory resources allocated to your environment containers. This only affects the environments you provide — trainer and inference infrastructure is fully managed by us.
[infrastructure]
compute_size = "L"
SizeDescription
SLower CPU allocation. Suitable for lightweight environments.
MDefault. Balanced allocation for most workloads.
LHigh CPU allocation. Use for environments that compile code or for vision-language models with heavy image processing.
If not specified, runs default to M. Most users won’t need to change this — use L if you notice slow CPU-bound operations during training.

Tailscale Networking

Tailscale networking is an enterprise-only feature. Contact your account team to enable it on your organization.
When enabled, every env-server (training and eval) for the run joins your Tailscale tailnet via a sidecar. From inside your environment code you can then reach private services — internal APIs, MCP servers, datasets behind a VPN — by their Tailscale IP, MagicDNS hostname, or by native LAN IP if a subnet router advertises it.
[tailscale]
enabled = true
# auth_key = "tskey-auth-..."        # preferably via TAILSCALE_AUTH_KEY env var
# hostname_prefix = "prime-hosted-training"
FieldTypeDefaultDescription
[tailscale].enabledbooleanfalseToggle the per-run sidecar.
[tailscale].auth_keystringTailscale pre-authenticated key (must start with tskey-auth-). OAuth client secrets are not supported. Prefer the TAILSCALE_AUTH_KEY environment variable so the secret is not committed to rl.toml.
[tailscale].hostname_prefixstring"prime-hosted-training"Prefix for the Tailscale node name. The full name is derived as {prefix}-env-{idx}-{run_id}. 1–30 lowercase alphanumeric chars or hyphens, must start with a letter.
Use a tagged, ephemeral, reusable auth key. Tagged keys let you scope the env-servers in your tailnet ACL without granting them the same access as a user-owned device.

Weights & Biases Integration

Log training metrics, reward curves, and rollout samples to W&B:
[wandb]
project = "my-rl-experiments"
name = "qwen3-30b-alphabet-sort"
entity = "my-team"
When W&B is configured, all training metrics, evaluation results, and sample rollouts are logged automatically.

Secrets Management

The recommended way to supply secrets to Hosted Training is via environment secrets. Secrets linked or added to your environment are automatically injected at runtime — no config changes needed.
If you prefer to supply secrets via a file, you can use env_file in your training config instead:
env_file = ["secrets.env"]
The secrets.env file should contain key-value pairs:
OPENAI_API_KEY=sk-...
CUSTOM_API_KEY=...
You can also manage secrets via the CLI:
prime secret list              # list global secrets
prime env secret list my-env   # list secrets for an environment
In your environment code, validate required keys early using vf.ensure_keys():
def load_environment(api_key_var: str = "OPENAI_API_KEY") -> vf.Environment:
    vf.ensure_keys([api_key_var])
    # ...

End-to-End Run

Walk through a complete training run step by step.

Troubleshooting

Solutions for common issues with Hosted Training.