Full configuration reference for Hosted Training runs
Hosted Training runs are configured via a .toml file. This page covers all available configuration fields, from basic setup to advanced features like multi-environment training, online evaluation, difficulty filtering, and W&B integration.
Below is a complete annotated config showing all available fields. Required fields are uncommented; optional fields are shown as comments with their defaults.
Copy
Ask AI
# ============================================================# Core Configuration (required)# ============================================================model = "Qwen/Qwen3-30B-A3B-Instruct-2507" # HuggingFace model IDmax_steps = 100 # Total training stepsbatch_size = 256 # Rollouts per training batchrollouts_per_example = 8 # Rollouts generated per dataset example# ============================================================# Training Hyperparameters (optional)# ============================================================# learning_rate = 1e-4 # Learning rate for LoRA# lora_alpha = 16 # LoRA alpha scaling factor# oversampling_factor = 2.0 # Oversample factor for rollout generation# max_async_level = 2 # Maximum async generation level# trajectory_strategy = "interleaved" # "interleaved" or "branching"# ============================================================# Secrets (optional)# ============================================================# env_file = ["secrets.env"] # File(s) containing environment secrets# ============================================================# Sampling Configuration (required)# ============================================================[sampling]max_tokens = 512 # Max tokens per model response# ============================================================# Environment(s) (at least one required)# ============================================================[[env]]id = "primeintellect/alphabet-sort" # Environments Hub ID (owner/name)# args = { min_turns = 3, max_turns = 5 } # Arguments passed to load_environment()# Add multiple [[env]] sections for multi-environment training:# [[env]]# id = "primeintellect/another-env"# args = { split = "train", max_examples = 1000 }# ============================================================# Weights & Biases Logging (optional)# ============================================================# [wandb]# project = "my-project" # W&B project name# name = "my-run-name" # W&B run name# entity = "my-team" # W&B team/entity# ============================================================# Online Evaluation (optional)# ============================================================# [eval]# interval = 100 # Run eval every N training steps# num_examples = -1 # Number of eval examples (-1 = all)# rollouts_per_example = 1 # Rollouts per eval example# eval_base_model = true # Also evaluate the base (untrained) model## [[eval.env]] # Environment-specific eval overrides# id = "primeintellect/eval-env"# args = { split = "test" }# num_examples = 30# rollouts_per_example = 4# ============================================================# Validation During Training (optional)# ============================================================# [val]# num_examples = 64 # Validation examples per check# rollouts_per_example = 1 # Rollouts per validation example# interval = 5 # Validate every N steps# ============================================================# Buffer / Difficulty Filtering (optional)# ============================================================# [buffer]# online_difficulty_filtering = false # Enable difficulty-based sampling# easy_threshold = 0.8 # Reward above this = "easy"# hard_threshold = 0.2 # Reward below this = "hard"# easy_fraction = 0.0 # Fraction of easy examples to include# hard_fraction = 0.0 # Fraction of hard examples to include# env_ratios = [0.5, 0.5] # Ratio between envs (multi-env only)# seed = 42 # Random seed# ============================================================# Warm-Start from Checkpoint (optional)# ============================================================# checkpoint_id = "..." # Resume training from an existing checkpoint# ============================================================# Checkpoints (optional)# ============================================================# [checkpoints]# interval = 100 # Save checkpoint every N steps# keep_cloud = 5 # Keep N checkpoints in cloud (-1 = keep all)# ============================================================# Adapters (optional)# ============================================================# [adapters]# interval = 0 # Upload adapter every N steps (0 = only at run end)# keep_last = 3 # Keep N adapters in cloud (-1 = keep all)# ============================================================# Infrastructure (optional)# ============================================================# [infrastructure]# compute_size = "M" # CPU allocation: S, M (default), or L
HuggingFace model ID. Must be a supported model. Run prime rl models to see available options.
max_steps
integer
✓
Total number of training steps.
batch_size
integer
✓
Number of rollouts consumed per training batch. Larger values improve stability.
rollouts_per_example
integer
✓
Number of rollouts generated per dataset example. Higher values give more reward signal diversity.
checkpoint_id
string
—
Checkpoint ID to warm-start from. The checkpoint must be in READY status, accessible to you, and from a run using the same model. See Warm-Starting from a Checkpoint.
LoRA alpha scaling factor. Controls the magnitude of LoRA updates.
oversampling_factor
float
2.0
Generate this many more rollouts than needed per batch to ensure sufficient data.
max_async_level
integer
2
Maximum level of asynchronous generation. Higher values increase throughput but use more memory.
trajectory_strategy
string
"interleaved"
How multi-turn trajectories are generated. "interleaved" runs turns across examples concurrently. "branching" generates full trajectories per example before moving on.
env_file
array of strings
[]
Path(s) to .env files containing secrets (e.g., API keys). See Secrets Management.
The difficulty buffer helps focus training on examples at the right difficulty level for the current model:
Copy
Ask AI
[buffer]online_difficulty_filtering = trueeasy_threshold = 0.8 # Examples scored above 0.8 are "easy"hard_threshold = 0.2 # Examples scored below 0.2 are "hard"easy_fraction = 0.0 # Exclude easy examples (0.0 = drop all easy)hard_fraction = 0.0 # Exclude hard examples
This is especially useful for large datasets with a wide difficulty range. By filtering out examples that are too easy (model already solves them) or too hard (model gets no reward signal), you focus compute on examples where the model can meaningfully improve.
Control how often checkpoints are saved and how many are retained in cloud storage:
Copy
Ask AI
[checkpoints]interval = 100 # Save checkpoint every 100 stepskeep_cloud = 5 # Keep last 5 checkpoints in cloud
Field
Type
Default
Description
interval
integer
cluster default
Save a checkpoint every N training steps.
keep_cloud
integer
5
Number of checkpoints to retain in cloud storage. Set to -1 to keep all checkpoints.
Checkpoints enable resuming training from a specific step if a run is interrupted. They’re automatically uploaded to cloud storage and can be used to create new runs from a saved state.
Start a new run from an existing checkpoint by setting checkpoint_id at the top level of your config. The checkpoint must be READY, use the same model, and you need access to the original run.
Copy
Ask AI
checkpoint_id = "cp_abc123"
List available checkpoints with prime rl checkpoints <run-id>.
Configure periodic adapter uploads during training. Adapters are LoRA weights that can be deployed for inference.
Copy
Ask AI
[adapters]interval = 100 # Upload adapter every 100 stepskeep_last = 3 # Keep last 3 adapters in cloud
Field
Type
Default
Description
interval
integer
0
Upload adapter every N training steps. Set to 0 to only upload the final adapter at run end.
keep_last
integer
3
Number of adapters to retain in cloud storage. Set to -1 to keep all adapters.
Deployed adapters are protected from automatic cleanup. If you deploy an adapter for inference, it will not be deleted even if it exceeds the keep_last limit.
Control the CPU and memory resources allocated to your environment containers. This only affects the environments you provide — trainer and inference infrastructure is fully managed by us.
Copy
Ask AI
[infrastructure]compute_size = "L"
Size
Description
S
Lower CPU allocation. Suitable for lightweight environments.
M
Default. Balanced allocation for most workloads.
L
High CPU allocation. Use for environments that compile code or for vision-language models with heavy image processing.
If not specified, runs default to M. Most users won’t need to change this — use L if you notice slow CPU-bound operations during training.
The recommended way to supply secrets to hosted training is via environment secrets. Secrets linked or added to your environment are automatically injected at runtime — no config changes needed.
If you prefer to supply secrets via a file, you can use env_file in your training config instead:
Copy
Ask AI
env_file = ["secrets.env"]
The secrets.env file should contain key-value pairs:
Copy
Ask AI
OPENAI_API_KEY=sk-...CUSTOM_API_KEY=...
You can also manage secrets via the CLI:
Copy
Ask AI
prime secret list # list global secretsprime env secret list my-env # list secrets for an environment
In your environment code, validate required keys early using vf.ensure_keys():