Skip to main content
Hosted Training runs are configured via a .toml file. This page covers all available configuration fields, from basic setup to advanced features like multi-environment training, online evaluation, difficulty filtering, and W&B integration.

Full Config Reference

Below is a complete annotated config showing all available fields. Required fields are uncommented; optional fields are shown as comments with their defaults.
# ============================================================
# Core Configuration (required)
# ============================================================
model = "Qwen/Qwen3-30B-A3B-Instruct-2507"   # HuggingFace model ID
max_steps = 100                                # Total training steps
batch_size = 256                               # Rollouts per training batch
rollouts_per_example = 8                       # Rollouts generated per dataset example

# ============================================================
# Training Hyperparameters (optional)
# ============================================================
# learning_rate = 1e-4                         # Learning rate for LoRA
# lora_alpha = 16                              # LoRA alpha scaling factor
# oversampling_factor = 2.0                    # Oversample factor for rollout generation
# max_async_level = 2                          # Maximum async generation level
# trajectory_strategy = "interleaved"          # "interleaved" or "branching"

# ============================================================
# Secrets (optional)
# ============================================================
# env_file = ["secrets.env"]                   # File(s) containing environment secrets

# ============================================================
# Sampling Configuration (required)
# ============================================================
[sampling]
max_tokens = 512                               # Max tokens per model response

# ============================================================
# Environment(s) (at least one required)
# ============================================================
[[env]]
id = "primeintellect/alphabet-sort"            # Environments Hub ID (owner/name)
# args = { min_turns = 3, max_turns = 5 }      # Arguments passed to load_environment()

# Add multiple [[env]] sections for multi-environment training:
# [[env]]
# id = "primeintellect/another-env"
# args = { split = "train", max_examples = 1000 }

# ============================================================
# Weights & Biases Logging (optional)
# ============================================================
# [wandb]
# project = "my-project"                       # W&B project name
# name = "my-run-name"                         # W&B run name
# entity = "my-team"                           # W&B team/entity

# ============================================================
# Online Evaluation (optional)
# ============================================================
# [eval]
# interval = 100                               # Run eval every N training steps
# num_examples = -1                            # Number of eval examples (-1 = all)
# rollouts_per_example = 1                     # Rollouts per eval example
# eval_base_model = true                       # Also evaluate the base (untrained) model
#
# [[eval.env]]                                 # Environment-specific eval overrides
# id = "primeintellect/eval-env"
# args = { split = "test" }
# num_examples = 30
# rollouts_per_example = 4

# ============================================================
# Validation During Training (optional)
# ============================================================
# [val]
# num_examples = 64                            # Validation examples per check
# rollouts_per_example = 1                     # Rollouts per validation example
# interval = 5                                 # Validate every N steps

# ============================================================
# Buffer / Difficulty Filtering (optional)
# ============================================================
# [buffer]
# online_difficulty_filtering = false          # Enable difficulty-based sampling
# easy_threshold = 0.8                         # Reward above this = "easy"
# hard_threshold = 0.2                         # Reward below this = "hard"
# easy_fraction = 0.0                          # Fraction of easy examples to include
# hard_fraction = 0.0                          # Fraction of hard examples to include
# env_ratios = [0.5, 0.5]                      # Ratio between envs (multi-env only)
# seed = 42                                    # Random seed

Field Reference

Core Fields

FieldTypeRequiredDescription
modelstringHuggingFace model ID. Must be a supported model. Run prime rl models to see available options.
max_stepsintegerTotal number of training steps.
batch_sizeintegerNumber of rollouts consumed per training batch. Larger values improve stability.
rollouts_per_exampleintegerNumber of rollouts generated per dataset example. Higher values give more reward signal diversity.

Training Hyperparameters

FieldTypeDefaultDescription
learning_ratefloat1e-4Learning rate for the LoRA adapter.
lora_alphainteger16LoRA alpha scaling factor. Controls the magnitude of LoRA updates.
oversampling_factorfloat2.0Generate this many more rollouts than needed per batch to ensure sufficient data.
max_async_levelinteger2Maximum level of asynchronous generation. Higher values increase throughput but use more memory.
trajectory_strategystring"interleaved"How multi-turn trajectories are generated. "interleaved" runs turns across examples concurrently. "branching" generates full trajectories per example before moving on.
env_filearray of strings[]Path(s) to .env files containing secrets (e.g., API keys). See Secrets Management.

Sampling

FieldTypeRequiredDescription
[sampling].max_tokensintegerMaximum number of tokens the model can generate per response turn.

Environment

FieldTypeRequiredDescription
[[env]].idstringEnvironment ID on the Environments Hub, in owner/name format.
[[env]].argstableArguments passed to the environment’s load_environment() function.

Multi-Environment Training

You can train on multiple environments simultaneously by adding multiple [[env]] sections:
[[env]]
id = "primeintellect/alphabet-sort"
args = { min_turns = 3, max_turns = 5 }

[[env]]
id = "primeintellect/gsm8k"
args = { split = "train" }
Control the ratio of examples from each environment using the [buffer] section:
[buffer]
env_ratios = [0.6, 0.4]   # 60% alphabet-sort, 40% gsm8k

Online Evaluation

Enable periodic evaluation during training to track progress without interrupting the run:
[eval]
interval = 100                    # Evaluate every 100 steps
num_examples = -1                 # Use all eval examples
rollouts_per_example = 1
eval_base_model = true            # Include base model comparison

[[eval.env]]
id = "primeintellect/alphabet-sort"
args = { split = "test" }
num_examples = 50
rollouts_per_example = 4
The [eval] section sets global defaults, and [[eval.env]] sections can override settings per environment.

Validation

Validation is a lightweight check that runs more frequently than full evaluation:
[val]
num_examples = 64
rollouts_per_example = 1
interval = 5                      # Validate every 5 steps
This uses the training environment’s validation split (if available) and reports metrics to W&B and the dashboard.

Difficulty Filtering

The difficulty buffer helps focus training on examples at the right difficulty level for the current model:
[buffer]
online_difficulty_filtering = true
easy_threshold = 0.8              # Examples scored above 0.8 are "easy"
hard_threshold = 0.2              # Examples scored below 0.2 are "hard"
easy_fraction = 0.0               # Exclude easy examples (0.0 = drop all easy)
hard_fraction = 0.0               # Exclude hard examples
This is especially useful for large datasets with a wide difficulty range. By filtering out examples that are too easy (model already solves them) or too hard (model gets no reward signal), you focus compute on examples where the model can meaningfully improve.

Weights & Biases Integration

Log training metrics, reward curves, and rollout samples to W&B:
[wandb]
project = "my-rl-experiments"
name = "qwen3-30b-alphabet-sort"
entity = "my-team"
When W&B is configured, all training metrics, evaluation results, and sample rollouts are logged automatically.

Secrets Management

If your environment requires API keys or other secrets (e.g., for LLM judge calls or external tool access), you can provide them via an environment file:
env_file = ["secrets.env"]
The secrets.env file should contain key-value pairs:
OPENAI_API_KEY=sk-...
CUSTOM_API_KEY=...
You can also manage secrets via the CLI:
prime secrets             # Manage global secrets
prime env secrets         # Manage per-environment secrets
In your environment code, validate required keys early using vf.ensure_keys():
def load_environment(api_key_var: str = "OPENAI_API_KEY") -> vf.Environment:
    vf.ensure_keys([api_key_var])
    # ...