Skip to main content
Hosted Evaluations run your environment on Prime-managed infrastructure and store the run in Prime Evals. You can launch them either from the Environments Hub UI or directly from the CLI with prime eval run --hosted.

What hosted evaluations are for

Use hosted evaluations when you want Prime to handle the execution environment for you:
  • Run a published environment without setting up local Python dependencies
  • Evaluate large jobs against a Hub environment slug
  • Monitor logs remotely and share runs through the platform
  • Grant temporary sandbox or instance permissions for tool-using environments
Hosted evaluations require an environment that is already published to the Environments Hub. If you only have a local environment, push it first with prime env push.

Prerequisites

Before running a hosted evaluation, make sure you have:
  1. A published environment on the Environments Hub
    prime env push
    
  2. Write access to that environment
  3. Prime CLI installed and authenticated if you plan to use the CLI flow
  4. Account balance for the inference model you choose

Quick start with the CLI

The new hosted eval flow is built into prime eval run.
prime eval run primeintellect/gsm8k --hosted
This creates a hosted run on the platform instead of executing the evaluation locally.

Follow logs until completion

prime eval run primeintellect/gsm8k --hosted --follow
With --follow, the CLI keeps polling the run, streams hosted logs, and exits when the evaluation reaches a terminal state.

Run from a TOML config

Hosted evals also support TOML configs.
model = "openai/gpt-4.1-mini"
num_examples = 20
rollouts_per_example = 2

[[eval]]
env_id = "primeintellect/gsm8k"
env_args = { split = "test" }

[[eval]]
env_id = "primeintellect/alphabet-sort"
Run it with:
prime eval run configs/eval/benchmark-hosted.toml --hosted

Hosted-only CLI options

These flags only apply when you pass --hosted:
FlagDescription
--followStream hosted logs and wait for completion
--poll-intervalPolling interval for hosted status/log streaming
--timeout-minutesOptional timeout in minutes for the hosted run. Default: 120. Max: 1440 (24 hours).
--allow-sandbox-accessAllow sandbox read/write access
--allow-instances-accessAllow instance creation and management
--custom-secretsJSON object of secrets injected for the hosted run
--eval-nameCustom display name for the hosted evaluation

Example: Environment Args and Custom Secrets

prime eval run my-team/browser-agent \
  --hosted \
  -m anthropic/claude-sonnet-4.5 \
  -a '{"task":"checkout"}' \
  --custom-secrets '{"SHOP_API_KEY":"..."}' \
  --allow-sandbox-access \
  --timeout-minutes 45
Use --custom-secrets for run-specific values. Secrets already configured on the environment continue to work as usual.
Hosted eval env_args are passed to load_environment(), similar to training [[env]].args. Use them for custom environment settings such as split, difficulty, tool configuration, or other environment-specific overrides supported by your environment.

Monitoring and managing hosted runs

After starting a hosted evaluation, the CLI prints the evaluation id and platform URL.
--follow only streams logs and waits for completion. Hosted evaluations do not yet have a training-style checkpoint or restart workflow.

List evaluations

prime eval list
The list output includes a Type column so you can distinguish HOSTED and LOCAL evaluations.

Inspect one run

prime eval get <eval-id>
prime eval samples <eval-id>

Stream logs for an existing hosted run

prime eval logs <eval-id> -f

Stop a running hosted evaluation

prime eval stop <eval-id>

Running a hosted evaluation from the dashboard

You can still launch the same workflow from the Environments Hub UI.

Step 1: Open the environment

  1. Go to the Environments Hub
  2. Open your environment
  3. Go to the Evaluations tab
  4. Click Run Hosted Evaluation

Step 2: Choose a model

Select an inference model for the run. Model selection interface showing available inference models for hosted evaluations

Step 3: Configure the run

Set the number of examples, rollouts per example, and any environment arguments. Hosted evaluation configuration form with examples, rollouts, and environment arguments
Environment secrets linked in the Hub are exposed automatically during hosted evaluation runs. You only need --custom-secrets when launching a CLI run with additional per-run secrets.

Step 4: Monitor progress

You will be redirected to the evaluations list where you can watch the run status. Evaluations list showing hosted evaluation runs and statuses

Step 5: Review results

Completed runs show aggregate metrics and per-sample outputs in Prime Evals. Hosted evaluation result page showing metrics and example-level outputs

Failure modes

When a hosted evaluation fails, the platform surfaces the error message and logs. Common causes:
  1. Environment code errors — import failures, dependency issues, invalid verifier logic
  2. Missing permissions — the run needs sandbox or instance access but those flags were not enabled
  3. Missing secrets — environment-linked or custom secrets were not available
  4. Timeouts — the run exceeded the configured or platform timeout
  5. Inference issues — temporary provider or model errors
Start with a small hosted run first, such as -n 5 -r 1, then scale up once logs and scores look correct.

Pricing

Hosted evaluations use Prime Inference under the hood. Cost depends on:
  • The selected model
  • Prompt and completion token usage
  • num_examples × rollouts_per_example
  • Any extra tool usage triggered by the environment

When to use dashboard vs CLI

  • Use the dashboard when you want the simplest point-and-click flow
  • Use the CLI when you want reproducible commands, TOML configs, log following, or automation in scripts/CI