Skip to main content
Hosted Evaluations allow you to run environment evaluations directly on Environment Hub interface, providing a convenient alternative to CLI-based evaluations. The platform handles all the infrastructure, automatically provisions compute resources, and stores results in your environment’s evaluation history.

Prerequisites

Before running hosted evaluations, ensure you have:
  1. Published Environment: Your environment must be pushed to the Environment Hub
    prime env push
    
  2. Environment Access: You must be the owner or have write permissions on the environment
  3. Account Balance: Sufficient credits to cover inference costs (costs vary by model and usage)

Running a Hosted Evaluation

Step 1: Navigate to Your Environment

  1. Go to the Environment Hub
  2. Find your environment (either in “My Environments” or via search)
  3. Click on the environment to open its detail page
  4. Navigate to the “Evaluations” tab

Step 2: Start New Evaluation

Click the “Run Hosted Evaluation” button to begin the evaluation wizard.
If you don’t see this button, verify that you have write permissions on the environment.

Step 3: Select Model

The model selection page displays all available inference models. Model selection interface showing various inference models with pricing

Step 4: Configure Evaluation

Configure how your evaluation will run:
Number of Examples
integer
required
Number of test cases to evaluate from your environment’s dataset.
Rollouts per Example
integer
required
Number of times to run inference on each example for statistical aggregation.
Environment Arguments
object
Optional key-value pairs passed to your environment during evaluation.
Environment Secrets: Any secrets configured in your environment settings will be automatically exposed during evaluation. You don’t need to pass them as arguments.
Click “Run Evaluation” to submit your evaluation job. Configuration page with evaluation parameters and summary panel

Step 5: Monitor Progress

After submitting, you’ll be redirected to the Evaluations list where you can monitor progress. Evaluations list showing multiple runs with different statuses

Step 6: View Results

Click on a completed evaluation to view detailed results: Metrics Tab:
  • Average reward/score across all examples
  • Total samples evaluated
  • Statistical aggregations (if multiple rollouts)
  • Model information and parameters used
Examples Tab:
  • Individual test case results
  • Model inputs and outputs
  • Per-example scores
  • Grouped view for multiple rollouts (shows variance)
Results page displaying metrics and detailed evaluation statistics

Failed Evaluations

When an evaluation fails, detailed debugging information is provided: Error Information:
  • Error message describing what went wrong
  • Full evaluation logs in a scrollable terminal view
Common Failure Reasons:
  1. Environment Errors:
    • Missing secrets
    • Bug in environment code
    • Missing dependencies
    • Incorrect verifier implementation
  2. Timeout:
    • Evaluation took longer than maximum allowed time (60 min)
    • Consider reducing examples or optimizing environment
  3. Insufficient Balance:
    • Not enough credits to complete evaluation
    • Add funds and re-run
  4. Model API Errors:
    • Temporary issues with inference service
    • Rate limiting
    • Try again or contact support

Pricing

Hosted evaluations use Prime Inference API under the hood. Costs are calculated based on:
  • Model Pricing: Each model has different rates (shown on model selection page)
  • Token Usage: Both prompt and completion tokens are counted
  • Number of Examples × Rollouts: Multiplies the total inference calls
Start with a small number of examples (5-10) to test your environment before running large-scale evaluations.