Prerequisites
Before running hosted evaluations, ensure you have:-
Published Environment: Your environment must be pushed to the Environments Hub
- Environment Access: You must be the owner or have write permissions on the environment
- Account Balance: Sufficient credits to cover inference costs (costs vary by model and usage)
Running a Hosted Evaluation
Step 1: Navigate to Your Environment
- Go to the Environments Hub
- Find your environment (either in “My Environments” or via search)
- Click on the environment to open its detail page
- Navigate to the “Evaluations” tab
Step 2: Start New Evaluation
Click the “Run Hosted Evaluation” button to begin the evaluation wizard.If you don’t see this button, verify that you have write permissions on the environment.
Step 3: Select Model
The model selection page displays all available inference models.
Step 4: Configure Evaluation
Configure how your evaluation will run:Number of test cases to evaluate from your environment’s dataset.
Number of times to run inference on each example for statistical aggregation.
Optional key-value pairs passed to your environment during evaluation.
Environment Secrets: Any secrets configured in your environment settings will be automatically exposed during evaluation. You don’t need to pass them as arguments.

Step 5: Monitor Progress
After submitting, you’ll be redirected to the Evaluations list where you can monitor progress.
Step 6: View Results
Click on a completed evaluation to view detailed results: Metrics Tab:- Average reward/score across all examples
- Total samples evaluated
- Statistical aggregations (if multiple rollouts)
- Model information and parameters used
- Individual test case results
- Model inputs and outputs
- Per-example scores
- Grouped view for multiple rollouts (shows variance)

Failed Evaluations
When an evaluation fails, detailed debugging information is provided: Error Information:- Error message describing what went wrong
- Full evaluation logs in a scrollable terminal view
-
Environment Errors:
- Missing secrets
- Bug in environment code
- Missing dependencies
- Incorrect verifier implementation
-
Timeout:
- Evaluation took longer than maximum allowed time (60 min)
- Consider reducing examples or optimizing environment
-
Insufficient Balance:
- Not enough credits to complete evaluation
- Add funds and re-run
-
Model API Errors:
- Temporary issues with inference service
- Rate limiting
- Try again or contact support
Pricing
Hosted evaluations use Prime Inference API under the hood. Costs are calculated based on:- Model Pricing: Each model has different rates (shown on model selection page)
- Token Usage: Both prompt and completion tokens are counted
- Number of Examples × Rollouts: Multiplies the total inference calls