prime eval command provides powerful evaluation capabilities for testing environments against various language models via Prime Inference. You can run evaluations locally or launch them as hosted evaluations on the platform with --hosted. This guide covers both workflows, model selection, and best practices.
Quick Start: Running Your First Evaluation
Prerequisites
- Python 3.10–3.13 — Required for the Prime CLI and verifiers
- Install Prime CLI — Follow the installation guide
- Set up API keys — Configure your Prime Inference API key via
prime login - Install an environment — Use
prime env install owner/environment
Basic Evaluation
Hosted Evaluation Quick Start
If your environment is already published to the Environments Hub, you can run it remotely on Prime-managed infrastructure:--follow to stream hosted logs until completion:
Using the prime eval Command
Basic Syntax
prime eval run ENVIRONMENT. Both forms work identically.
Available Models
To see all available models for evaluation:| Model | Notes |
|---|---|
openai/gpt-4.1-mini | Fast, cost-effective (default) |
openai/gpt-4.1 | Higher quality |
anthropic/claude-sonnet-4.5 | Strong reasoning |
meta-llama/llama-3.3-70b-instruct | Open-weight, balanced |
deepseek/deepseek-r1-0528 | Advanced reasoning |
qwen/qwen3-235b-a22b-instruct-2507 | Large MoE model |
google/gemini-2.5-flash | Fast multimodal |
Model availability and pricing may change. Always run
prime inference models to get the current list with pricing.Core Parameters
Environment to evaluate. Supported forms:
- Full slug (e.g.,
primeintellect/gsm8k) — recommended for hosted runs and Hub environments - Short name (e.g.,
gsm8k) — local-first resolution for installed environments - TOML config path (e.g.,
configs/eval/gsm8k.toml) — for config-driven runs
Model to use for evaluation. Default:
openai/gpt-4.1-miniSee prime inference models for all available models.Number of examples to evaluate. Default: 5
Number of rollouts per example for statistical significance. Default: 3
Advanced Options
Maximum concurrent requests to the inference API. Default: 32
Maximum tokens to generate per request. If unset, uses model default.
Sampling temperature (0.0–2.0). Higher values = more randomness.
JSON string with additional sampling arguments.Example:
'{"enable_thinking": false, "max_tokens": 256}'Environment-specific arguments as JSON.Example:
'{"difficulty": "hard"}'Output and Storage Options
Enable verbose output for detailed logging.
Save evaluation results to disk. Default: true
Save dataset every N rollouts. Useful for checkpointing. Default: 1
Save results to Hugging Face Hub.
Specify Hugging Face Hub dataset name.
Skip uploading results for local evaluations (local runs upload by default).
Hosted Evaluation Options
Run the evaluation on the platform instead of locally. Requires a published environment.
Follow hosted evaluation status and stream logs until completion. Only valid with
--hosted.Polling interval in seconds for hosted status and log streaming. Only valid with
--hosted.Optional timeout in minutes for a hosted evaluation. Default: 120. Max: 1440 (24 hours).
Allow sandbox read/write access for hosted evaluations.
Allow hosted evaluations to create and manage instances.
JSON object of additional secrets to inject into a hosted run.
Custom display name for a hosted evaluation.
End-to-End Example
Here’s a complete workflow from installation to viewing results:Using TOML Configs for Multi-Environment Evals
For reproducible evals, you can pass a TOML config file instead of individual CLI flags:[[eval]] entries, which makes them useful for benchmark suites and multi-environment comparisons:
[[eval]] override global defaults at the top of the file.
For the full config schema, precedence rules, and advanced options like ablations and endpoint registries, see Verifiers Evaluation.
Hosted Evaluations from the CLI
Hosted eval runs are useful when you want the platform to execute the environment remotely and keep logs on the platform.Managing Evaluation Results
List Evaluations
Get Evaluation Details
View Samples
Push Local Results
If you ran evaluations offline or with--skip-upload, you can push results later:
Model Selection Guide
When choosing models for evaluation, consider:- Task complexity — Harder tasks benefit from larger, reasoning-capable models
- Cost — Smaller models are significantly cheaper for large-scale evals
- Throughput — Some models handle high concurrency better than others
Self-hosting with vLLM
For most users, we recommend Prime Inference for easier setup. Consider self-hosting only for specialized requirements or very large-scale evaluations.
- Need specific model variants or custom fine-tuned models
- Require maximum cost efficiency for very large evaluations (1M+ examples)
- Are testing smaller models not available via API
Configuring for Self-Hosted Models
Recommended Self-Hosted Models
High Performance (MoE with small active parameters):vLLM Configuration Example
Environment-Specific Dependencies
Some environments may require additional dependencies beyond the base installation. Check the environment’s documentation or useprime env info to see requirements:
Troubleshooting
Common Issues
Rate Limiting
Rate Limiting
Error: “Rate limit exceeded” or 429 responsesSolution: Reduce concurrency
Insufficient Balance
Insufficient Balance
Error: “Insufficient balance” or payment requiredSolution: Add funds to your Prime Intellect account or use cheaper models
Environment Not Found
Environment Not Found
Error: Environment not found or installation failuresSolution: Verify environment exists and reinstall
Python Version Issues
Python Version Issues
Error: Installation failures or import errorsSolution: Ensure you’re using Python 3.10–3.13
Performance Tips
- Start Small: Begin with
-n 5to test your setup - Monitor Costs: Check token usage before large evaluations
- Use Appropriate Models: Match model capability to task complexity
- Optimize Concurrency: Balance speed vs. rate limits (default 32 is usually good)
- Save Results: Results auto-save and upload by default
Integration with Other APIs
You can use other OpenAI-compatible API providers:Next Steps
Environments Hub
Learn more about creating and managing environments
Inference API Reference
Detailed API documentation for inference endpoints
Prime CLI Reference
Complete CLI command reference
Creating Environments
Build your own evaluation environments