prime eval command provides powerful evaluation capabilities for testing environments against various language models via Prime Inference. This guide covers the evaluation workflow, model selection, and best practices.
Quick Start: Running Your First Evaluation
Prerequisites
- Python 3.10–3.13 — Required for the Prime CLI and verifiers
- Install Prime CLI — Follow the installation guide
- Set up API keys — Configure your Prime Inference API key via
prime login - Install an environment — Use
prime env install owner/environment
Basic Evaluation
Using the prime eval Command
Basic Syntax
prime eval run ENVIRONMENT. Both forms work identically.
Available Models
To see all available models for evaluation:| Model | Notes |
|---|---|
openai/gpt-4.1-mini | Fast, cost-effective (default) |
openai/gpt-4.1 | Higher quality |
anthropic/claude-sonnet-4.5 | Strong reasoning |
meta-llama/llama-3.3-70b-instruct | Open-weight, balanced |
deepseek/deepseek-r1-0528 | Advanced reasoning |
qwen/qwen3-235b-a22b-instruct-2507 | Large MoE model |
google/gemini-2.5-flash | Fast multimodal |
Model availability and pricing may change. Always run
prime inference models to get the current list with pricing.Core Parameters
Environment to evaluate. Two formats are supported:
- Full slug (e.g.,
primeintellect/gsm8k) — Auto-installs if not already installed - Short name (e.g.,
gsm8k) — Must already be installed viaprime env install
Model to use for evaluation. Default:
openai/gpt-4.1-miniSee prime inference models for all available models.Number of examples to evaluate. Default: 5
Number of rollouts per example for statistical significance. Default: 3
Advanced Options
Maximum concurrent requests to the inference API. Default: 32
Maximum tokens to generate per request. If unset, uses model default.
Sampling temperature (0.0–2.0). Higher values = more randomness.
JSON string with additional sampling arguments.Example:
'{"enable_thinking": false, "max_tokens": 256}'Environment-specific arguments as JSON.Example:
'{"difficulty": "hard"}'Output and Storage Options
Enable verbose output for detailed logging.
Save evaluation results to disk. Default: true
Save dataset every N rollouts. Useful for checkpointing. Default: 1
Save results to Hugging Face Hub.
Specify Hugging Face Hub dataset name.
Skip uploading results to the platform (results are uploaded by default).
End-to-End Example
Here’s a complete workflow from installation to viewing results:Managing Evaluation Results
List Evaluations
Get Evaluation Details
View Samples
Push Local Results
If you ran evaluations offline or with--skip-upload, you can push results later:
Model Selection Guide
When choosing models for evaluation, consider:- Task complexity — Harder tasks benefit from larger, reasoning-capable models
- Cost — Smaller models are significantly cheaper for large-scale evals
- Throughput — Some models handle high concurrency better than others
Self-hosting with vLLM
For most users, we recommend Prime Inference for easier setup. Consider self-hosting only for specialized requirements or very large-scale evaluations.
- Need specific model variants or custom fine-tuned models
- Require maximum cost efficiency for very large evaluations (1M+ examples)
- Are testing smaller models not available via API
Configuring for Self-Hosted Models
Recommended Self-Hosted Models
High Performance (MoE with small active parameters):vLLM Configuration Example
Environment-Specific Dependencies
Some environments may require additional dependencies beyond the base installation. Check the environment’s documentation or useprime env info to see requirements:
Troubleshooting
Common Issues
Rate Limiting
Rate Limiting
Error: “Rate limit exceeded” or 429 responsesSolution: Reduce concurrency
Insufficient Balance
Insufficient Balance
Error: “Insufficient balance” or payment requiredSolution: Add funds to your Prime Intellect account or use cheaper models
Environment Not Found
Environment Not Found
Error: Environment not found or installation failuresSolution: Verify environment exists and reinstall
Python Version Issues
Python Version Issues
Error: Installation failures or import errorsSolution: Ensure you’re using Python 3.10–3.13
Performance Tips
- Start Small: Begin with
-n 5to test your setup - Monitor Costs: Check token usage before large evaluations
- Use Appropriate Models: Match model capability to task complexity
- Optimize Concurrency: Balance speed vs. rate limits (default 32 is usually good)
- Save Results: Results auto-save and upload by default