prime env eval
command, which integrates with Prime Inference to test environments against various language models. This guide covers both the evaluation process and model selection for optimal results.
Quick Start: Running Your First Evaluation
Prerequisites
- Install Prime CLI - Follow the installation guide
- Set up API keys - Configure your Prime Inference API key
- Install an environment - Use
prime env install owner/environment
Basic Evaluation
Using the eval Command
Basic Syntax
Core Parameters
Name of the installed verifier environment (e.g., ‘wordle’, ‘gsm8k’, ‘math_python’)
Model to use for evaluation. Defaults to
meta-llama/llama-3.1-70b-instruct
Examples: anthropic/claude-3-5-sonnet-20241022
, openai/gpt-4o
Number of examples to evaluate. Default: 5
Number of rollouts per example for statistical significance. Default: 3
Advanced Options
Maximum concurrent requests to the inference API. Default: 32
Maximum tokens to generate per request
Sampling temperature (0.0-2.0). Higher values = more randomness
JSON string with additional sampling arguments
Environment-specific arguments as JSON
Output and Storage Options
Enable verbose output for detailed logging
Save evaluation dataset to disk
Save results to Hugging Face Hub
Specify Hugging Face Hub dataset name
Model Selection Guide
When choosing models for evaluation, consider these key factors:- Task complexity - How hard your task is, and whether reasoning is required/helpful
- Deployment preference - Whether you want to self-host vs. use an API
- Throughput requirements - How high-throughput you need for inference
- Cost considerations - Balance between model performance and evaluation costs
Self-hosting with vLLM
For most users, we recommend starting with Prime Inference models for easier setup and management. Consider self-hosting only for specialized requirements or very large-scale evaluations.
- Are testing with smaller models not easily available via API
- Require specific model variants or custom fine-tuned models
- Need maximum cost efficiency for very large evaluations (1M+ examples)
- Are less sensitive to tok/s on a per-request basis
Configuring Prime CLI for Self-Hosted Models
Recommended Self-Hosted Models
High Performance (30B-A3B series)Self-Hosting Best Practices
Hardware Optimization:- Use FP8 precision on Ada/Hopper/Blackwell GPUs for improved speed
- Allocate sufficient GPU memory headroom for parallel requests
- Consider multi-GPU deployments for larger models
- 235B models are often more economical via API (0.6 per M tokens) than self-hosting
- Self-hosting becomes cost-effective for very large evaluations (1M+ examples)
- Factor in GPU rental costs vs. API pricing
Troubleshooting and Best Practices
Common Issues
Rate Limiting
Rate Limiting
Error: “Rate limit exceeded” or 429 responsesSolution: Reduce concurrency or add delays between requests
Insufficient Balance
Insufficient Balance
Error: “Insufficient balance” or payment requiredSolution: Add funds to your Prime Intellect account or use cheaper models
Environment Installation Issues
Environment Installation Issues
Error: Environment not found or installation failuresSolution: Verify environment exists and you have access
Performance Tips
- Start Small: Begin with
--num-examples 5
to test your setup - Monitor Costs: Check token usage and costs before large evaluations
- Use Appropriate Models: Match model capability to task complexity
- Optimize Concurrency: Balance speed vs. rate limits
- Save Results: Always use
--save-to-hf-hub
for reproducibility