Skip to main content
The Prime CLI provides powerful evaluation capabilities through the prime env eval command, which integrates with Prime Inference to test environments against various language models. This guide covers both the evaluation process and model selection for optimal results.

Quick Start: Running Your First Evaluation

Prerequisites

  1. Install Prime CLI - Follow the installation guide
  2. Set up API keys - Configure your Prime Inference API key
  3. Install an environment - Use prime env install owner/environment

Basic Evaluation

# List available environments
prime env list --owner primeintellect

# Install an environment
prime env install primeintellect/gsm8k@latest

# Run a basic evaluation
instruct -n 10 -r 1

Using the eval Command

Basic Syntax

prime env eval ENVIRONMENT [OPTIONS]

Core Parameters

environment
string
required
Name of the installed verifier environment (e.g., ‘wordle’, ‘gsm8k’, ‘math_python’)
--model, -m
string
Model to use for evaluation. Defaults to meta-llama/llama-3.1-70b-instructExamples: anthropic/claude-3-5-sonnet-20241022, openai/gpt-4o
--num-examples, -n
integer
Number of examples to evaluate. Default: 5
--rollouts-per-example, -r
integer
Number of rollouts per example for statistical significance. Default: 3

Advanced Options

--max-concurrent, -c
integer
Maximum concurrent requests to the inference API. Default: 32
--max-tokens, -t
integer
Maximum tokens to generate per request
--temperature, -T
number
Sampling temperature (0.0-2.0). Higher values = more randomness
--sampling-args, -S
string
JSON string with additional sampling arguments
--env-args, -a
string
Environment-specific arguments as JSON

Output and Storage Options

--verbose, -v
boolean
Enable verbose output for detailed logging
--save-dataset, -s
boolean
Save evaluation dataset to disk
--save-to-hf-hub, -H
boolean
Save results to Hugging Face Hub
--hf-hub-dataset-name, -D
string
Specify Hugging Face Hub dataset name

Model Selection Guide

When choosing models for evaluation, consider these key factors:
  • Task complexity - How hard your task is, and whether reasoning is required/helpful
  • Deployment preference - Whether you want to self-host vs. use an API
  • Throughput requirements - How high-throughput you need for inference
  • Cost considerations - Balance between model performance and evaluation costs

Self-hosting with vLLM

For most users, we recommend starting with Prime Inference models for easier setup and management. Consider self-hosting only for specialized requirements or very large-scale evaluations.
Self-hosting makes sense when you:
  • Are testing with smaller models not easily available via API
  • Require specific model variants or custom fine-tuned models
  • Need maximum cost efficiency for very large evaluations (1M+ examples)
  • Are less sensitive to tok/s on a per-request basis

Configuring Prime CLI for Self-Hosted Models

# Set custom API base URL for your vLLM instance
prime env eval gsm8k \
  -m Qwen/Qwen3-4B-Instruct-2507 \
  --api-base-url http://localhost:8000/v1 \
  --api-key-var CUSTOM_API_KEY \
  -n 1000 -c 64
For evals, you’ll want much more headroom than the bare minimum. For example, you could self-host a 32B dense model on a 48GB GPU in FP8, but will likely find that this is pretty slow in terms of total throughput for many parallel requests. We recommend using the Qwen3 series via vLLM for most self-hosted testing. Qwen has excellent deployment documentation. High Performance (30B-A3B series)
# Powerful models with small footprints (3B active parameters)
prime env eval gsm8k \
  -m Qwen/Qwen3-30B-A3B-Instruct-2507 \
  --api-base-url http://localhost:8000/v1 \
  -n 2000 -c 96

# With thinking capabilities
prime env eval math \
  -m Qwen/Qwen3-30B-A3B-Thinking-2507 \
  --api-base-url http://localhost:8000/v1 \
  -n 500 -c 32
Balanced Performance (4B series)
# Good balance of performance and resource usage
prime env eval wordle \
  -m Qwen/Qwen3-4B-Instruct-2507 \
  --api-base-url http://localhost:8000/v1 \
  -n 5000 -c 128

# With thinking
prime env eval gsm8k \
  -m Qwen/Qwen3-4B-Thinking-2507 \
  --api-base-url http://localhost:8000/v1 \
  -n 1000 -c 64
Ultra High Throughput (Small models for RL training)
# For massive-scale evaluations and RL experiments
prime env eval simple_task \
  -m Qwen/Qwen3-1.7B-Instruct \
  --api-base-url http://localhost:8000/v1 \
  -n 50000 -c 256

Self-Hosting Best Practices

Hardware Optimization:
  • Use FP8 precision on Ada/Hopper/Blackwell GPUs for improved speed
  • Allocate sufficient GPU memory headroom for parallel requests
  • Consider multi-GPU deployments for larger models
vLLM Configuration:
# Example vLLM startup for high-throughput evaluation
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-4B-Instruct-2507 \
  --max-model-len 32768 \
  --max-num-seqs 256 \
  --tensor-parallel-size 2 \
  --quantization fp8
Cost Considerations:
  • 235B models are often more economical via API (0.13/0.13/0.6 per M tokens) than self-hosting
  • Self-hosting becomes cost-effective for very large evaluations (1M+ examples)
  • Factor in GPU rental costs vs. API pricing

Troubleshooting and Best Practices

Common Issues

Error: “Rate limit exceeded” or 429 responsesSolution: Reduce concurrency or add delays between requests
# Reduce concurrent requests
prime env eval gsm8k -n 100 -c 8

# Use batch processing with delays
for i in {1..10}; do
    prime env eval gsm8k -n 10 -r 1
    sleep 60
done
Error: “Insufficient balance” or payment requiredSolution: Add funds to your Prime Intellect account or use cheaper models
# Use cost-effective models for large evaluations
prime env eval gsm8k -m meta-llama/llama-3.1-8b-instruct -n 1000
Error: Environment not found or installation failuresSolution: Verify environment exists and you have access
# List available environments
prime env list --owner primeintellect

# Check environment details
prime env info primeintellect/gsm8k

# Reinstall environment
prime env uninstall gsm8k
prime env install primeintellect/gsm8k@latest

Performance Tips

  1. Start Small: Begin with --num-examples 5 to test your setup
  2. Monitor Costs: Check token usage and costs before large evaluations
  3. Use Appropriate Models: Match model capability to task complexity
  4. Optimize Concurrency: Balance speed vs. rate limits
  5. Save Results: Always use --save-to-hf-hub for reproducibility

Integration with Other APIs

While Prime Inference is recommended for simplicity, you can also use other API providers:
# DeepSeek API
prime env eval math \
  -m deepseek-reasoner \
  --api-base-url https://api.deepseek.com/v1 \
  --api-key-var DEEPSEEK_API_KEY

# OpenRouter API
prime env eval gsm8k \
  -m meta-llama/llama-3.1-405b-instruct \
  --api-base-url https://openrouter.ai/api/v1 \
  --api-key-var OPENROUTER_API_KEY

Next Steps

I