Skip to main content
The prime eval command provides powerful evaluation capabilities for testing environments against various language models via Prime Inference. This guide covers the evaluation workflow, model selection, and best practices.

Quick Start: Running Your First Evaluation

Prerequisites

  1. Python 3.10–3.13 — Required for the Prime CLI and verifiers
  2. Install Prime CLI — Follow the installation guide
  3. Set up API keys — Configure your Prime Inference API key via prime login
  4. Install an environment — Use prime env install owner/environment

Basic Evaluation

# List available environments
prime env list --owner primeintellect

# Install an environment
prime env install primeintellect/gsm8k@latest

# Run a basic evaluation
prime eval gsm8k -m openai/gpt-4.1-mini -n 10 -r 1
Results are automatically uploaded to the platform after each run. Use --skip-upload to disable this.

Using the prime eval Command

Basic Syntax

prime eval ENVIRONMENT [OPTIONS]
This is a shorthand for prime eval run ENVIRONMENT. Both forms work identically.

Available Models

To see all available models for evaluation:
prime inference models
Example models:
ModelNotes
openai/gpt-4.1-miniFast, cost-effective (default)
openai/gpt-4.1Higher quality
anthropic/claude-sonnet-4.5Strong reasoning
meta-llama/llama-3.3-70b-instructOpen-weight, balanced
deepseek/deepseek-r1-0528Advanced reasoning
qwen/qwen3-235b-a22b-instruct-2507Large MoE model
google/gemini-2.5-flashFast multimodal
Model availability and pricing may change. Always run prime inference models to get the current list with pricing.

Core Parameters

environment
string
required
Environment to evaluate. Two formats are supported:
  • Full slug (e.g., primeintellect/gsm8k) — Auto-installs if not already installed
  • Short name (e.g., gsm8k) — Must already be installed via prime env install
--model, -m
string
Model to use for evaluation. Default: openai/gpt-4.1-miniSee prime inference models for all available models.
--num-examples, -n
integer
Number of examples to evaluate. Default: 5
--rollouts-per-example, -r
integer
Number of rollouts per example for statistical significance. Default: 3

Advanced Options

--max-concurrent, -c
integer
Maximum concurrent requests to the inference API. Default: 32
--max-tokens, -t
integer
Maximum tokens to generate per request. If unset, uses model default.
--temperature, -T
number
Sampling temperature (0.0–2.0). Higher values = more randomness.
--sampling-args, -S
string
JSON string with additional sampling arguments.Example: '{"enable_thinking": false, "max_tokens": 256}'
--env-args, -a
string
Environment-specific arguments as JSON.Example: '{"difficulty": "hard"}'

Output and Storage Options

--verbose, -v
boolean
Enable verbose output for detailed logging.
--save-results, -s
boolean
Save evaluation results to disk. Default: true
--save-every, -f
integer
Save dataset every N rollouts. Useful for checkpointing. Default: 1
--save-to-hf-hub, -H
boolean
Save results to Hugging Face Hub.
--hf-hub-dataset-name, -D
string
Specify Hugging Face Hub dataset name.
--skip-upload
boolean
Skip uploading results to the platform (results are uploaded by default).

End-to-End Example

Here’s a complete workflow from installation to viewing results:
# 1. Install an environment
prime env install primeintellect/gsm8k@latest

# 2. Run evaluation with a fast model (5 examples, 1 rollout each)
prime eval gsm8k -m openai/gpt-4.1-mini -n 5 -r 1

# 3. List your evaluation runs
prime eval list

# 4. Get details of a specific evaluation
prime eval get <eval-id>

# 5. View samples from an evaluation
prime eval samples <eval-id>
Example output:
Running evaluation: gsm8k
Model: openai/gpt-4.1-mini
Examples: 5 | Rollouts: 1 | Concurrency: 32

Progress: 100%|████████████████████████████| 5/5 [00:12<00:00, 2.45s/it]

Results:
  Average Score: 0.80
  Total Samples: 5
  Successful: 4
  Failed: 1

Results saved to: outputs/evals/gsm8k--openai-gpt-4.1-mini/...
Uploaded to platform: https://app.primeintellect.ai/...

Managing Evaluation Results

List Evaluations

# List all your evaluations
prime eval list

# Filter by environment
prime eval list --env gsm8k

# Output as JSON
prime eval list --output json

# Paginate results
prime eval list --limit 20 --skip 10

Get Evaluation Details

# Get full details of an evaluation
prime eval get <eval-id>

# Pretty-print output
prime eval get <eval-id> --output pretty

View Samples

# Get samples from an evaluation
prime eval samples <eval-id>

# Paginate samples
prime eval samples <eval-id> --page 2 --limit 50

Push Local Results

If you ran evaluations offline or with --skip-upload, you can push results later:
# Auto-discover and push from outputs/evals/
prime eval push

# Push a specific directory
prime eval push outputs/evals/gsm8k--gpt-4/abc123

# Push with environment context
prime eval push --env gsm8k

Model Selection Guide

When choosing models for evaluation, consider:
  • Task complexity — Harder tasks benefit from larger, reasoning-capable models
  • Cost — Smaller models are significantly cheaper for large-scale evals
  • Throughput — Some models handle high concurrency better than others

Self-hosting with vLLM

For most users, we recommend Prime Inference for easier setup. Consider self-hosting only for specialized requirements or very large-scale evaluations.
Self-hosting makes sense when you:
  • Need specific model variants or custom fine-tuned models
  • Require maximum cost efficiency for very large evaluations (1M+ examples)
  • Are testing smaller models not available via API

Configuring for Self-Hosted Models

# Point to your vLLM instance
prime eval gsm8k \
  -m Qwen/Qwen3-4B-Instruct-2507 \
  --api-base-url http://localhost:8000/v1 \
  --api-key-var CUSTOM_API_KEY \
  -n 1000 -c 64
High Performance (MoE with small active parameters):
prime eval gsm8k \
  -m Qwen/Qwen3-30B-A3B-Instruct-2507 \
  --api-base-url http://localhost:8000/v1 \
  -n 2000 -c 96
Balanced Performance:
prime eval wordle \
  -m Qwen/Qwen3-4B-Instruct-2507 \
  --api-base-url http://localhost:8000/v1 \
  -n 5000 -c 128

vLLM Configuration Example

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-4B-Instruct-2507 \
  --max-model-len 32768 \
  --max-num-seqs 256 \
  --tensor-parallel-size 2 \
  --quantization fp8

Environment-Specific Dependencies

Some environments may require additional dependencies beyond the base installation. Check the environment’s documentation or use prime env info to see requirements:
prime env info primeintellect/math_python
If an environment needs specific packages (e.g., sympy for math verification), install them before running evaluations.

Troubleshooting

Common Issues

Error: “Rate limit exceeded” or 429 responsesSolution: Reduce concurrency
prime eval gsm8k -n 100 -c 8
Error: “Insufficient balance” or payment requiredSolution: Add funds to your Prime Intellect account or use cheaper models
prime eval gsm8k -m openai/gpt-4.1-mini -n 1000
Error: Environment not found or installation failuresSolution: Verify environment exists and reinstall
prime env list --owner primeintellect
prime env uninstall gsm8k
prime env install primeintellect/gsm8k@latest
Error: Installation failures or import errorsSolution: Ensure you’re using Python 3.10–3.13
python --version  # Should be 3.10, 3.11, 3.12, or 3.13

Performance Tips

  1. Start Small: Begin with -n 5 to test your setup
  2. Monitor Costs: Check token usage before large evaluations
  3. Use Appropriate Models: Match model capability to task complexity
  4. Optimize Concurrency: Balance speed vs. rate limits (default 32 is usually good)
  5. Save Results: Results auto-save and upload by default

Integration with Other APIs

You can use other OpenAI-compatible API providers:
# DeepSeek API
prime eval math \
  -m deepseek-reasoner \
  --api-base-url https://api.deepseek.com/v1 \
  --api-key-var DEEPSEEK_API_KEY

# OpenRouter API
prime eval gsm8k \
  -m meta-llama/llama-3.1-405b-instruct \
  --api-base-url https://openrouter.ai/api/v1 \
  --api-key-var OPENROUTER_API_KEY

Next Steps