Evaluating Environments

The Prime CLI provides powerful evaluation capabilities through the prime env eval command, which integrates with Prime Inference to test environments against various language models. This guide covers both the evaluation process and model selection for optimal results.

Prefer a visual interface? You can also run evaluations directly from the web UI without installing the CLI. See Hosted Evaluations for a guided walkthrough of the web-based evaluation workflow.

Quick Start: Running Your First Evaluation

Prerequisites

Install Prime CLI - Follow the installation guide
Set up API keys - Configure your Prime Inference API key
Install an environment - Use prime env install owner/environment

Basic Evaluation

# List available environments
prime env list --owner primeintellect

# Install an environment
prime env install primeintellect/gsm8k@latest

# Run a basic evaluation
prime env eval gsm8k -m meta-llama/llama-3.1-70b-instruct -n 10 -r 1

Using the eval Command

Basic Syntax

prime env eval ENVIRONMENT [OPTIONS]

Core Parameters

environment

string

required

Name of the installed verifier environment (e.g., ‘wordle’, ‘gsm8k’, ‘math_python’)

--model, -m

string

Model to use for evaluation. Defaults to meta-llama/llama-3.1-70b-instructExamples: anthropic/claude-3-5-sonnet-20241022, openai/gpt-4o

--num-examples, -n

integer

Number of examples to evaluate. Default: 5

--rollouts-per-example, -r

integer

Number of rollouts per example for statistical significance. Default: 3

Advanced Options

--max-concurrent, -c

integer

Maximum concurrent requests to the inference API. Default: 32

--max-tokens, -t

integer

Maximum tokens to generate per request

--temperature, -T

number

Sampling temperature (0.0-2.0). Higher values = more randomness

--sampling-args, -S

string

JSON string with additional sampling arguments

--env-args, -a

string

Environment-specific arguments as JSON

Output and Storage Options

--verbose, -v

boolean

Enable verbose output for detailed logging

--save-results, -s

boolean

Save evaluation results to disk

--save-every, -f

integer

Save dataset every N rollouts. Useful for long-running evaluations to checkpoint progress. Default: -1 (no checkpointing)

--save-to-hf-hub, -H

boolean

Save results to Hugging Face Hub

--hf-hub-dataset-name, -D

string

Specify Hugging Face Hub dataset name

Model Selection Guide

When choosing models for evaluation, consider these key factors:

Task complexity - How hard your task is, and whether reasoning is required/helpful
Deployment preference - Whether you want to self-host vs. use an API
Throughput requirements - How high-throughput you need for inference
Cost considerations - Balance between model performance and evaluation costs

Self-hosting with vLLM

For most users, we recommend starting with Prime Inference models for easier setup and management. Consider self-hosting only for specialized requirements or very large-scale evaluations.

Self-hosting makes sense when you:

Are testing with smaller models not easily available via API
Require specific model variants or custom fine-tuned models
Need maximum cost efficiency for very large evaluations (1M+ examples)
Are less sensitive to tok/s on a per-request basis

Configuring Prime CLI for Self-Hosted Models

# Set custom API base URL for your vLLM instance
prime env eval gsm8k \
  -m Qwen/Qwen3-4B-Instruct-2507 \
  --api-base-url http://localhost:8000/v1 \
  --api-key-var CUSTOM_API_KEY \
  -n 1000 -c 64

For evals, you’ll want much more headroom than the bare minimum. For example, you could self-host a 32B dense model on a 48GB GPU in FP8, but will likely find that this is pretty slow in terms of total throughput for many parallel requests. We recommend using the Qwen3 series via vLLM for most self-hosted testing. Qwen has excellent deployment documentation.

Recommended Self-Hosted Models

High Performance (30B-A3B series)

# Powerful models with small footprints (3B active parameters)
prime env eval gsm8k \
  -m Qwen/Qwen3-30B-A3B-Instruct-2507 \
  --api-base-url http://localhost:8000/v1 \
  -n 2000 -c 96

# With thinking capabilities
prime env eval math \
  -m Qwen/Qwen3-30B-A3B-Thinking-2507 \
  --api-base-url http://localhost:8000/v1 \
  -n 500 -c 32

Balanced Performance (4B series)

# Good balance of performance and resource usage
prime env eval wordle \
  -m Qwen/Qwen3-4B-Instruct-2507 \
  --api-base-url http://localhost:8000/v1 \
  -n 5000 -c 128

# With thinking
prime env eval gsm8k \
  -m Qwen/Qwen3-4B-Thinking-2507 \
  --api-base-url http://localhost:8000/v1 \
  -n 1000 -c 64

Ultra High Throughput (Small models for RL training)

# For massive-scale evaluations and RL experiments
prime env eval simple_task \
  -m Qwen/Qwen3-1.7B-Instruct \
  --api-base-url http://localhost:8000/v1 \
  -n 50000 -c 256

Self-Hosting Best Practices

Hardware Optimization:

Use FP8 precision on Ada/Hopper/Blackwell GPUs for improved speed
Allocate sufficient GPU memory headroom for parallel requests
Consider multi-GPU deployments for larger models

vLLM Configuration:

# Example vLLM startup for high-throughput evaluation
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-4B-Instruct-2507 \
  --max-model-len 32768 \
  --max-num-seqs 256 \
  --tensor-parallel-size 2 \
  --quantization fp8

Cost Considerations:

235B models are often more economical via API ( $0.13/$ 0.6 per M tokens) than self-hosting
Self-hosting becomes cost-effective for very large evaluations (1M+ examples)
Factor in GPU rental costs vs. API pricing

Troubleshooting and Best Practices

Common Issues

Rate Limiting

Error: “Rate limit exceeded” or 429 responsesSolution: Reduce concurrency or add delays between requests

# Reduce concurrent requests
prime env eval gsm8k -n 100 -c 8

# Use batch processing with delays
for i in {1..10}; do
    prime env eval gsm8k -n 10 -r 1
    sleep 60
done

Insufficient Balance

Error: “Insufficient balance” or payment requiredSolution: Add funds to your Prime Intellect account or use cheaper models

# Use cost-effective models for large evaluations
prime env eval gsm8k -m meta-llama/llama-3.1-8b-instruct -n 1000

Environment Installation Issues

Error: Environment not found or installation failuresSolution: Verify environment exists and you have access

# List available environments
prime env list --owner primeintellect

# Check environment details
prime env info primeintellect/gsm8k

# Reinstall environment
prime env uninstall gsm8k
prime env install primeintellect/gsm8k@latest

Performance Tips

Start Small: Begin with --num-examples 5 to test your setup
Monitor Costs: Check token usage and costs before large evaluations
Use Appropriate Models: Match model capability to task complexity
Optimize Concurrency: Balance speed vs. rate limits
Save Results: Always use --save-to-hf-hub for reproducibility

Integration with Other APIs

While Prime Inference is recommended for simplicity, you can also use other API providers:

# DeepSeek API
prime env eval math \
  -m deepseek-reasoner \
  --api-base-url https://api.deepseek.com/v1 \
  --api-key-var DEEPSEEK_API_KEY

# OpenRouter API
prime env eval gsm8k \
  -m meta-llama/llama-3.1-405b-instruct \
  --api-base-url https://openrouter.ai/api/v1 \
  --api-key-var OPENROUTER_API_KEY

Next Steps

Environments Hub

Learn more about creating and managing environments

Inference API Reference

Detailed API documentation for inference endpoints

Prime CLI Reference

Complete CLI command reference

Creating Environments

Build your own evaluation environments

Getting Started

On-Demand Cloud

Storage

Multi-Node Clusters

Environments Hub

Inference

Reinforcement Fine-Tuning

Sandboxes

Quick Start: Running Your First Evaluation

Prerequisites

Basic Evaluation

Using the eval Command

Basic Syntax

Core Parameters

Advanced Options

Output and Storage Options

Model Selection Guide

Self-hosting with vLLM

Configuring Prime CLI for Self-Hosted Models

Recommended Self-Hosted Models

Self-Hosting Best Practices

Troubleshooting and Best Practices

Common Issues

Performance Tips

Integration with Other APIs

Next Steps

Environments Hub

Inference API Reference

Prime CLI Reference

Creating Environments

Getting Started

On-Demand Cloud

Storage

Multi-Node Clusters

Environments Hub

Inference

Reinforcement Fine-Tuning

Sandboxes

​Quick Start: Running Your First Evaluation

​Prerequisites

​Basic Evaluation

​Using the eval Command

​Basic Syntax

​Core Parameters

​Advanced Options

​Output and Storage Options

​Model Selection Guide

​Self-hosting with vLLM

​Configuring Prime CLI for Self-Hosted Models

​Recommended Self-Hosted Models

​Self-Hosting Best Practices

​Troubleshooting and Best Practices

​Common Issues

​Performance Tips

​Integration with Other APIs

​Next Steps

Environments Hub

Inference API Reference

Prime CLI Reference

Creating Environments

Quick Start: Running Your First Evaluation

Prerequisites

Basic Evaluation

Using the eval Command

Basic Syntax

Core Parameters

Advanced Options

Output and Storage Options

Model Selection Guide

Self-hosting with vLLM

Configuring Prime CLI for Self-Hosted Models

Recommended Self-Hosted Models

Self-Hosting Best Practices

Troubleshooting and Best Practices

Common Issues

Performance Tips

Integration with Other APIs

Next Steps