> ## Documentation Index
> Fetch the complete documentation index at: https://docs.primeintellect.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluating Environments

> Guide to running evaluations with Prime CLI using Prime Inference or custom model endpoints

The `prime eval` command provides powerful evaluation capabilities for testing environments against various language models through Prime Inference or other OpenAI-compatible providers. You can run evaluations locally or launch them as hosted evaluations on the platform with `--hosted`. This guide covers both workflows, model selection, and best practices.

## Quick Start: Running Your First Evaluation

### Prerequisites

1. **Python 3.10–3.13** — Required for the Prime CLI and verifiers
2. **Install Prime CLI** — Follow the [installation guide](/cli-reference/introduction)
3. **Set up API keys** — Configure your Prime API key via `prime login`; if you plan to use another provider, also export the provider key you will reference with `--api-key-var`
4. **Install an environment** — Use `prime env install owner/environment`

### Basic Evaluation

```bash theme={null}
# List available environments
prime env list --owner primeintellect

# Install an environment
prime env install primeintellect/gsm8k@latest

# Run a basic evaluation
prime eval gsm8k
```

<Tip>
  Local `prime eval` runs are automatically uploaded to the platform after each run. Use `--skip-upload` to disable this.
</Tip>

### Hosted Evaluation Quick Start

If your environment is already published to the Environments Hub, you can run it remotely on Prime-managed infrastructure:

```bash theme={null}
prime eval run primeintellect/gsm8k --hosted
```

Use `--follow` to stream hosted logs until completion:

```bash theme={null}
prime eval run primeintellect/gsm8k --hosted --follow
```

See [Hosted Evaluations](/tutorials-environments/hosted-evaluations) for the full dashboard and CLI workflow.

## Using the prime eval Command

### Basic Syntax

```bash theme={null}
prime eval ENVIRONMENT [OPTIONS]
```

This is a shorthand for `prime eval run ENVIRONMENT`. Both forms work identically.

### Available Models

To see all available models for evaluation:

```bash theme={null}
prime inference models
```

**Example models:**

| Model                                | Notes                          |
| ------------------------------------ | ------------------------------ |
| `openai/gpt-4.1-mini`                | Fast, cost-effective (default) |
| `openai/gpt-4.1`                     | Higher quality                 |
| `anthropic/claude-sonnet-4.5`        | Strong reasoning               |
| `meta-llama/llama-3.3-70b-instruct`  | Open-weight, balanced          |
| `deepseek/deepseek-r1-0528`          | Advanced reasoning             |
| `qwen/qwen3-235b-a22b-instruct-2507` | Large MoE model                |
| `google/gemini-2.5-flash`            | Fast multimodal                |

<Note>
  Model availability and pricing may change. Always run `prime inference models` to get the current list with pricing.
</Note>

### Core Parameters

<ParamField path="environment" type="string" required>
  Environment to evaluate. Supported forms:

  * **Full slug** (e.g., `primeintellect/gsm8k`) — recommended for hosted runs and Hub environments
  * **Short name** (e.g., `gsm8k`) — local-first resolution for installed environments
  * **TOML config path** (e.g., `configs/eval/gsm8k.toml`) — for config-driven runs
</ParamField>

<ParamField query="--model, -m" type="string">
  Model to use for evaluation. Default: `openai/gpt-4.1-mini`

  See `prime inference models` for all available models.
</ParamField>

<ParamField query="--num-examples, -n" type="integer">
  Number of examples to evaluate. Default: 5
</ParamField>

<ParamField query="--rollouts-per-example, -r" type="integer">
  Number of rollouts per example for statistical significance. Default: 3
</ParamField>

### Advanced Options

<ParamField query="--max-concurrent, -c" type="integer">
  Maximum concurrent requests to the inference API. Default: 32
</ParamField>

<ParamField query="--max-retries" type="integer">
  Maximum number of automatic retries with exponential backoff when rollouts fail due to transient infrastructure errors (e.g., sandbox timeouts, API failures).
</ParamField>

<ParamField query="--max-tokens, -t" type="integer">
  Maximum tokens to generate per request. If unset, uses model default.
</ParamField>

<ParamField query="--temperature, -T" type="number">
  Sampling temperature (0.0–2.0). Higher values = more randomness.
</ParamField>

<ParamField query="--sampling-args, -S" type="string">
  JSON string with additional sampling arguments.

  Example: `'{"enable_thinking": false, "max_tokens": 256}'`
</ParamField>

<ParamField query="--env-args, -a" type="string">
  Environment-specific arguments as JSON.

  Example: `'{"difficulty": "hard"}'`
</ParamField>

### Output and Storage Options

<ParamField query="--verbose, -v" type="boolean">
  Enable verbose output for detailed logging.
</ParamField>

<ParamField query="--save-results, -s" type="boolean">
  Save evaluation results to disk. Default: true
</ParamField>

<ParamField query="--save-every, -f" type="integer">
  Save dataset every N rollouts. Useful for checkpointing. Default: 1
</ParamField>

<ParamField query="--save-to-hf-hub, -H" type="boolean">
  Save results to Hugging Face Hub.
</ParamField>

<ParamField query="--hf-hub-dataset-name, -D" type="string">
  Specify Hugging Face Hub dataset name.
</ParamField>

<ParamField query="--skip-upload" type="boolean">
  Skip uploading results for local evaluations (local runs upload by default).
</ParamField>

### Hosted Evaluation Options

<ParamField query="--hosted" type="boolean">
  Run the evaluation on the platform instead of locally. Requires a published environment.
</ParamField>

<ParamField query="--follow" type="boolean">
  Follow hosted evaluation status and stream logs until completion. Only valid with `--hosted`.
</ParamField>

<ParamField query="--poll-interval" type="number">
  Polling interval in seconds for hosted status and log streaming. Only valid with `--hosted`.
</ParamField>

<ParamField query="--timeout-minutes" type="integer">
  Optional timeout in minutes for a hosted evaluation. Default: 1440 (24 hours). Min: 120. Max: 1440.
</ParamField>

<ParamField query="--allow-sandbox-access" type="boolean">
  Allow sandbox read/write access for hosted evaluations.
</ParamField>

<ParamField query="--allow-instances-access" type="boolean">
  Allow hosted evaluations to create and manage instances.
</ParamField>

<ParamField query="--allow-tunnel-access" type="boolean">
  Allow hosted evaluations to create and manage tunnels from inside the
  sandbox. This adds tunnel scopes to the temporary `PRIME_API_KEY`.
</ParamField>

<ParamField query="--custom-secrets" type="string">
  JSON object of additional secrets to inject into a hosted run.
</ParamField>

<ParamField query="--eval-name" type="string">
  Custom display name for a hosted evaluation.
</ParamField>

`--api-base-url` and `--api-key-var` from the model configuration section also work with `--hosted`. When you use a custom endpoint for a hosted run, provide the referenced API key inside the remote sandbox via an environment secret or `--custom-secrets`.

## End-to-End Example

Here's a complete workflow from installation to viewing results:

```bash theme={null}
# 1. Install an environment
prime env install primeintellect/gsm8k@latest

# 2. Run evaluation with a fast model (5 examples, 1 rollout each)
prime eval gsm8k -m openai/gpt-4.1-mini -n 20 -r 2

# 3. List your evaluation runs
prime eval list

# 4. Get details of a specific evaluation
prime eval get <eval-id>

# 5. View samples from an evaluation
prime eval samples <eval-id>
```

**Example output:**

```
Running evaluation: gsm8k
Model: openai/gpt-4.1-mini
Examples: 5 | Rollouts: 1 | Concurrency: 32

Progress: 100%|████████████████████████████| 5/5 [00:12<00:00, 2.45s/it]

Results:
  Average Score: 0.80
  Total Samples: 5
  Successful: 4
  Failed: 1

Results saved to: outputs/evals/gsm8k--openai-gpt-4.1-mini/...
Uploaded to platform: https://app.primeintellect.ai/...
```

## Using TOML Configs for Multi-Environment Evals

For reproducible evals, you can pass a TOML config file instead of individual CLI flags:

```bash theme={null}
prime eval run configs/eval/my-benchmark.toml
```

Local TOML configs can define one or more `[[eval]]` entries, which makes them useful for benchmark suites and multi-environment comparisons:

```toml theme={null}
model = "openai/gpt-4.1-mini"
num_examples = 20

[[eval]]
env_id = "primeintellect/gsm8k"
rollouts_per_example = 2

[[eval]]
env_id = "primeintellect/alphabet-sort"
rollouts_per_example = 2
```

When you use a TOML config, per-eval settings in `[[eval]]` override global defaults at the top of the file.

For the full config schema, precedence rules, and advanced options like ablations and endpoint registries, see [Verifiers Evaluation](/verifiers/evaluation).

## Hosted Evaluations from the CLI

Hosted eval runs are useful when you want the platform to execute the environment remotely and keep logs on the platform.

```bash theme={null}
prime eval run primeintellect/gsm8k \
  --hosted \
```

Hosted runs can also target a custom OpenAI-compatible endpoint:

```bash theme={null}
prime eval run primeintellect/gsm8k \
  --hosted \
  -m openai/gpt-4.1-mini \
  --api-base-url https://api.openai.com/v1 \
  --api-key-var OPENAI_API_KEY \
  --custom-secrets '{"OPENAI_API_KEY":"..."}'
```

When `--api-base-url` is set on a hosted run, Prime still hosts the evaluation sandbox, but model billing comes from your external provider instead of Prime Inference. See [Hosted Evaluations](/tutorials-environments/hosted-evaluations) for the full hosted billing details.

You can also launch a hosted run from a TOML file:

```toml theme={null}
model = "openai/gpt-4.1-mini"
num_examples = 20
rollouts_per_example = 2

[[eval]]
env_id = "primeintellect/gsm8k"
env_args = { split = "test" }
```

```bash theme={null}
prime eval run configs/eval/gsm8k-hosted.toml --hosted
```

Hosted run management commands:

```bash theme={null}
prime eval logs <eval-id> -f
prime eval stop <eval-id>
```

## Managing Evaluation Results

### List Evaluations

```bash theme={null}
# List all your evaluations
prime eval list

# Filter by environment
prime eval list --env gsm8k

# Output as JSON
prime eval list --output json

# Paginate results
prime eval list --num 20 --page 2
```

### Get Evaluation Details

```bash theme={null}
# Get full details of an evaluation
prime eval get <eval-id>

# Pretty-print output
prime eval get <eval-id> --output pretty
```

### View Samples

```bash theme={null}
# Get samples from an evaluation
prime eval samples <eval-id>

# Paginate samples
prime eval samples <eval-id> --page 2 --num 50
```

### Push Local Results

If you ran evaluations offline or with `--skip-upload`, you can push results later:

```bash theme={null}
# Auto-discover and push from outputs/evals/
prime eval push

# Push a specific directory
prime eval push outputs/evals/gsm8k--gpt-4/abc123

# Push with environment context
prime eval push --env gsm8k
```

## Model Selection Guide

When choosing models for evaluation, consider:

* **Task complexity** — Harder tasks benefit from larger, reasoning-capable models
* **Cost** — Smaller models are significantly cheaper for large-scale evals
* **Throughput** — Some models handle high concurrency better than others

### Self-hosting with vLLM

<Note>
  For most users, we recommend Prime Inference for easier setup. Consider self-hosting only for specialized requirements or very large-scale evaluations.
</Note>

Self-hosting makes sense when you:

* Need specific model variants or custom fine-tuned models
* Require maximum cost efficiency for very large evaluations (1M+ examples)
* Are testing smaller models not available via API

#### Configuring for Self-Hosted Models

```bash theme={null}
# Point to your vLLM instance
prime eval gsm8k \
  -m Qwen/Qwen3-4B-Instruct-2507 \
  --api-base-url http://localhost:8000/v1 \
  --api-key-var CUSTOM_API_KEY \
```

#### Recommended Self-Hosted Models

**High Performance (MoE with small active parameters):**

```bash theme={null}
prime eval gsm8k \
  -m Qwen/Qwen3-30B-A3B-Instruct-2507 \
  --api-base-url http://localhost:8000/v1 \
  -n 2000 -c 96
```

**Balanced Performance:**

```bash theme={null}
prime eval wordle \
  -m Qwen/Qwen3-4B-Instruct-2507 \
  --api-base-url http://localhost:8000/v1 \
  -n 5000 -c 128
```

#### vLLM Configuration Example

```bash theme={null}
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-4B-Instruct-2507 \
  --max-model-len 32768 \
  --max-num-seqs 256 \
  --tensor-parallel-size 2 \
  --quantization fp8
```

## Environment-Specific Dependencies

Some environments may require additional dependencies beyond the base installation. Check the environment's documentation or use `prime env info` to see requirements:

```bash theme={null}
prime env info primeintellect/math_python
```

If an environment needs specific packages (e.g., sympy for math verification), install them before running evaluations.

## Troubleshooting

### Common Issues

<AccordionGroup>
  <Accordion title="Rate Limiting">
    **Error:** "Rate limit exceeded" or 429 responses

    **Solution:** Reduce concurrency

    ```bash theme={null}
    prime eval gsm8k -n 100 -c 8
    ```
  </Accordion>

  <Accordion title="Insufficient Balance">
    **Error:** "Insufficient balance" or payment required

    **Solution:** Add funds to your Prime Intellect account or use cheaper models

    ```bash theme={null}
    prime eval gsm8k -m openai/gpt-4.1-mini -n 1000
    ```
  </Accordion>

  <Accordion title="Environment Not Found">
    **Error:** Environment not found or installation failures

    **Solution:** Verify environment exists and reinstall

    ```bash theme={null}
    prime env list --owner primeintellect
    prime env uninstall gsm8k
    prime env install primeintellect/gsm8k@latest
    ```
  </Accordion>

  <Accordion title="Python Version Issues">
    **Error:** Installation failures or import errors

    **Solution:** Ensure you're using Python 3.10–3.13

    ```bash theme={null}
    python --version  # Should be 3.10, 3.11, 3.12, or 3.13
    ```
  </Accordion>
</AccordionGroup>

### Performance Tips

1. **Start Small:** Begin with `-n 5` to test your setup
2. **Monitor Costs:** Check token usage before large evaluations
3. **Use Appropriate Models:** Match model capability to task complexity
4. **Optimize Concurrency:** Balance speed vs. rate limits (default 32 is usually good)
5. **Save Results:** Results auto-save and upload by default

### Integration with Other APIs

You can use other OpenAI-compatible API providers:

```bash theme={null}
# DeepSeek API
prime eval math \
  -m deepseek-reasoner \
  --api-base-url https://api.deepseek.com/v1 \
  --api-key-var DEEPSEEK_API_KEY

# OpenRouter API
prime eval gsm8k \
  -m meta-llama/llama-3.1-405b-instruct \
  --api-base-url https://openrouter.ai/api/v1 \
  --api-key-var OPENROUTER_API_KEY
```

## Next Steps

<CardGroup cols={2}>
  <Card title="Environments Hub" icon="cube" href="/tutorials-environments/environments">
    Learn more about creating and managing environments
  </Card>

  <Card title="Inference API Reference" icon="code" href="/api-reference/inference-models">
    Detailed API documentation for inference endpoints
  </Card>

  <Card title="Prime CLI Reference" icon="terminal" href="/cli-reference/introduction">
    Complete CLI command reference
  </Card>

  <Card title="Creating Environments" icon="plus" href="/tutorials-environments/create">
    Build your own evaluation environments
  </Card>
</CardGroup>
