> ## Documentation Index > Fetch the complete documentation index at: https://docs.primeintellect.ai/llms.txt > Use this file to discover all available pages before exploring further. # Evaluating Environments > Guide to running evaluations with Prime CLI using Prime Inference or custom model endpoints The `prime eval` command provides powerful evaluation capabilities for testing environments against various language models through Prime Inference or other OpenAI-compatible providers. You can run evaluations locally or launch them as hosted evaluations on the platform with `--hosted`. This guide covers both workflows, model selection, and best practices. ## Quick Start: Running Your First Evaluation ### Prerequisites 1. **Python 3.10–3.13** — Required for the Prime CLI and verifiers 2. **Install Prime CLI** — Follow the [installation guide](/cli-reference/introduction) 3. **Set up API keys** — Configure your Prime API key via `prime login`; if you plan to use another provider, also export the provider key you will reference with `--api-key-var` 4. **Install an environment** — Use `prime env install owner/environment` ### Basic Evaluation ```bash theme={null} # List available environments prime env list --owner primeintellect # Install an environment prime env install primeintellect/gsm8k@latest # Run a basic evaluation prime eval gsm8k ``` Local `prime eval` runs are automatically uploaded to the platform after each run. Use `--skip-upload` to disable this. ### Hosted Evaluation Quick Start If your environment is already published to the Environments Hub, you can run it remotely on Prime-managed infrastructure: ```bash theme={null} prime eval run primeintellect/gsm8k --hosted ``` Use `--follow` to stream hosted logs until completion: ```bash theme={null} prime eval run primeintellect/gsm8k --hosted --follow ``` See [Hosted Evaluations](/tutorials-environments/hosted-evaluations) for the full dashboard and CLI workflow. ## Using the prime eval Command ### Basic Syntax ```bash theme={null} prime eval ENVIRONMENT [OPTIONS] ``` This is a shorthand for `prime eval run ENVIRONMENT`. Both forms work identically. ### Available Models To see all available models for evaluation: ```bash theme={null} prime inference models ``` **Example models:** | Model | Notes | | ------------------------------------ | ------------------------------ | | `openai/gpt-4.1-mini` | Fast, cost-effective (default) | | `openai/gpt-4.1` | Higher quality | | `anthropic/claude-sonnet-4.5` | Strong reasoning | | `meta-llama/llama-3.3-70b-instruct` | Open-weight, balanced | | `deepseek/deepseek-r1-0528` | Advanced reasoning | | `qwen/qwen3-235b-a22b-instruct-2507` | Large MoE model | | `google/gemini-2.5-flash` | Fast multimodal | Model availability and pricing may change. Always run `prime inference models` to get the current list with pricing. ### Core Parameters Environment to evaluate. Supported forms: * **Full slug** (e.g., `primeintellect/gsm8k`) — recommended for hosted runs and Hub environments * **Short name** (e.g., `gsm8k`) — local-first resolution for installed environments * **TOML config path** (e.g., `configs/eval/gsm8k.toml`) — for config-driven runs Model to use for evaluation. Default: `openai/gpt-4.1-mini` See `prime inference models` for all available models. Number of examples to evaluate. Default: 5 Number of rollouts per example for statistical significance. Default: 3 ### Advanced Options Maximum concurrent requests to the inference API. Default: 32 Maximum number of automatic retries with exponential backoff when rollouts fail due to transient infrastructure errors (e.g., sandbox timeouts, API failures). Maximum tokens to generate per request. If unset, uses model default. Sampling temperature (0.0–2.0). Higher values = more randomness. JSON string with additional sampling arguments. Example: `'{"enable_thinking": false, "max_tokens": 256}'` Environment-specific arguments as JSON. Example: `'{"difficulty": "hard"}'` ### Output and Storage Options Enable verbose output for detailed logging. Save evaluation results to disk. Default: true Save dataset every N rollouts. Useful for checkpointing. Default: 1 Save results to Hugging Face Hub. Specify Hugging Face Hub dataset name. Skip uploading results for local evaluations (local runs upload by default). ### Hosted Evaluation Options Run the evaluation on the platform instead of locally. Requires a published environment. Follow hosted evaluation status and stream logs until completion. Only valid with `--hosted`. Polling interval in seconds for hosted status and log streaming. Only valid with `--hosted`. Optional timeout in minutes for a hosted evaluation. Default: 1440 (24 hours). Min: 120. Max: 1440. Allow sandbox read/write access for hosted evaluations. Allow hosted evaluations to create and manage instances. Allow hosted evaluations to create and manage tunnels from inside the sandbox. This adds tunnel scopes to the temporary `PRIME_API_KEY`. JSON object of additional secrets to inject into a hosted run. Custom display name for a hosted evaluation. `--api-base-url` and `--api-key-var` from the model configuration section also work with `--hosted`. When you use a custom endpoint for a hosted run, provide the referenced API key inside the remote sandbox via an environment secret or `--custom-secrets`. ## End-to-End Example Here's a complete workflow from installation to viewing results: ```bash theme={null} # 1. Install an environment prime env install primeintellect/gsm8k@latest # 2. Run evaluation with a fast model (5 examples, 1 rollout each) prime eval gsm8k -m openai/gpt-4.1-mini -n 20 -r 2 # 3. List your evaluation runs prime eval list # 4. Get details of a specific evaluation prime eval get # 5. View samples from an evaluation prime eval samples ``` **Example output:** ``` Running evaluation: gsm8k Model: openai/gpt-4.1-mini Examples: 5 | Rollouts: 1 | Concurrency: 32 Progress: 100%|████████████████████████████| 5/5 [00:12<00:00, 2.45s/it] Results: Average Score: 0.80 Total Samples: 5 Successful: 4 Failed: 1 Results saved to: outputs/evals/gsm8k--openai-gpt-4.1-mini/... Uploaded to platform: https://app.primeintellect.ai/... ``` ## Using TOML Configs for Multi-Environment Evals For reproducible evals, you can pass a TOML config file instead of individual CLI flags: ```bash theme={null} prime eval run configs/eval/my-benchmark.toml ``` Local TOML configs can define one or more `[[eval]]` entries, which makes them useful for benchmark suites and multi-environment comparisons: ```toml theme={null} model = "openai/gpt-4.1-mini" num_examples = 20 [[eval]] env_id = "primeintellect/gsm8k" rollouts_per_example = 2 [[eval]] env_id = "primeintellect/alphabet-sort" rollouts_per_example = 2 ``` When you use a TOML config, per-eval settings in `[[eval]]` override global defaults at the top of the file. For the full config schema, precedence rules, and advanced options like ablations and endpoint registries, see [Verifiers Evaluation](/verifiers/evaluation). ## Hosted Evaluations from the CLI Hosted eval runs are useful when you want the platform to execute the environment remotely and keep logs on the platform. ```bash theme={null} prime eval run primeintellect/gsm8k \ --hosted \ ``` Hosted runs can also target a custom OpenAI-compatible endpoint: ```bash theme={null} prime eval run primeintellect/gsm8k \ --hosted \ -m openai/gpt-4.1-mini \ --api-base-url https://api.openai.com/v1 \ --api-key-var OPENAI_API_KEY \ --custom-secrets '{"OPENAI_API_KEY":"..."}' ``` When `--api-base-url` is set on a hosted run, Prime still hosts the evaluation sandbox, but model billing comes from your external provider instead of Prime Inference. See [Hosted Evaluations](/tutorials-environments/hosted-evaluations) for the full hosted billing details. You can also launch a hosted run from a TOML file: ```toml theme={null} model = "openai/gpt-4.1-mini" num_examples = 20 rollouts_per_example = 2 [[eval]] env_id = "primeintellect/gsm8k" env_args = { split = "test" } ``` ```bash theme={null} prime eval run configs/eval/gsm8k-hosted.toml --hosted ``` Hosted run management commands: ```bash theme={null} prime eval logs -f prime eval stop ``` ## Managing Evaluation Results ### List Evaluations ```bash theme={null} # List all your evaluations prime eval list # Filter by environment prime eval list --env gsm8k # Output as JSON prime eval list --output json # Paginate results prime eval list --num 20 --page 2 ``` ### Get Evaluation Details ```bash theme={null} # Get full details of an evaluation prime eval get # Pretty-print output prime eval get --output pretty ``` ### View Samples ```bash theme={null} # Get samples from an evaluation prime eval samples # Paginate samples prime eval samples --page 2 --num 50 ``` ### Push Local Results If you ran evaluations offline or with `--skip-upload`, you can push results later: ```bash theme={null} # Auto-discover and push from outputs/evals/ prime eval push # Push a specific directory prime eval push outputs/evals/gsm8k--gpt-4/abc123 # Push with environment context prime eval push --env gsm8k ``` ## Model Selection Guide When choosing models for evaluation, consider: * **Task complexity** — Harder tasks benefit from larger, reasoning-capable models * **Cost** — Smaller models are significantly cheaper for large-scale evals * **Throughput** — Some models handle high concurrency better than others ### Self-hosting with vLLM For most users, we recommend Prime Inference for easier setup. Consider self-hosting only for specialized requirements or very large-scale evaluations. Self-hosting makes sense when you: * Need specific model variants or custom fine-tuned models * Require maximum cost efficiency for very large evaluations (1M+ examples) * Are testing smaller models not available via API #### Configuring for Self-Hosted Models ```bash theme={null} # Point to your vLLM instance prime eval gsm8k \ -m Qwen/Qwen3-4B-Instruct-2507 \ --api-base-url http://localhost:8000/v1 \ --api-key-var CUSTOM_API_KEY \ ``` #### Recommended Self-Hosted Models **High Performance (MoE with small active parameters):** ```bash theme={null} prime eval gsm8k \ -m Qwen/Qwen3-30B-A3B-Instruct-2507 \ --api-base-url http://localhost:8000/v1 \ -n 2000 -c 96 ``` **Balanced Performance:** ```bash theme={null} prime eval wordle \ -m Qwen/Qwen3-4B-Instruct-2507 \ --api-base-url http://localhost:8000/v1 \ -n 5000 -c 128 ``` #### vLLM Configuration Example ```bash theme={null} python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen3-4B-Instruct-2507 \ --max-model-len 32768 \ --max-num-seqs 256 \ --tensor-parallel-size 2 \ --quantization fp8 ``` ## Environment-Specific Dependencies Some environments may require additional dependencies beyond the base installation. Check the environment's documentation or use `prime env info` to see requirements: ```bash theme={null} prime env info primeintellect/math_python ``` If an environment needs specific packages (e.g., sympy for math verification), install them before running evaluations. ## Troubleshooting ### Common Issues **Error:** "Rate limit exceeded" or 429 responses **Solution:** Reduce concurrency ```bash theme={null} prime eval gsm8k -n 100 -c 8 ``` **Error:** "Insufficient balance" or payment required **Solution:** Add funds to your Prime Intellect account or use cheaper models ```bash theme={null} prime eval gsm8k -m openai/gpt-4.1-mini -n 1000 ``` **Error:** Environment not found or installation failures **Solution:** Verify environment exists and reinstall ```bash theme={null} prime env list --owner primeintellect prime env uninstall gsm8k prime env install primeintellect/gsm8k@latest ``` **Error:** Installation failures or import errors **Solution:** Ensure you're using Python 3.10–3.13 ```bash theme={null} python --version # Should be 3.10, 3.11, 3.12, or 3.13 ``` ### Performance Tips 1. **Start Small:** Begin with `-n 5` to test your setup 2. **Monitor Costs:** Check token usage before large evaluations 3. **Use Appropriate Models:** Match model capability to task complexity 4. **Optimize Concurrency:** Balance speed vs. rate limits (default 32 is usually good) 5. **Save Results:** Results auto-save and upload by default ### Integration with Other APIs You can use other OpenAI-compatible API providers: ```bash theme={null} # DeepSeek API prime eval math \ -m deepseek-reasoner \ --api-base-url https://api.deepseek.com/v1 \ --api-key-var DEEPSEEK_API_KEY # OpenRouter API prime eval gsm8k \ -m meta-llama/llama-3.1-405b-instruct \ --api-base-url https://openrouter.ai/api/v1 \ --api-key-var OPENROUTER_API_KEY ``` ## Next Steps Learn more about creating and managing environments Detailed API documentation for inference endpoints Complete CLI command reference Build your own evaluation environments