Model Selection for Environment Testing

Main criteria for deciding models to test in your environment are:
  • Whether you want to self-host vs. use an API
  • How hard your task is, and whether reasoning is required/helpful
  • How high-throughput you need for inference

Self-hosting with vLLM

Self-hosting makes most sense when you can host across one or more GPUs (depending on model size and want high total throughput in terms of parallel requests but are less sensitive to tok/s on a per-request basis, or are testing with a small model which isn’t easily available via API a high-throughput API — perfect for large evals on small models. For evals, you’ll want much more headroom than the bare minimum. For example, you could self-host a 32B dense model on a 48GB GPU in FP8, but will likely find that this is pretty slow in terms of total throughput for many parallel requests. We recommend using the Qwen3 series via vLLM for most self-hosted testing. Qwen has some excellent docs for getting started with self-hosting here. The 30B-A3B series of models also offers a great sweet spot for powerful models with small footprints and very fast inference (due to having only 3B active parameters).

Models to consider

Non-thinking:
  • Qwen3-30B-A3B-Instruct-2507
  • Qwen3-Coder-30B-A3B-Instruct
  • Qwen3-4B-Instruct-2507
Thinking:
  • Qwen3-30B-A3B-Thinking-2507
  • Qwen3-4B-Thinking-2507
Earlier series (mostly useful for smallest models + RL training experiments):
  • Qwen3-8B (thinking)
  • Qwen3-1.7B (thinking)
  • Qwen3-0.6B (thinking)

Notes on Qwen3 series

  • All Qwen3 models offer FP8 versions. If you’re using an Ada/Hopper/Blackwell GPU this is likely what you’ll want to use for improved speed (and negligible performance differences).
  • The 235B model series is quite strong, but is also widely available from API providers at low costs ($0.13/$0.6 per M toks in/out), and is generally not economical for self-hosting unless you’re doing high-throughput synthetic data generation with a highly optimized hosting configuration.
  • For RL training, you’ll want to use a modified chat template which avoids stripping <think> sections when applied to messages; a collection with pre-modified templates is available here.

OpenAI-compatible API models

DeepSeek API

  • Non-thinking: deepseek-chat (V3.1 non-thinking)
  • Thinking: deepseek-reasoner (V3.1 thinking)

OpenAI API

  • Non-thinking: gpt-4.1 (-mini, -nano)
  • Thinking: o4-mini, gpt-5 (-mini, -nano)
    • Note: thinking summary not exposed for chatcompletions

Anthropic API

  • Non-thinking: Haiku 3.5, Sonnet 4
  • Thinking: Sonnet 4 (with extra_body={"thinking": { "type": "enabled", "budget_tokens": 2000 }} in sampling_args)

Gemini API

  • Thinking: gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite
    • With reasoning_effort = "low" / "medium" / "high" in sampling_args
  • Non-thinking: gemini-2.5-flash, gemini-2.5-flash-lite
    • With reasoning_effort = "none" in sampling_args

OpenRouter API

Has most popular open-weights models at low prices from many providers (along with entrypoints for all major proprietary models), pick your favorites.