Model Selection for Environment Testing
Main criteria for deciding models to test in your environment are:- Whether you want to self-host vs. use an API
- How hard your task is, and whether reasoning is required/helpful
- How high-throughput you need for inference
Self-hosting with vLLM
Self-hosting makes most sense when you can host across one or more GPUs (depending on model size and want high total throughput in terms of parallel requests but are less sensitive to tok/s on a per-request basis, or are testing with a small model which isn’t easily available via API a high-throughput API — perfect for large evals on small models. For evals, you’ll want much more headroom than the bare minimum. For example, you could self-host a 32B dense model on a 48GB GPU in FP8, but will likely find that this is pretty slow in terms of total throughput for many parallel requests. We recommend using the Qwen3 series via vLLM for most self-hosted testing. Qwen has some excellent docs for getting started with self-hosting here. The 30B-A3B series of models also offers a great sweet spot for powerful models with small footprints and very fast inference (due to having only 3B active parameters).Models to consider
Non-thinking:- Qwen3-30B-A3B-Instruct-2507
- Qwen3-Coder-30B-A3B-Instruct
- Qwen3-4B-Instruct-2507
- Qwen3-30B-A3B-Thinking-2507
- Qwen3-4B-Thinking-2507
- Qwen3-8B (thinking)
- Qwen3-1.7B (thinking)
- Qwen3-0.6B (thinking)
Notes on Qwen3 series
- All Qwen3 models offer FP8 versions. If you’re using an Ada/Hopper/Blackwell GPU this is likely what you’ll want to use for improved speed (and negligible performance differences).
- The 235B model series is quite strong, but is also widely available from API providers at low costs ($0.13/$0.6 per M toks in/out), and is generally not economical for self-hosting unless you’re doing high-throughput synthetic data generation with a highly optimized hosting configuration.
- For RL training, you’ll want to use a modified chat template which avoids stripping
<think>
sections when applied to messages; a collection with pre-modified templates is available here.
OpenAI-compatible API models
DeepSeek API
- Non-thinking: deepseek-chat (V3.1 non-thinking)
- Thinking: deepseek-reasoner (V3.1 thinking)
OpenAI API
- Non-thinking: gpt-4.1 (-mini, -nano)
- Thinking: o4-mini, gpt-5 (-mini, -nano)
- Note: thinking summary not exposed for chatcompletions
Anthropic API
- Non-thinking: Haiku 3.5, Sonnet 4
- Thinking: Sonnet 4 (with
extra_body={"thinking": { "type": "enabled", "budget_tokens": 2000 }}
insampling_args
)
Gemini API
- Thinking: gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite
- With
reasoning_effort = "low" / "medium" / "high"
insampling_args
- With
- Non-thinking: gemini-2.5-flash, gemini-2.5-flash-lite
- With
reasoning_effort = "none"
insampling_args
- With