Reinforcement Fine-Tuning (Beta)

This guide is for more technically proficient users who are comfortable working directly with command line tools and have some experience training or fine-tuning models.

If you want to join our private beta to gain early-access to our RFT platform, with deep integrations to our Environments Hub, and hands-on support from our research team, send us your details via this form.

Overview

Reinforcement Fine-Tuning (RFT) optimizes a language model with feedback-driven rewards from RL environments. On Prime Intellect, you can run scalable RFT jobs using our compute, our prime-rl trainer, and access hundreds of pre-built environments from our Environments Hub, which all use verifiers.

prime-rl (GitHub) - Trainer for large-scale FSDP training
verifiers (GitHub) - Library of modular components for building RL environments
Environments Hub - Community hub hosting hundreds of RL environments that can be used, evaluated and forked for your RL run

This guide walks you through end-to-end:

Deploy compute using our RFT image with pre-installed dependencies
Validate machine setup
Configure RFT run by setting up RL environment(s) and config files
Launch RFT run

1) Deploy compute

Go to the Deploy GPU page on Prime Intellect, and select the “Prime RL (RFT)” image. This image is preloaded with CUDA, Python/UV, prime-rl, verifiers, and the Prime CLI to minimize setup.

prime-rl requires at least 2 GPUs for training.

Select your desired GPU configuration (at least 2 GPUs), and click deploy. Once the instance is running, use the SSH command to connect to the instance. Then, once connected, prime-rl is installed in /workspace/prime-rl:

cd /workspace/prime-rl

2) Validate machine setup

Check that environment uses Python 3.12

uv run python -V

Check that flash-attn is installed

uv run python -c "import flash_attn"

Check that you can run SFT trainer in debug model (this requires 1 GPU)

uv run sft @ configs/debug/sft/train.toml

Check that you can run the RL trainer debug mode (this requires 1 GPU)

uv run trainer @ configs/debug/rl/train.toml

Check that you can run the orchestrator against an inference server (this requires 1 GPU). Run these commands at the same time in separate terminals:

uv run inference @ configs/debug/infer.toml

uv run orchestrator @ configs/debug/orch.toml

Check that you can run a simple SFT warmup (this requires 1 GPU)

uv run sft @ configs/reverse_text/sft/train.toml

Check that you can run a toy RL run (this requires 2 GPUs)

uv run rl \
  --trainer @ configs/reverse_text/rl/train.toml \
  --orchestrator @ configs/reverse_text/rl/orch.toml \
  --inference @ configs/reverse_text/rl/infer.toml

3) Configure RFT run

We support doing RL on environments that are built with verifiers (repo). You can find hundreds of community-contributed environments on the Environments Hub.

Not all environments on the Hub are vetted or tested by Prime Intellect. For a set of verified, training and RL-ready environments, visit the prime-environments repository.

To install an environment temporarily within prime-rl, do:

# optional: for pushing environments to the hub or accessing private environments
uv run prime login

# install the environment
uv run prime env install custom-environment

To persist your environment installation in the package-wide pyproject.toml, do:

uv add custom-environment --index https://hub.primeintellect.ai/username/custom-environment

For quick API-based evaluation post-installation, do:

uv run vf-eval custom-environment # -h for config options; defaults to gpt-4.1-mini, 5 prompts, 3 rollouts each

For training:

Create trainer/inference/orchestrator config files following the aforementioned examples
Then set id = custom-environment in the [environment] section of your orchestrator config (along with any desired Environment-level args in [environment.args]).

There is an example of this in the examples/wordle directory.

Additional setup

If you want to log your runs to W&B (wandb), log in

uv run wandb login
# Or set `export WANDB_API_KEY=...`

If you require gated/ private models or datasets from HuggingFace, log in

uv run huggingface-cli login
# Or set `export HF_TOKEN=...`

We provide a convenient tmux layout script to start a run and view the logs for trainer, orchestrator, and inference. To start the session simply run

bash scripts/tmux.sh

4) Launch RFT Run

The prime-rl stack has three cooperating components: inference, orchestrator, and trainer. For convenience, you can launch all via a single entrypoint. Once you have your config files set up, you can launch the run with:

uv run rl \
  --trainer @ configs/<your-environment>/train.toml \
  --orchestrator @ configs/<your-environment>/orch.toml \
  --inference @ configs/<your-environment>/infer.toml \
  --wandb.project <your-project> \
  --wandb.name <your-run-name>

This will launch the run and log to W&B. You can find more details and training examples in the prime-rl README (GitHub), including:

Multi-Node training
Detailed W&B logging
Detailed checkpointing
Benchmarking

Checkpointing and Resuming

Enable periodic checkpoints:

uv run trainer @ /workspace/configs/<your-environment>/train.toml --ckpt.interval 10

Resume from step 10:

uv run rl \
  --trainer @ /workspace/configs/<your-environment>/train.toml \
  --orchestrator @ /workspace/configs/<your-environment>/orch.toml \
  --inference @ /workspace/configs/<your-environment>/infer.toml \
  --ckpt.resume-step 10 \
  --trainer.monitor.wandb.id <your-wandb-run-id> \
  --orchestrator.monitor.wandb.id <your-wandb-run-id>

See prime-rl README for details on checkpoint layout and async semantics (GitHub).

References

prime-rl: Trainer for large-scale FSDP training — GitHub
verifiers: Library of modular components for building RL environments — GitHub
Environments Hub: Community hub hosting hundreds of RL environments that can be used, evaluated and forked for your RL run

Getting Started

On-Demand Cloud

Storage

Multi-Node Clusters

Environment Hub

Inference

Reinforcement Fine-Tuning

Sandboxes

Community Pools

Overview

1) Deploy compute

2) Validate machine setup

3) Configure RFT run

Additional setup

4) Launch RFT Run

Checkpointing and Resuming

References

Getting Started

On-Demand Cloud

Storage

Multi-Node Clusters

Environment Hub

Inference

Reinforcement Fine-Tuning

Sandboxes

Community Pools

​Overview

​1) Deploy compute

​2) Validate machine setup

​3) Configure RFT run

​Additional setup

​4) Launch RFT Run

​Checkpointing and Resuming

​References

Overview

1) Deploy compute

2) Validate machine setup

3) Configure RFT run

Additional setup

4) Launch RFT Run

Checkpointing and Resuming

References