SFT
Single-GPU
For training on a single GPU, no communication orchestration is required and you can choose whether to start your trainer using our trainer entrypoint or usingtorchrun.
To start with our sft entrypoint
torchrun
Multi-GPU
For training on multiple GPUs, usetorchrun with the --nproc-per-node flag.
--local-rank-filter flag is used to only log the logs from the master rank, as detailed in logs.
Multi-Node
For training on multiple nodes, usetorchrun with the --nnodes, --node-rank, and --rdzv-endpoint flags.
First, decide which node will be your head node and find a reachable private IP address for it. If your nodes are not colocated, you will likely need to setup VPN (e.g. Tailscale) for the nodes to reach each other.
(Skip this step if the default network interface is sufficient.) Make sure to set the network interface for GLOO and NCCL to one that allows all nodes to reach each other.
MASTER_ADDR is the private IP address of the head node and MASTER_PORT is a free port on the head node, typically port 29500 for torchrun.
SLURM
TBD.Inference
We rely on vLLMs multi-node deployment primitives and load balancing for multi-node deployments. Currently, vLLM supports multi-node data parallel deployment (docs). First, decide which node will be your head node and find a reachable private IP address for it. If your nodes are not colocated, you will likely need to setup VPN (e.g. Tailscale) for the nodes to reach each other. (Skip this step if the default network interface is sufficient.) Make sure to set the network interface for GLOO and NCCL to one that allows all nodes to reach each other.RL
Single-GPU Training
If you only have access to a single GPU, you may still be able to run small RL experiments. To do so, configure your inference server to use only a fraction of the available memory to leave some space for the trainer. For example, to run an RL training on a single GPU while using 50% of the available memory for the inference server, run--gpu-memory-utilization value such that you have enough GPU memory for the RL trainer.
You can also set this up by starting each submodule manually.
Multi-GPU Training
For single-node training, we recommend using therl entrypoint to conveniently start all components, i.e. the inference server, the orchestrator, and the trainer.
By default, the inference server starts on GPU ID 0 and the trainer on GPU ID 1.
Parallel Experiments
For quick ablations, it can be more efficient to parallelize experiments within a node (e.g. split your GPUs to run two experiments in parallel). For example, if you have access to 4 GPUs and your experiment fits on 2 GPUs, you can parallelize two experiments as follows: Start the first experiment in a tmux sessionexp1 with outputs directory outputs1. Specify it both in the tmux script, as well as in the start command (will use the first 2 GPUs)
exp2 with outputs directory outputs2. In addition, specify a new server port for the inference engine and orchestrator (will use the last 2 GPUs)
Multi-Node Training
We currently require shared file system for multi-node RL training.To faciliate multi-node RL training, ensure that all nodes have access to a shared file system and that the node that will run the inference server is reachable from the orchestrator via a private or public IP address. Then, set the following environment variables on all nodes: