Run Torch FSDP
How to use Torch FSDP to train models distributed across multiple nodes.
In this tutorial, we will see how to train a model across multiple nodes using pytorch FSDP.
To follow this tutorial you need to have access to multiple nodes. On each node, you need to install torch
, transformers
, and datasets
.
You can access all of the python code used in this tutorial here.
1 - Single GPU training
Let’s start with a simple script to train a 150m language model on the C4 datasets on one GPU.
You can run this on one GPU by doing:
2 - Torch Distributed
Let’s now turn this script into a multi-GPU (and later multi-node) script using FSDP and torchrun.
Distributed training in pytorch follows the SPMD (Single Program, Multiple Data) paradigm.
The same training code is run on multiple GPUs each in its own process and they communicate with each other using torch.distributed.
torchrun
is a pytorch utility that does the job of spawning a process on each GPU and making sure they can communicate with each other.
To use torch.distributed
, we need to initialize the torch process group before training and delete it afterward.
We also need to ensure that the default device is set to the LOCAL_RANK, i.e., the GPU number within the current node.
We can now run the code using torchrun
this time:
3 - Fully Sharded Data Parallel (FSDP)
You can already run the code above on multiple GPUs, but this will only train the same model on different GPUs. Without proper communication and setup, this would be ineffective.
To train your model across multiple nodes, you could either use FSDP or DDP from PyTorch. Both are forms of data parallelism (each GPU processes different data), but FSDP is more flexible and allows for the training of larger models by reducing memory usage.
To use torch FSDP, we need to wrap our model in an FSDP unit:
What these modifications do:
- Wrap the model into a FSDP unit (with mixed precision)
- Use the
split_dataset_by_node
utility to split the dataset into a subset for each node/rank - Use the
no_sync
context manager to avoid doing any communication during the gradient accumulation phase
You can now run this code on multiple GPUs using torchrun
:
4 - Run on Multiple Nodes
The code is already ready to be run on multiple nodes. You need to have a copy of the script on each node and be able to ssh into each of them.
First, you need to decide which node will be the master node, you need to know the private IP of the master node.
The private IP usually starts with 10.
or 192.
.
You can find the IP using one of the commands below:
or
On each node then do:
Replace 10.15.42.1
with the private IP address of your master node and 1234 with any open port on the master node.
You then need to assign a rank to each node. The master node must have rank 0.
RANK
, as RANK
will conflict with the one used by torch.distributed
. Finally to start the training, run the following command on each node:
--nnodes
should be adjusted to the number of nodes you have. Each node should wait for the others to be ready before starting the training.
Troubleshooting
If torchrun
is hanging forever or crashes, check that:
- Rank 0 is the one that exposed its private IP
- There are no duplicate ranks. (Each node should have a unique
MY_RANK
) - All ranks should be assigned (based on the number of nodes you specified)
- Other machines can reach the private IP (use
ping <MASTER_NODE_IP>
to verify). Or start a python serverpython -m http.server 1234
on the master node and check that the other node can reach itcurl http://$RDZV_ENDPOINT
5 - Adapting the FSDP ShardingStrategy
FSDP stands for Fully Sharded Data Parallel which is a data parallelism technique that allows training bigger model by sharding model weight, gradient, and optimizer across all of the ranks.
The FSDP class in PyTorch can go beyond simply doing fully sharded data parallel. It exposes different sharding strategies.
There is no free lunch when it comes to choosing one of these FSDP strategies. It all depends on your hardware, your number of nodes, and the size of the model you are training. A good method to help make a decision is to benchmark the Model FLOPs Utilization (MFU) of different strategies on your hardware and model size. See this paper for more information:
To change the FSDP strategy you can do:
NO_SHARD
The no-shard strategy is equivalent to DDP ( Distributed Data Parallel). The model as well as the optimizer state and gradient are duplicated on each rank. The gradients are all reduced once per step and the model update is done on each rank.
This strategy is the least communication intensive but requires more memory. It is useful to train small models (below 1B parameters).
SHARD_GRAD_OP
This strategy shards the gradient and the optimizer state across all the ranks, but the model weights are replicated on each rank.
It is more communication-intensive than NO_SHARD
because you need to communicate each time you want to modify the gradients or use the optimizer state, but it is less memory-intensive because the gradient and optimizer state are never duplicated.
Avoid using this strategy if your intra-connect is not NVLINK / NVSWITCH or PCIE 5.0.
FULL_SHARD
This strategy is the default FSDP strategy. Nothing is replicated: the model, the optimizer state, and the gradient are all sharded across all the ranks. This is the least memory-intensive strategy but requires a lot of communication. Specifically, at each forward pass of a layer, the weights are virtualized on all ranks via an all-gather operation.
Avoid using this strategy if your intra-connect is not NVSWITCH (SXM machine).
Hybrid techniques
You might have a very fast intraconnect (between GPUs on the same node) but a relatively slow interconnect (e.g. 100gbs ethernet between nodes).
In this case, you might want to leverage the least memory-intensive strategy (FULL_SHARD
) within a node but still use the less communication-intensive strategy (NO_SHARD
) across nodes.
This can be done using a hybrid strategy.
You need first to create a devices_mesh
which represents your topology.
If you want to use the SHARD_GRAD_OP
strategy on one node and normal NO_SHARD
between nodes you can use the _HYBRID_SHARD_ZERO2
strategy.
Hybrid strategy should be used if you don’t have Infiniband or if you are training a relatively small model (< 7b parameters). Use a Hybrid strategy if you are training a relativly small model (< 7b parameters).
What if you have a even slower interconnect bandwith (less than 100gbs) between nodes? Check out OpenDiloco, our framework for low bandwidth distributed training.