How to use Torch FSDP to train models distributed across multiple nodes.
torch
, transformers
, and datasets
.
You can access all of the python code used in this tutorial here.
torchrun
is a pytorch utility that does the job of spawning a process on each GPU and making sure they can communicate with each other.
To use torch.distributed
, we need to initialize the torch process group before training and delete it afterward.
We also need to ensure that the default device is set to the LOCAL_RANK, i.e., the GPU number within the current node.
torchrun
this time:
split_dataset_by_node
utility to split the dataset into a subset for each node/rankno_sync
context manager to avoid doing any communication during the gradient accumulation phasetorchrun
:
10.
or 192.
.
You can find the IP using one of the commands below:
10.15.42.1
with the private IP address of your master node and 1234 with any open port on the master node.
You then need to assign a rank to each node. The master node must have rank 0.
RANK
, as RANK
will conflict with the one used by torch.distributed
. --nnodes
should be adjusted to the number of nodes you have. torchrun
is hanging forever or crashes, check that:
MY_RANK
)ping <MASTER_NODE_IP>
to verify). Or start a python server python -m http.server 1234
on the master node and check that the other node can reach it curl http://$RDZV_ENDPOINT
NO_SHARD
because you need to communicate each time you want to modify the gradients or use the optimizer state, but it is less memory-intensive because the gradient and optimizer state are never duplicated.
Avoid using this strategy if your intra-connect is not NVLINK / NVSWITCH or PCIE 5.0.
Avoid using this strategy if your intra-connect is not NVSWITCH (SXM machine).
FULL_SHARD
) within a node but still use the less communication-intensive strategy (NO_SHARD
) across nodes.
This can be done using a hybrid strategy.
You need first to create a devices_mesh
which represents your topology.
SHARD_GRAD_OP
strategy on one node and normal NO_SHARD
between nodes you can use the _HYBRID_SHARD_ZERO2
strategy.
Hybrid strategy should be used if you don’t have Infiniband or if you are training a relatively small model (< 7b parameters). Use a Hybrid strategy if you are training a relativly small model (< 7b parameters).What if you have a even slower interconnect bandwith (less than 100gbs) between nodes? Check out OpenDiloco, our framework for low bandwidth distributed training.