Llama 405B Inference in BF16
Run Llama 405B inference in BF16 across multiple nodes.
Introduction
The Llama 3.1 405B model is a 405 billion parameter model released by Meta and is currently the most powerful open source model available.
The model can be served on one node of 8 x H100 with FP8 quantization using the image on the Prime Intellect platform. However, serving the model with FP8 quantization will result in some performance degradation. If you wish to avoid the performance degradation, you can serve the model on 16 x H100 in BF16 by following this tutorial.
You can also serve other 405B models such as NousResearch/Hermes-3-Llama-3.1-405B with this tutorial.
Just replace any references to Meta-Llama-3.1-405B-Instruct
with Hermes-3-Llama-3.1-405B
(or any other model path) in the code snippets.
Preliminary
Get access to the Llama 3.1 models on huggingface
In order to download the model weights from huggingface, you need to accept the Llama 3.1 License and be approved to access the repository.
You can accept the Llama 3.1 License on the Meta Llama website: https://llama.meta.com/llama-downloads/
Once you have accepted the license on the Meta Llama website, you can request for access on the huggingface repo: https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct You will need to login, review the license, and request access.
The approval can take a few hours. Once you have been approved, you will see that you have been granted access to the model.
Getting the machines
Now that we can access the model weights, let’s provision the machines.
Navigate over to the megacluster tab in the sidebar and deploy a cluster containing 16 x H100.
Wait for the nodes to provision and the connection string to be shown. This can take awhile, so feel free to take a little coffee break.
Serve vllm with ray cluster
Network setup
For the nodes to collaborate effectively in serving the model, they need to be able to communicate with each other.
Once the nodes are up, connect to them on separate terminals using their respective connection strings.
Once connected, you can get the private IP address and interface of the nodes using the command below:
ip a | grep '10\.' | awk '{ print $2, $NF }'; ip a | grep '192\.168' | awk '{ print $2, $NF }'
# e.g. 10.0.0.123/24 enp8s0
On both nodes, set the environment variable GLOO_SOCKET_IFNAME
to the name of the network interface
# e.g. export GLOO_SOCKET_IFNAME=enp8s0
export GLOO_SOCKET_IFNAME=<INTERFACE_NAME>
Pick one of the nodes to be the head node.
(We recommend picking the first one in the list to make it easy to remember.)
Set the HEAD_NODE_ADDRESS
environment variable to the IP of the head node by replacing <HEAD_NODE_IP>
in the command below with the IP address obtained from the previous command:
# e.g. export HEAD_NODE_ADDRESS=10.0.0.123
export HEAD_NODE_ADDRESS=<HEAD_NODE_IP>
Make sure all the machines can reach the head node and that the head node can reach the other nodes.
You can do this by making sure that all nodes can ping the $HEAD_NODE_ADDRESS
and that the head node can ping the other nodes’ IP address.
ping $HEAD_NODE_ADDRESS
ping <WORKER_NODE_ADDRESS>
(Optional) Infiniband setup
If your cluster has Infiniband interconnect, you can utilise them to improve your throughput.
In order to check if your nodes have Infiniband, run:
nvidia-smi topo -m
If your output contains NICs (Network Interface Cards) with the mlx
prefix, similar to the example below, your node has Infiniband interconnect.
Otherwise, it does not and you should skip this section.
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 NIC10 NIC11 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX PIX PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS 0-12 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS 26-38 2 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS 39-51 3 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS 13-25 1 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS SYS PIX PIX PIX SYS SYS SYS 52-64 4 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX SYS SYS 78-90 6 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX SYS 91-103 7 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX 65-77 5 N/A
NIC0 PIX SYS SYS SYS SYS SYS SYS SYS X PIX PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC1 PIX SYS SYS SYS SYS SYS SYS SYS PIX X PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC2 PIX SYS SYS SYS SYS SYS SYS SYS PIX PIX X SYS SYS SYS SYS SYS SYS SYS SYS SYS
NIC3 SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS SYS SYS
NIC4 SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS SYS
NIC5 SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS
NIC6 SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX PIX SYS SYS SYS
NIC7 SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X PIX SYS SYS SYS
NIC8 SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX PIX X SYS SYS SYS
NIC9 SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS
NIC10 SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X SYS
NIC11 SYS SYS SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
NIC10: mlx5_10
NIC11: mlx5_11
If your nodes contain infiniband interconnects, you can enable them for your deployment by setting the NETWORK_ENVS
:
export NETWORK_ENVS='-e NCCL_IB_HCA=mlx5'
Install docker
The ray cluster which will be used to deploy the vLLM server requires a homogenous environment for the participating processes. To make the environments consistent, we will start the ray nodes with docker. Install docker using the command below:
curl https://get.docker.com | sh
sudo usermod -aG docker "$USER"
Now install the nvidia-container-toolkit to be able to run containers with GPU:
#======================================================================================
# Reference Documentation:
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
#--------------------------------------------------------------------------------------
# Add Repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# Reload Sources
sudo apt-get update
# Install package
sudo apt-get install -y nvidia-docker2
# Restart docker daemon
sudo systemctl restart docker
# Test install
sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
If the installation worked, you should see the nvidia-smi output containing information about the GPUs on the node:
ubuntu@g384:~$ sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Unable to find image 'nvidia/cuda:11.6.2-base-ubuntu20.04' locally
11.6.2-base-ubuntu20.04: Pulling from nvidia/cuda
96d54c3075c9: Pull complete
a3d20efe6db8: Pull complete
bfdf8ce43b67: Pull complete
ad14f66bfcf9: Pull complete
1056ff735c59: Pull complete
Digest: sha256:a0dd581afdbf82ea9887dd077aebf9723aba58b51ae89acb4c58b8705b74179b
Status: Downloaded newer image for nvidia/cuda:11.6.2-base-ubuntu20.04
Sat Aug 17 14:14:16 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 Off | 00000000:19:00.0 Off | 0 |
| N/A 36C P0 120W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 Off | 00000000:3B:00.0 Off | 0 |
| N/A 32C P0 119W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 Off | 00000000:4C:00.0 Off | 0 |
| N/A 29C P0 118W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 Off | 00000000:5D:00.0 Off | 0 |
| N/A 35C P0 120W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 Off | 00000000:9B:00.0 Off | 0 |
| N/A 36C P0 119W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 Off | 00000000:BB:00.0 Off | 0 |
| N/A 30C P0 114W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 Off | 00000000:CB:00.0 Off | 0 |
| N/A 34C P0 115W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 Off | 00000000:DB:00.0 Off | 0 |
| N/A 30C P0 120W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Restart your ssh session for the docker permission change to take effect.
exit
<SSH Connection string from Prime Intellect Console>
Download the model
The download can take quite awhile. You may want to run this section in a tmux session.
sudo apt install tmux -y; tmux
Now that we have established connectivity between the nodes and installed docker, let’s download the model on each of the nodes. For this you will need a huggingface API token for your account (which accepted the Meta Llama 3.1 License and was granted access to the model repository).
If you don’t have a token, or wish to create a new one, you can do that here. A read token should be sufficient.
First, install git and git-lfs:
sudo apt install git git-lfs -y
git lfs install
Then clone the repository and pull the model weights (You will be asked for your username and password. Enter your huggingface API token as the password.):
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct
cd Meta-Llama-3.1-405B-Instruct
export MODEL_PATH=$(pwd)
git lfs pull -I '*.safetensors'
If you have done this correctly, you should see a progress printout on the download progress:
Start ray cluster
Now that we have the model weights, it’s time to set up the ray cluster. The ray cluster will be used by vLLM to coordinate work between the nodes.
Before running the commands to start the cluster, let’s make sure we have all the environment variables set correctly. If you are missing something, refer to the sections above to find where it was set.
echo $HEAD_NODE_ADDRESS # IP address of the head node
echo $GLOO_SOCKET_IFNAME # Interface name of the private network between nodes
echo $MODEL_PATH # Path to where the model was cloned and pulled
echo $NETWORK_ENVS # Infiniband env var if node has infiniband interconnect, otherwise not set
# e.g.
# 10.0.0.123
# enp8s0
# /data/Meta-Llama-3.1-405B-Instruct
# -e NCCL_IB_HCA=mlx5
sudo docker ...
On the head node run:
docker run -d --rm \
--name vllm-server \
--privileged \
--network host \
--ipc host \
--entrypoint /bin/bash \
--gpus all \
-e GLOO_SOCKET_IFNAME=${GLOO_SOCKET_IFNAME} \
${NETWORK_ENVS} \
-v "${MODEL_PATH}:/Meta-Llama-3.1-405B-Instruct" \
primeintellect/vllm-cuda12-1 -c "ray start --block --head --port 6379"
On the worker node run:
docker run -d --rm \
--name vllm-server \
--privileged \
--network host \
--ipc host \
--entrypoint /bin/bash \
--gpus all \
-e GLOO_SOCKET_IFNAME=${GLOO_SOCKET_IFNAME} \
${NETWORK_ENVS} \
-v "${MODEL_PATH}:/Meta-Llama-3.1-405B-Instruct" \
primeintellect/vllm-cuda12-1 -c "ray start --block --address=${HEAD_NODE_ADDRESS}:6379"
Then, on any node, use docker exec -it vllm-server /bin/bash
to enter the container.
Execute ray status
to check the status of the Ray cluster.
You should see the right number of nodes and GPUs.
======== Autoscaler status: 2024-08-17 14:16:06.197429 ========
Node status
---------------------------------------------------------------
Active:
1 node_ea778bb322b397a006009ebf6823e92c6263cfa7b91b36252022c7ea
1 node_b90c7b7af9d6193473b45dbc65dc1ff47a4e26e7d6602d2bf9c22805
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/312.0 CPU
0.0/16.0 GPU
0B/1.59TiB memory
0B/372.53GiB object_store_memory
Demands:
(no resource demands)
Start vLLM API server
With the ray cluster up and running, we can now serve with vLLM as usual, as if you have all the GPUs on one node.
Pick any one of the nodes to start the vLLM server by running:
# API server
vllm serve /Meta-Llama-3.1-405B-Instruct \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 > /root/vllm.log 2>&1
Start chat server
Optionally, you can start a chat server on the head node by executing this command inside the container.
First, install gradio
pip install gradio
Then, launch the chat server as background process:
# Chat server
python3 /chat_server.py --model /Meta-Llama-3.1-405B-Instruct --host 0.0.0.0 --port 8001 > /root/gradio.log 2>&1 &
You should now be able to chat with the model using the public IP address of the head node. This should be the IP address provided in the connection string in the Megacluster console. Go to http://<HEAD_NODE_PUBLIC_IP>:8001 and you should be able to see the chat interface.