Introduction

The Llama 3.1 405B model is a 405 billion parameter model released by Meta and is currently the most powerful open source model available.

The model can be served on one node of 8 x H100 with FP8 quantization using the image on the Prime Intellect platform. However, serving the model with FP8 quantization will result in some performance degradation. If you wish to avoid the performance degradation, you can serve the model on 16 x H100 in BF16 by following this tutorial.

You can also serve other 405B models such as NousResearch/Hermes-3-Llama-3.1-405B with this tutorial. Just replace any references to Meta-Llama-3.1-405B-Instruct with Hermes-3-Llama-3.1-405B (or any other model path) in the code snippets.

Preliminary

Get access to the Llama 3.1 models on huggingface

In order to download the model weights from huggingface, you need to accept the Llama 3.1 License and be approved to access the repository.

You can accept the Llama 3.1 License on the Meta Llama website: https://llama.meta.com/llama-downloads/

Once you have accepted the license on the Meta Llama website, you can request for access on the huggingface repo: https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct You will need to login, review the license, and request access.

The approval can take a few hours. Once you have been approved, you will see that you have been granted access to the model.

Getting the machines

Now that we can access the model weights, let’s provision the machines.

Navigate over to the megacluster tab in the sidebar and deploy a cluster containing 16 x H100.

Make sure the machines have at least 2TB of disk space each.

Wait for the nodes to provision and the connection string to be shown. This can take awhile, so feel free to take a little coffee break.

Serve vllm with ray cluster

Network setup

For the nodes to collaborate effectively in serving the model, they need to be able to communicate with each other.

Once the nodes are up, connect to them on separate terminals using their respective connection strings.

Once connected, you can get the private IP address and interface of the nodes using the command below:

ip a | grep '10\.' | awk '{ print $2, $NF }'; ip a | grep '192\.168' | awk '{ print $2, $NF }'
# e.g. 10.0.0.123/24 enp8s0

On both nodes, set the environment variable GLOO_SOCKET_IFNAME to the name of the network interface

# e.g. export GLOO_SOCKET_IFNAME=enp8s0
export GLOO_SOCKET_IFNAME=<INTERFACE_NAME>

Pick one of the nodes to be the head node. (We recommend picking the first one in the list to make it easy to remember.) Set the HEAD_NODE_ADDRESS environment variable to the IP of the head node by replacing <HEAD_NODE_IP> in the command below with the IP address obtained from the previous command:

# e.g. export HEAD_NODE_ADDRESS=10.0.0.123
export HEAD_NODE_ADDRESS=<HEAD_NODE_IP>

Make sure all the machines can reach the head node and that the head node can reach the other nodes. You can do this by making sure that all nodes can ping the $HEAD_NODE_ADDRESS and that the head node can ping the other nodes’ IP address.

ping $HEAD_NODE_ADDRESS
ping <WORKER_NODE_ADDRESS>

(Optional) Infiniband setup

If your cluster has Infiniband interconnect, you can utilise them to improve your throughput.

In order to check if your nodes have Infiniband, run:

nvidia-smi topo -m

If your output contains NICs (Network Interface Cards) with the mlx prefix, similar to the example below, your node has Infiniband interconnect. Otherwise, it does not and you should skip this section.

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   NIC11   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-12    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     26-38   2               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     39-51   3               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     13-25   1               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX     PIX     SYS     SYS     SYS     52-64   4               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     78-90   6               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     91-103  7               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     65-77   5               N/A
NIC0    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC2    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC3    SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC4    SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     PIX     SYS     SYS     SYS
NIC7    SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      PIX     SYS     SYS     SYS
NIC8    SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     PIX      X      SYS     SYS     SYS
NIC9    SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS
NIC10   SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS
NIC11   SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
  NIC10: mlx5_10
  NIC11: mlx5_11

If your nodes contain infiniband interconnects, you can enable them for your deployment by setting the NETWORK_ENVS:

export NETWORK_ENVS='-e NCCL_IB_HCA=mlx5'

Install docker

The ray cluster which will be used to deploy the vLLM server requires a homogenous environment for the participating processes. To make the environments consistent, we will start the ray nodes with docker. Install docker using the command below:

curl https://get.docker.com | sh
sudo usermod -aG docker "$USER"

Now install the nvidia-container-toolkit to be able to run containers with GPU:

#======================================================================================
# Reference Documentation:
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
#--------------------------------------------------------------------------------------
# Add Repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Reload Sources
sudo apt-get update

# Install package
sudo apt-get install -y nvidia-docker2

# Restart docker daemon
sudo systemctl restart docker

# Test install
sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

If the installation worked, you should see the nvidia-smi output containing information about the GPUs on the node:

ubuntu@g384:~$ sudo docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Unable to find image 'nvidia/cuda:11.6.2-base-ubuntu20.04' locally
11.6.2-base-ubuntu20.04: Pulling from nvidia/cuda
96d54c3075c9: Pull complete 
a3d20efe6db8: Pull complete 
bfdf8ce43b67: Pull complete 
ad14f66bfcf9: Pull complete 
1056ff735c59: Pull complete 
Digest: sha256:a0dd581afdbf82ea9887dd077aebf9723aba58b51ae89acb4c58b8705b74179b
Status: Downloaded newer image for nvidia/cuda:11.6.2-base-ubuntu20.04
Sat Aug 17 14:14:16 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:19:00.0 Off |                    0 |
| N/A   36C    P0            120W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          Off |   00000000:3B:00.0 Off |                    0 |
| N/A   32C    P0            119W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          Off |   00000000:4C:00.0 Off |                    0 |
| N/A   29C    P0            118W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          Off |   00000000:5D:00.0 Off |                    0 |
| N/A   35C    P0            120W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          Off |   00000000:9B:00.0 Off |                    0 |
| N/A   36C    P0            119W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          Off |   00000000:BB:00.0 Off |                    0 |
| N/A   30C    P0            114W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          Off |   00000000:CB:00.0 Off |                    0 |
| N/A   34C    P0            115W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          Off |   00000000:DB:00.0 Off |                    0 |
| N/A   30C    P0            120W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Restart your ssh session for the docker permission change to take effect.

exit
<SSH Connection string from Prime Intellect Console>

Download the model

The download can take quite awhile. You may want to run this section in a tmux session.

sudo apt install tmux -y; tmux

Now that we have established connectivity between the nodes and installed docker, let’s download the model on each of the nodes. For this you will need a huggingface API token for your account (which accepted the Meta Llama 3.1 License and was granted access to the model repository).

If you don’t have a token, or wish to create a new one, you can do that here. A read token should be sufficient.

First, install git and git-lfs:

sudo apt install git git-lfs -y
git lfs install

Then clone the repository and pull the model weights (You will be asked for your username and password. Enter your huggingface API token as the password.):

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct
cd Meta-Llama-3.1-405B-Instruct
export MODEL_PATH=$(pwd)
git lfs pull -I '*.safetensors'

If you have done this correctly, you should see a progress printout on the download progress:

The total files size should be ~812GB. However, approximately double of the disk space will be used as git lfs copies the object files in the .git folder.

Start ray cluster

Now that we have the model weights, it’s time to set up the ray cluster. The ray cluster will be used by vLLM to coordinate work between the nodes.

Before running the commands to start the cluster, let’s make sure we have all the environment variables set correctly. If you are missing something, refer to the sections above to find where it was set.

echo $HEAD_NODE_ADDRESS     # IP address of the head node
echo $GLOO_SOCKET_IFNAME    # Interface name of the private network between nodes
echo $MODEL_PATH            # Path to where the model was cloned and pulled
echo $NETWORK_ENVS          # Infiniband env var if node has infiniband interconnect, otherwise not set
# e.g.
# 10.0.0.123
# enp8s0
# /data/Meta-Llama-3.1-405B-Instruct
# -e NCCL_IB_HCA=mlx5
If you get a permission error running the command, prepend sudo to the docker commands. sudo docker ...

On the head node run:

docker run -d --rm \
    --name vllm-server \
    --privileged \
    --network host \
    --ipc host \
    --entrypoint /bin/bash \
    --gpus all \
    -e GLOO_SOCKET_IFNAME=${GLOO_SOCKET_IFNAME} \
    ${NETWORK_ENVS} \
    -v "${MODEL_PATH}:/Meta-Llama-3.1-405B-Instruct" \
    primeintellect/vllm-cuda12-1 -c "ray start --block --head --port 6379"

On the worker node run:

docker run -d --rm \
    --name vllm-server \
    --privileged \
    --network host \
    --ipc host \
    --entrypoint /bin/bash \
    --gpus all \
    -e GLOO_SOCKET_IFNAME=${GLOO_SOCKET_IFNAME} \
    ${NETWORK_ENVS} \
    -v "${MODEL_PATH}:/Meta-Llama-3.1-405B-Instruct" \
    primeintellect/vllm-cuda12-1 -c "ray start --block --address=${HEAD_NODE_ADDRESS}:6379"

Then, on any node, use docker exec -it vllm-server /bin/bash to enter the container. Execute ray status to check the status of the Ray cluster. You should see the right number of nodes and GPUs.

======== Autoscaler status: 2024-08-17 14:16:06.197429 ========
Node status
---------------------------------------------------------------
Active:
 1 node_ea778bb322b397a006009ebf6823e92c6263cfa7b91b36252022c7ea
 1 node_b90c7b7af9d6193473b45dbc65dc1ff47a4e26e7d6602d2bf9c22805
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/312.0 CPU
 0.0/16.0 GPU
 0B/1.59TiB memory
 0B/372.53GiB object_store_memory

Demands:
 (no resource demands)

Start vLLM API server

With the ray cluster up and running, we can now serve with vLLM as usual, as if you have all the GPUs on one node.

Pick any one of the nodes to start the vLLM server by running:

# API server
vllm serve /Meta-Llama-3.1-405B-Instruct \
    --tensor-parallel-size 8 \
    --pipeline-parallel-size 2 > /root/vllm.log 2>&1 

Start chat server

Optionally, you can start a chat server on the head node by executing this command inside the container.

First, install gradio

pip install gradio

Then, launch the chat server as background process:

# Chat server
python3 /chat_server.py --model /Meta-Llama-3.1-405B-Instruct --host 0.0.0.0 --port 8001 > /root/gradio.log 2>&1 &

You should now be able to chat with the model using the public IP address of the head node. This should be the IP address provided in the connection string in the Megacluster console. Go to http://<HEAD_NODE_PUBLIC_IP>:8001 and you should be able to see the chat interface.