Llama 405B Inference in BF16
Run Llama 405B inference in BF16 across multiple nodes.
Introduction
The Llama 3.1 405B model is a 405 billion parameter model released by Meta and is currently the most powerful open source model available.
The model can be served on one node of 8 x H100 with FP8 quantization using the image on the Prime Intellect platform. However, serving the model with FP8 quantization will result in some performance degradation. If you wish to avoid the performance degradation, you can serve the model on 16 x H100 in BF16 by following this tutorial.
You can also serve other 405B models such as NousResearch/Hermes-3-Llama-3.1-405B with this tutorial.
Just replace any references to Meta-Llama-3.1-405B-Instruct
with Hermes-3-Llama-3.1-405B
(or any other model path) in the code snippets.
Preliminary
Get access to the Llama 3.1 models on huggingface
In order to download the model weights from huggingface, you need to accept the Llama 3.1 License and be approved to access the repository.
You can accept the Llama 3.1 License on the Meta Llama website: https://llama.meta.com/llama-downloads/
Once you have accepted the license on the Meta Llama website, you can request for access on the huggingface repo: https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct You will need to login, review the license, and request access.
The approval can take a few hours. Once you have been approved, you will see that you have been granted access to the model.
Getting the machines
Now that we can access the model weights, let’s provision the machines.
Navigate over to the megacluster tab in the sidebar and deploy a cluster containing 16 x H100.
Wait for the nodes to provision and the connection string to be shown. This can take awhile, so feel free to take a little coffee break.
Serve vllm with ray cluster
Network setup
For the nodes to collaborate effectively in serving the model, they need to be able to communicate with each other.
Once the nodes are up, connect to them on separate terminals using their respective connection strings.
Once connected, you can get the private IP address and interface of the nodes using the command below:
On both nodes, set the environment variable GLOO_SOCKET_IFNAME
to the name of the network interface
Pick one of the nodes to be the head node.
(We recommend picking the first one in the list to make it easy to remember.)
Set the HEAD_NODE_ADDRESS
environment variable to the IP of the head node by replacing <HEAD_NODE_IP>
in the command below with the IP address obtained from the previous command:
Make sure all the machines can reach the head node and that the head node can reach the other nodes.
You can do this by making sure that all nodes can ping the $HEAD_NODE_ADDRESS
and that the head node can ping the other nodes’ IP address.
(Optional) Infiniband setup
If your cluster has Infiniband interconnect, you can utilise them to improve your throughput.
In order to check if your nodes have Infiniband, run:
If your output contains NICs (Network Interface Cards) with the mlx
prefix, similar to the example below, your node has Infiniband interconnect.
Otherwise, it does not and you should skip this section.
If your nodes contain infiniband interconnects, you can enable them for your deployment by setting the NETWORK_ENVS
:
Install docker
The ray cluster which will be used to deploy the vLLM server requires a homogenous environment for the participating processes. To make the environments consistent, we will start the ray nodes with docker. Install docker using the command below:
Now install the nvidia-container-toolkit to be able to run containers with GPU:
If the installation worked, you should see the nvidia-smi output containing information about the GPUs on the node:
Restart your ssh session for the docker permission change to take effect.
Download the model
The download can take quite awhile. You may want to run this section in a tmux session.
Now that we have established connectivity between the nodes and installed docker, let’s download the model on each of the nodes. For this you will need a huggingface API token for your account (which accepted the Meta Llama 3.1 License and was granted access to the model repository).
If you don’t have a token, or wish to create a new one, you can do that here. A read token should be sufficient.
First, install git and git-lfs:
Then clone the repository and pull the model weights (You will be asked for your username and password. Enter your huggingface API token as the password.):
If you have done this correctly, you should see a progress printout on the download progress:
Start ray cluster
Now that we have the model weights, it’s time to set up the ray cluster. The ray cluster will be used by vLLM to coordinate work between the nodes.
Before running the commands to start the cluster, let’s make sure we have all the environment variables set correctly. If you are missing something, refer to the sections above to find where it was set.
sudo docker ...
On the head node run:
On the worker node run:
Then, on any node, use docker exec -it vllm-server /bin/bash
to enter the container.
Execute ray status
to check the status of the Ray cluster.
You should see the right number of nodes and GPUs.
Start vLLM API server
With the ray cluster up and running, we can now serve with vLLM as usual, as if you have all the GPUs on one node.
Pick any one of the nodes to start the vLLM server by running:
Start chat server
Optionally, you can start a chat server on the head node by executing this command inside the container.
First, install gradio
Then, launch the chat server as background process:
You should now be able to chat with the model using the public IP address of the head node. This should be the IP address provided in the connection string in the Megacluster console. Go to http://<HEAD_NODE_PUBLIC_IP>:8001 and you should be able to see the chat interface.