Run Llama 405B inference in BF16 across multiple nodes.
Meta-Llama-3.1-405B-Instruct
with Hermes-3-Llama-3.1-405B
(or any other model path) in the code snippets.GLOO_SOCKET_IFNAME
to the name of the network interface
HEAD_NODE_ADDRESS
environment variable to the IP of the head node by replacing <HEAD_NODE_IP>
in the command below with the IP address obtained from the previous command:
$HEAD_NODE_ADDRESS
and that the head node can ping the other nodes’ IP address.
mlx
prefix, similar to the example below, your node has Infiniband interconnect.
Otherwise, it does not and you should skip this section.
NETWORK_ENVS
:
sudo docker ...
docker exec -it vllm-server /bin/bash
to enter the container.
Execute ray status
to check the status of the Ray cluster.
You should see the right number of nodes and GPUs.