Deploy and manage multi-node clusters with Slurm workload orchestration on Prime Intellect Platform.
Boot cluster with shared storage and Slurm orchestrator
Access the controller node
Verify cluster status
--gres=gpu:N
instead of --gpus-per-node=N
to avoid InvalidAccount errors.job.sh
) with SBATCH directives:
sbatch
was executed, which may be on the compute node’s local filesystem if you’re not in shared storage. Always submit jobs from the shared storage directory to ensure output accessibility.InvalidAccount
errors when submitting batch jobs:
--gres=gpu:N
instead of --gpus-per-node=N
#SBATCH --account=
directives from your scriptssinfo
sbatch
was run--output
and --error
directivessbatch
idle
state with sinfo
srun --nodelist=<node> hostname
sinfo
and squeue
to optimize job scheduling