Slurm will make a bunch of seperated machines look much like a cluster, is it right?
Naming Convention of Nodes
A common cluster should comprise management nodes and compute nodes. This aritcle will take our cluster as an example to demostrate steps to install and configure Slurm. In our case, the management node is called
clab-mgt01 while the compute nodes are named from
clab20 in order.
Execute the following command to install the dependencies on all machines. (
clab-all refers to all machines including management and compute nodes).
clab-all$ sudo apt install slurm-wlm slurm-client munge
Tips: There are several tools that may help to manage multiple nodes easily:
- iTerm2 (on Mac) / Terminator (on Linux)
- csshX (on Mac) / cssh (on Linux)
- Parallel SSH (at cluster side)
Generate Slurm Configuration
There is an official online configuration generator. And we should carefully check the fields below.
clab-mgt01in our case.
clab[01-20]in our case.
- CPUs: It is recommended to leave it blank.
- Sockets: For a dual-socket server we commonly see, it should be
- CoresPerSocket: Number of physical cores per socket.
- ThreadsPerCore: For a regular x86 server, if hyperthreading is enabled, it should be
- RealMemory: Optional.
submit, then we could copy the file content to
/etc/slurm-llnl/slurm.conf on all machines.
Tips: Don't forget the shared storage (e.g. NFS storage) on the cluster. We could utilize it to distribute files.
Distribute Munge Key
Once Munge is installed successfully, the key
/etc/munge/munge.key will be automatically generated. It is requried for all machines to hold the same key. Therefore, we could distribute the key on the management node to the remaining nodes including compute nodes and other backup management node if existing.
Tips: Again. We could also utilize the shared storage to distribute the key.
Then make sure the permission and the ownership are correctly set.
clab-all$ sudo chmod 400 /etc/munge/munge.key
Patch Slurm Cgroup Integration
By default, there Slurm cannot work with Cgroup well. If we start Slurm service right now, we may receive this error shown below.
error: cgroup namespace 'freezer' not mounted. aborting
Therefore, by pasting the following content to
/etc/slurm/cgroup.conf on compute nodes, this issue can be fixed.
or using this command:
echo CgroupMountpoint=/sys/fs/cgroup >> /etc/slurm/cgroup.conf
Fix Directory Permission
For unknown reasons, the permission of the relevant directory is not set properly, which may lead to this error.
slurmctld: fatal: mkdir(/var/spool/slurmctld): Permission denied
The solution is executing the commands below on management nodes.
clab-mgt$ sudo mkdir -p /var/spool/slurmctld
Start Slurm Service
So far, we have finished the basic configuration. Let us launch Slurm now.
# On management nodes
sinfo and we should see all the compute nodes are ready.
If your Slurm is not working correctly, you could try with these commands to debug.
clab-mgt$ sudo slurmctld -D