Slurm will make a bunch of seperated machines look much like a cluster, is it right?

## Naming Convention of Nodes

A common cluster should comprise management nodes and compute nodes. This aritcle will take our cluster as an example to demostrate steps to install and configure Slurm. In our case, the management node is called clab-mgt01 while the compute nodes are named from clab01 to clab20 in order.

## Install Dependencies

Execute the following command to install the dependencies on all machines. (clab-all refers to all machines including management and compute nodes).

Tips: There are several tools that may help to manage multiple nodes easily:

• iTerm2 (on Mac) / Terminator (on Linux)
• csshX (on Mac) / cssh (on Linux)
• Parallel SSH (at cluster side)

## Generate Slurm Configuration

There is an official online configuration generator. And we should carefully check the fields below.

• SlurmctldHost: clab-mgt01 in our case.
• NodeName: clab[01-20] in our case.
• CPUs: It is recommended to leave it blank.
• Sockets: For a dual-socket server we commonly see, it should be 2.
• CoresPerSocket: Number of physical cores per socket.
• ThreadsPerCore: For a regular x86 server, if hyperthreading is enabled, it should be 2, otherwise 1.
• RealMemory: Optional.

Click submit, then we could copy the file content to /etc/slurm-llnl/slurm.conf on all machines.

Tips: Don't forget the shared storage (e.g. NFS storage) on the cluster. We could utilize it to distribute files.

## Distribute Munge Key

Once Munge is installed successfully, the key /etc/munge/munge.key will be automatically generated. It is requried for all machines to hold the same key. Therefore, we could distribute the key on the management node to the remaining nodes including compute nodes and other backup management node if existing.

Tips: Again. We could also utilize the shared storage to distribute the key.

Then make sure the permission and the ownership are correctly set.

## Patch Slurm Cgroup Integration

By default, there Slurm cannot work with Cgroup well. If we start Slurm service right now, we may receive this error shown below.

Therefore, by pasting the following content to /etc/slurm/cgroup.conf on compute nodes, this issue can be fixed.

or using this command:

## Fix Directory Permission

For unknown reasons, the permission of the relevant directory is not set properly, which may lead to this error.

The solution is executing the commands below on management nodes.

## Start Slurm Service

So far, we have finished the basic configuration. Let us launch Slurm now.

Run sinfo and we should see all the compute nodes are ready.

## Debugging Tips

If your Slurm is not working correctly, you could try with these commands to debug.