Slurm job scheduler – SDU eScience

At ABACUS 2.0 we use Slurm for scheduling jobs.

In general, ABACUS 2.0 is intended to be used for batch jobs, i.e. jobs which run without any user intervention. It is also possible to run interactive jobs—this is described here.

A typical user usage scenario is the following

The user logs in to fe.deic.sdu.dk.
A previous job script is edited with new parameters.
The job script is submitted to the job queue.
The user logs out
Later, after the job has completed, the user logs in again to retrieve the result.

Job scripts as mentioned in Steps 2 and 3 contains both details of which computer resources are needed (number and types of nodes, etc.) and details on which application should be run and how (name and version of the application, input and output, etc.).

General commands

You can use man to get further documentation one the commands mentioned later:

testuser@fe1:~$ man COMMAND

Try the following commands

testuser@fe1:~$ man sbatch
testuser@fe1:~$ man squeue
testuser@fe1:~$ man scancel

Accounts

To see which accounts are available to you, including how many node hours are available, use the command abc-quota:

testuser@fe1:~$ abc-quota

Available node hours per account/user
=====================================

Account/user |   Quota   Avail | UsedPeriod   % of Qt | UsedMonth
------------ + ------- ------- + ---------- --------- + ---------

test00_gpu   |   2,000   1,220 |        780    39.4 % |       650
 otheruser   |                 |         80     4.4 % |        50
 testuser *  |                 |        700    35.0 % |       600

...

In this case, testuser can use the account test00_gpu. Within this accounting period, the user testuser has used 700 node hours, and the test00_gpu account has used in total used 780 hours. 1,220 node hours are still available. As shown in the column UsedMonth most node hours have been used during this month.

Submitting jobs

The following is a minimal job script – it generates a lot of random numbers and then sorts them. See later for a more realistic job script. For any job script, you should specify the account to use, the number of nodes you want (default 1), and the maximum wall time (at most 24 hours):

#! /bin/bash
#
#SBATCH --account test00_gpu      # account
#SBATCH --nodes 1                 # number of nodes
#SBATCH --time 2:00:00            # max time (HH:MM:SS)

for i in {1..10000}; do
  echo $RANDOM >> random.txt
done

sort random.txt

Note that Slurm parameters must be specified at the top of the file before any real commands. Further, #SBATCH must appear at the start of the line written exactly as #SBATCH.

To submit the job, write the above contents to a file, e.g.myscript.sh, and run the command:

testuser@fe1:~$ sbatch myscript.sh

You can also add extra options for sbatch overriding the values in the script itself, e.g.,

testuser@fe1:~$ sbatch --time 4:00:00 myscript.sh

Information on jobs

List all current, running or pending jobs for the user testuser:

testuser@fe1:~$ squeue -u testuser
testuser@fe1:~$ squeue -u testuser -t RUNNING
testuser@fe1:~$ squeue -u testuser -t PENDING

List detailed information for a job (sometimes useful for troubleshooting):

testuser@fe1:~$ scontrol show jobid -dd <jobid>

To cancel a single job, all jobs or all pending jobs for the user testuser:

testuser@fe1:~$ scancel <jobid>
testuser@fe1:~$ scancel -u testuser
testuser@fe1:~$ scancel -u testuser -t PENDING

Interactive jobs

It is also possible to run interactive jobs on ABACUS 2.0, i.e. jobs where you using a GUI or using the command line use one or more of our compute nodes as if you were sitting at your own computer.

How to do can be seen here.

Jobscript tips

Walltime --time: Set the maximum wall time as low as possible enables Slurm to possibly pack your job on idle nodes currently waiting for a large job to start.
Nodes --nodes: If your job can be flexible, use a range of the number of nodes needed to run the job, e.g.--nodes=4-6. In this case your job starts running when at least 4 nodes are available. If at that time 5 or 6 nodes are available, your job gets all of them.
Tasks per node, --ntasks-per-node: Use this to select how many MPI ranks you want per node, e.g., 24 if you want one rank per cpu core or 2 if you want one mpi rank per gpu card.

Note that you do not need to specify the following in your job scripts

Partition: The partition is automatically derived from the account you use, e.g., test00_gpu implies the partition gpu.
Memory use, e.g., --mem or --mem-per-spu: By default you get all the RAM on the nodes you are running.
GPU cards, i.e., --gres=gpu:2: If you are running on a gpu node, you automatically get access to both gpu cards.

MPI jobs

For MPI jobs, you should use a combination of --nodes and --ntasks-per-node to get the number of nodes and MPI ranks per node you want. Both have a default value of 1.

For all MPI implementations available as as module at ABACUS 2.0, the recommended way to start MPI applications is using srun, i.e., not mpirun or similar.

#! /bin/bash
#
#SBATCH --account test00_gpu      # account
#SBATCH --nodes 4                 # number of nodes
#SBATCH --ntasks-per-node 24      # number of MPI tasks per node
#SBATCH --time 2:00:00            # max time (HH:MM:SS)

echo Running on "$(hostname)"
echo Available nodes: "$SLURM_NODELIST"
echo Slurm_submit_dir: "$SLURM_SUBMIT_DIR"
echo Start time: "$(date)"

# Load the modules previously used when compiling the application
module purge
module add gcc/4.8-c7 openmpi/1.8.4

# Start in total 4*24 MPI ranks on all available CPU cores
srun my-mpi-application -i input.txt -o output.txt

echo Done.

Further jobscript examples

Purely sequential job

#!/bin/bash

#SBATCH --account test00_gpu      # account
#SBATCH --nodes 1                 # number of nodes
#SBATCH --time 2:00:00            # max time (HH:MM:SS)

./serial.exe

Amber, Gaussian, Gromacs, Namd, etc

You can find sample sbatch job scripts in the folder /opt/sys/documentation/sbatch-scripts/ on the ABACUS 2.0 frontend nodes. For the software packages installed on ABACUS 2.0, you can also look at our software page for further information.

Using as few switches as possible

The InfiniBand switches in ABACUS 2.0 are connected using a 3D torus. By default, Slurm always starts your job as soon as possible. When enough nodes are available for the job, i.e.
the job is ready to start, Slurm packs the job on the available nodes as good as possible.

If you have a very network intensive job, you may want to ensure that your job is packed as good as possible, even at the cost of the job maybe starting later than would otherwise be possible.

For all the the possible sbatch --switches options below, there is a time limit of one hour, i.e., after one hour, the --switches option is ignored.

`sbatch --switches 1`

Run everything using nodes from one switch (at most 16 slim/fat nodes or 18 gpu nodes)

`sbatch --switches 2`

Run everything using nodes from at most two neighbour switches (at most 32 slim/fat nodes or 34 gpu nodes).

`sbatch --switches 3`

Run everything using nodes from at most 2×2 neighbour switches (at most 64/72 nodes). For both fat and gpu nodes, there is no need to specify this as there is only 64 respectively 72 nodes available.