At ABACUS 2.0 we use Slurm for scheduling jobs.
In general, ABACUS 2.0 is intended to be used for batch jobs, i.e. jobs which run without any user intervention. It is also possible to run interactive jobs—this is described here.
A typical user usage scenario is the following
- The user logs in to
fe.deic.sdu.dk
. - A previous job script is edited with new parameters.
- The job script is submitted to the job queue.
- The user logs out
- Later, after the job has completed, the user logs in again to retrieve the result.
Job scripts as mentioned in Steps 2 and 3 contains both details of which computer resources are needed (number and types of nodes, etc.) and details on which application should be run and how (name and version of the application, input and output, etc.).
General commands
You can use man
to get further documentation one the commands mentioned later:
testuser@fe1:~$ man COMMAND
Try the following commands
testuser@fe1:~$ man sbatch
testuser@fe1:~$ man squeue
testuser@fe1:~$ man scancel
Accounts
To see which accounts are available to you, including how many node hours are available, use the command abc-quota
:
testuser@fe1:~$ abc-quota
Available node hours per account/user
=====================================
Account/user | Quota Avail | UsedPeriod % of Qt | UsedMonth
------------ + ------- ------- + ---------- --------- + ---------
test00_gpu | 2,000 1,220 | 780 39.4 % | 650
otheruser | | 80 4.4 % | 50
testuser * | | 700 35.0 % | 600
...
In this case, testuser
can use the account test00_gpu
. Within this accounting period, the user testuser
has used 700 node hours, and the test00_gpu
account has used in total used 780 hours. 1,220 node hours are still available. As shown in the column UsedMonth
most node hours have been used during this month.
Submitting jobs
The following is a minimal job script – it generates a lot of random numbers and then sorts them. See later for a more realistic job script. For any job script, you should specify the account to use, the number of nodes you want (default 1), and the maximum wall time (at most 24 hours):
#! /bin/bash
#
#SBATCH --account test00_gpu # account
#SBATCH --nodes 1 # number of nodes
#SBATCH --time 2:00:00 # max time (HH:MM:SS)
for i in {1..10000}; do
echo $RANDOM >> random.txt
done
sort random.txt
Note that Slurm parameters must be specified at the top of the file before any real commands. Further, #SBATCH
must appear at the start of the line written exactly as #SBATCH
.
To submit the job, write the above contents to a file, e.g.myscript.sh
, and run the command:
testuser@fe1:~$ sbatch myscript.sh
You can also add extra options for sbatch
overriding the values in the script itself, e.g.,
testuser@fe1:~$ sbatch --time 4:00:00 myscript.sh
Information on jobs
List all current, running or pending jobs for the user testuser
:
testuser@fe1:~$ squeue -u testuser
testuser@fe1:~$ squeue -u testuser -t RUNNING
testuser@fe1:~$ squeue -u testuser -t PENDING
List detailed information for a job (sometimes useful for troubleshooting):
testuser@fe1:~$ scontrol show jobid -dd <jobid>
To cancel a single job, all jobs or all pending jobs for the user testuser
:
testuser@fe1:~$ scancel <jobid>
testuser@fe1:~$ scancel -u testuser
testuser@fe1:~$ scancel -u testuser -t PENDING
Interactive jobs
It is also possible to run interactive jobs on ABACUS 2.0, i.e. jobs where you using a GUI or using the command line use one or more of our compute nodes as if you were sitting at your own computer.
How to do can be seen here.
Jobscript tips
- Walltime
--time
: Set the maximum wall time as low as possible enables Slurm to possibly pack your job on idle nodes currently waiting for a large job to start. - Nodes
--nodes
: If your job can be flexible, use a range of the number of nodes needed to run the job, e.g.--nodes=4-6
. In this case your job starts running when at least 4 nodes are available. If at that time 5 or 6 nodes are available, your job gets all of them. - Tasks per node,
--ntasks-per-node
: Use this to select how many MPI ranks you want per node, e.g., 24 if you want one rank per cpu core or 2 if you want one mpi rank per gpu card.
Note that you do not need to specify the following in your job scripts
- Partition: The partition is automatically derived from the account you use, e.g.,
test00_gpu
implies the partitiongpu
. - Memory use, e.g.,
--mem
or--mem-per-spu
: By default you get all the RAM on the nodes you are running. - GPU cards, i.e.,
--gres=gpu:2
: If you are running on a gpu node, you automatically get access to both gpu cards.
MPI jobs
For MPI jobs, you should use a combination of --nodes
and --ntasks-per-node
to get the number of nodes and MPI ranks per node you want. Both have a default value of 1.
For all MPI implementations available as as module at ABACUS 2.0, the recommended way to start MPI applications is using srun
, i.e., not mpirun
or similar.
#! /bin/bash
#
#SBATCH --account test00_gpu # account
#SBATCH --nodes 4 # number of nodes
#SBATCH --ntasks-per-node 24 # number of MPI tasks per node
#SBATCH --time 2:00:00 # max time (HH:MM:SS)
echo Running on "$(hostname)"
echo Available nodes: "$SLURM_NODELIST"
echo Slurm_submit_dir: "$SLURM_SUBMIT_DIR"
echo Start time: "$(date)"
# Load the modules previously used when compiling the application
module purge
module add gcc/4.8-c7 openmpi/1.8.4
# Start in total 4*24 MPI ranks on all available CPU cores
srun my-mpi-application -i input.txt -o output.txt
echo Done.
Further jobscript examples
Purely sequential job
#!/bin/bash
#SBATCH --account test00_gpu # account
#SBATCH --nodes 1 # number of nodes
#SBATCH --time 2:00:00 # max time (HH:MM:SS)
./serial.exe
Amber, Gaussian, Gromacs, Namd, etc
You can find sample sbatch job scripts in the folder /opt/sys/documentation/sbatch-scripts/
on the ABACUS 2.0 frontend nodes. For the software packages installed on ABACUS 2.0, you can also look at our software page for further information.
Using as few switches as possible
The InfiniBand switches in ABACUS 2.0 are connected using a 3D torus. By default, Slurm always starts your job as soon as possible. When enough nodes are available for the job, i.e.
the job is ready to start, Slurm packs the job on the available nodes as good as possible.
If you have a very network intensive job, you may want to ensure that your job is packed as good as possible, even at the cost of the job maybe starting later than would otherwise be possible.
For all the the possible sbatch --switches
options below, there is a time limit of one hour, i.e., after one hour, the --switches
option is ignored.
sbatch --switches 1
Run everything using nodes from one switch (at most 16 slim/fat nodes or 18 gpu nodes)
sbatch --switches 2
Run everything using nodes from at most two neighbour switches (at most 32 slim/fat nodes or 34 gpu nodes).
sbatch --switches 3
Run everything using nodes from at most 2×2 neighbour switches (at most 64/72 nodes). For both fat and gpu nodes, there is no need to specify this as there is only 64 respectively 72 nodes available.