SLURM Workload Manager¶
SLURM is the workload manager and job scheduler used for Stallo.
There are two ways of starting jobs with SLURM; either interactively with srun
or as a script with sbatch
.
Interactive jobs are a good way to test your setup before you put it into a script or to work with interactive applications like MATLAB or python. You immediately see the results and can check if all parts behave as you expected. See Interactive jobs for more details.
SLURM Parameter¶
SLURM supports a multitude of different parameters. This enables you to effectivly tailor your script to your need when using Stallo but also means that is easy to get lost and waste your time and quota.
The following parameters can be used as command line parameters with sbatch
and
srun
or in jobscript, see Job script examples.
To use it in a jobscript, start a newline with #SBTACH
followed by the parameter.
Replace <….> with the value you want, e.g. --job-name=test-job
.
Basic settings:¶
Parameter | Function |
---|---|
–job-name=<name> | Job name to be displayed by for example squeue |
–output=<path> | Path to the file where the job (error) output is written to
|
–mail-type=<type> | Turn on mail notification; type can be one of BEGIN, END, FAIL, REQUEUE or ALL
|
–mail-user=<email_address> | Email address to send notifications to |
Requesting Resources¶
Parameter | Function |
---|---|
–time=<d-hh:mm:ss> | Time limit for job. Job will be killed by SLURM after time has run out. Format days-hours:minutes:seconds |
–nodes=<num_nodes> | Number of nodes. Multiple nodes are only useful for jobs with distributed-memory (e.g. MPI). |
–mem=<MB> | Memory (RAM) per node. Number followed by unit prefix, e.g. 16G |
–mem-per-cpu=<MB> | Memory (RAM) per requested CPU core |
–ntasks-per-node=<num_procs> | Number of (MPI) processes per node. More than one useful only for MPI jobs. Maximum number depends nodes (number of cores) |
–cpus-per-task=<num_threads> | CPU cores per task. For MPI use one. For parallelized applications benchmark this is the number of threads. |
–exclusive | Job will not share nodes with other running jobs. You will be charged for the complete nodes even if you asked for less. |
Accounting¶
See also Partitions (queues) and services.
Parameter | Function |
---|---|
–account=<name> | Project (not user) account the job should be charged to. |
–partition=<name> | Partition/queue in which o run the job. |
–qos=devel | On stallo the devel QOS (quality of servive) can be used to submit short jobs for testing and debugging. |
Advanced Job Control¶
Parameter | Function |
---|---|
–array=<indexes> | Submit a collection of similar jobs, e.g. --array=1-10 . (sbatch command only). See official SLURM documentation |
–dependency=<state:jobid> | Wait with the start of the job until specified dependencies have been satified. E.g. –dependency=afterok:123456 |
–ntasks-per-core=2 | Enables hyperthreading. Only useful in special circumstances. |
Differences between CPUs and tasks¶
As a new users writing your first SLURM job script the difference between
--ntasks
and --cpus-per-taks
is typically quite confusing.
Assuming you want to run your program on a single node with 16 cores which
SLURM parameters should you specify?
The answer is it depends whether the your application supports MPI. MPI (message passing protocol) is a communication interface used for developing parallel computing programs on distributed memory systems. This is necessary for applications running on multiple computers (nodes) to be able to share (intermediate) results.
To decide which set of parameters you should use, check if your application utilizes MPI and therefore would benefit from running on multiple nodes simultaneously. On the other hand you have an non-MPI enables application or made a mistake in your setup, it doesn’t make sense to request more than one node.
Settings for OpenMP and MPI jobs¶
Single node jobs¶
For applications that are not optimized for HPC (high performance computing) systems like simple python or R scripts and a lot of software which is optimized for desktop PCs.
Simple applications and scripts¶
Many simple tools and scripts are not parallized at all and therefore won’t profit from more than one CPU core.
Parameter | Function |
---|---|
–nodes=1 | Start a unparallized job on only one node |
–ntasks-per-node=1 | For OpenMP, only one task is necessary |
–cpus-per-task=1 | Just one CPU core will be used. |
–mem=<MB> | Memory (RAM) for the job. Number followed by unit prefix, e.g. 16G |
If you are unsure if your application can benefit from more cores try a higher number and observe the load of your job. If it stays at approximately one there is no need to ask for more than one.
OpenMP applications¶
OpenMP (Open Multi-Processing) is a multiprocessing library is often used for programs on shared memory systems. Shared memory describes systems which share the memory between all processing units (CPU cores), so that each process can access all data on that system.
Parameter | Function |
---|---|
–nodes=1 | Start a parallel job for a shared memory system on only one node |
–ntasks-per-node=1 | For OpenMP, only one task is necessary |
–cpus-per-task=<num_threads> | Number of threads (CPU cores) to use |
–mem=<MB> | Memory (RAM) for the job. Number followed by unit prefix, e.g. 16G |
Multiple node jobs (MPI)¶
For MPI applications.
Depending on the frequency and bandwidth demand of your setup, you can either just start a number of MPI tasks or request whole nodes. While using whole nodes guarantees that a low latency and high bandwidth it usually results in a longer queuing time compared to cluster wide job. With the latter the SLURM manager can distribute your task across all nodes of stallo and utilize otherwise unused cores on nodes which for example run a 16 core job on a 20 core node. This usually results in shorter queuing times but slower inter-process connection speeds.
We strongly advice all users to ask for a given set of cores when submitting
multi-core jobs. To make sure that you utilize full nodes, you should ask for
sets that adds up to both 16 and 20 (80, 160 etc) due to the hardware specifics
of Stallo i.e. submit the job with --ntasks=80
if your application
scales to this number of tasks.
This will make the best use of the resources and give the most predictable
execution times. If your job requires more than the default available memory per
core (32 GB/node gives 2 GB/core for 16 core nodes and 1.6GB/core for 20 core
nodes) you should adjust this need with the following command: #SBATCH
--mem-per-cpu=4GB
When doing this, the batch system will automatically allocate
8 cores or less per node.
To use whole nodes¶
Parameter | Function |
---|---|
–nodes=<num_nodes> | Start a parallel job for a distributed memory system on several nodes |
–ntasks-per-node=<num_procs> | Number of (MPI) processes per node. Maximum number depends nodes (16 or 20 on Stallo) |
–cpus-per-task=1 | Use one CPU core per task. |
–exclusive | Job will not share nodes with other running jobs. You don’t need to specify memory as you will get all available on the node. |
To distribute your job¶
Parameter | Function |
---|---|
–ntasks=<num_procs> | Number of (MPI) processes in total. Equals to the number of cores |
–mem-per-cpu=<MB> | Memory (RAM) per requested CPU core. Number followed by unit prefix, e.g. 2G |
Scalability¶
You should run a few tests to see what is the best fit between minimizing runtime and maximizing your allocated cpu-quota. That is you should not ask for more cpus for a job than you really can utilize efficiently. Try to run your job on 1, 2, 4, 8, 16, etc., cores to see when the runtime for your job starts tailing off. When you start to see less than 30% improvement in runtime when doubling the cpu-counts you should probably not go any further. Recommendations to a few of the most used applications can be found in Application guides.
Partitions (queues) and services¶
SLURM differs slightly from the previous Torque system with respect to
definitions of various parameters, and what was known as queues in Torque may
be covered by both --partition=...
and --qos=...
.
We have the following partitions:
- normal:
- The default partition. Up to 48 hours of walltime.
- singlenode:
- If you ask for less resources than available on one single node, this will be the partition your job will be put in. We may remove the single-user policy on this partition in the future. This partition is also for single-node jobs that run for longer than 48 hours.
- multinode:
- Request this partition if you ask for more resources than you will find on one node and request walltime longer than 48 hrs.
- highmem:
- Use this partition to use the high memory nodes with 128 GB. You will have to apply for access to this partition by sending us an email explaining why you need these high memory nodes.
To figure out the walltime limits for the various partitions, type:
$ sinfo --format="%P %l" # small L
As a service to users that needs to submit short jobs for testing and debugging, we have a service called devel. These jobs have higher priority, with a maximum of 4 hrs of walltime and no option for prolonging runtime.
Jobs in using devel service will get higher priority than any other jobs in the system and will thus have a shorter queue delay than regular jobs. To prevent misuse the devel service has the following limitations:
- Only one running job per user.
- Maximum 4 hours walltime.
- Only one job queued at any time, remark this is for the whole queue.
You submit to the devel-service by typing:
#SBATCH --qos=devel
in your job script.
General job limitations¶
The following limits are the default per user in the batch system. Users can ask for increased limits by sending a request to support@metacenter.no.
Limit | Value |
---|---|
Max number of running jobs | 1024 |
Maximum cpus per job | 2048 |
Maximum walltime | 28 days |
Maximum memory per job | No limit [1] |
[1] There is a practical limit of 128GB per compute node used.
Remark: Even if we impose a 28 day run time limit on Stallo we only give a weeks warning on system maintenance. Jobs with more than 7 days walltime, will be terminated and restarted if possible.
See About Stallo chapter of the documentation if you need more information on the system architecture.