Tips

  1. Job arrays
  2. Job dependencies

Job arrays

Job arrays provide a means of submitting multiple similar jobs. The only difference being an array index, that is incremented for each job, that can be used in your batch script, or codes, to provide a different parameter, load data files, or change the flow of your code.

To submit an array job use the sarray with the #SARRAY –range=i-n directive. For example:

#!/bin/sh

#SBATCH --mail-type=ALL
#SBATCH --mail-user=uniqname@umich.edu
#SBATCH --time=1-0
#SBATCH --job-name=array_job_test

#SARRAY --range=1-10

srun Rscript ./script.R

This batch script will submit job 10 jobs with names array_job_test[1] … array_job_test[10] incrementing the $SLURM_ARRAYID with each job submission. This environment variable can be access directly in your code, for example in R like this

# grab the array id value from the environment variable passed from sbatch
slurm_arrayid <- Sys.getenv('SLURM_ARRAYID')

# coerce the value to an integer
n <- as.numeric(slurm_arrayid)

You can also use the variable as a command line parameter to your job like this example also in R

#!/bin/sh

#SBATCH --mail-type=ALL
#SBATCH --mail-user=uniqname@umich.edu
#SBATCH --time=1-0
#SBATCH --job-name=array_job_test

#SARRAY --range=1-10

srun Rscript ./script.R --array_id=${SLURM_ARRAYID}

Each job could output its results to seperate files based on this index, or possibly the job id using the $SLURM_JOBID environment variables using the same means as the array index. Once all jobs are complete use a separate job to combine these results into your final results.

When you start submitting hundreds of jobs with sarray it would be a good idea to turn off the email notifications by dropping the –mail-type and –mail-user options to avoid spamming yourself.

Job dependencies

It is possible to delay the start of a job until a specified dependency has been satisfied. For example suppose you have multiple jobs running concurrently that each generate seperate datasets that need to be aggregated as the final step. You could just wait for all the jobs to complete and download the results and combine on your local system. Or you could create another job to aggregate your results and set it dependent on all your jobs.

With dependencies you have many options. Here is the relevant porition of the man page for sbatch:

-d, --dependency=<dependency_list>
       Defer  the  start  of this job until the specified dependencies have been satisfied com‐
       pleted.  <dependency_list> is of the form <type:job_id[:job_id][,type:job_id[:job_id]]>.
       Many  jobs  can  share  the  same dependency and these jobs may even belong to different
       users. The  value may be changed after job submission using the scontrol command.

       after:job_id[:jobid...]
              This job can begin execution after the specified jobs have begun execution.

       afterany:job_id[:jobid...]
              This job can begin execution after the specified jobs have terminated.

       afternotok:job_id[:jobid...]
              This job can begin execution after the specified jobs  have  terminated  in  some
              failed state (non-zero exit code, node failure, timed out, etc).

       afterok:job_id[:jobid...]
              This  job can begin execution after the specified jobs have successfully executed
              (ran to completion with an exit code of zero).

       expand:job_id
              Resources allocated to this job should be used to expand the specified job.   The
              job  to  expand must share the same QOS (Quality of Service) and partition.  Gang
              scheduling of resources in the partition is also not supported.

       singleton
              This job can begin execution after any previously launched jobs sharing the  same
              job name and user have terminated.

This gives you many options for job dependencies. To continue with the example above, all your jobs that run concurrently should have the same –job-name, as well as the aggregation job, allowing you to use the singleton option for your aggregation job. This option will cause this job to wait for all jobs with the same name to complete before it will start. There are many more options for dependencies such as allowing you to handle job failure.