R

  1. Basic R Job
  2. Array job with R
  3. R array job with getenv()
  4. R array job with getopt()
  5. R with snow(fall)
  6. Snow/Snowfall node type SOCK
  7. Snow/Snowfall node type MPI
  8. R Slurm Package

Basic R Job

For the majority of your R work a simple batch script like below will suffice:

#!/bin/sh

#SBATCH --mail-type=ALL
#SBATCH --mail-user=uniqname@umich.edu
#SBATCH --time=1-0
#SBATCH --job-name=basic_r_job

srun R CMD BATCH ./script.R

Array job with R

sarray provides the iteration of the array to the batch script via the $SLURM_ARRAYID environment variable. This provides a means of changing the flow of your code, executing a different R script, or setting variables in your code. A basic batch script that uses sarray to run multiple R jobs:

#!/bin/sh

#SBATCH --mail-type=ALL
#SBATCH --mail-user=uniqname@umich.edu
#SBATCH --time=1-0
#SBATCH --job-name=basic_r_job

#SARRAY --range=1-10

srun R CMD BATCH ./script.R

R array job with getenv()

To access the variable from within R simply use the Sys.getenv() function.

#!/usr/bin/env Rscript

# grab the array id value from the environment variable passed from sbatch
slurm_arrayid <- Sys.getenv('SLURM_ARRAYID')

# coerce the value to an integer
n <- as.numeric(slurm_arrayid)

R array job with getopt()

To access the variable as a command line argugement

#!/usr/bin/env Rscript                                                                      

# spec contains at least 4 columns, as many as 5 columns.                                   
#                                                                                           
# column 1: long name of flag                                                               
#                                                                                           
# column 2: short name of flag                                                              
#                                                                                           
# column 3: argument flag.  0=no argument, 1=required argument, 2=optional argument         
#                                                                                           
# column 4: mode of argument.  one of "logical", "integer", "double", "complex", "character"
#                                                                                           
# column 5 (optional): description of option.                                               
#
# Examples:
#   $ ./example.R --array_id 42
#   $ ./example.R --array_id $SLURM_ARRAYID

library(getopt);                                                                            

opt = getopt(matrix(c(                                                                      
  'help',     'h', 0, "logical", "Usage information",                                       
  'array_id', 'a', 1, "integer", "Value of the $SLURM_ARRAYID env var"                      
),ncol=5,byrow=TRUE));                                                                      

if (!is.null(opt$help)) {                                                                   
  writeLines("Usage: test.R [options]");                                                    
  writeLines("\t--help,h\tPrint usage options");                                            
  writeLines("\t--array_id,a\tValue of $SLURM_ARRAY_ID env var");                           

  q(status=1);                                                                              
}                                                                                           

print(opt$array_id);                                                                        

n <- opt$array_id;

R with snow(fall)

The snow and snowfall packages can be used in R to simplify the process of writing parallel R scripts. These packages work by creating separate R processes (slaves/nodes) and then sending jobs to those nodes in parellel. You have a couple different options for communication between these nodes. The options are communicationare SOCKETS, MPI, and PVM. The biostatistics cluster does not implement PVM leaving you with a choice between MPI and SOCKETS. The basic difference between the two is that SOCKETS only allow communication within a single host and MPI communicates across the network. When you submit a R job to the cluster using snow/snowfall you should consider how many R cluster nodes you wish to start as this will help you decide which type you need to use. You can think of the snow/snowfall nodes as 1:1 with cores and nodes, each node you start will use a single core. If you need to start more slaves then a single cluster node has cores then you will need to use MPI so those R slaves can communicate across the network.

Here are some examples of sbatch batch file for each type:

Snow/Snowfall node type SOCK

This is a sample of initializing your cluster in snow and snowfall.

# snow cluster init
library(snow)
cl <- makeCluster(12, type="SOCK")

# snowfall cluster init
library(snowfall)
cl <- sfInit(parallel=T,cpus=12,type="SOCK")

This batch script requests a whole node be allocated to this job giving your code up to 12 cores. This means you can start 12 snow/snowfall nodes within your code. The key is the –nodes=1–1 option. This tells the scheduler that you need exactly one whole node for your job with access to all the cores in that node and that you want to use 12 cores in that node.

#!/bin/sh

#SBATCH --mail-type=ALL
#SBATCH --mail-user=uniqname@umich.edu
#SBATCH --time=1-0
#SBATCH --job-name=test_r_socket_job
#SBATCH --nodes=1-1
#SBATCH --cpus-per-task=12

srun R CMD BATCH ./script.R

Snow/Snowfall node type MPI

This is a sample of initializing your cluster in snow and snowfall.

# snow cluster init
library(snow)
cl <- makeCluster(100, type="MPI")

# snowfall cluster init
library(snowfall)
cl <- sfInit(parallel=T,cpus=100,type="MPI")

This batch script tells the scheduler that you would like to allocate 100 cpus(cores) to this single task, giving you 100 snow/snowfall nodes, that it will run for one day and requires 2GB of memory. Two key things about this patch script are the –ntasks=100 which is how you request those 100 cores and that there is no srun on the R CMD BATCH line. Do NOT include srun when using snow/snowfall and MPI. These libraries rely on Rmpi which will handle starting the mpi session based on the environment that sbatch creates for your job.

#!/bin/sh

#SBATCH --mail-type=ALL
#SBATCH --mail-user=uniqname@umich.edu
#SBATCH --ntasks=100
#SBATCH --time=1-0
#SBATCH --job-name=test_r_socket_job
#SBATCH --mem-per-cpu=2000

R CMD BATCH ./script.R

R Slurm Package

We have a locally developed R package for accessing the Slurm environment during job time. The package is called Slurm and the source is in github. This would be primarily useful within array jobs, either sarray or sbatch. Instead of digging into the environment with Sys.getenv() you could use this pacakge instead. Here are a couple examples

To get the array id from SLURM_ARRAYID when using sarray:

library('Slurm')
arrayid <- slurm.arrayid()

To get the array task id from SLURM_ARRAY_TASK_ID when using array job in sbatch:

library('Slurm')
taskid <- slurm.array_task_id()

More variables are available and documented in the rdoc.