Chapter 7 High Performance Computing at UQ

7.1 UQ’s Research Computing Centre

The Research Computing Centre (RCC) provides coordinated management and support of The University of Queensland’s sustained and substantial investment in eResearch. The RCC helps UQ researchers across disciplines make the most of the University’s eResearch technologies, such as High Performance Computing (HPC), data storage, data management, visualisation, workflow and videoconferencing.

Specifically for the MME, the RCC coordinates access and support for 3 HPCs and the Research Data Manager (RDM).

7.1.1 HPC access

Go to https://www.qriscloud.org.au/ and register for an account. Try to log in with your staff account, which may require logging out of any UQ websites. This step is automated, and you will be granted access immediately.

You must apply separately for each HPC server you want to access, providing a “technical justification”. You apply at https://www.qriscloud.org.au/.

Tinaroo does not appear on the QRIScompute request page, go here.

The 3 HPCs under the RCC banner are:

FlashLite

The technical justification for FlashLite is:

You need to read and write vast amounts of data, such as CMIP6 data.
You keep running out of memory on Tinaroo or Awoonga
David Abramson has asked us to demonstrate the BeeGFS filesystem caching enhancements

An example submission made by Phil Dyer. Access was approved very quickly.

MathMarEcol Bioregionalisation on FlashLite

My lab is working on large scale marine ecology and climate models. The input data is around 30-50TB. David Abramson has asked us to use FlashLite to demostrate the BeeGFS filesystem caching enhancements, which is only available on FlashLite.

Tinaroo

The technical justification for Tinaroo is:

Your code runs faster when you can use more nodes (MPI)
Need a GUI, such as with MATLAB

Using more nodes means your code runs across more than one computer. If you are just using multiple cores within your computer, then Awoonga is a better choice.

Awoonga

The technical justification for Awoonga is:

Your code is written to use multiple cores within your computer
You are doing embarrassingly parallel work or parameter sweeps, where each run is independent

The Zooplankton size spectrum modelling is embarrassingly parallel, because each grid cell can ignore surrounding grid cells. Computing GCMs would not be embarrassingly parallel, because you need to know the neighbouring grid cells to decide your next state, at every time step..

An example submission made by Phil Dyer, approved within 5 minutes.

MathMarEcol bioregionalisation

My code currently works in a shared memory mode within a single node. Most of the computations are embarrassingly parallel. I run parameter sweeps across different datasets as well, which can be run independently on separate nodes. I can make use of 20-40+ cores, and typically need about 5 to 10 GB per core within a run.

7.1.1.1 Network Access

You must be on the UQ intranet to access the servers. If you are connected to eduroam/uq cable/uq wifi, then you are already in the intranet.

Everywhere else you will need to install the UQ VPN software and log in. Follow the instructions at:

https://my.uq.edu.au/information-and-services/information-technology/working-remotely/vpn-virtual-private-network

7.1.1.2 The terminal

To access any of the HPC’s, you will need a terminal program on your computer.

On mac, this is Terminal.

On windows, this is Command Prompt or Windows Powershell (not Windows Powershell ISE). If you don’t have an up to date Windows 10 machine, install PuTTy.

Some people like the Ubuntu app in windows 10, just remember the mnt/c syntax to get to the files on your computer.

7.1.2 Logging in

Once you have your terminal open

ssh <hpc cluster login alias>

Your username and password are the same as your UQ login.

Cluster	Login Nodes	Cluster Login Alias
Awoonga	awoonga1.rcc.uq.edu.au	awoonga.rcc.uq.edu.au
FlashLite	flashlite1.rcc.uq.edu.au	flashlite.rcc.uq.edu.au
	flashlite2.rcc.uq.edu.au
Tinaroo	tinaroo1.rcc.uq.edu.au	tinaroo.rcc.uq.edu.au
	tinaroo2.rcc.uq.edu.au
Wiener	wiener.hpc.dc.uq.edu.au	wiener.hpc.dc.uq.edu.au

7.1.3 Submitting a Job

Add section on submitted a job via PBS to Awoonga/Flashlite

For the moment, you can get more information here. NOTE: You need to VPN in if not on campus.

7.2 SMP High Performance Computing

http://research.smp.uq.edu.au/computing/getafix.html

The School of Mathematics and Physics have their own HPC called Getafix. Getafix is a cluster and will assign your job to nodes using the SLURM program.

7.2.1 Getafix Structure

Each computer in the cluster is called a node. Each node has it’s own cpu’s, memory and graphics card. Nodes all share a common network drive, so all your data is available on all the nodes. Most nodes in Getafix have about 24 cores, a few have more and a few have less. Each node also typically has about 250GB of memory.

One node is special, the login node. The login node works like any remote computer, you can run jobs on the login node, but out of respect for other users, you should only do setup operations and short tests on the login node.

7.2.2 SLURM

SLURM is the program that balances all the jobs between users across the cluster. You specify what you want the cluster to do, and how much time, nodes, cpu’s and memory you think you will need. SLURM then waits until the cluster has enough resources available to run your job and starts the job. If you try to use more resources than you asked for, your job will be killed. If you ask for lots of resources, it might take a while for enough nodes to be available, since smaller jobs will usually fill up the available resources as other jobs end.

7.2.3 Other Resources

The cluster has access to shared resources, particularly shared storage and networking, so your data is available on all the nodes at the same location.

7.2.4 Getting into Getafix

7.2.4.1 Request Access

Email ITS at help@its.uq.edu.au and request access to the SMP cluster which is called Getafix.

This can take a few days. Make sure you explain your connection to the School of Maths and Physics. For HDR students, you will often have both a staff and student account. Let ITS know which one you want to use to log in with.

7.2.4.2 Logging into Getafix

Once you have your terminal open and logged into the UQ VPN (if off campus), type:

ssh  <username>@getafix.smp.uq.edu.au

Where <username> is your staff or student login.

If a popup appears saying "Authenticity can't be established". Say yes.

Enter your password.

If you see:

> ssh: Could not resolve hostname getafix.smp.uq.edu.au: This is usually 
> a temporary error during hostname resolution and means that the local
> server did not receive a response from an authoritative server.

Then you either typed it incorrectly or are not on the UQ internal network, see Network Access.

If you really want R X11 plots to appear on your desktop, add -X to the login. This works on Mac and Linux. Windows may need an X11 server, which you can most easily get from the Ubuntu app. I strongly recommend saving plots into files and copying them to your computer, X11 can be very slow to render.

ssh -X <username>@getafix.smp.uq.edu.au

7.2.4.3 Advanced

SSH config file

\~/.ssh/config:

Host my-getafix
  ForwardX11Trusted yes #equivalent to adding -X
  User uqpdyer
  HostName 130.102.3.59

Now, you only have to type

ssh my-getafix

7.2.5 Putting your data on Getafix

You need to have your data on Getafix. Where to put it?

ls /

7.2.5.1 Home

/home/username/

Which is often shortened to

~/

You have 40GB of data allowed in here. The home directory is often used for programs and configuration files. R libraries will install here by default, which is fine. Don’t use the home directory for data sets, results and plots.

7.2.5.2 Data

/data/username/

You have 400GB of data allowed in here. Put your data sets and scripts in the /data/username/ directory.

7.2.5.3 Data1, Data2 … Data6

Most of the time you use /data/ but there are other data folders that you can access on request.

/data1/username/
/data2/username/

data1 and data2 are older system drives that we are not using.

data3 to data6 are high speed access drives for very large file reading and writing. These could be useful for work on GCM’s.

7.2.6 Sending your data to getafix

7.2.6.1 scp

scp is copy over ssh, it should work anywhere ssh works.

scp -r  local_folder <username>@getafix.smp.uq.edu.au:/data/<username>/<folder>

7.2.6.2 WinSCP

A graphical interface for scp.

7.2.6.3 rsync

Rsync is smarter than scp, it only copies changes, so it can be faster than scp.

usually installed on Linux, may not be available on Mac or Windows.

rsync  local_folder <username>@getafix.smp.uq.edu.au:/data/<username>/<folder>

7.2.7 Running jobs on Getafix

7.2.7.1 Modules

To allow different versions of software to be used by different users (like python 3.4 versus python 3.6), Getafix uses Linux modules. On Getafix, R is in a module, and won’t work until you load the module. You can use the default version of a module, or specify a particular version of a module if your job depends on it.

To see what modules are available, type:

module avail

When you see a module you want to use, load it with:

module load <module name>

I use the following modules:

module load R
module load pandoc #rendering rmarkdown into html and pdf
module load gdal #spatial packages like sf
module load geos #spatial packages like sf
module load proj #spatial packages like sf

7.2.7.2 R on Getafix

R on Getafix is pretty similar to R on your local computer, without RStudio.

7.2.7.3 Installing packages

Install packages to a local library in your home directory.

module load R
R

install.packages("foreach")
install.packages("doParallel")
install.packages("sf")

Accept the default library location unless you have a reason not to.

If your package fails to install, check if there is a system package missing.

For example, to install sf, you need to load more modules:

module load R
module load gdal
module load geos
module load proj
R

Now sf should install.

install.packages("sf")

If it still does not work, send an email to help@its.uq.edu.au, they have been helpful setting me up when something was missing or broken.

7.2.7.4 Parameters for Rmd files

You can pass parameters into Rmd files from the command line. You need to set up the `params` in the YAML header, see https://bookdown.org/yihui/rmarkdown/parameterized-reports.html

Using Rscript, you can render a report using a specified set of parameters like this:

paramlist='list(nCores =24, gRange = 1:300 )'
Rscript -e 'knitr::opts_knit$set(progress = TRUE, verbose = TRUE); rmarkdown::render("/data/uqpdyer/demo.Rmd", params = '"${paramlist}"', output_format = "html_notebook", output_file = "outfile.html", output_dir = "/data/uqpdyer", envir = new.env())'

This is not as convenient as setting parameters in RStudio or within an R session. You have to be very careful about quotes, and the command line is hard to edit.

Sending parameters to an R script in bulk is much easier, but lacks the reproducability of Rmd.

7.2.7.4.1 Parameters for script files

Use the base commandArgs or the optparse package to use command line parameters in your script files

Then you can run the script from the command line like this:

Rscript script.R input1 input2

7.2.7.5 SLURM

So far, R has been running on the login node. Some things are easiest to do on the login node, like installing packages and transferring files. Computationally heavy tasks should be done on other nodes. To use the other nodes, you submit jobs with SLURM.

SLURM is a job scheduler. People request jobs by specifying the resources they need and the scripts to run, and SLURM queues the jobs until some nodes are available to run the job. You can also request an interactive session, but it is more efficient to set up jobs and come back to get the results.

7.2.7.5.1 `sbatch`

The command sbatch is used to submit jobs, which are run when the resources become available.

You use SLURM by calling sbatch:

sbatch script.sh

where script.sh looks like

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --time=0-00:10

echo "hello world"

The #SBATCH lines at the top are parameters for the sbatch command, you can also include them in the command call:

sbatch script.sh --ntasks=1 --cpus-per-task=1 --mem-per-cpu=1G --time=0-00:10 script.sh

Then script.sh can simplify down to:

#!/bin/bash

echo "hello world"

When working with R, we run scripts or arbitrary code snippets with Rscript, but the sbatch command still expects Rscript to be wrapped in script_for_r.sh

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=2G
#SBATCH --time=0-01:00

module load R
Rscript myRscript.R

So running sbatch looks the same:

sbatch script_for_r.sh

Here is a meaningful r script we can test out. Note that I have specified 20 cpu’s in both script_for_r.sh and myRscript.R

library(foreach)
library(doParallel)

cl <- makeCluster(2)
registerDoParallel(cl)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000

ptime <- system.time({
  r <- ~foreach~(icount(trials), .combine=cbind) %dopar% {
    ind <- sample(100, 100, replace=TRUE)
    result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
    coefficients(result1)
  }
})[3]
ptime


stime <- system.time({
  r <- ~foreach~(icount(trials), .combine=cbind) %do% {
    ind <- sample(100, 100, replace=TRUE)
    result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
    coefficients(result1)
  }
})[3]
stime

stopCluster(cl)

cl <- makeCluster(20)
registerDoParallel(cl)
ptime <- system.time({
  r <- ~foreach~(icount(trials), .combine=cbind) %dopar% {
    ind <- sample(100, 100, replace=TRUE)
    result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
    coefficients(result1)
  }
})[3]
ptime

7.2.7.5.2 `salloc` and `srun`

If you really need to run something interactively, such as a Matlab GUI, make sure you log in with -X or -Y and then call:

salloc --ntasks=1 --cpus-per-task=1 --mem-per-cpu=1G --time=0-00:10
srun --pty -x11 matlab

7.2.7.5.3 Many nodes

So far, we have used only 1 node. The memory and cpu’s are all in the same node, and R behaves in exactly the same way as on your machine, only with more cores. You are clearly limited to a speed up of around 20 times with a single node.

To go beyond the single node limit, you have a few options.

Divide your job up

If you can break your job up into many independent runs, then you can submit a job for each run, and collect all the different outputs after all the jobs have run.
I have done this to test out the effects of changing combinations of parameters.

Rslurm

The Rslurm package will generate a set of scripts that submit a collection of jobs. Rslurm does the work of dividing up your job for you.
Take care with this package, the default settings will allocate far too many jobs and overwhelm Getafix.

Message Passing Interface (MPI) is a library that shares information across nodes.
MPI requires extra setup in your code, and you need to use mpirun or srun inside the sbatch script

    #!/bin/bash
    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task=1
    #SBATCH --mem-per-cpu=1G
    #SBATCH --time=0-00:10

    mpirun echo "running with MPI"
    srun echo "run with SRUN"

doMPI and Rmpi are packages for working with MPI in R.

TODO: I have not figured out MPI on the cluster yet, it didn’t “just work”

future.batchTools

The future.batchtools package lets you submit SLURM jobs from R. The benefit of using future.batchtools is that you write R code using future syntax, then specify a SLURM plan, and all futures at a level specified by the plan are sent off as jobs. The script waits until the job finishes.
I recommend running a single node task with 1 cpu and a modest amount of ram, but a long running time, then all the heavy lifting is done by new jobs submitted by your script.
Take care with this package, I haven’t checked the defaults, but it might request too many resources like Rslurm.

7.2.7.5.4 R with parameters

Being able to pass in parameters is useful for dividing your job up.

Some examples of sbatch scripts that use R parameters

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=7G
#SBATCH --time=1-22:00
module load R
Rscript script.R $1 $2

    sbatch run_script.sh input1 input2
    sbatch run_script.sh input1 input3

$1 becomes the first argument, input1 both times, and $2 becomes the second argument, input2 in the first run and input3 in the second line.

Rendering an Rmd file with parameters:

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=7G
#SBATCH --time=1-22:00
module load R
module load pandoc #essential for knitr and rmarkdown

paramlist='list(nCores =24, gRange = 1:300 )'
Rscript -e 'knitr::opts_knit$set(progress = TRUE, verbose = TRUE); rmarkdown::render("/data/uqpdyer/demo.Rmd", params = '"${paramlist}"', output_format = "html_notebook", output_file = "outfile.html", output_dir = "/data/uqpdyer", envir = new.env())'

If you want to generate a collection of rendered Rmd files, you have to write up a script for each set of parameters.

The other way to generate a set of Rmd files with different parameters, in parallel, is to copy the file and change the parameters in each file. Then you can use job arrays by naming the files from 1 to n.

The simplest approach I can think of to run many different Rmd files is to have a shell script, that calls an R script, that renders an Rmd. Parameters passed to the shell script go all the way down to the Rmd file.

args = commandArgs(trailingOnly = TRUE)

knitr::opts_knit$set(progress = TRUE, verbose = TRUE)

paramlist=list(nCores =as.numeric(args[1]), gRange = as.numeric(args[2]):as.numeric(args[3]))
rmarkdown::render("/data/uqpdyer/demo.Rmd", params = paramlist, output_format = "html_notebook",
                  output_file = "outfile.html", output_dir = "/data/uqpdyer", envir = new.env())

7.2.8 Debugging Rmd files

Debugging regular R isn’t great, but debugging Rmd files non-interactively is even worse.

Here are some tricks:

7.2.8.1 Cache on

Turn on caching for your Rmd file.

knitr::opts_chunk$set(cache=TRUE)

7.2.8.2 Load in cache dir

The qwraps2 package on github has a function, lazyload_cache, which will pull in the saved cache so you can interact with it in RStudio.

https://github.com/dewittpe/qwraps2/blob/master/R/lazyload_cache.R

Once you call lazyload_cache() on the cache generated by knitr, your environment has all the variables seen up until a failure point.

7.3 General HPC Guidance

7.3.1 Getting Results back

The batch jobs are not interactive. You set them up to run and come back for the results. Usually you save a file to disk.

All printed output goes to a file slurm-<job number>.out.

For example, in SLURM you can change the name with

#!/bin/bash
#SBATCH -o output_file_%A_%a.out #%A is the job number, %a is the array number
#SBATCH -e error_file.err

7.3.2 Data.frames

If you have R objects you want to see, save them to disk.

saveRDS R binary format, single object, you specify the name when you readRDS
fwrite data.table csv writer, fast, and gives you a csv
feather a new binary format backed by Hadley, fast, and compatible with python
fst another new binary format, extremely fast and you can read in subsets of the data.frame
Archivist puts everything into a database and gives a unique hash to every object. Good for reproducibility, you can’t mix up objects from different runs later on because the hash always changes

7.3.3 Plots

Always save plots to disk, unless you are using Rmd files, which will put the plot into the output html anyway.

Use ggsave() for ggplots, and dev.png(), your plotting code, dev.off() for other kinds of R plots.

It is a good idea to set the width, height and dpi for all your plots.

7.3.4 Playing nice

The cluster is a shared resource.

Try to be as precise as possible when allocating resources, so others can use what you don’t need.

Be careful with any package that submits jobs for you (Rslurm,future.batchtools). Check the jobs are requesting a reasonable set of resources.

7.3.5 Job Submission Tips

If you have trouble getting access to the resources you want, consider if you can split up your jobs into smaller parts. The queue system (PBSpro and SLURM) will normally favour smaller, shorter jobs. If you request a single job which needs, for example, 24 cores, then you will have a long (days) delay to start, because the HPC needs to find 24 cores on the same node, which may require existing jobs to finish. While your job is queued, other smaller jobs will jump in front of you and get time. If the cores don’t need to talk to each other, consider submitting the job as an array job, with 24 individual runs of 1 core each.

7.3.6 What HPC should I use?

The MME lab has users across all HPCs (Awoonga-Phil, Flashlite-Isaac, Tinarro-Patrick, Getafix-Patrick). Our staff/students will continue to use the system which works best for our individual needs.

Our experience is that getting time on Getafix is the easiest - apart from the end of session when a lot of undergraduates are completing assignments. However, Getafix is a bit more of a stand-alone cluster, which is run by SMP and does not have fast write access to the UQ Research Data Manager (RDM). The RCC HPC’s all have fast write access to the RDM.

As it is a smaller community of users, the staff in charge of Getafix seem to be more flexible in terms of installing new modules if you request. Because the RCC is responsible for the HPC support for the rest of UQ, they need more rigorous protocols to install new modules but they will do so if asked.

7.3.7 Parallel coding in R

7.3.7.1 Add a section on parallelising in SLURM/PBSpro

Include something on Epilogue and Prologue And pro/cons of parallelising in/outside of R

7.3.7.2 parlapply, par*apply etc.

The parallel package gives you par* versions of the apply family of functions.

Easy to drop in if you usually use the apply family from base R, you replace any apply, lapply, sapply with parapply, parlapply, parsapply.

7.3.7.3 foreach and %dopar%

The foreach package has a few uses.

First, it changes how you use for loops in R. Base R for loops usually work by changing a variable defined outside the loop. foreach is more like an apply, where a result is returned, so foreach works without needing to use side effects.

Second, because foreach can work without side effects, it can run the loop in parallel easily. There are a number of backends you can use, including doParallel, doMPI, doFuture. You can quickly change how parallel works by replacing %do% with %dopar% and changing the registerDo* function at the start of your script.

If you already like using foreach over the apply family, moving to parallel is easy. The main downside is that foreach is not particularly efficient with resources. If you are mostly using the apply family, switching to foreach is much harder.

7.3.7.4 `future`

The future package is built around the idea of a promise. The promise is “Eventually the variable will have a result assigned to it, but not straight away”. You replace <- with %<-% and the line of code becomes a promise, rather than something to do right away.

Generally, you don’t need to change your code much, just add % around your <- at strategic points, but you can gain a lot of flexibility, and there are many backends.

future.apply gives you parallel versions of apply. doFuture mixes future and foreach.

future.batchtools spawns jobs on the cluster that return the results into your session. future.callr does local parallel calculations in separate R processes. future alone will do local multiprocessor calculations and use multiple nodes.

7.3.7.5 General Coding habits for going parallel

Put things into lists

It is easy to loop over lists, and you can quickly parallelise the processing of a list.

Don’t do graphics in parallel

Generally speaking, you can specify ggplot objects in parallel, but if you try to print or save them in parallel, R crashes.
You probably have to work very carefully to open multiple dev.png() for plotting base R plots.

Make functions small

If you repeat something frequently, put it into a function. You can then put functions into packages. Try to keep functions small, then they are easier to use in parallel. Generally don’t try to work in parallel within a function.
If you want a function to use parallel processing, I strongly recommend you use either future or foreach with %dopar%. By default, they both drop back to sequential processing, but the user can set up parallel processing before calling the function to get speed improvements.

Look for loops, but don’t use R base for loops

Base R for loops cannot be used in parallel, because they rely on side effects.