R

From HPC

Jump to: navigation, search

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.

Website: http://www.r-project.org/

Contents

Documentation/Tutorials

R Versions via environment modules

The core R install is upgraded regularly. If this causes issues with packages we hold older versions that can be used. If you need a different version please ask

The command module is used to load/unload versions of R

List available versions of R

Type module avail

The versions available will be listed as R/?.?.?, where ?.?.? is the version number i.e. 3.4.1 or 3.5.0

module avail
------------------------------------------------------------------------------------------------------ /usr/local/modulefiles -------------------------------------------------------------------------------------------------------
dot  module-git  module-info  modules  null  R/3.4.1  R/3.4.4  R/3.5.2  use.own

Load a specific version of R

To load version 3.4.1 you would type

module load R/3.4.1

or for version 3.5.2

module load R/3.5.2

Loading in you qsub script

#!/bin/bash
#$ -N R_JOB
#$ -M me@lshtm.ac.uk -m be
#$ -q short.q
#$ -l mem_free=1G,h_vmem=1.2G
#$ -V -cwd 

module load R/3.4.1

R CMD BATCH myrscript myrscript.out

Main R Project Wiki

http://wiki.r-project.org/rwiki/doku.php

Moving from Stata to R

http://wiki.r-project.org/rwiki/doku.php

Installing R libraries for your account on the HPC

Important notes

  • Installing packages - There are limits on hpclogin to stop users consuming too much resource (cpu/memory). You can install small packages interactively in R but for larger more complex packages you will need to install them via qsub as a job on the HPC.
  • You should not use install.packages() in your jobs scripts where calculations are being run, you should use library() to load them. Either install interactively or run a job just for installing required packages.
    • You install them from CRAN with install.packages("x").
    • You use them in R with library("x").

First time install of packages - Once per Major.Minor version - create a personal library folder (i.e. 3.4.? or 3.5.?)

The first time you try to install a package you should run R interactively (login to hpclogin, run module load R/?.?.? and then type R) as it will prompt you to create a personal library. You need to answer yes twice.

Run a simple install for a small package

install.packages('Rcpp', repos="http://cran.ma.imperial.ac.uk/", dependencies = TRUE)

Warning in install.packages("Rcpp", repos = "http://cran.ma.imperial.ac.uk/",  :
 'lib = "/usr/local/packages/apps/R/3.5.2/lib64/R/library"' is not writable

Would you like to use a personal library instead? (yes/No/cancel) yes

Would you like to create a personal library
‘/home/aitsswhi/R/x86_64-library/3.5’
to install packages into? (yes/No/cancel) yes

Install and use packages

Run job to install packages

Notes

  • You must specify repos = "https://cloud.r-project.org/" (you can change the mirror location url) when installing packages via a qsub job as it will not be able to prompt for selection.

Qsub Script

  • use short.q unless large number of packages being install in which case use long.q
  • build.out will contain a log of the install to check it completed OK or locate any errors
#!/bin/bash
#$ -N R_BUILD_JOB
#$ -M your.email@lshtm.ac.uk -m be
#$ -q short.q
#$ -l mem_free=4G,h_vmem=4.2G
#$ -V -cwd

module load R/3.5.3 

which R

R CMD BATCH build.R build.out

build.R

install.packages("ggplot2", repos = "https://cloud.r-project.org/", dependencies = TRUE)
install.packages("arm", repos = "https://cloud.r-project.org/", dependencies = TRUE)
install.packages("zoo", repos = "https://cloud.r-project.org/", dependencies = TRUE)
install.packages("coda", repos = "https://cloud.r-project.org/", dependencies = TRUE)
install.packages("stats", repos = "https://cloud.r-project.org/", dependencies = TRUE)
install.packages("sna", repos = "https://cloud.r-project.org/", dependencies = TRUE)

Using installed library/package in your code

Add this to the top of your R script file.

library("ggplot2")
library("arm")
library("zoo")
library("coda")
library("stats")
library("sna")

R third party libraries and packages via Conda

See wiki page - Conda (R/Python package management)


https://docs.anaconda.com/anaconda/user-guide/tasks/using-r-language/

Package install locations (R_LIBS_USER)

Your packages need to be installed to folder you have write permissions

By default R will add packages to the R folder in your home directory ~/R (which is shorthand for /home/username/R). This folder must exist.

R will then create a set of sub folders for each version of R (~/R/x86_64-library/3.4 or ~/R/x86_64-library/3.5)

Note The library folder is shared for the same major.minor (e.g 3.5) version the patch number (e.g 3.5.1) will not affect the installed libraries.

Override R_LIBS_USER

You can set R_LIBS_USER to any path you like with in your home space or share with another user in their homespace.

Create/edit ~/.Renviron

default

R_LIBS_USER=~/R/x86_64-library/%v

You could also add custom R_LIBS_USER to your job script (for this to work you may want to not specify R_LIBS_USER in .Renviron)

#!/bin/bash
#$ -N R_JOB
#$ -M me@lshtm.ac.uk -m be
#$ -q short.q
#$ -l mem_free=1G,h_vmem=1.2G
#$ -V -cwd 

module load R/3.4.1

R_LIBS_USER=~/R-custom-packages/%v

R CMD BATCH myrscript myrscript.out

Compiler overrides for R Packages

Create the file ~/.R/Makevars

CXXFLAGS=-O2 -march=native -mtune=native -fPIC
CC=gcc
CXX=g++

Using a different version of GCC

Several major versions of GCC are available via environment modules

To see all available modules including GCC

module avail

-------------------------------------------------------- /usr/local/modulefiles ---------------------------------------------------------
dot  gcc/6.5.0  gcc/7.4.0  module-git  module-info  modules  null  R/3.4.1  R/3.4.4  R/3.5.2  R/3.5.3  use.own

To load the version you want

module load gcc/6.5.0

IMPORTANT depending on the package and how it compiles, you may need to load the same module(s) before you use the package. You load modules automatically by adding to your qsub script or even adding the commands to the bottom of ~/.bashrc

Running R program from file

R CMD BATCH rscriptfile outputfile

Example

R CMD BATCH myrscript myrscript.out

Your results will be saved to a file called myrscript.out

If you run multiple jobs at the same time they can cause problems when saving the workspace at the same time. To resolve the issue use

R CMD BATCH --no-save --no-restore myrscript myrscript.out

Running R program on HPC

Example job script saved as myrjob

#!/bin/bash
#$ -N R_JOB
#$ -M me@lshtm.ac.uk -m be
#$ -q short.q
#$ -l mem_free=1G,h_vmem=1.2G
#$ -V -cwd
module load R/3.4.3
R CMD BATCH myrscript myrscript.out

Submitting job

qsub myrjob

Parallel - Openmpi, Multthreaded or multi cpu jobs

If you are using a package/library which can support multi-threaded (multi CPU) processing or supports OpenMPI you need to submit your job to the parallel.q and specify the exact number of CPUs you need.

You cannot use the short.q or long.q

Some libraries/packages will have parameters/options that allow you to restrict the number of CPUs/threads it will attempt to use. This should be set to the number you request in the qsub script.

Ideally you should aim for between 4 - 8 CPUs/threads. If you cannot restrict the library then you should specify 8. The nodes in the HPC have 8 or 12 CPUs, if you specify more your jobs will never run. The exception in OpenMPI which can run across several machines, however it recommend you don't request more than 20 slots for MPI or your job may take a long time to find the free slots.

Multi cpu/multi threaded/OpenMP job

This example requests the use of four CPUs on the same node. Edit -pe smp 4 and set to the number of cpus required to a maximum of 8 or 12 (no node has more than 12 cpus and more nodes have 8 than 12)

Memory requirements set in script are multiplied by the number of CPUSs requested, in this case 1G x 4 = 4GB total required on node.

#!/bin/bash
#$ -cwd -V
#$ -l mem_free=1G,h_vmem=1G
#$ -q parallel.q
#$ -pe smp 4
#$ -R y
export OMP_NUM_THREADS=4
module load R/3.4.3
R CMD BATCH myrscript myrscript

OpenMPI job

This example requests 16 OpenMPI slots and will run across multiple nodes in the parallel.q

To change the slots edit "-pe openmpi 16" and "mpirun -np 16"

Memory requirements set in script are multiplied by the number of OpenMPI slots requested, in this case 1G x 16 = 16GB total required.

#!/bin/bash
#$ -cwd -V 
#$ -l mem_free=1G,h_vmem=1G
#$ -q parallel.q
#$ -pe openmpi 16
#$ -R y
mpirun -np 16 R CMD BATCH test.R test.txt

Array Job

Array jobs allow you to submit one job with multiple tasks, you can then access the task id to use in scripts. You might use this with 10 task to process 10 different data files or use it to process you simulation with 10 different parameters.

qsub -t 1:10

See array jobs qsub

Qsub Job Script

#!/bin/bash
#$ -N ARRAY_TEST_JOB
#$ -q short.q
#$ -cwd -V
#$ -l mem_free=1G,h_vmem=1.2G
#$ -t 1-10

R CMD BATCH myrscript myrscript${SGE_TASK_ID}.out

-t 1:10 This specifies the number of sequential tasks 1,2,3,4,5,6,7,8,9,10. This can be any range i.e. 1:5 or 1:1286.

This will submit your job with 10 tasks (1-10) and will create 10 R output files i.e:

myscript1.out
....
myscript10.out

R Script

To use the task id in your R script you need to use Sys.getenv("SGE_TASK_ID")

taskIdChar <- Sys.getenv("SGE_TASK_ID")
taskIdInteger <- (as.numeric(alphaenv)-1)

dataFilename <- paste("coredate-", taskIdChar, ".dta") 

taskIdChar is the string/char value of the current taskid

taskIdInteger is the current task id converted to an integer number

dataFilename is a sting combining the task id to specify a file to read/write to

Example R Program

Save the following to a file called myrscript to work with example instructions for running R jobs.

library("sna")
library("network")

# These are the names of the actors
labels <- c("Allison", "Drew", "Ross", "Sarah", "Eliot", "Keith")

net <- network.initialize(6)

# Label the verticies
net %v% "vertex.names" <- labels

# Data on page 123.
add.edges(net, c(1,1,2,2,5,6,3,4), c(2,3,4,5,2,3,4,2))

degree(net, cmode="outdegree")
degree(net, cmode="indegree")

# Note that the variance of indegree and outdegree, on page 128, can be calculated with:
var(degree(net, cmode="outdegree"))
var(degree(net, cmode="indegree"))

Guides

Build RStan

Personal tools