This guide is currently very much a work in progress, published during the Pilot phase of the Sigma GPU nodes, which is between July and September of
The focus of this guide is to cover the particulars of using the GPU nodes of Sigma conveniently in a hands-on, practical manner focused on examples. It is not intended to cover GPU compute usage in general. References to more exhaustive background sources are included as a matter of convenience.
The guide assumes familiarity with the “Sigma - Getting Started Guide”, if you have not already done so, please acquaint yourself with it. It is also assumed that you have a valid login account on Sigma and a project with access to the GPU nodes of the cluster, the process is described in “Becoming a member of a project” and “Getting a login account”.
The examples in the guide will also assume that you are using the ThinLinc VNC solution to
access Sigma, as this will provide the best user experience
running the examples or indeed anything else requiring X-windows graphics. Furthermore,
ThinLinc will provide session management where you can suspend and resume your
session on the Sigma login node in a way akin to that of terminal multiplexers
(or their front-ends) such as screen
, byobu
or tmux
when using an SSH login.
That said, it is perfectly allowed to use SSH logins and work that way if that
is your preference, most parts of this user guide should work very well also in
this context.
From your desktop to an interactive prompt on a Sigma GPU node in three simple steps:
interactive -n 1 -c 9 --gpus-per-task=v100:1 -t 60 -A <your_account_string> --reservation=gpu
This allocates 1 task comprising 9 CPU cores and 1 V100 GPU for 60 minutes
using your project account (i.e. you should fill in something like LiU-gpu-XXXX-YYYYY), i.e. a quarter of a node is allocated. It also
specifies to use the reservation gpu
, which is the name of the
reservation containing the GPU nodes. Only specific
users or project accounts are allowed to submit jobs (like this one) to
a reservation. You should now be presented with an interactive
prompt on a GPU node like
[username@sigma ~]$ interactive -n 1 -c 9 --gpus-per-task=v100:1 -t 60 -A --reservation=gpu
salloc: Granted job allocation 958371
srun: Step created for job 958371
[username@n2017 ~]$
It is sometimes prudent to check that you have access to the GPU devices you have allocated, for instance
[username@n2017 ~]$ echo $CUDA_VISIBLE_DEVICES
0
The above means you have access to GPU device number 0 out of devices 0 –
3, not that you have 0 devices. You can also check using the nvidia-smi
tool
(2 GPUs allocated in this example).
[username@sigma ~]$ interactive -n 1 -c 9 -A <your_account_string> -t 60 --reservation=gpu --gpus-per-task=v100:2
salloc: Granted job allocation 958413
srun: Step created for job 958413
[username@n2017 ~]$ nvidia-smi
Thu Jun 25 17:36:34 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:61:00.0 Off | 0 |
| N/A 39C P0 41W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:62:00.0 Off | 0 |
| N/A 43C P0 55W / 300W | 0MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
An important point to be aware of, especially on Sigma, is that data access
between centre storage under /home
and /proj
and the Sigma GPU (and CPU)
compute nodes is not suitable for I/O intensive loads of small
random-read/write character common in for instance Machine Learning.
Fortunately, the GPU nodes are equipped with large and fast NVMe SSD disks (14
TB in total) suitable for this type of I/O load, available to each job under
/scratch/local
. Note though, that this scratch space is volatile, accessible
on a per-job basis, and will be cleared after each SLURM job.
Transferring data sets for training to the node local scratch space should by
preference be done in large contiguous chunks to maximize transfer speed.
A good way to achieve this is to store your data set on centre storage
(/proj
is suggested) in an uncompressed tar
archive and transfer this with
direct unpacking to scratch in your SLURM job on the allocated GPU node like
[username@n2017 ~]$ tar xf /proj/some/path/to/DATA_SET.tar -C /scratch/local
Avoid using compression of the tar
archive, it will add a timing overhead to
the data set unpacking in the order of 4x for gunzip
(.tar.gz archives) and
20x for bunzip2
(.tar.bz2 archives).
Access to the GPU nodes is provided to members of projects with a time allocation on the GPU nodes. Applying for a project with time allocation on the GPU nodes is described at “Applying for a new project” under the “LiU Local projects” heading, and becoming member of a project with a time allocation on the GPU nodes is described at “Becoming a member of a project”. Members also need a login account on Sigma, a process described at “Getting a login account”.
Since the GPUs of Sigma are a scarce resource at present, project proposals will need to motivate the use of the GPUs from a technical perspective, i.e. describe how the project will use the GPUs and why GPUs are required for the project as opposed to other compute resources. New projects will also need to renew the project application (a simple procedure in SUPR) after a few months, after which project renewal follows the regular Sigma schedule of once per year. Reasonable types of motivations for a project’s use of the GPU nodes can range from (in principle) the more light-weight “We want to test software X’s GPU capabilities” reason to the more compelling “Software Y and Z perform 10x as well on a GPU as compared to on the CPUs, and can’t competitively be run in other ways”.
For login access to the Sigma GPU nodes NSC recommends using the “ThinLinc” VNC solution, as this will provide the most convenient and reliable access to graphical applications on Sigma, but SSH can be used as well. Data transfers to Sigma are described at “Getting data to and from the cluster”.
Project storage is hosted on the NSC Centre Storage, under the /proj
and
/home
directories, where /home
is a smaller backed up area intended for
precious data, and the former is non-backup volume storage, which is
typically where projects should store volume data not requiring backup to tape.
See NSC’s Centre Storage for more detailed
information.
While NSC Centre Storage is generally high-performance, some data access patterns are much less well suited for it than others. In particular, input-output (I/O) loads characterized by predominantly small random read-write operations will not perform well on Centre Storage. For these loads, node local scratch disks are available. General information about scratch space is available at Node local scratch storage. With respect to scratch space, the GPU nodes are different to the CPU-only nodes only in the size and performance, where the GPU-nodes have 14 TB scratch storage space and performance around 6.3 GB/s, 500k IOPs random-read.
The GPU nodes are intended for using the GPUs they are equipped with. That is, jobs not requiring a GPU should not be run here, but should be run on the regular CPU nodes of Sigma. This is quite natural, but is stated to make clear that any systematic CPU-only use of the nodes constitutes a misuse of the resource.
Allocating GPU resources is done via the SLURM resource manager like on the rest of Sigma, see “Sigma Batch Jobs” and “Tetralith Batch Jobs”, but there are a few extras to bring up here. In addition to other SLURM allocation switches, three more pieces of information are required to allocate a GPU resource
--gpus-per-task=<num_gpus>
switch, e.g. --gpus-per-task=2
-A <project account>
switch, where the project account must have an allocation on the GPUs.--reservation=gpu
switch.The --gpus-per-task
switch can be further specified with GPU type and the amount
separated by a “:
”. At present in Sigma, the only GPUs available are Tesla V100,
but if in a future other GPUs are added, say a Tesla A100, specifying which one
to allocate becomes important. The V100 GPUs in Sigma are specified with the
v100 label to --gpus-per-task
, and allocating for instance three v100 per task can be done with
--gpus-per-task=v100:3
.
For example, allocating a single V100 GPU for interactive work via a terminal on Sigma can be done as follows
[username@sigma ~] $ interactive -n 1 -c 9 -t 1:00:00 --gpus-per-task=v100:1 -A <account_string> --reservation=gpu
A batch job script analogous to the interactive
command above would look like
#!/bin/bash
#SBATCH -n 1
#SBATCH -c 9
#SBATCH -t 1:00:00
#SBATCH --gpus-per-task=v100:1
#SBATCH -A <account_string>
#SBATCH --reservation=gpu
# Job script commands follow
As of recent SLURM releases, there has been many more features added for the control of allocations containing GPUs. This has brought added complexity to the allocation of these resources, and there are now very many ways in which you can give conflicting allocation directives. Often, this is not followed by a clear error message from SLURM, or indeed any, and following an unfortunate allocation there may be an infinite wait for non-existent combinations of resources to become available.
The NSC recommended way to allocate resources is to specify how many tasks to allocate and to these tasks specify what resources they should have, as shown in the examples. This also extends to the switches --mem
, --mem-per-cpu
and --mem-per-gpu
which could be needed, but can cause allocation conflicts and are therefore mutually exclusive. NSC recommends to avoid the --mem
switch, since it acts on a per-node basis, and instead use --mem-per-gpu
or --mem-per-cpu
options. Following the example allocation with -n 1 -c 9 --gpus-per-task=1
, to allocate one quarter of the available memory per node (~90 GB), you would add the option --mem-per-gpu=90G
or --mem-per-cpu=10G
. The default is to allocate tot_mem_avail / num_cpus
(i.e. 360 / 36) to each allocated CPU core, so there should most often not be a need to specifying this switch.
Working interactively at the CLI prompt with SLURM allocated resources is most
conveniently done by using the NSC command interactive
, which is a wrapper
around the SLURM salloc
and srun
commands, and as such accepts the switches
supported by these. In short, the interactive
command launches a shell
process on the first allocated node of a job, drops you in the shell and lets you
interact with the allocated resources from there, much like any other CLI
prompt. More information about interactive
can be found at Running
applications under the “Interactive
jobs” heading. More informations about batch jobs can be found at “Batch jobs
and scheduling”.
Running GPU jobs (i.e. no MPI communications involved), be they single or multi GPU, is typically done exactly as you would run them on your laptop or workstation, after having allocated resources via SLURM. It becomes a bit more involved when using MPI. In principle, when running applications built at NSC with the NSC provided build environments, the general process is described at “Running parallel jobs with mpprun”, but there are many application specific caveats to take into account, and the application documentation should be consulted when running GPU+MPI applications. It is currently not expected that performance should be very high when running on more nodes than one for GPU+MPI jobs, due to limitations in the interconnect driver and the data transport (PSM2) libraries in use. It should be functional though, only the GPUdirect feature over the OmniPath interconnect is unsupported.
The SLURM job queue to the GPU nodes can be checked with squeue -R gpu
. Checking various aspects of your job’s status at NSC as it is running is normally done
using the jobsh
script to access nodes of the job. However, this is not
possible at present when it comes to monitoring GPU jobs, due to limitations in
the resource manager. Instead, it is suggested that at least for interactive
jobs you use a terminal multiplexer at the GPU node such as tmux
or screen
before starting your job. These multiplexers will allow you to open a second
shell prompt within the job (check the tmux
or screen
man
pages for how)
to check GPU usage with nvidia-smi
or other tools as you go.
Transferring data sets to the node local scratch space should by
preference be done in large contiguous chunks to maximize transfer speed.
A good way to achieve this is to store your data set on centre storage
(/proj
is suggested) in an uncompressed tar
archive and transfer this with
direct unpacking to scratch in your SLURM job on the allocated GPU node like
[username@n2017 ~]$ tar xf /proj/some/path/to/DATA_SET.tar -C /scratch/local
Carrying out your data transfers this way, you can expect a data transfer speed
of about 1 GB/s, e.g. a 60 GB tar
archive should be unpacked to local disk in
around a minute on the GPU nodes. On regular Sigma nodes, the performance is
bottle-necked by the slower local disk of these nodes to something like 200 MB/s.
Avoid using compression of the tar
archive, it will add a timing overhead to
the data set unpacking in the order of 4x for gunzip
(.tar.gz archives) and
20x for bunzip2
(.tar.bz2 archives). If you absolutely must use a compressed
tar archive, you can perform the decompression using a parallelized
implementation of your compression program like for instance pigz
when using
gzip
-compressed archives
tar xf --use-compress-program=pigz /proj/some/path/to/DATA_SET.tar.gz -C /scratch/local
The different parallel compression programs have variously well implemented parallelization, so your mileage may vary with respect to unpacking times.
The user environment at NSC is handled with the module
system as described on
or Module system page. All CUDA enabled modules
suitable for use on the GPU nodes can be listed with module avail cuda
, where
cuda
is a case insensitive search string. Alternatively search for your
software title using module avail <your_sw_title>
and see if it has a string
in it indicating any GPU capability, like “cuda”, “opencl” or “openacc” for
instance.
An important module is the
buildenv-gcccuda/<CUDA_VERSION>-<BASE_BUILDENV_VERSION>
module, which sets up
a CUDA toolkit build environment along with a GCC compiler, MPI and math
libraries custom built against this CUDA version. There is
a hidden module available called buildenv-gcccuda/.10.2-7.3.0-bare
, which due
to its preliminary (hidden) status is not listed when doing module avail
cuda
. To use it issue module load buildenv-gcccuda/.10.2-7.3.0-bare
on the
CLI prompt. To list all available gcccuda
modules, hidden as well, issue module --show-hidden avail gcccuda
, be aware though, that hidden modules may be removed or changed without notice (they are hidden for a reason).
You can of course also install to your own directories whatever CUDA version
you need and use that (with a suitable compiler). Note though, that NSC will be less
able to support you if you use your own CUDA installations. If you have no
requirement on MPI working with your own CUDA installation, you should be good to go as soon as
you set the following CUDA-specific environment variables; CUDA_HOME
, CUDA_ROOT
and
CUDA_PATH
, all pointing to the root of your CUDA installation, e.g. (assuming
you have downloaded the CUDA toolkit .run file)
[username@sigma ~] $ cuda_installpath="/proj/<project_name>/some/path/CUDA/10.2.89_440.33.01"
[username@sigma ~] $ sh ./cuda_10.2.89_440.33.01_linux.run --toolkit --silent --installpath=${cuda_installpath}
[username@sigma ~] $ export CUDA_HOME=${cuda_installpath} CUDA_ROOT=${cuda_installpath} CUDA_PATH=${cuda_installpath}
There are also other ways you need to modify your environment, consult the CUDA toolkit installation manual. The reason for the many different CUDA root installation pointing variables is historical, in practise you may only need to set one of them for your purposes.
A highly recommended development environment on the GPU nodes is the Singularity container solution. All compute nodes of the Sigma cluster have the container solution Singularity installed, see our Singularity page for more details on how it is set up at NSC and https://sylabs.io for the canonical information. The NSC documentation has some notes on Singularity security and trusted sources of containers. In addition to the trusted sources mentioned there, you may trust the NVIDIA NGC container registry used in the examples below.
Being a container solution, using a suitable Singularity image as your user environment brings a lot of advantages such as for example; using a familiar operating system and CUDA environment of choice, the convenience of portability between systems and reproducibility of results. Additionally Singularity can import well optimized Docker containers directly from the NVIDIA NGC registry, and also offer the possibility of modifying these to fit your needs. Examples of how to do this are provided in the Development Tools section.
If you are using Python as you development platform, another approach to manage
your user environment is the Conda package management system. NSC recommends
using Conda over vanilla Python virtual environments, as this seems to be
a more favoured solution when it comes to Python codes on the GPU, and it has
several functional advantages as well. When installing Python modules into you
Conda environment requiring compilation, be sure to have loaded a suitable
build environment module, e.g. buildenv-gcccuda/.10.2-7.3.0-bare
.
This guide will not cover Conda further, instead consult official documentation
at
https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html
for its use. Check which Python Conda modules are available with module avail
conda
or install some other version to your own directories. Since Conda
environments can be rather voluminous, a tip is to make your
/home/<username>/.conda
directory a symbolic link to some place in your
project directory (in principle /proj/some/path/to/your/conda/dir
).
The current version of Singularity installed on the GPU nodes is from the 3.5
series. It is important to understand that you will need to build your
Singularity image on a computer where you have administrator (i.e. super user
or root) privileges. Typically you would have this on your own laptop or
workstation but not anywhere on Sigma (sudo
does not work here), i.e. any
sudo
invocation in the following examples assumes you are doing it on your
local computer. Installing a recent enough version of Singularity on your
computer will not be covered here, check out the official documentation at
https://sylabs.io.
Here follows a few examples of what you can do with Singularity. The examples barely scratches the surface of what you can be done, consult the official documentation for a comprehensive guide.
This pulls a docker image from the NVIDIA NGC registry and makes a Singularity image (.sif) of it
sudo singularity pull tensorflow-20.03-tf2-py3.sif docker://nvcr.io/nvidia/tensorflow:20.03-tf2-py3
[sudo] password for user:
WARNING: Authentication token file not found : Only pulls of public images will succeed
INFO: Starting build...
Getting image source signatures
Copying blob sha256:423ae2b273f4c17ceee9e8482fa8d071d90c7d052ae208e1fe4963fceb3d6954
25.46 MiB / 25.46 MiB [====================================================] 5s
Copying blob sha256:de83a2304fa1f7c4a13708a0d15b9704f5945c2be5cbb2b3ed9b2ccb718d0b3d
34.54 KiB / 34.54 KiB [====================================================] 0s
---8<------ SNIP ------->8----
Copying config sha256:fabc6c87fbf06db2bbd63455d5e2e95ce5d7cfcc732dc3af632f89cb25627d7f
37.30 KiB / 37.30 KiB [====================================================] 0s
Writing manifest to image destination
Storing signatures
INFO: Creating SIF file...
INFO: Build complete: tensorflow-20.03-tf2-py3.sif
Check the NVIDIA NGC registry at https://ngc.nvidia.com/catalog/all for other images of interest. For example a PyTorch image using Python3 can be downloaded with
sudo singularity pull pytorch_20.03-py3.sif docker://nvcr.io/nvidia/pytorch:20.03-py3
As a side note, pulling from the official PyTorch Docker repository, the same can be accomplished with
sudo singularity pull pytorch_latest.sif docker://pytorch/pytorch:latest
The image pulled from the NGC or other place can potentially need to be adapted
to suit your requirements better. Then you will want to create a writeable sandboxed
directory from your .sif image, enter and modify, and finally create an updated
version of your image. For instance adding a package (here vim-gtk
) can be
done with
sudo singularity build --sandbox pytorch_latest pytorch_latest.sif
sudo singularity shell --writable pytorch_latest/
Singularity pytorch_latest/:~> apt-get update
Singularity pytorch_latest/:~> apt-get install vim-gtk
Singularity pytorch_latest/:~> exit
sudo singularity build pytorch_latest.v2.sif pytorch_latest
You can make any installations into the image, including with pip
or using the container image OS native build tools (gcc etc.) to manually build
and install whatever you may need. If you are doing very elaborate installs,
you may want to use persistent overlays instead, see
https://sylabs.io/guides/3.5/user-guide/persistent_overlays.html.
If you need more control over what goes into your image, you will need to use your own Singularity definition files (image recipes if you will). Using the following Singularity definition file we can for instance build a Singularity image containing PyTorch, CUDA 10.2 and lots of added extras
Bootstrap: library
From: ubuntu:18.04
%post
# Get the packages from a nearby location (well, if Sweden is close to you at least)
cat << EOF > /etc/apt/sources.list
deb http://se.archive.ubuntu.com/ubuntu/ bionic main restricted
deb http://se.archive.ubuntu.com/ubuntu/ bionic-updates main restricted
deb http://se.archive.ubuntu.com/ubuntu/ bionic universe
deb http://se.archive.ubuntu.com/ubuntu/ bionic-updates universe
deb http://se.archive.ubuntu.com/ubuntu/ bionic multiverse
deb http://se.archive.ubuntu.com/ubuntu/ bionic-updates multiverse
deb http://se.archive.ubuntu.com/ubuntu/ bionic-backports main restricted universe multiverse
deb http://security.ubuntu.com/ubuntu bionic-security main restricted
deb http://security.ubuntu.com/ubuntu bionic-security universe
deb http://security.ubuntu.com/ubuntu bionic-security multiverse
EOF
# Downloads the latest package lists
apt-get update -y
# Install required and reasonable extra distro packages
DEBIAN_FRONTEND=noninteractive apt-get -y --no-install-recommends install \
build-essential \
wget \
git \
software-properties-common \
python3 \
python3-tk \
python3-pip \
gdb \
freeglut3-dev \
dirmngr \
gpg-agent \
python3-setuptools \
python-dev \
python3-dev \
python3-wheel \
python3-pip \
vim-gtk \
nano \
openmpi-bin \
libopenmpi-dev \
openssh-client
# Install extras, the atom editor for instance like here
add-apt-repository ppa:webupd8team/atom
apt-get update -y
apt-get -y install atom
# Get the NVIDIA repos
apt-key adv --fetch-keys \
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
wget -O /etc/apt/preferences.d/cuda-repository-pin-600 \
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
add-apt-repository \
"deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
# Install CUDA (everything and the kitchen sink style)
apt-get update -y
# Make sure to match the CUDA version to whatever other packages are being
# installed
DEBIAN_FRONTEND=noninteractive apt-get -y install \
cuda-10-2 \
cuda-toolkit-10-2 \
cuda-samples-10-2 \
cuda-documentation-10-2
# Reduce the size of the image by deleting the package lists we downloaded,
# which are no longer needed.
rm -rf /var/lib/apt/lists/*
# Install Python modules. Make sure the CUDA-utilising python packages are
# compatible with whatever CUDA version was installed above.
pip3 install \
numpy \
matplotlib \
jupyter \
torch \
torchvision \
tqdm
%environment
export LC_ALL=C
export PATH=/usr/local/cuda/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/sbin:/usr/local/bin
export CPATH=/usr/local/cuda/include:$CPATH
export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
Build your image from your definition file (here my_image.def
) like
sudo singularity build my_image.sif my_image.def
NB: Singularity images can only be launched from under the /proj
file system, ie. the image must be placed there when launched.
Using a toy PyTorch learning example train_xor.py, we can execute it by way of the PyTorch containing image (in effect executing it in that operating system environment) like this
[username@n2017 ~] $ singularity shell --nv pytorch_20.03-py3.sif
Singularity> python train_xor.py
CUDA is available -- using GPU
iteration #1
loss: 0.24764549732208252
accuracy: 50.00%
iteration #2
loss: 0.2470039427280426
accuracy: 50.00%
---8<---- SNIP ---->8---
iteration #100
loss: 0.006982346531003714
accuracy: 100.00%
Singularity> exit
exit
[username@n2017 ~] $
The --nv
switch above to singularity is very important as it imports the
relevant host OS NVIDIA runtime libraries and devices (i.e. from CentOS 7 of
the GPU node) into the image to permit execution on the GPUs from within the
container.
Guides, documentation and FAQ.
Applying for projects and login accounts.