Bi Early Access Guide

The pilot testing phase on Bi has now ended. On 2015-03-02, Bi entered into production use for SMHI research. The information on this page may no longer be accurate, as we have made changes to the system configurations. Please see the Bi Getting Started Guide and the Migrating to Bi page instead.

Hardware notes

Bi has 16 cores per compute node, just like Krypton. If you have a working job configuration for Krypton, you should be able to run exactly the same job on Bi -- it will just run much faster (typical improvement 50%).

Bi has hyper-threading available, making each physical core appear as two virtual cores. This means that mpprun automatically starts 32 MPI ranks per compute node. You need to enable it using --ntasks-per-core=2. If you do not do that, you get 16 MPI ranks per compute nodes (as long as you don't change that using other parameters). Hyper-threading makes some applications like Arome run faster (about 10%). See below for more information about hyper-threading and Slurm. Note: during the pilot phase until 2015-02-25, hyper-threading was on by default.

Bi has Intel Xeon E5v3 processors of the "Haswell" generation. Haswell CPUs have improved vectorization with AVX2 instructions. In theory, up to 8 floating points instructions can be handled per clock cycle (up from 4 using AVX). To benefit from this, you need to recompile your software with high optimization (like -O2 -xCORE-AVX2) or at least link with an external library that has AVX2 support (like Intel's MKL).

Bi has 64 GB of memory in the thin compute nodes. This is twice the amount of Krypton. The memory speed has also improved. Bi has 1866 Mhz DDR4 memory. In low-level memory benchmarks like STREAM, we can see up to 30% improvement. For certain applications, this can lead to substantial speed-up, even without recompiling them.

Bi has Intel Truescale Infiniband (previous known as Qlogic Truescale) -- earlier clusters at NSC have had Infiniband from Mellanox. As a user, you will probably not notice this, but if you are using your own MPI library, you may have to supply special flags or recompile it with "PSM" or "TMI" support to get the best performance. In low-level benchmarks, we have seen that Truescale Infiniband is especially strong at small messages (high "packet rate").

Examples on how to launch jobs

Bi has the Slurm job scheduling system, like earlier clusters at NSC. Below, we present some example on how to launch parallel jobs with different kinds of parallelization.

Pure MPI (without hyper-threading)

This job script will launch e.g. 8 nodes with 16 cores/node. Run like this if you want everything to be as similar as possible to Krypton:

#!/bin/bash
# SBATCH -J jobname
# SBATCH -t HH:MM:SS
# SBATCH -N 8
...
mpprun binary.x

Pure MPI (with hyper-threading)

This is the simplest way of running, mpprun will launch 32 MPI ranks per node.

#!/bin/bash
# SBATCH -J jobname
# SBATCH -t HH:MM:SS
# SBATCH -N 8
# SBATCH --ntasks-per-core=2
...
mpprun binary.x

Hybrid MPI + OpenMP parallelization

In this case, each MPI rank will spawn a number of OpenMP threads. You can have up to 2 OpenMP threads per core. There are many possible combinations. We expect that the following combinations are likely run well:

16 MPI ranks x 2 OpenMP threads = 1 MPI rank per physical core and 2 OpenMP threads per virtual core. Job script

#!/bin/bash
# SBATCH -J jobname
# SBATCH -t HH:MM:SS
# SBATCH -N 8
# SBATCH --ntasks-per-node=16
...

export OMP_NUM_THREADS=2
mpprun binary.x

2 MPI ranks x 16 OpenMP threads = 1 MPI ranks per socket and 16 OpenMP threads on each socket.

#!/bin/bash
# SBATCH -J jobname
# SBATCH -t HH:MM:SS
# SBATCH -N 8
# SBATCH --ntasks-per-node=2
...

export OMP_NUM_THREADS=16
mpprun binary.x

Instead of giving the flag --ntasks-per-node, you can also affect the number tasks per node indirectly by giving e.g. --ntasks-per-core=1. This effectively disables hyperthreading and starts 16 MPI ranks per node. Update: --ntasks-per-core=1 is now default. Use --ntasks-per-core=2 to enable hyper-threading.

Various tricks

IntelMPI is rather slow at launching big jobs on Bifrost (big as in 100+ nodes) if you use just mpiexec.hydra. The startup time can be improved by setting export I_MPI_HYDRA_PMI_CONNECT=alltoall in the job script. Please note that the IntelMPI module and the mpprun program does this automatically for you.
With OpenMP threading, try export KMP_AFFINITY=scatter to change thread affinity.

Things that are currently not working

The queue system is not finally configured. Jobs cannot share nodes and there are no "risk jobs".
The /software/apps directory with precompiled software is a work in progress and may not be available from day one of pilot testing.
Future Frost users can test jobs on Bi, as the environments are almost identical, but the Frost dedicated file systems are not available on Bi.

Suggestions of what to test

Try to recompile your software on Bi with the new compilers. Use the module buildenv/2015-1.
Run identical jobs on Krypton and Bi (using 16 cores/node). Check the output for correctness and then look at the speed. The job should run faster on Bi. If you have problems with numerical reproducibility, try running on Bi with export MKL_CBWR = "AVX2" or export MKL_CBWR = "AVX".
Next try using hyper-threading (32 cores/node), especially if your program has OpenMP parallelization. You may have to test different combinations of number of OpenMP threads and MPI ranks to get the best performance.

Nemo-specific information

These are some specific tips for Nemo supplied by Torgny and the vendor's own testing. Suitable compiler options are:

%FC          ifort  -c -cpp -Nmpi
%FCFLAGS     -r8 -i4 -O3 -fp-model precise -xCORE-AVX2 -ip -unroll-aggressive 
%FFLAGS      -r8 -i4 -O3 -fp-model precise -xCORE-AVX2 -ip -unroll-aggressive 
%LD          ifort -O3 -fp-model precise -assume byterecl -convert big_endian -Nmpi

Example batch-script for a 16-node Nemo run. Here, we are not using hyperthreading, as NEMO does not benefit from that. Thre is also no OpenMP usage.

#!/bin/sh 
#SBATCH -N 16
#SBATCH -t 01:00:00 
.......................................
time mpprun -np 255  ./nemo.exe
...........................................................

Arome-specific information

Some early experiences from the Arome benchmarking.

Compiler options

See example Arome "makeup" file below.

Example batch-script

Suppose we want to run a 48 node Arome job using Intel MPI. In this case, we want to:

use hyperthreading which Arome benefits from
have the IO-server on one node
use OpenMP with two threads per MPI rank.

The script would look like:

#!/bin/sh 
#SBATCH  -J Forecast 
#SBATCH  -N 49 
#SBATCH --ntasks-per-node=16 
#SBATCH  -t 01:00:00 
.................
export NPROCX=16 
export NPROCY=48 
export NPROC_IO=16 
export NPROC=$(( $NPROCX * $NPROCY )) 
export TOTPROC=$(( $NPROCX * $NPROCY + $NPROC_IO )) 
export NSTRIN=$NPROC 
export NSTROUT=$NPROC 
export OMP_NUM_THREADS=2 
export KMP_STACKSIZE=128m 

........................................................................NAMELIST etc....................

time mpprun  LINK_TO_MASTERODB -maladin -vmeteo -eHARM -c001 -t$TSTEP -fh$FCLEN -asli || exit

Tuning tips for performance

NPROMA=-32 seems to work fine

Speedup launching of MPI-jobs:

export I_MPI_HYDRA_PMI_CONNECT=alltoall

Improve MPI-performance by tweaking some of the MPI routines alternatives:

export I_MPI_ADJUST_ALLREDUCE=6 
export I_MPI_ADJUST_BARRIER=1 
export I_MPI_ADJUST_ALLTOALLV=2

Improve dynamic memory allocation:

export MALLOC_MMAP_MAX_=0 
export MALLOC_TRIM_THRESHOLD_=- 1

Improve performance for larger values of OMP_NUM_THREADS (4 and bigger):

export KMP_AFFINITY=compact 
export I_MPI_PIN_DOMAIN=omp:platform

Sometimes it can be beneficial to reduce the number of ranks, for example run 15 ranks on each node, each with 2 OpenMP threads seems to reduce the variability of runtime. See example in the table below for 96 nodes.

Miscellaneous environment variables

To enable reproducible output, independent of MPI-rank distribution and number of OpenMP-threads:

export MKL_CBWR=SSE4_1

Outstanding issues (2015-02.02)

For very large number of MPI-ranks (ca 2500 and more) there is an additional overhead for each I/O-step, not clear yet why.

Arome processor distribution suggestions

With I/O-server:

Total number of nodes	49	65	97	97	145	194
I/O nodes	1	1	1	1	1	2
Compute nodes	48	64	96	96	144	192
NPROCX	16	16	30	32	48	32
NPROCY	48	64	48	48	48	48
OMP_NUM_THREADS	2	2	2	2	2	4
NPROC_IO	16	16	16	16	16	32
NSTRIN	NPROC	NPROC	NPROC	NPROC	NPROC/2	NPROC/4
NSTROUT	NPROC	NPROC	NPROC	NPROC	NPROC	NPROC
MPI-ranks/node	16	16	15	16	16	8
Hyperthreading	Yes	Yes	Yes	Yes	Yes	Yes

No I/O-server

Total number of nodes	48	64	96	96	144	192
Compute nodes	48	64	96	96	144	192
NPROCX	16	16	30	32	48	32
NPROCY	48	64	48	48	48	48
OMP_NUM_THREADS	2	2	2	2	2	4
NPROC_IO	0	0	0	0	0	32
NSTRIN	NPROC	NPROC	NPROC	NPROC	NPROC/2	NPROC/4
NSTROUT	NPROC	NPROC	NPROC	NPROC	NPROC	NPROC
MPI-ranks/node	16	16	15	16	16	8
Hyperthreading	Yes	Yes	Yes	Yes	Yes	Yes

Example Arome "makeup" file

MOD=mod 
 
FOPT=-noauto -convert big_endian -assume byterecl -openmp -openmp-threadprivate=compat -O3 -fpe0 -fp-model precise -fp-speculation=safe -ftz 
COPT=-O2 -fp-model precise -openmp -fp-speculation=safe -openmp-threadprivate=compat 

DEFS=-DLINUX -DLITTLE -DLITTLE_ENDIAN -DHIGHRES -DADDRESS64 -DPOINTER_64 -D_ABI64 -DBLAS \ 
     -DSTATIC_LINKING -DINTEL -D_RTTOV_DO_DISTRIBCOEF -DINTEGER_IS_INT \ 
     -DREAL_8 -DREAL_BIGGER_THAN_INTEGER -DUSE_SAMIO -D_RTTOV_DO_DISTRIBCOEF -DNO_CURSES \ 
     -DFA=fa -DLFI=lfi -DARO=aro -DOL=ol -DASC=asc -DTXT=txt 

CC=icc -g -traceback -Nmpi 
CCFLAGS=$(COPT) $(DEFS) -Dlinux -DFOPEN64 

FC=ifort -Nmpi -g -traceback 
FCFLAGS=$(FOPT) $(DEFS) 

FCFREE=-free 
FCFIXED=-nofree 
AUTODBL=-r8 

LD=ifort -Nmpi -O3 -g -traceback -fp-model precise -fpe0 -ftz 
LDFLAGS=-pc 64 -openmp 
MKLROOT=/software/apps/intel/composer_xe_2015.1.133/mkl 

# System-dependent libraries - ALWAYS LOADED - (absolute filename or short name) : 
LD_SYS01 = -lpthread -lm 

# INTEL Math Kernel Library 
LD_LANG01 = $(MKLROOT)/lib/intel64/libmkl_blas95_lp64.a 
LD_LANG02 = $(MKLROOT)/lib/intel64/libmkl_lapack95_lp64.a 
LD_LANG03 = -L$(MKLROOT)/lib/intel64 -lmkl_intel_lp64 
LD_LANG04 = -lmkl_core 
LD_LANG05 = -lmkl_intel_thread 
# MPI: 
LD_MPI01 = -L$(I_MPI_ROOT)/intel64/lib -ldl -lrt -lpthread 

SYSLIBS= $(LD_SYS01) \ 
         $(LD_LANG01) $(LD_LANG02) $(LD_LANG03) $(LD_LANG04) $(LD_LANG05) $(LD_MPI01) \ 
         $(GRIB_API_LIB) 

#INCLDIRS=$(GRIB_API_INCLUDE) -I$(NETCDFINCLUDE) 
INCLDIRS=$(GRIB_API_INCLUDE) 

RANLIB=ls -l 

PRESEARCH=-Wl,--start-group 
POSTSEARCH=-Wl,--end-group 

MPIDIR=/software/apps/intel/impi/5.0.2.044/intel64//lib 
MPIDIR_INCL=/software/apps/intel/impi/5.0.2.044/intel64/include 

YACCLEX_LIBS=-lm 

LDCC=icc -Nmpi -O3 -DLINUX -w -lifcore $(LD_MPI01) 

NPES=1 

AUXSOURCES=sources.linux 

# comma-separated list of external module references 
EXTMODS=hdf5