Bi has 16 cores per compute node, just like Krypton. If you have a working job configuration for Krypton, you should be able to run exactly the same job on Bi -- it will just run much faster (typical improvement 50%).
Bi has hyper-threading available, making each physical core appear as two virtual cores. This means that mpprun
automatically starts 32 MPI ranks per compute node. You need to enable it using --ntasks-per-core=2
. If you do not do that, you get 16 MPI ranks per compute nodes (as long as you don't change that using other parameters). Hyper-threading makes some applications like Arome run faster (about 10%). See below for more information about hyper-threading and Slurm. Note: during the pilot phase until 2015-02-25, hyper-threading was on by default.
Bi has Intel Xeon E5v3 processors of the "Haswell" generation. Haswell CPUs have improved vectorization with AVX2 instructions. In theory, up to 8 floating points instructions can be handled per clock cycle (up from 4 using AVX). To benefit from this, you need to recompile your software with high optimization (like -O2 -xCORE-AVX2
) or at least link with an external library that has AVX2 support (like Intel's MKL).
Bi has 64 GB of memory in the thin compute nodes. This is twice the amount of Krypton. The memory speed has also improved. Bi has 1866 Mhz DDR4 memory. In low-level memory benchmarks like STREAM, we can see up to 30% improvement. For certain applications, this can lead to substantial speed-up, even without recompiling them.
Bi has Intel Truescale Infiniband (previous known as Qlogic Truescale) -- earlier clusters at NSC have had Infiniband from Mellanox. As a user, you will probably not notice this, but if you are using your own MPI library, you may have to supply special flags or recompile it with "PSM" or "TMI" support to get the best performance. In low-level benchmarks, we have seen that Truescale Infiniband is especially strong at small messages (high "packet rate").
Bi has the Slurm job scheduling system, like earlier clusters at NSC. Below, we present some example on how to launch parallel jobs with different kinds of parallelization.
This job script will launch e.g. 8 nodes with 16 cores/node. Run like this if you want everything to be as similar as possible to Krypton:
#!/bin/bash
# SBATCH -J jobname
# SBATCH -t HH:MM:SS
# SBATCH -N 8
...
mpprun binary.x
This is the simplest way of running, mpprun will launch 32 MPI ranks per node.
#!/bin/bash
# SBATCH -J jobname
# SBATCH -t HH:MM:SS
# SBATCH -N 8
# SBATCH --ntasks-per-core=2
...
mpprun binary.x
In this case, each MPI rank will spawn a number of OpenMP threads. You can have up to 2 OpenMP threads per core. There are many possible combinations. We expect that the following combinations are likely run well:
16 MPI ranks x 2 OpenMP threads = 1 MPI rank per physical core and 2 OpenMP threads per virtual core. Job script
#!/bin/bash
# SBATCH -J jobname
# SBATCH -t HH:MM:SS
# SBATCH -N 8
# SBATCH --ntasks-per-node=16
...
export OMP_NUM_THREADS=2
mpprun binary.x
2 MPI ranks x 16 OpenMP threads = 1 MPI ranks per socket and 16 OpenMP threads on each socket.
#!/bin/bash
# SBATCH -J jobname
# SBATCH -t HH:MM:SS
# SBATCH -N 8
# SBATCH --ntasks-per-node=2
...
export OMP_NUM_THREADS=16
mpprun binary.x
Instead of giving the flag --ntasks-per-node
, you can also affect the number tasks per node indirectly by giving e.g. --ntasks-per-core=1
. This effectively disables hyperthreading and starts 16 MPI ranks per node. Update: --ntasks-per-core=1
is now default. Use --ntasks-per-core=2
to enable hyper-threading.
mpiexec.hydra
. The startup time can be improved by setting export I_MPI_HYDRA_PMI_CONNECT=alltoall
in the job script. Please note that the IntelMPI module and the mpprun
program does this automatically for you.export KMP_AFFINITY=scatter
to change thread affinity./software/apps
directory with precompiled software is a work in progress and may not be available from day one of pilot testing.buildenv/2015-1
.export MKL_CBWR = "AVX2"
or export MKL_CBWR = "AVX"
.These are some specific tips for Nemo supplied by Torgny and the vendor's own testing. Suitable compiler options are:
%FC ifort -c -cpp -Nmpi
%FCFLAGS -r8 -i4 -O3 -fp-model precise -xCORE-AVX2 -ip -unroll-aggressive
%FFLAGS -r8 -i4 -O3 -fp-model precise -xCORE-AVX2 -ip -unroll-aggressive
%LD ifort -O3 -fp-model precise -assume byterecl -convert big_endian -Nmpi
Example batch-script for a 16-node Nemo run. Here, we are not using hyperthreading, as NEMO does not benefit from that. Thre is also no OpenMP usage.
#!/bin/sh
#SBATCH -N 16
#SBATCH -t 01:00:00
.......................................
time mpprun -np 255 ./nemo.exe
...........................................................
Some early experiences from the Arome benchmarking.
See example Arome "makeup" file below.
Suppose we want to run a 48 node Arome job using Intel MPI. In this case, we want to:
The script would look like:
#!/bin/sh
#SBATCH -J Forecast
#SBATCH -N 49
#SBATCH --ntasks-per-node=16
#SBATCH -t 01:00:00
.................
export NPROCX=16
export NPROCY=48
export NPROC_IO=16
export NPROC=$(( $NPROCX * $NPROCY ))
export TOTPROC=$(( $NPROCX * $NPROCY + $NPROC_IO ))
export NSTRIN=$NPROC
export NSTROUT=$NPROC
export OMP_NUM_THREADS=2
export KMP_STACKSIZE=128m
........................................................................NAMELIST etc....................
time mpprun LINK_TO_MASTERODB -maladin -vmeteo -eHARM -c001 -t$TSTEP -fh$FCLEN -asli || exit
NPROMA=-32
seems to work fine
Speedup launching of MPI-jobs:
export I_MPI_HYDRA_PMI_CONNECT=alltoall
Improve MPI-performance by tweaking some of the MPI routines alternatives:
export I_MPI_ADJUST_ALLREDUCE=6
export I_MPI_ADJUST_BARRIER=1
export I_MPI_ADJUST_ALLTOALLV=2
Improve dynamic memory allocation:
export MALLOC_MMAP_MAX_=0
export MALLOC_TRIM_THRESHOLD_=- 1
Improve performance for larger values of OMP_NUM_THREADS (4 and bigger):
export KMP_AFFINITY=compact
export I_MPI_PIN_DOMAIN=omp:platform
Sometimes it can be beneficial to reduce the number of ranks, for example run 15 ranks on each node, each with 2 OpenMP threads seems to reduce the variability of runtime. See example in the table below for 96 nodes.
To enable reproducible output, independent of MPI-rank distribution and number of OpenMP-threads:
export MKL_CBWR=SSE4_1
For very large number of MPI-ranks (ca 2500 and more) there is an additional overhead for each I/O-step, not clear yet why.
With I/O-server:
Total number of nodes | 49 | 65 | 97 | 97 | 145 | 194 |
I/O nodes | 1 | 1 | 1 | 1 | 1 | 2 |
Compute nodes | 48 | 64 | 96 | 96 | 144 | 192 |
NPROCX | 16 | 16 | 30 | 32 | 48 | 32 |
NPROCY | 48 | 64 | 48 | 48 | 48 | 48 |
OMP_NUM_THREADS | 2 | 2 | 2 | 2 | 2 | 4 |
NPROC_IO | 16 | 16 | 16 | 16 | 16 | 32 |
NSTRIN | NPROC | NPROC | NPROC | NPROC | NPROC/2 | NPROC/4 |
NSTROUT | NPROC | NPROC | NPROC | NPROC | NPROC | NPROC |
MPI-ranks/node | 16 | 16 | 15 | 16 | 16 | 8 |
Hyperthreading | Yes | Yes | Yes | Yes | Yes | Yes |
No I/O-server
Total number of nodes | 48 | 64 | 96 | 96 | 144 | 192 |
Compute nodes | 48 | 64 | 96 | 96 | 144 | 192 |
NPROCX | 16 | 16 | 30 | 32 | 48 | 32 |
NPROCY | 48 | 64 | 48 | 48 | 48 | 48 |
OMP_NUM_THREADS | 2 | 2 | 2 | 2 | 2 | 4 |
NPROC_IO | 0 | 0 | 0 | 0 | 0 | 32 |
NSTRIN | NPROC | NPROC | NPROC | NPROC | NPROC/2 | NPROC/4 |
NSTROUT | NPROC | NPROC | NPROC | NPROC | NPROC | NPROC |
MPI-ranks/node | 16 | 16 | 15 | 16 | 16 | 8 |
Hyperthreading | Yes | Yes | Yes | Yes | Yes | Yes |
MOD=mod
FOPT=-noauto -convert big_endian -assume byterecl -openmp -openmp-threadprivate=compat -O3 -fpe0 -fp-model precise -fp-speculation=safe -ftz
COPT=-O2 -fp-model precise -openmp -fp-speculation=safe -openmp-threadprivate=compat
DEFS=-DLINUX -DLITTLE -DLITTLE_ENDIAN -DHIGHRES -DADDRESS64 -DPOINTER_64 -D_ABI64 -DBLAS \
-DSTATIC_LINKING -DINTEL -D_RTTOV_DO_DISTRIBCOEF -DINTEGER_IS_INT \
-DREAL_8 -DREAL_BIGGER_THAN_INTEGER -DUSE_SAMIO -D_RTTOV_DO_DISTRIBCOEF -DNO_CURSES \
-DFA=fa -DLFI=lfi -DARO=aro -DOL=ol -DASC=asc -DTXT=txt
CC=icc -g -traceback -Nmpi
CCFLAGS=$(COPT) $(DEFS) -Dlinux -DFOPEN64
FC=ifort -Nmpi -g -traceback
FCFLAGS=$(FOPT) $(DEFS)
FCFREE=-free
FCFIXED=-nofree
AUTODBL=-r8
LD=ifort -Nmpi -O3 -g -traceback -fp-model precise -fpe0 -ftz
LDFLAGS=-pc 64 -openmp
MKLROOT=/software/apps/intel/composer_xe_2015.1.133/mkl
# System-dependent libraries - ALWAYS LOADED - (absolute filename or short name) :
LD_SYS01 = -lpthread -lm
# INTEL Math Kernel Library
LD_LANG01 = $(MKLROOT)/lib/intel64/libmkl_blas95_lp64.a
LD_LANG02 = $(MKLROOT)/lib/intel64/libmkl_lapack95_lp64.a
LD_LANG03 = -L$(MKLROOT)/lib/intel64 -lmkl_intel_lp64
LD_LANG04 = -lmkl_core
LD_LANG05 = -lmkl_intel_thread
# MPI:
LD_MPI01 = -L$(I_MPI_ROOT)/intel64/lib -ldl -lrt -lpthread
SYSLIBS= $(LD_SYS01) \
$(LD_LANG01) $(LD_LANG02) $(LD_LANG03) $(LD_LANG04) $(LD_LANG05) $(LD_MPI01) \
$(GRIB_API_LIB)
#INCLDIRS=$(GRIB_API_INCLUDE) -I$(NETCDFINCLUDE)
INCLDIRS=$(GRIB_API_INCLUDE)
RANLIB=ls -l
PRESEARCH=-Wl,--start-group
POSTSEARCH=-Wl,--end-group
MPIDIR=/software/apps/intel/impi/5.0.2.044/intel64//lib
MPIDIR_INCL=/software/apps/intel/impi/5.0.2.044/intel64/include
YACCLEX_LIBS=-lm
LDCC=icc -Nmpi -O3 -DLINUX -w -lifcore $(LD_MPI01)
NPES=1
AUXSOURCES=sources.linux
# comma-separated list of external module references
EXTMODS=hdf5
Guides, documentation and FAQ.
Applying for projects and login accounts.