accelerate_kokkos.txt

"Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
"LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c

:link(lws,http://lammps.sandia.gov)
:link(ld,Manual.html)
:link(lc,Section_commands.html#comm)

:line

"Return to Section accelerate overview"_Section_accelerate.html

5.3.3 KOKKOS package :h5

Kokkos is a templated C++ library that provides abstractions to allow
a single implementation of an application kernel (e.g. a pair style) to run efficiently on
different kinds of hardware, such as GPUs, Intel Xeon Phis, or many-core
CPUs. Kokkos maps the C++ kernel onto different backend languages such as CUDA, OpenMP, or Pthreads.
The Kokkos library also provides data abstractions to adjust (at
compile time) the memory layout of data structures like 2d and
3d arrays to optimize performance on different hardware. For more information on Kokkos, see
"Github"_https://github.com/kokkos/kokkos. Kokkos is part of
"Trilinos"_http://trilinos.sandia.gov/packages/kokkos. The Kokkos library was written primarily by Carter Edwards,
Christian Trott, and Dan Sunderland (all Sandia).

The LAMMPS KOKKOS package contains versions of pair, fix, and atom styles
that use data structures and macros provided by the Kokkos library,
which is included with LAMMPS in /lib/kokkos. The KOKKOS package was developed primarily by Christian Trott (Sandia)
and Stan Moore (Sandia) with contributions of various styles by others, including Sikandar
Mashayak (UIUC), Ray Shan (Sandia), and Dan Ibanez (Sandia). For more information on developing using Kokkos abstractions
see the Kokkos programmers' guide at /lib/kokkos/doc/Kokkos_PG.pdf.

Kokkos currently provides support for 3 modes of execution (per MPI
task). These are Serial (MPI-only for CPUs and Intel Phi), OpenMP (threading
for many-core CPUs and Intel Phi), and CUDA (for NVIDIA GPUs). You choose the mode at build time to
produce an executable compatible with specific hardware.

[Building LAMMPS with the KOKKOS package:]

NOTE: Kokkos support within LAMMPS must be built with a C++11 compatible
compiler. This means GCC version 4.7.2 or later, Intel 14.0.4 or later, or
Clang 3.5.2 or later is required.

The recommended method of building the KOKKOS package is to start with the provided Kokkos
Makefiles in /src/MAKE/OPTIONS/. You may need to modify the KOKKOS_ARCH variable in the Makefile
to match your specific hardware. For example:

for Sandy Bridge CPUs, set KOKKOS_ARCH=SNB
for Broadwell CPUs, set KOKKOS_ARCH=BWD
for K80 GPUs, set KOKKOS_ARCH=Kepler37
for P100 GPUs and Power8 CPUs, set KOKKOS_ARCH=Pascal60,Power8 :ul

See the [Advanced Kokkos Options] section below for a listing of all KOKKOS_ARCH options.

[Compile for CPU-only (MPI only, no threading):]

use a C++11 compatible compiler and set KOKKOS_ARCH variable in
/src/MAKE/OPTIONS/Makefile.kokkos_mpi_only as described above. Then do the
following:

cd lammps/src
make yes-kokkos
make kokkos_mpi_only :pre

[Compile for CPU-only (MPI plus OpenMP threading):]

NOTE: To build with Kokkos support for OpenMP threading, your compiler must support the
OpenMP interface. You should have one or more multi-core CPUs so that
multiple threads can be launched by each MPI task running on a CPU.

use a C++11 compatible compiler and set KOKKOS_ARCH variable in
/src/MAKE/OPTIONS/Makefile.kokkos_omp as described above.  Then do the
following:

cd lammps/src
make yes-kokkos
make kokkos_omp :pre

[Compile for Intel KNL Xeon Phi (Intel Compiler, OpenMPI):]

use a C++11 compatible compiler and do the following:

cd lammps/src
make yes-kokkos
make kokkos_phi :pre

[Compile for CPUs and GPUs (with OpenMPI or MPICH):]

NOTE: To build with Kokkos support for NVIDIA GPUs, NVIDIA CUDA software
version 7.5 or later must be installed on your system. See the
discussion for the "GPU"_accelerate_gpu.html package for details of
how to check and do this.

use a C++11 compatible compiler and set KOKKOS_ARCH variable in
/src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi for both GPU and CPU as described
above.  Then do the following:

cd lammps/src
make yes-kokkos
make kokkos_cuda_mpi :pre

[Alternative Methods of Compiling:]

Alternatively, the KOKKOS package can be built by specifying Kokkos variables
on the make command line. For example:

make mpi KOKKOS_DEVICES=OpenMP KOKKOS_ARCH=SNB     # set the KOKKOS_DEVICES and KOKKOS_ARCH variable explicitly
make kokkos_cuda_mpi KOKKOS_ARCH=Pascal60,Power8   # set the KOKKOS_ARCH variable explicitly :pre

Setting the KOKKOS_DEVICES and KOKKOS_ARCH variables on the
make command line requires a GNU-compatible make command. Try
"gmake" if your system's standard make complains.

NOTE: If you build using make line variables and re-build LAMMPS twice
with different KOKKOS options and the *same* target, then you *must* perform a "make clean-all"
or "make clean-machine" before each build. This is to force all the
KOKKOS-dependent files to be re-compiled with the new options.

[Running LAMMPS with the KOKKOS package:]

All Kokkos operations occur within the
context of an individual MPI task running on a single node of the
machine. The total number of MPI tasks used by LAMMPS (one or
multiple per compute node) is set in the usual manner via the mpirun
or mpiexec commands, and is independent of Kokkos. E.g. the mpirun
command in OpenMPI does this via its
-np and -npernode switches. Ditto for MPICH via -np and -ppn.

[Running on a multi-core CPU:]

Here is a quick overview of how to use the KOKKOS package
for CPU acceleration, assuming one or more 16-core nodes.

mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -in in.lj        # 1 node, 16 MPI tasks/node, no multi-threading
mpirun -np 2 -ppn 1 lmp_kokkos_omp -k on t 16 -sf kk -in in.lj  # 2 nodes, 1 MPI task/node, 16 threads/task
mpirun -np 2 lmp_kokkos_omp -k on t 8 -sf kk -in in.lj          # 1 node,  2 MPI tasks/node, 8 threads/task
mpirun -np 32 -ppn 4 lmp_kokkos_omp -k on t 4 -sf kk -in in.lj  # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre

To run using the KOKKOS package, use the "-k on", "-sf kk" and "-pk kokkos" "command-line switches"_Section_start.html#start_7 in your mpirun command.
You must use the "-k on" "command-line
switch"_Section_start.html#start_7 to enable the KOKKOS package. It
takes additional arguments for hardware settings appropriate to your
system. Those arguments are "documented
here"_Section_start.html#start_7. For OpenMP use:

-k on t Nt :pre

The "t Nt" option specifies how many OpenMP threads per MPI
task to use with a node. The default is Nt = 1, which is MPI-only mode.
Note that the product of MPI tasks * OpenMP
threads/task should not exceed the physical number of cores (on a
node), otherwise performance will suffer. If hyperthreading is enabled, then
the product of MPI tasks * OpenMP threads/task should not exceed the
physical number of cores * hardware threads.
The "-k on" switch also issues a "package kokkos" command (with no
additional arguments) which sets various KOKKOS options to default
values, as discussed on the "package"_package.html command doc page.

The "-sf kk" "command-line switch"_Section_start.html#start_7
will automatically append the "/kk" suffix to styles that support it.
In this manner no modification to the input script is needed. Alternatively,
one can run with the KOKKOS package by editing the input script as described below.

NOTE: The default for the "package kokkos"_package.html command is
to use "full" neighbor lists and set the Newton flag to "off" for both
pairwise and bonded interactions. However, when running on CPUs, it
will typically be faster to use "half" neighbor lists and set the
Newton flag to "on", just as is the case for non-accelerated pair
styles. It can also be faster to use non-threaded communication.
Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to
change the default "package kokkos"_package.html
options. See its doc page for details and default settings. Experimenting with
its options can provide a speed-up for specific calculations. For example:

mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -pk kokkos newton on neigh half comm no -in in.lj       # Newton on, Half neighbor list, non-threaded comm :pre

If the "newton"_newton.html command is used in the input
script, it can also override the Newton flag defaults.

[Core and Thread Affinity:]

When using multi-threading, it is important for
performance to bind both MPI tasks to physical cores, and threads to
physical cores, so they do not migrate during a simulation.

If you are not certain MPI tasks are being bound (check the defaults
for your MPI installation), binding can be forced with these flags:

OpenMPI 1.8: mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ...
Mvapich2 2.0: mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ... :pre

For binding threads with KOKKOS OpenMP, use thread affinity
environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or
later, intel 12 or later) setting the environment variable
OMP_PROC_BIND=true should be sufficient. In general, for best performance
with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads.
For binding threads with the
KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option
as described below.

[Running on Knight's Landing (KNL) Intel Xeon Phi:]

Here is a quick overview of how to use the KOKKOS package
for the Intel Knight's Landing (KNL) Xeon Phi:

KNL Intel Phi chips have 68 physical cores. Typically 1 to 4 cores
are reserved for the OS, and only 64 or 66 cores are used. Each core
has 4 hyperthreads,so there are effectively N = 256 (4*64) or
N = 264 (4*66) cores to run on. The product of MPI tasks * OpenMP threads/task should not exceed this limit,
otherwise performance will suffer. Note that with the KOKKOS package you do not need to
specify how many KNLs there are per node; each
KNL is simply treated as running some number of MPI tasks.

Examples of mpirun commands that follow these rules are shown below.

Intel KNL node with 68 cores (272 threads/node via 4x hardware threading):
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj      # 1 node, 64 MPI tasks/node, 4 threads/task
mpirun -np 66 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj      # 1 node, 66 MPI tasks/node, 4 threads/task
mpirun -np 32 lmp_kokkos_phi -k on t 8 -sf kk -in in.lj      # 1 node, 32 MPI tasks/node, 8 threads/task
mpirun -np 512 -ppn 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj  # 8 nodes, 64 MPI tasks/node, 4 threads/task :pre

The -np setting of the mpirun command sets the number of MPI
tasks/node. The "-k on t Nt" command-line switch sets the number of
threads/task as Nt. The product of these two values should be N, i.e.
256 or 264.

NOTE: The default for the "package kokkos"_package.html command is
to use "full" neighbor lists and set the Newton flag to "off" for both
pairwise and bonded interactions. When running on KNL, this
will typically be best for pair-wise potentials. For manybody potentials,
using "half" neighbor lists and setting the
Newton flag to "on" may be faster. It can also be faster to use non-threaded communication.
Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to
change the default "package kokkos"_package.html
options. See its doc page for details and default settings. Experimenting with
its options can provide a speed-up for specific calculations. For example:

mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm no -in in.lj      #  Newton off, full neighbor list, non-threaded comm
mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos newton on neigh half comm no -in in.reax      # Newton on, half neighbor list, non-threaded comm :pre

NOTE: MPI tasks and threads should be bound to cores as described above for CPUs.

NOTE: To build with Kokkos support for Intel Xeon Phi coprocessors such as Knight's Corner (KNC), your
system must be configured to use them in "native" mode, not "offload"
mode like the USER-INTEL package supports.

[Running on GPUs:]

Use the "-k" "command-line switch"_Section_commands.html#start_7 to
specify the number of GPUs per node. Typically the -np setting
of the mpirun command should set the number of MPI
tasks/node to be equal to the # of physical GPUs on the node.
You can assign multiple MPI tasks to the same GPU with the
KOKKOS package, but this is usually only faster if significant portions
of the input script have not been ported to use Kokkos. Using CUDA MPS
is recommended in this scenario. As above for multi-core CPUs (and no GPU), if N is the number
of physical cores/node, then the number of MPI tasks/node should not exceed N.

-k on g Ng :pre

Here are examples of how to use the KOKKOS package for GPUs,
assuming one or more nodes, each with two GPUs:

mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj          # 1 node,   2 MPI tasks/node, 2 GPUs/node
mpirun -np 32 -ppn 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj  # 16 nodes, 2 MPI tasks/node, 2 GPUs/node (32 GPUs total) :pre

NOTE: The default for the "package kokkos"_package.html command is
to use "full" neighbor lists and set the Newton flag to "off" for both
pairwise and bonded interactions, along with threaded communication.
When running on Maxwell or Kepler GPUs, this will typically be best. For Pascal GPUs,
using "half" neighbor lists and setting the
Newton flag to "on" may be faster. For many pair styles, setting the neighbor binsize
equal to the ghost atom cutoff will give speedup.
Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to
change the default "package kokkos"_package.html
options. See its doc page for details and default settings. Experimenting with
its options can provide a speed-up for specific calculations. For example:

mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos binsize 2.8 -in in.lj      # Set binsize = neighbor ghost cutoff
mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos newton on neigh half binsize 2.8 -in in.lj      # Newton on, half neighborlist, set binsize = neighbor ghost cutoff :pre
NOTE: For good performance of the KOKKOS package on GPUs, you must
have Kepler generation GPUs (or later). The Kokkos library exploits
texture cache options not supported by Telsa generation GPUs (or
older).

NOTE: When using a GPU, you will achieve the best performance if your
input script does not use fix or compute styles which are not yet
Kokkos-enabled. This allows data to stay on the GPU for multiple
timesteps, without being copied back to the host CPU. Invoking a
non-Kokkos fix or compute, or performing I/O for
"thermo"_thermo_style.html or "dump"_dump.html output will cause data
to be copied back to the CPU incurring a performance penalty.

NOTE: To get an accurate timing breakdown between time spend in pair,
kspace, etc., you must set the environment variable CUDA_LAUNCH_BLOCKING=1.
However, this will reduce performance and is not recommended for production runs.

[Run with the KOKKOS package by editing an input script:]

Alternatively the effect of the "-sf" or "-pk" switches can be
duplicated by adding the "package kokkos"_package.html or "suffix
kk"_suffix.html commands to your input script.

The discussion above for building LAMMPS with the KOKKOS package, the mpirun/mpiexec command, and setting
appropriate thread are the same.

You must still use the "-k on" "command-line
switch"_Section_start.html#start_7 to enable the KOKKOS package, and
specify its additional arguments for hardware options appropriate to
your system, as documented above.

You can use the "suffix kk"_suffix.html command, or you can explicitly add a
"kk" suffix to individual styles in your input script, e.g.

pair_style lj/cut/kk 2.5 :pre

You only need to use the "package kokkos"_package.html command if you
wish to change any of its option defaults, as set by the "-k on"
"command-line switch"_Section_start.html#start_7.

[Using OpenMP threading and CUDA together (experimental):]

With the KOKKOS package, both OpenMP multi-threading and GPUs can be used
together in a few special cases. In the Makefile, the KOKKOS_DEVICES variable must
include both "Cuda" and "OpenMP", as is the case for /src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi

KOKKOS_DEVICES=Cuda,OpenMP :pre

The suffix "/kk" is equivalent to "/kk/device", and for Kokkos CUDA,
using the "-sf kk" in the command line gives the default CUDA version everywhere.
However, if the "/kk/host" suffix is added to a specific style in the input
script, the Kokkos OpenMP (CPU) version of that specific style will be used instead.
Set the number of OpenMP threads as "t Nt" and the number of GPUs as "g Ng"

-k on t Nt g Ng :pre

For example, the command to run with 1 GPU and 8 OpenMP threads is then:

mpiexec -np 1 lmp_kokkos_cuda_openmpi -in in.lj -k on g 1 t 8 -sf kk :pre

Conversely, if the "-sf kk/host" is used in the command line and then the
"/kk" or "/kk/device" suffix is added to a specific style in your input script,
then only that specific style will run on the GPU while everything else will
run on the CPU in OpenMP mode. Note that the execution of the CPU and GPU
styles will NOT overlap, except for a special case:

A kspace style and/or molecular topology (bonds, angles, etc.) running on
the host CPU can overlap with a pair style running on the GPU. First compile
with "--default-stream per-thread" added to CCFLAGS in the Kokkos CUDA Makefile.
Then explicitly use the "/kk/host" suffix for kspace and bonds, angles, etc.
in the input file and the "kk" suffix (equal to "kk/device") on the command line.
Also make sure the environment variable CUDA_LAUNCH_BLOCKING is not set to "1"
so CPU/GPU overlap can occur.

[Speed-ups to expect:]

The performance of KOKKOS running in different modes is a function of
your hardware, which KOKKOS-enable styles are used, and the problem
size.

Generally speaking, the following rules of thumb apply:

When running on CPUs only, with a single thread per MPI task,
performance of a KOKKOS style is somewhere between the standard
(un-accelerated) styles (MPI-only mode), and those provided by the
USER-OMP package. However the difference between all 3 is small (less
than 20%). :ulb,l

When running on CPUs only, with multiple threads per MPI task,
performance of a KOKKOS style is a bit slower than the USER-OMP
package. :l

When running large number of atoms per GPU, KOKKOS is typically faster
than the GPU package. :l

When running on Intel hardware, KOKKOS is not as fast as
the USER-INTEL package, which is optimized for that hardware. :l
:ule

See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
LAMMPS web site for performance of the KOKKOS package on different
hardware.

[Advanced Kokkos options:]

There are other allowed options when building with the KOKKOS package.
As above, they can be set either as variables on the make command line
or in Makefile.machine. This is the full list of options, including
those discussed above. Each takes a value shown below. The
default value is listed, which is set in the
/lib/kokkos/Makefile.kokkos file.

KOKKOS_DEVICES, values = {Serial}, {OpenMP}, {Pthreads}, {Cuda}, default = {OpenMP}
KOKKOS_ARCH, values = {KNC}, {SNB}, {HSW}, {Kepler30}, {Kepler32}, {Kepler35}, {Kepler37}, {Maxwell50}, {Maxwell52}, {Maxwell53}, {Pascal60}, {Pascal61}, {ARMv80}, {ARMv81}, {ARMv81}, {ARMv8-ThunderX}, {BGQ}, {Power7}, {Power8}, {Power9}, {KNL}, {BDW}, {SKX}, default = {none}
KOKKOS_DEBUG, values = {yes}, {no}, default = {no}
KOKKOS_USE_TPLS, values = {hwloc}, {librt}, {experimental_memkind}, default = {none}
KOKKOS_CXX_STANDARD, values = {c++11}, {c++1z}, default = {c++11}
KOKKOS_OPTIONS, values = {aggressive_vectorization}, {disable_profiling}, default = {none}
KOKKOS_CUDA_OPTIONS, values = {force_uvm}, {use_ldg}, {rdc}, {enable_lambda}, default = {enable_lambda} :ul

KOKKOS_DEVICES sets the parallelization method used for Kokkos code
(within LAMMPS). KOKKOS_DEVICES=Serial means that no threading will be used.
KOKKOS_DEVICES=OpenMP means that OpenMP threading will be
used. KOKKOS_DEVICES=Pthreads means that pthreads will be used.
KOKKOS_DEVICES=Cuda means an NVIDIA GPU running CUDA will be used.

KOKKOS_ARCH enables compiler switches needed when compiling for a
specific hardware:

ARMv80 = ARMv8.0 Compatible CPU
ARMv81 = ARMv8.1 Compatible CPU
ARMv8-ThunderX = ARMv8 Cavium ThunderX CPU
SNB = Intel Sandy/Ivy Bridge CPUs
HSW = Intel Haswell CPUs
BDW = Intel Broadwell Xeon E-class CPUs
SKX = Intel Sky Lake Xeon E-class HPC CPUs (AVX512)
KNC = Intel Knights Corner Xeon Phi
KNL = Intel Knights Landing Xeon Phi
Kepler30 = NVIDIA Kepler generation CC 3.0
Kepler32 = NVIDIA Kepler generation CC 3.2
Kepler35 = NVIDIA Kepler generation CC 3.5
Kepler37 = NVIDIA Kepler generation CC 3.7
Maxwell50 = NVIDIA Maxwell generation CC 5.0
Maxwell52 = NVIDIA Maxwell generation CC 5.2
Maxwell53 = NVIDIA Maxwell generation CC 5.3
Pascal60 = NVIDIA Pascal generation CC 6.0
Pascal61 = NVIDIA Pascal generation CC 6.1
BGQ = IBM Blue Gene/Q CPUs
Power8 = IBM POWER8 CPUs
Power9 = IBM POWER9 CPUs :ul

KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be
used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not
necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
provides alternative methods via environment variables for binding
threads to hardware cores. More info on binding threads to cores is
given in "Section 5.3"_Section_accelerate.html#acc_3.

KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
on most Unix platforms. This library is not available on all
platforms.

KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time
debugging information that can be useful. It also enables runtime
bounds checking on Kokkos data structures.

KOKKOS_CXX_STANDARD and KOKKOS_OPTIONS are typically not changed when building LAMMPS.

KOKKOS_CUDA_OPTIONS are additional options for CUDA. The LAMMPS KOKKOS package must be compiled
with the {enable_lambda} option when using GPUs.

[Restrictions:]

Currently, there are no precision options with the KOKKOS
package. All compilation and computation is performed in double
precision.