diff --git a/doc/src/accelerate_kokkos.txt b/doc/src/accelerate_kokkos.txt index 2b07ed035f8b6cc5b1f96805d9f4ba4b6d1dce70..b2279c5c719b46562c0ecf5593a9600e453f3b74 100644 --- a/doc/src/accelerate_kokkos.txt +++ b/doc/src/accelerate_kokkos.txt @@ -11,336 +11,346 @@ 5.3.3 KOKKOS package :h5 -The KOKKOS package was developed primarily by Christian Trott (Sandia) -with contributions of various styles by others, including Sikandar -Mashayak (UIUC), Stan Moore (Sandia), and Ray Shan (Sandia). The -underlying Kokkos library was written primarily by Carter Edwards, +Kokkos is a templated C++ library that provides abstractions to allow +a single implementation of an application kernel (e.g. a pair style) to run efficiently on +different kinds of hardware, such as GPUs, Intel Xeon Phis, or many-core +CPUs. Kokkos maps the C++ kernel onto different backend languages such as CUDA, OpenMP, or Pthreads. +The Kokkos library also provides data abstractions to adjust (at +compile time) the memory layout of data structures like 2d and +3d arrays to optimize performance on different hardware. For more information on Kokkos, see +"Github"_https://github.com/kokkos/kokkos. Kokkos is part of +"Trilinos"_http://trilinos.sandia.gov/packages/kokkos. The Kokkos library was written primarily by Carter Edwards, Christian Trott, and Dan Sunderland (all Sandia). -The KOKKOS package contains versions of pair, fix, and atom styles +The LAMMPS KOKKOS package contains versions of pair, fix, and atom styles that use data structures and macros provided by the Kokkos library, -which is included with LAMMPS in lib/kokkos. - -The Kokkos library is part of -"Trilinos"_http://trilinos.sandia.gov/packages/kokkos and can also be -downloaded from "Github"_https://github.com/kokkos/kokkos. Kokkos is a -templated C++ library that provides two key abstractions for an -application like LAMMPS. First, it allows a single implementation of -an application kernel (e.g. a pair style) to run efficiently on -different kinds of hardware, such as a GPU, Intel Phi, or many-core -CPU. - -The Kokkos library also provides data abstractions to adjust (at -compile time) the memory layout of basic data structures like 2d and -3d arrays and allow the transparent utilization of special hardware -load and store operations. Such data structures are used in LAMMPS to -store atom coordinates or forces or neighbor lists. The layout is -chosen to optimize performance on different platforms. Again this -functionality is hidden from the developer, and does not affect how -the kernel is coded. - -These abstractions are set at build time, when LAMMPS is compiled with -the KOKKOS package installed. All Kokkos operations occur within the -context of an individual MPI task running on a single node of the -machine. The total number of MPI tasks used by LAMMPS (one or -multiple per compute node) is set in the usual manner via the mpirun -or mpiexec commands, and is independent of Kokkos. +which is included with LAMMPS in /lib/kokkos. The KOKKOS package was developed primarily by Christian Trott (Sandia) +and Stan Moore (Sandia) with contributions of various styles by others, including Sikandar +Mashayak (UIUC), Ray Shan (Sandia), and Dan Ibanez (Sandia). For more information on developing using Kokkos abstractions +see the Kokkos programmers' guide at /lib/kokkos/doc/Kokkos_PG.pdf. Kokkos currently provides support for 3 modes of execution (per MPI -task). These are OpenMP (for many-core CPUs), Cuda (for NVIDIA GPUs), -and OpenMP (for Intel Phi). Note that the KOKKOS package supports -running on the Phi in native mode, not offload mode like the -USER-INTEL package supports. You choose the mode at build time to +task). These are Serial (MPI-only for CPUs and Intel Phi), OpenMP (threading +for many-core CPUs and Intel Phi), and CUDA (for NVIDIA GPUs). You choose the mode at build time to produce an executable compatible with specific hardware. -Here is a quick overview of how to use the KOKKOS package -for CPU acceleration, assuming one or more 16-core nodes. -More details follow. +[Building LAMMPS with the KOKKOS package:] -use a C++11 compatible compiler -make yes-kokkos -make mpi KOKKOS_DEVICES=OpenMP # build with the KOKKOS package -make kokkos_omp # or Makefile.kokkos_omp already has variable set :pre +NOTE: Kokkos support within LAMMPS must be built with a C++11 compatible +compiler. This means GCC version 4.7.2 or later, Intel 14.0.4 or later, or +Clang 3.5.2 or later is required. + +The recommended method of building the KOKKOS package is to start with the provided Kokkos +Makefiles in /src/MAKE/OPTIONS/. You may need to modify the KOKKOS_ARCH variable in the Makefile +to match your specific hardware. For example: -mpirun -np 16 lmp_mpi -k on -sf kk -in in.lj # 1 node, 16 MPI tasks/node, no threads -mpirun -np 2 -ppn 1 lmp_mpi -k on t 16 -sf kk -in in.lj # 2 nodes, 1 MPI task/node, 16 threads/task -mpirun -np 2 lmp_mpi -k on t 8 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 8 threads/task -mpirun -np 32 -ppn 4 lmp_mpi -k on t 4 -sf kk -in in.lj # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre +for Sandy Bridge CPUs, set KOKKOS_ARCH=SNB +for Broadwell CPUs, set KOKKOS_ARCH=BWD +for K80 GPUs, set KOKKOS_ARCH=Kepler37 +for P100 GPUs and Power8 CPUs, set KOKKOS_ARCH=Pascal60,Power8 :ul -specify variables and settings in your Makefile.machine that enable OpenMP, GPU, or Phi support -include the KOKKOS package and build LAMMPS -enable the KOKKOS package and its hardware options via the "-k on" command-line switch use KOKKOS styles in your input script :ul +See the [Advanced Kokkos Options] section below for a listing of all KOKKOS_ARCH options. -Here is a quick overview of how to use the KOKKOS package for GPUs, -assuming one or more nodes, each with 16 cores and a GPU. More -details follow. +[Compile for CPU-only (MPI only, no threading):] -discuss use of NVCC, which Makefiles to examine +use a C++11 compatible compiler and set KOKKOS_ARCH variable in +/src/MAKE/OPTIONS/Makefile.kokkos_mpi_only as described above. Then do the +following: -use a C++11 compatible compiler -KOKKOS_DEVICES = Cuda, OpenMP -KOKKOS_ARCH = Kepler35 +cd lammps/src make yes-kokkos -make machine :pre +make kokkos_mpi_only :pre -mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj # one MPI task, 6 threads on CPU -mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj # ditto on 4 nodes :pre +[Compile for CPU-only (MPI plus OpenMP threading):] -mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # two MPI tasks, 8 threads per CPU -mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj # ditto on 16 nodes :pre +NOTE: To build with Kokkos support for OpenMP threading, your compiler must support the +OpenMP interface. You should have one or more multi-core CPUs so that +multiple threads can be launched by each MPI task running on a CPU. -Here is a quick overview of how to use the KOKKOS package -for the Intel Phi: +use a C++11 compatible compiler and set KOKKOS_ARCH variable in +/src/MAKE/OPTIONS/Makefile.kokkos_omp as described above. Then do the +following: -use a C++11 compatible compiler -KOKKOS_DEVICES = OpenMP -KOKKOS_ARCH = KNC +cd lammps/src make yes-kokkos -make machine :pre +make kokkos_omp :pre -host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading): -mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj # 1 MPI task on 1 Phi, 1*240 = 240 -mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj # 30 MPI tasks on 1 Phi, 30*8 = 240 -mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj # 12 MPI tasks on 1 Phi, 12*20 = 240 -mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj # ditto on 8 Phis :pre +[Compile for Intel KNL Xeon Phi (Intel Compiler, OpenMPI):] -[Required hardware/software:] +use a C++11 compatible compiler and do the following: -Kokkos support within LAMMPS must be built with a C++11 compatible -compiler. If using gcc, version 4.7.2 or later is required. +cd lammps/src +make yes-kokkos +make kokkos_phi :pre -To build with Kokkos support for CPUs, your compiler must support the -OpenMP interface. You should have one or more multi-core CPUs so that -multiple threads can be launched by each MPI task running on a CPU. +[Compile for CPUs and GPUs (with OpenMPI or MPICH):] -To build with Kokkos support for NVIDIA GPUs, NVIDIA CUDA software -version 7.5 or later must be installed on your system. See the +NOTE: To build with Kokkos support for NVIDIA GPUs, NVIDIA CUDA software +version 7.5 or later must be installed on your system. See the discussion for the "GPU"_accelerate_gpu.html package for details of how to check and do this. -NOTE: For good performance of the KOKKOS package on GPUs, you must -have Kepler generation GPUs (or later). The Kokkos library exploits -texture cache options not supported by Telsa generation GPUs (or -older). - -To build with Kokkos support for Intel Xeon Phi coprocessors, your -sysmte must be configured to use them in "native" mode, not "offload" -mode like the USER-INTEL package supports. +use a C++11 compatible compiler and set KOKKOS_ARCH variable in +/src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi for both GPU and CPU as described +above. Then do the following: -[Building LAMMPS with the KOKKOS package:] - -You must choose at build time whether to build for CPUs (OpenMP), -GPUs, or Phi. +cd lammps/src +make yes-kokkos +make kokkos_cuda_mpi :pre -You can do any of these in one line, using the suitable make command -line flags as described in "Section 4"_Section_packages.html of the -manual. If run from the src directory, these -commands will create src/lmp_kokkos_omp, lmp_kokkos_cuda_mpi, and -lmp_kokkos_phi. Note that the OMP and PHI options use -src/MAKE/Makefile.mpi as the starting Makefile.machine. The CUDA -option uses src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi. +[Alternative Methods of Compiling:] -The latter two steps can be done using the "-k on", "-pk kokkos" and -"-sf kk" "command-line switches"_Section_start.html#start_6 -respectively. Or the effect of the "-pk" or "-sf" switches can be -duplicated by adding the "package kokkos"_package.html or "suffix -kk"_suffix.html commands respectively to your input script. +Alternatively, the KOKKOS package can be built by specifying Kokkos variables +on the make command line. For example: +make mpi KOKKOS_DEVICES=OpenMP KOKKOS_ARCH=SNB # set the KOKKOS_DEVICES and KOKKOS_ARCH variable explicitly +make kokkos_cuda_mpi KOKKOS_ARCH=Pascal60,Power8 # set the KOKKOS_ARCH variable explicitly :pre -Or you can follow these steps: +Setting the KOKKOS_DEVICES and KOKKOS_ARCH variables on the +make command line requires a GNU-compatible make command. Try +"gmake" if your system's standard make complains. -CPU-only (run all-MPI or with OpenMP threading): +NOTE: If you build using make line variables and re-build LAMMPS twice +with different KOKKOS options and the *same* target, then you *must* perform a "make clean-all" +or "make clean-machine" before each build. This is to force all the +KOKKOS-dependent files to be re-compiled with the new options. -cd lammps/src -make yes-kokkos -make kokkos_omp :pre +[Running LAMMPS with the KOKKOS package:] -CPU-only (only MPI, no threading): +All Kokkos operations occur within the +context of an individual MPI task running on a single node of the +machine. The total number of MPI tasks used by LAMMPS (one or +multiple per compute node) is set in the usual manner via the mpirun +or mpiexec commands, and is independent of Kokkos. E.g. the mpirun +command in OpenMPI does this via its +-np and -npernode switches. Ditto for MPICH via -np and -ppn. -cd lammps/src -make yes-kokkos -make kokkos_mpi_only :pre +[Running on a multi-core CPU:] -Intel Xeon Phi (Intel Compiler, Intel MPI): +Here is a quick overview of how to use the KOKKOS package +for CPU acceleration, assuming one or more 16-core nodes. -cd lammps/src -make yes-kokkos -make kokkos_phi :pre +mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -in in.lj # 1 node, 16 MPI tasks/node, no multi-threading +mpirun -np 2 -ppn 1 lmp_kokkos_omp -k on t 16 -sf kk -in in.lj # 2 nodes, 1 MPI task/node, 16 threads/task +mpirun -np 2 lmp_kokkos_omp -k on t 8 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 8 threads/task +mpirun -np 32 -ppn 4 lmp_kokkos_omp -k on t 4 -sf kk -in in.lj # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre -CPUs and GPUs (with MPICH or OpenMPI): +To run using the KOKKOS package, use the "-k on", "-sf kk" and "-pk kokkos" "command-line switches"_Section_start.html#start_7 in your mpirun command. +You must use the "-k on" "command-line +switch"_Section_start.html#start_7 to enable the KOKKOS package. It +takes additional arguments for hardware settings appropriate to your +system. Those arguments are "documented +here"_Section_start.html#start_7. For OpenMP use: -cd lammps/src -make yes-kokkos -make kokkos_cuda_mpi :pre +-k on t Nt :pre -These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the -make command line which requires a GNU-compatible make command. Try -"gmake" if your system's standard make complains. +The "t Nt" option specifies how many OpenMP threads per MPI +task to use with a node. The default is Nt = 1, which is MPI-only mode. +Note that the product of MPI tasks * OpenMP +threads/task should not exceed the physical number of cores (on a +node), otherwise performance will suffer. If hyperthreading is enabled, then +the product of MPI tasks * OpenMP threads/task should not exceed the +physical number of cores * hardware threads. +The "-k on" switch also issues a "package kokkos" command (with no +additional arguments) which sets various KOKKOS options to default +values, as discussed on the "package"_package.html command doc page. -NOTE: If you build using make line variables and re-build LAMMPS twice -with different KOKKOS options and the *same* target, e.g. g++ in the -first two examples above, then you *must* perform a "make clean-all" -or "make clean-machine" before each build. This is to force all the -KOKKOS-dependent files to be re-compiled with the new options. +The "-sf kk" "command-line switch"_Section_start.html#start_7 +will automatically append the "/kk" suffix to styles that support it. +In this manner no modification to the input script is needed. Alternatively, +one can run with the KOKKOS package by editing the input script as described below. -NOTE: Currently, there are no precision options with the KOKKOS -package. All compilation and computation is performed in double -precision. +NOTE: The default for the "package kokkos"_package.html command is +to use "full" neighbor lists and set the Newton flag to "off" for both +pairwise and bonded interactions. However, when running on CPUs, it +will typically be faster to use "half" neighbor lists and set the +Newton flag to "on", just as is the case for non-accelerated pair +styles. It can also be faster to use non-threaded communication. +Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to +change the default "package kokkos"_package.html +options. See its doc page for details and default settings. Experimenting with +its options can provide a speed-up for specific calculations. For example: -There are other allowed options when building with the KOKKOS package. -As above, they can be set either as variables on the make command line -or in Makefile.machine. This is the full list of options, including -those discussed above, Each takes a value shown below. The -default value is listed, which is set in the -lib/kokkos/Makefile.kokkos file. +mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -pk kokkos newton on neigh half comm no -in in.lj # Newton on, Half neighbor list, non-threaded comm :pre -#Default settings specific options -#Options: force_uvm,use_ldg,rdc +If the "newton"_newton.html command is used in the input +script, it can also override the Newton flag defaults. -KOKKOS_DEVICES, values = {OpenMP}, {Serial}, {Pthreads}, {Cuda}, default = {OpenMP} -KOKKOS_ARCH, values = {KNC}, {SNB}, {HSW}, {Kepler}, {Kepler30}, {Kepler32}, {Kepler35}, {Kepler37}, {Maxwell}, {Maxwell50}, {Maxwell52}, {Maxwell53}, {ARMv8}, {BGQ}, {Power7}, {Power8}, default = {none} -KOKKOS_DEBUG, values = {yes}, {no}, default = {no} -KOKKOS_USE_TPLS, values = {hwloc}, {librt}, default = {none} -KOKKOS_CUDA_OPTIONS, values = {force_uvm}, {use_ldg}, {rdc} :ul +[Core and Thread Affinity:] -KOKKOS_DEVICE sets the parallelization method used for Kokkos code -(within LAMMPS). KOKKOS_DEVICES=OpenMP means that OpenMP will be -used. KOKKOS_DEVICES=Pthreads means that pthreads will be used. -KOKKOS_DEVICES=Cuda means an NVIDIA GPU running CUDA will be used. +When using multi-threading, it is important for +performance to bind both MPI tasks to physical cores, and threads to +physical cores, so they do not migrate during a simulation. -If KOKKOS_DEVICES=Cuda, then the lo-level Makefile in the src/MAKE -directory must use "nvcc" as its compiler, via its CC setting. For -best performance its CCFLAGS setting should use -O3 and have a -KOKKOS_ARCH setting that matches the compute capability of your NVIDIA -hardware and software installation, e.g. KOKKOS_ARCH=Kepler30. Note -the minimal required compute capability is 2.0, but this will give -significantly reduced performance compared to Kepler generation GPUs -with compute capability 3.x. For the LINK setting, "nvcc" should not -be used; instead use g++ or another compiler suitable for linking C++ -applications. Often you will want to use your MPI compiler wrapper -for this setting (i.e. mpicxx). Finally, the lo-level Makefile must -also have a "Compilation rule" for creating *.o files from *.cu files. -See src/Makefile.cuda for an example of a lo-level Makefile with all -of these settings. +If you are not certain MPI tasks are being bound (check the defaults +for your MPI installation), binding can be forced with these flags: -KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not -migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be -used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not -necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP -provides alternative methods via environment variables for binding -threads to hardware cores. More info on binding threads to cores is -given in "Section 5.3"_Section_accelerate.html#acc_3. +OpenMPI 1.8: mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ... +Mvapich2 2.0: mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ... :pre -KOKKOS_ARCH=KNC enables compiler switches needed when compiling for an -Intel Phi processor. +For binding threads with KOKKOS OpenMP, use thread affinity +environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or +later, intel 12 or later) setting the environment variable +OMP_PROC_BIND=true should be sufficient. In general, for best performance +with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads. +For binding threads with the +KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option +as described below. -KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism -on most Unix platforms. This library is not available on all -platforms. +[Running on Knight's Landing (KNL) Intel Xeon Phi:] -KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style -within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time -debugging information that can be useful. It also enables runtime -bounds checking on Kokkos data structures. +Here is a quick overview of how to use the KOKKOS package +for the Intel Knight's Landing (KNL) Xeon Phi: -KOKKOS_CUDA_OPTIONS are additional options for CUDA. +KNL Intel Phi chips have 68 physical cores. Typically 1 to 4 cores +are reserved for the OS, and only 64 or 66 cores are used. Each core +has 4 hyperthreads,so there are effectively N = 256 (4*64) or +N = 264 (4*66) cores to run on. The product of MPI tasks * OpenMP threads/task should not exceed this limit, +otherwise performance will suffer. Note that with the KOKKOS package you do not need to +specify how many KNLs there are per node; each +KNL is simply treated as running some number of MPI tasks. -For more information on Kokkos see the Kokkos programmers' guide here: -/lib/kokkos/doc/Kokkos_PG.pdf. +Examples of mpirun commands that follow these rules are shown below. -[Run with the KOKKOS package from the command line:] +Intel KNL node with 68 cores (272 threads/node via 4x hardware threading): +mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 1 node, 64 MPI tasks/node, 4 threads/task +mpirun -np 66 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 1 node, 66 MPI tasks/node, 4 threads/task +mpirun -np 32 lmp_kokkos_phi -k on t 8 -sf kk -in in.lj # 1 node, 32 MPI tasks/node, 8 threads/task +mpirun -np 512 -ppn 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj # 8 nodes, 64 MPI tasks/node, 4 threads/task :pre -The mpirun or mpiexec command sets the total number of MPI tasks used -by LAMMPS (one or multiple per compute node) and the number of MPI -tasks used per node. E.g. the mpirun command in MPICH does this via -its -np and -ppn switches. Ditto for OpenMPI via -np and -npernode. +The -np setting of the mpirun command sets the number of MPI +tasks/node. The "-k on t Nt" command-line switch sets the number of +threads/task as Nt. The product of these two values should be N, i.e. +256 or 264. -When using KOKKOS built with host=OMP, you need to choose how many -OpenMP threads per MPI task will be used (via the "-k" command-line -switch discussed below). Note that the product of MPI tasks * OpenMP -threads/task should not exceed the physical number of cores (on a -node), otherwise performance will suffer. - -When using the KOKKOS package built with device=CUDA, you must use -exactly one MPI task per physical GPU. - -When using the KOKKOS package built with host=MIC for Intel Xeon Phi -coprocessor support you need to insure there are one or more MPI tasks -per coprocessor, and choose the number of coprocessor threads to use -per MPI task (via the "-k" command-line switch discussed below). The -product of MPI tasks * coprocessor threads/task should not exceed the -maximum number of threads the coprocessor is designed to run, -otherwise performance will suffer. This value is 240 for current -generation Xeon Phi(TM) chips, which is 60 physical cores * 4 -threads/core. Note that with the KOKKOS package you do not need to -specify how many Phi coprocessors there are per node; each -coprocessors is simply treated as running some number of MPI tasks. +NOTE: The default for the "package kokkos"_package.html command is +to use "full" neighbor lists and set the Newton flag to "off" for both +pairwise and bonded interactions. When running on KNL, this +will typically be best for pair-wise potentials. For manybody potentials, +using "half" neighbor lists and setting the +Newton flag to "on" may be faster. It can also be faster to use non-threaded communication. +Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to +change the default "package kokkos"_package.html +options. See its doc page for details and default settings. Experimenting with +its options can provide a speed-up for specific calculations. For example: + +mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm no -in in.lj # Newton off, full neighbor list, non-threaded comm +mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos newton on neigh half comm no -in in.reax # Newton on, half neighbor list, non-threaded comm :pre + +NOTE: MPI tasks and threads should be bound to cores as described above for CPUs. + +NOTE: To build with Kokkos support for Intel Xeon Phi coprocessors such as Knight's Corner (KNC), your +system must be configured to use them in "native" mode, not "offload" +mode like the USER-INTEL package supports. -You must use the "-k on" "command-line -switch"_Section_start.html#start_6 to enable the KOKKOS package. It -takes additional arguments for hardware settings appropriate to your -system. Those arguments are "documented -here"_Section_start.html#start_6. The two most commonly used -options are: +[Running on GPUs:] --k on t Nt g Ng :pre +Use the "-k" "command-line switch"_Section_commands.html#start_7 to +specify the number of GPUs per node. Typically the -np setting +of the mpirun command should set the number of MPI +tasks/node to be equal to the # of physical GPUs on the node. +You can assign multiple MPI tasks to the same GPU with the +KOKKOS package, but this is usually only faster if significant portions +of the input script have not been ported to use Kokkos. Using CUDA MPS +is recommended in this scenario. As above for multi-core CPUs (and no GPU), if N is the number +of physical cores/node, then the number of MPI tasks/node should not exceed N. -The "t Nt" option applies to host=OMP (even if device=CUDA) and -host=MIC. For host=OMP, it specifies how many OpenMP threads per MPI -task to use with a node. For host=MIC, it specifies how many Xeon Phi -threads per MPI task to use within a node. The default is Nt = 1. -Note that for host=OMP this is effectively MPI-only mode which may be -fine. But for host=MIC you will typically end up using far less than -all the 240 available threads, which could give very poor performance. +-k on g Ng :pre -The "g Ng" option applies to device=CUDA. It specifies how many GPUs -per compute node to use. The default is 1, so this only needs to be -specified is you have 2 or more GPUs per compute node. +Here are examples of how to use the KOKKOS package for GPUs, +assuming one or more nodes, each with two GPUs: -The "-k on" switch also issues a "package kokkos" command (with no -additional arguments) which sets various KOKKOS options to default -values, as discussed on the "package"_package.html command doc page. +mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 1 node, 2 MPI tasks/node, 2 GPUs/node +mpirun -np 32 -ppn 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj # 16 nodes, 2 MPI tasks/node, 2 GPUs/node (32 GPUs total) :pre -Use the "-sf kk" "command-line switch"_Section_start.html#start_6, -which will automatically append "kk" to styles that support it. Use -the "-pk kokkos" "command-line switch"_Section_start.html#start_6 if -you wish to change any of the default "package kokkos"_package.html -optionns set by the "-k on" "command-line -switch"_Section_start.html#start_6. +NOTE: The default for the "package kokkos"_package.html command is +to use "full" neighbor lists and set the Newton flag to "off" for both +pairwise and bonded interactions, along with threaded communication. +When running on Maxwell or Kepler GPUs, this will typically be best. For Pascal GPUs, +using "half" neighbor lists and setting the +Newton flag to "on" may be faster. For many pair styles, setting the neighbor binsize +equal to the ghost atom cutoff will give speedup. +Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to +change the default "package kokkos"_package.html +options. See its doc page for details and default settings. Experimenting with +its options can provide a speed-up for specific calculations. For example: + +mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos binsize 2.8 -in in.lj # Set binsize = neighbor ghost cutoff +mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos newton on neigh half binsize 2.8 -in in.lj # Newton on, half neighborlist, set binsize = neighbor ghost cutoff :pre +NOTE: For good performance of the KOKKOS package on GPUs, you must +have Kepler generation GPUs (or later). The Kokkos library exploits +texture cache options not supported by Telsa generation GPUs (or +older). +NOTE: When using a GPU, you will achieve the best performance if your +input script does not use fix or compute styles which are not yet +Kokkos-enabled. This allows data to stay on the GPU for multiple +timesteps, without being copied back to the host CPU. Invoking a +non-Kokkos fix or compute, or performing I/O for +"thermo"_thermo_style.html or "dump"_dump.html output will cause data +to be copied back to the CPU incurring a performance penalty. -Note that the default for the "package kokkos"_package.html command is -to use "full" neighbor lists and set the Newton flag to "off" for both -pairwise and bonded interactions. This typically gives fastest -performance. If the "newton"_newton.html command is used in the input -script, it can override the Newton flag defaults. +NOTE: To get an accurate timing breakdown between time spend in pair, +kspace, etc., you must set the environment variable CUDA_LAUNCH_BLOCKING=1. +However, this will reduce performance and is not recommended for production runs. -However, when running in MPI-only mode with 1 thread per MPI task, it -will typically be faster to use "half" neighbor lists and set the -Newton flag to "on", just as is the case for non-accelerated pair -styles. You can do this with the "-pk" "command-line -switch"_Section_start.html#start_6. +[Run with the KOKKOS package by editing an input script:] -[Or run with the KOKKOS package by editing an input script:] +Alternatively the effect of the "-sf" or "-pk" switches can be +duplicated by adding the "package kokkos"_package.html or "suffix +kk"_suffix.html commands to your input script. -The discussion above for the mpirun/mpiexec command and setting -appropriate thread and GPU values for host=OMP or host=MIC or -device=CUDA are the same. +The discussion above for building LAMMPS with the KOKKOS package, the mpirun/mpiexec command, and setting +appropriate thread are the same. You must still use the "-k on" "command-line -switch"_Section_start.html#start_6 to enable the KOKKOS package, and +switch"_Section_start.html#start_7 to enable the KOKKOS package, and specify its additional arguments for hardware options appropriate to your system, as documented above. -Use the "suffix kk"_suffix.html command, or you can explicitly add a +You can use the "suffix kk"_suffix.html command, or you can explicitly add a "kk" suffix to individual styles in your input script, e.g. pair_style lj/cut/kk 2.5 :pre You only need to use the "package kokkos"_package.html command if you wish to change any of its option defaults, as set by the "-k on" -"command-line switch"_Section_start.html#start_6. +"command-line switch"_Section_start.html#start_7. + +[Using OpenMP threading and CUDA together (experimental):] + +With the KOKKOS package, both OpenMP multi-threading and GPUs can be used +together in a few special cases. In the Makefile, the KOKKOS_DEVICES variable must +include both "Cuda" and "OpenMP", as is the case for /src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi + +KOKKOS_DEVICES=Cuda,OpenMP :pre + +The suffix “/kk” is equivalent to “/kk/device”, and for Kokkos CUDA, +using the “-sf kk” in the command line gives the default CUDA version everywhere. +However, if the “/kk/host” suffix is added to a specific style in the input +script, the Kokkos OpenMP (CPU) version of that specific style will be used instead. +Set the number of OpenMP threads as "t Nt" and the number of GPUs as "g Ng" + +-k on t Nt g Ng :pre + +For example, the command to run with 1 GPU and 8 OpenMP threads is then: + +mpiexec -np 1 lmp_kokkos_cuda_openmpi -in in.lj -k on g 1 t 8 -sf kk :pre + +Conversely, if the “-sf kk/host” is used in the command line and then the +“/kk” or “/kk/device” suffix is added to a specific style in your input script, +then only that specific style will run on the GPU while everything else will +run on the CPU in OpenMP mode. Note that the execution of the CPU and GPU +styles will NOT overlap, except for a special case: + +A kspace style and/or molecular topology (bonds, angles, etc.) running on +the host CPU can overlap with a pair style running on the GPU. First compile +with “--default-stream per-thread” added to CCFLAGS in the Kokkos CUDA Makefile. +Then explicitly use the “/kk/host” suffix for kspace and bonds, angles, etc. +in the input file and the "kk" suffix (equal to "kk/device") on the command line. +Also make sure the environment variable CUDA_LAUNCH_BLOCKING is not set to "1" +so CPU/GPU overlap can occur. [Speed-ups to expect:] @@ -353,7 +363,7 @@ Generally speaking, the following rules of thumb apply: When running on CPUs only, with a single thread per MPI task, performance of a KOKKOS style is somewhere between the standard (un-accelerated) styles (MPI-only mode), and those provided by the -USER-OMP package. However the difference between all 3 is small (less +USER-OMP package. However the difference between all 3 is small (less than 20%). :ulb,l When running on CPUs only, with multiple threads per MPI task, @@ -363,7 +373,7 @@ package. :l When running large number of atoms per GPU, KOKKOS is typically faster than the GPU package. :l -When running on Intel Xeon Phi, KOKKOS is not as fast as +When running on Intel hardware, KOKKOS is not as fast as the USER-INTEL package, which is optimized for that hardware. :l :ule @@ -371,123 +381,78 @@ See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the LAMMPS web site for performance of the KOKKOS package on different hardware. -[Guidelines for best performance:] - -Here are guidline for using the KOKKOS package on the different -hardware configurations listed above. - -Many of the guidelines use the "package kokkos"_package.html command -See its doc page for details and default settings. Experimenting with -its options can provide a speed-up for specific calculations. - -[Running on a multi-core CPU:] - -If N is the number of physical cores/node, then the number of MPI -tasks/node * number of threads/task should not exceed N, and should -typically equal N. Note that the default threads/task is 1, as set by -the "t" keyword of the "-k" "command-line -switch"_Section_start.html#start_6. If you do not change this, no -additional parallelism (beyond MPI) will be invoked on the host -CPU(s). +[Advanced Kokkos options:] -You can compare the performance running in different modes: - -run with 1 MPI task/node and N threads/task -run with N MPI tasks/node and 1 thread/task -run with settings in between these extremes :ul - -Examples of mpirun commands in these modes are shown above. - -When using KOKKOS to perform multi-threading, it is important for -performance to bind both MPI tasks to physical cores, and threads to -physical cores, so they do not migrate during a simulation. - -If you are not certain MPI tasks are being bound (check the defaults -for your MPI installation), binding can be forced with these flags: - -OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ... -Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre - -For binding threads with the KOKKOS OMP option, use thread affinity -environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or -later, intel 12 or later) setting the environment variable -OMP_PROC_BIND=true should be sufficient. For binding threads with the -KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option -(see "this section"_Section_packages.html#KOKKOS of the manual for -details). - -[Running on GPUs:] - -Insure the -arch setting in the machine makefile you are using, -e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software. -(see "this section"_Section_packages.html#KOKKOS of the manual for -details). - -The -np setting of the mpirun command should set the number of MPI -tasks/node to be equal to the # of physical GPUs on the node. - -Use the "-k" "command-line switch"_Section_commands.html#start_6 to -specify the number of GPUs per node, and the number of threads per MPI -task. As above for multi-core CPUs (and no GPU), if N is the number -of physical cores/node, then the number of MPI tasks/node * number of -threads/task should not exceed N. With one GPU (and one MPI task) it -may be faster to use less than all the available cores, by setting -threads/task to a smaller value. This is because using all the cores -on a dual-socket node will incur extra cost to copy memory from the -2nd socket to the GPU. - -Examples of mpirun commands that follow these rules are shown above. - -NOTE: When using a GPU, you will achieve the best performance if your -input script does not use any fix or compute styles which are not yet -Kokkos-enabled. This allows data to stay on the GPU for multiple -timesteps, without being copied back to the host CPU. Invoking a -non-Kokkos fix or compute, or performing I/O for -"thermo"_thermo_style.html or "dump"_dump.html output will cause data -to be copied back to the CPU. - -You cannot yet assign multiple MPI tasks to the same GPU with the -KOKKOS package. We plan to support this in the future, similar to the -GPU package in LAMMPS. +There are other allowed options when building with the KOKKOS package. +As above, they can be set either as variables on the make command line +or in Makefile.machine. This is the full list of options, including +those discussed above. Each takes a value shown below. The +default value is listed, which is set in the +/lib/kokkos/Makefile.kokkos file. -You cannot yet use both the host (multi-threaded) and device (GPU) -together to compute pairwise interactions with the KOKKOS package. We -hope to support this in the future, similar to the GPU package in -LAMMPS. +KOKKOS_DEVICES, values = {Serial}, {OpenMP}, {Pthreads}, {Cuda}, default = {OpenMP} +KOKKOS_ARCH, values = {KNC}, {SNB}, {HSW}, {Kepler30}, {Kepler32}, {Kepler35}, {Kepler37}, {Maxwell50}, {Maxwell52}, {Maxwell53}, {Pascal60}, {Pascal61}, {ARMv80}, {ARMv81}, {ARMv81}, {ARMv8-ThunderX}, {BGQ}, {Power7}, {Power8}, {Power9}, {KNL}, {BDW}, {SKX}, default = {none} +KOKKOS_DEBUG, values = {yes}, {no}, default = {no} +KOKKOS_USE_TPLS, values = {hwloc}, {librt}, {experimental_memkind}, default = {none} +KOKKOS_CXX_STANDARD, values = {c++11}, {c++1z}, default = {c++11} +KOKKOS_OPTIONS, values = {aggressive_vectorization}, {disable_profiling}, default = {none} +KOKKOS_CUDA_OPTIONS, values = {force_uvm}, {use_ldg}, {rdc}, {enable_lambda}, default = {enable_lambda} :ul + +KOKKOS_DEVICES sets the parallelization method used for Kokkos code +(within LAMMPS). KOKKOS_DEVICES=Serial means that no threading will be used. +KOKKOS_DEVICES=OpenMP means that OpenMP threading will be +used. KOKKOS_DEVICES=Pthreads means that pthreads will be used. +KOKKOS_DEVICES=Cuda means an NVIDIA GPU running CUDA will be used. -[Running on an Intel Phi:] +KOKKOS_ARCH enables compiler switches needed when compiling for a +specific hardware: + +ARMv80 = ARMv8.0 Compatible CPU +ARMv81 = ARMv8.1 Compatible CPU +ARMv8-ThunderX = ARMv8 Cavium ThunderX CPU +SNB = Intel Sandy/Ivy Bridge CPUs +HSW = Intel Haswell CPUs +BDW = Intel Broadwell Xeon E-class CPUs +SKX = Intel Sky Lake Xeon E-class HPC CPUs (AVX512) +KNC = Intel Knights Corner Xeon Phi +KNL = Intel Knights Landing Xeon Phi +Kepler30 = NVIDIA Kepler generation CC 3.0 +Kepler32 = NVIDIA Kepler generation CC 3.2 +Kepler35 = NVIDIA Kepler generation CC 3.5 +Kepler37 = NVIDIA Kepler generation CC 3.7 +Maxwell50 = NVIDIA Maxwell generation CC 5.0 +Maxwell52 = NVIDIA Maxwell generation CC 5.2 +Maxwell53 = NVIDIA Maxwell generation CC 5.3 +Pascal60 = NVIDIA Pascal generation CC 6.0 +Pascal61 = NVIDIA Pascal generation CC 6.1 +BGQ = IBM Blue Gene/Q CPUs +Power8 = IBM POWER8 CPUs +Power9 = IBM POWER9 CPUs :ul -Kokkos only uses Intel Phi processors in their "native" mode, i.e. -not hosted by a CPU. +KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not +migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be +used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not +necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP +provides alternative methods via environment variables for binding +threads to hardware cores. More info on binding threads to cores is +given in "Section 5.3"_Section_accelerate.html#acc_3. -As illustrated above, build LAMMPS with OMP=yes (the default) and -MIC=yes. The latter insures code is correctly compiled for the Intel -Phi. The OMP setting means OpenMP will be used for parallelization on -the Phi, which is currently the best option within Kokkos. In the -future, other options may be added. +KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism +on most Unix platforms. This library is not available on all +platforms. -Current-generation Intel Phi chips have either 61 or 57 cores. One -core should be excluded for running the OS, leaving 60 or 56 cores. -Each core is hyperthreaded, so there are effectively N = 240 (4*60) or -N = 224 (4*56) cores to run on. +KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style +within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time +debugging information that can be useful. It also enables runtime +bounds checking on Kokkos data structures. -The -np setting of the mpirun command sets the number of MPI -tasks/node. The "-k on t Nt" command-line switch sets the number of -threads/task as Nt. The product of these 2 values should be N, i.e. -240 or 224. Also, the number of threads/task should be a multiple of -4 so that logical threads from more than one MPI task do not run on -the same physical core. +KOKKOS_CXX_STANDARD and KOKKOS_OPTIONS are typically not changed when building LAMMPS. -Examples of mpirun commands that follow these rules are shown above. +KOKKOS_CUDA_OPTIONS are additional options for CUDA. The LAMMPS KOKKOS package must be compiled +with the {enable_lambda} option when using GPUs. [Restrictions:] -As noted above, if using GPUs, the number of MPI tasks per compute -node should equal to the number of GPUs per compute node. In the -future Kokkos will support assigning multiple MPI tasks to a single -GPU. - -Currently Kokkos does not support AMD GPUs due to limits in the -available backend programming models. Specifically, Kokkos requires -extensive C++ support from the Kernel language. This is expected to -change in the future. +Currently, there are no precision options with the KOKKOS +package. All compilation and computation is performed in double +precision.