some rewrite/update of the accelerator comparison page removing outdated info

f8c9ab4a · Axel Kohlmeyer · a8c687ae · f8c9ab4a
Commit f8c9ab4a authored 6 years ago by Axel Kohlmeyer
--- a/doc/src/Speed_compare.txt
+++ b/doc/src/Speed_compare.txt
@@ -9,65 +9,108 @@ Documentation"_ld - "LAMMPS Commands"_lc :c
 Comparison of various accelerator packages :h3
-NOTE: this section still needs to be re-worked with additional KOKKOS
-and USER-INTEL information.
 The next section compares and contrasts the various accelerator
 options, since there are multiple ways to perform OpenMP threading,
-run on GPUs, and run on Intel Xeon Phi coprocessors.
+run on GPUs, optimize for vector units on CPUs and run on Intel
+Xeon Phi (co-)processors.
-All 3 of these packages accelerate a LAMMPS calculation using NVIDIA
+All of these packages can accelerate a LAMMPS calculation taking
-hardware, but they do it in different ways.
+advantage of hardware features, but they do it in different ways
+and acceleration is not always guaranteed.
 As a consequence, for a particular simulation on specific hardware,
-one package may be faster than the other.  We give guidelines below,
+one package may be faster than the other.  We give some guidelines
-but the best way to determine which package is faster for your input
+below, but the best way to determine which package is faster for your
-script is to try both of them on your machine.  See the benchmarking
+input script is to try multiple of them on your machine and experiment
+with available performance tuning settings.  See the benchmarking
 section below for examples where this has been done.
 [Guidelines for using each package optimally:]
-The GPU package allows you to assign multiple CPUs (cores) to a single
+Both, the GPU and the KOKKOS package allows you to assign multiple
-GPU (a common configuration for "hybrid" nodes that contain multicore
+MPI ranks (= CPU cores) to the same GPU. For the GPU package, this
-CPU(s) and GPU(s)) and works effectively in this mode. :ulb,l
+can lead to a speedup through better utilization of the GPU (by
+overlapping computation and data transfer) and more efficient
-The GPU package moves per-atom data (coordinates, forces)
+computation of the non-GPU accelerated parts of LAMMPS through MPI
-back-and-forth between the CPU and GPU every timestep.  The
+parallelization, as all system data is maintained and updated on
-KOKKOS/CUDA package only does this on timesteps when a CPU calculation
+the host. For KOKKOS, there is less to no benefit from this, due
-is required (e.g. to invoke a fix or compute that is non-GPU-ized).
+to its different memory management model, which tries to retain
-Hence, if you can formulate your input script to only use GPU-ized
+data on the GPU.
-fixes and computes, and avoid doing I/O too often (thermo output, dump
+ :ulb,l
-file snapshots, restart files), then the data transfer cost of the
-KOKKOS/CUDA package can be very low, causing it to run faster than the
+The GPU package moves per-atom data (coordinates, forces, and
-GPU package. :l
+(optionally) neighbor list data, if not computed on the GPU) between
+the CPU and GPU at every timestep.  The KOKKOS/CUDA package only does
-The GPU package is often faster than the KOKKOS/CUDA package, if the
+this on timesteps when a CPU calculation is required (e.g. to invoke
-number of atoms per GPU is smaller.  The crossover point, in terms of
+a fix or compute that is non-GPU-ized). Hence, if you can formulate
-atoms/GPU at which the KOKKOS/CUDA package becomes faster depends
+your input script to only use GPU-ized fixes and computes, and avoid
-strongly on the pair style.  For example, for a simple Lennard Jones
+doing I/O too often (thermo output, dump file snapshots, restart files),
+then the data transfer cost of the KOKKOS/CUDA package can be very low,
+causing it to run faster than the GPU package. :l
+The GPU package is often faster than the KOKKOS/CUDA package, when the
+number of atoms per GPU is on the smaller side.  The crossover point,
+in terms of atoms/GPU at which the KOKKOS/CUDA package becomes faster
+depends strongly on the pair style.  For example, for a simple Lennard Jones
 system the crossover (in single precision) is often about 50K-100K
 atoms per GPU.  When performing double precision calculations the
 crossover point can be significantly smaller. :l
-Both packages compute bonded interactions (bonds, angles, etc) on the
+Both KOKKOS and GPU package compute bonded interactions (bonds, angles,
-CPU.  If the GPU package is running with several MPI processes
+etc) on the CPU.  If the GPU package is running with several MPI processes
 assigned to one GPU, the cost of computing the bonded interactions is
-spread across more CPUs and hence the GPU package can run faster. :l
+spread across more CPUs and hence the GPU package can run faster in these
+cases. :l
-When using the GPU package with multiple CPUs assigned to one GPU, its
-performance depends to some extent on high bandwidth between the CPUs
+When using LAMMPS with multiple MPI ranks assigned to the same GPU, its
-and the GPU.  Hence its performance is affected if full 16 PCIe lanes
+performance depends to some extent on the available bandwidth between
-are not available for each GPU.  In HPC environments this can be the
+the CPUs and the GPU. This can differ significantly based on the
-case if S2050/70 servers are used, where two devices generally share
+available bus technology, capability of the host CPU and mainboard,
-one PCIe 2.0 16x slot.  Also many multi-GPU mainboards do not provide
+the wiring of the buses and whether switches are used to increase the
-full 16 lanes to each of the PCIe 2.0 16x slots. :l
+number of available bus slots, or if GPUs are housed in an external
+enclosure.  This can become quite complex. :l
+To achieve significant acceleration through GPUs, both KOKKOS and GPU
+package require capable GPUs with fast on-device memory and efficient
+data transfer rates. This requests capable upper mid-level to high-end
+(desktop) GPUs. Using lower performance GPUs (e.g. on laptops) may
+result in a slowdown instead. :l
+For the GPU package, specifically when running in parallel with MPI,
+if it often more efficient to exclude the PPPM kspace style from GPU
+acceleration and instead run it - concurrently with a GPU accelerated
+pair style - on the CPU. This can often be easily achieved with placing
+a {suffix off} command before and a {suffix on} command after the
+{kspace_style pppm} command. :l
+The KOKKOS/OpenMP and USER-OMP package have different thread management
+strategies, which should result in USER-OMP being more efficient for a
+small number of threads with increasing overhead as the number of threads
+per MPI rank grows. The KOKKOS/OpenMP kernels have less overhead in that
+case, but have lower performance with few threads. :l
+The USER-INTEL package contains many options and settings for achieving
+additional performance on Intel hardware (CPU and accelerator cards), but
+to unlock this potential, an Intel compiler is required. The package code
+will compile with GNU gcc, but it will not be as efficient. :l
 :ule
-[Differences between the two packages:]
+[Differences between the GPU and KOKKOS packages:]
-The GPU package accelerates only pair force, neighbor list, and PPPM
+The GPU package accelerates only pair force, neighbor list, and (parts
-calculations. :ulb,l
+of) PPPM calculations. The KOKKOS package attempts to run most of the
+calculation on the GPU, but can transparently support non-accelerated
+code (with a performance penalty due to having data transfers between
+host and GPU). :ulb,l
 The GPU package requires neighbor lists to be built on the CPU when using
 exclusion lists, hybrid pair styles, or a triclinic simulation box. :l
+The GPU package can be compiled for CUDA or OpenCL and thus supports
+both, Nvidia and AMD GPUs well. On Nvidia hardware, using CUDA is typically
+resulting in equal or better performance over OpenCL. :l
+OpenCL in the GPU package does theoretically also support Intel CPUs or
+Intel Xeon Phi, but the native support for those in KOKKOS (or USER-INTEL)
+is superior. :l
 :ule