Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
L
lammps
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container Registry
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
multiscale
lammps
Commits
f8c9ab4a
Commit
f8c9ab4a
authored
6 years ago
by
Axel Kohlmeyer
Browse files
Options
Downloads
Patches
Plain Diff
some rewrite/update of the accelerator comparison page removing outdated info
parent
a8c687ae
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/src/Speed_compare.txt
+84
-41
84 additions, 41 deletions
doc/src/Speed_compare.txt
with
84 additions
and
41 deletions
doc/src/Speed_compare.txt
+
84
−
41
View file @
f8c9ab4a
...
@@ -9,65 +9,108 @@ Documentation"_ld - "LAMMPS Commands"_lc :c
...
@@ -9,65 +9,108 @@ Documentation"_ld - "LAMMPS Commands"_lc :c
Comparison of various accelerator packages :h3
Comparison of various accelerator packages :h3
NOTE: this section still needs to be re-worked with additional KOKKOS
and USER-INTEL information.
The next section compares and contrasts the various accelerator
The next section compares and contrasts the various accelerator
options, since there are multiple ways to perform OpenMP threading,
options, since there are multiple ways to perform OpenMP threading,
run on GPUs, and run on Intel Xeon Phi coprocessors.
run on GPUs, optimize for vector units on CPUs and run on Intel
Xeon Phi (co-)processors.
All 3 of these packages accelerate a LAMMPS calculation using NVIDIA
All of these packages can accelerate a LAMMPS calculation taking
hardware, but they do it in different ways.
advantage of hardware features, but they do it in different ways
and acceleration is not always guaranteed.
As a consequence, for a particular simulation on specific hardware,
As a consequence, for a particular simulation on specific hardware,
one package may be faster than the other. We give guidelines below,
one package may be faster than the other. We give some guidelines
but the best way to determine which package is faster for your input
below, but the best way to determine which package is faster for your
script is to try both of them on your machine. See the benchmarking
input script is to try multiple of them on your machine and experiment
with available performance tuning settings. See the benchmarking
section below for examples where this has been done.
section below for examples where this has been done.
[Guidelines for using each package optimally:]
[Guidelines for using each package optimally:]
The GPU package allows you to assign multiple CPUs (cores) to a single
Both, the GPU and the KOKKOS package allows you to assign multiple
GPU (a common configuration for "hybrid" nodes that contain multicore
MPI ranks (= CPU cores) to the same GPU. For the GPU package, this
CPU(s) and GPU(s)) and works effectively in this mode. :ulb,l
can lead to a speedup through better utilization of the GPU (by
overlapping computation and data transfer) and more efficient
The GPU package moves per-atom data (coordinates, forces)
computation of the non-GPU accelerated parts of LAMMPS through MPI
back-and-forth between the CPU and GPU every timestep. The
parallelization, as all system data is maintained and updated on
KOKKOS/CUDA package only does this on timesteps when a CPU calculation
the host. For KOKKOS, there is less to no benefit from this, due
is required (e.g. to invoke a fix or compute that is non-GPU-ized).
to its different memory management model, which tries to retain
Hence, if you can formulate your input script to only use GPU-ized
data on the GPU.
fixes and computes, and avoid doing I/O too often (thermo output, dump
:ulb,l
file snapshots, restart files), then the data transfer cost of the
KOKKOS/CUDA package can be very low, causing it to run faster than the
The GPU package moves per-atom data (coordinates, forces, and
GPU package. :l
(optionally) neighbor list data, if not computed on the GPU) between
the CPU and GPU at every timestep. The KOKKOS/CUDA package only does
The GPU package is often faster than the KOKKOS/CUDA package, if the
this on timesteps when a CPU calculation is required (e.g. to invoke
number of atoms per GPU is smaller. The crossover point, in terms of
a fix or compute that is non-GPU-ized). Hence, if you can formulate
atoms/GPU at which the KOKKOS/CUDA package becomes faster depends
your input script to only use GPU-ized fixes and computes, and avoid
strongly on the pair style. For example, for a simple Lennard Jones
doing I/O too often (thermo output, dump file snapshots, restart files),
then the data transfer cost of the KOKKOS/CUDA package can be very low,
causing it to run faster than the GPU package. :l
The GPU package is often faster than the KOKKOS/CUDA package, when the
number of atoms per GPU is on the smaller side. The crossover point,
in terms of atoms/GPU at which the KOKKOS/CUDA package becomes faster
depends strongly on the pair style. For example, for a simple Lennard Jones
system the crossover (in single precision) is often about 50K-100K
system the crossover (in single precision) is often about 50K-100K
atoms per GPU. When performing double precision calculations the
atoms per GPU. When performing double precision calculations the
crossover point can be significantly smaller. :l
crossover point can be significantly smaller. :l
Both package
s
compute bonded interactions (bonds, angles,
etc) on the
Both
KOKKOS and GPU
package compute bonded interactions (bonds, angles,
CPU. If the GPU package is running with several MPI processes
etc) on the
CPU. If the GPU package is running with several MPI processes
assigned to one GPU, the cost of computing the bonded interactions is
assigned to one GPU, the cost of computing the bonded interactions is
spread across more CPUs and hence the GPU package can run faster. :l
spread across more CPUs and hence the GPU package can run faster in these
cases. :l
When using the GPU package with multiple CPUs assigned to one GPU, its
performance depends to some extent on high bandwidth between the CPUs
When using LAMMPS with multiple MPI ranks assigned to the same GPU, its
and the GPU. Hence its performance is affected if full 16 PCIe lanes
performance depends to some extent on the available bandwidth between
are not available for each GPU. In HPC environments this can be the
the CPUs and the GPU. This can differ significantly based on the
case if S2050/70 servers are used, where two devices generally share
available bus technology, capability of the host CPU and mainboard,
one PCIe 2.0 16x slot. Also many multi-GPU mainboards do not provide
the wiring of the buses and whether switches are used to increase the
full 16 lanes to each of the PCIe 2.0 16x slots. :l
number of available bus slots, or if GPUs are housed in an external
enclosure. This can become quite complex. :l
To achieve significant acceleration through GPUs, both KOKKOS and GPU
package require capable GPUs with fast on-device memory and efficient
data transfer rates. This requests capable upper mid-level to high-end
(desktop) GPUs. Using lower performance GPUs (e.g. on laptops) may
result in a slowdown instead. :l
For the GPU package, specifically when running in parallel with MPI,
if it often more efficient to exclude the PPPM kspace style from GPU
acceleration and instead run it - concurrently with a GPU accelerated
pair style - on the CPU. This can often be easily achieved with placing
a {suffix off} command before and a {suffix on} command after the
{kspace_style pppm} command. :l
The KOKKOS/OpenMP and USER-OMP package have different thread management
strategies, which should result in USER-OMP being more efficient for a
small number of threads with increasing overhead as the number of threads
per MPI rank grows. The KOKKOS/OpenMP kernels have less overhead in that
case, but have lower performance with few threads. :l
The USER-INTEL package contains many options and settings for achieving
additional performance on Intel hardware (CPU and accelerator cards), but
to unlock this potential, an Intel compiler is required. The package code
will compile with GNU gcc, but it will not be as efficient. :l
:ule
:ule
[Differences between the
two
packages:]
[Differences between the
GPU and KOKKOS
packages:]
The GPU package accelerates only pair force, neighbor list, and PPPM
The GPU package accelerates only pair force, neighbor list, and (parts
calculations. :ulb,l
of) PPPM calculations. The KOKKOS package attempts to run most of the
calculation on the GPU, but can transparently support non-accelerated
code (with a performance penalty due to having data transfers between
host and GPU). :ulb,l
The GPU package requires neighbor lists to be built on the CPU when using
The GPU package requires neighbor lists to be built on the CPU when using
exclusion lists, hybrid pair styles, or a triclinic simulation box. :l
exclusion lists, hybrid pair styles, or a triclinic simulation box. :l
The GPU package can be compiled for CUDA or OpenCL and thus supports
both, Nvidia and AMD GPUs well. On Nvidia hardware, using CUDA is typically
resulting in equal or better performance over OpenCL. :l
OpenCL in the GPU package does theoretically also support Intel CPUs or
Intel Xeon Phi, but the native support for those in KOKKOS (or USER-INTEL)
is superior. :l
:ule
:ule
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment