diff --git a/doc/src/accelerate_kokkos.txt b/doc/src/accelerate_kokkos.txt
index 2b07ed035f8b6cc5b1f96805d9f4ba4b6d1dce70..b2279c5c719b46562c0ecf5593a9600e453f3b74 100644
--- a/doc/src/accelerate_kokkos.txt
+++ b/doc/src/accelerate_kokkos.txt
@@ -11,336 +11,346 @@
 
 5.3.3 KOKKOS package :h5
 
-The KOKKOS package was developed primarily by Christian Trott (Sandia)
-with contributions of various styles by others, including Sikandar
-Mashayak (UIUC), Stan Moore (Sandia), and Ray Shan (Sandia).  The
-underlying Kokkos library was written primarily by Carter Edwards,
+Kokkos is a templated C++ library that provides abstractions to allow
+a single implementation of an application kernel (e.g. a pair style) to run efficiently on
+different kinds of hardware, such as GPUs, Intel Xeon Phis, or many-core
+CPUs. Kokkos maps the C++ kernel onto different backend languages such as CUDA, OpenMP, or Pthreads.
+The Kokkos library also provides data abstractions to adjust (at
+compile time) the memory layout of data structures like 2d and
+3d arrays to optimize performance on different hardware. For more information on Kokkos, see
+"Github"_https://github.com/kokkos/kokkos. Kokkos is part of
+"Trilinos"_http://trilinos.sandia.gov/packages/kokkos. The Kokkos library was written primarily by Carter Edwards,
 Christian Trott, and Dan Sunderland (all Sandia).
 
-The KOKKOS package contains versions of pair, fix, and atom styles
+The LAMMPS KOKKOS package contains versions of pair, fix, and atom styles
 that use data structures and macros provided by the Kokkos library,
-which is included with LAMMPS in lib/kokkos.
-
-The Kokkos library is part of
-"Trilinos"_http://trilinos.sandia.gov/packages/kokkos and can also be
-downloaded from "Github"_https://github.com/kokkos/kokkos. Kokkos is a
-templated C++ library that provides two key abstractions for an
-application like LAMMPS.  First, it allows a single implementation of
-an application kernel (e.g. a pair style) to run efficiently on
-different kinds of hardware, such as a GPU, Intel Phi, or many-core
-CPU.
-
-The Kokkos library also provides data abstractions to adjust (at
-compile time) the memory layout of basic data structures like 2d and
-3d arrays and allow the transparent utilization of special hardware
-load and store operations.  Such data structures are used in LAMMPS to
-store atom coordinates or forces or neighbor lists.  The layout is
-chosen to optimize performance on different platforms.  Again this
-functionality is hidden from the developer, and does not affect how
-the kernel is coded.
-
-These abstractions are set at build time, when LAMMPS is compiled with
-the KOKKOS package installed.  All Kokkos operations occur within the
-context of an individual MPI task running on a single node of the
-machine.  The total number of MPI tasks used by LAMMPS (one or
-multiple per compute node) is set in the usual manner via the mpirun
-or mpiexec commands, and is independent of Kokkos.
+which is included with LAMMPS in /lib/kokkos. The KOKKOS package was developed primarily by Christian Trott (Sandia)
+and Stan Moore (Sandia) with contributions of various styles by others, including Sikandar
+Mashayak (UIUC), Ray Shan (Sandia), and Dan Ibanez (Sandia). For more information on developing using Kokkos abstractions
+see the Kokkos programmers' guide at /lib/kokkos/doc/Kokkos_PG.pdf.
 
 Kokkos currently provides support for 3 modes of execution (per MPI
-task).  These are OpenMP (for many-core CPUs), Cuda (for NVIDIA GPUs),
-and OpenMP (for Intel Phi).  Note that the KOKKOS package supports
-running on the Phi in native mode, not offload mode like the
-USER-INTEL package supports.  You choose the mode at build time to
+task). These are Serial (MPI-only for CPUs and Intel Phi), OpenMP (threading
+for many-core CPUs and Intel Phi), and CUDA (for NVIDIA GPUs). You choose the mode at build time to
 produce an executable compatible with specific hardware.
 
-Here is a quick overview of how to use the KOKKOS package
-for CPU acceleration, assuming one or more 16-core nodes.
-More details follow.
+[Building LAMMPS with the KOKKOS package:]
 
-use a C++11 compatible compiler
-make yes-kokkos
-make mpi KOKKOS_DEVICES=OpenMP                 # build with the KOKKOS package
-make kokkos_omp                                # or Makefile.kokkos_omp already has variable set :pre
+NOTE: Kokkos support within LAMMPS must be built with a C++11 compatible
+compiler. This means GCC version 4.7.2 or later, Intel 14.0.4 or later, or
+Clang 3.5.2 or later is required.
+
+The recommended method of building the KOKKOS package is to start with the provided Kokkos
+Makefiles in /src/MAKE/OPTIONS/. You may need to modify the KOKKOS_ARCH variable in the Makefile
+to match your specific hardware. For example:
 
-mpirun -np 16 lmp_mpi -k on -sf kk -in in.lj              # 1 node, 16 MPI tasks/node, no threads
-mpirun -np 2 -ppn 1 lmp_mpi -k on t 16 -sf kk -in in.lj   # 2 nodes, 1 MPI task/node, 16 threads/task
-mpirun -np 2 lmp_mpi -k on t 8 -sf kk -in in.lj           # 1 node, 2 MPI tasks/node, 8 threads/task
-mpirun -np 32 -ppn 4 lmp_mpi -k on t 4 -sf kk -in in.lj   # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre
+for Sandy Bridge CPUs, set KOKKOS_ARCH=SNB
+for Broadwell CPUs, set KOKKOS_ARCH=BWD
+for K80 GPUs, set KOKKOS_ARCH=Kepler37
+for P100 GPUs and Power8 CPUs, set KOKKOS_ARCH=Pascal60,Power8 :ul
 
-specify variables and settings in your Makefile.machine that enable OpenMP, GPU, or Phi support
-include the KOKKOS package and build LAMMPS
-enable the KOKKOS package and its hardware options via the "-k on" command-line switch use KOKKOS styles in your input script :ul
+See the [Advanced Kokkos Options] section below for a listing of all KOKKOS_ARCH options.
 
-Here is a quick overview of how to use the KOKKOS package for GPUs,
-assuming one or more nodes, each with 16 cores and a GPU.  More
-details follow.
+[Compile for CPU-only (MPI only, no threading):]
 
-discuss use of NVCC, which Makefiles to examine
+use a C++11 compatible compiler and set KOKKOS_ARCH variable in
+/src/MAKE/OPTIONS/Makefile.kokkos_mpi_only as described above. Then do the
+following:
 
-use a C++11 compatible compiler
-KOKKOS_DEVICES = Cuda, OpenMP
-KOKKOS_ARCH = Kepler35
+cd lammps/src
 make yes-kokkos
-make machine :pre
+make kokkos_mpi_only :pre
 
-mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj          # one MPI task, 6 threads on CPU
-mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj   # ditto on 4 nodes :pre
+[Compile for CPU-only (MPI plus OpenMP threading):]
 
-mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj           # two MPI tasks, 8 threads per CPU
-mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj   # ditto on 16 nodes :pre
+NOTE: To build with Kokkos support for OpenMP threading, your compiler must support the
+OpenMP interface. You should have one or more multi-core CPUs so that
+multiple threads can be launched by each MPI task running on a CPU.
 
-Here is a quick overview of how to use the KOKKOS package
-for the Intel Phi:
+use a C++11 compatible compiler and set KOKKOS_ARCH variable in
+/src/MAKE/OPTIONS/Makefile.kokkos_omp as described above.  Then do the
+following:
 
-use a C++11 compatible compiler
-KOKKOS_DEVICES = OpenMP
-KOKKOS_ARCH = KNC
+cd lammps/src
 make yes-kokkos
-make machine :pre
+make kokkos_omp :pre
 
-host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading):
-mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj           # 1 MPI task on 1 Phi, 1*240 = 240
-mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj            # 30 MPI tasks on 1 Phi, 30*8 = 240
-mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj           # 12 MPI tasks on 1 Phi, 12*20 = 240
-mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj   # ditto on 8 Phis :pre
+[Compile for Intel KNL Xeon Phi (Intel Compiler, OpenMPI):]
 
-[Required hardware/software:]
+use a C++11 compatible compiler and do the following:
 
-Kokkos support within LAMMPS must be built with a C++11 compatible
-compiler.  If using gcc, version 4.7.2 or later is required.
+cd lammps/src
+make yes-kokkos
+make kokkos_phi :pre
 
-To build with Kokkos support for CPUs, your compiler must support the
-OpenMP interface.  You should have one or more multi-core CPUs so that
-multiple threads can be launched by each MPI task running on a CPU.
+[Compile for CPUs and GPUs (with OpenMPI or MPICH):]
 
-To build with Kokkos support for NVIDIA GPUs, NVIDIA CUDA software
-version 7.5 or later must be installed on your system.  See the
+NOTE: To build with Kokkos support for NVIDIA GPUs, NVIDIA CUDA software
+version 7.5 or later must be installed on your system. See the
 discussion for the "GPU"_accelerate_gpu.html package for details of
 how to check and do this.
 
-NOTE: For good performance of the KOKKOS package on GPUs, you must
-have Kepler generation GPUs (or later).  The Kokkos library exploits
-texture cache options not supported by Telsa generation GPUs (or
-older).
-
-To build with Kokkos support for Intel Xeon Phi coprocessors, your
-sysmte must be configured to use them in "native" mode, not "offload"
-mode like the USER-INTEL package supports.
+use a C++11 compatible compiler and set KOKKOS_ARCH variable in
+/src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi for both GPU and CPU as described
+above.  Then do the following:
 
-[Building LAMMPS with the KOKKOS package:]
-
-You must choose at build time whether to build for CPUs (OpenMP),
-GPUs, or Phi.
+cd lammps/src
+make yes-kokkos
+make kokkos_cuda_mpi :pre
 
-You can do any of these in one line, using the suitable make command
-line flags as described in "Section 4"_Section_packages.html of the
-manual. If run from the src directory, these
-commands will create src/lmp_kokkos_omp, lmp_kokkos_cuda_mpi, and
-lmp_kokkos_phi.  Note that the OMP and PHI options use
-src/MAKE/Makefile.mpi as the starting Makefile.machine.  The CUDA
-option uses src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi.
+[Alternative Methods of Compiling:]
 
-The latter two steps can be done using the "-k on", "-pk kokkos" and
-"-sf kk" "command-line switches"_Section_start.html#start_6
-respectively.  Or the effect of the "-pk" or "-sf" switches can be
-duplicated by adding the "package kokkos"_package.html or "suffix
-kk"_suffix.html commands respectively to your input script.
+Alternatively, the KOKKOS package can be built by specifying Kokkos variables
+on the make command line. For example:
 
+make mpi KOKKOS_DEVICES=OpenMP KOKKOS_ARCH=SNB     # set the KOKKOS_DEVICES and KOKKOS_ARCH variable explicitly
+make kokkos_cuda_mpi KOKKOS_ARCH=Pascal60,Power8   # set the KOKKOS_ARCH variable explicitly :pre
 
-Or you can follow these steps:
+Setting the KOKKOS_DEVICES and KOKKOS_ARCH variables on the
+make command line requires a GNU-compatible make command. Try
+"gmake" if your system's standard make complains.
 
-CPU-only (run all-MPI or with OpenMP threading):
+NOTE: If you build using make line variables and re-build LAMMPS twice
+with different KOKKOS options and the *same* target, then you *must* perform a "make clean-all"
+or "make clean-machine" before each build. This is to force all the
+KOKKOS-dependent files to be re-compiled with the new options.
 
-cd lammps/src
-make yes-kokkos
-make kokkos_omp :pre
+[Running LAMMPS with the KOKKOS package:]
 
-CPU-only (only MPI, no threading):
+All Kokkos operations occur within the
+context of an individual MPI task running on a single node of the
+machine. The total number of MPI tasks used by LAMMPS (one or
+multiple per compute node) is set in the usual manner via the mpirun
+or mpiexec commands, and is independent of Kokkos. E.g. the mpirun
+command in OpenMPI does this via its
+-np and -npernode switches. Ditto for MPICH via -np and -ppn.
 
-cd lammps/src
-make yes-kokkos
-make kokkos_mpi_only :pre
+[Running on a multi-core CPU:]
 
-Intel Xeon Phi (Intel Compiler, Intel MPI):
+Here is a quick overview of how to use the KOKKOS package
+for CPU acceleration, assuming one or more 16-core nodes.
 
-cd lammps/src
-make yes-kokkos
-make kokkos_phi :pre
+mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -in in.lj        # 1 node, 16 MPI tasks/node, no multi-threading
+mpirun -np 2 -ppn 1 lmp_kokkos_omp -k on t 16 -sf kk -in in.lj  # 2 nodes, 1 MPI task/node, 16 threads/task
+mpirun -np 2 lmp_kokkos_omp -k on t 8 -sf kk -in in.lj          # 1 node,  2 MPI tasks/node, 8 threads/task
+mpirun -np 32 -ppn 4 lmp_kokkos_omp -k on t 4 -sf kk -in in.lj  # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre
 
-CPUs and GPUs (with MPICH or OpenMPI):
+To run using the KOKKOS package, use the "-k on", "-sf kk" and "-pk kokkos" "command-line switches"_Section_start.html#start_7 in your mpirun command.
+You must use the "-k on" "command-line
+switch"_Section_start.html#start_7 to enable the KOKKOS package. It
+takes additional arguments for hardware settings appropriate to your
+system. Those arguments are "documented
+here"_Section_start.html#start_7. For OpenMP use:
 
-cd lammps/src
-make yes-kokkos
-make kokkos_cuda_mpi :pre
+-k on t Nt :pre
 
-These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the
-make command line which requires a GNU-compatible make command.  Try
-"gmake" if your system's standard make complains.
+The "t Nt" option specifies how many OpenMP threads per MPI
+task to use with a node. The default is Nt = 1, which is MPI-only mode.
+Note that the product of MPI tasks * OpenMP
+threads/task should not exceed the physical number of cores (on a
+node), otherwise performance will suffer. If hyperthreading is enabled, then
+the product of MPI tasks * OpenMP threads/task should not exceed the
+physical number of cores * hardware threads.
+The "-k on" switch also issues a "package kokkos" command (with no
+additional arguments) which sets various KOKKOS options to default
+values, as discussed on the "package"_package.html command doc page.
 
-NOTE: If you build using make line variables and re-build LAMMPS twice
-with different KOKKOS options and the *same* target, e.g. g++ in the
-first two examples above, then you *must* perform a "make clean-all"
-or "make clean-machine" before each build.  This is to force all the
-KOKKOS-dependent files to be re-compiled with the new options.
+The "-sf kk" "command-line switch"_Section_start.html#start_7
+will automatically append the "/kk" suffix to styles that support it.
+In this manner no modification to the input script is needed. Alternatively,
+one can run with the KOKKOS package by editing the input script as described below.
 
-NOTE: Currently, there are no precision options with the KOKKOS
-package.  All compilation and computation is performed in double
-precision.
+NOTE: The default for the "package kokkos"_package.html command is
+to use "full" neighbor lists and set the Newton flag to "off" for both
+pairwise and bonded interactions. However, when running on CPUs, it
+will typically be faster to use "half" neighbor lists and set the
+Newton flag to "on", just as is the case for non-accelerated pair
+styles. It can also be faster to use non-threaded communication.
+Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to
+change the default "package kokkos"_package.html
+options. See its doc page for details and default settings. Experimenting with
+its options can provide a speed-up for specific calculations. For example:
 
-There are other allowed options when building with the KOKKOS package.
-As above, they can be set either as variables on the make command line
-or in Makefile.machine.  This is the full list of options, including
-those discussed above, Each takes a value shown below.  The
-default value is listed, which is set in the
-lib/kokkos/Makefile.kokkos file.
+mpirun -np 16 lmp_kokkos_mpi_only -k on -sf kk -pk kokkos newton on neigh half comm no -in in.lj       # Newton on, Half neighbor list, non-threaded comm :pre
 
-#Default settings specific options
-#Options: force_uvm,use_ldg,rdc
+If the "newton"_newton.html command is used in the input
+script, it can also override the Newton flag defaults.
 
-KOKKOS_DEVICES, values = {OpenMP}, {Serial}, {Pthreads}, {Cuda}, default = {OpenMP}
-KOKKOS_ARCH, values = {KNC}, {SNB}, {HSW}, {Kepler}, {Kepler30}, {Kepler32}, {Kepler35}, {Kepler37}, {Maxwell}, {Maxwell50}, {Maxwell52}, {Maxwell53}, {ARMv8}, {BGQ}, {Power7}, {Power8}, default = {none}
-KOKKOS_DEBUG, values = {yes}, {no}, default = {no}
-KOKKOS_USE_TPLS, values = {hwloc}, {librt}, default = {none}
-KOKKOS_CUDA_OPTIONS, values = {force_uvm}, {use_ldg}, {rdc} :ul
+[Core and Thread Affinity:]
 
-KOKKOS_DEVICE sets the parallelization method used for Kokkos code
-(within LAMMPS).  KOKKOS_DEVICES=OpenMP means that OpenMP will be
-used.  KOKKOS_DEVICES=Pthreads means that pthreads will be used.
-KOKKOS_DEVICES=Cuda means an NVIDIA GPU running CUDA will be used.
+When using multi-threading, it is important for
+performance to bind both MPI tasks to physical cores, and threads to
+physical cores, so they do not migrate during a simulation.
 
-If KOKKOS_DEVICES=Cuda, then the lo-level Makefile in the src/MAKE
-directory must use "nvcc" as its compiler, via its CC setting.  For
-best performance its CCFLAGS setting should use -O3 and have a
-KOKKOS_ARCH setting that matches the compute capability of your NVIDIA
-hardware and software installation, e.g. KOKKOS_ARCH=Kepler30.  Note
-the minimal required compute capability is 2.0, but this will give
-significantly reduced performance compared to Kepler generation GPUs
-with compute capability 3.x.  For the LINK setting, "nvcc" should not
-be used; instead use g++ or another compiler suitable for linking C++
-applications.  Often you will want to use your MPI compiler wrapper
-for this setting (i.e. mpicxx).  Finally, the lo-level Makefile must
-also have a "Compilation rule" for creating *.o files from *.cu files.
-See src/Makefile.cuda for an example of a lo-level Makefile with all
-of these settings.
+If you are not certain MPI tasks are being bound (check the defaults
+for your MPI installation), binding can be forced with these flags:
 
-KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
-migrate during a simulation.  KOKKOS_USE_TPLS=hwloc should always be
-used if running with KOKKOS_DEVICES=Pthreads for pthreads.  It is not
-necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
-provides alternative methods via environment variables for binding
-threads to hardware cores.  More info on binding threads to cores is
-given in "Section 5.3"_Section_accelerate.html#acc_3.
+OpenMPI 1.8: mpirun -np 2 --bind-to socket --map-by socket ./lmp_openmpi ...
+Mvapich2 2.0: mpiexec -np 2 --bind-to socket --map-by socket ./lmp_mvapich ... :pre
 
-KOKKOS_ARCH=KNC enables compiler switches needed when compiling for an
-Intel Phi processor.
+For binding threads with KOKKOS OpenMP, use thread affinity
+environment variables to force binding. With OpenMP 3.1 (gcc 4.7 or
+later, intel 12 or later) setting the environment variable
+OMP_PROC_BIND=true should be sufficient. In general, for best performance
+with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads.
+For binding threads with the
+KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option
+as described below.
 
-KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
-on most Unix platforms.  This library is not available on all
-platforms.
+[Running on Knight's Landing (KNL) Intel Xeon Phi:]
 
-KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
-within LAMMPS.  KOKKOS_DEBUG=yes enables printing of run-time
-debugging information that can be useful.  It also enables runtime
-bounds checking on Kokkos data structures.
+Here is a quick overview of how to use the KOKKOS package
+for the Intel Knight's Landing (KNL) Xeon Phi:
 
-KOKKOS_CUDA_OPTIONS are additional options for CUDA.
+KNL Intel Phi chips have 68 physical cores. Typically 1 to 4 cores
+are reserved for the OS, and only 64 or 66 cores are used. Each core
+has 4 hyperthreads,so there are effectively N = 256 (4*64) or
+N = 264 (4*66) cores to run on. The product of MPI tasks * OpenMP threads/task should not exceed this limit,
+otherwise performance will suffer. Note that with the KOKKOS package you do not need to
+specify how many KNLs there are per node; each
+KNL is simply treated as running some number of MPI tasks.
 
-For more information on Kokkos see the Kokkos programmers' guide here:
-/lib/kokkos/doc/Kokkos_PG.pdf.
+Examples of mpirun commands that follow these rules are shown below.
 
-[Run with the KOKKOS package from the command line:]
+Intel KNL node with 68 cores (272 threads/node via 4x hardware threading):
+mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj      # 1 node, 64 MPI tasks/node, 4 threads/task
+mpirun -np 66 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj      # 1 node, 66 MPI tasks/node, 4 threads/task
+mpirun -np 32 lmp_kokkos_phi -k on t 8 -sf kk -in in.lj      # 1 node, 32 MPI tasks/node, 8 threads/task
+mpirun -np 512 -ppn 64 lmp_kokkos_phi -k on t 4 -sf kk -in in.lj  # 8 nodes, 64 MPI tasks/node, 4 threads/task :pre
 
-The mpirun or mpiexec command sets the total number of MPI tasks used
-by LAMMPS (one or multiple per compute node) and the number of MPI
-tasks used per node.  E.g. the mpirun command in MPICH does this via
-its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.
+The -np setting of the mpirun command sets the number of MPI
+tasks/node. The "-k on t Nt" command-line switch sets the number of
+threads/task as Nt. The product of these two values should be N, i.e.
+256 or 264.
 
-When using KOKKOS built with host=OMP, you need to choose how many
-OpenMP threads per MPI task will be used (via the "-k" command-line
-switch discussed below).  Note that the product of MPI tasks * OpenMP
-threads/task should not exceed the physical number of cores (on a
-node), otherwise performance will suffer.
-
-When using the KOKKOS package built with device=CUDA, you must use
-exactly one MPI task per physical GPU.
-
-When using the KOKKOS package built with host=MIC for Intel Xeon Phi
-coprocessor support you need to insure there are one or more MPI tasks
-per coprocessor, and choose the number of coprocessor threads to use
-per MPI task (via the "-k" command-line switch discussed below).  The
-product of MPI tasks * coprocessor threads/task should not exceed the
-maximum number of threads the coprocessor is designed to run,
-otherwise performance will suffer.  This value is 240 for current
-generation Xeon Phi(TM) chips, which is 60 physical cores * 4
-threads/core.  Note that with the KOKKOS package you do not need to
-specify how many Phi coprocessors there are per node; each
-coprocessors is simply treated as running some number of MPI tasks.
+NOTE: The default for the "package kokkos"_package.html command is
+to use "full" neighbor lists and set the Newton flag to "off" for both
+pairwise and bonded interactions. When running on KNL, this
+will typically be best for pair-wise potentials. For manybody potentials,
+using "half" neighbor lists and setting the
+Newton flag to "on" may be faster. It can also be faster to use non-threaded communication.
+Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to
+change the default "package kokkos"_package.html
+options. See its doc page for details and default settings. Experimenting with
+its options can provide a speed-up for specific calculations. For example:
+
+mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos comm no -in in.lj      #  Newton off, full neighbor list, non-threaded comm
+mpirun -np 64 lmp_kokkos_phi -k on t 4 -sf kk -pk kokkos newton on neigh half comm no -in in.reax      # Newton on, half neighbor list, non-threaded comm :pre
+
+NOTE: MPI tasks and threads should be bound to cores as described above for CPUs.
+
+NOTE: To build with Kokkos support for Intel Xeon Phi coprocessors such as Knight's Corner (KNC), your
+system must be configured to use them in "native" mode, not "offload"
+mode like the USER-INTEL package supports.
 
-You must use the "-k on" "command-line
-switch"_Section_start.html#start_6 to enable the KOKKOS package.  It
-takes additional arguments for hardware settings appropriate to your
-system.  Those arguments are "documented
-here"_Section_start.html#start_6.  The two most commonly used
-options are:
+[Running on GPUs:]
 
--k on t Nt g Ng :pre
+Use the "-k" "command-line switch"_Section_commands.html#start_7 to
+specify the number of GPUs per node. Typically the -np setting
+of the mpirun command should set the number of MPI
+tasks/node to be equal to the # of physical GPUs on the node.
+You can assign multiple MPI tasks to the same GPU with the
+KOKKOS package, but this is usually only faster if significant portions
+of the input script have not been ported to use Kokkos. Using CUDA MPS
+is recommended in this scenario. As above for multi-core CPUs (and no GPU), if N is the number
+of physical cores/node, then the number of MPI tasks/node should not exceed N.
 
-The "t Nt" option applies to host=OMP (even if device=CUDA) and
-host=MIC.  For host=OMP, it specifies how many OpenMP threads per MPI
-task to use with a node.  For host=MIC, it specifies how many Xeon Phi
-threads per MPI task to use within a node.  The default is Nt = 1.
-Note that for host=OMP this is effectively MPI-only mode which may be
-fine.  But for host=MIC you will typically end up using far less than
-all the 240 available threads, which could give very poor performance.
+-k on g Ng :pre
 
-The "g Ng" option applies to device=CUDA.  It specifies how many GPUs
-per compute node to use.  The default is 1, so this only needs to be
-specified is you have 2 or more GPUs per compute node.
+Here are examples of how to use the KOKKOS package for GPUs,
+assuming one or more nodes, each with two GPUs:
 
-The "-k on" switch also issues a "package kokkos" command (with no
-additional arguments) which sets various KOKKOS options to default
-values, as discussed on the "package"_package.html command doc page.
+mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj          # 1 node,   2 MPI tasks/node, 2 GPUs/node
+mpirun -np 32 -ppn 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -in in.lj  # 16 nodes, 2 MPI tasks/node, 2 GPUs/node (32 GPUs total) :pre
 
-Use the "-sf kk" "command-line switch"_Section_start.html#start_6,
-which will automatically append "kk" to styles that support it.  Use
-the "-pk kokkos" "command-line switch"_Section_start.html#start_6 if
-you wish to change any of the default "package kokkos"_package.html
-optionns set by the "-k on" "command-line
-switch"_Section_start.html#start_6.
+NOTE: The default for the "package kokkos"_package.html command is
+to use "full" neighbor lists and set the Newton flag to "off" for both
+pairwise and bonded interactions, along with threaded communication.
+When running on Maxwell or Kepler GPUs, this will typically be best. For Pascal GPUs,
+using "half" neighbor lists and setting the
+Newton flag to "on" may be faster. For many pair styles, setting the neighbor binsize
+equal to the ghost atom cutoff will give speedup.
+Use the "-pk kokkos" "command-line switch"_Section_start.html#start_7 to
+change the default "package kokkos"_package.html
+options. See its doc page for details and default settings. Experimenting with
+its options can provide a speed-up for specific calculations. For example:
+
+mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos binsize 2.8 -in in.lj      # Set binsize = neighbor ghost cutoff
+mpirun -np 2 lmp_kokkos_cuda_openmpi -k on g 2 -sf kk -pk kokkos newton on neigh half binsize 2.8 -in in.lj      # Newton on, half neighborlist, set binsize = neighbor ghost cutoff :pre
 
+NOTE: For good performance of the KOKKOS package on GPUs, you must
+have Kepler generation GPUs (or later). The Kokkos library exploits
+texture cache options not supported by Telsa generation GPUs (or
+older).
 
+NOTE: When using a GPU, you will achieve the best performance if your
+input script does not use fix or compute styles which are not yet
+Kokkos-enabled. This allows data to stay on the GPU for multiple
+timesteps, without being copied back to the host CPU. Invoking a
+non-Kokkos fix or compute, or performing I/O for
+"thermo"_thermo_style.html or "dump"_dump.html output will cause data
+to be copied back to the CPU incurring a performance penalty.
 
-Note that the default for the "package kokkos"_package.html command is
-to use "full" neighbor lists and set the Newton flag to "off" for both
-pairwise and bonded interactions.  This typically gives fastest
-performance.  If the "newton"_newton.html command is used in the input
-script, it can override the Newton flag defaults.
+NOTE: To get an accurate timing breakdown between time spend in pair,
+kspace, etc., you must set the environment variable CUDA_LAUNCH_BLOCKING=1.
+However, this will reduce performance and is not recommended for production runs.
 
-However, when running in MPI-only mode with 1 thread per MPI task, it
-will typically be faster to use "half" neighbor lists and set the
-Newton flag to "on", just as is the case for non-accelerated pair
-styles.  You can do this with the "-pk" "command-line
-switch"_Section_start.html#start_6.
+[Run with the KOKKOS package by editing an input script:]
 
-[Or run with the KOKKOS package by editing an input script:]
+Alternatively the effect of the "-sf" or "-pk" switches can be
+duplicated by adding the "package kokkos"_package.html or "suffix
+kk"_suffix.html commands to your input script.
 
-The discussion above for the mpirun/mpiexec command and setting
-appropriate thread and GPU values for host=OMP or host=MIC or
-device=CUDA are the same.
+The discussion above for building LAMMPS with the KOKKOS package, the mpirun/mpiexec command, and setting
+appropriate thread are the same.
 
 You must still use the "-k on" "command-line
-switch"_Section_start.html#start_6 to enable the KOKKOS package, and
+switch"_Section_start.html#start_7 to enable the KOKKOS package, and
 specify its additional arguments for hardware options appropriate to
 your system, as documented above.
 
-Use the "suffix kk"_suffix.html command, or you can explicitly add a
+You can use the "suffix kk"_suffix.html command, or you can explicitly add a
 "kk" suffix to individual styles in your input script, e.g.
 
 pair_style lj/cut/kk 2.5 :pre
 
 You only need to use the "package kokkos"_package.html command if you
 wish to change any of its option defaults, as set by the "-k on"
-"command-line switch"_Section_start.html#start_6.
+"command-line switch"_Section_start.html#start_7.
+
+[Using OpenMP threading and CUDA together (experimental):]
+
+With the KOKKOS package, both OpenMP multi-threading and GPUs can be used
+together in a few special cases. In the Makefile, the KOKKOS_DEVICES variable must
+include both "Cuda" and "OpenMP", as is the case for /src/MAKE/OPTIONS/Makefile.kokkos_cuda_mpi
+
+KOKKOS_DEVICES=Cuda,OpenMP :pre
+
+The suffix “/kk” is equivalent to “/kk/device”, and for Kokkos CUDA,
+using the “-sf kk” in the command line gives the default CUDA version everywhere.
+However, if the “/kk/host” suffix is added to a specific style in the input
+script, the Kokkos OpenMP (CPU) version of that specific style will be used instead.
+Set the number of OpenMP threads as "t Nt" and the number of GPUs as "g Ng"
+
+-k on t Nt g Ng :pre
+
+For example, the command to run with 1 GPU and 8 OpenMP threads is then:
+
+mpiexec -np 1 lmp_kokkos_cuda_openmpi -in in.lj -k on g 1 t 8 -sf kk :pre
+
+Conversely, if the “-sf kk/host” is used in the command line and then the
+“/kk” or “/kk/device” suffix is added to a specific style in your input script,
+then only that specific style will run on the GPU while everything else will
+run on the CPU in OpenMP mode. Note that the execution of the CPU and GPU
+styles will NOT overlap, except for a special case:
+
+A kspace style and/or molecular topology (bonds, angles, etc.) running on
+the host CPU can overlap with a pair style running on the GPU. First compile
+with “--default-stream per-thread” added to CCFLAGS in the Kokkos CUDA Makefile.
+Then explicitly use the “/kk/host” suffix for kspace and bonds, angles, etc.
+in the input file and the "kk" suffix (equal to "kk/device") on the command line.
+Also make sure the environment variable CUDA_LAUNCH_BLOCKING is not set to "1"
+so CPU/GPU overlap can occur.
 
 [Speed-ups to expect:]
 
@@ -353,7 +363,7 @@ Generally speaking, the following rules of thumb apply:
 When running on CPUs only, with a single thread per MPI task,
 performance of a KOKKOS style is somewhere between the standard
 (un-accelerated) styles (MPI-only mode), and those provided by the
-USER-OMP package.  However the difference between all 3 is small (less
+USER-OMP package. However the difference between all 3 is small (less
 than 20%). :ulb,l
 
 When running on CPUs only, with multiple threads per MPI task,
@@ -363,7 +373,7 @@ package. :l
 When running large number of atoms per GPU, KOKKOS is typically faster
 than the GPU package. :l
 
-When running on Intel Xeon Phi, KOKKOS is not as fast as
+When running on Intel hardware, KOKKOS is not as fast as
 the USER-INTEL package, which is optimized for that hardware. :l
 :ule
 
@@ -371,123 +381,78 @@ See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
 LAMMPS web site for performance of the KOKKOS package on different
 hardware.
 
-[Guidelines for best performance:]
-
-Here are guidline for using the KOKKOS package on the different
-hardware configurations listed above.
-
-Many of the guidelines use the "package kokkos"_package.html command
-See its doc page for details and default settings.  Experimenting with
-its options can provide a speed-up for specific calculations.
-
-[Running on a multi-core CPU:]
-
-If N is the number of physical cores/node, then the number of MPI
-tasks/node * number of threads/task should not exceed N, and should
-typically equal N.  Note that the default threads/task is 1, as set by
-the "t" keyword of the "-k" "command-line
-switch"_Section_start.html#start_6.  If you do not change this, no
-additional parallelism (beyond MPI) will be invoked on the host
-CPU(s).
+[Advanced Kokkos options:]
 
-You can compare the performance running in different modes:
-
-run with 1 MPI task/node and N threads/task
-run with N MPI tasks/node and 1 thread/task
-run with settings in between these extremes :ul
-
-Examples of mpirun commands in these modes are shown above.
-
-When using KOKKOS to perform multi-threading, it is important for
-performance to bind both MPI tasks to physical cores, and threads to
-physical cores, so they do not migrate during a simulation.
-
-If you are not certain MPI tasks are being bound (check the defaults
-for your MPI installation), binding can be forced with these flags:
-
-OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
-Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre
-
-For binding threads with the KOKKOS OMP option, use thread affinity
-environment variables to force binding.  With OpenMP 3.1 (gcc 4.7 or
-later, intel 12 or later) setting the environment variable
-OMP_PROC_BIND=true should be sufficient.  For binding threads with the
-KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option
-(see "this section"_Section_packages.html#KOKKOS of the manual for
-details).
-
-[Running on GPUs:]
-
-Insure the -arch setting in the machine makefile you are using,
-e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software.
-(see "this section"_Section_packages.html#KOKKOS of the manual for
-details).
-
-The -np setting of the mpirun command should set the number of MPI
-tasks/node to be equal to the # of physical GPUs on the node.
-
-Use the "-k" "command-line switch"_Section_commands.html#start_6 to
-specify the number of GPUs per node, and the number of threads per MPI
-task.  As above for multi-core CPUs (and no GPU), if N is the number
-of physical cores/node, then the number of MPI tasks/node * number of
-threads/task should not exceed N.  With one GPU (and one MPI task) it
-may be faster to use less than all the available cores, by setting
-threads/task to a smaller value.  This is because using all the cores
-on a dual-socket node will incur extra cost to copy memory from the
-2nd socket to the GPU.
-
-Examples of mpirun commands that follow these rules are shown above.
-
-NOTE: When using a GPU, you will achieve the best performance if your
-input script does not use any fix or compute styles which are not yet
-Kokkos-enabled.  This allows data to stay on the GPU for multiple
-timesteps, without being copied back to the host CPU.  Invoking a
-non-Kokkos fix or compute, or performing I/O for
-"thermo"_thermo_style.html or "dump"_dump.html output will cause data
-to be copied back to the CPU.
-
-You cannot yet assign multiple MPI tasks to the same GPU with the
-KOKKOS package.  We plan to support this in the future, similar to the
-GPU package in LAMMPS.
+There are other allowed options when building with the KOKKOS package.
+As above, they can be set either as variables on the make command line
+or in Makefile.machine. This is the full list of options, including
+those discussed above. Each takes a value shown below. The
+default value is listed, which is set in the
+/lib/kokkos/Makefile.kokkos file.
 
-You cannot yet use both the host (multi-threaded) and device (GPU)
-together to compute pairwise interactions with the KOKKOS package.  We
-hope to support this in the future, similar to the GPU package in
-LAMMPS.
+KOKKOS_DEVICES, values = {Serial}, {OpenMP}, {Pthreads}, {Cuda}, default = {OpenMP}
+KOKKOS_ARCH, values = {KNC}, {SNB}, {HSW}, {Kepler30}, {Kepler32}, {Kepler35}, {Kepler37}, {Maxwell50}, {Maxwell52}, {Maxwell53}, {Pascal60}, {Pascal61}, {ARMv80}, {ARMv81}, {ARMv81}, {ARMv8-ThunderX}, {BGQ}, {Power7}, {Power8}, {Power9}, {KNL}, {BDW}, {SKX}, default = {none}
+KOKKOS_DEBUG, values = {yes}, {no}, default = {no}
+KOKKOS_USE_TPLS, values = {hwloc}, {librt}, {experimental_memkind}, default = {none}
+KOKKOS_CXX_STANDARD, values = {c++11}, {c++1z}, default = {c++11}
+KOKKOS_OPTIONS, values = {aggressive_vectorization}, {disable_profiling}, default = {none}
+KOKKOS_CUDA_OPTIONS, values = {force_uvm}, {use_ldg}, {rdc}, {enable_lambda}, default = {enable_lambda} :ul
+
+KOKKOS_DEVICES sets the parallelization method used for Kokkos code
+(within LAMMPS). KOKKOS_DEVICES=Serial means that no threading will be used.
+KOKKOS_DEVICES=OpenMP means that OpenMP threading will be
+used. KOKKOS_DEVICES=Pthreads means that pthreads will be used.
+KOKKOS_DEVICES=Cuda means an NVIDIA GPU running CUDA will be used.
 
-[Running on an Intel Phi:]
+KOKKOS_ARCH enables compiler switches needed when compiling for a
+specific hardware:
+
+ARMv80 = ARMv8.0 Compatible CPU
+ARMv81 = ARMv8.1 Compatible CPU
+ARMv8-ThunderX = ARMv8 Cavium ThunderX CPU
+SNB = Intel Sandy/Ivy Bridge CPUs
+HSW = Intel Haswell CPUs
+BDW = Intel Broadwell Xeon E-class CPUs
+SKX = Intel Sky Lake Xeon E-class HPC CPUs (AVX512)
+KNC = Intel Knights Corner Xeon Phi
+KNL = Intel Knights Landing Xeon Phi
+Kepler30 = NVIDIA Kepler generation CC 3.0
+Kepler32 = NVIDIA Kepler generation CC 3.2
+Kepler35 = NVIDIA Kepler generation CC 3.5
+Kepler37 = NVIDIA Kepler generation CC 3.7
+Maxwell50 = NVIDIA Maxwell generation CC 5.0
+Maxwell52 = NVIDIA Maxwell generation CC 5.2
+Maxwell53 = NVIDIA Maxwell generation CC 5.3
+Pascal60 = NVIDIA Pascal generation CC 6.0
+Pascal61 = NVIDIA Pascal generation CC 6.1
+BGQ = IBM Blue Gene/Q CPUs
+Power8 = IBM POWER8 CPUs
+Power9 = IBM POWER9 CPUs :ul
 
-Kokkos only uses Intel Phi processors in their "native" mode, i.e.
-not hosted by a CPU.
+KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
+migrate during a simulation. KOKKOS_USE_TPLS=hwloc should always be
+used if running with KOKKOS_DEVICES=Pthreads for pthreads. It is not
+necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
+provides alternative methods via environment variables for binding
+threads to hardware cores. More info on binding threads to cores is
+given in "Section 5.3"_Section_accelerate.html#acc_3.
 
-As illustrated above, build LAMMPS with OMP=yes (the default) and
-MIC=yes.  The latter insures code is correctly compiled for the Intel
-Phi.  The OMP setting means OpenMP will be used for parallelization on
-the Phi, which is currently the best option within Kokkos.  In the
-future, other options may be added.
+KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
+on most Unix platforms. This library is not available on all
+platforms.
 
-Current-generation Intel Phi chips have either 61 or 57 cores.  One
-core should be excluded for running the OS, leaving 60 or 56 cores.
-Each core is hyperthreaded, so there are effectively N = 240 (4*60) or
-N = 224 (4*56) cores to run on.
+KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
+within LAMMPS. KOKKOS_DEBUG=yes enables printing of run-time
+debugging information that can be useful. It also enables runtime
+bounds checking on Kokkos data structures.
 
-The -np setting of the mpirun command sets the number of MPI
-tasks/node.  The "-k on t Nt" command-line switch sets the number of
-threads/task as Nt.  The product of these 2 values should be N, i.e.
-240 or 224.  Also, the number of threads/task should be a multiple of
-4 so that logical threads from more than one MPI task do not run on
-the same physical core.
+KOKKOS_CXX_STANDARD and KOKKOS_OPTIONS are typically not changed when building LAMMPS.
 
-Examples of mpirun commands that follow these rules are shown above.
+KOKKOS_CUDA_OPTIONS are additional options for CUDA. The LAMMPS KOKKOS package must be compiled
+with the {enable_lambda} option when using GPUs.
 
 [Restrictions:]
 
-As noted above, if using GPUs, the number of MPI tasks per compute
-node should equal to the number of GPUs per compute node.  In the
-future Kokkos will support assigning multiple MPI tasks to a single
-GPU.
-
-Currently Kokkos does not support AMD GPUs due to limits in the
-available backend programming models.  Specifically, Kokkos requires
-extensive C++ support from the Kernel language.  This is expected to
-change in the future.
+Currently, there are no precision options with the KOKKOS
+package. All compilation and computation is performed in double
+precision.