NVIDIA ALCHEMI Toolkit-Ops
May 29, 2026 · View on GitHub
High-performance NVIDIA Warp primitives for computational chemistry
NVIDIA ALCHEMI Toolkit-Ops is a collection of GPU-optimized, batched
primitives for accelerating atomistic simulations. High performance compute
kernels are written in NVIDIA warp-lang.
Key Features
- Molecular Dynamics kernels: Velocity Verlet (NVE), Langevin (NVT), Nosé-Hoover Chain (NVT), NPT/NPH ensembles, velocity rescaling
- Geometry optimization: FIRE and FIRE2 with optional unit cell optimization
- Neighbor lists: naive and cell list algorithms
- Dispersion corrections via Becke-Johnson damped DFT-D3
- Electrostatic interactions: Ewald, particle mesh Ewald (PME), and damped shifted force (DSF) algorithms
- Differentiable physics: analytical stress tensor (virial) support for Ewald and PME, enabling stress-based MLIP training
- NVIDIA Warp core with optional, JIT-compatible PyTorch and JAX bindings, including autograd support
Kernels are naturally intended to be highly scalable (>100,000 atoms) and generally optimized for high throughput operations (on the order of several microseconds per atom) on GPUs, with batching support.
Use Cases
There are currently three primary use cases where we imagine nvalchemi-toolkit-ops to
fit into the ecosystem:
- Library maintainers and developers are encouraged to benchmark and explore integrating functionality like neighbor list computation to accelerate existing workflows;
- Researchers and model developers ideally should be able to rely on this package (and not implement their own!) for neighbor list computation, interatomic interactions, and so on during method development;
- Engineers looking to build applications that involve molecular dynamics,
interatomic potentials, and the like can take advantage of optimized and
maintained low-level kernels.
warp-langkernels should be sufficiently modular to allow for a high degree of flexibility and reusability.
The combination of being GPU-first and batched should enable the kernels contained
in nvalchemi-toolkit-ops to be ready for a wide range of research and production
applications.
Example Snippets
We encourage interested readers to browse our hosted documentation. Below are some short snippets that highlight our straightforward API and use cases for PyTorch: see the hosted documentation for Jax details.
Neighbor list in a 2D unit cell with 50,000 atoms
This example uses PyTorch:
import torch
from nvalchemiops.torch.neighbors import neighbor_list
torch.set_default_dtype(torch.float32)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.set_default_device(device)
NUM_ATOMS = 50_000
# arbitrarily scale positions
positions = torch.randn((NUM_ATOMS, 3)) * 10.0
cell = torch.eye(3, dtype=torch.float32).unsqueeze(0)
pbc = torch.tensor([True, True, False], dtype=torch.bool)
cutoff = 6.0
# use padded matrix representation for neighbors, optimal for
# compiled applications that need constant shapes
neighbor_matrix, num_neighbors, shift_matrix = neighbor_list(
positions,
cutoff,
cell=cell,
pbc=pbc,
method="cell_list"
)
# ...or pass `return_neighbor_list=True` for the familiar COO
# `edge_index` format. `method` will also automatically determine
# neighbor algorithm based off system size
edge_index, neighbor_ptr, shifts = neighbor_list(
positions,
cutoff,
cell=cell,
pbc=pbc,
return_neighbor_list=True
)
DFT-D3(BJ) corrections on a batch of molecules
This example assumes you already have concatenated a set of molecules
into combined tensors, and have computed some form of neighborhood
using the neighbor_list API. Here, we'll demonstrate using the
matrix representation:
import torch
from nvalchemiops.torch.interactions.dispersion import dftd3
from nvalchemiops.torch.neighbors import neighbor_list
# the following parameters need to be constructed ahead of time
positions = ... # [num_atoms, 3]
atomic_numbers = ... # [num_atoms]
cell = ... # [num_systems, 3, 3]
pbc = ... # [num_systems, 3]
batch_idx = ... # [num_atoms]
batch_ptr = ... # [num_systems + 1]
# construct neighbor matrix
neighbor_matrix, num_neighbors, shift_matrix = neighbor_list(
positions,
cutoff=..., # on the order of ~20 Angstroms
cell=cell,
pbc=pbc,
batch_idx=batch_idx,
batch_ptr=batch_ptr
)
# DFT-D3 parameters need to be provided, which comprises reference C6 parameters.
# Refer to the user documentation to see the expected structure and data source.
d3_params = ...
# pass everything to the functional interface
d3_energies, d3_forces, coord_nums, d3_virials = dftd3(
positions=positions,
numbers=atomic_numbers,
neighbor_matrix=neighbor_matrix,
neighbor_matrix_shifts=shift_matrix,
batch_idx=batch_idx,
# functional specific DFT-D3 parameters (PBE shown)
a1=0.4289, a2=4.4407, s8=0.7875,
d3_params=d3_params,
compute_virial=True
)
Electrostatics via particle mesh Ewald
This example shows how to compute the per-atom and system energies as well as the forces using the particle mesh Ewald interface.
import torch
from nvalchemiops.torch.interactions.electrostatics import particle_mesh_ewald
from nvalchemiops.torch.neighbors import neighbor_list
# the following parameters need to be constructed ahead of time
positions = ... # [num_atoms, 3]
atomic_numbers = ... # [num_atoms]
cell = ... # [num_systems, 3, 3]
pbc = ... # [num_systems, 3]
atomic_charges = ... # [num_atoms]
# construct neighbor matrix
neighbor_matrix, num_neighbors, shift_matrix = neighbor_list(
positions,
cutoff=..., # on the order of ~20 Angstroms
cell=cell,
pbc=pbc,
)
# call PME, using automatic parameter tuning
atom_energies, atom_forces = particle_mesh_ewald(
positions=positions,
charges=atomic_charges,
cell=cell,
neighbor_matrix=neighbor_matrix,
neighbor_matrix_shifts=shift_matrix,
accuracy=1e-6
)
system_energy = atom_energies.sum()
CUDA 13 Support
CUDA 13 is required for Blackwell GPUs. torch>=2.11.0 and jax[cuda13]
publish CUDA 13 wheels on the default PyPI index for Linux x86_64 and
aarch64 platforms.
The torch and jax extras use CUDA 13 by default and are equivalent to the
explicit torch-cu13 and jax-cu13 extras. Use torch-cu12 and jax-cu12
when CUDA 12.6 PyTorch wheels or CUDA 12 JAX plugins are required. The PyTorch
indexes used for explicit CUDA selection are cu130 for CUDA 13 and cu126
for the CUDA 12 fallback.
# Standalone install
uv venv --seed --python 3.12
uv pip install nvalchemi-toolkit-ops torch==2.11.0
# Explicit CUDA 13 PyTorch wheel index
uv pip install nvalchemi-toolkit-ops \
torch==2.11.0+cu130 \
--extra-index-url https://download.pytorch.org/whl/cu130
See the installation guide for details.
Roadmap
Features planned for upcoming releases:
- Performance improvements for neighbor lists, DFT-D3, and electrostatics
- Explicit 2nd-derivative electrostatics kernels for more efficient MLIP training
- Multipole Ewald summation
- Batched Nudged Elastic Band (NEB)
- Support for custom pair potentials in neighbor list functions
- Slab corrections for pseudo-2D periodic systems
- Ewald dispersion
- Improved pair potential coverage (e.g. ZBL, OQDO, Born-Mayer)
- Basis functions and descriptors for MLIPs (e.g. spherical harmonics, radial basis, Wigner D3 matrix)
Contributions & Disclaimers
Feature requests, discussions, and general feedback are welcome and encouraged via Github Issues. Before submitting a pull request, we highly encourage you to create an issue to discuss with developers first so we can understand your use case and collaborate on features and bug fixes. Contributors must read CONTRIBUTING.md to understand and follow the development workflow and logistics.
NVIDIA ALCHEMI Toolkit-Ops is under active development; while we strive to ensure public facing APIs do not break, they are subject to change as we are trying to continuously improve performance and user experience.