The NVVM dialect is MLIR's LLVM-IR-based, NVIDIA-specific backend dialect. It models NVVM intrinsics and public ISA functionality and introduces NVIDIA extensions to the MLIR/LLVM type system and address spaces (e.g., global, shared, and cluster memory), enabling faithful lowering of GPU kernels to the NVPTX toolchain. While a NVVM op usually maps to a single LLVM IR intrinsic, the NVVM dialect uses type polymorphism and other attributes so that a single NVVM op can map to different LLVM intrinsics.
The dialect covers core GPU features such as thread/block builtins, barriers and atomics, warp-level collectives (e.g., shuffle/vote), matrix/tensor core operations (e.g., mma.sync, wgmma), tensor memory accelerator (TMA) operations, asynchronous copies (cp.async, bulk/tensor variants) with memory barriers, cache and prefetch controls, and NVVM-specific attributes and enums (e.g., FP rounding modes, memory scopes, and MMA types/layouts).
NVVM sits below target-agnostic dialects like gpu and NVIDIA's nvgpu. Typical pipelines convert gpu/nvgpu ops into NVVM using -convert-gpu-to-nvvm and -convert-nvgpu-to-nvvm, then translate into LLVM for final code generation via NVPTX backend.
NVVM provides a #nvvm.target attribute to describe the GPU target (SM, features, and flags). In conjunction with gpu serialization (e.g., gpu-module-to-binary), this enables producing architecture-specific GPU binaries (such as CUBIN) from nested GPU modules.
When an intrinsic is unavailable or a performance-critical sequence must be expressed directly, NVVM provides an nvvm.inline_ptx op to embed PTX inline as a last-resort escape hatch, with explicit operands and results.
The NVVM dialect introduces the following memory spaces, each with distinct scopes and lifetimes:
| Memory Space | Address Space | Scope |
|---|---|---|
generic | 0 | All threads |
global | 1 | All threads (device) |
shared | 3 | Thread block (CTA) |
constant | 4 | All threads |
local | 5 | Single thread |
tensor | 6 | Thread block (CTA) |
shared_cluster | 7 | Thread block cluster |
LLVM_PointerGeneric in the NVVM Ops.LLVM_PointerGlobal in the NVVM Ops.shared_cta in the NVVMOps and as shared::cta in the PTX ISA. A pointer to this memory space is represented by the LLVM_PointerShared type in the NVVM Ops.LLVM_PointerConst type in NVVM Ops.LLVM_PointerLocal type in NVVM Ops.tcgen05 instructions on SM 100+ for tensor input/output operations. A pointer to this memory space is represented by the LLVM_PointerTensor type in the NVVM Ops.shared_cluster in the NVVMOps and as shared::cluster in the PTX ISA. A pointer to this memory space is represented by the LLVM_PointerSharedCluster type in the NVVM Ops.An mbarrier is a barrier created in shared memory that supports synchronizing any subset of threads within a CTA. An mbarrier object is an opaque object in shared memory with .b64 type and an alignment of 8-bytes. Unlike nvvm.barrier Op which can access only a limited number of barriers per CTA, the mbarrier objects are user-defined and are only limited by the total shared memory size available. The list of operations supported on an mbarrier object is exposed through the nvvm.mbarrier.* family of NVVM Ops.
NVVM is not a place for convenience or “wrapper” ops. It is not intended to introduce high-level ops that expand into multiple unrelated NVVM intrinsics or that lower to no intrinsic at all. Such abstractions belong in higher-level dialects (e.g., nvgpu, gpu, or project-specific dialects). The design intent is a thin, predictable, low-level surface with near-mechanical lowering to NVVM/LLVM IR.
All operations in the NVIDIA's instruction set have a custom form in MLIR. The mnemonic of an operation is that used in LLVM IR prefixed with “nvvm.”.
[include “Dialects/NVVMOps.md”]