TCGen05 Copy Instruction
This tutorial will cover the tcgen05_copy instruction: how to use it and its
applications.
The tcgen05_copy instruction is an asynchronous tensorcore operation that copies
data from shared memory to tensor memory. The completion of tcgen05_copy is
tracked with tcgen05_commit on an mbarrier just like tcgen05_mma. The
completion of a single or multiple tcgen05_copy operations can be tracked by a
single tcgen05_commit:
tcgen05_copy(lhs_smem, lhs_tmem)
tcgen05_copy(acc_smem, acc_tmem)
tcgen05_commit(bar)
mbarrier.wait(bar, phase=phase)
acc = acc_tmem.load()
lhs = lhs_tmem.load()
tcgen05_copy can be used to copy data into tensor memory that is fed into a
tcgen05_mma instruction. Because tcgen05_copy is implicitly pipelined with
tcgen05_mma, even though it is asynchronous, the MMA is guaranteed to start
after the copy is complete:
tcgen05_copy(smem, lhs_tmem)
tcgen05_mma(lhs_tmem, rhs_smem, acc_tmem)
tcgen05_commit(bar)
mbarrier.wait(bar, phase=phase)
The implicit pipelining is because the PTX-level tcgen05.copy and tcgen05.mma
instructions are executed by the tensor core pipe on the SM, which you can think
of as a single thread running tensor core specific instructions on the SM,
asynchronously from the rest of the SM. In other words, all tcgen05_* instructions
enqueue a tensor core operation on the tensor pipe, which are executed in order.
The following is also valid.
tcgen05_copy(lhs_smem0, lhs_tmem)
tcgen05_mma(lhs_tmem, rhs_smem, acc_tmem)
tcgen05_commit(bar)
tcgen05_copy(lhs_smem1, lhs_tmem)
tcgen05_mma(lhs_tmem, rhs_smem, acc_tmem)
Because the second tcgen05_copy will only execute after the preceeding
tcgen05_mma is complete. In other words, tcgen05_copy, tcgen05_mma, and
tcgen05_commit are all implicitly pipelined and executed in order.
tcgen05_copy accesses shared memory via the async proxy, just like tcgen05_mma.
Make sure to insert fences as appropriate:
lhs_smem.store(value1)
fence_async_shared()
tcgen05_copy(lhs_smem, lhs_tmem)
tcgen05_commit(bar)
mbarrier.wait(bar, phase=phase)
lhs_smem.store(value0)
Note that a fence is not needed between tcgen05_copy and the second write to
lhs_smem because waiting on the completion of the tcgen05_copy operation
via the mbarrier implicitly fences the generic and async proxies.
What makes using tcgen05_copy particularly tricky is selecting the right
shared memory and tensor memory layouts, as tcgen05_copy only supports a
limited set of instruction shapes for copy data from shared to tensor memory.
import itertools
import importlib
import pytest
import triton
import torch
import triton.experimental.gluon as gluon
import triton.experimental.gluon.language as gl
from triton.experimental.gluon.nvidia.hopper import TensorDescriptor
from triton.experimental.gluon.language.nvidia.blackwell import (
TensorMemoryLayout,
tensor_memory_descriptor,
allocate_tensor_memory,
fence_async_shared,
tcgen05_copy,
tcgen05_commit,
tcgen05_mma,
mbarrier,
tma,
)
def is_blackwell():
target = triton.runtime.driver.active.get_current_target()
return target.backend == "cuda" and torch.cuda.get_device_capability()[0] == 10
if __name__ == "__main__" and not is_blackwell():
raise RuntimeError("This tutorial requires a Blackwell NVIDIA GPU")
# Re-use utilities from the previous tutorials.
t7 = importlib.import_module("07-persistence")
t8 = importlib.import_module("08-warp-specialization")
Let’s write an example kernel that uses tcgen05_copy and and show what the
requirements are for the shared and tensor memory layouts.
@gluon.jit
def tcgen05_copy_kernel(in_ptr, in_stride0, in_stride1, out_ptr, out_stride0, out_stride1, M: gl.constexpr,
N: gl.constexpr, smem_layout: gl.constexpr, tmem_layout: gl.constexpr):
coalesced_2d_layout: gl.constexpr = gl.BlockedLayout([1, 1], [1, 32], [1, gl.num_warps()], [1, 0])
offs_m = gl.arange(0, M, gl.SliceLayout(1, coalesced_2d_layout))
offs_n = gl.arange(0, N, gl.SliceLayout(0, coalesced_2d_layout))
input = gl.load(in_ptr + offs_m[:, None] * in_stride0 + offs_n[None, :] * in_stride1)
# Allocate shared memory and tensor memory with the tile shape [M, N].
smem = gl.allocate_shared_memory(input.dtype, (M, N), smem_layout)
tmem = allocate_tensor_memory(input.dtype, (M, N), tmem_layout)
bar = gl.allocate_shared_memory(gl.int64, [1], gl.constexpr(mbarrier.MBarrierLayout()))
mbarrier.init(bar, count=1)
# Copy data from shared memory to tensor memory.
smem.store(input)
# Fence generic and async proxies
fence_async_shared()
# Issue the async copy
tcgen05_copy(smem, tmem)
# Track completion of the async copy
tcgen05_commit(bar)
# Wait for the async copy to complete
mbarrier.wait(bar, phase=0)
mbarrier.invalidate(bar)
# Read the data from tensor memory.
output = tmem.load()
# Write using a coalesced layout.
output = gl.convert_layout(output, coalesced_2d_layout)
gl.store(out_ptr + offs_m[:, None] * out_stride0 + offs_n[None, :] * out_stride1, output)
def tcgen05_copy_example(M, N, smem_layout, tmem_layout, dtype):
input = torch.randn(M, N, dtype=dtype, device="cuda")
output = torch.empty_like(input)
tcgen05_copy_kernel[(1, )](input, *input.stride(), output, *output.stride(), M, N, smem_layout, tmem_layout)
# Just check that the input and output are equal.
torch.testing.assert_close(input, output, atol=0, rtol=0)
Let’s first explore the valid shared memory layouts for the source of
tcgen05_copy when the destination tensor memory layout is a
TensorMemoryLayout, which is common when using TMAs and tensor core
instructions.
Recall that TensorMemoryLayout only supports 2D memory descriptors. When the
destination tensor memory layout is a TensorMemoryLayout, the source shared
memory layout is typically an NVMMASharedLayout. Other exotic layouts are
supported, such as some SharedLinearLayout, but we won’t cover them in this
tutorial.
Additional, the current restrictions apply to the NVMMASharedLayout:
The layout must be swizzled (swizzle_byte_width > 0).
The dtype must be 32-bit (e.g. gl.float32).
TensorMemoryLayoutblockM must be 128.The layout cannot be transposed.
configs = []
TMEM_BLOCK_M = 128
for TMEM_BLOCK_N in [1, 2, 4, 8, 16, 32, 64, 128, 256]:
for M, N in itertools.product([128, 256], [16, 32, 64, 128, 256]):
if M < TMEM_BLOCK_M or N < TMEM_BLOCK_N or M * N * 4 > 228 * 1024:
continue
configs.append((M, N, TMEM_BLOCK_N))
@pytest.mark.parametrize("M, N, TMEM_BLOCK_N", configs)
@pytest.mark.parametrize("dtype", [torch.float32])
@pytest.mark.parametrize("swizzle", [32, 64, 128])
@pytest.mark.skipif(not is_blackwell(), reason="Requires Blackwell")
def test_tcgen05_copy_nvmma_shared(M, N, TMEM_BLOCK_N, dtype, swizzle):
bitwidth = dtype.itemsize * 8
# There are still some shared memory layouts for which an implementation does not exist.
if M == 256 and swizzle // TMEM_BLOCK_N >= 8:
pytest.skip("no tcgen05.copy atom exists for codegen")
# NVMMASharedLayout swizzle block shape has a minimum size.
if N < swizzle // dtype.itemsize:
pytest.skip("block shape along contiguous dimension is too small for the swizzle byte width")
bitwidth = dtype.itemsize * 8
smem_layout = gl.NVMMASharedLayout(swizzle_byte_width=swizzle, element_bitwidth=bitwidth, rank=2)
tmem_layout = TensorMemoryLayout(block=(TMEM_BLOCK_M, TMEM_BLOCK_N), col_stride=32 // bitwidth)
tcgen05_copy_example(M, N, smem_layout, tmem_layout, dtype)
Although tcgen05_copy into TensorMemoryLayout only supports 32-bit dtypes,
this is useful for writing matmul accumulate kernels: D = A @ B + C.
Specifically, we can use TMA to load C, asynchronously copy it into tensor
memory with tcgen05_copy, and then issue tcgen05_mma to perform the matmul
while accumulating into tensor memory.
We will use gl.store to write the output tiles to save shared memory, since
C will require a large float32 buffer. We will use warp specialization to
efficiently overlap the epilogue store with the rest of the kernel. Avoiding
TMA for the epilogue store also reduces contention for the TMA pipe.
@gluon.aggregate
class PartitionArgs:
a_desc: tma.tensor_descriptor
b_desc: tma.tensor_descriptor
c_desc: tma.tensor_descriptor
d_ptr: gl.tensor
d_stride_m: gl.tensor
d_stride_n: gl.tensor
a_bufs: gl.shared_memory_descriptor
b_bufs: gl.shared_memory_descriptor
load_empty_bars: gl.shared_memory_descriptor
load_ready_bars: gl.shared_memory_descriptor
c_buf: gl.shared_memory_descriptor
c_empty_bar: gl.shared_memory_descriptor
c_ready_bar: gl.shared_memory_descriptor
acc_bufs: tensor_memory_descriptor
acc_empty_bars: gl.shared_memory_descriptor
acc_ready_bars: gl.shared_memory_descriptor
SchedulerImpl: gl.constexpr
@gluon.jit
def matmul_accumulate_load_partition(p):
BLOCK_M: gl.constexpr = p.c_desc.block_type.shape[0]
BLOCK_N: gl.constexpr = p.c_desc.block_type.shape[1]
BLOCK_K: gl.constexpr = p.a_desc.block_type.shape[1]
K = p.a_desc.shape[1]
c_phase = 1
state = t8.Counter.create(1, p.load_empty_bars.shape[0])
scheduler = p.SchedulerImpl.initialize(p.c_desc.shape[0], p.c_desc.shape[1], BLOCK_M, BLOCK_N)
for idx in range(scheduler.get_num_tiles()):
pid_m, pid_n = scheduler.get_tile(idx)
off_m = pid_m * BLOCK_M
off_n = pid_n * BLOCK_N
# Issue the async TMA load for the C tile.
mbarrier.wait(p.c_empty_bar, c_phase)
mbarrier.expect(p.c_ready_bar, p.c_desc.block_type.nbytes)
tma.async_load(p.c_desc, [off_m, off_n], p.c_ready_bar, p.c_buf)
c_phase ^= 1
# Inner loop loads.
for k in range(0, K, BLOCK_K):
bar = p.load_ready_bars.index(state.index)
mbarrier.wait(p.load_empty_bars.index(state.index), state.phase)
mbarrier.expect(bar, p.a_desc.block_type.nbytes + p.b_desc.block_type.nbytes)
tma.async_load(p.a_desc, [off_m, k], bar, p.a_bufs.index(state.index))
tma.async_load(p.b_desc, [k, off_n], bar, p.b_bufs.index(state.index))
state = state.next()
@gluon.jit
def matmul_accmulate_mma_partition(p):
BLOCK_M: gl.constexpr = p.c_desc.block_type.shape[0]
BLOCK_N: gl.constexpr = p.c_desc.block_type.shape[1]
BLOCK_K: gl.constexpr = p.a_desc.block_type.shape[1]
K = p.a_desc.shape[1]
c_phase = 0
load_state = t8.Counter.create(0, p.load_empty_bars.shape[0])
acc_state = t8.Counter.create(1, p.acc_empty_bars.shape[0])
scheduler = p.SchedulerImpl.initialize(p.c_desc.shape[0], p.c_desc.shape[1], BLOCK_M, BLOCK_N)
for _ in range(scheduler.get_num_tiles()):
# We expect the load of C to take longer than the previous epilogue to
# release the accumulator, so acquire c_buf first.
mbarrier.wait(p.c_ready_bar, c_phase)
mbarrier.wait(p.acc_empty_bars.index(acc_state.index), acc_state.phase)
acc_buf = p.acc_bufs.index(acc_state.index)
tcgen05_copy(p.c_buf, acc_buf)
# Release c_buf when the copy is complete. We don't need to wait for the
# copy to complete because it will be implicitly pipelined with the first MMA.
tcgen05_commit(p.c_empty_bar)
c_phase ^= 1
for k in range(0, K, BLOCK_K):
# Wait for the operands to be ready.
mbarrier.wait(p.load_ready_bars.index(load_state.index), load_state.phase)
# Issue the MMA and release the load buffers then it completes.
tcgen05_mma(p.a_bufs.index(load_state.index), p.b_bufs.index(load_state.index), acc_buf, use_acc=True)
tcgen05_commit(p.load_empty_bars.index(load_state.index))
load_state = load_state.next()
# Release the accumulator when the last MMA is complete.
tcgen05_commit(p.acc_ready_bars.index(acc_state.index))
acc_state = acc_state.next()
@gluon.jit
def matmul_accumulate_epilogue_partition(p):
BLOCK_M: gl.constexpr = p.c_desc.block_type.shape[0]
BLOCK_N: gl.constexpr = p.c_desc.block_type.shape[1]
coalesced_2d_layout: gl.constexpr = gl.BlockedLayout([1, 1], [1, 32], [1, gl.num_warps()], [1, 0])
range_m = gl.arange(0, BLOCK_M, gl.SliceLayout(1, coalesced_2d_layout))
range_n = gl.arange(0, BLOCK_N, gl.SliceLayout(0, coalesced_2d_layout))
acc_state = t8.Counter.create(0, p.acc_empty_bars.shape[0])
scheduler = p.SchedulerImpl.initialize(p.c_desc.shape[0], p.c_desc.shape[1], BLOCK_M, BLOCK_N)
for idx in range(scheduler.get_num_tiles()):
pid_m, pid_n = scheduler.get_tile(idx)
off_m = pid_m * BLOCK_M
off_n = pid_n * BLOCK_N
# Wait for the accumulator.
mbarrier.wait(p.acc_ready_bars.index(acc_state.index), acc_state.phase)
acc = p.acc_bufs.index(acc_state.index).load()
mbarrier.arrive(p.acc_empty_bars.index(acc_state.index), count=1)
acc_state = acc_state.next()
offs_m = (off_m + range_m)
offs_n = (off_n + range_n)
# This `convert_layout` is fairly expensive and it uses a lot of shared
# memory, because the default TMEM register layout assigns contiguous
# columns to the same thread, but the coalesced layout assigns
# contiguous columns to different threads for efficient global writes.
# We could subtile the store to reduce the shared memory usage.
acc = gl.convert_layout(acc, coalesced_2d_layout)
gl.store(p.d_ptr + offs_m[:, None] * p.d_stride_m + offs_n[None, :] * p.d_stride_n, acc)
@gluon.jit(do_not_specialize=["d_stride_m", "d_stride_n"])
def matmul_accumulate_kernel(a_desc, b_desc, c_desc, d_ptr, d_stride_m, d_stride_n, SchedulerImpl: gl.constexpr,
num_buffers: gl.constexpr):
BLOCK_M: gl.constexpr = c_desc.block_type.shape[0]
BLOCK_N: gl.constexpr = c_desc.block_type.shape[1]
dtype: gl.constexpr = a_desc.dtype
a_bufs = gl.allocate_shared_memory(dtype, [num_buffers] + a_desc.block_type.shape, a_desc.layout)
b_bufs = gl.allocate_shared_memory(dtype, [num_buffers] + b_desc.block_type.shape, b_desc.layout)
load_empty_bars = gl.allocate_shared_memory(gl.int64, [num_buffers, 1], mbarrier.MBarrierLayout())
load_ready_bars = gl.allocate_shared_memory(gl.int64, [num_buffers, 1], mbarrier.MBarrierLayout())
for i in gl.static_range(num_buffers):
mbarrier.init(load_empty_bars.index(i), count=1)
mbarrier.init(load_ready_bars.index(i), count=1)
c_buf = gl.allocate_shared_memory(c_desc.dtype, c_desc.block_type.shape, c_desc.layout)
c_empty_bar = gl.allocate_shared_memory(gl.int64, [1], mbarrier.MBarrierLayout())
c_ready_bar = gl.allocate_shared_memory(gl.int64, [1], mbarrier.MBarrierLayout())
mbarrier.init(c_empty_bar, count=1)
mbarrier.init(c_ready_bar, count=1)
tmem_layout: gl.constexpr = TensorMemoryLayout([BLOCK_M, BLOCK_N], col_stride=1)
acc_bufs = allocate_tensor_memory(gl.float32, [2, BLOCK_M, BLOCK_N], tmem_layout)
acc_empty_bars = gl.allocate_shared_memory(gl.int64, [2, 1], mbarrier.MBarrierLayout())
acc_ready_bars = gl.allocate_shared_memory(gl.int64, [2, 1], mbarrier.MBarrierLayout())
for i in gl.static_range(2):
mbarrier.init(acc_empty_bars.index(i), count=1)
mbarrier.init(acc_ready_bars.index(i), count=1)
p = PartitionArgs(a_desc, b_desc, c_desc, d_ptr, d_stride_m, d_stride_n, a_bufs, b_bufs, load_empty_bars,
load_ready_bars, c_buf, c_empty_bar, c_ready_bar, acc_bufs, acc_empty_bars, acc_ready_bars,
SchedulerImpl)
gl.warp_specialize([
(matmul_accumulate_epilogue_partition, (p, )),
(matmul_accmulate_mma_partition, (p, )),
(matmul_accumulate_load_partition, (p, )),
], [1, 1], [24, 24])
def matmul_accumulate(A, B, C, BLOCK_M=128, BLOCK_N=128, BLOCK_K=64, GROUP_SIZE_M=8, num_buffers=3):
SchedulerImpl = t7.GroupedPersistentTileScheduler(GROUP_SIZE_M)
M, N = C.shape
dtype = getattr(gl, str(A.dtype).split('.')[1])
acc_dtype = getattr(gl, str(C.dtype).split('.')[1])
a_layout = gl.NVMMASharedLayout.get_default_for([BLOCK_M, BLOCK_K], dtype)
b_layout = gl.NVMMASharedLayout.get_default_for([BLOCK_K, BLOCK_N], dtype)
c_layout = gl.NVMMASharedLayout.get_default_for([BLOCK_M, BLOCK_N], acc_dtype)
a_desc = TensorDescriptor.from_tensor(A, [BLOCK_M, BLOCK_K], a_layout)
b_desc = TensorDescriptor.from_tensor(B, [BLOCK_K, BLOCK_N], b_layout)
c_desc = TensorDescriptor.from_tensor(C, [BLOCK_M, BLOCK_N], c_layout)
D = torch.empty((M, N), dtype=C.dtype, device="cuda")
num_sms = torch.cuda.get_device_properties("cuda").multi_processor_count
num_pid = triton.cdiv(M, BLOCK_M) * triton.cdiv(N, BLOCK_N)
grid = (min(num_sms, num_pid), )
matmul_accumulate_kernel[grid](a_desc, b_desc, c_desc, D, *D.stride(), SchedulerImpl, num_buffers)
return D
@pytest.mark.parametrize("M, N, K", [(1024, 1024, 2048), (4096, 4096, 4096)])
@pytest.mark.parametrize("BLOCK_M, BLOCK_N", [(128, 128), (128, 64)])
@pytest.mark.parametrize("BLOCK_K, num_buffers", [(64, 3)])
@pytest.mark.parametrize("dtype", [torch.float16])
@pytest.mark.skipif(not is_blackwell(), reason="Requires Blackwell")
def test_matmul_accumulate(M, N, K, BLOCK_M, BLOCK_N, BLOCK_K, num_buffers, dtype):
torch.manual_seed(0)
A = torch.randn(M, K, dtype=dtype, device="cuda")
B = torch.randn(K, N, dtype=dtype, device="cuda")
C = torch.randn(M, N, dtype=torch.float32, device="cuda")
D = matmul_accumulate(A, B, C, BLOCK_M, BLOCK_N, BLOCK_K, num_buffers=num_buffers)
torch.testing.assert_close(A @ B + C, D, atol=5e-3, rtol=1e-2)
Another important use case for tcgen05_copy is to asynchronously copy tensor
scales from shared memory to tensor memory for use by tcgen05_mma_scaled.
In the next tutorial, we will cover tcgen05_mma_scaled in more detail, but
for now just know that the tensor scales must be supplied to tcgen05_mma_scaled
via tensor memory, and the layout of the scales tensor memory must be
TensorMemoryScalesLayout. If we load the scales via TMAs into shared memory,
we can efficiently copy the scales into tensor memory with tcgen05_copy
which can be implicitly pipelined with the tcgen05_mma_scaled instruction:
tma.async_load(a_scale_desc, ..., bar, a_scale_buf)
tma.async_load(b_scale_desc, ..., bar, b_scale_buf)
mbarrier.wait(bar, phase)
tcgen05_copy(a_scale_buf, a_scale_tmem)
tcgen05_copy(b_scale_buf, b_scale_tmem)
tcgen05_mma_scaled(a_buf, b_buf, acc_tmem, a_scale_tmem, b_scale_tmem, ...)
tcgen05_commit(mma_bar)
The main takeaway from this tutorial is understanding how to use tcgen05_copy
to asynchronously copy data from shared memory to tensor memory. tcgen05_copy
doesn’t support all layouts, but should support typical NVMMASharedLayouts.
The instruction is useful in specific cases to copy data from shared to tensor
memory without round-tripping the data through registers, which increases
register pressure and is slow. It is also asynchronous and can be implicitly
pipelined with other tcgen05 instructions.