triton.experimental.gluon.language.nvidia.blackwell.tma.async_gather

triton.experimental.gluon.language.nvidia.blackwell.tma.async_gather(tensor_desc, x_offsets, y_offset, barrier, result, pred=True, multicast=False, _semantic=None)

Asynchronously gather elements from global memory to shared memory using TMA.

Parameters:

tensor_desc (tensor_descriptor) – The tensor descriptor.
x_offsets (tensor) – 1D tensor of X offsets.
y_offset (int) – Scalar Y offset.
barrier (shared_memory_descriptor) – Barrier that will be signaled when the operation is complete.
result (tensor_memory_descriptor) – Result shared memory, must have NVMMASharedLayout.
pred (bool) – Scalar predicate. Operation is skipped if predicate is False. Defaults to True.
multicast (bool) – Enable multicast.