triton.experimental.gluon.language.nvidia.hopper.tma.async_load

triton.experimental.gluon.language.nvidia.hopper.tma.async_load(tensor_desc, coord, barrier, result, pred=True, multicast=False, _semantic=None)

Load data from global memory to shared memory using TMA.

Parameters:

tensor_desc – Tensor descriptor (tiled)
coord – Coordinates in the source tensor
barrier – Barrier for synchronization. In a two-CTA kernel, use a two-CTA barrier when this TMA load feeds a tcgen05 op; otherwise use a barrier allocated with two_ctas=False.
result – Destination memory descriptor
pred – Predicate for conditional execution
multicast – Enable multicast