triton.experimental.gluon.language.nvidia.hopper.tma.async_load_im2col

triton.experimental.gluon.language.nvidia.hopper.tma.async_load_im2col(tensor_desc, coord, offsets, barrier, result, pred=True, multicast=False, _semantic=None)

Load data from global memory to shared memory using TMA in im2col mode.

Parameters:
  • tensor_desc – Tensor descriptor (im2col)

  • coord – Coordinates in the source tensor

  • offsets – Im2col offsets (must be i16 values) - For 3D tensors: 1 offset - For 4D tensors: 2 offsets - For 5D tensors: 3 offsets

  • barrier – Barrier for synchronization. In a two-CTA kernel, use a two-CTA barrier when this TMA load feeds a tcgen05 op; otherwise use a barrier allocated with two_ctas=False.

  • result – Destination memory descriptor

  • pred – Predicate for conditional execution

  • multicast – Enable multicast