triton.experimental.gluon.language.nvidia.blackwell.tma.async_gather

triton.experimental.gluon.language.nvidia.blackwell.tma.async_gather(tensor_desc, x_offsets, y_offset, barrier, result, pred=True, multicast=False, _semantic=None)

Asynchronously gather elements from global memory to shared memory using TMA.

Parameters:
  • tensor_desc (tensor_descriptor) – The tensor descriptor.

  • x_offsets (tensor) – 1D tensor of X offsets.

  • y_offset (int) – Scalar Y offset.

  • barrier (shared_memory_descriptor) – Barrier that will be signaled when the operation is complete.

  • result (tensor_memory_descriptor) – Result shared memory, must have NVMMASharedLayout.

  • pred (bool) – Scalar predicate. Operation is skipped if predicate is False. Defaults to True.

  • multicast (bool) – Enable multicast.