triton.experimental.gluon.language.nvidia.blackwell.tma

Functions

`async_gather`	Asynchronously gather elements from global memory to shared memory using TMA.
`async_scatter`	Asynchronously scatter elements from shared memory to global memory using TMA.
`async_atomic_add`	Atomically add data from shared memory into global memory using TMA.
`async_atomic_and`	Atomically bitwise-and data from shared memory into global memory using TMA.
`async_atomic_max`	Atomically compute the maximum of shared memory data and global memory using TMA.
`async_atomic_min`	Atomically compute the minimum of shared memory data and global memory using TMA.
`async_atomic_or`	Atomically bitwise-or data from shared memory into global memory using TMA.
`async_atomic_xor`	Atomically bitwise-xor data from shared memory into global memory using TMA.
`async_copy_global_to_shared`	Load data from global memory to shared memory using TMA.
`async_copy_shared_to_global`	Store data from shared memory to global memory using TMA.
`async_load`	Load data from global memory to shared memory using TMA.
`async_load_im2col`	Load data from global memory to shared memory using TMA in im2col mode.
`async_store`	Store data from shared memory to global memory using TMA.
`store_wait`	Wait for pending TMA stores.
`make_tensor_descriptor`

Classes

`tensor_descriptor`
`tensor_descriptor_type`	Type for tiled tensor descriptors.