TritonNvidiaGPUOps

triton_nvidia_gpu.async_tma_copy_global_to_local (triton::nvidia_gpu::AsyncTMACopyGlobalToLocalOp)

Copy data based on descriptor from global memory to local memory asynchronously

Syntax:

operation ::= `triton_nvidia_gpu.async_tma_copy_global_to_local` $desc_ptr `[` $coord `]` $result `,` $barrier `,` $pred
              oilist(`cacheModifier` `=` $cache | `evictionPolicy` `=` $evict)
              attr-dict `:` type($desc_ptr) `,` type($barrier) `->` type($result)

This operation copies data from global memory to local memory asynchronously. This is analogue to tt.load except the data are copied to local memory pointed by the memory descriptor instread of a distributed tensor. The data copied depends on the global memory descriptor pointed to by desc_ptr.

Traits: VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface

Attributes:

AttributeMLIR TypeDescription
cache::mlir::triton::CacheModifierAttr
allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7{{% markdown %}}Enum cases: * none (`NONE`) * ca (`CA`) * cg (`CG`) * wb (`WB`) * cs (`CS`) * wt (`WT`) * cv (`CV`){{% /markdown %}}
evict::mlir::triton::EvictionPolicyAttr
allowed 32-bit signless integer cases: 1, 2, 3{{% markdown %}}Enum cases: * evict_normal (`NORMAL`) * evict_first (`EVICT_FIRST`) * evict_last (`EVICT_LAST`){{% /markdown %}}
isVolatile::mlir::BoolAttrbool attribute

Operands:

Operand

Description

desc_ptr

Pointer type (::mlir::triton::PointerType) in Triton IR type system

coord

variadic of 32-bit signless integer

barrier

memory descriptor type (::mlir::triton::MemDescType) in Triton IR type system

result

memory descriptor type (::mlir::triton::MemDescType) in Triton IR type system

pred

1-bit signless integer

triton_nvidia_gpu.async_tma_copy_local_to_global (triton::nvidia_gpu::AsyncTMACopyLocalToGlobalOp)

Copy data based on descriptor from local memory to global memory asynchronously

Syntax:

operation ::= `triton_nvidia_gpu.async_tma_copy_local_to_global` $desc_ptr `[` $coord `]` $src
              attr-dict `:` type($desc_ptr) `,` type($src)

This operation copies data from local memory to global memory asynchronously. This is analogue to tt.store except the data are copied from local memory pointed by the memory descriptor instread of a distributed tensor. The data copied depends on the global memory descriptor pointed to by desc_ptr.

Traits: VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface

Operands:

Operand

Description

desc_ptr

Pointer type (::mlir::triton::PointerType) in Triton IR type system

coord

variadic of 32-bit signless integer

src

memory descriptor type (::mlir::triton::MemDescType) in Triton IR type system

triton_nvidia_gpu.barrier_expect (triton::nvidia_gpu::BarrierExpectOp)

Signal a barrier of an expected number of bytes to be copied.

Syntax:

operation ::= `triton_nvidia_gpu.barrier_expect` $alloc `,` $size attr-dict `,` $pred `:` type($alloc)

This signal the barrier that size bytes are expected to be copied. The associated barrier wait will block until the expected number of bytes are copied.

Traits: VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface

Attributes:

AttributeMLIR TypeDescription
size::mlir::IntegerAttr32-bit signless integer attribute

Operands:

Operand

Description

alloc

memory descriptor type (::mlir::triton::MemDescType) in Triton IR type system

pred

1-bit signless integer

triton_nvidia_gpu.cluster_arrive (triton::nvidia_gpu::ClusterArriveOp)

Syntax:

operation ::= `triton_nvidia_gpu.cluster_arrive` attr-dict

Traits: VerifyTensorLayoutsTrait

Attributes:

AttributeMLIR TypeDescription
relaxed::mlir::IntegerAttr1-bit signless integer attribute

triton_nvidia_gpu.cluster_wait (triton::nvidia_gpu::ClusterWaitOp)

Syntax:

operation ::= `triton_nvidia_gpu.cluster_wait` attr-dict

Traits: VerifyTensorLayoutsTrait

triton_nvidia_gpu.fence_async_shared (triton::nvidia_gpu::FenceAsyncSharedOp)

Fence proxy async

Syntax:

operation ::= `triton_nvidia_gpu.fence_async_shared` attr-dict

Traits: VerifyTensorLayoutsTrait

Attributes:

AttributeMLIR TypeDescription
bCluster::mlir::BoolAttrbool attribute

triton_nvidia_gpu.init_barrier (triton::nvidia_gpu::InitBarrierOp)

Initialize a barrier in the given shared memory allocation.

Syntax:

operation ::= `triton_nvidia_gpu.init_barrier` $alloc `,` $count attr-dict `:` type($alloc)

Initializes a shared memory allocation with mbarrier information. alloc is a descriptor to the shared memory allocation. count is the number of arrives expected by the barrier.

This lowers to PTX mbarrier.init.shared::cta.b64.

Traits: VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface

Attributes:

AttributeMLIR TypeDescription
count::mlir::IntegerAttr32-bit signless integer attribute

Operands:

Operand

Description

alloc

memory descriptor type (::mlir::triton::MemDescType) in Triton IR type system

triton_nvidia_gpu.inval_barrier (triton::nvidia_gpu::InvalBarrierOp)

Invalidate a barrier allocation.

Syntax:

operation ::= `triton_nvidia_gpu.inval_barrier` $alloc attr-dict `:` type($alloc)

Invalidate a barrier allocation so that it can be re-used. According to PTX spec this has to be done before any re-use of the memory used by mbarrier.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-inval

Traits: VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface

Operands:

Operand

Description

alloc

memory descriptor type (::mlir::triton::MemDescType) in Triton IR type system

triton_nvidia_gpu.async_tma_store_wait (triton::nvidia_gpu::TMAStoreWait)

Wait until all the inputs are read.

Syntax:

operation ::= `triton_nvidia_gpu.async_tma_store_wait` attr-dict

Wait until all the read operations are done from the associated store operations. This is needed before the shared memory can be written to.

Traits: VerifyTensorLayoutsTrait

Attributes:

AttributeMLIR TypeDescription
pendings::mlir::IntegerAttr32-bit signless integer attribute

triton_nvidia_gpu.wait_barrier (triton::nvidia_gpu::WaitBarrierOp)

Wait until the mbarrier phase completes.

Syntax:

operation ::= `triton_nvidia_gpu.wait_barrier` $alloc `,` $phase attr-dict `:` type($alloc)

Blocks the program progress until the mbarrier object in alloc completes its current phase.

This lowers a waitloop using PTX instruction mbarrier.try_wait.parity.shared.b64.

The barrier behavior is described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-asynchronous-copy-completion-mechanisms

Traits: VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface

Operands:

Operand

Description

alloc

memory descriptor type (::mlir::triton::MemDescType) in Triton IR type system

phase

32-bit signless integer

triton_nvidia_gpu.warp_group_dot (triton::nvidia_gpu::WarpGroupDotOp)

Warp group dot

Syntax:

operation ::= `triton_nvidia_gpu.warp_group_dot` $a`,` $b`,` $c (`,` $useC^)? attr-dict `:` type($a) `*` type($b) `->` type($d)

$d = matrix_multiply($a, $b) + $c. For docs on InputPrecisionAttr, see TT_DotOp

Traits: DotLike, VerifyTensorLayoutsTrait

Interfaces: InferTypeOpInterface, MemoryEffectOpInterface

Attributes:

AttributeMLIR TypeDescription
inputPrecision::mlir::triton::InputPrecisionAttr
allowed 32-bit signless integer cases: 0, 1, 2{{% markdown %}}Enum cases: * tf32 (`TF32`) * tf32x3 (`TF32x3`) * ieee (`IEEE`){{% /markdown %}}
maxNumImpreciseAcc::mlir::IntegerAttr32-bit signless integer attribute
isAsync::mlir::BoolAttrbool attribute

Operands:

Operand

Description

a

TensorOrMemDesc instance

b

TensorOrMemDesc instance

c

ranked tensor of floating-point or integer values

useC

1-bit signless integer

Results:

Result

Description

d

ranked tensor of floating-point or integer values

triton_nvidia_gpu.warp_group_dot_wait (triton::nvidia_gpu::WarpGroupDotWaitOp)

Warp group dot wait

Syntax:

operation ::= `triton_nvidia_gpu.warp_group_dot_wait` $inputs attr-dict `:` type($inputs)

Waits until there are $pendings or fewer outstanding async dot operations.

$inputs must be the tensors corresponding to the async dot ops that we’re waiting on. For example, if there are N pending async dot ops and we call warp_group_dot_wait 1, then $inputs must be the result of the first dot op.

Traits: VerifyTensorLayoutsTrait

Interfaces: InferTypeOpInterface

Attributes:

AttributeMLIR TypeDescription
pendings::mlir::IntegerAttr32-bit signless integer attribute

Operands:

Operand

Description

inputs

variadic of TensorOrMemDesc instance

Results:

Result

Description

outputs

variadic of TensorOrMemDesc instance