TritonNvidiaGPUOps

triton_nvidia_gpu.async_tma_copy_global_to_local (triton::nvidia_gpu::AsyncTMACopyGlobalToLocalOp)

Copy data based on descriptor from global memory to local memory asynchronously

Syntax:

operation ::= `triton_nvidia_gpu.async_tma_copy_global_to_local` $desc_ptr `[` $coord `]` $result `,` $barrier `,` $pred
              oilist(`cacheModifier` `=` $cache | `evictionPolicy` `=` $evict)
              attr-dict `:` type($desc_ptr) `,` type($barrier) `->` type($result)

This operation copies data from global memory to local memory asynchronously. This is analogue to tt.load except the data are copied to local memory pointed by the memory descriptor instread of a distributed tensor. The data copied depends on the global memory descriptor pointed to by desc_ptr.

Traits: VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::triton::GlobalMemory, MemoryEffects::Write on ::mlir::triton::gpu::SharedMemory}

Attributes:

AttributeMLIR TypeDescription
cache::mlir::triton::CacheModifierAttr
allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6{{% markdown %}}Enum cases: * none (`NONE`) * ca (`CA`) * cg (`CG`) * wb (`WB`) * cs (`CS`) * wt (`WT`){{% /markdown %}}
evict::mlir::triton::EvictionPolicyAttr
allowed 32-bit signless integer cases: 1, 2, 3{{% markdown %}}Enum cases: * evict_normal (`NORMAL`) * evict_first (`EVICT_FIRST`) * evict_last (`EVICT_LAST`){{% /markdown %}}
isVolatile::mlir::BoolAttrbool attribute

Operands:

Operand

Description

desc_ptr

Pointer type (::mlir::triton::PointerType) in Triton IR type system

coord

variadic of 32-bit signless integer

barrier

memory descriptor type (::mlir::triton::MemDescType) in Triton IR type system

result

memory descriptor type (::mlir::triton::MemDescType) in Triton IR type system

pred

1-bit signless integer

triton_nvidia_gpu.async_tma_copy_local_to_global (triton::nvidia_gpu::AsyncTMACopyLocalToGlobalOp)

Copy data based on descriptor from local memory to global memory asynchronously

Syntax:

operation ::= `triton_nvidia_gpu.async_tma_copy_local_to_global` $desc_ptr `[` $coord `]` $src
              attr-dict `:` type($desc_ptr) `,` type($src)

This operation copies data from local memory to global memory asynchronously. This is analogue to tt.store except the data are copied from local memory pointed by the memory descriptor instread of a distributed tensor. The data copied depends on the global memory descriptor pointed to by desc_ptr.

Traits: VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::triton::gpu::SharedMemory, MemoryEffects::Write on ::mlir::triton::GlobalMemory}

Operands:

Operand

Description

desc_ptr

Pointer type (::mlir::triton::PointerType) in Triton IR type system

coord

variadic of 32-bit signless integer

src

memory descriptor type (::mlir::triton::MemDescType) in Triton IR type system

triton_nvidia_gpu.cluster_arrive (triton::nvidia_gpu::ClusterArriveOp)

Syntax:

operation ::= `triton_nvidia_gpu.cluster_arrive` attr-dict

Traits: VerifyTensorLayoutsTrait

Attributes:

AttributeMLIR TypeDescription
relaxed::mlir::IntegerAttr1-bit signless integer attribute

triton_nvidia_gpu.cluster_wait (triton::nvidia_gpu::ClusterWaitOp)

Syntax:

operation ::= `triton_nvidia_gpu.cluster_wait` attr-dict

Traits: VerifyTensorLayoutsTrait

triton_nvidia_gpu.dot_async (triton::nvidia_gpu::DotAsyncOp)

Dot async

Syntax:

operation ::= `triton_nvidia_gpu.dot_async` $a`,` $b`,` $c attr-dict `:` type($a) `*` type($b) `->` type($d)

$d = matrix_multiply($a, $b) + $c. For docs on InputPrecisionAttr, see TT_DotOp

Traits: AlwaysSpeculatableImplTrait, VerifyTensorLayoutsTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
inputPrecision::mlir::triton::InputPrecisionAttr
allowed 32-bit signless integer cases: 0, 1, 2{{% markdown %}}Enum cases: * tf32 (`TF32`) * tf32x3 (`TF32x3`) * ieee (`IEEE`){{% /markdown %}}
maxNumImpreciseAcc::mlir::IntegerAttr32-bit signless integer attribute

Operands:

Operand

Description

a

TensorOrMemDesc instance

b

TensorOrMemDesc instance

c

ranked tensor of floating-point or integer values

Results:

Result

Description

d

ranked tensor of floating-point or integer values

triton_nvidia_gpu.dot_wait (triton::nvidia_gpu::DotWaitOp)

Dot wait

Syntax:

operation ::= `triton_nvidia_gpu.dot_wait` $inputs attr-dict `:` type($inputs)

Waits until there are $pendings or fewer outstanding async dot operations.

$inputs must be the tensors corresponding to the async dot ops that we’re waiting on. For example, if there are N pending async dot ops and we call dot_wait 1, then $inputs must be the result of the first dot op.

Traits: VerifyTensorLayoutsTrait

Interfaces: InferTypeOpInterface

Attributes:

AttributeMLIR TypeDescription
pendings::mlir::IntegerAttr32-bit signless integer attribute

Operands:

Operand

Description

inputs

variadic of TensorOrMemDesc instance

Results:

Result

Description

outputs

variadic of TensorOrMemDesc instance

triton_nvidia_gpu.fence_async_shared (triton::nvidia_gpu::FenceAsyncSharedOp)

Fence proxy async

Syntax:

operation ::= `triton_nvidia_gpu.fence_async_shared` attr-dict

Traits: VerifyTensorLayoutsTrait

Attributes:

AttributeMLIR TypeDescription
bCluster::mlir::BoolAttrbool attribute

triton_nvidia_gpu.init_barrier (triton::nvidia_gpu::InitBarrierOp)

Initialize a barrier in the given shared memory allocation.

Syntax:

operation ::= `triton_nvidia_gpu.init_barrier` $alloc `,` $count attr-dict `:` type($alloc)

Initializes a shared memory allocation with mbarrier information. alloc is a descriptor to the shared memory allocation. count is the number of arrives expected by the barrier.

This lowers to PTX mbarrier.init.shared::cta.b64.

Traits: VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Write on ::mlir::triton::gpu::SharedMemory}

Attributes:

AttributeMLIR TypeDescription
count::mlir::IntegerAttr32-bit signless integer attribute

Operands:

Operand

Description

alloc

memory descriptor type (::mlir::triton::MemDescType) in Triton IR type system

triton_nvidia_gpu.wait_barrier (triton::nvidia_gpu::WaitBarrierOp)

Wait until the mbarrier phase completes.

Syntax:

operation ::= `triton_nvidia_gpu.wait_barrier` $alloc `,` $phase attr-dict `:` type($alloc)

Blocks the program progress until the mbarrier object in alloc completes its current phase.

This lowers a waitloop using PTX instruction mbarrier.try_wait.parity.shared.b64.

The barrier behavior is described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-asynchronous-copy-completion-mechanisms

Traits: VerifyTensorLayoutsTrait

Operands:

Operand

Description

alloc

memory descriptor type (::mlir::triton::MemDescType) in Triton IR type system

phase

32-bit signless integer