# TritonNvidiaGPUOps ### `triton_nvidia_gpu.async_tma_copy_global_to_local` (triton::nvidia_gpu::AsyncTMACopyGlobalToLocalOp) _Copy data based on descriptor from global memory to local memory asynchronously_ Syntax: ``` operation ::= `triton_nvidia_gpu.async_tma_copy_global_to_local` $desc_ptr `[` $coord `]` $result `,` $barrier `,` $pred oilist(`cacheModifier` `=` $cache | `evictionPolicy` `=` $evict) attr-dict `:` type($desc_ptr) `,` type($barrier) `->` type($result) ``` This operation copies data from global memory to local memory asynchronously. This is analogue to tt.load except the data are copied to local memory pointed by the memory descriptor instread of a distributed tensor. The data copied depends on the global memory descriptor pointed to by `desc_ptr`. Traits: `VerifyTensorLayoutsTrait` Interfaces: `MemoryEffectOpInterface (MemoryEffectOpInterface)` Effects: `MemoryEffects::Effect{MemoryEffects::Read on ::mlir::triton::GlobalMemory, MemoryEffects::Write on ::mlir::triton::gpu::SharedMemory}` #### Attributes:
AttributeMLIR TypeDescription
cache::mlir::triton::CacheModifierAttr
allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6{{% markdown %}}Enum cases: * none (`NONE`) * ca (`CA`) * cg (`CG`) * wb (`WB`) * cs (`CS`) * wt (`WT`){{% /markdown %}}
evict::mlir::triton::EvictionPolicyAttr
allowed 32-bit signless integer cases: 1, 2, 3{{% markdown %}}Enum cases: * evict_normal (`NORMAL`) * evict_first (`EVICT_FIRST`) * evict_last (`EVICT_LAST`){{% /markdown %}}
isVolatile::mlir::BoolAttrbool attribute
#### Operands: | Operand | Description | | :-----: | ----------- | | `desc_ptr` | Pointer type (`::mlir::triton::PointerType`) in Triton IR type system | `coord` | variadic of 32-bit signless integer | `barrier` | memory descriptor type (`::mlir::triton::MemDescType`) in Triton IR type system | `result` | memory descriptor type (`::mlir::triton::MemDescType`) in Triton IR type system | `pred` | 1-bit signless integer ### `triton_nvidia_gpu.async_tma_copy_local_to_global` (triton::nvidia_gpu::AsyncTMACopyLocalToGlobalOp) _Copy data based on descriptor from local memory to global memory asynchronously_ Syntax: ``` operation ::= `triton_nvidia_gpu.async_tma_copy_local_to_global` $desc_ptr `[` $coord `]` $src attr-dict `:` type($desc_ptr) `,` type($src) ``` This operation copies data from local memory to global memory asynchronously. This is analogue to tt.store except the data are copied from local memory pointed by the memory descriptor instread of a distributed tensor. The data copied depends on the global memory descriptor pointed to by `desc_ptr`. Traits: `VerifyTensorLayoutsTrait` Interfaces: `MemoryEffectOpInterface (MemoryEffectOpInterface)` Effects: `MemoryEffects::Effect{MemoryEffects::Read on ::mlir::triton::gpu::SharedMemory, MemoryEffects::Write on ::mlir::triton::GlobalMemory}` #### Operands: | Operand | Description | | :-----: | ----------- | | `desc_ptr` | Pointer type (`::mlir::triton::PointerType`) in Triton IR type system | `coord` | variadic of 32-bit signless integer | `src` | memory descriptor type (`::mlir::triton::MemDescType`) in Triton IR type system ### `triton_nvidia_gpu.cluster_arrive` (triton::nvidia_gpu::ClusterArriveOp) Syntax: ``` operation ::= `triton_nvidia_gpu.cluster_arrive` attr-dict ``` Traits: `VerifyTensorLayoutsTrait` #### Attributes:
AttributeMLIR TypeDescription
relaxed::mlir::IntegerAttr1-bit signless integer attribute
### `triton_nvidia_gpu.cluster_wait` (triton::nvidia_gpu::ClusterWaitOp) Syntax: ``` operation ::= `triton_nvidia_gpu.cluster_wait` attr-dict ``` Traits: `VerifyTensorLayoutsTrait` ### `triton_nvidia_gpu.dot_async` (triton::nvidia_gpu::DotAsyncOp) _Dot async_ Syntax: ``` operation ::= `triton_nvidia_gpu.dot_async` $a`,` $b`,` $c attr-dict `:` type($a) `*` type($b) `->` type($d) ``` $d = matrix_multiply($a, $b) + $c. For docs on InputPrecisionAttr, see TT_DotOp Traits: `AlwaysSpeculatableImplTrait`, `VerifyTensorLayoutsTrait` Interfaces: `ConditionallySpeculatable`, `InferTypeOpInterface`, `NoMemoryEffect (MemoryEffectOpInterface)` Effects: `MemoryEffects::Effect{}` #### Attributes:
AttributeMLIR TypeDescription
inputPrecision::mlir::triton::InputPrecisionAttr
allowed 32-bit signless integer cases: 0, 1, 2{{% markdown %}}Enum cases: * tf32 (`TF32`) * tf32x3 (`TF32x3`) * ieee (`IEEE`){{% /markdown %}}
maxNumImpreciseAcc::mlir::IntegerAttr32-bit signless integer attribute
#### Operands: | Operand | Description | | :-----: | ----------- | | `a` | TensorOrMemDesc instance | `b` | TensorOrMemDesc instance | `c` | ranked tensor of floating-point or integer values #### Results: | Result | Description | | :----: | ----------- | | `d` | ranked tensor of floating-point or integer values ### `triton_nvidia_gpu.dot_wait` (triton::nvidia_gpu::DotWaitOp) _Dot wait_ Syntax: ``` operation ::= `triton_nvidia_gpu.dot_wait` $inputs attr-dict `:` type($inputs) ``` Waits until there are $pendings or fewer outstanding async dot operations. $inputs must be the tensors corresponding to the async dot ops that we're waiting on. For example, if there are N pending async dot ops and we call `dot_wait 1`, then $inputs must be the result of the first dot op. Traits: `VerifyTensorLayoutsTrait` Interfaces: `InferTypeOpInterface` #### Attributes:
AttributeMLIR TypeDescription
pendings::mlir::IntegerAttr32-bit signless integer attribute
#### Operands: | Operand | Description | | :-----: | ----------- | | `inputs` | variadic of TensorOrMemDesc instance #### Results: | Result | Description | | :----: | ----------- | | `outputs` | variadic of TensorOrMemDesc instance ### `triton_nvidia_gpu.fence_async_shared` (triton::nvidia_gpu::FenceAsyncSharedOp) _Fence proxy async_ Syntax: ``` operation ::= `triton_nvidia_gpu.fence_async_shared` attr-dict ``` Traits: `VerifyTensorLayoutsTrait` #### Attributes:
AttributeMLIR TypeDescription
bCluster::mlir::BoolAttrbool attribute
### `triton_nvidia_gpu.init_barrier` (triton::nvidia_gpu::InitBarrierOp) _Initialize a barrier in the given shared memory allocation._ Syntax: ``` operation ::= `triton_nvidia_gpu.init_barrier` $alloc `,` $count attr-dict `:` type($alloc) ``` Initializes a shared memory allocation with mbarrier information. `alloc` is a descriptor to the shared memory allocation. `count` is the number of arrives expected by the barrier. This lowers to PTX mbarrier.init.shared::cta.b64. Traits: `VerifyTensorLayoutsTrait` Interfaces: `MemoryEffectOpInterface (MemoryEffectOpInterface)` Effects: `MemoryEffects::Effect{MemoryEffects::Write on ::mlir::triton::gpu::SharedMemory}` #### Attributes:
AttributeMLIR TypeDescription
count::mlir::IntegerAttr32-bit signless integer attribute
#### Operands: | Operand | Description | | :-----: | ----------- | | `alloc` | memory descriptor type (`::mlir::triton::MemDescType`) in Triton IR type system ### `triton_nvidia_gpu.wait_barrier` (triton::nvidia_gpu::WaitBarrierOp) _Wait until the mbarrier phase completes._ Syntax: ``` operation ::= `triton_nvidia_gpu.wait_barrier` $alloc `,` $phase attr-dict `:` type($alloc) ``` Blocks the program progress until the mbarrier object in `alloc` completes its current phase. This lowers a waitloop using PTX instruction mbarrier.try_wait.parity.shared.b64. The barrier behavior is described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-asynchronous-copy-completion-mechanisms Traits: `VerifyTensorLayoutsTrait` #### Operands: | Operand | Description | | :-----: | ----------- | | `alloc` | memory descriptor type (`::mlir::triton::MemDescType`) in Triton IR type system | `phase` | 32-bit signless integer