# TritonNvidiaGPUOps
### `triton_nvidia_gpu.async_tma_copy_global_to_local` (triton::nvidia_gpu::AsyncTMACopyGlobalToLocalOp)
_Copy data based on descriptor from global memory to local memory asynchronously_
Syntax:
```
operation ::= `triton_nvidia_gpu.async_tma_copy_global_to_local` $desc_ptr `[` $coord `]` $result `,` $barrier `,` $pred
oilist(`cacheModifier` `=` $cache | `evictionPolicy` `=` $evict)
attr-dict `:` type($desc_ptr) `,` type($barrier) `->` type($result)
```
This operation copies data from global memory to local memory
asynchronously. This is analogue to tt.load except the data are copied to
local memory pointed by the memory descriptor instread of a distributed
tensor. The data copied depends on the global memory descriptor pointed to
by `desc_ptr`.
Traits: `VerifyTensorLayoutsTrait`
Interfaces: `MemoryEffectOpInterface (MemoryEffectOpInterface)`
Effects: `MemoryEffects::Effect{MemoryEffects::Read on ::mlir::triton::GlobalMemory, MemoryEffects::Write on ::mlir::triton::gpu::SharedMemory}`
#### Attributes:
Attribute | MLIR Type | Description |
cache | ::mlir::triton::CacheModifierAttr | allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6{{% markdown %}}Enum cases:
* none (`NONE`)
* ca (`CA`)
* cg (`CG`)
* wb (`WB`)
* cs (`CS`)
* wt (`WT`){{% /markdown %}} |
evict | ::mlir::triton::EvictionPolicyAttr | allowed 32-bit signless integer cases: 1, 2, 3{{% markdown %}}Enum cases:
* evict_normal (`NORMAL`)
* evict_first (`EVICT_FIRST`)
* evict_last (`EVICT_LAST`){{% /markdown %}} |
isVolatile | ::mlir::BoolAttr | bool attribute |
#### Operands:
| Operand | Description |
| :-----: | ----------- |
| `desc_ptr` | Pointer type (`::mlir::triton::PointerType`) in Triton IR type system
| `coord` | variadic of 32-bit signless integer
| `barrier` | memory descriptor type (`::mlir::triton::MemDescType`) in Triton IR type system
| `result` | memory descriptor type (`::mlir::triton::MemDescType`) in Triton IR type system
| `pred` | 1-bit signless integer
### `triton_nvidia_gpu.async_tma_copy_local_to_global` (triton::nvidia_gpu::AsyncTMACopyLocalToGlobalOp)
_Copy data based on descriptor from local memory to global memory asynchronously_
Syntax:
```
operation ::= `triton_nvidia_gpu.async_tma_copy_local_to_global` $desc_ptr `[` $coord `]` $src
attr-dict `:` type($desc_ptr) `,` type($src)
```
This operation copies data from local memory to global memory
asynchronously. This is analogue to tt.store except the data are copied from
local memory pointed by the memory descriptor instread of a distributed
tensor. The data copied depends on the global memory descriptor pointed to
by `desc_ptr`.
Traits: `VerifyTensorLayoutsTrait`
Interfaces: `MemoryEffectOpInterface (MemoryEffectOpInterface)`
Effects: `MemoryEffects::Effect{MemoryEffects::Read on ::mlir::triton::gpu::SharedMemory, MemoryEffects::Write on ::mlir::triton::GlobalMemory}`
#### Operands:
| Operand | Description |
| :-----: | ----------- |
| `desc_ptr` | Pointer type (`::mlir::triton::PointerType`) in Triton IR type system
| `coord` | variadic of 32-bit signless integer
| `src` | memory descriptor type (`::mlir::triton::MemDescType`) in Triton IR type system
### `triton_nvidia_gpu.cluster_arrive` (triton::nvidia_gpu::ClusterArriveOp)
Syntax:
```
operation ::= `triton_nvidia_gpu.cluster_arrive` attr-dict
```
Traits: `VerifyTensorLayoutsTrait`
#### Attributes:
Attribute | MLIR Type | Description |
relaxed | ::mlir::IntegerAttr | 1-bit signless integer attribute |
### `triton_nvidia_gpu.cluster_wait` (triton::nvidia_gpu::ClusterWaitOp)
Syntax:
```
operation ::= `triton_nvidia_gpu.cluster_wait` attr-dict
```
Traits: `VerifyTensorLayoutsTrait`
### `triton_nvidia_gpu.dot_async` (triton::nvidia_gpu::DotAsyncOp)
_Dot async_
Syntax:
```
operation ::= `triton_nvidia_gpu.dot_async` $a`,` $b`,` $c attr-dict `:` type($a) `*` type($b) `->` type($d)
```
$d = matrix_multiply($a, $b) + $c. For docs on InputPrecisionAttr, see TT_DotOp
Traits: `AlwaysSpeculatableImplTrait`, `VerifyTensorLayoutsTrait`
Interfaces: `ConditionallySpeculatable`, `InferTypeOpInterface`, `NoMemoryEffect (MemoryEffectOpInterface)`
Effects: `MemoryEffects::Effect{}`
#### Attributes:
Attribute | MLIR Type | Description |
inputPrecision | ::mlir::triton::InputPrecisionAttr | allowed 32-bit signless integer cases: 0, 1, 2{{% markdown %}}Enum cases:
* tf32 (`TF32`)
* tf32x3 (`TF32x3`)
* ieee (`IEEE`){{% /markdown %}} |
maxNumImpreciseAcc | ::mlir::IntegerAttr | 32-bit signless integer attribute |
#### Operands:
| Operand | Description |
| :-----: | ----------- |
| `a` | TensorOrMemDesc instance
| `b` | TensorOrMemDesc instance
| `c` | ranked tensor of floating-point or integer values
#### Results:
| Result | Description |
| :----: | ----------- |
| `d` | ranked tensor of floating-point or integer values
### `triton_nvidia_gpu.dot_wait` (triton::nvidia_gpu::DotWaitOp)
_Dot wait_
Syntax:
```
operation ::= `triton_nvidia_gpu.dot_wait` $inputs attr-dict `:` type($inputs)
```
Waits until there are $pendings or fewer outstanding async dot operations.
$inputs must be the tensors corresponding to the async dot ops that we're
waiting on. For example, if there are N pending async dot ops and we call
`dot_wait 1`, then $inputs must be the result of the first dot op.
Traits: `VerifyTensorLayoutsTrait`
Interfaces: `InferTypeOpInterface`
#### Attributes:
Attribute | MLIR Type | Description |
pendings | ::mlir::IntegerAttr | 32-bit signless integer attribute |
#### Operands:
| Operand | Description |
| :-----: | ----------- |
| `inputs` | variadic of TensorOrMemDesc instance
#### Results:
| Result | Description |
| :----: | ----------- |
| `outputs` | variadic of TensorOrMemDesc instance
### `triton_nvidia_gpu.fence_async_shared` (triton::nvidia_gpu::FenceAsyncSharedOp)
_Fence proxy async_
Syntax:
```
operation ::= `triton_nvidia_gpu.fence_async_shared` attr-dict
```
Traits: `VerifyTensorLayoutsTrait`
#### Attributes:
Attribute | MLIR Type | Description |
bCluster | ::mlir::BoolAttr | bool attribute |
### `triton_nvidia_gpu.init_barrier` (triton::nvidia_gpu::InitBarrierOp)
_Initialize a barrier in the given shared memory allocation._
Syntax:
```
operation ::= `triton_nvidia_gpu.init_barrier` $alloc `,` $count attr-dict `:` type($alloc)
```
Initializes a shared memory allocation with mbarrier information.
`alloc` is a descriptor to the shared memory allocation. `count` is the
number of arrives expected by the barrier.
This lowers to PTX mbarrier.init.shared::cta.b64.
Traits: `VerifyTensorLayoutsTrait`
Interfaces: `MemoryEffectOpInterface (MemoryEffectOpInterface)`
Effects: `MemoryEffects::Effect{MemoryEffects::Write on ::mlir::triton::gpu::SharedMemory}`
#### Attributes:
Attribute | MLIR Type | Description |
count | ::mlir::IntegerAttr | 32-bit signless integer attribute |
#### Operands:
| Operand | Description |
| :-----: | ----------- |
| `alloc` | memory descriptor type (`::mlir::triton::MemDescType`) in Triton IR type system
### `triton_nvidia_gpu.wait_barrier` (triton::nvidia_gpu::WaitBarrierOp)
_Wait until the mbarrier phase completes._
Syntax:
```
operation ::= `triton_nvidia_gpu.wait_barrier` $alloc `,` $phase attr-dict `:` type($alloc)
```
Blocks the program progress until the mbarrier object in `alloc` completes
its current phase.
This lowers a waitloop using PTX instruction
mbarrier.try_wait.parity.shared.b64.
The barrier behavior is described here:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-asynchronous-copy-completion-mechanisms
Traits: `VerifyTensorLayoutsTrait`
#### Operands:
| Operand | Description |
| :-----: | ----------- |
| `alloc` | memory descriptor type (`::mlir::triton::MemDescType`) in Triton IR type system
| `phase` | 32-bit signless integer