# TritonNvidiaGPUOps<!-- Autogenerated by mlir-tblgen; don't manually edit -->

### `ttng.arrive_barrier` (triton::nvidia_gpu::ArriveBarrierOp)

_Perform the arrive operation on an mbarrier_

Syntax:

```
operation ::= `ttng.arrive_barrier` $alloc `,` $count (`,` $pred^)? attr-dict `:` qualified(type($alloc))
```

The `ttng.arrive_barrier` operation performs the "arrive" operation on an
mbarrier object in shared memory. The operation requires a `count` attribute
of at least 1, and decreasing the pending arrival count of the mbarrier by
the specific count.

The operation accepts an optional predicate.

Example:

```mlir
ttng.arrive_barrier %barrier, 2 : !ttg.memdesc<1xi64, #shared, #smem, mutable>
ttng.arrive_barrier %barrier, 1, %pred : !ttg.memdesc<1xi64, #shared, #smem, mutable>
```

Traits: `VerifyTensorLayoutsTrait`

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>count</code></td><td>::mlir::IntegerAttr</td><td>32-bit signless integer attribute</td></tr>
</table>

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `alloc` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `pred` | 1-bit signless integer |


### `ttng.async_copy_mbarrier_arrive` (triton::nvidia_gpu::AsyncCopyMbarrierArriveOp)

_Arrive on mbarrier once all previously issued copies are completed_

Syntax:

```
operation ::= `ttng.async_copy_mbarrier_arrive` $barrier attr-dict `:` qualified(type($barrier))
```

Traits: `VerifyTensorLayoutsTrait`

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>noIncrement</code></td><td>::mlir::UnitAttr</td><td>unit attribute</td></tr>
</table>

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `barrier` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |


### `ttng.async_tma_copy_global_to_local` (triton::nvidia_gpu::AsyncTMACopyGlobalToLocalOp)

_Copy data based on descriptor from global memory to local memory asynchronously_

Syntax:

```
operation ::= `ttng.async_tma_copy_global_to_local` $desc `[` $coord `]` $result `,` $barrier `,` $pred
              oilist(`cacheModifier` `=` $cache | `evictionPolicy` `=` $evict)
              attr-dict `:` qualified(type($desc)) `,` qualified(type($barrier)) `->` qualified(type($result))
```

This operation copies data from global memory to local memory
asynchronously.  This is analogue to tt.load except the data are copied to
local memory pointed by the memory descriptor instead of a distributed
tensor. The data copied depends on the global memory descriptor pointed to
by `desc`.

Traits: `VerifyTensorLayoutsTrait`

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>cache</code></td><td>::mlir::triton::CacheModifierAttr</td><td>allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7</td></tr>
<tr><td><code>evict</code></td><td>::mlir::triton::EvictionPolicyAttr</td><td>allowed 32-bit signless integer cases: 1, 2, 3</td></tr>
<tr><td><code>isVolatile</code></td><td>::mlir::BoolAttr</td><td>bool attribute</td></tr>
</table>

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `desc` | Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system |
| `coord` | variadic of 32-bit signless integer |
| `barrier` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `result` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `pred` | 1-bit signless integer |


### `ttng.async_tma_copy_local_to_global` (triton::nvidia_gpu::AsyncTMACopyLocalToGlobalOp)

_Copy data based on descriptor from local memory to global memory asynchronously_

Syntax:

```
operation ::= `ttng.async_tma_copy_local_to_global` $desc `[` $coord `]` $src
              attr-dict `:` qualified(type($desc)) `,` qualified(type($src))
```

This operation copies data from local memory to global memory
asynchronously.  This is analogue to tt.store except the data are copied from
local memory pointed by the memory descriptor instead of a distributed
tensor. The data copied depends on the global memory descriptor pointed to
by `desc`.

Traits: `VerifyTensorLayoutsTrait`

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `desc` | Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system |
| `coord` | variadic of 32-bit signless integer |
| `src` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |


### `ttng.async_tma_gather` (triton::nvidia_gpu::AsyncTMAGatherOp)

_Gather data based on descriptor from global memory to local memory asynchronously_

Syntax:

```
operation ::= `ttng.async_tma_gather` $desc `[` $x_offsets `,` $y_offset `]` $result `,` $barrier `,` $pred
              attr-dict `:` type(operands)
```

This operation gathers multiple rows of data from global memory matrix to
local memory asynchronously.  This is similar to
async_tma_copy_global_to_local except that each row is indexed independently.

Traits: `VerifyTensorLayoutsTrait`

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `desc` | Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system |
| `x_offsets` | ranked tensor of 32-bit signless integer values |
| `y_offset` | 32-bit signless integer |
| `barrier` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `result` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `pred` | 1-bit signless integer |


### `ttng.async_tma_reduce` (triton::nvidia_gpu::AsyncTMAReduceOp)

_Reduce result in gmem based on a TMA descriptor_

Syntax:

```
operation ::= `ttng.async_tma_reduce` $kind `,` $desc `[` $coord `]` $src
              attr-dict `:` qualified(type($desc)) `,` qualified(type($src))
```

This operation copies data from local memory to global memory
asynchronously, and atomically performs the specified reduction kind.
Atomicity is at the granularity of individual elements, and only relaxed
semantics are implied.

Traits: `VerifyTensorLayoutsTrait`

Interfaces: `MemoryEffectOpInterface (MemoryEffectOpInterface)`

Effects: `MemoryEffects::Effect{MemoryEffects::Read on ::mlir::triton::GlobalMemory, MemoryEffects::Write on ::mlir::triton::GlobalMemory}`

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>kind</code></td><td>::mlir::triton::DescriptorReduceKindAttr</td><td>allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7, 8</td></tr>
</table>

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `desc` | Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system |
| `coord` | variadic of 32-bit signless integer |
| `src` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |


### `ttng.async_tma_scatter` (triton::nvidia_gpu::AsyncTMAScatterOp)

_Scatter data from local memory into global memory based on a descriptor asynchronously_

Syntax:

```
operation ::= `ttng.async_tma_scatter` $desc `[` $x_offsets `,` $y_offset `]` $src
              attr-dict `:` type(operands)
```

The `ttng.async_tma_scatter` operation scatters multiple separately-indexed
rows of data from local memory into global memory asynchronously. The
operation scatters a 2D tensor in shared memory, laid out by core tensor
tiles nvmma_shared layout into separately indexed rows in global
memory at a given `y` offset.

Traits: `VerifyTensorLayoutsTrait`

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `desc` | Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system |
| `x_offsets` | ranked tensor of 32-bit signless integer values |
| `y_offset` | 32-bit signless integer |
| `src` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |


### `ttng.barrier_expect` (triton::nvidia_gpu::BarrierExpectOp)

_Signal a barrier of an expected number of bytes to be copied._

Syntax:

```
operation ::= `ttng.barrier_expect` $alloc `,` $size attr-dict `,` $pred `:` qualified(type($alloc))
```

This signal the barrier that `size` bytes are expected to be copied. The
associated barrier wait will block until the expected number of bytes are copied.

Traits: `VerifyTensorLayoutsTrait`

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>size</code></td><td>::mlir::IntegerAttr</td><td>32-bit signless integer attribute</td></tr>
</table>

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `alloc` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `pred` | 1-bit signless integer |


### `ttng.cluster_arrive` (triton::nvidia_gpu::ClusterArriveOp)

Syntax:

```
operation ::= `ttng.cluster_arrive` attr-dict
```

Traits: `VerifyTensorLayoutsTrait`

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>relaxed</code></td><td>::mlir::IntegerAttr</td><td>1-bit signless integer attribute</td></tr>
</table>


### `ttng.cluster_wait` (triton::nvidia_gpu::ClusterWaitOp)

Syntax:

```
operation ::= `ttng.cluster_wait` attr-dict
```

Traits: `VerifyTensorLayoutsTrait`


### `ttng.fence_async_shared` (triton::nvidia_gpu::FenceAsyncSharedOp)

_Fence proxy async_

Syntax:

```
operation ::= `ttng.fence_async_shared` attr-dict
```

Traits: `VerifyTensorLayoutsTrait`

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>bCluster</code></td><td>::mlir::BoolAttr</td><td>bool attribute</td></tr>
</table>


### `ttng.init_barrier` (triton::nvidia_gpu::InitBarrierOp)

_Initialize a barrier in the given shared memory allocation._

Syntax:

```
operation ::= `ttng.init_barrier` $alloc `,` $count attr-dict `:` qualified(type($alloc))
```

Initializes a shared memory allocation with mbarrier information.
`alloc` is a descriptor to the shared memory allocation. `count` is the
number of arrives expected by the barrier.

This lowers to PTX mbarrier.init.shared::cta.b64.

Traits: `VerifyTensorLayoutsTrait`

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>count</code></td><td>::mlir::IntegerAttr</td><td>32-bit signless integer attribute</td></tr>
</table>

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `alloc` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |


### `ttng.inval_barrier` (triton::nvidia_gpu::InvalBarrierOp)

_Invalidate a barrier allocation._

Syntax:

```
operation ::= `ttng.inval_barrier` $alloc attr-dict `:` qualified(type($alloc))
```

Invalidate a barrier allocation so that it can be re-used. According to PTX
spec this has to be done before any reuse of the memory used by mbarrier.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-inval

Traits: `VerifyTensorLayoutsTrait`

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `alloc` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |


### `ttng.reinterpret_tensor_descriptor` (triton::nvidia_gpu::ReinterpretTensorDescOp)

_Reinterpret a pointer as a tensor descriptor_

Syntax:

```
operation ::= `ttng.reinterpret_tensor_descriptor` $rawDesc attr-dict `:` qualified(type($rawDesc))  `to` qualified(type($result))
```

This Op exists to help the transition from untyped raw TMA objects to typed Tensor descriptor objects.
Ideally, we can remove this once the APIs are fully fleshed out.

Traits: `AlwaysSpeculatableImplTrait`, `VerifyTensorLayoutsTrait`

Interfaces: `ConditionallySpeculatable`, `NoMemoryEffect (MemoryEffectOpInterface)`

Effects: `MemoryEffects::Effect{}`

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `rawDesc` | ptr |

#### Results:

| Result | Description |
| :----: | ----------- |
| `result` | Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system |


### `ttng.tc_gen5_commit` (triton::nvidia_gpu::TCGen5CommitOp)

_Make an mbarrier track completion of all prior async tcgen5 ops_

Syntax:

```
operation ::= `ttng.tc_gen5_commit` $barrier (`,` $pred^)? attr-dict `:` qualified(type($barrier))
```

The `ttng.tc_gen5_commit` is an asynchronous operation that makes the
mbarrier object track the completion of all prior asynchronous tcgen5
operations. Upon completion of all asynchronous operations, the mbarrier
arrive operation is performed on the mbarrier with a count of 1.

If `two_ctas` is set, then the mbarrier tracks all prior operations
initiated with `two_ctas` set as well. Otherwise, it tracks all prior
operations initiated without `two_ctas`.

Note that the completion mechanisms are guaranteed to occur sequentially in
the order the commit operations were issued. This means, for example:

```mlir
ttng.tmem_copy
ttng.tc_gen5_mma
ttng.tc_gen5_commit %barrierA
ttng.tc_gen5_commit %barrierB
```

`%barrierA` tracks the completion of the previous TMEM copy and MMA
operations, but since the commit groups are sequential, the arrive-on
operation on `%barrierA` is guaranteed to be performed before the arrive-on
operation on `%barrierB`, even though its commit group is empty.

Traits: `VerifyTensorLayoutsTrait`

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>two_ctas</code></td><td>::mlir::UnitAttr</td><td>unit attribute</td></tr>
</table>

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `barrier` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `pred` | 1-bit signless integer |


### `ttng.tc_gen5_mma` (triton::nvidia_gpu::TCGen5MMAOp)

_Block level op mapping to tensorcore gen5 mma_

Syntax:

```
operation ::= `ttng.tc_gen5_mma` $a `,` $b `,` $d `` custom<Token>($acc_dep, type($token)) `,` $useD`,`
              $pred `` custom<BarriersAndPreds>($barriers, $barrier_preds)
              attr-dict `:` qualified(type($a)) `,` qualified(type($b)) `,`
              qualified(type($d)) (`,` qualified(type($barriers))^)?
```

$d += matrix_multiply($a, $b).
If no barrier is given the op is assumed to be synchronous otherwise the op will trigger a commit/arrive on the given barrier.
If there is a barrier the result will be safe to read after a barrier wait.
If $two_ctas is set the op will execute a matmul across two contiguous CTAs, it will read the data distributed across the two CTAs.
and syncronize both CTAs if the op is synchronous.

This operation takes and produces an optional token to indicate TMEM read
and write on its accumulator operand. When the tokens are present, they can
be used to check aliasing and modref on the accumulator memory.

Traits: `AttrSizedOperandSegments`, `VerifyTensorLayoutsTrait`

Interfaces: `DotOpInterface`, `MMAv5OpInterface`, `MemoryEffectOpInterface`

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>two_ctas</code></td><td>::mlir::UnitAttr</td><td>unit attribute</td></tr>
</table>

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `a` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `b` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `d` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `acc_dep` | async token type |
| `useD` | 1-bit signless integer |
| `pred` | 1-bit signless integer |
| `barriers` | variadic of memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `barrier_preds` | variadic of 1-bit signless integer |

#### Results:

| Result | Description |
| :----: | ----------- |
| `token` | async token type |


### `ttng.tc_gen5_mma_scaled` (triton::nvidia_gpu::TCGen5MMAScaledOp)

_Block level op mapping to tensorcore gen5 mma_

Syntax:

```
operation ::= `ttng.tc_gen5_mma_scaled` $a `,` $b `,` $d `` custom<Token>($acc_dep, type($token)) `,` $a_scale `,`
              $b_scale `,` $useD `,` $pred `lhs` `=` $a_type `rhs` `=` $b_type
              `` custom<BarriersAndPreds>($barriers, $barrier_preds)
              attr-dict `:` qualified(type($a)) `,` qualified(type($b)) `,`
              qualified(type($d)) `,` qualified(type($a_scale)) `,`
              qualified(type($b_scale)) (`,` qualified(type($barriers))^)?
```

$d += matrix_multiply(scale($lhs, $lhs_scale), scale(rlhs, $rhs_scale))
If no barrier is given the op is assumed to be synchronous otherwise the op will trigger a commit/arrive on the given barrier.
If there is a barrier the result will be safe to read after a barrier wait.

This operation takes and produces an optional token to indicate TMEM read
and write on its accumulator operand. When the tokens are present, they can
be used to check aliasing and modref on the accumulator memory.

Traits: `AttrSizedOperandSegments`, `VerifyTensorLayoutsTrait`

Interfaces: `DotOpInterface`, `MMAv5OpInterface`, `MemoryEffectOpInterface`

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>a_type</code></td><td>::mlir::triton::ScaleDotElemTypeAttr</td><td>allowed 32-bit signless integer cases: 0, 1, 2, 3, 4, 5, 6</td></tr>
<tr><td><code>b_type</code></td><td>::mlir::triton::ScaleDotElemTypeAttr</td><td>allowed 32-bit signless integer cases: 0, 1, 2, 3, 4, 5, 6</td></tr>
</table>

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `a` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `b` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `d` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `acc_dep` | async token type |
| `a_scale` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `b_scale` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `useD` | 1-bit signless integer |
| `pred` | 1-bit signless integer |
| `barriers` | variadic of memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `barrier_preds` | variadic of 1-bit signless integer |

#### Results:

| Result | Description |
| :----: | ----------- |
| `token` | async token type |


### `ttng.async_tma_store_wait` (triton::nvidia_gpu::TMAStoreWaitOp)

_Wait until all the inputs are read._

Syntax:

```
operation ::= `ttng.async_tma_store_wait` attr-dict
```

Wait until all the read operations are done from the associated store operations.
This is needed before the shared memory can be written to.

Traits: `VerifyTensorLayoutsTrait`

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>pendings</code></td><td>::mlir::IntegerAttr</td><td>32-bit signless integer attribute</td></tr>
</table>


### `ttng.tmem_alloc` (triton::nvidia_gpu::TMEMAllocOp)

_Allocate tensor memory_

Syntax:

```
operation ::= `ttng.tmem_alloc` ($src^)? attr-dict `:` functional-type(operands, results)
```

This operation allocates buffer in tensor memory and return a descriptor
containing the address and a view of the buffer.
This is similar to ttg.local_alloc except the buffer is allocated in tensor memory.

Explicitly deallocating a buffer is optional; see local_dealloc.

Traits: `VerifyTensorLayoutsTrait`

Interfaces: `MemoryEffectOpInterface`

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `src` | ranked tensor of floating-point or integer or ptr values |

#### Results:

| Result | Description |
| :----: | ----------- |
| `result` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `token` | async token type |


### `ttng.tmem_copy` (triton::nvidia_gpu::TMEMCopyOp)

_Initiate an asynchronous copy operation from shared memory to the Tensor Memory._

Syntax:

```
operation ::= `ttng.tmem_copy` $src `,` $dst `,` $barrier attr-dict `:` functional-type(operands, results)
```

2D blocks stored contiguously in SMEM are copied into TMEM as specified by the destination address.
The completion of the copy can be observed by waiting on the optional barrier. If this op is used
together with an MMA op, one barrier can be used to wait for both copy and MMA. We do not need to wait
for the completion of the copy before MMA, since tcgen05.cp followed by tcgen05.mma is guaranteed to
execute in that order.

This op lowers to the PTX instruction tcgen05.cp. Right now, we only support 1CTA and the warpx4.32x128b
variant of the instruction. Each 32x128b block in SMEM is duplicated over 4 warps and stored into 128 rows
and 4 columns of TMEM. The primary use case of this op is to copy blocked scales from SMEM to TMEM.

The shape of the input SMEM can be flexibily chosen depending on use cases. In the simplest case (e.g. unit test),
the source SMEM can be of shape (32 x num_blocks, 16), and the destination TMEM should be of shape (128, 16 x num_blocks),
for copying 8 bit values. For scaled GEMM, rep_m x rep_k copies of a 32x128b block need to be stored in SMEM, where
rep_m = BLOCK_M / 128, rep_k = BLOCK_K / scale_vec_size / 4, and scale_vec_size = 32 for MXFP.
Conceptually, the SMEM is organized in a high-dimensional layout, (rep_m, rep_k, 32, 4, 4B).
Some of axes can be flattened into one, to reduce the rank of the load. For example, the following patterns are supported:
 * (rep_m, rep_k * 32 x 4 x 4B), 2D scale load with cp.async
 * (rep_m, rep_k, 32, 16B), 4D scale load with TMA
 * (rep_m, rep_k, 32, 4, 4B), 5D scale load with cp.async
Since rep_m blocks are not contiguous in SMEM, this axis cannot be flattened into inner ones.

In Triton, the TMEM memdesc for blocked scales must be of the following form:
* Its shape must be (BLOCK_MN, BLOCK_K / scale_vec_size), representing the logical shape of blocked scales.
* It must be attached with `tensor_memory_scales_encoding` to indicate the chunk-based layout and its duplication over 4 warps.

In contrast, the src SMEM must be in the explicit chunk-based layout as described above. So the IR might look like this:

%0 = ttng.tmem_alloc : () -> !ttg.memdesc<128x4xi8, #tmem_scales, #ttng.tensor_memory>
ttng.tmem_copy %1, %0 : (!ttg.memdesc<1x1x32x4x4xi8, #shared1, #smem>, !ttg.memdesc<128x4xi8, #tmem_scales, #ttng.tensor_memory>) -> ()

We interpret the semantics of this copy operation as follows. The chunk-based layout in SMEM implies that
the logical shape (BLOCK_MN, BLOCK_K / scale_vec_size) in TMEM is the result of certain reshape and transpose operations.
In practice, to take an advantage of the native scale layout and the TMEM copy op,  users need to do
`scales5D.trans(0, 3, 2, 1, 4).reshape(BLOCK_M, BLOCK_K // scale_vec_size)` before feeding scales into dot_scaled.
When we use tmem_copy in the IR, such reshape and transpose operations are removed. But the change in the logical shape they have caused on
registers is now understood to be incorporated into tmem_copy itself. Ideally, we would lift reshape / transpose done on registers onto
the SMEM memdesc, making tmem_copy a straightforward 2D copy operation: (BLOCK_MN, BLOCK_K / scale_vec_size) -> (BLOCK_MN, BLOCK_K / scale_vec_size).
In the absence of such operations on memdesc, we resort to implicitly encoding the reshape/transpose semantics in tmem_copy.


Traits: `VerifyTensorLayoutsTrait`

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `src` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `dst` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `barrier` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |


### `ttng.tmem_load` (triton::nvidia_gpu::TMEMLoadOp)

_Load a buffer from tensor memory into a distributed tensor_

Syntax:

```
operation ::= `ttng.tmem_load` $src `` custom<Token>($dep, type($token))
              attr-dict `:` qualified(type($src)) `->` type($result)
```

This is similar to ttg.local_load except the result layout is restricted to only few possibility.
Therefore we cannot combine this op with any convert layout like local_load.

This operation takes and produces an optional token to indicate TMEM read
on its source operand. When the tokens are present, they can
be used to check aliasing and modref on the TMEM buffer.

Traits: `VerifyTensorLayoutsTrait`

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `src` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `dep` | async token type |

#### Results:

| Result | Description |
| :----: | ----------- |
| `result` | ranked tensor of floating-point or integer or ptr values |
| `token` | async token type |


### `ttng.tmem_store` (triton::nvidia_gpu::TMEMStoreOp)

_Store a distributed tensor into a buffer in tensor memory_

Syntax:

```
operation ::= `ttng.tmem_store` $src `,` $dst `` custom<Token>($dep, type($token)) `,` $pred
              attr-dict `:` type($src) `->` qualified(type($dst))
```

This is similar to ttg.local_store except the source layout is restricted to only few possibility.

This operation takes and produces an optional token to indicate TMEM write
on its source operand. When the tokens are present, they can
be used to check aliasing and modref on the TMEM buffer.

Traits: `VerifyTensorLayoutsTrait`

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `dst` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `dep` | async token type |
| `src` | ranked tensor of floating-point or integer or ptr values |
| `pred` | 1-bit signless integer |

#### Results:

| Result | Description |
| :----: | ----------- |
| `token` | async token type |


### `ttng.tmem_subslice` (triton::nvidia_gpu::TMEMSubSliceOp)

_Take a subslice of a tensor memory allocation_

Syntax:

```
operation ::= `ttng.tmem_subslice` $src attr-dict `:` qualified(type($src)) `->` qualified(type($result))
```

This operation takes a subslice of a tensor memory allocation and returns a new descriptor
containing the address and a view of the subslice.
This is similar to ttg.memdesc_subview except the offset needs to be static and we can only
slice alog the inner dimension of a 2D memdesc as this is the only one we can do for TMem.

Traits: `AlwaysSpeculatableImplTrait`, `VerifyTensorLayoutsTrait`

Interfaces: `ConditionallySpeculatable`, `NoMemoryEffect (MemoryEffectOpInterface)`

Effects: `MemoryEffects::Effect{}`

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>N</code></td><td>::mlir::IntegerAttr</td><td>32-bit signless integer attribute</td></tr>
</table>

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `src` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |

#### Results:

| Result | Description |
| :----: | ----------- |
| `result` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |


### `ttng.tensormap_create` (triton::nvidia_gpu::TensormapCreateOp)

_Create a new TMA descriptor on device_

Syntax:

```
operation ::= `ttng.tensormap_create` $desc_ptr `,` $global_address `,`
              `[` $box_dim `]` `,`
              `[` $global_dim `]` `,`
              `[` $global_stride `]` `,`
              `[` $element_stride `]`
              attr-dict `:` functional-type(operands, results)
```

Traits: `AttrSizedOperandSegments`, `VerifyTensorLayoutsTrait`

Interfaces: `MemoryEffectOpInterface (MemoryEffectOpInterface)`

Effects: `MemoryEffects::Effect{MemoryEffects::Read on ::mlir::triton::GlobalMemory, MemoryEffects::Write on ::mlir::triton::GlobalMemory}`

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>elem_type</code></td><td>::mlir::IntegerAttr</td><td>32-bit signless integer attribute whose value is non-negative whose maximum value is 15</td></tr>
<tr><td><code>interleave_layout</code></td><td>::mlir::IntegerAttr</td><td>32-bit signless integer attribute whose value is non-negative whose maximum value is 2</td></tr>
<tr><td><code>swizzle_mode</code></td><td>::mlir::IntegerAttr</td><td>32-bit signless integer attribute whose value is non-negative whose maximum value is 3</td></tr>
<tr><td><code>fill_mode</code></td><td>::mlir::IntegerAttr</td><td>32-bit signless integer attribute whose value is non-negative whose maximum value is 1</td></tr>
</table>

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `desc_ptr` | Pointer type (`::mlir::triton::PointerType`) in Triton IR type system |
| `global_address` | Pointer type (`::mlir::triton::PointerType`) in Triton IR type system |
| `box_dim` | variadic of 32-bit signless integer |
| `global_dim` | variadic of 32-bit signless integer |
| `global_stride` | variadic of 64-bit signless integer |
| `element_stride` | variadic of 32-bit signless integer |


### `ttng.tensormap_fenceproxy_acquire` (triton::nvidia_gpu::TensormapFenceproxyAcquireOp)

_Acquire fence on a tensormap object_

Syntax:

```
operation ::= `ttng.tensormap_fenceproxy_acquire` $desc_ptr attr-dict `:` qualified(type($desc_ptr))
```

Traits: `VerifyTensorLayoutsTrait`

Interfaces: `MemoryEffectOpInterface (MemoryEffectOpInterface)`

Effects: `MemoryEffects::Effect{MemoryEffects::Write on ::mlir::triton::GlobalMemory}`

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `desc_ptr` | Pointer type (`::mlir::triton::PointerType`) in Triton IR type system |


### `ttng.wait_barrier` (triton::nvidia_gpu::WaitBarrierOp)

_Wait until the mbarrier phase completes._

Syntax:

```
operation ::= `ttng.wait_barrier` $alloc `,` $phase (`,` $pred^)? (`deps` $deps^)?
              attr-dict `:` qualified(type($alloc)) (`,` type($deps)^)?
```

Blocks the program progress until the mbarrier object in `alloc` completes
its current phase.

This lowers a waitloop using PTX instruction
mbarrier.try_wait.parity.shared.b64.

Accepts optional list of memory. If present, it is assumed that any of the
dependencies may be accessed until the barrier completes.

The barrier behavior is described here:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-asynchronous-copy-completion-mechanisms

Traits: `AttrSizedOperandSegments`, `VerifyTensorLayoutsTrait`

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `alloc` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `phase` | 32-bit signless integer |
| `pred` | 1-bit signless integer |
| `deps` | variadic of memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |


### `ttng.warp_group_dot` (triton::nvidia_gpu::WarpGroupDotOp)

_Warp group dot_

Syntax:

```
operation ::= `ttng.warp_group_dot` $a`,` $b`,` $c (`,` $useC^)? attr-dict
              `:` type($a) `*` type($b) `->` type($d)
```

$d = matrix_multiply($a, $b) + $c. For docs on InputPrecisionAttr, see TT_DotOp

Traits: `VerifyTensorLayoutsTrait`

Interfaces: `DotOpInterface`, `InferTypeOpInterface`, `MemoryEffectOpInterface`

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>inputPrecision</code></td><td>::mlir::triton::InputPrecisionAttr</td><td>allowed 32-bit signless integer cases: 0, 1, 2</td></tr>
<tr><td><code>maxNumImpreciseAcc</code></td><td>::mlir::IntegerAttr</td><td>32-bit signless integer attribute</td></tr>
<tr><td><code>isAsync</code></td><td>::mlir::BoolAttr</td><td>bool attribute</td></tr>
</table>

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `a` | TensorOrMemDesc instance |
| `b` | TensorOrMemDesc instance |
| `c` | ranked tensor of floating-point or integer values |
| `useC` | 1-bit signless integer |

#### Results:

| Result | Description |
| :----: | ----------- |
| `d` | ranked tensor of floating-point or integer values |


### `ttng.warp_group_dot_wait` (triton::nvidia_gpu::WarpGroupDotWaitOp)

_Warp group dot wait_

Syntax:

```
operation ::= `ttng.warp_group_dot_wait` $inputs attr-dict `:` type($inputs)
```

Waits until there are $pendings or fewer outstanding async dot operations.

$inputs must be the tensors corresponding to the async dot ops that we're
waiting on.  For example, if there are N pending async dot ops and we call
`warp_group_dot_wait 1`, then $inputs must be the result of the first dot op.

Traits: `VerifyTensorLayoutsTrait`

Interfaces: `InferTypeOpInterface`

#### Attributes:

<table>
<tr><th>Attribute</th><th>MLIR Type</th><th>Description</th></tr>
<tr><td><code>pendings</code></td><td>::mlir::IntegerAttr</td><td>32-bit signless integer attribute</td></tr>
</table>

#### Operands:

| Operand | Description |
| :-----: | ----------- |
| `inputs` | variadic of TensorOrMemDesc instance |

#### Results:

| Result | Description |
| :----: | ----------- |
| `outputs` | variadic of TensorOrMemDesc instance |
Attribute	MLIR Type	Description
`cache`	::mlir::triton::CacheModifierAttr	allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7
`evict`	::mlir::triton::EvictionPolicyAttr	allowed 32-bit signless integer cases: 1, 2, 3
`isVolatile`	::mlir::BoolAttr	bool attribute
Attribute	MLIR Type	Description
`a_type`	::mlir::triton::ScaleDotElemTypeAttr	allowed 32-bit signless integer cases: 0, 1, 2, 3, 4, 5, 6
`b_type`	::mlir::triton::ScaleDotElemTypeAttr	allowed 32-bit signless integer cases: 0, 1, 2, 3, 4, 5, 6
Attribute	MLIR Type	Description
`elem_type`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 15
`interleave_layout`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 2
`swizzle_mode`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 3
`fill_mode`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 1
Attribute	MLIR Type	Description
`inputPrecision`	::mlir::triton::InputPrecisionAttr	allowed 32-bit signless integer cases: 0, 1, 2
`maxNumImpreciseAcc`	::mlir::IntegerAttr	32-bit signless integer attribute
`isAsync`	::mlir::BoolAttr	bool attribute