# TritonNvidiaGPUOps ### `ttng.arrive_barrier` (triton::nvidia_gpu::ArriveBarrierOp) _Perform the arrive operation on an mbarrier_ Syntax: ``` operation ::= `ttng.arrive_barrier` $alloc `,` $count (`,` $pred^)? attr-dict `:` qualified(type($alloc)) ``` The `ttng.arrive_barrier` operation performs the "arrive" operation on an mbarrier object in shared memory. The operation requires a `count` attribute of at least 1, and decreasing the pending arrival count of the mbarrier by the specific count. The operation accepts an optional predicate. Example: ```mlir ttng.arrive_barrier %barrier, 2 : !ttg.memdesc<1xi64, #shared, #smem, mutable> ttng.arrive_barrier %barrier, 1, %pred : !ttg.memdesc<1xi64, #shared, #smem, mutable> ``` Traits: `VerifyTensorLayoutsTrait` #### Attributes:
AttributeMLIR TypeDescription
count::mlir::IntegerAttr32-bit signless integer attribute
#### Operands: | Operand | Description | | :-----: | ----------- | | `alloc` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `pred` | 1-bit signless integer | ### `ttng.async_copy_mbarrier_arrive` (triton::nvidia_gpu::AsyncCopyMbarrierArriveOp) _Arrive on mbarrier once all previously issued copies are completed_ Syntax: ``` operation ::= `ttng.async_copy_mbarrier_arrive` $barrier attr-dict `:` qualified(type($barrier)) ``` Traits: `VerifyTensorLayoutsTrait` #### Attributes:
AttributeMLIR TypeDescription
noIncrement::mlir::UnitAttrunit attribute
#### Operands: | Operand | Description | | :-----: | ----------- | | `barrier` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | ### `ttng.async_tma_copy_global_to_local` (triton::nvidia_gpu::AsyncTMACopyGlobalToLocalOp) _Copy data based on descriptor from global memory to local memory asynchronously_ Syntax: ``` operation ::= `ttng.async_tma_copy_global_to_local` $desc `[` $coord `]` $result `,` $barrier `,` $pred oilist(`cacheModifier` `=` $cache | `evictionPolicy` `=` $evict) attr-dict `:` qualified(type($desc)) `,` qualified(type($barrier)) `->` qualified(type($result)) ``` This operation copies data from global memory to local memory asynchronously. This is analogue to tt.load except the data are copied to local memory pointed by the memory descriptor instead of a distributed tensor. The data copied depends on the global memory descriptor pointed to by `desc`. Traits: `VerifyTensorLayoutsTrait` #### Attributes:
AttributeMLIR TypeDescription
cache::mlir::triton::CacheModifierAttrallowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7
evict::mlir::triton::EvictionPolicyAttrallowed 32-bit signless integer cases: 1, 2, 3
isVolatile::mlir::BoolAttrbool attribute
#### Operands: | Operand | Description | | :-----: | ----------- | | `desc` | Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system | | `coord` | variadic of 32-bit signless integer | | `barrier` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `result` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `pred` | 1-bit signless integer | ### `ttng.async_tma_copy_local_to_global` (triton::nvidia_gpu::AsyncTMACopyLocalToGlobalOp) _Copy data based on descriptor from local memory to global memory asynchronously_ Syntax: ``` operation ::= `ttng.async_tma_copy_local_to_global` $desc `[` $coord `]` $src attr-dict `:` qualified(type($desc)) `,` qualified(type($src)) ``` This operation copies data from local memory to global memory asynchronously. This is analogue to tt.store except the data are copied from local memory pointed by the memory descriptor instead of a distributed tensor. The data copied depends on the global memory descriptor pointed to by `desc`. Traits: `VerifyTensorLayoutsTrait` #### Operands: | Operand | Description | | :-----: | ----------- | | `desc` | Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system | | `coord` | variadic of 32-bit signless integer | | `src` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | ### `ttng.async_tma_gather` (triton::nvidia_gpu::AsyncTMAGatherOp) _Gather data based on descriptor from global memory to local memory asynchronously_ Syntax: ``` operation ::= `ttng.async_tma_gather` $desc `[` $x_offsets `,` $y_offset `]` $result `,` $barrier `,` $pred attr-dict `:` type(operands) ``` This operation gathers multiple rows of data from global memory matrix to local memory asynchronously. This is similar to async_tma_copy_global_to_local except that each row is indexed independently. Traits: `VerifyTensorLayoutsTrait` #### Operands: | Operand | Description | | :-----: | ----------- | | `desc` | Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system | | `x_offsets` | ranked tensor of 32-bit signless integer values | | `y_offset` | 32-bit signless integer | | `barrier` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `result` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `pred` | 1-bit signless integer | ### `ttng.async_tma_reduce` (triton::nvidia_gpu::AsyncTMAReduceOp) _Reduce result in gmem based on a TMA descriptor_ Syntax: ``` operation ::= `ttng.async_tma_reduce` $kind `,` $desc `[` $coord `]` $src attr-dict `:` qualified(type($desc)) `,` qualified(type($src)) ``` This operation copies data from local memory to global memory asynchronously, and atomically performs the specified reduction kind. Atomicity is at the granularity of individual elements, and only relaxed semantics are implied. Traits: `VerifyTensorLayoutsTrait` Interfaces: `MemoryEffectOpInterface (MemoryEffectOpInterface)` Effects: `MemoryEffects::Effect{MemoryEffects::Read on ::mlir::triton::GlobalMemory, MemoryEffects::Write on ::mlir::triton::GlobalMemory}` #### Attributes:
AttributeMLIR TypeDescription
kind::mlir::triton::DescriptorReduceKindAttrallowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7, 8
#### Operands: | Operand | Description | | :-----: | ----------- | | `desc` | Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system | | `coord` | variadic of 32-bit signless integer | | `src` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | ### `ttng.async_tma_scatter` (triton::nvidia_gpu::AsyncTMAScatterOp) _Scatter data from local memory into global memory based on a descriptor asynchronously_ Syntax: ``` operation ::= `ttng.async_tma_scatter` $desc `[` $x_offsets `,` $y_offset `]` $src attr-dict `:` type(operands) ``` The `ttng.async_tma_scatter` operation scatters multiple separately-indexed rows of data from local memory into global memory asynchronously. The operation scatters a 2D tensor in shared memory, laid out by core tensor tiles nvmma_shared layout into separately indexed rows in global memory at a given `y` offset. Traits: `VerifyTensorLayoutsTrait` #### Operands: | Operand | Description | | :-----: | ----------- | | `desc` | Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system | | `x_offsets` | ranked tensor of 32-bit signless integer values | | `y_offset` | 32-bit signless integer | | `src` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | ### `ttng.barrier_expect` (triton::nvidia_gpu::BarrierExpectOp) _Signal a barrier of an expected number of bytes to be copied._ Syntax: ``` operation ::= `ttng.barrier_expect` $alloc `,` $size attr-dict `,` $pred `:` qualified(type($alloc)) ``` This signal the barrier that `size` bytes are expected to be copied. The associated barrier wait will block until the expected number of bytes are copied. Traits: `VerifyTensorLayoutsTrait` #### Attributes:
AttributeMLIR TypeDescription
size::mlir::IntegerAttr32-bit signless integer attribute
#### Operands: | Operand | Description | | :-----: | ----------- | | `alloc` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `pred` | 1-bit signless integer | ### `ttng.cluster_arrive` (triton::nvidia_gpu::ClusterArriveOp) Syntax: ``` operation ::= `ttng.cluster_arrive` attr-dict ``` Traits: `VerifyTensorLayoutsTrait` #### Attributes:
AttributeMLIR TypeDescription
relaxed::mlir::IntegerAttr1-bit signless integer attribute
### `ttng.cluster_wait` (triton::nvidia_gpu::ClusterWaitOp) Syntax: ``` operation ::= `ttng.cluster_wait` attr-dict ``` Traits: `VerifyTensorLayoutsTrait` ### `ttng.fence_async_shared` (triton::nvidia_gpu::FenceAsyncSharedOp) _Fence proxy async_ Syntax: ``` operation ::= `ttng.fence_async_shared` attr-dict ``` Traits: `VerifyTensorLayoutsTrait` #### Attributes:
AttributeMLIR TypeDescription
bCluster::mlir::BoolAttrbool attribute
### `ttng.init_barrier` (triton::nvidia_gpu::InitBarrierOp) _Initialize a barrier in the given shared memory allocation._ Syntax: ``` operation ::= `ttng.init_barrier` $alloc `,` $count attr-dict `:` qualified(type($alloc)) ``` Initializes a shared memory allocation with mbarrier information. `alloc` is a descriptor to the shared memory allocation. `count` is the number of arrives expected by the barrier. This lowers to PTX mbarrier.init.shared::cta.b64. Traits: `VerifyTensorLayoutsTrait` #### Attributes:
AttributeMLIR TypeDescription
count::mlir::IntegerAttr32-bit signless integer attribute
#### Operands: | Operand | Description | | :-----: | ----------- | | `alloc` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | ### `ttng.inval_barrier` (triton::nvidia_gpu::InvalBarrierOp) _Invalidate a barrier allocation._ Syntax: ``` operation ::= `ttng.inval_barrier` $alloc attr-dict `:` qualified(type($alloc)) ``` Invalidate a barrier allocation so that it can be re-used. According to PTX spec this has to be done before any reuse of the memory used by mbarrier. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-inval Traits: `VerifyTensorLayoutsTrait` #### Operands: | Operand | Description | | :-----: | ----------- | | `alloc` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | ### `ttng.reinterpret_tensor_descriptor` (triton::nvidia_gpu::ReinterpretTensorDescOp) _Reinterpret a pointer as a tensor descriptor_ Syntax: ``` operation ::= `ttng.reinterpret_tensor_descriptor` $rawDesc attr-dict `:` qualified(type($rawDesc)) `to` qualified(type($result)) ``` This Op exists to help the transition from untyped raw TMA objects to typed Tensor descriptor objects. Ideally, we can remove this once the APIs are fully fleshed out. Traits: `AlwaysSpeculatableImplTrait`, `VerifyTensorLayoutsTrait` Interfaces: `ConditionallySpeculatable`, `NoMemoryEffect (MemoryEffectOpInterface)` Effects: `MemoryEffects::Effect{}` #### Operands: | Operand | Description | | :-----: | ----------- | | `rawDesc` | ptr | #### Results: | Result | Description | | :----: | ----------- | | `result` | Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system | ### `ttng.tc_gen5_commit` (triton::nvidia_gpu::TCGen5CommitOp) _Make an mbarrier track completion of all prior async tcgen5 ops_ Syntax: ``` operation ::= `ttng.tc_gen5_commit` $barrier (`,` $pred^)? attr-dict `:` qualified(type($barrier)) ``` The `ttng.tc_gen5_commit` is an asynchronous operation that makes the mbarrier object track the completion of all prior asynchronous tcgen5 operations. Upon completion of all asynchronous operations, the mbarrier arrive operation is performed on the mbarrier with a count of 1. If `two_ctas` is set, then the mbarrier tracks all prior operations initiated with `two_ctas` set as well. Otherwise, it tracks all prior operations initiated without `two_ctas`. Note that the completion mechanisms are guaranteed to occur sequentially in the order the commit operations were issued. This means, for example: ```mlir ttng.tmem_copy ttng.tc_gen5_mma ttng.tc_gen5_commit %barrierA ttng.tc_gen5_commit %barrierB ``` `%barrierA` tracks the completion of the previous TMEM copy and MMA operations, but since the commit groups are sequential, the arrive-on operation on `%barrierA` is guaranteed to be performed before the arrive-on operation on `%barrierB`, even though its commit group is empty. Traits: `VerifyTensorLayoutsTrait` #### Attributes:
AttributeMLIR TypeDescription
two_ctas::mlir::UnitAttrunit attribute
#### Operands: | Operand | Description | | :-----: | ----------- | | `barrier` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `pred` | 1-bit signless integer | ### `ttng.tc_gen5_mma` (triton::nvidia_gpu::TCGen5MMAOp) _Block level op mapping to tensorcore gen5 mma_ Syntax: ``` operation ::= `ttng.tc_gen5_mma` $a `,` $b `,` $d `` custom($acc_dep, type($token)) `,` $useD`,` $pred `` custom($barriers, $barrier_preds) attr-dict `:` qualified(type($a)) `,` qualified(type($b)) `,` qualified(type($d)) (`,` qualified(type($barriers))^)? ``` $d += matrix_multiply($a, $b). If no barrier is given the op is assumed to be synchronous otherwise the op will trigger a commit/arrive on the given barrier. If there is a barrier the result will be safe to read after a barrier wait. If $two_ctas is set the op will execute a matmul across two contiguous CTAs, it will read the data distributed across the two CTAs. and syncronize both CTAs if the op is synchronous. This operation takes and produces an optional token to indicate TMEM read and write on its accumulator operand. When the tokens are present, they can be used to check aliasing and modref on the accumulator memory. Traits: `AttrSizedOperandSegments`, `VerifyTensorLayoutsTrait` Interfaces: `DotOpInterface`, `MMAv5OpInterface`, `MemoryEffectOpInterface` #### Attributes:
AttributeMLIR TypeDescription
two_ctas::mlir::UnitAttrunit attribute
#### Operands: | Operand | Description | | :-----: | ----------- | | `a` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `b` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `d` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `acc_dep` | async token type | | `useD` | 1-bit signless integer | | `pred` | 1-bit signless integer | | `barriers` | variadic of memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `barrier_preds` | variadic of 1-bit signless integer | #### Results: | Result | Description | | :----: | ----------- | | `token` | async token type | ### `ttng.tc_gen5_mma_scaled` (triton::nvidia_gpu::TCGen5MMAScaledOp) _Block level op mapping to tensorcore gen5 mma_ Syntax: ``` operation ::= `ttng.tc_gen5_mma_scaled` $a `,` $b `,` $d `` custom($acc_dep, type($token)) `,` $a_scale `,` $b_scale `,` $useD `,` $pred `lhs` `=` $a_type `rhs` `=` $b_type `` custom($barriers, $barrier_preds) attr-dict `:` qualified(type($a)) `,` qualified(type($b)) `,` qualified(type($d)) `,` qualified(type($a_scale)) `,` qualified(type($b_scale)) (`,` qualified(type($barriers))^)? ``` $d += matrix_multiply(scale($lhs, $lhs_scale), scale(rlhs, $rhs_scale)) If no barrier is given the op is assumed to be synchronous otherwise the op will trigger a commit/arrive on the given barrier. If there is a barrier the result will be safe to read after a barrier wait. This operation takes and produces an optional token to indicate TMEM read and write on its accumulator operand. When the tokens are present, they can be used to check aliasing and modref on the accumulator memory. Traits: `AttrSizedOperandSegments`, `VerifyTensorLayoutsTrait` Interfaces: `DotOpInterface`, `MMAv5OpInterface`, `MemoryEffectOpInterface` #### Attributes:
AttributeMLIR TypeDescription
a_type::mlir::triton::ScaleDotElemTypeAttrallowed 32-bit signless integer cases: 0, 1, 2, 3, 4, 5, 6
b_type::mlir::triton::ScaleDotElemTypeAttrallowed 32-bit signless integer cases: 0, 1, 2, 3, 4, 5, 6
#### Operands: | Operand | Description | | :-----: | ----------- | | `a` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `b` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `d` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `acc_dep` | async token type | | `a_scale` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `b_scale` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `useD` | 1-bit signless integer | | `pred` | 1-bit signless integer | | `barriers` | variadic of memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `barrier_preds` | variadic of 1-bit signless integer | #### Results: | Result | Description | | :----: | ----------- | | `token` | async token type | ### `ttng.async_tma_store_wait` (triton::nvidia_gpu::TMAStoreWaitOp) _Wait until all the inputs are read._ Syntax: ``` operation ::= `ttng.async_tma_store_wait` attr-dict ``` Wait until all the read operations are done from the associated store operations. This is needed before the shared memory can be written to. Traits: `VerifyTensorLayoutsTrait` #### Attributes:
AttributeMLIR TypeDescription
pendings::mlir::IntegerAttr32-bit signless integer attribute
### `ttng.tmem_alloc` (triton::nvidia_gpu::TMEMAllocOp) _Allocate tensor memory_ Syntax: ``` operation ::= `ttng.tmem_alloc` ($src^)? attr-dict `:` functional-type(operands, results) ``` This operation allocates buffer in tensor memory and return a descriptor containing the address and a view of the buffer. This is similar to ttg.local_alloc except the buffer is allocated in tensor memory. Explicitly deallocating a buffer is optional; see local_dealloc. Traits: `VerifyTensorLayoutsTrait` Interfaces: `MemoryEffectOpInterface` #### Operands: | Operand | Description | | :-----: | ----------- | | `src` | ranked tensor of floating-point or integer or ptr values | #### Results: | Result | Description | | :----: | ----------- | | `result` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `token` | async token type | ### `ttng.tmem_copy` (triton::nvidia_gpu::TMEMCopyOp) _Initiate an asynchronous copy operation from shared memory to the Tensor Memory._ Syntax: ``` operation ::= `ttng.tmem_copy` $src `,` $dst `,` $barrier attr-dict `:` functional-type(operands, results) ``` 2D blocks stored contiguously in SMEM are copied into TMEM as specified by the destination address. The completion of the copy can be observed by waiting on the optional barrier. If this op is used together with an MMA op, one barrier can be used to wait for both copy and MMA. We do not need to wait for the completion of the copy before MMA, since tcgen05.cp followed by tcgen05.mma is guaranteed to execute in that order. This op lowers to the PTX instruction tcgen05.cp. Right now, we only support 1CTA and the warpx4.32x128b variant of the instruction. Each 32x128b block in SMEM is duplicated over 4 warps and stored into 128 rows and 4 columns of TMEM. The primary use case of this op is to copy blocked scales from SMEM to TMEM. The shape of the input SMEM can be flexibily chosen depending on use cases. In the simplest case (e.g. unit test), the source SMEM can be of shape (32 x num_blocks, 16), and the destination TMEM should be of shape (128, 16 x num_blocks), for copying 8 bit values. For scaled GEMM, rep_m x rep_k copies of a 32x128b block need to be stored in SMEM, where rep_m = BLOCK_M / 128, rep_k = BLOCK_K / scale_vec_size / 4, and scale_vec_size = 32 for MXFP. Conceptually, the SMEM is organized in a high-dimensional layout, (rep_m, rep_k, 32, 4, 4B). Some of axes can be flattened into one, to reduce the rank of the load. For example, the following patterns are supported: * (rep_m, rep_k * 32 x 4 x 4B), 2D scale load with cp.async * (rep_m, rep_k, 32, 16B), 4D scale load with TMA * (rep_m, rep_k, 32, 4, 4B), 5D scale load with cp.async Since rep_m blocks are not contiguous in SMEM, this axis cannot be flattened into inner ones. In Triton, the TMEM memdesc for blocked scales must be of the following form: * Its shape must be (BLOCK_MN, BLOCK_K / scale_vec_size), representing the logical shape of blocked scales. * It must be attached with `tensor_memory_scales_encoding` to indicate the chunk-based layout and its duplication over 4 warps. In contrast, the src SMEM must be in the explicit chunk-based layout as described above. So the IR might look like this: %0 = ttng.tmem_alloc : () -> !ttg.memdesc<128x4xi8, #tmem_scales, #ttng.tensor_memory> ttng.tmem_copy %1, %0 : (!ttg.memdesc<1x1x32x4x4xi8, #shared1, #smem>, !ttg.memdesc<128x4xi8, #tmem_scales, #ttng.tensor_memory>) -> () We interpret the semantics of this copy operation as follows. The chunk-based layout in SMEM implies that the logical shape (BLOCK_MN, BLOCK_K / scale_vec_size) in TMEM is the result of certain reshape and transpose operations. In practice, to take an advantage of the native scale layout and the TMEM copy op, users need to do `scales5D.trans(0, 3, 2, 1, 4).reshape(BLOCK_M, BLOCK_K // scale_vec_size)` before feeding scales into dot_scaled. When we use tmem_copy in the IR, such reshape and transpose operations are removed. But the change in the logical shape they have caused on registers is now understood to be incorporated into tmem_copy itself. Ideally, we would lift reshape / transpose done on registers onto the SMEM memdesc, making tmem_copy a straightforward 2D copy operation: (BLOCK_MN, BLOCK_K / scale_vec_size) -> (BLOCK_MN, BLOCK_K / scale_vec_size). In the absence of such operations on memdesc, we resort to implicitly encoding the reshape/transpose semantics in tmem_copy. Traits: `VerifyTensorLayoutsTrait` #### Operands: | Operand | Description | | :-----: | ----------- | | `src` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `dst` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `barrier` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | ### `ttng.tmem_load` (triton::nvidia_gpu::TMEMLoadOp) _Load a buffer from tensor memory into a distributed tensor_ Syntax: ``` operation ::= `ttng.tmem_load` $src `` custom($dep, type($token)) attr-dict `:` qualified(type($src)) `->` type($result) ``` This is similar to ttg.local_load except the result layout is restricted to only few possibility. Therefore we cannot combine this op with any convert layout like local_load. This operation takes and produces an optional token to indicate TMEM read on its source operand. When the tokens are present, they can be used to check aliasing and modref on the TMEM buffer. Traits: `VerifyTensorLayoutsTrait` #### Operands: | Operand | Description | | :-----: | ----------- | | `src` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `dep` | async token type | #### Results: | Result | Description | | :----: | ----------- | | `result` | ranked tensor of floating-point or integer or ptr values | | `token` | async token type | ### `ttng.tmem_store` (triton::nvidia_gpu::TMEMStoreOp) _Store a distributed tensor into a buffer in tensor memory_ Syntax: ``` operation ::= `ttng.tmem_store` $src `,` $dst `` custom($dep, type($token)) `,` $pred attr-dict `:` type($src) `->` qualified(type($dst)) ``` This is similar to ttg.local_store except the source layout is restricted to only few possibility. This operation takes and produces an optional token to indicate TMEM write on its source operand. When the tokens are present, they can be used to check aliasing and modref on the TMEM buffer. Traits: `VerifyTensorLayoutsTrait` #### Operands: | Operand | Description | | :-----: | ----------- | | `dst` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `dep` | async token type | | `src` | ranked tensor of floating-point or integer or ptr values | | `pred` | 1-bit signless integer | #### Results: | Result | Description | | :----: | ----------- | | `token` | async token type | ### `ttng.tmem_subslice` (triton::nvidia_gpu::TMEMSubSliceOp) _Take a subslice of a tensor memory allocation_ Syntax: ``` operation ::= `ttng.tmem_subslice` $src attr-dict `:` qualified(type($src)) `->` qualified(type($result)) ``` This operation takes a subslice of a tensor memory allocation and returns a new descriptor containing the address and a view of the subslice. This is similar to ttg.memdesc_subview except the offset needs to be static and we can only slice alog the inner dimension of a 2D memdesc as this is the only one we can do for TMem. Traits: `AlwaysSpeculatableImplTrait`, `VerifyTensorLayoutsTrait` Interfaces: `ConditionallySpeculatable`, `NoMemoryEffect (MemoryEffectOpInterface)` Effects: `MemoryEffects::Effect{}` #### Attributes:
AttributeMLIR TypeDescription
N::mlir::IntegerAttr32-bit signless integer attribute
#### Operands: | Operand | Description | | :-----: | ----------- | | `src` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | #### Results: | Result | Description | | :----: | ----------- | | `result` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | ### `ttng.tensormap_create` (triton::nvidia_gpu::TensormapCreateOp) _Create a new TMA descriptor on device_ Syntax: ``` operation ::= `ttng.tensormap_create` $desc_ptr `,` $global_address `,` `[` $box_dim `]` `,` `[` $global_dim `]` `,` `[` $global_stride `]` `,` `[` $element_stride `]` attr-dict `:` functional-type(operands, results) ``` Traits: `AttrSizedOperandSegments`, `VerifyTensorLayoutsTrait` Interfaces: `MemoryEffectOpInterface (MemoryEffectOpInterface)` Effects: `MemoryEffects::Effect{MemoryEffects::Read on ::mlir::triton::GlobalMemory, MemoryEffects::Write on ::mlir::triton::GlobalMemory}` #### Attributes:
AttributeMLIR TypeDescription
elem_type::mlir::IntegerAttr32-bit signless integer attribute whose value is non-negative whose maximum value is 15
interleave_layout::mlir::IntegerAttr32-bit signless integer attribute whose value is non-negative whose maximum value is 2
swizzle_mode::mlir::IntegerAttr32-bit signless integer attribute whose value is non-negative whose maximum value is 3
fill_mode::mlir::IntegerAttr32-bit signless integer attribute whose value is non-negative whose maximum value is 1
#### Operands: | Operand | Description | | :-----: | ----------- | | `desc_ptr` | Pointer type (`::mlir::triton::PointerType`) in Triton IR type system | | `global_address` | Pointer type (`::mlir::triton::PointerType`) in Triton IR type system | | `box_dim` | variadic of 32-bit signless integer | | `global_dim` | variadic of 32-bit signless integer | | `global_stride` | variadic of 64-bit signless integer | | `element_stride` | variadic of 32-bit signless integer | ### `ttng.tensormap_fenceproxy_acquire` (triton::nvidia_gpu::TensormapFenceproxyAcquireOp) _Acquire fence on a tensormap object_ Syntax: ``` operation ::= `ttng.tensormap_fenceproxy_acquire` $desc_ptr attr-dict `:` qualified(type($desc_ptr)) ``` Traits: `VerifyTensorLayoutsTrait` Interfaces: `MemoryEffectOpInterface (MemoryEffectOpInterface)` Effects: `MemoryEffects::Effect{MemoryEffects::Write on ::mlir::triton::GlobalMemory}` #### Operands: | Operand | Description | | :-----: | ----------- | | `desc_ptr` | Pointer type (`::mlir::triton::PointerType`) in Triton IR type system | ### `ttng.wait_barrier` (triton::nvidia_gpu::WaitBarrierOp) _Wait until the mbarrier phase completes._ Syntax: ``` operation ::= `ttng.wait_barrier` $alloc `,` $phase (`,` $pred^)? (`deps` $deps^)? attr-dict `:` qualified(type($alloc)) (`,` type($deps)^)? ``` Blocks the program progress until the mbarrier object in `alloc` completes its current phase. This lowers a waitloop using PTX instruction mbarrier.try_wait.parity.shared.b64. Accepts optional list of memory. If present, it is assumed that any of the dependencies may be accessed until the barrier completes. The barrier behavior is described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-asynchronous-copy-completion-mechanisms Traits: `AttrSizedOperandSegments`, `VerifyTensorLayoutsTrait` #### Operands: | Operand | Description | | :-----: | ----------- | | `alloc` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | | `phase` | 32-bit signless integer | | `pred` | 1-bit signless integer | | `deps` | variadic of memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system | ### `ttng.warp_group_dot` (triton::nvidia_gpu::WarpGroupDotOp) _Warp group dot_ Syntax: ``` operation ::= `ttng.warp_group_dot` $a`,` $b`,` $c (`,` $useC^)? attr-dict `:` type($a) `*` type($b) `->` type($d) ``` $d = matrix_multiply($a, $b) + $c. For docs on InputPrecisionAttr, see TT_DotOp Traits: `VerifyTensorLayoutsTrait` Interfaces: `DotOpInterface`, `InferTypeOpInterface`, `MemoryEffectOpInterface` #### Attributes:
AttributeMLIR TypeDescription
inputPrecision::mlir::triton::InputPrecisionAttrallowed 32-bit signless integer cases: 0, 1, 2
maxNumImpreciseAcc::mlir::IntegerAttr32-bit signless integer attribute
isAsync::mlir::BoolAttrbool attribute
#### Operands: | Operand | Description | | :-----: | ----------- | | `a` | TensorOrMemDesc instance | | `b` | TensorOrMemDesc instance | | `c` | ranked tensor of floating-point or integer values | | `useC` | 1-bit signless integer | #### Results: | Result | Description | | :----: | ----------- | | `d` | ranked tensor of floating-point or integer values | ### `ttng.warp_group_dot_wait` (triton::nvidia_gpu::WarpGroupDotWaitOp) _Warp group dot wait_ Syntax: ``` operation ::= `ttng.warp_group_dot_wait` $inputs attr-dict `:` type($inputs) ``` Waits until there are $pendings or fewer outstanding async dot operations. $inputs must be the tensors corresponding to the async dot ops that we're waiting on. For example, if there are N pending async dot ops and we call `warp_group_dot_wait 1`, then $inputs must be the result of the first dot op. Traits: `VerifyTensorLayoutsTrait` Interfaces: `InferTypeOpInterface` #### Attributes:
AttributeMLIR TypeDescription
pendings::mlir::IntegerAttr32-bit signless integer attribute
#### Operands: | Operand | Description | | :-----: | ----------- | | `inputs` | variadic of TensorOrMemDesc instance | #### Results: | Result | Description | | :----: | ----------- | | `outputs` | variadic of TensorOrMemDesc instance |