TritonNvidiaGPUOps¶

`ttng.arrive_barrier` (triton::nvidia_gpu::ArriveBarrierOp)¶

Perform the arrive operation on an mbarrier

Syntax:

operation ::= `ttng.arrive_barrier` $alloc `,` $count (`,` $pred^)? attr-dict `:` qualified(type($alloc))

The ttng.arrive_barrier operation performs the “arrive” operation on an mbarrier object in shared memory. The operation requires a count attribute of at least 1, and decreasing the pending arrival count of the mbarrier by the specific count.

The operation accepts an optional predicate.

Example:

ttng.arrive_barrier %barrier, 2 : !ttg.memdesc<1xi64, #shared, #smem, mutable>
ttng.arrive_barrier %barrier, 1, %pred : !ttg.memdesc<1xi64, #shared, #smem, mutable>

Traits: VerifyTensorLayoutsTrait

Attributes:¶

Attribute	MLIR Type	Description
`count`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands:¶

Operand	Description
`alloc`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`pred`	1-bit signless integer

`ttng.async_copy_mbarrier_arrive` (triton::nvidia_gpu::AsyncCopyMbarrierArriveOp)¶

Arrive on mbarrier once all previously issued copies are completed

Syntax:

operation ::= `ttng.async_copy_mbarrier_arrive` $barrier attr-dict `:` qualified(type($barrier))

Traits: VerifyTensorLayoutsTrait

Attributes:¶

Attribute	MLIR Type	Description
`noIncrement`	::mlir::UnitAttr	unit attribute

Operands:¶

Operand	Description
`barrier`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system

`ttng.async_tma_copy_global_to_local` (triton::nvidia_gpu::AsyncTMACopyGlobalToLocalOp)¶

Copy data based on descriptor from global memory to local memory asynchronously

Syntax:

operation ::= `ttng.async_tma_copy_global_to_local` $desc `[` $coord `]` $result `,` $barrier `,` $pred
              oilist(`cacheModifier` `=` $cache | `evictionPolicy` `=` $evict)
              attr-dict `:` qualified(type($desc)) `,` qualified(type($barrier)) `->` qualified(type($result))

This operation copies data from global memory to local memory asynchronously. This is analogue to tt.load except the data are copied to local memory pointed by the memory descriptor instead of a distributed tensor. The data copied depends on the global memory descriptor pointed to by desc.

Traits: VerifyTensorLayoutsTrait

Attributes:¶

Attribute	MLIR Type	Description
`cache`	::mlir::triton::CacheModifierAttr	allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7
`evict`	::mlir::triton::EvictionPolicyAttr	allowed 32-bit signless integer cases: 1, 2, 3
`isVolatile`	::mlir::BoolAttr	bool attribute

Operands:¶

Operand	Description
`desc`	Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system
`coord`	variadic of 32-bit signless integer
`barrier`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`result`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`pred`	1-bit signless integer

`ttng.async_tma_copy_local_to_global` (triton::nvidia_gpu::AsyncTMACopyLocalToGlobalOp)¶

Copy data based on descriptor from local memory to global memory asynchronously

Syntax:

operation ::= `ttng.async_tma_copy_local_to_global` $desc `[` $coord `]` $src
              attr-dict `:` qualified(type($desc)) `,` qualified(type($src))

This operation copies data from local memory to global memory asynchronously. This is analogue to tt.store except the data are copied from local memory pointed by the memory descriptor instead of a distributed tensor. The data copied depends on the global memory descriptor pointed to by desc.

Traits: VerifyTensorLayoutsTrait

Operands:¶

Operand	Description
`desc`	Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system
`coord`	variadic of 32-bit signless integer
`src`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system

`ttng.async_tma_gather` (triton::nvidia_gpu::AsyncTMAGatherOp)¶

Gather data based on descriptor from global memory to local memory asynchronously

Syntax:

operation ::= `ttng.async_tma_gather` $desc `[` $x_offsets `,` $y_offset `]` $result `,` $barrier `,` $pred
              attr-dict `:` type(operands)

This operation gathers multiple rows of data from global memory matrix to local memory asynchronously. This is similar to async_tma_copy_global_to_local except that each row is indexed independently.

Traits: VerifyTensorLayoutsTrait

Operands:¶

Operand	Description
`desc`	Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system
`x_offsets`	ranked tensor of 32-bit signless integer values
`y_offset`	32-bit signless integer
`barrier`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`result`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`pred`	1-bit signless integer

`ttng.async_tma_reduce` (triton::nvidia_gpu::AsyncTMAReduceOp)¶

Reduce result in gmem based on a TMA descriptor

Syntax:

operation ::= `ttng.async_tma_reduce` $kind `,` $desc `[` $coord `]` $src
              attr-dict `:` qualified(type($desc)) `,` qualified(type($src))

This operation copies data from local memory to global memory asynchronously, and atomically performs the specified reduction kind. Atomicity is at the granularity of individual elements, and only relaxed semantics are implied.

Traits: VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::triton::GlobalMemory, MemoryEffects::Write on ::mlir::triton::GlobalMemory}

Attributes:¶

Attribute	MLIR Type	Description
`kind`	::mlir::triton::DescriptorReduceKindAttr	allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7, 8

Operands:¶

Operand	Description
`desc`	Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system
`coord`	variadic of 32-bit signless integer
`src`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system

`ttng.async_tma_scatter` (triton::nvidia_gpu::AsyncTMAScatterOp)¶

Scatter data from local memory into global memory based on a descriptor asynchronously

Syntax:

operation ::= `ttng.async_tma_scatter` $desc `[` $x_offsets `,` $y_offset `]` $src
              attr-dict `:` type(operands)

The ttng.async_tma_scatter operation scatters multiple separately-indexed rows of data from local memory into global memory asynchronously. The operation scatters a 2D tensor in shared memory, laid out by core tensor tiles nvmma_shared layout into separately indexed rows in global memory at a given y offset.

Traits: VerifyTensorLayoutsTrait

Operands:¶

Operand	Description
`desc`	Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system
`x_offsets`	ranked tensor of 32-bit signless integer values
`y_offset`	32-bit signless integer
`src`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system

`ttng.barrier_expect` (triton::nvidia_gpu::BarrierExpectOp)¶

Signal a barrier of an expected number of bytes to be copied.

Syntax:

operation ::= `ttng.barrier_expect` $alloc `,` $size attr-dict `,` $pred `:` qualified(type($alloc))

This signal the barrier that size bytes are expected to be copied. The associated barrier wait will block until the expected number of bytes are copied.

Traits: VerifyTensorLayoutsTrait

Attributes:¶

Attribute	MLIR Type	Description
`size`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands:¶

Operand	Description
`alloc`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`pred`	1-bit signless integer

`ttng.cluster_arrive` (triton::nvidia_gpu::ClusterArriveOp)¶

Syntax:

operation ::= `ttng.cluster_arrive` attr-dict

Traits: VerifyTensorLayoutsTrait

Attributes:¶

Attribute	MLIR Type	Description
`relaxed`	::mlir::IntegerAttr	1-bit signless integer attribute

`ttng.cluster_wait` (triton::nvidia_gpu::ClusterWaitOp)¶

Syntax:

operation ::= `ttng.cluster_wait` attr-dict

Traits: VerifyTensorLayoutsTrait

`ttng.fence_async_shared` (triton::nvidia_gpu::FenceAsyncSharedOp)¶

Fence proxy async

Syntax:

operation ::= `ttng.fence_async_shared` attr-dict

Traits: VerifyTensorLayoutsTrait

Attributes:¶

Attribute	MLIR Type	Description
`bCluster`	::mlir::BoolAttr	bool attribute

`ttng.init_barrier` (triton::nvidia_gpu::InitBarrierOp)¶

Initialize a barrier in the given shared memory allocation.

Syntax:

operation ::= `ttng.init_barrier` $alloc `,` $count attr-dict `:` qualified(type($alloc))

Initializes a shared memory allocation with mbarrier information. alloc is a descriptor to the shared memory allocation. count is the number of arrives expected by the barrier.

This lowers to PTX mbarrier.init.shared::cta.b64.

Traits: VerifyTensorLayoutsTrait

Attributes:¶

Attribute	MLIR Type	Description
`count`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands:¶

Operand	Description
`alloc`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system

`ttng.inval_barrier` (triton::nvidia_gpu::InvalBarrierOp)¶

Invalidate a barrier allocation.

Syntax:

operation ::= `ttng.inval_barrier` $alloc attr-dict `:` qualified(type($alloc))

Invalidate a barrier allocation so that it can be re-used. According to PTX spec this has to be done before any reuse of the memory used by mbarrier.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-inval

Traits: VerifyTensorLayoutsTrait

Operands:¶

Operand	Description
`alloc`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system

`ttng.reinterpret_tensor_descriptor` (triton::nvidia_gpu::ReinterpretTensorDescOp)¶

Reinterpret a pointer as a tensor descriptor

Syntax:

operation ::= `ttng.reinterpret_tensor_descriptor` $rawDesc attr-dict `:` qualified(type($rawDesc))  `to` qualified(type($result))

This Op exists to help the transition from untyped raw TMA objects to typed Tensor descriptor objects. Ideally, we can remove this once the APIs are fully fleshed out.

Traits: AlwaysSpeculatableImplTrait, VerifyTensorLayoutsTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands:¶

Operand	Description
`rawDesc`	ptr

Results:¶

Result	Description
`result`	Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system

`ttng.tc_gen5_commit` (triton::nvidia_gpu::TCGen5CommitOp)¶

Make an mbarrier track completion of all prior async tcgen5 ops

Syntax:

operation ::= `ttng.tc_gen5_commit` $barrier (`,` $pred^)? attr-dict `:` qualified(type($barrier))

The ttng.tc_gen5_commit is an asynchronous operation that makes the mbarrier object track the completion of all prior asynchronous tcgen5 operations. Upon completion of all asynchronous operations, the mbarrier arrive operation is performed on the mbarrier with a count of 1.

If two_ctas is set, then the mbarrier tracks all prior operations initiated with two_ctas set as well. Otherwise, it tracks all prior operations initiated without two_ctas.

Note that the completion mechanisms are guaranteed to occur sequentially in the order the commit operations were issued. This means, for example:

ttng.tmem_copy
ttng.tc_gen5_mma
ttng.tc_gen5_commit %barrierA
ttng.tc_gen5_commit %barrierB

%barrierA tracks the completion of the previous TMEM copy and MMA operations, but since the commit groups are sequential, the arrive-on operation on %barrierA is guaranteed to be performed before the arrive-on operation on %barrierB, even though its commit group is empty.

Traits: VerifyTensorLayoutsTrait

Attributes:¶

Attribute	MLIR Type	Description
`two_ctas`	::mlir::UnitAttr	unit attribute

Operands:¶

Operand	Description
`barrier`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`pred`	1-bit signless integer

`ttng.tc_gen5_mma` (triton::nvidia_gpu::TCGen5MMAOp)¶

Block level op mapping to tensorcore gen5 mma

Syntax:

operation ::= `ttng.tc_gen5_mma` $a `,` $b `,` $d `` custom<Token>($acc_dep, type($token)) `,` $useD`,`
              $pred `` custom<BarriersAndPreds>($barriers, $barrier_preds)
              attr-dict `:` qualified(type($a)) `,` qualified(type($b)) `,`
              qualified(type($d)) (`,` qualified(type($barriers))^)?

$d += matrix_multiply($a, $b). If no barrier is given the op is assumed to be synchronous otherwise the op will trigger a commit/arrive on the given barrier. If there is a barrier the result will be safe to read after a barrier wait. If $two_ctas is set the op will execute a matmul across two contiguous CTAs, it will read the data distributed across the two CTAs. and syncronize both CTAs if the op is synchronous.

This operation takes and produces an optional token to indicate TMEM read and write on its accumulator operand. When the tokens are present, they can be used to check aliasing and modref on the accumulator memory.

Traits: AttrSizedOperandSegments, VerifyTensorLayoutsTrait

Interfaces: DotOpInterface, MMAv5OpInterface, MemoryEffectOpInterface

Attributes:¶

Attribute	MLIR Type	Description
`two_ctas`	::mlir::UnitAttr	unit attribute

Operands:¶

Operand	Description
`a`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`b`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`d`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`acc_dep`	async token type
`useD`	1-bit signless integer
`pred`	1-bit signless integer
`barriers`	variadic of memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`barrier_preds`	variadic of 1-bit signless integer

Results:¶

Result	Description
`token`	async token type

`ttng.tc_gen5_mma_scaled` (triton::nvidia_gpu::TCGen5MMAScaledOp)¶

Block level op mapping to tensorcore gen5 mma

Syntax:

operation ::= `ttng.tc_gen5_mma_scaled` $a `,` $b `,` $d `` custom<Token>($acc_dep, type($token)) `,` $a_scale `,`
              $b_scale `,` $useD `,` $pred `lhs` `=` $a_type `rhs` `=` $b_type
              `` custom<BarriersAndPreds>($barriers, $barrier_preds)
              attr-dict `:` qualified(type($a)) `,` qualified(type($b)) `,`
              qualified(type($d)) `,` qualified(type($a_scale)) `,`
              qualified(type($b_scale)) (`,` qualified(type($barriers))^)?

$d += matrix_multiply(scale($lhs, $lhs_scale), scale(rlhs, $rhs_scale)) If no barrier is given the op is assumed to be synchronous otherwise the op will trigger a commit/arrive on the given barrier. If there is a barrier the result will be safe to read after a barrier wait.

This operation takes and produces an optional token to indicate TMEM read and write on its accumulator operand. When the tokens are present, they can be used to check aliasing and modref on the accumulator memory.

Traits: AttrSizedOperandSegments, VerifyTensorLayoutsTrait

Interfaces: DotOpInterface, MMAv5OpInterface, MemoryEffectOpInterface

Attributes:¶

Attribute	MLIR Type	Description
`a_type`	::mlir::triton::ScaleDotElemTypeAttr	allowed 32-bit signless integer cases: 0, 1, 2, 3, 4, 5, 6
`b_type`	::mlir::triton::ScaleDotElemTypeAttr	allowed 32-bit signless integer cases: 0, 1, 2, 3, 4, 5, 6

Operands:¶

Operand	Description
`a`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`b`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`d`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`acc_dep`	async token type
`a_scale`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`b_scale`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`useD`	1-bit signless integer
`pred`	1-bit signless integer
`barriers`	variadic of memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`barrier_preds`	variadic of 1-bit signless integer

Results:¶

Result	Description
`token`	async token type

`ttng.async_tma_store_wait` (triton::nvidia_gpu::TMAStoreWaitOp)¶

Wait until all the inputs are read.

Syntax:

operation ::= `ttng.async_tma_store_wait` attr-dict

Wait until all the read operations are done from the associated store operations. This is needed before the shared memory can be written to.

Traits: VerifyTensorLayoutsTrait

Attributes:¶

Attribute	MLIR Type	Description
`pendings`	::mlir::IntegerAttr	32-bit signless integer attribute

`ttng.tmem_alloc` (triton::nvidia_gpu::TMEMAllocOp)¶

Allocate tensor memory

Syntax:

operation ::= `ttng.tmem_alloc` ($src^)? attr-dict `:` functional-type(operands, results)

This operation allocates buffer in tensor memory and return a descriptor containing the address and a view of the buffer. This is similar to ttg.local_alloc except the buffer is allocated in tensor memory.

Explicitly deallocating a buffer is optional; see local_dealloc.

Traits: VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface

Operands:¶

Operand	Description
`src`	ranked tensor of floating-point or integer or ptr values

Results:¶

Result	Description
`result`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`token`	async token type

`ttng.tmem_copy` (triton::nvidia_gpu::TMEMCopyOp)¶

Initiate an asynchronous copy operation from shared memory to the Tensor Memory.

Syntax:

operation ::= `ttng.tmem_copy` $src `,` $dst `,` $barrier attr-dict `:` functional-type(operands, results)

2D blocks stored contiguously in SMEM are copied into TMEM as specified by the destination address. The completion of the copy can be observed by waiting on the optional barrier. If this op is used together with an MMA op, one barrier can be used to wait for both copy and MMA. We do not need to wait for the completion of the copy before MMA, since tcgen05.cp followed by tcgen05.mma is guaranteed to execute in that order.

This op lowers to the PTX instruction tcgen05.cp. Right now, we only support 1CTA and the warpx4.32x128b variant of the instruction. Each 32x128b block in SMEM is duplicated over 4 warps and stored into 128 rows and 4 columns of TMEM. The primary use case of this op is to copy blocked scales from SMEM to TMEM.

The shape of the input SMEM can be flexibily chosen depending on use cases. In the simplest case (e.g. unit test), the source SMEM can be of shape (32 x num_blocks, 16), and the destination TMEM should be of shape (128, 16 x num_blocks), for copying 8 bit values. For scaled GEMM, rep_m x rep_k copies of a 32x128b block need to be stored in SMEM, where rep_m = BLOCK_M / 128, rep_k = BLOCK_K / scale_vec_size / 4, and scale_vec_size = 32 for MXFP. Conceptually, the SMEM is organized in a high-dimensional layout, (rep_m, rep_k, 32, 4, 4B). Some of axes can be flattened into one, to reduce the rank of the load. For example, the following patterns are supported:

(rep_m, rep_k * 32 x 4 x 4B), 2D scale load with cp.async
(rep_m, rep_k, 32, 16B), 4D scale load with TMA
(rep_m, rep_k, 32, 4, 4B), 5D scale load with cp.async Since rep_m blocks are not contiguous in SMEM, this axis cannot be flattened into inner ones.

In Triton, the TMEM memdesc for blocked scales must be of the following form:

Its shape must be (BLOCK_MN, BLOCK_K / scale_vec_size), representing the logical shape of blocked scales.
It must be attached with tensor_memory_scales_encoding to indicate the chunk-based layout and its duplication over 4 warps.

In contrast, the src SMEM must be in the explicit chunk-based layout as described above. So the IR might look like this:

ttng.tmem_copy %1, %0 : (!ttg.memdesc<1x1x32x4x4xi8, #shared1, #smem>, !ttg.memdesc<128x4xi8, #tmem_scales, #ttng.tensor_memory>) -> ()

We interpret the semantics of this copy operation as follows. The chunk-based layout in SMEM implies that the logical shape (BLOCK_MN, BLOCK_K / scale_vec_size) in TMEM is the result of certain reshape and transpose operations. In practice, to take an advantage of the native scale layout and the TMEM copy op, users need to do scales5D.trans(0, 3, 2, 1, 4).reshape(BLOCK_M, BLOCK_K // scale_vec_size) before feeding scales into dot_scaled. When we use tmem_copy in the IR, such reshape and transpose operations are removed. But the change in the logical shape they have caused on registers is now understood to be incorporated into tmem_copy itself. Ideally, we would lift reshape / transpose done on registers onto the SMEM memdesc, making tmem_copy a straightforward 2D copy operation: (BLOCK_MN, BLOCK_K / scale_vec_size) -> (BLOCK_MN, BLOCK_K / scale_vec_size). In the absence of such operations on memdesc, we resort to implicitly encoding the reshape/transpose semantics in tmem_copy.

Traits: VerifyTensorLayoutsTrait

Operands:¶

Operand	Description
`src`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`dst`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`barrier`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system

`ttng.tmem_load` (triton::nvidia_gpu::TMEMLoadOp)¶

Load a buffer from tensor memory into a distributed tensor

Syntax:

operation ::= `ttng.tmem_load` $src `` custom<Token>($dep, type($token))
              attr-dict `:` qualified(type($src)) `->` type($result)

This is similar to ttg.local_load except the result layout is restricted to only few possibility. Therefore we cannot combine this op with any convert layout like local_load.

This operation takes and produces an optional token to indicate TMEM read on its source operand. When the tokens are present, they can be used to check aliasing and modref on the TMEM buffer.

Traits: VerifyTensorLayoutsTrait

Operands:¶

Operand	Description
`src`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`dep`	async token type

Results:¶

Result	Description
`result`	ranked tensor of floating-point or integer or ptr values
`token`	async token type

`ttng.tmem_store` (triton::nvidia_gpu::TMEMStoreOp)¶

Store a distributed tensor into a buffer in tensor memory

Syntax:

operation ::= `ttng.tmem_store` $src `,` $dst `` custom<Token>($dep, type($token)) `,` $pred
              attr-dict `:` type($src) `->` qualified(type($dst))

This is similar to ttg.local_store except the source layout is restricted to only few possibility.

This operation takes and produces an optional token to indicate TMEM write on its source operand. When the tokens are present, they can be used to check aliasing and modref on the TMEM buffer.

Traits: VerifyTensorLayoutsTrait

Operands:¶

Operand	Description
`dst`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`dep`	async token type
`src`	ranked tensor of floating-point or integer or ptr values
`pred`	1-bit signless integer

Results:¶

Result	Description
`token`	async token type

`ttng.tmem_subslice` (triton::nvidia_gpu::TMEMSubSliceOp)¶

Take a subslice of a tensor memory allocation

Syntax:

operation ::= `ttng.tmem_subslice` $src attr-dict `:` qualified(type($src)) `->` qualified(type($result))

This operation takes a subslice of a tensor memory allocation and returns a new descriptor containing the address and a view of the subslice. This is similar to ttg.memdesc_subview except the offset needs to be static and we can only slice alog the inner dimension of a 2D memdesc as this is the only one we can do for TMem.

Traits: AlwaysSpeculatableImplTrait, VerifyTensorLayoutsTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:¶

Attribute	MLIR Type	Description
`N`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands:¶

Operand	Description
`src`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system

Results:¶

Result	Description
`result`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system

`ttng.tensormap_create` (triton::nvidia_gpu::TensormapCreateOp)¶

Create a new TMA descriptor on device

Syntax:

operation ::= `ttng.tensormap_create` $desc_ptr `,` $global_address `,`
              `[` $box_dim `]` `,`
              `[` $global_dim `]` `,`
              `[` $global_stride `]` `,`
              `[` $element_stride `]`
              attr-dict `:` functional-type(operands, results)

Traits: AttrSizedOperandSegments, VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::triton::GlobalMemory, MemoryEffects::Write on ::mlir::triton::GlobalMemory}

Attributes:¶

Attribute	MLIR Type	Description
`elem_type`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 15
`interleave_layout`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 2
`swizzle_mode`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 3
`fill_mode`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative whose maximum value is 1

Operands:¶

Operand	Description
`desc_ptr`	Pointer type (`::mlir::triton::PointerType`) in Triton IR type system
`global_address`	Pointer type (`::mlir::triton::PointerType`) in Triton IR type system
`box_dim`	variadic of 32-bit signless integer
`global_dim`	variadic of 32-bit signless integer
`global_stride`	variadic of 64-bit signless integer
`element_stride`	variadic of 32-bit signless integer

`ttng.tensormap_fenceproxy_acquire` (triton::nvidia_gpu::TensormapFenceproxyAcquireOp)¶

Acquire fence on a tensormap object

Syntax:

operation ::= `ttng.tensormap_fenceproxy_acquire` $desc_ptr attr-dict `:` qualified(type($desc_ptr))

Traits: VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Write on ::mlir::triton::GlobalMemory}

Operands:¶

Operand	Description
`desc_ptr`	Pointer type (`::mlir::triton::PointerType`) in Triton IR type system

`ttng.wait_barrier` (triton::nvidia_gpu::WaitBarrierOp)¶

Wait until the mbarrier phase completes.

Syntax:

operation ::= `ttng.wait_barrier` $alloc `,` $phase (`,` $pred^)? (`deps` $deps^)?
              attr-dict `:` qualified(type($alloc)) (`,` type($deps)^)?

Blocks the program progress until the mbarrier object in alloc completes its current phase.

This lowers a waitloop using PTX instruction mbarrier.try_wait.parity.shared.b64.

Accepts optional list of memory. If present, it is assumed that any of the dependencies may be accessed until the barrier completes.

The barrier behavior is described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-asynchronous-copy-completion-mechanisms

Traits: AttrSizedOperandSegments, VerifyTensorLayoutsTrait

Operands:¶

Operand	Description
`alloc`	memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system
`phase`	32-bit signless integer
`pred`	1-bit signless integer
`deps`	variadic of memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system

`ttng.warp_group_dot` (triton::nvidia_gpu::WarpGroupDotOp)¶

Warp group dot

Syntax:

operation ::= `ttng.warp_group_dot` $a`,` $b`,` $c (`,` $useC^)? attr-dict
              `:` type($a) `*` type($b) `->` type($d)

$d = matrix_multiply($a, $b) + $c. For docs on InputPrecisionAttr, see TT_DotOp

Traits: VerifyTensorLayoutsTrait

Interfaces: DotOpInterface, InferTypeOpInterface, MemoryEffectOpInterface

Attributes:¶

Attribute	MLIR Type	Description
`inputPrecision`	::mlir::triton::InputPrecisionAttr	allowed 32-bit signless integer cases: 0, 1, 2
`maxNumImpreciseAcc`	::mlir::IntegerAttr	32-bit signless integer attribute
`isAsync`	::mlir::BoolAttr	bool attribute

Operands:¶

Operand	Description
`a`	TensorOrMemDesc instance
`b`	TensorOrMemDesc instance
`c`	ranked tensor of floating-point or integer values
`useC`	1-bit signless integer

Results:¶

Result	Description
`d`	ranked tensor of floating-point or integer values

`ttng.warp_group_dot_wait` (triton::nvidia_gpu::WarpGroupDotWaitOp)¶

Warp group dot wait

Syntax:

operation ::= `ttng.warp_group_dot_wait` $inputs attr-dict `:` type($inputs)

Waits until there are $pendings or fewer outstanding async dot operations.

$inputs must be the tensors corresponding to the async dot ops that we’re waiting on. For example, if there are N pending async dot ops and we call warp_group_dot_wait 1, then $inputs must be the result of the first dot op.

Traits: VerifyTensorLayoutsTrait

Interfaces: InferTypeOpInterface

Attributes:¶

Attribute	MLIR Type	Description
`pendings`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands:¶

Operand	Description
`inputs`	variadic of TensorOrMemDesc instance

Results:¶

Result	Description
`outputs`	variadic of TensorOrMemDesc instance

TritonNvidiaGPUOps¶

ttng.arrive_barrier (triton::nvidia_gpu::ArriveBarrierOp)¶

Attributes:¶

Operands:¶

ttng.async_copy_mbarrier_arrive (triton::nvidia_gpu::AsyncCopyMbarrierArriveOp)¶

Attributes:¶

Operands:¶

ttng.async_tma_copy_global_to_local (triton::nvidia_gpu::AsyncTMACopyGlobalToLocalOp)¶

Attributes:¶

Operands:¶

ttng.async_tma_copy_local_to_global (triton::nvidia_gpu::AsyncTMACopyLocalToGlobalOp)¶

Operands:¶

ttng.async_tma_gather (triton::nvidia_gpu::AsyncTMAGatherOp)¶

Operands:¶

ttng.async_tma_reduce (triton::nvidia_gpu::AsyncTMAReduceOp)¶

Attributes:¶

Operands:¶

ttng.async_tma_scatter (triton::nvidia_gpu::AsyncTMAScatterOp)¶

Operands:¶

ttng.barrier_expect (triton::nvidia_gpu::BarrierExpectOp)¶

Attributes:¶

Operands:¶

ttng.cluster_arrive (triton::nvidia_gpu::ClusterArriveOp)¶

Attributes:¶

ttng.cluster_wait (triton::nvidia_gpu::ClusterWaitOp)¶

ttng.fence_async_shared (triton::nvidia_gpu::FenceAsyncSharedOp)¶

Attributes:¶

ttng.init_barrier (triton::nvidia_gpu::InitBarrierOp)¶

Attributes:¶

Operands:¶

ttng.inval_barrier (triton::nvidia_gpu::InvalBarrierOp)¶

Operands:¶

ttng.reinterpret_tensor_descriptor (triton::nvidia_gpu::ReinterpretTensorDescOp)¶

Operands:¶

Results:¶

ttng.tc_gen5_commit (triton::nvidia_gpu::TCGen5CommitOp)¶

Attributes:¶

Operands:¶

ttng.tc_gen5_mma (triton::nvidia_gpu::TCGen5MMAOp)¶

Attributes:¶

Operands:¶

Results:¶

ttng.tc_gen5_mma_scaled (triton::nvidia_gpu::TCGen5MMAScaledOp)¶

Attributes:¶

Operands:¶

Results:¶

ttng.async_tma_store_wait (triton::nvidia_gpu::TMAStoreWaitOp)¶

Attributes:¶

ttng.tmem_alloc (triton::nvidia_gpu::TMEMAllocOp)¶

Operands:¶

Results:¶

ttng.tmem_copy (triton::nvidia_gpu::TMEMCopyOp)¶

Operands:¶

ttng.tmem_load (triton::nvidia_gpu::TMEMLoadOp)¶

Operands:¶

Results:¶

ttng.tmem_store (triton::nvidia_gpu::TMEMStoreOp)¶

Operands:¶

Results:¶

ttng.tmem_subslice (triton::nvidia_gpu::TMEMSubSliceOp)¶

Attributes:¶

Operands:¶

Results:¶

ttng.tensormap_create (triton::nvidia_gpu::TensormapCreateOp)¶

Attributes:¶

Operands:¶

ttng.tensormap_fenceproxy_acquire (triton::nvidia_gpu::TensormapFenceproxyAcquireOp)¶

Operands:¶

ttng.wait_barrier (triton::nvidia_gpu::WaitBarrierOp)¶

Operands:¶

ttng.warp_group_dot (triton::nvidia_gpu::WarpGroupDotOp)¶

Attributes:¶

Operands:¶

Results:¶

ttng.warp_group_dot_wait (triton::nvidia_gpu::WarpGroupDotWaitOp)¶

Attributes:¶

Operands:¶

Results:¶

`ttng.arrive_barrier` (triton::nvidia_gpu::ArriveBarrierOp)¶

`ttng.async_copy_mbarrier_arrive` (triton::nvidia_gpu::AsyncCopyMbarrierArriveOp)¶

`ttng.async_tma_copy_global_to_local` (triton::nvidia_gpu::AsyncTMACopyGlobalToLocalOp)¶

`ttng.async_tma_copy_local_to_global` (triton::nvidia_gpu::AsyncTMACopyLocalToGlobalOp)¶

`ttng.async_tma_gather` (triton::nvidia_gpu::AsyncTMAGatherOp)¶

`ttng.async_tma_reduce` (triton::nvidia_gpu::AsyncTMAReduceOp)¶

`ttng.async_tma_scatter` (triton::nvidia_gpu::AsyncTMAScatterOp)¶

`ttng.barrier_expect` (triton::nvidia_gpu::BarrierExpectOp)¶

`ttng.cluster_arrive` (triton::nvidia_gpu::ClusterArriveOp)¶

`ttng.cluster_wait` (triton::nvidia_gpu::ClusterWaitOp)¶

`ttng.fence_async_shared` (triton::nvidia_gpu::FenceAsyncSharedOp)¶

`ttng.init_barrier` (triton::nvidia_gpu::InitBarrierOp)¶

`ttng.inval_barrier` (triton::nvidia_gpu::InvalBarrierOp)¶

`ttng.reinterpret_tensor_descriptor` (triton::nvidia_gpu::ReinterpretTensorDescOp)¶

`ttng.tc_gen5_commit` (triton::nvidia_gpu::TCGen5CommitOp)¶

`ttng.tc_gen5_mma` (triton::nvidia_gpu::TCGen5MMAOp)¶

`ttng.tc_gen5_mma_scaled` (triton::nvidia_gpu::TCGen5MMAScaledOp)¶

`ttng.async_tma_store_wait` (triton::nvidia_gpu::TMAStoreWaitOp)¶

`ttng.tmem_alloc` (triton::nvidia_gpu::TMEMAllocOp)¶

`ttng.tmem_copy` (triton::nvidia_gpu::TMEMCopyOp)¶

`ttng.tmem_load` (triton::nvidia_gpu::TMEMLoadOp)¶

`ttng.tmem_store` (triton::nvidia_gpu::TMEMStoreOp)¶

`ttng.tmem_subslice` (triton::nvidia_gpu::TMEMSubSliceOp)¶

`ttng.tensormap_create` (triton::nvidia_gpu::TensormapCreateOp)¶

`ttng.tensormap_fenceproxy_acquire` (triton::nvidia_gpu::TensormapFenceproxyAcquireOp)¶

`ttng.wait_barrier` (triton::nvidia_gpu::WaitBarrierOp)¶

`ttng.warp_group_dot` (triton::nvidia_gpu::WarpGroupDotOp)¶

`ttng.warp_group_dot_wait` (triton::nvidia_gpu::WarpGroupDotWaitOp)¶