TritonGPUOps¶
ttg.async_commit_group (triton::gpu::AsyncCommitGroupOp)¶
Commit pending async copies into an async group that can be waited on
Syntax:
operation ::= `ttg.async_commit_group` (`tokens` $inputTokens^)? attr-dict
Closes the current batch of async_copy_* operations
and allows for them to be waited on with ttg.async_wait.
This is required in order to ensure async copy operations can be waited on.
Traits: VerifyTensorLayoutsTrait
Interfaces: InferTypeOpInterface
Operands:¶
Operand |
Description |
|---|---|
|
variadic of async token type |
Results:¶
Result |
Description |
|---|---|
|
async token type |
ttg.async_copy_global_to_local (triton::gpu::AsyncCopyGlobalToLocalOp)¶
Copy data from global memory to local memory asynchronously
Syntax:
operation ::= `ttg.async_copy_global_to_local` $src `,` $result (`mask` $mask^)? (`other` $other^)?
oilist(`cacheModifier` `=` $cache | `evictionPolicy` `=` $evict)
attr-dict `:` type($src) `->` type($result)
This operation copies data from global memory to local memory asynchronously.
This is analogue to tt.load except the data are copied to local memory pointed
to by the memory descriptor instead of a distributed tensor. The rest of the
operands are the same as tt.load.
Contiguity is the maximum number of elements that can be loaded in a single vector with
the given layout and mask.
This allows op to use async_copy_global_to_local even if the alignment cannot be proven based on IR.
The data will only be available in local memory after ttg.async_wait is issued to wait on the
completion of async_copy_global_to_local. The async copy operations must be committed using
ttg.async_commit_group to close the batch and allow for them to be waited on.
Traits: AttrSizedOperandSegments, VerifyTensorLayoutsTrait
Interfaces: InferTypeOpInterface
Attributes:¶
| Attribute | MLIR Type | Description |
|---|---|---|
cache | ::mlir::triton::CacheModifierAttr | allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7 |
evict | ::mlir::triton::EvictionPolicyAttr | allowed 32-bit signless integer cases: 1, 2, 3 |
isVolatile | ::mlir::BoolAttr | bool attribute |
contiguity | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:¶
Operand |
Description |
|---|---|
|
ranked tensor of ptr values |
|
memory descriptor type ( |
|
tensor of 1-bit signless integer values |
|
floating-point or ranked tensor of floating-point values or integer or ranked tensor of integer values or ptr or ranked tensor of ptr values or ptr |
Results:¶
Result |
Description |
|---|---|
|
async token type |
ttg.async_wait (triton::gpu::AsyncWaitOp)¶
Ensure all specified async_copy* operations are complete._
Syntax:
operation ::= `ttg.async_wait` ($asyncToken^)? attr-dict
The async_wait op waits until at most “num” async copy groups are outstanding without synchronising CTA execution.
It takes zero or more asyncToken plus an integer num that specifies how many async copy groups can remain
outstanding after the async_wait op is completed. num = 0 waits until all groups of async copies are complete.
This operation does not provide any syncronisation in the CTA, if syncronisation is needed use ttg.local_barrier
in addition to this operation.
Traits: MemWaitOpTrait, VerifyTensorLayoutsTrait
Interfaces: InferTypeOpInterface
Attributes:¶
| Attribute | MLIR Type | Description |
|---|---|---|
num | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:¶
Operand |
Description |
|---|---|
|
variadic of async token type |
Results:¶
Result |
Description |
|---|---|
|
async token type |
ttg.barrier (triton::gpu::BarrierOp)¶
Synchronizes execution and reads/writes to the selected address spaces for all threads in the CTA.
The barrier op synchronises the execution and all operations between the selected address spaces for all
threads in the CTA. It is used to coordinate communication between threads in the CTA.
This operation waits until all threads in the CTA have reached a barrier (for syncronisation) and operations
between the selected address spaces made by these threads prior to the op are visible to all threads in the CTA.
Data hazards between threads accessing the same memory can be avoided by synchronising the
specified scope in-between these accesses with a barrier.
A barrier operation only provides syncronisation and memory guarantees on the selected address spaces in the CTA.
The mandatory addrspace attribute is a bitmask describing which address spaces will be visible when the barrier completes:
nonecontrol-only syncronisation (no memory ordering).localshared-memory operations are complete and visible CTA-wide.global_readglobal memory reads are complete and visible CTA-wide.global_writeglobal memory writes are complete and visible CTA-wide.tensor_readtensor memory read operations are complete and visible CTA-wide.tensor_writetensor memory write operations are complete and visible CTA-wide.allconvenience alias for["local", "global_read", "global_write", "tensor_read", "tensor_write"].
Multiple address spaces can be combined (e.g. local|tensor_write). none cannot be combined with other address spaces.
Example:
ttg.barrier local
ttg.barrier local|global_read|global_write
Traits: VerifyTensorLayoutsTrait
Attributes:¶
| Attribute | MLIR Type | Description |
|---|---|---|
addrSpace | ::mlir::triton::gpu::AddrSpaceAttr |
ttg.convert_layout (triton::gpu::ConvertLayoutOp)¶
Convert layout
Syntax:
operation ::= `ttg.convert_layout` $src attr-dict `:` type($src) `->` type($result)
Traits: AlwaysSpeculatableImplTrait, SameOperandsAndResultElementType, SameOperandsAndResultShape, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Operands:¶
Operand |
Description |
|---|---|
|
ranked tensor of floating-point or integer or ptr values |
Results:¶
Result |
Description |
|---|---|
|
ranked tensor of floating-point or integer or ptr values |
ttg.fp4_to_fp (triton::gpu::Fp4ToFpOp)¶
Upcast fp4 (e2m1) to fp
Syntax:
operation ::= `ttg.fp4_to_fp` $src attr-dict `:` type($src) `->` type($result)
Upcast fp4 (e2m1) represented packed as i8s to fp.
The lower 4 bits of the i8s represent the first fp4 element, and the upper 4 bits the second fp4 element.
The axis attribute specifies the axis along which the fp4 elements are packed.
Traits: AlwaysSpeculatableImplTrait, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes:¶
| Attribute | MLIR Type | Description |
|---|---|---|
axis | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:¶
Operand |
Description |
|---|---|
|
ranked tensor of 8-bit signless integer values |
Results:¶
Result |
Description |
|---|---|
|
ranked tensor of floating-point values |
ttg.global_scratch_alloc (triton::gpu::GlobalScratchAllocOp)¶
Allocate a global memory buffer
Syntax:
operation ::= `ttg.global_scratch_alloc` attr-dict `:` qualified(type($result))
This operation allocates a buffer in global memory that is private to the current program.
The backend attribute specifies the backend to use for allocation.
The default backend is used by TritonGPU passes.
Downstream Triton tools and compilers can register a different backend and use a different allocation policy.
Traits: VerifyTensorLayoutsTrait
Attributes:¶
| Attribute | MLIR Type | Description |
|---|---|---|
nbytes | ::mlir::IntegerAttr | 32-bit signless integer attribute |
alignment | ::mlir::IntegerAttr | 32-bit signless integer attribute |
backend | ::mlir::StringAttr | string attribute |
Results:¶
Result |
Description |
|---|---|
|
ptr |
ttg.local_alloc (triton::gpu::LocalAllocOp)¶
Allocate tensor
Syntax:
operation ::= `ttg.local_alloc` ($src^)? attr-dict `:` functional-type(operands, results)
This operation allocates buffer in shared memory and return a descriptor containing the address and a view of the buffer.
Explicitly deallocating a buffer is optional; see local_dealloc.
The src operand is an optional initializer for the allocated buffer. It
must have the element type as the buffer. If src is not specified, the
returned buffer must be mutable.
Traits: VerifyTensorLayoutsTrait
Interfaces: MemoryEffectOpInterface
Attributes:¶
| Attribute | MLIR Type | Description |
|---|---|---|
alignment | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:¶
Operand |
Description |
|---|---|
|
ranked tensor of floating-point or integer or ptr values |
Results:¶
Result |
Description |
|---|---|
|
memory descriptor type ( |
ttg.local_dealloc (triton::gpu::LocalDeallocOp)¶
Dealloc buffer
Syntax:
operation ::= `ttg.local_dealloc` $src attr-dict `:` qualified(type($src))
This operation deallocates a buffer explicitly. Using the buffer after this operation is undefined.
This operation is optional. If you don’t explicitly dealloc a buffer, the compiler assumes it’s deallocated at the first point that post-dominates all uses of the alloc.
Because we assume a memdesc is dead at the first point that post-dominates its uses, ops that wait for an async operation on a memdesc to complete (such as ttng.warp_group_dot_wait) should also take the memdesc as an operand.
Traits: VerifyTensorLayoutsTrait
Operands:¶
Operand |
Description |
|---|---|
|
memory descriptor type ( |
ttg.local_gather (triton::gpu::LocalGatherOp)¶
Gather elements from shared memory along a specified axis
Syntax:
operation ::= `ttg.local_gather` $src `[` $indices `]` (`token` $token^)? attr-dict `:` qualified(type($src)) `,` type($indices) `->` type($result)
Gather elements from a shared memory descriptor using an indices tensor along a single specified axis. The output tensor has the same shape as the indices tensor.
For each output position I, the operation reads from src where the coordinate at the gather axis is replaced by indices[I]: result[I] = src[I[0], …, indices[I], …, I[n]] where the axis dimension is replaced by the index value.
This matches the behavior of tt.gather but operates on shared memory descriptors.
Traits: LocalLoadTrait, VerifyTensorLayoutsTrait
Attributes:¶
| Attribute | MLIR Type | Description |
|---|---|---|
axis | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:¶
Operand |
Description |
|---|---|
|
memory descriptor type ( |
|
ranked tensor of integer values |
|
async token type |
Results:¶
Result |
Description |
|---|---|
|
ranked tensor of floating-point or integer or ptr values |
ttg.local_load (triton::gpu::LocalLoadOp)¶
Load a buffer from local memory into a distributed tensor
Syntax:
operation ::= `ttg.local_load` $src (`token` $token^)? attr-dict `:` qualified(type($src)) `->` type($result)
Load a tensor from the local memory descriptor into a distributed tensor.
Traits: LocalLoadTrait, VerifyTensorLayoutsTrait
Operands:¶
Operand |
Description |
|---|---|
|
memory descriptor type ( |
|
async token type |
Results:¶
Result |
Description |
|---|---|
|
ranked tensor of floating-point or integer or ptr values |
ttg.local_scatter (triton::gpu::LocalScatterOp)¶
Scatter elements to shared memory along a specified axis
Syntax:
operation ::= `ttg.local_scatter` $dst `[` $indices `]` `,` $values (`token` $token^)? attr-dict `:` qualified(type($dst)) `,` type($indices) `,` type($values)
Scatter elements to a shared memory descriptor using an indices tensor along a single specified axis. The values tensor has the same shape as the indices tensor.
For each input position I, the operation writes to dst where the coordinate at the scatter axis is replaced by indices[I]: dst[I[0], …, indices[I], …, I[n]] = values[I] where the axis dimension is replaced by the index value.
This is the inverse of local_gather and writes to shared memory at runtime-computed indices.
Traits: VerifyTensorLayoutsTrait
Attributes:¶
| Attribute | MLIR Type | Description |
|---|---|---|
axis | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:¶
Operand |
Description |
|---|---|
|
memory descriptor type ( |
|
ranked tensor of floating-point or integer or ptr values |
|
ranked tensor of integer values |
|
async token type |
ttg.local_store (triton::gpu::LocalStoreOp)¶
Store a distributed tensor into a buffer in local memory
Syntax:
operation ::= `ttg.local_store` $src `,` $dst attr-dict `:` type($src) `->` qualified(type($dst))
Store a distributed tensor into a buffer in local memory.
Traits: VerifyTensorLayoutsTrait
Operands:¶
Operand |
Description |
|---|---|
|
ranked tensor of floating-point or integer or ptr values |
|
memory descriptor type ( |
ttg.mask (triton::gpu::MaskOp)¶
Mask op for pipelining
Traits: SingleBlock, VerifyTensorLayoutsTrait
Operands:¶
Operand |
Description |
|---|---|
|
1-bit signless integer |
Results:¶
Result |
Description |
|---|---|
|
variadic of any type |
ttg.mask.return (triton::gpu::MaskReturnOp)¶
Terminator for mask operator
Syntax:
operation ::= `ttg.mask.return` $result attr-dict `:` type($result)
Traits: AlwaysSpeculatableImplTrait, HasParent<MaskOp>, ReturnLike, Terminator, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), RegionBranchTerminatorOpInterface
Effects: MemoryEffects::Effect{}
Operands:¶
Operand |
Description |
|---|---|
|
variadic of any type |
ttg.memdesc_index (triton::gpu::MemDescIndexOp)¶
Take a subview of the descriptor.
Syntax:
operation ::= `ttg.memdesc_index` $src `[` $index `]` attr-dict `:` qualified(type($src)) `->` qualified(type($result))
This operation returns a new descriptor pointing to the i-th element of the
input descriptor along the 0-th dimension.
It doesn’t affect the underlying memory.
For example, suppose that
the input shape is 2x4x16xf16,
the output shape is 4x16xf16, and
index = 1. Then the output descriptor is equivalent to input[1], where input is the logical tensor.
Traits: AlwaysSpeculatableImplTrait, MemDescViewTrait, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Operands:¶
Operand |
Description |
|---|---|
|
memory descriptor type ( |
|
32-bit signless integer |
Results:¶
Result |
Description |
|---|---|
|
memory descriptor type ( |
ttg.memdesc_reinterpret (triton::gpu::MemDescReinterpretOp)¶
Reinterpret a memory descriptor as a different type and shape
Syntax:
operation ::= `ttg.memdesc_reinterpret` $src attr-dict `:` qualified(type($src)) `->` qualified(type($result))
The ttg.memdesc_reinterpret operation reinterprets a memory descriptor
as one with a different shape and element type. Because memory descriptors
lack strides, this operation is only valid if the original memory descriptor
is contiguous.
Traits: AlwaysSpeculatableImplTrait, MemDescViewTrait, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Operands:¶
Operand |
Description |
|---|---|
|
memory descriptor type ( |
Results:¶
Result |
Description |
|---|---|
|
memory descriptor type ( |
ttg.memdesc_reshape (triton::gpu::MemDescReshapeOp)¶
Creates a descriptor for the new shape
Syntax:
operation ::= `ttg.memdesc_reshape` $src attr-dict `:` qualified(type($src)) `->` qualified(type($result))
This operation returns a new descriptor representing a reshaped view of the underlying buffer. This doesn’t affect the memory.
Traits: AlwaysSpeculatableImplTrait, MemDescViewTrait, SameOperandsAndResultElementType, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Operands:¶
Operand |
Description |
|---|---|
|
memory descriptor type ( |
Results:¶
Result |
Description |
|---|---|
|
memory descriptor type ( |
ttg.memdesc_subslice (triton::gpu::MemDescSubsliceOp)¶
Take a subview of the descriptor.
Syntax:
operation ::= `ttg.memdesc_subslice` $src `[` custom<Offsets>($offsets) `]` attr-dict `:` qualified(type($src))
`->` qualified(type($result))
This operation returns a new descriptor representing a subview of the logical tensor. It doesn’t affect the underlying memory.
For example, suppose that
the input shape is 32x16xf16,
the output shape is 8x16xf16, and
offsets = [2, 1]. Then in Python syntax, the subview covers input[2:8+2, 1:16+1] where input is the logical tensor.
The offsets must be larger or equal to the tile of the tensor (or zero).
Traits: AlwaysSpeculatableImplTrait, MemDescViewTrait, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes:¶
| Attribute | MLIR Type | Description |
|---|---|---|
offsets | ::mlir::DenseI32ArrayAttr | i32 dense array attribute |
Operands:¶
Operand |
Description |
|---|---|
|
memory descriptor type ( |
Results:¶
Result |
Description |
|---|---|
|
memory descriptor type ( |
ttg.memdesc_trans (triton::gpu::MemDescTransOp)¶
Transpose the descriptor
Syntax:
operation ::= `ttg.memdesc_trans` $src attr-dict `:` qualified(type($src)) `->` qualified(type($result))
This operation returns a new descriptor representing a transposed view of the buffer.
Traits: AlwaysSpeculatableImplTrait, InferTypeOpAdaptor, MemDescViewTrait, SameOperandsAndResultElementType, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TransposeOpInterface
Effects: MemoryEffects::Effect{}
Attributes:¶
| Attribute | MLIR Type | Description |
|---|---|---|
order | ::mlir::DenseI32ArrayAttr | i32 dense array attribute |
Operands:¶
Operand |
Description |
|---|---|
|
memory descriptor type ( |
Results:¶
Result |
Description |
|---|---|
|
memory descriptor type ( |
ttg.predicate_stage (triton::gpu::PredicateStageOp)¶
Pipeliner stage predicate
Syntax:
operation ::= `ttg.predicate_stage` $iv `,` $ub `,` $step `maxStage` $maxStage `stage` $stage attr-dict `:` type($iv) `->` type($result)
Traits: AlwaysSpeculatableImplTrait, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes:¶
| Attribute | MLIR Type | Description |
|---|---|---|
maxStage | ::mlir::IntegerAttr | 32-bit signless integer attribute |
stage | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:¶
Operand |
Description |
|---|---|
|
signless integer or index |
|
signless integer or index |
|
signless integer or index |
Results:¶
Result |
Description |
|---|---|
|
1-bit signless integer |
ttg.warp_id (triton::gpu::WarpIdOp)¶
Return the GPU warp ID
Syntax:
operation ::= `ttg.warp_id` attr-dict
This operation returns the GPU warp ID. This can translate to reading hardware registers if there are, or just thread ID divided by warp size.
The omitUniformHint attribute is indicating in NVIDIA backend whether to
omit emitting nvvm.shfl.sync idx 0 for LLVM.
Traits: AlwaysSpeculatableImplTrait, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes:¶
| Attribute | MLIR Type | Description |
|---|---|---|
omitUniformHint | ::mlir::UnitAttr | unit attribute |
Results:¶
Result |
Description |
|---|---|
|
32-bit signless integer |
ttg.warp_return (triton::gpu::WarpReturnOp)¶
Implicit terminator from partition regions
Syntax:
operation ::= `ttg.warp_return` attr-dict
The ttg.warp_return operation is the implicit terminator that ends the
partition regions of a ttg.warp_specialize op. It has no operands as these
regions cannot return anything.
TODO: Support returning uniform values from partition regions.
Traits: AlwaysSpeculatableImplTrait, HasParent<WarpSpecializePartitionsOp>, ReturnLike, Terminator, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), RegionBranchTerminatorOpInterface
Effects: MemoryEffects::Effect{}
ttg.warp_specialize (triton::gpu::WarpSpecializeOp)¶
Asynchronously execute code on multiple warpgroups
The ttg.warp_specialize op represents executing different code
simultaneously on different warp groups. A warp group is a group of
power-of-2 warps, which can be a different number of warps than in the
enclosing region.
The “default” region of the op represents the code executed by the currently executing warp group. This region is allowed to implicitly capture. The op contains a number of “partition” regions that are isolated from above. They must be isolated because these regions represent different layout domains, as the number of warps is different.
Semantically, execution of each region starts simultaneously for each warp group, and all warp groups are joined at the end of the op.
Example:
%0 = ttg.warp_specialize(%a, %b)
default {
%out = some_operation(%a) // implicit capture of `%a`
ttg.warp_yield %out : i32
}
partition0(%arg0: i32, %arg1: i32) num_warps(8) {
some_async_dispatch(%arg0, %arg1)
ttg.warp_return
}
partition1(%arg0: i32, %arg1: i32) num_warps(1) {
some_async_dispatch(%arg0, %arg1)
ttg.warp_return
} : (i32, i32) -> i32
Traits: AsyncRegions, RecursiveMemoryEffects, RecursivelySpeculatableImplTrait, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable, RegionBranchOpInterface
Attributes:¶
| Attribute | MLIR Type | Description |
|---|---|---|
partitionNumWarps | ::mlir::DenseI32ArrayAttr | i32 dense array attribute |
warpGroupStartIds | ::mlir::DenseI32ArrayAttr | i32 dense array attribute |
requestedRegisters | ::mlir::DenseI32ArrayAttr | i32 dense array attribute |
actualRegisters | ::mlir::DenseI32ArrayAttr | i32 dense array attribute |
Results:¶
Result |
Description |
|---|---|
|
variadic of any type |
ttg.warp_specialize.partitions (triton::gpu::WarpSpecializePartitionsOp)¶
Container op for ttg.warp_specialize
Because MLIR requires entire operations be isolated from above, this op
contains the actual isolated from above regions of ttg.warp_specialize.
Traits: HasParent<WarpSpecializeOp>, IsolatedFromAbove, RecursiveMemoryEffects, RecursivelySpeculatableImplTrait, Terminator, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable, RegionBranchOpInterface
Operands:¶
Operand |
Description |
|---|---|
|
variadic of any type |
ttg.warp_yield (triton::gpu::WarpYieldOp)¶
Yield from the default region of ttg.warp_specialize
Syntax:
operation ::= `ttg.warp_yield` ($values^)? attr-dict (`:` type($values)^)?
The ttg.warp_yield operation is the terminator for the “default” region of
a ttg.warp_specialize operation. The operands are passed transparently as
the SSA results of the ttg.warp_specialize operation.
Example:
ttg.warp_yield %a, %b : i32, tensor<32xbf16, #blocked>
Traits: AlwaysSpeculatableImplTrait, HasParent<WarpSpecializeOp>, ReturnLike, Terminator, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), RegionBranchTerminatorOpInterface
Effects: MemoryEffects::Effect{}
Operands:¶
Operand |
Description |
|---|---|
|
variadic of any type |