TritonGPUOps¶
ttg.async_commit_group
(triton::gpu::AsyncCommitGroupOp)¶
Async commit group
Syntax:
operation ::= `ttg.async_commit_group` $inputTokens attr-dict
Traits: VerifyTensorLayoutsTrait
Interfaces: InferTypeOpInterface
Operands:¶
Operand |
Description |
---|---|
|
variadic of async token type |
Results:¶
Result |
Description |
---|---|
|
async token type |
ttg.async_copy_global_to_local
(triton::gpu::AsyncCopyGlobalToLocalOp)¶
Copy data from global memory to local memory asynchronously
Syntax:
operation ::= `ttg.async_copy_global_to_local` $src `,` $result (`mask` $mask^)? (`other` $other^)?
oilist(`cacheModifier` `=` $cache | `evictionPolicy` `=` $evict)
attr-dict `:` type($src) `->` type($result)
This operation copies data from global memory to local memory asynchronously. This is analogue to tt.load except the data are copied to local memory pointed by by the memory descriptor instead of a distributed tensor. The rest of the operands are the same as tt.load.
Traits: AttrSizedOperandSegments
, VerifyTensorLayoutsTrait
Interfaces: InferTypeOpInterface
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
cache | ::mlir::triton::CacheModifierAttr | allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7 |
evict | ::mlir::triton::EvictionPolicyAttr | allowed 32-bit signless integer cases: 1, 2, 3 |
isVolatile | ::mlir::BoolAttr | bool attribute |
Operands:¶
Operand |
Description |
---|---|
|
ranked tensor of ptr values |
|
memory descriptor type ( |
|
tensor of 1-bit signless integer values |
|
floating-point or ranked tensor of floating-point values or integer or ranked tensor of integer values or ptr or ranked tensor of ptr values or ptr |
Results:¶
Result |
Description |
---|---|
|
async token type |
ttg.async_wait
(triton::gpu::AsyncWaitOp)¶
Async wait
Syntax:
operation ::= `ttg.async_wait` $asyncToken attr-dict
Traits: VerifyTensorLayoutsTrait
Interfaces: InferTypeOpInterface
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
num | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:¶
Operand |
Description |
---|---|
|
variadic of async token type |
Results:¶
Result |
Description |
---|---|
|
async token type |
ttg.convert_layout
(triton::gpu::ConvertLayoutOp)¶
Convert layout
Syntax:
operation ::= `ttg.convert_layout` $src attr-dict `:` type($src) `->` type($result)
Traits: AlwaysSpeculatableImplTrait
, SameOperandsAndResultElementType
, SameOperandsAndResultShape
, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Operands:¶
Operand |
Description |
---|---|
|
ranked tensor of floating-point or integer or ptr values |
Results:¶
Result |
Description |
---|---|
|
ranked tensor of floating-point or integer or ptr values |
ttg.fp4_to_fp
(triton::gpu::Fp4ToFpOp)¶
Upcast fp4 (e2m1) to fp
Syntax:
operation ::= `ttg.fp4_to_fp` $src attr-dict `:` type($src) `->` type($result)
Upcast fp4 (e2m1) represented packed as i8s to fp.
The lower 4 bits of the i8s represent the first fp4 element, and the upper 4 bits the second fp4 element.
The axis
attribute specifies the axis along which the fp4 elements are packed.
Traits: AlwaysSpeculatableImplTrait
, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
axis | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:¶
Operand |
Description |
---|---|
|
ranked tensor of 8-bit signless integer values |
Results:¶
Result |
Description |
---|---|
|
ranked tensor of floating-point values |
ttg.global_scratch_alloc
(triton::gpu::GlobalScratchAllocOp)¶
Allocate a global memory buffer
Syntax:
operation ::= `ttg.global_scratch_alloc` attr-dict `:` qualified(type($result))
This operation allocates a buffer in global memory that is private to the current program.
Traits: VerifyTensorLayoutsTrait
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
nbytes | ::mlir::IntegerAttr | 32-bit signless integer attribute |
alignment | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Results:¶
Result |
Description |
---|---|
|
ptr |
ttg.local_alloc
(triton::gpu::LocalAllocOp)¶
Allocate tensor
Syntax:
operation ::= `ttg.local_alloc` ($src^)? attr-dict `:` functional-type(operands, results)
This operation allocates buffer in shared memory and return a descriptor containing the address and a view of the buffer.
Explicitly deallocating a buffer is optional; see local_dealloc.
Traits: VerifyTensorLayoutsTrait
Interfaces: MemoryEffectOpInterface
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
alignment | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:¶
Operand |
Description |
---|---|
|
ranked tensor of floating-point or integer or ptr values |
Results:¶
Result |
Description |
---|---|
|
memory descriptor type ( |
ttg.local_dealloc
(triton::gpu::LocalDeallocOp)¶
Dealloc buffer
Syntax:
operation ::= `ttg.local_dealloc` $src attr-dict `:` qualified(type($src))
This operation deallocates a buffer explicitly. Using the buffer after this operation is undefined.
This operation is optional. If you don’t explicitly dealloc a buffer, the compiler assumes it’s deallocated at the first point that post-dominates all uses of the alloc.
Because we assume a memdesc is dead at the first point that post-dominates its uses, ops that wait for an async operation on a memdesc to complete (such as ttng.warp_group_dot_wait) should also take the memdesc as an operand.
Traits: VerifyTensorLayoutsTrait
Operands:¶
Operand |
Description |
---|---|
|
memory descriptor type ( |
ttg.local_load
(triton::gpu::LocalLoadOp)¶
Load a buffer from local memory into a distributed tensor
Syntax:
operation ::= `ttg.local_load` $src (`token` $token^)? attr-dict `:` qualified(type($src)) `->` type($result)
Load a tensor from the local memory descriptor into a distributed tensor.
Traits: VerifyTensorLayoutsTrait
Operands:¶
Operand |
Description |
---|---|
|
memory descriptor type ( |
|
async token type |
Results:¶
Result |
Description |
---|---|
|
ranked tensor of floating-point or integer or ptr values |
ttg.local_store
(triton::gpu::LocalStoreOp)¶
Store a distributed tensor into a buffer in local memory
Syntax:
operation ::= `ttg.local_store` $src `,` $dst attr-dict `:` type($src) `->` qualified(type($dst))
Store a distributed tensor into a buffer in local memory.
Traits: VerifyTensorLayoutsTrait
Operands:¶
Operand |
Description |
---|---|
|
ranked tensor of floating-point or integer or ptr values |
|
memory descriptor type ( |
ttg.memdesc_subview
(triton::gpu::MemDescSubviewOp)¶
Take a subview of the descriptor.
Syntax:
operation ::= `ttg.memdesc_subview` $src `[` $offsets `]` attr-dict `:` qualified(type($src)) `->` qualified(type($result))
This operation returns a new descriptor representing a subview of the buffer. It doesn’t affect the underlying memory. The subview can be rank-reduced.
For example, suppose that
the input shape is 2x4x16xf16,
the output shape is 4x4xf16, and
offsets = [1, 0, 4].
Then in Python syntax, the subview covers input[1][0:4][4:8].
Traits: AlwaysSpeculatableImplTrait
, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Operands:¶
Operand |
Description |
---|---|
|
memory descriptor type ( |
|
variadic of 32-bit signless integer |
Results:¶
Result |
Description |
---|---|
|
memory descriptor type ( |
ttg.memdesc_trans
(triton::gpu::MemDescTransOp)¶
Transpose the descriptor
Syntax:
operation ::= `ttg.memdesc_trans` $src attr-dict `:` qualified(type($src)) `->` qualified(type($result))
This operation returns a new descriptor representing a transposed view of the buffer.
Traits: AlwaysSpeculatableImplTrait
, InferTypeOpAdaptor
, SameOperandsAndResultElementType
, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable
, InferTypeOpInterface
, NoMemoryEffect (MemoryEffectOpInterface)
, TransposeOpInterface
Effects: MemoryEffects::Effect{}
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
order | ::mlir::DenseI32ArrayAttr | i32 dense array attribute |
Operands:¶
Operand |
Description |
---|---|
|
memory descriptor type ( |
Results:¶
Result |
Description |
---|---|
|
memory descriptor type ( |
ttg.warp_return
(triton::gpu::WarpReturnOp)¶
Implicit terminator from partition regions
Syntax:
operation ::= `ttg.warp_return` attr-dict
The ttg.warp_return
operation is the implicit terminator that ends the
partition regions of a ttg.warp_specialize
op. It has no operands as these
regions cannot return anything.
TODO: Support returning uniform values from partition regions.
Traits: AlwaysSpeculatableImplTrait
, HasParent<WarpSpecializePartitionsOp>
, ReturnLike
, Terminator
, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
, RegionBranchTerminatorOpInterface
Effects: MemoryEffects::Effect{}
ttg.warp_specialize
(triton::gpu::WarpSpecializeOp)¶
Asynchronously execute code on multiple warpgroups
The ttg.warp_specialize
op represents executing different code
simultaneously on different warp groups. A warp group is a group of
power-of-2 warps, which can be a different number of warps than in the
enclosing region.
The “default” region of the op represents the code executed by the currently executing warp group. This region is allowed to implicitly capture. The op contains a number of “partition” regions that are isolated from above. They must be isolated because these regions represent different layout domains, as the number of warps is different.
Semantically, execution of each region starts simultaneously for each warp group, and all warp groups are joined at the end of the op.
Example:
%0 = ttg.warp_specialize(%a, %b)
default {
%out = some_operation(%a) // implicit capture of `%a`
ttg.warp_yield %out : i32
}
partition0(%arg0: i32, %arg1: i32) num_warps(8) {
some_async_dispatch(%arg0, %arg1)
ttg.warp_return
}
partition1(%arg0: i32, %arg1: i32) num_warps(1) {
some_async_dispatch(%arg0, %arg1)
ttg.warp_return
} : (i32, i32) -> i32
Traits: AsyncRegions
, RecursiveMemoryEffects
, RecursivelySpeculatableImplTrait
, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable
, RegionBranchOpInterface
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
partitionNumWarps | ::mlir::DenseI32ArrayAttr | i32 dense array attribute |
warpGroupStartIds | ::mlir::DenseI32ArrayAttr | i32 dense array attribute |
requestedRegisters | ::mlir::DenseI32ArrayAttr | i32 dense array attribute |
actualRegisters | ::mlir::DenseI32ArrayAttr | i32 dense array attribute |
Operands:¶
Operand |
Description |
---|---|
|
variadic of any type |
Results:¶
Result |
Description |
---|---|
|
variadic of any type |
ttg.warp_specialize.partitions
(triton::gpu::WarpSpecializePartitionsOp)¶
Container op for ttg.warp_specialize
Because MLIR requires entire operations be isolated from above, this op
contains the actual isolated from above regions of ttg.warp_specialize
.
Traits: HasParent<WarpSpecializeOp>
, IsolatedFromAbove
, RecursiveMemoryEffects
, RecursivelySpeculatableImplTrait
, Terminator
, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable
ttg.warp_yield
(triton::gpu::WarpYieldOp)¶
Yield from the default region of ttg.warp_specialize
Syntax:
operation ::= `ttg.warp_yield` ($values^)? attr-dict (`:` type($values)^)?
The ttg.warp_yield
operation is the terminator for the “default” region of
a ttg.warp_specialize
operation. The operands are passed transparently as
the SSA results of the ttg.warp_specialize
operation.
Example:
ttg.warp_yield %a, %b : i32, tensor<32xbf16, #blocked>
Traits: AlwaysSpeculatableImplTrait
, HasParent<WarpSpecializeOp>
, ReturnLike
, Terminator
, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
, RegionBranchTerminatorOpInterface
Effects: MemoryEffects::Effect{}
Operands:¶
Operand |
Description |
---|---|
|
variadic of any type |