TritonGPUOps

triton_gpu.async_commit_group (triton::gpu::AsyncCommitGroupOp)

Async commit group

Syntax:

operation ::= `triton_gpu.async_commit_group` $inputTokens attr-dict

Traits: VerifyTensorLayoutsTrait

Interfaces: InferTypeOpInterface

Operands:

Operand

Description

inputTokens

variadic of async token type

Results:

Result

Description

asyncToken

async token type

triton_gpu.async_copy_global_to_local (triton::gpu::AsyncCopyGlobalToLocalOp)

Copy data from global memory to local memory asynchronously

Syntax:

operation ::= `triton_gpu.async_copy_global_to_local` $src `,` $result (`mask` $mask^)? (`other` $other^)?
              oilist(`cacheModifier` `=` $cache | `evictionPolicy` `=` $evict)
              attr-dict `:` type($src) `->` type($result)

This operation copies data from global memory to local memory asynchronously. This is analogue to tt.load except the data are copied to local memory pointed by by the memory descriptor instread of a distributed tensor. The rest of the operands are the same as tt.load.

Traits: AttrSizedOperandSegments, VerifyTensorLayoutsTrait

Interfaces: InferTypeOpInterface, MemoryEffectOpInterface

Attributes:

AttributeMLIR TypeDescription
cache::mlir::triton::CacheModifierAttr
allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7{{% markdown %}}Enum cases: * none (`NONE`) * ca (`CA`) * cg (`CG`) * wb (`WB`) * cs (`CS`) * wt (`WT`) * cv (`CV`){{% /markdown %}}
evict::mlir::triton::EvictionPolicyAttr
allowed 32-bit signless integer cases: 1, 2, 3{{% markdown %}}Enum cases: * evict_normal (`NORMAL`) * evict_first (`EVICT_FIRST`) * evict_last (`EVICT_LAST`){{% /markdown %}}
isVolatile::mlir::BoolAttrbool attribute

Operands:

Operand

Description

src

ranked tensor of ptr values

result

memory descriptor type (::mlir::triton::gpu::MemDescType) in Triton IR type system

mask

tensor of 1-bit signless integer values

other

floating-point or ranked tensor of floating-point values or integer or ranked tensor of integer values or ptr or ranked tensor of ptr values or ptr

Results:

Result

Description

token

async token type

triton_gpu.async_wait (triton::gpu::AsyncWaitOp)

Async wait

Syntax:

operation ::= `triton_gpu.async_wait` $asyncToken attr-dict

Traits: VerifyTensorLayoutsTrait

Interfaces: InferTypeOpInterface

Attributes:

AttributeMLIR TypeDescription
num::mlir::IntegerAttr32-bit signless integer attribute

Operands:

Operand

Description

asyncToken

variadic of async token type

Results:

Result

Description

retToken

async token type

triton_gpu.convert_layout (triton::gpu::ConvertLayoutOp)

Convert layout

Syntax:

operation ::= `triton_gpu.convert_layout` $src attr-dict `:` type($src) `->` type($result)

Traits: AlwaysSpeculatableImplTrait, SameOperandsAndResultElementType, SameOperandsAndResultShape, VerifyTensorLayoutsTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands:

Operand

Description

src

ranked tensor of floating-point or integer or ptr values

Results:

Result

Description

result

ranked tensor of floating-point or integer or ptr values

triton_gpu.global_scratch_alloc (triton::gpu::GlobalScratchAllocOp)

Allocate a global memory buffer

Syntax:

operation ::= `triton_gpu.global_scratch_alloc` attr-dict `:` qualified(type($result))

This operation allocates a buffer in global memory that is private to the current program.

Traits: VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Allocate on ::mlir::triton::GlobalMemory}

Attributes:

AttributeMLIR TypeDescription
nbytes::mlir::IntegerAttr32-bit signless integer attribute
alignment::mlir::IntegerAttr32-bit signless integer attribute

Results:

Result

Description

result

ptr

triton_gpu.local_alloc (triton::gpu::LocalAllocOp)

Allocate tensor

Syntax:

operation ::= `triton_gpu.local_alloc` $src attr-dict `:` functional-type(operands, results)

This operation allocates buffer in shared memory and return a descriptor containing the address and a view of the buffer.

Explicitly deallocating a buffer is optional; see local_dealloc.

Traits: VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface

Attributes:

AttributeMLIR TypeDescription
alignment::mlir::IntegerAttr32-bit signless integer attribute

Operands:

Operand

Description

src

ranked tensor of floating-point or integer or ptr values

Results:

Result

Description

result

memory descriptor type (::mlir::triton::gpu::MemDescType) in Triton IR type system

triton_gpu.local_dealloc (triton::gpu::LocalDeallocOp)

Dealloc buffer

Syntax:

operation ::= `triton_gpu.local_dealloc` $src attr-dict `:` qualified(type($src))

This operation deallocates a buffer explicitly. Using the buffer after this operation is undefined.

This operation is optional. If you don’t explicitly dealloc a buffer, the compiler assumes it’s deallocated at the first point that post-dominates all uses of the alloc.

Because we assume a memdesc is dead at the first point that post-dominates its uses, ops that wait for an async operation on a memdesc to complete (such as triton_nvidia_gpu.warp_group_dot_wait) should also take the memdesc as an operand.

Traits: VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Free on ::mlir::triton::gpu::SharedMemory}

Operands:

Operand

Description

src

memory descriptor type (::mlir::triton::gpu::MemDescType) in Triton IR type system

triton_gpu.local_load (triton::gpu::LocalLoadOp)

Load a buffer from local memory into a distributed tensor

Syntax:

operation ::= `triton_gpu.local_load` $src (`token` $token^)? attr-dict `:` qualified(type($src)) `->` type($result)

Load a tensor from the local memory descriptor into a distributed tensor.

Traits: VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface

Operands:

Operand

Description

src

memory descriptor type (::mlir::triton::gpu::MemDescType) in Triton IR type system

token

async token type

Results:

Result

Description

result

ranked tensor of floating-point or integer or ptr values

triton_gpu.local_store (triton::gpu::LocalStoreOp)

Store a distributed tensor into a buffer in local memory

Syntax:

operation ::= `triton_gpu.local_store` $src `,` $dst attr-dict `:` type($src) `->` qualified(type($dst))

Store a distributed tensor into a buffer in local memory.

Traits: VerifyTensorLayoutsTrait

Interfaces: MemoryEffectOpInterface

Operands:

Operand

Description

src

ranked tensor of floating-point or integer or ptr values

dst

memory descriptor type (::mlir::triton::gpu::MemDescType) in Triton IR type system

triton_gpu.memdesc_subview (triton::gpu::MemDescSubviewOp)

Take a subview of the descriptor.

Syntax:

operation ::= `triton_gpu.memdesc_subview` $src `[` $offsets `]` attr-dict `:` qualified(type($src)) `->` qualified(type($result))

This operation returns a new descriptor representing a subview of the buffer. It doesn’t affect the underlying memory. The subview can be rank-reduced.

For example, suppose that

  • the input shape is 2x4x16xf16,

  • the output shape is 4x4xf16, and

  • offsets = [1, 0, 4].

Then in Python syntax, the subview covers input[1][0:4][4:8].

Traits: AlwaysSpeculatableImplTrait, VerifyTensorLayoutsTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands:

Operand

Description

src

memory descriptor type (::mlir::triton::gpu::MemDescType) in Triton IR type system

offsets

variadic of 32-bit signless integer

Results:

Result

Description

result

memory descriptor type (::mlir::triton::gpu::MemDescType) in Triton IR type system

triton_gpu.memdesc_trans (triton::gpu::MemDescTransOp)

Transpose the descriptor

Syntax:

operation ::= `triton_gpu.memdesc_trans` $src attr-dict `:` qualified(type($src)) `->` qualified(type($result))

This operation returns a new descriptor representing a transposed view of the buffer.

Traits: AlwaysSpeculatableImplTrait, SameOperandsAndResultElementType, VerifyTensorLayoutsTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TransposeOpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
order::mlir::DenseI32ArrayAttri32 dense array attribute

Operands:

Operand

Description

src

memory descriptor type (::mlir::triton::gpu::MemDescType) in Triton IR type system

Results:

Result

Description

result

memory descriptor type (::mlir::triton::gpu::MemDescType) in Triton IR type system

triton_gpu.upcast_mxfp (triton::gpu::UpcastMXFPOp)

Convert an mxfp tensor to bf16

Syntax:

operation ::= `triton_gpu.upcast_mxfp` $src `,` $scale  `fp_type` `=` $fp_type attr-dict `:` type($src) `,` type($scale) `->` type($result)

Compute the bf16 encoded in the given mxfp number as per https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

Traits: AlwaysSpeculatableImplTrait, VerifyTensorLayoutsTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
fp_type::mlir::triton::ScaleDotElemTypeAttr
allowed 32-bit signless integer cases: 0, 1, 2, 3, 4, 5{{% markdown %}}Enum cases: * e4m3 (`E4M3`) * e5m2 (`E5M2`) * e2m3 (`E2M3`) * e3m2 (`E3M2`) * e2m1 (`E2M1`) * bf16 (`BF16`){{% /markdown %}}

Operands:

Operand

Description

src

ranked tensor of floating-point or integer or ptr values

scale

ranked tensor of floating-point or integer or ptr values

Results:

Result

Description

result

ranked tensor of floating-point or integer or ptr values