TritonGPUOps

triton_gpu.alloc_tensor (::mlir::triton::gpu::AllocTensorOp)

allocate tensor

Syntax:

operation ::= `triton_gpu.alloc_tensor` attr-dict `:` type($result)

This operation defines a tensor of a particular shape. The contents of the tensor are supposed to be in shared memory.

Note: This op can be repalced to a bufferization.alloc_tensor in LLVM 16.

Traits: ResultsAreSharedEncoding

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Allocate on ::mlir::SideEffects::DefaultResource}

Results:

Result

Description

result

tensor of floating-point values or tensor of integer values or tensor of ptr values

triton_gpu.async_bulk_commit_group (::mlir::triton::gpu::AsyncBulkCommitGroupOp)

async bulk commit group

Syntax:

operation ::= `triton_gpu.async_bulk_commit_group` attr-dict

triton_gpu.async_bulk_wait (::mlir::triton::gpu::AsyncBulkWaitOp)

async bulk wait

Syntax:

operation ::= `triton_gpu.async_bulk_wait` attr-dict

Attributes:

Attribute

MLIR Type

Description

num

::mlir::IntegerAttr

32-bit signless integer attribute

triton_gpu.async_commit_group (::mlir::triton::gpu::AsyncCommitGroupOp)

async commit group

Syntax:

operation ::= `triton_gpu.async_commit_group` attr-dict

triton_gpu.async_wait (::mlir::triton::gpu::AsyncWaitOp)

async wait

Syntax:

operation ::= `triton_gpu.async_wait` attr-dict

Attributes:

Attribute

MLIR Type

Description

num

::mlir::IntegerAttr

32-bit signless integer attribute

triton_gpu.cmpf (::mlir::triton::gpu::CmpFOp)

floating-point comparison operation

Traits: AlwaysSpeculatableImplTrait, Elementwise, SameOperandsAndResultEncoding, SameOperandsAndResultShape

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

Attribute

MLIR Type

Description

predicate

::mlir::arith::CmpFPredicateAttr

allowed 64-bit signless integer cases: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

Operands:

Operand

Description

lhs

floating-point or tensor of floating-point values

rhs

floating-point or tensor of floating-point values

Results:

Result

Description

result

1-bit signless integer or tensor of 1-bit signless integer values

triton_gpu.cmpi (::mlir::triton::gpu::CmpIOp)

integer comparison operation

Traits: AlwaysSpeculatableImplTrait, Elementwise, SameOperandsAndResultEncoding, SameOperandsAndResultShape

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

Attribute

MLIR Type

Description

predicate

::mlir::arith::CmpIPredicateAttr

allowed 64-bit signless integer cases: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

Operands:

Operand

Description

lhs

integer or tensor of integer values

rhs

integer or tensor of integer values

Results:

Result

Description

result

1-bit signless integer or tensor of 1-bit signless integer values

triton_gpu.convert_layout (::mlir::triton::gpu::ConvertLayoutOp)

convert layout

Syntax:

operation ::= `triton_gpu.convert_layout` $src attr-dict `:` functional-type(operands, results)

Traits: AlwaysSpeculatableImplTrait, SameOperandsAndResultElementType, SameOperandsAndResultShape

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands:

Operand

Description

src

tensor of floating-point values or tensor of integer values or tensor of ptr values

Results:

Result

Description

result

tensor of floating-point values or tensor of integer values or tensor of ptr values

triton_gpu.extract_slice (::mlir::triton::gpu::ExtractSliceOp)

extract slice operation

Syntax:

operation ::= `triton_gpu.extract_slice` $source ``
              custom<DynamicIndexList>($offsets, $static_offsets)
              custom<DynamicIndexList>($sizes, $static_sizes)
              custom<DynamicIndexList>($strides, $static_strides)
              attr-dict `:` type($source) `to` type($result)

same as tensor.extract_slice, but with int32 index. The motivations for re-implementing it are: We reimplement ExtractSliceOp with int32 index, because:

  • we want to enforce int32 indexing on GPUs since Triton tensors fit in SRAM

  • we still want to use indexWidth = 64 when lowering to LLVM because our loops can have 64-bit induction variables and scf.for uses indexType for bounds/ivs

Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments, ResultsAreSharedEncoding

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OffsetSizeAndStrideOpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute

MLIR Type

Description

static_offsets

::mlir::DenseI64ArrayAttr

i64 dense array attribute

static_sizes

::mlir::DenseI64ArrayAttr

i64 dense array attribute

static_strides

::mlir::DenseI64ArrayAttr

i64 dense array attribute

Operands:

Operand

Description

source

ranked tensor of any type values

offsets

32-bit signless integer

sizes

32-bit signless integer

strides

32-bit signless integer

Results:

Result

Description

result

ranked tensor of any type values

triton_gpu.insert_slice_async (::mlir::triton::gpu::InsertSliceAsyncOp)

insert slice async

This operation inserts a tensor $src into another tensor $dst as specified by the operation’s $index argument and $axis attribute.

It returns a copy of $dst with the proper slice updated asynchronously with the value of $src. This operation is non-blocking, and $results will have the updated value after the corresponding async_wait.

When converting from tt.load to triton_gpu.insert_slice_async, the $evict, $cache, and $isVolatile fields might be ignored on certain hardware. For example, on NVIDIA GPUs, the cache policy is determined by the backend, and $evict and $isVolatile are ignored because they apply to L1 cache only.

The insert_slice_async operation supports the following arguments:

  • src: the tensor that is inserted.

  • dst: the tensor into which the $src tensor is inserted.

  • index: the index of the $src tensor at the given $axis from which the $dst tensor is inserted into

  • mask: optional tensor-rank number of boolean masks which specify which elements of the $src tensor are inserted into the $dst tensor.

  • other: optional tensor-rank number of other tensors which specify what values are inserted into the $dst tensor if the corresponding element of the $mask tensor is false.

In the future, we may decompose this operation into a sequence of:

  • async operation to specify a sequence of asynchronous operations

  • load operation to load a tensor from global memory

  • insert_slice operations to insert the $src tensor into the $dst tensor

Example:

%1 = triton_gpu.alloc_tensor : tensor<2x32xf32>
%2 = triton_gpu.insert_slice_async %0, %1, %index { axis = 0 } : tensor<32x!tt.ptr<f32>, #AL> -> tensor<2x32xf32, #A>
triiton_gpu.async_wait { num = 0 : i32 }

Traits: AttrSizedOperandSegments, ResultsAreSharedEncoding

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

Attributes:

Attribute

MLIR Type

Description

cache

::mlir::triton::CacheModifierAttr

allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6

evict

::mlir::triton::EvictionPolicyAttr

allowed 32-bit signless integer cases: 1, 2, 3

isVolatile

::mlir::BoolAttr

bool attribute

axis

::mlir::IntegerAttr

32-bit signless integer attribute

Operands:

Operand

Description

src

ptr or tensor of ptr values

dst

tensor of floating-point values or tensor of integer values or tensor of ptr values

index

32-bit signless integer

mask

tensor of 1-bit signless integer values

other

floating-point or tensor of floating-point values or integer or tensor of integer values or ptr or tensor of ptr values or ptr

Results:

Result

Description

result

tensor of floating-point values or tensor of integer values or tensor of ptr values

triton_gpu.insert_slice (::mlir::triton::gpu::InsertSliceOp)

insert slice

This operation inserts a tensor $src into another tensor $dst as specified by the operation’s $index argument and $axis attribute.

It returns a copy of $dst with the proper slice updated with the value of $src.

When converting from tt.load to triton_gpu.insert_slice, the $evict, $cache, and $isVolatile fields might be ignored on certain hardware. For example, on NVIDIA GPUs, the cache policy is determined by the backend, and $evict and $isVolatile are ignored because they apply to L1 cache only.

The insert_slice operation supports the following arguments:

  • src: the tensor that is inserted.

  • dst: the tensor into which the $src tensor is inserted.

  • index: the index of the $src tensor at the given $axis from which the $dst tensor is inserted into

  • mask: optional tensor-rank number of boolean masks which specify which elements of the $src tensor are inserted into the $dst tensor.

  • other: optional tensor-rank number of other tensors which specify what values are inserted into the $dst tensor if the corresponding element of the $mask tensor is false.

ttgpu.load_tile_async depracate triton_gpu.insert_slice might be further lowered into triton_gpu_async for different hardware implementations

like tt.load, ttgpu.insert_slice/insert_slice_async has two modes up to the type of src mode 1: ptr/src is a tensor of pointers mode 2: ptr/src is a tensor pointer

Some typical lowering paths are: in case the load is pipelined by the pipeline pass( load is inside kBlock loop, which means “pipeline pass): Load from global + store to shared : tt.load(mode 1) -(tt->ttgpu+Coalesce)-> tt.load(mode 1) -(Pipeline)-> ttgpu.insert_slice(mode 1) Non-bulk cp.async : tt.load(mode 1) -(tt->ttgpu+Coalesce)-> tt.load(mode 1) -(Pipeline)-> ttgpu.insert_slice(mode 1) -(MaterializeLoad)> ttgpu.insert_slice_async(mode 1) + ttgpu.await-> llvm TMA load : tt.load(mode 2) -(tt->ttgpu+Coalesce)-> tt.load(mode 2) -(Pipeline)-> ttgpu.insert_slice(mode 2) -(MaterializeLoad)> ttgpu.insert_slice_async_v2(mode 2) + ttgpu.await-> llvm

otherwise: Load from global + store to shared : tt.load(mode 1) -(tt->ttgpu+Coalesce)-> tt.load(mode 1) Non-bulk cp.async : tt.load(mode 1) -(tt->ttgpu+Coalesce)-> tt.load(mode 1) -> … -(MaterializeLoad)-> ttgpu.insert_slice_async(mode 1) + ttgpu.await -> llvm TMA load : tt.load(mode 2) -(tt->ttgpu+Coalesce)-> tt.load(mode 2) -> … -(MaterializeLoad)-> ttgpu.insert_slice_async(mode 2) + ttgpu.await -> llvm

Example:

%1 = triton_gpu.alloc_tensor : tensor<2x32xf32>
%2 = triton_gpu.insert_slice %0, %1, %index { axis = 0 } : tensor<32x!tt.ptr<f32>, #AL> -> tensor<2x32xf32, #A>

Traits: AttrSizedOperandSegments, ResultsAreSharedEncoding

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

Attributes:

Attribute

MLIR Type

Description

cache

::mlir::triton::CacheModifierAttr

allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6

evict

::mlir::triton::EvictionPolicyAttr

allowed 32-bit signless integer cases: 1, 2, 3

isVolatile

::mlir::BoolAttr

bool attribute

axis

::mlir::IntegerAttr

32-bit signless integer attribute

Operands:

Operand

Description

src

ptr or tensor of ptr values

dst

tensor of floating-point values or tensor of integer values or tensor of ptr values

index

32-bit signless integer

mask

tensor of 1-bit signless integer values

other

floating-point or tensor of floating-point values or integer or tensor of integer values or ptr or tensor of ptr values or ptr

Results:

Result

Description

result

tensor of floating-point values or tensor of integer values or tensor of ptr values

triton_gpu.select (::mlir::triton::gpu::SelectOp)

select operation

Traits: AlwaysSpeculatableImplTrait, Elementwise, SameOperandsAndResultEncoding, SameOperandsAndResultShape

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands:

Operand

Description

condition

1-bit signless integer or tensor of 1-bit signless integer values

true_value

tensor of floating-point values or tensor of integer values or tensor of ptr values

false_value

tensor of floating-point values or tensor of integer values or tensor of ptr values

Results:

Result

Description

result

floating-point or tensor of floating-point values or integer or tensor of integer values or ptr or tensor of ptr values or ptr