TritonGPUOps¶
triton_gpu.alloc_tensor
(::mlir::triton::gpu::AllocTensorOp)¶
allocate tensor
Syntax:
operation ::= `triton_gpu.alloc_tensor` attr-dict `:` type($result)
This operation defines a tensor of a particular shape. The contents of the tensor are supposed to be in shared memory.
Note: This op can be repalced to a bufferization.alloc_tensor
in LLVM 16.
Traits: ResultsAreSharedEncoding
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{MemoryEffects::Allocate on ::mlir::SideEffects::DefaultResource}
Results:¶
Result |
Description |
---|---|
|
tensor of floating-point values or tensor of integer values or tensor of ptr values |
triton_gpu.async_bulk_commit_group
(::mlir::triton::gpu::AsyncBulkCommitGroupOp)¶
async bulk commit group
Syntax:
operation ::= `triton_gpu.async_bulk_commit_group` attr-dict
triton_gpu.async_bulk_wait
(::mlir::triton::gpu::AsyncBulkWaitOp)¶
async bulk wait
Syntax:
operation ::= `triton_gpu.async_bulk_wait` attr-dict
Attributes:¶
Attribute |
MLIR Type |
Description |
---|---|---|
|
::mlir::IntegerAttr |
32-bit signless integer attribute |
triton_gpu.async_commit_group
(::mlir::triton::gpu::AsyncCommitGroupOp)¶
async commit group
Syntax:
operation ::= `triton_gpu.async_commit_group` attr-dict
triton_gpu.async_wait
(::mlir::triton::gpu::AsyncWaitOp)¶
async wait
Syntax:
operation ::= `triton_gpu.async_wait` attr-dict
Attributes:¶
Attribute |
MLIR Type |
Description |
---|---|---|
|
::mlir::IntegerAttr |
32-bit signless integer attribute |
triton_gpu.cmpf
(::mlir::triton::gpu::CmpFOp)¶
floating-point comparison operation
Traits: AlwaysSpeculatableImplTrait, Elementwise, SameOperandsAndResultEncoding, SameOperandsAndResultShape
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes:¶
Attribute |
MLIR Type |
Description |
---|---|---|
|
::mlir::arith::CmpFPredicateAttr |
allowed 64-bit signless integer cases: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 |
Operands:¶
Operand |
Description |
---|---|
|
floating-point or tensor of floating-point values |
|
floating-point or tensor of floating-point values |
Results:¶
Result |
Description |
---|---|
|
1-bit signless integer or tensor of 1-bit signless integer values |
triton_gpu.cmpi
(::mlir::triton::gpu::CmpIOp)¶
integer comparison operation
Traits: AlwaysSpeculatableImplTrait, Elementwise, SameOperandsAndResultEncoding, SameOperandsAndResultShape
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes:¶
Attribute |
MLIR Type |
Description |
---|---|---|
|
::mlir::arith::CmpIPredicateAttr |
allowed 64-bit signless integer cases: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 |
Operands:¶
Operand |
Description |
---|---|
|
integer or tensor of integer values |
|
integer or tensor of integer values |
Results:¶
Result |
Description |
---|---|
|
1-bit signless integer or tensor of 1-bit signless integer values |
triton_gpu.convert_layout
(::mlir::triton::gpu::ConvertLayoutOp)¶
convert layout
Syntax:
operation ::= `triton_gpu.convert_layout` $src attr-dict `:` functional-type(operands, results)
Traits: AlwaysSpeculatableImplTrait, SameOperandsAndResultElementType, SameOperandsAndResultShape
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Operands:¶
Operand |
Description |
---|---|
|
tensor of floating-point values or tensor of integer values or tensor of ptr values |
Results:¶
Result |
Description |
---|---|
|
tensor of floating-point values or tensor of integer values or tensor of ptr values |
triton_gpu.extract_slice
(::mlir::triton::gpu::ExtractSliceOp)¶
extract slice operation
Syntax:
operation ::= `triton_gpu.extract_slice` $source ``
custom<DynamicIndexList>($offsets, $static_offsets)
custom<DynamicIndexList>($sizes, $static_sizes)
custom<DynamicIndexList>($strides, $static_strides)
attr-dict `:` type($source) `to` type($result)
same as tensor.extract_slice, but with int32 index. The motivations for re-implementing it are: We reimplement ExtractSliceOp with int32 index, because:
we want to enforce int32 indexing on GPUs since Triton tensors fit in SRAM
we still want to use indexWidth = 64 when lowering to LLVM because our loops can have 64-bit induction variables and scf.for uses indexType for bounds/ivs
Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments, ResultsAreSharedEncoding
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OffsetSizeAndStrideOpInterface
Effects: MemoryEffects::Effect{}
Attributes:¶
Attribute |
MLIR Type |
Description |
---|---|---|
|
::mlir::DenseI64ArrayAttr |
i64 dense array attribute |
|
::mlir::DenseI64ArrayAttr |
i64 dense array attribute |
|
::mlir::DenseI64ArrayAttr |
i64 dense array attribute |
Operands:¶
Operand |
Description |
---|---|
|
ranked tensor of any type values |
|
32-bit signless integer |
|
32-bit signless integer |
|
32-bit signless integer |
Results:¶
Result |
Description |
---|---|
|
ranked tensor of any type values |
triton_gpu.insert_slice_async
(::mlir::triton::gpu::InsertSliceAsyncOp)¶
insert slice async
This operation inserts a tensor $src
into another tensor $dst
as specified by the operation’s
$index
argument and $axis
attribute.
It returns a copy of $dst
with the proper slice updated asynchronously with the value of $src
.
This operation is non-blocking, and $results
will have the updated value after the corresponding async_wait.
When converting from tt.load
to triton_gpu.insert_slice_async
, the $evict
, $cache
, and $isVolatile
fields
might be ignored on certain hardware. For example, on NVIDIA GPUs, the cache policy is determined by the backend,
and $evict
and $isVolatile
are ignored because they apply to L1 cache only.
The insert_slice_async operation supports the following arguments:
src: the tensor that is inserted.
dst: the tensor into which the
$src
tensor is inserted.index: the index of the
$src
tensor at the given$axis
from which the$dst
tensor is inserted intomask: optional tensor-rank number of boolean masks which specify which elements of the
$src
tensor are inserted into the$dst
tensor.other: optional tensor-rank number of other tensors which specify what values are inserted into the
$dst
tensor if the corresponding element of the$mask
tensor is false.
In the future, we may decompose this operation into a sequence of:
async
operation to specify a sequence of asynchronous operationsload
operation to load a tensor from global memoryinsert_slice
operations to insert the$src
tensor into the$dst
tensor
Example:
%1 = triton_gpu.alloc_tensor : tensor<2x32xf32>
%2 = triton_gpu.insert_slice_async %0, %1, %index { axis = 0 } : tensor<32x!tt.ptr<f32>, #AL> -> tensor<2x32xf32, #A>
triiton_gpu.async_wait { num = 0 : i32 }
Traits: AttrSizedOperandSegments, ResultsAreSharedEncoding
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}
Attributes:¶
Attribute |
MLIR Type |
Description |
---|---|---|
|
::mlir::triton::CacheModifierAttr |
allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6 |
|
::mlir::triton::EvictionPolicyAttr |
allowed 32-bit signless integer cases: 1, 2, 3 |
|
::mlir::BoolAttr |
bool attribute |
|
::mlir::IntegerAttr |
32-bit signless integer attribute |
Operands:¶
Operand |
Description |
---|---|
|
ptr or tensor of ptr values |
|
tensor of floating-point values or tensor of integer values or tensor of ptr values |
|
32-bit signless integer |
|
tensor of 1-bit signless integer values |
|
floating-point or tensor of floating-point values or integer or tensor of integer values or ptr or tensor of ptr values or ptr |
Results:¶
Result |
Description |
---|---|
|
tensor of floating-point values or tensor of integer values or tensor of ptr values |
triton_gpu.insert_slice
(::mlir::triton::gpu::InsertSliceOp)¶
insert slice
This operation inserts a tensor $src
into another tensor $dst
as specified by the operation’s
$index
argument and $axis
attribute.
It returns a copy of $dst
with the proper slice updated with the value of $src
.
When converting from tt.load
to triton_gpu.insert_slice
, the $evict
, $cache
, and $isVolatile
fields
might be ignored on certain hardware. For example, on NVIDIA GPUs, the cache policy is determined by the backend,
and $evict
and $isVolatile
are ignored because they apply to L1 cache only.
The insert_slice operation supports the following arguments:
src: the tensor that is inserted.
dst: the tensor into which the
$src
tensor is inserted.index: the index of the
$src
tensor at the given$axis
from which the$dst
tensor is inserted intomask: optional tensor-rank number of boolean masks which specify which elements of the
$src
tensor are inserted into the$dst
tensor.other: optional tensor-rank number of other tensors which specify what values are inserted into the
$dst
tensor if the corresponding element of the$mask
tensor is false.
ttgpu.load_tile_async depracate triton_gpu.insert_slice might be further lowered into triton_gpu_async for different hardware implementations
like tt.load, ttgpu.insert_slice/insert_slice_async has two modes up to the type of src mode 1: ptr/src is a tensor of pointers mode 2: ptr/src is a tensor pointer
Some typical lowering paths are: in case the load is pipelined by the pipeline pass( load is inside kBlock loop, which means “pipeline pass): Load from global + store to shared : tt.load(mode 1) -(tt->ttgpu+Coalesce)-> tt.load(mode 1) -(Pipeline)-> ttgpu.insert_slice(mode 1) Non-bulk cp.async : tt.load(mode 1) -(tt->ttgpu+Coalesce)-> tt.load(mode 1) -(Pipeline)-> ttgpu.insert_slice(mode 1) -(MaterializeLoad)> ttgpu.insert_slice_async(mode 1) + ttgpu.await-> llvm TMA load : tt.load(mode 2) -(tt->ttgpu+Coalesce)-> tt.load(mode 2) -(Pipeline)-> ttgpu.insert_slice(mode 2) -(MaterializeLoad)> ttgpu.insert_slice_async_v2(mode 2) + ttgpu.await-> llvm
otherwise: Load from global + store to shared : tt.load(mode 1) -(tt->ttgpu+Coalesce)-> tt.load(mode 1) Non-bulk cp.async : tt.load(mode 1) -(tt->ttgpu+Coalesce)-> tt.load(mode 1) -> … -(MaterializeLoad)-> ttgpu.insert_slice_async(mode 1) + ttgpu.await -> llvm TMA load : tt.load(mode 2) -(tt->ttgpu+Coalesce)-> tt.load(mode 2) -> … -(MaterializeLoad)-> ttgpu.insert_slice_async(mode 2) + ttgpu.await -> llvm
Example:
%1 = triton_gpu.alloc_tensor : tensor<2x32xf32>
%2 = triton_gpu.insert_slice %0, %1, %index { axis = 0 } : tensor<32x!tt.ptr<f32>, #AL> -> tensor<2x32xf32, #A>
Traits: AttrSizedOperandSegments, ResultsAreSharedEncoding
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}
Attributes:¶
Attribute |
MLIR Type |
Description |
---|---|---|
|
::mlir::triton::CacheModifierAttr |
allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6 |
|
::mlir::triton::EvictionPolicyAttr |
allowed 32-bit signless integer cases: 1, 2, 3 |
|
::mlir::BoolAttr |
bool attribute |
|
::mlir::IntegerAttr |
32-bit signless integer attribute |
Operands:¶
Operand |
Description |
---|---|
|
ptr or tensor of ptr values |
|
tensor of floating-point values or tensor of integer values or tensor of ptr values |
|
32-bit signless integer |
|
tensor of 1-bit signless integer values |
|
floating-point or tensor of floating-point values or integer or tensor of integer values or ptr or tensor of ptr values or ptr |
Results:¶
Result |
Description |
---|---|
|
tensor of floating-point values or tensor of integer values or tensor of ptr values |
triton_gpu.select
(::mlir::triton::gpu::SelectOp)¶
select operation
Traits: AlwaysSpeculatableImplTrait, Elementwise, SameOperandsAndResultEncoding, SameOperandsAndResultShape
Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Operands:¶
Operand |
Description |
---|---|
|
1-bit signless integer or tensor of 1-bit signless integer values |
|
tensor of floating-point values or tensor of integer values or tensor of ptr values |
|
tensor of floating-point values or tensor of integer values or tensor of ptr values |
Results:¶
Result |
Description |
---|---|
|
floating-point or tensor of floating-point values or integer or tensor of integer values or ptr or tensor of ptr values or ptr |