TritonNvidiaGPUOps¶
triton_nvidia_gpu.async_tma_copy_global_to_local
(triton::nvidia_gpu::AsyncTMACopyGlobalToLocalOp)¶
Copy data based on descriptor from global memory to local memory asynchronously
Syntax:
operation ::= `triton_nvidia_gpu.async_tma_copy_global_to_local` $desc_ptr `[` $coord `]` $result `,` $barrier `,` $pred
oilist(`cacheModifier` `=` $cache | `evictionPolicy` `=` $evict)
attr-dict `:` type($desc_ptr) `,` type($barrier) `->` type($result)
This operation copies data from global memory to local memory
asynchronously. This is analogue to tt.load except the data are copied to
local memory pointed by the memory descriptor instread of a distributed
tensor. The data copied depends on the global memory descriptor pointed to
by desc_ptr
.
Traits: VerifyTensorLayoutsTrait
Interfaces: MemoryEffectOpInterface
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
cache | ::mlir::triton::CacheModifierAttr | allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7{{% markdown %}}Enum cases: * none (`NONE`) * ca (`CA`) * cg (`CG`) * wb (`WB`) * cs (`CS`) * wt (`WT`) * cv (`CV`){{% /markdown %}} |
evict | ::mlir::triton::EvictionPolicyAttr | allowed 32-bit signless integer cases: 1, 2, 3{{% markdown %}}Enum cases: * evict_normal (`NORMAL`) * evict_first (`EVICT_FIRST`) * evict_last (`EVICT_LAST`){{% /markdown %}} |
isVolatile | ::mlir::BoolAttr | bool attribute |
Operands:¶
Operand |
Description |
---|---|
|
Pointer type ( |
|
variadic of 32-bit signless integer |
|
memory descriptor type ( |
|
memory descriptor type ( |
|
1-bit signless integer |
triton_nvidia_gpu.async_tma_copy_local_to_global
(triton::nvidia_gpu::AsyncTMACopyLocalToGlobalOp)¶
Copy data based on descriptor from local memory to global memory asynchronously
Syntax:
operation ::= `triton_nvidia_gpu.async_tma_copy_local_to_global` $desc_ptr `[` $coord `]` $src
attr-dict `:` type($desc_ptr) `,` type($src)
This operation copies data from local memory to global memory
asynchronously. This is analogue to tt.store except the data are copied from
local memory pointed by the memory descriptor instread of a distributed
tensor. The data copied depends on the global memory descriptor pointed to
by desc_ptr
.
Traits: VerifyTensorLayoutsTrait
Interfaces: MemoryEffectOpInterface
Operands:¶
Operand |
Description |
---|---|
|
Pointer type ( |
|
variadic of 32-bit signless integer |
|
memory descriptor type ( |
triton_nvidia_gpu.barrier_expect
(triton::nvidia_gpu::BarrierExpectOp)¶
Signal a barrier of an expected number of bytes to be copied.
Syntax:
operation ::= `triton_nvidia_gpu.barrier_expect` $alloc `,` $size attr-dict `,` $pred `:` type($alloc)
This signal the barrier that size
bytes are expected to be copied. The
associated barrier wait will block until the expected number of bytes are copied.
Traits: VerifyTensorLayoutsTrait
Interfaces: MemoryEffectOpInterface
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
size | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:¶
Operand |
Description |
---|---|
|
memory descriptor type ( |
|
1-bit signless integer |
triton_nvidia_gpu.cluster_arrive
(triton::nvidia_gpu::ClusterArriveOp)¶
Syntax:
operation ::= `triton_nvidia_gpu.cluster_arrive` attr-dict
Traits: VerifyTensorLayoutsTrait
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
relaxed | ::mlir::IntegerAttr | 1-bit signless integer attribute |
triton_nvidia_gpu.cluster_wait
(triton::nvidia_gpu::ClusterWaitOp)¶
Syntax:
operation ::= `triton_nvidia_gpu.cluster_wait` attr-dict
Traits: VerifyTensorLayoutsTrait
triton_nvidia_gpu.init_barrier
(triton::nvidia_gpu::InitBarrierOp)¶
Initialize a barrier in the given shared memory allocation.
Syntax:
operation ::= `triton_nvidia_gpu.init_barrier` $alloc `,` $count attr-dict `:` type($alloc)
Initializes a shared memory allocation with mbarrier information.
alloc
is a descriptor to the shared memory allocation. count
is the
number of arrives expected by the barrier.
This lowers to PTX mbarrier.init.shared::cta.b64.
Traits: VerifyTensorLayoutsTrait
Interfaces: MemoryEffectOpInterface
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
count | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:¶
Operand |
Description |
---|---|
|
memory descriptor type ( |
triton_nvidia_gpu.inval_barrier
(triton::nvidia_gpu::InvalBarrierOp)¶
Invalidate a barrier allocation.
Syntax:
operation ::= `triton_nvidia_gpu.inval_barrier` $alloc attr-dict `:` type($alloc)
Invalidate a barrier allocation so that it can be re-used. According to PTX spec this has to be done before any re-use of the memory used by mbarrier.
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-mbarrier-inval
Traits: VerifyTensorLayoutsTrait
Interfaces: MemoryEffectOpInterface
Operands:¶
Operand |
Description |
---|---|
|
memory descriptor type ( |
triton_nvidia_gpu.async_tma_store_wait
(triton::nvidia_gpu::TMAStoreWait)¶
Wait until all the inputs are read.
Syntax:
operation ::= `triton_nvidia_gpu.async_tma_store_wait` attr-dict
Wait until all the read operations are done from the associated store operations. This is needed before the shared memory can be written to.
Traits: VerifyTensorLayoutsTrait
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
pendings | ::mlir::IntegerAttr | 32-bit signless integer attribute |
triton_nvidia_gpu.tensor_desc_to_tma_ptr
(triton::nvidia_gpu::TensorDescToTMAPtrOp)¶
Convert tensor descriptor to pointer to tma descriptor
Syntax:
operation ::= `triton_nvidia_gpu.tensor_desc_to_tma_ptr` $desc attr-dict `:` qualified(type($desc)) `to` qualified(type($ptr))
Traits: AlwaysSpeculatableImplTrait
, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Operands:¶
Operand |
Description |
---|---|
|
Tensor descriptor type ( |
Results:¶
Result |
Description |
---|---|
|
ptr |
triton_nvidia_gpu.wait_barrier
(triton::nvidia_gpu::WaitBarrierOp)¶
Wait until the mbarrier phase completes.
Syntax:
operation ::= `triton_nvidia_gpu.wait_barrier` $alloc `,` $phase attr-dict `:` type($alloc)
Blocks the program progress until the mbarrier object in alloc
completes
its current phase.
This lowers a waitloop using PTX instruction mbarrier.try_wait.parity.shared.b64.
The barrier behavior is described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-asynchronous-copy-completion-mechanisms
Traits: VerifyTensorLayoutsTrait
Interfaces: MemoryEffectOpInterface
Operands:¶
Operand |
Description |
---|---|
|
memory descriptor type ( |
|
32-bit signless integer |
triton_nvidia_gpu.warp_group_dot
(triton::nvidia_gpu::WarpGroupDotOp)¶
Warp group dot
Syntax:
operation ::= `triton_nvidia_gpu.warp_group_dot` $a`,` $b`,` $c (`,` $useC^)? attr-dict `:` type($a) `*` type($b) `->` type($d)
$d = matrix_multiply($a, $b) + $c. For docs on InputPrecisionAttr, see TT_DotOp
Traits: DotLike
, VerifyTensorLayoutsTrait
Interfaces: InferTypeOpInterface
, MemoryEffectOpInterface
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
inputPrecision | ::mlir::triton::InputPrecisionAttr | allowed 32-bit signless integer cases: 0, 1, 2{{% markdown %}}Enum cases: * tf32 (`TF32`) * tf32x3 (`TF32x3`) * ieee (`IEEE`){{% /markdown %}} |
maxNumImpreciseAcc | ::mlir::IntegerAttr | 32-bit signless integer attribute |
isAsync | ::mlir::BoolAttr | bool attribute |
Operands:¶
Operand |
Description |
---|---|
|
TensorOrMemDesc instance |
|
TensorOrMemDesc instance |
|
ranked tensor of floating-point or integer values |
|
1-bit signless integer |
Results:¶
Result |
Description |
---|---|
|
ranked tensor of floating-point or integer values |
triton_nvidia_gpu.warp_group_dot_wait
(triton::nvidia_gpu::WarpGroupDotWaitOp)¶
Warp group dot wait
Syntax:
operation ::= `triton_nvidia_gpu.warp_group_dot_wait` $inputs attr-dict `:` type($inputs)
Waits until there are $pendings or fewer outstanding async dot operations.
$inputs must be the tensors corresponding to the async dot ops that we’re
waiting on. For example, if there are N pending async dot ops and we call
warp_group_dot_wait 1
, then $inputs must be the result of the first dot op.
Traits: VerifyTensorLayoutsTrait
Interfaces: InferTypeOpInterface
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
pendings | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:¶
Operand |
Description |
---|---|
|
variadic of TensorOrMemDesc instance |
Results:¶
Result |
Description |
---|---|
|
variadic of TensorOrMemDesc instance |