TritonNvidiaGPUOps¶
triton_nvidia_gpu.async_tma_copy_global_to_local
(triton::nvidia_gpu::AsyncTMACopyGlobalToLocalOp)¶
Copy data based on descriptor from global memory to local memory asynchronously
Syntax:
operation ::= `triton_nvidia_gpu.async_tma_copy_global_to_local` $desc_ptr `[` $coord `]` $result `,` $barrier `,` $pred
oilist(`cacheModifier` `=` $cache | `evictionPolicy` `=` $evict)
attr-dict `:` type($desc_ptr) `,` type($barrier) `->` type($result)
This operation copies data from global memory to local memory
asynchronously. This is analogue to tt.load except the data are copied to
local memory pointed by the memory descriptor instread of a distributed
tensor. The data copied depends on the global memory descriptor pointed to
by desc_ptr
.
Traits: VerifyTensorLayoutsTrait
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::triton::GlobalMemory, MemoryEffects::Write on ::mlir::triton::gpu::SharedMemory}
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
cache | ::mlir::triton::CacheModifierAttr | allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6{{% markdown %}}Enum cases: * none (`NONE`) * ca (`CA`) * cg (`CG`) * wb (`WB`) * cs (`CS`) * wt (`WT`){{% /markdown %}} |
evict | ::mlir::triton::EvictionPolicyAttr | allowed 32-bit signless integer cases: 1, 2, 3{{% markdown %}}Enum cases: * evict_normal (`NORMAL`) * evict_first (`EVICT_FIRST`) * evict_last (`EVICT_LAST`){{% /markdown %}} |
isVolatile | ::mlir::BoolAttr | bool attribute |
Operands:¶
Operand |
Description |
---|---|
|
Pointer type ( |
|
variadic of 32-bit signless integer |
|
memory descriptor type ( |
|
memory descriptor type ( |
|
1-bit signless integer |
triton_nvidia_gpu.async_tma_copy_local_to_global
(triton::nvidia_gpu::AsyncTMACopyLocalToGlobalOp)¶
Copy data based on descriptor from local memory to global memory asynchronously
Syntax:
operation ::= `triton_nvidia_gpu.async_tma_copy_local_to_global` $desc_ptr `[` $coord `]` $src
attr-dict `:` type($desc_ptr) `,` type($src)
This operation copies data from local memory to global memory
asynchronously. This is analogue to tt.store except the data are copied from
local memory pointed by the memory descriptor instread of a distributed
tensor. The data copied depends on the global memory descriptor pointed to
by desc_ptr
.
Traits: VerifyTensorLayoutsTrait
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::triton::gpu::SharedMemory, MemoryEffects::Write on ::mlir::triton::GlobalMemory}
Operands:¶
Operand |
Description |
---|---|
|
Pointer type ( |
|
variadic of 32-bit signless integer |
|
memory descriptor type ( |
triton_nvidia_gpu.cluster_arrive
(triton::nvidia_gpu::ClusterArriveOp)¶
Syntax:
operation ::= `triton_nvidia_gpu.cluster_arrive` attr-dict
Traits: VerifyTensorLayoutsTrait
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
relaxed | ::mlir::IntegerAttr | 1-bit signless integer attribute |
triton_nvidia_gpu.cluster_wait
(triton::nvidia_gpu::ClusterWaitOp)¶
Syntax:
operation ::= `triton_nvidia_gpu.cluster_wait` attr-dict
Traits: VerifyTensorLayoutsTrait
triton_nvidia_gpu.dot_async
(triton::nvidia_gpu::DotAsyncOp)¶
Dot async
Syntax:
operation ::= `triton_nvidia_gpu.dot_async` $a`,` $b`,` $c attr-dict `:` type($a) `*` type($b) `->` type($d)
$d = matrix_multiply($a, $b) + $c. For docs on InputPrecisionAttr, see TT_DotOp
Traits: AlwaysSpeculatableImplTrait
, VerifyTensorLayoutsTrait
Interfaces: ConditionallySpeculatable
, InferTypeOpInterface
, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
inputPrecision | ::mlir::triton::InputPrecisionAttr | allowed 32-bit signless integer cases: 0, 1, 2{{% markdown %}}Enum cases: * tf32 (`TF32`) * tf32x3 (`TF32x3`) * ieee (`IEEE`){{% /markdown %}} |
maxNumImpreciseAcc | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:¶
Operand |
Description |
---|---|
|
TensorOrMemDesc instance |
|
TensorOrMemDesc instance |
|
ranked tensor of floating-point or integer values |
Results:¶
Result |
Description |
---|---|
|
ranked tensor of floating-point or integer values |
triton_nvidia_gpu.dot_wait
(triton::nvidia_gpu::DotWaitOp)¶
Dot wait
Syntax:
operation ::= `triton_nvidia_gpu.dot_wait` $inputs attr-dict `:` type($inputs)
Waits until there are $pendings or fewer outstanding async dot operations.
$inputs must be the tensors corresponding to the async dot ops that we’re
waiting on. For example, if there are N pending async dot ops and we call
dot_wait 1
, then $inputs must be the result of the first dot op.
Traits: VerifyTensorLayoutsTrait
Interfaces: InferTypeOpInterface
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
pendings | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:¶
Operand |
Description |
---|---|
|
variadic of TensorOrMemDesc instance |
Results:¶
Result |
Description |
---|---|
|
variadic of TensorOrMemDesc instance |
triton_nvidia_gpu.init_barrier
(triton::nvidia_gpu::InitBarrierOp)¶
Initialize a barrier in the given shared memory allocation.
Syntax:
operation ::= `triton_nvidia_gpu.init_barrier` $alloc `,` $count attr-dict `:` type($alloc)
Initializes a shared memory allocation with mbarrier information.
alloc
is a descriptor to the shared memory allocation. count
is the
number of arrives expected by the barrier.
This lowers to PTX mbarrier.init.shared::cta.b64.
Traits: VerifyTensorLayoutsTrait
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{MemoryEffects::Write on ::mlir::triton::gpu::SharedMemory}
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
count | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:¶
Operand |
Description |
---|---|
|
memory descriptor type ( |
triton_nvidia_gpu.wait_barrier
(triton::nvidia_gpu::WaitBarrierOp)¶
Wait until the mbarrier phase completes.
Syntax:
operation ::= `triton_nvidia_gpu.wait_barrier` $alloc `,` $phase attr-dict `:` type($alloc)
Blocks the program progress until the mbarrier object in alloc
completes
its current phase.
This lowers a waitloop using PTX instruction mbarrier.try_wait.parity.shared.b64.
The barrier behavior is described here: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-asynchronous-copy-completion-mechanisms
Traits: VerifyTensorLayoutsTrait
Operands:¶
Operand |
Description |
---|---|
|
memory descriptor type ( |
|
32-bit signless integer |