TritonAMDGPUOps¶
amdgpu.buffer_load
(triton::amdgpu::BufferLoadOp)¶
Load from a scalar base pointer and a tensor offset
Syntax:
operation ::= `amdgpu.buffer_load` $ptr `[` $offsets `]` (`,` $mask^)? (`,` $other^)?
attr-dict `:` type($result)
AMD Buffer load operation. Buffer store is similar to
a normal store but it accesses global memory via a scalar base pointer
and a tensor of offsets instead of a tensor of pointers. The other fields
are similar to a normal load, i.e., the mask
is a boolean vector that
determines if a given element should be read from memory, and other
is the
element that should be returned on lane i
when mask[i] == 0
.
Traits: AttrSizedOperandSegments
, SameLoadStoreOperandsAndResultEncoding
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::triton::GlobalMemory}
Operands:¶
Operand |
Description |
---|---|
|
ptr |
|
tensor of 32-bit signless integer values |
|
ranked tensor of 1-bit signless integer values |
|
ranked tensor of floating-point or integer or ptr values |
Results:¶
Result |
Description |
---|---|
|
ranked tensor of floating-point or integer or ptr values |
amdgpu.buffer_store
(triton::amdgpu::BufferStoreOp)¶
Store into scalar base pointer and a tensor offset
Syntax:
operation ::= `amdgpu.buffer_store` $value `,` $ptr `[` $offsets `]` (`,` $mask^)?
attr-dict `:` type($value)
AMD Buffer store operation. Buffer store is similar to
normal store but it accesses global memory via a scalar base pointer
and a tensor of offsets instead of a tensor of pointers. The other fields
are similar to a normal store , i.e., the mask
is a boolean vector that
determines if a given element should be written to memory, and value
is the
tensor of elements that should be written on lane i
when mask[i] == 1
.
Traits: SameLoadStoreOperandsEncoding
Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{MemoryEffects::Write on ::mlir::triton::GlobalMemory}
Operands:¶
Operand |
Description |
---|---|
|
ranked tensor of floating-point or integer or ptr values |
|
ptr |
|
tensor of 32-bit signless integer values |
|
ranked tensor of 1-bit signless integer values |
amdgpu.cond_barrier
(triton::amdgpu::CondBarrierOp)¶
Conditionally set barriers to synchronize partial threads in a block
Syntax:
operation ::= `amdgpu.cond_barrier` $pred attr-dict
condBarrierOp sets barrier instruction only when the given argument is true. This provides a way to synchronize partial threads in a block, deliberately diverges the execution sequences. However, user should guarantee all threads converge at the end by calling condBarrierOp(true) with the remaining threads. Conceptually, this is similar to having an execution barrier inside an if statement. This op allows us to avoid blocking the whole block when suitable to help scheduling. NB. This doesn’t set any memory fence.
Operands:¶
Operand |
Description |
---|---|
|
1-bit signless integer |
amdgpu.extract_slice
(triton::amdgpu::ExtractSliceOp)¶
Extract slice operation
Syntax:
operation ::= `amdgpu.extract_slice` $source $static_offsets attr-dict `:` type($source) `to` type($result)
The “extract_slice” operation enables extracting a slice of a tensor in registers.
The “extract_slice” operation supports the following arguments:
source: the base tensor on which to create a view tensor
offsets: offsets into the base tensor at which to create the view
Example 1:
#blocked = #ttg.blocked<{sizePerThread = [1, 8],
threadsPerWarp = [4, 16], warpsPerCTA = [4, 1], order = [0, 1]}>
#blocked1 = #ttg.blocked<{sizePerThread = [1, 8],
threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [0, 1]}>
%1 = ttg.convert_layout %0 : tensor<128x128xf16, #blocked>
-> tensor<128x128xf16, #blocked1>
// create a slice of base tensor %1 with static offsets
%2 = amdgpu.extract_slice %0 [0, 0] :
tensor<128x128xf16, #blocked1> to tensor<128x32xf16, #blocked1>
Example 1 shows how “extract_slice” operation may be used. In this example a new slice of 128x32 is created. “extract_slice” works on tensors with layout where the desired slice has the same layout as the source tensor. “%0” cannot be sliced directly as the resulting slice cannot have the same layout as “%0”. Therefore it needs to be converted to a layout suitable for slicing. “#blocked1” layout is appropriate for this as it keeps the sizePerThread the same thus keeping coalescing properties the same. In order to utilize all threads in a warp, “threadsPerWarp” is set to [16,4] for this new layout. This layout conversion carried out before using “extract_slice” ensures slicing still uses all threads efficiently. The size of the slice is determined by the result type.
Traits: AlwaysSpeculatableImplTrait
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
Effects: MemoryEffects::Effect{}
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
static_offsets | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
Operands:¶
Operand |
Description |
---|---|
|
ranked tensor of any type values |
Results:¶
Result |
Description |
---|---|
|
ranked tensor of any type values |
amdgpu.instruction_sched_hint
(triton::amdgpu::InstructionSchedHint)¶
A placeholder op for instruction scheduling hints within a basic block
Syntax:
operation ::= `amdgpu.instruction_sched_hint` attr-dict
A placeholder op for instruction scheduling hints applied to instructions within
a basic block where the placeholder op is located. This op is primarily intended
to be used to adjust instruction scheduling inside the resulting main loop
of a tt.dot
operation. It’s easier to identify dot ops at a high level and, thus,
to mark intended scheduling regions. The hint ops are eventually lowered
into LLVM AMDGPU instruction scheduling primitives, which are meant to control
how different kinds of instructions (valu/mfma, global/shared memory, etc.) should
interleave for better instruction level parallelism.
Attributes:¶
Attribute | MLIR Type | Description |
---|---|---|
numDsReadsA | ::mlir::triton::amdgpu::InstCounterAttr | An instruction counter attribute.{{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}} |
numDsReadsB | ::mlir::triton::amdgpu::InstCounterAttr | An instruction counter attribute.{{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}} |
numDsWritesA | ::mlir::triton::amdgpu::InstCounterAttr | An instruction counter attribute.{{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}} |
numDsWritesB | ::mlir::triton::amdgpu::InstCounterAttr | An instruction counter attribute.{{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}} |
numGlobalLoadsA | ::mlir::triton::amdgpu::InstCounterAttr | An instruction counter attribute.{{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}} |
numGlobalLoadsB | ::mlir::triton::amdgpu::InstCounterAttr | An instruction counter attribute.{{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}} |
isBufferLoadsAEnabled | ::mlir::BoolAttr | bool attribute |
isBufferLoadsBEnabled | ::mlir::BoolAttr | bool attribute |
numMMAs | ::mlir::triton::amdgpu::InstCounterAttr | An instruction counter attribute.{{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}} |