# TritonAMDGPUOps
### `amdgpu.buffer_atomic_rmw` (triton::amdgpu::BufferAtomicRMWOp)
_Atomic RMW op which reads, modifies, and writes to a scalar base pointer and a tensor offset_
Syntax:
```
operation ::= `amdgpu.buffer_atomic_rmw` $atomic_rmw_op `,` $sem `,` $scope `,` $value `,` $ptr `[` $offsets `]` (`,` $mask^)?
(`stride` `=` $stride^)?
attr-dict `:` type($result)
```
AMD Buffer atomic RMW operation. Buffer atomics are similar to normal atomics, but access global memory via a
scalar base pointer and a tensor of offsets instead of a tensor of pointers.
Similar to other buffer ops, the `mask` is a boolean vector that determines if a given element should be processed with
the atomic RMW op. Elements with `mask[i] == 0` are dropped (i.e., the atomic is not executed).
Similar to TT_AtomicRMWOp: Buffer atomic RMW ops load data at $ptr, do $rmw_op with $val, and store result to $ptr with
the specified memory semantics and scope. Atomic RMW ops return the pre-op value if used, otherwise the value is implicitly dropped.
Stride is the distance between the beginning of contiguous memory chunks. When performing a RMW, the `stride` is
the address difference between the first elements of each row in bytes. Compiler tries to obtain the `stride`
when it converts to the buffer ops because it is important for optimizing the cache memory access.
Traits: `AttrSizedOperandSegments`, `SameLoadStoreOperandsAndResultEncoding`
#### Attributes:
Attribute | MLIR Type | Description |
atomic_rmw_op | ::mlir::triton::RMWOpAttr | allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 |
sem | ::mlir::triton::MemSemanticAttr | allowed 32-bit signless integer cases: 1, 2, 3, 4 |
scope | ::mlir::triton::MemSyncScopeAttr | allowed 32-bit signless integer cases: 1, 2, 3 |
#### Operands:
| Operand | Description |
| :-----: | ----------- |
| `ptr` | ptr |
| `offsets` | tensor of 32-bit signless integer values |
| `value` | ranked tensor of floating-point or integer or ptr values |
| `stride` | 32-bit signless integer |
| `mask` | ranked tensor of 1-bit signless integer values |
#### Results:
| Result | Description |
| :----: | ----------- |
| `result` | ranked tensor of floating-point or integer or ptr values |
### `amdgpu.buffer_load` (triton::amdgpu::BufferLoadOp)
_Load from a scalar base pointer and a tensor offset_
Syntax:
```
operation ::= `amdgpu.buffer_load` $ptr `[` $offsets `]` (`,` $mask^)? (`,` $other^)?
oilist(`cacheModifier` `=` $cache)
(`stride` `=` $stride^)?
attr-dict `:` type($result)
```
AMD Buffer load operation. Buffer store is similar to
a normal store but it accesses global memory via a scalar base pointer
and a tensor of offsets instead of a tensor of pointers. The other fields
are similar to a normal load, i.e., the `mask` is a boolean vector that
determines if a given element should be read from memory, and `other` is the
element that should be returned on lane `i` when `mask[i] == 0`.
Stride is the distance between the beginning of contiguous memory chunks.
When performing a load of a block, the `stride` is the address difference between
the first elements of each row in bytes. Compiler tries to obtain the `stride`
when it converts to the buffer ops because it is important for optimizing
the cache memory access.
Traits: `AttrSizedOperandSegments`, `SameLoadStoreOperandsAndResultEncoding`
#### Attributes:
Attribute | MLIR Type | Description |
cache | ::mlir::triton::CacheModifierAttr | allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7 |
#### Operands:
| Operand | Description |
| :-----: | ----------- |
| `ptr` | ptr |
| `offsets` | tensor of 32-bit signless integer values |
| `stride` | 32-bit signless integer |
| `mask` | ranked tensor of 1-bit signless integer values |
| `other` | ranked tensor of floating-point or integer or ptr values |
#### Results:
| Result | Description |
| :----: | ----------- |
| `result` | ranked tensor of floating-point or integer or ptr values |
### `amdgpu.buffer_load_to_local` (triton::amdgpu::BufferLoadToLocalOp)
_Load from a scalar base pointer and a tensor offset to shared memory_
Syntax:
```
operation ::= `amdgpu.buffer_load_to_local` $ptr `[` $offsets `]` (`mask` `=` $mask^)? (`other` `=` $other^)? (`stride` `=` $stride^)?
oilist(`cacheModifier` `=` $cache) `into` $dest
attr-dict `:` type($ptr) `[` type($offsets) `]` type($other) `->` type($dest)
```
AMD Buffer load operation. Similar to amdgpu.buffer_load op but directly wirtes to shared memory instead of into registers.
Traits: `AttrSizedOperandSegments`
Interfaces: `InferTypeOpInterface`
#### Attributes:
Attribute | MLIR Type | Description |
cache | ::mlir::triton::CacheModifierAttr | allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7 |
#### Operands:
| Operand | Description |
| :-----: | ----------- |
| `dest` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `ptr` | ptr |
| `offsets` | tensor of 32-bit signless integer values |
| `mask` | ranked tensor of 1-bit signless integer values |
| `other` | ranked tensor of floating-point or integer or ptr values |
| `stride` | 32-bit signless integer |
#### Results:
| Result | Description |
| :----: | ----------- |
| `token` | async token type |
### `amdgpu.buffer_store` (triton::amdgpu::BufferStoreOp)
_Store into scalar base pointer and a tensor offset_
Syntax:
```
operation ::= `amdgpu.buffer_store` $value `,` $ptr `[` $offsets `]` (`,` $mask^)?
oilist(`cacheModifier` `=` $cache)
(`stride` `=` $stride^)?
attr-dict `:` type($value)
```
AMD Buffer store operation. Buffer store is similar to
normal store but it accesses global memory via a scalar base pointer
and a tensor of offsets instead of a tensor of pointers. The other fields
are similar to a normal store , i.e., the `mask` is a boolean vector that
determines if a given element should be written to memory, and `value` is the
tensor of elements that should be written on lane `i` when `mask[i] == 1`.
Stride is the distance between the beginning of contiguous memory chunks.
When performing a block store, the `stride` is the address difference between
the first elements of each row in bytes. Compiler tries to obtain the `stride`
when it converts to the buffer ops because it is important for optimizing
the cache memory access.
Traits: `AttrSizedOperandSegments`, `SameLoadStoreOperandsEncoding`
#### Attributes:
Attribute | MLIR Type | Description |
cache | ::mlir::triton::CacheModifierAttr | allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7 |
#### Operands:
| Operand | Description |
| :-----: | ----------- |
| `value` | ranked tensor of floating-point or integer or ptr values |
| `ptr` | ptr |
| `offsets` | tensor of 32-bit signless integer values |
| `stride` | 32-bit signless integer |
| `mask` | ranked tensor of 1-bit signless integer values |
### `amdgpu.cond_barrier` (triton::amdgpu::CondBarrierOp)
_Conditionally set barriers to synchronize partial threads in a block_
Syntax:
```
operation ::= `amdgpu.cond_barrier` $pred attr-dict
```
condBarrierOp sets barrier instruction only when the given argument is true.
This provides a way to synchronize partial threads in a block, deliberately
diverges the execution sequences. However, user should guarantee all threads
converge at the end by calling condBarrierOp(true) with the remaining threads.
Conceptually, this is similar to having an execution barrier inside an if statement.
This op allows us to avoid blocking the whole block when suitable to help scheduling.
NB. This doesn't set any memory fence.
#### Operands:
| Operand | Description |
| :-----: | ----------- |
| `pred` | 1-bit signless integer |
### `amdgpu.extract_slice` (triton::amdgpu::ExtractSliceOp)
_Extract slice operation_
Syntax:
```
operation ::= `amdgpu.extract_slice` $source $static_offsets attr-dict `:` type($source) `to` type($result)
```
The "extract_slice" operation enables extracting a slice of a tensor in
registers.
The "extract_slice" operation supports the following arguments:
* source: the base tensor on which to create a view tensor
* offsets: offsets into the base tensor at which to create the view
Example 1:
```mlir
#blocked = #ttg.blocked<{sizePerThread = [1, 8],
threadsPerWarp = [4, 16], warpsPerCTA = [4, 1], order = [0, 1]}>
#blocked1 = #ttg.blocked<{sizePerThread = [1, 8],
threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [0, 1]}>
%1 = ttg.convert_layout %0 : tensor<128x128xf16, #blocked>
-> tensor<128x128xf16, #blocked1>
// create a slice of base tensor %1 with static offsets
%2 = amdgpu.extract_slice %0 [0, 0] :
tensor<128x128xf16, #blocked1> to tensor<128x32xf16, #blocked1>
```
Example 1 shows how "extract_slice" operation may be used. In this example a
new slice of 128x32 is created. "extract_slice" works on tensors with layout
where the desired slice has the same layout as the source tensor.
"%0" cannot be sliced directly as the resulting slice cannot have the same
layout as "%0". Therefore it needs to be converted to a layout suitable
for slicing. "#blocked1" layout is appropriate for this as it keeps the
sizePerThread the same thus keeping coalescing properties the same.
In order to utilize all threads in a warp, "threadsPerWarp" is set to
[16,4] for this new layout. This layout conversion carried out before
using "extract_slice" ensures slicing still uses all threads efficiently. The
size of the slice is determined by the result type.
Traits: `AlwaysSpeculatableImplTrait`
Interfaces: `ConditionallySpeculatable`, `NoMemoryEffect (MemoryEffectOpInterface)`
Effects: `MemoryEffects::Effect{}`
#### Attributes:
Attribute | MLIR Type | Description |
static_offsets | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
#### Operands:
| Operand | Description |
| :-----: | ----------- |
| `source` | ranked tensor of any type values |
#### Results:
| Result | Description |
| :----: | ----------- |
| `result` | ranked tensor of any type values |
### `amdgpu.in_thread_transpose` (triton::amdgpu::InThreadTransposeOp)
_Perform transpose of register values belonging to each threads_
Syntax:
```
operation ::= `amdgpu.in_thread_transpose` $src attr-dict `:` type($src) `->` type($result)
```
This operation performs a layout transpose over values in registers per thread.
Specifically, given the input layout's blocked layout, it transposes the two last dimensions(rank-1 and rank-2)
along the register dimension of the underlying linear layout.
Conversion example:
* input layout: blocked layout with sizePerThread=[2, 2], order=[0, 1]. It's linear layout register bases = [[1, 0], [2, 0], [0, 1], [0, 2]]
* output layout: same thread and warp bases as in input, register bases = [[0, 1], [0, 2], [1, 0], [2, 0]]
This operation enables efficient coalesced loading from HBM with following vectorized writing to shared memory
in cases when HBM and shared memory order differ and target AMD hardware does not natively support this transposition.
This is a specific variant of ttg.convert_layout and will be converted to ttg.convert_layout when lowering to llvm.
We do not want this conversion to be optimized out, because we need to explicitly materialize instructions
to transpose within each thread after loading from HBM and before writing to shared memory.
Traits: `AlwaysSpeculatableImplTrait`
Interfaces: `ConditionallySpeculatable`, `NoMemoryEffect (MemoryEffectOpInterface)`
Effects: `MemoryEffects::Effect{}`
#### Operands:
| Operand | Description |
| :-----: | ----------- |
| `src` | ranked tensor of floating-point or integer or ptr values |
#### Results:
| Result | Description |
| :----: | ----------- |
| `result` | ranked tensor of floating-point or integer or ptr values |
### `amdgpu.instruction_sched_hint` (triton::amdgpu::InstructionSchedHint)
_A placeholder op for instruction scheduling hints within a basic block_
Syntax:
```
operation ::= `amdgpu.instruction_sched_hint` attr-dict
```
A placeholder op for instruction scheduling hints applied to instructions within
a basic block where the placeholder op is located. This op is primarily intended
to be used to adjust instruction scheduling inside the resulting main loop
of a `tt.dot` operation. It's easier to identify dot ops at a high level and, thus,
to mark intended scheduling regions. The hint ops are eventually lowered
into LLVM AMDGPU instruction scheduling primitives, which are meant to control
how different kinds of instructions (valu/mfma, global/shared memory, etc.) should
interleave for better instruction level parallelism.
#### Attributes:
Attribute | MLIR Type | Description |
variant | ::mlir::triton::amdgpu::SchedHintAttr | Instruction Scheduling Hints for AMD GPUs |
numDsReadsA | ::mlir::triton::amdgpu::InstCounterAttr | An instruction counter attribute.{{% markdown %}}
The attribute holds the number of issued LLVM instructions of a specific kind as well as
the data type.
{{% /markdown %}} |
numDsReadsB | ::mlir::triton::amdgpu::InstCounterAttr | An instruction counter attribute.{{% markdown %}}
The attribute holds the number of issued LLVM instructions of a specific kind as well as
the data type.
{{% /markdown %}} |
numDsWritesA | ::mlir::triton::amdgpu::InstCounterAttr | An instruction counter attribute.{{% markdown %}}
The attribute holds the number of issued LLVM instructions of a specific kind as well as
the data type.
{{% /markdown %}} |
numDsWritesB | ::mlir::triton::amdgpu::InstCounterAttr | An instruction counter attribute.{{% markdown %}}
The attribute holds the number of issued LLVM instructions of a specific kind as well as
the data type.
{{% /markdown %}} |
numGlobalLoadsA | ::mlir::triton::amdgpu::InstCounterAttr | An instruction counter attribute.{{% markdown %}}
The attribute holds the number of issued LLVM instructions of a specific kind as well as
the data type.
{{% /markdown %}} |
numGlobalLoadsB | ::mlir::triton::amdgpu::InstCounterAttr | An instruction counter attribute.{{% markdown %}}
The attribute holds the number of issued LLVM instructions of a specific kind as well as
the data type.
{{% /markdown %}} |
isBufferLoadsAEnabled | ::mlir::BoolAttr | bool attribute |
isBufferLoadsBEnabled | ::mlir::BoolAttr | bool attribute |
numMMAs | ::mlir::triton::amdgpu::InstCounterAttr | An instruction counter attribute.{{% markdown %}}
The attribute holds the number of issued LLVM instructions of a specific kind as well as
the data type.
{{% /markdown %}} |
### `amdgpu.upcast_mxfp` (triton::amdgpu::UpcastMXFPOp)
_Convert an mxfp tensor to bf16/fp16_
Syntax:
```
operation ::= `amdgpu.upcast_mxfp` $src `,` $scale `fp_type` `=` $fp_type attr-dict `:` type($src) `,` type($scale) `->` type($result)
```
Compute the bf16 encoded in the given mxfp number as per
https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf
Traits: `AlwaysSpeculatableImplTrait`
Interfaces: `ConditionallySpeculatable`, `NoMemoryEffect (MemoryEffectOpInterface)`
Effects: `MemoryEffects::Effect{}`
#### Attributes:
Attribute | MLIR Type | Description |
fp_type | ::mlir::triton::ScaleDotElemTypeAttr | allowed 32-bit signless integer cases: 0, 1, 2, 3, 4, 5, 6 |
fastMath | ::mlir::BoolAttr | bool attribute |
#### Operands:
| Operand | Description |
| :-----: | ----------- |
| `src` | ranked tensor of floating-point or integer or ptr values |
| `scale` | ranked tensor of floating-point or integer or ptr values |
#### Results:
| Result | Description |
| :----: | ----------- |
| `result` | ranked tensor of floating-point or integer or ptr values |