TritonAMDGPUOps

amdgpu.buffer_load (triton::amdgpu::BufferLoadOp)

Load from a scalar base pointer and a tensor offset

Syntax:

operation ::= `amdgpu.buffer_load` $ptr `[` $offsets `]` (`,` $mask^)? (`,` $other^)?
              attr-dict `:` type($result)

AMD Buffer load operation. Buffer store is similar to a normal store but it accesses global memory via a scalar base pointer and a tensor of offsets instead of a tensor of pointers. The other fields are similar to a normal load, i.e., the mask is a boolean vector that determines if a given element should be read from memory, and other is the element that should be returned on lane i when mask[i] == 0.

Traits: AttrSizedOperandSegments, SameLoadStoreOperandsAndResultEncoding

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::triton::GlobalMemory}

Operands:

Operand

Description

ptr

ptr

offsets

tensor of 32-bit signless integer values

mask

ranked tensor of 1-bit signless integer values

other

ranked tensor of floating-point or integer or ptr values

Results:

Result

Description

result

ranked tensor of floating-point or integer or ptr values

amdgpu.buffer_store (triton::amdgpu::BufferStoreOp)

Store into scalar base pointer and a tensor offset

Syntax:

operation ::= `amdgpu.buffer_store` $value `,` $ptr `[` $offsets `]` (`,` $mask^)?
              attr-dict `:` type($value)

AMD Buffer store operation. Buffer store is similar to normal store but it accesses global memory via a scalar base pointer and a tensor of offsets instead of a tensor of pointers. The other fields are similar to a normal store , i.e., the mask is a boolean vector that determines if a given element should be written to memory, and value is the tensor of elements that should be written on lane i when mask[i] == 1.

Traits: SameLoadStoreOperandsEncoding

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Write on ::mlir::triton::GlobalMemory}

Operands:

Operand

Description

value

ranked tensor of floating-point or integer or ptr values

ptr

ptr

offsets

tensor of 32-bit signless integer values

mask

ranked tensor of 1-bit signless integer values

amdgpu.cond_barrier (triton::amdgpu::CondBarrierOp)

Conditionally set barriers to synchronize partial threads in a block

Syntax:

operation ::= `amdgpu.cond_barrier` $pred attr-dict

condBarrierOp sets barrier instruction only when the given argument is true. This provides a way to synchronize partial threads in a block, deliberately diverges the execution sequences. However, user should guarantee all threads converge at the end by calling condBarrierOp(true) with the remaining threads. Conceptually, this is similar to having an execution barrier inside an if statement. This op allows us to avoid blocking the whole block when suitable to help scheduling. NB. This doesn’t set any memory fence.

Operands:

Operand

Description

pred

1-bit signless integer

amdgpu.extract_slice (triton::amdgpu::ExtractSliceOp)

Extract slice operation

Syntax:

operation ::= `amdgpu.extract_slice` $source $static_offsets attr-dict `:` type($source) `to` type($result)

The “extract_slice” operation enables extracting a slice of a tensor in registers.

The “extract_slice” operation supports the following arguments:

  • source: the base tensor on which to create a view tensor

  • offsets: offsets into the base tensor at which to create the view

Example 1:

#blocked = #ttg.blocked<{sizePerThread = [1, 8],
    threadsPerWarp = [4, 16], warpsPerCTA = [4, 1], order = [0, 1]}>
#blocked1 = #ttg.blocked<{sizePerThread = [1, 8],
    threadsPerWarp = [16, 4], warpsPerCTA = [4, 1], order = [0, 1]}>
%1 = ttg.convert_layout %0 : tensor<128x128xf16, #blocked>
    -> tensor<128x128xf16, #blocked1>
// create a slice of base tensor %1 with static offsets
%2 = amdgpu.extract_slice %0 [0, 0] :
  tensor<128x128xf16, #blocked1> to tensor<128x32xf16, #blocked1>

Example 1 shows how “extract_slice” operation may be used. In this example a new slice of 128x32 is created. “extract_slice” works on tensors with layout where the desired slice has the same layout as the source tensor. “%0” cannot be sliced directly as the resulting slice cannot have the same layout as “%0”. Therefore it needs to be converted to a layout suitable for slicing. “#blocked1” layout is appropriate for this as it keeps the sizePerThread the same thus keeping coalescing properties the same. In order to utilize all threads in a warp, “threadsPerWarp” is set to [16,4] for this new layout. This layout conversion carried out before using “extract_slice” ensures slicing still uses all threads efficiently. The size of the slice is determined by the result type.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
static_offsets::mlir::DenseI64ArrayAttri64 dense array attribute

Operands:

Operand

Description

source

ranked tensor of any type values

Results:

Result

Description

result

ranked tensor of any type values

amdgpu.instruction_sched_hint (triton::amdgpu::InstructionSchedHint)

A placeholder op for instruction scheduling hints within a basic block

Syntax:

operation ::= `amdgpu.instruction_sched_hint` attr-dict

A placeholder op for instruction scheduling hints applied to instructions within a basic block where the placeholder op is located. This op is primarily intended to be used to adjust instruction scheduling inside the resulting main loop of a tt.dot operation. It’s easier to identify dot ops at a high level and, thus, to mark intended scheduling regions. The hint ops are eventually lowered into LLVM AMDGPU instruction scheduling primitives, which are meant to control how different kinds of instructions (valu/mfma, global/shared memory, etc.) should interleave for better instruction level parallelism.

Attributes:

AttributeMLIR TypeDescription
numDsReadsA::mlir::triton::amdgpu::InstCounterAttr
An instruction counter attribute.{{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}}
numDsReadsB::mlir::triton::amdgpu::InstCounterAttr
An instruction counter attribute.{{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}}
numDsWritesA::mlir::triton::amdgpu::InstCounterAttr
An instruction counter attribute.{{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}}
numDsWritesB::mlir::triton::amdgpu::InstCounterAttr
An instruction counter attribute.{{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}}
numGlobalLoadsA::mlir::triton::amdgpu::InstCounterAttr
An instruction counter attribute.{{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}}
numGlobalLoadsB::mlir::triton::amdgpu::InstCounterAttr
An instruction counter attribute.{{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}}
isBufferLoadsAEnabled::mlir::BoolAttrbool attribute
isBufferLoadsBEnabled::mlir::BoolAttrbool attribute
numMMAs::mlir::triton::amdgpu::InstCounterAttr
An instruction counter attribute.{{% markdown %}} The attribute holds the number of issued LLVM instructions of a specific kind as well as the data type. {{% /markdown %}}