# TritonAMDGPUOps
### `amdgpu.async_tdm_copy_global_to_local` (triton::amdgpu::AsyncTDMCopyGlobalToLocalOp)
_Copy data based on descriptor from global memory to local memory asynchronously_
Syntax:
```
operation ::= `amdgpu.async_tdm_copy_global_to_local` $desc `[` $indices `]` `into` $result `,` $pred
              attr-dict `:` qualified(type($desc)) `->` qualified(type($result))
```
This operation copies data from global memory to local memory
asynchronously. This is analogue to tt.load except the data are copied to
local memory pointed by `result` instead of a distributed tensor. The data
copied depends on the global memory pointed to by `desc`. Set `pred` to
false will disable the copy. This operation does not support shared memory
swizzling.
Interfaces: `InferTypeOpInterface`
#### Operands:
| Operand | Description |
| :-----: | ----------- |
| `desc` | Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system |
| `indices` | variadic of 32-bit signless integer |
| `result` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
| `pred` | 1-bit signless integer |
#### Results:
| Result | Description |
| :----: | ----------- |
| `token` | async token type |
### `amdgpu.async_tdm_copy_local_to_global` (triton::amdgpu::AsyncTDMCopyLocalToGlobalOp)
_Copy data based on descriptor from local memory to global memory asynchronously_
Syntax:
```
operation ::= `amdgpu.async_tdm_copy_local_to_global` $desc `[` $indices `]` `from` $src
              attr-dict `:` qualified(type($src)) `->` qualified(type($desc))
```
This operation copies data from local memory to global memory
asynchronously. This is analogue to tt.store except the data are copied from
local memory pointed by `src` instead of a distributed tensor. The copy
destination depends on the global memory pointed to by `desc`. This
operation does not support shared memory padding or swizzling.
#### Operands:
| Operand | Description |
| :-----: | ----------- |
| `desc` | Tensor descriptor type (`::mlir::triton::TensorDescType`) in Triton IR type system |
| `indices` | variadic of 32-bit signless integer |
| `src` | memory descriptor type (`::mlir::triton::gpu::MemDescType`) in Triton IR type system |
### `amdgpu.async_tdm_wait` (triton::amdgpu::AsyncTDMWait)
_Wait until there are less than or equal to the given number of outstanding TDM operations_
Syntax:
```
operation ::= `amdgpu.async_tdm_wait` $asyncToken attr-dict
```
This operation waits until there are less than or equal to the given number
of outstanding TDM operations, including both loads and stores. This is
necessary to ensure that data is available in the LDS before it is used.
Interfaces: `InferTypeOpInterface`
#### Attributes:
| Attribute | MLIR Type | Description | 
|---|
| atomic_rmw_op | ::mlir::triton::RMWOpAttr | allowed 32-bit signless integer cases: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | 
| sem | ::mlir::triton::MemSemanticAttr | allowed 32-bit signless integer cases: 1, 2, 3, 4 | 
| scope | ::mlir::triton::MemSyncScopeAttr | allowed 32-bit signless integer cases: 1, 2, 3 | 
#### Operands:
| Operand | Description |
| :-----: | ----------- |
| `ptr` | ptr |
| `offsets` | tensor of 32-bit signless integer values |
| `value` | ranked tensor of floating-point or integer or ptr values |
| `stride` | 32-bit signless integer |
| `mask` | ranked tensor of 1-bit signless integer values |
#### Results:
| Result | Description |
| :----: | ----------- |
| `result` | ranked tensor of floating-point or integer or ptr values |
### `amdgpu.buffer_load` (triton::amdgpu::BufferLoadOp)
_Load from a scalar base pointer and a tensor offset_
Syntax:
```
operation ::= `amdgpu.buffer_load` $ptr `[` $offsets `]` (`,` $mask^)? (`,` $other^)?
              oilist(`cacheModifier` `=` $cache)
              (`stride` `=` $stride^)?
              attr-dict `:` type($result)
```
AMD Buffer load operation. Buffer store is similar to
a normal store but it accesses global memory via a scalar base pointer
and a tensor of offsets instead of a tensor of pointers. The other fields
are similar to a normal load, i.e., the `mask` is a boolean vector that
determines if a given element should be read from memory, and `other` is the
element that should be returned on lane `i` when `mask[i] == 0`.
Stride is the distance between the beginning of contiguous memory chunks.
When performing a load of a block, the `stride` is the address difference between
the first elements of each row in bytes. Compiler tries to obtain the `stride`
when it converts to the buffer ops because it is important for optimizing
the cache memory access.
Traits: `AttrSizedOperandSegments`, `SameLoadStoreOperandsAndResultEncoding`
#### Attributes: