AMD CDNA 4
AMD buffer load from global memory via a scalar base pointer and a tensor of offsets instead of a tensor of pointers. |
|
AMD buffer store a tensor directly to global memory via a scalar base pointer and a tensor of offsets instead of a tensor of pointers. |
|
Get the scale layout for MFMA scaled operands. |
|
Computes matrix-multiplication of a * b + acc using AMD native matrix core units. |
|
AMD Scaled MFMA operation. |