AMD CDNA 4

`async_copy`
`buffer_load`	AMD buffer load from global memory via a scalar base pointer and a tensor of offsets instead of a tensor of pointers.
`buffer_store`	AMD buffer store a tensor directly to global memory via a scalar base pointer and a tensor of offsets instead of a tensor of pointers.
`buffer_atomic_add`
`buffer_atomic_and`
`buffer_atomic_max`
`buffer_atomic_min`
`buffer_atomic_or`
`buffer_atomic_xchg`
`buffer_atomic_xor`
`get_mfma_scale_layout`	Get the scale layout for MFMA scaled operands.
`mfma`	Computes matrix-multiplication of a * b + acc using AMD native matrix core units.
`mfma_scaled`	AMD Scaled MFMA operation.