Language API
Types
Represents an N-dimensional array of values or pointers. |
|
Represents a handle to a shared memory allocation in Gluon IR. |
|
Programming Model
Returns the id of the current program instance along the given |
|
Returns the number of program instances launched along the given |
|
Returns the number of warps that execute the current context, including in warp-specialized regions. |
|
Returns the number of CTAs in the current kernel |
|
Create a warp-specialized execution region, partitioning work across warps. |
|
Insert a barrier to synchronize threads within a CTA, or across a cluster. |
Creation Ops
Allocate shared memory for a tensor with the given element type, shape, and layout. |
|
Generate a sequence tensor with values in [start, end) using a specified layout. |
|
Casts a tensor to the given |
|
Create a tensor filled with a scalar value, with specified shape, dtype, and layout. |
|
Create a tensor with the same properties as a given tensor, filled with a specified value. |
|
Create a tensor filled with zeros. |
|
Create a tensor with the same properties as a given tensor, filled with zeros. |
|
Layout Ops
Count the bank conflicts per wavefront of each instruction generated when reading/writing the distributed tensor from/to the shared memory descriptor using ld.shared/st.shared instructions. |
|
Convert a tensor to a different distributed layout. |
|
Set a tensor with AutoLayout to a concrete layout |
|
Shape Manipulation Ops
Tries to broadcast the two given blocks to a common compatible shape. |
|
Expand the shape of a tensor, by inserting new length-1 dimensions. |
|
Join the given tensors in a new, minor dimension. |
|
Map a scalar function over a tensor. |
|
Permutes the dimensions of a tensor. |
|
Returns a contiguous flattened view of |
|
Returns a tensor with the same number of elements as input but with the provided shape. |
|
Split a tensor in two along its last dim, which must have size 2. |
Memory Ops
Return a tensor of data whose values are loaded from memory at location defined by pointer: |
|
Store a tensor of data into memory locations defined by pointer. |
Atomic Ops
Performs an atomic add at the memory location specified by |
|
Performs an atomic logical and at the memory location specified by |
|
Performs an atomic compare-and-swap at the memory location specified by |
|
Performs an atomic max at the memory location specified by |
|
Performs an atomic min at the memory location specified by |
|
Performs an atomic logical or at the memory location specified by |
|
Performs an atomic exchange at the memory location specified by |
|
Performs an atomic logical xor at the memory location specified by |
Linear Algebra Ops
Indexing Ops
Gather from a tensor along a given dimension. |
|
Returns a tensor of elements from either |
Math Ops
Computes the element-wise absolute value of |
|
Computes the ceiling division of |
|
Computes the element-wise ceil of |
|
Clamps the input tensor |
|
Computes the element-wise cosine of |
|
Computes the element-wise precise division (rounding to nearest wrt the IEEE standard) of |
|
Computes the element-wise error function of |
|
Computes the element-wise exponential of |
|
Computes the element-wise exponential (base 2) of |
|
Computes the element-wise fast division of |
|
Computes the element-wise floor of |
|
Computes the element-wise fused multiply-add of |
|
Upcast a tensor from fp4 (e2m1) to another floating point type. |
|
Computes the element-wise natural logarithm of |
|
Computes the element-wise logarithm (base 2) of |
|
Computes the element-wise maximum of |
|
Computes the element-wise minimum of |
|
Computes the element-wise inverse square root of |
|
Computes the element-wise sine of |
|
Computes the element-wise fast square root of |
|
Computes the element-wise precise square root (rounding to nearest wrt the IEEE standard) of |
|
Computes the element-wise most significant N bits of the 2N-bit product of |
Reduction Ops
Applies the combine_fn to all elements in |
|
Returns the reduce_or of all elements in the |
|
Returns the sum of all elements in the |
|
Returns the maximum of all elements in the |
|
Returns the minimum of all elements in the |
|
Returns the xor sum of all elements in the |
Scan Ops
Applies the combine_fn to each elements with a carry in |
|
Compute a histogram of a 1D integer tensor. |
Layout Classes
Represents a blocked layout, partitioning a tensor across threads, warps, and CTAs. |
|
Represents a layout for a dot operand. |
|
Represents a linear distributed layout with explicit bases at register, lane, warp, and block levels. |
|
Represents a layout for NVIDIA MMA (tensor core) operations. |
|
Represents a layout for shared memory suitable for NVIDIA MMA operations. |
|
Represents a layout for the access to shared memory. |
|
Represents a shared memory layout defined via an explicit LinearLayout. |
|
Represents a layout corresponding to slicing a distributed tensor along one dimension. |
|
Represents a generic swizzled shared memory layout. |
Iterators
Iterator that counts upward forever. |
Inline Assembly
Execute inline assembly over a tensor. |
Compiler Hints and Debugging
Allow compiler to assume the |
|
Let the compiler know that the value first values in |
|
Let the compiler know that the value first values in |
|
Let the compiler know that the values in |
|
Assert the condition at compile time. |
|
Print the values at compile time. |
|
Assert the condition at runtime from the device. |
|
Print the values at runtime from the device. |