Language API

Types

`tensor`	Represents an N-dimensional array of values or pointers.
`shared_memory_descriptor`	Represents a handle to a shared memory allocation in Gluon IR.
`distributed_type`

`program_id`	Returns the id of the current program instance along the given `axis`.
`num_programs`	Returns the number of program instances launched along the given `axis`.
`num_warps`	Returns the number of warps that execute the current context, including in warp-specialized regions.
`num_ctas`	Returns the number of CTAs in the current kernel
`warp_specialize`	Create a warp-specialized execution region, partitioning work across warps.
`barrier`	Insert a barrier to synchronize threads within a CTA, or across a cluster.

`allocate_shared_memory`	Allocate shared memory for a tensor with the given element type, shape, and layout.
`arange`	Generate a sequence tensor with values in [start, end) using a specified layout.
`cast`	Casts a tensor to the given `dtype`.
`full`	Create a tensor filled with a scalar value, with specified shape, dtype, and layout.
`full_like`	Create a tensor with the same properties as a given tensor, filled with a specified value.
`zeros`	Create a tensor filled with zeros.
`zeros_like`	Create a tensor with the same properties as a given tensor, filled with zeros.
`to_tensor`

`bank_conflicts`	Count the bank conflicts per wavefront of each instruction generated when reading/writing the distributed tensor from/to the shared memory descriptor using ld.shared/st.shared instructions.
`convert_layout`	Convert a tensor to a different distributed layout.
`set_auto_layout`	Set a tensor with AutoLayout to a concrete layout
`to_linear_layout`

`broadcast`	Tries to broadcast the two given blocks to a common compatible shape.
`expand_dims`	Expand the shape of a tensor, by inserting new length-1 dimensions.
`join`	Join the given tensors in a new, minor dimension.
`map_elementwise`	Map a scalar function over a tensor.
`permute`	Permutes the dimensions of a tensor.
`ravel`	Returns a contiguous flattened view of `x`.
`reshape`	Returns a tensor with the same number of elements as input but with the provided shape.
`split`	Split a tensor in two along its last dim, which must have size 2.

`load`	Return a tensor of data whose values are loaded from memory at location defined by pointer:
`store`	Store a tensor of data into memory locations defined by pointer.

`atomic_add`	Performs an atomic add at the memory location specified by `pointer`.
`atomic_and`	Performs an atomic logical and at the memory location specified by `pointer`.
`atomic_cas`	Performs an atomic compare-and-swap at the memory location specified by `pointer`.
`atomic_max`	Performs an atomic max at the memory location specified by `pointer`.
`atomic_min`	Performs an atomic min at the memory location specified by `pointer`.
`atomic_or`	Performs an atomic logical or at the memory location specified by `pointer`.
`atomic_xchg`	Performs an atomic exchange at the memory location specified by `pointer`.
`atomic_xor`	Performs an atomic logical xor at the memory location specified by `pointer`.

`gather`	Gather from a tensor along a given dimension.
`where`	Returns a tensor of elements from either `x` or `y`, depending on `condition`.

`abs`	Computes the element-wise absolute value of `x`.
`add`
`cdiv`	Computes the ceiling division of `x` by `div`
`ceil`	Computes the element-wise ceil of `x`.
`clamp`	Clamps the input tensor `x` within the range [min, max].
`cos`	Computes the element-wise cosine of `x`.
`div_rn`	Computes the element-wise precise division (rounding to nearest wrt the IEEE standard) of `x` and `y`.
`erf`	Computes the element-wise error function of `x`.
`exp`	Computes the element-wise exponential of `x`.
`exp2`	Computes the element-wise exponential (base 2) of `x`.
`fdiv`	Computes the element-wise fast division of `x` and `y`.
`floor`	Computes the element-wise floor of `x`.
`fma`	Computes the element-wise fused multiply-add of `x`, `y`, and `z`.
`fp4_to_fp`	Upcast a tensor from fp4 (e2m1) to another floating point type.
`log`	Computes the element-wise natural logarithm of `x`.
`log2`	Computes the element-wise logarithm (base 2) of `x`.
`maximum`	Computes the element-wise maximum of `x` and `y`.
`minimum`	Computes the element-wise minimum of `x` and `y`.
`mul`
`rsqrt`	Computes the element-wise inverse square root of `x`.
`sin`	Computes the element-wise sine of `x`.
`sqrt`	Computes the element-wise fast square root of `x`.
`sqrt_rn`	Computes the element-wise precise square root (rounding to nearest wrt the IEEE standard) of `x`.
`sub`
`umulhi`	Computes the element-wise most significant N bits of the 2N-bit product of `x` and `y`.

`reduce`	Applies the combine_fn to all elements in `input` tensors along the provided `axis`
`reduce_or`	Returns the reduce_or of all elements in the `input` tensor along the provided `axis`
`sum`	Returns the sum of all elements in the `input` tensor along the provided `axis`
`max`	Returns the maximum of all elements in the `input` tensor along the provided `axis`
`min`	Returns the minimum of all elements in the `input` tensor along the provided `axis`
`xor_sum`	Returns the xor sum of all elements in the `input` tensor along the provided `axis`

`associative_scan`	Applies the combine_fn to each elements with a carry in `input` tensors along the provided `axis` and update the carry
`histogram`	Compute a histogram of a 1D integer tensor.

`AutoLayout`
`BlockedLayout`	Represents a blocked layout, partitioning a tensor across threads, warps, and CTAs.
`CoalescedLayout`
`DotOperandLayout`	Represents a layout for a dot operand.
`DistributedLinearLayout`	Represents a linear distributed layout with explicit bases at register, lane, warp, and block levels.
`NVMMADistributedLayout`	Represents a layout for NVIDIA MMA (tensor core) operations.
`NVMMASharedLayout`	Represents a layout for shared memory suitable for NVIDIA MMA operations.
`PaddedSharedLayout`	Represents a layout for the access to shared memory.
`SharedLinearLayout`	Represents a shared memory layout defined via an explicit LinearLayout.
`SliceLayout`	Represents a layout corresponding to slicing a distributed tensor along one dimension.
`SwizzledSharedLayout`	Represents a generic swizzled shared memory layout.

Iterator that counts upward forever.

Execute inline assembly over a tensor.

`assume`	Allow compiler to assume the `cond` is True.
`max_constancy`	Let the compiler know that the elements of `input` along dimension `d` form constant groups of length `values[d]`.
`max_contiguous`	Let the compiler know that the elements of `input` along dimension `d` form contiguous groups of length `values[d]`.
`multiple_of`	Let the compiler know that `values[d]` is the largest power of two that divides the first element of every contiguous group along dimension `d` of `input` (see `max_contiguous()` for the definition of contiguous group).
`static_assert`	Assert the condition at compile time.
`static_print`	Print the values at compile time.
`device_assert`	Assert the condition at runtime from the device.
`device_print`	Print the values at runtime from the device.