Language API

Types

tensor

Represents an N-dimensional array of values or pointers.

shared_memory_descriptor

Represents a handle to a shared memory allocation in Gluon IR.

distributed_type

Programming Model

program_id

Returns the id of the current program instance along the given axis.

num_programs

Returns the number of program instances launched along the given axis.

num_warps

Returns the number of warps that execute the current context, including in warp-specialized regions.

num_ctas

Returns the number of CTAs in the current kernel

warp_specialize

Create a warp-specialized execution region, partitioning work across warps.

barrier

Insert a barrier to synchronize threads within a CTA, or across a cluster.

Creation Ops

allocate_shared_memory

Allocate shared memory for a tensor with the given element type, shape, and layout.

arange

Generate a sequence tensor with values in [start, end) using a specified layout.

cast

Casts a tensor to the given dtype.

full

Create a tensor filled with a scalar value, with specified shape, dtype, and layout.

full_like

Create a tensor with the same properties as a given tensor, filled with a specified value.

zeros

Create a tensor filled with zeros.

zeros_like

Create a tensor with the same properties as a given tensor, filled with zeros.

to_tensor

Layout Ops

bank_conflicts

Count the bank conflicts per wavefront of each instruction generated when reading/writing the distributed tensor from/to the shared memory descriptor using ld.shared/st.shared instructions.

convert_layout

Convert a tensor to a different distributed layout.

set_auto_layout

Set a tensor with AutoLayout to a concrete layout

to_linear_layout

Shape Manipulation Ops

broadcast

Tries to broadcast the two given blocks to a common compatible shape.

expand_dims

Expand the shape of a tensor, by inserting new length-1 dimensions.

join

Join the given tensors in a new, minor dimension.

map_elementwise

Map a scalar function over a tensor.

permute

Permutes the dimensions of a tensor.

ravel

Returns a contiguous flattened view of x.

reshape

Returns a tensor with the same number of elements as input but with the provided shape.

split

Split a tensor in two along its last dim, which must have size 2.

Memory Ops

load

Return a tensor of data whose values are loaded from memory at location defined by pointer:

store

Store a tensor of data into memory locations defined by pointer.

Atomic Ops

atomic_add

Performs an atomic add at the memory location specified by pointer.

atomic_and

Performs an atomic logical and at the memory location specified by pointer.

atomic_cas

Performs an atomic compare-and-swap at the memory location specified by pointer.

atomic_max

Performs an atomic max at the memory location specified by pointer.

atomic_min

Performs an atomic min at the memory location specified by pointer.

atomic_or

Performs an atomic logical or at the memory location specified by pointer.

atomic_xchg

Performs an atomic exchange at the memory location specified by pointer.

atomic_xor

Performs an atomic logical xor at the memory location specified by pointer.

Linear Algebra Ops

dot_fma

Indexing Ops

gather

Gather from a tensor along a given dimension.

where

Returns a tensor of elements from either x or y, depending on condition.

Math Ops

abs

Computes the element-wise absolute value of x.

add

cdiv

Computes the ceiling division of x by div

ceil

Computes the element-wise ceil of x.

clamp

Clamps the input tensor x within the range [min, max].

cos

Computes the element-wise cosine of x.

div_rn

Computes the element-wise precise division (rounding to nearest wrt the IEEE standard) of x and y.

erf

Computes the element-wise error function of x.

exp

Computes the element-wise exponential of x.

exp2

Computes the element-wise exponential (base 2) of x.

fdiv

Computes the element-wise fast division of x and y.

floor

Computes the element-wise floor of x.

fma

Computes the element-wise fused multiply-add of x, y, and z.

fp4_to_fp

Upcast a tensor from fp4 (e2m1) to another floating point type.

log

Computes the element-wise natural logarithm of x.

log2

Computes the element-wise logarithm (base 2) of x.

maximum

Computes the element-wise maximum of x and y.

minimum

Computes the element-wise minimum of x and y.

mul

rsqrt

Computes the element-wise inverse square root of x.

sin

Computes the element-wise sine of x.

sqrt

Computes the element-wise fast square root of x.

sqrt_rn

Computes the element-wise precise square root (rounding to nearest wrt the IEEE standard) of x.

sub

umulhi

Computes the element-wise most significant N bits of the 2N-bit product of x and y.

Reduction Ops

reduce

Applies the combine_fn to all elements in input tensors along the provided axis

reduce_or

Returns the reduce_or of all elements in the input tensor along the provided axis

sum

Returns the sum of all elements in the input tensor along the provided axis

max

Returns the maximum of all elements in the input tensor along the provided axis

min

Returns the minimum of all elements in the input tensor along the provided axis

xor_sum

Returns the xor sum of all elements in the input tensor along the provided axis

Scan Ops

associative_scan

Applies the combine_fn to each elements with a carry in input tensors along the provided axis and update the carry

histogram

Compute a histogram of a 1D integer tensor.

Layout Classes

AutoLayout

BlockedLayout

Represents a blocked layout, partitioning a tensor across threads, warps, and CTAs.

CoalescedLayout

DotOperandLayout

Represents a layout for a dot operand.

DistributedLinearLayout

Represents a linear distributed layout with explicit bases at register, lane, warp, and block levels.

NVMMADistributedLayout

Represents a layout for NVIDIA MMA (tensor core) operations.

NVMMASharedLayout

Represents a layout for shared memory suitable for NVIDIA MMA operations.

PaddedSharedLayout

Represents a layout for the access to shared memory.

SharedLinearLayout

Represents a shared memory layout defined via an explicit LinearLayout.

SliceLayout

Represents a layout corresponding to slicing a distributed tensor along one dimension.

SwizzledSharedLayout

Represents a generic swizzled shared memory layout.

Iterators

static_range

Iterator that counts upward forever.

Inline Assembly

inline_asm_elementwise

Execute inline assembly over a tensor.

Compiler Hints and Debugging

assume

Allow compiler to assume the cond is True.

max_constancy

Let the compiler know that the value first values in input are constant.

max_contiguous

Let the compiler know that the value first values in input are contiguous.

multiple_of

Let the compiler know that the values in input are all multiples of value.

static_assert

Assert the condition at compile time.

static_print

Print the values at compile time.

device_assert

Assert the condition at runtime from the device.

device_print

Print the values at runtime from the device.