Floating-Point Sanitizer (FpSan)

FpSan is a compiler instrumentation mode that rewrites selected floating-point Triton IR operations into deterministic “payload algebra” over integer bit-patterns. Its goal is not to approximate IEEE floating-point arithmetic. Instead, it preserves selected algebraic structure so that kernels that are symbolically equivalent under the sanitized semantics continue to agree, while wrong rewrites, wrong operands, wrong dataflow, or missing synchronization tend to perturb the result.

This makes FpSan primarily a kernel-checking tool. It is especially useful when you care more about whether a kernel preserves the intended symbolic computation than about its exact IEEE result on a particular input.

At a high level, FpSan:

  • maps floating-point bit-patterns into an integer payload domain

  • replaces supported floating-point ops with integer-domain rewrites chosen to preserve selected identities exactly

  • maps the resulting payload back into a floating-point bit-pattern so the rest of the pipeline can continue

Enabling FpSan

Enable FpSan before the compile or run you want to instrument.

From Python:

import triton

triton.knobs.compilation.instrumentation_mode = "fpsan"
# compile and run kernels here
triton.knobs.compilation.instrumentation_mode = ""

From the shell:

TRITON_INSTRUMENTATION_MODE=fpsan python your_script.py

Notes:

  • FpSan is a compiler feature, so it does not apply in interpreter mode.

  • On AMD, the backend currently enables FpSan only for gfx942, gfx950, and gfx1250.

How to Use It

The most effective way to use FpSan is to compare two kernels, or two versions of one kernel, under the same FpSan mode. Typical uses include:

  • comparing an optimized kernel against a simple reference kernel

  • comparing a fused kernel against an unfused composition

  • comparing two schedule variants that should be mathematically equivalent

  • checking that accumulator selection, predication, or TMEM pipelines preserve the intended payload flow

FpSan results should only be compared against other FpSan results, not against ordinary floating-point outputs.

Payload Model

For each floating-point width w, FpSan defines a bijection between floating-point bit-patterns and a w-bit integer payload; arithmetic wraps modulo 2^w.

Conceptually:

  • embed(x) maps a float bit-pattern to an integer payload

  • unembed(u) maps an integer payload back to a float bit-pattern

  • sanitized float ops are implemented as unembed(F(embed(...)))

The embedding is deliberately chosen so that a few important constants are stable:

  • embed(+0.0) = 0

  • embed(+1.0) = 1

  • embed(-1.0) = all-ones

Those fixed points are the reason identities such as x + 0 = x and x * 1 = x behave naturally under FpSan.

What FpSan Preserves

FpSan preserves exact identities in the payload algebra selected by each rewrite. The most important ones are:

  • ring identities for add, subtract, multiply, FMA, and dot-like accumulation

  • selected exponential identities for exp and exp2 (see below for details)

  • trigonometric identities for sin and cos

  • payload equality through casts, loads, stores, and copies

  • deterministic op-distinguishing tags for unary functions that do not yet have a richer algebraic model

This is what makes FpSan valuable for kernel checks: if two kernels should be the same symbolic computation under the preserved properties, they should produce the same payloads. This holds assuming a (generally believed) conjecture in transcendental number theory, Schanuel’s conjecture. One of the authors of FpSan has a [blog post](https://cp4space.hatsya.com/2026/05/03/schanuels-conjecture-and-the-semantics-of-fpsan/) explaining the theory behind FpSan from a mathematical perspective.

What FpSan Does Not Preserve

FpSan is not an IEEE simulator.

In particular, do not rely on it for:

  • real floating-point ordering, rounding, NaN propagation, infinities, subnormals, or exceptions

  • real transcendental semantics for log, sqrt, erf, floor, ceil, rsqrt, and similar tagged unary ops

  • expected floating-point bit patterns (i.e. for kernels that bitcast between floats and integers)

When a property matters for your check, the right question is: “is this property preserved by the payload rewrite for this specific op family?”

Common Arithmetic Ops

Add, Sub, Mul

Supported operations:

  • x + y

  • x - y

  • x * y

Rewrite:

  • add, subtract, or multiply the embedded payloads, then unembed the result

Exact preserved properties:

  • x + 0 = x

  • x - 0 = x

  • x - x = 0

  • x * 1 = x

  • associativity and commutativity of add and mul

  • distributivity of mul over add

Important caveat:

  • This is ring arithmetic modulo 2^w, not IEEE arithmetic.

Min and Max

Supported operations:

  • tl.minimum(x, y)

  • tl.maximum(x, y)

  • min(x, y) and max(x, y) in Triton code

Rewrite:

  • signed integer min or max on payloads

Exact preserved properties:

  • idempotence: min(x, x) = x and max(x, x) = x

  • commutativity

  • associativity

Important caveats:

  • The order is the signed integer order of payloads, not IEEE float order.

  • NaN handling, and the exact signed-zero contract, are not modeled.

Division

Supported operation:

  • x / y

Rewrite:

  • x / y becomes embed(x) * inv(embed(y)), then unembed

Here inv is:

  • the true modular inverse for odd payloads

  • a parity-preserving involution for even payloads

Exact preserved properties:

  • x / 1 = x

  • 1 / (1 / x) = x

  • for odd payloads, the usual modular inverse laws hold

Important caveats:

  • x / x = 1 is not guaranteed for all payloads.

  • Division by zero does not produce IEEE infinities or traps.

  • This rewrite is chosen for algebraic checking, not numeric realism.

Remainder

Supported operation:

  • x % y

Rewrite:

  • signed integer remainder on payloads after forcing the denominator odd with den | 1

Exact preserved properties:

  • same inputs produce the same sanitized remainder payload

Important caveats:

  • Real floating-point remainder semantics are not modeled.

  • Zero denominators are intentionally mapped to a safe odd payload instead of trapping.

FMA

Supported operation:

  • tl.fma(a, b, c)

Rewrite:

  • a * b + c in payload arithmetic

Exact preserved properties:

  • exact agreement with the sanitized expansion mul followed by add

  • fma(a, b, c) = a*b + c in the payload ring

Important caveat:

  • There is no special fused-rounding behavior.

Unary Math Ops

exp2

Supported operation:

  • tl.exp2(x)

Rewrite:

  • modular exponentiation by a fixed odd generator in payload space

Exact preserved properties:

  • exp2(x + y) = exp2(x) * exp2(y)

  • exp2(0) = 1

  • exp2(-x) = 1.0 / exp2(x)

exp

Supported operation:

  • tl.exp(x)

Rewrite:

  • exp(x) is implemented as exp2(x * rcp_log2) in payload space

Exact preserved properties:

  • exp uses the same payload-space construction as exp2 after scaling the input by a fixed internal payload constant

sin and cos

Supported operations:

  • tl.sin(x)

  • tl.cos(x)

Rewrite:

  • a deterministic payload-space rewrite chosen to preserve the identities below

Exact preserved properties:

  • sin(x + y) = sin(x) * cos(y) + cos(x) * sin(y)

  • sin(x - y) = sin(x) * cos(y) - cos(x) * sin(y)

  • cos(x + y) = cos(x) * cos(y) - sin(x) * sin(y)

  • cos(x - y) = cos(x) * cos(y) + sin(x) * sin(y)

  • cos(x)^2 + sin(x)^2 = 1

Important caveat:

  • These are not IEEE trig values; they are payload functions chosen to preserve the angle identities above.

Tagged Unary Ops

Supported operations:

  • tl.log(x)

  • tl.log2(x)

  • tl.sqrt(x)

  • tl.rsqrt(x)

  • tl.erf(x)

  • tl.floor(x)

  • tl.ceil(x)

  • precise square root variants

Rewrite:

  • an invertible payload tag: multiply by an odd constant, xor with an op-specific hash, then multiply again

Exact preserved properties:

  • payload equality is preserved for the same op: if x == y in payload space, then op(x) == op(y)

  • different supported unary ops get different tags

Important caveats:

  • These rewrites intentionally do not preserve real mathematical identities such as sqrt(x)^2 = x or log(x*y) = log(x) + log(y).

Casts and Format Conversions

Float-to-Float Conversions

Supported operations:

  • converting a tensor between floating-point types with x.to(dtype)

  • implicit float widening and narrowing conversions

Rewrite:

  • signed integer extension or truncation in payload space, followed by unembed

Exact preserved properties:

  • 0, +1, and -1 remain stable across the conversion

  • sign-extension behavior in the payload domain

  • truncation drops high payload bits

  • an upcast followed by a downcast is the identity

Important caveat:

  • This preserves payload structure, not IEEE conversion semantics.

  • Conversions between fp types of the same width do not model any loss of precision or range, so for example under fpsan fn(a.to(tl.float16)).to(tl.bfloat16) == fn(a) (for any bfloat16 a).

Packed fp4 conversion

Rewrite:

  • unpack low and high nibbles from the source byte tensor

  • reshape and reorder them

  • interpret each unpacked nibble directly as a payload in the destination float width

Exact preserved properties:

  • deterministic unpacking of packed fp4 storage

  • exact preservation of the unpacked nibble payloads

Important caveat:

  • This is not real fp4 numeric decoding.

  • The same raw-payload interpretation is reused by scaled-dot paths for e2m1.

Pure Extern Elementwise Ops

Supported operation:

  • tl.extern_elementwise when all of the following hold:

    • the op is pure

    • the result type is float-like

    • there is at least one operand

    • every operand is numeric

Rewrite:

  • rotate each operand payload by its argument index

  • sum the rotated payloads

  • xor the result with a stable hash of the symbol name

  • unembed

Exact preserved properties:

  • deterministic dependence on all operands and on operand order

  • deterministic distinction between different external symbols

  • mixed float and integer operands are supported; float operands are embedded, integer operands are used directly after signed casting to the result width

Important caveat:

  • This is a structural tag, not a numeric model of the external function.

Gluon MMA and Tensor Memory

Supported Gluon operations include:

  • mma_v2

  • warpgroup_mma and warpgroup_mma_wait

  • tcgen05_mma and tcgen05_mma_scaled

  • tcgen05_copy and tcgen05_commit

  • allocate_tensor_memory

  • tensor-memory descriptor methods such as load, load_min, load_max, store, slice, index, and _reinterpret

  • AMD mfma, mfma_scaled, wmma, wmma_scaled, and scaled_upcast

Rewrite:

  • perform multiply-add accumulation in payload space

  • preserve payload bits across tensor-memory loads, stores, copies, and views

  • keep accumulator-selection and predication behavior structurally visible

Exact preserved properties:

  • exact matrix-multiply algebra over the payload ring

  • exact agreement with sanitized scalar multiply-add expansion

  • accumulation with the provided accumulator is preserved as payload addition

  • tensor-memory operations preserve payload flow across the pipeline

Important caveats:

  • Scaled MMA preserves the sanitizer’s payload treatment of low-precision operands and scales, not exact hardware-format numeric decoding.

  • Tensor-memory operations preserve payload dataflow; they do not make FpSan a substitute for race or synchronization checking.

  • Currently fpsan is supported on all NVIDIA hardware, as well as AMD gfx942, gfx950, and gfx1250.

Practical Guidance for Checks

FpSan is a good fit when you want to check:

  • that two kernels implement the same preserved algebra

  • that a fused kernel keeps the intended dataflow

  • that predication or accumulator-selection logic is wired correctly

  • that a tensor-memory or warp-specialized pipeline preserves payload flow

FpSan is a poor fit when you want to check:

  • IEEE edge cases

  • real transcendental accuracy

  • NaN or infinity behavior

  • hardware-format decode semantics for low-precision formats

In short, rely on FpSan for structure-preserving kernel validation, and rely on ordinary numerical tests for IEEE behavior.