Floating-Point Sanitizer (FpSan)

FpSan is a compiler instrumentation mode that rewrites selected floating-point Triton IR operations into deterministic “payload algebra” over integer bit-patterns. Its goal is not to approximate IEEE floating-point arithmetic. Instead, it preserves selected algebraic structure so that kernels that are symbolically equivalent under the sanitized semantics continue to agree, while wrong rewrites, wrong operands, wrong dataflow, or missing synchronization tend to perturb the result.

This makes FpSan primarily a kernel-checking tool. It is especially useful when you care more about whether a kernel preserves the intended symbolic computation than about its exact IEEE result on a particular input.

At a high level, FpSan:

maps floating-point bit-patterns into an integer payload domain
replaces supported floating-point ops with integer-domain rewrites chosen to preserve selected identities exactly
maps the resulting payload back into a floating-point bit-pattern so the rest of the pipeline can continue

Enabling FpSan

Enable FpSan before the compile or run you want to instrument.

From Python:

import triton

triton.knobs.compilation.instrumentation_mode = "fpsan"
# compile and run kernels here
triton.knobs.compilation.instrumentation_mode = ""

From the shell:

TRITON_INSTRUMENTATION_MODE=fpsan python your_script.py

Notes:

FpSan is a compiler feature, so it does not apply in interpreter mode.
On AMD, the backend currently enables FpSan only for gfx942, gfx950, and gfx1250.

How to Use It

The most effective way to use FpSan is to compare two kernels, or two versions of one kernel, under the same FpSan mode. Typical uses include:

comparing an optimized kernel against a simple reference kernel
comparing a fused kernel against an unfused composition
comparing two schedule variants that should be mathematically equivalent
checking that accumulator selection, predication, or TMEM pipelines preserve the intended payload flow

FpSan results should only be compared against other FpSan results, not against ordinary floating-point outputs.

Payload Model

For each floating-point width w, FpSan defines a bijection between floating-point bit-patterns and a w-bit integer payload; arithmetic wraps modulo 2^w.

Conceptually:

embed(x) maps a float bit-pattern to an integer payload
unembed(u) maps an integer payload back to a float bit-pattern
sanitized float ops are implemented as unembed(F(embed(...)))

The embedding is deliberately chosen so that a few important constants are stable:

embed(+0.0) = 0
embed(+1.0) = 1
embed(-1.0) = all-ones

Those fixed points are the reason identities such as x + 0 = x and x * 1 = x behave naturally under FpSan.

What FpSan Preserves

FpSan preserves exact identities in the payload algebra selected by each rewrite. The most important ones are:

ring identities for add, subtract, multiply, FMA, and dot-like accumulation
selected exponential identities for exp and exp2 (see below for details)
trigonometric identities for sin and cos
payload equality through casts, loads, stores, and copies
deterministic op-distinguishing tags for unary functions that do not yet have a richer algebraic model

This is what makes FpSan valuable for kernel checks: if two kernels should be the same symbolic computation under the preserved properties, they should produce the same payloads. This holds assuming a (generally believed) conjecture in transcendental number theory, Schanuel’s conjecture. One of the authors of FpSan has a [blog post](https://cp4space.hatsya.com/2026/05/03/schanuels-conjecture-and-the-semantics-of-fpsan/) explaining the theory behind FpSan from a mathematical perspective.

What FpSan Does Not Preserve

FpSan is not an IEEE simulator.

In particular, do not rely on it for:

real floating-point ordering, rounding, NaN propagation, infinities, subnormals, or exceptions
real transcendental semantics for log, sqrt, erf, floor, ceil, rsqrt, and similar tagged unary ops
expected floating-point bit patterns (i.e. for kernels that bitcast between floats and integers)

When a property matters for your check, the right question is: “is this property preserved by the payload rewrite for this specific op family?”

Common Arithmetic Ops

Add, Sub, Mul

Supported operations:

x + y
x - y
x * y

Rewrite:

add, subtract, or multiply the embedded payloads, then unembed the result

Exact preserved properties:

x + 0 = x
x - 0 = x
x - x = 0
x * 1 = x
associativity and commutativity of add and mul
distributivity of mul over add

Important caveat:

This is ring arithmetic modulo 2^w, not IEEE arithmetic.

Min and Max

Supported operations:

tl.minimum(x, y)
tl.maximum(x, y)
min(x, y) and max(x, y) in Triton code

Rewrite:

signed integer min or max on payloads

Exact preserved properties:

idempotence: min(x, x) = x and max(x, x) = x
commutativity
associativity

Important caveats:

The order is the signed integer order of payloads, not IEEE float order.
NaN handling, and the exact signed-zero contract, are not modeled.

Division

Supported operation:

x / y

Rewrite:

x / y becomes embed(x) * inv(embed(y)), then unembed

Here inv is:

the true modular inverse for odd payloads
a parity-preserving involution for even payloads

Exact preserved properties:

x / 1 = x
1 / (1 / x) = x
for odd payloads, the usual modular inverse laws hold

Important caveats:

x / x = 1 is not guaranteed for all payloads.
Division by zero does not produce IEEE infinities or traps.
This rewrite is chosen for algebraic checking, not numeric realism.

Remainder

Supported operation:

x % y

Rewrite:

signed integer remainder on payloads after forcing the denominator odd with den | 1

Exact preserved properties:

same inputs produce the same sanitized remainder payload

Important caveats:

Real floating-point remainder semantics are not modeled.
Zero denominators are intentionally mapped to a safe odd payload instead of trapping.

FMA

Supported operation:

tl.fma(a, b, c)

Rewrite:

a * b + c in payload arithmetic

Exact preserved properties:

exact agreement with the sanitized expansion mul followed by add
fma(a, b, c) = a*b + c in the payload ring

Important caveat:

There is no special fused-rounding behavior.

Unary Math Ops

`exp2`

Supported operation:

tl.exp2(x)

Rewrite:

modular exponentiation by a fixed odd generator in payload space

Exact preserved properties:

exp2(x + y) = exp2(x) * exp2(y)
exp2(0) = 1
exp2(-x) = 1.0 / exp2(x)

`exp`

Supported operation:

tl.exp(x)

Rewrite:

exp(x) is implemented as exp2(x * rcp_log2) in payload space

Exact preserved properties:

exp uses the same payload-space construction as exp2 after scaling the input by a fixed internal payload constant

`sin` and `cos`

Supported operations:

tl.sin(x)
tl.cos(x)

Rewrite:

a deterministic payload-space rewrite chosen to preserve the identities below

Exact preserved properties:

sin(x + y) = sin(x) * cos(y) + cos(x) * sin(y)
sin(x - y) = sin(x) * cos(y) - cos(x) * sin(y)
cos(x + y) = cos(x) * cos(y) - sin(x) * sin(y)
cos(x - y) = cos(x) * cos(y) + sin(x) * sin(y)
cos(x)^2 + sin(x)^2 = 1

Important caveat:

These are not IEEE trig values; they are payload functions chosen to preserve the angle identities above.

Tagged Unary Ops

Supported operations:

tl.log(x)
tl.log2(x)
tl.sqrt(x)
tl.rsqrt(x)
tl.erf(x)
tl.floor(x)
tl.ceil(x)
precise square root variants

Rewrite:

an invertible payload tag: multiply by an odd constant, xor with an op-specific hash, then multiply again

Exact preserved properties:

payload equality is preserved for the same op: if x == y in payload space, then op(x) == op(y)
different supported unary ops get different tags

Important caveats:

These rewrites intentionally do not preserve real mathematical identities such as sqrt(x)^2 = x or log(x*y) = log(x) + log(y).

Casts and Format Conversions

Float-to-Float Conversions

Supported operations:

converting a tensor between floating-point types with x.to(dtype)
implicit float widening and narrowing conversions

Rewrite:

signed integer extension or truncation in payload space, followed by unembed

Exact preserved properties:

0, +1, and -1 remain stable across the conversion
sign-extension behavior in the payload domain
truncation drops high payload bits
an upcast followed by a downcast is the identity

Important caveat:

This preserves payload structure, not IEEE conversion semantics.
Conversions between fp types of the same width do not model any loss of precision or range, so for example under fpsan fn(a.to(tl.float16)).to(tl.bfloat16) == fn(a) (for any bfloat16 a).

Packed fp4 conversion

Rewrite:

unpack low and high nibbles from the source byte tensor
reshape and reorder them
interpret each unpacked nibble directly as a payload in the destination float width

Exact preserved properties:

deterministic unpacking of packed fp4 storage
exact preservation of the unpacked nibble payloads

Important caveat:

This is not real fp4 numeric decoding.
The same raw-payload interpretation is reused by scaled-dot paths for e2m1.

Pure Extern Elementwise Ops

Supported operation:

tl.extern_elementwise when all of the following hold:
- the op is pure
- the result type is float-like
- there is at least one operand
- every operand is numeric

Rewrite:

rotate each operand payload by its argument index
sum the rotated payloads
xor the result with a stable hash of the symbol name
unembed

Exact preserved properties:

deterministic dependence on all operands and on operand order
deterministic distinction between different external symbols
mixed float and integer operands are supported; float operands are embedded, integer operands are used directly after signed casting to the result width

Important caveat:

This is a structural tag, not a numeric model of the external function.

Gluon MMA and Tensor Memory

Supported Gluon operations include:

mma_v2
warpgroup_mma and warpgroup_mma_wait
tcgen05_mma and tcgen05_mma_scaled
tcgen05_copy and tcgen05_commit
allocate_tensor_memory
tensor-memory descriptor methods such as load, load_min, load_max, store, slice, index, and _reinterpret
AMD mfma, mfma_scaled, wmma, wmma_scaled, and scaled_upcast

Rewrite:

perform multiply-add accumulation in payload space
preserve payload bits across tensor-memory loads, stores, copies, and views
keep accumulator-selection and predication behavior structurally visible

Exact preserved properties:

exact matrix-multiply algebra over the payload ring
exact agreement with sanitized scalar multiply-add expansion
accumulation with the provided accumulator is preserved as payload addition
tensor-memory operations preserve payload flow across the pipeline

Important caveats:

Scaled MMA preserves the sanitizer’s payload treatment of low-precision operands and scales, not exact hardware-format numeric decoding.
Tensor-memory operations preserve payload dataflow; they do not make FpSan a substitute for race or synchronization checking.
Currently fpsan is supported on all NVIDIA hardware, as well as AMD gfx942, gfx950, and gfx1250.

Practical Guidance for Checks

FpSan is a good fit when you want to check:

that two kernels implement the same preserved algebra
that a fused kernel keeps the intended dataflow
that predication or accumulator-selection logic is wired correctly
that a tensor-memory or warp-specialized pipeline preserves payload flow

FpSan is a poor fit when you want to check:

IEEE edge cases
real transcendental accuracy
NaN or infinity behavior
hardware-format decode semantics for low-precision formats

In short, rely on FpSan for structure-preserving kernel validation, and rely on ordinary numerical tests for IEEE behavior.

Floating-Point Sanitizer (FpSan)

Enabling FpSan

How to Use It

Payload Model

What FpSan Preserves

What FpSan Does Not Preserve

Common Arithmetic Ops

Add, Sub, Mul

Min and Max

Division

Remainder

FMA

Unary Math Ops

exp2

exp

sin and cos

Tagged Unary Ops

Casts and Format Conversions

Float-to-Float Conversions

Packed fp4 conversion

Pure Extern Elementwise Ops

Gluon MMA and Tensor Memory

Practical Guidance for Checks

`exp2`

`exp`

`sin` and `cos`