Floating-Point Sanitizer (FpSan)
FpSan is a compiler instrumentation mode that rewrites selected floating-point Triton IR operations into deterministic “payload algebra” over integer bit-patterns. Its goal is not to approximate IEEE floating-point arithmetic. Instead, it preserves selected algebraic structure so that kernels that are symbolically equivalent under the sanitized semantics continue to agree, while wrong rewrites, wrong operands, wrong dataflow, or missing synchronization tend to perturb the result.
This makes FpSan primarily a kernel-checking tool. It is especially useful when you care more about whether a kernel preserves the intended symbolic computation than about its exact IEEE result on a particular input.
At a high level, FpSan:
maps floating-point bit-patterns into an integer payload domain
replaces supported floating-point ops with integer-domain rewrites chosen to preserve selected identities exactly
maps the resulting payload back into a floating-point bit-pattern so the rest of the pipeline can continue
Enabling FpSan
Enable FpSan before the compile or run you want to instrument.
From Python:
import triton
triton.knobs.compilation.instrumentation_mode = "fpsan"
# compile and run kernels here
triton.knobs.compilation.instrumentation_mode = ""
From the shell:
TRITON_INSTRUMENTATION_MODE=fpsan python your_script.py
Notes:
FpSan is a compiler feature, so it does not apply in interpreter mode.
On AMD, the backend currently enables FpSan only for
gfx942,gfx950, andgfx1250.
How to Use It
The most effective way to use FpSan is to compare two kernels, or two versions of one kernel, under the same FpSan mode. Typical uses include:
comparing an optimized kernel against a simple reference kernel
comparing a fused kernel against an unfused composition
comparing two schedule variants that should be mathematically equivalent
checking that accumulator selection, predication, or TMEM pipelines preserve the intended payload flow
FpSan results should only be compared against other FpSan results, not against ordinary floating-point outputs.
Payload Model
For each floating-point width w, FpSan defines a bijection between
floating-point bit-patterns and a w-bit integer payload; arithmetic wraps modulo 2^w.
Conceptually:
embed(x)maps a float bit-pattern to an integer payloadunembed(u)maps an integer payload back to a float bit-patternsanitized float ops are implemented as
unembed(F(embed(...)))
The embedding is deliberately chosen so that a few important constants are stable:
embed(+0.0) = 0embed(+1.0) = 1embed(-1.0) = all-ones
Those fixed points are the reason identities such as x + 0 = x and
x * 1 = x behave naturally under FpSan.
What FpSan Preserves
FpSan preserves exact identities in the payload algebra selected by each rewrite. The most important ones are:
ring identities for add, subtract, multiply, FMA, and dot-like accumulation
selected exponential identities for
expandexp2(see below for details)trigonometric identities for
sinandcospayload equality through casts, loads, stores, and copies
deterministic op-distinguishing tags for unary functions that do not yet have a richer algebraic model
This is what makes FpSan valuable for kernel checks: if two kernels should be the same symbolic computation under the preserved properties, they should produce the same payloads. This holds assuming a (generally believed) conjecture in transcendental number theory, Schanuel’s conjecture. One of the authors of FpSan has a [blog post](https://cp4space.hatsya.com/2026/05/03/schanuels-conjecture-and-the-semantics-of-fpsan/) explaining the theory behind FpSan from a mathematical perspective.
What FpSan Does Not Preserve
FpSan is not an IEEE simulator.
In particular, do not rely on it for:
real floating-point ordering, rounding, NaN propagation, infinities, subnormals, or exceptions
real transcendental semantics for
log,sqrt,erf,floor,ceil,rsqrt, and similar tagged unary opsexpected floating-point bit patterns (i.e. for kernels that bitcast between floats and integers)
When a property matters for your check, the right question is: “is this property preserved by the payload rewrite for this specific op family?”
Common Arithmetic Ops
Add, Sub, Mul
Supported operations:
x + yx - yx * y
Rewrite:
add, subtract, or multiply the embedded payloads, then unembed the result
Exact preserved properties:
x + 0 = xx - 0 = xx - x = 0x * 1 = xassociativity and commutativity of add and mul
distributivity of mul over add
Important caveat:
This is ring arithmetic modulo
2^w, not IEEE arithmetic.
Min and Max
Supported operations:
tl.minimum(x, y)tl.maximum(x, y)min(x, y)andmax(x, y)in Triton code
Rewrite:
signed integer
minormaxon payloads
Exact preserved properties:
idempotence:
min(x, x) = xandmax(x, x) = xcommutativity
associativity
Important caveats:
The order is the signed integer order of payloads, not IEEE float order.
NaN handling, and the exact signed-zero contract, are not modeled.
Division
Supported operation:
x / y
Rewrite:
x / ybecomesembed(x) * inv(embed(y)), thenunembed
Here inv is:
the true modular inverse for odd payloads
a parity-preserving involution for even payloads
Exact preserved properties:
x / 1 = x1 / (1 / x) = xfor odd payloads, the usual modular inverse laws hold
Important caveats:
x / x = 1is not guaranteed for all payloads.Division by zero does not produce IEEE infinities or traps.
This rewrite is chosen for algebraic checking, not numeric realism.
Remainder
Supported operation:
x % y
Rewrite:
signed integer remainder on payloads after forcing the denominator odd with
den | 1
Exact preserved properties:
same inputs produce the same sanitized remainder payload
Important caveats:
Real floating-point remainder semantics are not modeled.
Zero denominators are intentionally mapped to a safe odd payload instead of trapping.
FMA
Supported operation:
tl.fma(a, b, c)
Rewrite:
a * b + cin payload arithmetic
Exact preserved properties:
exact agreement with the sanitized expansion
mulfollowed byaddfma(a, b, c) = a*b + cin the payload ring
Important caveat:
There is no special fused-rounding behavior.
Unary Math Ops
exp2
Supported operation:
tl.exp2(x)
Rewrite:
modular exponentiation by a fixed odd generator in payload space
Exact preserved properties:
exp2(x + y) = exp2(x) * exp2(y)exp2(0) = 1exp2(-x) = 1.0 / exp2(x)
exp
Supported operation:
tl.exp(x)
Rewrite:
exp(x)is implemented asexp2(x * rcp_log2)in payload space
Exact preserved properties:
expuses the same payload-space construction asexp2after scaling the input by a fixed internal payload constant
sin and cos
Supported operations:
tl.sin(x)tl.cos(x)
Rewrite:
a deterministic payload-space rewrite chosen to preserve the identities below
Exact preserved properties:
sin(x + y) = sin(x) * cos(y) + cos(x) * sin(y)sin(x - y) = sin(x) * cos(y) - cos(x) * sin(y)cos(x + y) = cos(x) * cos(y) - sin(x) * sin(y)cos(x - y) = cos(x) * cos(y) + sin(x) * sin(y)cos(x)^2 + sin(x)^2 = 1
Important caveat:
These are not IEEE trig values; they are payload functions chosen to preserve the angle identities above.
Tagged Unary Ops
Supported operations:
tl.log(x)tl.log2(x)tl.sqrt(x)tl.rsqrt(x)tl.erf(x)tl.floor(x)tl.ceil(x)precise square root variants
Rewrite:
an invertible payload tag: multiply by an odd constant, xor with an op-specific hash, then multiply again
Exact preserved properties:
payload equality is preserved for the same op: if
x == yin payload space, thenop(x) == op(y)different supported unary ops get different tags
Important caveats:
These rewrites intentionally do not preserve real mathematical identities such as
sqrt(x)^2 = xorlog(x*y) = log(x) + log(y).
Casts and Format Conversions
Float-to-Float Conversions
Supported operations:
converting a tensor between floating-point types with
x.to(dtype)implicit float widening and narrowing conversions
Rewrite:
signed integer extension or truncation in payload space, followed by
unembed
Exact preserved properties:
0,+1, and-1remain stable across the conversionsign-extension behavior in the payload domain
truncation drops high payload bits
an upcast followed by a downcast is the identity
Important caveat:
This preserves payload structure, not IEEE conversion semantics.
Conversions between fp types of the same width do not model any loss of precision or range, so for example under fpsan
fn(a.to(tl.float16)).to(tl.bfloat16) == fn(a)(for any bfloat16a).
Packed fp4 conversion
Rewrite:
unpack low and high nibbles from the source byte tensor
reshape and reorder them
interpret each unpacked nibble directly as a payload in the destination float width
Exact preserved properties:
deterministic unpacking of packed fp4 storage
exact preservation of the unpacked nibble payloads
Important caveat:
This is not real fp4 numeric decoding.
The same raw-payload interpretation is reused by scaled-dot paths for
e2m1.
Pure Extern Elementwise Ops
Supported operation:
tl.extern_elementwisewhen all of the following hold:the op is
purethe result type is float-like
there is at least one operand
every operand is numeric
Rewrite:
rotate each operand payload by its argument index
sum the rotated payloads
xor the result with a stable hash of the symbol name
unembed
Exact preserved properties:
deterministic dependence on all operands and on operand order
deterministic distinction between different external symbols
mixed float and integer operands are supported; float operands are embedded, integer operands are used directly after signed casting to the result width
Important caveat:
This is a structural tag, not a numeric model of the external function.
Gluon MMA and Tensor Memory
Supported Gluon operations include:
mma_v2warpgroup_mmaandwarpgroup_mma_waittcgen05_mmaandtcgen05_mma_scaledtcgen05_copyandtcgen05_commitallocate_tensor_memorytensor-memory descriptor methods such as
load,load_min,load_max,store,slice,index, and_reinterpretAMD
mfma,mfma_scaled,wmma,wmma_scaled, andscaled_upcast
Rewrite:
perform multiply-add accumulation in payload space
preserve payload bits across tensor-memory loads, stores, copies, and views
keep accumulator-selection and predication behavior structurally visible
Exact preserved properties:
exact matrix-multiply algebra over the payload ring
exact agreement with sanitized scalar multiply-add expansion
accumulation with the provided accumulator is preserved as payload addition
tensor-memory operations preserve payload flow across the pipeline
Important caveats:
Scaled MMA preserves the sanitizer’s payload treatment of low-precision operands and scales, not exact hardware-format numeric decoding.
Tensor-memory operations preserve payload dataflow; they do not make FpSan a substitute for race or synchronization checking.
Currently fpsan is supported on all NVIDIA hardware, as well as AMD
gfx942,gfx950, andgfx1250.
Practical Guidance for Checks
FpSan is a good fit when you want to check:
that two kernels implement the same preserved algebra
that a fused kernel keeps the intended dataflow
that predication or accumulator-selection logic is wired correctly
that a tensor-memory or warp-specialized pipeline preserves payload flow
FpSan is a poor fit when you want to check:
IEEE edge cases
real transcendental accuracy
NaN or infinity behavior
hardware-format decode semantics for low-precision formats
In short, rely on FpSan for structure-preserving kernel validation, and rely on ordinary numerical tests for IEEE behavior.