=========================================
Floating-Point Sanitizer (FpSan)
=========================================

FpSan is a compiler instrumentation mode that rewrites selected floating-point
Triton IR operations into deterministic "payload algebra" over integer
bit-patterns. Its goal is not to approximate IEEE floating-point arithmetic.
Instead, it preserves selected algebraic structure so that kernels that are
symbolically equivalent under the sanitized semantics continue to agree, while
wrong rewrites, wrong operands, wrong dataflow, or missing synchronization tend
to perturb the result.

This makes FpSan primarily a kernel-checking tool. It is especially useful when
you care more about whether a kernel preserves the intended symbolic
computation than about its exact IEEE result on a particular input.

At a high level, FpSan:

- maps floating-point bit-patterns into an integer payload domain
- replaces supported floating-point ops with integer-domain rewrites chosen to
  preserve selected identities exactly
- maps the resulting payload back into a floating-point bit-pattern so the rest
  of the pipeline can continue

------------
Enabling FpSan
------------

Enable FpSan before the compile or run you want to instrument.

From Python:

.. code-block:: python

   import triton

   triton.knobs.compilation.instrumentation_mode = "fpsan"
   # compile and run kernels here
   triton.knobs.compilation.instrumentation_mode = ""

From the shell:

.. code-block:: bash

   TRITON_INSTRUMENTATION_MODE=fpsan python your_script.py

Notes:

- FpSan is a compiler feature, so it does not apply in interpreter mode.
- On AMD, the backend currently enables FpSan only for ``gfx942``, ``gfx950``,
  and ``gfx1250``.

----------------
How to Use It
----------------

The most effective way to use FpSan is to compare two kernels, or two versions
of one kernel, under the same FpSan mode. Typical uses include:

- comparing an optimized kernel against a simple reference kernel
- comparing a fused kernel against an unfused composition
- comparing two schedule variants that should be mathematically equivalent
- checking that accumulator selection, predication, or TMEM pipelines preserve
  the intended payload flow

FpSan results should only be compared against other FpSan results, not
against ordinary floating-point outputs.

------------------
Payload Model
------------------

For each floating-point width ``w``, FpSan defines a bijection between
floating-point bit-patterns and a w-bit integer payload; arithmetic wraps modulo 2^w.

Conceptually:

- ``embed(x)`` maps a float bit-pattern to an integer payload
- ``unembed(u)`` maps an integer payload back to a float bit-pattern
- sanitized float ops are implemented as ``unembed(F(embed(...)))``

The embedding is deliberately chosen so that a few important constants are
stable:

- ``embed(+0.0) = 0``
- ``embed(+1.0) = 1``
- ``embed(-1.0) = all-ones``

Those fixed points are the reason identities such as ``x + 0 = x`` and
``x * 1 = x`` behave naturally under FpSan.

------------------------------
What FpSan Preserves
------------------------------

FpSan preserves exact identities in the payload algebra selected by each
rewrite. The most important ones are:

- ring identities for add, subtract, multiply, FMA, and dot-like accumulation
- selected exponential identities for ``exp`` and ``exp2`` (see below for details)
- trigonometric identities for ``sin`` and ``cos``
- payload equality through casts, loads, stores, and copies
- deterministic op-distinguishing tags for unary functions that do not yet have
  a richer algebraic model

This is what makes FpSan valuable for kernel checks: if two kernels should be
the same symbolic computation under the preserved properties, they should produce
the same payloads. This holds assuming a (generally believed) conjecture in
transcendental number theory, Schanuel's conjecture. One of the authors of FpSan
has a [blog post](https://cp4space.hatsya.com/2026/05/03/schanuels-conjecture-and-the-semantics-of-fpsan/)
explaining the theory behind FpSan from a mathematical perspective.

----------------------------------
What FpSan Does Not Preserve
----------------------------------

FpSan is not an IEEE simulator.

In particular, do not rely on it for:

- real floating-point ordering, rounding, NaN propagation, infinities,
  subnormals, or exceptions
- real transcendental semantics for ``log``, ``sqrt``, ``erf``, ``floor``,
  ``ceil``, ``rsqrt``, and similar tagged unary ops
- expected floating-point bit patterns (i.e. for kernels that bitcast
  between floats and integers)

When a property matters for your check, the right question is:
"is this property preserved by the payload rewrite for this specific op
family?"

----------------------
Common Arithmetic Ops
----------------------

Add, Sub, Mul
=============

Supported operations:

- ``x + y``
- ``x - y``
- ``x * y``

Rewrite:

- add, subtract, or multiply the embedded payloads, then unembed the result

Exact preserved properties:

- ``x + 0 = x``
- ``x - 0 = x``
- ``x - x = 0``
- ``x * 1 = x``
- associativity and commutativity of add and mul
- distributivity of mul over add

Important caveat:

- This is ring arithmetic modulo ``2^w``, not IEEE arithmetic.

Min and Max
===========

Supported operations:

- ``tl.minimum(x, y)``
- ``tl.maximum(x, y)``
- ``min(x, y)`` and ``max(x, y)`` in Triton code

Rewrite:

- signed integer ``min`` or ``max`` on payloads

Exact preserved properties:

- idempotence: ``min(x, x) = x`` and ``max(x, x) = x``
- commutativity
- associativity

Important caveats:

- The order is the signed integer order of payloads, not IEEE float order.
- NaN handling, and the exact signed-zero contract, are not modeled.

Division
========

Supported operation:

- ``x / y``

Rewrite:

- ``x / y`` becomes ``embed(x) * inv(embed(y))``, then ``unembed``

Here ``inv`` is:

- the true modular inverse for odd payloads
- a parity-preserving involution for even payloads

Exact preserved properties:

- ``x / 1 = x``
- ``1 / (1 / x) = x``
- for odd payloads, the usual modular inverse laws hold

Important caveats:

- ``x / x = 1`` is not guaranteed for all payloads.
- Division by zero does not produce IEEE infinities or traps.
- This rewrite is chosen for algebraic checking, not numeric realism.

Remainder
=========

Supported operation:

- ``x % y``

Rewrite:

- signed integer remainder on payloads after forcing the denominator odd with
  ``den | 1``

Exact preserved properties:

- same inputs produce the same sanitized remainder payload

Important caveats:

- Real floating-point remainder semantics are not modeled.
- Zero denominators are intentionally mapped to a safe odd payload instead of
  trapping.

FMA
===

Supported operation:

- ``tl.fma(a, b, c)``

Rewrite:

- ``a * b + c`` in payload arithmetic

Exact preserved properties:

- exact agreement with the sanitized expansion ``mul`` followed by ``add``
- ``fma(a, b, c) = a*b + c`` in the payload ring

Important caveat:

- There is no special fused-rounding behavior.

----------------
Unary Math Ops
----------------

``exp2``
========

Supported operation:

- ``tl.exp2(x)``

Rewrite:

- modular exponentiation by a fixed odd generator in payload space

Exact preserved properties:

- ``exp2(x + y) = exp2(x) * exp2(y)``
- ``exp2(0) = 1``
- ``exp2(-x) = 1.0 / exp2(x)``

``exp``
=======

Supported operation:

- ``tl.exp(x)``

Rewrite:

- ``exp(x)`` is implemented as ``exp2(x * rcp_log2)`` in payload space

Exact preserved properties:

- ``exp`` uses the same payload-space construction as ``exp2`` after scaling
  the input by a fixed internal payload constant

``sin`` and ``cos``
===================

Supported operations:

- ``tl.sin(x)``
- ``tl.cos(x)``

Rewrite:

- a deterministic payload-space rewrite chosen to preserve the identities below

Exact preserved properties:

- ``sin(x + y) = sin(x) * cos(y) + cos(x) * sin(y)``
- ``sin(x - y) = sin(x) * cos(y) - cos(x) * sin(y)``
- ``cos(x + y) = cos(x) * cos(y) - sin(x) * sin(y)``
- ``cos(x - y) = cos(x) * cos(y) + sin(x) * sin(y)``
- ``cos(x)^2 + sin(x)^2 = 1``

Important caveat:

- These are not IEEE trig values; they are payload functions chosen to preserve
  the angle identities above.

Tagged Unary Ops
================

Supported operations:

- ``tl.log(x)``
- ``tl.log2(x)``
- ``tl.sqrt(x)``
- ``tl.rsqrt(x)``
- ``tl.erf(x)``
- ``tl.floor(x)``
- ``tl.ceil(x)``
- precise square root variants

Rewrite:

- an invertible payload tag: multiply by an odd constant, xor with an
  op-specific hash, then multiply again

Exact preserved properties:

- payload equality is preserved for the same op: if ``x == y`` in payload
  space, then ``op(x) == op(y)``
- different supported unary ops get different tags

Important caveats:

- These rewrites intentionally do not preserve real mathematical identities
  such as ``sqrt(x)^2 = x`` or ``log(x*y) = log(x) + log(y)``.

--------------------------
Casts and Format Conversions
--------------------------

Float-to-Float Conversions
============================

Supported operations:

- converting a tensor between floating-point types with ``x.to(dtype)``
- implicit float widening and narrowing conversions

Rewrite:

- signed integer extension or truncation in payload space, followed by
  ``unembed``

Exact preserved properties:

- ``0``, ``+1``, and ``-1`` remain stable across the conversion
- sign-extension behavior in the payload domain
- truncation drops high payload bits
- an upcast followed by a downcast is the identity

Important caveat:

- This preserves payload structure, not IEEE conversion semantics.
- Conversions between fp types of the same width do not model any loss of
  precision or range, so for example under fpsan
  ``fn(a.to(tl.float16)).to(tl.bfloat16) == fn(a)`` (for any bfloat16 ``a``).

Packed fp4 conversion
=====================

Rewrite:

- unpack low and high nibbles from the source byte tensor
- reshape and reorder them
- interpret each unpacked nibble directly as a payload in the destination float
  width

Exact preserved properties:

- deterministic unpacking of packed fp4 storage
- exact preservation of the unpacked nibble payloads

Important caveat:

- This is not real fp4 numeric decoding.
- The same raw-payload interpretation is reused by scaled-dot paths for
  ``e2m1``.

----------------------------
Pure Extern Elementwise Ops
----------------------------

Supported operation:

- ``tl.extern_elementwise`` when all of the following hold:

  - the op is ``pure``
  - the result type is float-like
  - there is at least one operand
  - every operand is numeric

Rewrite:

- rotate each operand payload by its argument index
- sum the rotated payloads
- xor the result with a stable hash of the symbol name
- unembed

Exact preserved properties:

- deterministic dependence on all operands and on operand order
- deterministic distinction between different external symbols
- mixed float and integer operands are supported; float operands are embedded,
  integer operands are used directly after signed casting to the result width

Important caveat:

- This is a structural tag, not a numeric model of the external function.

---------------------------
Gluon MMA and Tensor Memory
---------------------------

Supported Gluon operations include:

- ``mma_v2``
- ``warpgroup_mma`` and ``warpgroup_mma_wait``
- ``tcgen05_mma`` and ``tcgen05_mma_scaled``
- ``tcgen05_copy`` and ``tcgen05_commit``
- ``allocate_tensor_memory``
- tensor-memory descriptor methods such as ``load``, ``load_min``,
  ``load_max``, ``store``, ``slice``, ``index``, and ``_reinterpret``
- AMD ``mfma``, ``mfma_scaled``, ``wmma``, ``wmma_scaled``, and
  ``scaled_upcast``

Rewrite:

- perform multiply-add accumulation in payload space
- preserve payload bits across tensor-memory loads, stores, copies, and views
- keep accumulator-selection and predication behavior structurally visible

Exact preserved properties:

- exact matrix-multiply algebra over the payload ring
- exact agreement with sanitized scalar multiply-add expansion
- accumulation with the provided accumulator is preserved as payload addition
- tensor-memory operations preserve payload flow across the pipeline

Important caveats:

- Scaled MMA preserves the sanitizer's payload treatment of low-precision
  operands and scales, not exact hardware-format numeric decoding.
- Tensor-memory operations preserve payload dataflow; they do not make FpSan a
  substitute for race or synchronization checking.
- Currently fpsan is supported on all NVIDIA hardware, as well as AMD ``gfx942``, ``gfx950``,
  and ``gfx1250``.

------------------------------
Practical Guidance for Checks
------------------------------

FpSan is a good fit when you want to check:

- that two kernels implement the same preserved algebra
- that a fused kernel keeps the intended dataflow
- that predication or accumulator-selection logic is wired correctly
- that a tensor-memory or warp-specialized pipeline preserves payload flow

FpSan is a poor fit when you want to check:

- IEEE edge cases
- real transcendental accuracy
- NaN or infinity behavior
- hardware-format decode semantics for low-precision formats

In short, rely on FpSan for structure-preserving kernel validation, and rely on
ordinary numerical tests for IEEE behavior.