Couette Torque Benchmarks — Mass Conservation & TF32 Precision (RESOLVED)#
Investigation: 2026-05-20 → 2026-05-21
Benchmarks: MIME-VER-008, MIME-VER-009,
TestBouzidiCI::test_torque_accuracy_under_5_percent,
test_bouzidi_convergence_order
Test file: tests/verification/test_ladd_cylinder.py
Status: RESOLVED. Two genuine float32 defects in the D3Q19 LBM were
found and fixed. All four Couette benchmarks are un-xfailed and pass; the
three pre-existing test_d3q19 / test_d2q9 momentum failures are fixed.
Summary#
The rotating-cylinder Couette torque benchmarks failed (torque error ~27 % at
128³, growing with resolution; the flow never reaching a steady state). The
cause was two distinct float32 defects in the LBM core — neither a
benchmark-design problem, neither a v0.2 regression (both present since the
v0.1.0 tag).
Defect 1 — BGK collision mass non-conservation (fix: commit 84016bf)#
The BGK collision f_out = f − (f − f_eq)/τ conserves mass exactly in exact
arithmetic (the D3Q19 moment identities make the u² terms cancel). In float32
that cancellation is not exact — the equilibrium polynomial does not sum to
exactly the node density, leaving a systematic O(u²) residual. Any non-uniform
flow then loses mass ~1e-6/step; with a continuously-driven rotating wall this
never decays and integrates into an unbounded ∝Ω² drift, so the flow never
reaches a steady state.
Fix — in collide_bgk (d3q19.py), route the per-node residual into the
rest population (e₀ = (0,0,0), so it is momentum-neutral):
f_out = f_out.at[..., 0].add(jnp.sum(f, axis=-1) - jnp.sum(f_out, axis=-1))
Verification — relative Σf drift per step (float64-measured, tau = 0.8):
case |
before |
after |
|---|---|---|
Couette simple BB, Ω = 0.005 |
−1.79e-6 |
+1.2e-10 |
Couette simple BB, Ω = 0.010 |
−7.98e-6 |
+7.8e-11 |
Couette Bouzidi, Ω = 0.005 |
−2.02e-6 |
−2.3e-7 |
The ∝Ω² leak is gone (round-off). Bouzidi keeps a small ∝Ω residual (~2e-7/step) — the known mass-conservation error of interpolated bounce-back, from the interpolation formula itself; ~10× smaller and acceptable.
Defect 2 — GPU TF32 matmul precision (fix: commit ccc0d53)#
The LBM moments are matmuls (momentum = f @ E in compute_macroscopic,
e·u = velocity @ Eᵀ in equilibrium). On GPU, JAX’s default float32 matmul
precision is TF32 (~10-bit mantissa). A moment is a tiny residual of a
near-cancellation of the ~0.05-magnitude populations — far below TF32
granularity.
Decisive test: a batched f @ E for a near-rest field returns exactly
0.0 at default precision vs the correct 1.0e-5 at
precision="highest". TF32 corrupted the velocity → the equilibrium → the
collision relaxed toward a wrong-momentum target → a spurious
velocity-proportional drag that drooped the Couette profile
(u/u_analytical 0.99 at the inner wall → 0.67 at the outer, 128³), worsening
with resolution. At very low speed the corruption is total — a forced
Poiseuille flow (~1e-5 velocity) froze completely (the pre-existing
test_d3q19 / test_d2q9 failure).
It was masked because test_fvm_ibm.py / test_kinematics.py used to enable
jax_enable_x64 at module scope (x64 matmuls don’t use TF32), which leaked
x64 across the whole session, so the LBM tests passed in the full slow lane
and failed only in genuine float32. (v0.2 removed that leak: x64 is now opt-in
per test via @pytest.mark.x64 + an autouse conftest.py fixture, so this
class of masking can no longer happen — see the
v0.2 release notes.)
Fix — precision="highest" on every LBM moment matmul:
compute_macroscopic, equilibrium, guo_forcing (d3q19 + d2q9), and the
momentum-exchange / stress-torque sums in bounce_back.py.
Result — velocity-profile-fit Couette torque error vs analytical:
n³ |
before fix |
after fix |
|---|---|---|
64 |
0.9 % |
1.8 % |
96 |
4.0 % |
1.1 % |
128 |
7.0 % |
1.3 % |
The resolution-growth is eliminated; the residual ~1–2 % is genuine wall-position / compressibility accuracy. MEM-torque error at the benchmark resolutions: 1.6 % / 0.1 % / 0.4 %.
Outcome#
MIME-VER-008, MIME-VER-009,
TestBouzidiCI::test_torque_accuracy_under_5_percent,test_bouzidi_convergence_order— un-xfailed and passing.test_d3q19/test_d2q9forced-Poiseuille — previously frozen — now develop correctly.TestBouzidiRegressionIBLBM baseline re-validated17.4442 → 17.1138(the17.4442value was itself TF32-corrupted).LBM suite (
test_ladd_cylinder.py+tests/nodes/lbm/): 194 passed.
Production impact — IBLBMFluidNode#
The production rotating-UMR node uses the same collide_bgk (via
lbm_step_split) and the same moment matmuls, so it inherits both fixes
directly. Measured with the fixes, an N=24 rotating UMR has Σf drift
+5.3e-8/step (simple BB) / −7.5e-8/step (Bouzidi) — down from the ~1e-6/step
∝Ω² collision leak. The small remaining residual is not the collision
(static-mask Couette conserves to ~1e-10): it is the rotating-helix mask
change (“fresh-node / refilling”), a separate, pre-existing concern beyond
this fix — flagged for follow-up.
Systemic warning — TF32 beyond the LBM#
TF32 silently degrades any low-magnitude float32 matmul on GPU. It was
degrading the LBM by up to 100 %+ and went unnoticed for months. Other
precision-sensitive float32 matmul paths — the FVM dense-DCT pressure solver,
the Stokeslet BEM, and the Pallas/Triton LBM kernel (pallas_lbm.py, left out
of scope here) — should be audited. Prefer per-physics-path
precision="highest" over a global flag (the GNN/MLP surrogates are
TF32-tolerant; a global flag would needlessly slow them).
Superseded diagnoses (historical record)#
The route to the root cause took two wrong turns, recorded here for honesty:
“Ghost-node bounce-back duplication” (2026-05-20). The original benchmark-redesign investigation correctly established that the mass leak is real, monotonic, ∝Ω² and resolution-independent, but mis-attributed it to the bounce-back. Decisive falsification: a fully periodic box with no walls and no bounce-back, carrying a non-uniform flow, leaks at the same ∝u² rate — the cause is the collision, not the bounce-back.
“A separate torque-overshoot bug.” An intermediate framing held the mass leak and the torque overshoot to be independent defects. The overshoot was the TF32 corruption; both symptoms are covered by the two fixes above.
This report supersedes the open questions in bouzidi_ibb_diagnostics.md
(2026-03-22), which chased the “30–80 % torque overshoot” through q-values and
the Ladd correction but never checked mass conservation or matmul precision.
Reproduction#
source /home/nick/MSF/msf/.venv/bin/activate
cd /home/nick/MSF/msf/MIME
python -m pytest -q -m 'slow or not slow' tests/verification/test_ladd_cylinder.py
TF32 check: a batched f @ E (LBM-field-sized, near-rest distribution) returns
0.0 at default GPU precision and 1.0e-5 at precision="highest".