What’s new in v0.2#

Added in version v0.2.

v0.2 is the cycle that turns MADDENING from a single-GPU prototype into a deployable multi-physics framework: pencil-decomposed sharding for big simulations, a clean cloud-resume path that survives spot preemption, edge validation that catches wiring errors at compile time, and the static-data channel that finally gives nodes a place to keep large arrays that don’t evolve in time.

Nine roadmap items, eight shipped (item #2 partial pending the MIME decoder pull-over). Full per-feature status lives in V0.2_PROGRESS.md in the repo root.

Highlights#

Halo / pencil decomposition (#1)#

3-D pencil decomposition for stencil nodes lands as M1-M8. The contract change: every SimulationNode now declares its halo via halo_width() returning a dict[axis, width] instead of the old requires_halo boolean. Pointwise nodes return {}; 1-D heat with a 2nd-order stencil returns {0: 1}; D3Q19 LBM returns {0: 1, 1: 1, 2: 1}.

Note

requires_halo stays around as a default-implemented compatibility shim until v0.3. Subclasses that override it instead of halo_width emit a DeprecationWarning pointing at the new API.

Static-data channel on SimulationNode (#3)#

Optional static_data property lets a node carry non-state arrays (meshes, wall masks, basis functions, lookup tables) outside the state pytree. JAX bakes them into the JIT’d HLO as constants instead of carrying them through every fori_loop iteration.

A drift check at every step entry-point re-hashes the (key, shape, dtype) tuples and triggers a recompile if any node’s static_data shape changed — typical case: replace_node swapping in a different mesh size. See DESIGN.md §2 “Static-data channel” for the full contract.

HeatNode is the first in-tree consumer: _grid_x migrated from “re-build a JAX array on every property access” to “build once, expose via static_data”.

Compile-time edge validation (#4)#

GraphManager.compile() now walks every edge and surfaces shape/dtype/unit mismatches as four warning classes:

  • ShapeMismatchWarning

  • DtypeMismatchWarning

  • UnitMismatchWarning

  • parent EdgeValidationWarning

A transform on the edge suppresses the check (the transform may reshape on the fly). Aggregation means a 20-edge graph with three different problems fires three warnings, not one.

Shipped as warnings in v0.2; flipped to hard EdgeValidationError subclasses inside an :class:ExceptionGroup in v0.2.1 (units stay as warnings). See Edge validation: migration guide (v0.2 → v0.3.0) for the migration playbook and What’s new in v0.2.1 for the patch release notes.

Field subscriptions + zstd compression (#5, #6)#

BinaryStateEncoder learned to pack a subset of fields and optionally compress the payload:

enc = BinaryStateEncoder(
    state,
    fields={"lbm": ["velocity"]},   # drop the 19 f-distributions
    compression="zstd",             # or "zstd+xor"
)

The compression mode is part of the schema, so the /ws/state/binary subscribe message accepts it too:

{"type": "subscribe",
 "fields": {"lbm": ["velocity"]},
 "compression": "zstd"}

On a 32³ LBM-like payload, subscribing to velocity + zstd cuts the wire by 99% on slowly-varying flows. ZMQ NetworkRelay got the same fields= parameter for static server-side filtering.

A runnable demo is at src/maddening/examples/cloud/streaming/08_subscribe_lbm_velocity.py.

Cloud provider expansion (#7)#

AWSProvider and GCPProvider join RunPodProvider and LambdaLabsProvider (the latter promoted out of “stub” status). All four share the CloudProvider ABC and pass the same credential-lifecycle test suite (51 cases covering profile merging, env vars, chmod 0600, deletion semantics).

Examples for each: 02_runpod_launch.py, 03_lambda_launch.py, 04_aws_launch.py, 05_gcp_launch.py.

Note

End-to-end launches against real cloud accounts are out of CI scope — they need real credentials and trigger spend. The credential layer is fully covered offline.

Preempt → snapshot, resume from URL (#8)#

CloudSession(on_preempted=hook) now drives a snapshot of the GraphManager state when the spot VM is reclaimed. The new VM reads RESUME_FROM_URL at startup and pulls the state back. Every snapshot ships with a sidecar manifest containing schema_version, SHA-256, size, and a caller-supplied extra dict. Tampering or version drift raises CheckpointIntegrityError.

See Surviving spot preemption for the full contract.

Profiler + Perfetto export (#9)#

POST /sim/profile?n_steps=N returns a Perfetto-loadable Trace Event JSON. Drag-and-drop it into https://ui.perfetto.dev for an interactive flame-graph view of per-node timings + coupling overhead.

POST /sim/profile/jax/start and /stop wrap jax.profiler.start_trace() for an XLA-level capture, and /cloud/teardown snapshots the last trace dir as a base64’d tar.gz in its response so the trace survives VM destruction.

A runnable demo: src/maddening/examples/advanced/profile_lbm_step.py.

Surrogates subpackage scaffolding (#2, partial)#

New subpackages — surrogates/primitives/, surrogates/weights/, surrogates/training/, surrogates/replace/ — re-export their contents from the v0.1 leaf-module locations. The decoder zoo extraction from MIME and the SurrogateArchitecture ABC decoupling are queued for v0.2.x.

Breaking changes#

None in v0.2.0. All changes are additive or live behind the warnings introduced by #4. v0.2.1 subsequently flipped those warnings to hard errors (see What’s new in v0.2.1 and the semver carve-out in Edge validation: migration guide (v0.2 → v0.3.0)).

Deprecations#

Symbol

Replacement

Removed in

SimulationNode.requires_halo

halo_width()

v0.3

ShardedNode

ShardedPointwiseNode (deprecated alias)

v0.3

New optional dependencies#

Extra

Pulls in

For

compression

zstandard>=0.22

binary-encoder compression (#6)

compression is also rolled into [server], [ci], and [all].

Suite size#

v0.1

v0.2

MADDENING tests passing

1358

1613

MIME tests passing

625

The MADDENING suite added ~250 new tests across the v0.2 work (static_data, edge validation, encoder subscription + compression, profiler perfetto export, AWS/GCP providers, preempt + manifest).

Migration playbook#

If you have v0.1 code:

  1. requires_halohalo_width. Pointwise nodes are fine as-is. Stencil subclasses should override halo_width() returning a dict[axis, width]. The compat shim keeps v0.1 subclasses working with a DeprecationWarning.

  2. Edges that previously failed at first step() now warn at compile(). Either fix the mismatch or add a transform=. See Edge validation: migration guide (v0.2 → v0.3.0).

  3. Large per-node arrays moved to static_data — clean refactor, not required. Nodes with requires_halo-shaped migration paths are documented in DESIGN.md §2.

For cloud users:

  • Spot resilience: wire make_preempt_snapshot_hook into your CloudSession(on_preempted=) callback.

  • Bandwidth: pass fields={...} to BinaryStateEncoder / NetworkRelay and add compression="zstd" to subscribe messages.

  • Profiling: hit POST /sim/profile and drag the JSON into ui.perfetto.dev.

What’s still in flight#

See V0.2_PROGRESS.md in the repo root for the per-item open checkboxes. The big remainders:

  • Multi-GPU smoke test on a real RunPod cluster (#1 M9).

  • MIME-decoder pull-over into surrogates/primitives/ (#2).

  • SurrogateTrainer decoupling from the SurrogateArchitecture ABC (#2).

  • s3:// / gs:// / azure:// URL schemes in download_and_load_state (#8).

  • The v0.2.1 flip-to-errors cut for #4.