Preempt / resume for experiments#

Long cloud runs get interrupted — a spot instance is reclaimed, a job is cancelled. MIME experiments are made resilient to that with the MADDENING v0.2 checkpoint machinery (versioned, hash-checked CHECKPOINT_SCHEMA_VERSION manifests). There are two patterns, depending on the experiment shape.

Parameter sweeps — ResumableSweep#

A sweep runs many independent work items (parameter combos). Wrap them in mime.data.sweep_resume.ResumableSweep: each finished item is written to a JSON progress checkpoint — carrying a MADDENING integrity manifest — so a relaunch skips the finished items and runs only what is left.

import os
from mime.data.sweep_resume import ResumableSweep

sweep = ResumableSweep(
    items=my_param_combos,
    key=lambda combo: combo["label"],          # unique id per item
    checkpoint_path="data/my_experiment.h5",   # progress sits beside it
    resume_from=os.environ.get("SWEEP_RESUME_FROM"),
    snapshot_dir=os.environ.get("SWEEP_SNAPSHOT_DIR"),
)

for result in sweep.completed:     # items finished in a prior run —
    write_output(result)           #   replay them into the output

for combo in sweep.pending:        # only the unfinished items
    result = run_one(combo)
    write_output(result)
    sweep.record(combo, result)    # checkpointed immediately
  • resume_from — a path / file:// / http(s):// URL of a prior checkpoint. A missing source is treated as a fresh start, so a job spec can set it unconditionally. With it unset, a relaunch on the same machine still resumes from the local checkpoint beside checkpoint_path.

  • snapshot_dir — durable storage (a mounted volume/bucket) the checkpoint is mirrored to after every item, so it survives an instance teardown.

Worked example: scripts/run_confinement_sweep.py. Job-spec wiring: jobs/production_h100.yaml.

Single-graph experiments — MADDENING checkpoint API#

An experiment that runs one long-lived graph (rather than a sweep) uses MADDENING’s graph-state checkpoint API directly:

from maddening.core.simulation.checkpoint import (
    save_state_with_manifest, download_and_load_state,
)

# snapshot the running graph
save_state_with_manifest(graph, "snapshot.npz", extra={"step": step})

# resume it on a fresh instance
download_and_load_state(graph, "file:///mnt/durable/snapshot.npz")

For a server-style run, maddening.cloud.entrypoint.make_preempt_snapshot_hook builds an on-preemption snapshot callback and resume_from_url is the resume entry point. Both patterns write the same versioned, hash-checked manifest, so a checkpoint from either is integrity-verified on load.