Launching cloud jobs#

MIME runs its long sweeps (the confinement sweep, GNN data generation) on cloud GPUs. A run is described by a job spec — a YAML file in jobs/ — and launched with scripts/launch_job.py.

Job specs#

A job spec is a MADDENING JobConfig YAML: a provider, a gpu_type, cost guards, and setup / run shell blocks. jobs/ ships:

Spec

Provider

production_h100.yaml

RunPod — H100-SXM, on-demand

production_h100_aws.yaml

AWS — template

production_h100_gcp.yaml

GCP — template

rehearsal_a100.yaml

RunPod — A100, rehearsal

The AWS and GCP specs are templates: gpu_type, region and the cost guards must be checked against the provider before a real launch — sky show-gpus --cloud aws|gcp for accelerator names, and the provider’s current on-demand / spot pricing for the cost guards.

Launching#

# dry run — resolve provider, instance and cost without provisioning
python scripts/launch_job.py --job jobs/production_h100_aws.yaml --dry-run

# real launch
python scripts/launch_job.py --job jobs/production_h100_aws.yaml

launch_job.py is provider-agnostic — it reads the spec’s provider: field and dispatches through MADDENING’s CloudLauncher, which supports runpod, aws, gcp and lambda_labs.

Credentials#

CloudLauncher reads credentials from ~/.maddening/cloud_credentials.yaml (one block per provider). Each provider also has a native credential file it expects — ~/.runpod/config.toml, ~/.aws/credentials, ~/.config/gcloud/application_default_credentials.json — see MADDENING’s cloud/providers.py for the exact formats.

Spot instances and resume#

The AWS/GCP templates set use_spot: true. Spot is safe for the confinement sweep because it is resumableResumableSweep (see preempt/resume) checkpoints after every combo. Point SWEEP_SNAPSHOT_DIR at durable storage (a mounted S3/GCS bucket) so the checkpoint survives a preemption; a relaunch then resumes instead of restarting from zero. spot_fallback: true additionally falls back to on-demand if spot capacity is unavailable.