Launching cloud jobs#
MIME runs its long sweeps (the confinement sweep, GNN data generation) on
cloud GPUs. A run is described by a job spec — a YAML file in jobs/ —
and launched with scripts/launch_job.py.
Job specs#
A job spec is a MADDENING JobConfig YAML: a provider, a gpu_type, cost
guards, and setup / run shell blocks. jobs/ ships:
Spec |
Provider |
|---|---|
|
RunPod — H100-SXM, on-demand |
|
AWS — template |
|
GCP — template |
|
RunPod — A100, rehearsal |
The AWS and GCP specs are templates: gpu_type, region and the cost
guards must be checked against the provider before a real launch —
sky show-gpus --cloud aws|gcp for accelerator names, and the provider’s
current on-demand / spot pricing for the cost guards.
Launching#
# dry run — resolve provider, instance and cost without provisioning
python scripts/launch_job.py --job jobs/production_h100_aws.yaml --dry-run
# real launch
python scripts/launch_job.py --job jobs/production_h100_aws.yaml
launch_job.py is provider-agnostic — it reads the spec’s provider: field
and dispatches through MADDENING’s CloudLauncher, which supports runpod,
aws, gcp and lambda_labs.
Credentials#
CloudLauncher reads credentials from ~/.maddening/cloud_credentials.yaml
(one block per provider). Each provider also has a native credential file it
expects — ~/.runpod/config.toml, ~/.aws/credentials,
~/.config/gcloud/application_default_credentials.json — see MADDENING’s
cloud/providers.py for the exact formats.
Spot instances and resume#
The AWS/GCP templates set use_spot: true. Spot is safe for the confinement
sweep because it is resumable — ResumableSweep (see
preempt/resume) checkpoints after every combo. Point
SWEEP_SNAPSHOT_DIR at durable storage (a mounted S3/GCS bucket) so the
checkpoint survives a preemption; a relaunch then resumes instead of
restarting from zero. spot_fallback: true additionally falls back to
on-demand if spot capacity is unavailable.