# Launching cloud jobs MIME runs its long sweeps (the confinement sweep, GNN data generation) on cloud GPUs. A run is described by a **job spec** — a YAML file in `jobs/` — and launched with `scripts/launch_job.py`. ## Job specs A job spec is a MADDENING `JobConfig` YAML: a `provider`, a `gpu_type`, cost guards, and `setup` / `run` shell blocks. `jobs/` ships: | Spec | Provider | |---|---| | `production_h100.yaml` | RunPod — H100-SXM, on-demand | | `production_h100_aws.yaml` | AWS — template | | `production_h100_gcp.yaml` | GCP — template | | `rehearsal_a100.yaml` | RunPod — A100, rehearsal | The AWS and GCP specs are **templates**: `gpu_type`, `region` and the cost guards must be checked against the provider before a real launch — `sky show-gpus --cloud aws|gcp` for accelerator names, and the provider's current on-demand / spot pricing for the cost guards. ## Launching ```bash # dry run — resolve provider, instance and cost without provisioning python scripts/launch_job.py --job jobs/production_h100_aws.yaml --dry-run # real launch python scripts/launch_job.py --job jobs/production_h100_aws.yaml ``` `launch_job.py` is provider-agnostic — it reads the spec's `provider:` field and dispatches through MADDENING's `CloudLauncher`, which supports `runpod`, `aws`, `gcp` and `lambda_labs`. ## Credentials `CloudLauncher` reads credentials from `~/.maddening/cloud_credentials.yaml` (one block per provider). Each provider also has a native credential file it expects — `~/.runpod/config.toml`, `~/.aws/credentials`, `~/.config/gcloud/application_default_credentials.json` — see MADDENING's `cloud/providers.py` for the exact formats. ## Spot instances and resume The AWS/GCP templates set `use_spot: true`. Spot is safe for the confinement sweep because it is **resumable** — `ResumableSweep` (see [preempt/resume](../preempt_resume.md)) checkpoints after every combo. Point `SWEEP_SNAPSHOT_DIR` at durable storage (a mounted S3/GCS bucket) so the checkpoint survives a preemption; a relaunch then resumes instead of restarting from zero. `spot_fallback: true` additionally falls back to on-demand if spot capacity is unavailable.