Docker Image Update Plan#
Current state#
ghcr.io/microrobotics-simulation-framework/maddening-cloud:latest uses:
Base image:
nvidia/cuda:12.2.2-runtime-ubuntu22.04JAX install:
jax[cuda12]>=0.4,<0.6(resolves to JAX 0.5.3 with cuDNN for CUDA 12.2)Python: 3.10 (system default on Ubuntu 22.04)
Problem: RunPod A100 SXM hosts now run NVIDIA driver 570 (CUDA 12.8). The
cuDNN bundled with JAX 0.5.3’s CUDA 12.2 build is incompatible with this driver,
producing CUDNN_STATUS_INTERNAL_ERROR on any GPU kernel launch. Discovered
during the T2.6 cloud rehearsal (2026-03-23).
Current workaround: The job config setup script includes
pip3 install --upgrade 'jax[cuda12]' as the first line, which upgrades JAX
to 0.6.2 with CUDA 12.9 cuDNN at runtime. This adds ~30 seconds to setup
but works reliably.
Proposed change#
Update the Docker image to use a CUDA 12.8+ base and a compatible JAX version:
Base image:
nvidia/cuda:12.8.0-runtime-ubuntu22.04(or latest available 12.x)JAX install:
jax[cuda12]>=0.5.0(confirmed working at JAX 0.6.2 in rehearsal)Tag:
latest(overwrite — single user, no dependents)
Steps to implement (deferred)#
Update
MADDENING/docker/Dockerfile.cloud:Change
FROM nvidia/cuda:12.2.2-runtime-ubuntu22.04toFROM nvidia/cuda:12.8.0-runtime-ubuntu22.04Change
jax[cuda12]>=0.4,<0.6tojax[cuda12]>=0.5.0
Build locally:
cd /path/to/MADDENING docker build -f docker/Dockerfile.cloud -t ghcr.io/microrobotics-simulation-framework/maddening-cloud:latest .
Test locally:
docker run --gpus all ghcr.io/microrobotics-simulation-framework/maddening-cloud:latest \ python3 -c "import jax; print(f'JAX {jax.__version__}: {jax.default_backend()}')"
Push:
docker push ghcr.io/microrobotics-simulation-framework/maddening-cloud:latest
Remove
pip3 install --upgrade 'jax[cuda12]'from all job config setup scriptsTest with a short rehearsal run to confirm end-to-end
When to implement#
After the T2.6 production sweep completes and results are reviewed. The setup script workaround is sufficient for the production run. Updating the image during active development risks breaking the production config if the build fails.
Risk assessment#
Single user, latest tag, no dependents — no risk of breaking other users
Build failure: leaves existing image untouched until push succeeds
Dockerfile is single-stage: base image update is a one-line change
MADDENING version pin:
jax>=0.5.0is compatible with MADDENING’sjax>=0.4,<0.6constraint. JAX 0.6.2 (installed in rehearsal) technically violates MADDENING’s<0.6upper bound — this constraint should be relaxed to<0.7in MADDENING’s pyproject.toml as part of this update.