BeClaude

remote-gpu-trainer

New
67GitHub TrendingGeneralby Hanyuyuan6

Use whenever the user deploys, trains, monitors, or troubleshoots a long-running GPU job on a RENTED or remote instance they do not own — training, eval, ablation sweeps, batch inference, or large data processing — on AutoDL, RunPod, vast.ai, Lambda, Paperspace, Chinese platforms (恒源云/矩池云/Featurize/揽睿星舟), a bare SSH box, Slurm, or Kubernetes; single OR multi-instance. Triggers (multilingual): "远程 GPU 训练", "GPU 租赁", "GPU rental", "租卡", "spot 抢占", "spot preemption", "断点续训", "resumable training", "tmux 训练守护", "防 SSH 断线", "scp/rsync 上传", "多实例 ablation", "远程 GPU 监控", "省钱关机/销毁实例", "stop vs terminate billing", "checkpoint 磁盘满", "CUDA OOM/显存不足", "loss NaN/loss spike", "loss 不下降/不收敛", "overfit 单 batch", "FSDP/DeepSpeed 配置", "多卡训练 hang", "dataloader worker/数据增广 bug". NOT for purely local single-GPU training, in-instance multi-GPU DDP (use torchrun/accelerate), managed multi-cloud price-shopping (use SkyPilot's skill), or zero-ops serverless (use Modal).

First seen 6/19/2026

Summary

This skill orchestrates long-running GPU jobs on rented remote instances across platforms like AutoDL, RunPod, and bare SSH boxes.

  • It handles the full lifecycle—deployment, monitoring, checkpointing, and safe teardown—to prevent data loss and cost overruns on metered hardware.

Overview

remote-gpu-trainer — Remote GPU Job Orchestration

Overview

Deploy and babysit long-running GPU jobs on rented boxes you don't own, across any platform, and get the result off the box before the meter or a preemption kills it. The core insight: you are a short-term tenant on someone else's machine — so the job is to detach the work, make the result outlive the instance, and stop the meter safely, not to provision a cluster.

This skill is platform-agnostic at the core, platform-specific at the edges: a fixed set of operating principles + a 6-phase lifecycle that hold everywhere, plus one profile per platform (profiles/<platform>.md) that owns every concrete path, proxy, billing verb, and spot semantic. Its defensible value is the union the big orchestrators skip: Chinese cgroup-isolated rentals + bare-SSH cheap boxes + the disk-budget / monitoring / teardown reality that is the job on metered hardware.

When NOT to use — and what to use instead

SituationUse instead
Local single-GPU, or multi-GPU DDP inside one boxtorchrun / accelerate directly
Managed multi-cloud price-shopping + auto spot-recovery across Western cloudsSkyPilot (has its own Agent Skill) — then come back here to make your code resume-correct so its recovery actually works
Open BYOC dev environmentsdstack
Zero-ops serverless inferenceModal
"Is this metric / ablation delta real?"REQUIRED: verifying-dl-experiments (this skill owns running the job; that one owns whether the number is true)

This skill is for the blind spot those tools leave: AutoDL + Chinese platforms, bare SSH/Slurm/K8s rentals, and the operational gotchas (inode caps, mirror stalls, cgroup OOM, silent sync, spot grace windows, irreversible teardown) that survive whichever provisioner you use.

Operating principles (the WHY — 10 invariants)

These hold on every metered, isolated, rented GPU; only the paths/CLI change. One line each; the deep form with cross-platform nuance is in `references/principles.md` (read it before Phase 0).

  1. Minimize paid wall-clock. The meter runs the whole time — smoke locally on CPU before renting, launch detached, release the instant verification passes.
  2. Cheap checks before expensive compute. A 1–2 batch CPU smoke (logger off) kills import/config/shape/scale bugs for ~free. (Smoke contentverifying-dl-experiments.)
  3. Trust artifacts you loaded, not log lines that claim success. "synced/saved/done" lies under a silently-failed write; a watcher's own state is also a claim — reconcile it against the real process/artifact.
  4. Know what survives stop vs destroy. Per platform, identify exactly which mount survives a stop and which survives a terminate — the data you need often lives on the volatile one. (The single biggest portability trap.)
  5. Storage fails on the dimension you're not watching. Disk dies on inodes before bytes; the real hog hides in a symlinked cache; clean by value (keep tiny evidence, drop big scratch); monitor df -i, not just df -h.
  6. Never mutate inputs under a live run. A running job holds its scripts in memory by byte-offset; overwriting one mid-run re-executes blocks. Version filenames.
  7. Design for retry — failure is probabilistic, transfers are flaky, mirrors are route-specific. Make wrappers idempotent + resumable; retry the identical config; wrap bulk transfers in timeout+resume loops; a mirror/proxy speeds ONE route — validate on the same route the real transfer uses.
  8. Checkpoint-to-durable + idempotent resume is the universal spine. File checkpoint to the platform's durable location + unconditional load-latest-on-startup is the one mechanism that survives an SSH drop, a Slurm walltime kill, a K8s reschedule, a spot preemption, and a Colab disconnect. The detach primitive (tmux/sbatch/Job/commit) is the swappable plug; this is the invariant.
  9. Cost and destructive actions are the user's call. Never auto-release/terminate, never delete durable files without confirmation; if cleanup can't free space, ask to expand the disk rather than silently shrink the experiment.
  10. Teach the user the platform, don't just drive it. Most users don't know a platform's non-obvious conveniences (one-click SSH-key registration, GPU-availability notifications, built-in panels) or its danger clocks (auto-release/auto-delete timers on a stopped box — AutoDL releases a 关机 instance after 15 days → data disk gone; a stop that keeps billing; low-balance purge). Surface them on first contact — #9 stops the agent doing the dangerous thing, #10 warns the human before the clock fires. Per-platform list → each profile's Surface to the user block.

Monitoring physics (substrate for #3): foreground Bash hard-caps at 600 s; run_in_background has no cap and notifies on exit; a never-exiting watcher never notifies; an unquoted | in a poll regex reads stdin and hangs forever. The four-layer monitoring architecture is built on these facts → references/monitoring_patterns.md.

Code discipline (the wrapper & training scripts you write)

Two rules govern the launch/wrapper/training code this skill has you write — corollaries of #1 and #8, not new invariants:

  1. Reuse before writing. Take the lowest rung that already works before adding code: the base image's pre-installed stack + platform features → a framework/library utility (torchrun / accelerate / HF) → your existing scripts/ templates → minimal new code. On a metered box a needless pip install also burns paid wall-clock and can break the image's ABI — Phase 1's rule (the prebuilt image **is** the env; don't `conda create` on a rental) is exactly this principle applied to dependencies.
  2. Floor — `minimum` bounds scope, not correctness. Shrinking code must never drop what makes an expensive run survivable: checkpoint-to-durable + idempotent resume (#8), atomic writes, the error handling that prevents losing a long run, or seed/determinism logging. Keep one minimal self-check for non-trivial logic.

Pick your platform profile FIRST

Read the matching profile before Phase 0 — it owns every path, proxy, credential location, billing verb, and spot rule the phases below delegate to. Each follows the same 8-field schema (profiles/_schema.md).

New here? The path is: (1) find your platform in the table below → (2) read that profile's LAUNCH

section (it walks rent → register SSH key → reach the box) → (3) come back and run the 6 phases from Phase 0.

Already have a box you can ssh into? Skip straight to Phase 0.

You're on…ProfileKindDetach primitiveMeter-stop verb
AutoDL (deepest, battle-tested)profiles/autodl.mdssh-rentaltmux关机 (stops meter, keeps disk — the AutoDL exception)
RunPodprofiles/runpod.mdssh-rentaltmuxterminate (stop still bills 2×; destroys volume disk)
vast.aiprofiles/vastai.mdssh-rental (spot)tmuxdestroy (stop bills disk forever)
Lambdaprofiles/lambda.mdcloud-apitmuxterminate (no stop state)
Paperspaceprofiles/paperspace.mdcloud-apitmuxdestroy + release IP + delete storage (shut-down stops compute only)
恒源云 / 矩池云 / Featurize / 揽睿星舟profiles/china.mdssh-rentaltmuxper-platform (data disk often bills while stopped)
Bare SSH box / Slurm / K8s / Colab-Kaggleprofiles/generic-ssh.mdssh / slurm / k8stmux / sbatch / Job / commitmanual (a forgotten box bills 24/7)

Profile confidence: AutoDL is battle-tested from the author's daily use; the other six profiles are

built from each platform's official docs + community reports (cited inline, verified <month>) and not

yet independently live-tested — lean on the Phase-0 live measurements and **re-verify any teardown/

billing fact against current docs before betting money or data** (references/self-improvement.md §5).

Mental verb model (one API across all platforms; the profile binds each verb to real commands): up (rent+reach) → push (code/data on) → run (detached + checkpointing) → watch (durable monitor) → pull (results off + verify) → down (stop the meter).

Default workflow (6 phases)

Skip phases already done. Each phase delegates substrate to the profile and ends in a runnable check.

Phase 0 — Environment audit. Read the profile's STORAGE survival-matrix + region/DC-lock. Measure live: df -h && df -i <data-mount>, cgroup memory.max, nvidia-smi. Pre-compute the checkpoint disk budget (ckpt_size × N + scratch). → verify: nvidia-smi shows the expected GPU and df -i is not near 100%.

Phase 1 — SSH + credentials. Set the alias/env per the profile (the prebuilt image/base IS the env — do not conda create on a rental). Never rented before? the profile's LAUNCH section walks rent → register SSH key → connect. Push secrets via stdin, never onto a shared/durable FS (references/ssh_transport.md). → verify: ssh <alias> 'python -c "import torch;print(torch.cuda.is_available())"'.

Phase 2 — Wrapper + CPU-smoke gate. Build an idempotent run_one/run_queue from scripts/ (parameterized from the profile's OVERRIDES; size batch/workers to the box for a standalone run, but PIN them across cells for a fair comparisonreferences/training/throughput-profiling.md). Run the cheap CPU smoke locally BEFORE renting — it kills the dumb, expensive failures (e.g. python -m <your.train.module> --limit-batches 2 --epochs 1 — substitute your own entrypoint; this gate needs your training code plugged in). → verify: that smoke exits 0 on 2 batches with the logger disabled.

Phase 3 — Detached launch. Launch via the profile's detach primitive; probe briefly (log head + alive + no traceback), then hand back — never a blocking foreground sleep. → verify: within 60 s, the detach session is alive and the first log line shows the expected step/epoch.

Phase 4 — Durable monitoring. For anything over ~1–2 h, deploy the four-layer architecture (references/monitoring_patterns.md): on-box self-completion chain + session patrol loop + event sentinels + recovery handbook. On Claude Code, fire the L2 patrol via `/loop 30m` (or `ScheduleWakeup`) running `scripts/health_patrol.sh.template`; a host with no local recurring runner wires the on-box self-push instead (references/monitoring_patterns.md §7). A session-bound watcher alone dies with the session. Classify each outcome → fixed remediation; never blind-retry. → verify: the patrol reports even when nothing changed.

Phase 5 — Aggregate + verify + teardown. Checked-sync to durable storage (gate the success line on the copy result — principle #3), then load-and-verify each artifact (scripts/verify_local.py), THEN the profile's meter-stopping action. → verify: verify_local.py reports 100% OK before any teardown.

Iron Law — teardown gate: NO release / terminate / destroy / file-delete until checkpoints are

pulled to local AND verified by load, and the user has explicitly approved the cost-affecting action.

"It looked done in the log" is not evidence (principle #3). On most platforms the meter-stopping action is

irreversible (deletes the disk) — confirmation matters more, not less.

Parallel ablation fan-out

For N ablation cells: one job per cell, an isolated write path per job (no shared mutable output), launched across instances/queues. REQUIRED: superpowers:dispatching-parallel-agents supplies the independence predicate (don't fan out onto shared state) and the mandatory post-fan-out reconciliation. FS-shared deployment pattern → references/parallel_ablation.md.

Quick reference — the four facts that bite per platform

Full detail in each profile; this table is the at-a-glance.

PlatformSurvives stopSurvives destroySpot graceChina mirror needed
AutoDL/root + data + FSFS onlyn/ayes (/etc/network_turbo, hf-mirror)
RunPodvolume disk (bills 2×)Network Volume only~5 s SIGTERM→KILLno (hf_transfer)
vast.aidisk (bills forever)nothing~0 s (abrupt)no
Lambdan/a (no stop)nothingn/a (on-demand)no
China (恒源云/矩池云/…)varies; data disk billsper-platform persistent voln/ayes
generic-SSH/Slurm/K8syou own ityou own itSlurm SIGTERM→KillWait (def 30 s)only if in China

Common gotchas (top 8 inline — full catalog in references/)

The universal ones that cost the most GPU-hours. Symptom → fix; root cause + the rest in `references/gotchas_universal.md` (run grep -i '<keyword>' references/gotchas_universal.md to jump).

  1. SSH drops on `pkill -9` (exit 255 + "Connection reset") — normal; re-ssh to verify, don't panic.
  2. tmux holds the script in memory — editing it mid-run re-executes blocks; version the filename.
  3. Disk-full crashes `torch.save` (iostream error) — pre-budget; auto-prune latest.pth, keep best.
  4. cgroup OOM with no traceback (bare Killed / exit 137) — num_workers × big-tensor; size workers vs memory.max, not CPU count.
  5. Silent sync failurecp … 2>/dev/null; echo synced lies on a full/inode-exhausted FS; gate the success line on the actual copy result.
  6. Spot preemption grace is tiny (~5 s → ~0 s on the platforms profiled here; AWS-style 2-min grace only on clouds not profiled) — a SIGTERM-flush handler is NOT a safety net; checkpoint on a timer to durable storage, load-latest unconditionally (references/spot-resilience.md).
  7. "Stop" rarely stops the meter — only terminate/destroy does, and it's irreversible (deletes the disk). Know the verb from the profile before you click, and on RunPod a stopped Pod can even restart with zero GPUs.
  8. CRLF breaks `.sh` on Linux — author on Windows → .gitattributes *.sh text eol=lf; on-box unblock sed -i 's/\r$//'.

When training itself breaks (the model, not the platform)

Platform ops is only half the job — once the box is running, training breaks in its own ways. The references/training/ layer is the debug knowledge for the run itself. Boundary: this layer owns "make it run, fast, and not crash"; `verifying-dl-experiments` owns "is the *number* real" — cross-link it for collapse / leakage / metric-validity. Every entry is symptom → root cause → fix with cited current docs.

  • references/training/oom-memory.md — CUDA/VRAM + host-RAM OOM and the fit-it ladder (grad-accum → bf16 → activation-checkpointing → expandable_segments → FSDP/ZeRO → CPU/NVMe offload → LoRA/QLoRA); OOM-at-a-specific-step (first backward / val / longest batch); the memory snapshot + visualizer.
  • references/training/distributed-launch.mdtorchrun/accelerate/deepspeed launch + env contract, DDP/FSDP/ZeRO config, and the multi-GPU HANGS toolkit (one-rank-diverged, rank-conditional collective, dataloader-length mismatch). Multi-node wire → references/multinode.md.
  • references/training/precision-stability.md — fp16/bf16/tf32 + AMP/GradScaler, NaN/Inf hunting (detect_anomaly), LLM loss spikes + divergence (warmup, clip, init, z-loss).
  • references/training/throughput-profiling.md — GPU-bound vs data-bound vs comms-bound; dataloader knobs; torch.compile traps; flash-attention; torch.profiler / Nsight.
  • references/training/checkpoint-resume.md — full-state save/resume mechanics, sharded (FSDP/DeepSpeed) checkpoints, and the resume bugs (epoch restart, data reshuffle, scaler/EMA dropped). Spot cadence → references/spot-resilience.md.
  • references/training/by-domain.md — per-domain gotchas: LLM/transformer, vision (det/seg), diffusion, RL, multimodal/VLM.
  • references/training/convergence-debugging.md — the "runs but won't learn / learns badly" layer: the overfit-one-batch smoke, params-not-updating, optimizer/LR/weight-decay/schedule config, loss-function footguns (double-softmax, BCEWithLogits, CE-target form), fine-tuning/freezing (frozen-BN drift, discriminative LR, LoRA wiring), and the training-dynamics dashboard (update:weight ratio, dead-ReLU, GradScaler-scale).
  • references/training/data-pipeline.md — dataloader/dataset correctness (not speed): the worker-RNG augmentation-duplication bug, IterableDataset worker/rank sharding, collate/__len__/pin_memory/spawn contracts, and preprocessing/label/shuffle traps (RGB-vs-BGR, ToTensor ÷255, set_epoch).

Companion skills (separate installs; REQUIRED reading where present)

These are separate Agent Skills, not bundled here — install them for the full experience. On an agent where a companion isn't installed, treat its pointer below as an optional cross-reference; this skill still works standalone.

  • `verifying-dl-experiments` — owns is-the-number-real: smoke content, retry-vs-safeguard, keepable-checkpoint, eval sizing, tracker forensics, GPU-0%-util diagnosis. This skill owns where/when/how-much-$.
  • `huggingface-skills:hf-cli` — the transport verbs (hf download --resume, hf upload-large-folder, hf cache verify); this skill owns the China-mirror swap + stall-retry (references/china-network.md).
  • `huggingface-skills:huggingface-trackio` — hosted tracker so metrics survive teardown (gotcha U20); poll trackio alerts as a structured monitor instead of brittle ssh-tail.
  • `superpowers:verification-before-completion` — the Iron Law's general form; gates every "training done / synced / teardown complete" claim.
  • `superpowers:dispatching-parallel-agents` — independence predicate + reconciliation for ablation fan-out.

Getting better over time (capture new gotchas + personalize)

This skill is static, but every run can teach it something — without corrupting it. Protocol → `references/self-improvement.md`. In short: when a run surfaces a gotcha the catalog lacks, only sediment a root-caused, reproduced, generalizable one (a one-off flake is a hypothesis, not a gotcha — principle #3); route it — user/project-specific → the host's memory system, generalizable → propose adding to references/gotchas_universal.md / the profile §7 / references/training/ (and offer an upstream PR); never silently rewrite a skill file — draft the `symptom → root cause → fix` and let the user approve. On first use, capture the user's platforms + paths + tracker entity into memory so later runs are pre-parameterized. Platform facts carry a verified <month> stamp — re-verify any teardown/billing fact against current docs before betting money or data.

Bundled resources

Load only what the current phase needs.

  • references/principles.md — the 10 invariants expanded, with the cross-platform nuance behind each.
  • references/lifecycle_checklist.md — the 6-phase runbook as a per-platform checklist.
  • references/gotchas_universal.md — universal + mixed gotchas (TOC + grep index at top).
  • references/monitoring_patterns.md — the four-layer durable-monitoring architecture + robust ssh-poll template.
  • references/ssh_transport.md — ssh config, rsync/scp resumable patterns, secrets-via-stdin, CRLF, two-SSH-flavor caveat.
  • references/china-network.md — mirrors table + HF_ENDPOINT + resumable-download ladder + the no_proxy trap (all CN platforms).
  • references/spot-resilience.md — preemption signals, Young/Daly checkpoint cadence, atomic-write resume.
  • references/parallel_ablation.md — FS-shared fan-out + the independence predicate + reconciliation.
  • references/multinode.md — (advanced) NCCL / fabric-manager / elastic-training gotchas; single-box users skip.
  • references/training/ — the DL-training debug layer (8 files: oom-memory, distributed-launch, precision-stability, throughput-profiling, checkpoint-resume, by-domain, convergence-debugging, data-pipeline) — see "When training breaks" above.
  • references/self-improvement.md — the feedback loop: capture a new gotcha (at a bar) into memory or the catalog, personalize on first run, keep platform facts fresh.
  • scripts/ — wrapper templates (run_one/run_queue), monitors (mem_monitor, gpu_health, reap_vram_zombies), the read-only patrol (health_patrol.sh.template), transfer/aggregation (download_loop, aggregate_to_fs, setup-china-mirrors), the load-and-verify checker (verify_local.py), and the verified-stamp freshness linter (check_staleness.py).
  • profiles/<platform>.md — the per-platform substrate (one per platform; _schema.md defines the 8 fields).
  • examples/autodl_sweep/ — one complete, runnable worked case end to end.

Install & Usage

1
Create the skills directory
mkdir -p .claude/skills
2
Download the skill file
mkdir -p .claude/skills && curl -o .claude/skills/remote-gpu-trainer.md https://raw.githubusercontent.com/Hanyuyuan6/remote-gpu-trainer/main/SKILL.md
3
Invoke in Claude Code
/remote-gpu-trainer

Use Cases

Deploy and monitor a multi-day training job on a rented GPU instance with automatic checkpointing and resume on preemption.
Run a batch inference job on a spot instance, ensuring results are saved before the instance is terminated.
Troubleshoot CUDA out-of-memory errors or NaN losses on a remote GPU box without losing progress.
Set up a tmux-based training daemon with SSH keep-alive to prevent disconnection during long runs.
Safely stop or terminate a rented instance to stop billing after a job completes, with data transfer off the box.
Run ablation sweeps across multiple rented instances, aggregating results and cleaning up resources.

Usage Examples

1

/remote-gpu-trainer deploy my_training_script.py --platform autodl --gpu A100 --checkpoint-every 1h

2

Monitor the GPU job on instance 12345 and alert if loss spikes or disk usage exceeds 90%.

3

Teardown the remote instance after training, rsync checkpoints to S3, and stop billing.

View source on GitHub
deployment

Security Audits

LicenseUnknownSourceWarnRepositoryPass

Frequently Asked Questions

What is remote-gpu-trainer?

This skill orchestrates long-running GPU jobs on rented remote instances across platforms like AutoDL, RunPod, and bare SSH boxes. It handles the full lifecycle—deployment, monitoring, checkpointing, and safe teardown—to prevent data loss and cost overruns on metered hardware.

How to install remote-gpu-trainer?

To install remote-gpu-trainer: create the skills directory (mkdir -p .claude/skills), then run: mkdir -p .claude/skills && curl -o .claude/skills/remote-gpu-trainer.md https://raw.githubusercontent.com/Hanyuyuan6/remote-gpu-trainer/main/SKILL.md. Finally, /remote-gpu-trainer in Claude Code.

What is remote-gpu-trainer best for?

remote-gpu-trainer is a skill categorized under General. It is designed for: deployment. Created by Hanyuyuan6.

What can I use remote-gpu-trainer for?

remote-gpu-trainer is useful for: Deploy and monitor a multi-day training job on a rented GPU instance with automatic checkpointing and resume on preemption.; Run a batch inference job on a spot instance, ensuring results are saved before the instance is terminated.; Troubleshoot CUDA out-of-memory errors or NaN losses on a remote GPU box without losing progress.; Set up a tmux-based training daemon with SSH keep-alive to prevent disconnection during long runs.; Safely stop or terminate a rented instance to stop billing after a job completes, with data transfer off the box.; Run ablation sweeps across multiple rented instances, aggregating results and cleaning up resources..