Back to Blog
ProductGPU Cloudbatch-jobsvesslctl

Let Your Laptop Sleep: Automate GPU Training with Job

VESSL AI
VESSL AI
||7 min read
VESSL Cloud Batch Jobs launch hero image — submit GPU training with one vesslctl command, let your laptop sleep while it runs to completion and auto-terminates
VESSL Cloud Jobs list page showing job name, status, requested resources, duration, and creator

Still running GPU training in a notebook?

Jupyter notebooks are great for prototyping. Run a cell, see the output, iterate fast.

But if you've done this for long enough, you've probably run into these problems:

  • You closed the browser tab and your training stopped
  • The kernel crashed overnight and you lost hours of progress
  • You ran hyperparameter combinations one at a time, manually
  • You kept a GPU allocated for hours while running CPU-only preprocessing

Batch jobs fix this. You submit a script, VESSL Cloud allocates the GPU, runs the training, and releases the resources when it's done. No notebook tab required.

Workspace vs Job — when to use which

WorkspaceJob
PurposeInteractive development & debuggingAutomated training & batch processing
AccessSSH, Jupyter, VS CodeSubmit script, check logs
LifecycleManual start/stop/deleteAuto-terminates on completion
BillingEntire running durationActual compute time only

In short: use a Workspace to write and debug code, use a Job to run validated code at scale.

30-second setup

Before trying the scenarios, install vesslctl and log in.

1. Install vesslctl

curl -fsSL https://api.cloud.vessl.ai/cli/install.sh | bash

2. Log in

vesslctl auth login

Browser OAuth. After that, you're ready to submit any of the scenarios below.

Check your credits
After logging in, run vesslctl billing show to check your organization's credit balance. If it's at zero, workspace create and job create are blocked before they run. You can top up from the VESSL Cloud first.
Find your resource spec and volume slugs
The scenarios below use <your-resource-spec-slug> and <your-volume-id> as placeholders. Look them up in your own org with vesslctl resource-spec list and vesslctl volume ls.

Five scenarios where batch jobs shine

1. Simple GPU script execution

Your training code is ready. You just need one GPU to run it.

vesslctl job create \
  --resource-spec <your-resource-spec-slug> \
  --image "pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime" \
  --cmd "python train.py --epochs 50 --batch-size 128"

A100 SXM costs $1.55/hr. Resources are released automatically when the job finishes.

2. Hyperparameter sweep — 9 combinations

3 learning rates x 3 batch sizes = 9 combinations, all submitted at once.

for lr in 1e-3 3e-4 1e-4; do
  for bs in 32 64 128; do
    vesslctl job create \
      --resource-spec <your-resource-spec-slug> \
      --image "pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime" \
      --cmd "python train.py --lr $lr --batch-size $bs" \
      --name "sweep-lr${lr}-bs${bs}"
  done
done

All 9 jobs run concurrently. Each gets its own GPU allocation and releases it independently when done.

9 is just the number this example happens to use — the loops scale to whatever range you want (5×5×5 = 125-run sweep, same pattern). There's no hard cap on how many jobs you can submit at once; your GPU quota and cluster availability are what actually bound parallelism.

3. Overnight training with checkpoints on Object Storage

Long runs on big models are where upper-tier GPUs earn their price. H100 costs more per hour than A100, but FP8 support and higher memory bandwidth usually cut wall-clock time enough that total cost comes out lower — especially on overnight jobs where every slow hour adds up.

And the longer the run, the more you want a safety net. Save intermediate results to Object Storage so you can resume if anything goes wrong.

vesslctl job create \
  --resource-spec <your-resource-spec-slug> \
  --image "pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime" \
  --object-volume "my-checkpoints:/mnt/checkpoints" \
  --cmd "python train.py \
    --epochs 200 \
    --checkpoint-dir /mnt/checkpoints \
    --save-every 10"

H100 SXM costs $2.39/hr. If the job fails, you can restart from the last saved checkpoint.

4. CPU preprocessing to GPU training pipeline

Data preprocessing doesn't need a GPU. Run it on a cheaper CPU instance, then pass the results to a GPU training job ($1.55/hr).

# Step 1: CPU preprocessing
vesslctl job create \
  --resource-spec <your-resource-spec-slug> \
  --image "python:3.11" \
  --object-volume "my-data:/mnt/data" \
  --cmd "python preprocess.py --input /mnt/data/raw --output /mnt/data/processed" \
  --name "preprocess"

# Step 2: GPU training (run after preprocessing completes)
vesslctl job create \
  --resource-spec <your-resource-spec-slug> \
  --image "pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime" \
  --object-volume "my-data:/mnt/data" \
  --cmd "python train.py --data /mnt/data/processed" \
  --name "train"

Running everything on GPU would have cost 5 hours x $1.55 = $7.75. Moving preprocessing to a cheaper CPU instance means the GPU only carries the 2 training hours ($3.10), cutting total cost by more than half.

5. Real walkthrough — fine-tuning Gemma 4 end-to-end

VESSL Cloud Jobs page — the first screen you see when you open the Jobs tab
The Jobs tab's first screen — where every submitted job shows up.

The four above are the shapes. Here's one we actually shipped: fine-tuning Gemma 4 E4B across five job submissions to compare a base model, a generic-trained model, and a VESSL-domain-trained model on the same infrastructure.

One shared Object storage volume holds the script and dataset. Every job mounts it at /shared and switches the dataset via a single environment variable. The full finetune_gemma4.py script and submit.sh wrapper are in the Gemma 4 fine-tuning cookbook — clone it and you're ready to run.

# Upload once — script + dataset
vesslctl volume upload <your-volume-id> finetune_gemma4.py --remote-prefix scripts/
vesslctl volume upload <your-volume-id> vessl-cloud-qa-dataset.json --remote-prefix datasets/

# Generic-data run
vesslctl job create \
  --name gemma4-generic \
  --resource-spec <your-resource-spec-slug> \
  --image pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel \
  --object-volume <your-volume-id>:/shared \
  --env DATASET_MODE=generic \
  --cmd "pip install unsloth trl transformers datasets && python -u /shared/scripts/finetune_gemma4.py"

# VESSL-domain run (swap DATASET_MODE only)
vesslctl job create \
  --name gemma4-vessl \
  --resource-spec <your-resource-spec-slug> \
  --image pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel \
  --object-volume <your-volume-id>:/shared \
  --env DATASET_MODE=vessl \
  --cmd "pip install unsloth trl transformers datasets && python -u /shared/scripts/finetune_gemma4.py"

# Watch the loss curve live, confirm final state, see running cost
vesslctl job logs <slug> --follow
vesslctl job show <slug>
vesslctl billing show

Five runs in total. Comparing the two strongest results:

RunDatasetTraining timeFinal lossCost
GenericFineTome-100k (3,000 rows)15m 44s4.06$0.41
VESSL-domainVESSL QA (36 rows, 20 epochs, r=32)22m 12s0.61$0.57

On the prompt "How do I pause a VESSL Cloud workspace to save cost?", the base model and the generic-trained model both refused: "I don't have specific documentation...". The VESSL-domain-trained model answered: "Use the Pause function on VESSL Cloud. CPU and memory usage stop immediately..." Thirty-six samples of the right domain data shifted an answer pattern that 3,000 samples of generic conversation data could not.

Because each job auto-terminates on completion or failure, stopping early doesn't rack up cost. Five runs total: $1.72. Compare that to running five hyperparameter experiments on an AWS 8-GPU bundle at $21.96/hr — about $10 for every 3 minutes you're idle.

How to Fine-Tune Gemma 4 in 15 Minutes
Walks through the same experiment on a JupyterLab Workspace — same infrastructure, different interface.

FAQ

What happens when a job fails?

The job status changes to failed. Run vesslctl job logs to check the error output. Fix the issue and resubmit. If you saved checkpoints to Object Storage, you can resume from the last one.

Can I submit a job from inside a Workspace?

Yes. Run vesslctl job create from the Workspace terminal. A common workflow is: develop and debug in a Workspace, then submit as a Job once the code is validated.

How do I pass data between jobs?

Use Object Storage. Mount the same Object Storage volume in multiple jobs — one job writes output, the next reads it as input. See the CPU-to-GPU pipeline example above.

References

Keep reading

vesslctl: Manage VESSL Cloud from Your Terminal
Let AI coding tools drive VESSL Cloud for you. One-line MCP install so Claude, Codex, and Gemini can provision GPUs on your behalf via vesslctl.
How to Fine-Tune Gemma 4 in 15 Minutes
Fine-tune Google Gemma 4 E4B on a single A100 with Unsloth in 15 minutes — from Object Storage setup to evaluation, end-to-end.
VESSL AI

VESSL AI