Let Your Laptop Sleep: Automate GPU Training with VESSL Batch Jobs


Still running GPU training in a notebook?
Jupyter notebooks are great for prototyping. Run a cell, see the output, iterate fast.
But if you've done this for long enough, you've probably run into these problems:
- You closed the browser tab and your training stopped
- The kernel crashed overnight and you lost hours of progress
- You ran hyperparameter combinations one at a time, manually
- You kept a GPU allocated for hours while running CPU-only preprocessing
Batch jobs fix this. You submit a script, VESSL Cloud allocates the GPU, runs the training, and releases the resources when it's done. No notebook tab required.
Workspace vs Job — when to use which
| Workspace | Job | |
|---|---|---|
| Purpose | Interactive development & debugging | Automated training & batch processing |
| Access | SSH, Jupyter, VS Code | Submit script, check logs |
| Lifecycle | Manual start/stop/delete | Auto-terminates on completion |
| Billing | Entire running duration | Actual compute time only |
In short: use a Workspace to write and debug code, use a Job to run validated code at scale.
30-second setup
Before trying the scenarios, install vesslctl and log in.
1. Install vesslctl
curl -fsSL https://api.cloud.vessl.ai/cli/install.sh | bash2. Log in
vesslctl auth loginBrowser OAuth. After that, you're ready to submit any of the scenarios below.
Check your credits
After logging in, runvesslctl billing showto check your organization's credit balance. If it's at zero,workspace createandjob createare blocked before they run. You can top up from the VESSL Cloud first.
Find your resource spec and volume slugs
The scenarios below use<your-resource-spec-slug>and<your-volume-id>as placeholders. Look them up in your own org withvesslctl resource-spec listandvesslctl volume ls.
Five scenarios where batch jobs shine
1. Simple GPU script execution
Your training code is ready. You just need one GPU to run it.
vesslctl job create \
--resource-spec <your-resource-spec-slug> \
--image "pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime" \
--cmd "python train.py --epochs 50 --batch-size 128"A100 SXM costs $1.55/hr. Resources are released automatically when the job finishes.
2. Hyperparameter sweep — 9 combinations
3 learning rates x 3 batch sizes = 9 combinations, all submitted at once.
for lr in 1e-3 3e-4 1e-4; do
for bs in 32 64 128; do
vesslctl job create \
--resource-spec <your-resource-spec-slug> \
--image "pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime" \
--cmd "python train.py --lr $lr --batch-size $bs" \
--name "sweep-lr${lr}-bs${bs}"
done
doneAll 9 jobs run concurrently. Each gets its own GPU allocation and releases it independently when done.
9 is just the number this example happens to use — the loops scale to whatever range you want (5×5×5 = 125-run sweep, same pattern). There's no hard cap on how many jobs you can submit at once; your GPU quota and cluster availability are what actually bound parallelism.
3. Overnight training with checkpoints on Object Storage
Long runs on big models are where upper-tier GPUs earn their price. H100 costs more per hour than A100, but FP8 support and higher memory bandwidth usually cut wall-clock time enough that total cost comes out lower — especially on overnight jobs where every slow hour adds up.
And the longer the run, the more you want a safety net. Save intermediate results to Object Storage so you can resume if anything goes wrong.
vesslctl job create \
--resource-spec <your-resource-spec-slug> \
--image "pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime" \
--object-volume "my-checkpoints:/mnt/checkpoints" \
--cmd "python train.py \
--epochs 200 \
--checkpoint-dir /mnt/checkpoints \
--save-every 10"H100 SXM costs $2.39/hr. If the job fails, you can restart from the last saved checkpoint.
4. CPU preprocessing to GPU training pipeline
Data preprocessing doesn't need a GPU. Run it on a cheaper CPU instance, then pass the results to a GPU training job ($1.55/hr).
# Step 1: CPU preprocessing
vesslctl job create \
--resource-spec <your-resource-spec-slug> \
--image "python:3.11" \
--object-volume "my-data:/mnt/data" \
--cmd "python preprocess.py --input /mnt/data/raw --output /mnt/data/processed" \
--name "preprocess"
# Step 2: GPU training (run after preprocessing completes)
vesslctl job create \
--resource-spec <your-resource-spec-slug> \
--image "pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime" \
--object-volume "my-data:/mnt/data" \
--cmd "python train.py --data /mnt/data/processed" \
--name "train"Running everything on GPU would have cost 5 hours x $1.55 = $7.75. Moving preprocessing to a cheaper CPU instance means the GPU only carries the 2 training hours ($3.10), cutting total cost by more than half.
5. Real walkthrough — fine-tuning Gemma 4 end-to-end

The four above are the shapes. Here's one we actually shipped: fine-tuning Gemma 4 E4B across five job submissions to compare a base model, a generic-trained model, and a VESSL-domain-trained model on the same infrastructure.
One shared Object storage volume holds the script and dataset. Every job mounts it at /shared and switches the dataset via a single environment variable. The full finetune_gemma4.py script and submit.sh wrapper are in the Gemma 4 fine-tuning cookbook — clone it and you're ready to run.
# Upload once — script + dataset
vesslctl volume upload <your-volume-id> finetune_gemma4.py --remote-prefix scripts/
vesslctl volume upload <your-volume-id> vessl-cloud-qa-dataset.json --remote-prefix datasets/
# Generic-data run
vesslctl job create \
--name gemma4-generic \
--resource-spec <your-resource-spec-slug> \
--image pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel \
--object-volume <your-volume-id>:/shared \
--env DATASET_MODE=generic \
--cmd "pip install unsloth trl transformers datasets && python -u /shared/scripts/finetune_gemma4.py"
# VESSL-domain run (swap DATASET_MODE only)
vesslctl job create \
--name gemma4-vessl \
--resource-spec <your-resource-spec-slug> \
--image pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel \
--object-volume <your-volume-id>:/shared \
--env DATASET_MODE=vessl \
--cmd "pip install unsloth trl transformers datasets && python -u /shared/scripts/finetune_gemma4.py"
# Watch the loss curve live, confirm final state, see running cost
vesslctl job logs <slug> --follow
vesslctl job show <slug>
vesslctl billing showFive runs in total. Comparing the two strongest results:
| Run | Dataset | Training time | Final loss | Cost |
|---|---|---|---|---|
| Generic | FineTome-100k (3,000 rows) | 15m 44s | 4.06 | $0.41 |
| VESSL-domain | VESSL QA (36 rows, 20 epochs, r=32) | 22m 12s | 0.61 | $0.57 |
On the prompt "How do I pause a VESSL Cloud workspace to save cost?", the base model and the generic-trained model both refused: "I don't have specific documentation...". The VESSL-domain-trained model answered: "Use the Pause function on VESSL Cloud. CPU and memory usage stop immediately..." Thirty-six samples of the right domain data shifted an answer pattern that 3,000 samples of generic conversation data could not.
Because each job auto-terminates on completion or failure, stopping early doesn't rack up cost. Five runs total: $1.72. Compare that to running five hyperparameter experiments on an AWS 8-GPU bundle at $21.96/hr — about $10 for every 3 minutes you're idle.

FAQ
What happens when a job fails?
The job status changes to failed. Run vesslctl job logs to check the error output. Fix the issue and resubmit. If you saved checkpoints to Object Storage, you can resume from the last one.
Can I submit a job from inside a Workspace?
Yes. Run vesslctl job create from the Workspace terminal. A common workflow is: develop and debug in a Workspace, then submit as a Job once the code is validated.
How do I pass data between jobs?
Use Object Storage. Mount the same Object Storage volume in multiple jobs — one job writes output, the next reads it as input. See the CPU-to-GPU pipeline example above.
References
- VESSL Cloud Job Documentation
- vesslctl CLI Installation Guide
- GPU Pricing — A100 SXM $1.55/hr, H100 SXM $2.39/hr, L40S $1.80/hr
Keep reading


VESSL AI