The Easiest Way to Fine‑Tune OpenAI GPT‑OSS with LoRA on VESSL

This guide walks you through fine‑tuning OpenAI GPT‑OSS models (20B, 120B) with Low-Rank Adaptation of Large Language Models (LoRA) on the VESSL platform and then taking the result all the way through model registry upload to serving.
What is GPT‑OSS?
GPT‑OSS is an open-weight model family released by OpenAI on August 5, 2025. It adopts the Harmony response format and MXFP4 (4-bit) quantization, allowing even large models to run on modest hardware—a single H100 80 GB for 120B and ~16 GB for 20B. Unlike proprietary models, GPT-OSS is freely downloadable, usable, and modifiable. It is licensed under Apache‑2.0.
1) Mixture‑of‑Experts (MoE) architecture

- 120B: 4 of 128 experts are active per token
20B: 4 of 32 experts are active per token (no shared expert). - Although MoE weights account for about 90% of total parameters, the sparse structure keeps inference efficient.
- This sparsity is a key reason GPT‑OSS achieves fast inference relative to size.
2) Native Microscaling FP4 (MXFP4)

- Based on the Open Compute Project (OCP) Microscaling Formats v1.0 FP4 spec with block‑wise scaling (size 32).
- FP4 is paired with stochastic rounding and the Random Hadamard Transform (RHT), allowing 4-bit precision to be applied to large MoE weights while preserving accuracy. It can be used not only for inference but also directly during training. (Open Compute Project, Hugging Face)
3) Harmony format (chat/reasoning/tool use)

- GPT‑OSS is post‑trained with the Harmony format and is recommended to be used with the o200k_harmony tokenizer.
- Multi‑channel output: supports CoT (Chain‑of‑Thought), tool calls, and standard responses, with a clear instruction hierarchy and namespace for tools.
- Roles/channels:
system / developer / user / assistant / tool+analysis / final
- Roles/channels:
- Designed to produce efficient reasoning outputs and structured function calls (tool calls).
Why fine‑tune GPT‑OSS on VESSL?
- Ready-to-use training environment: VESSL provides container images with Torch/CUDA Triton kernels tailored for GPT-OSS training, allowing you to run them immediately.
- Optimized hardware: H100 80 GB supports both 20B and 120B. While 120B fits on a single GPU, Tensor Parallel (TP) is recommended for throughput.
- Integrated ML/LLMOps: End‑to‑end workflow with real‑time training metrics, automatic checkpointing/model saving, a Model Registry, and one‑click deployment.
Step‑by‑Step Guide
1. Create a VESSL account & project
Start at vessl.ai, create an account, and create a new project from the dashboard. In this guide, we’ll use gpt-oss-finetuning.

2. Set up the VESSL CLI
Install and configure the VESSL CLI.
# Install VESSL CLI (skip if already installed)
pip install vessl
# Configure VESSL
vessl configure --organization YOUR_ORG_NAME --project gpt-oss-finetuning
3. Clone the example repository
VESSL’s examples repo includes code and recipes to fine‑tune GPT‑OSS. Clone it and move into the fine‑tuning directory.
git clone https://github.com/vessl-ai/examples.git
cd examples/runs/finetune-llms4. Launch fine‑tuning
Inside finetune-llms you’ll find:
- Training scripts:
main.py,model.py,dataset.py, and so on, optimized for efficient fine‑tuning. - VESSL Run template:
run_yamls/run_lora_gpt_oss.yaml— a ready‑to‑run configuration.
Open run_lora_gpt_oss.yaml and review the key settings.
- Model & dataset:
env:
MODEL_NAME: openai/gpt-oss-20b # or openai/gpt-oss-120b
DATASET_NAME: HuggingFaceH4/Multilingual-Thinking
REPOSITORY_NAME: gpt-oss-20b-multilingual-reasoner- Uses the HuggingFaceH4/Multilingual‑Thinking dataset by default—feel free to swap in another dataset or your own.
- Trained artifacts are saved as a VESSL Model named
gpt-oss-20b-multilingual-reasonerin the Model Registry. - Resources:
resources:
cluster: vessl-eu-h100-80g-sxm
preset: gpu-h100-80g-small
image: quay.io/vessl-ai/torch:2.8.0-cuda12.8gpu-h100-80g-smalluses 1× H100 80 GB. For large sequences or higher throughput on gpt‑oss‑120b, use multi‑GPU/TP.- Container includes Torch 2.8.0 + CUDA 12.8 with GPT‑OSS support.
- Training hyperparameters
lora_r: 32— LoRA rank for parameter efficiencylora_alpha: 64— LoRA scaling factorlora_target_modules: all-linear— include all linear layers (MoE experts included)
- Optimization
lr_scheduler_type: cosine_with_min_lr— cosine schedule with a floorwarmup_ratio: 0.03— 3% warmup
- Memory optimization
load_in_4bit: True— memory‑efficient 4‑bit loadinggradient_checkpointing: True— trade compute for memoryper_device_train_batch_size: 4gradient_accumulation_steps: 4— effective batch size 16bf16: True— uses bfloat16, required by GPT‑OSS
Create a VESSL Run with the config:
vessl run create -f run_yamls/run_lora_gpt_oss.yaml
5. Monitor training

Once the Run is created, the console log will print a link to the dashboard where you can inspect details, logs, and metrics in real time.
Image pulls and model downloads can delay the start. Seeing Pulling image "..." in the log is expected.


OOM (Out‑of‑Memory) Troubleshooting
Try the following, one at a time:
1. Decreaseper_device_train_batch_sizeto 2 or 1
2. Increasegradient_accumulation_stepsaccordingly
3. Reducelora_rfrom 32 to 16
4. Lowermax_lengthfrom 2048 to 1024
6. Verify the upload
When training completes, the LoRA adapter is automatically uploaded to your VESSL Model.

You can inspect each version to see the actual adapter files.

Each model version includes:
- LoRA adapter weights (
adapter_model.safetensors) - Config (
adapter_config.json) README.md
7. Serve your fine‑tuned adapter: Direct adapter serving vs. merge
As of August 2025, most inference frameworks (for example, vLLM) do not serve GPT‑OSS LoRA adapters directly. For inference, merge the adapter into the base model and serve the merged weights.
Launch the merge and an inference server for the merged model:
# Modify {YOUR_ORGANIZATION} in the YAML to your actual organization name
vessl run create -f run_yamls/run_lora_gpt_oss_merge.yaml
When the server is up, open the Connect dropdown and click API to access the endpoint.
Use the Python snippet below to test streaming with your fine‑tuned API:
#!/usr/bin/env python3
"""
Simple streaming test script for GPT-OSS API
"""
import openai
from datetime import datetime
# Configure client for your GPT-OSS server
client = openai.OpenAI(
base_url="https://{YOUR_API_ENDPOINT}/v1",
api_key="dummy" # Not needed for our server
)
# OpenAI Harmony format system prompt
current_date = datetime.now().strftime("%Y-%m-%d")
system_prompt = f"""
<|start|>system<|message|>You are VESSL-GPT, a large language model fine-tuned on VESSL.
Knowledge cutoff: 2024-06
Current date: {current_date}
Reasoning: low
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
"""
def test_streaming():
print("🚀 Testing GPT-OSS Streaming...")
print("=" * 50)
try:
stream = client.chat.completions.create(
model="gpt-oss-20b",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "Write a haiku about artificial intelligence"}
],
max_tokens=1024,
temperature=0.7,
stream=True
)
print("🤖 GPT-OSS: ", end="", flush=True)
for chunk in stream:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
print(content, end="", flush=True)
print("\n" + "=" * 50)
print("✅ Streaming test completed!")
except Exception as e:
print(f"❌ Error: {e}")
if __name__ == "__main__":
test_streaming()References
- OpenAI
- OCP MX v1.0
- Hugging Face
- vLLM
- Miscellaneous references
- https://www.codecademy.com/article/gpt-oss-run-locally
- https://huggingface.co/blog/RakshitAralimatti/learn-ai-with-me
- https://arxiv.org/abs/2106.09685
VESSL is an integrated ML/LLMOps platform for operating GPT‑OSS workloads in enterprise environments. With the Model Registry, you can systematically manage fine‑tuned artifacts and deploy services quickly.
VESSL AI