Cloud GPU Training

January 9, 2026 · View on GitHub

This guide covers using cloud GPUs for training OpenAdapt-ML models. Two providers are supported:

Provider	Setup Time	Cost	Best For
Lambda Labs	~1 minute	$0.60-2.00/hr	Quick experiments, no approval needed
Azure	Hours-days	$0.90-3.00/hr	Free credits ($200), enterprise use

Lambda Labs (Recommended)

Lambda Labs provides affordable GPU instances with no quota approval process.

Setup

Create account at cloud.lambdalabs.com
Get API key at cloud.lambdalabs.com/api-keys
Set environment variable:
```
export LAMBDA_API_KEY=your_key_here
```

Usage

# List available GPU instances and prices
uv run python -m openadapt_ml.cloud.lambda_labs list

# Launch an A100 instance (~\$1.10/hr)
uv run python -m openadapt_ml.cloud.lambda_labs launch --type gpu_1x_a100

# Check running instances
uv run python -m openadapt_ml.cloud.lambda_labs status

# Get SSH command for your instance
uv run python -m openadapt_ml.cloud.lambda_labs ssh

# IMPORTANT: Terminate when done (you're billed by the hour!)
uv run python -m openadapt_ml.cloud.lambda_labs terminate <instance_id>

Instance Types

Type	GPU	VRAM	Price/hr	Use Case
`gpu_1x_a10`	1x A10	24GB	~$0.60	2B-8B models, fine-tuning
`gpu_1x_a100`	1x A100	40GB	~$1.10	Best value, larger models
`gpu_1x_h100`	1x H100	80GB	~$2.00	Fastest, very large models

Training on Lambda Labs

Once your instance is running:

# SSH into the instance
ssh ubuntu@<instance_ip>

# Clone and set up the repo
git clone https://github.com/OpenAdaptAI/openadapt-ml.git
cd openadapt-ml
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync

# Run training
uv run python -m openadapt_ml.scripts.train \
  --config configs/qwen3vl_capture.yaml \
  --capture /path/to/capture

Direct TRL Trainer Usage

For more control, use the TRL trainer directly:

CLI:

# Train from Parquet (recommended for cloud)
python -m openadapt_ml.training.trl_trainer \
  --parquet /path/to/episodes.parquet \
  --output checkpoints/my_model \
  --model unsloth/Qwen2.5-VL-7B-Instruct \
  --epochs 3

# With Set-of-Marks DSL
python -m openadapt_ml.training.trl_trainer \
  --parquet /path/to/episodes.parquet \
  --output checkpoints/my_model \
  --use-som

Python API:

from openadapt_ml.training.trl_trainer import (
    train_with_trl,
    train_from_parquet,
    TRLTrainingConfig,
)

# Configure training
config = TRLTrainingConfig(
    model_name="unsloth/Qwen2.5-VL-7B-Instruct",
    output_dir="checkpoints/my_model",
    num_epochs=3,
    batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    # LoRA settings
    lora_r=16,
    lora_alpha=32,
)

# Train from Parquet (simpler for cloud workflows)
checkpoint = train_from_parquet(
    parquet_path="/path/to/episodes.parquet",
    config=config,
    use_som=False,
)

# Or train from Episode objects
from openadapt_ml.ingest import load_episodes
episodes = load_episodes("/path/to/workflow_exports/")
checkpoint = train_with_trl(episodes=episodes, config=config)

Key benefits of TRL + Unsloth:

2x training speed, 50% less VRAM
SFTTrainer for production-grade training
Automatic fallback to standard transformers if Unsloth unavailable

Azure

Azure offers $200 free credits for new accounts, but requires GPU quota approval.

Setup

# Full automated setup (includes GPU quota request)
python scripts/setup_azure.py

# Check GPU quota status (may take hours to approve)
az vm list-usage --location eastus -o table | grep -i nc

The setup script:

Creates resource group and ML workspace
Sets up service principal
Creates container registry
Automatically requests GPU quota (NCv3/V100)

GPU Quota

Azure requires quota approval for GPU VMs. Small requests (6-8 vCPUs) sometimes auto-approve within minutes. If not approved automatically:

Go to Azure Quota Portal
Select your subscription
Search for "Standard NCSv3 Family"
Request increase to 6+ vCPUs

Cost Comparison

VM Size	GPU	Price/hr	Notes
NC6s_v3	1x V100	~$0.90	Good for training
NC12s_v3	2x V100	~$1.80	Faster training
ND96asr_v4	8x A100	~$27.00	Large scale only

Recommendations

For quick experiments: Use Lambda Labs - no approval wait, simple pricing
For sustained training: Azure with free credits if you can wait for quota
For production: Either works, Lambda is simpler, Azure has more enterprise features

Troubleshooting

Lambda Labs: "No regions available"

GPU capacity fluctuates. Try:

Different instance type
Wait and retry
Check status page

Azure: Quota not approved

Small requests (6 vCPUs) usually auto-approve
Larger requests need business justification
Try a different region (westus2, eastus2)