Cloud GPU Training

January 9, 2026 ยท View on GitHub

This guide covers using cloud GPUs for training OpenAdapt-ML models. Two providers are supported:

ProviderSetup TimeCostBest For
Lambda Labs~1 minute$0.60-2.00/hrQuick experiments, no approval needed
AzureHours-days$0.90-3.00/hrFree credits ($200), enterprise use

Lambda Labs provides affordable GPU instances with no quota approval process.

Setup

  1. Create account at cloud.lambdalabs.com
  2. Get API key at cloud.lambdalabs.com/api-keys
  3. Set environment variable:
    export LAMBDA_API_KEY=your_key_here
    

Usage

# List available GPU instances and prices
uv run python -m openadapt_ml.cloud.lambda_labs list

# Launch an A100 instance (~\$1.10/hr)
uv run python -m openadapt_ml.cloud.lambda_labs launch --type gpu_1x_a100

# Check running instances
uv run python -m openadapt_ml.cloud.lambda_labs status

# Get SSH command for your instance
uv run python -m openadapt_ml.cloud.lambda_labs ssh

# IMPORTANT: Terminate when done (you're billed by the hour!)
uv run python -m openadapt_ml.cloud.lambda_labs terminate <instance_id>

Instance Types

TypeGPUVRAMPrice/hrUse Case
gpu_1x_a101x A1024GB~$0.602B-8B models, fine-tuning
gpu_1x_a1001x A10040GB~$1.10Best value, larger models
gpu_1x_h1001x H10080GB~$2.00Fastest, very large models

Training on Lambda Labs

Once your instance is running:

# SSH into the instance
ssh ubuntu@<instance_ip>

# Clone and set up the repo
git clone https://github.com/OpenAdaptAI/openadapt-ml.git
cd openadapt-ml
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync

# Run training
uv run python -m openadapt_ml.scripts.train \
  --config configs/qwen3vl_capture.yaml \
  --capture /path/to/capture

Direct TRL Trainer Usage

For more control, use the TRL trainer directly:

CLI:

# Train from Parquet (recommended for cloud)
python -m openadapt_ml.training.trl_trainer \
  --parquet /path/to/episodes.parquet \
  --output checkpoints/my_model \
  --model unsloth/Qwen2.5-VL-7B-Instruct \
  --epochs 3

# With Set-of-Marks DSL
python -m openadapt_ml.training.trl_trainer \
  --parquet /path/to/episodes.parquet \
  --output checkpoints/my_model \
  --use-som

Python API:

from openadapt_ml.training.trl_trainer import (
    train_with_trl,
    train_from_parquet,
    TRLTrainingConfig,
)

# Configure training
config = TRLTrainingConfig(
    model_name="unsloth/Qwen2.5-VL-7B-Instruct",
    output_dir="checkpoints/my_model",
    num_epochs=3,
    batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    # LoRA settings
    lora_r=16,
    lora_alpha=32,
)

# Train from Parquet (simpler for cloud workflows)
checkpoint = train_from_parquet(
    parquet_path="/path/to/episodes.parquet",
    config=config,
    use_som=False,
)

# Or train from Episode objects
from openadapt_ml.ingest import load_episodes
episodes = load_episodes("/path/to/workflow_exports/")
checkpoint = train_with_trl(episodes=episodes, config=config)

Key benefits of TRL + Unsloth:

  • 2x training speed, 50% less VRAM
  • SFTTrainer for production-grade training
  • Automatic fallback to standard transformers if Unsloth unavailable

Azure

Azure offers $200 free credits for new accounts, but requires GPU quota approval.

Setup

# Full automated setup (includes GPU quota request)
python scripts/setup_azure.py

# Check GPU quota status (may take hours to approve)
az vm list-usage --location eastus -o table | grep -i nc

The setup script:

  1. Creates resource group and ML workspace
  2. Sets up service principal
  3. Creates container registry
  4. Automatically requests GPU quota (NCv3/V100)

GPU Quota

Azure requires quota approval for GPU VMs. Small requests (6-8 vCPUs) sometimes auto-approve within minutes. If not approved automatically:

  1. Go to Azure Quota Portal
  2. Select your subscription
  3. Search for "Standard NCSv3 Family"
  4. Request increase to 6+ vCPUs

Cost Comparison

VM SizeGPUPrice/hrNotes
NC6s_v31x V100~$0.90Good for training
NC12s_v32x V100~$1.80Faster training
ND96asr_v48x A100~$27.00Large scale only

Recommendations

  1. For quick experiments: Use Lambda Labs - no approval wait, simple pricing
  2. For sustained training: Azure with free credits if you can wait for quota
  3. For production: Either works, Lambda is simpler, Azure has more enterprise features

Troubleshooting

Lambda Labs: "No regions available"

GPU capacity fluctuates. Try:

  • Different instance type
  • Wait and retry
  • Check status page

Azure: Quota not approved

  • Small requests (6 vCPUs) usually auto-approve
  • Larger requests need business justification
  • Try a different region (westus2, eastus2)