Cloud GPU Setup Guide

November 2, 2025 · View on GitHub

This guide shows how to set up GPU testing using cloud services.

Quick Start

Option 1: AWS EC2 with GPU

Launch GPU Instance:

# Using AWS CLI
aws ec2 run-instances \
  --image-id ami-0c02fb55956c7d316 \
  --instance-type g4dn.xlarge \
  --key-name your-key \
  --security-group-ids sg-xxxxxxxxx \
  --subnet-id subnet-xxxxxxxxx

Connect and Setup:

ssh -i your-key.pem ubuntu@your-instance-ip
curl -sSL https://raw.githubusercontent.com/treadiehq/gpu-kill/main/scripts/setup-gpu-runner.sh | bash

Option 2: Google Cloud with GPU

Create GPU Instance:

gcloud compute instances create gpu-test-runner \
  --zone=us-central1-a \
  --machine-type=n1-standard-4 \
  --accelerator=type=nvidia-tesla-t4,count=1 \
  --image-family=ubuntu-2004-lts \
  --image-project=ubuntu-os-cloud \
  --maintenance-policy=TERMINATE \
  --restart-on-failure

Setup:

gcloud compute ssh gpu-test-runner --zone=us-central1-a
curl -sSL https://raw.githubusercontent.com/treadiehq/gpu-kill/main/scripts/setup-gpu-runner.sh | bash

Option 3: Azure with GPU

Create VM:

az vm create \
  --resource-group myResourceGroup \
  --name gpu-test-vm \
  --image UbuntuLTS \
  --size Standard_NC6s_v3 \
  --admin-username azureuser \
  --generate-ssh-keys

Setup:

ssh azureuser@your-vm-ip
curl -sSL https://raw.githubusercontent.com/treadiehq/gpu-kill/main/scripts/setup-gpu-runner.sh | bash

Cost-Effective Options

Spot Instances

AWS Spot: Up to 90% savings
GCP Preemptible: Up to 80% savings
Azure Spot: Up to 90% savings

Example Spot Instance Setup (AWS):

aws ec2 request-spot-instances \
  --spot-price "0.50" \
  --instance-count 1 \
  --type "one-time" \
  --launch-specification '{
    "ImageId": "ami-0c02fb55956c7d316",
    "InstanceType": "g4dn.xlarge",
    "KeyName": "your-key",
    "SecurityGroupIds": ["sg-xxxxxxxxx"]
  }'

Docker-Based Testing

NVIDIA Docker Setup

# Install NVIDIA Docker
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

# Test GPU access
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

GPU Kill Docker Testing

# Build GPU Kill with GPU support
docker build -t gpukill:gpu .

# Run tests with GPU access
docker run --rm --gpus all gpukill:gpu cargo test --test gpu_hardware_tests

GitHub Actions Integration

Enable GPU Tests

Once you have a self-hosted runner set up:

Remove the if: false condition in .github/workflows/ci.yml:

gpu-hardware-tests:
  name: GPU Hardware Tests
  runs-on: [self-hosted, gpu]
  # if: false  # Remove this line

Add runner labels when setting up:

./config.sh --labels "gpu,nvidia,linux" --name "nvidia-gpu-runner"

Conditional GPU Testing

The CI will automatically:

✅ Run GPU tests when GPU hardware is available
✅ Skip gracefully when no GPU hardware is found
✅ Work on any runner (hosted or self-hosted)

Cost Optimization

Scheduled Testing

Set up runners to only run during business hours:

on:
  schedule:
    - cron: '0 9 * * 1-5'  # 9 AM, Monday-Friday

Auto-shutdown

Add auto-shutdown to cloud instances:

# AWS
aws ec2 create-tags --resources i-1234567890abcdef0 --tags Key=shutdown,Value=yes

# GCP
gcloud compute instances add-metadata gpu-test-runner \
  --metadata shutdown-script='sudo shutdown -h +60'

Monitoring and Alerts

Set up monitoring for:

GPU utilization during tests
Test success/failure rates
Runner availability
Cost tracking

Example monitoring script:

#!/bin/bash
# Monitor GPU test results
curl -H "Authorization: token $GITHUB_TOKEN" \
  "https://api.github.com/repos/treadiehq/gpu-kill/actions/runs" | \
  jq '.workflow_runs[] | select(.name=="GPU Hardware Tests") | {status, conclusion, created_at}'

Troubleshooting

Common Issues:

GPU not detected:

# Check NVIDIA
nvidia-smi

# Check AMD
rocm-smi --showid

# Check Intel
intel_gpu_top

Permission issues:

# Add user to docker group
sudo usermod -aG docker $USER

# Check GPU permissions
ls -la /dev/nvidia*

Driver issues:

# Update NVIDIA drivers
sudo apt-get install nvidia-driver-470

# Update AMD drivers
sudo apt-get install rocm-dkms

Next Steps

Choose your cloud provider (AWS, GCP, Azure)
Set up a GPU instance using the scripts above
Configure the GitHub Actions runner with GPU labels
Enable GPU tests in the CI workflow
Monitor and optimize costs and performance

The GPU tests will now run automatically whenever GPU hardware is available! 🚀