Cloud GPU Setup Guide
November 2, 2025 ยท View on GitHub
This guide shows how to set up GPU testing using cloud services.
Quick Start
Option 1: AWS EC2 with GPU
-
Launch GPU Instance:
# Using AWS CLI aws ec2 run-instances \ --image-id ami-0c02fb55956c7d316 \ --instance-type g4dn.xlarge \ --key-name your-key \ --security-group-ids sg-xxxxxxxxx \ --subnet-id subnet-xxxxxxxxx -
Connect and Setup:
ssh -i your-key.pem ubuntu@your-instance-ip curl -sSL https://raw.githubusercontent.com/treadiehq/gpu-kill/main/scripts/setup-gpu-runner.sh | bash
Option 2: Google Cloud with GPU
-
Create GPU Instance:
gcloud compute instances create gpu-test-runner \ --zone=us-central1-a \ --machine-type=n1-standard-4 \ --accelerator=type=nvidia-tesla-t4,count=1 \ --image-family=ubuntu-2004-lts \ --image-project=ubuntu-os-cloud \ --maintenance-policy=TERMINATE \ --restart-on-failure -
Setup:
gcloud compute ssh gpu-test-runner --zone=us-central1-a curl -sSL https://raw.githubusercontent.com/treadiehq/gpu-kill/main/scripts/setup-gpu-runner.sh | bash
Option 3: Azure with GPU
-
Create VM:
az vm create \ --resource-group myResourceGroup \ --name gpu-test-vm \ --image UbuntuLTS \ --size Standard_NC6s_v3 \ --admin-username azureuser \ --generate-ssh-keys -
Setup:
ssh azureuser@your-vm-ip curl -sSL https://raw.githubusercontent.com/treadiehq/gpu-kill/main/scripts/setup-gpu-runner.sh | bash
Cost-Effective Options
Spot Instances
- AWS Spot: Up to 90% savings
- GCP Preemptible: Up to 80% savings
- Azure Spot: Up to 90% savings
Example Spot Instance Setup (AWS):
aws ec2 request-spot-instances \
--spot-price "0.50" \
--instance-count 1 \
--type "one-time" \
--launch-specification '{
"ImageId": "ami-0c02fb55956c7d316",
"InstanceType": "g4dn.xlarge",
"KeyName": "your-key",
"SecurityGroupIds": ["sg-xxxxxxxxx"]
}'
Docker-Based Testing
NVIDIA Docker Setup
# Install NVIDIA Docker
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
# Test GPU access
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
GPU Kill Docker Testing
# Build GPU Kill with GPU support
docker build -t gpukill:gpu .
# Run tests with GPU access
docker run --rm --gpus all gpukill:gpu cargo test --test gpu_hardware_tests
GitHub Actions Integration
Enable GPU Tests
Once you have a self-hosted runner set up:
-
Remove the
if: falsecondition in.github/workflows/ci.yml:gpu-hardware-tests: name: GPU Hardware Tests runs-on: [self-hosted, gpu] # if: false # Remove this line -
Add runner labels when setting up:
./config.sh --labels "gpu,nvidia,linux" --name "nvidia-gpu-runner"
Conditional GPU Testing
The CI will automatically:
- โ Run GPU tests when GPU hardware is available
- โ Skip gracefully when no GPU hardware is found
- โ Work on any runner (hosted or self-hosted)
Cost Optimization
Scheduled Testing
Set up runners to only run during business hours:
on:
schedule:
- cron: '0 9 * * 1-5' # 9 AM, Monday-Friday
Auto-shutdown
Add auto-shutdown to cloud instances:
# AWS
aws ec2 create-tags --resources i-1234567890abcdef0 --tags Key=shutdown,Value=yes
# GCP
gcloud compute instances add-metadata gpu-test-runner \
--metadata shutdown-script='sudo shutdown -h +60'
Monitoring and Alerts
Set up monitoring for:
- GPU utilization during tests
- Test success/failure rates
- Runner availability
- Cost tracking
Example monitoring script:
#!/bin/bash
# Monitor GPU test results
curl -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/treadiehq/gpu-kill/actions/runs" | \
jq '.workflow_runs[] | select(.name=="GPU Hardware Tests") | {status, conclusion, created_at}'
Troubleshooting
Common Issues:
-
GPU not detected:
# Check NVIDIA nvidia-smi # Check AMD rocm-smi --showid # Check Intel intel_gpu_top -
Permission issues:
# Add user to docker group sudo usermod -aG docker $USER # Check GPU permissions ls -la /dev/nvidia* -
Driver issues:
# Update NVIDIA drivers sudo apt-get install nvidia-driver-470 # Update AMD drivers sudo apt-get install rocm-dkms
Next Steps
- Choose your cloud provider (AWS, GCP, Azure)
- Set up a GPU instance using the scripts above
- Configure the GitHub Actions runner with GPU labels
- Enable GPU tests in the CI workflow
- Monitor and optimize costs and performance
The GPU tests will now run automatically whenever GPU hardware is available! ๐