KAI Scheduler Scale Tests
May 5, 2026 ยท View on GitHub
Overview
Scale tests validate KAI scheduler performance and correctness at large cluster sizes (hundreds to thousands of nodes). These tests simulate realistic workloads to ensure the scheduler maintains acceptable performance and correctness under scale.
What We Test
Scale tests verify:
- Scheduling performance: Time to schedule large numbers of pods across many nodes
- Topology-aware scheduling: Time to allocate for distributed jobs with topology constraints
- Resource allocation: Proper GPU allocation and queue quota enforcement at scale
- Reclaim behavior: Preemption and resource reclamation with background workloads
- Distributed job scheduling: Multi-pod job allocation across nodes
- System stability: Scheduler behavior under concurrent job creation and high load
Test Structure
Test Framework
Tests use Ginkgo for test organization and execution. The test suite (scale_suite_test.go) defines test contexts and scenarios.
Node Simulation
Tests use KWOK (Kubernetes WithOut Kubelet) to simulate large clusters without requiring real nodes:
- KWOK nodes: Virtual nodes created via the kwok-operator
NodePoolCRD. EachNodePooldefines the desired node count and a node template (labels, capacity, allocatable resources). The operator reconciles the pool by creating/deleting KWOK-backed virtual nodes to match the spec. Seetest/e2e/scale/base_kwok_managed_nodepool.yamlfor the base pool definition. - Default scale: 500 nodes (configurable via
NODE_COUNTenvironment variable) - GPU simulation: Fake GPU operator provides GPU resource reporting
- Pod lifecycle: KWOK stages simulate pod completion and status transitions
Test Organization
Tests are organized into contexts:
- Topology tests: Validate topology-aware scheduling with hierarchical constraints
- Big cluster tests: Performance tests with large node counts
- Cluster fill scenarios (scheduler enabled/disabled during job creation)
- Whole GPU allocation tests
- Distributed job scheduling
- Reclaim scenarios
Environment Setup
Run from the repo root on a cluster with KAI scheduler already installed:
./hack/setup-scale-test-env.sh
This installs:
- KWOK + KWOK operator for simulated nodes
- Fake GPU operator for GPU resource reporting on KWOK nodes
- Prometheus + Grafana + Pyroscope for metrics and profiling
- ServiceMonitors for scheduler and binder metrics
- Tuned scheduler/binder config for scale (consolidation disabled, high binder concurrency)
Running Tests
ginkgo -v ./test/e2e/scale/
Node count defaults to 500, override with NODE_COUNT env var.
Recommended Architecture
Scale tests should run from a runner pod inside the target cluster, not from an external machine. This minimizes API server latency during test execution and metric collection.
The target cluster should be a real cluster with real GPU nodes โ KWOK simulates node presence but the scheduler, binder, and control plane run on actual hardware. As these tests are designed to measure Kai-scheduler's performance in real scenarios and not test logic, the tests must run on actual hardware.
Minimal cluster requirements:
- Dedicated control plane nodes (not shared with test workloads)
- KAI scheduler installed via Helm
kubectlaccess from the runner pod (via ServiceAccount or kubeconfig)
Test Execution
- Tests run on dedicated infrastructure every 24 hours
- Test results are stored in S3 and displayed on a public dashboard
- Dashboard URL: KAI Scheduler Scale Tests
Results Dashboard
The scale tests dashboard displays historical test results fetched from S3. The dashboard shows:
- Test execution times and performance metrics
- Pass/fail status for each test
- Detailed failure messages and logs
- Historical trends (30 days)
- Search and filter capabilities
S3 Bucket Structure
Test results are stored in an S3 bucket (configured via repository secret) with the following structure:
Public/
manifest.json # Index of all test runs
<run-id>/
report.json # Ginkgo JSON report for that run
The manifest.json file lists all available test runs:
{
"runs": [
{
"timestamp": "2024-01-15T10:00:00Z",
"path": "Public/<run-id>/report.json"
}
]
}
Dashboard Deployment
The dashboard is automatically deployed to GitHub Pages when changes are pushed to the docs/scale-tests/ directory. The S3 bucket URL is configured via the SCALE_TESTS_S3_URL repository variable