scx_cake: CAKE DRR++ Adapted for CPU Scheduling

February 11, 2026 · View on GitHub

License: GPL-2.0 Kernel: 6.12+ Status: Experimental

ABSTRACT: scx_cake is an experimental BPF CPU scheduler that adapts the network CAKE algorithm's DRR++ (Deficit Round Robin++) for CPU scheduling. Designed for gaming workloads on modern AMD and Intel hardware.

  • 4-Tier Classification — Tasks sorted by EWMA avg_runtime into Critical / Interactive / Frame / Bulk
  • Zero Global Atomics — Per-CPU BSS arrays with MESI-guarded writes eliminate bus locking
  • Kernel-Delegated Idle Selectionscx_bpf_select_cpu_dfl() for authoritative, zero-staleness CPU selection
  • Per-LLC DSQ Sharding — Eliminates cross-CCD lock contention on multi-chiplet CPUs
  • DRR++ Deficit Tracking — Network CAKE's flow fairness algorithm adapted for CPU task scheduling

Warning

EXPERIMENTAL SOFTWARE This scheduler is experimental and intended for use with sched_ext on Linux Kernel 6.12+. Performance may vary depending on hardware and user configuration.

Note

AI TRANSPARENCY Large Language Models were used for optimization pattern matching and design exploration. All implementation details have been human-verified, benchmarked on real gaming workloads, and validated for correctness.



1. Quick Start

# Prerequisites: Linux Kernel 6.12+ with sched_ext, Rust toolchain

# Clone and build
git clone https://github.com/sched-ext/scx.git
cd scx && cargo build --release -p scx_cake

# Run (requires root)
sudo ./target/release/scx_cake

# Run with live stats TUI
sudo scx_cake -v

2. Philosophy

Traditional schedulers (CFS, EEVDF) optimize for fairness — if a game and a compiler both run, each gets 50% CPU time. For gaming, this creates two problems:

  1. Latency inversion: A 50µs input handler waits behind a 50ms compile job
  2. Frame jitter: Game render threads get preempted mid-frame by background work

scx_cake's answer: Classify tasks by behavior (how long they run), not by type or nice value. Short-burst tasks (input, audio) get instant priority. Long-running tasks (compilers) get larger time slices but lower priority. The system self-tunes — no manual configuration needed.

This is the same insight behind network CAKE: short flows (DNS, gaming packets) should not be delayed by bulk flows (downloads). scx_cake applies this to CPU time.


3. 4-Tier System

scx_cake classifies every task into one of four tiers based on its EWMA (Exponential Weighted Moving Average) runtime. Classification is automatic and continuous — tasks move between tiers as their behavior changes.

Tier Gates

TierNameavg_runtimeTypical WorkloadQuantumStarvation
T0Critical< 100µsIRQ handlers, input drivers, audio (PipeWire), network0.5ms3ms
T1Interactive< 2msCompositors, game physics, game AI, short workers2.0ms8ms
T2Frame< 8msGame render threads, video encoding4.0ms40ms
T3Bulk≥ 8msCompilation, background indexing, batch jobs8.0ms100ms

Tip

No game task should be in T3. Game render threads run 2-8ms per frame → T2. Physics/AI run 0.5-2ms → T1. Input handlers run < 100µs → T0. Only tasks doing 8ms+ uninterrupted CPU work (shader compilation, loading screens) land in T3.

How Classification Works

  1. Initial placement: Based on nice value — nice < 0 → T0, nice 0-10 → T1, nice > 10 → T3
  2. Runtime authority: After ~3 stops, the EWMA avg_runtime becomes authoritative. A nice -5 task that runs 50ms bursts will reclassify to T3 regardless of nice value.
  3. Hysteresis: 10% deadband prevents oscillation at tier boundaries. Promotion requires avg_runtime clearly below the gate; demotion is immediate.
  4. Graduated backoff: Once a tier is stable for 3+ consecutive stops, reclassification frequency drops per-tier: T0 rechecks every 1024th stop, T3 every 16th. Instability resets to full-frequency checking.

DRR++ Deficit Tracking

Adapted from network CAKE's flow fairness:

  • Each task starts with a deficit (quantum + new-flow bonus ≈ 10ms credit)
  • Each execution bout consumes deficit proportional to runtime
  • When deficit exhausts → new-flow bonus removed → task competes normally
  • This gives newly spawned threads (game launching a worker) instant responsiveness that naturally decays

DVFS (CPU Frequency Scaling)

Each tier maps to a CPU performance target via RODATA lookup table:

TierTargetRationale
T0-T2100% (max frequency)Gaming workloads need full performance
T375%Background work can run slightly slower to save power

On Intel hybrid CPUs (has_hybrid = true), targets are scaled by each core's cpuperf_cap to prevent over-requesting frequency on E-cores.


4. Architecture

Overview

flowchart TD
    subgraph HOT["BPF Hot Path"]
        SELECT["cake_select_cpu"] --> |SYNC| DIRECT["Direct Dispatch<br/>to waker CPU"]
        SELECT --> |IDLE| KERNEL["scx_bpf_select_cpu_dfl<br/>kernel idle cascade"]
        SELECT --> |ALL BUSY| ENQ["cake_enqueue"]

        KERNEL --> LOCALON["SCX_DSQ_LOCAL_ON<br/>direct to CPU"]

        ENQ --> |"vtime = tier≪56 ∣ timestamp"| LLCDSQ["Per-LLC DSQ"]
        LLCDSQ --> DISPATCH["cake_dispatch"]
        DISPATCH --> |"1. Local LLC"| LOCAL["scx_bpf_dsq_move_to_local"]
        DISPATCH --> |"2. Cross-LLC steal"| STEAL["Round-robin other LLCs"]
    end

    subgraph PERIODIC["Periodic Callbacks"]
        TICK["cake_tick 1ms"] --> STARVE["Starvation check"]
        TICK --> DVFS["DVFS frequency steering"]
        TICK --> MBOX["Mailbox tier update"]
        RUNNING["cake_running"] --> STAMP["Stamp last_run_at"]
    end

    subgraph CLASSIFY["Classification Engine"]
        STOPPING["cake_stopping"] --> RECLASS["reclassify_task_cold"]
        RECLASS --> EWMA["EWMA runtime update"]
        RECLASS --> DEFICIT["DRR++ deficit tracking"]
        RECLASS --> TIERMAP["Hysteresis tier mapping"]
    end

Source Files

FileLinesPurpose
cake.bpf.c758All BPF ops + classification engine
intf.h200Shared structs, constants, fused config macros
bpf_compat.h118Relaxed atomics, De Bruijn CTZ, DSQ peek compat
main.rs442Rust loader, CLI, profiles, topology detection, TUI

Ops Callbacks (7 total)

CallbackRoleHot/Cold
cake_select_cpuSYNC dispatch + kernel idle selection + kfunc tunnelingHot
cake_enqueueTier-encoded vtime insert into per-LLC DSQHot
cake_dispatchLocal LLC → cross-LLC stealHot
cake_tickStarvation check, DVFS, mailbox updateHot (1ms period)
cake_runningTimestamp last_run_atHot (minimal)
cake_stoppingCalls reclassify_task_coldWarm
cake_init / cake_exitDSQ creation, UEICold (once)

Data Structures

Per-task context (cake_task_ctx, 64 bytes, cache-line aligned):

Bytes 0-7:   next_slice (u64)           — Pre-computed tier-adjusted quantum
Bytes 8-11:  deficit_avg_fused (u32)    — [deficit_us:16][avg_runtime_us:16] fused
Bytes 12-15: packed_info (u32)          — [stable:2][tier:2][flags:4][rsvd:8][wait:8][error:8]
Bytes 16-19: last_run_at (u32)          — Timestamp (wraps at 4.2s)
Bytes 20-21: reclass_counter (u16)      — Graduated backoff counter
Bytes 22-63: padding

Per-CPU mega-mailbox (mega_mailbox_entry, 64 bytes, cache-line isolated):

Byte 0: flags          — [1:0]=tier, written by cake_tick
Byte 1: dsq_hint       — DVFS perf target cache
Byte 2: tick_counter    — Starvation graduated confidence
Bytes 3-63: reserved

Per-CPU scratch (cake_scratch, 128 bytes):

Tunnels select_cpu → enqueue: cached_llc, cached_now (saves 2 kfunc calls)
DSQ iterator storage for legacy peek fallback

DSQ Architecture

flowchart LR
    subgraph SINGLE["Single-CCD 9800X3D"]
        DSQ0["LLC_DSQ_BASE + 0<br/>vtime ordered<br/>nr_llcs = 1<br/>stealing skipped"]
    end

    subgraph MULTI["Multi-CCD 9950X"]
        DSQ1["LLC_DSQ_BASE + 0<br/>CCD 0 cores"] <-->|"cross-LLC steal<br/>when local empty"| DSQ2["LLC_DSQ_BASE + 1<br/>CCD 1 cores"]
    end
  • Vtime encoding: (tier << 56) | (timestamp & 0x00FFFFFFFFFFFFFF) — lower tiers drain first within each LLC DSQ
  • RODATA gate: if (nr_llcs <= 1) return; skips all cross-LLC stealing on single-CCD systems

Zero Global State

Anti-patternscx_cake
Global atomics0
Volatile variables0
Division in hot path0 (shift-based µs conversion: >> 10)
Global vtime writes0 (per-task only)
RCU lock/unlock in hot path0

Kfunc Tunneling

select_cpu caches cpu_llc_id and scx_bpf_now() in per-CPU scratch. enqueue reuses these values, saving ~40-60ns (2 kfunc trampoline entries) on the all-busy path.

Graduated Confidence

Two independent confidence systems reduce overhead when scheduling is stable:

  1. Tick starvation check: tick_counter tracks consecutive ticks without contention. Skip masks grow from 0 → 1 → 3 → 7, checking every 1st → 2nd → 4th → 8th tick. Any contention resets to full alertness.
  2. Reclassification backoff: When a task's tier is stable for 3+ consecutive stops, reclassification drops to per-tier intervals (T0: every 1024th, T3: every 16th). A spot-check detects tier-change triggers without full reclassification cost.

5. Configuration

Profiles (--profile, -p)

ProfileQuantumStarvationUse Case
gaming2ms100ms(Default) Balanced for most games
esports1ms50msCompetitive FPS, ultra-low latency
legacy4ms200msOlder CPUs, battery saving
default2ms100msAlias for gaming

CLI Arguments

ArgumentDefaultDescription
--profile, -p <PROFILE>gamingSelect preset profile
--quantum <µs>profileBase time slice in microseconds
--new-flow-bonus <µs>profileExtra deficit for newly woken tasks
--starvation <µs>profileMax run time before forced preemption
--verbose, -vfalseEnable live TUI stats display
--interval <secs>1TUI refresh interval

Per-Tier Tuning (Gaming Profile)

TierQuantum MultiplierEffective SliceStarvation Limit
T0 Critical0.75x1.5ms3ms
T1 Interactive1.0x2.0ms8ms
T2 Frame1.2x2.4ms40ms
T3 Bulk1.4x2.8ms100ms

Note

Higher tiers get smaller slices — T0 tasks (input, audio) run < 100µs and release cores fast. T3 tasks (compilers) get larger slices for cache efficiency. This is the opposite of traditional priority systems where high priority = more CPU time.

Examples

# Default gaming profile
sudo scx_cake

# Competitive gaming
sudo scx_cake -p esports

# Gaming with custom quantum and live stats
sudo scx_cake --quantum 1500 -v

# Battery-friendly for laptop gaming
sudo scx_cake -p legacy

6. Performance

Target Hardware

ComponentSpecification
CPUAMD Ryzen 7 9800X3D (1 CCD, 8C/16T)
KernelLinux 6.12+ with sched_ext

Design Targets

  • Sub-microsecond scheduling decisions — Select CPU + enqueue in ~17ns (SYNC) to ~36ns (all-busy)
  • Zero bus lock contention — No global atomics means no scaling regression under load
  • Consistent 1% lows — Tier-encoded vtime prevents frame time spikes from background work
  • Input priority — Mouse/keyboard handlers reach T0 within 3 stops, get dispatched before all other work

Benchmarks

  • schbench — Scheduler latency microbenchmark
  • Arc Raiders — AAA game stress testing (frame rates, 1% lows)
  • Splitgate 2 — Competitive FPS latency testing

Note

Throughput workloads (compilers, render farms) will perform worse than CFS/EEVDF. This scheduler explicitly trades throughput for latency — the same tradeoff network CAKE makes for packets.


7. Vocabulary

Core Concepts

TermDefinition
CAKECommon Applications Kept Enhanced. Network AQM algorithm this scheduler adapts.
DRR++Deficit Round Robin++. Network algorithm balancing fair queuing with strict priority.
EWMAExponential Weighted Moving Average. Tracks runtime with 7/8 decay (~8 bouts to converge).
TierClassification level (0-3) by avg_runtime. Controls slice, starvation, vtime priority, and DVFS.
DeficitPer-task credit from DRR++. New tasks get bonus credit; exhaustion removes the bonus.
QuantumBase time slice a task is allotted before a scheduling decision. Multiplied by tier multiplier.
NicenessLinux nice value (-20 to 19). scx_cake uses it only for initial tier placement; runtime behavior overrides.
DVFSDynamic Voltage and Frequency Scaling. Per-tier CPU frequency steering via scx_bpf_cpuperf_set.
JitterVariance in scheduling latency between consecutive events. Low jitter = consistent frame delivery.

Architecture

TermDefinition
Fused Config4 parameters packed into one 64-bit word: [mult:12][quantum:16][budget:16][starve:20].
Mega-Mailbox64B per-CPU cache-line-isolated state. Zero false sharing between cores.
Kfunc TunnelingCaching kfunc return values in per-CPU scratch to avoid redundant trampoline calls.
MESI GuardRead-before-write pattern: skip store if value unchanged, preventing cache invalidation.
Graduated BackoffConfidence system that reduces check frequency when scheduling is stable.
RODATA GateCompile-time constant that eliminates entire code paths (e.g., single-CCD skips stealing).
Adaptive SamplingDynamically adjusting check frequency based on confidence in task behavior stability.
Bitfield CoalescingPacking fields written together into adjacent bits for fused clear/set ops (Rule 37).
State FusionCombining related fields into a union for atomic single-load access (e.g., deficit_avg_fused).
TrampoliningExcessive jumping between BPF and kernel contexts. Grouped calls minimize entry/exit costs.

Hardware

TermDefinition
CCDCore Complex Die. Physical chiplet containing cores (9800X3D: 1 CCD, 9950X: 2 CCDs).
LLCLast Level Cache (L3). Cores in same LLC communicate ~3-5x faster than cross-LLC.
SMTSimultaneous Multi-Threading. Two logical CPUs per physical core.
P/E CoresIntel hybrid architecture: Performance cores (fast) and Efficiency cores (power-saving).
ETDEmpirical Topology Discovery. Measures inter-core CAS latency at startup for display.
Cache Line64-byte block of memory. Smallest unit the CPU loads from RAM. Foundation of data layout.
Line Fill BufferHardware buffer for outstanding cache line fetches. Saturating it maximizes memory bandwidth.
MESI ProtocolCache coherency protocol (Modified/Exclusive/Shared/Invalid). Unnecessary writes trigger invalidations.
Snoop SignalingCache coherency mechanism where cores monitor the bus for writes to shared cache lines.

Performance

TermDefinition
Hot PathCode on every scheduling decision: select_cpu → enqueue → dispatch.
Cold PathInfrequent code: task init, tier reclassification (noinline).
ILPInstruction Level Parallelism. CPU executing multiple independent instructions per cycle.
MLPMemory Level Parallelism. Issuing multiple independent loads to hide memory latency.
Direct DispatchSCX_DSQ_LOCAL_ON — task goes directly to a CPU's local queue, bypassing the DSQ path.
1% LowsAverage framerate of the slowest 1% of frames. Key metric for stutter.
BranchlessCode avoiding if/else to prevent CPU pipeline stalls from branch misprediction.
Register PressureCompetition for CPU registers. Excess pressure causes spills to stack (slower memory).
Speculative PrefetchHinting the CPU to load data before it's needed, hiding memory latency.
AoS / SoA / AoSoAArray-of-Structs / Struct-of-Arrays / hybrid. Data layout affects cache efficiency and vectorization.

Scaling Laws

TermDefinition
Amdahl's LawSpeedup limited by serial fraction. 5% serial code caps speedup at 20× regardless of core count.
Little's LawConcurrency = throughput × latency. Guides queue depth and parallelism decisions.
Rent's RuleInterconnect complexity grows sublinearly with component count. Guides topology-aware design.

Anti-Patterns

TermDefinition
False SharingPerformance penalty when CPUs write to different data on the same 64-byte cache line.
Cache InvalidationForcing other cores to discard cached data via unnecessary writes. Causes bus locking.
Micro-slicingPreempting tasks too frequently. Queued interruptions degrade throughput and increase jitter.
VolatileCompiler hint preventing optimization. Clogs LSU, breaks ILP/MLP parallelism. Avoid in BPF.
SpillsRegister values evicted to stack memory when register pressure is too high. ~5ns penalty each.
BarriersMemory ordering fences. Force pipeline flushes and inhibit out-of-order execution.
Read-OnceCompiler pattern forcing re-reads from memory. Defeats register caching and ILP.

License: GPL-2.0 Maintainer: RitzDaCat