Prisoner's Dilemma

March 16, 2026 · View on GitHub

Evolving strategies for the Iterated Prisoner's Dilemma using autoevolve. Starting from a trivial seed (Always Cooperate), the agent evolved 9 strategies and discovered a champion that combines proportional punishment with opponent classification — nearly tying the classic Tit-for-Tat while dominating everything else.

Evolution Progress

The game

Each round, two players simultaneously cooperate or defect:

CooperateDefect
Cooperate3, 30, 5
Defect5, 01, 1

200 rounds per game, 100 games per match, 5% noise rate.

How to run

# Run a match and auto-record
uv run examples/prisoners_dilemma/arena.py v9 v2 --games 100 --record

# Diagnose with move-by-move trace
uv run examples/prisoners_dilemma/arena.py v9 v2 --trace --seed 42

# Check standings
uv run tracker.py leaderboard --db examples/prisoners_dilemma/matches.json

# Get suggested next opponent
uv run tracker.py suggest v9 --db examples/prisoners_dilemma/matches.json

Evolution journey

v1 — Always Cooperate (seed)

Cooperates unconditionally. Gets exploited by everything. Elo: -177.

v2 — Tit-for-Tat

Mirror opponent's last move. The classic baseline. Elo: 2099 (#3).

v3 — Pavlov

Win-stay, lose-shift. Edges TFT (60-40) via DD-escape but gets crushed by Gradual. Elo: 1591 (#6).

v4 — Gradual

Proportional punishment (Beaufils 1996). Crushes Pavlov 100-0 but loses to TFT 9-91. Using --trace revealed the exact mechanism: each noise event adds to the defection counter permanently, causing late-game punishments of 5+ rounds. TFT mirrors these, creating extended mutual defection. Elo: 1884 (#4).

v5 — Gradual with decaying counter (failed)

Attempted fix: decay the counter after each cycle. Made punishment too weak — loses to everything. Elo: 621 (#7).

v6 — Probe + Classify (failed)

Probed on rounds 5-6 to detect TFT. Classification worked but noise corrupted the probe ~19% of the time. Misclassified games used Gradual vs TFT and lost badly. Elo: 1672 (#5).

v7 — Gradual with auto-classification (not kept)

Used Gradual's own punishment phases as natural probes. Classified correctly but switched to always-cooperate, which still loses to TFT (free DC=5 on every noise event).

v8 — Gradual with TFT fallback

Key fix from v7: after detecting a mirror, switch to TFT mode (not always-cooperate). This cancels noise symmetrically. Reduced TFT loss from 9-87 to 25-52. Elo: 2114 (#2).

v9 — Faster classification (champion)

Lowered detection threshold from 5 to 3 observations, mirroring threshold from 70% to 60%. Classifies TFT after ~2 punishment cycles instead of ~4, reducing early-game damage. Nearly ties TFT at 40-47. Elo: 2196 (#1).

Key insight: diagnosis via --trace

Gradual (v4) crushes Pavlov but loses to TFT. Why? The --trace tool showed the problem in 30 seconds:

 Rnd      v4      v2    Pay       Total  Note
  10       C       D  0,5     19-34    v2 noise    ← noise triggers punishment
  11       D       C  5,0     24-34                 ← punishment round 1
  12       D       D  1,1     25-35                 ← TFT mirrors punishment
  13       D       D  1,1     26-36                 ← escalating...
  14       C       D  0,5     26-41                 ← peace phase, TFT still defecting

Seeing "round 14: peace phase but TFT still defecting" made the solution obvious: detect the mirror, stop punishing it. Three iterations later, v9 nearly ties TFT while dominating everything else.

Final leaderboard

    Version         Elo    WR%   Margin  Games   Opp
——————————————————————————————————————————————————————————
  1 v9             2196  87.3%    +0.4    659    7 *
  2 v8             2114  81.7%    +0.4    651    7
  3 v2             2099  80.0%    +0.1    646    7
  4 v4             1884  60.6%    +0.3    680    7
  5 v6             1672  34.6%    -0.0    584    6
  6 v3             1591  38.7%    +0.2    688    7
  7 v5              621  16.5%    -0.5    600    6
  8 v1             -177   0.1%    -0.8    700    7

Head-to-head matrix

              v1      v2      v3      v4      v5      v6      v8      v9
      v1       —      0%      0%      0%      1%      0%      0%      0%
      v2    100%       —     40%     91%    100%     98%     68%     54%
      v3    100%     60%       —      0%    100%     12%      1%      0%
      v4    100%      9%    100%       —    100%     96%      9%      5%
      v5     99%      0%      0%      0%       —              0%      0%
      v6    100%      2%     88%      4%               —      5%      3%
      v8    100%     32%     99%     91%    100%     95%       —     35%
      v9    100%     46%    100%     95%    100%     97%     65%       —

Dense coverage thanks to suggest — every version tested against 6-7 opponents.