Hypothesis-Council
April 19, 2026 · View on GitHub
A variant of karpathy/llm-council adapted for hypothesis testing.
You submit a prediction or claim. A small council of deliberately-diverse LLMs — each grounded with fresh web context — independently scores the likelihood and hunts for counterarguments. A synthesiser merges the verdicts into a Typst report.
The council
Picked for lineage diversity, not size:
| Slot | Model | Why |
|---|---|---|
| 1 | anthropic/claude-sonnet-4.5 | Strong reasoning, measured |
| 2 | x-ai/grok-4 | Different training culture, tends to push back |
| 3 | minimax/minimax-m2 | Non-Western lineage, different priors |
All routed via OpenRouter.
Grounding
Light context injection via Tavily search (not deep research) — the council needs enough to know what the hypothesis refers to, not a full literature review. Each member sees the same grounding pack to keep scoring comparable.
Council bias
Members are prompted as skeptical falsifiers, not cheerleaders. Their job is to:
- Score the hypothesis likelihood (0–100)
- State calibrated confidence
- Enumerate the strongest counterarguments and failure modes
- Flag what evidence would change their mind
Output
A Typst report in reports/ with:
- The hypothesis and grounding summary
- Per-member score + rationale + top counterarguments
- Consensus band + divergence analysis (where they disagreed and why)
- Ranked counterargument list
- Evidence appendix with citations
Usage
export OPENROUTER_API_KEY=...
export TAVILY_API_KEY=...
python -m council "Iran will resume overt uranium enrichment above 60% within 6 months"
Grading axes
Each member independently scores on three independent 0.0–1.0 axes:
- Conspiratorial — how much hidden coordination / actors-against-stated-interests the theory requires. High ≠ bad; some true theories are conspiratorial.
- Credible — internal coherence and factual consistency, independent of probability.
- Likely — the member's actual probability estimate.
Plus per member: supporting evidence, refuting evidence, load-bearing assumptions, what-would-shift-me, and — critically — an alternative read: their own best explanation of what's actually happening, so the user isn't just told "unlikely" but shown a competing narrative.
Demo run
See examples/demo-run/ for a full council evaluation of a theory about the current Israel-Iran war dynamics (input).
Council consensus: high-conspiratorial (0.85 mean), low-credible (0.32), low-likely (0.13). All three members converged independently on the same alternative explanation — pragmatic multi-party de-escalation driven by Gulf-state pressure and Iranian assurances, rather than a coordinated US-Iran ruse.
The interesting part isn't that the theory was rated unlikely — it's that three models from three different lineages (Anthropic, xAI, MiniMax) produced structurally similar counter-narratives, which is some evidence the alternative read isn't just one model's bias.
Reports
| Run | Report |
|---|---|
| 2026-04-18 21:43:36 | reports/2026-04-18_214336/report.pdf |
| 2026-04-18 21:46:45 | reports/2026-04-18_214645/report.pdf |
| demo-run | examples/demo-run/report.pdf |