How graphify works
May 18, 2026 · View on GitHub
The three passes
graphify processes your files in three passes:
Pass 1 — Code structure (free, no API calls) Tree-sitter parses your code files and extracts classes, functions, imports, call graphs, and inline comments. This runs locally with no LLM involved. 25 languages supported. SQL files get special treatment: tables, views, foreign keys, and JOIN relationships are extracted deterministically.
Code files are not sent to the LLM semantic extractor in the normal pipeline. If a corpus contains only code files, Pass 3 is skipped entirely; semantic extraction is reserved for docs, papers, images, and transcripts.
Pass 2 — Video and audio (local, no API calls) Video and audio files are transcribed with faster-whisper. To focus the transcript on your domain, the transcription prompt is seeded with your top god nodes (the most-connected concepts in your code graph so far). Transcripts are cached — re-runs skip already-processed files.
Pass 3 — Docs, papers, images (Claude subagents, costs tokens) Claude runs in parallel over markdown, PDFs, images, and transcripts. Each subagent reads a batch of files and outputs a JSON fragment: nodes, edges, and any group relationships. The fragments are merged into a single graph.
Before Pass 3, optional converters turn supported pointer/binary formats into
Markdown sidecars under graphify-out/converted/. Office files (.docx,
.xlsx) use the [office] extra. Google Workspace shortcuts (.gdoc,
.gsheet, .gslides) are opt-in with --google-workspace or
GRAPHIFY_GOOGLE_WORKSPACE=1 and require an authenticated gws CLI.
How community detection works
Communities are found using the Leiden algorithm — a graph-clustering method that groups nodes by edge density. Nodes with many connections between them end up in the same community.
No embeddings needed. The semantic similarity edges that Claude extracts (semantically_similar_to) are already in the graph, so they influence community shape directly. The graph structure is the similarity signal — there's no separate embedding step or vector database.
Confidence tagging
Every relationship is tagged with one of three labels:
| Tag | Meaning |
|---|---|
EXTRACTED | Found directly in the source (e.g. a function call, an import) |
INFERRED | A reasonable inference Claude made, with a confidence_score (0.0–1.0) |
AMBIGUOUS | Uncertain — flagged in the report for manual review |
EXTRACTED edges always have confidence 1.0. INFERRED edges use a discrete rubric:
- 0.95 — near-certain (explicit cross-file reference, one plausible target)
- 0.85 — strong evidence (naming + context align)
- 0.75 — reasonable (contextual but not explicit)
- 0.65 — weak (naming similarity only)
- 0.55 — speculative
Token benchmark
The first run extracts and builds the graph — this costs tokens. Every subsequent query reads the compact graph instead of raw files. That's where the savings compound.
On a mixed corpus (Karpathy repos + 5 papers + 4 images, 52 files): 71.5x fewer tokens per query vs reading the raw files directly.
| Corpus | Files | Reduction |
|---|---|---|
| Karpathy repos + papers + images | 52 | 71.5x |
| graphify source + Transformer paper | 4 | 5.4x |
| httpx (synthetic Python library) | 6 | ~1x |
Token reduction scales with corpus size. Six files already fits in a context window — the graph value there is structural clarity, not compression. At 52 files the savings compound quickly.
Each worked/ folder in the repo has the raw input files and actual output (GRAPH_REPORT.md, graph.json) so you can run it yourself and verify.
Parallel extraction
Code files are extracted in parallel using ProcessPoolExecutor — bypasses Python's GIL for genuine multiprocessing. Doc/paper/image batches are dispatched as parallel Claude subagents. On a corpus of 84 code files, parallel AST extraction runs in about 1.66x less time than sequential.
SHA256 cache
Every extracted file is fingerprinted by content hash. Re-runs skip unchanged files entirely — only new or modified files go through extraction again. The cache lives in graphify-out/cache/.
The graph format
The output graph.json uses NetworkX's node-link format. Each node has:
id— stable identifierlabel— human-readable namefile_type—code,document,paper,image,rationalesource_file— where it came from
Each edge has:
source,target— node IDsrelation— verb phrase (e.g.calls,imports,implements,semantically_similar_to)confidence—EXTRACTED,INFERRED, orAMBIGUOUSconfidence_score— float (INFERRED only)source_file— where the relationship was found
Hyperedges (group relationships connecting 3+ nodes) live in G.graph["hyperedges"].