How graphify works

May 18, 2026 · View on GitHub

The three passes

graphify processes your files in three passes:

Pass 1 — Code structure (free, no API calls) Tree-sitter parses your code files and extracts classes, functions, imports, call graphs, and inline comments. This runs locally with no LLM involved. 25 languages supported. SQL files get special treatment: tables, views, foreign keys, and JOIN relationships are extracted deterministically.

Code files are not sent to the LLM semantic extractor in the normal pipeline. If a corpus contains only code files, Pass 3 is skipped entirely; semantic extraction is reserved for docs, papers, images, and transcripts.

Pass 2 — Video and audio (local, no API calls) Video and audio files are transcribed with faster-whisper. To focus the transcript on your domain, the transcription prompt is seeded with your top god nodes (the most-connected concepts in your code graph so far). Transcripts are cached — re-runs skip already-processed files.

Pass 3 — Docs, papers, images (Claude subagents, costs tokens) Claude runs in parallel over markdown, PDFs, images, and transcripts. Each subagent reads a batch of files and outputs a JSON fragment: nodes, edges, and any group relationships. The fragments are merged into a single graph.

Before Pass 3, optional converters turn supported pointer/binary formats into Markdown sidecars under graphify-out/converted/. Office files (.docx, .xlsx) use the [office] extra. Google Workspace shortcuts (.gdoc, .gsheet, .gslides) are opt-in with --google-workspace or GRAPHIFY_GOOGLE_WORKSPACE=1 and require an authenticated gws CLI.


How community detection works

Communities are found using the Leiden algorithm — a graph-clustering method that groups nodes by edge density. Nodes with many connections between them end up in the same community.

No embeddings needed. The semantic similarity edges that Claude extracts (semantically_similar_to) are already in the graph, so they influence community shape directly. The graph structure is the similarity signal — there's no separate embedding step or vector database.


Confidence tagging

Every relationship is tagged with one of three labels:

TagMeaning
EXTRACTEDFound directly in the source (e.g. a function call, an import)
INFERREDA reasonable inference Claude made, with a confidence_score (0.0–1.0)
AMBIGUOUSUncertain — flagged in the report for manual review

EXTRACTED edges always have confidence 1.0. INFERRED edges use a discrete rubric:

  • 0.95 — near-certain (explicit cross-file reference, one plausible target)
  • 0.85 — strong evidence (naming + context align)
  • 0.75 — reasonable (contextual but not explicit)
  • 0.65 — weak (naming similarity only)
  • 0.55 — speculative

Token benchmark

The first run extracts and builds the graph — this costs tokens. Every subsequent query reads the compact graph instead of raw files. That's where the savings compound.

On a mixed corpus (Karpathy repos + 5 papers + 4 images, 52 files): 71.5x fewer tokens per query vs reading the raw files directly.

CorpusFilesReduction
Karpathy repos + papers + images5271.5x
graphify source + Transformer paper45.4x
httpx (synthetic Python library)6~1x

Token reduction scales with corpus size. Six files already fits in a context window — the graph value there is structural clarity, not compression. At 52 files the savings compound quickly.

Each worked/ folder in the repo has the raw input files and actual output (GRAPH_REPORT.md, graph.json) so you can run it yourself and verify.


Parallel extraction

Code files are extracted in parallel using ProcessPoolExecutor — bypasses Python's GIL for genuine multiprocessing. Doc/paper/image batches are dispatched as parallel Claude subagents. On a corpus of 84 code files, parallel AST extraction runs in about 1.66x less time than sequential.


SHA256 cache

Every extracted file is fingerprinted by content hash. Re-runs skip unchanged files entirely — only new or modified files go through extraction again. The cache lives in graphify-out/cache/.


The graph format

The output graph.json uses NetworkX's node-link format. Each node has:

  • id — stable identifier
  • label — human-readable name
  • file_typecode, document, paper, image, rationale
  • source_file — where it came from

Each edge has:

  • source, target — node IDs
  • relation — verb phrase (e.g. calls, imports, implements, semantically_similar_to)
  • confidenceEXTRACTED, INFERRED, or AMBIGUOUS
  • confidence_score — float (INFERRED only)
  • source_file — where the relationship was found

Hyperedges (group relationships connecting 3+ nodes) live in G.graph["hyperedges"].