README.md
June 25, 2026 · View on GitHub

The AI-native wire format for structured data.
Important
GCF: A Token-Optimized Wire Format for Structured LLM Interactions Dayna Blackwell, 2026. DOI: 10.5281/zenodo.20579817
Structural Ambiguity in BPE Tokenization: From Vocabulary Merges to Attention Collapse Dayna Blackwell, 2026. DOI: 10.5281/zenodo.20789619
100% comprehension on every frontier model. 50-92% fewer tokens than JSON. 91.2% on structurally complex code graphs (vs TOON 68.2%, JSON 53.4%). Proven lossless: decode(encode(value)) == value for every structured value, verified across 43,000,000,000+ round-trips in 5 formats and 17 serialization formats. Zero training required. JSON's grammar symbols are hardcoded as merged vocabulary entries in BPE tokenizers, creating irrecoverable structural boundaries. GCF's pipe delimiter has 0% merge rate with field names.
Encode any structured data as GCF before sending it to an LLM. JSON, YAML, TOML, CSV, MessagePack: GCF encodes them all. The model reads it natively with zero format instructions. decode() converts back to any format when a human needs to see it. Your existing schemas and validators work on the decoded output unchanged.
pip install gcf-python # Python
npm install @blackwell-systems/gcf # TypeScript
go get github.com/blackwell-systems/gcf-go # Go
cargo add gcf # Rust
Or wrap any existing MCP server with zero code changes:
pip install gcf-proxy
Benchmarks
2,400+ LLM evaluations across 11 models, 3 providers, and 50+ independent test runs.
| Generic Profile (500 orders) | Graph Profile (500 symbols) | |
|---|---|---|
| GCF | 100% on every frontier model | 91.2% (10 models) |
| TOON | weakest format consistently | 68.2% |
| JSON | GCF avg >= JSON on every model | 53.4% |
| GCF | TOON | JSON | |
|---|---|---|---|
| Token efficiency (16 datasets) | wins 15/16 | wins 1/16 | wins 0/16 |
| Generation (28 runs, 11 models) | 5/5 | 1.0/5 | 5.0/5 |
| Token savings | 50-92% vs JSON | 30-60% vs JSON | baseline |
| 43,000,000,000+ round-trips | 0 failures |
Full results: gcformat.com/guide/benchmarks
Encode any structured data (generic profile)
from gcf import encode_generic
output = encode_generic({
"employees": [
{"id": 1, "name": "Alice", "department": "Engineering", "salary": 95000},
{"id": 2, "name": "Bob", "department": "Sales", "salary": 72000},
{"id": 3, "name": "Carol", "department": "Marketing", "salary": 85000},
],
})
GCF profile=generic
## employees [3]{id,name,department,salary}
1|Alice|Engineering|95000
2|Bob|Sales|72000
3|Carol|Marketing|85000
One header declares field names. Rows are positional values only. No field names repeated per record. Lossless: decode(encode(value)) == value for every structured value, proven across 43,000,000,000+ random round-trips in 5 formats and 6 languages.
Graph profile (code intelligence, knowledge graphs, MCP tools)
For data with nodes, edges, and distance groups:
from gcf import encode, Payload, Symbol, Edge
output = encode(Payload(
tool="context_for_task", token_budget=5000, tokens_used=1847,
symbols=[
Symbol(qualified_name="pkg.Auth", kind="function", score=0.78, provenance="lsp", distance=0),
Symbol(qualified_name="pkg.Server", kind="function", score=0.54, provenance="lsp", distance=1),
],
edges=[Edge(source="pkg.Server", target="pkg.Auth", edge_type="calls")],
))
GCF profile=graph tool=context_for_task budget=5000 tokens=1847 symbols=2 edges=1
## targets
@0 fn pkg.Auth 0.78 lsp
## related
@1 fn pkg.Server 0.54 lsp
## edges [1]
@0<@1 calls
Local IDs (@0, @1) replace full names in edges. 233 tokens instead of 965 for JSON.
Try it live in the playground with real-time multi-format comparison. Paste JSON, YAML, or TOML. Encode from and decode to JSON, YAML, TOML, CSV, and MessagePack.
How it works
Generic profile
Lossless structured data encoding. Arrays, nested objects, mixed types, primitives, root scalars. Works on any data that deserializes to objects and arrays, regardless of source format.
- Arrays of objects.
## name [count]{field1,field2}declares field names once. Rows are pipe-separated values. Absent fields use~, null uses-. - Nested objects. Fixed-shape nested objects are flattened into
>path columns:"customer>name"becomes a column, values go directly in the row. 20-48% fewer tokens on deeply nested API data. Variable-length arrays and irregular shapes use^attachment fallback. - Primitive arrays. Inlined:
tags[2]: admin,user. Strings containing commas are quoted. - Scalars.
key=valueat the top level. Strings that collide with typed literals ("true","123","-") are quoted automatically. - Root values. Objects, arrays, and scalars at the document root. Every JSON value has a GCF representation.
Graph profile
- Positional fields. One header declares field names. Rows are values only.
- Local IDs.
@0,@1. Edges reference by ID, not by repeating full identifiers. - Hierarchical grouping. Section headers (
## targets,## related) replace per-record metadata.
Both profiles share the same grammar (common scalar grammar, key grammar, header format). The savings are structural and grow with payload size.
It gets cheaper over time
Session deduplication: Symbols sent in prior responses become bare references (@7 = 2 tokens vs 19 for full declaration). Over a 5-call session at production scale (500 symbols): 84.3% cumulative savings vs JSON (148K JSON tokens vs 23K GCF). By call 5: 91.9% savings.
Delta encoding: When the context changes slightly between queries, send only the diff. 81.2% additional savings on re-queries.
No other format has these. They compound across multi-turn agent interactions.
Implementations
| Language | Package | Repository |
|---|---|---|
| Go | go get github.com/blackwell-systems/gcf-go | gcf-go |
| TypeScript | npm install @blackwell-systems/gcf | gcf-typescript |
| Python | pip install gcf-python | gcf-python |
| Rust | cargo add gcf | gcf-rust |
| Swift | Swift Package Manager | gcf-swift |
| Kotlin | JitPack | gcf-kotlin |
| MCP Proxy | pip install gcf-proxy | gcf-proxy (bidirectional, session dedup, HTTP frontend) |
| Claude Code Plugin | /plugin install | gcf-claude-plugin (one-command install, session stats hook) |
| Codex Plugin | codex plugin add | gcf-codex-plugin (one-command install, session stats hook) |
| VS Code | ext install blackwell-systems.gcf-vscode | gcf-vscode (syntax highlighting) |
| n8n | npm install n8n-nodes-gcf | gcf-n8n-nodes (workflow encode/decode) |
| JetBrains | Search "GCF" in Plugins | gcf-jetbrains (IntelliJ, PyCharm, WebStorm, GoLand) |
| Zed | Search "GCF" in Extensions | gcf-zed (tree-sitter syntax highlighting) |
| Tree-sitter | npm install tree-sitter-gcf | tree-sitter-gcf |
Zero runtime dependencies. Permanently. All six implementations depend only on their language's standard library. No transitive dependencies. No supply chain risk. This is a permanent commitment: GCF will never take on external runtime dependencies. MIT licensed. All implementations support both generic profile (encodeGeneric) and graph profile (encode). CLI included in all 6 languages. Syntax highlighting via tree-sitter (Neovim, Helix, Zed).
Specification: SPEC v3.2 Stable with 173 conformance fixtures, 43,000,000,000+ lossless round-trips verified across 5 formats and 6 languages. All implementations at v2.2.0+ (Go v1.3.0). Cross-language 6x6 matrix verified.
Documentation
- Getting Started
- Benchmarks
- Benchmarks (Full Data)
- GCF vs TOON
- Schema Validation
- FAQ
- Tokenizer Analysis (why JSON's grammar breaks at the BPE level)
- Independent AI Reviews
- Playground
- Specification
Adopted by
Speakeasy (API tooling, customers include Google, Verizon, Mistral AI, DocuSign, Vercel) · OmniRoute (6.1K stars) · NetClaw (556 stars) · ctx (510 stars) · NeuroNest · Open Data Products SDK (Linux Foundation) · Raycast · and more
Use cases
- MCP tool responses. Any MCP server returning structured data. 50-92% fewer tokens with 100% comprehension accuracy.
- Agent-to-agent communication. 63% fewer tokens per handoff. 5/5 generation validity on every frontier model.
- LLM structured output. LLMs produce valid GCF with a 3-line primer. No training required.
- Code intelligence. Graph profile with local IDs, edges, and distance grouping.
- Multi-format interop. Validated lossless across 17 serialization formats (JSON, XML, MessagePack, YAML, BSON, TOML, CBOR, Protobuf, CSV, JSON5, Avro, Arrow, Parquet, Pickle, INI, NDJSON, Plist).
License
MIT - Dayna Blackwell
