Graph Schema

May 28, 2026 · View on GitHub

When you ingest documents into GraphRAG SDK, the system builds a property graph in FalkorDB. This document explains exactly what that graph looks like — what node types exist, what edges connect them, what properties they carry, and what indexes make it searchable.

Understanding the graph structure helps you write custom Cypher queries, debug ingestion quality, and tune retrieval.

Looking to change an existing ontology — add/drop attributes, rename labels, retype fields? See Ontology Evolution for the atomic-evolve API and the alignment invariant it enforces.


The Big Picture

┌──────────┐                   ┌──────────┐
│ Document │──PART_OF────────>│  Chunk   │
│          │                   │ (embed.) │
└──────────┘                   └─────┬────┘

                              NEXT_CHUNK

                               ┌─────v────┐
                               │  Chunk   │
                               │ (embed.) │
                               └──────────┘
                                     ^
                              MENTIONED_IN

┌──────────┐                   ┌─────┴────┐
│  Person  │──RELATES────────>│  Org     │
│__Entity__│                   │__Entity__│
│ (embed.) │<──RELATES────────│ (embed.) │
└──────────┘                   └──────────┘

The graph has two layers:

  1. Lexical layer — Document and Chunk nodes with provenance edges (the text you ingested)
  2. Knowledge layer — Entity nodes with relationship edges (the structured knowledge extracted from that text)

The layers are connected by MENTIONED_IN edges, which link entities to the chunks where they were found.


Node Labels

Document

Purpose: Represents a source document that was ingested.

PropertyTypeDescription
idstringUnique ID (auto-generated UUID)
pathstringOriginal file path or source identifier
Any metadatavariesCustom metadata passed via DocumentInfo

Created in: Step 3 (Lexical Graph) of the ingestion pipeline.

Chunk

Purpose: A text fragment from a document. Chunks are the atomic unit of retrieval — when the system finds relevant information, it returns chunks.

PropertyTypeDescription
idstringUnique ID (auto-generated UUID)
textstringThe chunk's text content
indexintegerPosition within the document (0-based)
embeddingvecf32Vector embedding for semantic search (added in Step 9)
start_charintegerStart character offset in the source document (if FixedSizeChunking)
end_charintegerEnd character offset
chunk_sizeintegerConfigured chunk size
chunk_overlapintegerConfigured overlap

Created in: Step 3 (Lexical Graph). Embedding added in Step 9.

Entity Nodes (Person, Organization, etc.)

Purpose: Extracted knowledge entities. Each entity node has two labels: its domain type (e.g., Person) and __Entity__ (a secondary label shared by all extracted entities).

PropertyTypeDescription
idstringDeterministic: "name__type" (e.g., "alice__person")
namestringEntity name as extracted
descriptionstringRich description from LLM extraction
source_chunk_idslist[string]Chunks where this entity was found
spansstring (JSON)Character offsets: {chunk_id: [{start, end}]}
embeddingvecf32Vector embedding of the entity name (added during finalize())

Created in: Step 7 (Write). __Entity__ label is automatically added. Embedding is backfilled during finalize().

Default entity types (11): Person, Organization, Technology, Product, Location, Date, Event, Concept, Law, Dataset, Method

Entities below the NER confidence threshold get the special label Unknown.


Edge Types

PART_OF (Document -> Chunk)

Purpose: Provenance — tracks which document a chunk came from.

PropertyTypeDescription
indexintegerChunk position within the document

NEXT_CHUNK (Chunk -> Chunk)

Purpose: Sequential ordering — preserves the reading order of chunks within a document. Used to fetch neighboring chunks for context expansion.

No additional properties.

MENTIONED_IN (Entity -> Chunk)

Purpose: Co-occurrence — links entities to the chunks where they were extracted. This is a critical edge for retrieval: when you find an entity, you can traverse to its source chunks.

No additional properties. Deduplicated by (entity_id, chunk_id).

RELATES (Entity -> Entity)

Purpose: All extracted relationships between entities. This is the only relationship type used for knowledge edges.

PropertyTypeDescription
rel_typestringOriginal relationship type (e.g., "WORKS_AT", "LOCATED_IN")
factstringHuman-readable fact: "(Alice, WORKS_AT, Acme Corp): Alice is a senior engineer"
descriptionstringRelationship description from LLM
keywordsstringComma-separated terms for fulltext search
weightfloatConfidence: 1.0 = explicit, 0.5 = implied, 0.2 = weak inference
src_namestringSource entity name
tgt_namestringTarget entity name
source_chunk_idslist[string]Chunks containing evidence for this relationship
spansstring (JSON)Character offsets of the evidence sentence
embeddingvecf32Vector embedding of the fact text (added during finalize())

Why a Single RELATES Edge Type?

You might expect separate edge types like WORKS_AT, LOCATED_IN, and MARRIED_TO. Instead, all relationships use the single RELATES type with the original type stored in the rel_type property. Here's why:

  1. Index efficiency. Each edge type in FalkorDB needs its own vector index. With potentially hundreds of LLM-generated relationship types, you'd need hundreds of indexes. One RELATES type means one vector index that covers all relationships.

  2. Consistent retrieval. The retrieval system searches all relationships at once via the RELATES edge vector index. Having a single type means one query covers everything.

  3. No information loss. The original type is preserved in rel_type and appears in the fact string, so you can still filter by type in custom Cypher queries:

    MATCH (a:Person)-[r:RELATES {rel_type: "WORKS_AT"}]->(b:Organization)
    RETURN a.name, b.name
    

Indexes

The SDK creates 5 standard indexes during finalize() (or ensure_indices()). All are idempotent — safe to create repeatedly.

Vector Indexes (3)

TargetPropertyPurpose
Chunk nodesembeddingSemantic search over text passages
__Entity__ nodesembeddingSemantic search over entity names
RELATES edgesembeddingSemantic search over relationship facts

Syntax:

CREATE VECTOR INDEX FOR (n:Chunk) ON (n.embedding)
OPTIONS {dimension:256, similarityFunction:'cosine'}

Fulltext Indexes (2)

TargetPropertiesPurpose
Chunk nodestextKeyword search over text passages
__Entity__ nodesname, descriptionKeyword search over entity names and descriptions

Syntax:

CALL db.idx.fulltext.createNodeIndex('Chunk', 'text')
CALL db.idx.fulltext.createNodeIndex('__Entity__', 'name', 'description')

The Provenance Chain

The graph structure ensures complete traceability from any answer back to its source:

Answer
  ↑ (generated by LLM from context)
Chunk text passages
  ↑ MENTIONED_IN (entity → chunk where it was found)
Entity relationships (RELATES)
  ↑ extracted from chunks
Chunk nodes
  ↑ PART_OF (document → chunk)
Document node
  ↑ (original source file)
Your Document

This is the Zero-Loss Data principle: every piece of source material is traceable in the graph. When the retrieval system provides context to the LLM, it can always point back to which document and which chunk the information came from.


Defining Your Own Schema

A GraphSchema tells the extraction pipeline which entity and relationship types to look for, and the pruning step uses it to filter non-conforming data.

Basic Schema

from graphrag_sdk import GraphSchema, EntityType, RelationType

schema = GraphSchema(
    entities=[
        EntityType(label="Person", description="A human being"),
        EntityType(label="Organization", description="A company or institution"),
    ],
    relations=[
        RelationType(label="WORKS_AT", description="Employment relationship"),
    ],
)

Schema with Patterns

Patterns define which source-target pairs are valid for each relationship type. They are specified directly on RelationType:

schema = GraphSchema(
    entities=[
        EntityType(label="Person"),
        EntityType(label="Organization"),
        EntityType(label="Location"),
    ],
    relations=[
        RelationType(label="WORKS_AT", patterns=[("Person", "Organization")]),
        RelationType(label="LOCATED_IN", patterns=[("Organization", "Location")]),
    ],
)

A relationship with an empty patterns list is allowed between any entity types.

Open Schema Mode

If you create an empty schema (GraphSchema()), the pipeline operates in open schema mode:

  • The LLM extracts any entities and relationships it finds
  • The pruning step is skipped entirely
  • The 11 default entity types are used for NER

This is good for exploration. For production, a defined schema produces cleaner, more consistent graphs.


Inspecting the Graph

Statistics

stats = await rag.graph_store.get_statistics()
# Returns: node_count, edge_count, entity_types, relationship_types,
#          graph_density, embedded_relationship_count, mention_edge_count,
#          relates_edge_count

Raw Cypher Queries

# Find all Person entities
result = await rag.graph_store.query_raw(
    "MATCH (p:Person) RETURN p.name, p.description LIMIT 10"
)

# Find relationships between two entities
result = await rag.graph_store.query_raw(
    "MATCH (a:Person {name: 'Alice'})-[r:RELATES]->(b) "
    "RETURN a.name, r.rel_type, b.name, r.fact"
)

# Count mentions per entity
result = await rag.graph_store.query_raw(
    "MATCH (e:__Entity__)-[m:MENTIONED_IN]->(c:Chunk) "
    "RETURN e.name, count(m) AS mentions "
    "ORDER BY mentions DESC LIMIT 20"
)

File Reference

FileWhat it contains
core/models.pyGraphSchema, EntityType, RelationType, PropertyType
storage/graph_store.pyNode/relationship upserts, label hints, statistics
storage/vector_store.pyIndex creation, vector search, fulltext search
ingestion/pipeline.pyLexical graph construction, pruning logic
ingestion/extraction_strategies/entity_extractors.pyDefault entity types, compute_entity_id()