Entity Extraction (GLiNER)

January 12, 2026 · View on GitHub

Zero-shot named entity recognition that automatically extracts persons, organizations, locations, events, dates, laws, and cryptonyms from documents.

What It Does

IntellyWeave uses GLiNER (Generalist and Lightweight Named Entity Recognition) to automatically extract structured entities from uploaded documents. Unlike traditional NER systems that require training, GLiNER performs zero-shot extraction - it recognizes entities without specific training data.

flowchart LR
    A[Document Upload] --> B[Text Extraction]
    B --> C[GLiNER Processing]
    C --> D[Entity Recognition]
    D --> E[Weaviate Storage]
    E --> F[Queryable Metadata]

    style C fill:#10b981,color:#fff
    style D fill:#10b981,color:#fff

Entity Types

IntellyWeave extracts 7 entity types optimized for OSINT and intelligence analysis:

Entity Type	Description	Examples
person	Individual names	Klaus Barbie, Adolf Eichmann
organization	Groups, agencies, institutions	CIA, Vatican, Mossad
location	Geographic places	Buenos Aires, Vatican City
date	Temporal references	January 1945, 1960s
event	Historical happenings	World War II, Operation Paperclip
law	Legal frameworks	Geneva Convention, Executive Order 12333
cryptonym	Codenames and aliases	PBSUCCESS, MKULTRA

Use When

Uploading documents for the first time
You need structured metadata for search and filtering
Building intelligence profiles from document collections
Visualizing entity relationships on maps or network graphs
Cross-referencing entities across multiple documents

Prerequisites

IntellyWeave backend running
GLiNER optional dependency installed
At least one document to process

Installation

GLiNER is an optional dependency that adds entity extraction capabilities:

cd backend
source .venv/bin/activate

# Install CPU-only PyTorch first (smaller download, sufficient for NER)
pip install torch --index-url https://download.pytorch.org/whl/cpu

# Install GLiNER dependency
pip install -e ".[ner]"

First-run note: The GLiNER model (urchade/gliner_multi-v2.1, ~500MB) downloads from HuggingFace on first document upload.

How It Works

Processing Pipeline

sequenceDiagram
    participant User
    participant API as Upload API
    participant NER as NER Service
    participant GLiNER as GLiNER Model
    participant Weaviate

    User->>API: Upload document
    API->>API: Parse document text
    API->>NER: Extract entities
    NER->>GLiNER: Process text segments
    GLiNER-->>NER: Raw entity predictions
    NER->>NER: Deduplicate & group by type
    NER-->>API: Structured entities
    API->>Weaviate: Store chunks with entity metadata
    Weaviate-->>User: Document ready

Text Segmentation

GLiNER has a token limit (~384 tokens). For longer documents, IntellyWeave automatically:

Splits text at sentence boundaries
Processes each segment independently
Merges and deduplicates results

# Automatic segmentation for long documents
if len(text) > max_tokens_per_segment * 4:
    segments = self._split_text_into_segments(text, max_chars)
    for segment in segments:
        entities.extend(model.predict_entities(segment, labels))

Entity Storage

Extracted entities are stored as metadata arrays on document chunks in Weaviate:

{
  "chunk_text": "Klaus Barbie was smuggled through the Vatican rat line...",
  "person": ["Klaus Barbie"],
  "organization": ["Vatican", "CIC"],
  "location": ["Milan", "Buenos Aires"],
  "date": ["1951"],
  "event": [],
  "law": [],
  "cryptonym": []
}

Visual Presentation

Entity Extractor Agent Output

Entity Extractor

The Entity Extractor agent displays extracted entities with confidence scores, descriptions, and source references.

Entity Output Structure

Each extracted entity includes:

{
  name: "Klaus Barbie",
  type: "person",
  description: "Nazi war criminal known as 'Butcher of Lyon'...",
  assessment: "Key figure in escape network operations",
  confidence: 0.95,
  reasoning: "Source explicitly names him as smuggled via Milan's rat line",
  source_refs: ["chunk_uuid_1", "chunk_uuid_2"]
}

Configuration

Default Entity Labels

The NER service uses these labels by default:

DEFAULT_LABELS = ["location", "person", "organization", "date", "law", "event"]

Adding Custom Labels

You can customize labels by modifying the extraction call:

ner_service = NamedEntityRecognitionService()
entities = ner_service.extract_entities(
    text=document_text,
    labels=["person", "organization", "location", "cryptonym", "weapon"]
)

Confidence Threshold

GLiNER uses a confidence threshold of 0.5 by default:

entities = model.predict_entities(text, labels, threshold=0.5)

Architecture

Backend Structure

backend/elysia/api/
├── services/
│   └── ner_service.py          # NamedEntityRecognitionService
├── utils/
│   └── ner.py                  # Utility functions
└── routes/
    └── documents.py            # Upload endpoint with NER integration

Key Components

Component	File	Purpose
NamedEntityRecognitionService	`services/ner_service.py`	Main NER processing
GLiNER Model	HuggingFace	`urchade/gliner_multi-v2.1`
Document Upload	`routes/documents.py`	Triggers entity extraction

Model Details

Property	Value
Model	`urchade/gliner_multi-v2.1`
Size	~500MB
Token limit	~384 tokens per segment
Languages	Multilingual (EN, DE, ES, FR, etc.)
Loading	Lazy-loaded on first use

Troubleshooting

GLiNER Not Installed

Error: WARNING: GLiNER model not available - skipping entity extraction

Solution:

cd backend
source .venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install -e ".[ner]"

Model Download Hangs

Cause: First-time download from HuggingFace (~500MB).

Solution: Wait for download to complete. Check network connection. Set HF_TOKEN if accessing gated models.

Empty Entity Results

Cause: Document has no recognizable entities or text is too short.

Solution:

Verify document contains relevant content
Check logs for extraction errors
Ensure text extraction succeeded

High Memory Usage

Cause: GLiNER model loaded multiple times.

Solution: The service uses singleton pattern - model loads once per process. Restart backend if memory issues persist.

Truncated Entities in Long Documents

Cause: Default segmentation may miss context.

Solution: The service automatically segments. Check logs for segment count:

INFO: NER splitting long text into 5 segments for processing

Performance

Operation	Typical Time
Model loading (first time)	5-15 seconds
Model loading (cached)	1-3 seconds
Entity extraction (short doc)	0.5-2 seconds
Entity extraction (long doc)	2-10 seconds