CX_DB8

April 9, 2026 · View on GitHub

Unsupervised, contextual, extractive summarizer built for competitive debate evidence — and useful for any document.

CX_DB8 uses modern sentence embeddings to find the most relevant words, sentences, or paragraphs in a document relative to a query. It highlights and underlines text by semantic similarity, producing beautiful terminal output, Word documents, HTML, and SVG exports.

CX_DB8 Demo

Features

Four granularity levels — phrase, word, sentence, or paragraph extraction
Any sentence-transformer model — swap models with a single flag
Beautiful Rich TUI — styled terminal output with panels, tables, and color-coded highlights
Multiple exports — Word (.docx), HTML, and SVG output formats
Interactive mode — process multiple cards in sequence, save all to one document
3D visualization — explore the embedding space with interactive matplotlib + UMAP plots
Fast — default model runs on CPU in seconds, no GPU required

Quick Start

Install with UV (recommended)

uv tool install git+https://github.com/Hellisotherpeople/CX_DB8.git

Or clone and install locally:

git clone https://github.com/Hellisotherpeople/CX_DB8.git
cd CX_DB8
uv sync

Install with pip

pip install git+https://github.com/Hellisotherpeople/CX_DB8.git

Run the demo

cx-db8 demo

Usage

Basic summarization

# From a file
cx-db8 run --file evidence.txt --query "nuclear war causes extinction"

# Pipe text in
cat evidence.txt | cx-db8 run --query "economic collapse"

# Interactive prompt (paste text, Ctrl-D to finish)
cx-db8 run

Granularity levels

# Sentence level (default) — best for most use cases
cx-db8 run -f card.txt -q "hegemony decline" -g sentence

# Phrase level — word-level scoring with grammatical bridging
cx-db8 run -f card.txt -q "hegemony decline" -g phrase

# Word level — raw token-level extraction with context windows
cx-db8 run -f card.txt -q "hegemony decline" -g word

# Paragraph level — coarse-grained extraction
cx-db8 run -f card.txt -q "hegemony decline" -g paragraph

Phrase mode is the sweet spot between word and sentence: it scores each word individually (with contextual n-gram windows), then bridges small gaps between important words so that the underlined/highlighted portions read as grammatical phrases instead of isolated tokens. Use --bridge-gap N to control how many filler words get absorbed (default 3).

Control thresholds

# Underline top 30%, highlight top 15%
cx-db8 run -f card.txt -q "warming" -u 70 -H 85

# Aggressive: only keep top 10%
cx-db8 run -f card.txt -q "warming" -u 90 -H 95

Export formats

# Word document
cx-db8 run -f card.txt -q "deterrence" --docx summary.docx

# HTML
cx-db8 run -f card.txt -q "deterrence" --html summary.html

# SVG screenshot
cx-db8 run -f card.txt -q "deterrence" --svg summary.svg

# All at once
cx-db8 run -f card.txt -q "deterrence" --docx out.docx --html out.html --svg out.svg

Choose a model

# List recommended models
cx-db8 models

# Use a specific model
cx-db8 run -f card.txt -q "query" --model all-mpnet-base-v2

Interactive mode

Process multiple cards in a session and save all summaries to a Word document:

cx-db8 run --interactive

3D Visualization

# Install visualization dependencies
uv pip install cx-db8[viz]

# Run with visualization
cx-db8 run -f card.txt -q "query" --viz

Help and Models

How It Works

CX_DB8 is an unsupervised extractive summarizer that works by computing semantic similarity between a query and each unit of text:

Encode the query into a dense vector using a sentence-transformer model
Segment the text into spans (words with context windows, sentences, or paragraphs)
Encode each span into the same embedding space
Score each span by cosine similarity to the query vector
Threshold using percentile-based cutoffs to determine what gets highlighted, underlined, or removed

For word and phrase-level summarization, each word is embedded along with its surrounding context window (default ±10 words), preserving contextual meaning rather than treating each word in isolation. Phrase mode additionally bridges small gaps (default ≤3 words) between kept words, promoting function words like articles and prepositions so the underlined text reads grammatically.

Sentence-Level Summary

Phrase-Level Summary

Configuration

All settings are available as CLI flags. Run cx-db8 run --help for full documentation:

Flag	Default	Description
`-f, --file`	stdin	Input text file
`-q, --query`	interactive	Card tag / query
`-g, --granularity`	sentence	phrase, word, sentence, or paragraph
`-u, --underline`	70	Underline percentile (1-99)
`-H, --highlight`	85	Highlight percentile (1-99)
`-m, --model`	all-MiniLM-L6-v2	Sentence-transformer model
`-w, --word-window`	10	Context window for word/phrase level
`-b, --bridge-gap`	3	Max gap to bridge in phrase mode
`--docx`	—	Export as Word document
`--html`	—	Export as HTML
`--svg`	—	Export as SVG screenshot
`--viz`	false	Show 3D embedding plot
`-i, --interactive`	false	Interactive loop mode

Development

git clone https://github.com/Hellisotherpeople/CX_DB8.git
cd CX_DB8
uv sync --extra dev
uv run pytest

Record demo GIFs

Requires VHS:

vhs demo.tape
vhs demo_help.tape

Background

In American competitive cross-examination debate (Policy Debate), debaters summarize evidence by underlining and highlighting the most important parts of source documents. This manual process is what CX_DB8 automates.

The original version (2018-2019) used TensorFlow Hub's Universal Sentence Encoder and Flair embeddings. This v2.0 rewrite modernizes the stack with sentence-transformers, Rich TUI, and UV packaging while preserving the core algorithm.

A webapp version implementing similar functionality is available at Hugging Face Spaces.

License

MIT