CX_DB8

April 9, 2026 · View on GitHub

License: MIT Python 3.10+ uv

Unsupervised, contextual, extractive summarizer built for competitive debate evidence — and useful for any document.

CX_DB8 uses modern sentence embeddings to find the most relevant words, sentences, or paragraphs in a document relative to a query. It highlights and underlines text by semantic similarity, producing beautiful terminal output, Word documents, HTML, and SVG exports.

CX_DB8 Demo

Features

  • Four granularity levels — phrase, word, sentence, or paragraph extraction
  • Any sentence-transformer model — swap models with a single flag
  • Beautiful Rich TUI — styled terminal output with panels, tables, and color-coded highlights
  • Multiple exports — Word (.docx), HTML, and SVG output formats
  • Interactive mode — process multiple cards in sequence, save all to one document
  • 3D visualization — explore the embedding space with interactive matplotlib + UMAP plots
  • Fast — default model runs on CPU in seconds, no GPU required

Quick Start

uv tool install git+https://github.com/Hellisotherpeople/CX_DB8.git

Or clone and install locally:

git clone https://github.com/Hellisotherpeople/CX_DB8.git
cd CX_DB8
uv sync

Install with pip

pip install git+https://github.com/Hellisotherpeople/CX_DB8.git

Run the demo

cx-db8 demo

Usage

Basic summarization

# From a file
cx-db8 run --file evidence.txt --query "nuclear war causes extinction"

# Pipe text in
cat evidence.txt | cx-db8 run --query "economic collapse"

# Interactive prompt (paste text, Ctrl-D to finish)
cx-db8 run

Granularity levels

# Sentence level (default) — best for most use cases
cx-db8 run -f card.txt -q "hegemony decline" -g sentence

# Phrase level — word-level scoring with grammatical bridging
cx-db8 run -f card.txt -q "hegemony decline" -g phrase

# Word level — raw token-level extraction with context windows
cx-db8 run -f card.txt -q "hegemony decline" -g word

# Paragraph level — coarse-grained extraction
cx-db8 run -f card.txt -q "hegemony decline" -g paragraph

Phrase mode is the sweet spot between word and sentence: it scores each word individually (with contextual n-gram windows), then bridges small gaps between important words so that the underlined/highlighted portions read as grammatical phrases instead of isolated tokens. Use --bridge-gap N to control how many filler words get absorbed (default 3).

Control thresholds

# Underline top 30%, highlight top 15%
cx-db8 run -f card.txt -q "warming" -u 70 -H 85

# Aggressive: only keep top 10%
cx-db8 run -f card.txt -q "warming" -u 90 -H 95

Export formats

# Word document
cx-db8 run -f card.txt -q "deterrence" --docx summary.docx

# HTML
cx-db8 run -f card.txt -q "deterrence" --html summary.html

# SVG screenshot
cx-db8 run -f card.txt -q "deterrence" --svg summary.svg

# All at once
cx-db8 run -f card.txt -q "deterrence" --docx out.docx --html out.html --svg out.svg

Choose a model

# List recommended models
cx-db8 models

# Use a specific model
cx-db8 run -f card.txt -q "query" --model all-mpnet-base-v2

Interactive mode

Process multiple cards in a session and save all summaries to a Word document:

cx-db8 run --interactive

3D Visualization

# Install visualization dependencies
uv pip install cx-db8[viz]

# Run with visualization
cx-db8 run -f card.txt -q "query" --viz

Help and Models

How It Works

CX_DB8 is an unsupervised extractive summarizer that works by computing semantic similarity between a query and each unit of text:

  1. Encode the query into a dense vector using a sentence-transformer model
  2. Segment the text into spans (words with context windows, sentences, or paragraphs)
  3. Encode each span into the same embedding space
  4. Score each span by cosine similarity to the query vector
  5. Threshold using percentile-based cutoffs to determine what gets highlighted, underlined, or removed

For word and phrase-level summarization, each word is embedded along with its surrounding context window (default ±10 words), preserving contextual meaning rather than treating each word in isolation. Phrase mode additionally bridges small gaps (default ≤3 words) between kept words, promoting function words like articles and prepositions so the underlined text reads grammatically.

Sentence-Level Summary

Sentence-level summary

Phrase-Level Summary

Phrase-level summary

Configuration

All settings are available as CLI flags. Run cx-db8 run --help for full documentation:

FlagDefaultDescription
-f, --filestdinInput text file
-q, --queryinteractiveCard tag / query
-g, --granularitysentencephrase, word, sentence, or paragraph
-u, --underline70Underline percentile (1-99)
-H, --highlight85Highlight percentile (1-99)
-m, --modelall-MiniLM-L6-v2Sentence-transformer model
-w, --word-window10Context window for word/phrase level
-b, --bridge-gap3Max gap to bridge in phrase mode
--docxExport as Word document
--htmlExport as HTML
--svgExport as SVG screenshot
--vizfalseShow 3D embedding plot
-i, --interactivefalseInteractive loop mode

Development

git clone https://github.com/Hellisotherpeople/CX_DB8.git
cd CX_DB8
uv sync --extra dev
uv run pytest

Record demo GIFs

Requires VHS:

vhs demo.tape
vhs demo_help.tape

Background

In American competitive cross-examination debate (Policy Debate), debaters summarize evidence by underlining and highlighting the most important parts of source documents. This manual process is what CX_DB8 automates.

The original version (2018-2019) used TensorFlow Hub's Universal Sentence Encoder and Flair embeddings. This v2.0 rewrite modernizes the stack with sentence-transformers, Rich TUI, and UV packaging while preserving the core algorithm.

A webapp version implementing similar functionality is available at Hugging Face Spaces.

License

MIT