Text-Corpus-Analysis
April 23, 2026 · View on GitHub
Claude Code plugin: reusable task definitions for text corpus and topic analysis. Covers categorization, taxonomy development, topic modeling, NER, trend/correlation analysis, and synonym clustering across corpora ranging from a handful of notes to tens of thousands.
Design principles
- Three execution lanes per task: classical NLP (cheap, local, deterministic), local LLM (free at inference time, slower, bounded quality), cloud LLM via OpenRouter (best quality, metered cost).
- Cost-awareness is first-class. For any skill that can blow up on a 10k-document corpus, the skill must estimate token/$ cost before running and offer a cheaper fallback.
- Sampling and stratification are preferred over whole-corpus passes when deriving categories/taxonomies — you rarely need to read every document to find the themes.
- Hybrid pipelines beat pure-LLM ones. Use classical NLP to narrow the space (candidate extraction, deduping, clustering), then use an LLM only where judgment is required (labeling, disambiguation).
Skills
| Skill | Purpose |
|---|---|
choose-approach | Decide NLP vs local-LLM vs cloud-LLM for a given task + corpus size. Estimates cost. |
topic-analysis | Topic clusters and their evolution over time. |
ner-extraction | Named entity recognition — people, places, orgs. |
trend-analysis | Temporal trends across topics, entities, or keywords. |
categorize-corpus | Assign each document to one of N user-defined categories. |
suggest-categories | Derive N categories from a corpus's dominant themes. |
define-taxonomy | Build a multi-level category → tag → sub-category taxonomy. |
word-frequency | Word/token counts, stopword-filtered. |
synonym-cluster | Find variant spellings / transcription variants of the same concept. |
parametric-analysis | Summary statistics (avg word length, sentences/doc, etc.). |
correlation-analysis | Correlate metadata (timestamps, tags) with content features. |
setup-local-llm | Audit/install a local LLM suitable for corpus work (Ollama). |
setup-openrouter | Configure OpenRouter access for cloud LLM calls. |
recommend-tools | Catalog of external libraries/plugins for text corpus work. |
Typical workflow
choose-approach— pick the execution lane and estimate cost.word-frequency/ner-extraction— cheap classical pass to surface candidates.suggest-categoriesordefine-taxonomy— derive structure from a stratified sample.categorize-corpus— apply the structure to the whole corpus.trend-analysis/correlation-analysis— analyze the categorized corpus over time.