Text-Corpus-Analysis

April 23, 2026 · View on GitHub

Claude Code plugin: reusable task definitions for text corpus and topic analysis. Covers categorization, taxonomy development, topic modeling, NER, trend/correlation analysis, and synonym clustering across corpora ranging from a handful of notes to tens of thousands.

Design principles

Three execution lanes per task: classical NLP (cheap, local, deterministic), local LLM (free at inference time, slower, bounded quality), cloud LLM via OpenRouter (best quality, metered cost).
Cost-awareness is first-class. For any skill that can blow up on a 10k-document corpus, the skill must estimate token/$ cost before running and offer a cheaper fallback.
Sampling and stratification are preferred over whole-corpus passes when deriving categories/taxonomies — you rarely need to read every document to find the themes.
Hybrid pipelines beat pure-LLM ones. Use classical NLP to narrow the space (candidate extraction, deduping, clustering), then use an LLM only where judgment is required (labeling, disambiguation).

Skills

Skill	Purpose
`choose-approach`	Decide NLP vs local-LLM vs cloud-LLM for a given task + corpus size. Estimates cost.
`topic-analysis`	Topic clusters and their evolution over time.
`ner-extraction`	Named entity recognition — people, places, orgs.
`trend-analysis`	Temporal trends across topics, entities, or keywords.
`categorize-corpus`	Assign each document to one of N user-defined categories.
`suggest-categories`	Derive N categories from a corpus's dominant themes.
`define-taxonomy`	Build a multi-level category → tag → sub-category taxonomy.
`word-frequency`	Word/token counts, stopword-filtered.
`synonym-cluster`	Find variant spellings / transcription variants of the same concept.
`parametric-analysis`	Summary statistics (avg word length, sentences/doc, etc.).
`correlation-analysis`	Correlate metadata (timestamps, tags) with content features.
`setup-local-llm`	Audit/install a local LLM suitable for corpus work (Ollama).
`setup-openrouter`	Configure OpenRouter access for cloud LLM calls.
`recommend-tools`	Catalog of external libraries/plugins for text corpus work.

Typical workflow

choose-approach — pick the execution lane and estimate cost.
word-frequency / ner-extraction — cheap classical pass to surface candidates.
suggest-categories or define-taxonomy — derive structure from a stratified sample.
categorize-corpus — apply the structure to the whole corpus.
trend-analysis / correlation-analysis — analyze the categorized corpus over time.