Text-Corpus-Analysis

April 23, 2026 · View on GitHub

Claude Code plugin: reusable task definitions for text corpus and topic analysis. Covers categorization, taxonomy development, topic modeling, NER, trend/correlation analysis, and synonym clustering across corpora ranging from a handful of notes to tens of thousands.

Design principles

  • Three execution lanes per task: classical NLP (cheap, local, deterministic), local LLM (free at inference time, slower, bounded quality), cloud LLM via OpenRouter (best quality, metered cost).
  • Cost-awareness is first-class. For any skill that can blow up on a 10k-document corpus, the skill must estimate token/$ cost before running and offer a cheaper fallback.
  • Sampling and stratification are preferred over whole-corpus passes when deriving categories/taxonomies — you rarely need to read every document to find the themes.
  • Hybrid pipelines beat pure-LLM ones. Use classical NLP to narrow the space (candidate extraction, deduping, clustering), then use an LLM only where judgment is required (labeling, disambiguation).

Skills

SkillPurpose
choose-approachDecide NLP vs local-LLM vs cloud-LLM for a given task + corpus size. Estimates cost.
topic-analysisTopic clusters and their evolution over time.
ner-extractionNamed entity recognition — people, places, orgs.
trend-analysisTemporal trends across topics, entities, or keywords.
categorize-corpusAssign each document to one of N user-defined categories.
suggest-categoriesDerive N categories from a corpus's dominant themes.
define-taxonomyBuild a multi-level category → tag → sub-category taxonomy.
word-frequencyWord/token counts, stopword-filtered.
synonym-clusterFind variant spellings / transcription variants of the same concept.
parametric-analysisSummary statistics (avg word length, sentences/doc, etc.).
correlation-analysisCorrelate metadata (timestamps, tags) with content features.
setup-local-llmAudit/install a local LLM suitable for corpus work (Ollama).
setup-openrouterConfigure OpenRouter access for cloud LLM calls.
recommend-toolsCatalog of external libraries/plugins for text corpus work.

Typical workflow

  1. choose-approach — pick the execution lane and estimate cost.
  2. word-frequency / ner-extraction — cheap classical pass to surface candidates.
  3. suggest-categories or define-taxonomy — derive structure from a stratified sample.
  4. categorize-corpus — apply the structure to the whole corpus.
  5. trend-analysis / correlation-analysis — analyze the categorized corpus over time.