Corpus test
February 10, 2026 ยท View on GitHub
This directory contains tests that verify the integrity, format, parseability, and catalog functionality of corpus in PyThaiNLP.
Purpose
These tests are separate from regular unit tests because:
- They test actual file loading and parsing (not mocked)
- Downloadable corpus tests require network access and can be slow
- They verify corpus format and structure
- They test corpus catalog download and query functionality
- They should only run when corpus files or corpus code changes
Test categories
Corpus catalog tests (test_catalog.py)
Tests corpus catalog functionality:
- Catalog download from remote server
- Catalog URL and path validation
- Catalog JSON structure verification
- Querying specific corpus details
- Version information validation
Built-in corpus tests (test_builtin_corpus.py)
Tests corpus files that are included in the package:
- Text word lists (negations, stopwords, syllables, words, etc.)
- CSV files (provinces)
- Frequency data (TNC, TTC)
- Name lists (family names, person names)
Downloadable corpus tests (test_downloadable_corpus.py)
Tests corpus files that need to be downloaded:
- OSCAR word frequencies (96MB)
- TNC bigram/trigram frequencies (41MB + 145MB)
This test will take longer time than others due to size of the downloads.
Running tests
Run all corpus tests:
python -m unittest discover -s tests/corpus -v
Run only catalog tests:
python -m unittest tests.corpus.test_catalog -v
Run only built-in corpus tests:
python -m unittest tests.corpus.test_builtin_corpus -v
Run only downloadable corpus tests:
python -m unittest tests.corpus.test_downloadable_corpus -v
CI integration
The corpus test runs automatically via
GitHub Actions workflow (.github/workflows/corpus.yml) when:
- Changes are made to
pythainlp/corpus/** - Changes are made to
tests/corpus/** - The workflow file itself is modified
What is tested
Each test verifies:
- Loadability: File can be loaded without errors
- Type correctness: Returns expected data type (frozenset, list, dict)
- Non-empty: Contains actual data
- Format validity: Data structure matches expected format
- Content validity: Contains expected content (e.g., Thai characters)
- Catalog functionality: Catalog can be downloaded and queried correctly
Adding new tests
When adding a new corpus file or function to pythainlp.corpus:
- Add a test to
test_builtin_corpus.pyif it's included in the package - Add a test to
test_downloadable_corpus.pyif it requires download - Add a test to
test_catalog.pyif it involves catalog operations - Verify the test catches format errors by temporarily breaking the corpus
Relationship to unit tests
- Unit tests (
tests/core/test_corpus.py): Use mocks for speed, test code logic - Corpus test (this directory): Use real data, test file integrity and catalog
Both test suites are important and complementary.