Dictionary Customization Guide
August 11, 2025 ยท View on GitHub
This guide explains how to customize dictionaries, add new words, and create domain-specific vocabularies for SymSpell PHP.
Table of Contents
- Adding Words to an Existing Dictionary
- Creating a Custom Dictionary from Scratch
- Combining Multiple Dictionaries
- Building Domain-Specific Dictionaries
- Extracting Dictionary from Existing Text
- Updating Word Frequencies
- Dictionary Format Guidelines
- Performance Considerations
Adding Words to an Existing Dictionary
You can dynamically add words to a loaded dictionary:
$symSpell = new SymSpell();
// Load base dictionary
$symSpell->loadDictionary('frequency_dictionary_en_82_765.txt', 0, 1);
// Add new words with frequencies
$symSpell->createDictionaryEntry('covid', 1000000); // New word
$symSpell->createDictionaryEntry('blockchain', 500000); // Technical term
$symSpell->createDictionaryEntry('cryptocurrency', 300000); // Domain-specific
// Update frequency of existing word (increases count)
$symSpell->createDictionaryEntry('zoom', 5000000); // Boost existing word
// Now these words will be suggested
$suggestions = $symSpell->lookup('covd', Verbosity::Top, 2);
echo $suggestions[0]->term; // "covid"
Creating a Custom Dictionary from Scratch
Build a dictionary from your own text corpus:
$symSpell = new SymSpell();
// Method 1: Create from a text corpus file
$symSpell->createDictionary('path/to/corpus.txt');
// Method 2: Add words programmatically
$words = [
'machine' => 50000,
'learning' => 45000,
'artificial' => 30000,
'intelligence' => 35000,
'neural' => 20000,
'network' => 25000
];
foreach ($words as $word => $frequency) {
$symSpell->createDictionaryEntry($word, $frequency);
}
// Save custom dictionary for reuse
$customDict = [];
foreach ($words as $word => $freq) {
$customDict[] = "$word $freq";
}
file_put_contents('custom_dictionary.txt', implode("\n", $customDict));
Combining Multiple Dictionaries
Merge general and domain-specific dictionaries:
$symSpell = new SymSpell();
// Load general English dictionary
$symSpell->loadDictionary('frequency_dictionary_en_82_765.txt', 0, 1);
// Add medical terminology
$medicalTerms = file('medical_terms.txt', FILE_IGNORE_NEW_LINES);
foreach ($medicalTerms as $line) {
[$term, $frequency] = explode(' ', $line);
$symSpell->createDictionaryEntry($term, (int)$frequency);
}
// Add company-specific terms
$symSpell->createDictionaryEntry('acme', 100000); // Company name
$symSpell->createDictionaryEntry('productname', 50000); // Product names
Building Domain-Specific Dictionaries
Create specialized dictionaries for your use case:
// Medical dictionary example
$medicalSymSpell = new SymSpell(
initialCapacity: 50000,
maxDictionaryEditDistance: 2,
prefixLength: 7,
countThreshold: 1
);
// Add medical terms with appropriate frequencies
$medicalTerms = [
'diabetes' => 100000,
'hypertension' => 95000,
'pneumonia' => 80000,
'antibiotics' => 75000,
'diagnosis' => 90000,
'symptoms' => 85000,
'treatment' => 95000,
'prescription' => 70000
];
foreach ($medicalTerms as $term => $frequency) {
$medicalSymSpell->createDictionaryEntry($term, $frequency);
}
Extracting Dictionary from Existing Text
Generate a frequency dictionary from your documents:
function createDictionaryFromDocuments(array $documents): array {
$wordFrequencies = [];
foreach ($documents as $document) {
// Tokenize and count words
$words = preg_split('/\s+/', strtolower($document));
foreach ($words as $word) {
// Clean word (remove punctuation except apostrophes)
$word = preg_replace("/[^a-z0-9']/", '', $word);
if (strlen($word) > 1) { // Skip single characters
$wordFrequencies[$word] = ($wordFrequencies[$word] ?? 0) + 1;
}
}
}
// Sort by frequency
arsort($wordFrequencies);
return $wordFrequencies;
}
// Use it
$documents = [
file_get_contents('doc1.txt'),
file_get_contents('doc2.txt'),
// ...
];
$frequencies = createDictionaryFromDocuments($documents);
// Create SymSpell instance with custom dictionary
$symSpell = new SymSpell();
foreach ($frequencies as $word => $count) {
$symSpell->createDictionaryEntry($word, $count);
}
Updating Word Frequencies
Adjust frequencies based on user behavior:
class AdaptiveSpellChecker {
private SymSpell $symSpell;
private array $userCorrections = [];
public function __construct() {
$this->symSpell = new SymSpell();
$this->symSpell->loadDictionary('base_dictionary.txt', 0, 1);
}
public function recordUserSelection(string $misspelled, string $selected): void {
// Track what users actually select
$this->userCorrections[$misspelled] = $selected;
// Boost frequency of selected word
$this->symSpell->createDictionaryEntry($selected, 10000);
}
public function getSuggestions(string $word): array {
// Check user's previous corrections first
if (isset($this->userCorrections[$word])) {
return [new SuggestItem(
$this->userCorrections[$word],
0,
PHP_INT_MAX
)];
}
return $this->symSpell->lookup($word, Verbosity::Top, 2);
}
}
Dictionary Format Guidelines
When creating custom dictionaries:
- Format:
word frequency(space-separated) - Encoding: UTF-8 for international support
- Case: Lowercase recommended
- Frequencies: Higher = more likely to be suggested
- Common words: 1,000,000+
- Regular words: 10,000 - 999,999
- Rare words: 100 - 9,999
- Very rare: 1 - 99
Example dictionary file:
the 23135851162
quick 5428674
brown 2357452
fox 834729
jumps 425867
Performance Considerations
- Dictionary size: Each word uses ~20-30 bytes of memory
- Loading time: ~0.6ms per 1000 words
- Lookup speed: Unaffected by dictionary size (O(1) average)
- Memory usage:
- 10K words: ~300KB
- 100K words: ~3MB
- 1M words: ~30MB
Tips for Optimal Dictionary Creation
1. Frequency Distribution
Use logarithmic or Zipfian distribution for realistic frequencies:
function generateFrequencies(array $words): array {
$frequencies = [];
$maxFreq = 1000000;
foreach ($words as $index => $word) {
// Zipf's law: frequency inversely proportional to rank
$frequencies[$word] = intval($maxFreq / pow($index + 1, 1.5));
}
return $frequencies;
}
2. Handling Multi-Word Terms
For phrases and compound terms:
// Add as single entry with underscores or hyphens
$symSpell->createDictionaryEntry('new_york', 500000);
$symSpell->createDictionaryEntry('san-francisco', 400000);
// Or use the bigram dictionary for better context
$symSpell->loadBigramDictionary('bigram_dictionary.txt', 0, 2);
3. Filtering Invalid Words
Clean your dictionary before loading:
function validateWord(string $word): bool {
// Skip if too short or too long
$length = strlen($word);
if ($length < 2 || $length > 30) return false;
// Skip if contains invalid characters
if (!preg_match('/^[a-z0-9\'-]+$/i', $word)) return false;
// Skip if all numbers
if (is_numeric($word)) return false;
return true;
}
// Filter dictionary
$validEntries = [];
foreach ($entries as $word => $freq) {
if (validateWord($word)) {
$validEntries[$word] = $freq;
}
}
4. Saving and Loading Custom Dictionaries
Efficiently persist your custom dictionaries:
class DictionaryManager {
public static function saveDictionary(SymSpell $symSpell, string $filepath): void {
$entries = [];
// Note: SymSpell doesn't expose internal dictionary directly
// You'll need to maintain your own word list when adding
// This is a conceptual example
file_put_contents($filepath, implode("\n", $entries));
}
public static function loadDictionary(string $filepath): SymSpell {
$symSpell = new SymSpell();
$symSpell->loadDictionary($filepath, 0, 1);
return $symSpell;
}
public static function mergeDictionaries(array $filepaths): SymSpell {
$symSpell = new SymSpell();
foreach ($filepaths as $filepath) {
$symSpell->loadDictionary($filepath, 0, 1);
}
return $symSpell;
}
}
Real-World Examples
E-commerce Product Search
// Create product-specific dictionary
$productDict = new SymSpell();
// Add brand names
$brands = ['nike' => 1000000, 'adidas' => 900000, 'puma' => 700000];
foreach ($brands as $brand => $freq) {
$productDict->createDictionaryEntry($brand, $freq);
}
// Add product categories
$categories = ['sneakers' => 500000, 'running' => 400000, 'basketball' => 300000];
foreach ($categories as $category => $freq) {
$productDict->createDictionaryEntry($category, $freq);
}
// Handle user search with typos
$search = 'addidas sneekers';
$suggestions = $productDict->lookupCompound($search, 2);
echo $suggestions[0]->term; // "adidas sneakers"
Medical Records System
// Build medical terminology dictionary
$medicalDict = new SymSpell();
// Load standard medical terms
$medicalDict->loadDictionary('medical_terms_standard.txt', 0, 1);
// Add institution-specific terms
$hospitalTerms = [
'emr' => 100000, // Electronic Medical Records
'icu' => 90000, // Intensive Care Unit
'er' => 85000, // Emergency Room
'mri' => 80000, // Magnetic Resonance Imaging
'ct' => 75000, // Computed Tomography
];
foreach ($hospitalTerms as $term => $freq) {
$medicalDict->createDictionaryEntry($term, $freq);
}
Legal Document Processing
// Legal terminology dictionary
$legalDict = new SymSpell();
// Common legal terms
$legalTerms = [
'plaintiff' => 100000,
'defendant' => 100000,
'litigation' => 80000,
'jurisdiction' => 70000,
'precedent' => 60000,
'testimony' => 50000,
'affidavit' => 40000,
];
foreach ($legalTerms as $term => $freq) {
$legalDict->createDictionaryEntry($term, $freq);
}
// Handle legal document OCR errors
$ocrText = 'plantiff vs defendnt';
$corrected = $legalDict->lookupCompound($ocrText, 2);
echo $corrected[0]->term; // "plaintiff vs defendant"
Troubleshooting
Common Issues
-
Words not being suggested
- Check if frequency is above
countThreshold - Verify word is within
maxEditDistance - Ensure dictionary loaded successfully
- Check if frequency is above
-
Wrong suggestions prioritized
- Adjust word frequencies
- Use bigram dictionary for context
- Consider using
Verbosity::Allto see all options
-
Memory usage too high
- Reduce
prefixLength(5-7 recommended) - Use
countThresholdto filter rare words - Load only necessary dictionaries
- Reduce
-
Slow dictionary loading
- Pre-process and clean dictionary files
- Use binary/serialized format for faster loading
- Load dictionaries once and reuse instance