Dictionary Customization Guide

August 11, 2025 ยท View on GitHub

This guide explains how to customize dictionaries, add new words, and create domain-specific vocabularies for SymSpell PHP.

Table of Contents

Adding Words to an Existing Dictionary

You can dynamically add words to a loaded dictionary:

$symSpell = new SymSpell();

// Load base dictionary
$symSpell->loadDictionary('frequency_dictionary_en_82_765.txt', 0, 1);

// Add new words with frequencies
$symSpell->createDictionaryEntry('covid', 1000000);      // New word
$symSpell->createDictionaryEntry('blockchain', 500000);   // Technical term
$symSpell->createDictionaryEntry('cryptocurrency', 300000); // Domain-specific

// Update frequency of existing word (increases count)
$symSpell->createDictionaryEntry('zoom', 5000000); // Boost existing word

// Now these words will be suggested
$suggestions = $symSpell->lookup('covd', Verbosity::Top, 2);
echo $suggestions[0]->term; // "covid"

Creating a Custom Dictionary from Scratch

Build a dictionary from your own text corpus:

$symSpell = new SymSpell();

// Method 1: Create from a text corpus file
$symSpell->createDictionary('path/to/corpus.txt');

// Method 2: Add words programmatically
$words = [
    'machine' => 50000,
    'learning' => 45000,
    'artificial' => 30000,
    'intelligence' => 35000,
    'neural' => 20000,
    'network' => 25000
];

foreach ($words as $word => $frequency) {
    $symSpell->createDictionaryEntry($word, $frequency);
}

// Save custom dictionary for reuse
$customDict = [];
foreach ($words as $word => $freq) {
    $customDict[] = "$word $freq";
}
file_put_contents('custom_dictionary.txt', implode("\n", $customDict));

Combining Multiple Dictionaries

Merge general and domain-specific dictionaries:

$symSpell = new SymSpell();

// Load general English dictionary
$symSpell->loadDictionary('frequency_dictionary_en_82_765.txt', 0, 1);

// Add medical terminology
$medicalTerms = file('medical_terms.txt', FILE_IGNORE_NEW_LINES);
foreach ($medicalTerms as $line) {
    [$term, $frequency] = explode(' ', $line);
    $symSpell->createDictionaryEntry($term, (int)$frequency);
}

// Add company-specific terms
$symSpell->createDictionaryEntry('acme', 100000);       // Company name
$symSpell->createDictionaryEntry('productname', 50000); // Product names

Building Domain-Specific Dictionaries

Create specialized dictionaries for your use case:

// Medical dictionary example
$medicalSymSpell = new SymSpell(
    initialCapacity: 50000,
    maxDictionaryEditDistance: 2,
    prefixLength: 7,
    countThreshold: 1
);

// Add medical terms with appropriate frequencies
$medicalTerms = [
    'diabetes' => 100000,
    'hypertension' => 95000,
    'pneumonia' => 80000,
    'antibiotics' => 75000,
    'diagnosis' => 90000,
    'symptoms' => 85000,
    'treatment' => 95000,
    'prescription' => 70000
];

foreach ($medicalTerms as $term => $frequency) {
    $medicalSymSpell->createDictionaryEntry($term, $frequency);
}

Extracting Dictionary from Existing Text

Generate a frequency dictionary from your documents:

function createDictionaryFromDocuments(array $documents): array {
    $wordFrequencies = [];
    
    foreach ($documents as $document) {
        // Tokenize and count words
        $words = preg_split('/\s+/', strtolower($document));
        
        foreach ($words as $word) {
            // Clean word (remove punctuation except apostrophes)
            $word = preg_replace("/[^a-z0-9']/", '', $word);
            
            if (strlen($word) > 1) { // Skip single characters
                $wordFrequencies[$word] = ($wordFrequencies[$word] ?? 0) + 1;
            }
        }
    }
    
    // Sort by frequency
    arsort($wordFrequencies);
    
    return $wordFrequencies;
}

// Use it
$documents = [
    file_get_contents('doc1.txt'),
    file_get_contents('doc2.txt'),
    // ...
];

$frequencies = createDictionaryFromDocuments($documents);

// Create SymSpell instance with custom dictionary
$symSpell = new SymSpell();
foreach ($frequencies as $word => $count) {
    $symSpell->createDictionaryEntry($word, $count);
}

Updating Word Frequencies

Adjust frequencies based on user behavior:

class AdaptiveSpellChecker {
    private SymSpell $symSpell;
    private array $userCorrections = [];
    
    public function __construct() {
        $this->symSpell = new SymSpell();
        $this->symSpell->loadDictionary('base_dictionary.txt', 0, 1);
    }
    
    public function recordUserSelection(string $misspelled, string $selected): void {
        // Track what users actually select
        $this->userCorrections[$misspelled] = $selected;
        
        // Boost frequency of selected word
        $this->symSpell->createDictionaryEntry($selected, 10000);
    }
    
    public function getSuggestions(string $word): array {
        // Check user's previous corrections first
        if (isset($this->userCorrections[$word])) {
            return [new SuggestItem(
                $this->userCorrections[$word], 
                0, 
                PHP_INT_MAX
            )];
        }
        
        return $this->symSpell->lookup($word, Verbosity::Top, 2);
    }
}

Dictionary Format Guidelines

When creating custom dictionaries:

  1. Format: word frequency (space-separated)
  2. Encoding: UTF-8 for international support
  3. Case: Lowercase recommended
  4. Frequencies: Higher = more likely to be suggested
    • Common words: 1,000,000+
    • Regular words: 10,000 - 999,999
    • Rare words: 100 - 9,999
    • Very rare: 1 - 99

Example dictionary file:

the 23135851162
quick 5428674
brown 2357452
fox 834729
jumps 425867

Performance Considerations

  • Dictionary size: Each word uses ~20-30 bytes of memory
  • Loading time: ~0.6ms per 1000 words
  • Lookup speed: Unaffected by dictionary size (O(1) average)
  • Memory usage:
    • 10K words: ~300KB
    • 100K words: ~3MB
    • 1M words: ~30MB

Tips for Optimal Dictionary Creation

1. Frequency Distribution

Use logarithmic or Zipfian distribution for realistic frequencies:

function generateFrequencies(array $words): array {
    $frequencies = [];
    $maxFreq = 1000000;
    
    foreach ($words as $index => $word) {
        // Zipf's law: frequency inversely proportional to rank
        $frequencies[$word] = intval($maxFreq / pow($index + 1, 1.5));
    }
    
    return $frequencies;
}

2. Handling Multi-Word Terms

For phrases and compound terms:

// Add as single entry with underscores or hyphens
$symSpell->createDictionaryEntry('new_york', 500000);
$symSpell->createDictionaryEntry('san-francisco', 400000);

// Or use the bigram dictionary for better context
$symSpell->loadBigramDictionary('bigram_dictionary.txt', 0, 2);

3. Filtering Invalid Words

Clean your dictionary before loading:

function validateWord(string $word): bool {
    // Skip if too short or too long
    $length = strlen($word);
    if ($length < 2 || $length > 30) return false;
    
    // Skip if contains invalid characters
    if (!preg_match('/^[a-z0-9\'-]+$/i', $word)) return false;
    
    // Skip if all numbers
    if (is_numeric($word)) return false;
    
    return true;
}

// Filter dictionary
$validEntries = [];
foreach ($entries as $word => $freq) {
    if (validateWord($word)) {
        $validEntries[$word] = $freq;
    }
}

4. Saving and Loading Custom Dictionaries

Efficiently persist your custom dictionaries:

class DictionaryManager {
    public static function saveDictionary(SymSpell $symSpell, string $filepath): void {
        $entries = [];
        
        // Note: SymSpell doesn't expose internal dictionary directly
        // You'll need to maintain your own word list when adding
        // This is a conceptual example
        
        file_put_contents($filepath, implode("\n", $entries));
    }
    
    public static function loadDictionary(string $filepath): SymSpell {
        $symSpell = new SymSpell();
        $symSpell->loadDictionary($filepath, 0, 1);
        return $symSpell;
    }
    
    public static function mergeDictionaries(array $filepaths): SymSpell {
        $symSpell = new SymSpell();
        
        foreach ($filepaths as $filepath) {
            $symSpell->loadDictionary($filepath, 0, 1);
        }
        
        return $symSpell;
    }
}

Real-World Examples

// Create product-specific dictionary
$productDict = new SymSpell();

// Add brand names
$brands = ['nike' => 1000000, 'adidas' => 900000, 'puma' => 700000];
foreach ($brands as $brand => $freq) {
    $productDict->createDictionaryEntry($brand, $freq);
}

// Add product categories
$categories = ['sneakers' => 500000, 'running' => 400000, 'basketball' => 300000];
foreach ($categories as $category => $freq) {
    $productDict->createDictionaryEntry($category, $freq);
}

// Handle user search with typos
$search = 'addidas sneekers';
$suggestions = $productDict->lookupCompound($search, 2);
echo $suggestions[0]->term; // "adidas sneakers"

Medical Records System

// Build medical terminology dictionary
$medicalDict = new SymSpell();

// Load standard medical terms
$medicalDict->loadDictionary('medical_terms_standard.txt', 0, 1);

// Add institution-specific terms
$hospitalTerms = [
    'emr' => 100000,        // Electronic Medical Records
    'icu' => 90000,         // Intensive Care Unit
    'er' => 85000,          // Emergency Room
    'mri' => 80000,         // Magnetic Resonance Imaging
    'ct' => 75000,          // Computed Tomography
];

foreach ($hospitalTerms as $term => $freq) {
    $medicalDict->createDictionaryEntry($term, $freq);
}
// Legal terminology dictionary
$legalDict = new SymSpell();

// Common legal terms
$legalTerms = [
    'plaintiff' => 100000,
    'defendant' => 100000,
    'litigation' => 80000,
    'jurisdiction' => 70000,
    'precedent' => 60000,
    'testimony' => 50000,
    'affidavit' => 40000,
];

foreach ($legalTerms as $term => $freq) {
    $legalDict->createDictionaryEntry($term, $freq);
}

// Handle legal document OCR errors
$ocrText = 'plantiff vs defendnt';
$corrected = $legalDict->lookupCompound($ocrText, 2);
echo $corrected[0]->term; // "plaintiff vs defendant"

Troubleshooting

Common Issues

  1. Words not being suggested

    • Check if frequency is above countThreshold
    • Verify word is within maxEditDistance
    • Ensure dictionary loaded successfully
  2. Wrong suggestions prioritized

    • Adjust word frequencies
    • Use bigram dictionary for context
    • Consider using Verbosity::All to see all options
  3. Memory usage too high

    • Reduce prefixLength (5-7 recommended)
    • Use countThreshold to filter rare words
    • Load only necessary dictionaries
  4. Slow dictionary loading

    • Pre-process and clean dictionary files
    • Use binary/serialized format for faster loading
    • Load dictionaries once and reuse instance

Further Resources