Length-Constrained Segmentation
January 21, 2026 · View on GitHub
This supplementary document explains the theory and implementation of length-constrained segmentation in wtpsplit (NB: auto-generated).
Overview
A text segmenter like SaT gives us a probability score for every character, indicating how likely that position is to be a segment boundary (e.g., end of a sentence).
The Basic Approach
Simply split whenever probability > threshold.
Problem: No control over segment lengths!
The Controllable Approach
Note: When using length-constrained segmentation (
max_lengthis set), thethresholdparameter is ignored. The algorithms use raw model probabilities directly to find optimal split points.
Define a prior probability distribution over chunk lengths, then solve an optimization problem that balances:
- The model's boundary predictions
- Your length preferences
The Maths
We view segmentation as selecting a subset of positions C = {C₁, C₂, ..., Cₖ} where each Cᵢ marks the end of a chunk.
Optimization Problem:
argmax_C ∏ᵢ Prior(Cᵢ - Cᵢ₋₁) × P(Cᵢ)
Where:
Cᵢ= position of the i-th split pointPrior(length)= how much we prefer chunks of that lengthP(Cᵢ)= model's probability at position Cᵢ
In Bayesian terms:
Prior(Cᵢ - Cᵢ₋₁)is the priorP(Cᵢ)is the evidence- Their product is the posterior
Prior Functions
Which Prior Should I Use?
| Prior | Recommendation | Best For |
|---|---|---|
| Uniform | Default | Just enforce max_length, let model decide split points |
| Gaussian ⭐ | Recommended | Prefer segments around a target length (intuitive, easy to tune) |
| Log-Normal | Advanced | Right-skewed preference (more tolerant of longer segments) |
| Clipped Polynomial | Special cases | When segments MUST be very close to target length |
TL;DR: Use uniform (default) if you just need max_length. Use gaussian if you want to prefer a specific segment size.
Prior Visualization
The figure below shows how each prior behaves with target_length=50, spread=15, and max_length=100:

Key differences:
- Uniform: Flat until max_length, then drops to zero
- Gaussian: Symmetric bell curve centered at target_length
- Clipped Polynomial: Parabola that clips to exactly zero at ±spread from target
- Log-Normal: Asymmetric (right-skewed) — more tolerant of longer segments, stricter on shorter ones
1. Uniform Prior (Default)
Prior(length) = 1.0 if length ≤ max_length else 0.0
- All lengths equally good up to
max_length - Hard cutoff at maximum
- Use case: Simple length limiting — let the model decide optimal splits
2. Gaussian Prior ⭐ Recommended
Prior(length) = exp(-0.5 × ((length - target_length) / spread)²)
- Peaks at
target_length(preferred length) - Falls off smoothly based on
spread(standard deviation) - Symmetric around target
- Use case: Prefer specific chunk sizes (e.g., ~512 chars for embedding models)
3. Clipped Polynomial Prior
$\text{python} \text{Prior}(\text{length}) = \text{max}(1 - (1/\text{spread}²) \times (\text{length} - \text{target\_length})², 0) $
- Peaks at
target_length - Clips to exactly zero at
±spreadcharacters from target - More aggressive than Gaussian — creates a "hard window" around target
- Use case: Strong enforcement when segments must be close to target
4. Log-Normal Prior
Prior(length) = exp(-0.5 × ((log(length) - μ) / σ)²) / length
- Right-skewed distribution (asymmetric around target)
- More tolerant of segments longer than target (long right tail)
- Drops off faster for segments shorter than target
target_lengthsets the peak (mode)spreadin characters (same scale as other priors for consistency)- Use case: When you want to be more lenient with longer segments while preferring target length
Algorithms
Greedy Search
At each step, pick the locally best split point based on Prior(length) × P(position).
Pros:
- Fast (O(n × max_length))
- Simple to understand
Cons:
- Not globally optimal
- May miss better overall segmentations
Viterbi Algorithm (Recommended)
Dynamic programming to find the globally optimal sequence of splits.
Algorithm:
dp[i] = best score achievable for text[0:i]
dp[i] = max over j of: dp[j] + log(Prior(i-j)) + log(P[i-1])
Where j ranges from max(0, i-max_length) to i-min_length
Pros:
- Globally optimal solution
- Respects sentence boundaries when possible
Cons:
- Slightly slower (O(n × max_length))
Key Guarantees
max_lengthis STRICT: No segment will ever exceedmax_lengthcharactersmin_lengthis BEST EFFORT: Segments may be shorter if merging would violatemax_length- Text preservation — use the correct join method:
# With constraints (max_length or min_length set):
original_text = "".join(segments) # ← newlines may be inside segments
# Without constraints (SaT default with split_on_input_newlines=True):
original_text = "\n".join(segments)
# Without constraints (WtP or split_on_input_newlines=False):
original_text = "".join(segments)
Why the difference? When constraints force mid-line splits (line > max_length), we can't split exactly at newlines. Newlines stay embedded in segments to preserve text. The algorithm boosts probabilities at newline positions to prefer splitting there when possible.
Usage Examples
Basic Length Limiting
from wtpsplit import SaT
sat = SaT("sat-3l-sm")
# Limit segments to 100 characters
segments = sat.split(text, max_length=100)
Both Min and Max
# Segments between 20-100 characters
segments = sat.split(text, min_length=20, max_length=100)
Using Gaussian Prior
# Prefer ~50 character segments
segments = sat.split(
text,
max_length=100,
prior_type="gaussian",
prior_kwargs={"target_length": 50, "spread": 15}
)
Algorithm Selection
# Use greedy for speed (slightly suboptimal)
segments = sat.split(text, max_length=100, algorithm="greedy")
# Use viterbi for optimal results (default)
segments = sat.split(text, max_length=100, algorithm="viterbi")
How It Respects Sentence Boundaries
The Viterbi algorithm naturally prefers splitting at high-probability positions (sentence boundaries) because:
P(position)is high at sentence boundaries- The product
Prior(length) × P(position)is maximized when both factors are high - The algorithm finds the global optimum, so it won't make a locally good choice that leads to bad splits later
Example:
Text: "The quick brown fox jumps. Pack my box with jugs."
With max_length=100:
- Position 25 (after "jumps.") has P ≈ 0.95
- Position 50 (after "jugs.") has P ≈ 0.98
- The algorithm will split at these natural boundaries
With max_length=30:
- Must split somewhere before position 30
- Will choose position 25 (sentence boundary) over position 28 (mid-word)
When Word Cuts Happen
Word cuts only occur when there is no sentence boundary within max_length characters. This happens with:
- Very long sentences without punctuation
- Very restrictive
max_lengthvalues - Text without natural break points (e.g., code, URLs)
Implementation Details
Post-Processing (_enforce_segment_constraints)
After the algorithm finds split points, post-processing ensures:
- Segments don't exceed
max_length(force-splits if needed) - Short segments are merged when possible
- All whitespace is preserved
Parameters Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
min_length | int | 1 | Minimum segment length (best effort) |
max_length | int | None | Maximum segment length (strict) |
algorithm | str | "viterbi" | "viterbi" (optimal) or "greedy" (faster) |
prior_type | str | "uniform" | "uniform", "gaussian", "clipped_polynomial", or "lognormal" |
prior_kwargs | dict | None | Parameters for prior function |
Prior Parameters
All priors support:
lang_code: Optional language code (e.g., "en", "zh", "de") for language-aware defaults
Gaussian:
target_length: Preferred length in characters (language-aware default iflang_codeprovided)spread: Standard deviation in characters (language-aware default iflang_codeprovided)
Clipped Polynomial:
target_length: Peak position in characters (language-aware default iflang_codeprovided)spread: Tolerance in characters — clips to zero at ±spread from target (language-aware default iflang_codeprovided)
Log-Normal:
target_length: Peak/mode in characters (language-aware default iflang_codeprovided)spread: Tolerance in characters (language-aware default iflang_codeprovided) — same scale as gaussian/clipped_polynomial
Language-Aware Defaults
When lang_code is provided in prior_kwargs but target_length is not specified,
the prior uses empirically-derived defaults for that language:
# Uses Chinese defaults: target_length=45, spread=15
sat.split(text, max_length=100, prior_type="gaussian",
prior_kwargs={"lang_code": "zh"})
# Uses German defaults: target_length=90, spread=35
sat.split(text, max_length=150, prior_type="gaussian",
prior_kwargs={"lang_code": "de"})
Supported languages include: zh, ja, ko (East Asian), de, en, fr, es, it, pt, ru, ar, and many more.
See LANG_SENTENCE_STATS in wtpsplit/utils/priors.py for the full list.
See Also
- Interactive Demo - Run examples and experiment
- Test Suite - Comprehensive tests
- README - Quick start guide