Length-Constrained Segmentation

January 21, 2026 · View on GitHub

This supplementary document explains the theory and implementation of length-constrained segmentation in wtpsplit (NB: auto-generated).

Overview

A text segmenter like SaT gives us a probability score for every character, indicating how likely that position is to be a segment boundary (e.g., end of a sentence).

The Basic Approach

Simply split whenever probability > threshold.

Problem: No control over segment lengths!

The Controllable Approach

Note: When using length-constrained segmentation (max_length is set), the threshold parameter is ignored. The algorithms use raw model probabilities directly to find optimal split points.

Define a prior probability distribution over chunk lengths, then solve an optimization problem that balances:

The model's boundary predictions
Your length preferences

The Maths

We view segmentation as selecting a subset of positions C = {C₁, C₂, ..., Cₖ} where each Cᵢ marks the end of a chunk.

Optimization Problem:

argmax_C  ∏ᵢ Prior(Cᵢ - Cᵢ₋₁) × P(Cᵢ)

Where:

Cᵢ = position of the i-th split point
Prior(length) = how much we prefer chunks of that length
P(Cᵢ) = model's probability at position Cᵢ

In Bayesian terms:

Prior(Cᵢ - Cᵢ₋₁) is the prior
P(Cᵢ) is the evidence
Their product is the posterior

Prior Functions

Which Prior Should I Use?

Prior	Recommendation	Best For
Uniform	Default	Just enforce max_length, let model decide split points
Gaussian ⭐	Recommended	Prefer segments around a target length (intuitive, easy to tune)
Log-Normal	Advanced	Right-skewed preference (more tolerant of longer segments)
Clipped Polynomial	Special cases	When segments MUST be very close to target length

TL;DR: Use uniform (default) if you just need max_length. Use gaussian if you want to prefer a specific segment size.

Prior Visualization

The figure below shows how each prior behaves with target_length=50, spread=15, and max_length=100:

Prior Functions Comparison

Key differences:

Uniform: Flat until max_length, then drops to zero
Gaussian: Symmetric bell curve centered at target_length
Clipped Polynomial: Parabola that clips to exactly zero at ±spread from target
Log-Normal: Asymmetric (right-skewed) — more tolerant of longer segments, stricter on shorter ones

1. Uniform Prior (Default)

Prior(length) = 1.0 if length ≤ max_length else 0.0

All lengths equally good up to max_length
Hard cutoff at maximum
Use case: Simple length limiting — let the model decide optimal splits

2. Gaussian Prior ⭐ Recommended

Prior(length) = exp(-0.5 × ((length - target_length) / spread)²)

Peaks at target_length (preferred length)
Falls off smoothly based on spread (standard deviation)
Symmetric around target
Use case: Prefer specific chunk sizes (e.g., ~512 chars for embedding models)

3. Clipped Polynomial Prior

$\text{python} \text{Prior}(\text{length}) = \text{max}(1 - (1/\text{spread}²) \times (\text{length} - \text{target\_length})², 0) $

Peaks at target_length
Clips to exactly zero at ±spread characters from target
More aggressive than Gaussian — creates a "hard window" around target
Use case: Strong enforcement when segments must be close to target

4. Log-Normal Prior

Prior(length) = exp(-0.5 × ((log(length) - μ) / σ)²) / length

Right-skewed distribution (asymmetric around target)
More tolerant of segments longer than target (long right tail)
Drops off faster for segments shorter than target
target_length sets the peak (mode)
spread in characters (same scale as other priors for consistency)
Use case: When you want to be more lenient with longer segments while preferring target length

Algorithms

Greedy Search

At each step, pick the locally best split point based on Prior(length) × P(position).

Pros:

Fast (O(n × max_length))
Simple to understand

Cons:

Not globally optimal
May miss better overall segmentations

Viterbi Algorithm (Recommended)

Dynamic programming to find the globally optimal sequence of splits.

Algorithm:

dp[i] = best score achievable for text[0:i]
dp[i] = max over j of: dp[j] + log(Prior(i-j)) + log(P[i-1])

Where j ranges from max(0, i-max_length) to i-min_length

Pros:

Globally optimal solution
Respects sentence boundaries when possible

Cons:

Slightly slower (O(n × max_length))

Key Guarantees

max_length is STRICT: No segment will ever exceed max_length characters
min_length is BEST EFFORT: Segments may be shorter if merging would violate max_length
Text preservation — use the correct join method:

# With constraints (max_length or min_length set):
original_text = "".join(segments)  # ← newlines may be inside segments

# Without constraints (SaT default with split_on_input_newlines=True):
original_text = "\n".join(segments)

# Without constraints (WtP or split_on_input_newlines=False):
original_text = "".join(segments)

Why the difference? When constraints force mid-line splits (line > max_length), we can't split exactly at newlines. Newlines stay embedded in segments to preserve text. The algorithm boosts probabilities at newline positions to prefer splitting there when possible.

Usage Examples

Basic Length Limiting

from wtpsplit import SaT

sat = SaT("sat-3l-sm")

# Limit segments to 100 characters
segments = sat.split(text, max_length=100)

Both Min and Max

# Segments between 20-100 characters
segments = sat.split(text, min_length=20, max_length=100)

Using Gaussian Prior

# Prefer ~50 character segments
segments = sat.split(
    text,
    max_length=100,
    prior_type="gaussian",
    prior_kwargs={"target_length": 50, "spread": 15}
)

Algorithm Selection

# Use greedy for speed (slightly suboptimal)
segments = sat.split(text, max_length=100, algorithm="greedy")

# Use viterbi for optimal results (default)
segments = sat.split(text, max_length=100, algorithm="viterbi")

How It Respects Sentence Boundaries

The Viterbi algorithm naturally prefers splitting at high-probability positions (sentence boundaries) because:

P(position) is high at sentence boundaries
The product Prior(length) × P(position) is maximized when both factors are high
The algorithm finds the global optimum, so it won't make a locally good choice that leads to bad splits later

Example:

Text: "The quick brown fox jumps. Pack my box with jugs."

With max_length=100:

Position 25 (after "jumps.") has P ≈ 0.95
Position 50 (after "jugs.") has P ≈ 0.98
The algorithm will split at these natural boundaries

With max_length=30:

Must split somewhere before position 30
Will choose position 25 (sentence boundary) over position 28 (mid-word)

When Word Cuts Happen

Word cuts only occur when there is no sentence boundary within max_length characters. This happens with:

Very long sentences without punctuation
Very restrictive max_length values
Text without natural break points (e.g., code, URLs)

Implementation Details

Post-Processing (`_enforce_segment_constraints`)

After the algorithm finds split points, post-processing ensures:

Segments don't exceed max_length (force-splits if needed)
Short segments are merged when possible
All whitespace is preserved

Parameters Reference

Parameter	Type	Default	Description
`min_length`	int	1	Minimum segment length (best effort)
`max_length`	int	None	Maximum segment length (strict)
`algorithm`	str	"viterbi"	"viterbi" (optimal) or "greedy" (faster)
`prior_type`	str	"uniform"	"uniform", "gaussian", "clipped_polynomial", or "lognormal"
`prior_kwargs`	dict	None	Parameters for prior function

Prior Parameters

All priors support:

lang_code: Optional language code (e.g., "en", "zh", "de") for language-aware defaults

Gaussian:

target_length: Preferred length in characters (language-aware default if lang_code provided)
spread: Standard deviation in characters (language-aware default if lang_code provided)

Clipped Polynomial:

target_length: Peak position in characters (language-aware default if lang_code provided)
spread: Tolerance in characters — clips to zero at ±spread from target (language-aware default if lang_code provided)

Log-Normal:

target_length: Peak/mode in characters (language-aware default if lang_code provided)
spread: Tolerance in characters (language-aware default if lang_code provided) — same scale as gaussian/clipped_polynomial

Language-Aware Defaults

When lang_code is provided in prior_kwargs but target_length is not specified, the prior uses empirically-derived defaults for that language:

# Uses Chinese defaults: target_length=45, spread=15
sat.split(text, max_length=100, prior_type="gaussian", 
          prior_kwargs={"lang_code": "zh"})

# Uses German defaults: target_length=90, spread=35
sat.split(text, max_length=150, prior_type="gaussian", 
          prior_kwargs={"lang_code": "de"})

Supported languages include: zh, ja, ko (East Asian), de, en, fr, es, it, pt, ru, ar, and many more. See LANG_SENTENCE_STATS in wtpsplit/utils/priors.py for the full list.