Supported Input Types
May 11, 2026 ยท View on GitHub
psql_bm25s supports five indexed source-column types:
int4[]text[]varchar[]textvarchar
They all feed the same BM25 index core, but they reach it through two different input models:
- pretokenized inputs owned by the application:
int4[]text[]varchar[]
- scalar text inputs tokenized at the index boundary:
textvarchar
Overview
| Source type | Input model | Best fit | Main trade-off |
|---|---|---|---|
int4[] | pre-encoded token IDs | highest-throughput exact retrieval | requires an external vocabulary or token-ID pipeline |
text[] | pretokenized text tokens | explicit token control with strong exact performance | application must materialize tokens |
varchar[] | pretokenized text tokens | same use case as text[] for schemas that already use varchar[] | application must materialize tokens |
text | raw scalar text | easiest onboarding from ordinary PostgreSQL schemas | indexing and some verification paths pay tokenization cost inside the extension |
varchar | raw scalar text | same as text when schema already uses varchar | indexing and some verification paths pay tokenization cost inside the extension |
How psql_bm25s Supports Them
For int4[], text[], and varchar[], the index receives the caller's
token stream directly:
int4[]passes pre-encoded token IDstext[]passes pretokenized textvarchar[]follows the same text-token path after adapting each array element to the text-like token interface used by the index
For scalar text and varchar, the extension tokenizes at the index
boundary and then lowers the result to the same internal token-stream
model. That keeps one BM25 scoring and postings core while still
letting ordinary SQL schemas start from raw text columns.
The scalar text pipeline uses the same project text-processing path used by the SQL helpers:
- Unicode-aware tokenization
- NFC normalization
- ICU word-break segmentation
- Unicode case folding
- optional stopword filtering
- optional English Porter stemming
- optional Latin-diacritic folding
For the index-level scalar text parameters, see Index Parameters.
Performance Characteristics
int4[]
This is the fastest and most stable exact-retrieval path when the
application can own vocabulary management and token-ID assignment. It is
the basis of the published psql_bm25s ids benchmark line.
text[]
This is the main pretokenized text path. It keeps the token stream
explicit, avoids scalar retokenization, and is the basis of the
published psql_bm25s text[] benchmark line.
Use it when:
- the application already tokenizes documents
- token boundaries must stay explicit
- phrase and verification-heavy paths should avoid retokenizing raw text
varchar[]
This follows the same pretokenized path as text[]. It mainly exists so
existing schemas do not need to rewrite arrays just to use the index.
Its behavior and expected performance profile should track text[]
closely.
text and varchar
These are the easiest way to get started because the schema can index an ordinary text column directly. The trade-off is that the extension must tokenize during indexing, refresh, and some exact verification paths.
That makes scalar text columns the best choice when:
- ease of adoption matters more than explicit token materialization
- the dataset is moderate enough that extra tokenization CPU is acceptable
- the application wants a direct SQL-column search surface
Pretokenized arrays remain the more performance-oriented choice when the application already owns tokenization or wants the most stable high-throughput benchmark path.
Public Benchmark Scope
The main published PG18 cross-engine benchmark matrix in Performance and Benchmarks is currently based on the pretokenized input paths:
psql_bm25s idsusesint4[]psql_bm25s text[]usestext[]
Scalar text and varchar are supported for ordinary application
schemas, but they are not the basis of the current public cross-engine
matrix.
Multicolumn Fusion
Multicolumn fusion indexes currently support:
text[]varchar[]textvarchar
Scalar multicolumn fusion tokenizes each indexed scalar column with the index text options before fusing the resulting token stream. See Multi-Column Fusion Indexes for the current rules and recommended usage.
Practical Guidance
Use:
textorvarcharwhen you want the easiest schema-level starttext[]orvarchar[]when you already own tokenization and want the clearest text-token contractint4[]when you need the highest-throughput exact path and can own token IDs upstream
Related docs: