plserieshash
November 11, 2025 · View on GitHub
pl_series_hash is a polars plugin to compute lightning fast hashes per series in polars
This will be used by buckaroo to enable summary stats caching.
using pl_series_hash
>>> import polars as pl
>>> from pl_series_hash import hash_xx
>>> df = pl.DataFrame({"u64": pl.Series([5, 3, 20], dtype=pl.UInt64)})
>>> df.select(hash_col=hash_xx("u64"))
shape: (1, 1)
┌─────────────────────┐
│ hash_col │
│ --- │
│ u64 │
╞═════════════════════╡
│ 6142793559755377588 │
└─────────────────────┘
You can hash every column in a dataframe with the namespaced function.
import pl_series_hash
df.select(pl.all().pl_series_hash.hash_xx())
Installing pl_series_hash
pip install pl_series_hash
properties of pl_series_hash
The same values in a different dtype will result in different different hash values. The name of a column or struct part doesn't effect the hash values The presence and position of nulls do affect the hash value
Supported column types
The following polars Rust datatypes are supported
- Boolean
- UInt8
- UInt16
- UInt32
- UInt64
- Int8
- Int16
- Int32
- Int64
- Float32
- Float64
- String
- Date
- Datetime
- Duration
- Time
- Array
- Null
- Categorical
- Enum
- Struct
Unsupported datatypes
Planned
Binary BinaryOffset Int128 - planned, it's a compile/config option Decimal - planned it's a compile/config option
Not planned
Object - Summary stats on objects are useless and these columns rarely show up. I will probably skip List Complex nested type implementation, rarely used DataType::Unknown # have no idea what could be done with this in use Null - Currently implemented but I don't know the use case for this
Basic implementation
This uses twox-hash a very performant hashing library.
For each series I first write out a type identifier.
For each element in a series I add the bytes, for strings I also write a STRING_SEPERATOR of 128u16 which isn't a valid UTF8 symbol and shouldn't ever appear.
For NANs/Nulls I write out NAN_SEPERATOR - 129u16 also an invalid unicode character.
Next I write out the array position in bytes (u64)
All of this is then hashed.
Structs and arrays are hashed recursively - a vector of each constituent sub-series is hashed, then that vector is hashed.
Dev, build, and packaging instrucitons
This is based very directly on Marco Gorelli's Polars Plugin Tutorial
to release a new version, first manually bump the version number in cargo.toml. Then make a PR with that commit, and merge it to main.
Finally draft a new release on github, using the same tag and name as the cargo.toml version.
The pypi package is automatically versioned the same as the tag.
write tags without v or other artifacts. so 0.2.1 not v0.2.1.
For now, that flow works. Obviously better workflows can be added to this section.
Further research
Articles pulled from the polars codebase https://www.cockroachlabs.com/blog/vectorized-hash-joiner/ http://myeyesareblind.com/2017/02/06/Combine-hash-values/
If you want elementwise hashing take a look at polars-hash It is a much more mature plugin that allows you to choose different hashing algorithms.