plserieshash

November 11, 2025 · View on GitHub

pl_series_hash is a polars plugin to compute lightning fast hashes per series in polars

This will be used by buckaroo to enable summary stats caching.

using pl_series_hash

>>> import polars as pl
>>> from pl_series_hash import hash_xx
>>> df = pl.DataFrame({"u64": pl.Series([5, 3, 20], dtype=pl.UInt64)})
>>> df.select(hash_col=hash_xx("u64"))
shape: (1, 1)
┌─────────────────────┐
│ hash_col            │
│ ---                 │
│ u64                 │
╞═════════════════════╡
│ 6142793559755377588 │
└─────────────────────┘

You can hash every column in a dataframe with the namespaced function.

import pl_series_hash
df.select(pl.all().pl_series_hash.hash_xx())

Installing pl_series_hash

pip install pl_series_hash

The same values in a different dtype will result in different different hash values. The name of a column or struct part doesn't effect the hash values The presence and position of nulls do affect the hash value

Supported column types

The following polars Rust datatypes are supported

Boolean
UInt8
UInt16
UInt32
UInt64
Int8
Int16
Int32
Int64
Float32
Float64
String
Date
Datetime
Duration
Time
Array
Null
Categorical
Enum
Struct

Unsupported datatypes

Planned

Binary BinaryOffset Int128 - planned, it's a compile/config option Decimal - planned it's a compile/config option

Not planned

Object - Summary stats on objects are useless and these columns rarely show up. I will probably skip List Complex nested type implementation, rarely used DataType::Unknown # have no idea what could be done with this in use Null - Currently implemented but I don't know the use case for this

Basic implementation

This uses twox-hash a very performant hashing library.

For each series I first write out a type identifier.

For each element in a series I add the bytes, for strings I also write a STRING_SEPERATOR of 128u16 which isn't a valid UTF8 symbol and shouldn't ever appear. For NANs/Nulls I write out NAN_SEPERATOR - 129u16 also an invalid unicode character.

Next I write out the array position in bytes (u64)

All of this is then hashed.

Structs and arrays are hashed recursively - a vector of each constituent sub-series is hashed, then that vector is hashed.

Dev, build, and packaging instrucitons

This is based very directly on Marco Gorelli's Polars Plugin Tutorial

to release a new version, first manually bump the version number in cargo.toml. Then make a PR with that commit, and merge it to main.

Finally draft a new release on github, using the same tag and name as the cargo.toml version.

The pypi package is automatically versioned the same as the tag. write tags without v or other artifacts. so 0.2.1 not v0.2.1.

For now, that flow works. Obviously better workflows can be added to this section.

Further research

Articles pulled from the polars codebase https://www.cockroachlabs.com/blog/vectorized-hash-joiner/ http://myeyesareblind.com/2017/02/06/Combine-hash-values/

If you want elementwise hashing take a look at polars-hash It is a much more mature plugin that allows you to choose different hashing algorithms.