Dataframe Frameworks Showdown

April 10, 2023 · View on GitHub

Blog post at https://www.karnwong.me/posts/2023/04/duckdb-vs-polars-vs-spark/

Frameworks Used

~~pandas~~ (not present in the experiment, because it's very unrealistic to expect pandas to be able to open 15GB data on a 16GB RAM machine)
polars
duckdb
spark (single node)

Experiments

Input data (of varying row count, to measure performance against same partition size but different length) is equally partitioned into 8 chunks.

Create a timestamp diff column trip_length_minute, converted to minute
Create percentile on trip_length_minute as trip_length_minute_percentile
Filter only trip_length_minute_percentile between (0.2, 0.8)
Group by on VendorID, payment_type
Aggregate min, max, avg on passenger_count, trip_distance, total_amount

framework	mode	remarks
polars	lazy	by default, polars does not operate in lazy mode
duckdb	default	-
spark	default	-

Compute specs

CPU: M1 MacBook Air (M1, 2020)
RAM: 16GB

Data source

See here.

total size: 15 GB,
total records: 1,195,313,202 - around 1200 million rows
partitions: year 2012 to year 2022 (older partitions have different schema)
dirty data: some columns have mismatched data types across partitions

These will be partitioned and used for experiments. See prep stage in Makefile.

Usage

# download data
make download-data

# sanitize data
make prep-base

# run experiment 1: window on single partition
# contains prep, run and visualize
make run-expt1-window-on-single-partition

# run experiment 2: window on multiple partitions
make run-expt2-window-on-multiple-partition

Results

duckdb crash/unrespond at 30M rows (task terminated by user at 14 minute).

result expt1

result expt2