polars-bio-bench
April 25, 2026 · View on GitHub
Benchmarks of various genomic ranges operations
Pre-requisites
- pyenv
➜ polars-bio-bench git:(init) ✗ pyenv --version
pyenv 2.5.0
- poetry
➜ polars-bio-bench git:(init) ✗ poetry --version
Poetry (version 2.0.0)
Setup
pyenv install 3.12.8
pyenv local 3.12.8
poetry env use 3.12
poetry update
Hardware requirements
Please note that you need at least 64GB of RAM to run the full benchmarks. For the default 16-32GB should be enough.
Run
All the benchmarking scenarios are defined in the conf/benchmark_*.yaml files. By default, the conf/benchmark_small.yaml file is used.
If you would like to run the benchmarks with a different configuration file, you can specify it using the --bench-config option.
export BENCH_DATA_ROOT=/tmp/polars-bio-bench/
poetry run python src/run-benchmarks.py --help
INFO:polars_bio:Creating BioSessionContext
Usage: run-benchmarks.py [OPTIONS]
Options:
--bench-config TEXT Benchmark config file (default:
conf/benchmark_small.yaml)
--help Show this message and exit.
For e2e test suite (benchmark-e2e-overlap) please additionally set :
export POLARS_MAX_THREADS=1
To override the polars_bio output type for file-based benchmarks, set
polars_bio_output_type in the benchmark YAML. You can define a config-wide
default in common and override it per benchmark when needed:
common:
baseline: polars_bio
polars_bio_output_type: pandas.DataFrame
benchmarks:
- name: overlap-single
operation: overlap
dataset: databio
tools:
- polars_bio
parallel: false
input_dataframes: false
polars_bio_output_type: polars.DataFrame
For input_dataframes: true benchmarks, dataframes_io still controls the
input/output DataFrame pair.
For pandas-backed polars_bio inputs, dataframes_io also supports
pandas.pyarrow.DataFrame. That reads parquet through pandas with
engine="pyarrow" and dtype_backend="pyarrow" so the loaded columns stay
Arrow-backed:
dataframes_io:
- "pandas.DataFrame:pandas.DataFrame"
- "pandas.pyarrow.DataFrame:pandas.DataFrame"
Datasets
Sample benchmark scenarios
conf/benchmark_small.yaml- small dataset, small number of operations for nearest and overlap, native DataFusion inputconf/benchmark_dataframes.yaml- as above but with DataFrames (Polars/Pandas) as inputconf/benchmark_large.yaml- large dataset, large number of operations for nearest and overlap, native DataFusion inputconf/benchmark_parallel.yaml- comparison parallel operations for pyranges0 and polars_bio with bioframe as a baselineconf/benchmark_count_overlaps.yaml- comparison of count overlaps operation for pyranges{0,1} and polars_bio with bioframe as a baselineconf/benchmark_merge.yaml- comparison of merge operation for pyranges{0,1} and polars_bio with bioframe as a baselineconf/benchmark_coverage.yaml- comparison of coverage operation for pyranges{0,1} and polars_bio with bioframe as a baseline
Paper benchmarks
conf/paper/benchmark-e2e-overlap.yaml- end-to-end benchmark for overlap operation with writing results to a CSV file (1-2 and 8-7 datasets)conf/paper/benchmark-4ops-1-2.yaml- overlap, nearest, count_overlaps and coverage operations for 1-2 datasetsconf/paper/benchmark-4ops-8-7.yaml- as above but for 8-7 datasetsconf/paper/benchmark-4ops-8-7-polars-bio-parallel.yaml- as above but polars_bio only and with parallel operations 1,2,4,6,8 threadsconf/paper/benchmark-read_vcf.yaml- read VCF file with polars_bio and 1,2,4,6,8 threads
Paper memory benchmarks
Example of running memory profiler for polars_bio with 1-2 dataset for polars_bio:
PRFOF_FILE="polars_bio_1-2.dat"
mprof run --output $PRFOF_FILE python src/run-memory-profiler.py --bench-config conf/paper/benchmark-e2e-overlap.yaml --tool polars_bio --test-case 1-2 --operation overlap
mprof plot $PRFOF_FILE
Synthetic
BENCHMARK_TYPE="synthetic"
for operation in "overlap" "nearest" "coverage" "count-overlaps"; do
for tool in "polars_bio" "polars_bio_streaming" "bioframe" "pyranges0" "pyranges1"; do
for test_case in "100" "10000000"; do
PRFOF_FILE="${tool}_${operation}_${test_case}.dat"
mprof run --output $PRFOF_FILE python src/run-memory-profiler.py --bench-config conf/paper/benchmark-e2e-${BENCHMARK_TYPE}.yaml --tool $tool --test-case $test_case --operation $operation
done
done
done
Real
BENCHMARK_TYPE="real"
for operation in "overlap" "nearest" "coverage" "count-overlaps"; do
for tool in "polars_bio" "polars_bio_streaming" "bioframe" "pyranges0" "pyranges1"; do
for test_case in "1-2" "8-7"; do
PRFOF_FILE="${tool}_${operation}_${test_case}.dat"
mprof run --output $PRFOF_FILE python src/run-memory-profiler.py --bench-config conf/paper/benchmark-e2e-${BENCHMARK_TYPE}.yaml --tool $tool --test-case $test_case --operation $operation
done
done
done
New benchmark Sep 2025
BENCHMARK_TYPE="real"
for operation in "overlap" ; do
for tool in "genomicranges" ; do
for test_case in "8-7"; do
PRFOF_FILE="${tool}_${operation}_${test_case}.dat"
mprof run --output $PRFOF_FILE python src/run-memory-profiler.py --bench-config conf/paper/benchmark-e2e-${BENCHMARK_TYPE}.yaml --tool $tool --test-case $test_case --operation $operation
done
done
done
Generating New Datasets
This repository includes a unified script for generating random genomic interval datasets and uploading them to cloud storage. The script creates datasets with unique timestamps and uploads them with proper directory structure.
Prerequisites for Dataset Generation
-
rclone - Required for uploading datasets to Google Drive
# Install rclone curl https://rclone.org/install.sh | sudo bash # Configure rclone with your Google Drive (follow interactive setup) rclone config -
Python dependencies - The script requires pandas, numpy, and yaml:
# These are already included in the poetry environment poetry install
Usage
To generate a new dataset:
# From the polars-bio-bench root directory
poetry run python src/generate_dataset.py
The script will:
- Clean up old files - Remove previous datasets and ZIP archives
- Generate test data - Create parquet files with different sizes (100, 1K, 10K, 100K, 1M records)
- Create ZIP archive - Package the datasets into a single ZIP file
- Upload to Google Drive - Upload via rclone and generate public download link
- Generate configuration files - Create YAML configs for benchmarking
Output Structure
The script generates files in the following structure:
polars-bio-bench/
├── tmp/
│ ├── data/ # Generated parquet files
│ │ ├── df1-100.parquet
│ │ ├── df2-100.parquet
│ │ ├── df1-1000.parquet
│ │ ├── df2-1000.parquet
│ │ ├── ... (up to 1M records)
│ └── conf/ # Configuration files
│ ├── common.yaml # Dataset metadata and test cases
│ └── random.yaml # Benchmark definitions
└── random_intervals_YYYYMMDD_HHMMSS.zip # ZIP archive for upload
Generated Dataset Properties
- Dataset ID:
random_intervals_YYYYMMDD_HHMMSS(unique timestamp) - Test cases: 5 different sizes (100, 1K, 10K, 100K, 1M records)
- File format: Parquet files with genomic intervals (chrom, start, end)
- Chromosome range: chr1 only for simplicity
- Coordinate range: Random intervals up to dataset size
- Archive size: ~17-18 MB (compressed)
Configuration Files
common.yaml - Contains dataset metadata:
datasets:
- name: random_intervals_20250530_231351
source: tgambin
unzip: true
format: zip
url: https://drive.google.com/open?id=...
# ... additional metadata
test-cases:
- name: '100'
df_path_1: df1-100.parquet
df_path_2: df2-100.parquet
# ... more test cases
random.yaml - Contains benchmark definitions for overlap and nearest operations with various tools and parallelization options.
Cloud Storage
Datasets are automatically uploaded to:
- Remote path:
tgambin:polars-bio-datasets/{dataset_id}/ - Public URL: Generated automatically via rclone link
- Access: Public download links for easy integration
Integration with Benchmarks
Once generated, the new dataset can be used in benchmarks by:
- Copying the configuration files to the main
conf/directory - Updating benchmark YAML files to reference the new dataset ID
- Running benchmarks with the new configuration
Example:
# Copy generated configs (optional)
cp tmp/conf/common.yaml conf/
cp tmp/conf/random.yaml conf/benchmark_random_new.yaml
# Run benchmarks with new dataset
poetry run python src/run-benchmarks.py --bench-config conf/benchmark_random_new.yaml
Known Issues
On MacOS with MX chips you may encounter the following error when installing polars-bio from source with poetry:
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
To fix this, you can set the following environment variable when installing or updating polars-bio:
RUSTFLAGS="-Clink-arg=-undefined -Clink-arg=dynamic_lookup -Ctarget-cpu=native" poetry update