polars-bio-bench

April 25, 2026 · View on GitHub

Benchmarks of various genomic ranges operations

Pre-requisites

pyenv

➜  polars-bio-bench git:(init) ✗ pyenv --version
pyenv 2.5.0

poetry

➜  polars-bio-bench git:(init) ✗ poetry --version
Poetry (version 2.0.0)

Setup

pyenv install 3.12.8
pyenv local 3.12.8
poetry env use 3.12
poetry update

Hardware requirements

Please note that you need at least 64GB of RAM to run the full benchmarks. For the default 16-32GB should be enough.

All the benchmarking scenarios are defined in the conf/benchmark_*.yaml files. By default, the conf/benchmark_small.yaml file is used. If you would like to run the benchmarks with a different configuration file, you can specify it using the --bench-config option.

export BENCH_DATA_ROOT=/tmp/polars-bio-bench/

poetry run python src/run-benchmarks.py --help
INFO:polars_bio:Creating BioSessionContext
Usage: run-benchmarks.py [OPTIONS]

Options:
  --bench-config TEXT  Benchmark config file (default:
                       conf/benchmark_small.yaml)
  --help               Show this message and exit.

For e2e test suite (benchmark-e2e-overlap) please additionally set :

export POLARS_MAX_THREADS=1

To override the polars_bio output type for file-based benchmarks, set polars_bio_output_type in the benchmark YAML. You can define a config-wide default in common and override it per benchmark when needed:

common:
  baseline: polars_bio
  polars_bio_output_type: pandas.DataFrame

benchmarks:
  - name: overlap-single
    operation: overlap
    dataset: databio
    tools:
      - polars_bio
    parallel: false
    input_dataframes: false
    polars_bio_output_type: polars.DataFrame

For input_dataframes: true benchmarks, dataframes_io still controls the input/output DataFrame pair.

For pandas-backed polars_bio inputs, dataframes_io also supports pandas.pyarrow.DataFrame. That reads parquet through pandas with engine="pyarrow" and dtype_backend="pyarrow" so the loaded columns stay Arrow-backed:

dataframes_io:
  - "pandas.DataFrame:pandas.DataFrame"
  - "pandas.pyarrow.DataFrame:pandas.DataFrame"

Datasets

Datasets overview

Sample benchmark scenarios

conf/benchmark_small.yaml - small dataset, small number of operations for nearest and overlap, native DataFusion input
conf/benchmark_dataframes.yaml - as above but with DataFrames (Polars/Pandas) as input
conf/benchmark_large.yaml - large dataset, large number of operations for nearest and overlap, native DataFusion input
conf/benchmark_parallel.yaml - comparison parallel operations for pyranges0 and polars_bio with bioframe as a baseline
conf/benchmark_count_overlaps.yaml - comparison of count overlaps operation for pyranges{0,1} and polars_bio with bioframe as a baseline
conf/benchmark_merge.yaml - comparison of merge operation for pyranges{0,1} and polars_bio with bioframe as a baseline
conf/benchmark_coverage.yaml - comparison of coverage operation for pyranges{0,1} and polars_bio with bioframe as a baseline

Paper benchmarks

conf/paper/benchmark-e2e-overlap.yaml - end-to-end benchmark for overlap operation with writing results to a CSV file (1-2 and 8-7 datasets)
conf/paper/benchmark-4ops-1-2.yaml - overlap, nearest, count_overlaps and coverage operations for 1-2 datasets
conf/paper/benchmark-4ops-8-7.yaml - as above but for 8-7 datasets
conf/paper/benchmark-4ops-8-7-polars-bio-parallel.yaml - as above but polars_bio only and with parallel operations 1,2,4,6,8 threads
conf/paper/benchmark-read_vcf.yaml - read VCF file with polars_bio and 1,2,4,6,8 threads

Paper memory benchmarks

Example of running memory profiler for polars_bio with 1-2 dataset for polars_bio:

PRFOF_FILE="polars_bio_1-2.dat"
mprof run --output $PRFOF_FILE python src/run-memory-profiler.py --bench-config conf/paper/benchmark-e2e-overlap.yaml --tool polars_bio --test-case 1-2 --operation overlap
mprof plot $PRFOF_FILE

Synthetic

BENCHMARK_TYPE="synthetic"
for operation in "overlap" "nearest" "coverage" "count-overlaps"; do
   for tool in "polars_bio" "polars_bio_streaming" "bioframe" "pyranges0" "pyranges1"; do
       for test_case in "100" "10000000"; do
           PRFOF_FILE="${tool}_${operation}_${test_case}.dat"
           mprof run --output $PRFOF_FILE python src/run-memory-profiler.py --bench-config conf/paper/benchmark-e2e-${BENCHMARK_TYPE}.yaml --tool $tool --test-case $test_case --operation $operation
       done
   done
done

Real

BENCHMARK_TYPE="real"
for operation in "overlap" "nearest" "coverage" "count-overlaps"; do
   for tool in "polars_bio" "polars_bio_streaming" "bioframe" "pyranges0" "pyranges1"; do
       for test_case in "1-2" "8-7"; do
           PRFOF_FILE="${tool}_${operation}_${test_case}.dat"
           mprof run --output $PRFOF_FILE python src/run-memory-profiler.py --bench-config conf/paper/benchmark-e2e-${BENCHMARK_TYPE}.yaml --tool $tool --test-case $test_case --operation $operation
       done
   done
done

New benchmark Sep 2025

BENCHMARK_TYPE="real"
for operation in "overlap" ; do
   for tool in "genomicranges" ; do
       for test_case in "8-7"; do
           PRFOF_FILE="${tool}_${operation}_${test_case}.dat"
           mprof run --output $PRFOF_FILE python src/run-memory-profiler.py --bench-config conf/paper/benchmark-e2e-${BENCHMARK_TYPE}.yaml --tool $tool --test-case $test_case --operation $operation
       done
   done
done

Generating New Datasets

This repository includes a unified script for generating random genomic interval datasets and uploading them to cloud storage. The script creates datasets with unique timestamps and uploads them with proper directory structure.

Prerequisites for Dataset Generation

rclone - Required for uploading datasets to Google Drive

# Install rclone
curl https://rclone.org/install.sh | sudo bash

# Configure rclone with your Google Drive (follow interactive setup)
rclone config

Python dependencies - The script requires pandas, numpy, and yaml:

# These are already included in the poetry environment
poetry install

Usage

To generate a new dataset:

# From the polars-bio-bench root directory
poetry run python src/generate_dataset.py

The script will:

Clean up old files - Remove previous datasets and ZIP archives
Generate test data - Create parquet files with different sizes (100, 1K, 10K, 100K, 1M records)
Create ZIP archive - Package the datasets into a single ZIP file
Upload to Google Drive - Upload via rclone and generate public download link
Generate configuration files - Create YAML configs for benchmarking

Output Structure

The script generates files in the following structure:

polars-bio-bench/
├── tmp/
│   ├── data/                          # Generated parquet files
│   │   ├── df1-100.parquet
│   │   ├── df2-100.parquet
│   │   ├── df1-1000.parquet
│   │   ├── df2-1000.parquet
│   │   ├── ... (up to 1M records)
│   └── conf/                          # Configuration files
│       ├── common.yaml                # Dataset metadata and test cases
│       └── random.yaml                # Benchmark definitions
└── random_intervals_YYYYMMDD_HHMMSS.zip  # ZIP archive for upload

Generated Dataset Properties

Dataset ID: random_intervals_YYYYMMDD_HHMMSS (unique timestamp)
Test cases: 5 different sizes (100, 1K, 10K, 100K, 1M records)
File format: Parquet files with genomic intervals (chrom, start, end)
Chromosome range: chr1 only for simplicity
Coordinate range: Random intervals up to dataset size
Archive size: ~17-18 MB (compressed)

Configuration Files

common.yaml - Contains dataset metadata:

datasets:
- name: random_intervals_20250530_231351
  source: tgambin
  unzip: true
  format: zip
  url: https://drive.google.com/open?id=...
  # ... additional metadata
test-cases:
- name: '100'
  df_path_1: df1-100.parquet
  df_path_2: df2-100.parquet
# ... more test cases

random.yaml - Contains benchmark definitions for overlap and nearest operations with various tools and parallelization options.

Cloud Storage

Datasets are automatically uploaded to:

Remote path: tgambin:polars-bio-datasets/{dataset_id}/
Public URL: Generated automatically via rclone link
Access: Public download links for easy integration

Integration with Benchmarks

Once generated, the new dataset can be used in benchmarks by:

Copying the configuration files to the main conf/ directory
Updating benchmark YAML files to reference the new dataset ID
Running benchmarks with the new configuration

Example:

# Copy generated configs (optional)
cp tmp/conf/common.yaml conf/
cp tmp/conf/random.yaml conf/benchmark_random_new.yaml

# Run benchmarks with new dataset
poetry run python src/run-benchmarks.py --bench-config conf/benchmark_random_new.yaml

Known Issues

On MacOS with MX chips you may encounter the following error when installing polars-bio from source with poetry:

ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

To fix this, you can set the following environment variable when installing or updating polars-bio:

 RUSTFLAGS="-Clink-arg=-undefined -Clink-arg=dynamic_lookup -Ctarget-cpu=native" poetry update