polars-bio Performance Benchmarking
October 25, 2025 · View on GitHub
This directory contains tools and documentation for performance benchmarking of polars-bio using the polars-bio-bench framework.
Overview
The benchmarking system compares polars-bio performance against competing libraries (pyranges1, genomicranges, bioframe) across key genomic operations:
- overlap: Find overlapping genomic intervals
- nearest: Find nearest genomic intervals
- count_overlaps: Count overlaps between interval sets
GitHub Actions Workflow
Triggering Benchmarks
Benchmarks run via manual workflow dispatch only. They are not triggered automatically on PRs or commits to conserve resources and give developers control.
Via GitHub UI
- Go to the Actions tab in the repository
- Select Performance Benchmarks workflow
- Click Run workflow
- Configure parameters (optional):
- Alert threshold: Performance degradation threshold (default: 150%)
- Baseline tag: Git tag to compare against (default: latest tag)
- Target branch: Branch to benchmark (default: current branch)
- Click Run workflow
Via GitHub CLI
# Run with default parameters (150% threshold, latest tag as baseline)
gh workflow run benchmark.yml
# Run with custom threshold
gh workflow run benchmark.yml -f alert_threshold=120
# Run with specific baseline tag
gh workflow run benchmark.yml -f baseline_tag=0.17.0
# Run on specific branch
gh workflow run benchmark.yml -f target_branch=feature/my-optimization
# Combine multiple parameters
gh workflow run benchmark.yml \
-f alert_threshold=150 \
-f baseline_tag=0.18.0 \
-f target_branch=main
Workflow Parameters
| Parameter | Description | Default | Valid Range |
|---|---|---|---|
alert_threshold | Performance degradation threshold (%) | 150 | 100-1000 |
baseline_tag | Git tag to use as baseline | Latest tag | Any valid git tag |
target_branch | Branch/commit to benchmark | Current branch | Any valid git ref |
How It Works
-
Baseline Execution
- Identifies latest git tag (or uses specified tag)
- Checks out baseline code at that tag
- Installs baseline version of polars-bio
- Runs
benchmark_single_thread-4tools-pull-request.yamlconfiguration - Saves baseline results
-
Target Execution
- Checks out target branch/commit
- Installs target version of polars-bio
- Runs same benchmark configuration
- Saves target results
-
Comparison & Analysis
- Parses both CSV result files
- Calculates per-operation averages across test cases
- Compares averages against threshold
- Generates comparison report
- Stores results in gh-pages branch at
/dev/bench/
-
Reporting
- Posts comparison table as workflow summary
- Posts PR comment (if triggered from PR)
- Uploads results as workflow artifacts
- Updates benchmark visualization on GitHub Pages
Per-Operation Averaging
The benchmark system uses per-operation averaging to provide stable, meaningful comparisons:
- Each operation (overlap, nearest, count_overlaps) runs multiple test cases
- Individual test case times can have high variance
- We calculate the average execution time per operation across all test cases
- Threshold comparison uses these operation averages, not individual test cases
Example:
overlap operation with 6 test cases:
test_case_1: 10ms
test_case_2: 12ms
test_case_3: 11ms
test_case_4: 13ms
test_case_5: 9ms
test_case_6: 11ms
Average: 11ms
If baseline average was 7ms:
Ratio = 11 / 7 = 1.57 (157%)
If threshold is 150%:
157% > 150% → REGRESSION ALERT
Baseline Comparison
The system compares against tagged releases rather than commit-to-commit:
Why tagged releases?
- Tagged releases represent stable, validated baselines
- More meaningful than comparing against volatile main branch
- Aligns with how users experience performance (via releases)
- Reduces false positives from intermediate development work
Baseline selection:
- Default: Uses latest git tag (e.g.,
0.18.0) - Override: Specify
baseline_tagparameter for custom baseline - Fallback: Errors if no tags exist (requires at least one release tag)
Alert Threshold
The default threshold is 150% (1.5x degradation triggers alert).
Why 150%?
- Catches moderate to severe regressions
- Accounts for expected variance on GitHub Actions runners
- Per-operation averaging reduces noise vs individual test cases
- Consistent Linux environment (ubuntu-latest) reduces variance
- Can be adjusted per run based on change type
Adjusting the threshold:
# More sensitive (120% = 1.2x degradation)
gh workflow run benchmark.yml -f alert_threshold=120
# Less sensitive (200% = 2x degradation)
gh workflow run benchmark.yml -f alert_threshold=200
When to adjust:
- Performance-critical changes: Use lower threshold (120-130%)
- Experimental features: Use higher threshold (180-200%)
- Major refactoring: Consider higher threshold initially
Runner Environment
Benchmarks run exclusively on ubuntu-latest Linux runners:
Why Linux only?
- Consistent environment reduces performance variance
- Most polars-bio users deploy on Linux servers
- Simplifies infrastructure (no cross-platform complexity)
- Predictable CPU, memory, and I/O characteristics
Runner specifications:
- OS: Ubuntu (latest LTS)
- CPU: 2-core x86_64
- RAM: 7 GB
- Disk: SSD
Expected variance: ±5-10% due to shared runner infrastructure
Benchmark Configuration
The workflow uses conf/benchmark_single_thread-4tools-pull-request.yaml from polars-bio-bench:
- Operations: overlap, nearest, count_overlaps
- Tools: polars_bio, pyranges1, genomicranges, bioframe
- Dataset: databio (representative genomic intervals)
- Repetitions: 3x per test case
- Parallelism: Disabled (single-thread for reproducibility)
Estimated runtime: 20-25 minutes (including baseline)
Interpreting Results
Comparison Report
The workflow generates a markdown comparison table:
## Benchmark Comparison: feature-branch vs 0.18.0
**Summary:** 1 regressions, 2 improvements, 0 stable
## ⚠️ Performance Regressions Detected
| Operation | Baseline (ms) | PR (ms) | Change | Status |
|-----------|---------------|---------|--------|--------|
| overlap | 45.20 | 75.30 | +66.6% | ❌ Exceeds 150% threshold |
## All Operations
| Operation | Baseline (ms) | PR (ms) | Change | Status |
|-----------------|---------------|---------|---------|--------------|
| overlap | 45.20 | 75.30 | +66.6% | ❌ regression |
| nearest | 32.10 | 28.50 | -11.2% | ✅ improvement |
| count_overlaps | 18.40 | 15.20 | -17.4% | ✅ improvement |
Status Icons
- ❌ regression: Exceeds alert threshold
- ✅ improvement: >5% faster than baseline
- ✓ stable: Within acceptable range
- 🆕 new: Operation not in baseline
- 🗑️ removed: Operation not in target
Viewing Historical Results
Benchmark results are stored on GitHub Pages at:
https://biodatageeks.org/polars-bio/dev/bench/
This page shows:
- Time-series charts for each operation
- Performance trends over time
- Comparison across different tools
- Interactive filtering and zoom
Local Testing
You can run the benchmark parser locally to test changes:
# Parse and compare results
python benchmarks/parse_benchmark_results.py \
baseline_results.csv \
pr_results.csv \
--threshold 150 \
--baseline-tag v0.18.0 \
--pr-ref my-feature
# Output options
python benchmarks/parse_benchmark_results.py \
baseline_results.csv \
pr_results.csv \
--output-json results.json \
--output-comparison comparison.json \
--output-report report.md \
--fail-on-regression # Exit 1 if regressions found
Troubleshooting
No benchmarks running
Symptom: Workflow completes but no benchmark results
Solutions:
- Check that polars-bio-bench repository was cloned successfully
- Verify benchmark configuration file exists
- Check workflow logs for errors in benchmark execution
- Ensure baseline tag can be built and installed
High variance between runs
Symptom: Same code shows different performance on different runs
Solutions:
- GitHub Actions runners have shared infrastructure, expect ±5-10% variance
- Consider increasing alert threshold (e.g., 180-200%)
- Per-operation averaging already reduces variance; if still high, check for I/O-bound operations
- Run multiple benchmark executions and compare trends
Baseline tag not found
Symptom: Error: "Tag 'X' does not exist"
Solutions:
- Ensure you've specified a valid git tag name
- Use
git tag --sort=-creatordateto list available tags - If no tags exist, create one:
git tag v0.1.0 && git push origin v0.1.0
Parser errors
Symptom: Error parsing CSV benchmark output
Solutions:
- Check that polars-bio-bench produces valid CSV output
- Verify CSV has required columns:
operation,tool,test_case, timing column - Check for empty or malformed CSV files
- Update parser if polars-bio-bench CSV format has changed
Workflow fails on baseline installation
Symptom: Cannot install baseline version from tag
Solutions:
- Ensure tagged version can be built (old tags may have dependency issues)
- Consider using a more recent tag as baseline
- Check that baseline tag has all required files
- Verify Rust compilation succeeds for that version
GitHub Pages Structure
The gh-pages branch has the following structure:
gh-pages/
├── index.html # Documentation site (managed by mkdocs)
├── api/ # API documentation
├── blog/ # Blog posts
├── dev/
│ └── bench/ # Benchmark results (managed by workflow)
│ ├── data.js # Benchmark history
│ ├── index.html # Benchmark visualization
│ └── ... # Other benchmark files
└── ... # Other documentation files
Important: The benchmark workflow only modifies /dev/bench/ directory. Documentation remains unchanged.
FAQ
Q: Why manual trigger only?
A: Benchmarks take 20-25 minutes to run. Manual triggering:
- Gives developers control over when to benchmark
- Reduces GitHub Actions runner usage
- Prevents blocking PRs on benchmark completion
- Allows selective benchmarking of performance-sensitive changes
Q: Why compare against tags instead of main?
A: Tags represent stable release baselines:
- Main branch changes frequently, causing noisy comparisons
- Users care about performance relative to releases, not intermediate commits
- Tagged baselines provide stable reference points
- Reduces false positives from development churn
Q: Why 150% threshold?
A:
- Catches moderate regressions (1.5x slowdown)
- Accounts for runner variance (±5-10%)
- Per-operation averaging smooths outliers
- Lower than previous 200% to catch more issues
- Can be adjusted per run based on needs
Q: Can I run benchmarks on multiple platforms?
A: No, the workflow runs Linux only by design:
- Reduces complexity and cost
- Most deployments are Linux
- Cross-platform benchmarking adds variance without proportional value
- Use local benchmarking for platform-specific testing
Q: How do I benchmark unreleased code?
A: Use the target_branch parameter:
# Benchmark feature branch against latest release
gh workflow run benchmark.yml -f target_branch=feature/my-optimization
# Benchmark specific commit
gh workflow run benchmark.yml -f target_branch=abc1234
Q: What if there are no git tags?
A: The workflow requires at least one tag. Create one:
git tag v0.1.0
git push origin v0.1.0
Q: Can I cache baseline results?
A: Currently no, but this is planned. Each run executes both baseline and target benchmarks to ensure consistent environment.
Contributing
To improve the benchmarking system:
- Parser improvements: Edit
benchmarks/parse_benchmark_results.py - Workflow changes: Edit
.github/workflows/benchmark.yml - Documentation updates: Edit this file
- Test changes locally before committing
- Run manual benchmark to verify changes work end-to-end