Run comparison
February 14, 2026 ยท View on GitHub
Koala Diff
Blazingly Fast Data Comparison for the Modern Stack.
๐ Quickstart | ๐ฉ Issues | ๐ Benchmarks
Koala Diff is the "git diff" for your data lake. It compares massive datasets (CSV, Parquet, JSON) instantly to find added, removed, and modified rows.
Built in Rust ๐ฆ for speed, wrapped in Python ๐ for ease-of-use. It streams data to compare datasets larger than RAM and generates beautiful HTML reports.
๐ Why Koala Diff?
- Zero-Copy Streaming: Compare 100GB files on a laptop without crashing RAM.
- Rust-Powered Analytics: Go beyond row counts. Track Value Variance, Null Drift, and Match Integrity per column.
- Professional Dashboards: Auto-generates premium, stakeholder-ready HTML reports with status badges and join attribution.
- Deep-Dive API: Extract mismatched records as Polars DataFrames for instant remediation.
๐ The "Magic" Benchmark
"Process 100M rows on a laptop in seconds, not minutes."
โก Performance at a Glance
- Time: ๐ฆ๐ฆ 1x (Koala) vs ๐ฆ๐ฆ๐ฆ๐ฆ๐ฆ 3x (Polars) vs ๐ฆ๐ฆ...๐ฆ 30x+ (Pandas)
- RAM: ๐ฉ 0.4GB (Koala Diff) vs ๐ฉ๐ฉ๐ฉ๐ฉ๐ฉ๐ฉ๐ฉ๐ฉ 12GB+ (Polars)
- Edge: Native Rust
XXHash64handles massive joins locally without cluster overhead.
๐ง Why not just use Polars/Spark?
While Polars and Spark are incredible for general data processing, Koala Diff is a specialized tool for Data Quality & Regression:
| Feature | Polars / Spark | ๐ Koala Diff |
|---|---|---|
| Specialization | General Purpose ETL | Data Quality & Diffing |
| Memory | High (Join-heavy) | Ultra-Low (Streaming) |
| Output | Raw DataFrames | Pro Dashboards + Metrics |
| Logic | Manual Join/Filter code | Out-of-the-box Analytics |
| Stakeholders | Engineer-facing | Business-Ready Reports |
Koala Diff doesn't replace your processing engine; it verifies that its output is correct.
> Benchmarks run on MacBook Pro M3 Max.
๐ฏ Common Use Cases
- ETL Regression Testing: Automatically verify that your daily pipeline didn't accidentally mutate 1 million rows after a code change.
- Data Migration Validation: Ensure 100% parity when moving data between systems (e.g., Hive to Snowflake or S3 to BigQuery).
- Environment Drift Detection: Compare Production vs. Staging datasets to find out why your model is behaving differently.
- Compliance Auditing: Generate unalterable HTML snapshots of data changes for regulatory or financial reviews.
- CI/CD for Data: Run
koala-diffin your CI pipeline to block PRs that introduce unexpected data quality regressions.
๐ฆ Installation
pip install koala-diff
โก Quick Start
1. Generate a "Pro" Report
from koala_diff import DataDiff, HtmlReporter
# Initialize with primary keys
differ = DataDiff(key_columns=["user_id"])
# Run comparison
result = differ.compare("source.parquet", "target.parquet")
# Generate a professional dashboard
reporter = HtmlReporter("data_quality_report.html")
reporter.generate(result)
2. Mismatch Deep-Dive
Need to fix the data? Pull the exact differences directly into Python:
# Get a Polars DataFrame of ONLY mismatched rows
mismatch_df = differ.get_mismatch_df()
# Analyze variance or push to a remediation pipeline
print(mismatch_df.head())
2. CLI Usage (Coming Soon)
koala-diff production.csv staging.csv --key user_id --output report.html
๐ Architecture
Koala Diff uses a streaming hash-join algorithm implemented in Rust:
- Reader: Lazy Polars scan of both datasets.
- Hasher: XXHash64 computation of row values (SIMD optimized).
- Differ: fast set operations to classify rows as
Added,Removed, orModified. - Reporter: Jinja2 rendering of results.
๐ค Contributing
We welcome contributions! Whether it's a new file format reader, a performance optimization, or a documentation fix.
- Check the Issues.
- Read our Contribution Guide.
๐ License
MIT ยฉ 2026 godalida - KoalaDataLab