pfc-migrate-parquet
May 20, 2026 · View on GitHub
Convert Apache Parquet files to PFC-JSONL cold storage — streaming, in-region, no egress costs.
Why
Parquet is a great analytics format. But long-term cold storage of Parquet archives is expensive — files don't compress further, cold queries are slow, and moving terabytes between regions costs real money.
PFC-JSONL achieves 25% smaller files than gzip and 37% smaller than zstd on real-world log data, with a built-in block index (BIDX) for fast time-range queries without full decompression.
pfc-migrate-parquet converts your existing Parquet archives to PFC — directly on your cloud server, in the same region as your object storage, without touching your local disk.
Best for: Time-series log data, event archives, append-only workloads where you query by time range.
Not for: Analytics workloads that need column pruning or aggregations across all rows (e.g. SELECT AVG(cpu) GROUP BY host). For those, keep Parquet. PFC wins on cold storage and time-range retrieval.
Hive partitions: By default, files are written flat into
--output-dir. Use--preserve-structureto mirror the source directory tree (e.g.year=2024/month=01/data.pfc).
How It Works
your-data.parquet
│
│ pyarrow iter_batches() ← streaming, no full-file RAM load
▼
JSONL temp file ← columnar → row reconstruction
│
│ pfc_jsonl compress
▼
your-data.pfc ← BWT → MTF → RLE → rANS O2 + BIDX
Parquet's internal compression (Snappy, gzip, zstd, LZ4) is handled transparently by pyarrow — no extra steps needed.
Installation
pip install pfc-migrate-parquet
# Optional: S3/MinIO support
pip install "pfc-migrate-parquet[s3]"
Requires the pfc_jsonl binary (v5.0+):
# Linux x64
curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-linux-x64 \
-o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl
# macOS Apple Silicon
curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-macos-arm64 \
-o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl
# macOS Intel
curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-macos-x64 \
-o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl
# Windows (PowerShell)
Invoke-WebRequest https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-windows-x64.exe `
-OutFile "$env:LOCALAPPDATA\Microsoft\WindowsApps\pfc_jsonl.exe"
Quick Start
# Single file — auto-detects timestamp column, builds BIDX
pfc-migrate-parquet convert events.parquet events.pfc
# Explicit timestamp column
pfc-migrate-parquet convert events.parquet events.pfc --ts-col event_time
# No timestamp (sequential block index)
pfc-migrate-parquet convert metrics.parquet metrics.pfc --no-timestamp
# Whole directory
pfc-migrate-parquet convert --dir /data/parquet/ --output-dir /data/pfc/
# Recursive + preserve Hive partition structure
pfc-migrate-parquet convert --dir /data/parquet/ --output-dir /data/pfc/ \
--recursive --preserve-structure --verbose
Timestamp & BIDX
PFC-JSONL uses a Block Index (BIDX) for fast time-range queries — no full decompression needed. pfc-migrate-parquet maps a Parquet timestamp column to the ts field that feeds the BIDX.
| Scenario | Behaviour |
|---|---|
One TIMESTAMP column | Auto-detected, no flags needed |
| Multiple timestamp columns | First one wins, or use --ts-col name |
| No timestamp column | Use --no-timestamp — sequential blocks, no time queries |
| Custom column name | --ts-col my_event_ts |
Supported Parquet timestamp types: TIMESTAMP_MICROS, TIMESTAMP_MILLIS, TIMESTAMP_NANOS, with or without timezone. All are normalised to ISO 8601 UTC strings.
In-Region Migration (No Egress Costs)
The intended deployment pattern:
┌─────────────────────────────────────────┐
│ Cloud Region (e.g. eu-west-1) │
│ │
│ ┌──────────────┐ ┌────────────────┐ │
│ │ S3 / MinIO │ │ Cloud Server │ │
│ │ *.parquet │──▶│ pfc-migrate- │ │
│ │ │◀──│ parquet │ │
│ │ *.pfc │ │ │ │
│ └──────────────┘ └────────────────┘ │
└─────────────────────────────────────────┘
↑ no data leaves the region
Mount your object storage bucket via s3fs or gcsfuse, run the migration on a small VM in the same region — zero egress fees, no local disk needed.
# Mount S3 bucket in-region
s3fs my-bucket /mnt/s3 -o iam_role=auto,endpoint=eu-west-1
# Migrate in-place
pfc-migrate-parquet convert \
--dir /mnt/s3/parquet/2024/ \
--output-dir /mnt/s3/pfc/2024/ \
--recursive --verbose
Options Reference
pfc-migrate-parquet convert [INPUT OUTPUT] [OPTIONS]
Positional:
INPUT.parquet Input Parquet file
OUTPUT.pfc Output PFC file (default: same dir, .pfc extension)
File selection:
--dir DIR Process all .parquet files in DIR
--output-dir DIR Output directory for batch mode
--recursive Recurse into subdirectories (batch mode only)
--preserve-structure Mirror source dir tree in output (Hive partitions)
Timestamp:
--ts-col COLUMN Parquet column to use as 'ts' for BIDX
--no-timestamp Skip timestamp mapping (sequential block index)
Performance:
--batch-size N Rows per read batch, default 10000
Reduce for low-memory servers, increase for speed
Binary:
--pfc-binary PATH Path to pfc_jsonl binary
(overrides $PFC_JSONL_BINARY env var)
Global:
--verbose, -v Print per-file stats
--version Show version
Supported Parquet Features
| Feature | Support |
|---|---|
| Snappy / gzip / zstd / LZ4 / uncompressed | ✅ |
| Multiple row groups | ✅ |
| Dictionary-encoded columns (categoricals) | ✅ |
| Nested structs | ✅ |
| List / array columns | ✅ |
| Decimal128 | ✅ |
| NULL values | ✅ → null in JSON |
| Timezone-aware timestamps | ✅ → ISO 8601 UTC |
| Naive timestamps | ✅ → assumed UTC |
| Binary columns | ✅ → base64 via JSON |
Python API
from pfc_migrate_parquet import convert_file, batch_convert
# Single file
result = convert_file(
"events.parquet",
"events.pfc",
pfc_binary="/usr/local/bin/pfc_jsonl",
ts_col="event_time", # None = auto-detect
batch_size=10_000,
verbose=True,
)
print(f"{result['rows']:,} rows → {result['ratio_pct']:.1f}% ratio")
# Batch
batch_convert(
src_dir="/data/parquet/",
output_dir="/data/pfc/",
pfc_binary="/usr/local/bin/pfc_jsonl",
recursive=True,
)
What to Do After Migration
# Inspect BIDX (block index)
pfc_jsonl info archive.pfc
# Query a time range — no full decompression
pfc_jsonl query archive.pfc --from "2024-01-01T00:00:00Z" --to "2024-01-02T00:00:00Z"
# Full decompression to JSONL
pfc_jsonl decompress archive.pfc output.jsonl
Related Tools
| Tool | Purpose |
|---|---|
| pfc-jsonl | Core PFC compressor / CLI |
| pfc-migrate | Migrate gzip/zstd/bz2/lz4 JSONL → PFC |
| pfc-duckdb | Query PFC archives directly from DuckDB |
| pfc-export-cratedb | Export CrateDB tables → PFC |
| pfc-export-questdb | Export QuestDB tables → PFC |
→ View all PFC tools & integrations
Disclaimer
pfc-migrate-parquet is an independent open-source project and is not affiliated with, endorsed by, or associated with the Apache Software Foundation or the Apache Parquet project. Apache and Apache Parquet are trademarks of the Apache Software Foundation.
License
pfc-migrate-parquet (this repository) is released under the MIT License — see LICENSE.
The PFC-JSONL binary (pfc_jsonl) is proprietary software — free for personal and open-source use. Commercial use requires a license: info@impossibleforge.com