pfc-migrate-parquet

May 20, 2026 · View on GitHub

Convert Apache Parquet files to PFC-JSONL cold storage — streaming, in-region, no egress costs.

Why

Parquet is a great analytics format. But long-term cold storage of Parquet archives is expensive — files don't compress further, cold queries are slow, and moving terabytes between regions costs real money.

PFC-JSONL achieves 25% smaller files than gzip and 37% smaller than zstd on real-world log data, with a built-in block index (BIDX) for fast time-range queries without full decompression.

pfc-migrate-parquet converts your existing Parquet archives to PFC — directly on your cloud server, in the same region as your object storage, without touching your local disk.

Best for: Time-series log data, event archives, append-only workloads where you query by time range.

Not for: Analytics workloads that need column pruning or aggregations across all rows (e.g. SELECT AVG(cpu) GROUP BY host). For those, keep Parquet. PFC wins on cold storage and time-range retrieval.

Hive partitions: By default, files are written flat into --output-dir. Use --preserve-structure to mirror the source directory tree (e.g. year=2024/month=01/data.pfc).

How It Works

your-data.parquet
        │
        │  pyarrow iter_batches()   ← streaming, no full-file RAM load
        ▼
  JSONL temp file                  ← columnar → row reconstruction
        │
        │  pfc_jsonl compress
        ▼
your-data.pfc                      ← BWT → MTF → RLE → rANS O2 + BIDX

Parquet's internal compression (Snappy, gzip, zstd, LZ4) is handled transparently by pyarrow — no extra steps needed.

Installation

pip install pfc-migrate-parquet

# Optional: S3/MinIO support
pip install "pfc-migrate-parquet[s3]"

Requires the pfc_jsonl binary (v5.0+):

# Linux x64
curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-linux-x64 \
     -o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl

# macOS Apple Silicon
curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-macos-arm64 \
     -o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl

# macOS Intel
curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-macos-x64 \
     -o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl

# Windows (PowerShell)
Invoke-WebRequest https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-windows-x64.exe `
  -OutFile "$env:LOCALAPPDATA\Microsoft\WindowsApps\pfc_jsonl.exe"

Quick Start

# Single file — auto-detects timestamp column, builds BIDX
pfc-migrate-parquet convert events.parquet events.pfc

# Explicit timestamp column
pfc-migrate-parquet convert events.parquet events.pfc --ts-col event_time

# No timestamp (sequential block index)
pfc-migrate-parquet convert metrics.parquet metrics.pfc --no-timestamp

# Whole directory
pfc-migrate-parquet convert --dir /data/parquet/ --output-dir /data/pfc/

# Recursive + preserve Hive partition structure
pfc-migrate-parquet convert --dir /data/parquet/ --output-dir /data/pfc/ \
  --recursive --preserve-structure --verbose

Timestamp & BIDX

PFC-JSONL uses a Block Index (BIDX) for fast time-range queries — no full decompression needed. pfc-migrate-parquet maps a Parquet timestamp column to the ts field that feeds the BIDX.

Scenario	Behaviour
One `TIMESTAMP` column	Auto-detected, no flags needed
Multiple timestamp columns	First one wins, or use `--ts-col name`
No timestamp column	Use `--no-timestamp` — sequential blocks, no time queries
Custom column name	`--ts-col my_event_ts`

Supported Parquet timestamp types: TIMESTAMP_MICROS, TIMESTAMP_MILLIS, TIMESTAMP_NANOS, with or without timezone. All are normalised to ISO 8601 UTC strings.

In-Region Migration (No Egress Costs)

The intended deployment pattern:

┌─────────────────────────────────────────┐
│  Cloud Region (e.g. eu-west-1)          │
│                                         │
│  ┌──────────────┐   ┌────────────────┐  │
│  │  S3 / MinIO  │   │  Cloud Server  │  │
│  │  *.parquet   │──▶│  pfc-migrate-  │  │
│  │              │◀──│  parquet       │  │
│  │  *.pfc       │   │                │  │
│  └──────────────┘   └────────────────┘  │
└─────────────────────────────────────────┘
         ↑ no data leaves the region

Mount your object storage bucket via s3fs or gcsfuse, run the migration on a small VM in the same region — zero egress fees, no local disk needed.

# Mount S3 bucket in-region
s3fs my-bucket /mnt/s3 -o iam_role=auto,endpoint=eu-west-1

# Migrate in-place
pfc-migrate-parquet convert \
  --dir /mnt/s3/parquet/2024/ \
  --output-dir /mnt/s3/pfc/2024/ \
  --recursive --verbose

Options Reference

pfc-migrate-parquet convert [INPUT OUTPUT] [OPTIONS]

Positional:
  INPUT.parquet          Input Parquet file
  OUTPUT.pfc             Output PFC file (default: same dir, .pfc extension)

File selection:
  --dir DIR              Process all .parquet files in DIR
  --output-dir DIR       Output directory for batch mode
  --recursive            Recurse into subdirectories (batch mode only)
  --preserve-structure   Mirror source dir tree in output (Hive partitions)

Timestamp:
  --ts-col COLUMN        Parquet column to use as 'ts' for BIDX
  --no-timestamp         Skip timestamp mapping (sequential block index)

Performance:
  --batch-size N         Rows per read batch, default 10000
                         Reduce for low-memory servers, increase for speed

Binary:
  --pfc-binary PATH      Path to pfc_jsonl binary
                         (overrides $PFC_JSONL_BINARY env var)

Global:
  --verbose, -v          Print per-file stats
  --version              Show version

Supported Parquet Features

Feature	Support
Snappy / gzip / zstd / LZ4 / uncompressed	✅
Multiple row groups	✅
Dictionary-encoded columns (categoricals)	✅
Nested structs	✅
List / array columns	✅
Decimal128	✅
NULL values	✅ → `null` in JSON
Timezone-aware timestamps	✅ → ISO 8601 UTC
Naive timestamps	✅ → assumed UTC
Binary columns	✅ → base64 via JSON

Python API

from pfc_migrate_parquet import convert_file, batch_convert

# Single file
result = convert_file(
    "events.parquet",
    "events.pfc",
    pfc_binary="/usr/local/bin/pfc_jsonl",
    ts_col="event_time",   # None = auto-detect
    batch_size=10_000,
    verbose=True,
)
print(f"{result['rows']:,} rows  →  {result['ratio_pct']:.1f}% ratio")

# Batch
batch_convert(
    src_dir="/data/parquet/",
    output_dir="/data/pfc/",
    pfc_binary="/usr/local/bin/pfc_jsonl",
    recursive=True,
)

What to Do After Migration

# Inspect BIDX (block index)
pfc_jsonl info archive.pfc

# Query a time range — no full decompression
pfc_jsonl query archive.pfc --from "2024-01-01T00:00:00Z" --to "2024-01-02T00:00:00Z"

# Full decompression to JSONL
pfc_jsonl decompress archive.pfc output.jsonl

Tool	Purpose
pfc-jsonl	Core PFC compressor / CLI
pfc-migrate	Migrate gzip/zstd/bz2/lz4 JSONL → PFC
pfc-duckdb	Query PFC archives directly from DuckDB
pfc-export-cratedb	Export CrateDB tables → PFC
pfc-export-questdb	Export QuestDB tables → PFC

→ View all PFC tools & integrations

Disclaimer

pfc-migrate-parquet is an independent open-source project and is not affiliated with, endorsed by, or associated with the Apache Software Foundation or the Apache Parquet project. Apache and Apache Parquet are trademarks of the Apache Software Foundation.

License

pfc-migrate-parquet (this repository) is released under the MIT License — see LICENSE.

The PFC-JSONL binary (pfc_jsonl) is proprietary software — free for personal and open-source use. Commercial use requires a license: info@impossibleforge.com