pfc-archiver-elasticsearch

May 21, 2026 · View on GitHub

License: MIT Python PFC-JSONL Version

A standalone daemon that runs alongside your Elasticsearch cluster, watches for indices older than a configurable retention window, compresses them to PFC format, and writes them to local storage or S3 — automatically.

No schema changes. No plugins. No Elasticsearch modifications.

Supports self-hosted Elasticsearch and Elastic Cloud (Cloud ID).


How it works

Every interval_seconds (default: 3600), pfc-archiver-elasticsearch runs one archive cycle:

SCAN  →  EXPORT  →  COMPRESS  →  UPLOAD  →  VERIFY  →  (optional DELETE)  →  LOG
  1. SCAN — list all indices matching index_pattern, detect their date from the index name, select those older than retention_days
  2. EXPORT — stream all documents via search_after + Point-in-Time API → temp JSONL
  3. COMPRESS — pipe through pfc_jsonl compress.pfc + .pfc.bidx
  4. UPLOAD — write archive to output_dir (local path or s3://bucket/prefix/)
  5. VERIFY — decompress and count rows; must match exported count exactly
  6. DELETE (optional) — delete the source index from Elasticsearch (only if delete_after_archive = true)
  7. LOG — write a JSON run log to log_dir

Date detection

The archiver detects each index's date from its name — no data scan needed. All common Elasticsearch naming conventions are supported:

Index nameDetected date
logs-2024.01.152024-01-15
filebeat-8.19.13-2024-01-15-0000012024-01-15
events-2024-01-152024-01-15
metrics-202401152024-01-15
logs-2024.012024-01-01 (first of month)
events-2024-062024-06-01 (first of month)

Indices with no detectable date are skipped automatically. Elasticsearch internal indices (.kibana, .fleet, etc.) and closed indices are always skipped.


Install

pip install pfc-archiver-elasticsearch

Or from source:

git clone https://github.com/ImpossibleForge/pfc-archiver-elasticsearch
cd pfc-archiver-elasticsearch
pip install -r requirements.txt

The pfc_jsonl binary must be installed:

# Linux x64:
curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-linux-x64 \
     -o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl

# macOS Apple Silicon (M1–M4):
curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-macos-arm64 \
     -o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl

# macOS Intel (x64):
curl -L https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-macos-x64 \
     -o /usr/local/bin/pfc_jsonl && chmod +x /usr/local/bin/pfc_jsonl

# Windows (x64) — PowerShell:
Invoke-WebRequest -Uri https://github.com/ImpossibleForge/pfc-jsonl/releases/latest/download/pfc_jsonl-windows-x64.exe `
    -OutFile "$env:LOCALAPPDATA\Microsoft\WindowsApps\pfc_jsonl.exe"

Requires Elasticsearch 7.12+ and elasticsearch-py 7.12–8.x.


Configuration

Copy config/elasticsearch.toml to your working directory and adjust to your setup:

[elasticsearch]
url     = "http://localhost:9200"
api_key = "your-api-key-here"

# Elastic Cloud alternative:
# cloud_id = "my-deployment:dXMtZWFzdDQ..."
# api_key  = "your-api-key-here"

[archive]
index_pattern        = "logs-*"
retention_days       = 90
output_dir           = "s3://my-bucket/elasticsearch-cold/"
verify               = true
delete_after_archive = false   # Opt in explicitly when ready

[daemon]
interval_seconds = 3600

See config/elasticsearch.toml for the full reference with all options documented.


Usage

# Start the daemon
python pfc_archiver_elasticsearch.py --config config/elasticsearch.toml

# Dry run — scan and report, no data moved or deleted
python pfc_archiver_elasticsearch.py --config config/elasticsearch.toml --dry-run

# Single cycle then exit (for cron jobs)
python pfc_archiver_elasticsearch.py --config config/elasticsearch.toml --once

Example output

2026-05-21T14:00:00  INFO    pfc-archiver-elasticsearch v0.1.0 starting
2026-05-21T14:00:00  INFO    ES: http://localhost:9200  pattern: logs-*  retention: 90d
2026-05-21T14:00:00  INFO    Connected to Elasticsearch 8.17.0
2026-05-21T14:00:01  INFO    Found 3 index(es) to archive (cutoff: 2026-02-20)

2026-05-21T14:00:01  INFO    ── Index: logs-2024.01.15  (date: 2024-01-15, docs: 1,234,567) ──
2026-05-21T14:00:01  INFO      Exporting 'logs-2024.01.15' ...
2026-05-21T14:00:28  INFO      Exported 1,234,567 docs  (210.3 MiB JSONL) — compressing ...
2026-05-21T14:00:31  INFO      ✓ 1,234,567 docs  |  JSONL 210.3 MiB → PFC 19.1 MiB  (9.1%)  →  logs-2024.01.15.pfc
2026-05-21T14:00:31  INFO      Uploading s3://my-bucket/elasticsearch-cold/logs-2024.01.15.pfc ...
2026-05-21T14:00:33  INFO      ✓ S3 upload complete
2026-05-21T14:00:33  INFO      Verifying logs-2024.01.15.pfc (expected 1,234,567 rows) ...
2026-05-21T14:00:35  INFO      ✓ Verified: 1,234,567 rows match

2026-05-21T14:00:35  INFO    Cycle complete.

Authentication

MethodConfig keys
API key (recommended)api_key = "KEY"
Basic authuser = "elastic" + password = "changeme"
Elastic Cloudcloud_id = "dep:dXMt..." + api_key = "KEY"
Custom TLSca_certs = "/path/to/ca.crt"
Dev/testno_verify_certs = true

Deleting archived indices

delete_after_archive = false by default — pfc-archiver-elasticsearch never modifies your cluster without explicit opt-in.

After confirming your archives are accessible (via DuckDB, pfc-gateway, or pfc_jsonl query), set delete_after_archive = true and restart. Only indices that pass the row-count verify step will be deleted.

How deletion works: Calls DELETE /index_name via the Elasticsearch API. The index is removed from the cluster entirely. Make sure your archives are safely stored and verified before enabling this.


Query the archives

# Time-range query (no Elasticsearch needed)
pfc_jsonl query logs-2024.01.15.pfc --from "2024-01-15T00:00:00" --to "2024-01-16T00:00:00"

# Via DuckDB
duckdb -c "
  INSTALL pfc FROM community; LOAD pfc;
  SELECT level, count(*) FROM pfc_read('logs-2024.01.15.pfc')
  WHERE \"@timestamp\" >= '2024-01-15 08:00:00'
  GROUP BY level;
"

Run as a systemd service

[Unit]
Description=pfc-archiver-elasticsearch — PFC archive daemon for Elasticsearch
After=network.target

[Service]
Type=simple
User=pfc
WorkingDirectory=/opt/pfc-archiver-elasticsearch
ExecStart=/usr/bin/python3 /opt/pfc-archiver-elasticsearch/pfc_archiver_elasticsearch.py \
          --config /etc/pfc-archiver-elasticsearch/elasticsearch.toml
Restart=on-failure
RestartSec=60

[Install]
WantedBy=multi-user.target
sudo systemctl enable pfc-archiver-elasticsearch
sudo systemctl start  pfc-archiver-elasticsearch
sudo journalctl -u pfc-archiver-elasticsearch -f

Run log

Each archived index produces one JSON entry in archive_logs/archive_runs.jsonl:

{
  "ts":         "2026-05-21T14:00:35+00:00",
  "status":     "ok",
  "index":      "logs-2024.01.15",
  "index_date": "2024-01-15T00:00:00+00:00",
  "rows":       1234567,
  "jsonl_mb":   210.3,
  "output_mb":  19.1,
  "ratio_pct":  9.1,
  "deleted":    false
}

Running tests

# Unit tests (no Elasticsearch needed)
pip install pytest "elasticsearch>=7.12.0,<9.0"
python -m pytest tests/test_archiver.py -v

# Integration tests (requires Docker)
docker run -d --name es-test \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -p 9200:9200 \
  docker.elastic.co/elasticsearch/elasticsearch:8.17.0

# Wait ~30s, then:
python -m pytest tests/test_integration_elasticsearch.py -v

Part of the PFC Ecosystem

→ View all PFC tools & integrations

Direct integrationWhy
pfc-export-elasticsearchSame DB, one-shot mode — export a specific index or time range on demand
pfc-archiver-timescaledbSame concept for TimescaleDB
pfc-archiver-influxdbSame concept for InfluxDB
pfc-gatewayQuery exported archives via HTTP REST
pfc-duckdbQuery .pfc files directly from DuckDB

Disclaimer

pfc-archiver-elasticsearch is an independent open-source project and is not affiliated with, endorsed by, or associated with Elasticsearch B.V. or the Elastic project. Elasticsearch and Elastic Cloud are trademarks of Elasticsearch B.V.


License

pfc-archiver-elasticsearch (this repository) is released under the MIT License — see LICENSE.

The PFC-JSONL binary (pfc_jsonl) is proprietary software — free for personal and open-source use. Commercial use requires a license: info@impossibleforge.com