Provero

May 30, 2026 · View on GitHub

CI License Python Contributing

provero (Esperanto): to test, to put to proof.

A vendor-neutral, declarative data quality engine.

Provero demo

Quick Start

pip install provero
provero init

Edit provero.yaml:

source:
  type: duckdb
  table: orders

checks:
  - not_null: [order_id, customer_id, amount]
  - unique: order_id
  - accepted_values:
      column: status
      values: [pending, shipped, delivered, cancelled]
  - range:
      column: amount
      min: 0
      max: 100000
  - row_count:
      min: 1

Run:

provero run
┌─────────────────┬──────────────┬──────────┬──────────────────┬──────────────────┐
│ Check           │ Column       │ Status   │ Observed         │ Expected         │
├─────────────────┼──────────────┼──────────┼──────────────────┼──────────────────┤
│ not_null        │ order_id     │ ✓ PASS   │ 0 nulls          │ 0 nulls          │
│ not_null        │ customer_id  │ ✓ PASS   │ 0 nulls          │ 0 nulls          │
│ not_null        │ amount       │ ✓ PASS   │ 0 nulls          │ 0 nulls          │
│ unique          │ order_id     │ ✓ PASS   │ 0 duplicates     │ 0 duplicates     │
│ accepted_values │ status       │ ✓ PASS   │ 0 invalid values │ only [pending..] │
│ range           │ amount       │ ✓ PASS   │ min=45, max=999  │ min=0, max=100k  │
│ row_count       │ -            │ ✓ PASS   │ 5                │ >= 1             │
└─────────────────┴──────────────┴──────────┴──────────────────┴──────────────────┘

Score: 100/100 | 7 passed, 0 failed | 22ms

Features

  • 20 check types: not_null, unique, unique_combination, completeness, accepted_values, range, regex, email_validation, type, freshness, latency, row_count, row_count_change, anomaly, custom_sql, referential_integrity, distribution, cardinality, drift, cross_table_count
  • 3 connectors: DuckDB (files + in-memory), PostgreSQL, Pandas/Polars DataFrame
  • SQL batch optimizer: compiles N checks into 1 query
  • Data contracts: schema validation, SLA enforcement, contract diff
  • Anomaly detection: Z-Score, MAD, IQR (stdlib only, no scipy needed)
  • HTML reports: provero run --report html
  • Webhook alerts: notify Slack, PagerDuty, or any HTTP endpoint on failure
  • Result store: SQLite with time-series metrics and provero history
  • Data profiling: provero profile --suggest auto-generates checks
  • Configurable severity: info, warning, critical, blocker per check
  • JSON Schema validation for provero.yaml
  • Statistical checks: distribution (mean/stddev bounds), cardinality (distinct count/ratio), drift (PSI vs a discrete baseline), cross_table_count (row-count parity/ratio between two tables)
  • Connection pooling and retry: per-source pool sizing, connect timeouts, and bounded retry-with-backoff for SQLAlchemy connectors
  • Observability: structured JSON audit log, OpenTelemetry spans, and Prometheus metrics on provero run, with secret redaction in audit output
  • Server mode: provero serve exposes a FastAPI REST API (health, suites, runs, /metrics), a stdlib interval scheduler, and X-API-Key authentication
  • CI output formats: provero run --format sarif and --format junit for code-scanning and test-report integrations
  • Contract versioning: version-aware provero contract diff flags breaking changes and missing version bumps with severity policies
  • Airflow provider: ProveroCheckOperator + @provero_check decorator
  • SodaCL migration: provero import soda converts configs in one command
  • dbt interop: provero export dbt generates schema.yml test definitions
  • Continuous monitoring: provero watch polls checks on interval

Check Types

CheckDescriptionExample
not_nullColumn has no null valuesnot_null: order_id
uniqueColumn has no duplicate valuesunique: order_id
unique_combinationComposite uniqueness across columnsunique_combination: [date, store_id]
completenessMinimum percentage of non-null valuescompleteness: { column: email, min: 95% }
accepted_valuesColumn values are within allowed setaccepted_values: { column: status, values: [a, b] }
rangeNumeric values within min/max boundsrange: { column: amount, min: 0, max: 100000 }
regexValues match a regular expressionregex: { column: email, pattern: ".+@.+" }
email_validationValues are valid email addressesemail_validation: { column: email }
typeColumn data type matches expectedtype: { column: amount, expected: numeric }
freshnessMost recent timestamp within thresholdfreshness: { column: updated_at, max_age: 24h }
latencyTime between two timestamp columnslatency: { source_column: created_at, target_column: processed_at, max_latency: 1h }
row_countTable row count within boundsrow_count: { min: 1, max: 1000000 }
row_count_changeRow count change vs previous runrow_count_change: { max_decrease: 10% }
anomalyStatistical anomaly detectionanomaly: { column: amount, method: zscore }
custom_sqlCustom SQL query returns truthy valuecustom_sql: "SELECT COUNT(*) > 0 FROM orders"
referential_integrityFK values exist in reference tablereferential_integrity: { column: customer_id, reference_table: customers, reference_column: id }
distributionColumn mean/stddev within boundsdistribution: { column: amount, mean: 100, mean_tolerance: 5, stddev_max: 50 }
cardinalityDistinct count or ratio within boundscardinality: { column: country_code, min: 2, max: 250 }
driftPSI of a column vs a discrete baselinedrift: { column: segment, baseline: { A: 0.5, B: 0.3, C: 0.2 }, threshold: 0.25 }
cross_table_countRow-count parity/ratio between two tablescross_table_count: { other_table: staging.orders, tolerance: 0 }

Configuration

A provero.yaml file defines your data source, checks, alerts, and contracts:

# Source configuration
source:
  type: duckdb                    # duckdb, postgres, dataframe
  table: orders                   # table name or file expression
  # connection: postgres://...    # connection string for databases

# Quality checks
checks:
  - not_null: [order_id, customer_id]
  - unique: order_id
  - range:
      column: amount
      min: 0
      max: 100000
  - freshness:
      column: updated_at
      max_age: 24h
  - anomaly:
      column: amount
      method: zscore               # zscore, mad, iqr
      threshold: 3.0
      window: 30                   # lookback window in days
  - referential_integrity:
      column: customer_id
      reference_table: customers
      reference_column: id

# Severity levels: info, warning, critical, blocker
# Blocker checks cause a non-zero exit code

# Alert notifications
alerts:
  - type: webhook
    url: https://hooks.slack.com/services/YOUR/WEBHOOK
    trigger: on_failure            # on_failure, on_success, always

# Data contracts (optional)
contracts:
  - name: orders_contract
    owner: data-team
    table: orders
    schema:
      columns:
        - name: order_id
          type: integer
          checks: [not_null, unique]
    sla:
      freshness: 24h

Anomaly Detection

Provero includes built-in statistical anomaly detection that works without external dependencies (no scipy needed).

Supported methods:

MethodDescriptionBest for
zscoreStandard Z-ScoreNormally distributed metrics
madMedian Absolute DeviationRobust to outliers
iqrInterquartile RangeSkewed distributions
checks:
  - anomaly:
      column: daily_revenue
      method: mad
      threshold: 3.5
      window: 30

Anomaly detection uses the result store to compare current values against historical data. Run provero run regularly to build up the baseline.

CLI Commands

CommandDescription
provero initCreate a new provero.yaml template
provero runExecute quality checks
provero validateValidate config syntax without running
provero profileProfile a data source
provero historyShow historical check results
provero contract validateValidate data contracts against live data
provero contract diffCompare two contract versions
provero watchContinuously run checks on interval
provero import sodaConvert SodaCL config to Provero format
provero export dbtGenerate dbt schema.yml from checks
provero serveRun the REST API + scheduler server
provero versionShow version

provero run accepts --format table|json|csv|sarif|junit. The sarif and junit formats emit a single whole-run document for CI code-scanning and test-report integrations.

Alerts

Send webhook notifications when checks fail:

source:
  type: duckdb
  table: orders

checks:
  - not_null: order_id
  - row_count:
      min: 1

alerts:
  - type: webhook
    url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
    trigger: on_failure
  - type: webhook
    url: ${PAGERDUTY_WEBHOOK}
    headers:
      Authorization: "Bearer ${PD_TOKEN}"

Triggers: on_failure (default), on_success, always.

Observability

provero run can emit governance and telemetry signals through optional observer flags. All three are off by default and degrade gracefully when their optional dependency is absent.

# Append a structured JSON audit record of the run (pure stdlib, always available)
provero run --audit-log audit.jsonl

# Emit OpenTelemetry spans for the suite and each check
provero run --otel

# Write a Prometheus text exposition of run metrics to a file
provero run --metrics-file metrics.prom

OpenTelemetry and Prometheus require the observability extra:

pip install 'provero[observability]'

Exposed Prometheus metrics: provero_checks_total (by status), provero_check_duration_seconds, and provero_suite_score. The audit log records run id, suite name, a config hash, and per-check outcomes; connection strings and secrets are redacted before they are written.

Server Mode

provero serve starts a FastAPI application that exposes the engine over HTTP and can run suites on a schedule. It requires the server extra (FastAPI + Uvicorn):

pip install 'provero[server]'

provero serve                                   # 127.0.0.1:8000, auth disabled
provero serve -c production.yaml --host 0.0.0.0 --port 9000
provero serve --api-key secret1 --api-key secret2

Endpoints:

Method & pathAuthDescription
GET /healthnoLiveness probe
GET /readynoReadiness probe
GET /suitesyesList configured suites
POST /suites/{name}/runyesRun a suite on demand
GET /runsyesList historical runs
GET /runs/{run_id}yesRun detail
GET /metricsnoPrometheus exposition

Authentication is via the X-API-Key header. Allowed keys come from --api-key (repeatable) or the PROVERO_API_KEYS environment variable; if neither is set, auth is disabled. The bundled scheduler (SuiteScheduler) runs a suite on a fixed interval using the standard library only (no extra dependency) and persists every result to the store.

Statistical Checks

Four statistical checks extend the engine for distributional and cross-table validation:

checks:
  # Mean within tolerance and an upper bound on stddev (population statistics)
  - distribution:
      column: amount
      mean: 100.0
      mean_tolerance: 5.0
      stddev_max: 50.0

  # Distinct-value count and/or ratio bounds (ratio = distinct / non_null)
  - cardinality:
      column: country_code
      min: 2
      max: 250
      min_ratio: 0.0

  # Population Stability Index against a discrete baseline distribution
  - drift:
      column: segment
      baseline: { A: 0.5, B: 0.3, C: 0.2 }
      threshold: 0.25         # PSI above this fails
      warn_threshold: 0.1     # PSI above this warns

  # Row-count parity (or ratio) between two tables on the same source
  - cross_table_count:
      other_table: staging.orders
      tolerance: 0

drift is advisory by default (default severity warning): PSI above threshold fails, above warn_threshold warns, otherwise passes. cross_table_count also supports a ratio mode with min_ratio/max_ratio bounds.

Data Contracts

Define and enforce schema contracts:

contracts:
  - name: orders_contract
    owner: data-team
    table: orders
    on_violation: warn
    schema:
      columns:
        - name: order_id
          type: integer
          checks: [not_null, unique]
        - name: status
          type: varchar
    sla:
      freshness: 24h
      completeness: "95%"

Contracts carry a version field (default 1.0). provero contract diff is version-aware: it classifies each change as breaking or non-breaking, and warns when a breaking change ships without a major version bump.

provero contract validate
provero contract diff old.yaml new.yaml

Connectors

ConnectorStatusInstall
DuckDBStableincluded
PostgreSQLStablepip install provero[postgres]
DataFrameStablepip install provero[dataframe]
SnowflakeBetapip install provero[snowflake]
BigQueryBetapip install provero[bigquery]
MySQLBetapip install provero[mysql]
RedshiftBetapip install provero[redshift]

DuckDB supports file expressions: read_csv('data.csv'), read_parquet('*.parquet').

Pooling and retry

SQLAlchemy-backed connectors (PostgreSQL and the beta connectors) accept optional connection-pool and retry tuning per source. Every key is optional; when omitted, the connector behaves exactly as before.

source:
  type: postgres
  table: orders
  connection: ${POSTGRES_URL}
  # Connection pool (forwarded to SQLAlchemy create_engine)
  pool_size: 5
  max_overflow: 10
  pool_pre_ping: true
  pool_recycle: 1800
  pool_timeout: 30
  connect_timeout: 10
  # Bounded retry-with-backoff on transient connection errors
  retry_attempts: 3
  retry_base_delay: 0.1
  retry_max_delay: 5.0
  retry_jitter: true

Only transient failures (dropped connections, backend restarts, deadlocks) are retried. Programming errors such as a missing table or bad SQL fail immediately. Backoff is exponential with full jitter.

API

Python API

from provero.core.engine import Engine

engine = Engine("provero.yaml")
results = engine.run()

for result in results:
    print(f"{result.check_name}: {result.status}")

Programmatic Configuration

from provero.core.engine import Engine

engine = Engine.from_dict({
    "source": {"type": "duckdb", "table": "orders"},
    "checks": [
        {"not_null": "order_id"},
        {"row_count": {"min": 1}},
    ],
})
results = engine.run()

Airflow Integration

pip install provero-airflow
from provero.airflow.operators import ProveroCheckOperator

check_orders = ProveroCheckOperator(
    task_id="check_orders",
    config_path="dags/provero.yaml",
    suite="orders_daily",
)

Documentation

Full documentation is available on GitHub Pages.

License

Apache License 2.0. See LICENSE.