README.md

June 11, 2026 · View on GitHub

MaskOps

CI PyPI Downloads License: MPL-2.0 Docs

High-speed PII masking for Polars — powered by Rust. GDPR-compliant asterisk masking and FF3-1 format-preserving encryption for EU and Latin American PII.

MaskOps extends Polars with zero-overhead PII detection and masking expressions. No NLP models. No intermediate files. Just regex + Rust running directly on Arrow buffers.

Contents

Documentation

Full docs, regulatory coverage, and pricing plans: fcarvajalbrown.github.io/MaskOps

MaskOps is also listed in awesome-polars under Security / Privacy.

Install

pip install maskops

v1.0.0+ API stability guarantee — no breaking changes to mask_pii, contains_pii, or mask_pii_fpe signatures without a major version bump.

Usage

import polars as pl
import maskops

df = pl.read_csv("payments.csv")

# Mask all PII in a column
df.with_columns(maskops.mask_pii("notes"))

# Filter rows that contain PII
df.filter(maskops.contains_pii("free_text"))

Supported patterns

PatternExample inputMasked output
IBANDE89370400440532013000DE89******************
EU VATDE123456789DE*********
Emailjohn.doe@example.com********@example.com
Phone+14155552671+1**********
IP Address192.168.1.100192.168.*.*
RUT (Chile)76.354.771-K**********-K
CPF (Brazil)529.982.247-25*********-25
CURP (Mexico)BADD110313HCMLNS09******************
DNI (Spain)12345678Z********Z
NIE (Spain)X1234567L********L
NIN (UK)AB 12 34 56 C*********** C
Personalausweis (Germany)T220001293**********
Credit Card (Visa/MC/Amex/Discover/Maestro)4111111111111111411111******1111

Tested against 8 EU locales: DE, FR, ES, IT, NL, PL, PT, SE. Email and phone follow RFC 5322 and E.164 respectively. RUT and CPF include Módulo 11 check digit validation. DNI and NIE include modulo 23 check letter validation. Credit cards include Luhn validation — format-only matches are rejected. Personalausweis: weighted-sum check digit (weights [7,3,1] cyclic, mod 10). NIN: HMRC-excluded prefix validation (BG, GB, KN, NK, NT, TN, ZZ rejected).

How It Works

flowchart TD
    IN["Python · Polars DataFrame"]

    subgraph API ["  Polars Expression API  "]
        direction LR
        MP["mask_pii"]
        CP["contains_pii"]
        FP["mask_pii_fpe"]
    end

    RUST["Rust Core — Arrow buffers · zero-copy"]

    subgraph MODES ["  Masking modes  "]
        direction LR
        AST["Asterisk\nirreversible"]
        ENC["FF3-1 FPE\npseudonymization"]
    end

    OUT["Masked Polars Series"]

    IN --> API --> RUST --> MODES --> OUT

    style IN   fill:#0d1117,stroke:#3DDB81,color:#e6edf3
    style RUST fill:#0d1117,stroke:#CE422B,color:#e6edf3
    style OUT  fill:#0d1117,stroke:#3DDB81,color:#e6edf3
    style AST  fill:#0d1117,stroke:#4e8adb,color:#e6edf3
    style ENC  fill:#0d1117,stroke:#9f6fcf,color:#e6edf3
    style MP   fill:#0d1117,stroke:#3c4450,color:#8b949e
    style CP   fill:#0d1117,stroke:#3c4450,color:#8b949e
    style FP   fill:#0d1117,stroke:#3c4450,color:#8b949e

No Python objects created per row. No NLP model loaded. No intermediate files.

  • Presidio is heavy — it spins up NLP models for structured CSV data that doesn't need them.
  • Pure Python regex on large DataFrames is slow.
  • MaskOps compiles to a native .so that Polars calls directly — same speed as built-in expressions.

Architecture

maskops/
├── Cargo.toml               # Rust dependencies
├── pyproject.toml           # maturin build backend + PyPI metadata
├── src/
│   ├── lib.rs               # Polars expression registration (mask_pii, contains_pii, mask_pii_fpe)
│   └── patterns/
│       ├── mod.rs           # mask_all(), mask_all_fpe(), contains_any_pii() aggregators
│       ├── eu/
│       │   ├── iban.rs      # IBAN regex + masking
│       │   ├── vat.rs       # EU VAT regex + masking
│       │   └── european_id.rs # DNI/NIE (Spain), NIN (UK), Personalausweis (Germany)
│       ├── latam/
│       │   └── latam_id.rs  # RUT (Chile), CPF (Brazil), CURP (Mexico) + FPE
│       ├── contact/
│       │   ├── email.rs     # Email regex + masking (local part)
│       │   ├── phone.rs     # E.164 phone regex + masking + FPE
│       │   └── ip.rs        # IPv4/IPv6 regex + masking
│       ├── financial/
│       │   └── credit_card.rs # Visa, Mastercard, Amex, Discover, Maestro + Luhn + FPE
│       ├── fpe.rs           # FF3-1 AES-256 format-preserving encryption (NIST SP 800-38G Rev.1)
│       └── country_codes.rs # Country prefix lookup table
├── maskops/
│   └── __init__.py          # Python API (mask_pii, contains_pii, mask_pii_fpe)
├── benchmarks/
│   └── benchmark.py         # Per-family throughput benchmarks (1M rows)
└── tests/
    ├── test_masking.py      # pytest suite (246 tests)
    ├── generate_fixtures.py # Faker-based test data generator (5 fixture files)
    └── fixtures/            # Generated CSVs (gitignored)

The Rust layer operates directly on Arrow buffers — zero Python object overhead per row. Each PII type is its own module: adding a new pattern = new file + one line in mod.rs.

When to use MaskOps

SituationRecommended tool
Structured data with schema-defined PII columns (CSV, Parquet, database exports)MaskOps
Unstructured free text — need NER for names, places, organisationsPresidio
Both structured columns + free-text fields in the same pipelineMaskOps + Presidio
Reversible pseudonymization required (GDPR Art. 4(5))MaskOps (mask_pii_fpe)
Air-gapped or offline environmentMaskOps — no network calls, ever

contains_pii is useful as a pre-filter: scan cheaply first, then mask only flagged rows.

Benchmarks

Tested on 1,000,000 rows, Intel i-series CPU, Python 3.14, Windows.

Median of 3 runs per benchmark. Each family is compared like-for-like: maskops runs only that family's patterns (mask_pii(..., patterns=[...])), matching the exact coverage of the Python re baseline. The three data profiles are clean (no PII), dense (every row has PII), and mixed (50/50 — the realistic production case).

Why the clean profile is so fast: every supported pattern requires at least one digit or an @, so maskops short-circuits any row without those bytes before running a single regex. Real-world text is mostly PII-free, so this dominates throughput.

EU patterns (IBAN, VAT, Email, Phone)

ProfileExpressionTimeRows/sPython reSpeedup
cleanmask_pii0.102s9,791,6243.224s31.6×
cleancontains_pii0.028s35,608,478
densemask_pii1.430s699,1751.832s1.3×
densecontains_pii0.143s7,002,522
mixedmask_pii1.159s862,8212.085s1.8×
mixedcontains_pii0.119s8,419,309

LatAm patterns (RUT, CPF, CURP)

ProfileExpressionTimeRows/sPython reSpeedup
cleanmask_pii0.083s11,995,0242.364s28.4×
cleancontains_pii0.022s44,911,928
densemask_pii1.301s768,8112.220s1.7×
densecontains_pii0.351s2,851,885
mixedmask_pii1.147s871,9412.392s2.1×
mixedcontains_pii0.329s3,039,982

RUT and CPF include Módulo 11 check digit validation per row — this is the cost of zero false positives.

Network patterns (IP)

ProfileExpressionTimeRows/sPython reSpeedup
cleanmask_pii0.101s9,895,6212.777s27.5×
cleancontains_pii0.029s34,285,322
densemask_pii0.891s1,122,6531.902s2.1×
densecontains_pii0.292s3,424,025
mixedmask_pii0.699s1,430,7772.211s3.2×
mixedcontains_pii0.228s4,384,771

Credit card patterns (Visa, Mastercard, Amex, Discover, Maestro)

ProfileExpressionTimeRows/sPython reSpeedup
cleanmask_pii0.107s9,331,4311.233s11.5×
cleancontains_pii0.028s35,165,948
densemask_pii1.061s942,6471.337s1.3×
densecontains_pii0.345s2,902,612
mixedmask_pii0.819s1,220,8781.371s1.7×
mixedcontains_pii0.290s3,447,494

Luhn validation runs per candidate match — this eliminates false positives.

European ID patterns (DNI/NIE, NIN, Personalausweis)

ProfileExpressionTimeRows/sPython reSpeedup
cleanmask_pii0.100s9,996,0621.748s17.5×
cleancontains_pii0.026s37,979,203
densemask_pii1.541s649,0211.392s0.9×
densecontains_pii0.370s2,703,177
mixedmask_pii1.248s800,9921.472s1.2×
mixedcontains_pii0.308s3,249,503

The four EU-ID formats run as four separate regex passes; on 100%-dense data this is the one profile where a single combined Python regex edges ahead. Clean and mixed (realistic) still favour maskops.

US patterns (SSN, Passport)

ProfileExpressionTimeRows/sPython reSpeedup
cleanmask_pii0.099s10,066,6311.680s16.9×
cleancontains_pii0.028s35,939,291
densemask_pii0.972s1,028,3411.570s1.6×
densecontains_pii0.447s2,238,417
mixedmask_pii0.739s1,352,9741.628s2.2×
mixedcontains_pii0.343s2,912,967

All 15 benchmarked families active

The realistic production workload — all 15 families the Python baseline implements, running together. maskops supports many more families; they are excluded here only to keep coverage equal on both sides. contains_pii reaches ~1M rows/s on dense data — use it to pre-filter before masking in hot pipelines.

ProfileExpressionmaskopsPython reSpeedup
cleanmask_pii0.114s18.681s163.4×
cleancontains_pii0.028s
densemask_pii4.693s12.659s2.7×
densecontains_pii0.959s
mixedmask_pii5.239s10.620s2.0×
mixedcontains_pii1.037s

maskops throughput stays roughly flat as pattern count grows — Python regex degrades with each additional pattern, which is why the all-families gap (163× clean) dwarfs any single family.

vs Microsoft Presidio (measured)

Benchmarked on 10,000 rows of mixed real-world text (email, phone, IBAN, credit cards, IP), Python 3.11, Ubuntu, en_core_web_lg model. Extrapolated to 1M rows.

ToolProfileTime (10K rows)Rows/sSpeedup
maskopsclean0.021s479,441
Presidio (en_core_web_lg)clean101.131s994,849× slower
maskopsdense0.028s351,645
Presidio (en_core_web_lg)dense115.599s874,065× slower
maskopsmixed0.028s358,118
Presidio (en_core_web_lg)mixed118.125s854,230× slower

At Presidio's measured throughput of ~85–99 rows/s, processing 1M rows would take 2.8–3.3 hours. maskops processes the same 1M rows in under 3 seconds.

Entity coverage

PatternmaskopsPresidio
IBAN
EU VAT
Email
Phone (E.164)
IP Address
Credit Card
RUT (Chile)
CPF (Brazil)
CURP (Mexico)
DNI/NIE (Spain)
NIN (UK)
Personalausweis (Germany)
Person names (NER)
Locations (NER)
Organisations (NER)

Presidio's strength is unstructured text with named entities (names, locations, organisations) — use it when NER is required. maskops is purpose-built for structured data pipelines where schema-defined PII fields don't need NLP. For mixed workloads, both tools can be combined: maskops for bulk structured columns, Presidio for free-text fields.

maskops is purpose-built for structured data pipelines where Presidio's NLP overhead is unnecessary.

Build from source

Windows (PowerShell)

python -m venv .venv
.venv\Scripts\activate
pip install maturin faker polars pytest
maturin develop --release
python tests/generate_fixtures.py
pytest tests/ -v

Linux / macOS

python -m venv .venv
source .venv/bin/activate
pip install maturin faker polars pytest
maturin develop --release
python tests/generate_fixtures.py
pytest tests/ -v

Key dependency versions

PackageVersion
pyo30.25
pyo3-polars0.23
polars0.46
maturin>=1.7,<2.0

Note: pyo3 must be 0.25 to match pyo3-polars 0.23. Do not bump pyo3 independently.

Roadmap

  • Email, phone patterns
  • IP address patterns
  • Latin American IDs (RUT, CPF, CURP)
  • European IDs (DNI/NIE Spain, NIN UK, Personalausweis Germany)
  • Credit cards (Visa, Mastercard, Amex, Discover, Maestro) with Luhn validation
  • PyPI publish via GitHub Actions
  • Check digit validation for Personalausweis (Germany) and NIN (UK)
  • Format-Preserving Encryption (FPE/FF3-1) for reversible masking
  • Benchmark vs Presidio
  • Parquet streaming support
  • extract_pii expression — returns a 31-field Struct column with the first match per PII family, enabling routing, reporting, and selective masking without re-scanning
  • mask_pii_audit expression — masks and reports per-family match counts in a single pass, returning a nested Struct (masked value + counts) for compliance auditing
  • Brazilian CNPJ (legal-entity) — two-check-digit validated detection and masking, in asterisk, FPE, and consistent modes
  • masking_manifest / write_manifest — per-column PII inventory with match counts, built-in family→regulation mapping, and mask mode, exported as a JSON RAT / data-processing register (Ley 21.719 Art. 30 evidence)
  • FF1 mode (NIST SP 800-38G) alongside FF3-1 — mask_pii_fpe(..., mode="ff1"), reversible and length-preserving
  • FPE key management — derive_key / derive_tweak (HKDF/HMAC, offline) and validate_key / validate_tweak weak-key guards
  • rekey_pii_fpe — FPE key rotation on a token column without exposing plaintext
  • MEA identifiers — South African ID (Luhn + DOB + citizenship, POPIA) and Israeli ID / Teudat Zehut (weighted checksum, PPL)
  • Unified patterns= selection across extract_pii and mask_pii_audit (v2.0 enterprise release) + migration guide
  • Performance sweep — byte pre-check short-circuit (skips all regex on PII-free rows), like-for-like benchmark methodology, and a full benchmark refresh: per-family speedups now 11–163× (clean) and 1.2–3.2× (mixed) vs the Python baseline

License

Mozilla Public License 2.0. Commercial use requires a separate license — see CLA.md or contact fcarvajalbrown@gmail.com.