README.md

June 11, 2026 · View on GitHub

High-speed PII masking for Polars — powered by Rust. GDPR-compliant asterisk masking and FF3-1 format-preserving encryption for EU and Latin American PII.

MaskOps extends Polars with zero-overhead PII detection and masking expressions. No NLP models. No intermediate files. Just regex + Rust running directly on Arrow buffers.

Documentation
Install
Usage
Supported patterns
How It Works
Architecture
When to use MaskOps
Benchmarks
Build from source
Key dependency versions
Roadmap
License

Documentation

Full docs, regulatory coverage, and pricing plans: fcarvajalbrown.github.io/MaskOps

MaskOps is also listed in awesome-polars under Security / Privacy.

Install

pip install maskops

v1.0.0+ API stability guarantee — no breaking changes to mask_pii, contains_pii, or mask_pii_fpe signatures without a major version bump.

Usage

import polars as pl
import maskops

df = pl.read_csv("payments.csv")

# Mask all PII in a column
df.with_columns(maskops.mask_pii("notes"))

# Filter rows that contain PII
df.filter(maskops.contains_pii("free_text"))

Supported patterns

Pattern	Example input	Masked output
IBAN	`DE89370400440532013000`	`DE89******************`
EU VAT	`DE123456789`	`DE*********`
Email	`john.doe@example.com`	`********@example.com`
Phone	`+14155552671`	`+1**********`
IP Address	`192.168.1.100`	`192.168..`
RUT (Chile)	`76.354.771-K`	`**********-K`
CPF (Brazil)	`529.982.247-25`	`*********-25`
CURP (Mexico)	`BADD110313HCMLNS09`	`******************`
DNI (Spain)	`12345678Z`	`********Z`
NIE (Spain)	`X1234567L`	`********L`
NIN (UK)	`AB 12 34 56 C`	`*********** C`
Personalausweis (Germany)	`T220001293`	`**********`
Credit Card (Visa/MC/Amex/Discover/Maestro)	`4111111111111111`	`411111******1111`

Tested against 8 EU locales: DE, FR, ES, IT, NL, PL, PT, SE. Email and phone follow RFC 5322 and E.164 respectively. RUT and CPF include Módulo 11 check digit validation. DNI and NIE include modulo 23 check letter validation. Credit cards include Luhn validation — format-only matches are rejected. Personalausweis: weighted-sum check digit (weights [7,3,1] cyclic, mod 10). NIN: HMRC-excluded prefix validation (BG, GB, KN, NK, NT, TN, ZZ rejected).

How It Works

flowchart TD
    IN["Python · Polars DataFrame"]

    subgraph API ["  Polars Expression API  "]
        direction LR
        MP["mask_pii"]
        CP["contains_pii"]
        FP["mask_pii_fpe"]
    end

    RUST["Rust Core — Arrow buffers · zero-copy"]

    subgraph MODES ["  Masking modes  "]
        direction LR
        AST["Asterisk\nirreversible"]
        ENC["FF3-1 FPE\npseudonymization"]
    end

    OUT["Masked Polars Series"]

    IN --> API --> RUST --> MODES --> OUT

    style IN   fill:#0d1117,stroke:#3DDB81,color:#e6edf3
    style RUST fill:#0d1117,stroke:#CE422B,color:#e6edf3
    style OUT  fill:#0d1117,stroke:#3DDB81,color:#e6edf3
    style AST  fill:#0d1117,stroke:#4e8adb,color:#e6edf3
    style ENC  fill:#0d1117,stroke:#9f6fcf,color:#e6edf3
    style MP   fill:#0d1117,stroke:#3c4450,color:#8b949e
    style CP   fill:#0d1117,stroke:#3c4450,color:#8b949e
    style FP   fill:#0d1117,stroke:#3c4450,color:#8b949e

No Python objects created per row. No NLP model loaded. No intermediate files.

Presidio is heavy — it spins up NLP models for structured CSV data that doesn't need them.
Pure Python regex on large DataFrames is slow.
MaskOps compiles to a native .so that Polars calls directly — same speed as built-in expressions.

Architecture

maskops/
├── Cargo.toml               # Rust dependencies
├── pyproject.toml           # maturin build backend + PyPI metadata
├── src/
│   ├── lib.rs               # Polars expression registration (mask_pii, contains_pii, mask_pii_fpe)
│   └── patterns/
│       ├── mod.rs           # mask_all(), mask_all_fpe(), contains_any_pii() aggregators
│       ├── eu/
│       │   ├── iban.rs      # IBAN regex + masking
│       │   ├── vat.rs       # EU VAT regex + masking
│       │   └── european_id.rs # DNI/NIE (Spain), NIN (UK), Personalausweis (Germany)
│       ├── latam/
│       │   └── latam_id.rs  # RUT (Chile), CPF (Brazil), CURP (Mexico) + FPE
│       ├── contact/
│       │   ├── email.rs     # Email regex + masking (local part)
│       │   ├── phone.rs     # E.164 phone regex + masking + FPE
│       │   └── ip.rs        # IPv4/IPv6 regex + masking
│       ├── financial/
│       │   └── credit_card.rs # Visa, Mastercard, Amex, Discover, Maestro + Luhn + FPE
│       ├── fpe.rs           # FF3-1 AES-256 format-preserving encryption (NIST SP 800-38G Rev.1)
│       └── country_codes.rs # Country prefix lookup table
├── maskops/
│   └── __init__.py          # Python API (mask_pii, contains_pii, mask_pii_fpe)
├── benchmarks/
│   └── benchmark.py         # Per-family throughput benchmarks (1M rows)
└── tests/
    ├── test_masking.py      # pytest suite (246 tests)
    ├── generate_fixtures.py # Faker-based test data generator (5 fixture files)
    └── fixtures/            # Generated CSVs (gitignored)

The Rust layer operates directly on Arrow buffers — zero Python object overhead per row. Each PII type is its own module: adding a new pattern = new file + one line in mod.rs.

When to use MaskOps

Situation	Recommended tool
Structured data with schema-defined PII columns (CSV, Parquet, database exports)	MaskOps
Unstructured free text — need NER for names, places, organisations	Presidio
Both structured columns + free-text fields in the same pipeline	MaskOps + Presidio
Reversible pseudonymization required (GDPR Art. 4(5))	MaskOps (`mask_pii_fpe`)
Air-gapped or offline environment	MaskOps — no network calls, ever

contains_pii is useful as a pre-filter: scan cheaply first, then mask only flagged rows.

Benchmarks

Tested on 1,000,000 rows, Intel i-series CPU, Python 3.14, Windows.

Median of 3 runs per benchmark. Each family is compared like-for-like: maskops runs only that family's patterns (mask_pii(..., patterns=[...])), matching the exact coverage of the Python re baseline. The three data profiles are clean (no PII), dense (every row has PII), and mixed (50/50 — the realistic production case).

Why the clean profile is so fast: every supported pattern requires at least one digit or an @, so maskops short-circuits any row without those bytes before running a single regex. Real-world text is mostly PII-free, so this dominates throughput.

EU patterns (IBAN, VAT, Email, Phone)

Profile	Expression	Time	Rows/s	Python re	Speedup
clean	`mask_pii`	0.102s	9,791,624	3.224s	31.6×
clean	`contains_pii`	0.028s	35,608,478	—	—
dense	`mask_pii`	1.430s	699,175	1.832s	1.3×
dense	`contains_pii`	0.143s	7,002,522	—	—
mixed	`mask_pii`	1.159s	862,821	2.085s	1.8×
mixed	`contains_pii`	0.119s	8,419,309	—	—

LatAm patterns (RUT, CPF, CURP)

Profile	Expression	Time	Rows/s	Python re	Speedup
clean	`mask_pii`	0.083s	11,995,024	2.364s	28.4×
clean	`contains_pii`	0.022s	44,911,928	—	—
dense	`mask_pii`	1.301s	768,811	2.220s	1.7×
dense	`contains_pii`	0.351s	2,851,885	—	—
mixed	`mask_pii`	1.147s	871,941	2.392s	2.1×
mixed	`contains_pii`	0.329s	3,039,982	—	—

RUT and CPF include Módulo 11 check digit validation per row — this is the cost of zero false positives.

Network patterns (IP)

Profile	Expression	Time	Rows/s	Python re	Speedup
clean	`mask_pii`	0.101s	9,895,621	2.777s	27.5×
clean	`contains_pii`	0.029s	34,285,322	—	—
dense	`mask_pii`	0.891s	1,122,653	1.902s	2.1×
dense	`contains_pii`	0.292s	3,424,025	—	—
mixed	`mask_pii`	0.699s	1,430,777	2.211s	3.2×
mixed	`contains_pii`	0.228s	4,384,771	—	—

Credit card patterns (Visa, Mastercard, Amex, Discover, Maestro)

Profile	Expression	Time	Rows/s	Python re	Speedup
clean	`mask_pii`	0.107s	9,331,431	1.233s	11.5×
clean	`contains_pii`	0.028s	35,165,948	—	—
dense	`mask_pii`	1.061s	942,647	1.337s	1.3×
dense	`contains_pii`	0.345s	2,902,612	—	—
mixed	`mask_pii`	0.819s	1,220,878	1.371s	1.7×
mixed	`contains_pii`	0.290s	3,447,494	—	—

Luhn validation runs per candidate match — this eliminates false positives.

European ID patterns (DNI/NIE, NIN, Personalausweis)

Profile	Expression	Time	Rows/s	Python re	Speedup
clean	`mask_pii`	0.100s	9,996,062	1.748s	17.5×
clean	`contains_pii`	0.026s	37,979,203	—	—
dense	`mask_pii`	1.541s	649,021	1.392s	0.9×
dense	`contains_pii`	0.370s	2,703,177	—	—
mixed	`mask_pii`	1.248s	800,992	1.472s	1.2×
mixed	`contains_pii`	0.308s	3,249,503	—	—

The four EU-ID formats run as four separate regex passes; on 100%-dense data this is the one profile where a single combined Python regex edges ahead. Clean and mixed (realistic) still favour maskops.

US patterns (SSN, Passport)

Profile	Expression	Time	Rows/s	Python re	Speedup
clean	`mask_pii`	0.099s	10,066,631	1.680s	16.9×
clean	`contains_pii`	0.028s	35,939,291	—	—
dense	`mask_pii`	0.972s	1,028,341	1.570s	1.6×
dense	`contains_pii`	0.447s	2,238,417	—	—
mixed	`mask_pii`	0.739s	1,352,974	1.628s	2.2×
mixed	`contains_pii`	0.343s	2,912,967	—	—

All 15 benchmarked families active

The realistic production workload — all 15 families the Python baseline implements, running together. maskops supports many more families; they are excluded here only to keep coverage equal on both sides. contains_pii reaches ~1M rows/s on dense data — use it to pre-filter before masking in hot pipelines.

Profile	Expression	maskops	Python `re`	Speedup
clean	`mask_pii`	0.114s	18.681s	163.4×
clean	`contains_pii`	0.028s	—	—
dense	`mask_pii`	4.693s	12.659s	2.7×
dense	`contains_pii`	0.959s	—	—
mixed	`mask_pii`	5.239s	10.620s	2.0×
mixed	`contains_pii`	1.037s	—	—

maskops throughput stays roughly flat as pattern count grows — Python regex degrades with each additional pattern, which is why the all-families gap (163× clean) dwarfs any single family.

vs Microsoft Presidio (measured)

Benchmarked on 10,000 rows of mixed real-world text (email, phone, IBAN, credit cards, IP), Python 3.11, Ubuntu, en_core_web_lg model. Extrapolated to 1M rows.

Tool	Profile	Time (10K rows)	Rows/s	Speedup
maskops	clean	0.021s	479,441	—
Presidio (en_core_web_lg)	clean	101.131s	99	4,849× slower
maskops	dense	0.028s	351,645	—
Presidio (en_core_web_lg)	dense	115.599s	87	4,065× slower
maskops	mixed	0.028s	358,118	—
Presidio (en_core_web_lg)	mixed	118.125s	85	4,230× slower

At Presidio's measured throughput of ~85–99 rows/s, processing 1M rows would take 2.8–3.3 hours. maskops processes the same 1M rows in under 3 seconds.

Entity coverage

Pattern	maskops	Presidio
IBAN	✓	✗
EU VAT	✓	✗
Email	✓	✓
Phone (E.164)	✓	✓
IP Address	✓	✓
Credit Card	✓	✓
RUT (Chile)	✓	✗
CPF (Brazil)	✓	✗
CURP (Mexico)	✓	✗
DNI/NIE (Spain)	✓	✗
NIN (UK)	✓	✗
Personalausweis (Germany)	✓	✗
Person names (NER)	✗	✓
Locations (NER)	✗	✓
Organisations (NER)	✗	✓

Presidio's strength is unstructured text with named entities (names, locations, organisations) — use it when NER is required. maskops is purpose-built for structured data pipelines where schema-defined PII fields don't need NLP. For mixed workloads, both tools can be combined: maskops for bulk structured columns, Presidio for free-text fields.

maskops is purpose-built for structured data pipelines where Presidio's NLP overhead is unnecessary.

Build from source

Windows (PowerShell)

python -m venv .venv
.venv\Scripts\activate
pip install maturin faker polars pytest
maturin develop --release
python tests/generate_fixtures.py
pytest tests/ -v

Linux / macOS

python -m venv .venv
source .venv/bin/activate
pip install maturin faker polars pytest
maturin develop --release
python tests/generate_fixtures.py
pytest tests/ -v

Key dependency versions

Package	Version
pyo3	0.25
pyo3-polars	0.23
polars	0.46
maturin	>=1.7,<2.0

Note: pyo3 must be 0.25 to match pyo3-polars 0.23. Do not bump pyo3 independently.

Roadmap

License

Mozilla Public License 2.0. Commercial use requires a separate license — see CLA.md or contact fcarvajalbrown@gmail.com.