JustHTML

May 9, 2026 Β· View on GitHub

A pure Python HTML5 parser that just works. No C extensions to compile. No system dependencies to install. No complex API to learn.

πŸ“– Full documentation | πŸ› Try it in the Playground

Why use JustHTML?

Just... Correct βœ…

Spec-perfect HTML5 parsing with browser-grade error recovery. Passes the official 9k+ html5lib-tests suite, with 100% line+branch coverage. (Correctness)

JustHTML("<p><b>Hi<i>there</b>!", fragment=True).to_html(pretty=False)
# => <p><b>Hi<i>there</i></b><i>!</i></p>

Note: fragment=True parses snippets (no <html>/<body> needed).

Just... Secure πŸ”’

Safe-by-default sanitization at construction time. Built-in Bleach-style allowlist sanitization on JustHTML(...) (disable with sanitize=False). Can sanitize inline CSS rules. (Sanitization & Security)

JustHTML(
    "<p>Hello<script>alert(1)</script> "
    "<a href=\"javascript:alert(1)\">bad</a> "
    "<a href=\"https://example.com/?a=1&b=2\">ok</a></p>",
    fragment=True,
).to_html()
# => <p>Hello <a>bad</a> <a href="https://example.com/?a=1&amp;b=2">ok</a></p>

Just... Query πŸ”

CSS selectors out of the box. Two methods (query(), query_one()), familiar syntax (combinators, groups, pseudo-classes), and plain Python nodes as results. (CSS Selectors)

JustHTML(
    "<div><p class=\"x\">Hi</p><p>Bye</p></div>",
    fragment=True,
).query_one("div p.x").to_html(pretty=False)
# => <p class="x">Hi</p>

⚠️ Note: Sanitization happens before querying, so remember to disable (sanitize=False) if you are working with safe HTML.

Just... Transform πŸ—οΈ

Built-in DOM transforms for dropping and unwrapping nodes, rewriting attributes, linkifying text, and composing safe pipelines. (Transforms)

from justhtml import JustHTML, Linkify, SetAttrs, Unwrap

doc = JustHTML(
    "<p>Hello <span class=\"x\">world</span> example.com</p>",
    transforms=[
        Unwrap("span.x"),
        Linkify(),
        SetAttrs("a", rel="nofollow"),
    ],
    fragment=True,
    safe=False,
)
print(doc.to_html(pretty=False))
# => <p>Hello world <a href="http://example.com" rel="nofollow">example.com</a></p>

Just... Build 🧱

Build node trees directly when Python is driving the HTML, then normalize them through the same HTML5 parser. (Building HTML)

from justhtml import JustHTML
from justhtml.builder import element

doc = JustHTML(
    element(
        "article",
        {"class": "post"},
        element("h2", "JustHTML"),
        element("p", "Build nodes directly."),
        element("a", {"href": "/docs"}, "Read docs"),
    ),
    fragment=True,
    sanitize=False,
)
print(doc.to_html(pretty=False))
# => <article class="post"><h2>JustHTML</h2><p>Build nodes directly.</p><a href="/docs">Read docs</a></article>

Just... Python 🐍

Pure Python, zero dependencies. No C extensions or system libraries, easy to debug, and works anywhere Python runs, including PyPy and Pyodide. (Run in the browser)

python -m pip show justhtml | grep -E '^Requires:'
# Requires: [intentionally left blank]

Just... Fast Enough ⚑

Fast for the common case, and the fastest pure-Python HTML5 parser available. For terabytes, use a C/Rust parser like html5ever. (Benchmarks)

curl -Ls https://en.wikipedia.org/wiki/HTML -o /tmp/justhtml-bench.html
/usr/bin/time -f '%e s' python -m justhtml /tmp/justhtml-bench.html > /dev/null
# 0.22 s

Comparison

ToolHTML5 parsing [1][2]SpeedQueryBuildSanitizeNotes
JustHTML
Pure Python
βœ…Β 100%⚑ Fastβœ… CSS selectorsβœ… element()βœ… Built-inCorrect, secure, easy to install, and fast enough.
selectolax
Python wrapper of C-based Lexbor
βœ…Β 99.9%πŸš€ Very Fastβœ… CSS selectorsβœ… create_node()❌ Needs sanitizationVery fast and spec-compliant
Chromium
browser engine
βœ…Β 99.6%πŸš€Β VeryΒ Fastβ€”β€”β€”β€”
WebKit
browser engine
βœ… 98%πŸš€ Very Fastβ€”β€”β€”β€”
Firefox
browser engine
βœ… 98%πŸš€ Very Fastβ€”β€”β€”β€”
markupever
Python wrapper of Rust-based html5ever
🟑 89%πŸš€ Very Fastβœ… CSS selectorsβœ… TreeDom .create_*()❌ Needs sanitizationFast and mostly correct, but missing benchmarked capabilities count against compliance.
html5lib
Pure Python
🟑 86%🐒 Slow🟑 XPath (lxml)🟑 Tree APIπŸ”΄ DeprecatedUnmaintained reference implementation; incomplete coverage of the tree-construction fixtures.
html5_parser
Python wrapper of C-based Gumbo
πŸ”΄ 49%πŸš€ Very Fast🟑 XPath (lxml)🟑 etree (lxml)❌ Needs sanitizationFast, but its public tree API loses information needed by many fixtures.
BeautifulSoup
Pure Python
πŸ”΄ <1% (default)🐒 Slow🟑 Custom APIβœ… new_tag() API❌ Needs sanitizationWraps html.parser (default). Can use lxml or html5lib.
html.parser
Python stdlib
πŸ”΄ <1%⚑ Fast❌ None❌ None❌ Needs sanitizationStandard library. Chokes on malformed HTML.
lxml
Python wrapper of C-based libxml2
πŸ”΄ <1%πŸš€ Very Fast🟑 XPathβœ… etree / E-factory❌ Needs sanitizationFast but not HTML5 compliant. Context-fragment cases are skipped; supported cases still perform poorly. Don't use the old lxml.html.clean module!

[1]: Parser compliance scores are from a strict run of the html5lib-tests tree-construction fixtures (1,743 non-script tests). The score is pass / (pass + fail + error); unsupported public API capabilities count as failures rather than being faked. The benchmark may compose multiple public APIs from the same parser, but does not use testcase-specific shims or synthetic adapters when an API surface is missing. The selectolax score is 1742/1743 (99.94%) using its dev html5test output and fragment-context APIs. See docs/correctness.md for details.

[2]: Browser numbers are from a local rerun of justhtml-html5lib-tests-bench against this repo's tests/html5lib-tests-tree/*.dat corpus: Chromium 1763/1770, WebKit 1741/1770, Firefox 1727/1770, with 8 skipped cases per engine.

Installation

pip install justhtml

Next: Quickstart Guide, CSS Selectors, Sanitization & Security, or try the Playground.

Requires Python 3.10 or later.

Quick Example

from justhtml import JustHTML

doc = JustHTML("<html><body><p class='intro'>Hello!</p></body></html>")

# Query with CSS selectors
for p in doc.query("p.intro"):
    print(p.name)        # "p"
    print(p.attrs)       # {"class": "intro"}
    print(p.to_html())   # <p class="intro">Hello!</p>

See the Quickstart Guide for more examples including tree traversal, streaming, and strict mode.

Command Line

If you installed JustHTML (for example with pip install justhtml or pip install -e .), you can use the justhtml command. If you don't have it available, use the equivalent python -m justhtml ... form instead.

# Pretty-print an HTML file
justhtml index.html

# Parse from stdin
curl -s https://example.com | justhtml -

# Select nodes and output text
justhtml index.html --selector "main p" --format text

# Select nodes and output Markdown (subset of GFM)
justhtml index.html --selector "article" --format markdown

# Select nodes and output HTML
justhtml index.html --selector "a" --format html
# Example: extract Markdown from GitHub README HTML
curl -s https://github.com/EmilStenstrom/justhtml/ | justhtml - --selector '.markdown-body' --format markdown --unsafe | head -n 8

Output:

# JustHTML

[](#justhtml)

A pure Python HTML5 parser that just works. No C extensions to compile. No system dependencies to install. No complex API to learn.

[πŸ“– Full documentation](https://emilstenstrom.github.io/justhtml/) | [πŸ› Try it in the Playground](https://emilstenstrom.github.io/justhtml/playground/)

Security

For security policy and vulnerability reporting, please see SECURITY.md.

Contributing

See CONTRIBUTING.md for development setup and guidelines.

Acknowledgments

JustHTML started as a Python port of html5ever, the HTML5 parser from Mozilla's Servo browser engine. While the codebase has since evolved significantly, html5ever's clean architecture and spec-compliant approach were invaluable as a starting point. Thank you to the Servo team for their excellent work.

Correctness and conformance work is heavily guided by the html5lib ecosystem and especially the official html5lib-tests fixtures used across implementations.

The sanitization API and threat-model expectations are informed by established Python sanitizers like Bleach and nh3.

The CSS selector query API is inspired by the ergonomics of lxml.cssselect.

License

MIT. Free to use both for commercial and non-commercial use.