API Reference

May 24, 2026 · View on GitHub

← Back to docs

API Reference

Complete documentation for the JustHTML public API.

JustHTML

The main parser class.

from justhtml import JustHTML

Constructor

JustHTML(
    html,
    *,
    sanitize=True,
    policy=None,
    collect_errors=False,
    track_node_locations=False,
    debug=False,
    encoding=None,
    fragment=False,
    fragment_context=None,
    iframe_srcdoc=False,
    strict=False,
    transforms=None,
)
ParameterTypeDefaultDescription
htmlstr | bytes | bytearray | memoryview | Node | TextrequiredHTML input to parse, or a built node to normalize by serializing and reparsing. Bytes are decoded using HTML encoding sniffing.
sanitizeboolTrueSanitize untrusted HTML during construction
policySanitizationPolicy | NoneNoneOverride the default sanitization policy
collect_errorsboolFalseCollect all parse errors (enables errors property)
track_node_locationsboolFalseTrack line/column positions for nodes (slower)
debugboolFalseEnable debug mode (internal)
encodingstr | NoneNoneTransport-supplied encoding label used as an override for byte input. See Encoding & Byte Input.
fragmentboolFalseParse as a fragment in a default <div> context (convenience).
fragment_contextFragmentContextNoneParse as fragment inside this context element
scripting_enabledboolTrueWhile this library does not support executing javascript inside <script> tags, this flag controls how the HTML5 algorithm parses noscript tags. Do NOT set this flag to False while sanitizing untrusted input; disabling scripting increases mXSS risk.
strictboolFalseRaise StrictModeError on the earliest parse error by source position
transformslist[Transform] | NoneNoneOptional DOM transforms applied after parsing. See Transforms.
iframe_srcdocboolFalseParse whole document as if it's inside an iframe srcdoc (HTML parsing quirk)

Properties

PropertyTypeDescription
rootDocument | DocumentFragmentThe document root
errorslist[ParseError]Parse errors, ordered by source position (only if collect_errors=True)

Methods

to_html(pretty=True, indent_size=2, context=None, quote='"')

Serialize the document to HTML.

from justhtml import HTMLContext, JustHTML

doc = JustHTML("<p>Hello</p>")
doc.to_html()  # Pretty-printed HTML
doc.to_html(pretty=False)  # Compact HTML
doc.to_html(context=HTMLContext.JS_STRING)  # HTML -> JS string literal

# With enum:
# from justhtml import HTMLContext
# doc.to_html(context=HTMLContext.JS_STRING)

Parameters:

  • pretty (default: True): pretty-print with newlines/indent
  • indent_size (default: 2): indent size for pretty output
  • context (default: None/HTMLContext.HTML): output encoding context
  • quote (default: "): quote used for JS string escaping

escape_js_string(value, quote='"')

Escape a value for safe inclusion in a JavaScript string literal.

from justhtml import JustHTML

JustHTML.escape_js_string('He said "hi"')
# => He said \"hi\"

escape_attr_value(value, quote='"')

Escape a value for safe inclusion in a quoted HTML attribute value.

from justhtml import JustHTML

JustHTML.escape_attr_value('" onerror="alert(1)')
# => &quot; onerror=&quot;alert(1)

escape_url_value(value)

Percent-encode a URL value.

from justhtml import JustHTML

JustHTML.escape_url_value('/path with space?x=1&y=2')
# => /path%20with%20space?x=1&y=2

escape_url_in_js_string(value, quote='"')

Convenience helper: URL-encode, then JS-string escape.

from justhtml import JustHTML

JustHTML.escape_url_in_js_string('/path with space?x=1&y=2')
# => /path%20with%20space?x=1&y=2

clean_url_value(value, url_rule)

Validate and rewrite a URL value using an explicit UrlRule. Returns None if the URL is disallowed.

from justhtml import JustHTML, UrlRule

url_rule = UrlRule(allowed_schemes={"https"})
JustHTML.clean_url_value(value="https://example.com/", url_rule=url_rule)
# => https://example.com/

clean_url_in_js_string(value, url_rule, quote='"')

Convenience helper: clean a URL, then percent-encode it and JS-string escape it. Returns None if the URL is disallowed.

from justhtml import JustHTML, UrlRule

url_rule = UrlRule(allowed_schemes={"https"})
JustHTML.clean_url_in_js_string(value="https://example.com/a b", url_rule=url_rule)
# => https://example.com/a%20b

to_text()

Return the document's concatenated text.

doc = JustHTML("<p>Hello <b>world</b></p>")
doc.to_text()  # => Hello world

Parameters:

  • separator (default: " "): join string between text nodes
  • strip (default: True): strip each text node and drop empties
  • separator_blocks_only (default: False): only apply separator between block-level elements (avoid separators inside inline tags)

Sanitization happens at construction time. Use JustHTML(..., sanitize=False) for trusted input or JustHTML(..., policy=...) to customize the policy. If you use explicit transform pipelines, note that Sanitize(...) only guarantees safety at its position in the pipeline; later transforms can reintroduce unsafe content. See Transforms.

Built node inputs are normalized through the same parser path as string inputs. This means JustHTML(...) serializes the attempted node tree to HTML and reparses it using the normal HTML5 parser.

to_markdown(html_passthrough=False)

Return a pragmatic subset of GitHub Flavored Markdown (GFM).

Tables (<table>) and images (<img>) are preserved as raw HTML. Raw HTML tags like <script>, <style>, and <textarea> are dropped by default; pass html_passthrough=True to preserve them (including their contents).

doc = JustHTML("<h1>Title</h1><p>Hello <b>world</b></p>")
doc.to_markdown()  # => # Title
# =>
# => Hello **world**
doc.to_markdown(html_passthrough=True)

Sanitization happens at construction time. JustHTML(..., sanitize=True) (the default) makes the DOM safe before Markdown serialization runs, unless you use an explicit transform pipeline where later transforms run after Sanitize(...). The returned Markdown is generated from that DOM, but it is still Markdown source, not escaped HTML. Render it with a compliant Markdown renderer before embedding it into a page, or escape it first if you need to display the raw Markdown source inside HTML. Markdown output may still contain sanitized raw HTML for elements such as tables and images, so use to_text() instead if you need plain text with no HTML output at all. Use JustHTML(..., sanitize=False) only for trusted input, or JustHTML(..., policy=...) to customize the policy.

query(selector)

Find all nodes matching a CSS selector. Type hint: list[QueryMatch], where QueryMatch is Element | Comment.

doc.query("div.container > p")  # Returns list of matching nodes
doc.query(":comment")           # Returns comment nodes

query_one(selector)

Return the first matching descendant for a CSS selector, or None. Type hint: QueryMatch | None.

node = doc.query_one("div.container > p")

Node

Base type for all DOM nodes.

Public node type aliases:

  • NodeType: any DOM node accepted by JustHTML public helpers (Node | Text).
  • QueryMatch: nodes returned by selector queries (Element | Comment).

Node types:

  • Document: the root for full-document parses
  • DocumentFragment: the root for fragment parses
  • Element: normal HTML/SVG/MathML elements
  • Text: text nodes (#text)
  • Comment: comment nodes (#comment)
  • Template: <template> elements with template_content

Template nodes expose a template_content document fragment (HTML namespace only), which holds the template’s children.


Builder

The optional builder API lives in a separate submodule so programmatic HTML construction is explicit at the import site.

For a tutorial-style guide, see Building HTML.

from justhtml.dom.builder import comment, doctype, element, text

The builder constructs nodes directly. To normalize built nodes using HTML5 parsing rules, pass them to JustHTML(...).

from justhtml import JustHTML
from justhtml.dom.builder import element

doc = JustHTML(element("p", "Hello"), fragment=True)

element(name, attrs=None, *children, namespace="html")

Create an element node.

  • name: tag name, for example "div" or "a"
  • attrs: optional attribute dictionary
  • children: zero or more child values
  • namespace: optional namespace, default "html"; allowed values are "html", "svg", and "mathml" ("math" is also accepted as the internal alias)

attrs is optional. If the second positional argument is not a mapping, it is treated as the first child.

Examples:

element("p", "Hello")
element("a", {"href": "/docs"}, "Docs")
element("input[type=email][required]")

The name parameter supports a restricted attribute shorthand:

  • tag[attr]
  • tag[attr=value]
  • tag[attr="value"]
  • tag[attr='value']

This shorthand is optional convenience. The explicit attrs dict remains the canonical form.

text(value)

Create a text node.

comment(value)

Create a comment node.

doctype(name="html", public_id=None, system_id=None, *, force_quirks=False)

Create a doctype node.

JustHTML(...) preserves the doctype name and identifiers when it normalizes a built document tree.

Direct DOM edits are supported, but transforms are the preferred way to make changes because they preserve ordering semantics and make sanitization explicit. See Transforms for the recommended workflow. If you mutate the DOM after construction, sanitization has already happened; re-sanitize by using sanitize_dom(...) or rebuild the document with a Sanitize(...) transform in the construction pipeline.

Properties

PropertyTypeDescription
namestrTag name (e.g., "div") or "#text", "#comment", "#document"
attrsdict | NoneAttribute dictionary (None for comments/doctypes)
childrenlist | NoneChild nodes (None for comments/doctypes)
parentNodeParent node (or None for root)
textstrNode-local text value. For text nodes this is the node data, otherwise "". Use to_text() for textContent semantics.
namespacestr | NoneNamespace for the node ("html" by default for elements).

Methods

to_html(indent=0, indent_size=2, pretty=True, context=None, quote='"')

Serialize the node to HTML string.

from justhtml import HTMLContext

node.to_html()                      # Pretty-printed HTML
node.to_html(pretty=False)          # Compact HTML
node.to_html(indent_size=4)         # 4-space indent
node.to_html(indent=2, indent_size=4)  # Start with 2 indents
node.to_html(context=HTMLContext.JS_STRING)  # HTML -> JS string literal

# Or use the enum from the public namespace:
# from justhtml import HTMLContext
# node.to_html(context=HTMLContext.JS_STRING)

# Context options:
# - HTMLContext.HTML (default): no extra escaping
# - HTMLContext.JS_STRING: JS-string escape (serialized HTML markup)
# - HTMLContext.HTML_ATTR_VALUE: escape the serialized HTML for a quoted HTML attribute value
#
# If you need to put plain text into `innerHTML` via a JS string, use:
# - JustHTML.escape_html_text_in_js_string(...)
#
# For escaping plain strings (no DOM required), use:
# - JustHTML.escape_js_string(...)
# - JustHTML.escape_attr_value(...)
# - JustHTML.escape_url_value(...)
# - JustHTML.escape_url_in_js_string(...)
# - JustHTML.clean_url_value(...)
# - JustHTML.clean_url_in_js_string(...)

# Safety happens at construction time:
# - default: JustHTML(..., sanitize=True)
# - raw/trusted: JustHTML(..., sanitize=False)
# - custom policy: JustHTML(..., policy=policy)

query(selector)

Find descendants matching a CSS selector. Type hint: list[QueryMatch].

div.query("p.intro")  # Search within this node

query_one(selector)

Return the first matching descendant for a CSS selector, or None. Type hint: QueryMatch | None.

p = div.query_one("p.intro")

to_text()

Return the node's concatenated text.

node.to_text()

Parameters:

  • separator (default: " "): join string between text nodes
  • strip (default: True): strip each text node and drop empties
  • separator_blocks_only (default: False): only apply separator between block-level elements (avoid separators inside inline tags)

Text extraction is safe-by-default when you build documents with JustHTML(..., sanitize=True) (the default), unless later transforms run after an explicit Sanitize(...). Use sanitize=False at construction for trusted input.

to_markdown(html_passthrough=False)

Return a pragmatic subset of GitHub Flavored Markdown (GFM) for this subtree.

node.to_markdown()
node.to_markdown(html_passthrough=True)

When you build documents with JustHTML(..., sanitize=True) (the default), this Markdown is generated from the sanitized DOM, unless later transforms run after an explicit Sanitize(...). The safety guarantee applies to the rendered Markdown output, assuming you render it with a compliant Markdown renderer. The returned Markdown string is not escaped HTML and should not be injected directly into a page without rendering or escaping first. It may still include sanitized raw HTML for elements such as tables and images. Use to_text() if you need plain text output with no HTML.

append_child(node)

Append a child node to this node.

insert_before(node, reference_node)

Insert node before reference_node (or append if reference_node is None).

remove_child(node)

Remove a direct child node.

replace_child(new_node, old_node)

Replace a direct child node with a new node.

clone_node(deep=False, override_attrs=None)

Clone this node. If deep=True, children are cloned recursively.

has_child_nodes()

Return True if this node has children.


Sanitization

JustHTML includes a built-in, policy-driven HTML sanitizer.

Guides:

from justhtml import DEFAULT_POLICY, SanitizationPolicy, UrlPolicy, UrlProxy, UrlRule, sanitize_dom
from justhtml.selector import SelectorLimits

Sanitizing output vs sanitizing the DOM

  • Construction sanitization is the default: JustHTML(..., sanitize=True) sanitizes once during construction. If your transform list does not already include Sanitize(), JustHTML appends it at the end; otherwise your explicit Sanitize() determines where sanitization happens.
  • If you want to sanitize after other transforms or direct DOM edits, add Sanitize(...) to your transform pipeline.
    • If you care about explicit transform passes, group transforms using Stage([...]).
    • For details on how Sanitize(...) works (and why it’s reviewable), see Transforms.
from justhtml import JustHTML, Sanitize

doc = JustHTML(user_html, fragment=True, transforms=[Sanitize()])
clean_root = doc.root

sanitize_dom(node, *, policy=None, errors=None)

Re-sanitize a DOM tree after direct edits. For document roots (#document or #document-fragment), this mutates the tree in place. For other nodes, the node is sanitized as if it were the only child of a document fragment; the returned node may need to be reattached by the caller.

from justhtml import sanitize_dom

sanitize_dom(doc.root)  # In-place for document roots

DEFAULT_POLICY

Conservative built-in policy used for safe-by-default sanitization.

DEFAULT_DOCUMENT_POLICY

Conservative built-in policy used when sanitizing full documents (preserves <html>, <head>, and <body> wrappers).

SanitizationPolicy

Defines allowlists for tags and attributes, URL validation rules, and optional inline-style allowlisting.

The sanitizer is HTML-only for output safety: SVG and MathML are parsed and represented in the DOM when sanitization is disabled, but sanitizer output always drops foreign-namespace content.

Notable options:

  • unsafe_handling: "strip" (default), "raise", or "collect"
  • disallowed_tag_handling: "unwrap" (default), "escape", or "drop"
  • strip_invisible_unicode: True by default; strips invisible Unicode commonly abused for obfuscation, including variation selectors, zero-width/bidi controls, and private-use characters
  • url_policy: controls URL validation and URL handling ("allow", "strip", or "proxy")
  • selector_limits: resource limits used when parsing and matching selectors in sanitization transform pipelines

selector_limits is an advanced escape hatch for trusted real-world pipelines that hit the conservative selector hardening defaults. If a transform pipeline includes Sanitize(policy=policy), the pipeline uses policy.selector_limits for transform selector parsing and matching. If there are multiple enabled Sanitize(...) transforms, the last one in the pipeline controls the limits. Without an enabled Sanitize(...), default selector limits apply.

from justhtml import JustHTML, SanitizationPolicy, Sanitize, SetAttrs
from justhtml.selector import SelectorLimits

policy = SanitizationPolicy(
    allowed_tags={"div"},
    allowed_attributes={"*": {"class", "id"}},
    selector_limits=SelectorLimits(max_length=20_000, max_match_bytes=200_000_000),
)

doc = JustHTML(
    html,
    fragment=True,
    transforms=[
        SetAttrs(".long-generated-class-name", id="matched"),
        Sanitize(policy=policy),
    ],
)

Selector limits are not a substitute for input size controls. Prefer raising only the specific limit your trusted workload needs.

UrlPolicy

Wraps URL rules and controls what happens to URL-valued attributes.

UrlPolicy(
    default_handling="allow",  # or "strip" / "proxy"
    default_allow_relative=True,
    allow_rules={},
    url_filter=None,
    proxy=None,
)

UrlProxy

Proxy rewrite configuration used when effective URL handling is "proxy".

UrlProxy(
    url="/proxy",
    param="url",
)

UrlRule

Controls how URL-valued attributes like a[href] and img[src] are validated.

UrlRule(
    allow_fragment=True,
    resolve_protocol_relative="https",
    allowed_schemes=set(),
    allowed_hosts=None,
    handling=None,
    proxy=None,
)
ParameterTypeDefaultDescription
allow_fragmentboolTrueAllow fragment-only URLs (e.g. #anchor)
resolve_protocol_relativestr | None"https"Scheme to resolve protocol-relative URLs (//...) to before checking. If None, they are dropped.
allowed_schemesset[str]set()Allowed schemes for absolute URLs (e.g. {"https", "mailto"})
allowed_hostsset[str] | NoneNoneIf set, only allow these hosts (e.g. {"example.com"})
handling"allow" | "strip" | "proxy" | NoneNonePer-rule override. If None, UrlPolicy.default_handling is used.
proxyUrlProxy | NoneNonePer-rule proxy override used when effective handling is "proxy"

stream

Memory-efficient streaming parser.

from justhtml import stream

for event, data in stream(html):
    ...

stream() accepts the same input types as JustHTML. If you pass bytes, it will decode using HTML encoding sniffing. To override the encoding for byte input, pass encoding=....

Events

EventDataDescription
"start"(tag_name, attrs_dict)Opening tag
"end"tag_nameClosing tag
"text"text_contentText content
"comment"comment_textHTML comment
"doctype"doctype_nameDOCTYPE declaration

FragmentContext

Specifies the context element for fragment parsing. See Fragment Parsing for detailed usage.

from justhtml.parser.context import FragmentContext

Constructor

FragmentContext(tag_name, namespace=None)
ParameterTypeDefaultDescription
tag_namestrrequiredContext element tag name (e.g., "div", "tbody")
namespacestr | NoneNoneNone for HTML, "svg" for SVG, "math" for MathML

Example

from justhtml import JustHTML
from justhtml.parser.context import FragmentContext

# Parse table rows in correct context
ctx = FragmentContext("tbody")
doc = JustHTML("<tr><td>cell</td></tr>", fragment_context=ctx)

ParseError

Represents a parse error with location information.

from justhtml import ParseError

Properties

PropertyTypeDescription
codestrError code (e.g., "eof-in-tag")
lineintLine number (1-indexed)
columnintColumn number (1-indexed)
messagestrHuman-readable error message

Methods

as_exception()

Convert to a SyntaxError with source highlighting (Python 3.11+).

error.as_exception()  # Returns SyntaxError

StrictModeError

Exception raised when parsing with strict=True.

from justhtml import StrictModeError

Inherits from SyntaxError, so it displays source location in tracebacks.


Standalone Functions

query(node, selector)

Query a node without using the method syntax. Type hint: (node: NodeType, selector: str) -> list[QueryMatch].

from justhtml import query
results = query(doc.root, "div.main")

matches(node, selector)

Check if a node matches a selector. Type hint: (node: NodeType, selector: str) -> bool.

from justhtml import matches
if matches(node, "div.active"):
    ...

to_html(node, indent=0, indent_size=2, pretty=True, context=None, quote='"')

Serialize a node to HTML.

from justhtml import HTMLContext, to_html
html_string = to_html(node)
escaped = to_html(node, context=HTMLContext.JS_STRING)

# With enum:
# from justhtml import HTMLContext
# escaped = to_html(node, context=HTMLContext.JS_STRING)

# Context options:
# - HTMLContext.HTML (default)
# - HTMLContext.JS_STRING
# - HTMLContext.HTML_ATTR_VALUE
#
# For escaping plain strings (no DOM required), use:
# - JustHTML.escape_js_string(...)
# - JustHTML.escape_attr_value(...)
# - JustHTML.escape_url_value(...)
# - JustHTML.escape_url_in_js_string(...)
# - JustHTML.clean_url_value(...)
# - JustHTML.clean_url_in_js_string(...)

SelectorError

Exception raised for invalid CSS selectors.

from justhtml import SelectorError

try:
    doc.query("div[invalid")
except SelectorError as e:
    print(e)