Extracting Text
April 1, 2026 · View on GitHub
Extracting Text
JustHTML gives you a few ways to get text out of a parsed document, depending on whether you want a fast concatenation, or something structured.
1) to_text() (concatenated text)
Use to_text() when you want the concatenated text from a whole subtree:
- Traverses descendants.
- Joins text nodes using
separator(default: a single space). - Strips each text node by default (
strip=True) and drops empty segments. - Includes
<template>contents (viatemplate_content). - Sanitizes untrusted HTML by default (safe-by-default at construction).
from justhtml import JustHTML
doc = JustHTML("<div><h1>Title</h1><p>Hello <b>world</b></p></div>", fragment=True)
print(doc.to_text())
# => Title Hello world
from justhtml import JustHTML
untrusted = JustHTML("<p>Hello<script>alert(1)</script>World</p>", fragment=True)
print(untrusted.to_text())
# => Hello World
from justhtml import JustHTML
untrusted = JustHTML("<p>Hello<script>alert(1)</script>World</p>", fragment=True, sanitize=False)
print(untrusted.to_text())
# => Hello alert(1) World
from justhtml import JustHTML
doc = JustHTML("<p>Hello <b>world</b></p>", fragment=True)
print(doc.to_text(separator="", strip=False))
# => Hello world
The default separator=" " avoids accidentally smashing words together when the HTML splits text across nodes:
from justhtml import JustHTML
doc = JustHTML("<p>Hello<b>world</b></p>")
print(doc.to_text())
print(doc.to_text(separator="", strip=True))
# => Hello world
# => Helloworld
Block-only separators
If you use a separator like "\n" to get one “line” per block element, inline elements can split text into multiple nodes and produce extra separators:
from justhtml import JustHTML
doc = JustHTML("<p>hi</p><p>Hello <b>world</b></p>")
print(doc.to_text(separator="\n"))
# => hi
# => Hello
# => world
Use separator_blocks_only=True to apply separator only between block-level elements:
from justhtml import JustHTML
doc = JustHTML("<p>hi</p><p>Hello <b>world</b></p>")
print(doc.to_text(separator="\n", separator_blocks_only=True))
# => hi
# => Hello world
2) to_markdown() (GitHub Flavored Markdown)
to_markdown() outputs a pragmatic subset of GitHub Flavored Markdown (GFM) that aims to be readable and stable for common HTML.
- Converts common elements like headings, paragraphs, lists, emphasis, links, and code.
- Keeps tables (
<table>) and images (<img>) as raw HTML. - Drops
<script>,<style>, and<textarea>by default; passhtml_passthrough=Trueto include them and their contents. - When the document was built with
JustHTML(..., sanitize=True)(the default), the Markdown is generated from the sanitized DOM. - The safety guarantee applies to the HTML produced by rendering that Markdown with a compliant Markdown renderer.
- The returned Markdown string is Markdown source, not escaped HTML. If you want to embed the raw Markdown source into an HTML page, escape it first.
- It may still include sanitized raw HTML for tables and images.
from justhtml import JustHTML
doc = JustHTML("<h1>Title</h1><p>Hello <b>world</b></p>")
print(doc.to_markdown())
# => # Title
# =>
# => Hello **world**
Example:
from justhtml import JustHTML
html = """
<div>
<h1>Title</h1>
<p>Hello <b>world</b> and <a href="https://example.com">links</a>.</p>
<ul>
<li>First item</li>
<li>Second item</li>
</ul>
<pre>code block</pre>
</div>
"""
doc = JustHTML(html)
print(doc.to_markdown())
Output:
# Title
Hello **world** and [links](https://example.com).
- First item
- Second item
```
code block
```
Which should I use?
- Use
to_text()for the raw concatenated text of a subtree (textContent semantics). - Use
to_markdown()when you want readable, structured Markdown from the sanitized DOM. - Use
to_text()when you need plain text with no HTML in the output.