Extracting Text

April 1, 2026 · View on GitHub

Extracting Text

JustHTML gives you a few ways to get text out of a parsed document, depending on whether you want a fast concatenation, or something structured.

1) `to_text()` (concatenated text)

Use to_text() when you want the concatenated text from a whole subtree:

Traverses descendants.
Joins text nodes using separator (default: a single space).
Strips each text node by default (strip=True) and drops empty segments.
Includes <template> contents (via template_content).
Sanitizes untrusted HTML by default (safe-by-default at construction).

from justhtml import JustHTML

doc = JustHTML("<div><h1>Title</h1><p>Hello <b>world</b></p></div>", fragment=True)
print(doc.to_text())
# => Title Hello world

from justhtml import JustHTML

untrusted = JustHTML("<p>Hello<script>alert(1)</script>World</p>", fragment=True)
print(untrusted.to_text())
# => Hello World

from justhtml import JustHTML

untrusted = JustHTML("<p>Hello<script>alert(1)</script>World</p>", fragment=True, sanitize=False)
print(untrusted.to_text())
# => Hello alert(1) World

from justhtml import JustHTML

doc = JustHTML("<p>Hello <b>world</b></p>", fragment=True)
print(doc.to_text(separator="", strip=False))
# => Hello world

The default separator=" " avoids accidentally smashing words together when the HTML splits text across nodes:

from justhtml import JustHTML

doc = JustHTML("<p>Hello<b>world</b></p>")

print(doc.to_text())
print(doc.to_text(separator="", strip=True))
# => Hello world
# => Helloworld

Block-only separators

If you use a separator like "\n" to get one “line” per block element, inline elements can split text into multiple nodes and produce extra separators:

from justhtml import JustHTML

doc = JustHTML("<p>hi</p><p>Hello <b>world</b></p>")

print(doc.to_text(separator="\n"))
# => hi
# => Hello
# => world

Use separator_blocks_only=True to apply separator only between block-level elements:

from justhtml import JustHTML

doc = JustHTML("<p>hi</p><p>Hello <b>world</b></p>")

print(doc.to_text(separator="\n", separator_blocks_only=True))
# => hi
# => Hello world

2) `to_markdown()` (GitHub Flavored Markdown)

to_markdown() outputs a pragmatic subset of GitHub Flavored Markdown (GFM) that aims to be readable and stable for common HTML.

Converts common elements like headings, paragraphs, lists, emphasis, links, and code.
Keeps tables (<table>) and images (<img>) as raw HTML.
Drops <script>, <style>, and <textarea> by default; pass html_passthrough=True to include them and their contents.
When the document was built with JustHTML(..., sanitize=True) (the default), the Markdown is generated from the sanitized DOM.
The safety guarantee applies to the HTML produced by rendering that Markdown with a compliant Markdown renderer.
The returned Markdown string is Markdown source, not escaped HTML. If you want to embed the raw Markdown source into an HTML page, escape it first.
It may still include sanitized raw HTML for tables and images.

from justhtml import JustHTML

doc = JustHTML("<h1>Title</h1><p>Hello <b>world</b></p>")
print(doc.to_markdown())
# => # Title
# =>
# => Hello **world**

Example:

from justhtml import JustHTML

html = """
<div>
  <h1>Title</h1>
  <p>Hello <b>world</b> and <a href="https://example.com">links</a>.</p>
  <ul>
    <li>First item</li>
    <li>Second item</li>
  </ul>
  <pre>code block</pre>
</div>
"""

doc = JustHTML(html)
print(doc.to_markdown())

Output:

# Title

Hello **world** and [links](https://example.com).

- First item
- Second item

```
code block
```

Which should I use?

Use to_text() for the raw concatenated text of a subtree (textContent semantics).
Use to_markdown() when you want readable, structured Markdown from the sanitized DOM.
Use to_text() when you need plain text with no HTML in the output.

Extracting Text

1) to_text() (concatenated text)

Block-only separators

2) to_markdown() (GitHub Flavored Markdown)

Which should I use?

1) `to_text()` (concatenated text)

2) `to_markdown()` (GitHub Flavored Markdown)