Extracting Text

April 1, 2026 · View on GitHub

← Back to docs

Extracting Text

JustHTML gives you a few ways to get text out of a parsed document, depending on whether you want a fast concatenation, or something structured.

1) to_text() (concatenated text)

Use to_text() when you want the concatenated text from a whole subtree:

  • Traverses descendants.
  • Joins text nodes using separator (default: a single space).
  • Strips each text node by default (strip=True) and drops empty segments.
  • Includes <template> contents (via template_content).
  • Sanitizes untrusted HTML by default (safe-by-default at construction).
from justhtml import JustHTML

doc = JustHTML("<div><h1>Title</h1><p>Hello <b>world</b></p></div>", fragment=True)
print(doc.to_text())
# => Title Hello world
from justhtml import JustHTML

untrusted = JustHTML("<p>Hello<script>alert(1)</script>World</p>", fragment=True)
print(untrusted.to_text())
# => Hello World
from justhtml import JustHTML

untrusted = JustHTML("<p>Hello<script>alert(1)</script>World</p>", fragment=True, sanitize=False)
print(untrusted.to_text())
# => Hello alert(1) World
from justhtml import JustHTML

doc = JustHTML("<p>Hello <b>world</b></p>", fragment=True)
print(doc.to_text(separator="", strip=False))
# => Hello world

The default separator=" " avoids accidentally smashing words together when the HTML splits text across nodes:

from justhtml import JustHTML

doc = JustHTML("<p>Hello<b>world</b></p>")

print(doc.to_text())
print(doc.to_text(separator="", strip=True))
# => Hello world
# => Helloworld

Block-only separators

If you use a separator like "\n" to get one “line” per block element, inline elements can split text into multiple nodes and produce extra separators:

from justhtml import JustHTML

doc = JustHTML("<p>hi</p><p>Hello <b>world</b></p>")

print(doc.to_text(separator="\n"))
# => hi
# => Hello
# => world

Use separator_blocks_only=True to apply separator only between block-level elements:

from justhtml import JustHTML

doc = JustHTML("<p>hi</p><p>Hello <b>world</b></p>")

print(doc.to_text(separator="\n", separator_blocks_only=True))
# => hi
# => Hello world

2) to_markdown() (GitHub Flavored Markdown)

to_markdown() outputs a pragmatic subset of GitHub Flavored Markdown (GFM) that aims to be readable and stable for common HTML.

  • Converts common elements like headings, paragraphs, lists, emphasis, links, and code.
  • Keeps tables (<table>) and images (<img>) as raw HTML.
  • Drops <script>, <style>, and <textarea> by default; pass html_passthrough=True to include them and their contents.
  • When the document was built with JustHTML(..., sanitize=True) (the default), the Markdown is generated from the sanitized DOM.
  • The safety guarantee applies to the HTML produced by rendering that Markdown with a compliant Markdown renderer.
  • The returned Markdown string is Markdown source, not escaped HTML. If you want to embed the raw Markdown source into an HTML page, escape it first.
  • It may still include sanitized raw HTML for tables and images.
from justhtml import JustHTML

doc = JustHTML("<h1>Title</h1><p>Hello <b>world</b></p>")
print(doc.to_markdown())
# => # Title
# =>
# => Hello **world**

Example:

from justhtml import JustHTML

html = """
<div>
  <h1>Title</h1>
  <p>Hello <b>world</b> and <a href="https://example.com">links</a>.</p>
  <ul>
    <li>First item</li>
    <li>Second item</li>
  </ul>
  <pre>code block</pre>
</div>
"""

doc = JustHTML(html)
print(doc.to_markdown())

Output:

# Title

Hello **world** and [links](https://example.com).

- First item
- Second item

```
code block
```

Which should I use?

  • Use to_text() for the raw concatenated text of a subtree (textContent semantics).
  • Use to_markdown() when you want readable, structured Markdown from the sanitized DOM.
  • Use to_text() when you need plain text with no HTML in the output.