Article extraction benchmark: open-source libraries and commercial services

February 5, 2026 · View on GitHub

We evaluate the quality of article body extraction for commercial services Zyte Automatic Extraction (ours) <https://www.zyte.com/data-types/news-scraping-api/>, Diffbot <https://www.diffbot.com/> and open-source libraries newspaper4k <https://github.com/AndyTheFactory/newspaper4k>, readability-lxml <https://github.com/buriy/python-readability>, dragnet <https://github.com/dragnet-org/dragnet>, boilerpipe <https://github.com/misja/python-boilerpipe>, html-text <https://github.com/TeamHG-Memex/html-text>, trafilatura <https://github.com/adbar/trafilatura>, go-trafilatura <https://github.com/markusmobius/go-trafilatura>, go-readability <https://github.com/go-shiori/go-readability>, readeck/go-readability <https://codeberg.org/readeck/go-readability>, Readability.js <https://github.com/mozilla/readability>, Go-DomDistiller <https://github.com/markusmobius/go-domdistiller>. news-please <https://github.com/fhamborg/news-please>. Goose3 <https://github.com/goose3/goose3>, inscriptis <https://github.com/weblyzard/inscriptis>, html2text <https://github.com/Alir3z4/html2text>, jusText <https://github.com/miso-belica/jusText>, BeautifulSoup <https://www.crummy.com/software/BeautifulSoup/bs4/doc/>. Rust crates: august <https://crates.io/crates/august>, boilerpipe <https://crates.io/crates/boilerpipe>, dom_smoothie <https://crates.io/crates/dom_smoothie>, fast_html2md <https://crates.io/crates/fast_html2md>, htmd <https://crates.io/crates/htmd>, html2md-rs <https://crates.io/crates/html2md-rs>, html2text <https://crates.io/crates/html2text>, llm_readability <https://crates.io/crates/llm_readability>, mdka <https://crates.io/crates/mdka>, nanohtml2text <https://crates.io/crates/nanohtml2text>, readability <https://crates.io/crates/readability>, readable-readability <https://crates.io/crates/readable-readability>, readabilityrs <https://crates.io/crates/readabilityrs>, rs_trafilatura <https://github.com/Murrough-Foley/rs-trafilatura>_. We release evaluation datasets and scripts, and provide more details in a whitepaper.

Article extraction is a task of extracting certain fields of an article (e.g. news or blog post), such as headline, article body, publication date, authors, etc. Article extraction systems must work on any web-site. Here we evaluate only the article body field, as this is one of the most important fields and one of the hardest to get right.

.. contents::

Results

Results of the latest evaluation::

                    version      F1             precision      recall         accuracy     
august               2.4.0        0.471 ± 0.018  0.312 ± 0.015  0.955 ± 0.005  0.000 ± 0.000
beautifulsoup        4.13.5       0.665 ± 0.015  0.499 ± 0.017  0.994 ± 0.001  0.000 ± 0.000
boilerpipe_rs        0.6.0        0.739 ± 0.022  0.761 ± 0.022  0.717 ± 0.027  0.000 ± 0.000
dom_smoothie         0.14.0       0.865 ± 0.008  0.785 ± 0.012  0.963 ± 0.005  0.055 ± 0.018
fast_html2md         0.0.58       0.515 ± 0.018  0.351 ± 0.016  0.967 ± 0.003  0.000 ± 0.000
go_domdistiller      25b8d04      0.927 ± 0.007  0.901 ± 0.010  0.956 ± 0.010  0.061 ± 0.018
go_readability       9f5bf5c      0.934 ± 0.009  0.900 ± 0.011  0.971 ± 0.009  0.188 ± 0.029
go_readability_fork  fb0fbc5      0.947 ± 0.005  0.914 ± 0.008  0.982 ± 0.004  0.166 ± 0.027
go_trafilatura       ae7ea06      0.960 ± 0.007  0.940 ± 0.009  0.980 ± 0.006  0.287 ± 0.033
goose3               3.1.20       0.896 ± 0.015  0.940 ± 0.013  0.856 ± 0.020  0.232 ± 0.032
htmd                 0.5.0        0.184 ± 0.014  0.102 ± 0.008  0.970 ± 0.003  0.000 ± 0.000
html-text            0.7.0        0.665 ± 0.015  0.500 ± 0.017  0.994 ± 0.001  0.000 ± 0.000
html2md_rs           0.10.2       0.150 ± 0.021  0.142 ± 0.031  0.160 ± 0.026  0.000 ± 0.000
html2text            2025.4.15    0.662 ± 0.016  0.499 ± 0.018  0.983 ± 0.002  0.000 ± 0.000
html2text_rs         0.16.7       0.438 ± 0.018  0.283 ± 0.015  0.965 ± 0.004  0.000 ± 0.000
inscriptis           2.6.0        0.679 ± 0.015  0.517 ± 0.018  0.992 ± 0.001  0.000 ± 0.000
justext              3.0.2        0.804 ± 0.018  0.858 ± 0.016  0.756 ± 0.028  0.088 ± 0.021
llm_readability      0.0.13       0.829 ± 0.020  0.851 ± 0.019  0.809 ± 0.026  0.077 ± 0.020
mdka                 1.6.5        0.435 ± 0.018  0.285 ± 0.015  0.914 ± 0.014  0.000 ± 0.000
nanohtml2text        0.2.1        0.469 ± 0.018  0.309 ± 0.015  0.973 ± 0.002  0.000 ± 0.000
readability          0.8.4.1      0.922 ± 0.013  0.913 ± 0.014  0.931 ± 0.015  0.315 ± 0.034
readability_js       0.6.0        0.947 ± 0.005  0.914 ± 0.008  0.982 ± 0.003  0.166 ± 0.028
readability_rs       0.3.0        0.873 ± 0.017  0.906 ± 0.015  0.843 ± 0.023  0.227 ± 0.031
readabilityrs        0.1.2        0.832 ± 0.012  0.745 ± 0.015  0.943 ± 0.011  0.028 ± 0.012
readable_readability 0.4.0        0.884 ± 0.015  0.886 ± 0.015  0.881 ± 0.020  0.177 ± 0.028
rs_trafilatura       9261e08      0.970 ± 0.004  0.951 ± 0.006  0.990 ± 0.003  0.287 ± 0.032
trafilatura          2.0.0        0.958 ± 0.006  0.938 ± 0.009  0.978 ± 0.006  0.293 ± 0.034
xpath-text           5.4.0        0.394 ± 0.019  0.246 ± 0.015  0.992 ± 0.001  0.000 ± 0.000

Results of the previous evaluation that did not re-run::

                     version    F1             precision      recall         accuracy
newspaper4k          0.9.3.1    0.949 ± 0.008  0.964 ± 0.008  0.934 ± 0.011  0.326 ± 0.033
news_please          1.6.16     0.948 ± 0.008  0.964 ± 0.008  0.933 ± 0.011  0.326 ± 0.034
readability-lxml     0.8.4.1    0.922 ± 0.013  0.913 ± 0.014  0.931 ± 0.015  0.315 ± 0.034

Results of the initial evaluation, done in November 2019::

                  version   F1             precision      recall         accuracy
AutoExtract       Nov 2019  0.970 ± 0.005  0.984 ± 0.002  0.956 ± 0.010  0.470 ± 0.037
Diffbot           Nov 2019  0.951 ± 0.010  0.958 ± 0.009  0.944 ± 0.013  0.348 ± 0.038
boilerpipe        ab3694d   0.860 ± 0.016  0.850 ± 0.016  0.870 ± 0.020  0.006 ± 0.006
dragnet           1b65e7b   0.907 ± 0.014  0.925 ± 0.013  0.889 ± 0.019  0.221 ± 0.030
html-text         0.5.1     0.665 ± 0.015  0.500 ± 0.017  0.994 ± 0.001  0.000 ± 0.000
newspaper3k       0.2.8     0.912 ± 0.014  0.917 ± 0.014  0.906 ± 0.018  0.260 ± 0.032
readability-lxml  0.7.1     0.922 ± 0.014  0.913 ± 0.014  0.931 ± 0.016  0.315 ± 0.035
xpath-text        4.4.2     0.394 ± 0.020  0.246 ± 0.016  0.992 ± 0.001  0.000 ± 0.000

Earlier results from April 2021::

                  version   F1             precision      recall         accuracy
trafilatura       0.5.1     0.945 ± 0.009  0.925 ± 0.011  0.966 ± 0.009  0.221 ± 0.031
go_readability    bdc8717   0.943 ± 0.007  0.912 ± 0.009  0.975 ± 0.007  0.210 ± 0.030
readability_js    Feb 2021  0.887 ± 0.012  0.853 ± 0.013  0.924 ± 0.012  0.149 ± 0.026
go_domdistiller   1c90a88   0.927 ± 0.007  0.901 ± 0.010  0.956 ± 0.010  0.066 ± 0.018
news_please       1.5.17    0.911 ± 0.014  0.917 ± 0.013  0.906 ± 0.018  0.249 ± 0.032
goose3            3.1.8     0.887 ± 0.016  0.930 ± 0.015  0.847 ± 0.021  0.227 ± 0.032
inscriptis        1.1.2     0.679 ± 0.015  0.517 ± 0.017  0.993 ± 0.001  0.000 ± 0.000
html2text         2020.1.16 0.662 ± 0.015  0.499 ± 0.017  0.983 ± 0.002  0.000 ± 0.000
justext           2.2.0     0.802 ± 0.018  0.858 ± 0.017  0.754 ± 0.028  0.088 ± 0.021
beautifulsoup     4.9.3     0.665 ± 0.015  0.499 ± 0.017  0.994 ± 0.001  0.000 ± 0.000

Below you can find more details about the packages and result reproduction.

More details

More details are available:

In the whitepaper at https://www.zyte.com/whitepaper-ebook/in-depth-analysis-and-evaluation-on-the-quality-of-article-body-extraction/
In a technical report attached to the v1.0.0 release at https://github.com/scrapinghub/article-extraction-benchmark/releases/tag/v1.0.0

Installation

Clone this repo, and use Python 3.6+.

Evaluation does not require any dependencies. Dependencies listed in requirements.txt are only for re-generating output files for open-source article extraction libraries. See below for their installation details.

Data

JSON data format: a dictionary which maps item ids to dictionaries, with the following fields:

articleBody: text of the article
url: page url (optional)

All files should have the same keys. Ground truth is in ground-truth.json, predictions from different systems is in output/*.json files.

Prediction files in output/*.json may optionally be wrapped to include the system/library version used to generate them::

{
    "version": "2.0.0",
    "output": { "<item_id>": { "articleBody": "..." }, ... }
}

HTML files are in html folder. They were fetched with Splash headless browser with JS disabled by default. They are gzip-compressed and utf-8 encoded.

Screenshots of all pages are not in the repo, they are available on github in the "Releases" section: https://github.com/scrapinghub/article-extraction-benchmark/releases

Open-source libraries

In addition to benchmarking AutoExtract and Diffbot services, we also benchmark several open-source libraries that work directly on HTML files without a need for rendering or external resources:

newspaper4k: https://github.com/AndyTheFactory/newspaper4k
readability-lxml: https://github.com/buriy/python-readability
dragnet: https://github.com/dragnet-org/dragnet
boilerpipe: https://github.com/misja/python-boilerpipe
html-text: https://github.com/TeamHG-Memex/html-text - this is a baseline which extracts the full text of HTML page
trafilatura: https://github.com/adbar/trafilatura contributed by the author at https://github.com/scrapinghub/article-extraction-benchmark/pull/4
go-trafilatura: https://github.com/markusmobius/go-trafilatura
go-readability: https://github.com/go-shiori/go-readability
readeck/go-readability: https://codeberg.org/readeck/go-readability/src/branch/main/FORK.md
Readability.js: https://github.com/mozilla/readability
Go-DomDistiller: https://github.com/markusmobius/go-domdistiller
news-please: https://github.com/fhamborg/news-please
Goose3: https://github.com/goose3/goose3
inscriptis: https://github.com/weblyzard/inscriptis - converts HTML to text with a particular emphasis on nested tables
html2text: https://github.com/Alir3z4/html2text - converts HTML pages to Markup language
jusText: https://github.com/miso-belica/jusText - Heuristic based boilerplate removal tool
BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ - Python library for pulling data out of HTML and XML files.

Rust crates


- august (MIT): https://crates.io/crates/august
- boilerpipe (MIT): https://crates.io/crates/boilerpipe
- dom_smoothie (MIT): https://crates.io/crates/dom_smoothie
- fast_html2md (MIT): https://crates.io/crates/fast_html2md
- htmd (Apache-2.0): https://crates.io/crates/htmd
- html2md-rs (MIT): https://crates.io/crates/html2md-rs
- html2text (MIT): https://crates.io/crates/html2text
- llm_readability (MIT): https://crates.io/crates/llm_readability
- mdka (Apache-2.0): https://crates.io/crates/mdka
- nanohtml2text (MIT): https://crates.io/crates/nanohtml2text
- readability (MIT): https://crates.io/crates/readability
- readable-readability (MIT): https://crates.io/crates/readable-readability
- readabilityrs (Apache-2.0): https://crates.io/crates/readabilityrs
- rs_trafilatura (MIT OR Apache-2.0): https://github.com/Murrough-Foley/rs-trafilatura

Output from these libraries is already present in the repo in ``output/*.json`` files.
They were generated with ``extractors/run_*.py`` files.

You can re-generate output JSON files with:

    python3 -m venv ./venv
    source ./venv/bin/activate
    make run-all

This will install Python dependencies from ``requirements.txt`` into a
`virtual environment <https://docs.python.org/3/library/venv.html>`_

Evaluation
----------

For evaluation, run::

    python3 evaluate.py

We report precision, recall, F1, accuracy and their standard deviation estimated with bootstrap.
Please refer to the technical report for more details.

License
-------

License is MIT.