web-content-extracting.md

July 15, 2021 ยท View on GitHub

Bookmarks tagged [web-content-extracting]

www.codever.land/bookmarks/t/web-content-extracting

html2text

https://github.com/Alir3z4/html2text

Convert HTML to Markdown-formatted text.


lassie

https://github.com/michaelhelmick/lassie

Web Content Retrieval for Humans.


micawber

https://github.com/coleifer/micawber

A small library for extracting rich content from URLs.


newspaper

https://github.com/codelucas/newspaper

News extraction, article extraction and content curation in Python.


python-readability

https://github.com/buriy/python-readability

Fast Python port of arc90's readability tool.


requests-html

https://github.com/kennethreitz/requests-html

Pythonic HTML Parsing for Humans.


sumy

https://github.com/miso-belica/sumy

A module for automatic summarization of text documents and HTML pages.


textract

https://github.com/deanmalmgren/textract

Extract text from any document, Word, PowerPoint, PDFs, etc.


toapi

https://github.com/gaojiuli/toapi

Every web site provides APIs.