web-scraping.md

July 15, 2021 ยท View on GitHub

Bookmarks tagged [web-scraping]

www.codever.land/bookmarks/t/web-scraping

GitHub - jsdom/jsdom

https://github.com/jsdom/jsdom

jsdom is a pure-JavaScript implementation of many web standards, notably the WHATWG DOM and HTML Standards, for use with Node...


Advanced web spidering with Puppeteer

https://blog.kowalczyk.info/article/ea07db1b9bff415ab180b0525f3898f6/advanced-web-spidering-with-pup...

Puppeteer is a node.js library that makes it easy to do advanced web scraping and spidering. Older generation of web scraping and spidering tools would grab and analyze HTML pages as returned by a web...


cola

https://github.com/chineking/cola

A distributed crawling framework.


feedparser

https://pythonhosted.org/feedparser/

Universal feed parser.


grab

https://github.com/lorien/grab

Site scraping framework.


MechanicalSoup

https://github.com/MechanicalSoup/MechanicalSoup

A Python library for automating interaction with websites.


portia

https://github.com/scrapinghub/portia

Visual scraping for Scrapy.


pyspider

https://github.com/binux/pyspider

A powerful spider system.


robobrowser

https://github.com/jmcarp/robobrowser

A simple, Pythonic library for browsing the web without a standalone web browser.


scrapy

https://scrapy.org/

A fast high-level screen scraping and web crawling framework.