web-crawling.md

July 15, 2021 ยท View on GitHub

Bookmarks tagged [web-crawling]

www.codever.land/bookmarks/t/web-crawling

anemone

https://github.com/chriskite/anemone

Ruby library and CLI for crawling websites.


LinkThumbnailer

https://github.com/gottfrois/link_thumbnailer

Ruby gem that generates thumbnail images and videos from a given URL. Much like popular social website with link preview.


Mechanize

https://github.com/sparklemotion/mechanize

Mechanize is a ruby library that makes automated web interaction easy.


MetaInspector

https://github.com/jaimeiniesta/metainspector

Ruby gem for web scraping purposes.


Upton

https://github.com/propublica/upton

A batteries-included framework for easy web-scraping.


Wombat

https://github.com/felipecsl/wombat

Web scraper with an elegant DSL that parses structured data from web pages.


cola

https://github.com/chineking/cola

A distributed crawling framework.


feedparser

https://pythonhosted.org/feedparser/

Universal feed parser.


grab

https://github.com/lorien/grab

Site scraping framework.


MechanicalSoup

https://github.com/MechanicalSoup/MechanicalSoup

A Python library for automating interaction with websites.


portia

https://github.com/scrapinghub/portia

Visual scraping for Scrapy.


pyspider

https://github.com/binux/pyspider

A powerful spider system.


robobrowser

https://github.com/jmcarp/robobrowser

A simple, Pythonic library for browsing the web without a standalone web browser.


scrapy

https://scrapy.org/

A fast high-level screen scraping and web crawling framework.


Apache Nutch

https://nutch.apache.org

Highly extensible, highly scalable web crawler for production environments.


Crawler4j

https://github.com/yasserg/crawler4j

Simple and lightweight web crawler.


jsoup

https://jsoup.org

Scrapes, parses, manipulates and cleans HTML.


StormCrawler

http://stormcrawler.net

SDK for building low-latency and scalable web crawlers.


webmagic

https://github.com/code4craft/webmagic

Scalable crawler with downloading, url management, content extraction and persistent.