web-crawling.md

July 15, 2021 · View on GitHub

Ruby library and CLI for crawling websites.

tags: ruby, web-crawling
:octocat: source code

LinkThumbnailer

^{https://github.com/gottfrois/link_thumbnailer}

Ruby gem that generates thumbnail images and videos from a given URL. Much like popular social website with link preview.

tags: ruby, web-crawling
:octocat: source code

Mechanize

^{https://github.com/sparklemotion/mechanize}

Mechanize is a ruby library that makes automated web interaction easy.

tags: ruby, web-crawling
:octocat: source code

MetaInspector

^{https://github.com/jaimeiniesta/metainspector}

Ruby gem for web scraping purposes.

tags: ruby, web-crawling
:octocat: source code

Upton

^{https://github.com/propublica/upton}

A batteries-included framework for easy web-scraping.

tags: ruby, web-crawling
:octocat: source code

Wombat

^{https://github.com/felipecsl/wombat}

Web scraper with an elegant DSL that parses structured data from web pages.

tags: ruby, web-crawling
:octocat: source code

cola

^{https://github.com/chineking/cola}

A distributed crawling framework.

tags: python, web-crawling, web-scraping
:octocat: source code

feedparser

^{https://pythonhosted.org/feedparser/}

Universal feed parser.

tags: python, web-crawling, web-scraping

grab

^{https://github.com/lorien/grab}

Site scraping framework.

tags: python, web-crawling, web-scraping
:octocat: source code

MechanicalSoup

^{https://github.com/MechanicalSoup/MechanicalSoup}

A Python library for automating interaction with websites.

tags: python, web-crawling, web-scraping
:octocat: source code

portia

^{https://github.com/scrapinghub/portia}

Visual scraping for Scrapy.

tags: python, web-crawling, web-scraping
:octocat: source code

pyspider

^{https://github.com/binux/pyspider}

A powerful spider system.

tags: python, web-crawling, web-scraping
:octocat: source code

robobrowser

^{https://github.com/jmcarp/robobrowser}

A simple, Pythonic library for browsing the web without a standalone web browser.

tags: python, web-crawling, web-scraping
:octocat: source code

scrapy

^{https://scrapy.org/}

A fast high-level screen scraping and web crawling framework.

tags: python, web-crawling, web-scraping
:octocat: source code

Apache Nutch

^{https://nutch.apache.org}

Highly extensible, highly scalable web crawler for production environments.

tags: java, web-crawling

Crawler4j

^{https://github.com/yasserg/crawler4j}

Simple and lightweight web crawler.

tags: java, web-crawling
:octocat: source code

jsoup

^{https://jsoup.org}

Scrapes, parses, manipulates and cleans HTML.

tags: java, web-crawling

StormCrawler

^{http://stormcrawler.net}

SDK for building low-latency and scalable web crawlers.

tags: java, web-crawling

webmagic

^{https://github.com/code4craft/webmagic}

Scalable crawler with downloading, url management, content extraction and persistent.

tags: java, web-crawling
:octocat: source code