Scraping

January 9, 2020 · View on GitHub

Scraping Limitations / Road Blockers:

How many requests can their server take
- How long does it take for the server to handle a request
- Request/second
How many requests can you parallelise:
- From a single process
- From multiple process
How do you track what needs to be scraped
Authentication
- Watch out for
  - Password change reminders
  - Account being locked out
If the page uses JavaScript

Goals:

Maximize number of requests/sec
Less compute resources used

Tools:

Database
Queue

Database:

Avoid nosql
Use a SQL database from the start, since you’ll most likely be exporting/querying it
- Easier to change field names
- Run SQL queries to fix
One table per “type” of page
- One table for the pagination results
- Another table for page results
Another table for consolidated results
- This can be the source of truth
- Hard part may be figuring out what should exist in the consolidated table, but doesn’t

Log to the console:

page/id scraped
time scraped
Time it took to scrape page