Persistent Data
May 4, 2017 · View on GitHub
Locations where we store data.
Database
The Machine database is a simple PostgreSQL instance storing metadata about sources runs over time, such as timing, status, connection to batches, and links to results files on S3.
Database tables:
- Processing results of single sources, including sample data and output CSV’s, are added to the
runstable. - Groups of
runsresulting from Github events sent to Webhook are added to thejobstable. - Groups of
runsperiodically enqueued as a batch are added to thesetstable.
Other information:
- Complete schema can be found in
openaddr/ci/schema.pgsqland inopenaddr/ci/coverage/schema.pgsql. - Public URL at
machine-db.openaddresses.io. - Lives on an RDS
db.t2.microinstance. - Two weeks of nightly backups are kept.
Queue
The queue is used to schedule runs for Worker instances, and its size is used to grow and shrink the Worker pool. The queue is generally empty, and used only to store temporary data for scheduling runs. We use PQ to implement the queue in Python. Data is stored in the one PostgreSQL database but treated as separate.
There are four queues:
tasksqueue contains new runs to be handled.donequeue contains complete runs to be recognized.duequeue contains delayed runs that may have gone overtime.heartbeatqueue contains pings from active workers.
Other information:
- Database details are re-used, with identical
machine-db.openaddresses.iopublic URL. - Queue metrics in Cloudwatch are kept up-to-date by dequeuer.
- Queue length Cloudwatch alarms determine size of Worker pool.
S3
We use the S3 bucket data.openaddresses.io to store new and historical data.
- S3 access is handled via the Boto library.
- Boto expects current AWS credentials in the
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYenvironment variables.
Mapbox
We use the Mapbox API account open-addresses to store a tiled dot map with worldwide locations of address points.
- Uploads are handled via the Boto3 library, using credentials granted by the Mapbox API.