Persistent Data

May 4, 2017 · View on GitHub

Locations where we store data.

Database

The Machine database is a simple PostgreSQL instance storing metadata about sources runs over time, such as timing, status, connection to batches, and links to results files on S3.

Database tables:

Processing results of single sources, including sample data and output CSV’s, are added to the runs table.
Groups of runs resulting from Github events sent to Webhook are added to the jobs table.
Groups of runs periodically enqueued as a batch are added to the sets table.

Other information:

Complete schema can be found in openaddr/ci/schema.pgsql and in openaddr/ci/coverage/schema.pgsql.
Public URL at machine-db.openaddresses.io.
Lives on an RDS db.t2.micro instance.
Two weeks of nightly backups are kept.

The queue is used to schedule runs for Worker instances, and its size is used to grow and shrink the Worker pool. The queue is generally empty, and used only to store temporary data for scheduling runs. We use PQ to implement the queue in Python. Data is stored in the one PostgreSQL database but treated as separate.

There are four queues:

tasks queue contains new runs to be handled.
done queue contains complete runs to be recognized.
due queue contains delayed runs that may have gone overtime.
heartbeat queue contains pings from active workers.

Other information:

Database details are re-used, with identical machine-db.openaddresses.io public URL.
Queue metrics in Cloudwatch are kept up-to-date by dequeuer.
Queue length Cloudwatch alarms determine size of Worker pool.

S3

We use the S3 bucket data.openaddresses.io to store new and historical data.

S3 access is handled via the Boto library.
Boto expects current AWS credentials in the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

Mapbox

We use the Mapbox API account open-addresses to store a tiled dot map with worldwide locations of address points.

Uploads are handled via the Boto3 library, using credentials granted by the Mapbox API.

Database

Queue

S3

Mapbox