OpenEDGAR by LexPredict
July 11, 2018 ยท View on GitHub
Setup and Installation Guide
OpenEDGAR is designed to be run on Amazon Web Services to provide high-quality, reliable Internet access and intra-DC access to Amazon S3 for storage. While users can run OpenEDGAR from outside of AWS, an AWS account is required for S3 usage and performance will be substantially reduced.
Server Setup
-
Launch an EC2 instance
-
Update all packages
a.
$ sudo apt updateb.
$ sudo apt upgrade -
Reboot
-
Format and mount disks (optional)
a.
$ mkfs.ext4 /dev/nvme1n1b. add to
/etc/fstabc. Reboot to test mount
Required Software Setup
-
Install Python:
$ sudo apt install build-essential python3-dev python3-pip virtualenv -
Install Postgres:
$ sudo apt install postgresql-9.5 postgresql-client-common libpq-dev -
Install Oracle Java
a.
$ sudo add-apt-repository ppa:webupd8team/javab.
$ sudo apt-get updatec.
$ sudo apt-get install oracle-java8-installer oracle-java8-set-default oracle-java8-unlimited-jce-policyd.
$ java -version
OpenEDGAR Setup
-
Clone repo (you may need to ensure you have permissions to create a directory under /opt)
a.
$ cd /optb.
$ git clone https://github.com/LexPredict/openedgar.git -
Setup virtual environment
a.
$ cd /opt/openedgarb.
$ virtualenv -p /usr/bin/python3 envc.
$ ./env/bin/pip install -r lexpredict_openedgar/requirements/full.txt -
Setup database. Note that the password chosen for openegar must be set as DJANGO_PASSWORD in the .env later
a.
$ sudo -u postgres createuser -l -P -s openedgarb.
$ sudo -u postgres createdb -O openedgar openedgarc. Move PG data folder (optional)
$ sudo systemctl stop postgresql $ sudo systemctl status postgresql $ sudo mv /var/lib/postgresql /data $ sudo ln -s /data/postgresql /var/lib/postgresql $ sudo chown -R postgres:postgres /var/lib/postgresql $ sudo systemctl start postgresql $ sudo systemctl status postgresql $ sudo -u postgres psql -
Install and configure RabbitMQ
a.
$ wget https://packages.erlang-solutions.com/erlang-solutions_1.0_all.debb.
$ sudo dpkg -i erlang-solutions_1.0_all.debc.
$ sudo apt updated.
$ sudo apt install rabbitmq-servere.
$ sudo rabbitmqctl add_user openedgar openedgarf.
$ sudo rabbitmqctl add_vhost openedgarg.
$ sudo rabbitmqctl set_permissions -p openedgar openedgar ".*" ".*" ".*"h. Move rabbitmq data folder (optional)
$ sudo systemctl stop rabbitmq-server.service $ sudo mv /var/lib/rabbitmq /data/ $ sudo ln -s /data/rabbitmq /var/lib/rabbitmq $ sudo chown -R rabbitmq:rabbitmq /var/lib/rabbitmq $ sudo systemctl start rabbitmq-server.service $ sudo systemctl status rabbitmq-server.service -
Update .env file. For local testing (downloading files locally, instead of to S3), set CLIENT_TYPE to LOCAL and DOWNLOAD_PATH to a local path
a.
$ cp lexpredict_openedgar/sample.env lexpredict_openedgar/.envb. Update DATABASE_URL
c. Update CELERY_BROKER_URL
d. Setup AWS S3 bucket
e. Setup IAM policy
{ "Version": "2012-10-17", "Statement": [ { "Sid": "[REPLACE:unique ID]", "Effect": "Allow", "Action": [ "s3:*" ], "Resource": [ "arn:aws:s3:::[REPLACE:your bucket]" ] }, { "Sid": "[REPLACE:unique ID]", "Effect": "Allow", "Action": [ "s3:*" ], "Resource": [ "arn:aws:s3:::[REPLACE:your bucket]/*" ] } ] }f. Update
S3_ACCESS_KEY,S3_SECRET_KEY, andS3_BUCKET -
Initial database migration
a.
$ cd /opt/openedgar/lexpredict_openedgarb.
$ source ../env/bin/activatec.
$ source .envd.
$ python manage.py migrate -
Setup Apache Tika and run
a.
$ cd /opt/openedgar/tikab.
$ bash download_tika.shc.
$ bash run_tika.sh(run with&,nohup, or as service) -
Setup Celery
a.
$ cd /opt/openedgar/lexpredict_openedgarb.
$ source ../env/bin/activatec.
$ source .envd.
$ bash scripts/run_celery.sh(run with&,nohup, or as service)
Sample Database Construction
-
Build database of 10-Ks from 2018 from latest SEC EDGAR data
a.
$ cd /opt/openedgar/lexpredict_openedgarb.
$ source ../env/bin/activatec.
$ source .envd.
$ python manage.py shell_pluse. Retrieve all 10-Ks from 2018
>>> from openedgar.processes.edgar import download_filing_index_data, process_all_filing_index >>> download_filing_index_data(year=2018) >>> process_all_filing_index(year=2018, form_type_list=["10-K"])f. Sample timing on
m5.large(2 core, 8GB RAM): ~24 hours to retrieve and parse all 2018 10-Ksg. Sample statistics for 2018 10-Ks as of May
# Data on S3 Size of edgar/ on S3: Objects: 1645 Size: 2.4 GB Size of documents/raw/ on S3: Objects: 135497 Size: 2.1 GB Size of documents/text/ on S3: Object: 130469 Size: 1000.4 MB # Data in Postgres In [7]: Filing.objects.count() Out[7]: 1521 In [8]: FilingDocument.objects.count() Out[8]: 147598 In [9]: Company.objects.count() Out[9]: 1451