README.md
March 5, 2020 ยท View on GitHub
This code represents the build process for wordfreq, among other things. I've made it public because it's good to know where the data in wordfreq comes from. However, I make no promises that you'll be able to run it if you don't work at Luminoso.
Dependencies
Exquisite Corpus makes use of various libraries and command-line tools to process data correctly and efficiently. As something that is run on a development machine, it uses the best, fastest libraries it can, though this leads to somewhat complex system requirements.
You will need these programming environments installed:
- Python 3.4 or later
- Haskell, installed with
haskell-stack, used to compile and runwikiparsec
You also need certain tools to be available:
- The C library for
mecab(apt install libmecab-dev) - The ICU Unicode libraries (apt install libicu-dev)
- The JSON processor
jq(apt install jq) - The XML processor
xml2(apt install xml2) - The HTTP downloader
curl(apt install curl) - wikiparsec (https://github.com/LuminosoInsight/wikiparsec)
Installation
Some steps here probably need to be filled in better.
- Install system-level dependencies:
apt install python3-dev haskell-stack libmecab-dev libicu-dev jq xml2 curl
- Clone, build, and install
wikiparsec:
git clone https://github.com/LuminosoInsight/wikiparsec
cd wikiparsec
stack install
-
If building alignment files to get alignments for parallel corpus:
- Compile
fast_alignby following the instructions at https://github.com/clab/fast_align - Create a symbolic link to executable
fast_aligninside this directory (executablefast_alignis found in the directory wherefast_alignwas compiled)
- Compile
-
Finally, return to this directory and install
exquisite-corpusitself, along with the Python dependencies it manages:
pip install -e .
Getting data
Most of the data in Exquisite Corpus will be downloaded from places where it can be found on the Web. However, one input must be downloaded separately: Twitter data cannot be distributed due to the Twitter API's terms of use.
If you have a collection of tweets, put their text in
data/raw/twitter-2015.txt, one tweet per line. Or just put an empty file
there.
Building
Make sure you have lots of disk space available in the data directory, which
may have to be a symbolic link to an external hard disk.
Run:
snakemake -j 8
...and wait a day or two for results, or a crash that may tell you what you need to fix.
To build parallel corpus, run ./build.sh parallel. If you want alignment files for
already built parallel corpus or want to build parallel corpus and alignment together, run
./build.sh alignment.