Filtering of CC-En

October 3, 2024 · View on GitHub

Initial Filtering

Step 1: Download CC-En files from Matrix

Run:

python data_processing/basic_data/cc_en_math/initial_filter/download.py

This script generates download the files under data/cc-en.

Step 2: Generate training and testing files for fastText training

Run:

python data_processing/basic_data/cc_en_math/initial_filter/get_train_test_files.py

This script generates training file at data/fasttext_filter-cc-en/math-vs-other_train.txt and test file at data/fasttext_filter-cc-en/math-vs-other_test.txt.

Step 3: Train the fastText model

Run:

bash data_processing/basic_data/cc_en_math/initial_filter/train.sh

for training. This script outputs a fastText model at fasttext_models/model_math-vs-other.bin. For testing, run:

bash data_processing/basic_data/cc_en_math/initial_filter/test.sh

Step 4: filter with the fastText model generated in Step 3

Run:

bash data_processing/basic_data/cc_en_math/initial_filter/filter.sh

This scripts generates filtered files under the directory data/cc-en-math_filtered_orig

Step 5: simplify the files in Step 4

The files in step 4 only set the irrelevant documents to an empty string. Run:

python data_processing/basic_data/cc_en_math/initial_filter/get_filtered_jsonl.py

to remove these empty entries to get the final filtered files under data/cc-en-math_filtered.

Finer Filtering

The filtering in Initial Filtering is coarse, so we conduct a second round of Finer Filtering

Step 2: start Server

Get the Docker from text-generation-inference, and run deploy.sh in the Docker environment to start the server hosting Mixtral-8x7B-Instruct.

On the same machine as the server, run process.sh. Replace START_IDX and INTERVAL with the index and the number of process you are running. For example, if you started 4 processes on four different machines, then the INTERVAL is 4, and START_IDX is 0, 1, 2, 3 respectively.

bash data_processing/basic_data/cc_en_math/finer_filter/process.sh START_IDX INTERVAL

Running this script results in annotation jsonl files saved under data/cc-en-math_filtered_annotated. You can stop the process when part of the data is annotated.

Step 4: get train and test files for fastText

Run:

bash data_processing/basic_data/cc_en_math/finer_filter/filter.sh

This script generates filtered files under data/cc-en-math_filtered-finer_orig.

Step 5: remove the unrelated documents from the filtered files

Run:

python data_processing/basic_data/cc_en_math/finer_filter/get_filtered_jsonl.py

This generates files under data/cc-en-math_filtered-finer.