FLD Task
December 5, 2024 ยท View on GitHub
FLD utility modules, such as corpus loader, corpus serializer, and metrics calculators.
See the entry-point repository about the whole FLD project.
Release Branches (READ CAREFULLY to determine which branch suits you)
We have currently three branches:
NeurIPS_2024branch (2024-12)NLP_2024_KOBE_BEEFbranch (2024-01-24)ICML_2023branch (2023-08-22)
Please read CAREFULLY the instructions in other FLD repositories to determine which branch is required.
Installation
pip install -e .
python -c "import nltk; nltk.download('punkt')"
Making Prompt-Output Pairs from FLD Corpora
Once the raw FLD corpora are created by FLD-generator, we have to prepare prompt-output pairs for LLM training as follows:
python ./scripts/serialize.py \
--train {train_jsonl_path} \
--valid {valid_jsonl_path} \
--test {test_jsonl_path} \
--output-dir {output_dir}
This command will output examples with added prompt_serial and proof_serial fields, corresponding to the prompt and output of the LLMs, respectively.
(Additional) Pushing to Hugging Face Hub
python ./scripts/push_to_hub.py \
--train {serialized_train_jsonl_path} \
--valid {serialized_valid_jsonl_path} \
--test {serialized_test_jsonl_path} \
--repo-id {your_name/dataset_name} \
--config-name default