FLD Task

December 5, 2024 ยท View on GitHub

FLD utility modules, such as corpus loader, corpus serializer, and metrics calculators.

See the entry-point repository about the whole FLD project.

Release Branches (READ CAREFULLY to determine which branch suits you)

We have currently three branches:

  • NeurIPS_2024 branch (2024-12)
  • NLP_2024_KOBE_BEEF branch (2024-01-24)
  • ICML_2023 branch (2023-08-22)

Please read CAREFULLY the instructions in other FLD repositories to determine which branch is required.

Installation

pip install -e .
python -c "import nltk; nltk.download('punkt')"

Making Prompt-Output Pairs from FLD Corpora

Once the raw FLD corpora are created by FLD-generator, we have to prepare prompt-output pairs for LLM training as follows:

python ./scripts/serialize.py  \
    --train {train_jsonl_path}  \
    --valid {valid_jsonl_path}  \
    --test {test_jsonl_path}  \
    --output-dir {output_dir}

This command will output examples with added prompt_serial and proof_serial fields, corresponding to the prompt and output of the LLMs, respectively.

(Additional) Pushing to Hugging Face Hub

python ./scripts/push_to_hub.py  \
    --train {serialized_train_jsonl_path}  \
    --valid {serialized_valid_jsonl_path}  \
    --test {serialized_test_jsonl_path}  \
    --repo-id {your_name/dataset_name}  \
    --config-name default