Signals
February 21, 2025 ยท View on GitHub
The signals folder contains scripts and resources for implementing rule-based and model-based filters for dataset curation, as described in the Big-Math paper. These filtering techniques are critical for ensuring the quality, diversity, and relevance of the dataset for reinforcement learning.
๐ Contents
Rule-based filters
add_empty_boxed_signal.py(only useful for extracting answers from full solutions (e.g. from NuminaMath))add_hyperlink_signal.pyadd_language_signal.pyadd_multipartq_signal.pyadd_multiple_choice_signal.pyadd_proof_signal.pyadd_true_false_signal.pyadd_yes_no_signal.py
Deduplication
add_semdedup_signal.py
Model-based signals
model_based_signals.py
Solve rate
rollouts_based_signals/example_solve_rate_script.sh
๐ Getting Started
Prerequisites
- Python3.10+
- Install dependencies:
pip install -r requirements.txt
๐ Usage
Rule-based signals
Rule-based signals are written in separate files, but share the same use pattern. Each signal requires only a single input variable, dataset_path. The script will automatically update the source dataset with a new column.
This is a simple example that will detect whether the problem or solution columns have a hyperlink, adding 2 new columns, has_hyperlink_problem and has_hyperlink_solution:
DATASET_PATH=<your HF dataset path here>
python3 add_hyperlink_signal.py --dataset_path ${DATASET_PATH}
Deduplication
We use SemDeDup to mark duplicated problems. Just like the rule-based signals, this requires only a single input variable, dataset_path. The script, by default, uses a similarity threshold of 0.5, but that can easily be adjusted by adding/removing values from the EPSILONS variable.
DATASET_PATH=<your HF dataset path here>
python3 add_semdedup_signal.py --dataset_path ${DATASET_PATH}
Model-based signals
The model-based signals are all included in a single file for efficiency when running multiple filters. Our scripts make use of SGLang for fast inference.
Model-based signals require a bit more specification than rule-based signals. Here, we need to specify:
model_name: The name of a local or HF modeloutput_model_name: The name of model you want to use in the signal's column name. For example, the below code uses--output_model_name "llama3-70b"and will add the multiple choice signal in a new column: "multiple_choice_llama3-70b"dataset_name: A path to HF or local datasetdataset_name_outputs: Optional - The path to save the updated dataset to. Will save todataset_namewhen not specifiedsave_folder: Local directory to save intermediate outputs tosave_name: The name of the file to save intermediate outputs totp: tensor parallelism- Signal options:
--multiple_choice,--proof,--yes_no,--true_false,--multiple_part
The below example runs all model-based signals using Llama-3.1-70B:
OUTPUT_DATASET_NAME=<your dataset name here>
python3 model_based_filters.py \
--model_name "meta-llama/Meta-Llama-3.1-70B-Instruct"\
--output_model_name "llama3-70b"\
--dataset_name "SynthLabsAI/Big-Math-RL-Verified"\
--dataset_name_outputs "${OUTPUT_DATASET_NAME}"\
--save_folder "signal_outputs"\
--save_name "model_based_signals"\
--tp 8\
--multiple_choice\
--proof\
--yes_no\
--true_false\
--multiple_part\
> model_based_signal_outputs.txt 2>&1
Solve rate
See rollouts_based_signals/example_solve_rate_script.sh for example usage.