Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models
June 18, 2024 ยท View on GitHub
Environment Setup
conda create --name mathador python=3.11 -yconda activate mathadorpip install -r requirements.txt- Get your personal API key for any of the following providers: OpenAI, TogetherAI, Anthropic.
- Open
eval.yamland configure which models to evaluate. We provide examples for all three model providers.
Usage
For convenience, we attach mathador-10000.jsonl dataset that we used for some runs.
If you would like to generate a new instance of the dataset, please configure generate_dataset.yaml and run:
python generate_dataset.py generate_dataset.yaml
To run Mathador-LM benchmark, please specify your desired parameters in eval.yaml and run:
TOGETHER_API_KEY=<your_key> python eval.py eval.yaml
If you would like to override arguments from eval.yaml directly from command-line, please use:
TOGETHER_API_KEY=<your_key> python eval.py eval.yaml shots=20
The result of the evaluation will be saved in results.csv.