AudioBench Architecture
May 29, 2026 · View on GitHub
A short map of how the codebase fits together, so you can find your way around and extend it without reading every file.
Big picture
One evaluation run scores a single (dataset, model, metric) triple. eval.sh is the
thin shell wrapper; it calls the real entry point src/main_evaluate.py:
bash eval.sh $DATASET $MODEL_NAME $GPU $BATCH_SIZE $OVERWRITE $METRICS $NUMBER_OF_SAMPLES
│
└── python src/main_evaluate.py --dataset_name ... --model_name ... --metrics ...
The pipeline inside main_evaluate.py:
Dataset(dataset_name) model = Model(model_name)
├─ load_dataset() → raw HF data └─ load_model() → weights / API client
└─ data_format() → dataset_processor
│ │
│ input_data = processor.prepare_model_input()
│ │
└────────────► model.generate(input) ──────┘
│
predictions cached to log_for_all_models/<model>/<dataset>.json
│
processor.compute_score(records, metrics) → <dataset>_<metric>_score.json
Two short-circuits keep reruns cheap:
- if the
_score.jsonalready exists andoverwrite=False, the run is skipped entirely; - if the prediction
.jsonexists, inference is skipped and only scoring re-runs.
Directory layout
| Path | Role |
|---|---|
src/main_evaluate.py | Entry point / orchestrator (the only entry — invoked by eval.sh). |
src/model.py | Model class. Dispatches a model name to its implementation via MODEL_REGISTRY. |
src/model_src/<module>.py | One file per model. Exposes <module>_model_loader / <module>_model_generation. |
src/dataset.py | Dataset class. Loads raw data and selects the matching processor. |
src/dataset_src/<name>.py | One file per dataset. Prepares inputs, formats predictions, computes the score. |
src/dataset_src/eval_methods/ | Scoring backends: string match, and LLM-as-judge (Llama-3, GPT-4o, Prometheus2). |
src/dataset_src/prompts/, text_normalizer/, math_utils/ | Shared helpers used by processors. |
examples/ | supported_datasets.md, supported_models.md, and per-model launch scripts. |
leaderboard/ | Standalone Streamlit app that renders the public leaderboard (not part of eval). |
vllm_model_judge_llama_3_70b.sh | Serves the Llama-3-70B judge used by *_judge* metrics. |
The two contracts
Model (model_src/<module>.py) — two functions:
<module>_model_loader(self)— set upself.model/ processor / API client on theModelinstance.<module>_model_generation(self, input)— return the prediction(s) for one input.
model.py resolves these by name through MODEL_REGISTRY, which maps the public model
name to its module. Modules are imported lazily, so a model's (often heavy) dependencies
are only loaded when that model is actually used.
Dataset (dataset_src/<name>.py) — a processor class with three methods:
prepare_model_input()→ list of model inputs (audio + instruction).format_model_predictions(inputs, predictions)→ records with predictions attached.compute_score(records, metrics)→{metric: value, ...}(and optionallydetails).
Metrics fall into three families: exact/string match, reference metrics (WER, BLEU,
METEOR), and model-as-judge (*_judge*), which calls an LLM in eval_methods/ —
the judge model must be served separately (see vllm_model_judge_llama_3_70b.sh).
Adding things
- New model: add
model_src/<module>.pywith the two functions, then add one line toMODEL_REGISTRYinsrc/model.py. Document it inexamples/supported_models.md. - New dataset: add its source under
Dataset.load_dataset(), add a processor underdataset_src/, wire it intoDataset.data_format(), and list it inexamples/supported_datasets.md.