Evaluation

April 6, 2026 ยท View on GitHub

OSCaR benchmark evaluation uses text-generation metrics over predicted JSON files and human-verified ground-truth captions.

Text-Generation Evaluation

python scripts/eval/eval_text_gen.py \
  --predicted-folder /path/to/predicted_json_folder \
  --groundtruth-ek /path/to/epic_groundtruth_json \
  --groundtruth-ego4d /path/to/ego4d_groundtruth_json

The script reports:

  • average BLEU
  • average ROUGE-1
  • average ROUGE-2
  • average ROUGE-L

Output Conversion

If your inference outputs are still in per-frame .txt form, convert them to the JSON format expected by the evaluator:

python scripts/data/convert_output_to_json.py \
  --root-folder /path/to/output_root

Open-World Data Preparation

To assemble state-change image strips and a CSV mapping from open-world JSON:

python scripts/data/concat_state_frame.py \
  --json-file /path/to/openworld.json \
  --output-folder /path/to/openworld-output

Optional QA-Based Utilities

The repository also contains optional QA-generation helpers in scripts/data/question_answerings/. These are not required for training or core benchmark inference, but they are included because they were part of the broader local workflow around OSCaR data generation and analysis.