MicroWorldBench Evaluation Code

April 1, 2026 · View on GitHub

Evaluation pipeline for MicroWorldBench (MWBench) — a benchmark for assessing AI-generated microscopic world videos across three dimensions: Scientific Accuracy, Visual Quality, and Instruction Following.

Repository Structure

.
├── video_evaluator.py       # Step 1 — extract frames & call the judge model
├── calculate_scores.py      # Step 2 — aggregate scores & generate report
├── MWBenchRubrics.json      # Per-task rubrics used by video_evaluator.py
├── MWBenchRubrics_norm.json # Weighted rubrics used by calculate_scores.py
├── finaltasks.json          # Task metadata (category labels, prompts)
├── eval_result/             # Output directory
│   ├── <model>_result.json  # Raw evaluation output (auto-generated)
│   └── evaluation_report.md # Summary report (auto-generated)
└── .env.example             # Environment variable template

Requirements

pip install requests opencv-python numpy tqdm

Setup

Copy the environment variable template and fill in your API credentials:

cp .env.example .env

Edit .env:

# API key for the evaluation model (GPT-4o or any OpenAI-compatible endpoint)
EVAL_API_KEY=your_key_here

# Optional: override the API base URL (defaults to https://api.openai.com)
EVAL_BASE_URL=https://api.openai.com

Export the variables before running:

export $(cat .env | xargs)

Usage

Step 1 — Evaluate Videos

Place your generated videos under a subdirectory named after the model:

./
└── <ModelName>/
    ├── 1.mp4
    ├── 2.mp4
    └── ...

Video filenames must be the task index (e.g. 42.mp4).

Edit the model_names list in video_evaluator.py (or pass via EvalConfig) then run:

python video_evaluator.py

Results are saved incrementally to eval_result/<ModelName>_result.json.
Re-running skips already-completed tasks, so the script is safe to interrupt and resume.

Key configuration options (`EvalConfig` in `video_evaluator.py`)

Field	Default	Description
`model`	`"gpt-4o"`	Judge model name
`max_workers`	`100`	Parallel threads
`num_frames`	`8`	Frames extracted per video
`retry_times`	`3`	API retry attempts per request
`test_mode`	`False`	Set `True` to process only `test_limit` videos
`test_limit`	`4`	Videos to process in test mode

Step 2 — Calculate Scores

Once evaluation results exist in eval_result/, run:

python calculate_scores.py

This reads all *_result.json files, computes normalised scores per dimension and category, and writes eval_result/evaluation_report.md.

Scoring Formula

For each task and dimension:

$ \text{S} = Σ (\text{score\_i} \times \text{weight\_i} \times \text{sign\_i}) \text{S\_norm} = \text{max}(0, \text{S} / Σ \text{w\_i}+) \times 100 $

where score_i ∈ {0, 1}, sign_i ∈ {+1, −1} (positive/penalty criterion), and Σ w_i+ sums the weights of positive criteria only.
The overall task score is the normalised combined score across all three dimensions.

Evaluation Dimensions

Dimension	Description
Scientific	Accuracy of the depicted biological/physical processes
Visual	Realism, clarity, and quality of the rendered visuals
Instruction	Adherence to the task prompt and detailed requirements

Task Categories

Category	Description
Organ-level	Organ-scale structures and physiological processes
Cellular-level	Cell-scale structures and interactions
Subcellular-level	Subcellular organelles and molecular-level processes

Output Format

eval_result/<ModelName>_result.json — array of objects, one per video:

{
  "index": 42,
  "video_path": "./ModelName/42.mp4",
  "scientific_eval": {
    "raw_response": "...",
    "parsed_result": { "scores": [1, 0, 1], "reasoning": "..." }
  },
  "visual_eval": { ... },
  "instruction_eval": { ... }
}