scripts/paper/

May 13, 2026 · View on GitHub

Everything needed to schedule, deduplicate, run, resume and plot the paper experiments. paper_runner.py is the CLI entrypoint; the other modules are its dependencies.

Files

scripts/paper/
├── paper_runner.py       # CLI: run / status / resume subcommands
├── experiments.py        # E1 / E2_offline / E2_online / E2b / E4 atom enumerators
├── hyperparams.py        # hp_hash, override resolution, dataset N validation
├── registry.py           # SQLite dedup + audit log (outputs/paper_runs/registry.db)
├── runner_backend.py     # (env, cond) → main.py + cwd + absolute --config-path
├── log_parser.py         # Parse Hydra stdout/stderr → structured metrics
├── cost.py               # Per-model price table + estimate_cost_usd
├── build_configs.py      # Pipeline that generated configs/*.yaml from upstream templates (templates not shipped — see configs/README.md)
├── regenerate_datasets.py# Slice _pure_mixed.pkl → _paper_N{1..12}.pkl mirrored to curtis_baseline/
├── configs/              # Self-contained Hydra YAMLs (6 conditions × 7 envs = 42 files)
├── tests/                # Unit suites (no API key, < 30s total)
├── plot_pretty.py        # E1 main bars (Fig. 2) + E2b stochastic robustness
├── plot_e2_full_sweep.py # E2 offline sweep (Fig. 4)
├── plot_progression.py   # E2 online learning curves
└── plot_e4_3llms.py      # E4 LLM ablation (Tab. 1)

CLI reference

# Launch all atoms for an experiment (skips already-done atoms).
python scripts/paper/paper_runner.py run <exp> [--envs env,env] [--conditions cond,cond] [--seeds 0,1,…] [--dry-run] [--max-hours H] [--workers W]

# Count atoms by status across the registry.
python scripts/paper/paper_runner.py status [--exp <exp>]

# Relaunch only pending / failed / rate_limited atoms.
python scripts/paper/paper_runner.py resume [<exp>] [same flags as run]

<exp> is one of E1, E2_offline, E2_online, E2b, E4. See the top-level README.md §5 for the per-experiment budgets.

How atoms are built

experiments.py exposes one function per experiment, each returning a list of Atom tuples:

Atom(exp_id, env, condition, seed, episode_idx, llm_model, extra)

paper_runner then groups atoms by (env, condition, seed) (a "group" maps to a single Hydra subprocess that runs num_episodes episodes), computes the deterministic hp_hash of the resolved overrides, and queries the SQLite registry for prior completions. Only fresh (env, cond, seed, episode_idx, hp_hash) tuples spawn a subprocess.

SQLite schema

outputs/paper_runs/registry.db has two tables:

Table	Purpose
`runs`	One row per episode. Unique on `(exp_id, env, condition, seed, episode_idx, hp_hash)`. Stores reward, wall time, tokens, cost, status, hp_hash, llm_model.
`events`	Append-only audit trail of status transitions (`pending → running → done / failed / rate_limited / skipped_dedup / superseded`). Useful for forensic debugging.

Outputs

Each subprocess writes to outputs/paper_runs/runs/<exp_id>__<env>__<condition>__s<seed>__<hp_hash>/:

result.json     # rewards, terminations, wall_seconds, token_usage, …
stdout.log      # captured Hydra stdout (includes [REWARDS] / [TOKENS] markers)
stderr.log      # captured stderr
metadata.json   # the resolved Hydra config that was actually launched
overrides.json  # the CLI overrides passed to main.py
tensorboard/    # per-episode TB events (optional, agent-side)
episode_*_iter_*_step_*_..._llm_input.txt   # full LLM transcripts (one per call)

The plots read either the SQLite registry (fast path) or the per-atom result.json files (fallback when the registry was wiped).