Architecture and paper-to-code mapping
May 13, 2026 · View on GitHub
This document is the bridge between the paper and the code. Read
it after the main README.md quickstart and before opening any 4000-line
module.
1. End-to-end data flow
┌──────────────────────────────────────────────┐
│ Offline replay buffer │
│ (uncertain_worms/.../trajectory_data/*.pkl) │
│ observations a_t, o_{t+1}, r_{t+1}, d_{t+1} │
└────────────────────┬─────────────────────────┘
│ N demonstrations
▼
┌──────────────────────────────────────────────────────────────────┐
│ REx loop (paper Algorithm 1) │
│ │
│ ┌──────────────────────┐ ┌──────────────────────────────┐ │
│ │ UCB1 parent select │ │ LLM proposes candidate m_jk │ │
│ │ rex_helpers.py │───▶│ base_policy.requery_joint │ │
│ └──────────────────────┘ └──────────────┬───────────────┘ │
│ ▲ │ │
│ │ score S_jk, diagnostics D_jk │ │
│ │ ▼ │
│ ┌────────┴──────────────┐ ┌──────────────────────────────┐ │
│ │ Near-best selector │ │ Particle filter score │ │
│ │ Eq. 11–12 │◀──│ LikelihoodEvaluator.evaluate_likelihood │ │
│ └───────────────────────┘ │ (Eq. 7 / Eq. 8) │ │
│ │ └──────────────┬───────────────┘ │
│ │ │ │
│ │ QBC vote entropy Eq. 9 ▼ │
│ │ ┌──────────────────────────────────┐ │
│ │ │ DisagreementDetector │ │
│ │ │ model_disagreement.py │ │
│ │ └──────────────────────────────────┘ │
│ ▼ │
└─────── m* (final model) ──────────────────────────────────── ─┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Online deployment loop │
│ │
│ belief b_t (K particles) │
│ │ │
│ │ PO_DAStar.plan(b_t) │
│ ▼ │
│ action a_t ─▶ env.step(a_t) ─▶ o_{t+1}, r_{t+1}, d_{t+1}│
│ │ │
│ │ agent.update_belief(o_{t+1}) (PF + rejuvenation) │
│ ▼ │
│ b_{t+1} │
└──────────────────────────────────────────────────────────────────┘
After every episode the new trajectory is appended to the replay buffer and a fresh REx round (top of the diagram) is triggered. This is the "online" half of paper §4.5.
2. Paper-to-code reference table
| Paper element | Code location |
|---|---|
| Algorithm 1 (Belief-based Pinductor refinement) | uncertain_worms/policies/partially_obs_planning_agent.py::LLMPartiallyObsPlanningAgent.joint_update_models_rex |
Distance kernel d(ô, o) (§3, §5) | uncertain_worms/structs.py::MinigridObservation.distance_soft |
| Eq. 7 Distance-kernel log-likelihood per step | particle_filtering/get_score_metrics.py::LikelihoodEvaluator._step_score |
| Eq. 8 Aggregated kernel pseudo-likelihood | particle_filtering/get_score_metrics.py::LikelihoodEvaluator.evaluate_likelihood (and evaluate_score for the public wrapper) |
| Eq. 9 QBC vote entropy across the committee | particle_filtering/model_disagreement.py::committee_prediction_entropy (+ DisagreementDetector for per-context aggregation) |
| Eq. 10 UCB1 parent selection | uncertain_worms/policies/rex_helpers.py::ucb1_select |
| Eq. 11 / 12 Near-best set + softmax final selector | Inline in partially_obs_planning_agent.py::joint_update_models_rex (search for softmax_T, near_best); driven by likelihood_softmax_temperature |
| App. B.1 PO_DAStar belief-space planner | uncertain_worms/planners/PO_DAStar.py |
| App. B.2 Particle filter + rejuvenation | partially_obs_planning_agent.py::LLMPartiallyObsPlanningAgent.update_belief + LikelihoodEvaluator._rejuvenate / _rejuvenate_step |
| App. B.3 UCB1 tree expansion | rex_helpers.py::ucb1_select + agent's _select_node_to_refine |
| App. D Hyperparameter table | Frozen per-condition in scripts/paper/configs/<cond>/<env>.yaml |
| App. E Demonstration buffers | uncertain_worms/environments/minigrid/trajectory_data/*_paper_N*.pkl |
| App. F.1 POMDP Coder baseline | curtis_baseline/uncertain_worms/policies/partially_obs_planning_agent.py |
| App. F.2 Tabular baseline | curtis_baseline/uncertain_worms/policies/tabular_learners.py |
| App. F.3 Random baseline | curtis_baseline/uncertain_worms/policies/random_policy.py |
| App. F.4 Prompt-information sweep | env_descriptions.txt (L3) + uncertain_worms/policies/prompts/po_inserts.json |
| Fig. 2 (E1 main reward) | Generated by scripts/paper/plot_pretty.py from outputs/paper_runs/registry.db |
| Fig. 4 (E2 offline sweep) | Generated by scripts/paper/plot_e2_full_sweep.py |
| Fig. 5 (E2 online learning curves) | Generated by scripts/paper/plot_progression.py |
| Fig. 6 (E2b stochastic) | Generated by scripts/paper/plot_pretty.py (same script, separate panel) |
| Tab. 1 (E4 LLM ablation) | Generated by scripts/paper/plot_e4_3llms.py |
3. Hyperparameter index
The paper reports the following hyperparameters (App. D / Table 1). Each
appears in every scripts/paper/configs/ours/*.yaml:
| Symbol | YAML key | Default | Role |
|---|---|---|---|
| κ | agent.kernel_bandwidth | 0.2 | Distance-kernel sharpness in Eq. 7 |
| K | agent.num_particles | 10 | Particle belief size |
| M | agent.num_model_attempts | 5 | Candidates per REx round |
| T | agent.likelihood_softmax_temperature | 0.1 | Final-selection softmax temperature (Eq. 12) |
| c | agent.ucb1_c | 1.0 | UCB1 exploration coefficient (Eq. 10) |
| α | agent.entropy_coeff | 1.0 | Planner entropy bonus |
| λ | agent.lambda_coeff | 0.1 | Planner-side cost coefficient |
| N_D | runtime override (E2_offline sweep) | 10 | Number of offline demos |
| H | max_steps | 40 | Episode horizon |
All of these are visible in plain text in the YAMLs — there is no hidden default scattered across the codebase.
4. Adding pieces — quick pointers
- New MiniGrid environment →
uncertain_worms/environments/minigrid/README.md - New condition (policy variant) →
uncertain_worms/policies/README.md(last section) - New planner →
uncertain_worms/planners/README.md(last section) - New prompt template →
uncertain_worms/policies/prompts/README.md - New LLM provider → patch
uncertain_worms/utils.py::query_llmand expose the provider via thePAPER_LLM_MODELenv var (seescripts/paper/experiments.py::E4_llm_variationfor an example of how the runner threads the model id through Hydra). - New experiment → add an enumerator to
scripts/paper/experiments.py::all_experimentsand reference it frompaper_runner.py run <name>.
5. Glossary
| Term | Meaning |
|---|---|
| TROI | Transition / Reward / Observation / Initial — the four POMDP components proposed jointly (Pinductor) or one-by-one (POMDP Coder baseline). |
| REx (Refinement EXploration) | The iterative LLM-proposal + scoring + diagnostic loop (Algorithm 1). |
| hp_hash | Deterministic SHA256 over the resolved Hydra overrides. Used by the runner to deduplicate atoms across experiments. |
| Atom | A (exp_id, env, condition, seed, episode_idx, llm_model, extra) tuple. The unit of work for paper_runner.py. |
| Group | A bundle of atoms sharing (env, condition, seed); one Hydra subprocess handles them together. |
| QBC | Query-By-Committee — using the disagreement across the candidate-model committee to identify uncertain transition contexts (Eq. 9). |
| Rejuvenation | Replenishing the particle population by sampling fresh particles from ρ_0^m and replaying the action history (App. B.2). |
| Near-best set | Set of candidate models within one standard deviation of the top score, from which the final model is softmax-sampled (Eq. 11–12). |
| CWD-on-path | Python's default behaviour where the working directory is the first entry of sys.path. We exploit this to load two uncertain_worms packages without pip install collisions. |
6. Where to read the actual algorithm
uncertain_worms/policies/partially_obs_planning_agent.py— start at the module docstring (Algorithm 1 pseudocode).particle_filtering/get_score_metrics.py—LikelihoodEvaluator.evaluate_likelihoodis the scoring loop.particle_filtering/model_disagreement.py—DisagreementDetector(per-context committee aggregation) plus the globalcommittee_prediction_entropyhelper used as the QBC vote-entropy signal in Eq. 9.uncertain_worms/policies/rex_helpers.py— UCB1 selector and tree node bookkeeping.
If you have one hour, read in this order. The rest of the code is glue.