Project tasks

June 5, 2026 · View on GitHub

Live tracking of where the project stands. Updated as work progresses.

See design.md for the full design and timeline, and PROJECT_CONTEXT.md for the day-by-day narrative.

Status

PhaseStatusNotes
Day 1 — Bootstrap✅ DoneRepo, env, approvals, GitHub all complete
Week 1 — Data and infrastructure✅ Essentially done(2026-05-21) SSH/Tailscale to GPU server ✅, all 23 countries' DHS data ✅, bulk extraction running. Remaining: dataset.py, splits.py, PC env.
Week 2 — Replication and main result🟡 StartingExtraction → tiles in Drive → rclone to PC → train first model
Week 3 — Fairness audit + extensions + write-up⏳ Pending+ OOD-countries test + temporal (stretches)

Day 1 — bootstrap (✅ done)

  • Create poverty-cnn/ repo skeleton
  • Write environment.yml, README.md, .gitignore, pyproject.toml, LICENSE
  • Copy spec to docs/design.md
  • Initialize git, first commit
  • Create GitHub repo (public), push — github.com/OnurHaniffa/poverty-cnn
  • Register for DHS data — approved, all 23 countries (Survey + GPS)
  • Register for Google Earth Engine — approved, project storied-chimera-491721-i4
  • Create local conda environment, run smoke test
  • Configure VS Code Remote-SSH to school server — still pending, machine was powered off

Week 1 — data and infrastructure (May 8–15)

DayDateTaskStatus
1May 9 (Sat)Repo skeleton, registrations, env setup
2May 10 (Sun)Reading list (africa_poverty, predicting-poverty-replication, Aiken 2023). Launch background EE download.🟡 reading done; bulk EE extraction not yet launched
3May 11 (Mon)Pull DHS for all 23 countries; pooled PCA wealth index; verify against WILDS🟡 wealth-index code done; only Kenya downloaded so far
4May 12 (Tue)DHS + EE approvals come through; env fixes (numpy<2, pip retries)
5May 13 (Wed)Data pipeline modules written + Kenya end-to-end tested (1,594 clusters, 0.99σ urban/rural gap, 2 tiles downloaded)
6May 14 (Thu)PROJECT_CONTEXT.md added for session continuity
7May 15 (Fri)Hard checkpoint day. Outstanding: SSH server access, bulk EE extraction, dataset.py, splits.py, WILDS parity check🟡 in progress

Week 2 — replication and main result (May 16–22)

See docs/design.md §11 for the full Week 2 day-by-day. Headline tasks: hyperparameter search → 5-fold cross-country training → MC-dropout inference → ablations → Jean 2016 baseline → temporal training kickoff.

Week 3 — fairness audit + extensions + write-up (May 23–30)

See docs/design.md §11 for the full Week 3 day-by-day. Headline tasks: fairness audit → uncertainty-aware extension → temporal drift analysis → visualizations → final report + slide deck.

Decisions log

DateDecisionRationale
2026-05-09Picked Yeh 2020 (Nature Comms) replication over PlantVillage / HAM10000 / PCamTop venue, real-world impact, fits aspiring-data-engineer career path
2026-05-095-fold cross-country protocol matching Yeh's Supp Table S2Apples-to-apples comparison; standard CV; not 7/8/exhaustive
2026-05-09Added uncertainty-aware fairness extension as novel contributionReal research novelty; cheap (no extra training)
2026-05-09Added temporal fairness drift extension as novel contributionReal research novelty; ~5–10 GPU-h additional
2026-05-09Marked bias-amplification simulation as Week 3 stretchHighest novelty but requires simulation infra; only attempt if Day 28 finishes ahead
2026-05-13Use Person Recode (PR) deduped to households when DHS bundles PR not HRAsset variables are household-level; dedup by (hv001, hv002) recovers HR rows. Avoids re-download.
2026-05-13Collapse DHS categorical asset variables (water/toilet/floor/wall/roof) to single binary "improved/finished" flagsSidesteps country-specific ordinal-coding inconsistencies across 23 countries. Slight signal loss vs Yeh's 1–5 ordinal scoring.

Risks currently in play

  • School GPU server access — RustDesk in exists but the machine was powered off. Friend asked to power it on. Blocking: bulk EE extraction, model training. Highest-priority risk.
  • Bulk EE extraction not yet launched — design called for this Day 2 as a 24–48h background job. Every additional day of delay compresses Week 2.
  • 22 of 23 countries' DHS data not downloaded — purely web work, can be done from the laptop, just hasn't been started.
  • Zero tests in tests/dhs.py and earth_engine.py validated only by one manual Kenya run.
  • README.md quickstart references scripts that don't exist — fixed; will need updating again as scripts land.

What's actually built (snapshot)

  • src/poverty_cnn/data/dhs.py — DHS loader + pooled PCA wealth index
  • src/poverty_cnn/data/earth_engine.py — 8-channel Landsat+nightlights extraction
  • src/poverty_cnn/data/dataset.py — PyTorch Dataset (next)
  • src/poverty_cnn/data/splits.py — 5-fold cross-country split (next)
  • src/poverty_cnn/models/ — empty (Week 2)
  • src/poverty_cnn/training/ — empty (Week 2)
  • src/poverty_cnn/eval/ — empty (Week 3)
  • src/poverty_cnn/viz/ — empty (Week 3)
  • scripts/, tests/, notebooks/ — directories exist, all empty

Future work (parking lot — not for this 3-week window)

  • OOD generalization test on excluded countries. Once the 23-country model is trained, request DHS access for additional sub-Saharan countries Yeh excluded (Botswana, Namibia, Madagascar, Liberia, etc.) and run inference-only — no retraining. Report per-country r² on these as a "does the model travel?" test. South Africa is especially interesting because it's upper-middle-income and would test whether the model overfits to low-income visual features.
  • Self-supervised satellite-image pretraining (replace ImageNet pretraining with SimCLR / MAE on unlabeled Landsat).
  • Building-footprints integration (Microsoft Open Buildings or Google Open Buildings) as additional input.
  • Vision Transformer architecture replacing ResNet-18.
  • Multi-modal model fusing satellite + Open Street Map road network data.
  • Pixel-level wealth maps (the Yeh 2020 24-h imagery pipeline).