🌍 EarthShift Testbed

May 29, 2026 Β· View on GitHub

Project WebsiteArxiv

EarthShift is the first public testbed for benchmarking the robustness of geospatial foundation models (GFMs) across multiple realistic distribution shifts encountered in remote sensing.

Current Earth observation benchmarks focus on measuring performance on diverse tasks and applications, typically measuring generalization in-distribution. But when models are deployed, they must generalize to myriad out-of-distribution scenarios β€” new time periods, geographies, scales, and sensors. EarthShift is designed to close this gap.


πŸ” What is EarthShift?

EarthShift enables users to measure distributional robustness by comparing model performance in- and out-of-distribution using datasets from paired:

  • πŸ—ΊοΈ Geographic locations β€” does the model generalize to unseen regions?
  • πŸ“… Temporal windows β€” does performance hold across different time periods?
  • πŸ›°οΈ Sensors β€” how robust is the model when the input sensor changes?
  • πŸ“‘ Data sources β€” can the model handle shifts between different data providers?
  • πŸ”­ Spatial scales β€” does the model transfer across different spatial resolutions?

Our experiments across 8 geospatial foundation models and 11 tasks covering all 5 shift types reveal that GFMs consistently perform around 20% worse out-of-distribution, regardless of model architecture, size, pre-training strategy, or fine-tuning approach. Strikingly, GFM robustness is similar to that of generic vision foundation models β€” and even fully-supervised models β€” highlighting that distributional robustness remains an open and critical challenge for the field.


πŸ“¦ Shift Types

EarthShift measures distribution shifts across five categories:

Shift TypeDescription
πŸ”­ Spatial ScaleShifts in ground sampling distance / spatial resolution
πŸ“… TemporalShifts across different time periods or seasons
πŸ—ΊοΈ GeographicShifts across different geographic regions
πŸ›°οΈ SensorShifts between different remote sensing sensors
πŸ“‘ Data SourceShifts between different data collection sources or modalities

πŸš€ Running the EarthShift Pipeline

The EarthShift pipeline can be run from the command line via:

python run_pipeline.py

βš™οΈ Configurable Parameters

ParameterDescription
--root_dirRoot directory of datasets for fine-tuning and inference
--taskModel task: class (classification), semseg (semantic segmentation), or od (object detection)
--modelGFM or baseline model to evaluate
--shiftShift experiment type: one of data, sensor, location, temporal, scale
--dataset_pairDataset pair for testing distribution shift (use --help for full list)
--finetune_typeFine-tuning strategy: head (frozen backbone) or full (full fine-tuning)
--save_dirOutput directory for results

πŸ’‘ Why EarthShift?

EarthShift is motivated by a critical gap in how we evaluate remote sensing models: high benchmark performance does not imply robust real-world deployment. By providing a standardized testbed with paired in- and out-of-distribution datasets, EarthShift enables the community to:

  • πŸ“Š Quantify the robustness gap for any model
  • πŸ”¬ Diagnose which shift types are hardest for a given architecture
  • πŸ† Drive progress toward models that are not just accurate, but reliable

We release our code and datasets to provide a testbed to guide future work toward foundation models that are robust and reliable in real-world remote sensing applications.

Dataset License Details

DatasetLicense
RESISC45CC BY-NC-SA 4.0
UCMercedPublic Domain
DeepGlobeNon-commercial research and educational use only
DFC2022IGN's "licence ouverte"
FTWCC BY-SA 4.0
Sen1Floods11Open Access
BigEarthNet v2Community Data License Agreement -- Permissive -- Version 1.0
m-EuroSatMIT License