π EarthShift Testbed
May 29, 2026 Β· View on GitHub
| Project Website | Arxiv |
|---|
EarthShift is the first public testbed for benchmarking the robustness of geospatial foundation models (GFMs) across multiple realistic distribution shifts encountered in remote sensing.
Current Earth observation benchmarks focus on measuring performance on diverse tasks and applications, typically measuring generalization in-distribution. But when models are deployed, they must generalize to myriad out-of-distribution scenarios β new time periods, geographies, scales, and sensors. EarthShift is designed to close this gap.
π What is EarthShift?
EarthShift enables users to measure distributional robustness by comparing model performance in- and out-of-distribution using datasets from paired:
- πΊοΈ Geographic locations β does the model generalize to unseen regions?
- π Temporal windows β does performance hold across different time periods?
- π°οΈ Sensors β how robust is the model when the input sensor changes?
- π‘ Data sources β can the model handle shifts between different data providers?
- π Spatial scales β does the model transfer across different spatial resolutions?
Our experiments across 8 geospatial foundation models and 11 tasks covering all 5 shift types reveal that GFMs consistently perform around 20% worse out-of-distribution, regardless of model architecture, size, pre-training strategy, or fine-tuning approach. Strikingly, GFM robustness is similar to that of generic vision foundation models β and even fully-supervised models β highlighting that distributional robustness remains an open and critical challenge for the field.
π¦ Shift Types
EarthShift measures distribution shifts across five categories:
| Shift Type | Description |
|---|---|
| π Spatial Scale | Shifts in ground sampling distance / spatial resolution |
| π Temporal | Shifts across different time periods or seasons |
| πΊοΈ Geographic | Shifts across different geographic regions |
| π°οΈ Sensor | Shifts between different remote sensing sensors |
| π‘ Data Source | Shifts between different data collection sources or modalities |
π Running the EarthShift Pipeline
The EarthShift pipeline can be run from the command line via:
python run_pipeline.py
βοΈ Configurable Parameters
| Parameter | Description |
|---|---|
--root_dir | Root directory of datasets for fine-tuning and inference |
--task | Model task: class (classification), semseg (semantic segmentation), or od (object detection) |
--model | GFM or baseline model to evaluate |
--shift | Shift experiment type: one of data, sensor, location, temporal, scale |
--dataset_pair | Dataset pair for testing distribution shift (use --help for full list) |
--finetune_type | Fine-tuning strategy: head (frozen backbone) or full (full fine-tuning) |
--save_dir | Output directory for results |
π‘ Why EarthShift?
EarthShift is motivated by a critical gap in how we evaluate remote sensing models: high benchmark performance does not imply robust real-world deployment. By providing a standardized testbed with paired in- and out-of-distribution datasets, EarthShift enables the community to:
- π Quantify the robustness gap for any model
- π¬ Diagnose which shift types are hardest for a given architecture
- π Drive progress toward models that are not just accurate, but reliable
We release our code and datasets to provide a testbed to guide future work toward foundation models that are robust and reliable in real-world remote sensing applications.
Dataset License Details
| Dataset | License |
|---|---|
| RESISC45 | CC BY-NC-SA 4.0 |
| UCMerced | Public Domain |
| DeepGlobe | Non-commercial research and educational use only |
| DFC2022 | IGN's "licence ouverte" |
| FTW | CC BY-SA 4.0 |
| Sen1Floods11 | Open Access |
| BigEarthNet v2 | Community Data License Agreement -- Permissive -- Version 1.0 |
| m-EuroSat | MIT License |