Transfer Learning for Cross-Regional Soybean Yield Prediction
May 17, 2026 · View on GitHub
Code and data accompanying:
Unraveling Domain Shifts in Deep Transfer Learning for Cross-regional Soybean Yield Estimation Chishan Zhang, Chunyuan Diao, Yiqun Xie
Requirements
Environment Setup
A GPU is recommended for training. MAML meta-learning in particular requires substantial
training time (days on a single GPU for the full 12-state cross-validation).
The evaluation script (evaluate.py) runs on CPU without issue.
Dependencies
- Python >= 3.8
- PyTorch >= 1.12
- learn2learn
- scikit-learn, numpy, pandas, scipy
- matplotlib, seaborn (for simulation plots)
Repository Structure
code/
├── models.py Model architectures (LSTM, LSTMDANN)
├── data_utils.py Data loading, normalization, state splitting
├── train_lstm.py LSTM baseline training
├── train_maml.py MAML meta-learning training
├── train_dann.py DANN adversarial training
├── train_ftl.py FTL fine-tuning demonstration
├── evaluate.py Evaluate all methods on demo data
├── domain_shift.py KL divergence domain shift quantification
├── demo_data/ Preprocessed subsets for IL and ND (inference only)
├── raw_data/ Public yield records and county boundaries
│ ├── US_yields.csv
│ ├── ARG_yields.csv
│ └── US_counties/ Corn Belt county shapefile
├── simulation/ APSIM crop model simulation analysis
│ ├── README.md
│ ├── plot_apsim.py Reproduce Figures S2, S3, S4
│ └── simulation_results.db
└── weights/ Pre-trained model weights
├── lstm/ 12 state-split LSTM weights
├── maml/ 12 state-split MAML weights
└── dann/ 12 state-split DANN weights
Quick Start
Evaluate pre-trained models on demo data (Illinois and North Dakota):
python evaluate.py
This loads pre-trained weights and runs inference + few-shot adaptation for LSTM, FTL, MAML, and DANN on two representative states: IL (low domain shift) and ND (high domain shift).
Models
| Method | Description | Target Data |
|---|---|---|
| LSTM | Baseline, trained on source domain | None |
| FTL | Fine-tune pre-trained LSTM on target samples | 5 labeled |
| MAML | Meta-learned initialization, 2-step adaptation | 5 labeled |
| DANN | Adversarial feature alignment | Unlabeled |
All models share the same LSTM backbone:
- Input: 13 features x 6 monthly time steps
- LSTM: 1 layer, 100 hidden units
- FC layers: 100 -> 100 -> 1 (regression output)
Demo Data
The demo_data/ folder contains preprocessed subsets for two U.S. states:
- IL (Illinois): Low domain shift case -- 300 test + 50 adaptation samples
- ND (North Dakota): High covariate/prior shift case -- 300 test + 50 adaptation samples
Each NPZ file contains:
X: Input features (N, 6, 13) -- 6 monthly aggregates of 13 variablesY: Soybean yield in kg/ha (N, 1)ID: County FIPS code and year (N, 2)
The 13 input features (after removing lat/lon) include EVI2, temperature, precipitation, VPD, land surface temperature, evapotranspiration, and soil moisture variables.
Raw Data
raw_data/ contains publicly available records:
- US_yields.csv: County-level soybean yield (bu/acre) from USDA NASS, 2008-2021
- ARG_yields.csv: Department-level soybean yield (kg/ha) from Argentina Ministry of Agriculture, 2008-2021
- US_counties/: Corn Belt county boundary shapefile (12 states)
Simulations
The simulation/ directory contains APSIM crop model simulation results used to characterize the mechanistic drivers of domain shifts (Sections 2.4.1 and 3.1 of the manuscript).
Three sets of factorial experiments were conducted:
- Part I: Environmental variation across 6 US Corn Belt locations (IA, MN, ND, KS, MO, NE)
- Part II: Environment x Genotype factorial (6 climates x 8 maturity groups MG0-MG7)
- Part III: Environment x Genotype x Management factorial (adding planting date variation)
To reproduce the simulation figures from the supplementary materials:
python simulation/plot_apsim.py
APSIM is freely available at https://www.apsim.info/. The .apsimx configuration files
and .met weather forcing data used in this study are available from the corresponding
author upon reasonable request. See simulation/README.md for detailed parameters.
Data Sources
To reproduce the full dataset from scratch, the following data sources are needed:
| Data | Source | Resolution |
|---|---|---|
| Soybean yield (US) | USDA NASS QuickStats | County-level |
| Soybean yield (Argentina) | Argentina Ministry of Agriculture | Department-level |
| Crop mask (US) | USDA Cropland Data Layer | 30 m |
| Crop mask (Argentina) | Song et al. (2021), Nature Sustainability | 30 m |
| Vegetation index (EVI2) | MODIS MCD43A4 | 500 m |
| Temperature, precipitation, VPD | ERA5-Land | 0.1 deg |
| Land surface temperature | MODIS MYD11A2 | 1 km |
| Evapotranspiration, soil moisture | GLDAS Noah | 0.25 deg |
All time-series variables were aggregated to monthly scale during the growing season (May-October for the U.S.; November-May for Argentina) and spatially averaged to the county/department level.
Pre-trained Weights
| Folder | Files | Description |
|---|---|---|
weights/lstm/ | lstm_state{0..11}.pth | LSTM trained on 11 US states (leave-one-out) |
weights/maml/ | maml_state{0..11}.pth | MAML meta-trained on 11 US states |
weights/dann/ | dann_state{0..11}.pth | DANN trained on 11 US states (leave-one-out) |
State flag mapping: 0=IL, 1=IN, 2=IA, 3=KS, 4=MI, 5=MN, 6=MO, 7=NE, 8=ND, 9=OH, 10=SD, 11=WI.
Citation
If you use this code or data, please cite the accompanying paper.