Transfer Learning for Cross-Regional Soybean Yield Prediction

May 17, 2026 · View on GitHub

Code and data accompanying:

Unraveling Domain Shifts in Deep Transfer Learning for Cross-regional Soybean Yield Estimation Chishan Zhang, Chunyuan Diao, Yiqun Xie

A GPU is recommended for training. MAML meta-learning in particular requires substantial training time (days on a single GPU for the full 12-state cross-validation). The evaluation script (evaluate.py) runs on CPU without issue.

Dependencies

Python >= 3.8
PyTorch >= 1.12
learn2learn
scikit-learn, numpy, pandas, scipy
matplotlib, seaborn (for simulation plots)

Repository Structure

code/
├── models.py           Model architectures (LSTM, LSTMDANN)
├── data_utils.py       Data loading, normalization, state splitting
├── train_lstm.py       LSTM baseline training
├── train_maml.py       MAML meta-learning training
├── train_dann.py       DANN adversarial training
├── train_ftl.py        FTL fine-tuning demonstration
├── evaluate.py         Evaluate all methods on demo data
├── domain_shift.py     KL divergence domain shift quantification
├── demo_data/          Preprocessed subsets for IL and ND (inference only)
├── raw_data/           Public yield records and county boundaries
│   ├── US_yields.csv
│   ├── ARG_yields.csv
│   └── US_counties/    Corn Belt county shapefile
├── simulation/         APSIM crop model simulation analysis
│   ├── README.md
│   ├── plot_apsim.py   Reproduce Figures S2, S3, S4
│   └── simulation_results.db
└── weights/            Pre-trained model weights
    ├── lstm/           12 state-split LSTM weights
    ├── maml/           12 state-split MAML weights
    └── dann/           12 state-split DANN weights

Quick Start

Evaluate pre-trained models on demo data (Illinois and North Dakota):

python evaluate.py

This loads pre-trained weights and runs inference + few-shot adaptation for LSTM, FTL, MAML, and DANN on two representative states: IL (low domain shift) and ND (high domain shift).

Models

Method	Description	Target Data
LSTM	Baseline, trained on source domain	None
FTL	Fine-tune pre-trained LSTM on target samples	5 labeled
MAML	Meta-learned initialization, 2-step adaptation	5 labeled
DANN	Adversarial feature alignment	Unlabeled

All models share the same LSTM backbone:

Input: 13 features x 6 monthly time steps
LSTM: 1 layer, 100 hidden units
FC layers: 100 -> 100 -> 1 (regression output)

Demo Data

The demo_data/ folder contains preprocessed subsets for two U.S. states:

IL (Illinois): Low domain shift case -- 300 test + 50 adaptation samples
ND (North Dakota): High covariate/prior shift case -- 300 test + 50 adaptation samples

Each NPZ file contains:

X: Input features (N, 6, 13) -- 6 monthly aggregates of 13 variables
Y: Soybean yield in kg/ha (N, 1)
ID: County FIPS code and year (N, 2)

The 13 input features (after removing lat/lon) include EVI2, temperature, precipitation, VPD, land surface temperature, evapotranspiration, and soil moisture variables.

Raw Data

raw_data/ contains publicly available records:

US_yields.csv: County-level soybean yield (bu/acre) from USDA NASS, 2008-2021
ARG_yields.csv: Department-level soybean yield (kg/ha) from Argentina Ministry of Agriculture, 2008-2021
US_counties/: Corn Belt county boundary shapefile (12 states)

Simulations

The simulation/ directory contains APSIM crop model simulation results used to characterize the mechanistic drivers of domain shifts (Sections 2.4.1 and 3.1 of the manuscript).

Three sets of factorial experiments were conducted:

Part I: Environmental variation across 6 US Corn Belt locations (IA, MN, ND, KS, MO, NE)
Part II: Environment x Genotype factorial (6 climates x 8 maturity groups MG0-MG7)
Part III: Environment x Genotype x Management factorial (adding planting date variation)

To reproduce the simulation figures from the supplementary materials:

python simulation/plot_apsim.py

APSIM is freely available at https://www.apsim.info/. The .apsimx configuration files and .met weather forcing data used in this study are available from the corresponding author upon reasonable request. See simulation/README.md for detailed parameters.

Data Sources

To reproduce the full dataset from scratch, the following data sources are needed:

Data	Source	Resolution
Soybean yield (US)	USDA NASS QuickStats	County-level
Soybean yield (Argentina)	Argentina Ministry of Agriculture	Department-level
Crop mask (US)	USDA Cropland Data Layer	30 m
Crop mask (Argentina)	Song et al. (2021), Nature Sustainability	30 m
Vegetation index (EVI2)	MODIS MCD43A4	500 m
Temperature, precipitation, VPD	ERA5-Land	0.1 deg
Land surface temperature	MODIS MYD11A2	1 km
Evapotranspiration, soil moisture	GLDAS Noah	0.25 deg

All time-series variables were aggregated to monthly scale during the growing season (May-October for the U.S.; November-May for Argentina) and spatially averaged to the county/department level.

Pre-trained Weights

Folder	Files	Description
`weights/lstm/`	`lstm_state{0..11}.pth`	LSTM trained on 11 US states (leave-one-out)
`weights/maml/`	`maml_state{0..11}.pth`	MAML meta-trained on 11 US states
`weights/dann/`	`dann_state{0..11}.pth`	DANN trained on 11 US states (leave-one-out)

State flag mapping: 0=IL, 1=IN, 2=IA, 3=KS, 4=MI, 5=MN, 6=MO, 7=NE, 8=ND, 9=OH, 10=SD, 11=WI.

Citation

If you use this code or data, please cite the accompanying paper.