Transfer Learning for Cross-Regional Soybean Yield Prediction

May 17, 2026 · View on GitHub

Code and data accompanying:

Unraveling Domain Shifts in Deep Transfer Learning for Cross-regional Soybean Yield Estimation Chishan Zhang, Chunyuan Diao, Yiqun Xie

Requirements

Environment Setup

A GPU is recommended for training. MAML meta-learning in particular requires substantial training time (days on a single GPU for the full 12-state cross-validation). The evaluation script (evaluate.py) runs on CPU without issue.

Dependencies

  • Python >= 3.8
  • PyTorch >= 1.12
  • learn2learn
  • scikit-learn, numpy, pandas, scipy
  • matplotlib, seaborn (for simulation plots)

Repository Structure

code/
├── models.py           Model architectures (LSTM, LSTMDANN)
├── data_utils.py       Data loading, normalization, state splitting
├── train_lstm.py       LSTM baseline training
├── train_maml.py       MAML meta-learning training
├── train_dann.py       DANN adversarial training
├── train_ftl.py        FTL fine-tuning demonstration
├── evaluate.py         Evaluate all methods on demo data
├── domain_shift.py     KL divergence domain shift quantification
├── demo_data/          Preprocessed subsets for IL and ND (inference only)
├── raw_data/           Public yield records and county boundaries
│   ├── US_yields.csv
│   ├── ARG_yields.csv
│   └── US_counties/    Corn Belt county shapefile
├── simulation/         APSIM crop model simulation analysis
│   ├── README.md
│   ├── plot_apsim.py   Reproduce Figures S2, S3, S4
│   └── simulation_results.db
└── weights/            Pre-trained model weights
    ├── lstm/           12 state-split LSTM weights
    ├── maml/           12 state-split MAML weights
    └── dann/           12 state-split DANN weights

Quick Start

Evaluate pre-trained models on demo data (Illinois and North Dakota):

python evaluate.py

This loads pre-trained weights and runs inference + few-shot adaptation for LSTM, FTL, MAML, and DANN on two representative states: IL (low domain shift) and ND (high domain shift).

Models

MethodDescriptionTarget Data
LSTMBaseline, trained on source domainNone
FTLFine-tune pre-trained LSTM on target samples5 labeled
MAMLMeta-learned initialization, 2-step adaptation5 labeled
DANNAdversarial feature alignmentUnlabeled

All models share the same LSTM backbone:

  • Input: 13 features x 6 monthly time steps
  • LSTM: 1 layer, 100 hidden units
  • FC layers: 100 -> 100 -> 1 (regression output)

Demo Data

The demo_data/ folder contains preprocessed subsets for two U.S. states:

  • IL (Illinois): Low domain shift case -- 300 test + 50 adaptation samples
  • ND (North Dakota): High covariate/prior shift case -- 300 test + 50 adaptation samples

Each NPZ file contains:

  • X: Input features (N, 6, 13) -- 6 monthly aggregates of 13 variables
  • Y: Soybean yield in kg/ha (N, 1)
  • ID: County FIPS code and year (N, 2)

The 13 input features (after removing lat/lon) include EVI2, temperature, precipitation, VPD, land surface temperature, evapotranspiration, and soil moisture variables.

Raw Data

raw_data/ contains publicly available records:

  • US_yields.csv: County-level soybean yield (bu/acre) from USDA NASS, 2008-2021
  • ARG_yields.csv: Department-level soybean yield (kg/ha) from Argentina Ministry of Agriculture, 2008-2021
  • US_counties/: Corn Belt county boundary shapefile (12 states)

Simulations

The simulation/ directory contains APSIM crop model simulation results used to characterize the mechanistic drivers of domain shifts (Sections 2.4.1 and 3.1 of the manuscript).

Three sets of factorial experiments were conducted:

  • Part I: Environmental variation across 6 US Corn Belt locations (IA, MN, ND, KS, MO, NE)
  • Part II: Environment x Genotype factorial (6 climates x 8 maturity groups MG0-MG7)
  • Part III: Environment x Genotype x Management factorial (adding planting date variation)

To reproduce the simulation figures from the supplementary materials:

python simulation/plot_apsim.py

APSIM is freely available at https://www.apsim.info/. The .apsimx configuration files and .met weather forcing data used in this study are available from the corresponding author upon reasonable request. See simulation/README.md for detailed parameters.

Data Sources

To reproduce the full dataset from scratch, the following data sources are needed:

DataSourceResolution
Soybean yield (US)USDA NASS QuickStatsCounty-level
Soybean yield (Argentina)Argentina Ministry of AgricultureDepartment-level
Crop mask (US)USDA Cropland Data Layer30 m
Crop mask (Argentina)Song et al. (2021), Nature Sustainability30 m
Vegetation index (EVI2)MODIS MCD43A4500 m
Temperature, precipitation, VPDERA5-Land0.1 deg
Land surface temperatureMODIS MYD11A21 km
Evapotranspiration, soil moistureGLDAS Noah0.25 deg

All time-series variables were aggregated to monthly scale during the growing season (May-October for the U.S.; November-May for Argentina) and spatially averaged to the county/department level.

Pre-trained Weights

FolderFilesDescription
weights/lstm/lstm_state{0..11}.pthLSTM trained on 11 US states (leave-one-out)
weights/maml/maml_state{0..11}.pthMAML meta-trained on 11 US states
weights/dann/dann_state{0..11}.pthDANN trained on 11 US states (leave-one-out)

State flag mapping: 0=IL, 1=IN, 2=IA, 3=KS, 4=MI, 5=MN, 6=MO, 7=NE, 8=ND, 9=OH, 10=SD, 11=WI.

Citation

If you use this code or data, please cite the accompanying paper.