πΎ UniCrop: A Universal Data Pipeline for Crop Yield Modelling
December 29, 2025 Β· View on GitHub
UniCrop is a configuration-driven, universal data pipeline designed to automate the construction of analysis-ready environmental datasets for crop yield modelling.
Given field locations, dates, and a declarative feature specification, UniCrop automatically retrieves, harmonises, engineers, and selects predictors from multi-source satellite, climate, soil, and topographic data.
UniCrop focuses on data engineering and reproducibility, rather than proposing new machine-learning algorithms, enabling scalable and transparent crop yield modelling across regions and crops.
π Key Features
- Universal & reusable pipeline configurable for different crops, regions, and time windows
- Multi-source data integration:
- Sentinel-2 (optical remote sensing)
- Sentinel-1 (SAR backscatter)
- MODIS vegetation products
- ERA5-Land climate reanalysis
- NASA POWER agro-climatology
- SoilGrids soil properties
- SRTM topography
- Automated data harmonisation:
- Temporal alignment
- Spatial aggregation
- Provenance tracking
- Agronomic feature engineering:
- Growing Degree Days (GDD)
- Vegetation dynamics
- SAR texture metrics
- Soilβclimate interaction features
- Statistical feature selection:
- Near-zero variance filtering
- High-correlation pruning
- Minimum Redundancy Maximum Relevance (mRMR)
- Baseline modelling & interpretability:
- LightGBM, Random Forest, SVR, ElasticNet
- Constrained ensemble modelling
- SHAP-based interpretability
π§ Design Philosophy
UniCrop separates data specification from data implementation.
All required environmental variables are defined in a human-readable feature mapping file, allowing users to adapt the pipeline to new crops or regions without modifying code. This design promotes portability, reproducibility, and scalability.
π Repository Structure
unicrop/
β
βββ unicrop_main.py # Main pipeline execution script
βββ requirements.txt # Python package details
βββ requirements_optional.txt # Optional package imports
βββ README - FOR NEW DATA USAGE.md
β
βββ source_codes/
β βββ pipeline.py # Data acquisition and harmonisation
β βββ modeller.py # Feature engineering, selection, modelling
β βββ config.py # Pipeline and model configuration
β βββ paths.py # Folder details for data, sources, etc.
β βββ sources.py # Additional source codes
β
βββ data/
β βββ sample_data.csv
β
βββ source_files/
β βββ cleaned_feature_mapping.csv # Declarative feature specification
β βββ cleaned_input_table.csv
β βββ unicrop_feature_mapping.csv
β βββ fetch_plan.csv
β
βββ sample_data_output/
β βββ unicrop_master_timeseries.csv
β βββ unicrop_columns_manifest.csv
β βββ unicrop_model_artifacts1.pkl
β βββ unicrop_final_report.md
β βββ unicrop_figures/ # Folder storing figures saved from sample_data.csv modelling
β β βββ ...
β
βββ README.md
π Quick Start
1οΈβ£ Prerequisites
- Python β₯ 3.9
- Google Earth Engine account (for satellite data access)
Install dependencies:
pip install -r requirements.txt
Authenticate Google Earth Engine (once):
earthengine authenticate
2οΈβ£ Configure Features
Edit unicrop_feature_mapping.csv to define:
- variable names
- data sources
- API parameters
- optional derivation rules
Each row corresponds to one environmental variable.
3οΈβ£ Run the Pipeline
python unicrop_main.py
This will:
- Downloading Stage (runs only ONCE for a new dataset)
- Clean and validate field-level input data
- Generate an automated fetch plan
- Download and harmonise multi-source environmental data
- Engineer agronomic features
- Modelling Stage
- Perform statistical screening and mRMR feature selection
- Train baseline models and ensemble
- Export modelling artefacts and visualisation data
Currently, the folders include downloaded data for the sample_data.csv. When users run the script above, it will bypass the Downloading Stage above, and only run the Modelling Stage for performance and prediction outputs.
π Outputs
Key outputs include:
unicrop_master_timeseries.csv--> Harmonised multi-source dataset before feature selectionunicrop_model_artifacts1.pkl--> Trained models, selected features, feature families, ensemble weightsunicrop_final_report.md--> Summary of modelling results
π Case Study
Public Crop Yield Case Study (Spain β Maize)
For the open-source release on GitHub, UniCrop is demonstrated using a publicly available maize yield dataset from Spain, sourced from the Wageningen University & Research (WUR) AI sample data repository:
π https://github.com/WUR-AI/sample_data/tree/main
The dataset contains annual maize yield observations aggregated at the regional level, along with location identifiers that can be linked to geographic coordinates. To align with UniCropβs temporal modelling assumptions and satellite data availability, we subsample the dataset to include harvest years from 2010 onwards. The processed data used in this repository is provided in the data/ directory.
Purpose of the Case Study
This case study demonstrates that:
- UniCrop can be executed entirely using public, non-proprietary agricultural datasets
- Annual (year-level) harvest information can be integrated using UniCropβs date-anchoring strategy
- Automated data pipelines produce consistent and interpretable environmental predictors from NASA POWER, Sentinel-2, MODIS, and SRTM
- The resulting features support robust baseline yield modelling without manual data engineering
Scope and Limitations
The Spain maize example is intended as a methodological demonstration, not as a claim of state-of-the-art crop yield prediction performance. Model accuracy depends on data availability, spatial resolution, and management information, which may be limited in public datasets.
Nevertheless, the case study highlights UniCropβs key strengths:
- Reproducible data acquisition
- Transparent feature construction
- Modular modelling and benchmarking
- Suitability for comparative and exploratory crop-yield analysis
π Related Publication and Citation
If you use UniCrop in your research, please cite:
UniCrop: A Universal, Multi-Source Data Engineering Pipeline for Scalable Crop Yield Prediction
E. Khidirova, & O. Karakus, arXiv preprint, 2025.
BibTeX
@article{karakus2025unicrop,
title = {UniCrop: A Universal, Multi-Source Data Engineering Pipeline for ScalableCrop Yield Prediction},
author = {Khidirova, Emiliya, and Karakus, Oktay},
journal = {arXiv preprint arXiv:250X.XXXXX},
year = {2025}
}
β οΈ Scope and Limitations
- UniCrop does not propose new machine-learning algorithms
- Model performance depends on input data quality
- Satellite data availability may vary by region and season
- UniCrop is intended as a data engineering foundation for downstream modelling and analysis.
π€ Contributing
Contributions are welcome, particularly:
- additional feature mappings
- support for new data sources
- enhancements to feature engineering modules
Please open an issue or submit a pull request.
π¬ Contact
Oktay Karakus
Cardiff University
βοΈ karakuso@cardiff.ac.uk
π©βπ» Development and Contributions
This codebase was developed by Emiliya Khidirova as part of her MSc dissertation at Cardiff University (2025).
- All core coding, implementation, and pipeline development were carried out by Emiliya Khidirova.
- The study was supervised by Dr. Oktay Karakus, who provided research guidance, conceptual oversight, and feedback.
- Dr. Karakus also contributed minor cosmetic refinements to the final published data products and code structure in preparation for public release.
This repository reflects the original MSc research work, released in the interest of transparency, reproducibility, and community reuse.
π License
This project is released under the MIT License.