VCNet - a weakly supervised end-to-end deep-learning network for large-scale HR land-cover mapping
October 30, 2025 · View on GitHub
High-resolution (HR) land-cover mapping is an important task for surveying the Earth's surface and supporting decision-making in sectors such as agriculture, forestry and smart cities. However, it is impeded by the scarcity of HR high-quality labels, complex ground details and high computational cost. To address these challenges, we propose VCNet, a weakly-supervised end-to-end deep-learning network for large-scale HR land-cover mapping. It facilitates large-scale HR land-cover mapping by automatically generating HR maps using LR historical data as guidance, which fully eliminates the need for manual annotation and human intervention.
In this study, we utilize the framework to produce a 1-m HR land-cover map for Shanghai, the economic epicenter of China, with an accuracy of 72.26%. The complete 1-m resolution land-cover mapping results of Shanghai are shown below:
Environment Requirements
Hardware Requirements
- NVIDIA GPU (recommended video memory ≥ 8GB)
- System memory ≥ 16GB
- Storage: Approximately 20GB for datasets and models
Software Dependencies
torch==1.4.0
torchvision==0.5.0
numpy>=1.19.5
tqdm>=4.62.3
tensorboard>=2.7.0
tensorboardX>=2.5.1
ml-collections>=0.1.0
medpy>=0.4.0
SimpleITK>=2.1.1
scipy>=1.7.3
h5py>=3.6.0
rasterio==1.2.10
easydict>=1.9
Dataset Preparation
The Chesapeake Bay Dataset
The Chesapeake Bay Dataset contains 1-meter resolution images and a 30-meter resolution land-cover product as the training data pairs and also contains a 1-meter resolution ground reference for assessment. Download the dataset at Microsoft's website: https://lila.science/datasets/chesapeakelandcover and put them at ./dataset/Chesapeake_NewYork_dataset.
- The HR aerial images with 1-meter resolution were captured by the U.S. Department of Agriculture’s National Agriculture Imagery Program (NAIP).
- The LR labels with 30-meter resolution derived from the USGS’s National Land Cover Database (NLCD), consist of 16 land-cover classes.
- The HR (1 m) ground truths used for accuracy assessment, were obtained from the Chesapeake Bay Conservancy Land Cover (CCLC) project.
The Tokyo Dataset
The Tokyo dataset includes 0.5-m resolution images, two kinds of 10-m resolution LCPs, and two kinds of 30-m resolution LCPs to construct the training data pairs with different combinations.
- The HR aerial images(0.5 m/pixel) with red (R), green (G), blue (B) bands, were collected from the OpenEarthMap dataset where the image sources are from the Geospatial Information Authority of Japan.
- The LR labels with 10-m resolution were collected from (1) The ESA GLC10 provided by the European Space Agency (ESA), and (2) The ESRI global LCPs, abbreviated as ESRI GLC10, provided by the ESRI Inc. and IO Inc. The 30-m resolution labels were collected from (1) The global LCPs GLC FCS30 provided by the Chinese Academy of Sciences, and (2) The Globeland30 provided by the National Geomatics Center of China.
- The HR (0.5 m) ground truths were obtained from the OpenEarthMap dataset and contained eight land-cover types.
Data Structure
dataset/
├── CSV_list/
│ ├── Chesapeake_NewYork.csv # Training data list
│ └── ...
└── imagery/
├── HR image/ # Training image folder
├── LR label/ # Training label folder
└── CCLC/ # Validation data folder
Configuration Instructions
Modify the following parameters in train.py:
--dataset: Specify the dataset name (e.g.,Chesapeake)--list_dir: Point to the CSV list file path
Code Structure
VCNet/
├── train.py # Main training script
├── test.py # Testing and inference script
├── auto_test.sh # Batch testing script
├── networks/ # Network model definitions
│ ├── vit_seg_modeling.py # ViT_FuseX model implementation
│ ├── vit_seg_modeling_L2HNet.py # L2HNet model implementation
│ └── encoder_decoder.py # Dual-model concatenation architecture
├── trainer.py # Trainer implementation
├── mIoU.py # accuracy validation
├── requirements.txt # Dependency list
└── README.md # Project documentation
Training and Testing
Start Training
python train.py \
--dataset Chesapeake \
--batch_size 2 \
--max_epochs 100 \
--savepath ./checkpoints \
--gpu 0
Key Parameter Explanations
--batch_size: Adjust based on GPU video memory (recommended 2-8)--base_lr: Base learning rate (default 0.01)--CNN_width: L2HNet width (64 for lightweight mode, 128 for standard mode)
Model Testing
python test.py \
--checkpoint ./checkpoints/model.pth \
--dataset Chesapeake \
--gpu 0
Model Architecture
Core Components
- ViT_FuseX Encoder: Captures global semantic information with multi-scale features of remote sensing images
- L2HNet Encoder: Extracts high resolution spatial features from HR images
- Feature Fusion Module: Fuses features from dual branches to enhance segmentation accuracy
Concatenation Method
The dual-model concatenation is implemented via the EncoderDecoder class. Code example:
net = EncoderDecoder(
backbone=backbone_config,
decode_head=decode_head_config,
auxiliary_head=auxiliary_head_config,
cnn_encoder=L2HNet(width=args.CNN_width),
pretrained="/path/to/pretrained_model.pth"
).cuda()