VCNet - a weakly supervised end-to-end deep-learning network for large-scale HR land-cover mapping

October 30, 2025 · View on GitHub

High-resolution (HR) land-cover mapping is an important task for surveying the Earth's surface and supporting decision-making in sectors such as agriculture, forestry and smart cities. However, it is impeded by the scarcity of HR high-quality labels, complex ground details and high computational cost. To address these challenges, we propose VCNet, a weakly-supervised end-to-end deep-learning network for large-scale HR land-cover mapping. It facilitates large-scale HR land-cover mapping by automatically generating HR maps using LR historical data as guidance, which fully eliminates the need for manual annotation and human intervention.

In this study, we utilize the framework to produce a 1-m HR land-cover map for Shanghai, the economic epicenter of China, with an accuracy of 72.26%. The complete 1-m resolution land-cover mapping results of Shanghai are shown below:

Environment Requirements

Hardware Requirements

NVIDIA GPU (recommended video memory ≥ 8GB)
System memory ≥ 16GB
Storage: Approximately 20GB for datasets and models

Software Dependencies

torch==1.4.0  
torchvision==0.5.0  
numpy>=1.19.5  
tqdm>=4.62.3  
tensorboard>=2.7.0  
tensorboardX>=2.5.1  
ml-collections>=0.1.0  
medpy>=0.4.0  
SimpleITK>=2.1.1  
scipy>=1.7.3  
h5py>=3.6.0  
rasterio==1.2.10  
easydict>=1.9

Dataset Preparation

The Chesapeake Bay Dataset

The Chesapeake Bay Dataset contains 1-meter resolution images and a 30-meter resolution land-cover product as the training data pairs and also contains a 1-meter resolution ground reference for assessment. Download the dataset at Microsoft's website: https://lila.science/datasets/chesapeakelandcover and put them at ./dataset/Chesapeake_NewYork_dataset.

The HR aerial images with 1-meter resolution were captured by the U.S. Department of Agriculture’s National Agriculture Imagery Program (NAIP).
The LR labels with 30-meter resolution derived from the USGS’s National Land Cover Database (NLCD), consist of 16 land-cover classes.
The HR (1 m) ground truths used for accuracy assessment, were obtained from the Chesapeake Bay Conservancy Land Cover (CCLC) project.

The Tokyo Dataset

The Tokyo dataset includes 0.5-m resolution images, two kinds of 10-m resolution LCPs, and two kinds of 30-m resolution LCPs to construct the training data pairs with different combinations.

The HR aerial images(0.5 m/pixel) with red (R), green (G), blue (B) bands, were collected from the OpenEarthMap dataset where the image sources are from the Geospatial Information Authority of Japan.
The LR labels with 10-m resolution were collected from (1) The ESA GLC10 provided by the European Space Agency (ESA), and (2) The ESRI global LCPs, abbreviated as ESRI GLC10, provided by the ESRI Inc. and IO Inc. The 30-m resolution labels were collected from (1) The global LCPs GLC FCS30 provided by the Chinese Academy of Sciences, and (2) The Globeland30 provided by the National Geomatics Center of China.
The HR (0.5 m) ground truths were obtained from the OpenEarthMap dataset and contained eight land-cover types.

Data Structure

dataset/
├── CSV_list/
│   ├── Chesapeake_NewYork.csv    # Training data list
│   └── ...
└── imagery/
    ├── HR image/    # Training image folder
    ├── LR label/     # Training label folder 
    └── CCLC/     # Validation data folder

Configuration Instructions

Modify the following parameters in train.py:

--dataset: Specify the dataset name (e.g., Chesapeake)
--list_dir: Point to the CSV list file path

Code Structure

VCNet/
├── train.py              # Main training script
├── test.py               # Testing and inference script
├── auto_test.sh          # Batch testing script
├── networks/             # Network model definitions
│   ├── vit_seg_modeling.py       # ViT_FuseX model implementation
│   ├── vit_seg_modeling_L2HNet.py # L2HNet model implementation
│   └── encoder_decoder.py        # Dual-model concatenation architecture
├── trainer.py            # Trainer implementation
├── mIoU.py               # accuracy validation
├── requirements.txt      # Dependency list
└── README.md             # Project documentation

Training and Testing

Start Training

python train.py \
--dataset Chesapeake \
--batch_size 2 \
--max_epochs 100 \
--savepath ./checkpoints \
--gpu 0

Key Parameter Explanations

--batch_size: Adjust based on GPU video memory (recommended 2-8)
--base_lr: Base learning rate (default 0.01)
--CNN_width: L2HNet width (64 for lightweight mode, 128 for standard mode)

Model Testing

python test.py \
--checkpoint ./checkpoints/model.pth \
--dataset Chesapeake \
--gpu 0

Model Architecture

Core Components

ViT_FuseX Encoder: Captures global semantic information with multi-scale features of remote sensing images
L2HNet Encoder: Extracts high resolution spatial features from HR images
Feature Fusion Module: Fuses features from dual branches to enhance segmentation accuracy

Concatenation Method

The dual-model concatenation is implemented via the EncoderDecoder class. Code example:

net = EncoderDecoder(
    backbone=backbone_config,
    decode_head=decode_head_config,
    auxiliary_head=auxiliary_head_config,
    cnn_encoder=L2HNet(width=args.CNN_width),
    pretrained="/path/to/pretrained_model.pth"
).cuda()