MBCTD
May 25, 2026 Β· View on GitHub
A deep learning model for per-pixel, multi-label detection of building changes in bi-temporal satellite and aerial imagery. MBCTD classifies each pixel independently into three change categories, allowing overlapping labels (e.g., a replacement site marked as both demolished and new simultaneously).
Overview
Given a before image and an after image of the same geographic area, MBCTD produces per-pixel predictions for three classes:
| Class | Color | Meaning |
|---|---|---|
| Unchanged | π΅ | Building present in both images |
| Demolished | π΄ | Building present in before, absent in after |
| New | π’ | Building absent in before, present in after |
| Replacement (demolished + new) | π‘ | Both demolished and new labels active simultaneously |
Because the model is multi-label, a single pixel can belong to more than one class. This makes it possible to represent complex urban transitions that single-label models cannot express.

Inference samples
LEVIR-CD+
![]() | ![]() | ![]() | ![]() | ![]() |
![]() | ![]() | ![]() | ![]() | ![]() |
FOTBCD
![]() | ![]() | ![]() | ![]() | ![]() |
![]() | ![]() | ![]() | ![]() | ![]() |
![]() | ![]() |
Architecture
MBCTD uses a Siamese ConvNeXt-Base encoder paired with a full-resolution U-Net decoder:
Before image βββ
βββΊ Shared ConvNeXt-Base encoder βββΊ Change fusion at each scale
After image βββ β
βΌ
Full-resolution U-Net decoder
(PixelShuffle upsampling)
β
βΌ
3 independent sigmoid heads per pixel
Key design decisions:
- Shared encoder weights β the same ConvNeXt trunk processes both images, making the feature space comparable by construction.
- Change fusion β at each encoder scale, before/after features are combined as
[before, after, beforeβafter, |beforeβafter|]and projected through 1Γ1β3Γ3 convolutions. - High-resolution skip connections β in addition to encoder skips (1/32 β 1/4), raw input images are injected at 1/2 and 1/1 resolution to preserve fine-grained boundary information.
- PixelShuffle upsampling β learned upsampling at every decoder stage avoids checkerboard artifacts.
- Pre-trained backbone β ConvNeXt-Base initialised with DINOv3 LVD1689M weights.
Project Structure
MBCTD/
βββ model.py # Model definition (encoder, fusion, decoder)
βββ config.py # MBCTDConfig dataclass
βββ inference.py # load_model, predict_patch, visualisation helpers
βββ demo.py # Interactive Gradio web demo
βββ environment.yml # Conda environment spec
Installation
1. Clone the repository
git clone git@github.com:abdelpy/MBCTD
cd MBCTD
2. Create the Conda environment
conda env create -f environment.yml
conda activate mbctd
3. Install PyTorch
Follow the official instructions to install PyTorch matching your CUDA version.
4. Download model weights
Pre-trained weights are available on Google Drive.
Usage
Interactive demo
The fastest way to try the model is the Gradio web interface:
python demo.py path/to/model.pth
Open the URL printed in your terminal. The UI lets you:
- Upload a before and after image pair
- Adjust the confidence threshold (0.1 β 0.9, default 0.7)
- Choose an inference mode:
patchβ tile the image into 256 px patches and stitch predictions back (handles large images)fullβ run at the image's original resolution
- Inspect the overlay on the after image, the colour mask, and per-class pixel coverage statistics
Programmatic inference
from PIL import Image
import numpy as np
import torch
from inference import load_model, predict_patch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = load_model("model.pth", device)
# Load images as uint8 RGB numpy arrays
before = np.array(Image.open("before.png").convert("RGB"))
after = np.array(Image.open("after.png").convert("RGB"))
result = predict_patch(before, after, model, threshold=0.7)
predict_patch returns a dictionary:
| Key | Shape | dtype | Description |
|---|---|---|---|
binary | (3, H, W) | uint8 | Per-class binary masks (unchanged / demolished / new) |
class_map | (H, W) | uint8 | Collapsed single-label class ID (0 = background, 1β4 = see colour table above) |
overlay | (H, W, 3) | uint8 | Semi-transparent colour overlay drawn on the after image |
mask_rgb | (H, W, 3) | uint8 | Solid-colour mask visualisation |
Training
MBCTD was trained exclusively on FOTBCD β a large-scale, multi-label building change dataset β using 256 Γ 256 px patches drawn from over 220,000 before/after aerial image pairs across France.
FOTBCD β the dataset behind the model
FOTBCD is the first dataset with multi-label building change annotations as vector polygons covering demolished, new, and unchanged structures simultaneously. At 220k+ georeferenced pairs spanning diverse urban environments, it is an order of magnitude larger than existing change detection benchmarks β and the richness of its labels is what makes a model like MBCTD possible.
The dataset is available for licensing.
Whether you are building an urban monitoring platform, a real-estate analytics product, or a geospatial AI pipeline, FOTBCD gives you the ground truth that generic benchmarks cannot provide. Get in touch to discuss licensing terms.
Results
Binary change detection metrics (demolished OR new β "changed") are reported for both benchmarks to enable comparison; LEVIR-CD+ provides only binary ground truth so no per-class breakdown is available for it. For FOTBCD, which supports multi-label annotations, per-class IoU is also reported.
Inference threshold selected by best F1 on each benchmark.
LEVIR-CD+ (full-resolution inference, threshold = 0.75)
| Metric | Value |
|---|---|
| Precision | 0.7694 |
| Recall | 0.8137 |
| F1 | 0.7909 |
| IoU change | 0.6541 |
| mIoU | 0.8180 |
| OA | 0.9825 |
FOTBCD (full-resolution inference, threshold = 0.70)
Binary change detection
| Metric | Value |
|---|---|
| Precision | 0.8948 |
| Recall | 0.9201 |
| F1 | 0.9073 |
| IoU change | 0.8303 |
| mIoU | 0.9094 |
| OA | 0.9891 |
Per-class IoU
| Class | IoU |
|---|---|
| Unchanged | 0.7774 |
| Demolished | 0.8166 |
| New | 0.8198 |
License
This project is licensed under CC BY-NC 4.0





















