COSA: Contrastive OSM-Satellite Alignment

February 13, 2026 · View on GitHub

COSA is a contrastive learning model that aligns OpenStreetMap (OSM) text with satellite imagery features. It learns a shared embedding space where semantically similar OSM tags and satellite image regions are close together.

Model Overview

COSA uses a dual-encoder architecture:

  • Image Encoder: Satlas-pretrained Swin-B with FPN
  • Text Encoder: CLIP ViT-L/14 text encoder

The model is trained with InfoNCE contrastive loss to align per-polygon image features with their corresponding OSM text embeddings.

Files

cosa/
├── cosa.ckpt              # Trained model checkpoint
├── model.py               # OSMBind model class
├── text_encoder.py        # Text encoder implementations
├── compute_embeddings.py  # Script to generate tag embeddings
└── README.md              # This file

Usage

Generate Tag Embeddings

To create COSA embeddings for your tag vocabulary:

python compute_embeddings.py \
    --taglist_path /path/to/taglist_vocab.pt \
    --tag_vocab_path /path/to/tag_vocab.pt \
    --output_path cosa-embeddings.pt \
    --encoder_type osm-clip \
    --checkpoint_path cosa.ckpt

Encode Custom Text

import torch
from model import OSMBind

# Load model
checkpoint = torch.load("cosa.ckpt", map_location="cuda")
model = OSMBind(
    taglist_path="/path/to/taglist_vocab.pt",
    tagvocab_path="/path/to/tag_vocab.pt",
    text_backbone="clip"
)
model.load_state_dict(checkpoint['state_dict'], strict=False)
model.eval().cuda()

# Encode text
embedding = model.encode_text("building residential")

# Encode batch
embeddings = model.text_encoder.encode_batch(["building", "highway residential"])

Create Hint Tensor for VectorSynth

import torch

# Load pre-computed embeddings
embeddings = torch.load("cosa-embeddings.pt")

# Load pixel tensor (from dataset)
pixel_tensor = torch.load("pixel_tensors/bbox_123.pt")

# Create hint: look up embedding for each pixel
hint = embeddings[pixel_tensor]