Self-supervised Learning of Contextualized Local Visual Embeddings

December 26, 2023 · View on GitHub

Visit the project webpage!

By Thalles Silva, Helio Pedrini, Adín Ramírez Rivera.

This repo is the official implementation of Self-supervised Learning of Contextualized Local Visual Embeddings (CLoVE), featured on the 4th Visual Inductive Priors for Data-Efficient Deep Learning Workshop (ICCV2023).

Code base written in PyTorch.

Abstract

We present Contextualized Local Visual Embeddings (CLoVE), a self-supervised convolutional-based method that learns representations suited for dense prediction tasks. CLoVE deviates from current methods and optimizes a single loss function that operates at the level of contextualized local embeddings learned from output feature maps of convolution neural network (CNN) encoders. To learn contextualized embeddings, CLoVE proposes a normalized multi-head self-attention layer that combines local features from different parts of an image based on similarity. We extensively benchmark CLoVE’s pre-trained representations on multiple datasets. CLoVE reaches state-of-the-art performance for CNN-based architectures in 4 dense prediction downstream tasks, including object detection, instance segmentation, keypoint detection, and dense pose estimation.

...

Citation

@inproceedings{silva2023self,
    title={Self-supervised Learning of Contextualized Local Visual Embeddings},
    author={Silva, Thalles and Pedrini, Helio and Ram{\'\i}rez, Ad{\'\i}n},
    booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
    pages={177--186},
    year={2023}
}

Main Results

Pre-trained models

Epochs Multicrop URL
CLoVE 50 2x224 + 6x96 Checkpoints
CLoVE 200 2x224 + 6x96 Checkpoints
CLoVE 400 2x224 + 6x96 Checkpoints

Object detection and instance segmentation on COCO (R50-C4)

MethodepAPbb\text{AP}^{\text{bb}}AP50bb\text{AP}^{\text{bb}}_{50}AP75bb\text{AP}^{\text{bb}}_{75}APmb\text{AP}^{\text{mb}}AP50mb\text{AP}^{\text{mb}}_{50}AP75mb\text{AP}^{\text{mb}}_{75}
Supervised10038.258.241.233.354.735.2
Rand init-26.44427.829.346.930.8
ReSim20039.7594334.655.937.1
InsCon20040.360.043.535.156.737.6
PixPro40040.559.84435.456.9}37.7
DetCo20039.859.74334.756.336.7
SlotCon20039.959.843.034.956.537.3
CLoVE20040.660.044.135.456.837.8
CLoVE40041.060.344.235.557.238.1

Object detection and instance segmentation on LVIS (R50-FPN)

MethodepAPbb\textup{AP}^{\textup{bb}}AP50bb\textup{AP}^{\textup{bb}}_{50}AP75bb\textup{AP}^{\textup{bb}}_{75}APmb\textup{AP}^{\textup{mb}}AP50mb\textup{AP}^{\textup{mb}}_{50}AP75mb\textup{AP}^{\textup{mb}}_{75}
Supervised10020.233.421.419.631.220.8
Rand init-12.421.812.512.120.212.5
DenseCL20020.433.521.419.931.520.9
PixPro40023.838.225.223.336.124.7
SlotCon20023.237.624.322.935.624.3
VICRegL200713.46.47.412.77.3
CLoVE20023.637.725.223.335.924.8
40024.338.825.823.936.725.3

Instance segmentation on Cityscapes (R50-FPN)

MethodepAPAP50\textup{AP}_{50}
Supervised10026.552.9
Rand init-19.940.7
DenseCL20033.161.7
PixPro40035.863.7
VICRegL30029.858.5
SlotCon20035.263.8
CLoVE20035.764.1
CLoVE40037.265.3

Acknowledgement

This repository was built on top of several existing publicly available codes. Specifically, we have modified and integrated the following code into this project:

Contributing to the project

We welcome pull requests and issues from the community.