Self-supervised Learning of Contextualized Local Visual Embeddings

December 26, 2023 · View on GitHub

Visit the project webpage!

By Thalles Silva, Helio Pedrini, Adín Ramírez Rivera.

This repo is the official implementation of Self-supervised Learning of Contextualized Local Visual Embeddings (CLoVE), featured on the 4th Visual Inductive Priors for Data-Efficient Deep Learning Workshop (ICCV2023).

Code base written in PyTorch.

Abstract

We present Contextualized Local Visual Embeddings (CLoVE), a self-supervised convolutional-based method that learns representations suited for dense prediction tasks. CLoVE deviates from current methods and optimizes a single loss function that operates at the level of contextualized local embeddings learned from output feature maps of convolution neural network (CNN) encoders. To learn contextualized embeddings, CLoVE proposes a normalized multi-head self-attention layer that combines local features from different parts of an image based on similarity. We extensively benchmark CLoVE’s pre-trained representations on multiple datasets. CLoVE reaches state-of-the-art performance for CNN-based architectures in 4 dense prediction downstream tasks, including object detection, instance segmentation, keypoint detection, and dense pose estimation.

Citation

@inproceedings{silva2023self,
    title={Self-supervised Learning of Contextualized Local Visual Embeddings},
    author={Silva, Thalles and Pedrini, Helio and Ram{\'\i}rez, Ad{\'\i}n},
    booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
    pages={177--186},
    year={2023}
}

Main Results

Pre-trained models

	Epochs	Multicrop	URL
CLoVE	50	2x224 + 6x96	Checkpoints
CLoVE	200	2x224 + 6x96	Checkpoints
CLoVE	400	2x224 + 6x96	Checkpoints

Object detection and instance segmentation on COCO (R50-C4)

Method	ep	$\text{AP}^{\text{bb}}$	$\text{AP}^{\text{bb}}_{50}$	$\text{AP}^{\text{bb}}_{75}$	$\text{AP}^{\text{mb}}$	$\text{AP}^{\text{mb}}_{50}$	$\text{AP}^{\text{mb}}_{75}$
Supervised	100	38.2	58.2	41.2	33.3	54.7	35.2
Rand init	-	26.4	44	27.8	29.3	46.9	30.8
ReSim	200	39.7	59	43	34.6	55.9	37.1
InsCon	200	40.3	60.0	43.5	35.1	56.7	37.6
PixPro	400	40.5	59.8	44	35.4	56.9}	37.7
DetCo	200	39.8	59.7	43	34.7	56.3	36.7
SlotCon	200	39.9	59.8	43.0	34.9	56.5	37.3
CLoVE	200	40.6	60.0	44.1	35.4	56.8	37.8
CLoVE	400	41.0	60.3	44.2	35.5	57.2	38.1

Object detection and instance segmentation on LVIS (R50-FPN)

Method	ep	$\textup{AP}^{\textup{bb}}$	$\textup{AP}^{\textup{bb}}_{50}$	$\textup{AP}^{\textup{bb}}_{75}$	$\textup{AP}^{\textup{mb}}$	$\textup{AP}^{\textup{mb}}_{50}$	$\textup{AP}^{\textup{mb}}_{75}$
Supervised	100	20.2	33.4	21.4	19.6	31.2	20.8
Rand init	-	12.4	21.8	12.5	12.1	20.2	12.5
DenseCL	200	20.4	33.5	21.4	19.9	31.5	20.9
PixPro	400	23.8	38.2	25.2	23.3	36.1	24.7
SlotCon	200	23.2	37.6	24.3	22.9	35.6	24.3
VICRegL	200	7	13.4	6.4	7.4	12.7	7.3
CLoVE	200	23.6	37.7	25.2	23.3	35.9	24.8
	400	24.3	38.8	25.8	23.9	36.7	25.3

Instance segmentation on Cityscapes (R50-FPN)

Method	ep	AP	$\textup{AP}_{50}$
Supervised	100	26.5	52.9
Rand init	-	19.9	40.7
DenseCL	200	33.1	61.7
PixPro	400	35.8	63.7
VICRegL	300	29.8	58.5
SlotCon	200	35.2	63.8
CLoVE	200	35.7	64.1
CLoVE	400	37.2	65.3

Acknowledgement

This repository was built on top of several existing publicly available codes. Specifically, we have modified and integrated the following code into this project:

https://github.com/zdaxie/PixPro

Contributing to the project

We welcome pull requests and issues from the community.