Near, far: Patch-ordering enhances vision foundation models' scene understanding
April 20, 2025 ยท View on GitHub
Valentinos Pariza*, Mohammadreza Salehi*,Gertjan J. Burghouts, Francesco Locatello, Yuki M. Asano
ICLR 2025
๐ Project Page / โจ๏ธ GitHub Repository / ๐ Read the Paper on arXiv
Table of Contents
- News
- Introduction
- GPU Requirements
- Environment Setup
- Loading pretrained models
- Training
- Evaluation
- Dataset Preparation
- Visualizations
- Citation
- License
News
Thank you for using our code. Here we include news about changes in the repository.
- The repository has changed substantially to upgrade libraries to more recent libraries that speed up execution and reduce memory usage especially for the
v2ViT architecture that is used by Dinov2. The boost in the latter architecture comes by the use of xformers just like Dinov2 training. - We updated the table below with new model entries and have added post-training config files, for dinov2, and dinov2r with and without the use of registers.
- We have clarified how each model is trained by explicitly providing a config file next to each model we post-trained in the table below.
- We have added code for linear segmentation for the Cityscapes Dataset.
- We cleared the code more and added more flexibility on what can be used during training via the configuration in files. More specifically added the following parameters:
eval_attn_maps(True/False) for specifying whether to evaluate the attention maps during training.num_register_tokens(int: default to 0) for specifying whether to use registers and how much. Only works with architecturev2.
If you are interested for the legacy code, please look our github branch neco-1_x.
Introduction
NeCo introduces a new self-supervised learning technique for enhancing spatial representations in vision transformers. By leveraging Patch Neighbor Consistency, NeCo captures fine-grained details and structural information that are crucial for various downstream tasks, such as semantic segmentation.
Key features of NeCo include:
- Patch-based neighborhood consistency
- Improved dense prediction capabilities
- Efficient training requiring only 19 GPU hours
- Compatibility with existing vision transformer backbone
Below is a table with some of our results on Pascal VOC 2012 based on DINOv2 backbone.
| backbone | arch | params | Overclustering k=500 | Dense NN Retrieval | linear | download | config | |
|---|---|---|---|---|---|---|---|---|
| DINOv2 | ViT-S/14 | 21M | 57.7 | 78.6 | 81.4 | student | teacher | config |
| DINOv2R-XR | ViT-S/14 | 21M | 72.6 | 80.2 | 81.3 | student | teacher | config |
| DINOv2R | ViT-S/14 | 21M | 68.9 | 80.7 | 81.5 | student | teacher | config |
| DINOv2 | ViT-B/14 | 85M | 71.1 | 82.8 | 84.5 | student | teacher | config |
| DINOv2R-XR | ViT-B/14 | 85M | 71.8 | 83.5 | 83.3 | student | teacher | config |
| DINOv2R | ViT-S/14 | 85M | 71.9 | 82.9 | 84.4 | student | teacher | config |
| DINO | ViT-S/16 | 22M | 47.9 | 61.3 | 65.8 | student | teacher | config |
| TimeT | ViT-S/16 | 22M | 53.1 | 66.5 | 68.5 | student | teacher | config |
| Leopart | ViT-S/16 | 22M | 55.3 | 66.2 | 68.3 | student | teacher | config |
In the following sections, we will delve into the training process, evaluation metrics, and provide instructions for using NeCo in your own projects.
GPU Requirements
Optimizing with our model, NeCo, does not necessitate a significant GPU budget. Our training process is conducted on a single NVIDIA A100 GPU.
Environment Setup
We use conda for dependency management.
Please use environment.yml to install the environment necessary to run everything from our work.
You can install it by running the following command:
conda env create -f environment.yaml
Or you can see the step by step process in the Installation Guide guide.
Pythonpath
Export the module to PYTHONPATH within the repository's parent directory.
export PYTHONPATH="${PYTHONPATH}:$PATH_TO_REPO"
Neptune
We use neptune for logging experiments. Get you API token for neptune and insert it in the corresponding run-files. Also make sure to adapt the project name when setting up the logger.
Loading pretrained models
To use NeCo models on downstream dense prediction tasks, you just need to install timm and torch and depending on which checkpoint you use you can load it as follows:
The models can be download from our NeCo Hugging Face repo.
Models after post-training dinov2 (following dinov2 architecture)
- Models that can be loaded with this approach can be found at the hugging face page:
NeCo on Dinov2
import torch
# change to dinov2_vitb14 for base as described in:
# https://github.com/facebookresearch/dinov2
model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
path_to_checkpoint = "<your path to downloaded ckpt>"
state_dict = torch.load(path_to_checkpoint)
model.load_state_dict(state_dict, strict=False)
NeCo on Dinov2 with Registers
import torch
# change to dinov2_vitb14_reg for base as described in:
# https://github.com/facebookresearch/dinov2
model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg')
path_to_checkpoint = "<your path to downloaded ckpt>"
state_dict = torch.load(path_to_checkpoint)
model.load_state_dict(state_dict, strict=False)
Models after post-training dino or similar (following dino architecture)
- Models that can be loaded with this approach can be found at the hugging face page:
timm vit-small and vit-base architectures
import torch
from timm.models.vision_transformer import vit_small_patch16_224, vit_base_patch16_224
# Change to vit_base_patch8_224() if you want to use our larger model
model = vit_small_patch16_224()
path_to_checkpoint = "<your path to downloaded ckpt>"
state_dict = torch.load(path_to_checkpoint, map_location='cpu')
model.load_state_dict(state_dict, strict=False)
Note: In case you want to directly load the weights of the model from a hugging face url, please execute:
import torch
state_dict = torch.hub.load_state_dict_from_url("<url to the hugging face checkpoint>")
Training Setup
Repository Structure
src/: Model, method, and transform definitionsexperiments/: Scripts for setting up and running experimentsdata/: Data modules for ImageNet, COCO, Pascal VOC, and ADE20k
Training with NeCo
- Use configs in
experiments/configs/to reproduce our experiments - Modify paths in config files to match your dataset and checkpoint directories
- For new datasets:
- Change the data path in the config
- Add a new data module
- Initialize the new data module in
experiments/train_with_neco.py
For instance, to start a training on COCO:
python experiments/train_with_neco.py --config_path experiments/configs/neco_224x224.yml
Evaluation
We provide several evaluation scripts for different tasks. For detailed instructions and examples, please refer to the Evaluation README. Here's a summary of the evaluation methods:
-
Linear Segmentation:
- Use
linear_finetune.pyfor fine-tuning. - Use
eval_linear.pyfor evaluating on the validation dataset.
- Use
-
Overclustering:
- Use
eval_overcluster.pyto evaluate overclustering performance.
- Use
-
Cluster Based Foreground Extraction + Community Detection (CBFE+CD):
- Requires downloading noisy attention train and val masks.
- Provides examples for both ViT-Small and ViT-Base models.
Each evaluation method has specific configuration files and command-line arguments. The Evaluation README provides detailed examples and instructions for running these evaluations on different datasets and model architectures.
Datasets
We use PyTorch Lightning data modules for our datasets. Supported datasets include ImageNet100k, COCO, Pascal VOC, ADE20k, and Cityscapes. Each dataset requires a specific folder structure for proper functioning.
Data modules are located in the data/ directory and handle loading, preprocessing, and augmentation. When using these datasets, ensure you update the paths in your configuration files to match your local setup.
For detailed information on dataset preparation, download instructions, and specific folder structures, please refer to the Dataset README.
Visualizations
We provide visualizations to help understand the performance of our method. Below is an example of Cluster-Based Foreground Extraction (CBFE) results on the Pascal VOC dataset:

This visualization shows the ability of NeCo without relying on any supervision. Different objects are represented by distinct colors, and the method captures tight and precise object boundaries.
Citations
If you find this repository useful, please consider giving a star โญ and citation ๐ฃ:
@inproceedings{
pariza2025near,
title={Near, far: Patch-ordering enhances vision foundation models' scene understanding},
author={Valentinos Pariza and Mohammadreza Salehi and Gertjan J. Burghouts and Francesco Locatello and Yuki M Asano},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=Qro97zWC29}
}
License
All our code is MIT license with the exception of DINOv2 related code that follows Apache 2 license.
DINOv2 has an Apache 2 license DINOv2.