🔍 (ACM MM 25) Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation

April 16, 2026 · View on GitHub

Official code repository for our ACM MM 2025 paper:

"Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation"
Xiangyu Zheng, Songcheng He, Wanyu Li, Xiaoqiang Li, Wei Zhang 🔗 [Paper Link]

📖 Introduction

This repository provides the official implementation of our ACM MM 2025 paper, "Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation" 🔗 [Paper Link] .

In this work, we propose a novel method HMHI-Net for Unsupervised Video Object Segmentation (UVOS) with Shallow Features for Memroy. The method features:

🔧 A novel Hierarchical Memory Architecture that simultaneously incorporates shallow- and high-level features for memory, facilitating UVOS with both pixel-level details and semantic richness stored in memory banks.
🔁 The Heterogeneous Mutual Refinement Mechanism to perform interaction across two memory banks, through the pixel-guided local alignment module (PLAM) and the semantic-guided global integration module (SGIM) respectively.
⚡ HMHI-Net achieves SOTA on common UVOS and VSOD benchmarks, with 89.8% J&F on DAVIS-16, 86.9% J on FBMS and 76.2% J on YouTube-Objetcs.

(a) Overall pipeline of HMHI-Net. (b) Memory readout mechanism to refine current frame. (c) Pixel-guided local alignment module. (d)Semantic-guided global integration module. (e) Memory update mechanism with the reference encoder.

🎞️ Video Demo

Demo1	Demo2

Car-roundabout_Davis16	Dog_Davis16

Demo3	Demo4

Drift-straight_Davis16	Parkour_Davis16

🚀 Getting Started

1. Environment Setup

pip install -r requirements.txt

Thanks to 🔗 [Calledit] for providing a more detailed environment installation script!

#!/bin/bash

conda create -n env_name python=3.10
conda activate env_name

pip install torch numpy opencv-python timm mmcv bytecode IPython tensorboard scikit-image 

git clone https://github.com/luo3300612/Visualizer

cd Visualizer/
python setup.py install
cd ..


mkdir -p checkpoint/pretrained/mit/
wget -o checkpoint/pretrained/mit/mit_b1.pth https://download.openmmlab.com/mmsegmentation/v0.5/segformer/segformer_mit-b1_512x512_160k_ade20k/segformer_mit-b1_512x512_160k_ade20k_20220620_112037-c3f39e00.pth

pip install gdown

gdown --id 1OG_Dla9f-sBuoi3Q6mF55Au3rU-Fc9Sg -O checkpoint/infermodel.pth

mkdir -p Your_eval_data_path/FBMS2SEG_byvideo/frame/val

2. Data Preparation

▶️ Dataset Download

Dataset	Download Link
YouTube-VOS	🔗 Download
DAVIS-16	🔗 Download
FBMS	🔗 Download
Youtube-Objects	🔗 Download
DAVSOD	🔗 Download
ViSal	🔗 Download

▶️ Optical Flow Preparation

Following previous UVOS works, optical flow maps for both training and inference data are generated through 🔗 [RAFT].

▶️ Folder Structure

Please Ensure to organize the data files as follows:

data/
  └── DAVIS-16/
        ├── Images/
        |   ├── train/
        |   |   ├── video_name1/
        |   |   ├── video_name2/
        |   |    ...
        |   └── val/
        |       ├── video_name1/
        |       ├── video_name2/
        |       ...
        ├── Annotations/
        |   ├── train/
        |   |   ├── video_name1/
        |   |   ├── video_name2/
        |   |    ...
        |   └── val/
        |       ├── video_name1/
        |       ├── video_name2/
        |       ...
        └── Flows/
        |   ├── train/
        |   |   ├── video_name1/
        |   |   ├── video_name2/
        |   |    ...
        |   └── val/
        |       ├── video_name1/
        |       ├── video_name2/
        |        ...
              
  └── Youtube-VOS/
        ├── Images/
            ...
        ├── Annotations/
            ...
        └── Flows/
            ...
...

3. Checkpoint Preparation

▶️ Download Pretrained Model

Download the pretrained model and save them in './checkpoint/pretrained/' for model training.

We adopt the Segformer models pretrained on ImageNet-1k

Pretrained Model	Model Link
🔗 Segformer (NeurIPS 21)	🔗 Mit_b0 - Mit_b5 or 🔗 GoogleDrive
🔗 Swin-Transformer (ICCV 21)	🔗 Swin-T - Swin-B

▶️ Download HMHI-Net Checkpoints

Task	Download Link
	🔗 DAVIS-16
UVOS Checkpoints	🔗 FBMS
	🔗 Youtube-Objects

	🔗 DAVIS-16
VSOD Checkpoints	🔗 DAVSOD
	🔗 FBMS
	🔗 ViSal

4. Training

# Certain config values in the file may require modification to suit your local setup.
bash scripts/train.sh

5. Fine-Tuning

Load the best-performing checkpoint on the corresponding dataset at the Training stage and start Fine-Tuning.

# Certain config values in the file may require modification to suit your local setup.
bash scripts/finetune.sh

6. Inference

# Certain config values in the file may require modification to suit your local setup.
bash scripts/infer.sh

7. Evaluation

# Certain config values in the file may require modification to suit your local setup.

# For UVOS tasks
python utils/val_zvos.py

# For VSOD tasks
python utils/val_vsod.py

Acknowledgement

This repository is built upon [🔗 Isomer] and [🔗 SAM], originally proposed in:

"Isomer: Isomerous Transformer for Zero-Shot Video Object Segmentation", Yichen Yuan, Yifan Wang, Lijun Wang, Xiaoqi Zhao, Huchuan Lu, Yu Wang, Weibo Su, Lei Zhang ICCV, 2023. [🔗 Paper]
"Segment Anything" Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick, arxiv, 2023. [🔗 Paper]

We reuse parts of their codebase, including:

The data loading pipeline
Model initialization logic
Training routines
Module formulation

License

The model is licensed under the Apache 2.0 license.

Citating HMHI-Net

@inproceedings{Zheng2025mm,
  title     = {Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation},
  author    = {Xiangyu Zheng, Songcheng He, Wanyu Li, Xiaoqiang Li, Wei Zhang},
  booktitle = {Proceedings of the ACM International Conference on Multimedia (ACM MM)},
  year      = {2025}
}