๐Ÿ” (ACM MM 25) Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation

April 16, 2026 ยท View on GitHub

ACM MM 2025 License: Apache 2.0

Official code repository for our ACM MM 2025 paper:

"Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation"
Xiangyu Zheng, Songcheng He, Wanyu Li, Xiaoqiang Li, Wei Zhang ๐Ÿ”— [Paper Link]


๐Ÿ“– Introduction

This repository provides the official implementation of our ACM MM 2025 paper, "Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation" ๐Ÿ”— [Paper Link] .

In this work, we propose a novel method HMHI-Net for Unsupervised Video Object Segmentation (UVOS) with Shallow Features for Memroy. The method features:

  • ๐Ÿ”ง A novel Hierarchical Memory Architecture that simultaneously incorporates shallow- and high-level features for memory, facilitating UVOS with both pixel-level details and semantic richness stored in memory banks.
  • ๐Ÿ” The Heterogeneous Mutual Refinement Mechanism to perform interaction across two memory banks, through the pixel-guided local alignment module (PLAM) and the semantic-guided global integration module (SGIM) respectively.
  • โšก HMHI-Net achieves SOTA on common UVOS and VSOD benchmarks, with 89.8% J&F on DAVIS-16, 86.9% J on FBMS and 76.2% J on YouTube-Objetcs.
Pipeline Overview
(a) Overall pipeline of HMHI-Net. (b) Memory readout mechanism to refine current frame. (c) Pixel-guided local alignment module. (d)Semantic-guided global integration module. (e) Memory update mechanism with the reference encoder.

๐ŸŽž๏ธ Video Demo

Demo1Demo2
Car-roundabout_Davis16Dog_Davis16
Demo3Demo4
Drift-straight_Davis16Parkour_Davis16


๐Ÿš€ Getting Started

1. Environment Setup

pip install -r requirements.txt

Thanks to ๐Ÿ”— [Calledit] for providing a more detailed environment installation script!

#!/bin/bash

conda create -n env_name python=3.10
conda activate env_name

pip install torch numpy opencv-python timm mmcv bytecode IPython tensorboard scikit-image 

git clone https://github.com/luo3300612/Visualizer

cd Visualizer/
python setup.py install
cd ..


mkdir -p checkpoint/pretrained/mit/
wget -o checkpoint/pretrained/mit/mit_b1.pth https://download.openmmlab.com/mmsegmentation/v0.5/segformer/segformer_mit-b1_512x512_160k_ade20k/segformer_mit-b1_512x512_160k_ade20k_20220620_112037-c3f39e00.pth

pip install gdown

gdown --id 1OG_Dla9f-sBuoi3Q6mF55Au3rU-Fc9Sg -O checkpoint/infermodel.pth

mkdir -p Your_eval_data_path/FBMS2SEG_byvideo/frame/val

2. Data Preparation

โ–ถ๏ธ Dataset Download

DatasetDownload Link
YouTube-VOS๐Ÿ”— Download
DAVIS-16๐Ÿ”— Download
FBMS๐Ÿ”— Download
Youtube-Objects๐Ÿ”— Download
DAVSOD๐Ÿ”— Download
ViSal๐Ÿ”— Download

โ–ถ๏ธ Optical Flow Preparation

Following previous UVOS works, optical flow maps for both training and inference data are generated through ๐Ÿ”— [RAFT].

โ–ถ๏ธ Folder Structure

Please Ensure to organize the data files as follows:

data/
  โ””โ”€โ”€ DAVIS-16/
        โ”œโ”€โ”€ Images/
        |   โ”œโ”€โ”€ train/
        |   |   โ”œโ”€โ”€ video_name1/
        |   |   โ”œโ”€โ”€ video_name2/
        |   |    ...
        |   โ””โ”€โ”€ val/
        |       โ”œโ”€โ”€ video_name1/
        |       โ”œโ”€โ”€ video_name2/
        |       ...
        โ”œโ”€โ”€ Annotations/
        |   โ”œโ”€โ”€ train/
        |   |   โ”œโ”€โ”€ video_name1/
        |   |   โ”œโ”€โ”€ video_name2/
        |   |    ...
        |   โ””โ”€โ”€ val/
        |       โ”œโ”€โ”€ video_name1/
        |       โ”œโ”€โ”€ video_name2/
        |       ...
        โ””โ”€โ”€ Flows/
        |   โ”œโ”€โ”€ train/
        |   |   โ”œโ”€โ”€ video_name1/
        |   |   โ”œโ”€โ”€ video_name2/
        |   |    ...
        |   โ””โ”€โ”€ val/
        |       โ”œโ”€โ”€ video_name1/
        |       โ”œโ”€โ”€ video_name2/
        |        ...
              
  โ””โ”€โ”€ Youtube-VOS/
        โ”œโ”€โ”€ Images/
            ...
        โ”œโ”€โ”€ Annotations/
            ...
        โ””โ”€โ”€ Flows/
            ...
...

3. Checkpoint Preparation

โ–ถ๏ธ Download Pretrained Model

Download the pretrained model and save them in './checkpoint/pretrained/' for model training.

We adopt the Segformer models pretrained on ImageNet-1k

Pretrained ModelModel Link
๐Ÿ”— Segformer (NeurIPS 21)๐Ÿ”— Mit_b0 - Mit_b5 or ๐Ÿ”— GoogleDrive
๐Ÿ”— Swin-Transformer (ICCV 21)๐Ÿ”— Swin-T - Swin-B

โ–ถ๏ธ Download HMHI-Net Checkpoints

TaskDownload Link
๐Ÿ”— DAVIS-16
UVOS Checkpoints๐Ÿ”— FBMS
๐Ÿ”— Youtube-Objects
๐Ÿ”— DAVIS-16
VSOD Checkpoints๐Ÿ”— DAVSOD
๐Ÿ”— FBMS
๐Ÿ”— ViSal

4. Training

# Certain config values in the file may require modification to suit your local setup.
bash scripts/train.sh

5. Fine-Tuning

Load the best-performing checkpoint on the corresponding dataset at the Training stage and start Fine-Tuning.

# Certain config values in the file may require modification to suit your local setup.
bash scripts/finetune.sh

6. Inference

# Certain config values in the file may require modification to suit your local setup.
bash scripts/infer.sh

7. Evaluation

# Certain config values in the file may require modification to suit your local setup.

# For UVOS tasks
python utils/val_zvos.py

# For VSOD tasks
python utils/val_vsod.py

Acknowledgement

This repository is built upon [๐Ÿ”— Isomer] and [๐Ÿ”— SAM], originally proposed in:

  1. "Isomer: Isomerous Transformer for Zero-Shot Video Object Segmentation", Yichen Yuan, Yifan Wang, Lijun Wang, Xiaoqi Zhao, Huchuan Lu, Yu Wang, Weibo Su, Lei Zhang ICCV, 2023. [๐Ÿ”— Paper]

  2. "Segment Anything" Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick, arxiv, 2023. [๐Ÿ”— Paper]

We reuse parts of their codebase, including:

  • The data loading pipeline

  • Model initialization logic

  • Training routines

  • Module formulation

License

The model is licensed under the Apache 2.0 license.

Citating HMHI-Net

@inproceedings{Zheng2025mm,
  title     = {Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation},
  author    = {Xiangyu Zheng, Songcheng He, Wanyu Li, Xiaoqiang Li, Wei Zhang},
  booktitle = {Proceedings of the ACM International Conference on Multimedia (ACM MM)},
  year      = {2025}
}