SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation [CVPR 2025]

September 25, 2025 · View on GitHub

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation [CVPR 2025]

🎉 CVPR 2025 Highlight 🎉

Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, Giuseppe Averta

Welcome to the official repository for "SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation".

In this work, we build upon Segment Anything 2 (SAM2) and make it wiser by infusing natural language understanding and explicit temporal modeling.
🚀 No fine-tuning of SAM2 weights.
🧠 No reliance on external VLMs for multi-modal interaction.
📈 State-of-the-art performance across multiple benchmarks.
💡 Minimal overhead: just 4.9 M additional parameters!

📄 Read our paper on arXiv
🌍 Demo & Project Page

📢 [May 2025] Check out SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation — a unified framework powered by SAM2, supporting points, boxes, scribbles, and masks. No external models, no prompt-specific tweaks. 👉 Checkout SANSA

📢 [June 2025] Try SAMWISE on your own data: we’ve added a simple script to run SAMWISE on videos or images using textual prompts. 👉 Try SAMWISE on Your Own Data.

👀 SAMWISE in Action

SAMWISE (our model, not the hobbit) segments objects from The Lord of the Rings in zero-shot — no extra training, just living up to its namesake! 🧙‍♂️✨

https://github.com/user-attachments/assets/b582557e-a41a-4eb1-9069-a88dafd3e546

📊 Data Preparation

Before running SAMWISE, set up your dataset: refer to data.md for detailed data preparation.
Once organized, the directory structure should look like this:

SAMWISE/
├── data/
│   ├── ref-youtube-vos/
│   ├── ref-davis/
│   ├── MeViS/
├── datasets/
├── models/
│   ├── sam2/
│   ├── samwise.py
│   ├── ...
...

⚙️ Environment Setup

The code has been tested with Python 3.10 and PyTorch 2.3.1 (with CUDA 11.8). To set up the environment using Conda, run:

conda create --name samwise python=3.10 -y
conda activate samwise
pip install torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

🎥 Referring Video Object Segmentation (RVOS)

Reproducing Our Results: Below, we provide the model weights to replicate the results of our paper.

Dataset	Total Parameters	Trainable Params	J&F	Model	Zip
MeViS	210 M	4.9 M	49.5	Weights	Zip
MeViS - valid_u	210 M	4.9 M	57.1	Weights	-
Ref-Youtube-VOS	210 M	4.9 M	69.2	Weights	Zip
Ref-Davis	210 M	4.9 M	70.6	Weights	-

To evaluate the model on MeViS - valid_u run the following command:

python3 inference_mevis.py --split valid_u --resume=[/path/to/model_weight] --name_exp [name_exp] --HSA --use_cme_head

For Ref-Davis run the following command:

python3 inference_davis.py --resume=[/path/to/model_weight] --name_exp [name_exp]  --HSA --use_cme_head

For MeViS and Ref-Youtube-VOS, upload the zip file to:

🖼️ Referring Image Segmentation (RIS)

We also test SAMWISE on the Referring Image Segmentation (RIS) benchmark.

RefCOCO	RefCOCO+	RefCOCOg	Model
75.6	65.8	66.8	Weights

Run the following to evaluate on RIS:

python3 main_pretrain.py --eval --resume=[/path/to/model_weight] --name_exp [name_exp] --disable_pred_obj_score

🚀 Training and Inference

For step-by-step instructions on training and inference, please refer to the Training and Inference Guide.

This document includes all necessary details on:

Training SAMWISE on different datasets
Running inference and evaluating performance
Submitting results to online benchmarks

▶️ Try SAMWISE on Your Own Data

We provide a simple script to run SAMWISE on your own inputs using natural language prompts.
Supported input types:

A single image (.jpg, .png, .jpeg)
A video (.mp4)
A folder of consecutive video frames (e.g., frame_00001.png, frame_00002.png, ...)

Run the script:

python inference_demo.py --input_path <your_input> --text_prompts <text_prompt 1> <text_prompt 2>

Examples:

# On a single image
python inference_demo.py --input_path assets/example_image.jpg --text_prompts "the dog who is jumping" "the dog on the left" "the person with a yellow jacket"

# On a video
python inference_demo.py --input_path assets/example_video.mp4 --text_prompts "the horse jumping" "the person riding the horse"

# On a folder of consecutive frames
python inference_demo.py --input_path demo_sequence --text_prompts "the horse jumping" "the person riding the horse"

Output:

Image input:
- demo_output/<text_prompt>/example_image.png
Video or sequence of frames:
- Segmented frames: demo_output/<text_prompt>/frame_*.png
- Segmented video: demo_output/<text_prompt>.mp4

🔗 Acknowledgements

We build upon the amazing work from:

Citation

@InProceedings{cuttano2025samwise,
    author    = {Cuttano, Claudia and Trivigno, Gabriele and Rosi, Gabriele and Masone, Carlo and Averta, Giuseppe},
    title     = {SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {3395-3405}
}