SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation [CVPR 2025]
September 25, 2025 Β· View on GitHub
SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation [CVPR 2025]
π CVPR 2025 Highlight π
Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, Giuseppe Averta
Welcome to the official repository for "SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation".
In this work, we build upon Segment Anything 2 (SAM2) and make it wiser by infusing natural language understanding and explicit temporal modeling.
π No fine-tuning of SAM2 weights.
π§ No reliance on external VLMs for multi-modal interaction.
π State-of-the-art performance across multiple benchmarks.
π‘ Minimal overhead: just 4.9 M additional parameters!
π Read our paper on arXiv
π Demo & Project Page
π’ [May 2025] Check out SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation β a unified framework powered by SAM2, supporting points, boxes, scribbles, and masks. No external models, no prompt-specific tweaks. π Checkout SANSA
π’ [June 2025] Try SAMWISE on your own data: weβve added a simple script to run SAMWISE on videos or images using textual prompts. π Try SAMWISE on Your Own Data.
π SAMWISE in Action
SAMWISE (our model, not the hobbit) segments objects from The Lord of the Rings in zero-shot β no extra training, just living up to its namesake! π§ββοΈβ¨
https://github.com/user-attachments/assets/b582557e-a41a-4eb1-9069-a88dafd3e546
π Data Preparation
Before running SAMWISE, set up your dataset: refer to data.md for detailed data preparation.
Once organized, the directory structure should look like this:
SAMWISE/
βββ data/
β βββ ref-youtube-vos/
β βββ ref-davis/
β βββ MeViS/
βββ datasets/
βββ models/
β βββ sam2/
β βββ samwise.py
β βββ ...
...
βοΈ Environment Setup
The code has been tested with Python 3.10 and PyTorch 2.3.1 (with CUDA 11.8). To set up the environment using Conda, run:
conda create --name samwise python=3.10 -y
conda activate samwise
pip install torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
π₯ Referring Video Object Segmentation (RVOS)
Reproducing Our Results: Below, we provide the model weights to replicate the results of our paper.
| Dataset | Total Parameters | Trainable Params | J&F | Model | Zip |
|---|---|---|---|---|---|
| MeViS | 210 M | 4.9 M | 49.5 | Weights | Zip |
| MeViS - valid_u | 210 M | 4.9 M | 57.1 | Weights | - |
| Ref-Youtube-VOS | 210 M | 4.9 M | 69.2 | Weights | Zip |
| Ref-Davis | 210 M | 4.9 M | 70.6 | Weights | - |
To evaluate the model on MeViS - valid_u run the following command:
python3 inference_mevis.py --split valid_u --resume=[/path/to/model_weight] --name_exp [name_exp] --HSA --use_cme_head
For Ref-Davis run the following command:
python3 inference_davis.py --resume=[/path/to/model_weight] --name_exp [name_exp] --HSA --use_cme_head
For MeViS and Ref-Youtube-VOS, upload the zip file to:
πΌοΈ Referring Image Segmentation (RIS)
We also test SAMWISE on the Referring Image Segmentation (RIS) benchmark.
| RefCOCO | RefCOCO+ | RefCOCOg | Model |
|---|---|---|---|
| 75.6 | 65.8 | 66.8 | Weights |
Run the following to evaluate on RIS:
python3 main_pretrain.py --eval --resume=[/path/to/model_weight] --name_exp [name_exp] --disable_pred_obj_score
π Training and Inference
For step-by-step instructions on training and inference, please refer to the Training and Inference Guide.
This document includes all necessary details on:
- Training SAMWISE on different datasets
- Running inference and evaluating performance
- Submitting results to online benchmarks
βΆοΈ Try SAMWISE on Your Own Data
We provide a simple script to run SAMWISE on your own inputs using natural language prompts.
Supported input types:
- A single image (.jpg, .png, .jpeg)
- A video (.mp4)
- A folder of consecutive video frames (e.g., frame_00001.png, frame_00002.png, ...)
Run the script:
python inference_demo.py --input_path <your_input> --text_prompts <text_prompt 1> <text_prompt 2>
Examples:
# On a single image
python inference_demo.py --input_path assets/example_image.jpg --text_prompts "the dog who is jumping" "the dog on the left" "the person with a yellow jacket"
# On a video
python inference_demo.py --input_path assets/example_video.mp4 --text_prompts "the horse jumping" "the person riding the horse"
# On a folder of consecutive frames
python inference_demo.py --input_path demo_sequence --text_prompts "the horse jumping" "the person riding the horse"
Output:
- Image input:
- demo_output/<text_prompt>/example_image.png
- Video or sequence of frames:
- Segmented frames: demo_output/<text_prompt>/frame_*.png
- Segmented video: demo_output/<text_prompt>.mp4
π Acknowledgements
We build upon the amazing work from:
Citation
@InProceedings{cuttano2025samwise,
author = {Cuttano, Claudia and Trivigno, Gabriele and Rosi, Gabriele and Masone, Carlo and Averta, Giuseppe},
title = {SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {3395-3405}
}