SAM4MLLM

March 20, 2025 · View on GitHub

This is the implementation of our ECCV'24 "SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation"

Dataset Preparation

Download each dataset from website:

You are responsible for checking if the dataset license is fit for the intended purpose.

Put all of them under data directory so you should get:

    SAM4MLLM/
    ├──dataset/
    |  ├──ADE20K/
    |  ├──PACO-LVIS/
    |  ├──Part-ImageNet/
    |  ├──RefCOCO/
    |  ├──GRES/

Installation

pytorch==2.1.2
transformers==4.42.4
peft==0.11.1
lightning==2.3.3
FlashAttention2(optional)
LLaVA-NeXT: Follow instruction in https://github.com/LLaVA-VL/LLaVA-NeXT
EfficientVIT-SAM: Follow instruction in https://github.com/mit-han-lab/efficientvit

Checkpoint

Download each checkpoint:

Put all of them under checkpoint directory so you should get:

    SAM4MLLM/
    ├──checkpoint/
    |  ├──llama3-llava-next-8b/
    |  ├──sam4mllm/
    |  ├──sam4mllm_plus/
    |  ├──xl1.pt/
    |  ├──effvit_xl1_decoder_coco_ft.pt

Data pre-process

Rearrange data

In data, Run each jupyter notebook to generate dataset for training.

Convert the data into dialouge format:

python to_chat_format.ipynb

Traning

python sam4mllm_train.py

Inference

Run simple_infer.ipynb

Licenses

This work is made available under the NVIDIA Source Code License-NC. Click here to view a copy of this license.