EfficientViT-GazeSAM

October 25, 2024 · View on GitHub

EfficientViT-GazeSAM is a gaze-prompted image segmentation model capable of running in real time with TensorRT on an NVIDIA RTX 4070. GazeSAM is comprised of a face detection component (ProxylessGaze), gaze detection component (L2CS-Net), an object detection component (YOLO-NAS), a depth estimation component (Depth-Anything), and an image segmentation component (EfficientViT).

GazeSAM demo

Installation and Setup

Prior to following the runtime-specific instructions below, please make sure to follow the conda environment creation and package installation instructions for this repo.

# install extra packages
pip install -r extra_requirements.txt

TensorRT (recommended mode for real-time performance on RTX 4070)

Ensure the following packages are installed.

a. TensorRT

b. torch2trt

c. python -m pip install cuda-python.
Please follow the engine creation instructions within the models directory here. You can choose between the default version (FP32 + FP16 engines) and the optimized version (FP32, FP16, and INT8 engines). The optimized version is approximately 5ms faster per frame (on an RTX 4070) but both will run in real-time.

ONNX

python -m pip install onnxruntime-gpu

Note: if you run into ONNXRuntime issues, you can try uninstalling onnxruntime and onnxruntime-gpu, then reinstalling onnxruntime-gpu.
Download the ONNX model components here and save them to the models/onnx directory (make sure to create the onnx subfolder).

PyTorch

Setup EfficientViT-SAM model [guide]
Setup depth estimation model

a. Download the Depth-Anything repo and save it as a subfolder within this current directory.

b. cp models/create_pytorch/dpt_replacement.py Depth-Anything/depth_anything/dpt.py. This prepends the torchhub local download path with "Depth-Anything".

c. Download the Depth-Anything-Base checkpoint here. Save it within the models/pytorch directory (make sure to create the pytorch subfolder).
Setup gaze estimation model

a. Download the L2CS-Net pickle file here. Save it within the models/pytorch directory (make sure to create the pytorch subfolder).
Download the ONNX model components here. Save the files within the models/onnx directory (make sure to create the onnx subfolder).

Usage

GazeSAM can process webcam and video file inputs. To run with webcam, run python gazesam_demo.py --webcam. To run with input video, python gazesam_demo.py --video <path>.

By default, we run with TensorRT (use the runtime flag to change this, but note that only TensorRT mode will produce results in real-time). Results are saved by default to the output_videos directory (modifiable via the output-dir flag).

If you generated engines using the optimized script, set --precision-mode optimized. Modes described here. You can download the example video here.

Input video + default engines example: python gazesam_demo.py --video input_videos/example.mp4 --precision-mode default

Webcam + optimized engines example: python gazesam_demo.py --webcam --precision-mode optimized

Contributor

Nicole Stiles

Reference

If EfficientViT is useful or relevant to your research, please kindly recognize our contributions by citing our paper:

@inproceedings{cai2023efficientvit,
  title={Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction},
  author={Cai, Han and Li, Junyan and Hu, Muyan and Gan, Chuang and Han, Song},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={17302--17313},
  year={2023}
}