EfficientViT-GazeSAM
October 25, 2024 ยท View on GitHub
EfficientViT-GazeSAM is a gaze-prompted image segmentation model capable of running in real time with TensorRT on an NVIDIA RTX 4070. GazeSAM is comprised of a face detection component (ProxylessGaze), gaze detection component (L2CS-Net), an object detection component (YOLO-NAS), a depth estimation component (Depth-Anything), and an image segmentation component (EfficientViT).

Installation and Setup
Prior to following the runtime-specific instructions below, please make sure to follow the conda environment creation and package installation instructions for this repo.
# install extra packages
pip install -r extra_requirements.txt
TensorRT (recommended mode for real-time performance on RTX 4070)
-
Ensure the following packages are installed.
a. TensorRT
b. torch2trt
c.
python -m pip install cuda-python. -
Please follow the engine creation instructions within the
modelsdirectory here. You can choose between the default version (FP32 + FP16 engines) and the optimized version (FP32, FP16, and INT8 engines). The optimized version is approximately 5ms faster per frame (on an RTX 4070) but both will run in real-time.
ONNX
-
python -m pip install onnxruntime-gpuNote: if you run into ONNXRuntime issues, you can try uninstalling
onnxruntimeandonnxruntime-gpu, then reinstallingonnxruntime-gpu. -
Download the ONNX model components here and save them to the
models/onnxdirectory (make sure to create theonnxsubfolder).
PyTorch
-
Setup EfficientViT-SAM model [guide]
-
Setup depth estimation model
a. Download the Depth-Anything repo and save it as a subfolder within this current directory.
b.
cp models/create_pytorch/dpt_replacement.py Depth-Anything/depth_anything/dpt.py. This prepends the torchhub local download path with "Depth-Anything".c. Download the Depth-Anything-Base checkpoint here. Save it within the
models/pytorchdirectory (make sure to create thepytorchsubfolder). -
Setup gaze estimation model
a. Download the L2CS-Net pickle file here. Save it within the
models/pytorchdirectory (make sure to create thepytorchsubfolder). -
Download the ONNX model components here. Save the files within the
models/onnxdirectory (make sure to create theonnxsubfolder).
Usage
GazeSAM can process webcam and video file inputs. To run with webcam, run python gazesam_demo.py --webcam. To run with input video, python gazesam_demo.py --video <path>.
By default, we run with TensorRT (use the runtime flag to change this, but note that only TensorRT mode will produce results in real-time). Results are saved by default to the output_videos directory (modifiable via the output-dir flag).
If you generated engines using the optimized script, set --precision-mode optimized. Modes described here. You can download the example video here.
Input video + default engines example: python gazesam_demo.py --video input_videos/example.mp4 --precision-mode default
Webcam + optimized engines example: python gazesam_demo.py --webcam --precision-mode optimized
Contributor
Reference
If EfficientViT is useful or relevant to your research, please kindly recognize our contributions by citing our paper:
@inproceedings{cai2023efficientvit,
title={Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction},
author={Cai, Han and Li, Junyan and Hu, Muyan and Gan, Chuang and Han, Song},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={17302--17313},
year={2023}
}