Results
July 21, 2022 · View on GitHub
Preamble
Our code, by default, uses automatic mixed precision (AMP). Its effect on the output is negligible.
All speeds reported in the paper are recorded with AMP turned off (--benchmark).
Due to refactoring, there might be slight differences between the outputs produced by this code base with the precomputed results/results reported in the paper. This difference rarely leads to a change of the least significant figure (i.e., 0.1).
For most complete results, please see the paper (and the appendix)!
All available precomputed results can be found [here].
Pretrained models
We provide four pretrained models for download:
- XMem.pth (Default)
- XMem-s012.pth (Trained with BL30K)
- XMem-s2.pth (No pretraining on static images)
- XMem-no-sensory (No sensory memory)
The model without pretraining is for reference. The model without sensory memory might be more suitable for tasks without spatial continuity, like mask tracking in a multi-camera 3D reconstruction setting, though I would encourage you to try the base model as well.
Download them from [GitHub] or [Google Drive].
Long-Time Video
Long-Time Video (1X)
| Model | J&F | J | F |
|---|---|---|---|
| XMem | 89.8±0.2 | 88.0±0.2 | 91.6±0.2 |
Long-Time Video (3X)
| Model | J&F | J | F |
|---|---|---|---|
| XMem | 90.0±0.4 | 88.2±0.3 | 91.8±0.4 |
DAVIS
DAVIS 2016
| Model | J&F | J | F | FPS | FPS (AMP) |
|---|---|---|---|---|---|
| XMem | 91.5 | 90.4 | 92.7 | 29.6 | 40.3 |
| XMem-s012 | 92.0 | 90.7 | 93.2 | 29.6 | 40.3 |
| XMem-s2 | 90.8 | 89.6 | 91.9 | 29.6 | 40.3 |
DAVIS 2017 validation
| Model | J&F | J | F | FPS | FPS (AMP) |
|---|---|---|---|---|---|
| XMem | 86.2 | 82.9 | 89.5 | 22.6 | 33.9 |
| XMem-s012 | 87.7 | 84.0 | 91.4 | 22.6 | 33.9 |
| XMem-s2 | 84.5 | 81.4 | 87.6 | 22.6 | 33.9 |
| XMem-no-sensory | 85.1 | - | - | 23.1 | - |
DAVIS 2017 test-dev
| Model | J&F | J | F |
|---|---|---|---|
| XMem | 81.0 | 77.4 | 84.5 |
| XMem-s012 | 81.2 | 77.6 | 84.7 |
| XMem-s2 | 79.8 | 61.4 | 68.1 |
| XMem-s012 (600p) | 82.5 | 79.1 | 85.8 |
YouTubeVOS
We use all available frames in YouTubeVOS by default. See INFERENCE.md if you want to evaluate with sparse frames for some reason.
[Precomputed Results (sparse)]
YouTubeVOS 2018 validation
| Model | G | J-Seen | F-Seen | J-Unseen | F-Unseen | FPS | FPS (AMP) |
|---|---|---|---|---|---|---|---|
| XMem | 85.7 | 84.6 | 89.3 | 80.2 | 88.7 | 22.6 | 31.7 |
| XMem-s012 | 86.1 | 85.1 | 89.8 | 80.3 | 89.2 | 22.6 | 31.7 |
| XMem-s2 | 84.3 | 83.9 | 88.8 | 77.7 | 86.7 | 22.6 | 31.7 |
| XMem-no-sensory | 84.4 | - | - | - | - | 23.1 | - |
YouTubeVOS 2019 validation
| Model | G | J-Seen | F-Seen | J-Unseen | F-Unseen |
|---|---|---|---|---|---|
| XMem | 85.5 | 84.3 | 88.6 | 80.3 | 88.6 |
| XMem-s012 | 85.8 | 84.8 | 89.2 | 80.3 | 88.8 |
| XMem-s2 | 84.2 | 83.8 | 88.3 | 78.1 | 86.7 |
Multi-scale evaluation
Please see the appendix for quantitative results.