Results

July 21, 2022 · View on GitHub

Preamble

Our code, by default, uses automatic mixed precision (AMP). Its effect on the output is negligible. All speeds reported in the paper are recorded with AMP turned off (--benchmark). Due to refactoring, there might be slight differences between the outputs produced by this code base with the precomputed results/results reported in the paper. This difference rarely leads to a change of the least significant figure (i.e., 0.1).

For most complete results, please see the paper (and the appendix)!

All available precomputed results can be found [here].

Pretrained models

We provide four pretrained models for download:

  1. XMem.pth (Default)
  2. XMem-s012.pth (Trained with BL30K)
  3. XMem-s2.pth (No pretraining on static images)
  4. XMem-no-sensory (No sensory memory)

The model without pretraining is for reference. The model without sensory memory might be more suitable for tasks without spatial continuity, like mask tracking in a multi-camera 3D reconstruction setting, though I would encourage you to try the base model as well.

Download them from [GitHub] or [Google Drive].

Long-Time Video

[Precomputed Results]

Long-Time Video (1X)

ModelJ&FJF
XMem89.8±0.288.0±0.291.6±0.2

Long-Time Video (3X)

ModelJ&FJF
XMem90.0±0.488.2±0.391.8±0.4

DAVIS

[Precomputed Results]

DAVIS 2016

ModelJ&FJFFPSFPS (AMP)
XMem91.590.492.729.640.3
XMem-s01292.090.793.229.640.3
XMem-s290.889.691.929.640.3

DAVIS 2017 validation

ModelJ&FJFFPSFPS (AMP)
XMem86.282.989.522.633.9
XMem-s01287.784.091.422.633.9
XMem-s284.581.487.622.633.9
XMem-no-sensory85.1--23.1-

DAVIS 2017 test-dev

ModelJ&FJF
XMem81.077.484.5
XMem-s01281.277.684.7
XMem-s279.861.468.1
XMem-s012 (600p)82.579.185.8

YouTubeVOS

We use all available frames in YouTubeVOS by default. See INFERENCE.md if you want to evaluate with sparse frames for some reason.

[Precomputed Results]

[Precomputed Results (sparse)]

YouTubeVOS 2018 validation

ModelGJ-SeenF-SeenJ-UnseenF-UnseenFPSFPS (AMP)
XMem85.784.689.380.288.722.631.7
XMem-s01286.185.189.880.389.222.631.7
XMem-s284.383.988.877.786.722.631.7
XMem-no-sensory84.4----23.1-

YouTubeVOS 2019 validation

ModelGJ-SeenF-SeenJ-UnseenF-Unseen
XMem85.584.388.680.388.6
XMem-s01285.884.889.280.388.8
XMem-s284.283.888.378.186.7

Multi-scale evaluation

Please see the appendix for quantitative results.

[DAVIS-MS Precomputed Results]

[YouTubeVOS-MS Precomputed Results]