Exposing the Limits of Video-Text Models through Contrast Sets

February 28, 2023 ยท View on GitHub

Repository for Exposing the Limits of Video-Text Models through Contrast Sets (NAACL Short 2022).

Updates

  • 2/28/2023: Results for 7k/9k Train splits in MSRVTT.
  • 6/24/2022: Released contrast sets for MSRVTT.

To Do:

  • Release contrast sets for LSMDC
  • Release code to generate contrast sets automatically

Data

We share the verb phrase based contrast set for MSRVTT and LSDMC-ID in this repository. Code to generate the contrast set will be released soon.

DatasetLinkContrast Set Type
MSRVTTDataVerb
LSMDCTBDTBD

MSRVTT

Download

The verb based contrast sets generated by language model (VerbLM MC) and humans (VerbHuman MC) can be found in msrvtt/. You can find the video and annotation data following this link. The sets are generated for the full-test set of MSRVTT with 2990 videos.

You can additionally refer to download script in CLIPBert to get the original MSRVTT split data. To just get train and val, you can also run the download script:

bash msrvtt/download_msrvtt_train_val.sh

Example data:

clipending0ending1ending2ending3ending4label
video9770the boy is trying to fix the problemthe boy is trying to exacerbate the problemtwo men on wave runner in ocean rescuing a surferasian man discusses technology in the younger generationsa group is dancing0
video9771a woman is putting items into a miniature toy ovena child is running around on a mata woman pushing a strollera child is rolling around on a mata game show host hosting a game1

Results

train-9k is the 9k train and test-1k-A is the 1k test-split proposed by JSFUsion [Yu et. al.].

train-7k is the 7k train and test-full is the full test videos in the original MSRVTT. We use the 7k training videos in CLIP4CLIP repo.

MSRVTT-train-7KV to T (R@1)
test-1k-A
T to V (R@1)
test-1k-A
Random MC
test-full
Gender MC
test-full
Verb LM MC
test-full
Verb Human MC
test-full
CLIP-Straight27.231.294.169.665.465.1
MMT24.825.592.475.572.871.3
MMT with CLIP features30.530.395.080.173.871.4
CLIP4CLIPmeanP43.042.196.276.776.273.7
MSRVTT-train-9KV to T (R@1)
test-1k-A
T to V (R@1)
test-1k-A
Random MC
test-1k-A
Gender MC
test-1k-A
Verb LM MC
test-1k-A
Verb Human MC
test-1k-A
CLIP-Straight27.231.291.271.464.963.5
MMT27.026.693.575.975.272.9
MMT with CLIP features33.934.095.680.977.773.3
CLIP4CLIPmeanP43.143.196.379.176.875.4
CLIP2Video43.545.697.076.076.874.3

LSMDC

TBD

Citation

@inproceedings{park-etal-2022-exposing,
    title = "Exposing the Limits of Video-Text Models through Contrast Sets",
    author = "Park, Jae Sung  and
      Shen, Sheng  and
      Farhadi, Ali  and
      Darrell, Trevor  and
      Choi, Yejin  and
      Rohrbach, Anna",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.261",
    pages = "3574--3586",
}

Please email jspark96@cs.washington.edu for more information about the dataset.