VideoMind on Public Benchmarks

June 25, 2025 ยท View on GitHub

๐Ÿ”ฎ Table of Content

Click on the datasets to jump to the tables.

CG-Bench (mini):

MethodSizelong-acc.mIoUrec.@IoUacc.@IoU
GPT-4oโ€“45.25.628.304.38
GPT-4o-miniโ€“33.43.755.182.21
Gemini-1.5-Proโ€“37.23.955.812.53
Gemini-1.5-Flashโ€“32.33.675.442.45
Claude-3.5-Sonnetโ€“40.53.995.672.79
Video-LLaVA7B16.21.131.960.59
VideoLLaMA7B18.41.211.870.84
VideoChat27B19.31.281.980.94
Qwen-VL-Chat7B21.60.891.190.42
ST-LLM7B23.82.232.861.13
ShareGPT4Video16B26.71.852.651.01
Chat-UniVi-v1.513B25.92.072.531.21
VILA8B28.71.562.891.35
MiniCPM-v2.68B30.12.352.611.04
LongVA7B28.72.943.861.78
LLaVA-OV7B31.11.631.781.08
Video-CCAM14B29.72.632.481.83
Kangaroo8B30.22.562.811.94
VITA8ร—7B33.33.063.532.06
Qwen2-VL72B41.33.585.323.31
InternVL278B42.23.915.052.64
VideoMind (Ours)2B31.05.948.504.02
VideoMind (Ours)7B38.47.109.934.67

ReXTime (val):

MethodSizeFTR@0.3R@0.5mIoUAccAcc@IoU
VTimeLLM7Bร—28.8417.4120.1436.16โ€“
TimeChat7Bร—14.427.6111.6540.04โ€“
LITA13Bร—29.4916.2921.4934.44โ€“
VTimeLLM7Bโœ“43.6926.1329.9257.5817.13
TimeChat7Bโœ“40.1321.4226.2949.4610.92
VideoMind (Ours)2Bร—34.3122.6924.8369.0617.26
VideoMind (Ours)7Bร—38.2225.5227.6174.5920.20

Note: Acc@IoU means both QA (Acc) and Grounding (IoU >= 0.5) are correct

NExT-GQA (test):

ModelSizeR@IoU-0.3R@IoU-0.5mIoUR@IoP-0.3R@IoP-0.5mIoPAcc@GQA
FrozenBiLM NG+890M13.56.19.628.523.724.217.5
VIOLETv2โ€“4.31.33.125.123.323.612.8
SeViLA4B29.213.821.734.722.929.516.6
LangRepo8ร—7Bโ€“12.218.5โ€“28.731.317.1
VideoStreaming8.3Bโ€“13.319.331.032.217.8โ€“
LLoVi1.8Tโ€“15.320.0โ€“36.937.324.3
HawkEye7B37.019.525.7โ€“โ€“โ€“โ€“
VideoChat-TPO7B41.223.427.747.532.835.625.5
VideoMind (Ours)2B45.223.228.651.332.636.425.2
VideoMind (Ours)7B50.225.831.456.035.339.028.2

Note: Acc@GQA means both QA (Acc) and Grounding (IoP >= 0.5) are correct

DeVE-QA (val):

ModelSizeR@IoU-0.5mIoUR@IoP-0.5mIoPAcc@QAAcc@GQA
VideoMind (Ours)2B21.726.350.749.976.541.2
VideoMind (Ours)7B26.530.152.351.981.044.2

Note: Acc@GQA means both QA (Acc) and Grounding (IoP >= 0.5) are correct

Charades-STA (test):

MethodSizeFTR@0.3R@0.5R@0.7mIoU
Moment-DETRโ€“โœ“65.852.130.645.5
UMTโ€“โœ“โ€“48.329.3โ€“
UniVTGโ€“โœ“70.858.135.650.1
R2-Tuningโ€“โœ“70.959.837.050.9
VTimeLLM13Bร—55.334.314.734.6
TimeChat7Bร—51.532.213.4โ€“
Momentor7Bร—42.626.611.628.5
HawkEye7Bร—50.631.414.533.7
ChatVTG7Bร—52.733.015.934.9
VideoChat-TPO7Bร—58.340.218.438.1
E.T. Chat4Bร—65.745.920.042.3
VideoMind (Ours)2Bร—67.651.126.045.2
VideoMind (Ours)7Bร—73.559.131.250.2

ActivityNet-Captions (val_2):

MethodSizeFTR@0.3R@0.5R@0.7mIoU
2D-TANโ€“โœ“60.443.425.042.5
MMNโ€“โœ“64.548.229.446.6
VDIโ€“โœ“48.128.8โ€“โ€“
VideoChat7Bร—8.83.71.57.2
Video-LLaMA7Bร—6.92.10.86.5
Video-ChatGPT7Bร—26.413.66.118.9
Valley7Bร—30.613.78.121.9
ChatVTG7Bร—40.722.59.427.2
Momentor7Bร—42.923.012.429.3
E.T. Chat4Bร—24.112.86.118.9
VideoMind (Ours)2Bร—44.026.512.630.1
VideoMind (Ours)7Bร—48.430.315.733.3

QVHighlights (test):

MethodSizeFTR1@0.5R1@0.7mAP@0.5mAP@0.75mAP Avg.
XMLโ€“โœ“41.8330.3544.6331.7332.14
XML+โ€“โœ“46.6933.4647.8934.6734.90
Moment-DETRโ€“โœ“59.7840.3360.5135.3636.14
UMTโ€“โœ“60.8343.2657.3339.1238.08
MomentDiffโ€“โœ“58.2141.4854.5737.2136.84
QD-DETRโ€“โœ“62.4044.9862.5239.8839.86
UniVTGโ€“โœ“65.4350.0664.0645.0243.63
R2-Tuningโ€“โœ“68.0349.3569.0447.5646.17
VideoMind (Ours)2Bโœ“75.4259.3574.1155.1551.60
VideoMind (Ours)7Bโœ“78.5361.0976.0758.1754.19

TACoS (test):

MethodSizeFTR@0.3R@0.5R@0.7mIoU
2D-TANโ€“โœ“40.028.012.927.2
VSLNetโ€“โœ“35.523.513.125.0
Moment-DETRโ€“โœ“38.024.712.025.5
UniVTGโ€“โœ“51.435.017.433.6
R2-Tuningโ€“โœ“49.738.725.135.9
VideoMind (Ours)2Bโœ“38.626.915.527.4
VideoMind (Ours)7Bโœ“49.536.221.434.4

Ego4D-NLQ (val):

MethodSizeFTR@0.3R@0.5R@0.7mIoU
2D-TANโ€“โœ“4.31.80.63.4
VSLNetโ€“โœ“4.52.41.03.5
Moment-DETRโ€“โœ“4.31.80.73.5
UniVTGโ€“โœ“7.34.01.34.9
R2-Tuningโ€“โœ“7.24.52.14.9
UniVTGโ€“ร—6.53.51.24.6
VideoMind (Ours)2Bร—5.92.91.24.7
VideoMind (Ours)7Bร—7.23.71.75.4

ActivityNet-RTL (val):

MethodSizeFTP@0.5mIoU
LITA7Bโœ“21.224.1
LITA13Bโœ“25.928.6
VideoMind (Ours)2Bร—20.122.7
VideoMind (Ours)7Bร—28.031.3

Video-MME (w/o subs):

ModelSizeAllLong
Gemini-1.5-Proโ€“75.067.4
GPT-4oโ€“71.965.3
Video-LLaVA7B41.137.8
TimeChat7B34.332.1
MovieChat7B38.233.4
PLLaVA34B40.034.7
VideoChat-TPO7B48.841.0
LongVA7B52.646.2
VideoMind (Ours)2B53.645.4
VideoMind (Ours)7B58.249.2

MLVU:

ModelSizeM-Avg
GPT-4oโ€“54.5
Video-LLaVA7B29.3
TimeChat7B30.9
MovieChat7B25.8
PLLaVA34B53.6
VideoChat-TPO7B54.7
LongVA7B56.3
VideoMind (Ours)2B58.7
VideoMind (Ours)7B64.4

LVBench:

ModelSizeOverall
Gemini-1.5-Proโ€“33.1
GPT-4oโ€“30.8
Video-LLaVA7Bโ€“
TimeChat7B22.3
MovieChat7B22.5
PLLaVA34B26.1
VideoChat-TPO7Bโ€“
LongVA7Bโ€“
VideoMind (Ours)2B35.4
VideoMind (Ours)7B40.8

MVBench:

ModelSizeASAPAAFAUAOEOIOSMDALSTACMCMASCFPCOENERCIAvg.
GPT-4Vโ€“55.563.572.046.573.518.559.029.512.040.583.539.012.022.545.047.552.031.059.011.043.5
Video-ChatGPT7B23.526.062.022.526.554.028.040.023.020.031.030.525.539.548.529.033.029.526.035.532.7
Video-LLaMA7B27.525.551.029.039.048.040.538.022.522.543.034.022.532.545.532.540.030.021.037.034.1
VideoChat7B33.526.556.033.540.553.040.530.025.527.048.535.020.542.546.026.541.023.523.536.035.5
Video-LLaVA7B46.042.556.539.053.553.048.041.029.031.582.545.026.053.041.533.541.527.538.531.543.0
TimeChat7B40.536.061.032.553.053.541.529.019.526.566.534.020.043.542.036.536.029.035.035.038.5
PLLaVA7B58.049.055.541.061.056.061.036.023.526.082.039.542.052.045.042.053.530.548.031.046.6
ShareGPT4Video7B49.539.579.540.054.582.554.532.550.541.584.535.562.575.051.025.546.528.539.051.551.2
ST-LLM7B66.053.584.044.058.580.573.538.542.531.086.536.556.578.543.044.546.534.541.558.554.9
VideoGPT+3.8B69.060.083.048.566.585.575.536.044.034.089.539.571.090.545.053.050.029.544.060.058.7
VideoChat27B75.558.083.550.560.587.574.545.047.544.082.537.064.587.551.066.547.035.037.072.560.4
VideoMind (Ours)2B77.078.077.046.570.587.071.533.048.039.591.053.078.089.043.553.561.537.549.553.061.9
VideoMind (Ours)7B74.071.581.050.077.093.075.038.048.546.091.039.080.094.549.555.570.040.557.061.064.6

LongVideoBench:

MethodSizeAcc(8, 15](15, 60](180, 600](900, 3600]
GPT-4oโ€“66.771.476.769.160.9
GPT-4 Turboโ€“59.065.268.262.450.5
Gemini-1.5-Proโ€“64.067.475.165.358.6
Gemini-1.5-Flashโ€“61.668.376.262.654.0
Idefics28B49.759.865.747.842.7
Phi-3-Vision4B49.659.361.646.844.7
Mantis-Idefics28B47.056.655.845.642.2
Mantis-BakLLaVA7B43.753.457.640.338.7
VideoMind (Ours)2B48.859.359.349.341.7
VideoMind (Ours)7B56.367.467.456.848.6