Time Blindness: Why Video-Language Models Can't See What Humans Can?

June 2, 2025 ยท View on GitHub

Ujjwal Upadhyay * ย  Mukul Ranjan * ย  Zhiqiang Shen ย  Mohamed Elhoseiny



*Equal Contribution

arXiv Our Page GitHub issues GitHub stars GitHub license

๐Ÿ“Œ Table of Contents


๐Ÿ“– Overview

Current ๐Ÿค– Video-Vision Language Models (Video-VLMs) excel at spatial understanding but suffer from โฐ "time blindness" - a critical inability to process purely temporal patterns. While humans effortlessly recognize information encoded in temporal sequences with 98% accuracy, state-of-the-art models including GPT-4o, Gemini 2.0, and Qwen-VL achieve 0% accuracy on the same tasks.

We introduce ๐Ÿ‘ป SpookyBench, the first benchmark designed to isolate and evaluate pure temporal understanding by encoding information exclusively through temporal sequences of noise-like frames. This exposes a fundamental limitation in current video understanding architectures that over-rely on frame-level spatial features.


๐ŸŒŸ Key Highlights

โœ… First benchmark to isolate purely temporal reasoning without spatial shortcuts
โœ… 451 carefully crafted videos across 4 distinct categories (Words, Shapes, Objects, Dynamic Scenes)
โœ… Striking performance gap: Humans 98% vs. All AI models 0%
โœ… Comprehensive evaluation: 15+ state-of-the-art models tested including GPT-4o, Gemini, Qwen-VL
โœ… Novel temporal encoding framework using opposing motion patterns
โœ… Cross-architecture failure: Limitation persists across model scales and designs


๐Ÿš€ SpookyBench reveals that current Video-VLMs are fundamentally "time-blind" despite impressive performance on standard benchmarks! โฐ๐Ÿ‘๏ธ


๐Ÿ“Š SpookyBench Dataset

Our benchmark contains 451 videos distributed across four temporal pattern categories:

CategoryTotal VideosDescription
Text210 (46.6%)English words encoded through temporal noise patterns
Object Images156 (34.6%)Single objects encoded using temporal animation
Dynamic Scenes57 (12.6%)Video depth maps with temporal motion patterns
Shapes28 (6.2%)Geometric patterns encoded through temporal sequences
Total451Comprehensive temporal understanding evaluation

๐Ÿ“Œ Each video appears as random noise in individual frames, but reveals meaningful content when viewed as a temporal sequence.


๐ŸŽฏ Benchmark Results

The Time Blindness Gap

Our evaluation reveals a shocking performance disparity:

Model TypeHuman PerformanceAI Model PerformanceGap
๐Ÿ‘ฅ Humans98.0% ยฑ 0.6%N/AN/A
๐Ÿค– All Video-VLMsN/A0.0%98 percentage points

Tested Models (All Scored 0%)

Model FamilyModels TestedPerformance
Closed-SourceGPT-4o, GPT-4V, Gemini 2.0 Flash, Gemini 1.5 Pro0% across all
Open-Source LargeQwen2.5-VL-72B, InternVL2.5-78B, InternVL2-40B0% across all
Open-Source MidVideo-LLaVA, LLaVA-NeXT-Video, TimeChat0% across all
SpecializedTimeChat, VideoGPT+, VILA0% across all

Model Performacne

๐Ÿ“Œ Key Finding: The limitation is architectural, not a matter of scale, training, or prompting strategy.


๐Ÿ“ธ Task Examples

Examples of temporal patterns in SpookyBench: Individual frames appear as noise, but temporal sequences reveal words, shapes, and objects that humans can easily recognize. For more examples visit our project webpage


๐Ÿ”ฌ Temporal Encoding Framework

Our unique encoding method creates temporal patterns through opposing motion:

Core Principle: Motion-Based Content Emergence

  • Foreground pixels: Move in one direction (e.g., up/left)
  • Background pixels: Move in opposite direction (e.g., down/right)
  • Human perception: Groups pixels by motion direction, revealing content
  • AI models: Fail to leverage temporal motion cues

Technical Implementation

# Simplified temporal encoding algorithm
for each pixel (x, y):
    if content_mask(x, y):
        frame[x, y] = foreground_noise(x, y + velocity*time)
    else:
        frame[x, y] = background_noise(x, y - velocity*time)

Signal Analysis Metrics

MetricPurpose
Basic SNRMeasures signal-to-noise ratio in temporal patterns
Perceptual SNRIncorporates human visual sensitivity weighting
Temporal CoherenceQuantifies motion consistency over time
Motion ContrastMeasures foreground-background motion differentiation

๐Ÿค— Download the data

You can download the dataset from hugging face using wget and then unzip the file.

wget https://huggingface.co/datasets/timeblindness/spooky-bench/resolve/main/spooky_bench.zip
unzip spooky_bench.zip

โš™๏ธ Installation & Usage

1๏ธโƒฃ Clone the Repository

git clone https://github.com/TimeBlindness/time-blindness.git
cd time-blindness

2๏ธโƒฃ Setup Environment

# For closed-source models (GPT-4o, Gemini)
cd eval/closed_models
pip install -r requirements.txt

# Set up API keys in .env
OPENAI_API_KEY=your_openai_api_key_here
GOOGLE_API_KEY=your_gemini_api_key_here

3๏ธโƒฃ Evaluate Models

GPT-4o Evaluation

cd eval/closed_models
python eval_gpt4o.py \
  --dataset /path/to/spooky_bench/SpookyBenchDatasets \
  --csv /path/to/metadata.csv \
  --output ./results \
  --categories words shapes \
  --use_cot \
  --sample_size 10

Gemini Evaluation

cd eval/closed_models
python eval_gemini.py \
  --dataset /path/to/spooky_bench/SpookyBenchDatasets \
  --csv /path/to/metadata.csv \
  --output ./results \
  --categories words \
  --use_cot \
  --sample_size 10

Qwen Evaluation

See instructions in eval/qwen/README.md

InternVL Evaluation

See instructions in eval/internvl/README.md

MovieChat Evaluation

See instructions in eval/MovieChatVideo/README.md

TimeChat Evaluation

See instructions in the author's original repo TimeChat. More detailed instructions will be updated later.

VideoLLaMA3 Evaluation

See instructions in the author's original repo VideoLLaMA3. More detailed instructions will be updated later.

MiniGPT4-Video Evaluation

See instructions in the author's original repo MiniGPT4-video. More detailed instructions will be updated later.

Video-ChatGPT Evaluation

See instructions in the author's original repo Video-ChatGPT. More detailed instructions will be updated later.

VideoGPT-plus Evaluation

See instructions in the author's original repo VideoGPT-plus. More detailed instructions will be updated later.

VILA Evaluation

See instructions in the author's original repo VILA. More detailed instructions will be updated later.

ShareGPT4Video Evaluation

See instructions in the author's original repo ShareGPT4Video. More detailed instructions will be updated later.

VideoLLaMA2 Evaluation

See instructions in the author's original repo VideoLLaMA2. More detailed instructions will be updated later.

Video-LLaVA Evaluation

See instructions in the author's original repo Video-LLaVA. More detailed instructions will be updated later.

LLaVA-NeXT-Video Evaluation

See instructions in the author's original repo LLaVA-NeXT-Video. More detailed instructions will be updated later.

Command-line Arguments

  • --dataset: Path to the SpookyBench dataset directory
  • --csv: Path to the metadata CSV file
  • --output: Directory to save evaluation results
  • --categories: Categories to evaluate (words, shapes, images, videos)
  • --use_cot: Use chain-of-thought prompting for more detailed reasoning
  • --sample_size: Number of videos to sample per category
  • --model: Model name/version to use (specific to each evaluator)

4๏ธโƒฃ Human Evaluation Interface

Expected Data Folder Structure

data_path/
  โ”œโ”€โ”€ images/
  โ”‚   โ”œโ”€โ”€ video1.mp4
  โ”‚   โ”œโ”€โ”€ video2.mp4
  โ”‚   โ””โ”€โ”€ ...
  โ”œโ”€โ”€ shapes/
  โ”‚   โ”œโ”€โ”€ video1.mp4
  โ”‚   โ”œโ”€โ”€ video2.mp4
  โ”‚   โ””โ”€โ”€ ...
  โ”œโ”€โ”€ videos/
  โ”‚   โ”œโ”€โ”€ video1.mp4
  โ”‚   โ”œโ”€โ”€ video2.mp4
  โ”‚   โ””โ”€โ”€ ...
  โ””โ”€โ”€ words/
      โ”œโ”€โ”€ video1.mp4
      โ”œโ”€โ”€ video2.mp4
      โ””โ”€โ”€ ...

Example Usage:

# Run human evaluation interface
python human_eval_interface.py --data_path /path/to/spooky_bench_data
python human_eval_interface.py --data_path ./data --output_dir ./annotations --port 7861

Command-line Arguments

  • --dataset: Path to the SpookyBench dataset directory
  • --csv: Path to the metadata CSV file
  • --output: Directory to save evaluation results
  • --categories: Categories to evaluate (words, shapes, objects, videos)
  • --use_cot: Use chain-of-thought prompting for more detailed reasoning
  • --sample_size: Number of videos to sample per category
  • --model: Model name/version to use (specific to each evaluator)

Available Models

  • Closed-Source: GPT-4o, GPT-4V, Gemini 2.0 Flash, Gemini 1.5 Pro
  • Open-Source: Qwen2-VL, Qwen2.5-VL, InternVL2, InternVL2.5, Video-LLaVA, TimeChat, LLaVA-NeXT-Video
  • Specialized: InternVideo2.5, LongVLM, Momentor, Grounded-VideoLLM

๐Ÿ“Š Results Analysis

Human Performance Breakdown

CategoryAccuracyPerceptibility Rating
Text98.9% ยฑ 0.7%4.8 ยฑ 0.0
Shapes98.2% ยฑ 2.5%4.8 ยฑ 0.1
Object Images98.2% ยฑ 1.1%4.6 ยฑ 0.1
Dynamic Scenes94.3% ยฑ 3.1%4.3 ยฑ 0.1

Fine Tuning Qwen Models

See instructions in the repo Qwen2-VL-Finetune. More detailed instructions will be updated later. We have provided the json file in finetune directory.


๐Ÿ—๏ธ Project Structure

SpookyBench/
โ”œโ”€โ”€ eval/
โ”‚   โ”œโ”€โ”€ closed_models/
โ”‚   โ”‚   โ”œโ”€โ”€ eval_gpt4o.py
โ”‚   โ”‚   โ”œโ”€โ”€ eval_gemini.py
โ”‚   โ”‚   โ””โ”€โ”€ requirements.txt
โ”‚   โ”œโ”€โ”€ qwen/
โ”‚   โ”‚   โ”œโ”€โ”€ run_qwen.py
โ”‚   โ”‚   โ””โ”€โ”€ requirements.txt
โ”‚   โ”œโ”€โ”€ internvl/
โ”‚   โ”‚   โ””โ”€โ”€ run_internvl.py
โ”‚   โ””โ”€โ”€ video_llava/
โ”‚       โ””โ”€โ”€ run_video_llava.py
โ”œโ”€โ”€ dataset/
โ”‚   โ”œโ”€โ”€ SpookyBenchDatasets/
โ”‚   โ”‚   โ”œโ”€โ”€ words/
โ”‚   โ”‚   โ”œโ”€โ”€ shapes/
โ”‚   โ”‚   โ”œโ”€โ”€ objects/
โ”‚   โ”‚   โ””โ”€โ”€ videos/
โ”‚   โ””โ”€โ”€ metadata.csv
โ”œโ”€โ”€ human_eval/
โ”‚   โ””โ”€โ”€ human_eval_interface.py
โ”œโ”€โ”€ static/
โ”‚   โ””โ”€โ”€ images/
โ”‚       โ”œโ”€โ”€ timeblind_logo.svg
โ”‚       โ””โ”€โ”€ spooky_examples.png
โ””โ”€โ”€ README.md

๐Ÿ” Key Insights

Why Current Models Fail

  1. Over-reliance on spatial features: Models process individual frames first, then attempt temporal integration
  2. Lack of motion-based segregation: Cannot perform figure-ground separation based on motion patterns
  3. Insufficient temporal integration: Current architectures treat temporal information as secondary
  4. Missing biological inspiration: Human visual system uses distributed temporal processing mechanisms

Architectural Implications

  • Need for temporal-first processing: Future models should treat temporal understanding as primary
  • Motion contrast analysis required: Models need sophisticated motion segregation capabilities
  • Longer temporal integration windows: Extended temporal attention mechanisms necessary
  • Distributed temporal representations: Following biological principles of temporal processing

TODO List

  • Add support in VLMEvalKit
  • Add support in lmms-eval
  • Add python code for generating animations in batch

๐Ÿ“ง Contact

For questions or collaborations, please contact:

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


Exposing the temporal reasoning gap between humans and machines ๐Ÿง โšก๐Ÿค–