Time Blindness: Why Video-Language Models Can't See What Humans Can?

June 2, 2025 · View on GitHub

Ujjwal Upadhyay^* Mukul Ranjan^* Zhiqiang Shen Mohamed Elhoseiny

^*Equal Contribution

📌 Table of Contents

📖 Overview
🌟 Key Highlights
📊 SpookyBench Dataset
🎯 Benchmark Results
📸 Task Examples
🔬 Temporal Encoding Framework
⚙️ Installation & Usage
📜 Citation

Current 🤖 Video-Vision Language Models (Video-VLMs) excel at spatial understanding but suffer from ⏰ "time blindness" - a critical inability to process purely temporal patterns. While humans effortlessly recognize information encoded in temporal sequences with 98% accuracy, state-of-the-art models including GPT-4o, Gemini 2.0, and Qwen-VL achieve 0% accuracy on the same tasks.

We introduce 👻 SpookyBench, the first benchmark designed to isolate and evaluate pure temporal understanding by encoding information exclusively through temporal sequences of noise-like frames. This exposes a fundamental limitation in current video understanding architectures that over-rely on frame-level spatial features.

🌟 Key Highlights

✅ First benchmark to isolate purely temporal reasoning without spatial shortcuts
✅ 451 carefully crafted videos across 4 distinct categories (Words, Shapes, Objects, Dynamic Scenes)
✅ Striking performance gap: Humans 98% vs. All AI models 0%
✅ Comprehensive evaluation: 15+ state-of-the-art models tested including GPT-4o, Gemini, Qwen-VL
✅ Novel temporal encoding framework using opposing motion patterns
✅ Cross-architecture failure: Limitation persists across model scales and designs

🚀 SpookyBench reveals that current Video-VLMs are fundamentally "time-blind" despite impressive performance on standard benchmarks! ⏰👁️

📊 SpookyBench Dataset

Our benchmark contains 451 videos distributed across four temporal pattern categories:

Category	Total Videos	Description
Text	210 (46.6%)	English words encoded through temporal noise patterns
Object Images	156 (34.6%)	Single objects encoded using temporal animation
Dynamic Scenes	57 (12.6%)	Video depth maps with temporal motion patterns
Shapes	28 (6.2%)	Geometric patterns encoded through temporal sequences
Total	451	Comprehensive temporal understanding evaluation

📌 Each video appears as random noise in individual frames, but reveals meaningful content when viewed as a temporal sequence.

🎯 Benchmark Results

The Time Blindness Gap

Our evaluation reveals a shocking performance disparity:

Model Type	Human Performance	AI Model Performance	Gap
👥 Humans	98.0% ± 0.6%	N/A	N/A
🤖 All Video-VLMs	N/A	0.0%	98 percentage points

Tested Models (All Scored 0%)

Model Family	Models Tested	Performance
Closed-Source	GPT-4o, GPT-4V, Gemini 2.0 Flash, Gemini 1.5 Pro	0% across all
Open-Source Large	Qwen2.5-VL-72B, InternVL2.5-78B, InternVL2-40B	0% across all
Open-Source Mid	Video-LLaVA, LLaVA-NeXT-Video, TimeChat	0% across all
Specialized	TimeChat, VideoGPT+, VILA	0% across all

Model Performacne

📌 Key Finding: The limitation is architectural, not a matter of scale, training, or prompting strategy.

📸 Task Examples

Examples of temporal patterns in SpookyBench: Individual frames appear as noise, but temporal sequences reveal words, shapes, and objects that humans can easily recognize. For more examples visit our project webpage

🔬 Temporal Encoding Framework

Our unique encoding method creates temporal patterns through opposing motion:

Core Principle: Motion-Based Content Emergence

Foreground pixels: Move in one direction (e.g., up/left)
Background pixels: Move in opposite direction (e.g., down/right)
Human perception: Groups pixels by motion direction, revealing content
AI models: Fail to leverage temporal motion cues

Technical Implementation

# Simplified temporal encoding algorithm
for each pixel (x, y):
    if content_mask(x, y):
        frame[x, y] = foreground_noise(x, y + velocity*time)
    else:
        frame[x, y] = background_noise(x, y - velocity*time)

Signal Analysis Metrics

Metric	Purpose
Basic SNR	Measures signal-to-noise ratio in temporal patterns
Perceptual SNR	Incorporates human visual sensitivity weighting
Temporal Coherence	Quantifies motion consistency over time
Motion Contrast	Measures foreground-background motion differentiation

🤗 Download the data

You can download the dataset from hugging face using wget and then unzip the file.

wget https://huggingface.co/datasets/timeblindness/spooky-bench/resolve/main/spooky_bench.zip
unzip spooky_bench.zip

⚙️ Installation & Usage

1️⃣ Clone the Repository

git clone https://github.com/TimeBlindness/time-blindness.git
cd time-blindness

2️⃣ Setup Environment

# For closed-source models (GPT-4o, Gemini)
cd eval/closed_models
pip install -r requirements.txt

# Set up API keys in .env
OPENAI_API_KEY=your_openai_api_key_here
GOOGLE_API_KEY=your_gemini_api_key_here

3️⃣ Evaluate Models

GPT-4o Evaluation

cd eval/closed_models
python eval_gpt4o.py \
  --dataset /path/to/spooky_bench/SpookyBenchDatasets \
  --csv /path/to/metadata.csv \
  --output ./results \
  --categories words shapes \
  --use_cot \
  --sample_size 10

Gemini Evaluation

cd eval/closed_models
python eval_gemini.py \
  --dataset /path/to/spooky_bench/SpookyBenchDatasets \
  --csv /path/to/metadata.csv \
  --output ./results \
  --categories words \
  --use_cot \
  --sample_size 10

--dataset: Path to the SpookyBench dataset directory
--csv: Path to the metadata CSV file
--output: Directory to save evaluation results
--categories: Categories to evaluate (words, shapes, images, videos)
--use_cot: Use chain-of-thought prompting for more detailed reasoning
--sample_size: Number of videos to sample per category
--model: Model name/version to use (specific to each evaluator)

4️⃣ Human Evaluation Interface

Expected Data Folder Structure

data_path/
  ├── images/
  │   ├── video1.mp4
  │   ├── video2.mp4
  │   └── ...
  ├── shapes/
  │   ├── video1.mp4
  │   ├── video2.mp4
  │   └── ...
  ├── videos/
  │   ├── video1.mp4
  │   ├── video2.mp4
  │   └── ...
  └── words/
      ├── video1.mp4
      ├── video2.mp4
      └── ...

Example Usage:

# Run human evaluation interface
python human_eval_interface.py --data_path /path/to/spooky_bench_data
python human_eval_interface.py --data_path ./data --output_dir ./annotations --port 7861

Command-line Arguments

--dataset: Path to the SpookyBench dataset directory
--csv: Path to the metadata CSV file
--output: Directory to save evaluation results
--categories: Categories to evaluate (words, shapes, objects, videos)
--use_cot: Use chain-of-thought prompting for more detailed reasoning
--sample_size: Number of videos to sample per category
--model: Model name/version to use (specific to each evaluator)

Available Models

Closed-Source: GPT-4o, GPT-4V, Gemini 2.0 Flash, Gemini 1.5 Pro
Open-Source: Qwen2-VL, Qwen2.5-VL, InternVL2, InternVL2.5, Video-LLaVA, TimeChat, LLaVA-NeXT-Video
Specialized: InternVideo2.5, LongVLM, Momentor, Grounded-VideoLLM

📊 Results Analysis

Human Performance Breakdown

Category	Accuracy	Perceptibility Rating
Text	98.9% ± 0.7%	4.8 ± 0.0
Shapes	98.2% ± 2.5%	4.8 ± 0.1
Object Images	98.2% ± 1.1%	4.6 ± 0.1
Dynamic Scenes	94.3% ± 3.1%	4.3 ± 0.1

Fine Tuning Qwen Models

See instructions in the repo Qwen2-VL-Finetune. More detailed instructions will be updated later. We have provided the json file in finetune directory.

🏗️ Project Structure

SpookyBench/
├── eval/
│   ├── closed_models/
│   │   ├── eval_gpt4o.py
│   │   ├── eval_gemini.py
│   │   └── requirements.txt
│   ├── qwen/
│   │   ├── run_qwen.py
│   │   └── requirements.txt
│   ├── internvl/
│   │   └── run_internvl.py
│   └── video_llava/
│       └── run_video_llava.py
├── dataset/
│   ├── SpookyBenchDatasets/
│   │   ├── words/
│   │   ├── shapes/
│   │   ├── objects/
│   │   └── videos/
│   └── metadata.csv
├── human_eval/
│   └── human_eval_interface.py
├── static/
│   └── images/
│       ├── timeblind_logo.svg
│       └── spooky_examples.png
└── README.md

🔍 Key Insights

Why Current Models Fail

Over-reliance on spatial features: Models process individual frames first, then attempt temporal integration
Lack of motion-based segregation: Cannot perform figure-ground separation based on motion patterns
Insufficient temporal integration: Current architectures treat temporal information as secondary
Missing biological inspiration: Human visual system uses distributed temporal processing mechanisms

Architectural Implications

Need for temporal-first processing: Future models should treat temporal understanding as primary
Motion contrast analysis required: Models need sophisticated motion segregation capabilities
Longer temporal integration windows: Extended temporal attention mechanisms necessary
Distributed temporal representations: Following biological principles of temporal processing

TODO List

Add support in VLMEvalKit
Add support in lmms-eval
Add python code for generating animations in batch

📧 Contact

For questions or collaborations, please contact:

Ujjwal Upadhyay: ujjwalupadhyay8@gmail.com
Mukul Ranjan: mukul.ranjan@mbzuai.ac.ae

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Exposing the temporal reasoning gap between humans and machines 🧠⚡🤖