Time Blindness: Why Video-Language Models Can't See What Humans Can?
June 2, 2025 ยท View on GitHub

Ujjwal Upadhyay * ย Mukul Ranjan * ย Zhiqiang Shen ย Mohamed Elhoseiny
*Equal Contribution
๐ Table of Contents
- ๐ Overview
- ๐ Key Highlights
- ๐ SpookyBench Dataset
- ๐ฏ Benchmark Results
- ๐ธ Task Examples
- ๐ฌ Temporal Encoding Framework
- โ๏ธ Installation & Usage
- ๐ Citation
๐ Overview
Current ๐ค Video-Vision Language Models (Video-VLMs) excel at spatial understanding but suffer from โฐ "time blindness" - a critical inability to process purely temporal patterns. While humans effortlessly recognize information encoded in temporal sequences with 98% accuracy, state-of-the-art models including GPT-4o, Gemini 2.0, and Qwen-VL achieve 0% accuracy on the same tasks.
We introduce ๐ป SpookyBench, the first benchmark designed to isolate and evaluate pure temporal understanding by encoding information exclusively through temporal sequences of noise-like frames. This exposes a fundamental limitation in current video understanding architectures that over-rely on frame-level spatial features.
๐ Key Highlights
โ
First benchmark to isolate purely temporal reasoning without spatial shortcuts
โ
451 carefully crafted videos across 4 distinct categories (Words, Shapes, Objects, Dynamic Scenes)
โ
Striking performance gap: Humans 98% vs. All AI models 0%
โ
Comprehensive evaluation: 15+ state-of-the-art models tested including GPT-4o, Gemini, Qwen-VL
โ
Novel temporal encoding framework using opposing motion patterns
โ
Cross-architecture failure: Limitation persists across model scales and designs
๐ SpookyBench reveals that current Video-VLMs are fundamentally "time-blind" despite impressive performance on standard benchmarks! โฐ๐๏ธ
๐ SpookyBench Dataset
Our benchmark contains 451 videos distributed across four temporal pattern categories:
| Category | Total Videos | Description |
|---|---|---|
| Text | 210 (46.6%) | English words encoded through temporal noise patterns |
| Object Images | 156 (34.6%) | Single objects encoded using temporal animation |
| Dynamic Scenes | 57 (12.6%) | Video depth maps with temporal motion patterns |
| Shapes | 28 (6.2%) | Geometric patterns encoded through temporal sequences |
| Total | 451 | Comprehensive temporal understanding evaluation |
๐ Each video appears as random noise in individual frames, but reveals meaningful content when viewed as a temporal sequence.
๐ฏ Benchmark Results
The Time Blindness Gap
Our evaluation reveals a shocking performance disparity:
| Model Type | Human Performance | AI Model Performance | Gap |
|---|---|---|---|
| ๐ฅ Humans | 98.0% ยฑ 0.6% | N/A | N/A |
| ๐ค All Video-VLMs | N/A | 0.0% | 98 percentage points |
Tested Models (All Scored 0%)
| Model Family | Models Tested | Performance |
|---|---|---|
| Closed-Source | GPT-4o, GPT-4V, Gemini 2.0 Flash, Gemini 1.5 Pro | 0% across all |
| Open-Source Large | Qwen2.5-VL-72B, InternVL2.5-78B, InternVL2-40B | 0% across all |
| Open-Source Mid | Video-LLaVA, LLaVA-NeXT-Video, TimeChat | 0% across all |
| Specialized | TimeChat, VideoGPT+, VILA | 0% across all |
๐ Key Finding: The limitation is architectural, not a matter of scale, training, or prompting strategy.
๐ธ Task Examples
Examples of temporal patterns in SpookyBench: Individual frames appear as noise, but temporal sequences reveal words, shapes, and objects that humans can easily recognize. For more examples visit our project webpage
๐ฌ Temporal Encoding Framework
Our unique encoding method creates temporal patterns through opposing motion:
Core Principle: Motion-Based Content Emergence
- Foreground pixels: Move in one direction (e.g., up/left)
- Background pixels: Move in opposite direction (e.g., down/right)
- Human perception: Groups pixels by motion direction, revealing content
- AI models: Fail to leverage temporal motion cues
Technical Implementation
# Simplified temporal encoding algorithm
for each pixel (x, y):
if content_mask(x, y):
frame[x, y] = foreground_noise(x, y + velocity*time)
else:
frame[x, y] = background_noise(x, y - velocity*time)
Signal Analysis Metrics
| Metric | Purpose |
|---|---|
| Basic SNR | Measures signal-to-noise ratio in temporal patterns |
| Perceptual SNR | Incorporates human visual sensitivity weighting |
| Temporal Coherence | Quantifies motion consistency over time |
| Motion Contrast | Measures foreground-background motion differentiation |
๐ค Download the data
You can download the dataset from hugging face using wget and then unzip the file.
wget https://huggingface.co/datasets/timeblindness/spooky-bench/resolve/main/spooky_bench.zip
unzip spooky_bench.zip
โ๏ธ Installation & Usage
1๏ธโฃ Clone the Repository
git clone https://github.com/TimeBlindness/time-blindness.git
cd time-blindness
2๏ธโฃ Setup Environment
# For closed-source models (GPT-4o, Gemini)
cd eval/closed_models
pip install -r requirements.txt
# Set up API keys in .env
OPENAI_API_KEY=your_openai_api_key_here
GOOGLE_API_KEY=your_gemini_api_key_here
3๏ธโฃ Evaluate Models
GPT-4o Evaluation
cd eval/closed_models
python eval_gpt4o.py \
--dataset /path/to/spooky_bench/SpookyBenchDatasets \
--csv /path/to/metadata.csv \
--output ./results \
--categories words shapes \
--use_cot \
--sample_size 10
Gemini Evaluation
cd eval/closed_models
python eval_gemini.py \
--dataset /path/to/spooky_bench/SpookyBenchDatasets \
--csv /path/to/metadata.csv \
--output ./results \
--categories words \
--use_cot \
--sample_size 10
Qwen Evaluation
See instructions in eval/qwen/README.md
InternVL Evaluation
See instructions in eval/internvl/README.md
MovieChat Evaluation
See instructions in eval/MovieChatVideo/README.md
TimeChat Evaluation
See instructions in the author's original repo TimeChat. More detailed instructions will be updated later.
VideoLLaMA3 Evaluation
See instructions in the author's original repo VideoLLaMA3. More detailed instructions will be updated later.
MiniGPT4-Video Evaluation
See instructions in the author's original repo MiniGPT4-video. More detailed instructions will be updated later.
Video-ChatGPT Evaluation
See instructions in the author's original repo Video-ChatGPT. More detailed instructions will be updated later.
VideoGPT-plus Evaluation
See instructions in the author's original repo VideoGPT-plus. More detailed instructions will be updated later.
VILA Evaluation
See instructions in the author's original repo VILA. More detailed instructions will be updated later.
ShareGPT4Video Evaluation
See instructions in the author's original repo ShareGPT4Video. More detailed instructions will be updated later.
VideoLLaMA2 Evaluation
See instructions in the author's original repo VideoLLaMA2. More detailed instructions will be updated later.
Video-LLaVA Evaluation
See instructions in the author's original repo Video-LLaVA. More detailed instructions will be updated later.
LLaVA-NeXT-Video Evaluation
See instructions in the author's original repo LLaVA-NeXT-Video. More detailed instructions will be updated later.
Command-line Arguments
--dataset: Path to the SpookyBench dataset directory--csv: Path to the metadata CSV file--output: Directory to save evaluation results--categories: Categories to evaluate (words, shapes, images, videos)--use_cot: Use chain-of-thought prompting for more detailed reasoning--sample_size: Number of videos to sample per category--model: Model name/version to use (specific to each evaluator)
4๏ธโฃ Human Evaluation Interface
Expected Data Folder Structure
data_path/
โโโ images/
โ โโโ video1.mp4
โ โโโ video2.mp4
โ โโโ ...
โโโ shapes/
โ โโโ video1.mp4
โ โโโ video2.mp4
โ โโโ ...
โโโ videos/
โ โโโ video1.mp4
โ โโโ video2.mp4
โ โโโ ...
โโโ words/
โโโ video1.mp4
โโโ video2.mp4
โโโ ...
Example Usage:
# Run human evaluation interface
python human_eval_interface.py --data_path /path/to/spooky_bench_data
python human_eval_interface.py --data_path ./data --output_dir ./annotations --port 7861
Command-line Arguments
--dataset: Path to the SpookyBench dataset directory--csv: Path to the metadata CSV file--output: Directory to save evaluation results--categories: Categories to evaluate (words, shapes, objects, videos)--use_cot: Use chain-of-thought prompting for more detailed reasoning--sample_size: Number of videos to sample per category--model: Model name/version to use (specific to each evaluator)
Available Models
- Closed-Source: GPT-4o, GPT-4V, Gemini 2.0 Flash, Gemini 1.5 Pro
- Open-Source: Qwen2-VL, Qwen2.5-VL, InternVL2, InternVL2.5, Video-LLaVA, TimeChat, LLaVA-NeXT-Video
- Specialized: InternVideo2.5, LongVLM, Momentor, Grounded-VideoLLM
๐ Results Analysis
Human Performance Breakdown
| Category | Accuracy | Perceptibility Rating |
|---|---|---|
| Text | 98.9% ยฑ 0.7% | 4.8 ยฑ 0.0 |
| Shapes | 98.2% ยฑ 2.5% | 4.8 ยฑ 0.1 |
| Object Images | 98.2% ยฑ 1.1% | 4.6 ยฑ 0.1 |
| Dynamic Scenes | 94.3% ยฑ 3.1% | 4.3 ยฑ 0.1 |
Fine Tuning Qwen Models
See instructions in the repo Qwen2-VL-Finetune. More detailed instructions will be updated later. We have provided the json file in finetune directory.
๐๏ธ Project Structure
SpookyBench/
โโโ eval/
โ โโโ closed_models/
โ โ โโโ eval_gpt4o.py
โ โ โโโ eval_gemini.py
โ โ โโโ requirements.txt
โ โโโ qwen/
โ โ โโโ run_qwen.py
โ โ โโโ requirements.txt
โ โโโ internvl/
โ โ โโโ run_internvl.py
โ โโโ video_llava/
โ โโโ run_video_llava.py
โโโ dataset/
โ โโโ SpookyBenchDatasets/
โ โ โโโ words/
โ โ โโโ shapes/
โ โ โโโ objects/
โ โ โโโ videos/
โ โโโ metadata.csv
โโโ human_eval/
โ โโโ human_eval_interface.py
โโโ static/
โ โโโ images/
โ โโโ timeblind_logo.svg
โ โโโ spooky_examples.png
โโโ README.md
๐ Key Insights
Why Current Models Fail
- Over-reliance on spatial features: Models process individual frames first, then attempt temporal integration
- Lack of motion-based segregation: Cannot perform figure-ground separation based on motion patterns
- Insufficient temporal integration: Current architectures treat temporal information as secondary
- Missing biological inspiration: Human visual system uses distributed temporal processing mechanisms
Architectural Implications
- Need for temporal-first processing: Future models should treat temporal understanding as primary
- Motion contrast analysis required: Models need sophisticated motion segregation capabilities
- Longer temporal integration windows: Extended temporal attention mechanisms necessary
- Distributed temporal representations: Following biological principles of temporal processing
TODO List
- Add support in VLMEvalKit
- Add support in lmms-eval
- Add python code for generating animations in batch
๐ง Contact
For questions or collaborations, please contact:
- Ujjwal Upadhyay: ujjwalupadhyay8@gmail.com
- Mukul Ranjan: mukul.ranjan@mbzuai.ac.ae
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
Exposing the temporal reasoning gap between humans and machines ๐ง โก๐ค