SVBench: Evaluation of Video Generation Models on Social Reasoning
December 24, 2025 · View on GitHub
Introduction
Recent text-to-video generation models have achieved remarkable progress in visual realism, motion fidelity, and text–video alignment. However, these models remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans—who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues—current video generation models tend to render literal scenes without capturing the underlying causal or psychological logic.
While modern diffusion- and transformer-based architectures can synthesize dynamic scenes with striking realism and reproduce complex motion patterns and multi-agent interactions, they struggle to represent why people act the way they do—failing to capture the latent beliefs, intentions, emotions, and norms that structure real human interactions.
SVBench addresses this critical gap by introducing the first benchmark specifically designed to evaluate social reasoning capabilities in video generation models. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions:
- Mental State Inference - Understanding beliefs and theory of mind
- Goal-Directed Action - Recognizing intentional behaviors and goals
- Joint Attention & Perspective - Following gaze and understanding perspectives
- Social Coordination - Turn-taking and collaborative behavior
- Emotion & Prosocial Behavior - Emotion recognition and helping behaviors
- Social Norms & Spacing - Personal space and social conventions
- Multi-Agent Social Strategy - Complex multi-agent interactions
Our benchmark reveals substantial performance gaps in current video generation systems: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.
Demos
The following videos are successful examples generated by the evaluated models, demonstrating correct social reasoning behaviors that align with the expected ground truth.
| Prompt: A girl is crying on a park bench. A dropped ice cream cone is on the ground nearby. A woman is sitting on the other end of the bench. Ground Truth: The woman will notice the crying girl and offer comfort. Question: Does the woman notice the crying girl and offer comfort? |
Prompt: On the floor of a room, there are red and blue rings. A boy is standing before them, appearing to hesitate about which one to choose. At this moment, a nearby woman first looks at the boy, then clearly points her finger at the red ring on the floor. Ground Truth: The boy will walk towards the red ring based on the woman's cue. Question: Does the boy walk towards the red ring based on the woman's cue? |
Prompt: On a table are two open boxes, one red and one blue. A woman is extending her finger to point at the blue box. A child is looking at the woman's hand. Ground Truth: Child will look in the blue box. Question: Does the child look in the blue box? |
| Prompt: At the edge of a mud puddle, so expansive that one must step into it to retrieve anything from the middle, a child is unable to retrieve his toy. The boy is wearing clean cloth shoes, but the man is wearing clean but waterproof shoes. The boy looks at a man in clean shoes, also standing at the edge, and points at the toy. Ground Truth: The man will step into the mud to retrieve the toy, getting his shoes dirty. Question: Does the man step into the mud to retrieve the toy? |
Prompt: A woman and a boy are sitting on the floor with a small toy between them. Ground Truth: The woman and the boy will maintain coordinated joint attention, repeatedly shifting their gaze between the toy and each other. Question: Do the woman and boy maintain coordinated joint attention, shifting their gaze between the toy and each other? |
Prompt: An opaque screen stands in the middle of a table. A child on one side is reaching and fumbling, trying to find a toy on the other side. An adult, who can see the toy, is opposite them. Ground Truth: The adult will hand the toy to the child, knowing the child cannot see it. Question: Does the adult hand the toy to the child? |
| Prompt: A man and a woman are seated opposite one another at a table. The woman speaks a short sentence, pauses, and then turns her head to look directly at the man. Ground Truth: Man speaks next. Question: Does the man speak next? |
Prompt: A woman directs her gaze toward a red toy on the floor. A boy is beside her, looking at the woman. Ground Truth: Boy will look at the red toy on the floor. Question: Does the boy look at the red toy on the floor? |
Prompt: A man and a woman are positioned next to each other. The woman emits a loud laugh, and the man, his attention drawn by the sound, turns his head in her direction. Ground Truth: The man and boy adopt happy expressions (smile/laugh). Question: Do the man and the boy adopt happy expressions, such as smiling or laughing? |
| Prompt: A woman repeatedly reaches for a pen under a table, but her hand does not reach it. A boy is standing nearby, watching. Ground Truth: The boy will get the pen for the woman. Question: Does the boy get the pen for the woman? |
Prompt: A marble is at the bottom of a tall, narrow tube on a table. Two tools are available nearby: a long, thin stick and a short, thick stick. A child first holds their hand up against the side of the tube to gauge its height, and then extends their hand towards the area where the two sticks are placed. Ground Truth: Child will pick the long thin stick. Question: Does the child pick the long thin stick? |
Prompt: A woman holds a colorful toy next to her face. A child is nearby. Ground Truth: The woman and the child will shift their gaze back and forth between the toy and each other's faces, establishing coordinated joint attention. Question: Do the woman and the child shift their gaze back and forth between the toy and each other's faces? |
| Prompt: A woman is sitting on an empty park bench. A man sits down right next to her. She gets a tense, uncomfortable feeling. Ground Truth: The woman will shift away from the man to increase her personal space. Question: Does the woman shift away from the man to increase her personal space? |
Prompt: A woman is next in line at a coffee counter. A man walks up and stands in front of her, and she stares at the back of his head with a frown. Ground Truth: The woman will verbally correct the man, telling him to go to the back of the line. Question: Does the woman show displeasure or confront the man about cutting the line? |
Prompt: A woman and a boy are standing face-to-face, and the woman holds a toy car between them. Ground Truth: The woman and the boy will engage in reciprocal, back-and-forth gazes and interactions between the toy and each other's faces. Question: Do the woman and the boy engage in reciprocal, back-and-forth gazes between the toy and each other's faces? |
Overview
SVBench framework overview
Pipeline Framework
SVBench employs a fully training-free, agent-based pipeline to construct and evaluate social reasoning tasks for video generation. The framework consists of two main stages:
Agent-Based Generation Stage
This stage transforms abstract psychological paradigms into concrete, video-ready prompts through three specialized agents:
-
Experiment Understanding Agent
- Processes psychological experiment descriptions
- Distills underlying social reasoning mechanisms
- Creates structured conceptual blueprints with key concepts, test points, and ground truths
- Ensures downstream generation is grounded in cognitive constructs rather than superficial details
-
Prompt Synthesis Agent
- Translates abstract cognitive paradigms into visually grounded scenarios
- Generates action-oriented descriptions using only observable behaviors
- Ensures temporal feasibility (5–10 second clips)
- Creates concrete instantiations with specific entities and contexts
- Maintains separation between action descriptions and expected outcomes
-
Critic Agent
- Enforces conceptual neutrality by removing interpretive language
- Detects and eliminates ground-truth leakage
- Generates difficulty-controlled variants (easy, medium, hard) by manipulating social cues
- Provides structured feedback for iterative refinement
- Ensures validated, difficulty-controlled prompts for each experiment
Agent-Based Evaluation Stage
Generated videos are assessed using a Vision-Language Model (VLM) judge across five discrete, interpretable dimensions:
- D1: Core Paradigm Replication - Correctness of psychological phenomenon instantiation
- D2: Prompt Faithfulness - Adherence to specified agents, objects, and scenes
- D3: Social Coherence - Causal and social plausibility of agent behaviors
- D4: Social Cue Effectiveness - Quality of perceptual cues (gaze, gestures, etc.)
- D5: Video Plausibility - Visual stability baseline (isolates generation vs. reasoning failures)
Each dimension is scored as a binary {0, 1}, with the overall score computed as the average: S_overall = (1/5) × Σ D_k
This discrete evaluation approach enhances robustness by framing assessment as unambiguous factual questions, aligning more closely with human categorical judgments and substantially reducing inter-trial variance.
Key Features
✅ First benchmark for social reasoning in video generation
✅ Grounded in developmental and social psychology
✅ 30 classic experiments spanning 7 core social cognition dimensions
✅ Fully training-free, scalable pipeline
✅ Difficulty-controlled scenario generation
✅ Interpretable 5-dimensional evaluation framework
✅ Comprehensive evaluation of 7 state-of-the-art video generation models
Benchmark Results
Our extensive evaluation across eight state-of-the-art video generation systems reveals:
- Top performers (Sora2-Pro: 79.6%, Veo-3.1: 72.4%) show strong implicit priors for human motion causality and intention-driven interactions
- Mid-tier models (Hailuo02-S: 56.4%, Kling2.5-Turbo: 52.2%, Wan2.2: 48.3%) struggle with coordinated multi-agent behavior and abstract social inference
- Open-source models (HunyuanVideo: 30.8%, LongCat-Video: 39.2%, LTX-1.0: 27.6%) operate at substantially lower performance levels
- Critical gap: Models excel at physical reasoning but lack explicit mechanisms for social reasoning
Experimental Results of Selected 15 Tasks Across 8 Models (Performance in %)
| Task Dimension | Sub-Task | Hailuo02-S | Kling2.5-turbo | Sora2pro | Veo-3.1 | HunyuanVideo | Longcat-Video | LTX-1.0 | Wan2.2 |
|---|---|---|---|---|---|---|---|---|---|
| Goal Directed Action | Detour Reaching | 51.4 | 48.6 | 68.6 | 62.9 | 31.4 | 28.6 | 17.1 | 42.2 |
| Tool Selection | 55.0 | 45.0 | 85.0 | 85.0 | 17.5 | 28.6 | 30.0 | 57.8 | |
| Joint Attention & Perspective | Gaze Following | 44.4 | 33.3 | 62.2 | 68.9 | 35.6 | 44.4 | 28.9 | 28.9 |
| Pointing Comprehension | 75.0 | 67.5 | 87.5 | 82.5 | 30.0 | 57.1 | 22.9 | 71.1 | |
| Joint Engagement | 45.0 | 45.0 | 82.5 | 82.5 | 50.0 | 31.4 | 30.0 | 44.4 | |
| Social Coordination | Turn Taking | 45.7 | 62.9 | 94.3 | 85.7 | 37.1 | 57.1 | 40.0 | 62.2 |
| Leader Follower Coord | 40.0 | 44.4 | 77.8 | 64.4 | 17.8 | 35.6 | 17.8 | 48.9 | |
| Emotion & Prosocial Behavior | Emotion Contagion | 66.7 | 75.6 | 82.2 | 88.9 | 46.7 | 65.0 | 57.8 | 88.9 |
| Instrumental Helping | 68.9 | 55.6 | 62.2 | 48.9 | 31.1 | 33.3 | 24.4 | 35.6 | |
| Empathic Concern | 80.0 | 76.0 | 100.0 | 100.0 | 24.0 | 60.0 | 20.0 | 66.7 | |
| Costly Helping | 57.5 | 37.5 | 95.0 | 75.0 | 22.5 | 31.4 | 15.0 | 37.8 | |
| Social Norms & Spacing | Proxemics Personal Space | 47.5 | 47.5 | 65.0 | 45.0 | 25.0 | 25.0 | 35.0 | 22.2 |
| Queue Behavior | 55.6 | 40.0 | 82.5 | 75.6 | 33.3 | 20.0 | 20.0 | 20.0 | |
| Dominance Display | 80.0 | 75.0 | 75.0 | 82.5 | 35.0 | 30.0 | 37.5 | 64.4 | |
| Multi-Agent Social Strategy | Helping Based on Visual Perspective | 42.2 | 42.2 | 84.4 | 53.3 | 24.4 | 40.0 | 17.8 | 33.3 |
| Overall | - | 56.4 | 52.2 | 79.6 | 72.4 | 30.8 | 39.2 | 27.6 | 48.3 |
These findings highlight the fundamental distinction between perceptual realism and socially coherent behavior, revealing where current generation systems succeed or fundamentally fail in generating socially intelligent video content.