4DVLT

June 22, 2026 · View on GitHub

4DVLT

Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking

Language should recover an entity through space and time—not merely point to a box in one frame.

Chaoyue Li^1,†, Boxue Yang^2,†, Shengyao Zhou^3,†,
Haoyang Wu¹, Rui Qian², Linfeng Zhang^2,*

¹Huazhong University of Science and Technology · ²Shanghai Jiao Tong University · ³Zhejiang University
^†Equal contribution · ^*Corresponding author

Overview · Instruct-4D · 4DTrack · Video Demos · Results · Release

4DVLT task overview

Overview

Dynamic scene understanding is not only about finding an object once. A capable model must keep the referred physical entity stable as appearance, visibility, camera view, and motion evolve—and express that entity consistently in metric 3D space and synchronized 2D views.

We introduce 4D Vision-Language Tracking (4DVLT), a worldline-centered task for instruction-conditioned understanding of fully observed multi-view video. Its central representation is a worldline: a persistent, object-centric structure that binds semantic identity, metric 3D motion, and synchronized multi-view 2D projections across time.

The project contributes two complementary pieces:

Instruct-4D, a benchmark that turns spatial, temporal, geometric, and motion clues into language-conditioned worldline queries.
4DTrack, a framework that organizes observations into a 4D state graph, contracts ambiguity through metric-guided routing, and decodes a physically coherent target worldline.

Instruct-4D

129.4K
instructions

64.7K
target worldlines

851
scenes

9
query types

Instruct-4D combines two complementary settings: EgoWL, built from dynamic egocentric driving scenes, and AlloWL, built from calibrated multi-camera pedestrian scenes. Its nine query types span target grounding and metric localization as well as temporal and worldline understanding, including disambiguation, reverse reasoning, trajectory shape, kinematic shift, and motion residual.

Instruct-4D benchmark construction, query types, statistics, and metrics

The benchmark separates two questions that are often conflated: did the model identify the referred entity? and did it recover a faithful worldline for that entity? Accordingly, TGA and TGA_Top1 measure sequence-level and first-timestamp grounding, while WQS and CTQ evaluate unconditional and correctly grounded worldline quality.

4DTrack

4DTrack framework

4DTrack casts 4DVLT as query-conditioned worldline inference:

An object-centric 4D state graph links candidate physical states across time and views.
Metric-guided routing uses language, geometry, and reachability to retain a query-relevant subgraph.
Bidirectional worldline decoding exploits the fully observed clip to resolve non-local temporal dependencies.
Kinematic-prior joint decoding calibrates the recovered path toward physically plausible motion.
A view-aware alignment stage produces a unified 3D trajectory and synchronized multi-view 2D boxes.

The ablations reveal a metric-specific division of labor. Routing carries nearly all of the first-timestamp grounding gain, while graph structure, bidirectional decoding, and kinematic calibration are expressed more strongly through sequence-level grounding and worldline-quality metrics. The modules therefore form a coupled inference chain with different responsibilities rather than interchangeable sources of improvement.

Video Demos

The following videos show complete predicted worldlines across time, metric 3D space, and synchronized camera views. Click any entry to open the corresponding MP4 directly on GitHub.

Capability	Demo
3D Volume Geometry	▶ Watch video
Disambiguation	▶ Watch video
Kinematic Shift	▶ Watch video
Reverse Reasoning	▶ Watch video
Spatiotemporal Anchor	▶ Watch video
Motion Residual	▶ Watch video
Trajectory Shape	▶ Watch video

Results

4DTrack consistently improves matched multimodal backbones under the shared Instruct-4D evaluation interface. With Qwen3.5-9B, the full framework reaches 62.68 TGA_Top1, 51.93 TGA, 55.18 WQS, and 85.57 CTQ, while reducing 3D trajectory error to 3.67 m ADE_3D.

Model	TGA_Top1 ↑	TGA ↑	WQS ↑	CTQ ↑	ADE_3D ↓	SR_3D@1m ↑
Qwen3.5-9B	14.12	10.13	13.99	55.90	13.71	11.38
4DTrack-Qwen3.5-9B	62.68	51.93	55.18	85.57	3.67	58.27

Per-query grounding and trajectory analysis

The query-level analysis shows where worldline-centered structure matters most. Performance is strongest when an instruction can be expressed directly through metric position or motion—such as absolute 3D position, trajectory shape, and motion residual. Dense same-category disambiguation remains the most difficult regime, especially when nearby candidates share appearance and motion patterns.

Release Status

Artifact	Status
Source code	Coming soon
Instruct-4D benchmark	Coming soon
Model checkpoints	Coming soon
arXiv preprint	Coming soon

Citation metadata and download instructions will be added with the public release.

License

This project is released under the MIT License.