README.md

June 24, 2026 · View on GitHub

Awesome-Streaming-Video-Understanding

Last Commit

🤖 Building the Eyes & Mind of J.A.R.V.I.S. — One Frame at a Time

🔥 The most comprehensive list of papers, code, and datasets for
real-time, always-on, interactive video AI.

🔥 News

[2026.06] 🔥🔥🔥 We release our Survey Towards Online Interactors: A Comprehensive Survey on Streaming Video Understanding! Check it out and give us a ⭐ if you find it helpful!

This repository provides a curated collection of research papers, models, and datasets focused on Streaming (Online) Video Understanding. The field aims to develop AI assistants capable of J.A.R.V.I.S.-like continuous multimodal perception and interaction. Unlike traditional offline video understanding, where models have access to the complete video beforehand, streaming models must operate under real-time, causal constraints: frames arrive sequentially, and decisions at any moment can only rely on past and present information, without the ability to rewind or preview future content.

This paradigm introduces two fundamental challenges:

Proactive Decision-Making (When to Act): Determining the optimal moment to generate a response, ask for clarification, or remain silent.
Efficient Resource Management (How to Sustain): Managing ever-growing context (memory/KV cache) and computational load for perpetual, real-time processing.

The repository is organized to reflect these core challenges and the supporting ecosystem:

🔔 Proactive Streaming Models: Approaches for deciding when to interact, including token-driven triggering (EOS), dedicated classifiers, perplexity validation, and visual-based detection.
📺 Reactive Streaming Models: Techniques for efficient long-context processing, covering KV cache management, hierarchical memory, retrieval-augmentation, and computational optimizations.
📊 Benchmarks & Datasets: Key datasets for evaluating capabilities in multi-turn dialogue, real-time captioning, and proactive timing.

This list serves as a reference for researchers and practitioners exploring the frontier of always-on, interactive video AI systems. Love this awesome list? Help others discover it by starring the repository! ⭐

Awesome-Streaming-Video-Understanding

🔔 Proactive Streaming Models

Token-Driven Triggering via EOS / Action Token

Models that decide actions (Speak, Wait, or others) by generating specific tokens or action probabilities within the sequence. Typically, they learn through autoregressive prediction where an EOS token represents silence, while regular language tokens represent responses. This approach may potentially impact the model's general-purpose capabilities.

Paper	Model	Date	Link	Venue	Method / Key Contribution
Thinking in Streaming Video	ThinkStream	2026/03	Link	ECCV 2026	Watch–Think–Speak Streaming Reasoning: Introduces a streaming video reasoning framework decides when to respond, and uses reasoning compressed streaming memory (RCSM) to compress reasoning history and replace outdated visual tokens for low-latency, memory-efficient streaming.
StreamingClaw Technical Report	StreamingClaw	2026/03	Link	arXiv	Dedicated Trigger Tokens: Customizes dedicated trigger tokens for different scenarios; Proposes a unified agent framework that tackles the fundamental limitations in the real-time perception-action closed loop of existing embodied architectures.
Streaming Video Instruction Tuning	Streamo	2025/12	Link	arXiv	State-Token Unified Triggering: Introduces explicit response state tokens (Silence / Standby / Response) and integrates when to respond and what to say into a single autoregressive sequence; applies focal-weighted loss to mitigate extreme state imbalance.
MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning	MMDuet2	2025/12	Link	ICLR 2026	RL-based Reply/Silence Decision: Formulates proactive interaction as a per-turn text decision where the model outputs either a response or "NO REPLY". Trained via multi-turn RL with a PAUC-inspired reward that encourages early and correct responses without reply-time annotations.
Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video	VideoLLM-EyeWO	2025/10	Link	NeurIPS 2025	Active Perception & Action: Predicts 3 actions (Silence, Respond, Ask-High-Res); proactively requests high-res frames when uncertain to ensure just-in-time accuracy.
Proactive Assistant Dialogue Generation from Streaming Egocentric Videos	ProAssist	2025/06	Link	EMNLP 2025	EOS-Based Trigger: Predicts [EOS] token to remain silent or generates text to respond at each frame. Uses Negative Frame Sub-sampling to handle class imbalance between silence and speaking.
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale	LiveCC	2025/04	Link	CVPR 2025	EOS-Based: Trains on large-scale streaming ASR data. At inference, the model predicts [EOS] to stay silent or generates commentary tokens frame-by-frame, enabling real-time play-by-play narration.
AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis	AssistPDA	2025/03	N/A	arXiv	EOS-Based: Predicts [EOS] probability to decide whether to output an anomaly alert/prediction. Features a STRD module to distill offline temporal reasoning into online inference.
LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant	LION-FS	2025/03	Link	CVPR 2025	EOS-Based + Fast-Slow Architecture: Uses a Fast Path to efficiently determine when to respond (via token prediction) and a Slow Path with multi-granularity keyframe augmentation to generate detailed responses only when needed.
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation	VideoLLM-MoD	2024/08	N/A	NeurIPS 2024	EOS-Based + MoD Efficiency: Inherits [EOS] token prediction for proactive triggering. Key contribution is Mixture-of-Depths, dynamically skipping redundant vision token computation to enable efficient streaming.
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction	STREAM-VLM	2024/07	Link	NeurIPS 2024	Special Action Tokens Triggering: Uses two special action tokens <next>(allows the model to opt not to say anything and request the next video frame-3D CNN) and <feedback>(generate response-LLM) to enable proactive feedbacks.
VideoLLM-online: Online Video Large Language Model for Streaming Video	VideoLLM-online	2024/06	Link	CVPR 2024	Streaming EOS: Pioneered the Streaming EOS training objective. The model predicts an [EOS] token at each frame to decide whether to stay silent or generate a response, enabling real-time, proactive interaction.

Dedicated Classification Heads / Detectors

Models that use a lightweight detector, router head, or auxiliary module to trigger responses. A binary classification module determines whether to remain silent or to respond.

Paper	Model	Date	Link	Venue	Method / Key Contribution
STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding	STRIDE	2026/03	Link	arXiv	Sequence-Denoising Activation: Reformulates proactive triggering as span-level activation sequence modeling instead of point-wise binary decisions. Uses a lightweight masked diffusion activation module to refine when-to-speak signals over sliding windows and produce more temporally coherent triggers.
Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding	Em-Garde	2026/03	Link	arXiv	Propose-Match Triggering: Decouples semantic understanding from streaming perception by parsing user queries into visual proposals at query time, then using a lightweight embedding-based Proposal Matching Module to detect similarity surges and trigger responses.
StreamReady: Learning What to Answer and When in Long Streaming Videos	StreamReady	2026/03	N/A	CVPR 2026	Readiness-Head Trigger: Introduces a learnable readiness token monitored by a lightweight Readiness Head (MLP) that outputs a score ∈ [0, 1]. It triggers a response only when the score exceeds a threshold. Trained via contrastive loss between pseudo-positive/negative temporal regions.
Proact-VL: A Proactive VideoLLM for Real-Time AI Companions	Proact-VL	2026/03	Link	ICML 2026	FLAG-Token Response Head: Introduces a chunk-wise streaming framework for real-time AI companions, using a special <\|FLAG\|> token and lightweight gated response head to decide when to speak at each second, enabling timely live commentary and guidance.
ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding	ROMA	2026/01	Link	arXiv	Speak Head Trigger: Unifies proactive and reactive streaming audio-video interaction with synchronized multimodal units and chunked TMRoPE. Introduces a lightweight speak head parallel to the LM head to explicitly predict when to respond, decoupling response timing from content generation for event alerts, real-time narration, and reactive QA.
Learning to Respond: A Large-Scale Benchmark and Progressive Learning Framework for Trigger-Centric Online Video Understanding	ToM	2025/12	N/A	arXiv	Trigger-centric Responding: Introduces TV-Online and an agent-like paradigm that continuously processes streaming inputs and decides whether to respond or remain silent, trained with progressive training and reinforcement objectives.
Open-ended Hierarchical Streaming Video Understanding with Vision Language Models	OpenHOUSE	2025/09	N/A	ICCV 2025	Detector-Triggered Hierarchical Captioning: Uses a lightweight Streaming Module (RNN) to detect action boundaries (hybrid actionness/progress). Triggers the frozen VLM only at detected boundaries to generate hierarchical (substep/step) descriptions.
StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding	StreamAgent	2025/08	N/A	arXiv	Agent-as-Detector: Uses a separate, lightweight Anticipatory Agent (Small VLM) to act as a decision module. It plans and predicts future events to trigger the main responder only when necessary, decoupling decision from generation.
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant	StreamBridge	2025/05	Link	NeurIPS 2025	Decoupled Activation Model: Uses a separate, lightweight Activation Model (e.g., 0.5B LLaVA) to detect "when to speak" (triggering), allowing the main offline Video-LLM to be plug-and-play for proactive streaming. Also uses Round-Decayed Compression for memory.
ViSpeak: Visual Instruction Feedback in Streaming Videos	ViSpeak	2025/03	Link	ICCV 2025	Classification Head Trigger: Defines "Visual Instruction Feedback" tasks (e.g., visual wake-up, interruption). Uses a trained binary classification head (Informative Head) on top of the VLM to predict "when to speak" based on visual cues.
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition	StreamMind	2025/03	Link	ICCV 2025	Cognition Gate: Introduces an Event-Gated mechanism. A lightweight Cognition Gate (initialized from LLM shallow layers) continuously monitors the stream and only triggers/invokes the heavy LLM when relevant events occur, enabling 100 FPS processing.
EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild	EgoSpeak	2025/02	Link	NAACL 2025	Classification Head Trigger:The model EgoSpeak outputs a continuous speak-probability that a conversational agent can leverage in real time (e.g., by triggering speech once the probability surpasses a threshold).
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction	Dispider	2025/01	Link	CVPR 2025	Disentangled Decision Module: Decouples Perception (streaming), Decision (when to speak), and Reaction (generation) into asynchronous modules. Uses a lightweight decision model to trigger the heavy reaction model only when needed.
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format	MMDuet	2024/11	Link	EMNLP 2025	Dual-Head Trigger: Trains two binary classification heads (Informative Head & Relevance Head) to decide when to interrupt the video stream and generate a response. Enables "Duet" interaction format.
Streamlined Dense Video Captioning	SDVC	2019/04	Link	CVPR 2019	Event Sequence Generation: Uses an Event Sequence Generation Network (Pointer Net) to adaptively select a sequence of event proposals, which then triggers the captioning network. (Note: Offline method).

Uncertainty & Perplexity Validation

Models that monitor PPL spikes or uncertainty scores to initiate interaction. For previously spoken content, new frames are validated for perplexity: low perplexity indicates the content remains unchanged, thus no repeated decoding is needed (silent); high perplexity indicates new content in the frame, triggering a response.

Paper	Model	Date	Link	Venue	Method / Key Contribution
LiveStar: Live Streaming Assistant for Real-World Online Video Understanding	LiveStar	2025/11	Link	NeurIPS 2025	PPL-Based Verification (SVeD): Uses Streaming Verification Decoding (SVeD) which calculates the perplexity (PPL) of the generated caption to verify its validity. If PPL indicates high confidence/necessity, it triggers a response; otherwise, it stays silent.

Visual Change / Event-based Trigger

Models that trigger responses based on significant changes in the visual stream or detected events. Frames with substantial visual changes often trigger new responses, while frames with minimal changes typically correspond to unchanged content from before.

Paper	Model	Date	Link	Venue	Method / Key Contribution
Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing	ColorTrigger	2026/03	Link	CVPR 2026	Grayscale-Guided Color Trigger: Proposes a grayscale-always, color-on-demand paradigm for streaming video sensing. Uses causal windowed grayscale affinity analysis with a lightweight training-free QP trigger and credit-budgeted controller to selectively activate RGB capture, combined with dynamic token routing to reduce sensing and inference costs.
QueryStream	QueryStream	2026/01	Link	ICLR 2026	Training-free Framework: Uses Query-Aware Differential Pruning (QDP) to filter tokens by jointly evaluating semantic relevance and temporal novelty. Designs Relevance-Triggered Active Response (RTAR) policy to dynamically trigger responses.
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos	TimeChat-Online	2025/04	Link	ACM MM 2025	Visual Change Trigger: Uses Differential Token Drop (DTD) to prune redundant tokens. Monitors the token drop ratio; sudden drops indicate scene transitions, which serve as natural triggers for proactive responding.

📺 Reactive Streaming Models

KV Cache Management & Eviction

Methods focusing on optimizing the KV cache by evicting less important tokens (e.g., Heavy Hitter, Sliding Window).

Paper	Model	Date	Link	Venue	Method / Key Contribution
Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously	VST	2026/03	Link	ECCV 2026	Video Streaming Thinking: Enables synchronized video watching and reasoning during playback, using VST-SFT for causal streaming adaptation, VST-RL for multi-turn self-exploration, and KG-grounded streaming CoT data synthesis for multi-evidence reasoning.
Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models	TaYS	2026/03	Link	CVPR 2026	Think-as-You-See Streaming Reasoning: Uses streaming attention, decoupled positional encoding, and parallel dual KV-cache to enable causal video reasoning while simultaneously ingesting frames and generating reasoning tokens.
StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding	StreamingAssistant	2025/12	N/A	arXiv	Two Dimensions Tokens Pruning: Introduce a novel redundancy metric--MSSAVT; Video tokens are successively processed by the temporal pruning module and the spatial pruning module.
StreamingVLM: Real-Time Understanding for Infinite Video Streams	StreamingVLM	2025/10	Link	arXiv	Streaming-Aware KV Cache: Uses Attention Sinks + Sliding Window (Long Text + Short Vision) with Contiguous RoPE to enable infinite streaming without memory explosion or positional drift. Trains with overlapped-chunk full attention.
StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding	StreamMem	2025/08	Link	arXiv	Query-Agnostic Compression: Uses standard chat template tokens as Proxy Queries to calculate attention scores for Pruning and Merging KV cache, maintaining a fixed memory budget without needing the actual user query.
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling	StreamVLN	2025/07	Link	arXiv	SlowFast Context (Pruning): Combines a Sliding Window (Fast Path) for recent dialogue with a 3D-Aware Token Pruning (Slow Path) to compress historical visual states into a compact memory, enabling long-horizon navigation.
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding	InfiniPot-V	2025/06	Link	NeurIPS 2025	Continual KV Compression: Maintains a fixed memory budget by periodically compressing the KV cache using Temporal-axis Redundancy (TaR) (evicting repetitive frames) and Value-Norm (VaN) (keeping semantically important tokens).
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding	StreamingChat	2025/02	Link	ICLR 2025	Segment-Based KV Cache Bypass: Introduces a training and inference paradigm that splits long videos into sequential segments and conducts multi-turn dialogues per segment, avoiding unbounded KV cache growth.

Hierarchical Memory & Summarization

Methods that compress history into events, super-tokens, or hierarchical structures.

Paper	Model	Date	Link	Venue	Method / Key Contribution
Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models	TWW	2026/03	Link	ECCV 2026	Memory-Anchored Streaming Reasoning: Builds continuous segment-level memory with three-stage multi-round CoT training and streaming causal modeling; overlaps watching and thinking at inference, improving StreamingBench/OVO-Bench by 2.6%/3.79% and reducing multi-round output tokens by 56%.
FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding	FluxMem	2026/03	Link	CVPR 2026	Training-Free Adaptive Hierarchical Memory: Uses short/mid/long-term memory with Temporal Adjacency Selection (TAS) and Spatial Domain Consolidation (SDC), plus scene-adaptive compression to balance accuracy, latency, and GPU memory.
EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use	EventMemAgent	2026/02	Link	arXiv	Agent with Hierarchical Event-Centric Memory: Employs a dual-layer memory where short-term memory detects event boundaries and performs reservoir sampling, while long-term memory structuredly archives past events, and empowers the agent with a multi-granular perception toolkit, optimized by Agentic RL for active online video reasoning.
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding	HERMES	2026/01	Link	ACL 2026	Hierarchical KV Cache Memory: Conceptualizes KV cache as hierarchical memory framework encapsulating video information across multiple granularities. Reuses compact KV cache for efficient streaming under resource constraints, achieving 10× faster TTFT.
VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs	VideoScaffold	2025/12	Link	arXiv	Elastic-Scale Event Hierarchy: Introduces Elastic-Scale Event Segmentation (EES) with prediction-guided boundary refinement to dynamically adjust event granularity under causal streaming constraints, and Hierarchical Event Consolidation (HEC) to aggregate multi-level event representations from fine-grained frames to abstract events, preserving temporal continuity and semantic coherence.
video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory	video-SALMONN S	2025/10	N/A	arXiv	TTT Memory: Uses Test-Time Training (TTT) layers to compress video history into model weights (hidden state) + Prompt-dependent memory reading to extract relevant info from fixed-size memory. First to process >3h video at 1FPS.
StreamForest: Efficient Online Video Understanding with Persistent Event Memory	StreamForest	2025/09	Link	NeurIPS 2025	Tree-Structured Event Memory: Organizes video frames into a Persistent Event Memory Forest (tree structure). Adaptively merges event nodes based on penalty functions (time, similarity, merge count) to maintain long-term history within a fixed token budget.
OVG-HQ: Online Video Grounding with Hybrid-modal Queries	OVG-HQ-Unify	2025/08	Link	ICCV 2025	Parametric Memory (TTT): Uses a Parametric Memory Block (PMB) instantiated with a Test-Time Training (TTT) layer to compress historical video context into network parameters for online grounding. Supports hybrid-modal queries (text/image/video).
Flash-VStream: Efficient Real-Time Understanding for Long Video Streams	Flash-VStream	2025/06	Link	ICCV 2025	Flash Memory: Two-process framework with 1. Context Synopsis Memory (CSM): Compresses history via K-means clustering (summarization). 2. Detail Augmentation Memory (DAM): Retrieves high-res spatial details for key frames based on CSM distribution.
Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding	ProVideLLM	2025/04	Link	ICCV 2025	Verbalized Memory: Maintains a multimodal cache by verbalizing long-term video history into text steps (summarization) while keeping short-term history as visual tokens (extracted by DETR-QFormer), enabling extremely efficient streaming.
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers	VideoScan	2025/03	Link	arXiv	Semantic Carrier Token: Compresses each video frame into a single Semantic Carrier Token via average pooling to serve as a compact memory. Uses a feature duplication-based eviction strategy to maintain a fixed memory bank size.
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge	StreamChat (Mem)	2025/01	Link	ICLR 2025	Hierarchical Memory Tree: Builds a long-term memory tree by clustering and captioning video chunks. Uses a parallel scheduling system to update memory and retrieve relevant context for multi-turn dialogue.
Online Video Understanding: OVBench and VideoChat-Online	VideoChat-Online	2025/01	Link	CVPR 2025	Pyramid Memory Bank: Uses a hierarchical memory ( $m_t, m_{main}, m_s$ ) with progressive abstraction (pooling resolution/rate) to balance spatial and temporal details. Employs Frame Eviction & Down Writing to compress older frames into lower-resolution layers.
VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges	VideoLLaMB	2024/09	Link	ICCV 2025	Recurrent Memory Bridge: Uses SceneTiling to segment video into semantic clips. Compresses clips into Memory Tokens via recurrent bridge layers, which are periodically updated via retrieval, enabling long-context understanding with linear memory scaling.
Streaming Long Video Understanding with Large Language Models	VideoStreaming	2024/05	N/A	NeurIPS 2024	Memory-Propagated Encoding: Segments video into clips and encodes them into condensed memories using a small LLM, with memory propagated recursively. Uses Adaptive Memory Selection to retrieve relevant clips for QA.
Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline	VideoNarrator	2024/05	Link	ACL 2024	Memory Consolidation: Defines "Synchronized Video Storytelling". Uses Memory Consolidation to merge past visual tokens into fixed-length memory, and generates narrations guided by a structured storyline.
Streaming Dense Video Captioning	StreamingDVC	2024/04	Link	CVPR 2024	Clustering-Based Memory: Compresses incoming visual tokens into a fixed-size memory using K-means clustering. Uses a streaming decoding algorithm to output captions before the entire video is processed.

Retrieval-Augmented Mechanisms

Methods employing external memory banks and retrieval systems.

Paper	Model	Date	Link	Venue	Method / Key Contribution
WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs	WeaveTime	2026/02	Link	CVPR 2026	Temporal Reconstruction + Past-Current Dynamic Focus: Introduces Temporal Reconstruction (Streaming Order Perception) to instill order-aware representations. At inference, uses PCDF Cache for uncertainty-triggered, coarse-to-fine retrieval.
V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval	V-Rex	2025/12	N/A	HPCA 2026	Software-Hardware Co-Designed Accelerator: A Training-Free dynamic KV Cache retrieval algorithm(ReSV); A dynamic KV Cache retrieval engine(DRE)
Venus: An Efficient Edge Memory-and-Retrieval System for VLM-based Online Video Understanding	Venus	2025/12	N/A	IEEE INFOCOM 2026	Edge–Cloud Disaggregated Architecture: Sinks memory construction and keyframe retrieval from cloud to edge, operating in two stages--Ingestion stage(builds a hierarchical memory) and Querying(employs a threshold-based progressive sampling).
CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding	CacheFlow	2025/11	N/A	arXiv	Consensus-First Retrieval: Offloads KV cache to CPU. Compresses old blocks using a GRU-based memory. Retrieves top-K blocks based on a consensus score from shallow and deep layers, rehydrating them to GPU for inference.
StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression	StreamKV	2025/11	Link	AAAI 2026	Segment-based Retrieval: Partitions video into semantic segments and uses a Guidance Prompt to compress KV cache. Stores compressed KVs in a bank and retrieves relevant segments based on user query for QA.
StreamingTOM: Streaming Token Compression for Efficient Video Understanding	StreamingTOM	2025/10	Link	arXiv	Two-stage Framework: 1. CTR (Pre-LLM): Prunes input tokens based on temporal redundancy to speed up prefill. 2. OQM (Post-LLM): Stores 4-bit quantized KV groups and retrieves Top-K relevant groups on-demand for decoding.
Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs	rLiVS	2025/10	N/A	arXiv	Caption-Based Retrieval: 1. Token Selection: Uses LLM attention scores to select top ~5% visual tokens and passes them recurrently. 2. Retrieval: Generates captions for clips and retrieves top-K text captions to answer user queries, avoiding heavy KV storage.
CogStream: Context-guided Streaming Video Question Answering	CogReasoner	2025/06	Link	arXiv	Dialogue Retrieval & Visual Compression: 1. Visual Stream Compression: Clusters frames into events and compresses based on question relevance. 2. Historic Dialogue Retrieval: Uses LLM to retrieve relevant past QA pairs to support current reasoning.
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval	LiveVLM	2025/05	N/A	arXiv	Streaming-Oriented KV Cache & Retrieval: 1. Compresses video KV pairs via attention-based pruning and frame-wise merging. 2. Retrieves relevant long-term KV chunks based on query attention scores to answer questions efficiently.
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval	ReKV	2025/03	Link	ICLR 2025	KV-Cache Retrieval: Offloads video KV caches to CPU/Disk. Upon receiving a query, it retrieves and reloads only the relevant KV caches to GPU for efficient answer generation, decoupling encoding from QA.

Computational Efficiency & Sparse Computing

Methods reducing FLOPs via dynamic compute, sparse attention, or efficient backbone designs.

Paper	Model	Date	Link	Venue	Method / Key Contribution
OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams	OmniStream	2026/03	Link	arXiv	Unified Streaming Visual Backbone: Extends a pre-trained image encoder with causal spatiotemporal attention and 3D-RoPE, enabling strictly causal, efficient online inference.
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing	AutoGaze	2026/03	Link	CVPR 2026	Lightweight Model: Attends to informative patches and autoregressively selects a minimal set of multi-scale patches before a ViT.
Accelerating Streaming Video Large Language Models via Hierarchical Token Compression	STC	2025/12	Link	CVPR 2026	Hierarchical Token Compression: STC-Cacher caches/reuses features of temporally similar frames to reduce ViT encoding, and STC-Pruner compresses visual tokens before LLM prefill by retaining salient tokens based on spatial-temporal relevance (novelty).
Learning Streaming Video Representation via Multitask Training	StreamFormer	2025/04	Link	ICCV 2025	Efficient Streaming Backbone: Introduces Causal Temporal Attention into Vision Transformers to enable efficient frame-by-frame processing. Trained via Multitask Learning (classification, detection, segmentation) to learn robust spatiotemporal representations.
Learning from Streaming Video with Orthogonal Gradients	N/A	2025/04	N/A	CVPR 2025	Orthogonal Optimizer: Employs orthogonal gradients to reduce correlations between consecutive gradients, thereby enhancing the model's learning performance on continuous video streams.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction	VITA-1.5	2025/01	Link	NeurIPS 2025	Multi-Stage Training Methodology: Enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed.
StreamChat: Chatting with Streaming Video	StreamChat	2024/12	Link	arXiv	Cross-Attention Streaming Architecture: Dynamically updates visual context during decoding via lightweight cross-attention, enhanced with V-FFN refinement and parallel 3D-RoPE for stable temporal alignment, enabling real-time streaming interaction without trigger modules.
Streaming Detection of Queried Event Start	SDQES	2024/12	N/A	NeurIPS 2024	Adapter-Based Approach: Proposes a novel task—Streaming Detection of Queried Event Start, as well as new task-specific metrics.

📊 Benchmarks & Datasets

Multi-Turn Dialogue & QA

Paper	Dataset	Date	Link	Venue	Tasks
RIVER: A Real-Time Interaction Benchmark for Video LLMs	RIVER Bench	2026/03	Link	arXiv	Real-time Interactive Video QA, Retrospective Memory, Live-Perception, Proactive Anticipation, Multi-turn Dialogue
StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios	StreamEQA	2025/12	N/A	arXiv	Embodied (perception, interaction, and planning) and Streaming (backward, realtime, and forward reasoning)
StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA	StreamingCoT	2025/10	Link	ACM MM 2025	Streaming VideoQA, CoT Reasoning
StreamForest: Efficient Online Video Understanding with Persistent Event Memory	ODV-Bench	2025/09	Link	NeurIPS 2025	Streaming VideoQA (Autonomous Driving), Real-time Perception, Future Prediction (Risk/Trajectory), Past Memory
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding	OST-Bench	2025/07	Link	NeurIPS 2025	Online Spatio-Temporal QA, Agent State Estimation, 3D Spatial Reasoning, Memory Retrieval
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video	RTV-Bench	2025/05	Link	NeurIPS 2025	Real-Time Video Reasoning, Sport / Driving / Ego Scenario, hierarchical Evaluation
EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild	YT-Conversation	2025/02	Link	NAACL 2025	A dataset derived from diverse YouTube content including interviews, podcasts, and casual dialogues
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding	SVBench	2025/02	Link	ICLR 2025	Streaming VideoQA, Temporal Multi-Turn Dialogue, Long-Context Reasoning
Online Video Understanding: OVBench and VideoChat-Online	OVBench	2025/01	Link	CVPR 2025	Online VideoQA, Past Memory, Future Prediction, Spatial Perception
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge	StreamBench	2025/01	Link	ICLR 2025	Streaming VideoQA, Multi-turn Dialogue, Long/Short-term Memory, Object Search
StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding	StreamingBench	2024/11	Link	arXiv	Real-time Visual QA, Omni-source QA, Contextual QA
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models	TemporalBench	2024/10	Link	arXiv	Fine-grained Video Descriptions,Video QA, Video Captioning, Long Video Understanding

Real-time Captioning & Narration

Paper	Dataset	Date	Link	Venue	Tasks
LiveStar: Live Streaming Assistant for Real-World Online Video Understanding	OmniStar-RNG	2025/11	Link	NeurIPS 2025	Real-time Narration, Streaming Dense Captioning, Streaming Video-Text Alignment
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale	Live-CC-5M	2025/04	Link	CVPR 2025	Large-scale Pre-training, Streaming Captioning (ASR-based), Video-Text Alignment
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale	Live-WhisperX-526K	2025/04	Link	CVPR 2025	Real-time Video Commentary, Instruction Tuning, Dense Streaming Captioning
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction	QEVD-FIT-COACH	2024/07	Link	NeurIPS 2024	Fitness Activity Recognition and Guidance

Proactive Response & Timing Evaluation

Paper	Dataset	Date	Link	Venue	Tasks
StreamReady: Learning What to Answer and When in Long Streaming Videos	ProReady-QA	2026/03	N/A	CVPR 2026	Answer Readiness Score, Proactive Multi-turn Questions.
Proact-VL: A Proactive VideoLLM for Real-Time AI Companions	Live Gaming Benchmark	2026/03	Link	ICML 2026	Real-time Game Commentary, Co-commentary, User Guidance, Proactive Response Timing, Long-horizon Streaming Evaluation
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos	StreamGaze	2025/12	N/A	arXiv	Gaze-Triggered Alert, Object Transition Prediction, Gaze Sequence Matching
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?	QICD	2025/11	Link	NeurIPS 2025	Streaming Dialogue, Proactive Response Generation, Response Timing (When to speak)
Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video	ESTP-Bench	2025/10	Link	NeurIPS 2025	Proactive QA, Just-in-Time Response, Egocentric Reasoning
ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models	ProactiveVideoQA	2025/07	Link	arXiv	Proactive VideoQA (Web/Ego/TV), Response Timing Evaluation, Anomaly Detection
Proactive Assistant Dialogue Generation from Streaming Egocentric Videos	PROASSIST	2025/06	Link	EMNLP 2025	Proactive Task Guidance, Streaming Dialogue, Response Timing (When to speak)
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts	OmniMMI	2025/03	Link	CVPR 2025	Streaming Video Understanding (State Grounding, Action Planning), Proactive Reasoning (Alerting, Turn-Taking)
AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis	VAPDA-127K	2025/03	N/A	arXiv	Proactive Anomaly Prediction, Online Anomaly Detection, Interactive Anomaly Analysis
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?	OVO-Bench	2025/01	Link	CVPR 2025	Forward Active Responding (When to Answer), Backward Tracing, Real-time Perception
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format	MMDuetIT	2024/11	Link	EMNLP 2025	Multi-Answer Grounded QA, Proactive Response Generation

🏆 Competitions

Name	Venue
AI Coach	CVPR 2026

Complete Model List by Release Date

Models

Paper	Model	Date	Link	Venue
STRIDE: When to Speak Meets Sequence Denoising for Streaming Video Understanding	STRIDE	2026/03	Link	arXiv
Color When It Counts: Grayscale-Guided Online Triggering for Always-On Streaming Video Sensing	ColorTrigger	2026/03	Link	CVPR 2026
StreamingClaw Technical Report	StreamingClaw	2026/03	Link	arXiv
Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding	Em-Garde	2026/03	Link	arXiv
Thinking in Streaming Video	ThinkStream	2026/03	Link	ECCV 2026
OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams	OmniStream	2026/03	Link	arXiv
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing	AutoGaze	2026/03	Link	CVPR 2026
Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously	VST	2026/03	Link	ECCV 2026
Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models	TWW	2026/03	Link	ECCV 2026
StreamReady: Learning What to Answer and When in Long Streaming Videos	StreamReady	2026/03	N/A	CVPR 2026
Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models	TaYS	2026/03	Link	CVPR 2026
FluxMem: Adaptive Hierarchical Memory for Streaming Video Understanding	FluxMem	2026/03	Link	CVPR 2026
Proact-VL: A Proactive VideoLLM for Real-Time AI Companions	Proact-VL	2026/03	Link	ICML 2026
WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs	WeaveTime	2026/02	Link	CVPR 2026
EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use	EventMemAgent	2026/02	Link	arXiv
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding	HERMES	2026/01	Link	ACL 2026
ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding	ROMA	2026/01	Link	arXiv
QueryStream	QueryStream	2026/01	Link	ICLR 2026
VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs	VideoScaffold	2025/12	Link	arXiv
Streaming Video Instruction Tuning	Streamo	2025/12	Link	arXiv
StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding	StreamingAssistant	2025/12	N/A	arXiv
V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval	V-Rex	2025/12	N/A	HPCA 2026
Venus: An Efficient Edge Memory-and-Retrieval System for VLM-based Online Video Understanding	Venus	2025/12	N/A	IEEE INFOCOM 2026
Learning to Respond: A Large-Scale Benchmark and Progressive Learning Framework for Trigger-Centric Online Video Understanding	ToM	2025/12	N/A	arXiv
MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning	MMDuet2	2025/12	Link	ICLR 2026
Accelerating Streaming Video Large Language Models via Hierarchical Token Compression	STC	2025/12	Link	CVPR 2026
LiveStar: Live Streaming Assistant for Real-World Online Video Understanding	LiveStar	2025/11	Link	NeurIPS 2025
CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding	CacheFlow	2025/11	N/A	arXiv
StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression	StreamKV	2025/11	Link	AAAI 2026
Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video	VideoLLM-EyeWO	2025/10	Link	NeurIPS 2025
StreamingVLM: Real-Time Understanding for Infinite Video Streams	StreamingVLM	2025/10	Link	arXiv
StreamingTOM: Streaming Token Compression for Efficient Video Understanding	StreamingTOM	2025/10	Link	arXiv
Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs	rLiVS	2025/10	N/A	arXiv
video-SALMONN S: Streaming Audio-Visual LLMs Beyond Length Limits via Memory	video-SALMONN S	2025/10	N/A	arXiv
StreamForest: Efficient Online Video Understanding with Persistent Event Memory	StreamForest	2025/09	Link	NeurIPS 2025
Open-ended Hierarchical Streaming Video Understanding with Vision Language Models	OpenHOUSE	2025/09	N/A	ICCV 2025
StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding	StreamMem	2025/08	Link	arXiv
StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding	StreamAgent	2025/08	N/A	arXiv
OVG-HQ: Online Video Grounding with Hybrid-modal Queries	OVG-HQ-Unify	2025/08	Link	ICCV 2025
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling	StreamVLN	2025/07	Link	arXiv
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding	InfiniPot-V	2025/06	Link	NeurIPS 2025
CogStream: Context-guided Streaming Video Question Answering	CogReasoner	2025/06	Link	arXiv
Proactive Assistant Dialogue Generation from Streaming Egocentric Videos	ProAssist	2025/06	Link	EMNLP 2025
Flash-VStream: Efficient Real-Time Understanding for Long Video Streams	Flash-VStream	2025/06	Link	ICCV 2025
StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant	StreamBridge	2025/05	Link	NeurIPS 2025
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval	LiveVLM	2025/05	N/A	arXiv
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos	TimeChat-Online	2025/04	Link	ACM MM 2025
Learning Streaming Video Representation via Multitask Training	StreamFormer	2025/04	Link	ICCV 2025
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale	LiveCC	2025/04	Link	CVPR 2025
Memory-efficient Streaming VideoLLMs for Real-time Procedural Video Understanding	ProVideLLM	2025/04	Link	ICCV 2025
Learning from Streaming Video with Orthogonal Gradients	N/A	2025/04	N/A	CVPR 2025
ViSpeak: Visual Instruction Feedback in Streaming Videos	ViSpeak	2025/03	Link	ICCV 2025
AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis	AssistPDA	2025/03	N/A	arXiv
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers	VideoScan	2025/03	Link	arXiv
LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant	LION-FS	2025/03	Link	CVPR 2025
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition	StreamMind	2025/03	Link	ICCV 2025
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval	ReKV	2025/03	Link	ICLR 2025
EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild	EgoSpeak	2025/02	Link	NAACL 2025
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding	StreamingChat	2025/02	Link	ICLR 2025
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge	StreamChat (Mem)	2025/01	Link	ICLR 2025
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction	Dispider	2025/01	Link	CVPR 2025
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction	VITA-1.5	2025/01	Link	NeurIPS 2025
Online Video Understanding: OVBench and VideoChat-Online	VideoChat-Online	2025/01	Link	CVPR 2025
StreamChat: Chatting with Streaming Video	StreamChat	2024/12	Link	arXiv
Streaming Detection of Queried Event Start	SDQES	2024/12	N/A	NeurIPS 2024
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format	MMDuet	2024/11	Link	EMNLP 2025
VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges	VideoLLaMB	2024/09	Link	ICCV 2025
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation	VideoLLM-MoD	2024/08	N/A	NeurIPS 2024
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction	STREAM-VLM	2024/07	Link	NeurIPS 2024
VideoLLM-online: Online Video Large Language Model for Streaming Video	VideoLLM-online	2024/06	Link	CVPR 2024
Streaming Long Video Understanding with Large Language Models	VideoStreaming	2024/05	N/A	NeurIPS 2024
Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline	VideoNarrator	2024/05	Link	ACL 2024
Streaming Dense Video Captioning	StreamingDVC	2024/04	Link	CVPR 2024
Streamlined Dense Video Captioning	SDVC	2019/04	Link	CVPR 2019

Complete Dataset List by Release Date

Benchmarks & Datasets

Paper	Dataset	Date	Link	Venue
RIVER: A Real-Time Interaction Benchmark for Video LLMs	RIVER Bench	2026/03	Link	arXiv
StreamReady: Learning What to Answer and When in Long Streaming Videos	ProReady-QA	2026/03	N/A	CVPR 2026
Proact-VL: A Proactive VideoLLM for Real-Time AI Companions	Live Gaming Benchmark	2026/03	Link	ICML 2026
StreamEQA: Towards Streaming Video Understanding for Embodied Scenarios	StreamEQA	2025/12	N/A	arXiv
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos	StreamGaze	2025/12	N/A	arXiv
LiveStar: Live Streaming Assistant for Real-World Online Video Understanding	OmniStar-RNG	2025/11	Link	NeurIPS 2025
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?	QICD	2025/11	Link	NeurIPS 2025
StreamingCoT: A Dataset for Temporal Dynamics and Multimodal Chain-of-Thought Reasoning in Streaming VideoQA	StreamingCoT	2025/10	Link	ACM MM 2025
Eyes Wide Open: Ego Proactive Video-LLM for Streaming Video	ESTP-Bench	2025/10	Link	NeurIPS 2025
StreamForest: Efficient Online Video Understanding with Persistent Event Memory	ODV-Bench	2025/09	Link	NeurIPS 2025
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding	OST-Bench	2025/07	Link	NeurIPS 2025
ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models	ProactiveVideoQA	2025/07	Link	arXiv
Proactive Assistant Dialogue Generation from Streaming Egocentric Videos	PROASSIST	2025/06	Link	EMNLP 2025
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video	RTV-Bench	2025/05	Link	NeurIPS 2025
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale	Live-WhisperX-526K	2025/04	Link	CVPR 2025
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale	Live-CC-5M	2025/04	Link	CVPR 2025
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts	OmniMMI	2025/03	Link	CVPR 2025
AssistPDA: An Online Video Surveillance Assistant for Video Anomaly Prediction, Detection, and Analysis	VAPDA-127K	2025/03	N/A	arXiv
EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild	YT-Conversation	2025/02	Link	NAACL 2025
SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding	SVBench	2025/02	Link	ICLR 2025
Online Video Understanding: OVBench and VideoChat-Online	OVBench	2025/01	Link	CVPR 2025
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?	OVO-Bench	2025/01	Link	CVPR 2025
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge	StreamBench	2025/01	Link	ICLR 2025
StreamingBench: Assessing the Gap for MLLMs to Achieve Streaming Video Understanding	StreamingBench	2024/11	Link	arXiv
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format	MMDuetIT	2024/11	Link	EMNLP 2025
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models	TemporalBench	2024/10	Link	arXiv
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction	QEVD-FIT-COACH	2024/07	Link	NeurIPS 2024

📚 Citation

If you find our survey and this repository helpful, please consider citing our work:

@article{202606.1674,
	doi = {10.20944/preprints202606.1674.v1},
	url = {https://doi.org/10.20944/preprints202606.1674.v1},
	year = 2026,
	month = {June},
	publisher = {Preprints},
	author = {Zhenyu Yang and Kairui Zhang and Qi Liu and Tiancheng Liu and Long Ying and Dizhan Xue and Qibin Hou and Shengsheng Qian and Changsheng Xu},
	title = {Towards Online Interactors: A Comprehensive Survey on Streaming Video Understanding},
	journal = {Preprints}
}

🚀 Contributing

We welcome contributions! To add a resource, you can:

Open a pull request with a clear title and brief description of your changes.
Open an issue with a clear title and short explanation.

If you notice any errors, feel free to open an issue — we apologize in advance for any inconvenience.

❤️ Contact

If you have suggestions or find this project useful, we’d love to hear from you.
Email: yangzhenyu2022@ia.ac.cn and zhangkr2025@shanghaitech.edu.cn