README.md

April 19, 2026 · View on GitHub

From Static Inference to Dynamic Interaction:
A survey of Streaming Large Language Models

From Static Inference to Dynamic Interaction: A survey of Streaming Large Language Models

Junlong Tong^1,2, Zilong Wang², Yujie Ren², Peiran Yin², Hao Wu², Wei Zhang², Xiaoyu Shen²

¹Shanghai Jiao Tong University, ²Institute of Digital Twin, Eastern Institute of Technology, Ningbo

Contact: jl-tong@sjtu.edu.cn, xyshen@eitech.edu.cn

📢 News

[2026.04] Our survey has been accepted by ACL 2026 Findings.
[2026.03] We released the first comprehensive survey on Streaming LLMs/MLLMs!

💡 1. Overview

This repository provides a comprehensive landscape of current streaming LLMs/MLLMs, covering multi-modal streaming applications across text, audio, and video.

We cut through the confusing terminology of "streaming generation", "streaming input" and "interactive streaming" by introducing a unified, formal definition for Streaming LLMs. Based on Data Flow and Interaction Concurrency, we categorize Streaming LLMs into three progressive paradigms.

👉 Category I: Output-Streaming LLMs
(Left) Performs streaming generation after static reading.
👉 Category II: Sequential-Streaming LLMs
(Middle) Performs streaming generation after streaming reading.
👉 Category III: Concurrent-Streaming LLMs
(Right) Performs streaming generation while streaming reading.

1.1 Formal Definition

We formulate the modeling process as a conditional probability distribution $P(Y|X)$ , where $X = (x_1, \dots, x_M)$ denotes the bounded input stream and $Y = (y_1, \dots, y_N)$ denotes the output stream. This distribution can be factorized as:

$P(Y|X) = \prod_{t=1}^{N} P\big(y_t | y_{\lt t}, h_{1:\phi(t)}(X);\theta\big),$

where $\theta$ denotes the LLM parameters, $h_{\phi(t)}(X)=llm(x_{\phi(t)})$ is the hidden states corresponding to the input $x_{\phi(t)}$ , and $\phi(t)$ is a interaction decision function to determine the input stream visible at generation step $t$ .

Then:

Output-Streaming LLMs: $\phi(t)=M$ for all $t\in \{1,2,...,N \}$ , $h_{1:\phi(t)}(X) = h_{1:M}(X) = llm(X_{1:M}).$
Sequential-Streaming LLMs: $\phi(t)=M$ for all $t\in \{1,2,...,N \}$ , $h_{1:M}(X) = \{ llm(x_1), \dots, llm(x_M) \}.$
Concurrent-Streaming LLMs: $1\le \dots \le \phi(t)\le \phi(t+1)\le \dots \le M $,$ h_{\phi(t)}(X) = llm(X_{\phi(t)},y_{<t}).$

Concurrent-Streaming is built upon the foundation of the previous two paradigms, representing the evolution from static inference, to continuous streaming perception, to full-duplex dynamic interaction.

1.2 Key Challenges & Core Goal

Output-Streaming LLMs

Streaming Generation & Generation Efficiency

Sequential-Streaming LLMs

Continuous Perception & Streaming Context Management

Concurrent-Streaming LLMs

Architecture Adaptation & Proactive Interaction Decision

Note

Since Concurrent-Streaming LLMs build on top of the previous two paradigms, we emphasize the the entirely new challenges uniquely introduced by concurrency, while shared issues such as streaming context management are not repeated.

Main-png

✨ 3. Streaming Taxonomy

Main Category	Second-Level Category	Third-Level Category	Explanation
Output-Streaming LLMs	Streaming Generation	Token-wise	The model reads the full input first and then streams outputs one unit at a time; this is the standard autoregressive streaming setting.
		Block-wise	The model still finishes reading before writing, but generates blocks, chunks, or sentences to reduce serial latency.
		Refinement-based	The model reveals outputs progressively by iterative refinement or denoising, rather than only extending left to right.
	Streaming Efficiency	Decode Acceleration	Methods that keep output-streaming generation but improve speed through speculative decoding, multi-token prediction, or shorter execution paths.
		Memory Efficiency	Methods that reduce KV-cache or long-context cost during progressive output generation.
Sequential-Streaming LLMs	Incremental Encoding	Atomic Encoding	The input arrives as stable units, such as tokens or fixed discrete chunks, and the model processes them incrementally before generation starts.
		Fragmented Encoding	Continuous signals are partitioned into streaming units by fixed or semantic boundaries, then processed incrementally before generation.
	Streaming Context Management	Memory Retention	Methods that decide what historical streaming content should be kept, merged, or evicted over time.
		KV Cache Management	Methods that compress, retrieve, or reorganize internal KV states under long streaming inputs.
		Attention Optimization	Methods that redesign attention access patterns so the model can process long input streams efficiently.
Concurrent-Streaming LLMs	Architecture Adaptation	Re-encoded Streaming	New inputs trigger re-encoding of the history so the model can preserve batch-like dependencies while reading and writing concurrently.
		Concatenated Streaming	New inputs and generated outputs are concatenated into a single sequence so their order is unified during concurrent interaction.
		Interleaved Streaming	Input and output events are interleaved on a shared timeline to support continuous read-write overlap.
		Grouped Streaming	Input and output are placed in separate groups, and cross-group interaction is designed explicitly to avoid structural conflicts.
	Interaction Policy	Rule-based Policy	Read-write timing is controlled by fixed schedules or threshold-based triggers.
		SFT-based Policy	Read-write timing is learned from supervised fine-tuning signals.
		RL-based Policy	Read-write timing is learned as a sequential decision process that optimizes latency-quality trade-offs.

Paper	Source	Modality
GPT-4 Technical Report. [paper]	arXiv 2023	Text-out
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities. [paper] [code]	EMNLP 2023 Findings	Speech-out
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation. [paper][code]	arXiv 2024	Image-out
Videopoet: A large language model for zero-shot video generation. [paper] [project]	ICML 2024	Video-out

We actively maintain this repository and welcome community contributions. If you would like to:

Add newly released Streaming LLMs/MLLMs papers

Propose refinements to our taxonomy

Correct or update existing entries

Discuss classification or methodology

Please submit a pull request or contact the authors.

Citation

If you find our paper of this resource helpful, please consider cite:

@article{Tong2026Streaming,
      title={From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models}, 
      author={Junlong Tong and Zilong Wang and YuJie Ren and Peiran Yin and Hao Wu and Wei Zhang and Xiaoyu Shen},
      year={2026},
      eprint={2603.04592},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.04592}, 
}

README.md

From Static Inference to Dynamic Interaction:
A survey of Streaming Large Language Models

📢 News

💡 1. Overview

1.1 Formal Definition

1.2 Key Challenges & Core Goal

📚 2. Contents

✨ 3. Streaming Taxonomy

📋 4. Output-Streaming LLMs

4.1 Streaming Generation

4.1.1 Token-wise

4.1.2 Block-wise

4.1.3 Refinement-based

4.2 Streaming Efficiency

4.2.1 Decode Acceleration

4.2.2 Memory Efficiency

📋 5. Sequential-Streaming LLMs

5.1 Incremental Encoding

5.1.1 Atomic Encoding

5.1.2 Fragmented Encoding

5.2 Streaming Context Management

5.2.1 Memory Retention

5.2.2 KV Cache Management

5.2.3 Attention Optimization

📋 6. Concurrent-Streaming LLMs

6.1 Architecture Adaptation

6.1.1 Re-encoded Streaming

6.1.2 Concatenated Streaming

6.1.3 Interleaved Streaming

6.1.4 Grouped Streaming

6.2 Interaction Policy

6.2.1 Rule-based Policy

6.2.2 SFT-based Policy

6.2.3 RL-based Policy

🔧 7. Streaming Applications and Tasks

📋 8. Streaming Benchmark

Welcome Contributions

Citation