README.md

April 19, 2026 Β· View on GitHub

From Static Inference to Dynamic Interaction:

A survey of Streaming Large Language Models

Demo Awesome GitHub last commit (branch)

From Static Inference to Dynamic Interaction: A survey of Streaming Large Language Models

Junlong Tong1,2, Zilong Wang2, Yujie Ren2, Peiran Yin2, Hao Wu2, Wei Zhang2, Xiaoyu Shen2

1Shanghai Jiao Tong University, 2Institute of Digital Twin, Eastern Institute of Technology, Ningbo

Contact: jl-tong@sjtu.edu.cn, xyshen@eitech.edu.cn

πŸ“’ News

  • [2026.04] Our survey has been accepted by ACL 2026 Findings.
  • [2026.03] We released the first comprehensive survey on Streaming LLMs/MLLMs!

πŸ’‘ 1. Overview

This repository provides a comprehensive landscape of current streaming LLMs/MLLMs, covering multi-modal streaming applications across text, audio, and video.

We cut through the confusing terminology of "streaming generation", "streaming input" and "interactive streaming" by introducing a unified, formal definition for Streaming LLMs. Based on Data Flow and Interaction Concurrency, we categorize Streaming LLMs into three progressive paradigms.

πŸ‘‰ Category I: Output-Streaming LLMs
(Left) Performs streaming generation after static reading.
πŸ‘‰ Category II: Sequential-Streaming LLMs
(Middle) Performs streaming generation after streaming reading.
πŸ‘‰ Category III: Concurrent-Streaming LLMs
(Right) Performs streaming generation while streaming reading.

1.1 Formal Definition

We formulate the modeling process as a conditional probability distribution P(Y∣X)P(Y|X), where X=(x1,…,xM)X = (x_1, \dots, x_M) denotes the bounded input stream and Y=(y1,…,yN)Y = (y_1, \dots, y_N) denotes the output stream. This distribution can be factorized as:

P(Y∣X)=∏t=1NP(yt∣y<t,h1:Ο•(t)(X);ΞΈ),P(Y|X) = \prod_{t=1}^{N} P\big(y_t | y_{\lt t}, h_{1:\phi(t)}(X);\theta\big),

where ΞΈ\theta denotes the LLM parameters, hΟ•(t)(X)=llm(xΟ•(t))h_{\phi(t)}(X)=llm(x_{\phi(t)}) is the hidden states corresponding to the input xΟ•(t)x_{\phi(t)}, and Ο•(t)\phi(t) is a interaction decision function to determine the input stream visible at generation step tt.

Then:

  • Output-Streaming LLMs: Ο•(t)=M\phi(t)=M for all t∈{1,2,...,N}t\in \{1,2,...,N \}, h1:Ο•(t)(X)=h1:M(X)=llm(X1:M).h_{1:\phi(t)}(X) = h_{1:M}(X) = llm(X_{1:M}).
  • Sequential-Streaming LLMs: Ο•(t)=M\phi(t)=M for all t∈{1,2,...,N}t\in \{1,2,...,N \}, h1:M(X)={llm(x1),…,llm(xM)}.h_{1:M}(X) = \{ llm(x_1), \dots, llm(x_M) \}.
  • Concurrent-Streaming LLMs: $1\le \dots \le \phi(t)\le \phi(t+1)\le \dots \le M,, h_{\phi(t)}(X) = llm(X_{\phi(t)},y_{<t}).$

Concurrent-Streaming is built upon the foundation of the previous two paradigms, representing the evolution from static inference, to continuous streaming perception, to full-duplex dynamic interaction.

1.2 Key Challenges & Core Goal

Output-Streaming LLMs

  • Streaming Generation & Generation Efficiency

Sequential-Streaming LLMs

  • Continuous Perception & Streaming Context Management

Concurrent-Streaming LLMs

  • Architecture Adaptation & Proactive Interaction Decision

Note

Since Concurrent-Streaming LLMs build on top of the previous two paradigms, we emphasize the the entirely new challenges uniquely introduced by concurrency, while shared issues such as streaming context management are not repeated.

Main-png

πŸ“š 2. Contents

✨ 3. Streaming Taxonomy

Main CategorySecond-Level CategoryThird-Level CategoryExplanation
Output-Streaming LLMsStreaming GenerationToken-wiseThe model reads the full input first and then streams outputs one unit at a time; this is the standard autoregressive streaming setting.
Block-wiseThe model still finishes reading before writing, but generates blocks, chunks, or sentences to reduce serial latency.
Refinement-basedThe model reveals outputs progressively by iterative refinement or denoising, rather than only extending left to right.
Streaming EfficiencyDecode AccelerationMethods that keep output-streaming generation but improve speed through speculative decoding, multi-token prediction, or shorter execution paths.
Memory EfficiencyMethods that reduce KV-cache or long-context cost during progressive output generation.
Sequential-Streaming LLMsIncremental EncodingAtomic EncodingThe input arrives as stable units, such as tokens or fixed discrete chunks, and the model processes them incrementally before generation starts.
Fragmented EncodingContinuous signals are partitioned into streaming units by fixed or semantic boundaries, then processed incrementally before generation.
Streaming Context ManagementMemory RetentionMethods that decide what historical streaming content should be kept, merged, or evicted over time.
KV Cache ManagementMethods that compress, retrieve, or reorganize internal KV states under long streaming inputs.
Attention OptimizationMethods that redesign attention access patterns so the model can process long input streams efficiently.
Concurrent-Streaming LLMsArchitecture AdaptationRe-encoded StreamingNew inputs trigger re-encoding of the history so the model can preserve batch-like dependencies while reading and writing concurrently.
Concatenated StreamingNew inputs and generated outputs are concatenated into a single sequence so their order is unified during concurrent interaction.
Interleaved StreamingInput and output events are interleaved on a shared timeline to support continuous read-write overlap.
Grouped StreamingInput and output are placed in separate groups, and cross-group interaction is designed explicitly to avoid structural conflicts.
Interaction PolicyRule-based PolicyRead-write timing is controlled by fixed schedules or threshold-based triggers.
SFT-based PolicyRead-write timing is learned from supervised fine-tuning signals.
RL-based PolicyRead-write timing is learned as a sequential decision process that optimizes latency-quality trade-offs.

πŸ“‹ 4. Output-Streaming LLMs

4.1 Streaming Generation

4.1.1 Token-wise

PaperSourceModality
GPT-4 Technical Report. [paper]arXiv 2023Text-out
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities. [paper] [code]EMNLP 2023 FindingsSpeech-out
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation. [paper][code]arXiv 2024Image-out
Videopoet: A large language model for zero-shot video generation. [paper] [project]ICML 2024Video-out

4.1.2 Block-wise

4.1.3 Refinement-based

4.2 Streaming Efficiency

4.2.1 Decode Acceleration

4.2.2 Memory Efficiency

πŸ“‹ 5. Sequential-Streaming LLMs

5.1 Incremental Encoding

5.1.1 Atomic Encoding

5.1.2 Fragmented Encoding

5.2 Streaming Context Management

5.2.1 Memory Retention

5.2.2 KV Cache Management

5.2.3 Attention Optimization

πŸ“‹ 6. Concurrent-Streaming LLMs

6.1 Architecture Adaptation

6.1.1 Re-encoded Streaming

6.1.2 Concatenated Streaming

6.1.3 Interleaved Streaming

6.1.4 Grouped Streaming

6.2 Interaction Policy

6.2.1 Rule-based Policy

6.2.2 SFT-based Policy

6.2.3 RL-based Policy

πŸ”§ 7. Streaming Applications and Tasks

[To Do]

πŸ“‹ 8. Streaming Benchmark

[To Do]

Welcome Contributions

We actively maintain this repository and welcome community contributions. If you would like to:

  • Add newly released Streaming LLMs/MLLMs papers
  • Propose refinements to our taxonomy
  • Correct or update existing entries
  • Discuss classification or methodology

Please submit a pull request or contact the authors.

Citation

If you find our paper of this resource helpful, please consider cite:

@article{Tong2026Streaming,
      title={From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models}, 
      author={Junlong Tong and Zilong Wang and YuJie Ren and Peiran Yin and Hao Wu and Wei Zhang and Xiaoyu Shen},
      year={2026},
      eprint={2603.04592},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.04592}, 
}