README.md

January 17, 2025 ยท View on GitHub

Next Token Prediction Towards Multimodal Intelligence

Static Badge

Building on the foundations of language modeling in natural language processing, Next Token Prediction (NTP) has evolved into a versatile training objective for machine learning tasks across various modalities, achieving considerable success in both understanding and generation tasks. This repo features a comprehensive paper and repos collection for the survey: "Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey".

๐Ÿ‘‰ Full paper: https://arxiv.org/abs/2412.18619

Authors: Liang Chen1, Zekun Wang2, Shuhuai Ren1, Lei Li3, Haozhe Zhao1, Yunshui Li 4, Zefan Cai1, Hongcheng Guo2, Lei Zhang4, Yizhe Xiong5, Yichi Zhang1, Ruoyu Wu1, Qingxiu Dong1, Ge Zhang6, Jian Yang8, Lingwei Meng7, Shujie Hu7, Yulong Chen9, Junyang Lin8, Shuai Bai8, Andreas Vlachos9, Xu Tan 10, Minjia Zhang11, Wen Xiao 10, Aaron Yee12,13, Tianyu Liu8, Baobao Chang1

1Peking University 2Beihang University 3University of Hong Kong 4Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences 5Tsinghua University 6M-A-P 7The Chinese University of Hong Kong 8Alibaba Group 9University of Cambridge 10Microsoft Research 11UIUC 12Humanify Inc. 13Zhejiang University


๐Ÿ”ฅ๐Ÿ”ฅ News

  • 2024.12.30: We release the survey on arxiv and this repo at GitHub! Feel free to make pull requests to add the latest work to the seasonly update of the survey ~

๐Ÿ“‘ Table of Contents

  1. Awesome Multimodal Tokenizers
  2. Awesome MMNTP Models
  3. Awesome Multimodal Prompt Engineering
  4. Citation

Awesome Multimodal Tokenizers

Vision Tokenizer

PaperTimeModalityTokenization TypeGitHub
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation2024Image,VideoDiscrete & ContinuousStar
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation2024ImageDiscreteStar
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (QwenVL2-ViT)2024Image,VideoContinuousStar
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction2024ImageDiscreteStar
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs2023ImageDiscrete-
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization2023ImageDiscreteStar
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation2023Image,VideoDiscrete-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks2023ImageContinuousStar
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution2023ImageContinuous-
Planting a SEED of Vision in Large Language Model2023ImageDiscreteStar
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding2023ImageContinuous-
EVA-CLIP: Improved Training Techniques for CLIP at Scale2023ImageContinuousGithub
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks2023ImageContinuousGithub
A Unified View of Masked Image Modeling2023ImageContinuousGithub
BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers2022ImageContinuousGithub
MAGVIT: Masked Generative Video Transformer2022VideoDiscreteStar
Phenaki: Variable Length Video Generation From Open Domain Textual Description2022VideoDiscrete-
CoCa: Contrastive Captioners are Image-Text Foundation Models2022ImageContinuous-
Autoregressive Image Generation using Residual Quantization2022ImageDiscrete-
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning2022ImageContinuousStar
FlexiViT: One Model for All Patch Sizes2022ImageContinuousStar
Vector-quantized Image Modeling with Improved VQGAN2021ImageDiscrete-
ViViT: A Video Vision Transformer2021VideoContinuousGithub
BEiT: BERT Pre-Training of Image Transformers2021ImageContinuousGithub
High-Performance Large-Scale Image Recognition Without Normalization2021ImageContinuousGithub
Learning Transferable Visual Models From Natural Language Supervision (CLIP)2021ImageContinuousStar
Taming Transformers for High-Resolution Image Synthesis2020ImageDiscreteStar
Generating Diverse High-Fidelity Images with VQ-VAE-22019ImageDiscreteStar
Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification2017VideoContinuousStar
Neural Discrete Representation Learning (VQVAE)2017Image, Video, AudioDiscrete-

Audio Tokenizer

PaperTimeModalityTokenization TypeGitHub
Moshi: a speech-text foundation model for real-time dialogue (Mimi)2024AudioDiscreteStar
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling2024AudioDiscreteStar
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound2024AudioDiscreteStar
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models (FACodec)2024AudioDiscrete-
SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models2023AudioDiscreteStar
HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec2023AudioDiscreteStar
LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models2023AudioDiscrete-
High-Fidelity Audio Compression with Improved RVQGAN (DAC)2023AudioDiscreteStar
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages2023AudioContinuous-
High Fidelity Neural Audio Compression (Encodec)2022AudioDiscreteStar
CLAP: Learning Audio Concepts From Natural Language Supervision2022AudioContinuousStar
Robust Speech Recognition via Large-Scale Weak Supervision (Whisper)2022AudioContinuousStar
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language2022AudioContinuousStar
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing2021AudioContinuousStar
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units2021AudioContinuousStar
SoundStream: An End-to-End Neural Audio Codec2021AudioDiscrete-
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations2020AudioContinuousStar
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations2019AudioDiscreteStar

Awesome MMNTP Models

Vision Model

PaperTimeModalityModel TypeTaskGitHub
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning2024ImageUnifiedImage2Text, Text2Image-
Liquid: Language Models are Scalable Multi-modal Generators2024ImageUnifiedImage2Text, Text2ImageStar
Infinity โˆž: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis2024ImageUnifiedText2ImageStar
Multimodal Latent Language Modeling with Next-Token Diffusion2024ImageUnifiedImage2Text, Text2ImageStar
Randomized Autoregressive Visual Generation (RAR)2024ImageUnifiedText2ImageStar
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training (MonoInternVL)2024ImageUnifiedImage2Text-
A Single Transformer for Scalable Vision-Language Modeling (SOLO)2024ImageUnifiedImage2Text-
Unveiling Encoder-Free Vision-Language Models (EVE)2024ImageUnifiedImage2TextStar
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (Qwen2VL)2024ImageCompositionalImage2TextStar
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation (Janus)2024ImageCompositionalImage2Text, Text2ImageStar
Emu3: Next-Token Prediction is All You Need (Emu3)2024Image, VideoUnifiedImage2Text, Text2Image, Text2VideoStar
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation (Show-o)2024Image, VideoUnifiedImage2Text, Text2Image, Text2VideoStar
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation (VILA-U)2024Image, VideoUnifiedImage2Text, Text2Image, Text2VideoStar
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model (Transfusion)2024ImageUnifiedImage2Text-
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens (Fluid)2024ImageUnifiedImage2Text-
Autoregressive Image Generation without Vector Quantization (MAR)2024ImageUnifiedImage2TextStar
Chameleon: Mixed-Modal Early-Fusion Foundation Models (Chameleon)2024ImageUnifiedImage2Text, Text2ImageStar
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models (Mini-Genimi)2024ImageCompositionalImage2Text, Text2ImageStar
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation (DnD-Transformer)2024ImageUnifiedText2ImageStar
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (VAR)2024ImageUnifiedText2ImageStar
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation (LlamaGen)2024ImageUnifiedText2ImageStar
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens (MiniGPT5)2023ImageCompositionalImage2Text, Text2ImageStar
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing (Blip-Diffusion)2023ImageCompositionalText2ImageStar
Kosmos-G: Generating Images in Context with Multimodal Large Language Models (Kosmos-G)2023ImageCompositionalText2ImageStar
Kosmos-2: Grounding Multimodal Large Language Models to the World2023ImageCompositionalImage2TextStar
Kosmos-2.5: A Multimodal Literate Model2023ImageCompositionalImage2TextStar
Kosmos-E: Learning to Follow Instruction for Robotic Grasping2023ImageCompositionalImage2TextStar
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization (LaVIT)2023ImageCompositionalImage2Text, Text2ImageStar
Generative Multimodal Models are In-Context Learners (Emu2)2023ImageCompositionalImage2Text, Text2ImageStar
Generative Pretraining in Multimodality (Emu1)2023ImageCompositionalImage2Text, Text2ImageStar
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action (Unified-IO2)2023Image, Video, AudioCompositionalImage2Text, Text2Image, Audio2Text, Text2Audio, Text2VideoStar
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)2023ImageCompositionalImage2TextStar
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks (InternVL)2023ImageCompositionalImage2TextStar
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (QwenVL)2023ImageCompositionalImage2TextStar
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models (Molom)2023ImageCompositionalImage2Text-)
Fuyu-8B: A Multimodal Architecture for AI Agents (Fuyu)2023ImageUnifiedImage2Text-
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models (BLIP2)2023ImageCompositionalImage2TextStar
Visual Instruction Tuning (LLaVA)2023ImageCompositionalImage2TextStar
MiniGPT4: a Visual Language Model for Few-Shot Learning (MiniGPT4)2022ImageCompositionalImage2Text-
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks (Unified-IO)2022ImageCompositionalImage2Text, Text2Image-
Zero-Shot Text-to-Image Generation (DALLE)2022ImageUnifiedText2Image-
Language Models are General-Purpose Interfaces2022ImageCompositionalImage2TextStar
Flamingo: a Visual Language Model for Few-Shot Learning (Flamingo)2022ImageCompositionalImage2Text-

Audio Model

PaperTimeModalityModel TypeTaskGitHub
VoxtLM: Unified Decoder-Only Models for Consolidating Speech Recognition, Synthesis and Speech, Text Continuation Tasks (VoxtLM)2024AudioUnifiedA2T, T2A, A2A, T2T-
Moshi: a speech-text foundation model for real-time dialogue (Moshi)2024AudioUnifiedA2AStar
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming (Mini-Omni)2024AudioCompositionalA2AStar
LLaMA-Omni: Seamless Speech Interaction with Large Language Models (LLaMA-Omni)2024AudioCompositionalA2AStar
SpeechVerse: A Large-scale Generalizable Audio Language Model (SpeechVerse)2024AudioCompositionalA2T-
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities (AudioFlamingo)2024AudioCompositionalA2TStar
WavLLM: Towards Robust and Adaptive Speech Large Language Model (WavLLM)2024AudioCompositionalA2TStar
MELLE: Autoregressive Speech Synthesis without Vector Quantization2024AudioUnifiedT2A-
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models (Seed-TTS)2024AudioCompositionalT2A-
FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications (FireRedTTS)2024AudioCompositionalT2AStar
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens (CosyVoice)2024AudioCompositionalT2AStar
Uniaudio: An audio foundation model toward universal audio generation (UniAudio)2024AudioUnifiedT2A, A2AStar
BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data (BASE TTS)2024AudioUnifiedT2A-
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild (VoiceCraft)2024AudioUnifiedT2AStar
Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities (SpeechGPT)2023AudioUnifiedA2T, T2A, A2A, T2TStar
Lauragpt: Listen, attend, understand, and regenerate audio with gpt (LauraGPT)2023AudioUnifiedA2T, T2A, A2A, T2T-
Viola: Unified codec language models for speech recognition, synthesis, and translation (VIOLA)2023AudioCompositionalA2T, T2A, A2A, T2T-
Audiopalm: A large language model that can speak and listen (AudioPaLM)2023AudioCompositionalA2T, T2A, A2A-
Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models (Qwen-Audio)2023AudioCompositionalA2TStar
Salmonn: Towards generic hearing abilities for large language models (SALMONN)2023AudioCompositionalA2TStar
On decoder-only architecture for speech-to-text and large language model integration (SpeechLLaMA)2023AudioCompositionalA2T-
Listen, think, and understand (LTU)2023AudioCompositionalA2TStar
Pengi: An audio language model for audio tasks (Pengi)2023AudioCompositionalA2TStar
Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning (MU-LLaMA)2023AudioCompositionalA2T-
SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts (SpeechGen)2023AudioUnifiedT2AStar
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)2023AudioCompositionalT2AStar
Simple and Controllable Music Generation (MusicGen)2023AudioUnifiedT2AStar
Make-A-Voice: Unified Voice Synthesis With Discrete Representation (Make-A-Voice)2023AudioCompositionalT2A-
Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision (SPEAR-TTS)2023AudioCompositionalT2A-
AudioGen: Textually Guided Audio Generation (AudioGen)2022AudioUnifiedT2A-
AudioLM: a Language Modeling Approach to Audio Generation (AudioLM)2022AudioCompositionalA2A-
Generative Spoken Language Modeling from Raw Audio (GSLM)2021AudioUnifiedA2A-

Awesome Multimodal Prompt Engineering

Multimodal ICL

PaperTimeModalityGitHub
Multimodal Few-Shot Learning with Frozen Language Models (Frozen)2021Image-
Flamingo: a Visual Language Model for Few-Shot Learning (Flamingo)2022Image-
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning (MMICL)2023ImageStar
Efficient In-Context Learning in Vision-Language Models for Egocentric Videos (EILeV)2023ImageStar
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models (Open-Flamingo)2023ImageStar
Link-Context Learning for Multimodal LLMs (LCL)2023ImageStar
Med-Flamingo: a Multimodal Medical Few-shot Learner (Med-Flamingo)2023ImageStar
MIMIC-IT: Multi-Modal In-Context Instruction Tuning (MIMIC-IT)2023ImageStar
Sequential Modeling Enables Scalable Learning for Large Vision Models (LVM)2023ImageStar
World Model on Million-Length Video And Language With Blockwise RingAttention (LWM)2023Image, VideoStar
Exploring Diverse In-Context Configurations for Image Captioning (Yang et al.)2024ImageStar
Visual In-Context Learning for Large Vision-Language Models (VisualICL)2024Image-
Many-Shot In-Context Learning in Multimodal Foundation Models (Many-Shots ICL)2024ImageStar
Can MLLMs Perform Text-to-Image In-Context Learning? (CoBSAT)2024ImageStar
Video In-context Learning (Video ICL)2024VideoStar
Generative Pretraining in Multimodality (Emu)2024Image, VideoStar
Generative Multimodal Models are In-Context Learners (Emu2)2024Image, VideoStar
Towards More Unified In-context Visual Understanding (Sheng et al.)2024Image-
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)2023AudioStar
MELLE: Autoregressive Speech Synthesis without Vector Quantization (MELLE)2024Audio-
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models (Seed-TTS)2024Audio-
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities (Audio Flamingo)2024AudioStar
Moshi: a speech-text foundation model for real-time dialogue (Moshi)2024AudioStar

Multimodal CoT

PaperTimeModalityGitHub
WavLLM: Towards Robust and Adaptive Speech Large Language Model (WavLLM)2024AudioStar
SpeechVerse: A Large-scale Generalizable Audio Language Model (SpeechVerse)2024Audio-
CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought2024AudioStar
Chain-of-Thought Prompting for Speech Translation2024Audio-
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition2024VideoStar
VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool2024Video-
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning2024ImageStar
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations2024ImageStar
Compositional Chain-of-Thought Prompting for Large Multimodal Model2023ImageStar
Vโˆ—: Guided Visual Search as a Core Mechanism in Multimodal LLMs2023ImageStar
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models2023ImageStar
Visual Chain-of-Thought Diffusion Models2023Image-
Multimodal Chain-of-Thought Reasoning in Language Models2023ImageStar

Citation

If you feel our work helpful, please kindly cite the paper :)

@misc{chen2024tokenpredictionmultimodalintelligence,
      title={Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey}, 
      author={Liang Chen and Zekun Wang and Shuhuai Ren and Lei Li and Haozhe Zhao and Yunshui Li and Zefan Cai and Hongcheng Guo and Lei Zhang and Yizhe Xiong and Yichi Zhang and Ruoyu Wu and Qingxiu Dong and Ge Zhang and Jian Yang and Lingwei Meng and Shujie Hu and Yulong Chen and Junyang Lin and Shuai Bai and Andreas Vlachos and Xu Tan and Minjia Zhang and Wen Xiao and Aaron Yee and Tianyu Liu and Baobao Chang},
      year={2024},
      eprint={2412.18619},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.18619}, 
}