OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

May 9, 2026 ยท View on GitHub

OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis

[๐Ÿ“– arXiv Paper] [๐Ÿ“Š Datasets] [๐Ÿ† Models]
OpenOmni is the end-to-end fully open-sourced pioneering method that successfully incorporates image,speech and text into the omni large language model. OpenOmni's design for speech generation through language bridging and text-guided speech can be quickly trained in situations where omni-modal data and VRAM resources are scarce. OpenOmni not only supports omni-modal nderstanding, but also supports two real-time emotional speech generation modes, CTC mode and AR mode, so that users can flexibly choose according to their needs to achieve a balance between generation speed and quality. The flexible framework design allows OpenOmni to be easily and quickly applied to a variety of downstream tasks, such as speech embodied navigation, multi-role-playing speech dialogue, etc. Everyone is welcome to come and experience it now!

๐Ÿ”ฅ Update

  • [2025/09/22]๐Ÿ”ฅAfter a year of community evaluation, our work has been accepted by NIPS 2025. Congratulations!
  • [2025/05/26]๐Ÿ”ฅOur [OmniCharacter]โ€”built on MMEvol and OpenOmni seriesโ€”has been accepted to the main track of ACL 2025. Youโ€™re all welcome to give it a try!
  • [2025/05/15]๐Ÿ”ฅTwo paper has beed accepted by ACL 2025 main based on our findings (LLaMA-Omin2 and OmniCharacter). We warmly welcome everyone to use our work.
  • [2025/05/05]๐Ÿ”ฅOur gate fusion technology for more acurrate speech content generation is adopted by LLaMA-Omni2
  • [2025/02/12]๐Ÿ”ฅAdd some missing file and fix all possible bug
  • [2025/01/13]๐Ÿ”ฅOpenOmni is coming! We release the code, model and data
  • [2025/01/09]๐Ÿ”ฅAfter two months of company audit! We release the paper
  • [2024/11/14]๐Ÿ”ฅWe submit the paper for peer review openreview
  • [2024/09/15]๐Ÿ”ฅWe write the first line of OpenOmni project for fully open-sourced pioneering OmniLLM in end-to-end manner.

๐Ÿ‘€ Contents

  • Setup
  • Model
  • Preparation
  • Train
  • Evaluation
  • Example
  • Citation

๐Ÿ“ท Setup

Please follow the instructions below to install the required packages.

  1. Clone this repository
git clone https://github.com/RainBowLuoCS/OpenOmni.git
cd OpenOmni
  1. Install Package
conda create -n openomni python=3.10 -y
conda activate openomni
pip install --upgrade pip  # enable PEP 660 support
pip install -e ".[train]"
pip install -r requirements.txt
  1. Install additional packages for training
pip install flash-attn --no-build-isolation

๐Ÿ”ฅ Fast Usage

After downloading the weights and configuring the paths properly. Two open-sourced speech tokenizer are needed for speech discretization and reconstruction with different vocabulary size! CosVoice for 6K CTC Mode and GLM4Voice for 16K AR Mode

Fast inference for omnimodal input (speech,text,image and video)

python inference.py

Fast interation for omnimodal input (speech,text,image and video)

python demo.py

Model

Here are the pretrained weights and instruction tuning weights

StageModelSpeech ProjectorImage
Projector
IT DataDownload
1-1OpenOMNI-Qwen2-7B-Stage1-1ckptckptopenomni_stage1-1.jsonckpt
2-1OpenOMNI-Qwen2-7B-Stage2-1ckptckptopenomni_stage2-1.jsonckpt
2-2OpenOMNI-Qwen2-7B-Stage2-2ckptckptopenomni_stage2-2.jsonckpt
3-1OpenOMNI-Qwen2-7B-Stage3-1ckptckptopenomni_stage3-1.jsonckpt
3-2OpenOMNI-Qwen2-7B-Stage3-2ckptckptopenomni_stage3-2.jsonckpt

Preparation

Dataset

Please follow MMEvol to prepare the corresponding images-text datasets. Here we only provide the details of speech-text datasets.

The following is the data directory tree of OpenOmni

data structure

datasets
โ”œโ”€โ”€ json # data receipe
โ”‚   โ”œโ”€โ”€ openomni_stage1-1.json # speech2text pretraining
โ”‚   โ”œโ”€โ”€ openomni_stage2-1.json # image2text pretraining
โ”‚   โ”œโ”€โ”€ openomni_stage2-2.json # image2text instruction tuning
โ”‚   โ”œโ”€โ”€ openomni_stage3-1.json # text2speech pretraining
โ”‚   โ”œโ”€โ”€ openomni_stage3-2.json # text2speech emotional injection
โ”œโ”€โ”€ asr # classic bilingual speech corpus
โ”‚   โ”œโ”€โ”€ AISHELL-4
โ”‚   โ”œโ”€โ”€ LibriSPeech
โ”‚   โ”œโ”€โ”€ WeNetSpeech
โ”œโ”€โ”€ audio_en # synthetic english speech corpus for question
โ”œโ”€โ”€ audio_llava # synthetic bilingual speech corpus for answer
โ”œโ”€โ”€ audio_zh # synthetic chinese speech corpus for question
โ”œโ”€โ”€ audio_unit # synthetic bilingual speech corpus for answer
โ”œโ”€โ”€ audio_prefer # synthetic emotional bilingual speech corpus for answer
โ”œโ”€โ”€ audio_reject # synthetic emotional bilingual speech corpus for answer
โ”œโ”€โ”€ audio_ultrachat # synthetic bilingual speech corpus for answer
โ”œโ”€โ”€ ai2d
โ”‚   โ”œโ”€โ”€ abc_images
โ”‚   โ”œโ”€โ”€ annotations
โ”‚   โ”œโ”€โ”€ images
โ”‚   โ”œโ”€โ”€ questions
โ”‚   โ””โ”€โ”€ categories.json
......


  • All file/path starting with "audio" are self-synthesized.
  • DPO contains approximately 9k entries for "prefer" and "reject," covering 9 types of emotions.

More details about data curation can be found in our paper.

Train

Speech2Text Pretrain

Please download the MMEvol, AIShell-4, LibriSPeech, WeNetSpeech, OpenOmni Data and organize the data following Preparation before training . Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

bash scripts/train/llama3/speech2text_pretrain.sh
bash scripts/train/qwen2/speech2text_pretrain.sh

Image2Text Pretrain

Please make sure you download and organize the data following Preparation before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

bash scripts/train/llama3/image2text_pretrain.sh
bash scripts/train/qwen2/image2text_pretrain.sh

Image2Text Instruction Tuning

Please make sure you download and organize the data following Preparation before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

bash scripts/train/llama3/image2text_finetune.sh
bash scripts/train/qwen2/image2text_finetue.sh

Text2Speech Pretrain

Please make sure you download and organize the data following Preparation before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

bash scripts/train/llama3/text2speech_ pretrain.sh
bash scripts/train/qwen2/text2speech_ pretrain.sh

Text2Speech Emotional DPO Tuning

Please make sure you download and organize the data following Preparation before training. Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

bash scripts/train/llama3/text2speech_ dpo.sh
bash scripts/train/qwen2/text2speech_ dpo.sh

Evaluation

Dataset

Ensure that your api_base, key and dataset are correctly configured before evaluation.

data structure

datasets
โ”œโ”€โ”€ json # data receipe
โ”‚   โ”œโ”€โ”€ aishell2_eval.jsonl # aishell evaluation
โ”‚   โ”œโ”€โ”€ librispeech_eval.jsonl # image2text pretraining
โ”‚   โ”œโ”€โ”€ wenetspeech_eval.json # image2text instruction tuning
โ”‚   โ”œโ”€โ”€ openomni_emotion_val.json 
โ”œโ”€โ”€ OmniBench # OmniBench
โ”‚   โ”œโ”€โ”€ mmdata
โ”‚   โ”œโ”€โ”€ dataset
โ”‚   		โ”œโ”€โ”€ eval.json
โ”œโ”€โ”€ Ov-Odyssey # Ov-Odyssey Bench
โ”‚   โ”œโ”€โ”€ av_odyssey_part1.parquet
โ”‚   โ”œโ”€โ”€ av_odyssey_part2.parquet
โ”‚   โ”œโ”€โ”€ av_odyssey_part3.parquet
โ”‚   โ”œโ”€โ”€ av_odyssey_part4.parquet
โ”‚   โ”œโ”€โ”€ av_odyssey_part5.parquet


Speech-Text Evaluation

Make sure set up the corresponding train script with correct setting (data path, weight path, and hyper-paramaters)

python openomni/eval/llama3/asr_eavl.py
python openomni/eval/qwen2/asr_eavl.py
ModelLibriSpeech-test-cleanLibriSpeech-test-otherAIShell2-devAIShell2-testWeNetSpeech-testnetWeNetSpeech-testmeeting
VITA8.118.412.216.5
EMOVA4.08.610.610.3
MINI-OMNI4.59.7
Freeze-Omni3.297.48.5710.09
ours2.575.66.816.877.63

Image-Text Evaluation

Refer to MMEvol for detailed OpenCampass Vision Language Evaluation

# run on all 9 datasets
./script/run_inference.sh OpenOmni-Qwen "MME MMMU_DEV_VAL MathVista_MINI LLaVABench RealWorldQA MMStar MMVet AI2D_TEST OCRBench HallusionBench POPE BLINK" all

# The following are instructions for running on a single dataset
# MME
./script/run_inference.sh OpenOmni-Qwen MME all
# MMMU_DEV_VAL
./script/run_inference.sh OpenOmni-Qwen MMMU_DEV_VAL all
# MathVista_MINI
./script/run_inference.sh OpenOmni-Qwen MathVista_MINI all
.....

Speech-Text-Image Evaluation

Please download OmniBench and run the following command

python openomni/eval/llama3/omni_eavl.py
python openomni/eval/qwen2/omni_eavl.py

Speech-Text-Image-Video Evaluation

Please download Ov-Odyssey and run the following command

python openomni/eval/llama3/ov_odyssey_eavl.py
python openomni/eval/qwen2/ov_odyssey_eavl.py

Text-Speech Evaluation

python openomni/eval/llama3/t2s_eavl.py
python openomni/eval/qwen2/t2s_eavl.py

Emotional Text-Speech Evaluation

python openomni/eval/llama3/et2s_eavl.py
python openomni/eval/qwen2/et2s_eavl.py

๐Ÿ“Œ Cases of the text to speech

ๅ››ๆ˜ฏๅ››๏ผŒๅๆ˜ฏๅ๏ผŒๅๅ››ๆ˜ฏๅๅ››๏ผŒๅ››ๅๆ˜ฏๅ››ๅใ€‚

้ป‘ๅŒ–่‚ฅๅ‘็ฐ๏ผŒ็ฐๅŒ–่‚ฅๅ‘้ป‘๏ผŒ้ป‘ๅŒ–่‚ฅๅ‘็ฐไผšๆŒฅๅ‘๏ผŒ็ฐๅŒ–่‚ฅๆŒฅๅ‘ไผšๅ‘้ป‘ใ€‚

ๅƒ่‘ก่„ไธๅ่‘ก่„็šฎ๏ผŒไธๅƒ่‘ก่„ๅ€’ๅ่‘ก่„็šฎใ€‚

ๅ››ๆ˜ฏๅ››๏ผŒๅๆ˜ฏๅ๏ผŒๅๅ››ๆ˜ฏๅๅ››๏ผŒๅ››ๅๆ˜ฏๅ››ๅใ€‚

้ป‘ๅŒ–่‚ฅๅ‘็ฐ๏ผŒ็ฐๅŒ–่‚ฅๅ‘้ป‘๏ผŒ้ป‘ๅŒ–่‚ฅๅ‘็ฐไผšๆŒฅๅ‘๏ผŒ็ฐๅŒ–่‚ฅๆŒฅๅ‘ไผšๅ‘้ป‘ใ€‚

ๅƒ่‘ก่„ไธๅ่‘ก่„็šฎ๏ผŒไธๅƒ่‘ก่„ๅ€’ๅ่‘ก่„็šฎใ€‚

ๅ…ซ็™พๆ ‡ๅ…ตๅฅ”ๅŒ—ๅก๏ผŒ็‚ฎๅ…ตๅนถๆŽ’ๅŒ—่พน่ท‘๏ผŒ็‚ฎๅ…ตๆ€•ๆŠŠๆ ‡ๅ…ต็ขฐ๏ผŒๆ ‡ๅ…ตๆ€•็ขฐ็‚ฎๅ…ต็‚ฎใ€‚

็บขๅ‡คๅ‡ฐ๏ผŒ้ป„ๅ‡คๅ‡ฐ๏ผŒ็ฒ‰็บขๅ‡คๅ‡ฐ๏ผŒ่Šฑๅ‡คๅ‡ฐใ€‚

็‰›้ƒŽๅนดๅนดๆ‹ๅˆ˜ๅจ˜๏ผŒๅˆ˜ๅจ˜ๅฟตๅฟตๆ‹็‰›้ƒŽใ€‚

ๅ…ซ็™พๆ ‡ๅ…ตๅฅ”ๅŒ—ๅก๏ผŒ็‚ฎๅ…ตๅนถๆŽ’ๅŒ—่พน่ท‘๏ผŒ็‚ฎๅ…ตๆ€•ๆŠŠๆ ‡ๅ…ต็ขฐ๏ผŒๆ ‡ๅ…ตๆ€•็ขฐ็‚ฎๅ…ต็‚ฎใ€‚

็บขๅ‡คๅ‡ฐ๏ผŒ้ป„ๅ‡คๅ‡ฐ๏ผŒ็ฒ‰็บขๅ‡คๅ‡ฐ๏ผŒ่Šฑๅ‡คๅ‡ฐใ€‚

็‰›้ƒŽๅนดๅนดๆ‹ๅˆ˜ๅจ˜๏ผŒๅˆ˜ๅจ˜ๅฟตๅฟตๆ‹็‰›้ƒŽใ€‚

She sells seashells by the seashore.

Peter Piper picked a peck of pickled peppers.

Six slippery snails slid slowly seaward.

en_0.webm

en_1.webm

en_2.webm

Six sleek swans swam swiftly southwards.

I saw Susie sitting in a shoeshine shop.

Can you can a can as a canner can can a can?

en_3.webm

en_4.webm

en_5.webm

๐Ÿ“Œ Cases of the text to emotional speech

I am so sad.

why are you doing this to me.

what a nice day.

i am very scared.

en_sad.webm

en_angry.webm

en_happy.webm

en_fearful.webm

ๆˆ‘็œŸ็š„ๅพˆ้šพ่ฟ‡.

ไฝ ไธบไป€ไนˆ่ฆ่ฟ™ๆ ท๏ผŒๆˆ‘็œŸ็š„ๅพˆ็”Ÿๆฐ”.

ไปŠๅคฉๅคฉๆฐ”็œŸๅฅฝ.

ๆˆ‘็œŸๆœ‰็‚นๅฎณๆ€•.

zh_sad.webm

zh_angry.webm

zh_happy.webm

zh_fearful.webm

๐Ÿ“šVideo Demo

https://github.com/user-attachments/assets/cd679b7c-9f9d-4631-a1f5-96b1428a8ad4

๐Ÿ“šCitation

If you find this repo useful for your research, please consider citing the paper

@article{luo2025openomni,
  title={Openomni: Advancing open-source omnimodal large language models with progressive multimodal alignment and real-time self-aware emotional speech synthesis},
  author={Luo, Run and Lin, Ting-En and Zhang, Haonan and Wu, Yuchuan and Liu, Xiong and Yang, Min and Li, Yongbin and Chen, Longze and Li, Jiaming and Zhang, Lei and others},
  journal={arXiv preprint arXiv:2501.04561},
  year={2025}
}
@article{luo2024mmevol,
  title={Mmevol: Empowering multimodal large language models with evol-instruct},
  author={Luo, Run and Zhang, Haonan and Chen, Longze and Lin, Ting-En and Liu, Xiong and Wu, Yuchuan and Yang, Min and Wang, Minzheng and Zeng, Pengpeng and Gao, Lianli and others},
  journal={ACL 2025},
  year={2024}
}
@article{zhang2025omnicharacter,
  title={OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction},
  author={Zhang, Haonan and Luo, Run and Liu, Xiong and Wu, Yuchuan and Lin, Ting-En and Zeng, Pengpeng and Qu, Qiang and Fang, Feiteng and Yang, Min and Gao, Lianli and others},
  journal={ACL 2025},
  year={2025}
}
@article{zhang2025omnicharacter++,
  title   = {OmniCharacter++: Towards Comprehensive Benchmark for Realistic Role-Playing Agents},
  author  = {Haonan Zhang},
  journal = {TPAMI 2026},
  year = {2026}
}

๐Ÿ“ง Contact

if you have any question, please consider following concat for help

Acknowledgement

- LLaVA and LLaVA-Omni: the codebase we built upon. Thanks for their brilliant contributions to the community! We just can't wait to use OpenOmni.

- VLMEvalKit: the amazing open-sourced suit for evaluating various LMMs!

- CosVoice: the amazing open-sourced speech tokenizer for speech discretization and reconstruction with 6k vocabulary size!

- GLM4Voice: the amazing open-sourced speech tokenizer for speech discretization and reconstruction with 16k vocabulary size!