HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding

March 10, 2025 · View on GitHub

ModelScope Hugging Face arXiv

NewsNews

  1. Building upon HumanOmni, we are the first to combine RLVR (Reinforcement Learning for Vision and Reasoning) with an Omni model, introducing R1-Omni, a reasoning-based large multimodal model.

📖 Introduction

HumanOmni is the industry’s first human-centric Omni-multimodal large language model for comprehensive understanding in human-centric scenes.

  1. 2.4M human-centric video clips with over 14M double-check instructions: We have constructed a dataset containing over 2.4M human-centric video clips, providing rich and detailed information about individuals. We provide over 14M instruction data for visual pretraining.
  2. 50K video clips with more than 100K manually annotated instrcutions: We have manually annotated 50K video clips with more than 100K instructions related to emotion recognition, facial description, and speaker-specific speech recognition for visual fine-tuning and cross-modal interaction integration.
  3. Three human-specific branch: We use three branches to handle face-related, body-related, and interaction-related scenes separately in HumanOmni. HumanOmni dynamically adjusts its fusion weights based on input instructions, ensuring accurate responses across various scenes.
  4. Audio-visual synergy: HumanOmni can simultaneously understand vision and speech, allowing for a more comprehensive understanding of complex scenes.

📦 Model Download

ModelStage#ParamsHuggingFaceModelScope
HumanOmni-VideoVisual Capability Construction7BHFMS
HumanOmni-AudioAuditory Capability Development7BHFMS
HumanOmni-OmniCross-Modal Interaction Integration7BHFMS

Our training pipeline consists of three progressive stages to establish multimodal understanding capabilities:

📹 Visual Capability Construction

  • Model: HumanOmni-Video
  • Objective: Learn spatio-temporal feature representations to analyze human actions and scene dynamics in videos.

🎧 Auditory Capability Development

  • Model: HumanOmni-Audio
  • Objective: Develop robust speech comprehension and audio interpretation through large-scale acoustic modeling.

🌐 Cross-Modal Interaction Integration

  • Model: HumanOmni-Omni (also referred to as HumanOmni)
  • Objective: Enable synergistic vision-audio reasoning by fine-tuning parameters from both HumanOmni-Video and HumanOmni-Audio.

🏆 Performance

  • Emotion Understanding:
MethodModalitiesDFEW (UAR)DFEW (WAR)MAFW (UAR)MAFW (WAR)
Specialized models for emotion-related tasks
Wav2Vec2.0A36.1543.0521.5929.69
HuBERTA35.9843.2425.0032.60
DFER-CLIPV59.6171.2538.8952.55
MAE-DFERV63.4174.4341.6254.31
HiCMAEAV63.7675.0142.6556.17
Emotion-LLaMAAV64.2177.06--
MMA-DFERAV66.8577.4344.2558.45
Other models
Qwen2-VL-7BV43.0852.8331.6745.89
Qwen2-VL-72BV39.2445.1242.6146.07
VITAAV21.3632.0714.0533.38
InternLM-XComposer-2.5-OLAV44.2351.2933.7846.81
GPT4-OAV50.5757.1938.2948.82
HumanOmniAV74.8682.4652.9468.40
MethodCorrectnessDetailContextTemporalCIDErRouge-LAutoDQ
Vision large language model
VideoLLaMA3.603.673.843.500.1890.1960.303
VideoChat3.473.523.923.380.2510.1920.344
VideoChat23.703.564.163.520.2020.2290.311
Chat-UniVI3.643.634.213.610.1890.2310.396
LLaVA-Next-Video4.194.074.394.040.2500.2490.395
ShareGPT4Video4.244.134.354.090.1920.2050.394
LLaMA-VID3.954.014.223.710.1950.2310.339
VideoLLaMA24.174.024.473.930.2530.2660.344
PLLaVA4.214.154.374.080.2680.2500.393
ST-LLM4.003.984.313.940.2130.2380.321
Tarsier3.593.504.073.410.1430.1850.415
LLaVA-OneVision3.683.474.103.420.1150.1650.379
FaceTrack-MM4.424.304.604.260.4180.4730.483
Qwen2-VL-72B4.284.144.554.080.2410.3140.449
Qwen2-VL-7B4.234.164.524.020.2040.2330.422
Qwen2-VL-2B4.013.984.373.880.2020.2210.406
Claude3.5-Sonnet4.134.014.494.050.2430.2280.442
Omni-modality large language model
GPT4-O4.223.974.483.900.2640.2130.432
VITA3.983.744.113.590.1910.2240.366
InternLM-XComposer-2.5-OL3.913.704.123.540.1130.1640.382
HumanOmni4.584.414.704.410.4120.4680.523
  • Action and Pose Understanding:
MethodAction SequenceUnexpected ActionAction AntonymObject InteractionAction CountFine-grained ActionAvg
Vision large language model
Otter-V23.029.527.528.026.027.026.8
mPLUG-Owl-V22.029.034.027.031.529.028.8
Video-LLaMA27.539.051.040.534.029.036.8
LLaMA-Adapter23.033.051.032.529.030.033.1
Video-ChatGPT23.526.562.028.030.522.532.2
VideoChat33.540.556.040.535.033.539.8
VideoChat275.560.583.574.537.050.563.6
ST-LLM66.058.584.073.536.544.060.4
PLLaVA58.061.055.561.039.541.052.6
VideoLLaMB54.552.086.558.540.544.556.1
Qwen2-VL-72B*51.582.093.581.548.549.067.7
Qwen2-VL-7B*73.580.079.078.546.049.067.7
Qwen2-VL-2B*77.576.576.577.550.047.567.6
GPT-4V55.563.572.059.039.047.556.1
Omni-modality large language model
VITA58.081.573.561.545.542.060.3
InternLM-XComposer-2.5-OL84.581.075.079.560.546.071.1
HumanOmni70.078.092.580.565.549.072.6

📺 Demo

https://github.com/user-attachments/assets/ecf07e98-f448-4cb6-a124-6e2a2885e581

User:What's the major emotion?
HumanOmni:angry.

User:What's the major facial expression?
HumanOmni:frown.

User:Desribe the woman.
HumanOmni:The woman has long brown hair and is wearing a pink top with a dark cardigan.

User:Please describe the major action in detail.
HumanOmni:The woman is frowning and shaking her head, indicating that she is angry. She is also gesturing with her hands as if she is explaining something or emphasizing a point.

User:What did the woman say?
HumanOmni:The woman said "Of course we object! I object, Steve!"

https://github.com/user-attachments/assets/0f0d9c5c-6e1d-473f-8c6b-819ebdc8aae8

User:What's the major emotion?
HumanOmni:happy.

User:What's the major facial expression?
HumanOmni:smile.

User:Desribe the woman in detail.
HumanOmni:The woman in the video has blonde hair, is wearing a black top, and is smiling.

User:What did the woman say?
HumanOmni:The woman said "Dogs are sitting by the door."

https://humanmllm.github.io/HumanOmni/chinese_video.mp4

User:描述一下视频中男子的外貌穿着.
HumanOmni:视频中的男性有短的黑发,配戴圆框眼镜。他穿了一件棕色的毛大衣,内里是白色的.

User:视频中男子说了什么内容?
HumanOmni:男子说他穿什么颜色的衣服,现在处于什么环境,今天的天气怎么样.

User:视频中男子有什么情绪吗?
HumanOmni:没有情绪。

🛠️ Environment Setup

To set up the recommended environment for HumanOmni, follow these instructions:

  • Python: >=3.10
  • CUDA: >=12.1
  • PyTorch: >=2.2 (with CUDA support)
  • Transformers: >=4.45
  • Accelerate: >=0.30.1

Or you can quickly set up the environment as follows:

git clone https://github.com/HumanMLLM/HumanOmni
cd HumanOmni
conda create -n humanOmni python=3.10 -y
conda activate humanOmni
pip install --upgrade pip
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

🧠 Training on Custom Dataset

Data Preparation

An example json file of the training data:

[
    {
        "video": "human/DFEW/videos/1.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\n<audio>\nAs an emotional recognition expert; throughout the video, which emotion conveyed by the characters is the most obvious to you?\nfear ,angry ,surprise ,happy ,neutral ,sad ,disgust"
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ],
    },
    {
        "video": "human/DFEW/videos/1.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\n<audio>\nAs an emotional recognition expert, in the video, when the characters display their emotions, which predominant feeling is most clearly expressed?\nfear ,disgust ,happy ,sad ,surprise"
            },
            {
                "from": "gpt",
                "value": "sad"
            }
        ],
    },
  ...
]

Multi-Modal SFT

  • Download the required weights: (1) HumanOmni-7B-Video (2) HumanOmni-7B-Audio
  • scripts/train/finetune_humanomni.sh Loading the weights and the prepared dataset.
  • bash scripts/train/finetune_humanomni.sh

🔍 Inference

We provide inference.py for singe video inference.

  • video + audio
python inference.py --modal video_audio \
  --model_path ./HumanOmni_7B \
  --video_path video.mp4 \
  --instruct "Describe this video."
  • only video
python inference.py --modal video \
  --model_path ./HumanOmni_7B \
  --video_path video.mp4 \
  --instruct "Describe this video."
  • only audio
python inference.py --modal audio \
  --model_path ./HumanOmni_7B \
  --video_path video.mp4 \
  --instruct "Describe this video."

📚 Citation

If you find our work helpful, feel free to give us a cite.

@article{zhao2025humanomni,
  title={HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding},
  author={Zhao, Jiaxing and Yang, Qize and Peng, Yixing and Bai, Detao and Yao, Shimin and Sun, Boyuan and Chen, Xiang and Fu, Shenghao and Wei, Xihan and Bo, Liefeng and others},
  journal={arXiv preprint arXiv:2501.15111},
  year={2025}
}