DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression
March 15, 2026 · View on GitHub
Bingzhou Li, Tao Huang, "DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression"
News
- 2026-03-15: This repo is released.

Abstract: Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window partitioning and attention-based pruning, which overlook the piecewise semantic structure of audio-visual signals and become fragile under aggressive token reduction. We propose Dynamic Audio-driven Semantic cHunking (DASH), a training-free framework that aligns token compression with semantic structure. DASH treats audio embeddings as a semantic anchor and detects boundary candidates via cosine-similarity discontinuities, inducing dynamic, variable-length segments that approximate the underlying piecewise-coherent organization of the sequence. These boundaries are projected onto video tokens to establish explicit cross-modal segmentation. Within each segment, token retention is determined by a tri-signal importance estimator that fuses structural boundary cues, representational distinctiveness, and attention-based salience, mitigating the sparsity bias of attention-only selection. This structure-aware allocation preserves transition-critical tokens while reducing redundant regions. Extensive experiments on AVUT, VideoMME, and WorldSense demonstrate that DASH maintains superior accuracy while achieving higher compression ratios compared to prior methods.
⚒️ TODO
- Release core code
- Release paper
- Release all evaluation scripts
Install
1. Clone this repository:
git clone https://github.com/laychou666/DASH.git
cd DASH
2. Install dependencies:
conda create -n dash python=3.10 -y
conda activate dash
pip install --upgrade pip
bash setup.sh
cd lmms-eval
pip install -e .
pip install flash-attn --no-build-isolation
Quick Start
Run a quick demo of DASH with Qwen2.5-Omni:
python demo.py --dash
Set compression parameters:
python demo.py --dash --rho_audio 0.45 --rho_video 0.76
Evaluation
VideoMME (via lmms-eval)
We use the lmms-eval toolkit. Specify DASH parameters in eval.sh:
bash eval.sh
Method Overview
DASH operates in four stages:
- Audio Boundary Detection: Detects semantic breakpoints in audio tokens via training-free cosine similarity analysis.
- Boundary-to-Video Mapping: Projects audio boundaries onto video tokens through linear temporal scaling.
- Audio Compression: Selects informative audio tokens using three-signal fusion (boundary probability + Gaussian uniqueness + attention scores).
- Video Compression: Applies per-segment interleaved spatio-temporal merging (ISTM) with audio-guided compression rates.
Acknowledgement
This project builds upon OmniZip, lmms-eval, and Qwen2.5-Omni. Thanks for their awesome work.
Citation
If you find this work useful for your research, please consider citing our paper:
@article{dash2026,
title={DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression},
author={},
journal={},
year={2026}
}