VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models
August 14, 2025 ยท View on GitHub
This repository contains the codes and data for the paper "VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models".
The code will be released soon โ please stay tuned.
The MCD dataset, developed for our research, is now available on ๐ค Multimodal Coding Dataset (MCD).
๐ Overview
VisCodex is a unified multimodal framework that merges vision-language models with code-specialized LLMs using a task vector-based model merging strategy.
It brings state-of-the-art multimodal code generation capabilities, enabling models to understand complex visual contexts and produce syntactically correct, functionally accurate code.
Figure 1: Illustration of the VisCodex pipeline. (a) Model merging strategy for unifying vision-language and coding LLMs; (b) Data distribution and representative cases of MCD; (c) Category breakdown and representative cases of InfiBench-V.
๐ Main Results
๐ก Case Study
๐ฌ Contact
For any questions, please contact:
- Lingjie Jiang: lingjiejiang@stu.pku.edu.cn
๐ Citation
If you use our dataset, benchmark, or method in your research, please cite:
@article{jiang2025viscodex,
title={VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models},
author={Lingjie Jiang and Shaohan Huang and Xun Wu and Yixia Li and Dongdong Zhang and Furu Wei},
journal={arXiv preprint arXiv:2508.09945},
year={2025}
}