VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

August 14, 2025 ยท View on GitHub

arXiv Dataset

This repository contains the codes and data for the paper "VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models".
The code will be released soon โ€” please stay tuned.
The MCD dataset, developed for our research, is now available on ๐Ÿค— Multimodal Coding Dataset (MCD).


๐Ÿ“Œ Overview

VisCodex is a unified multimodal framework that merges vision-language models with code-specialized LLMs using a task vector-based model merging strategy.
It brings state-of-the-art multimodal code generation capabilities, enabling models to understand complex visual contexts and produce syntactically correct, functionally accurate code.

VisCodex Pipeline Overview

Figure 1: Illustration of the VisCodex pipeline. (a) Model merging strategy for unifying vision-language and coding LLMs; (b) Data distribution and representative cases of MCD; (c) Category breakdown and representative cases of InfiBench-V.


๐Ÿ“Š Main Results

VisCodex Main Results


๐Ÿ’ก Case Study

VisCodex Case Study

Example qualitative comparisons on multimodal coding tasks.

๐Ÿ“ฌ Contact

For any questions, please contact:

๐Ÿ“œ Citation

If you use our dataset, benchmark, or method in your research, please cite:

@article{jiang2025viscodex,
  title={VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models},
  author={Lingjie Jiang and Shaohan Huang and Xun Wu and Yixia Li and Dongdong Zhang and Furu Wei},
  journal={arXiv preprint arXiv:2508.09945},
  year={2025}
}