VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

August 14, 2025 · View on GitHub

This repository contains the codes and data for the paper "VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models".
The code will be released soon — please stay tuned.
The MCD dataset, developed for our research, is now available on 🤗 Multimodal Coding Dataset (MCD).

📌 Overview

VisCodex is a unified multimodal framework that merges vision-language models with code-specialized LLMs using a task vector-based model merging strategy.
It brings state-of-the-art multimodal code generation capabilities, enabling models to understand complex visual contexts and produce syntactically correct, functionally accurate code.

VisCodex Pipeline Overview

Figure 1: Illustration of the VisCodex pipeline. (a) Model merging strategy for unifying vision-language and coding LLMs; (b) Data distribution and representative cases of MCD; (c) Category breakdown and representative cases of InfiBench-V.

Lingjie Jiang: lingjiejiang@stu.pku.edu.cn

📜 Citation

If you use our dataset, benchmark, or method in your research, please cite:

@article{jiang2025viscodex,
  title={VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models},
  author={Lingjie Jiang and Shaohan Huang and Xun Wu and Yixia Li and Dongdong Zhang and Furu Wei},
  journal={arXiv preprint arXiv:2508.09945},
  year={2025}
}

VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

📌 Overview

📊 Main Results

💡 Case Study

📬 Contact

📜 Citation