CodeVisionary
November 2, 2025 ยท View on GitHub
An agent-based evaluation framework for complex code generation.
Framework Demonstration
Below is a video demonstrating the main functionalities of the prototype. Click on the thumbnail to watch the demonstration, which showcases key features and the overall workflow of the framework in action.
https://github.com/user-attachments/assets/78948dad-8367-4df5-9eb8-9b64ca639f6f
Features
Two-Stage Framework
Requirement-guided multi-dimensional context distillation
- Collecting contextual information based on the stepwise evaluation plan.
Fine-grained scoring and summarization
- Generating evaluation scores and reports through negotiation between multiple judges.
Detailed Evaluation Report
- Provides structured evaluation reports (Markdown & PDF format) with evaluation scores, environment configuration, task requirements, stepwise evaluation results, and overall evaluation results.
Multi Tool Integration
- Integrates various kinds of external tools for code evaluation, including dynamic execution, static linter, unit tests, screenshot/interaction, web browsing, and so on.
Prerequisites
- Python 3.x
- Docker
- Git
Installation
- Clone the repository:
git clone https://github.com/Eshe0922/CodeVisionary.git
- Build the Docker image:
cd docker
docker build -t codevisionary.evaluate .
docker pull eshe1836316339/codevisionary:lint
docker tag eshe1836316339/codevisionary:lint codevisionary.lint
- Install the required dependencies:
pip install -r requirements.txt
npm install --save-dev prettier
apt-get install pandoc
apt-get install texlive-xetex
Usage
You can execute the run.sh script with the following arguments:
SCRIPT_DIR=$(cd "$(dirname "\$0")"; pwd)
python3 main.py \
--evaluation_path "${SCRIPT_DIR}/dataset/benchmark_test.jsonl" \
--write_path "${SCRIPT_DIR}/experiments/test" \
--pdf
Where:
--evaluation_path: Path to the evaluation dataset in JSONL format. This file contains the questions and responses to be evaluated.--write_path: Directory where the evaluation results and outputs will be saved.--pdf: (Optional) If specified, the evaluation results will also be exported as a PDF report.
The evaluation dataset should be a JSON Lines file, where each line is a JSON object representing a single evaluation sample. Each object should have the following fields:
id: (int) Unique identifier for the sample.question: (str) The coding or evaluation question.response: (str) The code or answer generated by the model.model: (str) The name or identifier of the model that generated the response.
Example:
{"id": 4, "question": "Find the maximum element in a list.", "response": "def find_max(lst):\n return max(lst)", "model": "gpt-4"}
Project Structure
agents/- Agent implementations for code evaluationdataset/- Datasets used for code evaluationdocker/- Docker-related configurationsexperiments/- Experiment resultstools/- External tools designed for code evaluationutils/- Utility functions and helper classesmain.py- Main entry pointrun.sh- Shell script for executing themain.py
Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Citation
@misc{wang2025codevisionaryagentbasedframeworkevaluating,
title={CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation},
author={Xinchen Wang and Pengfei Gao and Chao Peng and Ruida Hu and Cuiyun Gao},
year={2025},
eprint={2504.13472},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2504.13472},
}
License
MIT