CodeVisionary

November 2, 2025 · View on GitHub

An agent-based evaluation framework for complex code generation.

Framework Demonstration

Below is a video demonstrating the main functionalities of the prototype. Click on the thumbnail to watch the demonstration, which showcases key features and the overall workflow of the framework in action.

https://github.com/user-attachments/assets/78948dad-8367-4df5-9eb8-9b64ca639f6f

Features

Two-Stage Framework

Requirement-guided multi-dimensional context distillation

Collecting contextual information based on the stepwise evaluation plan.

Fine-grained scoring and summarization

Generating evaluation scores and reports through negotiation between multiple judges.

Detailed Evaluation Report

Provides structured evaluation reports (Markdown & PDF format) with evaluation scores, environment configuration, task requirements, stepwise evaluation results, and overall evaluation results.

Multi Tool Integration

Integrates various kinds of external tools for code evaluation, including dynamic execution, static linter, unit tests, screenshot/interaction, web browsing, and so on.

Prerequisites

Python 3.x
Docker
Git

Installation

Clone the repository:

git clone https://github.com/Eshe0922/CodeVisionary.git

Build the Docker image:

cd docker
docker build -t codevisionary.evaluate .
docker pull eshe1836316339/codevisionary:lint
docker tag eshe1836316339/codevisionary:lint codevisionary.lint

Install the required dependencies:

pip install -r requirements.txt
npm install --save-dev prettier
apt-get install pandoc
apt-get install texlive-xetex

Usage

You can execute the run.sh script with the following arguments:

SCRIPT_DIR=$(cd "$(dirname "\$0")"; pwd)
python3 main.py \
  --evaluation_path "${SCRIPT_DIR}/dataset/benchmark_test.jsonl" \
  --write_path "${SCRIPT_DIR}/experiments/test" \
  --pdf

Where:

--evaluation_path: Path to the evaluation dataset in JSONL format. This file contains the questions and responses to be evaluated.
--write_path: Directory where the evaluation results and outputs will be saved.
--pdf: (Optional) If specified, the evaluation results will also be exported as a PDF report.

The evaluation dataset should be a JSON Lines file, where each line is a JSON object representing a single evaluation sample. Each object should have the following fields:

id: (int) Unique identifier for the sample.
question: (str) The coding or evaluation question.
response: (str) The code or answer generated by the model.
model: (str) The name or identifier of the model that generated the response.

Example:

{"id": 4, "question": "Find the maximum element in a list.", "response": "def find_max(lst):\n    return max(lst)", "model": "gpt-4"}

Project Structure

agents/ - Agent implementations for code evaluation
dataset/ - Datasets used for code evaluation
docker/ - Docker-related configurations
experiments/ - Experiment results
tools/ - External tools designed for code evaluation
utils/ - Utility functions and helper classes
main.py - Main entry point
run.sh - Shell script for executing the main.py

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Citation

@misc{wang2025codevisionaryagentbasedframeworkevaluating,
      title={CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation}, 
      author={Xinchen Wang and Pengfei Gao and Chao Peng and Ruida Hu and Cuiyun Gao},
      year={2025},
      eprint={2504.13472},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2504.13472}, 
}

License

MIT

Ackowledgement

https://github.com/Aider-AI/aider