LiveCodeBench Pro - LLM Benchmarking Toolkit
October 19, 2025 ยท View on GitHub
This repository contains a benchmarking toolkit for evaluating Large Language Models (LLMs) on competitive programming tasks. The toolkit provides a standardized way to test your LLM's code generation capabilities across a diverse set of problems.
Overview
LiveCodeBench Pro evaluates LLMs on their ability to generate solutions for programming problems. The benchmark includes problems of varying difficulty levels from different competitive programming platforms.
Getting Started
Prerequisites
- Ubuntu 20.04 or higher (or other distros with kernel version >= 3.10, and cgroup support. Refer to go-judge for more details)
- Python 3.12 or higher
- pip package manager
- docker (for running the judge server), and ensure the user has permission to run docker commands
Installation
-
Install the required dependencies:
pip install -r requirements.txtOr install directly using
uv:uv sync -
Ensure Docker is installed and running:
docker --versionMake sure your user has permission to run Docker commands. On Linux, you may need to add your user to the docker group:
sudo usermod -aG docker $USERThen log out and back in for the changes to take effect.
How to Use
Step 1: Implement Your LLM Interface
Create your own LLM class by extending the abstract LLMInterface class in api_interface.py. Your implementation needs to override the call_llm method.
Example:
from api_interface import LLMInterface
class YourLLM(LLMInterface):
def __init__(self):
super().__init__()
# Initialize your LLM client or resources here
def call_llm(self, user_prompt: str):
# Implement your logic to call your LLM with user_prompt
# Return a tuple containing (response_text, metadata)
# Example:
response = your_llm_client.generate(user_prompt)
return response.text, response.metadata
You can use the ExampleLLM class as a reference, which shows how to integrate with OpenAI's API.
Step 2: Configure the Benchmark
Edit the benchmark.py file to use your LLM implementation:
from your_module import YourLLM
# Replace this line:
llm_instance = YourLLM() # Update with your LLM class
And change the number of judge workers (recommended to <= physical CPU cores).
Step 3: Run the Benchmark
Execute the benchmark script:
python benchmark.py
The script will:
- Load the LiveCodeBench-Pro dataset from Hugging Face
- Process each problem with your LLM
- Extract C++ code from LLM responses automatically
- Submit solutions to the integrated judge system for evaluation
- Collect judge results and generate comprehensive statistics
- Save the results to
benchmark_result.json
(Optional) Step 4: Submit Your Results
Email your benchmark_result.json file to zz4242@nyu.edu to have it displayed on the leaderboard.
Please include the following information in your submission:
- LLM name and version
- Any specific details
- Contact information
Understanding the Codebase
api_interface.py
This file defines the abstract interface for LLM integration:
LLMInterface: Abstract base class with methods for LLM interactionExampleLLM: Example implementation with OpenAI's GPT-4o
benchmark.py
The main benchmarking script that:
- Loads the dataset
- Processes each problem through your LLM
- Extracts C++ code from responses
- Submits solutions to the judge system
- Collects results and generates statistics
- Saves comprehensive results with judge verdicts
judge.py
Contains the judge system integration:
Judge: Abstract base class for judge implementationsLightCPVerifierJudge: LightCPVerifier integration for local solution evaluation- Automatic problem data downloading from Hugging Face
util.py
Utility functions for code processing:
extract_longest_cpp_code(): Intelligent C++ code extraction from LLM responses
Dataset
The benchmark uses the QAQAQAQAQ/LiveCodeBench-Pro and QAQAQAQAQ/LiveCodeBench-Pro-Testcase datasets from Hugging Face, which contains competitive programming problems with varying difficulty levels.
Contact
For questions or support, please contact us at zz4242@nyu.edu.