README.md

December 10, 2024 ยท View on GitHub

๐Ÿ‘จโ€๐Ÿ’ป Awesome Code LLM

Awesome PRs Welcome Last Commit

ย 

๐Ÿ”† How to Contribute

Contributions are welcome! If you have any resources, tools, papers, or insights related to Code LLMs, feel free to submit a pull request. Let's work together to make this project better!

ย 

News

ย 

๐Ÿงต Table of Contents

ย 

๐Ÿš€ Top Code LLMs

Sort by HumanEval Pass@1
RankModelParamsHumanEvalMBPPSource
1o1-mini-2024-09-12-97.693.9paper
2o1-preview-2024-09-12-95.193.4paper
3Qwen2.5-Coder-32B-Instruct32B92.790.2github
4Claude-3.5-Sonnet-20241022-92.191.0paper
5GPT-4o-2024-08-06-92.186.8paper
6Qwen2.5-Coder-14B-Instruct14B89.686.2github
7Claude-3.5-Sonnet-20240620-89.087.6paper
8GPT-4o-mini-2024-07-18-87.886.0paper
9Qwen2.5-Coder-7B-Instruct7B88.483.5github
10DS-Coder-V2-Instruct21/236B85.489.4github
11Qwen2.5-Coder-3B-Instruct3B84.173.6github
12DS-Coder-V2-Lite-Instruct2.4/16B81.182.8github
13CodeQwen1.5-7B-Chat7B83.570.6github
14DeepSeek-Coder-33B-Instruct33B79.370.0github
15DeepSeek-Coder-6.7B-Instruct6.7B78.665.4github
16GPT-3.5-Turbo-76.270.8github
17CodeLlama-70B-Instruct70B72.077.8paper
18Qwen2.5-Coder-1.5B-Instruct1.5B70.769.2github
19StarCoder2-15B-Instruct-v0.115B67.778.0paper
20Qwen2.5-Coder-0.5B-Instruct0.5B61.652.4github
21Pangu-Coder215B61.6-paper
22WizardCoder-15B15B57.351.8paper
23CodeQwen1.5-7B7B51.861.8github
24CodeLlama-34B-Instruct34B48.261.1paper
25Code-Davinci-002-47.0-paper

ย 

๐Ÿ’ก Evaluation Toolkit:

  • bigcode-evaluation-harness: A framework for the evaluation of autoregressive code generation language models.
  • code-eval: A framework for the evaluation of autoregressive code generation language models on HumanEval.
  • SandboxFusion: A secure sandbox for running and judging code generated by LLMs.

ย 

๐Ÿš€ Awesome Code LLMs Leaderboard

LeaderboardDescription
Evalperf LeaderboardEvaluating LLMs for Efficient Code Generation.
Aider Code Editing LeaderboardMeasuring the LLMโ€™s coding ability, and whether it can write new code that integrates into existing code.
BigCodeBench LeaderboardBigCodeBench evaluates LLMs with practical and challenging programming tasks.
LiveCodeBench LeaderboardHolistic and Contamination Free Evaluation of Large Language Models for Code.
Big Code Models LeaderboardCompare performance of base multilingual code generation models on HumanEval benchmark and MultiPL-E.
BIRD LeaderboardBIRD contains over 12,751 unique question-SQL pairs, 95 big databases with a total size of 33.4 GB. It also covers more than 37 professional domains, such as blockchain, hockey, healthcare and education, etc.
CanAiCode LeaderboardCanAiCode Leaderboard
Coding LLMs LeaderboardCoding LLMs Leaderboard
CRUXEval LeaderboardCRUXEval is a benchmark complementary to HumanEval and MBPP measuring code reasoning, understanding, and execution capabilities!
EvalPlus LeaderboardEvalPlus evaluates AI Coders with rigorous tests.
InfiBench LeaderboardInfiBench is a comprehensive benchmark for code large language models evaluating model ability on answering freeform real-world questions in the code domain.
InterCode LeaderboardInterCode is a benchmark for evaluating language models on the interactive coding task. Given a natural language request, an agent is asked to interact with a software system (e.g., database, terminal) with code to resolve the issue.
Program Synthesis Models LeaderboardThey created this leaderboard to help researchers easily identify the best open-source model with an intuitive leadership quadrant graph. They evaluate the performance of open-source code models to rank them based on their capabilities and market adoption.
Spider LeaderboardSpider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases.

ย 

๐Ÿ“š Awesome Code LLMs Papers

๐ŸŒŠ Awesome Code Pre-Training Papers

TitleVenueDateCodeResources
Star
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
Preprint2024.11GithubHF
Star
Qwen2.5-Coder Technical Report
Preprint2024.09GithubHF
Star
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Preprint2024.06GithubHF
Star
StarCoder 2 and The Stack v2: The Next Generation
Preprint2024.02GithubHF
Star
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
Preprint2024.01GithubHF
Star
Code Llama: Open Foundation Models for Code
Preprint2023.08GithubHF
Textbooks Are All You Need
Preprint2023.06-HF
Star
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
Preprint2023.05GithubHF
Star
StarCoder: may the source be with you!
Preprint2023.05GithubHF
Star
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages
ICLR232023.05GithubHF
Star
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X
Preprint2023.03GithubHF
SantaCoder: don't reach for the stars!
Preprint2023.01-HF
Star
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis
ICLR'232022.03GithubHF
Star
Evaluating Large Language Models Trained on Code
Preprint2021.07Github-

ย 

๐Ÿณ Awesome Code Instruction-Tuning Papers

TitleVenueDateCodeResources
Star
Magicoder: Source Code Is All You Need
ICML'242023.12GithubHF
Star
OctoPack: Instruction Tuning Code Large Language Models
ICLR'242023.08GithubHF
Star
WizardCoder: Empowering Code Large Language Models with Evol-Instruct
Preprint2023.07GithubHF
Star
Code Alpaca: An Instruction-following LLaMA Model trained on code generation instructions
Preprint2023.xxGithubHF

ย 

๐Ÿฌ Awesome Code Alignment Papers

TitleVenueDateCodeResources
ProSec: Fortifying Code LLMs with Proactive Security Alignment
Preprint2024.11--
PLUM: Preference Learning Plus Test Cases Yields Better Code Language Models
Preprint2024.06--
PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback
Preprint2023.07--
Star
RLTF: Reinforcement Learning from Unit Test Feedback
Preprint2023.07Github-
Star
Execution-based Code Generation using Deep Reinforcement Learning
TMLR'232023.01Github-
Star
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning
NeurIPS'222022.07Github-

ย 

๐Ÿ‹ Awesome Code Prompting Papers

TitleVenueDateCodeResources
Star
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
Preprint2024.10Github-
Star
Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs
AAAI'252024.06Github-
Star
Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step
ACL'242024.02Github-
SelfEvolve: A Code Evolution Framework via Large Language Models
Preprint2023.06--
Star
Demystifying GPT Self-Repair for Code Generation
ICLR'242023.06Github-
Teaching Large Language Models to Self-Debug
ICLR'242023.06--
Star
LEVER: Learning to Verify Language-to-Code Generation with Execution
ICML'232023.02Github-
Star
Coder Reviewer Reranking for Code Generation
ICML'232022.11Github-
Star
CodeT: Code Generation with Generated Tests
ICLR'232022.07Github-

ย 

๐Ÿ™ Awesome Code Benchmark & Evaluation Papers

DatasetTitleVenueDateCodeResources
CodeArenaStar
Evaluating and Aligning CodeLLMs on Human Preference
Preprint2024.12GithubHF
FullStack BenchStar
FullStack Bench: Evaluating LLMs as Full Stack Coders
Preprint2024.12GithubHF Github
GitChameleonStar
GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models
Preprint2024.11Github-
EvalperfStar
Evaluating Language Models for Efficient Code Generation
COLM'242024.08GithubHF
LiveCodeBenchStar
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Preprint2024.03GithubHF
DevBenchStar
DevBench: A Comprehensive Benchmark for Software Development
Preprint2024.03Github-
SWE-benchStar
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
ICLR'242024.03GithubHF
CrossCodeEvalStar
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion
NeurIPS'232023.11Github-
RepoCoderStar
Repository-Level Code Completion Through Iterative Retrieval and Generation
EMNLP'232023.10Github-
LongCoderStar
LongCoder: A Long-Range Pre-trained Language Model for Code Completion
ICML'232023.10Github-
-Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability of Large Language Model Code Generation
Preprint2023.08--
BioCoderStar
BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models
ISMB'242023.08Github-
RepoBenchStar
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
ICLR'242023.06GithubHF
EvalplusStar
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
NeurIPS'232023.05GithubHF
CoeditorStar
Coeditor: Leveraging Contextual Changes for Multi-round Code Auto-editing
ICLR'242023.05Github-
DS-1000Star
DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
ICML'232022.11GithubHF
MultiPL-EStar
MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation
Preprint2022.08GithubHF
MBPPStar
Program Synthesis with Large Language Models
Preprint2021.08GithubHF
APPSStar
Measuring Coding Challenge Competence With APPS
NeurIPS'212021.05GithubHF

ย 

๐Ÿ™Œ Contributors

This is an active repository and your contributions are always welcome! If you have any question about this opinionated list, do not hesitate to contact me huybery@gmail.com.

ย 

Cite as

@software{awesome-code-llm,
  author = {Binyuan Hui, Lei Zhang},
  title = {An awesome and curated list of best code-LLM for research},
  howpublished = {\url{https://github.com/huybery/Awesome-Code-LLM}},
  year = 2023,
}

ย 

Acknowledgement

This project is inspired by Awesome-LLM.

ย 

Star History

Star History Chart

โฌ† Back to ToC