CoREB: Code Retrieval and Reranking Benchmark
June 2, 2026 · View on GitHub
CoREB is a graded-relevance benchmark for evaluating code retrieval and reranking models across three tasks:
| Task | Query | Target | Example |
|---|---|---|---|
| Text-to-Code (T2C) | Natural language description | Code solution | "Find the longest substring without repeating characters" → Python solution |
| Code-to-Code (C2C) | Code in language A | Equivalent code in language B | Python solution → Java translation |
| Code-to-Text (C2T) | Code snippet | Problem description | Python solution → problem statement |
MTEB Integration
CoREB is available as a domain-specific benchmark on the MTEB Leaderboard. You can also load and run it directly via the mteb Python package:
import mteb
benchmark = mteb.get_benchmark("CoREB")
Key Features
- Graded relevance: 3-level qrel scheme (rel=2: positive, rel=1: hard negative, rel=0: irrelevant) — hard negatives are same-problem distractors that penalize nDCG when retrieved above true positives
- 5 programming languages: Python, C++, Java, Go, Ruby
- Problem-disjoint train/test splits: v202602 (training) and v202603 (testing) cover non-overlapping contest windows
- Two-stage evaluation: benchmarks both retrieval (embedding models) and reranking (cross-encoders)
- Drop-in evaluation: compatible with standard IR evaluation (pytrec_eval) with
relevance_level=2
Installation
pip install coreb
For HuggingFace model support:
pip install coreb[hf] # transformers backend
pip install coreb[gemini] # Google Gemini API
pip install coreb[all] # everything
Quick Start
Load the Dataset
from datasets import load_dataset
# Load v202603 release (latest)
code_corpus = load_dataset("hq-bench/coreb", "code_corpus", split="release_v2603")
text_corpus = load_dataset("hq-bench/coreb", "text_corpus", split="release_v2603")
# Load task-specific queries and qrels
t2c_queries = load_dataset("hq-bench/coreb", "text2code_queries", split="release_v2603")
t2c_qrels = load_dataset("hq-bench/coreb", "text2code_qrels", split="release_v2603")
print(f"Code corpus: {len(code_corpus)} documents")
print(f"T2C queries: {len(t2c_queries)} queries, {len(t2c_qrels)} qrels")
Run Evaluation
from coreb_runner.benchmark import (
load_jsonl,
convert_corpus_to_coir_format,
convert_queries_to_coir_format,
convert_qrels_to_coir_format,
EvaluateRetrieval,
DenseRetrievalExactSearch,
create_model_wrapper,
)
# Load data (from local JSONL files or convert from HF datasets)
corpus = convert_corpus_to_coir_format(load_jsonl("code_corpus.jsonl"))
queries = convert_queries_to_coir_format(load_jsonl("text2code_queries.jsonl"))
qrels = convert_qrels_to_coir_format(load_jsonl("text2code_qrels.jsonl"))
# Create model wrapper
model = create_model_wrapper("jinaai/jina-embeddings-v3", model_type="huggingface")
# Run retrieval + evaluation
retriever = DenseRetrievalExactSearch(model, batch_size=64)
evaluator = EvaluateRetrieval(retriever, k_values=[1, 3, 5, 10])
results = evaluator.retrieve(corpus, queries)
ndcg, _map, recall, precision = evaluator.evaluate(qrels, results, evaluator.k_values)
print(f"nDCG@10: {ndcg['NDCG@10']:.4f}")
print(f"Recall@10: {recall['Recall@10']:.4f}")
Evaluation with Graded Relevance
CoREB uses relevance_level=2 — only rel>=2 items count as relevant for binary metrics (Recall, MAP, Precision). Hard negatives (rel=1) penalize nDCG by occupying top ranks with zero gain but do not inflate Recall/MRR.
# The EvaluateRetrieval class handles this automatically:
# - rel=1 (hard negatives) are zeroed out for nDCG computation
# - relevance_level=2 is set for pytrec_eval binary metrics
print(f"Relevance threshold: {EvaluateRetrieval.RELEVANCE_LEVEL}") # 2
Dataset Structure
Available on HuggingFace: hq-bench/coreb
8 configs x 2 splits (release_v2602, release_v2603):
| Config | v2603 Rows | Description |
|---|---|---|
code_corpus | 1,744 | Code solutions (5 languages, 2 generator models) |
text_corpus | 875 | Problem descriptions (175 original + 700 LLM noise) |
text2code_queries | 1,123 | T2C queries (canonical, full, search subtasks) |
text2code_qrels | 5,950 | T2C relevance judgments (2,814 pos + 3,136 hard neg) |
code2code_queries | 278 | C2C queries (cross-language) |
code2code_qrels | 1,457 | C2C relevance judgments (623 pos + 834 hard neg) |
code2text_queries | 1,200 | C2T queries (canonical, full, match subtasks) |
code2text_qrels | 4,610 | C2T relevance judgments (820 pos + 2,650 hard neg) |
Benchmark Results (v202603, nDCG@10)
| Rank | Model | Avg | T2C | C2C | C2T |
|---|---|---|---|---|---|
| 1 | GemEmb-2 | 0.639 | 0.434 | 0.698 | 0.784 |
| 2 | C2LLM-7B | 0.623 | 0.443 | 0.659 | 0.766 |
| 3 | jina-code-1.5b | 0.607 | 0.414 | 0.671 | 0.735 |
| 4 | C2LLM-0.5B | 0.604 | 0.430 | 0.657 | 0.725 |
| 5 | jina-code-0.5b | 0.596 | 0.386 | 0.677 | 0.725 |
| 6 | F2LLM-4B | 0.547 | 0.407 | 0.500 | 0.735 |
| 7 | Qwen3-Emb-4B | 0.495 | 0.390 | 0.392 | 0.704 |
| 8 | F2LLM-1.7B | 0.485 | 0.383 | 0.383 | 0.690 |
| 9 | Qwen3-Emb-0.6B | 0.443 | 0.349 | 0.384 | 0.597 |
| 10 | F2LLM-0.6B | 0.439 | 0.344 | 0.334 | 0.641 |
| 11 | Qwen3-Emb-8B | 0.428 | 0.328 | 0.320 | 0.635 |
CoREB-Reranker
hq-bench/coreb-code-reranker is a code reranker fine-tuned from Qwen3-Reranker-4B via LoRA. It is the first reranker to achieve consistent gains across all three code search tasks.
Reranking delta (nDCG@10 %):
| Reranker | T2C | C2C | C2T |
|---|---|---|---|
| CoREB-Reranker | +1.1 | +5.1 | +0.8 |
Training and test data: hq-bench/coreb-code-reranker-train-test-dataset
| Split | Records | T2C | C2T | C2C | Source |
|---|---|---|---|---|---|
| train | 4,173 | 2,742 | 1,064 | 367 | v202602 |
| test | 3,882 | 2,249 | 1,010 | 623 | v202603 |
Each record contains a query, one positive, up to 16 hard negatives (rel=1), and easy negatives sampled from the corpus. Train/test splits are problem-disjoint.
from datasets import load_dataset
train = load_dataset("hq-bench/coreb-code-reranker-train-test-dataset", split="train")
test = load_dataset("hq-bench/coreb-code-reranker-train-test-dataset", split="test")
# Filter by task
t2c_train = train.filter(lambda x: x["task"] == "text2code")
Tutorials
Interactive Colab notebooks to get started:
| Notebook | Description |
|---|---|
| 01 — Download & Analyze Data | Load the dataset from HuggingFace, explore corpus/queries/qrels, and analyze statistics |
| 02 — Run Evaluation | Two-stage evaluation: dense retrieval + reranking with CoREB-Reranker |
Citation
@article{xue2025coreb,
title = {Beyond Retrieval: A Multitask Benchmark and Model for Code Search},
author = {Xue, Siqiao and Liao, Zihan and Qin, Jin and Zhang, Ziyin and Mu, Yixiang and Zhou, Fan and Yu, Hang},
journal = {arXiv preprint arXiv:2605.04615},
year = {2025},
url = {https://arxiv.org/abs/2605.04615}
}
License
This project is licensed under the Apache License 2.0 — see LICENSE for details.