arabic-agent-eval
May 30, 2026 · View on GitHub
An open, installable, dialect-split Arabic function-calling benchmark
Because Arabic agents deserve evaluation too.
Why
Run a Gulf Arabic instruction through a frontier model and watch tool-call arguments come back transliterated — الرياض becomes Riyadh, أبي أحجز collapses to MSA. Dialect intent gets ignored. Public measurement for Arabic tool use is still thin.
This benchmark focuses on native Arabic functions, dialect splits, published data, and a reproducible grader. Claims about cross-language accuracy deltas belong in the matrix you generate with your own keys (see scripts/build_result_table.py) — not in this README.
51 evaluation items. 6 categories. 5 dialect variants. 22 Arabic-context functions. One score that tells you how well a model actually handles Arabic function calling.
Install
pip install arabic-agent-eval
Quick Start
# Set up API keys
aae config
# Quick single-provider benchmark (12 items)
aae quick openai
# Full benchmark across all configured providers
aae run
# Compare two providers
aae compare openai anthropic
Evaluation Categories
| Category | Arabic | What it tests |
|---|---|---|
| Simple Function Calling | استدعاء بسيط | Pick the right function, extract correct parameters |
| Parameter Extraction | استخراج المعاملات | Extract Arabic parameters from natural text |
| Multi-Step Reasoning | تفكير متعدد الخطوات | Chain multiple function calls in sequence |
| Dialect Handling | معالجة اللهجات | Understand Gulf, Egyptian, Levantine, Maghrebi dialects |
| Tool Selection | اختيار الأداة | Pick the right tool from 10 options |
| Error Recovery | معالجة الأخطاء | Handle Arabic error responses correctly |
Dialect Coverage
Every category includes dialect variants:
| Dialect | Example |
|---|---|
| MSA (فصحى) | أريد حجز فندق في دبي غداً |
| Gulf (خليجي) | ابي أحجز فندق في دبي بكرة |
| Egyptian (مصري) | عايز أحجز فندق في دبي بكره |
| Levantine (شامي) | بدي احجز فندق بدبي بكرا |
| Maghrebi (مغاربي) | بغيت نحجز فندق في دبي غدا |
Functions
22 Arabic-context functions including:
| Function | Arabic | Context |
|---|---|---|
| search_flights | البحث عن رحلات | Regional airlines |
| get_prayer_times | مواقيت الصلاة | Islamic calendar |
| calculate_zakat | حساب الزكاة | Islamic finance |
| find_quran_verse | البحث في القرآن | Quran search |
| check_visa_status | حالة التأشيرة | GCC visa systems |
| get_stock_price | سعر السهم | Tadawul, ADX, DFM |
| convert_currency | تحويل العملات | SAR, AED, EGP, MAD |
| book_car | حجز سيارة | Regional ride-hailing |
| order_food | طلب طعام | Local restaurants |
| get_traffic | حالة المرور | City traffic |
Scoring
Each item is scored on 4 dimensions:
| Dimension | What it measures |
|---|---|
| Function Selection | Did the model pick the right function? (0 or 1) |
| Argument Accuracy | Are the extracted arguments correct? (0-1 scale) |
| Arabic Preservation | Are Arabic values preserved, not transliterated? (0 or 1) |
| Dialect Understanding | Did the model understand the dialect? (dialect category only) |
Overall score = weighted average across all 6 categories.
Supported Providers
| Provider | Default Model |
|---|---|
| OpenAI | gpt-4o |
| Anthropic | claude-sonnet-4-6 |
| gemini-2.0-flash | |
| DeepSeek | deepseek-chat |
| Groq | llama-3.3-70b-versatile |
| Mistral | mistral-large-latest |
| Qwen | qwen-plus |
| xAI | grok-2 |
| Cohere | command-r-plus |
| Together | Qwen2.5-72B |
| Fireworks | Qwen2.5-72B |
| OpenRouter | nousresearch/hermes-4-70b |
| Hermes (direct) | NousResearch/Hermes-4-70B |
As a Library
from arabic_agent_eval import Dataset, Evaluator
dataset = Dataset()
def my_call_fn(instruction, tools, functions):
# Call your model here
return {"calls": [...], "raw": "..."}
evaluator = Evaluator(call_fn=my_call_fn, provider="my-model", model="v1")
result = evaluator.evaluate(dataset)
print(f"Score: {result.overall_score:.1%} ({result.overall_grade})")
CI / Automation
# JSON output for pipelines
aae run --json-output
# Fail if score drops below threshold
aae run --provider openai --min-score 0.7
Dataset Stats
aae dataset
- 51 evaluation items
- 6 categories (weighted)
- 5 Arabic dialects
- 22 function definitions
- 3 difficulty levels
Security
API keys are stored in ~/.aae/config.json with 0600 permissions. Environment variables are the recommended way to provide keys in CI.
Documentation
- Schema —
EvalItem,ExpectedCall, function registry - Grading — canonical structured-call comparison, Arabic normalization, per-category weighting
- Related work — positioning vs BFCL, MADAR, Habibi, Hermes-FC, OALL
- Baselines — template + published runs
- Dataset card — HF-style, YAML frontmatter, splits
Related projects
- mtg — Morphological Type Guards — a JSON Schema extension for multilingual tool-call arguments. Uses this benchmark as a diagnostic substrate.
- ToolProof — tool-call verification + signed receipts. Consumes MTG violations.
- artok — Arabic token cost calculator across 18 tokenizers. Provides the
token_cost_deltasignal.
Export JSONL
# Regenerate data/*.jsonl from the Python source-of-truth
aae export
# Or custom path
aae export --out /path/to/out
Community
Built with input from the Saudi AI Community.
License
Code: Apache-2.0 · Data: CC-BY-4.0 · See LICENSES.md.