arabic-agent-eval

May 30, 2026 · View on GitHub

An open, installable, dialect-split Arabic function-calling benchmark

Because Arabic agents deserve evaluation too.

Why

Run a Gulf Arabic instruction through a frontier model and watch tool-call arguments come back transliterated — الرياض becomes Riyadh, أبي أحجز collapses to MSA. Dialect intent gets ignored. Public measurement for Arabic tool use is still thin.

This benchmark focuses on native Arabic functions, dialect splits, published data, and a reproducible grader. Claims about cross-language accuracy deltas belong in the matrix you generate with your own keys (see scripts/build_result_table.py) — not in this README.

51 evaluation items. 6 categories. 5 dialect variants. 22 Arabic-context functions. One score that tells you how well a model actually handles Arabic function calling.

Install

pip install arabic-agent-eval

Quick Start

# Set up API keys
aae config

# Quick single-provider benchmark (12 items)
aae quick openai

# Full benchmark across all configured providers
aae run

# Compare two providers
aae compare openai anthropic

Evaluation Categories

Category	Arabic	What it tests
Simple Function Calling	استدعاء بسيط	Pick the right function, extract correct parameters
Parameter Extraction	استخراج المعاملات	Extract Arabic parameters from natural text
Multi-Step Reasoning	تفكير متعدد الخطوات	Chain multiple function calls in sequence
Dialect Handling	معالجة اللهجات	Understand Gulf, Egyptian, Levantine, Maghrebi dialects
Tool Selection	اختيار الأداة	Pick the right tool from 10 options
Error Recovery	معالجة الأخطاء	Handle Arabic error responses correctly

Dialect Coverage

Every category includes dialect variants:

Dialect	Example
MSA (فصحى)	أريد حجز فندق في دبي غداً
Gulf (خليجي)	ابي أحجز فندق في دبي بكرة
Egyptian (مصري)	عايز أحجز فندق في دبي بكره
Levantine (شامي)	بدي احجز فندق بدبي بكرا
Maghrebi (مغاربي)	بغيت نحجز فندق في دبي غدا

Functions

22 Arabic-context functions including:

Function	Arabic	Context
search_flights	البحث عن رحلات	Regional airlines
get_prayer_times	مواقيت الصلاة	Islamic calendar
calculate_zakat	حساب الزكاة	Islamic finance
find_quran_verse	البحث في القرآن	Quran search
check_visa_status	حالة التأشيرة	GCC visa systems
get_stock_price	سعر السهم	Tadawul, ADX, DFM
convert_currency	تحويل العملات	SAR, AED, EGP, MAD
book_car	حجز سيارة	Regional ride-hailing
order_food	طلب طعام	Local restaurants
get_traffic	حالة المرور	City traffic

Scoring

Each item is scored on 4 dimensions:

Dimension	What it measures
Function Selection	Did the model pick the right function? (0 or 1)
Argument Accuracy	Are the extracted arguments correct? (0-1 scale)
Arabic Preservation	Are Arabic values preserved, not transliterated? (0 or 1)
Dialect Understanding	Did the model understand the dialect? (dialect category only)

Overall score = weighted average across all 6 categories.

Supported Providers

Provider	Default Model
OpenAI	gpt-4o
Anthropic	claude-sonnet-4-6
Google	gemini-2.0-flash
DeepSeek	deepseek-chat
Groq	llama-3.3-70b-versatile
Mistral	mistral-large-latest
Qwen	qwen-plus
xAI	grok-2
Cohere	command-r-plus
Together	Qwen2.5-72B
Fireworks	Qwen2.5-72B
OpenRouter	nousresearch/hermes-4-70b
Hermes (direct)	NousResearch/Hermes-4-70B

As a Library

from arabic_agent_eval import Dataset, Evaluator

dataset = Dataset()

def my_call_fn(instruction, tools, functions):
    # Call your model here
    return {"calls": [...], "raw": "..."}

evaluator = Evaluator(call_fn=my_call_fn, provider="my-model", model="v1")
result = evaluator.evaluate(dataset)
print(f"Score: {result.overall_score:.1%} ({result.overall_grade})")

CI / Automation

# JSON output for pipelines
aae run --json-output

# Fail if score drops below threshold
aae run --provider openai --min-score 0.7

Dataset Stats

aae dataset

51 evaluation items
6 categories (weighted)
5 Arabic dialects
22 function definitions
3 difficulty levels

Security

API keys are stored in ~/.aae/config.json with 0600 permissions. Environment variables are the recommended way to provide keys in CI.

Documentation

Schema — EvalItem, ExpectedCall, function registry
Grading — canonical structured-call comparison, Arabic normalization, per-category weighting
Related work — positioning vs BFCL, MADAR, Habibi, Hermes-FC, OALL
Baselines — template + published runs
Dataset card — HF-style, YAML frontmatter, splits

mtg — Morphological Type Guards — a JSON Schema extension for multilingual tool-call arguments. Uses this benchmark as a diagnostic substrate.
ToolProof — tool-call verification + signed receipts. Consumes MTG violations.
artok — Arabic token cost calculator across 18 tokenizers. Provides the token_cost_delta signal.

Export JSONL

# Regenerate data/*.jsonl from the Python source-of-truth
aae export

# Or custom path
aae export --out /path/to/out

Community

Built with input from the Saudi AI Community.

License

Code: Apache-2.0 · Data: CC-BY-4.0 · See LICENSES.md.