arabic-agent-eval

May 30, 2026 · View on GitHub

An open, installable, dialect-split Arabic function-calling benchmark

License: Apache 2.0 Data: CC-BY-4.0 Python 3.9+

Because Arabic agents deserve evaluation too.

Why

Run a Gulf Arabic instruction through a frontier model and watch tool-call arguments come back transliterated — الرياض becomes Riyadh, أبي أحجز collapses to MSA. Dialect intent gets ignored. Public measurement for Arabic tool use is still thin.

This benchmark focuses on native Arabic functions, dialect splits, published data, and a reproducible grader. Claims about cross-language accuracy deltas belong in the matrix you generate with your own keys (see scripts/build_result_table.py) — not in this README.

51 evaluation items. 6 categories. 5 dialect variants. 22 Arabic-context functions. One score that tells you how well a model actually handles Arabic function calling.

Install

pip install arabic-agent-eval

Quick Start

# Set up API keys
aae config

# Quick single-provider benchmark (12 items)
aae quick openai

# Full benchmark across all configured providers
aae run

# Compare two providers
aae compare openai anthropic

Evaluation Categories

CategoryArabicWhat it tests
Simple Function Callingاستدعاء بسيطPick the right function, extract correct parameters
Parameter Extractionاستخراج المعاملاتExtract Arabic parameters from natural text
Multi-Step Reasoningتفكير متعدد الخطواتChain multiple function calls in sequence
Dialect Handlingمعالجة اللهجاتUnderstand Gulf, Egyptian, Levantine, Maghrebi dialects
Tool Selectionاختيار الأداةPick the right tool from 10 options
Error Recoveryمعالجة الأخطاءHandle Arabic error responses correctly

Dialect Coverage

Every category includes dialect variants:

DialectExample
MSA (فصحى)أريد حجز فندق في دبي غداً
Gulf (خليجي)ابي أحجز فندق في دبي بكرة
Egyptian (مصري)عايز أحجز فندق في دبي بكره
Levantine (شامي)بدي احجز فندق بدبي بكرا
Maghrebi (مغاربي)بغيت نحجز فندق في دبي غدا

Functions

22 Arabic-context functions including:

FunctionArabicContext
search_flightsالبحث عن رحلاتRegional airlines
get_prayer_timesمواقيت الصلاةIslamic calendar
calculate_zakatحساب الزكاةIslamic finance
find_quran_verseالبحث في القرآنQuran search
check_visa_statusحالة التأشيرةGCC visa systems
get_stock_priceسعر السهمTadawul, ADX, DFM
convert_currencyتحويل العملاتSAR, AED, EGP, MAD
book_carحجز سيارةRegional ride-hailing
order_foodطلب طعامLocal restaurants
get_trafficحالة المرورCity traffic

Scoring

Each item is scored on 4 dimensions:

DimensionWhat it measures
Function SelectionDid the model pick the right function? (0 or 1)
Argument AccuracyAre the extracted arguments correct? (0-1 scale)
Arabic PreservationAre Arabic values preserved, not transliterated? (0 or 1)
Dialect UnderstandingDid the model understand the dialect? (dialect category only)

Overall score = weighted average across all 6 categories.

Supported Providers

ProviderDefault Model
OpenAIgpt-4o
Anthropicclaude-sonnet-4-6
Googlegemini-2.0-flash
DeepSeekdeepseek-chat
Groqllama-3.3-70b-versatile
Mistralmistral-large-latest
Qwenqwen-plus
xAIgrok-2
Coherecommand-r-plus
TogetherQwen2.5-72B
FireworksQwen2.5-72B
OpenRouternousresearch/hermes-4-70b
Hermes (direct)NousResearch/Hermes-4-70B

As a Library

from arabic_agent_eval import Dataset, Evaluator

dataset = Dataset()

def my_call_fn(instruction, tools, functions):
    # Call your model here
    return {"calls": [...], "raw": "..."}

evaluator = Evaluator(call_fn=my_call_fn, provider="my-model", model="v1")
result = evaluator.evaluate(dataset)
print(f"Score: {result.overall_score:.1%} ({result.overall_grade})")

CI / Automation

# JSON output for pipelines
aae run --json-output

# Fail if score drops below threshold
aae run --provider openai --min-score 0.7

Dataset Stats

aae dataset
  • 51 evaluation items
  • 6 categories (weighted)
  • 5 Arabic dialects
  • 22 function definitions
  • 3 difficulty levels

Security

API keys are stored in ~/.aae/config.json with 0600 permissions. Environment variables are the recommended way to provide keys in CI.

Documentation

  • SchemaEvalItem, ExpectedCall, function registry
  • Grading — canonical structured-call comparison, Arabic normalization, per-category weighting
  • Related work — positioning vs BFCL, MADAR, Habibi, Hermes-FC, OALL
  • Baselines — template + published runs
  • Dataset card — HF-style, YAML frontmatter, splits
  • mtg — Morphological Type Guards — a JSON Schema extension for multilingual tool-call arguments. Uses this benchmark as a diagnostic substrate.
  • ToolProof — tool-call verification + signed receipts. Consumes MTG violations.
  • artok — Arabic token cost calculator across 18 tokenizers. Provides the token_cost_delta signal.

Export JSONL

# Regenerate data/*.jsonl from the Python source-of-truth
aae export

# Or custom path
aae export --out /path/to/out

Community

Built with input from the Saudi AI Community.

License

Code: Apache-2.0 · Data: CC-BY-4.0 · See LICENSES.md.