artok عرتوك

March 19, 2026 · View on GitHub

artok عرتوك

Arabic Token Tax Calculator

See how much more Arabic costs across 18 LLM tokenizers.

MIT License Python 3.9+ Tokenizers

Website · GitHub · Report Bug


Why This Exists

Arabic text uses 2-5x more tokens than equivalent English depending on the tokenizer. Same meaning, wildly different cost:

TextClaude TokensGPT-4.1 Tokens
الذكاء الاصطناعي يغير العالم2510
AI is changing the world55
Ratio5.0x2.0x

This is the Arabic Token Tax. If you're building Arabic AI products, you're paying 2-5x more for the same capabilities. artok makes this visible, measurable, and actionable.

artok demo

Install

git clone https://github.com/Moshe-ship/artok.git
cd artok
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[all]"

Quick Start

# See the tax across all 18 tokenizers
artok "الذكاء الاصطناعي يغير العالم"

# Compare Arabic vs English side by side
artok "الذكاء الاصطناعي" -e "Artificial intelligence"

# Run the Arabic friendliness benchmark (no input needed)
artok --benchmark

# See how diacritics inflate tokens
artok "بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ" --tashkeel

# Compare dialects (MSA vs Egyptian vs Gulf vs Levantine vs Moroccan)
artok --dialects

# Heatmap — color each word by token cost
artok "الذكاء الاصطناعي يغير حياتنا" --heatmap

# Rank tokenizers by composite score
artok "نص عربي" --leaderboard

# Estimate monthly costs at 50M tokens
artok "نص عربي" -e "Arabic text" --cost 50

# See savings from switching away from Claude
artok "الذكاء الاصطناعي يغير العالم" --switch-from claude-sonnet

# Compare Arabic against other languages
artok "الذكاء الاصطناعي" --compare-langs 'en:AI|fr:IA|zh:人工智能'

# Analyze Arabic text from a URL
artok --url https://ar.wikipedia.org/wiki/ذكاء_اصطناعي -t gpt4.1,claude-sonnet

# Export as SVG for sharing on X/Twitter
artok "نص عربي" -e "Arabic text" --export results.svg

# Live mode — type and see counts in real-time
artok --watch

# Keep pricing up to date
artok --update

Supported Tokenizers (18)

18 tokenizers across 10 providers. Pricing auto-updates from GitHub (artok --update).

TokenizerProviderInput $/1MOutput $/1M
GPT-4.1OpenAI$2.00$8.00
GPT-4.1 miniOpenAI$0.40$1.60
GPT-4.1 nanoOpenAI$0.10$0.40
GPT-4oOpenAI$2.50$10.00
GPT-4o miniOpenAI$0.15$0.60
Claude Opus 4.6Anthropic$5.00$25.00
Claude Sonnet 4.6Anthropic$3.00$15.00
Claude Haiku 4.5Anthropic$1.00$5.00
Llama 4Meta$0.18$0.18
Qwen 3.5Alibaba$0.10$0.40
Mistral Large 3Mistral$0.50$1.50
Mistral SmallMistral$0.10$0.30
Gemini 2.5 ProGoogle$1.25$10.00
Gemini 3 FlashGoogle$0.50$3.00
DeepSeek V3.2DeepSeek$0.27$1.10
Grok 2xAI$2.00$10.00
Command R+Cohere$2.50$10.00
Jamba 1.5AI21$0.20$0.40

All Features

FlagWhat it does
(text)Token count across all 18 tokenizers
-eArabic vs English comparison
-tFilter to specific tokenizers
-cCost estimate at N million tokens
-wCost estimate at N million words
-fRead text from file
--jsonJSON output for scripting
--chartVisual bar chart
--vizColored token split visualization
--batchProcess JSONL/CSV files
--recommendBest tokenizer for a budget
--switch-fromSavings from switching providers
--compare-langsArabic vs other languages
--urlAnalyze Arabic text from a URL
--tashkeelDiacritics inflation analysis
--heatmapColor words by token cost
--benchmarkArabic friendliness score 0-100
--dialectsMSA vs Egyptian vs Gulf vs Levantine vs Moroccan
--leaderboardComposite score ranking
--watchLive interactive mode
--exportExport to SVG
--updateFetch latest pricing from GitHub
--listShow all tokenizers with source info

The Arabic Token Tax

Most LLM tokenizers are trained primarily on English/Latin text. Arabic characters get split into individual bytes or small fragments instead of whole words. The result: same meaning, 2-5x more tokens, 2-5x higher cost.

Benchmark results (artok --benchmark):

RankTokenizerArabic Friendliness Score
1Mistral Large 392.1/100
2Qwen 3.591.7/100
3GPT-4.191.1/100
4Gemini 2.5 Pro90.2/100
5Llama 483.6/100
...Grok 250.3/100
...Claude Sonnet 4.625.6/100

Key findings:

  • Mistral / Qwen 3.5: Best for Arabic — dedicated multilingual tokenizers
  • GPT-4.1 / Gemini / DeepSeek / Llama 4: Good at ~1.5-2.5x vs English
  • Grok 2: Moderate at ~2.5-3.5x vs English
  • Claude: Worst for Arabic at ~3.5-5x vs English — byte-level encoding for Arabic

Auto-Update Pricing

Tokenizer pricing changes frequently. artok handles this automatically:

  1. On startup, checks ~/.cache/artok/tokenizers.json (24h cache)
  2. If stale, fetches latest from tokenizers.json on GitHub
  3. Falls back to hardcoded defaults if offline
artok --update    # Force refresh right now
artok --list      # Shows pricing source (cached vs built-in)

To update pricing for all users: Edit tokenizers.json in this repo. All users get the new prices within 24 hours.

Architecture

artok/
├── cli.py          # CLI entry point, argument parsing
├── core.py         # Tokenizer loading, counting, remote config
├── display.py      # Rich terminal output (tables, charts, heatmaps)
├── __init__.py     # Version
└── __main__.py     # python -m artok support
tokenizers.json     # Remote-updatable pricing config
docs/index.html     # GitHub Pages landing page
  • Tokenizer backends: tiktoken (OpenAI), transformers (HuggingFace), tokenizer_fast (direct tokenizer.json loading for DeepSeek)
  • Output: Rich tables, colored text, bar charts, SVG export
  • Config: Hardcoded defaults + GitHub-hosted JSON with 24h local cache

Contributing

  1. Fork the repo
  2. Add a tokenizer to tokenizers.json (and core.py TOKENIZERS list for the hardcoded fallback)
  3. Test with artok --list and artok "مرحبا" -t your-new-tokenizer
  4. Open a PR

To update pricing only: edit tokenizers.json — no code changes needed.

License

MIT — Musa the Carpenter