Memory Benchmarks

May 13, 2026 · View on GitHub

Open-source evaluation suite to run benchmarks on memory-augmented LLM systems. Currently supports the Mem0 Cloud and OSS versions to measure memory recall, extraction quality, and retrieval accuracy.

Benchmarks

BenchmarkDatasetQuestionsWhat it tests
LOCOMO10 multi-session dialogues~300Factual recall, temporal reasoning, multi-hop inference
LongMemEval500 diverse questions, 6 types500Long-term memory across information extraction, temporal, and multi-session reasoning
BEAM100 conversations per size bucket (100K–10M tokens)2,000+Real-world memory retrieval across 10 memory ability types

Quick Start

git clone https://github.com/mem0ai/memory-benchmarks.git
cd memory-benchmarks
pip install -r requirements.txt

Option A: Mem0 Cloud

No Docker required. You need a Mem0 API key and an OpenAI API key (for the answerer/judge LLM).

# Set your keys
export MEM0_API_KEY=m0-your-key
export OPENAI_API_KEY=sk-your-key

# Run a benchmark
python -m benchmarks.locomo.run \
  --project-name my-first-test \
  --backend cloud \
  --mem0-api-key $MEM0_API_KEY

# LongMemEval (500 questions)
python -m benchmarks.longmemeval.run \
  --project-name my-first-test \
  --backend cloud \
  --mem0-api-key $MEM0_API_KEY \
  --all-questions

# BEAM (configurable size)
python -m benchmarks.beam.run \
  --project-name my-first-test \
  --backend cloud \
  --mem0-api-key $MEM0_API_KEY \
  --chat-sizes 100K --conversations 0-9

Option B: Mem0 OSS (Self-Hosted)

Requires Docker and Docker Compose. This starts a local Mem0 server backed by Qdrant.

cp .env.example .env
# Edit .env and add your OPENAI_API_KEY

docker compose up -d
# Mem0 server: http://localhost:8888
# Qdrant:      http://localhost:6333

Then run benchmarks against your local server:

# LOCOMO (fastest — ~300 questions, 10 conversations)
python -m benchmarks.locomo.run --project-name my-first-test

# LongMemEval (500 questions)
python -m benchmarks.longmemeval.run --project-name my-first-test --all-questions

# BEAM (configurable size)
python -m benchmarks.beam.run --project-name my-first-test --chat-sizes 100K --conversations 0-9

By default, the OSS server uses OpenAI for fact extraction (gpt-4o-mini) and embeddings (text-embedding-3-small). See Custom Models for using Azure, Ollama, or other providers.

View results in the UI

npm install
npm run dev -- -p 3001
# Open http://localhost:3001

The web UI lets you browse results, inspect per-question evaluations with retrieval details, view logs, and compare runs.

How It Works

Each benchmark script runs a three-stage pipeline:

Ingest → Search → Evaluate
  1. Ingest: Conversations are chunked and added to Mem0. The system extracts facts, embeds them, and builds entity links.
  2. Search: For each question, the system queries Mem0. Results are scored using semantic similarity + BM25 + entity boost.
  3. Evaluate: An LLM generates an answer from retrieved memories, then a judge LLM scores correctness against ground truth.

Configuration

Benchmark options

All benchmarks accept these common flags:

--project-name NAME        Run identifier (required)
--answerer-model MODEL     LLM for answer generation (default: gpt-4o)
--judge-model MODEL        LLM for judging (default: gpt-4o)
--provider PROVIDER        LLM provider: openai, anthropic, azure (default: openai)
--top-k N                  Retrieved memories count (default: 200)
--top-k-cutoffs LIST       Evaluate at multiple cutoffs (default: 10,20,50,200)
--predict-only             Stop after search, skip answer+judge
--evaluate-only            Skip ingest+search, evaluate existing results
--resume                   Resume from checkpoint
--backend oss|cloud        Mem0 backend (default: oss)
--mem0-host URL            Mem0 server URL (default: http://localhost:8888)

Custom Models

By default, the Mem0 server uses OpenAI for fact extraction (gpt-4o-mini) and embeddings (text-embedding-3-small). You can change this by mounting a custom config file.

Step 1: Copy an example config:

cp configs/azure-openai.yaml mem0-config.yaml
# or: cp configs/ollama.yaml mem0-config.yaml

Step 2: Edit mem0-config.yaml with your model details.

Step 3: Uncomment the volume mount in docker-compose.yml:

volumes:
  - mem0_history:/app/history
  - ./mem0-config.yaml:/app/config.yaml:ro   # <-- uncomment this line

Step 4: Restart:

docker compose down && docker compose up -d

See configs/ for examples:

  • configs/openai.yaml — OpenAI (default)
  • configs/azure-openai.yaml — Azure OpenAI
  • configs/ollama.yaml — Fully local with Ollama (no API keys)

Results

Mem0 Platform

Results using the Mem0 managed platform with the v3 memory pipeline.

LongMemEval

MetricTop 200Top 50
Overall94.4% (472/500)94.8% (474/500)
LongMemEval breakdown by question type
Question TypeTop 200Top 50
knowledge-update93.6% (73/78)93.6% (73/78)
multi-session88.0% (117/133)93.2% (124/133)
single-session-assistant98.2% (55/56)98.2% (55/56)
single-session-preference96.7% (29/30)93.3% (28/30)
single-session-user98.6% (69/70)98.6% (69/70)
temporal-reasoning97.0% (129/133)94.0% (125/133)

LoCoMo

MetricTop 200Top 50
Overall92.5% (1425/1540)91.8% (1414/1540)
LoCoMo breakdown by question type (avg across top_10/20/50/200)
Question TypeAvg (Top 10–200)
single-hop91.2%
multi-hop91.3%
open-domain72.7%
temporal92.0%

BEAM

DatasetTop 200Top 50
Pass RateAvg ScorePass RateAvg Score
BEAM 1M (700 questions)70.1% (491/700)0.64167.1% (470/700)0.604
BEAM 10M (200 questions)50.5% (101/200)0.48645.5% (91/200)0.413
BEAM breakdown by memory ability type

BEAM 1M (Top 200)

AbilityAvg ScorePass Rate
preference_following0.88368/70
instruction_following0.85262/70
information_extraction0.70053/70
multi_session_reasoning0.65252/70
knowledge_update0.65046/70
summarization0.63548/70
temporal_reasoning0.61847/70
event_ordering0.53642/70
abstention0.52539/70
contradiction_resolution0.35734/70

BEAM 10M (Top 200)

AbilityAvg ScorePass Rate
preference_following0.90419/20
instruction_following0.82518/20
knowledge_update0.75016/20
information_extraction0.56211/20
summarization0.46911/20
abstention0.4008/20
contradiction_resolution0.3255/20
multi_session_reasoning0.2616/20
event_ordering0.2023/20
temporal_reasoning0.1634/20

OSS with Different Extraction Models

LongMemEval results using the self-hosted Mem0 OSS pipeline with different LLMs for memory extraction. All runs use the same embedder (Qwen 600M via SageMaker), the same Qdrant vector store, and GPT-5 as the answerer and judge.

Extraction ModelOverallSS-UserSS-AsstSS-PrefKnowledge UpdateTemporal ReasoningMulti-Session
GPT-591.0%95.7%92.9%93.3%91.0%94.7%83.5%
GPT-OSS-120B89.8%95.7%96.4%93.3%89.5%80.5%79.7%
Llama 4 Maverick88.6%97.1%75.0%93.3%93.6%90.2%84.2%
Gemma 4 31B88.6%95.7%83.9%93.3%94.9%91.7%78.9%

Full per-question evaluation results are available in results/platform/ and results/oss/.

A Note on Benchmark Scores

Benchmark scores are not absolute numbers. They depend heavily on:

  • Embedding model quality — A larger, more capable embedding model will produce better retrieval, directly improving scores. The default text-embedding-3-small (1536 dims) is cost-efficient but not state-of-the-art.
  • LLM capability — Both the fact extraction model (used during ingestion) and the judge model (used during evaluation) affect results. A stronger extraction model captures more nuanced facts; a stronger judge is more accurate in its verdicts.
  • Retrieval depth — Higher top-k values give the system more chances to find relevant memories, but may also introduce noise.

When comparing configurations, keep all other variables constant and change only what you're testing. The default OpenAI setup provides a reproducible baseline — your scores will likely improve with stronger models.

Project Structure

memory-benchmarks/
├── benchmarks/              Python evaluation scripts
│   ├── common/              Shared: Mem0 client, LLM client, metrics, utils
│   ├── locomo/              LOCOMO benchmark
│   ├── longmemeval/         LongMemEval benchmark
│   └── beam/                BEAM benchmark
├── configs/                 Example Mem0 server configs
├── docker/mem0/             Mem0 server (Dockerfile + FastAPI app)
├── docker-compose.yml       One-command setup: Mem0 + Qdrant
├── src/                     Next.js frontend
│   ├── app/                 Pages + API routes
│   ├── components/          UI components
│   └── lib/                 Database, adapters, executor
├── results/                 Benchmark output (gitignored)
└── datasets/                Auto-downloaded datasets (gitignored)

License

Apache 2.0