Multi-Person Group Chat Evaluation Framework

February 12, 2026 ยท View on GitHub

arXiv Dataset

A comprehensive evaluation framework for multi-person group chat datasets, supporting Memory Systems (Memos, Mem0, Memobase, EverMemOS, Zep) and LLM Long-Context Evaluation.

๐Ÿ“„ Paper: EverMemBench: A Comprehensive Benchmark for Long-Term Memory in Conversational AI

๐Ÿค— Dataset: EverMind-AI/EverMemBench-Dynamic

Features

  • Multi-person group chat support: Handles datasets with multiple speakers across multiple groups and days
  • 5 Memory Systems: Memos, Mem0, Memobase, EverMemOS, Zep
  • LLM Long-Context Evaluation: Direct LLM evaluation using full dialogue as context
  • Full Evaluation Pipeline: Add โ†’ Search โ†’ Answer โ†’ Evaluate
  • Two Question Types: Multiple choice (direct comparison) and open-ended (LLM judge)
  • Unified message format: All messages include group/speaker attribution
  • LLM Integration: Uses OpenRouter for answer generation and evaluation
  • Batch processing: Efficient API calls with configurable batch sizes and rate limiting
  • Smoke test mode: Quick validation with limited data

Pipeline Stages

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Add   โ”‚ -> โ”‚  Search  โ”‚ -> โ”‚  Answer  โ”‚ -> โ”‚ Evaluate  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
     โ”‚              โ”‚               โ”‚               โ”‚
     v              v               v               v
  Ingest       Retrieve LLM      Generate       Assess
 memories     memories        answers       accuracy
StageDescriptionOutput
AddIngest conversation data into memory system-
SearchRetrieve relevant memories for QA questionssearch_results_{user_id}.json
AnswerGenerate answers using LLM with retrieved contextanswer_results_{user_id}.json
EvaluateAssess answer quality (MC: direct, OE: LLM judge)evaluation_results_{user_id}.json

Supported Systems

Memory Systems

SystemTimestamp SupportMessage FormatEnvironment Variables
MemosNative chat_time[Group: X][Speaker: Y]contentMEMOS_API_KEY, MEMOS_BASE_URL
Mem0Native timestamp (Unix, per-batch)run_id="${user_id}_${groupId}", name=<Speaker>MEM0_API_KEY
MemobaseNative created_at[Group: X][Speaker: Y]content, alias=<Speaker>MEMOBASE_BASE_URL, MEMOBASE_API_TOKEN
EverMemOSNative create_timesender=<Speaker>, group_id=${user_id}_${groupId}EVERMEMOS_BASE_URL, EVERMEMOS_API_KEY
ZepNative created_at[Group: X][Speaker: Y]contentZEP_API_KEY

LLM System

SystemContextUse CaseEnvironment Variables
LLMFull dialogue (no retrieval)Test LLM long-context comprehensionLLM_BASE_URL, LLM_API_KEY

Key Differences: Memory Systems vs LLM System

AspectMemory SystemsLLM System
ContextRetrieved memories (top-k)Full dialogue
Add StageIngest into memory systemNo-op (stores dialogue)
Search StageQuery memory systemReturns full dialogue
Answer StageAnswer with retrieved contextAnswer with full dialogue
Use CaseTest memory retrievalTest LLM long-context

Directory Structure

eval/
โ”œโ”€โ”€ cli.py                    # CLI entry point
โ”œโ”€โ”€ config/
โ”‚   โ”œโ”€โ”€ pipeline.yaml        # Pipeline settings (answer/evaluate/search/retry/debug)
โ”‚   โ”œโ”€โ”€ prompts.yaml         # LLM prompts for answer/evaluate
โ”‚   โ”œโ”€โ”€ memos.yaml           # Memos configuration (connection + add + search)
โ”‚   โ”œโ”€โ”€ mem0.yaml            # Mem0 configuration (connection + add + search)
โ”‚   โ”œโ”€โ”€ memobase.yaml        # Memobase configuration (connection + add + search)
โ”‚   โ”œโ”€โ”€ evermemos.yaml       # EverMemOS configuration (connection + add + search)
โ”‚   โ””โ”€โ”€ zep.yaml             # Zep configuration (connection + add + search)
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ core/
โ”‚   โ”‚   โ”œโ”€โ”€ data_models.py   # Data classes (QAItem, SearchResult, etc.)
โ”‚   โ”‚   โ”œโ”€โ”€ loaders.py       # Dataset loading utilities
โ”‚   โ”‚   โ”œโ”€โ”€ qa_loader.py     # QA data loader
โ”‚   โ”‚   โ”œโ”€โ”€ pipeline.py      # Evaluation pipeline orchestrator
โ”‚   โ”‚   โ”œโ”€โ”€ answerer.py      # Answer generation with LLM
โ”‚   โ”‚   โ””โ”€โ”€ evaluator.py     # Evaluation with LLM judge
โ”‚   โ”œโ”€โ”€ adapters/
โ”‚   โ”‚   โ”œโ”€โ”€ base.py          # Base adapter abstract class
โ”‚   โ”‚   โ”œโ”€โ”€ memos_adapter.py # Memos implementation
โ”‚   โ”‚   โ”œโ”€โ”€ mem0_adapter.py  # Mem0 implementation
โ”‚   โ”‚   โ”œโ”€โ”€ memobase_adapter.py   # Memobase implementation
โ”‚   โ”‚   โ”œโ”€โ”€ evermemos_adapter.py  # EverMemOS implementation
โ”‚   โ”‚   โ”œโ”€โ”€ zep_adapter.py   # Zep Graph API implementation
โ”‚   โ”‚   โ””โ”€โ”€ llm_adapter.py   # LLM system adapter (full dialogue as context)
โ”‚   โ””โ”€โ”€ utils/
โ”‚       โ”œโ”€โ”€ config.py        # YAML config loader with env var support
โ”‚       โ””โ”€โ”€ logger.py        # Rich console logging
โ””โ”€โ”€ results/{system}/        # Output: eval/results/{system}/*.json
โ”‚                            #   LLM: eval/results/llm/{model}/*.json
tools/
โ””โ”€โ”€ analyze_results.py       # Analyze evaluation results by category

Installation

Requires Python >= 3.11.

pip install -r requirements.txt

Configuration

Environment Variables

Copy the template and fill in your API keys:

cp env.template .env

The LLM variables (OpenRouter) are required for answer generation and evaluation across all systems. Memory system variables only need to be configured for the systems you intend to use. See env.template for details.

Pipeline Configuration

Pipeline settings are in eval/config/pipeline.yaml.

# eval/config/pipeline.yaml

# Answer generation (answerer.py)
answer:
  model: "openai/gpt-4.1-mini"
  provider:
    order: ["openai"]
    allow_fallbacks: false
  temperature: 0
  max_tokens: 1000
  timeout: 300
  concurrency: 1

# LLM judge evaluation (evaluator.py)
evaluate:
  model: "google/gemini-3-flash-preview"
  provider:
    order: ["google-ai-studio"]
    allow_fallbacks: false
  concurrency: 20

# Search stage (pipeline.py)
search:
  concurrency: 3
  timeout: 120

# Retry (shared)
retry:
  max_retries: 20
  retry_delay: 1.0
  max_delay: 300

# Debug
debug:
  show_usage: true

# Cache warmup (LLM system only)
warmup:
  enabled: true
  delay_seconds: 15

System Search Configuration

Each memory system has its own config file (eval/config/{system}.yaml) with a search: section for system-specific search parameters. CLI --top-k overrides the config top_k when provided.

# eval/config/memos.yaml
search:
  top_k: 10                        # Number of memories to retrieve
  preference_limit_number: 6        # Number of preference memories

# eval/config/mem0.yaml
search:
  top_k: 10
  group_ids: ["1", "2", "3"]       # Group IDs to search across

# eval/config/memobase.yaml
search:
  max_token_size: 3000              # Max token size for search results
  event_similarity_threshold: 0.2   # Similarity threshold for event matching

# eval/config/evermemos.yaml
search:
  top_k: 10
  retrieve_method: "hybrid"         # Retrieval method: hybrid/semantic/keyword

# eval/config/zep.yaml
search:
  top_k: 10
  reranker_edges: "cross_encoder"   # Edge reranking strategy
  reranker_nodes: "rrf"             # Node reranking strategy
  max_query_length: 400             # Max query length for search

Prompt Templates

# eval/config/prompts.yaml
llm_answer:
  multiple_choice: |
    ...
  open_ended: |
    ...
llm_judge:
  system_prompt: |
    ...
  user_prompt: |
    ...

Usage

Memory Systems Evaluation

Memory systems follow a two-phase workflow: Add (ingest data), then Search โ†’ Answer โ†’ Evaluate (run evaluation).

Memos

# Add
python -m eval.cli \
    --dataset dataset/004/dialogue.json \
    --system memos \
    --user-id 004 \
    --stages add

# Search -> Answer -> Evaluate
python -m eval.cli \
    --dataset dataset/004/dialogue.json \
    --qa dataset/004/qa_004.json \
    --system memos \
    --user-id 004 \
    --stages search answer evaluate \
    --top-k 10

Mem0

# Add
python -m eval.cli \
    --dataset dataset/004/dialogue.json \
    --system mem0 \
    --user-id 004 \
    --stages add

# Search -> Answer -> Evaluate
python -m eval.cli \
    --dataset dataset/004/dialogue.json \
    --qa dataset/004/qa_004.json \
    --system mem0 \
    --user-id 004 \
    --stages search answer evaluate \
    --top-k 10

Memobase

# Add
python -m eval.cli \
    --dataset dataset/004/dialogue.json \
    --system memobase \
    --user-id 004 \
    --stages add

# Search -> Answer -> Evaluate
python -m eval.cli \
    --dataset dataset/004/dialogue.json \
    --qa dataset/004/qa_004.json \
    --system memobase \
    --user-id 004 \
    --stages search answer evaluate

EverMemOS

EverMemOS requires separate data isolation per batch (user ID):

  • Cloud service: Create a new memspace for each batch via the EverMemOS dashboard, then use the corresponding --base-url.
  • Local deployment: Start a separate service instance per batch, each on its own port (e.g., port 19004 for user 004, port 19005 for user 005). API key is not required for local deployment.
# Add (local deployment, port per batch)
python -m eval.cli \
    --dataset dataset/004/dialogue.json \
    --system evermemos \
    --user-id 004 \
    --stages add \
    --base-url http://0.0.0.0:19004

# Search -> Answer -> Evaluate
python -m eval.cli \
    --dataset dataset/004/dialogue.json \
    --qa dataset/004/qa_004.json \
    --system evermemos \
    --user-id 004 \
    --stages search answer evaluate \
    --top-k 10 \
    --base-url http://0.0.0.0:19004

Zep

# Add
python -m eval.cli \
    --dataset dataset/004/dialogue.json \
    --system zep \
    --user-id 004 \
    --stages add

# Search -> Answer -> Evaluate
python -m eval.cli \
    --dataset dataset/004/dialogue.json \
    --qa dataset/004/qa_004.json \
    --system zep \
    --user-id 004 \
    --stages search answer evaluate \
    --top-k 10

LLM Long-Context Evaluation

The LLM system uses the full dialogue as context (no memory retrieval). Add/search stages are auto-injected.

python -m eval.cli \
    --dataset dataset/004/dialogue.json \
    --qa dataset/004/qa_004.json \
    --system llm \
    --user-id 004 \
    --stages answer evaluate

Evaluate Only (re-evaluate existing answer results)

python -m eval.cli \
    --qa dataset/004/qa_004.json \
    --system mem0 \
    --user-id 004 \
    --stages evaluate

Smoke Test

# Smoke test add stage
python -m eval.cli --dataset dataset/004/dialogue.json --system memos --smoke

# Smoke test with specific date
python -m eval.cli --dataset dataset/004/dialogue.json --system memos --smoke --smoke-date 2025-01-16

# LLM smoke test with limited questions
python -m eval.cli \
    --dataset dataset/004/dialogue.json \
    --qa dataset/004/qa_004.json \
    --system llm \
    --user-id 004 \
    --stages answer evaluate \
    --qa-limit 3

CLI Options

OptionDescriptionDefault
--datasetPath to dataset JSON file (required for add stage)-
--systemSystem (memos/mem0/memobase/evermemos/zep/llm)Required
--stagesStages to run: add, search, answer, evaluate["add"]
--qaPath to QA JSON file (required for search/answer/evaluate)-
--user-idUser ID for memory systemAuto-generated
--top-kNumber of memories to retrieveFrom system config
--output-dirResults base directory (output goes to {output-dir}/{system}/)eval/results
--base-urlOverride base URL for memory system-
--start-dateResume add from this date (YYYY-MM-DD)-
--smokeEnable smoke test modeFalse
--smoke-daysDays to process in smoke test1
--smoke-dateSpecific date for smoke test (YYYY-MM-DD)-
--qa-limitLimit number of QA questions-

Output Structure

Results are organized by system under eval/results/:

eval/results/
โ”œโ”€โ”€ memos/
โ”‚   โ”œโ”€โ”€ search_results_004.json
โ”‚   โ”œโ”€โ”€ answer_results_004.json
โ”‚   โ””โ”€โ”€ evaluation_results_004.json
โ”œโ”€โ”€ mem0/
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ memobase/
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ evermemos/
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ zep/
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ llm/
    โ””โ”€โ”€ openai/
        โ””โ”€โ”€ gpt-4.1-mini/          # LLM results include model name in path
            โ”œโ”€โ”€ answer_results_004.json
            โ””โ”€โ”€ evaluation_results_004.json

Analysis Tools

tools/analyze_results.py analyzes evaluation results by question_id categories (major/minor/hierarchical). Supports single-file analysis and multi-batch aggregation.

# Single file analysis
python tools/analyze_results.py eval/results/evermemos/evaluation_results_004.json

# Aggregate all batches for a system
python tools/analyze_results.py --system mem0

# Specify results directory directly
python tools/analyze_results.py --results-dir eval/results/memos/

# Save JSON report
python tools/analyze_results.py --system evermemos -o report.json

# Quiet mode (JSON output only)
python tools/analyze_results.py --system zep -o report.json -q

Dataset Batches

Supported user IDs: 004, 005, 010, 011, 016

Each batch has:

  • dataset/{batch_id}/dialogue.json - Conversation data
  • dataset/{batch_id}/qa_{batch_id}.json - QA questions for evaluation