Multi-Person Group Chat Evaluation Framework
February 12, 2026 ยท View on GitHub
A comprehensive evaluation framework for multi-person group chat datasets, supporting Memory Systems (Memos, Mem0, Memobase, EverMemOS, Zep) and LLM Long-Context Evaluation.
๐ Paper: EverMemBench: A Comprehensive Benchmark for Long-Term Memory in Conversational AI
๐ค Dataset: EverMind-AI/EverMemBench-Dynamic
Features
- Multi-person group chat support: Handles datasets with multiple speakers across multiple groups and days
- 5 Memory Systems: Memos, Mem0, Memobase, EverMemOS, Zep
- LLM Long-Context Evaluation: Direct LLM evaluation using full dialogue as context
- Full Evaluation Pipeline: Add โ Search โ Answer โ Evaluate
- Two Question Types: Multiple choice (direct comparison) and open-ended (LLM judge)
- Unified message format: All messages include group/speaker attribution
- LLM Integration: Uses OpenRouter for answer generation and evaluation
- Batch processing: Efficient API calls with configurable batch sizes and rate limiting
- Smoke test mode: Quick validation with limited data
Pipeline Stages
โโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโ
โ Add โ -> โ Search โ -> โ Answer โ -> โ Evaluate โ
โโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโ
โ โ โ โ
v v v v
Ingest Retrieve LLM Generate Assess
memories memories answers accuracy
| Stage | Description | Output |
|---|---|---|
| Add | Ingest conversation data into memory system | - |
| Search | Retrieve relevant memories for QA questions | search_results_{user_id}.json |
| Answer | Generate answers using LLM with retrieved context | answer_results_{user_id}.json |
| Evaluate | Assess answer quality (MC: direct, OE: LLM judge) | evaluation_results_{user_id}.json |
Supported Systems
Memory Systems
| System | Timestamp Support | Message Format | Environment Variables |
|---|---|---|---|
| Memos | Native chat_time | [Group: X][Speaker: Y]content | MEMOS_API_KEY, MEMOS_BASE_URL |
| Mem0 | Native timestamp (Unix, per-batch) | run_id="${user_id}_${groupId}", name=<Speaker> | MEM0_API_KEY |
| Memobase | Native created_at | [Group: X][Speaker: Y]content, alias=<Speaker> | MEMOBASE_BASE_URL, MEMOBASE_API_TOKEN |
| EverMemOS | Native create_time | sender=<Speaker>, group_id=${user_id}_${groupId} | EVERMEMOS_BASE_URL, EVERMEMOS_API_KEY |
| Zep | Native created_at | [Group: X][Speaker: Y]content | ZEP_API_KEY |
LLM System
| System | Context | Use Case | Environment Variables |
|---|---|---|---|
| LLM | Full dialogue (no retrieval) | Test LLM long-context comprehension | LLM_BASE_URL, LLM_API_KEY |
Key Differences: Memory Systems vs LLM System
| Aspect | Memory Systems | LLM System |
|---|---|---|
| Context | Retrieved memories (top-k) | Full dialogue |
| Add Stage | Ingest into memory system | No-op (stores dialogue) |
| Search Stage | Query memory system | Returns full dialogue |
| Answer Stage | Answer with retrieved context | Answer with full dialogue |
| Use Case | Test memory retrieval | Test LLM long-context |
Directory Structure
eval/
โโโ cli.py # CLI entry point
โโโ config/
โ โโโ pipeline.yaml # Pipeline settings (answer/evaluate/search/retry/debug)
โ โโโ prompts.yaml # LLM prompts for answer/evaluate
โ โโโ memos.yaml # Memos configuration (connection + add + search)
โ โโโ mem0.yaml # Mem0 configuration (connection + add + search)
โ โโโ memobase.yaml # Memobase configuration (connection + add + search)
โ โโโ evermemos.yaml # EverMemOS configuration (connection + add + search)
โ โโโ zep.yaml # Zep configuration (connection + add + search)
โโโ src/
โ โโโ core/
โ โ โโโ data_models.py # Data classes (QAItem, SearchResult, etc.)
โ โ โโโ loaders.py # Dataset loading utilities
โ โ โโโ qa_loader.py # QA data loader
โ โ โโโ pipeline.py # Evaluation pipeline orchestrator
โ โ โโโ answerer.py # Answer generation with LLM
โ โ โโโ evaluator.py # Evaluation with LLM judge
โ โโโ adapters/
โ โ โโโ base.py # Base adapter abstract class
โ โ โโโ memos_adapter.py # Memos implementation
โ โ โโโ mem0_adapter.py # Mem0 implementation
โ โ โโโ memobase_adapter.py # Memobase implementation
โ โ โโโ evermemos_adapter.py # EverMemOS implementation
โ โ โโโ zep_adapter.py # Zep Graph API implementation
โ โ โโโ llm_adapter.py # LLM system adapter (full dialogue as context)
โ โโโ utils/
โ โโโ config.py # YAML config loader with env var support
โ โโโ logger.py # Rich console logging
โโโ results/{system}/ # Output: eval/results/{system}/*.json
โ # LLM: eval/results/llm/{model}/*.json
tools/
โโโ analyze_results.py # Analyze evaluation results by category
Installation
Requires Python >= 3.11.
pip install -r requirements.txt
Configuration
Environment Variables
Copy the template and fill in your API keys:
cp env.template .env
The LLM variables (OpenRouter) are required for answer generation and evaluation across all systems. Memory system variables only need to be configured for the systems you intend to use. See env.template for details.
Pipeline Configuration
Pipeline settings are in eval/config/pipeline.yaml.
# eval/config/pipeline.yaml
# Answer generation (answerer.py)
answer:
model: "openai/gpt-4.1-mini"
provider:
order: ["openai"]
allow_fallbacks: false
temperature: 0
max_tokens: 1000
timeout: 300
concurrency: 1
# LLM judge evaluation (evaluator.py)
evaluate:
model: "google/gemini-3-flash-preview"
provider:
order: ["google-ai-studio"]
allow_fallbacks: false
concurrency: 20
# Search stage (pipeline.py)
search:
concurrency: 3
timeout: 120
# Retry (shared)
retry:
max_retries: 20
retry_delay: 1.0
max_delay: 300
# Debug
debug:
show_usage: true
# Cache warmup (LLM system only)
warmup:
enabled: true
delay_seconds: 15
System Search Configuration
Each memory system has its own config file (eval/config/{system}.yaml) with a search: section for system-specific search parameters. CLI --top-k overrides the config top_k when provided.
# eval/config/memos.yaml
search:
top_k: 10 # Number of memories to retrieve
preference_limit_number: 6 # Number of preference memories
# eval/config/mem0.yaml
search:
top_k: 10
group_ids: ["1", "2", "3"] # Group IDs to search across
# eval/config/memobase.yaml
search:
max_token_size: 3000 # Max token size for search results
event_similarity_threshold: 0.2 # Similarity threshold for event matching
# eval/config/evermemos.yaml
search:
top_k: 10
retrieve_method: "hybrid" # Retrieval method: hybrid/semantic/keyword
# eval/config/zep.yaml
search:
top_k: 10
reranker_edges: "cross_encoder" # Edge reranking strategy
reranker_nodes: "rrf" # Node reranking strategy
max_query_length: 400 # Max query length for search
Prompt Templates
# eval/config/prompts.yaml
llm_answer:
multiple_choice: |
...
open_ended: |
...
llm_judge:
system_prompt: |
...
user_prompt: |
...
Usage
Memory Systems Evaluation
Memory systems follow a two-phase workflow: Add (ingest data), then Search โ Answer โ Evaluate (run evaluation).
Memos
# Add
python -m eval.cli \
--dataset dataset/004/dialogue.json \
--system memos \
--user-id 004 \
--stages add
# Search -> Answer -> Evaluate
python -m eval.cli \
--dataset dataset/004/dialogue.json \
--qa dataset/004/qa_004.json \
--system memos \
--user-id 004 \
--stages search answer evaluate \
--top-k 10
Mem0
# Add
python -m eval.cli \
--dataset dataset/004/dialogue.json \
--system mem0 \
--user-id 004 \
--stages add
# Search -> Answer -> Evaluate
python -m eval.cli \
--dataset dataset/004/dialogue.json \
--qa dataset/004/qa_004.json \
--system mem0 \
--user-id 004 \
--stages search answer evaluate \
--top-k 10
Memobase
# Add
python -m eval.cli \
--dataset dataset/004/dialogue.json \
--system memobase \
--user-id 004 \
--stages add
# Search -> Answer -> Evaluate
python -m eval.cli \
--dataset dataset/004/dialogue.json \
--qa dataset/004/qa_004.json \
--system memobase \
--user-id 004 \
--stages search answer evaluate
EverMemOS
EverMemOS requires separate data isolation per batch (user ID):
- Cloud service: Create a new memspace for each batch via the EverMemOS dashboard, then use the corresponding
--base-url. - Local deployment: Start a separate service instance per batch, each on its own port (e.g., port
19004for user004, port19005for user005). API key is not required for local deployment.
# Add (local deployment, port per batch)
python -m eval.cli \
--dataset dataset/004/dialogue.json \
--system evermemos \
--user-id 004 \
--stages add \
--base-url http://0.0.0.0:19004
# Search -> Answer -> Evaluate
python -m eval.cli \
--dataset dataset/004/dialogue.json \
--qa dataset/004/qa_004.json \
--system evermemos \
--user-id 004 \
--stages search answer evaluate \
--top-k 10 \
--base-url http://0.0.0.0:19004
Zep
# Add
python -m eval.cli \
--dataset dataset/004/dialogue.json \
--system zep \
--user-id 004 \
--stages add
# Search -> Answer -> Evaluate
python -m eval.cli \
--dataset dataset/004/dialogue.json \
--qa dataset/004/qa_004.json \
--system zep \
--user-id 004 \
--stages search answer evaluate \
--top-k 10
LLM Long-Context Evaluation
The LLM system uses the full dialogue as context (no memory retrieval). Add/search stages are auto-injected.
python -m eval.cli \
--dataset dataset/004/dialogue.json \
--qa dataset/004/qa_004.json \
--system llm \
--user-id 004 \
--stages answer evaluate
Evaluate Only (re-evaluate existing answer results)
python -m eval.cli \
--qa dataset/004/qa_004.json \
--system mem0 \
--user-id 004 \
--stages evaluate
Smoke Test
# Smoke test add stage
python -m eval.cli --dataset dataset/004/dialogue.json --system memos --smoke
# Smoke test with specific date
python -m eval.cli --dataset dataset/004/dialogue.json --system memos --smoke --smoke-date 2025-01-16
# LLM smoke test with limited questions
python -m eval.cli \
--dataset dataset/004/dialogue.json \
--qa dataset/004/qa_004.json \
--system llm \
--user-id 004 \
--stages answer evaluate \
--qa-limit 3
CLI Options
| Option | Description | Default |
|---|---|---|
--dataset | Path to dataset JSON file (required for add stage) | - |
--system | System (memos/mem0/memobase/evermemos/zep/llm) | Required |
--stages | Stages to run: add, search, answer, evaluate | ["add"] |
--qa | Path to QA JSON file (required for search/answer/evaluate) | - |
--user-id | User ID for memory system | Auto-generated |
--top-k | Number of memories to retrieve | From system config |
--output-dir | Results base directory (output goes to {output-dir}/{system}/) | eval/results |
--base-url | Override base URL for memory system | - |
--start-date | Resume add from this date (YYYY-MM-DD) | - |
--smoke | Enable smoke test mode | False |
--smoke-days | Days to process in smoke test | 1 |
--smoke-date | Specific date for smoke test (YYYY-MM-DD) | - |
--qa-limit | Limit number of QA questions | - |
Output Structure
Results are organized by system under eval/results/:
eval/results/
โโโ memos/
โ โโโ search_results_004.json
โ โโโ answer_results_004.json
โ โโโ evaluation_results_004.json
โโโ mem0/
โ โโโ ...
โโโ memobase/
โ โโโ ...
โโโ evermemos/
โ โโโ ...
โโโ zep/
โ โโโ ...
โโโ llm/
โโโ openai/
โโโ gpt-4.1-mini/ # LLM results include model name in path
โโโ answer_results_004.json
โโโ evaluation_results_004.json
Analysis Tools
tools/analyze_results.py analyzes evaluation results by question_id categories (major/minor/hierarchical). Supports single-file analysis and multi-batch aggregation.
# Single file analysis
python tools/analyze_results.py eval/results/evermemos/evaluation_results_004.json
# Aggregate all batches for a system
python tools/analyze_results.py --system mem0
# Specify results directory directly
python tools/analyze_results.py --results-dir eval/results/memos/
# Save JSON report
python tools/analyze_results.py --system evermemos -o report.json
# Quiet mode (JSON output only)
python tools/analyze_results.py --system zep -o report.json -q
Dataset Batches
Supported user IDs: 004, 005, 010, 011, 016
Each batch has:
dataset/{batch_id}/dialogue.json- Conversation datadataset/{batch_id}/qa_{batch_id}.json- QA questions for evaluation