NEXO Compare Scorecard
April 10, 2026 ยท View on GitHub
NEXO is the local cognitive runtime that makes the model around your model smarter.
Generated: 2026-04-10T14:06:55.452342+00:00
What this scorecard is
- A public proof surface for the claims NEXO makes most often.
- A mix of benchmark data, internal ablations, runtime telemetry, and parity guardrails.
- A map of inspectable artifacts, not a substitute for reading the underlying files.
Claims you can inspect today
- NEXO publishes a measured long-conversation memory result on LoCoMo instead of relying only on architecture claims.
- Evidence:
benchmarks/locomo/results/locomo_nexo_summary.json,features/benchmark/index.html - Scope: Benchmark result is memory-specific; it is not a universal score for every runtime capability.
- Evidence:
- NEXO shows external baselines and internal ablations side by side so the score is easier to interpret.
- Evidence:
benchmarks/runtime_ablations/results/ablation_summary.json,compare/scorecard.json - Scope: External baselines come from the public LoCoMo discussion; internal ablations come from checked-in NEXO artifacts.
- Evidence:
- Client parity across Claude Code and Codex is audited by code and docs, not left as a vague promise.
- Evidence:
scripts/verify_client_parity.py,docs/client-parity-checklist.md - Scope: Parity claims refer to the audited runtime surfaces listed in the checklist and script output.
- Evidence:
- NEXO publishes a reproducible operator benchmark pack against realistic local baselines, not just a memory-only paper benchmark.
- Evidence:
benchmarks/runtime_pack/results/latest_summary.json,benchmarks/runtime_pack/README.md - Scope: The runtime pack is a small manual-rubric benchmark for operator workflows; it complements LoCoMo instead of replacing it.
- Evidence:
- NEXO publishes longitudinal local runtime telemetry separately from benchmark scores.
- Evidence:
compare/scorecard.json,compare/README.md - Scope: Runtime windows are local operational telemetry; they are not folded into LoCoMo F1.
- Evidence:
What this scorecard does not claim
- It is not a universal winner-takes-all benchmark for every agent workload.
- LoCoMo measures long-conversation memory, not the full product surface.
- Longitudinal runtime windows come from local operator telemetry and should be read as operational evidence, not as a public SaaS benchmark.
Measured benchmark
- LoCoMo overall F1: 0.5875
- LoCoMo overall recall: 0.7487
- Open-domain F1: 0.6366
- Multi-hop F1: 0.3329
- Temporal F1: 0.3258
Ablation / baseline suite
- Combined external + NEXO ablation baselines (2026-04-06)
- Raw model baseline (GPT-4 128K full context): F1 0.379
- Gemini Pro 1.0 baseline: F1 0.313
- Retrieval baseline (GPT-3.5 + Contriever RAG): F1 0.283
- NEXO memory-only mode (LoCoMo RAG): F1 0.5875
- NEXO cognitive-cycle mode: F1 0.2931
Operator runtime pack
- Operator Runtime Matrix v5 Foundations (2026-04-10)
- Scope: Small reproducible operator benchmark focused on runtime-backed recall and continuity, not a universal agent benchmark.
- NEXO full stack: score 96.2% | pass 12 | partial 1 | fail 0
- Static CLAUDE.md: score 42.3% | pass 0 | partial 11 | fail 2
- No memory: score 0.0% | pass 0 | partial 0 | fail 13
Longitudinal local runtime metrics
- 30d: success 17.3% | avg close 8.3 min | recovery 64.2% | open protocol debt 16 | unnecessary tool 3.4% | cost/solved 16.267091 USD
- 60d: success 17.3% | avg close 8.3 min | recovery 64.2% | open protocol debt 16 | unnecessary tool 3.4% | cost/solved 16.267091 USD
- 90d: success 17.3% | avg close 8.3 min | recovery 64.2% | open protocol debt 16 | unnecessary tool 3.4% | cost/solved 16.267091 USD
System On Top Of Model

Public API surface
- MCP wrappers:
nexo_remember,nexo_memory_recall,nexo_consolidate,nexo_run_workflow - Python SDK:
src/nexo_sdk.py - Quickstart:
docs/quickstart-5-minutes.md
Client parity guardrails
scripts/verify_client_parity.pydocs/client-parity-checklist.md- runtime doctor parity audits
Artifact map
locomo_summary:benchmarks/locomo/results/locomo_nexo_summary.jsonablation_summary:benchmarks/runtime_ablations/results/ablation_summary.jsonruntime_pack_summary:benchmarks/runtime_pack/results/latest_summary.jsoncompare_readme:compare/README.mdcompare_scorecard:compare/scorecard.jsonbenchmark_page:features/benchmark/index.htmlparity_audit:scripts/verify_client_parity.pyparity_checklist:docs/client-parity-checklist.mdquickstart:docs/quickstart-5-minutes.mdpython_sdk:src/nexo_sdk.py