Ejentum Benchmarks

May 31, 2026 · View on GitHub

Benchmark methodology, evaluation results, and raw data for Ejentum's Logic API, the Reasoning Harness for Agentic AI.

The Logic API retrieves engineered cognitive operations (not information) and injects them into an LLM's context at inference time. These benchmarks measure the behavioral effect of that injection across eight independent evaluation frameworks, covering four product layers: Reasoning, Code, Anti-Deception, and Memory.


Research Paper

Under Pressure: RA²R and the Emergence of Uninstructed Reasoning Behaviors in Scaffold-Augmented Language Models

Franko Luci, Ejentum. April 2026.

This paper synthesizes all benchmark findings into a unified thesis: suppression is pressure, and emergence is the model's response. 25 pages, 9 figures, all negative findings reported.


What Was Tested

Reasoning Harness (311 abilities)

BenchmarkTasksTypeModelPrimary Finding
EjBench180 custom professionalSingle-turn, 7-factor blind rubricClaude Opus 4.6+10.1pp composite quality lift. Self-monitoring nearly doubled. Correctness flat.
BBH / CausalBench / MuSR70 published academicSingle-turn, 7-factor blind rubricClaude Opus 4.6+20.8pp composite lift on focused tasks. Correctness improved +7.1pp.
ARC-AGI-325 steps x 2 conditionsInteractive multi-step reasoningClaude Sonnet 4.6RHAE 0.0 = 0.0 (both failed). Injection persisted 24 steps. Memory decay reversed.
HLE-15 Ablation15 HLE hardest exactMatchB/D/A ablation, reasoning + adaptive-reasoning modesClaude Opus 4.8B=4/15, D=5/15, A=5/15. Both harness arms produced a +1 lift over baseline; predicted A > D separation not observed at n=15. Math sub-result A,D=2/3 vs B=1/3. Corrected 2026-06-01 from prior aggregate-field error.

Code Harness (128 abilities)

BenchmarkTasksTypeModelPrimary Finding
LiveCodeBench Hard28 hard competitive programmingCode generation + correctnessClaude Opus 4.6 (max effort)85.7% -> 100%. +14.3pp. 4 tasks gained, 0 lost. Zero regressions.
SciCode10 hard scientific computingDual injection (reasoning + code)Claude Opus 4.67 bugs -> 0 bugs. 10/10 blind evaluation chose injection.
MHPP-10 Ablation10 hardest MHPPB/D/A ablation + blind expert reviewClaude Opus 4.8Pass rate saturated 9/9/9 (10/10/10 corrected). Blind SWE review converged on A > D > B: 26/19/9 across 9 ballots, 8/9 exact ordering. 21,000x measured speedup on adversarial input.

Anti-Deception Harness (139 abilities)

BenchmarkTasksTypeModelPrimary Finding
ELEPHANT40 real Reddit scenariosSycophancy measurementGPT-4o (cross-model)5.8% composite sycophancy. 7.5% framing sycophancy.
Adversarial 20-Turn20-turn adaptive attackSocial engineering detectionGPT-4oDetected at Turn 6. 27/30 blind evaluation.
Hallucination Prevention5 fabrication testsHallucination measurementGPT-4oZero hallucinations across all 5 tests.

Memory Harness (101 abilities)

BenchmarkTasksTypeModelPrimary Finding
State Tracking20-turn Vantage scenarioImplicit state changesGPT-4o50% fewer stale facts served as current. Blind eval 4.1/5 vs 3.5/5.
Perceptual Detection15-turn Morgan scenarioSignal detection in coachingGPT-4o3x signal detection rate. 43% vs 14%.
Selective Metrics10-turn Casey scenarioPerception + reframingGPT-4oEarlier detection (1 turn) on 2 of 5 signals.

Total: 250 single-turn reasoning tasks + 50 interactive reasoning steps + 28 competitive programming tasks + 10 scientific computing tasks + 40 sycophancy scenarios + 20-turn adversarial attacks + 5 hallucination tests + 45 memory turns across eight benchmark suites.


How It Was Tested

All benchmarks follow a consistent protocol adapted to each product layer:

  1. Agent-native execution. Agents called Ejentum's production Logic API themselves via tool use. The agent summarized the task, called the endpoint, received the injection, and applied it before reasoning. This mirrors real deployment: the retrieval variance is real, not simulated.

  2. Blind evaluation. For reasoning and memory benchmarks: a separate evaluator scored outputs without knowing which condition was augmented. Generation and evaluation are separate stages. For code benchmarks: exact-match pass/fail on test cases.

  3. Cross-model validation. Anti-deception and memory benchmarks were tested on GPT-4o, validating that the mechanism works across model families.

  4. Negative findings reported. Correctness dips, domain regressions, and unexpected results are in the reports. We do not omit results that challenge the thesis.


Key Terms

TermDefinition
Logic APIEjentum's REST endpoint (POST /logicv1/). Retrieves engineered cognitive operations from 679 abilities across four product layers.
InjectionA structured cognitive payload containing a negative gate (failure pattern to avoid), reasoning topology (execution structure), suppression signals (failure modes to block), amplification signals (patterns to prioritize), and a falsification test (verification criterion).
HarnessA product layer (Reasoning, Code, Anti-Deception, Memory). Each harness is a curated collection of abilities targeting a specific class of AI failure.
AbilityOne engineered cognitive operation. The atomic unit retrieved from the database.
SuppressionNamed failure modes injected as constraints. Suppression signals reduce the probability of specific reasoning shortcuts. In testing, suppression produces larger behavioral effects than amplification alone.

API modes: reasoning, reasoning-multi, code, code-multi, anti-deception, memory, memory-multi


Headline Results

Reasoning Harness. Quality Lift (7-factor composite)

BenchmarkBaselineBest ConditionDelta
EjBench (180 tasks)0.6210.722+10.1pp
BBH/CausalBench/MuSR (70 tasks)0.4760.684+20.8pp

Code Harness. Correctness

BenchmarkBaselineWith InjectionDelta
LiveCodeBench Hard (28 tasks)85.7%100.0%+14.3pp
SciCode (10 tasks)7 bugs0 bugs-100%

Anti-Deception Harness. Protection

MetricResult
ELEPHANT composite sycophancy5.8%
Social engineering detectionTurn 6 of 20
Hallucinations (5 tests)0

Memory Harness. Accuracy

MetricBaselineWith InjectionDelta
Stale facts served1.60.8-50%
Perceptual detection rate14%43%3x
Blind evaluation score3.5/54.1/5+17%

ARC-AGI-3 Process Metrics

MetricBaselineAugmented
Memory decay slope-0.005 (degrading)+0.014 (improving)
Injection half-life024 steps
Reasoning depth trend0.8610.50 (12.2x)

Negative Findings

  • Correctness dipped under reasoning-multi on EjBench (-0.11 on 3-point scale). Thorougher reasoning occasionally trades accuracy for caution.
  • Spatial domain regressed under reasoning-multi on BBH (-20.0pp on 5 tasks). Multi-perspective injection confused spatial constraint tracking.
  • Reasoning-multi correctness dropped on BBH (-0.12). Focused tasks need focused injections. Single mode outperformed multi on every single-domain task.
  • Contradiction rate increased 1.9x on ARC-AGI-3 (token-normalized). Whether this is productive cognitive conflict or destructive interference is unresolved.
  • ARC-AGI-3: RHAE 0.0 = 0.0. Neither condition cleared Level 0. All process metrics are measured in a failure context.

What Would Falsify This

The core claim is that structured cognitive injection produces measurable behavioral changes in LLM outputs. This claim is falsified if:

  1. The same injection format produces zero behavioral change on a different model family. (Partially addressed: anti-deception and memory validated on GPT-4o.)
  2. Random injection (shuffled suppression signals, mismatched topologies) produces equivalent lift, meaning the specific cognitive operation doesn't matter.
  3. The 7-factor rubric scoring shows evaluator bias that systematically inflates injected conditions.
  4. Replication on a second independent run produces directionally different results.

Limitations

  • Reasoning benchmarks are Claude-only. Anti-deception and memory are cross-model (GPT-4o). Full cross-model reasoning testing is in progress.
  • LLM-as-judge. Two-stage blind protocol mitigates but does not eliminate the possibility of systematic bias. Human evaluation on a subset would strengthen the evidence.
  • Custom task design bias. EjBench tasks were designed by Ejentum. The BBH/CausalBench/MuSR benchmark addresses this with externally designed tasks.
  • Small samples on sub-analyses. Spatial navigation regression rests on 5 tasks. ARC-AGI-3 is n=1 per condition.

Repository Structure

benchmarks/
  README.md
  LICENSE
  ejbench/                      # 180 custom professional reasoning tasks
  bbh-causalbench-musr/         # 70 published academic reasoning tasks
  arc-agi-3/                    # Interactive multi-step reasoning (25 steps)
  lcb-hard/                     # 28 hard competitive programming tasks
  coding-benchmark/             # SciCode: 10 hard scientific computing problems
  elephant/                     # ELEPHANT sycophancy benchmark (40 scenarios)
  memory-retention/             # 20-turn implicit state change tracking
  perception-hard/              # Perceptual signal detection (Morgan + Casey)
  research/
    COGNITIVE_SCAFFOLDING_THESIS.md
    VALIDATED_CLAIMS.md
    paper/under_pressure.pdf


License

Released under CC BY 4.0. Share and adapt with attribution.