f3dx-bench
April 30, 2026 ยท View on GitHub
The first prod-traffic latency dashboard for LLMs. Artificial Analysis runs synthetic probes from a fixed test bench. lm-arena measures judge-quality. Nothing measures real prod latency on real traffic across providers, hours, regions. f3dx-bench does, by sitting downstream of f3dx and llmkit users who opt in.
The differentiator is structural: AA cannot collect this data without becoming a runtime, and they aren't. llmkit already sits in the request path of every llmkit user, so the data plane lights up on day 1 instead of needing a flywheel from zero.
# in your f3dx app:
F3DX_BENCH_OPTIN=1 python my_agent.py
# or in your llmkit dashboard, toggle "Contribute to f3dx-bench"
Three live charts at https://bench.f3d1.dev (placeholder URL until deploy):
- latency.md - p99 / p95 / p50 latency by model + region + hour, last 7 days
- errors.md - error-rate by provider + hour, last 7 days
- cost.md - cost-per-1k tokens by token bucket, by model
Architecture
Three components, one repo:
-
schemas/- canonical TraceBeacon row (12 fields, ~200 bytes wire). NO prompt content, NO response content, NO API keys. install_id is per-install UUID; install_hmac is HMAC-SHA256 over the canonical beacon JSON. k-anonymity bucket suppression at N>=50 on every aggregation that hits the public dashboard. -
worker/- Cloudflare Worker (Hono) at the ingest edge. POST/v1/beaconwith single beacon or NDJSON batch. Validates schema, verifies HMAC against KV-stored install secret, rate-limits per install (60 req/min), appends validated rows to R2 as date-partitioned NDJSON. Daily scheduled job converts NDJSON to Parquet for the dashboard's duckdb-wasm reader. -
dashboard/- single static HTML file. Loads duckdb-wasm + Plotly from CDN, queries the public R2 parquet directly in the browser, renders summary cards + latency / errors / cost charts + recent-beacons table. Zero build step. Deploy viawrangler pages deploy ./dashboard. (evidence-app/was the original Svelte/Evidence.dev approach; abandoned due to peer-dep conflicts on Windows + Svelte 5 incompatibility in Evidence v31. Single-file HTML is simpler, faster to ship, architecturally cleaner.) -
tools/compact_ndjson_to_parquet.py- Python+pyarrow tool reads the per-hour NDJSON in R2, emits per-day partitioned parquet underparquet/yyyy=*/mm=*/dd=*/all.parquetAND a consolidatedparquet/latest.parquetthat the dashboard reads. Run ad-hoc tonight; v0.0.2 wires it as a CF scheduled cron.
Privacy contract
Read docs/privacy.md for the full policy. Short version:
- Opt-in only (env var, dashboard toggle, or first-run prompt). No silent telemetry.
- 12-field beacon shape only. No prompt content, no response content, no API keys, no headers.
- Install-keyed HMAC blocks synthetic-traffic flooding without requiring registration.
- Aggregations that go to the public dashboard suppress any (model, provider, region, hour) bucket where N < 50.
- Forget request:
POST /v1/forget {install_id, hmac}removes all rows for that install_id within 24h.
Layout
f3dx-bench/
schemas/
trace_beacon.schema.json JSON Schema for the wire format
trace_beacon.parquet.md Parquet column layout for R2
worker/
src/worker.ts CF Worker (Hono): /v1/beacon + /v1/health + /v1/forget
wrangler.toml CF deploy config (placeholders for R2 bucket + KV namespace)
package.json deps: hono only; dev: wrangler + biome
tsconfig.json
evidence-app/
pages/{index,latency,errors,cost}.md
sources/bench/{connection.yaml,beacons.sql}
package.json
evidence.settings.json
docs/
architecture.md system diagram + retention policy
privacy.md opt-in mechanics, k-anonymity, forget flow
hmac-attestation.md per-install HMAC scheme + reasoning
.github/
workflows/{ci,scorecard,security}.yml
dependabot.yml
README.md / LICENSE / CODEOWNERS / SECURITY.md / .gitignore
What's missing
- Schema versioning (v1 only; v2 will add streaming-specific timing)
- Beacon batching client SDK (Python + TypeScript, lands in f3dx[bench] + @f3d1/llmkit-sdk)
- Public R2 deploy (Federico holds the wrangler creds)
- Evidence.dev hosted at bench.f3d1.dev (CF Pages step after R2)
Sibling projects
The f3d1 ecosystem:
f3dx- Rust runtime your Python imports. Drop-in for openai + anthropic SDKs with native SSE streaming, agent loop with concurrent tool dispatch, OTel emission. Includesf3dx.cache(content-addressable LLM response cache + tool-result memoization),f3dx.router(sequential / hedged-parallel LLM router), andf3dx.fast(CanonicalPrompt for prefix-cache hits, SpecToolDispatcher for streaming tool dispatch, budget_max_tokens hinting).pip install f3dx.tracewright- Trace-replay adapter forpydantic-evals. Read an f3dx or pydantic-ai logfire JSONL trace, get apydantic_evals.Dataset.pip install tracewright. The[cache]extra wraps candidate fns withf3dx.cache.cached_call.pydantic-cal- Calibration metrics forpydantic-evals: ECE, smECE, MCE, ACE, Brier, reliability diagrams, Fisher-Rao geometry kernel.pip install pydantic-cal.llmkit- Hosted API gateway with budget enforcement, session tracking, cost dashboards, MCP server. llmkit.sh.keyguard- Security linter for open source projects.
f3dx-cache and f3dx-router used to be separate PyPI packages; folded into f3dx on 2026-04-30 as workspace members. The old packages stay live as deprecation shims for a few months.
License
MIT.