Running tx-benchmark

June 1, 2026 · View on GitHub

tx-benchmark from HealthSamurai drives a suite of k6 load tests against a running FHIR terminology server. Hades' fork — wardle/tx-benchmark — pins a small patch series until it lands upstream.

When asked to "run tx-benchmark", pick a flavor:

Flavor	Time	What it runs	Use when
`preflight`	~1 min	Correctness check across every op (no perf numbers)	After a code change, before quoting any numbers
`quick`	~5 min	Preflight + every passing test at 1 VU / 10 s each	Broad regression sweep at low load; before/after
`full`	~30+ min	Preflight + warmup + bench at VUs 1 / 10 / 50 across all tests	Cross-server comparison or release-note numbers

For ad-hoc spot-checking of a single test, run k6 directly (see Spot-check one test below) — that's one command, no flavor needed.

Setup

The recipes below run from two checkouts and refer to them through shell variables. Set these once per shell, pointing at wherever you cloned each repo:

export HADES=/path/to/hades            # this repository
export TXBENCH=/path/to/tx-benchmark   # the benchmark fork, cloned below
git clone https://github.com/wardle/tx-benchmark "$TXBENCH"

k6 must be on your PATH; the full sweep additionally needs bun and docker.

The recipes assume you have already built the benchmark's fixture databases — the same set Development describes — and that they live under data/ in the hades checkout. Adjust the paths if yours live elsewhere.

Fixture	What it is
`snomed-uk-monolith.db`	SNOMED CT UK monolith edition (Hermes)
`loinc-2.82.db`	LOINC release 2.82 (native provider)
`fhir-tx.db`	every FHIR package — R4 core, HL7 terminology, US Core, IPS, VSAC, PHIN-VADS — in one FTRM (SQLite) container

A few things to keep consistent, so numbers stay comparable from one run to the next:

Serve fhir-tx.db, not the unpacked package cache. That one container holds the whole FHIR corpus, VSAC included. It is the FTRM (SQLite) provider that CI preflight and the criterium bench (clj -M:bench) also exercise, so $expand latency is comparable across all three. The in-memory provider over an unpacked fhir-cache/ directory serves the same resources at different latency, so its numbers aren't comparable — the parity tests use it, the benchmark does not.
Keep hades on port 8080. The recipes pass the URL to k6 explicitly, so the default port just works.
Wait for startup by polling /fhir/metadata — the recipes do this with a Bash until loop. Startup is one-shot.
Use only the recipes below. Anything untracked in the tx-benchmark checkout is local scaffolding, not part of the benchmark.

Run a flavor

Each block is self-contained: it boots hades against the canonical fixture set, waits for readiness, runs the chosen flavor, and shuts hades down. Run the entire block as one shell script (or pipe through bash -e); each step depends on the one before it.

Each recipe computes a unique RUN_ID so every run lands under its own directory and history is preserved. The format is <utc-date>T<hhmm>-<hades-version>-<sha>[-dirty], e.g. 2026-05-07T2015-2.0.189-e2587dc-dirty. The -dirty suffix appears whenever src/, test/, deps.edn, or build.clj carry uncommitted changes — without it, comparison would silently lie about what code produced the numbers.

For full runs, build an uberjar first and substitute java -Xmx6g -jar target/hades.jar serve … for the clj -M:run serve … line — better startup, lower JVM noise. (clj -T:build uber.)

`preflight`

set -e
cd "$HADES"
RUN_ID="$(date -u +%Y-%m-%dT%H%M)-2.0.$(git rev-list --count HEAD)-$(git rev-parse --short HEAD)$(git diff-index --quiet HEAD -- src test deps.edn build.clj || echo -dirty)"
clj -M:run serve \
  data/snomed-uk-monolith.db \
  data/loinc-2.82.db \
  data/fhir-tx.db \
  --port 8080 > /tmp/hades.log 2>&1 &
HADES_PID=$!
trap 'kill $HADES_PID 2>/dev/null' EXIT

until curl -fsS http://localhost:8080/fhir/metadata >/dev/null 2>&1; do sleep 1; done

cd "$TXBENCH"
mkdir -p "results/$RUN_ID/hades" && \
k6 run \
  --env BASE_URL=http://localhost:8080/fhir \
  --env SERVER_NAME=hades \
  --env RUN_ID="$RUN_ID" \
  preflight/run.js

jq '.tests | with_entries(select(.value.status != "pass"))' \
   "results/$RUN_ID/hades/preflight.json"

pass / skip are both fine. skip means the operation is not claimed by hades and the benchmark phase skips it. fail is a bug.

`quick`

Preflight + every passing test at 1 VU / 10 s. Same shape as full, but at the lowest VU level only and a short duration — broad regression sweep across the entire benchmark surface.

set -e
cd "$HADES"
RUN_ID="$(date -u +%Y-%m-%dT%H%M)-2.0.$(git rev-list --count HEAD)-$(git rev-parse --short HEAD)$(git diff-index --quiet HEAD -- src test deps.edn build.clj || echo -dirty)"
clj -M:run serve \
  data/snomed-uk-monolith.db \
  data/loinc-2.82.db \
  data/fhir-tx.db \
  --port 8080 > /tmp/hades.log 2>&1 &
HADES_PID=$!
trap 'kill $HADES_PID 2>/dev/null' EXIT

until curl -fsS http://localhost:8080/fhir/metadata >/dev/null 2>&1; do sleep 1; done

cd "$TXBENCH"
mkdir -p "results/$RUN_ID/hades/benchmark"

# 1. Preflight (records which tests pass, gates the bench loop below)
k6 run \
  --env BASE_URL=http://localhost:8080/fhir \
  --env SERVER_NAME=hades \
  --env RUN_ID="$RUN_ID" \
  preflight/run.js

# 2. Per-test bench at 1 VU / 10 s for every test that passed preflight
for test in $(jq -r '.tests | to_entries[] | select(.value.status=="pass") | .key' \
                  "results/$RUN_ID/hades/preflight.json"); do
  cat="${test:0:2}"   # FS / LK / VC / EX / SS / CM
  echo "─── $test ───"
  k6 run --vus 1 --duration 10s \
    --env BASE_URL=http://localhost:8080/fhir \
    --env SERVER_NAME=hades \
    --env RUN_ID="$RUN_ID" \
    --env TEST_ID="$test" \
    --env VUS=1 \
    "k6/${cat}/${test}.js"
done

Per-test summaries land under results/$RUN_ID/hades/benchmark/. For a cross-test comparison table, see full (which produces the same layout at three VU levels).

Spot-check one test

For iterating on a single hot path, no flavor needed — just run k6 directly against an already-running hades. Test ids: FS01, LK01–LK05, VC01–VC03, EX01–EX08, SS01, CM01–CM02.

Spot-checks are the one place a scratch run-id is appropriate: you're iterating, the numbers aren't comparable across iterations anyway, and preserving each one would just litter results/. Pick a memorable tag (probe, lk02-async, your branch name) so you can tell it apart from the dated runs.

cd "$TXBENCH"
RUN_ID=probe                         # or any short scratch label
mkdir -p "results/$RUN_ID/hades/benchmark"
k6 run --vus 10 --duration 10s \
  --env BASE_URL=http://localhost:8080/fhir \
  --env SERVER_NAME=hades \
  --env RUN_ID="$RUN_ID" \
  --env TEST_ID=EX01 \
  --env VUS=10 \
  k6/EX/EX01.js

`full`

The full sweep with metrics pushed to a local Prometheus / Pushgateway and visible in Grafana. Prerequisites: bun, docker, k6.

set -e
cd "$HADES"
RUN_ID="$(date -u +%Y-%m-%dT%H%M)-2.0.$(git rev-list --count HEAD)-$(git rev-parse --short HEAD)$(git diff-index --quiet HEAD -- src test deps.edn build.clj || echo -dirty)"
clj -T:build uber
java -Xmx6g -jar target/hades.jar serve \
  data/snomed-uk-monolith.db \
  data/loinc-2.82.db \
  data/fhir-tx.db \
  --port 8080 > /tmp/hades.log 2>&1 &
HADES_PID=$!
trap 'kill $HADES_PID 2>/dev/null' EXIT

until curl -fsS http://localhost:8080/fhir/metadata >/dev/null 2>&1; do sleep 1; done

cd "$TXBENCH"
( cd observability && docker compose up -d )
bun scripts/run.ts hades http://localhost:8080/fhir "$RUN_ID"

Results land in results/$RUN_ID/hades/ (preflight.json + a benchmark/ tree, one file per test × VU level). Grafana on http://localhost:3000.

scripts/run.ts hard-fails if Prometheus is not reachable at localhost:9090 — the observability compose stack must be up first. If Docker is unavailable, use the no-Docker sweep below instead; it produces the same per-test benchmark/ JSON, just without Grafana.

No-Docker comparable sweep (VUs 1 / 10 / 50)

When you need round-0-comparable numbers but can't run Docker, drive k6 directly. Same VU levels and 30s duration as round-0, same output layout as full; the only thing missing is the Prometheus/Grafana stream.

set -e
cd "$HADES"
RUN_ID="$(date -u +%Y-%m-%dT%H%M)-2.0.$(git rev-list --count HEAD)-$(git rev-parse --short HEAD)$(git diff-index --quiet HEAD -- src test deps.edn build.clj || echo -dirty)"
clj -M:run serve \
  data/snomed-uk-monolith.db \
  data/loinc-2.82.db \
  data/fhir-tx.db \
  --port 8080 > /tmp/hades.log 2>&1 &
HADES_PID=$!
trap 'kill $HADES_PID 2>/dev/null' EXIT

until curl -fsS http://localhost:8080/fhir/metadata >/dev/null 2>&1; do sleep 1; done

cd "$TXBENCH"
mkdir -p "results/$RUN_ID/hades/benchmark"
k6 run --env BASE_URL=http://localhost:8080/fhir --env SERVER_NAME=hades \
  --env RUN_ID="$RUN_ID" preflight/run.js

for test in $(jq -r '.tests | to_entries[] | select(.value.status=="pass") | .key' \
                  "results/$RUN_ID/hades/preflight.json"); do
  cat="${test:0:2}"
  for vus in 1 10 50; do
    echo "─── $test / ${vus}vu ───"
    k6 run --vus "$vus" --duration 30s \
      --env BASE_URL=http://localhost:8080/fhir --env SERVER_NAME=hades \
      --env RUN_ID="$RUN_ID" --env TEST_ID="$test" --env VUS="$vus" \
      "k6/${cat}/${test}.js"
  done
done

Compare against published round-0

The published results are the canonical comparison baseline. Read them from the live site: healthsamurai.github.io/tx-benchmark/results/round-0/tests/<TEST>/ (per-test pages, all five servers). Round 0 is the only published round (runs.json) — there is no newer one to accidentally skip.

That site is built from a machine-readable mirror vendored at $TXBENCH/site/src/data/round-0.json, which carries the same numbers at full precision. Prefer the local JSON for precise values, but confirm its date still matches the site before trusting it (last matched 2026-05-19). Per-server schema:

.servers[] | select(.id=="hades") | .benchmark.<TEST>.<"1"|"10"|"50">
  → { rps, p50, p95, p99, avg, min, max, errorRate }   # latency in ms
.config → { vus:[1,10,50], testDuration:"30s", tests:[…], bias:{…} }

Map local k6 output onto that schema:

local `benchmark/<TEST>_vus<N>.json`	round-0 `benchmark.<TEST>.<N>`
`throughput`	`rps`
`duration.p50` / `p95` / `p99` / `avg` / `min` / `max`	same keys (no `duration.` prefix)
`error_rate`	`errorRate`

Caveats when reading the comparison:

Ignore the version field in round-0.json (v1.4.1540) — it is a stale mislabel. The binary that produced round-0's hades numbers is whatever servers/hades/Dockerfile pinned via its ADD …/releases/… URL at the time (a recent 2.0 release — v2.0.206 as of this writing). Round-0 hades is therefore a valid near-current like-for-like, only a handful of commits behind HEAD. Check the Dockerfile pin for the exact version.
Latency is host-bound. Round 0 ran on Apple M3 8-core / 24 GB / Docker Desktop (20 GB). Cross-machine latency deltas are not meaningful; compare ratios between servers and your own run-to-run trend, not absolute milliseconds against round-0.

Edits to tx-benchmark

CI and local dev both track the head of wardle/tx-benchmark (our fork) — there is no pinned commit. To pick up upstream test or pool-data changes, just git -C "$TXBENCH" pull and re-run preflight; fix any newly-failing ops in hades.

If you need to change tx-benchmark itself (a new hades-specific test, a pool fix, an overly-strict check): make the change on a branch of the fork, push it, open a PR to HealthSamurai/tx-benchmark, and merge it into the fork's default branch so CI and local dev pick it up. Keep the fork's default branch in a state CI can run — preflight gates on no failures.