Rust Usage

May 7, 2026 · View on GitHub

After building with:

RUSTFLAGS="-C target-cpu=native" cargo build --release

three binaries are available under ./target/release/.

`bench_tac` — Token-Aware Clustering

Trains TAC on a multivector dataset and saves centroids and assignments to disk. Use this when you want to inspect or reuse the clustering step independently of the full index build.

./target/release/bench_tac \
  -i <vectors.npy> \
  --token-ids-file <token_ids.npy> \
  --total-centroids <K> \
  -o <output_dir>

Parameters

Flag	Default	Description
`-i`, `--vectors-file`	(required)	`[N, dim]` f16 token vectors
`--token-ids-file`	(required)	`[N]` token-type IDs (i64 or u32)
`--total-centroids`	(required)	Total centroid budget distributed across all token types
`--tac-n-iter`	`10`	K-means iterations per token group
`-o`, `--output-dir`	(required)	Directory where outputs are written
`--verbose`	off	Print per-token allocation details and spread scores
`--test-mode`	off	Load only the first 100 000 vectors (quick sanity check)

Output

Three files are written to <output_dir>:

File	Description
`centroids_it{n}_k{K}_n{N}.npy`	`[K, dim]` f32 coarse centroids
`assignments_it{n}_k{K}_n{N}.npy`	`[N]` u32 centroid id per token
`timing.txt`	TAC training time in seconds

The filename stem encodes the run parameters for easy traceability.

Example

./target/release/bench_tac \
  -i /data/vectors.npy \
  --token-ids-file /data/token_ids.npy \
  --total-centroids 2097152 \
  --tac-n-iter 10 \
  --verbose \
  -o /data/tac_output/

`tachiom_build` — Index Construction

Runs the full pipeline (TAC → PQ training → encoding → HNSW) and saves the index to disk.

./target/release/tachiom_build \
  -i <vectors.npy> \
  --token-ids-file <token_ids.npy> \
  --doclens-file <doclens.npy> \
  -o <index_file>

Parameters

Flag	Default	Description
`-i`, `--vectors-file`	(required)	`[N, dim]` f16 token vectors
`--token-ids-file`	(required)	`[N]` token-type IDs (i64 or u32)
`--doclens-file`	(required)	`[n_docs]` document lengths (i32 or i64)
`-o`, `--output-file`	(required)	Output path for the serialized index
`--total-centroids`	`4194304`	TAC coarse-centroid budget
`--tac-n-iter`	`10`	K-means iterations per token group in TAC
`--pq-sample-size`	`10000000`	Tokens sampled for PQ training
`--pq-n-iter`	`10`	K-means iterations for PQ subspace training
`--normalize`	off	L2-normalise residuals before PQ encoding
`--pq-seed`	`42`	RNG seed for reproducible PQ training
`--hnsw-m`	`32`	HNSW neighbours per node in the centroid graph
`--ef-construction`	`1500`	HNSW build-time beam width
`--pq-subspaces`	`32`	PQ subspace count (only 32 is currently supported)

Example

./target/release/tachiom_build \
  -i /data/vectors.npy \
  --token-ids-file /data/token_ids.npy \
  --doclens-file /data/doclens.npy \
  -o /indexes/tachiom_index \
  --total-centroids 2097152 \
  --normalize

`tachiom_search` — Querying

Loads a built index and runs a batch of multivector queries, printing average latency and optionally writing ranked results to disk.

./target/release/tachiom_search \
  -i <index_file> \
  -q <queries.npy> \
  -o <results.tsv>

Parameters

Flag	Default	Description
`-i`, `--index-file`	(required)	Path to the serialized Tachiom index
`-q`, `--query-file`	(required)	`[n_queries, n_tokens, dim]` f32 query vectors
`-o`, `--output-path`	(optional)	Output TSV file (omit to benchmark without saving)
`--k`	`10`	Results returned per query
`--k-centroids`	`4`	Coarse centroids probed per query token (Gather phase)
`--k-docs-to-score`	`1000`	Candidates forwarded to PQ reranking (Refine phase)
`--ef-search`	`64`	HNSW beam width during coarse retrieval
`--alpha`	(off)	Alpha-pruning threshold (fraction of k-th coarse score); omit to disable
`--beta`	(off)	Early-exit staleness counter for PQ reranking; omit to disable
`--lambda`	(off)	Distance-adaptive HNSW termination factor; omit to disable
`--num-runs`	`1`	Timing runs; results from the last run are saved

Output format

The TSV file has one line per result with columns: query_id, doc_id, rank, score.

Example

./target/release/tachiom_search \
  -i /indexes/tachiom_index \
  -q /data/queries.npy \
  -o results.tsv \
  --k 10 \
  --k-centroids 25 \
  --k-docs-to-score 2000 \
  --ef-search 40 \
  --alpha 0.4

scripts/run_experiments.py drives tachiom_build and tachiom_search from a TOML configuration file, logging machine info, git state, build output, search results, and evaluation metrics into a timestamped experiment folder.

python scripts/run_experiments.py --exp experiments/sigir2026/<config>.toml

TOML structure

name          = "my_experiment"
build-command = "./target/release/tachiom_build"
query-command = "./target/release/tachiom_search"

[settings]
k        = 10       # results per query
num-runs = 1        # timing repetitions
build    = true     # set to false to skip build and use a prebuilt index
metric   = "RR@10"  # any ir_measures metric string (e.g. "Success@5")
NUMA     = "numactl --physcpubind='0' --localalloc"  # optional NUMA pinning

[folder]
data       = "/path/to/dataset"
index      = "/path/to/index/dir"
experiment = "."           # experiment logs are written here
qrels_path = "/path/to/qrels.tsv"

[filename]
vectors   = "vectors.npy"      # [N, dim] f16
token_ids = "token_ids.npy"    # [N] i64 or u32
doclens   = "doclens.npy"      # [n_docs] i32 or i64
queries   = "queries.npy"      # [n_queries, n_tokens, dim] f32
index     = "my_index"         # index filename (no extension)
doc_ids   = "doc_ids.npy"      # optional: integer → string doc ID mapping
query_ids = "query_ids.npy"    # optional: integer → string query ID mapping

[build_params]
total-centroids = 4194304
tac-n-iter      = 10
pq-sample-size  = 10000000
pq-n-iter       = 10
normalize       = true
pq-seed         = 42
hnsw-m          = 32
ef-construction = 1500
pq-subspaces    = 32

# One subsection per search configuration
[query.my_config]
ef-search       = 30
k-centroids     = 20
k-docs-to-score = 4000
alpha           = 0.4

The runner produces a report.tsv inside the experiment folder with one row per [query.*] subsection, reporting query latency (μs), the chosen metric, memory usage, and build time.

Reproducing SIGIR 2026 results

The TOML configs used in the paper are in experiments/sigir2026/.

Pre-processed datasets and pre-built indexes are available on HuggingFace — see the Datasets section in the README for download instructions. Update the [folder] paths in the TOML configs to point to your local copies before running.

Prerequisites: single-core execution pinned via numactl, CPU governor set to performance.

# Check governor (should print the number of available CPUs)
cpufreq-info | grep "performance" | grep -v "available" | wc -l

MS MARCO-v1

python scripts/run_experiments.py \
  --exp experiments/sigir2026/ms-marco.toml

Key parameters:

Parameter	Value
`total-centroids`	4 194 304 (4M)
`normalize`	true
`ef-search`	30
`k-centroids`	20
`k-docs-to-score`	4000
`alpha`	0.4
Metric	MRR@10

LoTTE (pooled)

python scripts/run_experiments.py \
  --exp experiments/sigir2026/lotte.toml

Key parameters:

Parameter	Value
`total-centroids`	2 097 152 (2M)
`normalize`	true
`ef-search`	40
`k-centroids`	25
`k-docs-to-score`	2000
`alpha`	0.4
Metric	Success@5