Metagenome Graph Project

May 17, 2026 · View on GitHub

MetaGraph is a tool for scalable construction of annotated genome graphs and sequence-to-graph alignment.

The default index representations in MetaGraph are extremely scalable and support building graphs with trillions of nodes and millions of annotation labels. At the same time, the provided workflows and their careful implementation, combined with low-level optimizations of the core data structures, enable exceptional query and alignment performance.

Main features:

Large-scale indexing of sequences
Python API for querying in the server mode
Encoding k-mer counts (e.g., expression values) and k-mer coordinates in source sequences (e.g., for lossless encoding of genomes)
Sequence alignment against very large annotated graphs (sub-k seeding allows using arbitrarily short seeds)
Scalable cleaning of very large de Bruijn graphs (to remove sequencing errors)
Support for custom alphabets (e.g., {A,C,G,T,N} or amino acids)
Algorithms for differential assembly

Design choices in MetaGraph:

Use of succinct data structures and efficient representation schemes for extremely high scalability
Algorithmic choices that work efficiently with succinct data structures (e.g., always prefer batched operations)
Modular support of different graph and annotation representations
Use of generic and extensible interfaces to support adding custom index representations / algorithms with little code overhead.

Documentation

Online documentation is available at https://metagraph.ethz.ch/static/docs/index.html. Offline sources are here.

Citation

If you are using MetaGraph or the index resources for your work, please cite:

Karasikov M, Mustafa H, Danciu D, Kulkov O, Zimmermann M, Barber C, Rätsch G, Kahles A. Efficient and accurate search in petabase-scale sequence repositories. Nature. 2025;647: 1036–1044.
https://www.nature.com/articles/s41586-025-09603-w

BibTeX

@article{karasikov2025metagraph,
  title={Efficient and accurate search in petabase-scale sequence repositories},
  author={Karasikov, Mikhail and Mustafa, Harun and Danciu, Daniel and Kulkov, Oleksandr and Zimmermann, Marc and Barber, Christopher and R{\"a}tsch, Gunnar and Kahles, Andr{\'e}},
  journal={Nature},
  volume={647},
  number={8091},
  pages={1036--1044},
  year={2025},
  publisher={Nature Publishing Group},
  doi={10.1038/s41586-025-09603-w}
}

Install

Conda

Install the latest release on Linux or Mac OS X with Anaconda:

conda install -c bioconda -c conda-forge metagraph

Docker

If docker is available on the system, immediately get started with

docker pull ghcr.io/ratschlab/metagraph:master
docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master \
    metagraph build -v -k 10 -o /mnt/transcripts_1000 /mnt/transcripts_1000.fa

and replace ${HOME} with a directory on the host system to map it under /mnt in the container.

By default, it executes the binary compiled for the DNA alphabet {A,C,G,T}. To run the binary compiled for the DNA5 or Protein alphabet, just replace metagraph with metagraph_DNA5 or metagraph_Protein, respectively, e.g.:

docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master \
    metagraph_Protein build -v -k 10 -o /mnt/graph /mnt/protein.fa

One can see that running MetaGraph with docker is very easy. Also, the following command (or similar) may be handy to see what directory is mounted in the container:

docker run -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master ls /mnt

For more complex workflows, consider running docker in the interactive mode:

$ docker run -it --entrypoint /bin/bash -v ${HOME}:/mnt ghcr.io/ratschlab/metagraph:master

root@5c42291cc9cf:/# ls /mnt/
root@5c42291cc9cf:/# metagraph --version

All different versions of the container image are listed here.

Install From Sources

To compile from source (e.g., for builds with custom alphabet or other configurations), see documentation online.

Quick start: build an index in one command

For most users, the easiest entry point is the Snakemake wrapper, which runs the full indexing pipeline — graph construction, annotation, and all row-diff / BRWT transforms — as a single command.

The wrapper ships as a separate Python package; the metagraph conda recipe only installs the C++ binary, so the workflow CLI needs an extra pip install step:

conda install -c bioconda -c conda-forge metagraph     # the metagraph binary
pip install -U "git+https://github.com/ratschlab/metagraph.git#subdirectory=metagraph/workflows"

Then run the full pipeline as:

metagraph-workflows build samples.txt -o out/ --primary

samples.txt is a text file listing your input sample paths (one per line); a directory of sample files works just as well. out/ will contain graph.dbg, graph_small.dbg, and the requested annotation artifacts. You can also feed a list inline with process substitution:

metagraph-workflows build <(ls /data/samples/*.fa) -o out/ --primary

Tell the workflow how much hardware to use; everything else (memory caps per stage, per-column buffer sizing, BRWT clustering parameters) is derived automatically:

metagraph-workflows build samples.txt -o out/ --primary \
  -p 34 --mem-gb 70 --disk-swap-dir /scratch/swap

Useful switches:

-p N — maximum CPU cores to use (defaults to all cores)
--mem-gb GB — approximate RAM budget per rule (default 16)
--disk-swap-dir DIR — directory for on-disk spill buffers
--primary — build a primary graph (recommended for most workloads)
--anno-type FMT — request a specific annotation format (repeat for multiple outputs); the default is relax.row_diff_brwt
--with-counts / --with-coords — count- or coordinate-aware annotation (mutually exclusive)
--graph EXISTING.dbg — reuse an already-built graph and only run the annotation + transforms

See metagraph/workflows/README.rst for setup and the full option list (metagraph-workflows build --help).

Typical workflow

Build de Bruijn graph from Fasta files, FastQ files, or KMC k-mer counters:
./metagraph build
Annotate graph using the column compressed annotation:
./metagraph annotate
Transform the built annotation to a different annotation scheme:
./metagraph transform_anno
Query annotated graph
./metagraph query

Example

DATA="../tests/data/transcripts_1000.fa"

./metagraph build -k 12 -o transcripts_1000 $DATA

./metagraph annotate -i transcripts_1000.dbg --anno-filename -o transcripts_1000 $DATA

./metagraph query -i transcripts_1000.dbg -a transcripts_1000.column.annodbg $DATA

./metagraph stats -a transcripts_1000.column.annodbg transcripts_1000.dbg

Print usage

./metagraph

Build graph

Simple build

./metagraph build -v --parallel 30 -k 20 --mem-cap-gb 10 \
                        -o <GRAPH_DIR>/graph <DATA_DIR>/*.fasta.gz \
2>&1 | tee <LOG_DIR>/log.txt

Build with disk swap (use to limit the RAM usage)

./metagraph build -v --parallel 30 -k 20 --mem-cap-gb 10 --disk-swap <GRAPH_DIR> \
                        -o <GRAPH_DIR>/graph <DATA_DIR>/*.fasta.gz \
2>&1 | tee <LOG_DIR>/log.txt

Build from k-mers filtered with KMC

K=20
./KMC/kmc -ci5 -t4 -k$K -m5 -fm <FILE>.fasta.gz <FILE>.cutoff_5 ./KMC
./metagraph build -v -p 4 -k $K --mem-cap-gb 10 -o graph <FILE>.cutoff_5.kmc_pre

Annotate graph

./metagraph annotate -v --anno-type row --fasta-anno \
                           -i primates.dbg \
                           -o primates \
                           ~/fasta_zurich/refs_chimpanzee_primates.fa

Convert annotation to Multi-BRWT

Cluster columns

./metagraph transform_anno -v --linkage --greedy \
                           -o linkage.txt \
                           --subsample R \
                           -p NCORES \
                           primates.column.annodbg

Requires N*R/8 + 6*N^2 bytes of RAM, where N is the number of columns and R is the number of rows subsampled.

Construct Multi-BRWT

./metagraph transform_anno -v -p NCORES --anno-type brwt \
                           --linkage-file linkage.txt \
                           -o primates \
                           --parallel-nodes V \
                           -p NCORES \
                           primates.column.annodbg

Requires M*V/8 + Size(BRWT) bytes of RAM, where M is the number of rows in the annotation and V is the number of nodes merged concurrently.

Query graph

./metagraph query -v -i <GRAPH_DIR>/graph.dbg \
                        -a <GRAPH_DIR>/annotation.column.annodbg \
                        --min-kmers-fraction-label 0.8 --labels-delimiter ", " \
                        query_seq.fa

Align to graph

./metagraph align -v -i <GRAPH_DIR>/graph.dbg query_seq.fa

Assemble sequences

./metagraph assemble -v <GRAPH_DIR>/graph.dbg \
                        -o assembled.fa \
                        --unitigs

Assemble differential sequences

./metagraph assemble -v <GRAPH_DIR>/graph.dbg \
                        --unitigs \
                        -a <GRAPH_DIR>/annotation.column.annodbg \
                        --diff-assembly-rules diff_assembly_rules.json \
                        -o diff_assembled.fa

See metagraph/tests/data/example.diff.json and metagraph/tests/data/example_simple.diff.json for sample files.

Get stats

Stats for graph

./metagraph stats graph.dbg

Stats for annotation

./metagraph stats -a annotation.column.annodbg

Stats for both

./metagraph stats -a annotation.column.annodbg graph.dbg

env: environment in which to compile/run ("": on the host, docker: in a docker container)
alphabet: compile metagraph for a certain alphabet (e.g. DNA or Protein, default DNA)
additional_cmake_args: additional arguments to pass to cmake.

Examples:

# compiles metagraph in a docker container for the `DNA` alphabet
make build-metagraph env=docker alphabet=DNA

Update and create a new release

Creating a new version release is done in three steps:

Update package.json and set the version
Add a tag with that new version
Make a new release on github

License

Metagraph is distributed under the GPLv3 License (see LICENSE). Please find further information in the AUTHORS and COPYRIGHTS files.