Hifi assembly

May 9, 2026 · View on GitHub

MetaMDBG is a fast and memory-efficient assembler designed for long and accurate metagenomics reads (e.g. PacBio HiFi, Nanopore R10). It is optimized for metagenomes, but also works well on bacterial isolates samples. Up-to-date benchmarks are provided in the Results section below.

The method nanoMDBG for assembling Nanopore R10 simplex reads is integrated in metaMDBG.

Developper: Gaëtan Benoit
Contact: gaetanbenoitdev at gmail dot com

News

May 2026: MetaMDBG v1.4 is out!

  • Major performance update (both runtime and memory)
  • Better polishing

Check out the new results below.

Installation

Conda

conda install -c conda-forge -c bioconda metamdbg

Building from source (using conda)

See details

Choose an installation directory, then copy-paste the following commands.

# Download metaMDBG repository  
git clone https://github.com/GaetanBenoitDev/metaMDBG.git

# Create metaMDBG conda environment
cd metaMDBG
conda env create -f conda_env.yml
conda activate metamdbg1.4
conda env config vars set CPATH=${CONDA_PREFIX}/include:${CPATH}
conda deactivate

# Activate metaMDBG environment
conda activate metamdbg1.4

# Compile the software
mkdir build
cd build
cmake ..
make -j 3

After successful installation, an executable named metaMDBG will appear in ./build/bin.

Building from source

See details

Prerequisites

  • gcc 9.4+
  • cmake 3.10+
  • zlib
  • openmp

git clone https://github.com/GaetanBenoitDev/metaMDBG.git
cd metaMDBG
mkdir build
cd build
cmake ..
make -j 3

After successful installation, an executable named metaMDBG will appear in ./build/bin.

Usage

Usage:  metaMDBG asm {OPTIONS}

 Basic options:
   --out-dir               Output dir for contigs and temporary files
   --in-hifi               PacBio HiFi read filename(s) (separated by space)
   --in-ont                Nanopore R10.4+ read filename(s) (separated by space)
   --threads               Number of cores [1]

# Nanopore assembly
metaMDBG asm --out-dir ./outputDir/ --in-ont reads.fastq.gz --threads 4
# Hifi assembly
metaMDBG asm --out-dir ./outputDir/ --in-hifi reads.fastq.gz --threads 4
# Multiple sample co-assembly
metaMDBG asm --out-dir ./outputDir/ --in-ont reads_A.fastq.gz reads_B.fastq.gz reads_C.fastq.gz --threads 4

MetaMDBG will generate polished contigs in outputDir ("contigs.fasta.gz").

Contig information

Contig information are contained in contig headers in the resulting fasta assembly file. Example:

>ctg112 length=7013 coverage=6 circular=yes
ACGTAGCTTATAGCGAGTATCG...
>ctg37 length=1988 coverage=3 circular=no
ATTATTGATTAGGGCTATGCAT...
>ctg82 length=3824 coverage=13 circular=no
AATTCCGGCGGCGTATTATTAC...

Headers are composed of several fields seperated by space.

  • ctgID: the name of the contig
  • length: the length of the contig in bps
  • coverage: an estimated read coverage for the contig
  • circular: whether the contig is circular or no

Resume an existing run (checkpoint system)

If an assembly run stops for any reason, simply resubmit the same command. MetaMDBG will automatically skip completed steps and resume from the last checkpoint.

Advanced usage

# Filter out reads with low average per-base quality (using phred score)
metaMDBG asm --out-dir ./outputDir/ --in-ont reads.fastq.gz --min-read-quality 10

# Filter output contigs (by length and by coverage)
metaMDBG asm --out-dir ./outputDir/ --in-ont reads.fastq.gz --min-contig-length 500 --min-contig-coverage 2

# Skip correction step (useful if using corrected reads)
metaMDBG asm --out-dir ./outputDir/ --in-ont reads.fastq.gz --skip-correction

# Filter out unique k-min-mers to improve performances.
# Useful for scaling to very large datasets, but may reduce assembly quality and completeness.
# By default, metaMDBG attempts to rescue low-abundance genomic k-min-mers.
metaMDBG asm --out-dir ./outputDir/ --in-ont reads.fastq.gz --min-abundance 2

# Stop assembly after reaching k-th iteration.
metaMDBG asm --out-dir ./outputDir/ --in-ont reads.fastq.gz --max-k 11

Generating an assembly graph

See details

After a successful run of metaMDBG, assembly graph (.gfa) can be generated with the following command.

metaMDBG gfa --assembly-dir ./assemblyDir/ --k 21 --contigpath --readpath --threads 4

Assembly dir must be a metaMDBG output dir (the one containing the contig file "contigs.fasta.gz"). The --k parameter correspond to the level of resolution of the graph: lower k values will produce graph with high connectivity but shorter unitigs, while higher k graphs will be more fragmented but with longer unitigs. The two optional parameters --contigpath and --readpath allow to generate the path of contigs and reads in the graph respectivelly.

First, display the available k values and their corresponding sequence length in bps (those sequence length in bps are equivalent to the k-mer size that would be used in a traditional de-Brujin graph).

metaMDBG gfa --assembly-dir ./assemblyDir/ --k 0

Then, choose a k value and produce the graph (optionnaly add parameters --contigpath and/or --readpath).

metaMDBG gfa --assembly-dir ./assemblyDir/ --k 21

MetaMDBG will generate the assembly graph in the GFA format in assemblyDir (e.g. "assemblyGraph_k21_4013bps.gfa").

Note 1) Unitig sequences in the gfa file are not polished, they have the same error rate as in the original reads. Note 2) To generate the unitig sequences, a pass on the original reads that generated the assembly is required, if you have moved the original readsets, you will need to edit the file ./assemblyDir/tmp/input.txt with the new paths. Note 3) In nanopore mode, the read-path are not very accurate because of the high error rate, we recommend using actual aligner instead, such as graphAligner.

Results

Source data: mags.tsv errors.tsv perf.tsv

Alignment and binning were performed with minimap2 and SemiBin2. Completeness and contamination were measured with checkM2 (near-complete: ≥90% completeness and ≤5% contamination, Medium: ≥50% completeness and ≤5% contamination). Clipping events and zero-coverage regions were identified using the anvi-script-find-misassembly program from the Anvi’o platform. All assemblers were run with 32 cores.

SampleAccession# bases (Gb)N50 read length (kb)Average quality score
Human Gut 1 (ONT)ERR15285694507.823.2
Human Gut 2 (ONT)SRR299809727727.217.3
Oral (ONT)DRR582205241521.7
Soil Microflora (ONT)ERR115236651035.417.1
Human Gut 1 (HiFi)ERR15289675508.934
Human Gut 2 (HiFi)SRR1527521318.511.445
Anaerobic Digester (HiFi)ERR109057436710.240.6
Sea Water (HiFi)ERR9769281228.235

Results on bacterial isolates

SampleAverage quality score# reference genomes# contigsGenome fraction (%)Duplication ratio# mismatches per 100 kbp# indels per 100 kbp# misassemblies# local misassemblies
Escherichia_coli_120.976699.9991.0000.000.1400
Mycobacterium_avium_128.0811100.0001.0000.000.0500
Phascolarctobacterium_faecium_115.4511100.0001.0008.6012.9000
Escherichia_coli_216.235499.9051.0001.091.4701
Saccharolobus_islandicus_127.541199.8381.0000.000.0000
Agarivorans_120.7311100.0001.0000.280.1200

MetaMDBG was applied to bacterial isolate samples (each containing a single bacterial genome and, in some cases, multiple plasmids). The linked studies provide the original sequencing reads and the reference genomes assembled using Autocycler. Assembly quality was evaluated using QUAST.

License

metaMDBG is freely available under the MIT License.

Citation