Automated strain separation of low-complexity metagenomes
November 10, 2021 · View on GitHub

Automated strain separation of low-complexity metagenomes
Strainberry is a method that performs strain separation in low-complexity metagenomes using error-prone long-read technologies. It exploits state-of-the-art tools for variant calling, haplotype phasing, and genome assembly, in order to achieve single-sample assembly of strains with higher quality than other state-of-the-art long-read assemblers.
System requirements
Strainberry has been developed and tested under a Linux environment. It requires certain packages/tools in order to be installed/used:
- GNU bash (version 4 or later recommended)
- miniconda3
Installation
The simplest (and recommended) way to install Strainberry dependencies is through conda in an isolated environment (e.g., named sberry):
git clone https://github.com/rvicedomini/strainberry.git
cd strainberry
conda env create -n sberry --file environment.yml
The whole installation process should take about 5-10 minutes.
To make the strainberry command available, it is advised to include the absolute path of Strainberry's directory in your PATH environment variable by adding the following line to your ~/.bashrc file:
export PATH=/absolute/path/to/strainberry:${PATH}
Updating to the latest version
Assuming Strainberry conda environment has sberry name, the following commands allow to update Strainberry to the latest (development) version.
cd strainberry
git pull
conda env create -n sberry --file environment.yml --force
Usage
Activate Strainberry conda environment:
conda activate sberry
Running Strainberry:
strainberry [options] -r FASTA -b BAM -o OUTPUT_DIR
where FASTA is a strain-oblivious metagenome assembly (e.g., generated with metaFlye) and BAM is a coordinate-sorted long-read alignment in BAM format.
Both FASTA and BAM files are expected to be indexed with the samtools faidx and samtools index commands, respectively.
Strainberry's output is stored in OUTPUT_DIR.
After Strainberry execution the conda environment can be deactivated with the command:
conda deactivate
Output files
The output directory of Strainberry has the following structure:
OUTPUT_DIR/
├── strainberry_n2/
├── strainberry_n3/
├── ...
├── strainberry_nK/
├── assembly.scaffolds.bam
├── assembly.scaffolds.bam.bai
├── assembly.scaffolds.fa
└── assembly.scaffolds.fa.fai
Strainberry output assembly is stored in the assembly.scaffolds.fa file.
A minimap2-based alignment of input reads on the output assembly is also available in the assembly.scaffolds.bam file.
All sub-directories named strainberry_n{k} contain intermediate results of Strainberry iterations (e.g. the VCF of variants found by mapping reads to contigs) and log files.
After a successful run of Strainberry, they could be deleted.
Example
In order to verify that Strainberry has been correctly installed, it is possible to test it on a small dataset in the example sub-directory.
Generating the input from the reads (optional)
In order to generate a strain-oblivious assembly and a read alignment, we recommend to use metaFlye, minimap2, and samtools. Assuming these tools are available, it is possible to run the following commands, using 12 threads:
cd example
flye --meta --pacbio-raw reads.fq.gz --out-dir flye_out --genome-size 300k --threads 12
minimap2 -ax map-pb -t 12 ./flye_out/assembly.fasta reads.fq.gz | samtools sort >./flye_out/alignment.sorted.bam
samtools faidx ./flye_out/assembly.fasta
samtools index ./flye_out/alignment.sorted.bam
where --genome-size provides an estimate of the metagenome to metaFlye (not required from version 2.8).
The assembly and read alignment are then available in the flye_out directory as assembly.fasta and alignment.sorted.bam respectively.
Running Strainberry
Given a strain-oblivious assembly (file assembly.fasta) and a long-read mapping (file alignment.sorted.bam), it is possible to run Strainberry using 4 threads as follows:
$ cd example
$ strainberry -r assembly.fasta -b alignment.sorted.bam -o sberry_out -c 4
Strainberry should take around 5 minutes to finish. The file assembly.fasta contains a single sequence which is a consensus of a small region of E. coli strains K12 and W.
After a successful run of Strainberry, in the sberry_out directory, the file assembly.scaffolds.fa should contain two scaffolds (one closer to strain K12, the other closer to strain W).
Command line options
usage: strainberry -r PATH -b PATH -o PATH [--nanopore] [-n int] [-s float]
[-c int] [-h] [-V] [-v]
Automated strain separation of low-complexity metagenomes
Required arguments:
-r PATH, --reference PATH
Strain-oblivious assembly in FASTA format
-b PATH, --bam PATH Read alignment in BAM format
-o PATH, --out-dir PATH
Output directory of Strainberry assemblies
Optional arguments:
--nanopore Input consists of Oxford Nanopore reads
-n int, --max-strains int
Attempt strain-separation at most for the provided
strain multiplicity [5]
-s float, --snv-density float
Minimum SNV percentage to consider haplotype blocks
[0.1]
-c int, --cpus int Maximum number of CPUs to be used [1]
Other arguments:
-h, --help Show this help message and exit
-V, --version Show version number and exit
-v, --verbose Verbose output
Reference
If you use Strainberry in your work, please cite:
R. Vicedomini, C. Quince, A. E. Darling, R. Chikhi, Strainberry: automated strain separation in low-complexity metagenomes using long reads, Nature Communications (2021). doi: 10.1038/s41467-021-24515-9