Barbell - eval code

April 10, 2026 · View on GitHub

Code to reproduce resuls in the paper:

Beeloo R, Groot Koerkamp R, Jia X, Broekhuizen-Stins MJ, van IJken L, Broens EM, Zomer A, Dutilh BE. Barbell: Resolving Demultiplexing and Trimming Issues in Nanopore Data. bioRxiv. [2025].

Getting the reads

Retrieve the reads from SRA, and get the file path which you change in config.yml at FASTQ_FILE=.

Setting up the rest on the config

After the reads there are a few other variables ot set in config.yml. Such as root, where to output all the files, and THREADS depending on how many threads you allow each of the used tools to use. After this we are good to go to set up the environment!

Installing the tools (using Pixi)

All tools are in pixi.toml and can be directly installed using pixi. We assume you have Rust as well, if not install it

First install pixi (see homepage)
Make sure you are in the environment folder (i.e. where pixi.toml is)
Run pixi install, installing all python and conda tools
Run pixi run setup-environment, installs custom builds (e.g. barbell, checkm2, etc) and creates symlinks
Activate the shell pixi shell
Change variables in config.yml if you want to use different fastq file, output files, etc.
Now we run eval "$(pixi run load-vars)" to load all varibles to the ENV (note eval is important). This also creates config_frozen.yml
Check if it worked using echo $BASE_THREADS, which should print the specified threads in config.yml

Note all tools are in ./tools but as soon as you run pixi shell our TOML should add all to your PATH (as long as you are in the env)

Running everything step by step

We go over each of the seperate commands to run all tools as automatically running is more complicated with dependencies between results

Demuxing

Go to demux and run:

just barbell
just dorado
just flexiplex Each are independent so you can use multiple machines (just don't forget to also activate the pixi shell in that case). Then to get the final demux stats (total, assigned, etc):
just stats

Pattern analysis

This is for the first part of the paper, create the pattern table, the trimming figure, and the sequence logos. This also includes some taxonomic analysis for double barcodes but these can only be executed after we run taxonomy below

Go to demux and run:

just parse, converts tool outputs to "read_id -> barcode".
just table, creates tables with all tool outputs combined (number of reads are in total_reads.txt)
just overview, table 1 of the paper
just trim, create data and figure1 for the paper in figures
just fusion, looks for "fusion" reads this uses mmseqs2 for querying, the output goes to ligations as we have many files including downloaded reads from SRR29097612 and SRR18117839. Stats specifically are written to fusion_stats.tsv
just fusion_overall, this performs mmseq search of fusion patterns against all untrimmed reads which can take a while. The stats are appended to fusion_stats.tsv with the "ALL:" prefix along with a new table ALL_fusion_patterns.tsv

Assemblies

Go to assemblies.

First we filter reads based on read length (>=1000) and quality (>=Q15). This filters each of the barcode files which can take some time:

just filtlong_dorado
just filtlong_barbell
just asm_dorado
just asm_barbell To run Medaka we have to switch to a GPU environment:
Go to Medaka-gpu, run pixi install, then pixi shell. In case pixi shell fails, where the shell cannot be initialized use eval "$(pixi shell-hook)". Installing pytorch might take a bit
Verify if we indeed have GPU-torch: python -c "import torch; print(torch.version.cuda, torch.cuda.is_available())". If successful, move back one folder (cd ..) so we are in "assemblies" again.
just medaka_dorado
just medaka_barbell
Switch back to previous environment, first go to environment again and pixi shell
just checkm2_download, around 2.9GB
just checkm2_barbell
just checkm2_dorado
just checkm2_compare, to compare checkm2 values among Dorado and Barbell
just contam, looking for contamination in assemblies
just contam_map, mapping reads back to contaminated regions in this case for barcode49, contig_67 first 92 bases (paper example)

Taxonomy

Go to taxonomy

First we have to download the Centriguer databases:

Refseq (41GB): https://zenodo.org/records/10023239, and place in taxonomy/data/refseq. Such that the refseq folder contains the cfr_hpv+gbsarscov2.*.cfr files.
Fungi (166GB): https://www.dropbox.com/scl/fo/tkoge61zh199i9n2l7891/AL_8z4zzTm0tHvHJh8UrIdg?rlkey=3abt78pbbtn9v5w2ulz0nfjw2&st=m1ir66th&dl=0, which has GTDB + Fungi, and place that in taxonomy/data/fungi, such that it contains the cfr_gtdb_r226+refseq_hvfc.*.cfr files. If you already downloaded these files elsewhere, copying them can take a while so instead you can also symlink them to match the above paths.

Then we are ready to annotate the reads and assemblies.

just read_taxo_barbell_refseq, annotates barbell trimmed reads using RefSeq
just read_taxo_dorado_refseq, annotates dorado trimmed reads using Dorado
just read_taxo_barbell_gtdb, same but using gtdb + fungi
just read_taxo_dorado_gtdb, same but using gtdb + fungi
just asm_taxo_barbell_refseq, annotates barbell assemblies
just asm_taxo_dorado_refseq, annotates dorado assemblies
just asm_taxo_barbell_gtdb, same using gtdb
just asm_taxo_dorado_gtdb, same using gtdb
just taxon_tables
just taxon_read_compare
just taxon_asm_compare
just phage_check, this involves many downloads using APIs which is quite slow

After having the taxonomy annotation we execute one more script in patterns:

Go back to the patterns folder
just fusion_taxo

Scoring

To compare the cut-offs we use we first have to establish some sort of "ground truth" for each of the reads. Since we have to run this on raw reads, to prevent demuxing from biasing the selection already we:

Get read reads > 1500bp
trim 200bp of each side
Use centrifuge to annotate the reads For this we first have to run "taxonomy" stuff, so go to taxonomy:
just read_taxo_raw
just read_taxo_map Now we have a table with taxo annotation for each read which we use to benchmark the scoring against. However, the taxonomy is not enough as we just might not be able to annotate reads, hence we also run sassy with a low edit threshold (4) against the reads to supplement the ground truth. We go back to scoring:
just grid
just compare-grid
just summarize-grid
just sassy
just summarize-sassy Then we can greate a heatmap plot with the --min-score and --min-score-diff on the axis vs correct/total:
just plot

Then to compare just the scoring schemes themselves:

just compare

Public

To scan the public database we need to core nt database which is ~810GB and can be downloaded from here, then create a single FASTA and we store it in data and call it core_nt_full.fasta.

just download, to download adjusted version of sassy that skips N chars
just search_ncbi to search the rapid/native contamination
just parse_ncbi to parse the search results, including additional barcode edit check