Barbell - eval code

April 10, 2026 ยท View on GitHub

Code to reproduce resuls in the paper:

Beeloo R, Groot Koerkamp R, Jia X, Broekhuizen-Stins MJ, van IJken L, Broens EM, Zomer A, Dutilh BE. Barbell: Resolving Demultiplexing and Trimming Issues in Nanopore Data. bioRxiv. [2025].

Getting the reads

Retrieve the reads from SRA, and get the file path which you change in config.yml at FASTQ_FILE=.

Setting up the rest on the config

After the reads there are a few other variables ot set in config.yml. Such as root, where to output all the files, and THREADS depending on how many threads you allow each of the used tools to use. After this we are good to go to set up the environment!

Installing the tools (using Pixi)

All tools are in pixi.toml and can be directly installed using pixi. We assume you have Rust as well, if not install it

  1. First install pixi (see homepage)
  2. Make sure you are in the environment folder (i.e. where pixi.toml is)
  3. Run pixi install, installing all python and conda tools
  4. Run pixi run setup-environment, installs custom builds (e.g. barbell, checkm2, etc) and creates symlinks
  5. Activate the shell pixi shell
  6. Change variables in config.yml if you want to use different fastq file, output files, etc.
  7. Now we run eval "$(pixi run load-vars)" to load all varibles to the ENV (note eval is important). This also creates config_frozen.yml
  8. Check if it worked using echo $BASE_THREADS, which should print the specified threads in config.yml

Note all tools are in ./tools but as soon as you run pixi shell our TOML should add all to your PATH (as long as you are in the env)

Running everything step by step

We go over each of the seperate commands to run all tools as automatically running is more complicated with dependencies between results

Demuxing

Go to demux and run:

  • just barbell
  • just dorado
  • just flexiplex Each are independent so you can use multiple machines (just don't forget to also activate the pixi shell in that case). Then to get the final demux stats (total, assigned, etc):
  • just stats

Pattern analysis

This is for the first part of the paper, create the pattern table, the trimming figure, and the sequence logos. This also includes some taxonomic analysis for double barcodes but these can only be executed after we run taxonomy below

Go to demux and run:

  • just parse, converts tool outputs to "read_id -> barcode".
  • just table, creates tables with all tool outputs combined (number of reads are in total_reads.txt)
  • just overview, table 1 of the paper
  • just trim, create data and figure1 for the paper in figures
  • just fusion, looks for "fusion" reads this uses mmseqs2 for querying, the output goes to ligations as we have many files including downloaded reads from SRR29097612 and SRR18117839. Stats specifically are written to fusion_stats.tsv
  • just fusion_overall, this performs mmseq search of fusion patterns against all untrimmed reads which can take a while. The stats are appended to fusion_stats.tsv with the "ALL:" prefix along with a new table ALL_fusion_patterns.tsv

Assemblies

Go to assemblies.

First we filter reads based on read length (>=1000) and quality (>=Q15). This filters each of the barcode files which can take some time:

  • just filtlong_dorado
  • just filtlong_barbell
  • just asm_dorado
  • just asm_barbell To run Medaka we have to switch to a GPU environment:
  • Go to Medaka-gpu, run pixi install, then pixi shell. In case pixi shell fails, where the shell cannot be initialized use eval "$(pixi shell-hook)". Installing pytorch might take a bit
  • Verify if we indeed have GPU-torch: python -c "import torch; print(torch.version.cuda, torch.cuda.is_available())". If successful, move back one folder (cd ..) so we are in "assemblies" again.
  • just medaka_dorado
  • just medaka_barbell
  • Switch back to previous environment, first go to environment again and pixi shell
  • just checkm2_download, around 2.9GB
  • just checkm2_barbell
  • just checkm2_dorado
  • just checkm2_compare, to compare checkm2 values among Dorado and Barbell
  • just contam, looking for contamination in assemblies
  • just contam_map, mapping reads back to contaminated regions in this case for barcode49, contig_67 first 92 bases (paper example)

Taxonomy

Go to taxonomy

First we have to download the Centriguer databases:

Then we are ready to annotate the reads and assemblies.

  • just read_taxo_barbell_refseq, annotates barbell trimmed reads using RefSeq
  • just read_taxo_dorado_refseq, annotates dorado trimmed reads using Dorado
  • just read_taxo_barbell_gtdb, same but using gtdb + fungi
  • just read_taxo_dorado_gtdb, same but using gtdb + fungi
  • just asm_taxo_barbell_refseq, annotates barbell assemblies
  • just asm_taxo_dorado_refseq, annotates dorado assemblies
  • just asm_taxo_barbell_gtdb, same using gtdb
  • just asm_taxo_dorado_gtdb, same using gtdb
  • just taxon_tables
  • just taxon_read_compare
  • just taxon_asm_compare
  • just phage_check, this involves many downloads using APIs which is quite slow

After having the taxonomy annotation we execute one more script in patterns:

  • Go back to the patterns folder
  • just fusion_taxo

Scoring

To compare the cut-offs we use we first have to establish some sort of "ground truth" for each of the reads. Since we have to run this on raw reads, to prevent demuxing from biasing the selection already we:

  • Get read reads > 1500bp
  • trim 200bp of each side
  • Use centrifuge to annotate the reads For this we first have to run "taxonomy" stuff, so go to taxonomy:
  • just read_taxo_raw
  • just read_taxo_map Now we have a table with taxo annotation for each read which we use to benchmark the scoring against. However, the taxonomy is not enough as we just might not be able to annotate reads, hence we also run sassy with a low edit threshold (4) against the reads to supplement the ground truth. We go back to scoring:
  • just grid
  • just compare-grid
  • just summarize-grid
  • just sassy
  • just summarize-sassy Then we can greate a heatmap plot with the --min-score and --min-score-diff on the axis vs correct/total:
  • just plot

Then to compare just the scoring schemes themselves:

  • just compare

Public

To scan the public database we need to core nt database which is ~810GB and can be downloaded from here, then create a single FASTA and we store it in data and call it core_nt_full.fasta.

  • just download, to download adjusted version of sassy that skips N chars
  • just search_ncbi to search the rapid/native contamination
  • just parse_ncbi to parse the search results, including additional barcode edit check