Barbell - eval code
April 10, 2026 ยท View on GitHub
Code to reproduce resuls in the paper:
Beeloo R, Groot Koerkamp R, Jia X, Broekhuizen-Stins MJ, van IJken L, Broens EM, Zomer A, Dutilh BE. Barbell: Resolving Demultiplexing and Trimming Issues in Nanopore Data. bioRxiv. [2025].
Getting the reads
Retrieve the reads from SRA, and get the file path which you change in config.yml at FASTQ_FILE=.
Setting up the rest on the config
After the reads there are a few other variables ot set in config.yml.
Such as root, where to output all the files, and THREADS depending on how many threads you
allow each of the used tools to use.
After this we are good to go to set up the environment!
Installing the tools (using Pixi)
All tools are in pixi.toml and can be directly installed using pixi.
We assume you have Rust as well, if not install it
- First install pixi (see homepage)
- Make sure you are in the environment folder (i.e. where
pixi.tomlis) - Run
pixi install, installing all python and conda tools - Run
pixi run setup-environment, installs custom builds (e.g. barbell, checkm2, etc) and creates symlinks - Activate the shell
pixi shell - Change variables in
config.ymlif you want to use different fastq file, output files, etc. - Now we run
eval "$(pixi run load-vars)"to load all varibles to the ENV (note eval is important). This also creates config_frozen.yml - Check if it worked using
echo $BASE_THREADS, which should print the specified threads in config.yml
Note all tools are in ./tools but as soon as you run pixi shell our TOML should add all
to your PATH (as long as you are in the env)
Running everything step by step
We go over each of the seperate commands to run all tools as automatically running is more complicated with dependencies between results
Demuxing
Go to demux and run:
just barbelljust doradojust flexiplexEach are independent so you can use multiple machines (just don't forget to also activate the pixi shell in that case). Then to get the final demux stats (total, assigned, etc):just stats
Pattern analysis
This is for the first part of the paper, create the pattern table, the trimming figure, and the sequence logos. This also includes some taxonomic analysis for double barcodes but these can only be executed after we run taxonomy below
Go to demux and run:
just parse, converts tool outputs to "read_id -> barcode".just table, creates tables with all tool outputs combined (number of reads are intotal_reads.txt)just overview, table 1 of the paperjust trim, create data and figure1 for the paper in figuresjust fusion, looks for "fusion" reads this uses mmseqs2 for querying, the output goes to ligations as we have many files including downloaded reads from SRR29097612 and SRR18117839. Stats specifically are written to fusion_stats.tsvjust fusion_overall, this performs mmseq search of fusion patterns against all untrimmed reads which can take a while. The stats are appended to fusion_stats.tsv with the "ALL:" prefix along with a new table ALL_fusion_patterns.tsv
Assemblies
Go to assemblies.
First we filter reads based on read length (>=1000) and quality (>=Q15). This filters each of the barcode files which can take some time:
just filtlong_doradojust filtlong_barbelljust asm_doradojust asm_barbellTo run Medaka we have to switch to a GPU environment:- Go to Medaka-gpu, run
pixi install, thenpixi shell. In casepixi shellfails, where the shell cannot be initialized useeval "$(pixi shell-hook)". Installing pytorch might take a bit - Verify if we indeed have GPU-torch:
python -c "import torch; print(torch.version.cuda, torch.cuda.is_available())". If successful, move back one folder (cd ..) so we are in "assemblies" again. just medaka_doradojust medaka_barbell- Switch back to previous environment, first go to environment again and
pixi shell just checkm2_download, around2.9GBjust checkm2_barbelljust checkm2_doradojust checkm2_compare, to compare checkm2 values among Dorado and Barbelljust contam, looking for contamination in assembliesjust contam_map, mapping reads back to contaminated regions in this case forbarcode49,contig_67first 92 bases (paper example)
Taxonomy
Go to taxonomy
First we have to download the Centriguer databases:
- Refseq (41GB): https://zenodo.org/records/10023239, and place in taxonomy/data/refseq. Such that the refseq folder contains the
cfr_hpv+gbsarscov2.*.cfrfiles. - Fungi (166GB): https://www.dropbox.com/scl/fo/tkoge61zh199i9n2l7891/AL_8z4zzTm0tHvHJh8UrIdg?rlkey=3abt78pbbtn9v5w2ulz0nfjw2&st=m1ir66th&dl=0, which has GTDB + Fungi, and place that in taxonomy/data/fungi, such that it contains the
cfr_gtdb_r226+refseq_hvfc.*.cfrfiles. If you already downloaded these files elsewhere, copying them can take a while so instead you can also symlink them to match the above paths.
Then we are ready to annotate the reads and assemblies.
just read_taxo_barbell_refseq, annotates barbell trimmed reads using RefSeqjust read_taxo_dorado_refseq, annotates dorado trimmed reads using Doradojust read_taxo_barbell_gtdb, same but using gtdb + fungijust read_taxo_dorado_gtdb, same but using gtdb + fungijust asm_taxo_barbell_refseq, annotates barbell assembliesjust asm_taxo_dorado_refseq, annotates dorado assembliesjust asm_taxo_barbell_gtdb, same using gtdbjust asm_taxo_dorado_gtdb, same using gtdbjust taxon_tablesjust taxon_read_comparejust taxon_asm_comparejust phage_check, this involves many downloads using APIs which is quite slow
After having the taxonomy annotation we execute one more script in patterns:
- Go back to the patterns folder
just fusion_taxo
Scoring
To compare the cut-offs we use we first have to establish some sort of "ground truth" for each of the reads. Since we have to run this on raw reads, to prevent demuxing from biasing the selection already we:
- Get read reads > 1500bp
- trim 200bp of each side
- Use centrifuge to annotate the reads For this we first have to run "taxonomy" stuff, so go to taxonomy:
just read_taxo_rawjust read_taxo_mapNow we have a table with taxo annotation for each read which we use to benchmark the scoring against. However, the taxonomy is not enough as we just might not be able to annotate reads, hence we also run sassy with a low edit threshold (4) against the reads to supplement the ground truth. We go back to scoring:just gridjust compare-gridjust summarize-gridjust sassyjust summarize-sassyThen we can greate a heatmap plot with the--min-scoreand--min-score-diffon the axis vs correct/total:just plot
Then to compare just the scoring schemes themselves:
just compare
Public
To scan the public database we need to core nt database which is ~810GB and can be downloaded from
here, then create a single FASTA and we store it in
data and call it core_nt_full.fasta.
just download, to download adjusted version ofsassythat skipsNcharsjust search_ncbito search the rapid/native contaminationjust parse_ncbito parse the search results, including additional barcode edit check