scNoiseMeter Documentation

June 24, 2026 · View on GitHub

Version 0.6.1


1. Overview and Purpose

scNoiseMeter quantifies technical noise in single-cell RNA-seq BAM files. It classifies every primary alignment into one of 16 mutually exclusive output categories (plus three pre-classification filter states) and reports per-sample and per-cell noise fractions, strand concordance, chimeric read rates, and artifact flag counts.

The tool is platform-agnostic: it processes ONT, PacBio/Kinnex, short-read (Illumina, ElemBio) BAMs from 10x Genomics or BD Rhapsody kits, and Smart-seq / FLASH-seq plates (96- and 384-well). The same classification logic is used throughout, with platform-specific adjustments where the underlying biology differs.

Four subcommands are provided:

  • run — classify reads in a single BAM and produce QC metrics
  • run-plate — classify reads across a plate of Smart-seq / FLASH-seq wells and produce an aggregated plate report
  • compare — run on two BAMs (e.g. pre- and post-filter) and produce a side-by-side comparison
  • discover — scan a directory for BAM files, infer their parameters, and run scnoisemeter run on selected files

The tool requires a coordinate-sorted, indexed BAM file aligned to human GRCh38/hg38, and a GENCODE GTF annotation. Both the GTF and a PolyASite 3.0 atlas are downloaded automatically on first use if not supplied explicitly.

What's new in 0.6

  • User-configurable TSO (--tso, --tso-min-match, --no-polyg-tso, and per-side --tso-a/--tso-b in compare), added in 0.5: supply the template-switch oligo your library actually used instead of the built-in 10x/PacBio defaults.
  • Reverse-complement TSO matching (default-on). TSO invasion now matches both the forward TSO and its reverse complement, so reads whose TSO end maps antisense are no longer missed. Changes existing n_tso_invasion counts (they can only increase).
  • Strand-aware internal-priming (default-on). n_polya_priming and the intergenic-locus polyA-context check now look upstream for a T-run on minus-strand reads/loci, not only downstream for an A-run on the plus strand. Previously minus-strand internal priming was under-counted.
  • TSO concatemer metric (n_tso_concatemer): reads containing more than one occurrence of the TSO or its reverse complement, per Chou et al. (bioRxiv 2025.10.06.680646). Surfaced in read_metrics.tsv, the MultiQC JSON (tso_concatemer_frac), and the report's artifact panel.

Because the reverse-complement and strand-aware changes shift metric values for existing 10x/PacBio data, this is a minor version bump (0.6.0), not a patch.

What's new in 0.4

  • Vendor-neutral platform naming in all user-facing prose: short-read BAMs come from Illumina or ElemBio (AVITI) sequencers; 10x Genomics and BD Rhapsody are kit chemistries on top of those sequencers. Platform-enum values are unchanged.
  • 384-well plate support is explicit in docs and covered by unit tests for the sample-sheet parser (96- and 384-well geometries, both Sample_Name and PlateID/WellID input formats).
  • Simulation-based benchmark (tests/benchmark/) validates the classifier and per-cell noise metric against ground-truth labels. Exp 1 reaches 99.2% mean per-category accuracy across 8 read categories; Exp 2 reports per-cell noise fractions within ±0.01 of truth at every mixture level from 0 to 55% noise. Runs in under a minute from a GENCODE GTF and GRCh38 FASTA.
  • Every external URL in annotation_fetcher.py (GENCODE, PolyASite 3.0, PolyA_DB v4, FANTOM5, 10x whitelists) now supports a SCNM_*_URL environment-variable override. Use these when an upstream host or URL scheme changes, or behind a corporate mirror. PolyASite error messages name the exact probe template and point to the override.

What's new in 0.3

  • run-plate subcommand for Smart-seq / FLASH-seq plate data (96-well and 384-well).
  • Smart-seq platform support with unstranded noise definition (exonic_antisense excluded from noise) and platform-aware report messaging.
  • --parallel-wells for concurrent per-well processing; --plate-id to restrict processing to a subset of plates.
  • polyA and TSS site dictionaries cached to ~/.cache/scnoisemeter/ as compressed pickles, keyed on file metadata plus a hash of the first 64 KB. First load is ~35 s; subsequent loads are under 1 s.
  • Intergenic reclassification: reads at promoted intergenic_hotspot / intergenic_novel / intergenic_repeat loci now contribute to the correct category when computing per-cell and sample-level metrics. In 0.2.0 they were retained as intergenic_sparse, overstating the noise fraction by the share of reads in promoted intergenic_novel loci. Fixed in 0.3.1.
  • Sample-sheet handling: headerless CSVs are auto-detected using whole-token matching; BAM / sheet mismatches emit warnings instead of aborting.
  • Process-pool robustness: BrokenProcessPool (usually an OOM kill) is caught and reported with actionable guidance instead of causing a silent exit 0.

2. Installation

scNoiseMeter is installed from GitHub (it is not on PyPI):

pip install git+https://github.com/FullLengthFanatic/scnoisemeter.git

On a shared server without admin rights, add --user. Or install from a clone (pip install ., or pip install -e ".[dev]" for an editable dev install with test tooling).

Dependencies (installed automatically):

  • pysam >= 0.22
  • pyranges >= 0.0.129
  • pandas >= 2.0
  • numpy >= 1.24
  • click >= 8.1
  • plotly >= 5.18
  • scipy >= 1.11
  • tqdm >= 4.66

The BAM must be coordinate-sorted and indexed before use:

samtools sort -o sorted.bam input.bam
samtools index sorted.bam

3. Read Categories

Every primary alignment receives exactly one category. The classification hierarchy is applied in the order listed; a read is assigned the first matching category.

CategoryString valueDefinition
UNMAPPEDunmappedRead did not align. Excluded from all fractions.
SECONDARYsecondarySAM flag 0x100. Record is skipped entirely.
SUPPLEMENTARYsupplementarySAM flag 0x800. Record is skipped; the SA tag on the primary alignment is what the chimeric detector uses.
MULTIMAPPERmultimapperPrimary alignment with NH tag > 1. Highest priority in the classification hierarchy, so any NH > 1 read is assigned this category regardless of its genomic context.
MITOCHONDRIALmitochondrialMaps to the mitochondrial contig (chrM, MT, chrMT, or mitochondrion).
CHIMERICchimericSA tag present AND the split is inter-chromosomal, strand-discordant, or the same-strand intra-chromosomal distance exceeds the chimeric distance threshold (default 10,000 bp). For Illumina paired-end BAMs, also triggered when the absolute template length exceeds 1,000,000 bp.
EXONIC_SENSEexonic_senseOverlaps at least one annotated exon base on the correct strand.
EXONIC_ANTISENSEexonic_antisenseOverlaps at least one annotated exon base on the wrong strand.
INTRONIC_JXNSPANintronic_jxnspanMaps within an intron but the CIGAR contains an N operation near a splice site (candidate intron-retention or non-consensus transcript).
INTRONIC_PUREintronic_pureMaps entirely within an intron body with no junction signal.
INTRONIC_BOUNDARYintronic_boundarySpans an exon–intron boundary without a splice operation in the CIGAR (candidate incomplete reverse transcription).
INTERGENIC_REPEATintergenic_repeatIntergenic read overlapping a RepeatMasker interval (requires --repeats).
INTERGENIC_HOTSPOTintergenic_hotspotIntergenic monoexonic locus above threshold with a genomic A-run (>= 6 As within 20 bp downstream of the modal read 3' end) and more than 50 bp from any annotated polyA site. Likely internal priming.
INTERGENIC_NOVELintergenic_novelIntergenic locus above threshold with >= 80% strand consistency and >= 3 distinct barcodes, showing splice evidence (CIGAR N) and/or modal 3' end within 50 bp of an annotated polyA site. Candidate unannotated gene or extended 3' UTR.
INTERGENIC_SPARSEintergenic_sparseIntergenic read at a locus below the adaptive barcode threshold. Likely noise.
AMBIGUOUSambiguousOverlaps a region shared by two or more genes where the gene types are not clearly distinguished by the sub-categories below.
AMBIGUOUS_COD_NCODambiguous_cod_ncodOverlaps a shared region between a protein-coding gene and a non-coding gene (lncRNA, pseudogene, etc.).
AMBIGUOUS_COD_CODambiguous_cod_codOverlaps a shared region between two protein-coding genes.
UNASSIGNEDunassignedCB tag absent or not on the barcode whitelist. These reads are counted in the denominator but not attributed to a cell.

Three categories — UNMAPPED, SECONDARY, SUPPLEMENTARY — are excluded from the CATEGORY_ORDER used for output columns and fraction computation. All other 16 categories appear in per-cell and per-sample output columns.

Noise definitions

Two noise levels are reported:

Conservative noise (noise_read_frac, noise_base_frac) — includes reads that may represent genuine pre-mRNA capture. This is an upper bound on true noise:

EXONIC_ANTISENSE + INTRONIC_PURE + INTRONIC_BOUNDARY +
INTERGENIC_SPARSE + INTERGENIC_REPEAT + INTERGENIC_HOTSPOT + CHIMERIC

Strict noise (noise_read_frac_strict, noise_base_frac_strict) — only unambiguous RT/PCR/sequencing artifacts. Excludes INTRONIC_PURE and INTRONIC_BOUNDARY. This is a lower bound:

EXONIC_ANTISENSE + INTERGENIC_SPARSE + INTERGENIC_REPEAT +
INTERGENIC_HOTSPOT + CHIMERIC

The categories INTRONIC_JXNSPAN, INTERGENIC_NOVEL, AMBIGUOUS, AMBIGUOUS_COD_NCOD, and AMBIGUOUS_COD_COD are in neither noise set; their interpretation is ambiguous.

Unstranded mode. When --platform smartseq is set (Smart-seq / FLASH-seq / Smart-seq3), EXONIC_ANTISENSE is dropped from both the conservative and strict noise sets. Unstranded libraries produce sense and antisense reads in roughly equal proportion by design, so counting antisense reads as noise would inflate the metric.

Adaptive intergenic threshold

Intergenic reads are initially classified as INTERGENIC_SPARSE by the first pass of the classifier. A second pass clusters them into loci and evaluates each locus with a Poisson significance test against the expected read rate across all intergenic bases. Loci that pass the threshold are promoted to INTERGENIC_HOTSPOT, INTERGENIC_NOVEL, or INTERGENIC_REPEAT, and their reads are moved out of INTERGENIC_SPARSE in each affected cell's counts before per-cell and sample-level metrics are computed. This ordering matters: INTERGENIC_NOVEL is counted as ambiguous (neither noise nor signal), while INTERGENIC_SPARSE is counted as noise, so promotion to novel reduces the reported noise fraction.

Parameters:

  • Minimum distinct barcodes: max(3, 0.01% of total detected barcodes)
  • Minimum reads per locus: 5
  • Bonferroni-corrected p-value threshold: 0.01
  • Aggregation window: 500 bp

The per-read side-table used for profiling is reservoir-sampled at 500,000 records per sample. For samples under that threshold the reclassification is exact; above it, reclassification is applied proportionally per barcode using the sampled records as an unbiased estimator of the true distribution.


4. Artifact Flags

Three artifact flags are computed per read and counted at the sample and per-cell level. They are not part of the classification hierarchy — a read carries the flag in addition to its category.

TSO invasion (n_tso_invasion)

Detects reads whose soft-clipped bases (at either the 5′ or 3′ end) match a template-switching oligonucleotide (TSO) sequence. Detection requires at least 12 bp of match, and checks both orientations: the forward TSO and its reverse complement. This catches reads whose TSO end maps antisense, whose clip carries the reverse-complemented TSO.

TSO sequences checked (defaults):

  • 10x Genomics v3/v4: AAGCAGTGGTATCAACGCAGAGTACATGGG
  • PacBio Kinnex / IsoSeq: AAGCAGTGGTATCAACGCAGAGT

A poly-G tail of ≥ 6 bp in a soft-clip is also flagged as TSO-proximal (see --no-polyg-tso to disable for non-G-tailed chemistries).

Custom TSO sequences

TSO is a property of the library protocol, so you can tell scNoiseMeter exactly which oligo your sample used:

  • --tso SEQ supplies a custom TSO. Repeat the flag for more than one sequence. When given, it replaces the built-in 10x/PacBio defaults (it does not add to them), so detection reflects your protocol rather than mixing in oligos you did not use. Available in run, discover, and run-plate.
  • --tso-min-match N controls how many bases of the TSO must match. Detection looks for the first N bases of each TSO in the soft-clip, so a larger N is stricter and a smaller N is more permissive (default 12). If a TSO is shorter than N, the full sequence is used as the match requirement (a warning notes this). Sequences are validated to contain only A, C, G, T, N.
  • --no-polyg-tso disables the poly-G heuristic, leaving only TSO-sequence matches. The poly-G check (≥ 6 G's in a soft-clip) is independent of --tso and is on by default. Disable it when G-rich genomic regions or sequencing artifacts inflate the poly-G signal, or to count true TSO-sequence invasion in isolation. On adapter-trimmed pipelines (e.g. PacBio Kinnex after lima/skera) the TSO sequence has already been removed from soft-clips, so the metric is poly-G-only; run on pre-trim BAMs to capture the sequence signal.
  • In compare, --tso-a and --tso-b override the TSO per side, so two methods that used two different TSOs each get the correct sequence. A shared --tso acts as the default for both sides.

Both the forward TSO and its reverse complement are matched automatically, so you do not need to pass the reverse complement as a second --tso. This matters for chemistries whose TSO is not self-complementary (e.g. the poly-T-tailed TSO used by UltraMarathonRT, where reads can map either sense).

The TSO sequence(s) and the min-match used are recorded in the HTML report metadata table.

Example:

scnoisemeter run --bam sample.bam --gtf gencode.gtf \
  --tso AAGCAGTGGTATCAACGCAGAGTACATGGG --tso-min-match 12 \
  --output-dir out/

scnoisemeter compare --bam-a methodA.bam --bam-b methodB.bam --gtf gencode.gtf \
  --tso-a AAGCAGTGGTATCAACGCAGAGTACATGGG \
  --tso-b GCAGTGGTATCAACGCAGAGTACTCTGCGTTGATACCAC \
  --output-dir cmp/

Internal polyA priming (n_polya_priming)

Detects reads whose transcript 3′ end abuts an A-run in the reference genome, indicating the read likely originated from internal priming on an A-rich region rather than the true polyA tail. The check is strand-aware:

  • Forward reads: look 20 bp downstream (higher coordinate) of reference_end for an A-run on the + strand.
  • Reverse reads: look 20 bp upstream (lower coordinate) of reference_start for a T-run on the + strand (which is an A-run in the transcript's orientation).
  • Minimum run length: 6 consecutive bases within the window.

(Earlier versions checked only the + strand downstream, which under-counted minus-strand internal priming. The same strand-aware logic applies to the intergenic-locus polyA-context check that feeds intergenic_hotspot / intergenic_novel.)

Requires a reference FASTA (--reference); without one this flag is not computed.

TSO concatemer (n_tso_concatemer)

Counts reads containing more than one occurrence of a TSO sequence or its reverse complement, anywhere in the read (soft-clipped or aligned bases). These arise when template switching runs from one TSO onto another, producing TSO-TSO concatemers that should not carry biological signal. The definition follows Chou et al. (bioRxiv 2025.10.06.680646): concatemer reads ÷ total reads.

  • Uses the full configured --tso sequence(s) and their reverse complements (not the 12 bp prefix used for invasion).
  • Counting is non-overlapping and longest-first, so a shorter TSO that is a substring of a longer one (e.g. the PacBio TSO within the 10x TSO) is not double-counted at a single locus.
  • Like TSO invasion, this requires the TSO to still be present in the read, so run on untrimmed BAMs.
  • Surfaced as n_tso_concatemer in read_metrics.tsv, as tso_concatemer_frac in the MultiQC JSON, and as a bar in the report's artifact-flag panel.

Non-canonical junction (n_noncanon_junction)

Detects reads with a CIGAR N operation (intron) whose donor–acceptor dinucleotide pair is not one of the three canonical splice site motifs:

  • GT–AG
  • GC–AG
  • AT–AC

Requires a reference FASTA (--reference) for the dinucleotide lookup. Without a reference, this flag is not computed.


5. Reported Metrics

Sample-wide scalar metrics (in <sample>.read_metrics.tsv)

MetricTypeDefinition
n_reads_totalintTotal reads in the BAM (mapped + unmapped, from index counters).
n_reads_classifiedintReads that received a classification category. Excludes UNMAPPED, SECONDARY, SUPPLEMENTARY.
n_reads_unassignedintReads classified as UNASSIGNED (CB absent or not on whitelist).
n_cellsintNumber of distinct cell barcodes with ≥ 10 reads. Set to 1 in barcode-agnostic mode.
noise_read_fracfloatConservative noise fraction (reads). Denominator: n_reads_classified.
noise_base_fracfloatConservative noise fraction (aligned bases).
noise_read_frac_strictfloatStrict noise fraction (reads).
noise_base_frac_strictfloatStrict noise fraction (aligned bases).
strand_concordancefloatexonic_sense / (exonic_sense + exonic_antisense). Values < 0.95 suggest strand-switching or a non-stranded library.
chimeric_read_fracfloatFraction of classified reads in the CHIMERIC category.
multimapper_read_fracfloatFraction of classified reads in the MULTIMAPPER category.
per_cell_noise_medianfloatMedian per-cell conservative noise fraction across cells with ≥ 10 reads.
per_cell_noise_iqrfloatInterquartile range of per-cell conservative noise fraction.
n_tso_invasionintNumber of reads with a TSO invasion flag.
n_polya_primingintNumber of reads with an internal polyA priming flag.
n_noncanon_junctionintNumber of reads with a non-canonical splice junction flag.
read_frac_<category>floatFraction of classified reads in each category (one row per category in CATEGORY_ORDER).
base_frac_<category>floatFraction of classified aligned bases in each category.
full_length_read_fracfloatFraction of EXONIC_SENSE reads considered full-length. Present only when a polyA site database or --polya-sites is provided. Without a database, computed from a length threshold (ONT: 500 bp, PacBio: 1000 bp).
tss_anchored_fracfloatFraction of reads with 5′ end within 100 bp of an annotated TSS. Present only when --tss-sites is supplied.
numt_read_fracfloatFraction of MITOCHONDRIAL reads overlapping a NUMT interval. Present only when --numt-bed is supplied.

Per-cell metrics (columns of <sample>.cell_metrics.tsv)

ColumnDefinition
cell_barcodeCell barcode string (DataFrame index).
n_readsTotal classified reads for this cell.
n_basesTotal aligned bases for this cell.
read_frac_<category>Per-cell read fraction for each category in CATEGORY_ORDER.
base_frac_<category>Per-cell base fraction for each category.
umi_complexity_<category>unique UMIs / total reads for this cell and category. Only present when UMI tracking is active (default; disabled by --no-umi-dedup).
noise_read_fracPer-cell conservative noise read fraction.
noise_base_fracPer-cell conservative noise base fraction.
n_tsoTSO invasion flag count for this cell.
n_polyaInternal polyA priming flag count for this cell.
n_noncanonNon-canonical junction flag count for this cell.

Cells with fewer than 10 reads are excluded from the per-cell table.


6. The run Subcommand

Classifies reads in a single BAM and produces all output files.

Synopsis

scnoisemeter run [OPTIONS]

Flags

Required:

FlagTypeDescription
--bam PATHpathInput BAM file. Must be coordinate-sorted and have a .bai index in the same directory.
--output-dir PATHpathDirectory for output files. Created if it does not exist.

Optional — input:

FlagDefaultDescription
--sample-name TEXTBAM filename stemLabel used in output filenames and HTML report.
--gtf PATHauto-downloadedGENCODE GTF annotation (plain or .gz). Takes precedence over --gtf-version. If omitted, the latest GENCODE human GTF is downloaded to ~/.cache/scnoisemeter/.
--gtf-version INTnoneGENCODE release to auto-download (e.g. 42). Ignored when --gtf is set. Use 42 to match the PolyASite 3.0 atlas exactly.
--barcode-whitelist PATHnoneFile of valid corrected barcodes, one per line (plain text or .gz). Reads whose CB tag is not in this list are classified as UNASSIGNED. Distinct from --cell-barcodes.
--cell-barcodes PATHnoneCalled-cell barcode file (plain text or .gz, one per line). Reads whose CB tag is not in this list are skipped entirely and contribute to no metric. Trailing -1 suffixes are stripped from both the file entries and the CB tags in the BAM. Compatible with Cell Ranger filtered_feature_bc_matrix/barcodes.tsv.gz.
--barcode-tag TEXTCBBAM tag for the corrected cell barcode.
--umi-tag TEXTUBBAM tag for the corrected UMI.
--chemistry [10x_v3|10x_v4|bd_rhapsody_wta|custom]10x_v3Library chemistry. Sets the expected barcode length.
--platform [ont|pacbio|illumina|illumina_10x|illumina_bd|smartseq|unknown]autoSequencing platform. auto detects from BAM header @PG records. smartseq is not auto-detected and must be set explicitly.
--pipeline-stage [raw|pre_filter|post_filter|custom]autoProcessing stage. auto detects from header.
--repeats PATHnoneRepeatMasker BED file (hg38). Required to classify reads as INTERGENIC_REPEAT.
--reference PATHnoneReference FASTA (.fa or .fa.gz, with .fai index). Required for polyA context checks and non-canonical junction detection.
--polya-sites PATHauto-downloadedPolyA site BED file(s). Repeatable. Takes precedence over --polya-db. If omitted, a database is auto-downloaded according to --polya-db.
--polya-db [polyasite3|polyadb4|both]polyasite3PolyA site database to auto-download when --polya-sites is not set. polyasite3: PolyASite 3.0 atlas (GENCODE v42, ~569k sites). polyadb4: PolyA_DB v4 (not tied to a GENCODE version, works with any GTF release). both: load both databases simultaneously.
--tss-sites PATHauto-downloadedCAGE peak / TSS BED file(s). Repeatable. Takes precedence over --tss-db. If omitted, a database is auto-downloaded according to --tss-db.
--tss-db [fantom5|none]fantom5TSS database to auto-download when --tss-sites is not set. fantom5: FANTOM5 robust CAGE peaks (hg38, ~184k peaks). none: skip TSS anchoring entirely.
--numt-bed PATHnoneNUMT BED file (nuclear mitochondrial DNA segments, hg38 coordinates).
--obs-metadata PATHnonePer-cell metadata TSV with cell_barcode and cluster columns. Enables per-cluster noise profiles.
--exclude-biotypes TEXTnoneGene biotypes to exclude from annotation. Repeatable.

Optional — behaviour:

FlagDefaultDescription
--chimeric-distance INT10000Maximum same-strand intra-chromosomal SA distance (bp) below which a split alignment is not called chimeric.
--threads INT4Parallel worker processes (one per chromosome).
--no-umi-dedupoffSkip UMI set tracking. Reduces memory for very large datasets; disables umi_complexity_* columns.
--no-cacheoffDo not read or write the annotation index cache.
--offlineoffUse only cached annotation files; never make network calls. Raises an error if the cache is empty. Ignored when --gtf and --polya-sites are supplied explicitly.
--verbose / -voffEnable debug logging to stderr.

Validation performed before processing

  • BAM index (.bai) must exist.
  • BAM must be coordinate-sorted (@HD SO:coordinate).
  • BAM must contain at least one aligned read.
  • Chromosome naming style (UCSC chr1 vs Ensembl 1) must match between BAM and GTF. Mismatch is fatal.
  • Chromosome lengths are compared against GRCh38 expected values. Mismatch is a warning.
  • If --cell-barcodes is supplied but the BAM has no CB tags (barcode-agnostic mode), the filter is ignored with a warning.
  • If --cell-barcodes filters out all reads, an error is raised.

Examples

# Minimal run (GTF and polyA atlas auto-downloaded)
scnoisemeter run \
  --bam sample.bam \
  --output-dir results/

# Full run with explicit annotation
scnoisemeter run \
  --bam sample.bam \
  --gtf gencode.v45.annotation.gtf.gz \
  --barcode-whitelist 3M-february-2018.txt \
  --cell-barcodes filtered_feature_bc_matrix/barcodes.tsv.gz \
  --platform ont \
  --pipeline-stage post_filter \
  --repeats rmsk.hg38.bed.gz \
  --reference GRCh38.fa \
  --polya-sites atlas.clusters.3.0.GRCh38.GENCODE_42.bed.gz \
  --tss-sites hg38.cage_peak_phase1and2combined_ann.bed.gz \
  --threads 16 \
  --output-dir results/ \
  --sample-name my_sample

# Restrict analysis to called cells only
scnoisemeter run \
  --bam cellranger_output/possorted_genome_bam.bam \
  --cell-barcodes cellranger_output/filtered_feature_bc_matrix/barcodes.tsv.gz \
  --output-dir results/

7. The compare Subcommand

Runs the full classification pipeline on two BAMs, computes chi-squared proportion tests for each read category, and produces a comparison report.

Synopsis

scnoisemeter compare [OPTIONS]

Flags

Required:

FlagTypeDescription
--bam-a PATHpathBAM A (e.g. pre-filter / raw).
--bam-b PATHpathBAM B (e.g. post-filter).
--output-dir PATHpathDirectory for output files.

Optional — compare-specific:

FlagDefaultDescription
--label-a TEXTsample_ALabel for BAM A in reports and output filenames.
--label-b TEXTsample_BLabel for BAM B in reports and output filenames.

Shared flags (same as run, excluding --bam, --sample-name, --cell-barcodes):

--gtf, --gtf-version, --barcode-whitelist, --barcode-tag, --umi-tag, --chemistry, --platform, --pipeline-stage, --chimeric-distance, --repeats, --reference, --threads, --no-umi-dedup, --no-cache, --exclude-biotypes, --obs-metadata, --polya-sites, --polya-db, --tss-sites, --tss-db, --numt-bed, --offline, --verbose

A single annotation index is built once and shared between both BAM runs.

Statistical test

For each read category, a chi-squared test is applied to the contingency table of read counts (category reads vs all other classified reads) for BAM A and BAM B. P-values are Bonferroni-corrected for the number of categories tested.

Examples

# Pre-filter vs post-filter comparison
scnoisemeter compare \
  --bam-a raw.bam \
  --bam-b filtered.bam \
  --gtf gencode.v45.annotation.gtf.gz \
  --label-a pre_filter \
  --label-b post_filter \
  --threads 8 \
  --output-dir compare_results/

# With explicit labels and whitelist
scnoisemeter compare \
  --bam-a sample1.bam \
  --bam-b sample2.bam \
  --label-a sample1 \
  --label-b sample2 \
  --barcode-whitelist 3M-february-2018.txt \
  --platform illumina_10x \
  --output-dir compare_results/

7a. The run-plate Subcommand

Classifies reads across a plate of Smart-seq / FLASH-seq wells and produces an aggregated plate-level report. Each well is processed independently and the per-well classified counts are merged into one SampleResult that feeds the plate-level metrics, intergenic profiler, and HTML report.

Synopsis

scnoisemeter run-plate [OPTIONS]

Flags

Required:

FlagTypeDescription
--plate-dir PATHdirectoryDirectory containing one subdirectory per well.
--output-dir PATHpathRoot output directory. Each plate gets its own subdirectory.

Plate-specific:

FlagDefaultDescription
--sample-sheet PATHnoneCSV mapping wells to metadata. Headerless sheets are auto-detected; a warning is issued suggesting a header row. BAM ↔ sheet mismatches emit warnings but do not abort the run.
--sequencer [illumina|ont|pacbio]illuminaInforms whether the i7 sequence is stored in reverse-complement.
--plate-id TEXTnoneRepeatable. Restrict the run to the listed plate IDs. Useful for re-running a single plate without reprocessing the rest.
--parallel-wells INT1Number of wells to process concurrently. Each worker is initialised once (annotation index + polyA/TSS dictionaries loaded per worker) and reused across all wells assigned to it. When --parallel-wells N > 1, the per-BAM thread pool is implicitly max(1, --threads // N).
--platform smartseqMust be set explicitly for Smart-seq / FLASH-seq data. Selects unstranded noise and suppresses the expected "missing CB tag" warning.

Plus the shared flags from run (--gtf, --gtf-version, --repeats, --reference, --polya-sites, --polya-db, --tss-sites, --tss-db, --chimeric-distance, --threads, --no-umi-dedup, --offline, --no-cache, --verbose, etc.).

Folder layout

--plate-dir must contain one subdirectory per well, named <PlateID>_<WellID>. Supported well IDs:

  • 96-well: rows A–H, columns 1–12 (e.g. 881_A1, 881_H12).
  • 384-well: rows A–P, columns 1–24 (e.g. 882_A1, 882_P24).

Each subdirectory must hold one BAM (.bam) and its index (.bam.bai or .bai). Wells whose BAM lacks an index are skipped with a stderr warning and counted in the per-plate failure tally. If no well in a plate has an index, the plate is skipped entirely with a clear error message (instead of producing a silent empty result).

Behaviour

  1. Wells are discovered from folder names matching the <PlateID>_<WellID> regex and grouped by plate.
  2. If --plate-id is supplied, the plate set is filtered; unknown plate IDs emit a warning.
  3. For each plate, wells without a BAI index are filtered out; the remaining wells are dispatched.
  4. If --parallel-wells > 1, a ProcessPoolExecutor runs wells concurrently. Each worker initialises once: annotation index, polyA sites, TSS sites, and BAM chromosome style are all prepared up-front and shared across every well assigned to that worker.
  5. If a worker is killed by the OS (usually OOM) the pool is marked broken. scNoiseMeter catches BrokenProcessPool and raises a ClickException that reports how many wells completed and suggests reducing --parallel-wells. Prior to 0.3.1 this failure was silently swallowed and the run exited 0 with no outputs.
  6. Wells merge into a per-plate SampleResult. The intergenic profiler runs at the plate level and the resulting per-record reclassification is applied before plate metrics are computed.
  7. Per-plate outputs are written to <output-dir>/<PlateID>/: <PlateID>.read_metrics.tsv, <PlateID>.per_well_metrics.tsv, <PlateID>.report.html, <PlateID>.intergenic_loci.tsv (when applicable), and the usual .multiqc.json / .length_distributions/ outputs.

Examples

# Plate with 8 wells processed concurrently
scnoisemeter run-plate \
  --plate-dir /data/smartseq/plate_881/ \
  --sample-sheet plate_881.csv \
  --platform smartseq \
  --parallel-wells 8 \
  --threads 16 \
  --reference GRCh38.fa \
  --output-dir results/plate_881/

# Re-run a single plate from a directory that holds many
scnoisemeter run-plate \
  --plate-dir /data/smartseq/batch3/ \
  --plate-id 885 \
  --platform smartseq \
  --output-dir results/batch3/

Per-well output: <PlateID>.per_well_metrics.tsv

One row per well processed. Columns: well_id, plate_id, n_reads_total, n_reads_classified, noise_read_frac, strand_concordance, chimeric_read_frac, plus any metadata columns resolved from the sample sheet.


8. The discover Subcommand

Scans a directory for .bam files, inspects each to infer platform and pipeline stage, presents a summary table, and runs scnoisemeter run on selected files.

Synopsis

scnoisemeter discover [OPTIONS]

Flags

Required:

FlagTypeDescription
--bam-dir PATHdirectoryDirectory to scan for .bam files.
--reference PATHpathReference FASTA (.fa or .fa.gz with .fai index).
--output-dir PATHpathRoot output directory. Each BAM gets its own subdirectory named after the BAM stem.

Optional:

FlagDefaultDescription
--gtf PATHauto-downloadedGENCODE GTF. Auto-downloaded if omitted. Takes precedence over --gtf-version.
--gtf-version INTnoneGENCODE release to auto-download (e.g. 42). Ignored when --gtf is set.
--polya-sites PATHauto-downloadedPolyA site BED file(s). Repeatable. Takes precedence over --polya-db.
--polya-db [polyasite3|polyadb4|both]polyasite3PolyA site database to auto-download when --polya-sites is not set.
--tss-sites PATHauto-downloadedTSS / CAGE peak BED file(s). Repeatable. Takes precedence over --tss-db.
--tss-db [fantom5|none]fantom5TSS database to auto-download when --tss-sites is not set.
--threads INT4Parallel worker processes per BAM.
--run-alloffNon-interactive mode. All BAMs with fully inferable parameters are run automatically. BAMs with blocking issues (no index, wrong sort order, unknown platform) are skipped with a warning.
--offlineoffUse only cached annotation files.
--verbose / -voffEnable debug logging.

Behaviour

  1. All .bam files in --bam-dir are inspected to infer platform, pipeline stage, and check for blocking issues (missing index, wrong sort order).
  2. A summary table is printed showing each BAM, its inferred platform, pipeline stage, and any issues.
  3. In interactive mode (default), the user selects which BAMs to process. BAMs with unknown platform trigger an interactive prompt.
  4. In --run-all mode, BAMs with blocking issues are skipped silently.
  5. The annotation index is built once and shared across all selected BAMs.
  6. Each BAM is run inline (no subprocess). Output goes to <output-dir>/<bam-stem>/.
  7. A summary table of successes and failures is printed at the end.

Examples

# Interactive discovery
scnoisemeter discover \
  --bam-dir /data/bams/ \
  --reference GRCh38.fa \
  --output-dir discover_results/

# Non-interactive batch run
scnoisemeter discover \
  --bam-dir /data/bams/ \
  --reference GRCh38.fa \
  --gtf gencode.v45.annotation.gtf.gz \
  --output-dir discover_results/ \
  --threads 8 \
  --run-all

9. Output Files

run outputs

All files are written to --output-dir. <sample> is the value of --sample-name.


<sample>.read_metrics.tsv

Two-column tab-separated file: metric and value. One row per metric.

Rows written (in order):

Row keyDescription
n_reads_totalTotal reads in BAM (from index counters).
n_reads_classifiedReads with a category (excludes UNMAPPED, SECONDARY, SUPPLEMENTARY).
n_reads_unassignedReads classified as UNASSIGNED.
n_cellsDistinct barcodes with ≥ 10 reads.
noise_read_fracConservative noise fraction (reads).
noise_base_fracConservative noise fraction (bases).
strand_concordanceexonic_sense / (exonic_sense + exonic_antisense).
chimeric_read_fracCHIMERIC read fraction.
multimapper_read_fracMULTIMAPPER read fraction.
per_cell_noise_medianMedian per-cell noise fraction.
per_cell_noise_iqrIQR of per-cell noise fraction.
n_tso_invasionTSO invasion flag count.
n_polya_primingInternal polyA priming flag count.
n_noncanon_junctionNon-canonical junction flag count.
read_frac_<category>Read fraction per category (16 rows, CATEGORY_ORDER).
base_frac_<category>Base fraction per category (16 rows).
full_length_read_fracFull-length read fraction (present only when polyA sites are available).

Values are plain numbers (integers or floats with 6 decimal places for fractions).


<sample>.cell_metrics.tsv

Tab-separated file. Index column is cell_barcode. One row per cell with ≥ 10 reads.

Columns:

ColumnDescription
cell_barcodeCorrected cell barcode string (index).
n_readsTotal classified reads for this cell.
n_basesTotal aligned bases for this cell.
read_frac_exonic_senseFraction of this cell's reads in each category. One column per category in CATEGORY_ORDER.
read_frac_<category>(16 columns total, one per category in CATEGORY_ORDER)
base_frac_<category>Base fraction per category (16 columns).
umi_complexity_<category>unique UMIs / total reads for this cell and category (16 columns; absent when --no-umi-dedup is set).
noise_read_fracPer-cell conservative noise read fraction.
noise_base_fracPer-cell conservative noise base fraction.
n_tsoTSO invasion flag count for this cell.
n_polyaInternal polyA priming flag count for this cell.
n_noncanonNon-canonical junction flag count for this cell.

This file is not written when the sample is in barcode-agnostic mode and the DataFrame is empty.


<sample>.multiqc.json

MultiQC custom-content JSON. Contains a subset of scalar metrics formatted for ingestion by MultiQC's custom_content module.


<sample>.length_distributions/

Directory containing one TSV per read category that has at least one read with a recorded length. Filenames: <category_value>.lengths.tsv.

Each file has a single column:

ColumnDescription
read_lengthLength of one read (one row per sampled read).

Read lengths are collected by reservoir sampling; the sample size is bounded in memory. Not all reads are represented.


<sample>_length_stratified.tsv

Tab-separated. Rows represent the cross of length bin × read category.

ColumnDescription
length_binLength bin label (e.g. <150, 150–500, 500–1000, 1000–2000, 2000–5000, >5000).
categoryRead category string value.
countNumber of reads in this cell of the cross-tabulation.
fraction_of_binFraction of reads in this length bin that belong to this category.

For Illumina short-read data, a comment line is prepended noting that all reads fall in the <150 bp bin.

Bin breaks: 150, 500, 1000, 2000, 5000 bp. When median read length is ≥ 300 bp (long-read data), all six bins are reported. When median read length is < 300 bp (short-read data), the <150 and 150–500 bins are merged into <500.


<sample>.intergenic_loci.tsv

Tab-separated. One row per intergenic locus characterised by the intergenic profiler. Only written if intergenic reads are present.

ColumnDescription
contigChromosome / contig name.
startLocus start coordinate (0-based).
endLocus end coordinate.
strand+ or -.
n_readsNumber of intergenic reads at this locus.
n_barcodesNumber of distinct barcodes contributing reads.
has_splice_evidenceBoolean. True if any read at this locus has a junction (N in CIGAR).
is_monoexonicBoolean. True if no read has a junction.
polya_run_downstreamBoolean. True if an A-run ≥ 6 bp was found downstream of the locus.
near_polya_siteBoolean. True if the locus is within 50 bp of an annotated polyA site.
poisson_pvalue_adjBonferroni-corrected Poisson p-value for read enrichment vs background intergenic rate.
categoryCategory assigned to this locus: intergenic_hotspot, intergenic_novel, intergenic_repeat, or intergenic_sparse.

<sample>.cluster_metrics.tsv

Only written when --obs-metadata is supplied. Contains per-cluster aggregated noise metrics. The cluster column comes from the cluster column in the obs metadata TSV.


<sample>.report.html

Self-contained interactive HTML report using Plotly (pinned to version 2.35.2 from CDN, or embedded when --offline is set). Contains:

  • Sample metadata table (platform, pipeline stage, aligner, annotation versions)
  • Read category composition bar chart
  • Noise fraction summary
  • Per-cell noise violin (suppressed for barcode-agnostic samples)
  • Read-length distributions by category (long-read platforms only)
  • Noise by read length stratification (long-read platforms only)
  • Insert size distribution (Illumina only, when paired reads are present)
  • Per-cluster noise comparison (when --obs-metadata is supplied)
  • Intergenic loci table (when intergenic reads are present)
  • Warnings panel

compare outputs

All written to --output-dir.


comparison.metrics.tsv

Tab-separated. Side-by-side scalar metrics for both samples.

ColumnDescription
metricMetric name (same keys as <sample>.read_metrics.tsv).
<label_a>Value for BAM A.
<label_b>Value for BAM B.
delta<label_b><label_a>.

Only numeric metrics (int and float fields) from SampleMetrics are included.


comparison.stats.tsv

Tab-separated. One row per read category.

ColumnDescription
categoryRead category string value.
frac_<label_a>Read fraction for BAM A.
frac_<label_b>Read fraction for BAM B.
deltafrac_<label_b>frac_<label_a>.
chi2Chi-squared statistic from the contingency table test.
p_valueUncorrected p-value.
p_adjustedBonferroni-corrected p-value (clipped at 1.0).

comparison.report.html

Interactive HTML comparison report. Contains:

  • Side-by-side read category composition bar charts
  • Noise fraction comparison bars
  • Per-cell noise violin (suppressed when either sample is barcode-agnostic, i.e. n_cells == 1)
  • Length distribution overlays
  • Statistical test results table
  • Warnings panel

10. Annotation Caching and Auto-Download

Cache location

All automatically downloaded annotation files and parsed site dictionaries are stored in ~/.cache/scnoisemeter/. Subsequent runs reuse cached files without any network call. Pass --offline to enforce cache-only mode; the tool raises an error if a required file is absent from the cache.

Three independent caches live in that directory:

  1. Annotation index — parsed GTF stored as a compressed pickle next to the source GTF file.
  2. polyA site dict.scnoisemeter_polya_<hash>.pkl.gz, one per unique set of polyA BED inputs.
  3. TSS / CAGE site dict.scnoisemeter_tss_<hash>.pkl.gz, same scheme.

The polyA and TSS caches are keyed on the source file path, mtime, size, a hash of the first 64 KB of each file, and the BAM chromosome-naming style (UCSC vs Ensembl). The head-bytes hash guards against in-place edits that preserve mtime. If any component of the key changes, a new cache entry is written. First load takes ~35 s for the 569k-site PolyASite 3.0 atlas; subsequent loads are under 1 s. In run-plate with --parallel-wells > 1, each worker loads the cache once in its initialiser and reuses it across every well assigned to that worker.

GTF

On first run with no --gtf or --gtf-version, the latest GENCODE human GTF is downloaded from the GENCODE FTP. The parsed annotation index (pyranges intervals, intron complement, intergenic regions) is cached alongside the GTF as a compressed pickle; rebuilding it from a large GTF takes roughly 60 seconds, so repeated runs benefit significantly from this cache. Pass --no-cache to force a rebuild.

To pin a specific GENCODE release:

scnoisemeter run --bam sample.bam --gtf-version 42 --output-dir results/

To supply a local file (disables all auto-download for the GTF):

scnoisemeter run --bam sample.bam --gtf gencode.v45.annotation.gtf.gz --output-dir results/

PolyA site databases

Two databases are supported, selected with --polya-db:

DatabaseFlag valueSourceGenomeNotes
PolyASite 3.0polyasite3 (default)polyasite.unibas.chhg38 / GENCODE v42~569k sites; tied to GENCODE v42
PolyA_DB v4polyadb4exon.njms.rutgers.eduhg38Not tied to a GENCODE version; works with any GTF release

Use both to load both databases simultaneously. When --polya-sites is provided explicitly, --polya-db is ignored.

The PolyASite 3.0 atlas is distributed as a BED6 file. PolyA_DB v4 is distributed as a ZIP archive; scNoiseMeter downloads and converts it to BED3 format on first use and caches the result.

Version mismatch. The current PolyASite 3.0 atlas is built on GENCODE v42. Auto-downloading the latest GTF (currently v49) produces a seven-version gap; the tool warns when the difference exceeds five major releases. Two ways to resolve this:

  • Pass --gtf-version 42 to auto-download GENCODE v42, matching the PolyASite 3.0 atlas exactly.
  • Pass --polya-db polyadb4 to switch to PolyA_DB v4, which is not tied to a GENCODE version and works with any GTF release.

TSS / CAGE peak databases

Two options are supported, selected with --tss-db:

DatabaseFlag valueSourceNotes
FANTOM5fantom5 (default)fantom.gsc.riken.jphg38 robust CAGE peaks, ~184k peaks, BED6 format
NonenoneDisables TSS anchoring; tss_anchored_frac will not be reported

When --tss-sites is provided explicitly, --tss-db is ignored.


11. Platform-Specific Notes

ONT

  • Platform auto-detected from minimap2 @PG record in the BAM header.
  • Chimeric detection uses the SA tag with the default intra-chromosomal distance threshold (10,000 bp). Split alignments within this distance on the same strand are treated as legitimate splices.
  • Full-length read fraction fallback threshold: 500 bp (when no polyA site database is provided).
  • Read-length distribution and noise-by-length charts are included in the HTML report.
  • No insert size chart.

PacBio / Kinnex

  • Platform auto-detected from pbmm2 @PG record.
  • Chimeric detection uses the SA tag, same logic as ONT.
  • The PacBio TSO (AAGCAGTGGTATCAACGCAGAGT) is used for TSO invasion detection.
  • Full-length read fraction fallback threshold: 1000 bp.
  • Read-length distribution and noise-by-length charts are included in the HTML report.
  • No insert size chart.

10x Genomics and BD Rhapsody (short-read: Illumina, ElemBio)

  • Platform auto-detected from STAR, STARsolo, or cellranger @PG records. Short-read BAMs from Illumina or ElemBio (AVITI) sequencers are handled identically; the distinction is kit chemistry, not sequencer vendor.
  • Chimeric detection uses paired-end mode: a read pair is chimeric if it is inter-chromosomal, strand-discordant, or has abs(template_length) ≥ 1,000,000 bp.
  • Read-length distribution and noise-by-length charts are suppressed (all reads are the same short length).
  • Insert size distribution chart is shown when properly paired reads are present (collected by reservoir sampling from read1 of each proper pair, with 0 < abs(template_length) < 2000).
  • The _length_stratified.tsv file contains a note that all reads fall in the <150 bp bin.
  • illumina_10x and illumina_bd are treated identically to illumina in all classification logic; the distinction affects only BAM header auto-detection.

Smart-seq / FLASH-seq / Smart-seq3

  • Not auto-detected. Must be set with --platform smartseq (or via run-plate, which requires it explicitly for plate workflows).
  • Paired-end chimeric detection (same logic as Illumina) is used when template_length is populated.
  • Unstranded noise definition. exonic_antisense is excluded from both conservative and strict noise sets because unstranded libraries produce sense and antisense reads in roughly equal proportion by design. Including antisense would inflate the noise fraction by a constant that depends only on library construction, not on actual noise.
  • Missing CB tag is expected. One BAM = one cell; the CB tag is typically absent. The usual "low CB tag fraction" warning is suppressed for Smart-seq.
  • Per-cell metrics and n_cells report N/A in single-well HTML reports; the plate-level report aggregates across wells.
  • Strand concordance is expected to be ~50 % in unstranded data. The HTML report annotates this rather than flagging it as poor quality.
  • TSS / polyA anchoring still computes, but the interpretation changes: unstranded reads near an annotated polyA site are informative of transcript ends on either strand. The report surfaces this caveat.
  • Read-length distribution and noise-by-length charts are shown when reads are long enough (Smart-seq3 produces mixed lengths; traditional Smart-seq2 is short-read).

Barcode-agnostic mode

Activated when fewer than 50% of sampled reads (10,000 reads sampled from the BAM) carry the corrected barcode tag (CB by default). In this mode:

  • All reads are aggregated under the sentinel barcode NO_BARCODE.
  • n_cells is set to 1.
  • Per-cell metrics are not meaningful; the cell_metrics TSV will contain one row.
  • A warning is emitted to stderr.
  • If --cell-barcodes is supplied in barcode-agnostic mode, the filter is ignored with a warning.
  • In compare reports, the per-cell violin is suppressed when either sample has n_cells == 1, and a warning is appended to the report.

12. Known Caveats and Limitations

Genome / annotation:

  • Only human GRCh38/hg38 is supported. The chromosome length validation uses hardcoded GRCh38 expected lengths. Other species will produce length mismatch warnings.
  • The GENCODE GTF and PolyASite 3.0 atlas must use the same chromosome naming convention (UCSC or Ensembl). Mismatches between BAM and GTF chromosome names cause a fatal error.
  • If the GTF and polyA atlas differ by more than 5 GENCODE major versions, a warning is issued. Genes with 3′ UTRs annotated between the two versions may have reduced polyA anchoring scores. To resolve: pass --gtf-version 42 to match the PolyASite 3.0 atlas exactly, or pass --polya-db polyadb4 to switch to PolyA_DB v4, which is version-agnostic.

Read classification:

  • Only primary alignments are classified. Secondary (flag 0x100) and supplementary (flag 0x800) alignments are skipped; the SA tag on the primary alignment is what the chimeric detector parses.
  • MULTIMAPPER is defined as NH tag > 1 on the primary alignment record. It is the highest priority in the classification hierarchy; reads without an NH tag are not flagged as multimappers.
  • INTRONIC_PURE and INTRONIC_BOUNDARY cannot be distinguished from genuine pre-mRNA capture at the read level. They are included in conservative noise but excluded from strict noise. Their presence does not necessarily indicate an artifact.
  • INTERGENIC_NOVEL requires >= 80% strand consistency, >= 3 distinct barcodes, and either splice evidence (CIGAR N) or proximity (<= 50 bp) to an annotated polyA site. It flags candidate loci for review; it does not confirm the existence of an unannotated gene.
  • INTERGENIC_REPEAT classification requires a RepeatMasker BED file (--repeats). Without it, repeat-overlapping intergenic reads fall into INTERGENIC_HOTSPOT or INTERGENIC_SPARSE.
  • Non-canonical junction detection requires a reference FASTA (--reference). Without it, n_noncanon_junction is 0 regardless of the actual data.

Barcode handling:

  • The tool reads corrected barcodes from the CB tag (or the tag specified by --barcode-tag). Raw uncorrected barcodes (CR tag) are not used for classification.
  • Trailing -1 suffixes are stripped from both the --cell-barcodes file and the CB tags in the BAM, so Cell Ranger output is normalised automatically. Other suffix conventions are not handled.
  • In barcode-agnostic mode, per-cell noise values are not meaningful.

Chimeric detection:

  • The default chimeric distance threshold of 10,000 bp may flag very long transcripts (> 10 kb) that have legitimate split alignments as chimeric. For datasets with very long transcripts (e.g. PacBio full-length mRNA), consider increasing --chimeric-distance.
  • For Illumina paired-end data, the 1,000,000 bp insert size threshold for chimeric calling is fixed and cannot be adjusted from the command line.

Performance:

  • The annotation index is built per run. For large GTF files, this takes several minutes on first use. The cache (--no-cache disables it) stores the parsed index to avoid rebuilding on subsequent runs with the same GTF.
  • Parallelism is at the chromosome level (--threads). Chromosomes are processed independently. Small contigs (alt, patch) are each dispatched as separate workers, which may cause imbalance for heavily fragmented reference assemblies.
  • UMI tracking (umi_complexity_* columns) stores a set of UMI strings per cell per category. For very large datasets, this can consume significant memory. Use --no-umi-dedup to disable it.
  • Read-length and insert-size sampling use reservoir sampling (Algorithm R). Not all reads are represented in the length distribution TSVs or insert size charts.

Statistical:

  • The chi-squared test in compare is applied to the contingency table of read counts. It is not a paired test; it does not account for the fact that BAM B may be a strict subset of BAM A (e.g. post-filter ⊆ pre-filter). Interpret p-values accordingly.
  • The adaptive intergenic threshold uses Bonferroni correction across all intergenic loci. In samples with many sparse intergenic reads, this correction is conservative and may suppress detection of low-coverage novel loci.
  • Per-cell noise statistics (median, IQR) are computed only over cells with ≥ 10 reads. The threshold is fixed.