Smash++

June 3, 2026 · View on GitHub

Anaconda version Anaconda downloads CI License

Smash++ is a fast utility for identifying and visualizing rearrangements in DNA sequences.

Installation

Smash++ requires CMake 4.0.0 or newer and a compiler with C++20 support.

Conda

conda install -y bioconda::smashpp

Docker

docker pull smortezah/smashpp
docker run -it smortezah/smashpp

Build From Source

git clone --depth 1 https://github.com/smortezah/smashpp.git
cd smashpp
bash install.sh

By default, install.sh builds in ./build and installs smashpp, smashpp-inv-rep, and exclude_N into ./dist/bin.

You can customize the build with environment variables:

PREFIX=/your/path BUILD_TYPE=Debug PARALLEL=16 bash install.sh

Ubuntu

apt update && apt install -y git g++ python3-pip
pip3 install --user "cmake~=4.0.0"

git clone --depth 1 https://github.com/smortezah/smashpp.git
cd smashpp
bash install.sh

macOS

brew install git python
pip3 install --user "cmake~=4.0.0"

git clone --depth 1 https://github.com/smortezah/smashpp.git
cd smashpp
bash install.sh

Windows

Install Visual Studio 2022 Build Tools with the Desktop C++ workload, plus Python 3.

py -m pip install --user "cmake~=4.0.0"
git clone --depth 1 https://github.com/smortezah/smashpp.git
cd smashpp
powershell -ExecutionPolicy Bypass -File .\install.ps1

The PowerShell installer supports the same knobs as the shell script, for example:

powershell -ExecutionPolicy Bypass -File .\install.ps1 -BuildType Debug -Prefix .\dist

Usage

If you used the default source install, run the binaries from ./dist/bin.

./dist/bin/smashpp [OPTIONS] -r <REF_FILE> -t <TAR_FILE>
./dist/bin/smashpp viz [OPTIONS] -o <SVG_FILE> <POS_FILE>

For best results, keep the reference and target filenames short.

Smash++ Options

Use smashpp --help to print the full CLI help.

OptionValueDescriptionDefault
-r, --reference<FILE>Reference file in seq, FASTA, or FASTQ format.Required
-t, --target<FILE>Target file in seq, FASTA, or FASTQ format.Required
-l, --level<INT>Compression level from 0 to 6.3
-m, --min-segment-size<INT>Minimum segment size.50
-fmt, --format<STRING>Output format: pos or json.pos
-e, --entropy-N<FLOAT>Entropy assigned to N bases.2.0
-n, --num-threads<INT>Number of worker threads.4
-mem, --max-memory<SIZE>Maximum estimated memory use. Supports B, K, M, G, and T suffixes; 0 disables the check.Auto
-f, --filter-size<INT>Filter window size.100
-ft, --filter-type<INT/STRING>Window function: 0/rectangular, 1/hamming, 2/hann, 3/blackman, 4/triangular, 5/welch, 6/sine, 7/nuttall.hann
-fs, --filter-scale<STRING>Filter scale: S/small, M/medium, or L/large.Auto
-d, --sampling-step<INT>Sampling step.Auto
--approx-sampled-models-Use faster approximate updates between sampled positions in multi-model runs.Disabled
-th, --threshold<FLOAT>Segmentation threshold.1.5
-rb, --reference-begin-guard<INT>Reference begin guard.0
-re, --reference-end-guard<INT>Reference end guard.0
-tb, --target-begin-guard<INT>Target begin guard.0
-te, --target-end-guard<INT>Target end guard.0
-ar, --asymmetric-regions-Consider asymmetric regions.Disabled
-nr, --no-self-complexity-Skip self-complexity computation.Disabled
-sb, --save-sequence-Keep temporary .seq files generated from FASTA/FASTQ input.Disabled
-sp, --save-profile-Save profile output.Disabled
-sf, --save-filtered-Save filtered output.Disabled
-ss, --save-segmented-Save extracted segment files.Disabled
-sa, --save-profile-filtered-segmented-Save profile, filtered, and segmented outputs.Disabled
-rm, --reference-model<STRING>Custom reference model chain.Auto from --level
-tm, --target-model<STRING>Custom target model chain.Auto from --level
-ll, --list-levels-Print the built-in compression levels.-
-h, --help-Show the help message.-
-v, --verbose-Print detailed progress information.Disabled
-V, --version-Show the program version.-

Model Parameter Fields

Custom model strings use the form k,[w,d,]ir,a,g/t,ir,a,g:....

FieldMeaning
kContext size.
wSketch width given in log2 form, for example 10 means $2^{10} = 1024$.
dSketch depth.
irInverted-repeat mode: 0 regular, 1 inverted only, 2 regular plus inverted.
aEstimator.
gForgetting factor in the range 0.0 to 1.0.
tThreshold for the number of substitutions in a tolerant model.

Output Compatibility

Smash++ output is deterministic for the same executable, options, input files, and platform. Profile files saved with -sp or -sa still serialize entropy values using the profile precision shown by the program, but filtering and segmentation use full-precision entropy internally.

Because of that, .fil, .pos, and .json output may differ slightly from older Smash++ releases in the final decimal places or in threshold-adjacent segment boundaries. These differences are deterministic and come from avoiding an older round-to-text-and-parse-back step in the compression hot path.

--approx-sampled-models is opt-in. It speeds up sampled multi-model runs by updating only contexts between sampled positions, so its .prf, .fil, .pos, and .json output should be treated as an approximate mode rather than byte-compatible output with the default model update path.

Troubleshooting zero segments

If Smash++ finishes with 0 segments in both regular and inverted modes, it still writes an empty output file. For chromosome-scale or more divergent genome comparisons, the first tuning knobs to try are:

  • increase -th / --threshold
  • reduce -m / --min-segment-size
  • use -fs L / --filter-scale L for broader smoothing
  • lower -d / --sampling-step for finer resolution

See the Large and eukaryotic genomes section below for additional guidance on multi-gigabyte inputs.

Large and eukaryotic genomes

Smash++ was originally benchmarked on viral and bacterial genomes (kilobytes to low megabytes). When comparing large eukaryotic assemblies — for example human vs. chimpanzee — the automatic sampling step grows proportionally to file size in bytes (ceil(min(ref_bytes, tar_bytes) / 5000)), which can reduce resolution to the point where no segments survive filtering and thresholding.

Recommended workflow:

  1. Compare individual chromosomes rather than whole-genome FASTA files. Concatenated multi-chromosome files add cross-chromosome noise and inflate the auto-sampling step:

    # Extract chr1 from each assembly, then compare
    smashpp -r human_chr1.fa -t chimp_chr1.fa
    smashpp viz -o chr1_map.svg human_chr1.fa.chimp_chr1.fa.pos
    
  2. Lower the sampling step for multi-megabyte or gigabyte inputs so that the profile retains enough resolution:

    smashpp -r ref_chr.fa -t tar_chr.fa -d 50
    
  3. Raise the segmentation threshold — eukaryotic genomes contain more repetitive and divergent background, so a threshold of 1.5 (the default) may be too strict:

    smashpp -r ref_chr.fa -t tar_chr.fa -th 2.5
    
  4. Use sketch-based models with explicit width for memory-efficient processing of large chromosomes. The 6-field model format k,w,d,ir,a,g lets you control the sketch size:

    smashpp -r ref_chr.fa -t tar_chr.fa \
        -rm "20,10,5,0,0.002,0.95" \
        -tm "20,10,5,0,0.002,0.95"
    

    Here w=10 means a sketch width of 2102^{10} = 1024 buckets with depth d=5.

  5. A practical starting point for chromosome-to-chromosome eukaryotic comparison:

    smashpp -r human_chr1.fa -t chimp_chr1.fa \
        -l 0 -m 500 -th 2.5 -fs L -d 50 -n 8
    

    Adjust -th and -m based on the expected divergence between the species.

Visualizer Options

Use smashpp viz --help to print the full CLI help.

OptionValueDescriptionDefault
<POS_FILE>FilePosition file generated by Smash++ in *.pos or *.json format.Required
-o, --output<SVG_FILE>Output SVG path.map.svg
-rn, --reference-name<STRING>Override the displayed reference label.Header value
-tn, --target-name<STRING>Override the displayed target label.Header value
-l, --link<INT>Link style between the two maps.1
-c, --color<INT>Color mode: 0 or 1.0
-p, --opacity<FLOAT>Connector opacity.0.9
-w, --width<INT>Sequence bar width.10
-s, --space<INT>Space between sequences.40
-tc, --total-colors<INT>Total number of colors to use.Auto
-rt, --reference-tick<INT>Reference tick spacing.Auto
-tt, --target-tick<INT>Target tick spacing.Auto
-th, --tick-human-readable<INT>Human-readable tick labels: 0 false, 1 true.1
-m, --min-block-size<INT>Minimum block size to display.1
-vv, --vertical-view-Render a vertical layout.Disabled
-nrr, --no-relative-redundancy-Hide relative redundancy coloring.Disabled
-nr, --no-redundancy-Hide redundancy coloring.Disabled
-ni, --no-inverted-Hide inverted matches.Disabled
-ng, --no-regular-Hide regular matches.Disabled
-n, --show-N-Highlight N bases.Disabled
-stat, --statistics-Save statistics to CSV.stat.csv
-h, --help-Show the help message.-
-v, --verbose-Print detailed plotting information.Disabled
-V, --version-Show the program version.-

Example

After running the default installer, the example workflow looks like this:

cd example
../dist/bin/smashpp -r ref -t tar
../dist/bin/smashpp viz -o example.svg ref.tar.pos

JSON output is available too:

cd example
../dist/bin/smashpp --reference ref --target tar --format json
../dist/bin/smashpp viz --output example.svg ref.tar.json

If smashpp is already on your PATH, you can drop the ../dist/bin/ prefix.

Testing and Benchmarks

After configuring and building from source, run the regression suite with:

ctest --test-dir build --output-on-failure

To make warnings fail the build in local development or CI, configure with:

cmake -S . -B build -DSMASHPP_STRICT_WARNINGS=ON

The repository also includes CMake presets for common maintainer workflows:

cmake --preset strict
cmake --build --preset strict
ctest --preset strict

Focused test labels are available for narrower checks, for example:

ctest --preset strict -L compatibility
ctest --preset strict -L packaging
ctest --preset benchmark-smoke

For local performance checks, run the benchmark target:

cmake --build build --target smashpp-benchmark

To compare against another executable configure with:

cmake -S . -B build -DSMASHPP_BENCHMARK_BASELINE=/path/to/other/smashpp
cmake --build build --target smashpp-benchmark

The benchmark generates deterministic small and large inputs and writes timing rows to build/benchmarks/summary.csv. When a baseline executable is configured, it also writes build/benchmarks/comparison.csv with median timings and speedups for each scenario. The default large benchmark input is 256 MiB per file. Override the generated input sizes with byte counts when you need a shorter smoke run or a larger production check:

cmake -S . -B build \
  -DSMASHPP_BENCHMARK_SMALL_BYTES=131072 \
  -DSMASHPP_BENCHMARK_LARGE_BYTES=268435456

Use the same compiler, build type, input sizes, and machine when comparing results.

To create portable release archives from the install rules, run:

cmake --build build --target package

The archives are written to build/packages/.

Cite

If you find Smash++ useful in your research, please acknowledge our work by citing:

  • M. Hosseini, D. Pratas, B. Morgenstern, A.J. Pinho, "Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements," GigaScience, vol. 9, no. 5, 2020. DOI: 10.1093/gigascience/giaa048

Issues

If you encounter an issue, please let us know.

Contributing

Development workflow, testing, benchmarking, and pull request guidance are in CONTRIBUTING.md.

License

Smash++ is distributed under the GNU GPL v3 license.