Smash++
June 3, 2026 · View on GitHub
Smash++ is a fast utility for identifying and visualizing rearrangements in DNA sequences.
Installation
Smash++ requires CMake 4.0.0 or newer and a compiler with C++20 support.
Conda
conda install -y bioconda::smashpp
Docker
docker pull smortezah/smashpp
docker run -it smortezah/smashpp
Build From Source
git clone --depth 1 https://github.com/smortezah/smashpp.git
cd smashpp
bash install.sh
By default, install.sh builds in ./build and installs smashpp, smashpp-inv-rep, and exclude_N into ./dist/bin.
You can customize the build with environment variables:
PREFIX=/your/path BUILD_TYPE=Debug PARALLEL=16 bash install.sh
Ubuntu
apt update && apt install -y git g++ python3-pip
pip3 install --user "cmake~=4.0.0"
git clone --depth 1 https://github.com/smortezah/smashpp.git
cd smashpp
bash install.sh
macOS
brew install git python
pip3 install --user "cmake~=4.0.0"
git clone --depth 1 https://github.com/smortezah/smashpp.git
cd smashpp
bash install.sh
Windows
Install Visual Studio 2022 Build Tools with the Desktop C++ workload, plus Python 3.
py -m pip install --user "cmake~=4.0.0"
git clone --depth 1 https://github.com/smortezah/smashpp.git
cd smashpp
powershell -ExecutionPolicy Bypass -File .\install.ps1
The PowerShell installer supports the same knobs as the shell script, for example:
powershell -ExecutionPolicy Bypass -File .\install.ps1 -BuildType Debug -Prefix .\dist
Usage
If you used the default source install, run the binaries from ./dist/bin.
./dist/bin/smashpp [OPTIONS] -r <REF_FILE> -t <TAR_FILE>
./dist/bin/smashpp viz [OPTIONS] -o <SVG_FILE> <POS_FILE>
For best results, keep the reference and target filenames short.
Smash++ Options
Use smashpp --help to print the full CLI help.
| Option | Value | Description | Default |
|---|---|---|---|
-r, --reference | <FILE> | Reference file in seq, FASTA, or FASTQ format. | Required |
-t, --target | <FILE> | Target file in seq, FASTA, or FASTQ format. | Required |
-l, --level | <INT> | Compression level from 0 to 6. | 3 |
-m, --min-segment-size | <INT> | Minimum segment size. | 50 |
-fmt, --format | <STRING> | Output format: pos or json. | pos |
-e, --entropy-N | <FLOAT> | Entropy assigned to N bases. | 2.0 |
-n, --num-threads | <INT> | Number of worker threads. | 4 |
-mem, --max-memory | <SIZE> | Maximum estimated memory use. Supports B, K, M, G, and T suffixes; 0 disables the check. | Auto |
-f, --filter-size | <INT> | Filter window size. | 100 |
-ft, --filter-type | <INT/STRING> | Window function: 0/rectangular, 1/hamming, 2/hann, 3/blackman, 4/triangular, 5/welch, 6/sine, 7/nuttall. | hann |
-fs, --filter-scale | <STRING> | Filter scale: S/small, M/medium, or L/large. | Auto |
-d, --sampling-step | <INT> | Sampling step. | Auto |
--approx-sampled-models | - | Use faster approximate updates between sampled positions in multi-model runs. | Disabled |
-th, --threshold | <FLOAT> | Segmentation threshold. | 1.5 |
-rb, --reference-begin-guard | <INT> | Reference begin guard. | 0 |
-re, --reference-end-guard | <INT> | Reference end guard. | 0 |
-tb, --target-begin-guard | <INT> | Target begin guard. | 0 |
-te, --target-end-guard | <INT> | Target end guard. | 0 |
-ar, --asymmetric-regions | - | Consider asymmetric regions. | Disabled |
-nr, --no-self-complexity | - | Skip self-complexity computation. | Disabled |
-sb, --save-sequence | - | Keep temporary .seq files generated from FASTA/FASTQ input. | Disabled |
-sp, --save-profile | - | Save profile output. | Disabled |
-sf, --save-filtered | - | Save filtered output. | Disabled |
-ss, --save-segmented | - | Save extracted segment files. | Disabled |
-sa, --save-profile-filtered-segmented | - | Save profile, filtered, and segmented outputs. | Disabled |
-rm, --reference-model | <STRING> | Custom reference model chain. | Auto from --level |
-tm, --target-model | <STRING> | Custom target model chain. | Auto from --level |
-ll, --list-levels | - | Print the built-in compression levels. | - |
-h, --help | - | Show the help message. | - |
-v, --verbose | - | Print detailed progress information. | Disabled |
-V, --version | - | Show the program version. | - |
Model Parameter Fields
Custom model strings use the form k,[w,d,]ir,a,g/t,ir,a,g:....
| Field | Meaning |
|---|---|
k | Context size. |
w | Sketch width given in log2 form, for example 10 means $2^{10} = 1024$. |
d | Sketch depth. |
ir | Inverted-repeat mode: 0 regular, 1 inverted only, 2 regular plus inverted. |
a | Estimator. |
g | Forgetting factor in the range 0.0 to 1.0. |
t | Threshold for the number of substitutions in a tolerant model. |
Output Compatibility
Smash++ output is deterministic for the same executable, options, input files, and platform. Profile files saved with -sp or -sa still serialize entropy values using the profile precision shown by the program, but filtering and segmentation use full-precision entropy internally.
Because of that, .fil, .pos, and .json output may differ slightly from older Smash++ releases in the final decimal places or in threshold-adjacent segment boundaries. These differences are deterministic and come from avoiding an older round-to-text-and-parse-back step in the compression hot path.
--approx-sampled-models is opt-in. It speeds up sampled multi-model runs by updating only contexts between sampled positions, so its .prf, .fil, .pos, and .json output should be treated as an approximate mode rather than byte-compatible output with the default model update path.
Troubleshooting zero segments
If Smash++ finishes with 0 segments in both regular and inverted modes, it still writes an empty
output file. For chromosome-scale or more divergent genome comparisons, the first tuning knobs to try
are:
- increase
-th/--threshold - reduce
-m/--min-segment-size - use
-fs L/--filter-scale Lfor broader smoothing - lower
-d/--sampling-stepfor finer resolution
See the Large and eukaryotic genomes section below for additional guidance on multi-gigabyte inputs.
Large and eukaryotic genomes
Smash++ was originally benchmarked on viral and bacterial genomes (kilobytes to low megabytes). When comparing large eukaryotic assemblies — for example human vs. chimpanzee — the automatic sampling step grows proportionally to file size in bytes (ceil(min(ref_bytes, tar_bytes) / 5000)), which can reduce resolution to the point where no segments survive filtering and thresholding.
Recommended workflow:
-
Compare individual chromosomes rather than whole-genome FASTA files. Concatenated multi-chromosome files add cross-chromosome noise and inflate the auto-sampling step:
# Extract chr1 from each assembly, then compare smashpp -r human_chr1.fa -t chimp_chr1.fa smashpp viz -o chr1_map.svg human_chr1.fa.chimp_chr1.fa.pos -
Lower the sampling step for multi-megabyte or gigabyte inputs so that the profile retains enough resolution:
smashpp -r ref_chr.fa -t tar_chr.fa -d 50 -
Raise the segmentation threshold — eukaryotic genomes contain more repetitive and divergent background, so a threshold of
1.5(the default) may be too strict:smashpp -r ref_chr.fa -t tar_chr.fa -th 2.5 -
Use sketch-based models with explicit width for memory-efficient processing of large chromosomes. The 6-field model format
k,w,d,ir,a,glets you control the sketch size:smashpp -r ref_chr.fa -t tar_chr.fa \ -rm "20,10,5,0,0.002,0.95" \ -tm "20,10,5,0,0.002,0.95"Here
w=10means a sketch width of = 1024 buckets with depthd=5. -
A practical starting point for chromosome-to-chromosome eukaryotic comparison:
smashpp -r human_chr1.fa -t chimp_chr1.fa \ -l 0 -m 500 -th 2.5 -fs L -d 50 -n 8Adjust
-thand-mbased on the expected divergence between the species.
Visualizer Options
Use smashpp viz --help to print the full CLI help.
| Option | Value | Description | Default |
|---|---|---|---|
<POS_FILE> | File | Position file generated by Smash++ in *.pos or *.json format. | Required |
-o, --output | <SVG_FILE> | Output SVG path. | map.svg |
-rn, --reference-name | <STRING> | Override the displayed reference label. | Header value |
-tn, --target-name | <STRING> | Override the displayed target label. | Header value |
-l, --link | <INT> | Link style between the two maps. | 1 |
-c, --color | <INT> | Color mode: 0 or 1. | 0 |
-p, --opacity | <FLOAT> | Connector opacity. | 0.9 |
-w, --width | <INT> | Sequence bar width. | 10 |
-s, --space | <INT> | Space between sequences. | 40 |
-tc, --total-colors | <INT> | Total number of colors to use. | Auto |
-rt, --reference-tick | <INT> | Reference tick spacing. | Auto |
-tt, --target-tick | <INT> | Target tick spacing. | Auto |
-th, --tick-human-readable | <INT> | Human-readable tick labels: 0 false, 1 true. | 1 |
-m, --min-block-size | <INT> | Minimum block size to display. | 1 |
-vv, --vertical-view | - | Render a vertical layout. | Disabled |
-nrr, --no-relative-redundancy | - | Hide relative redundancy coloring. | Disabled |
-nr, --no-redundancy | - | Hide redundancy coloring. | Disabled |
-ni, --no-inverted | - | Hide inverted matches. | Disabled |
-ng, --no-regular | - | Hide regular matches. | Disabled |
-n, --show-N | - | Highlight N bases. | Disabled |
-stat, --statistics | - | Save statistics to CSV. | stat.csv |
-h, --help | - | Show the help message. | - |
-v, --verbose | - | Print detailed plotting information. | Disabled |
-V, --version | - | Show the program version. | - |
Example
After running the default installer, the example workflow looks like this:
cd example
../dist/bin/smashpp -r ref -t tar
../dist/bin/smashpp viz -o example.svg ref.tar.pos
JSON output is available too:
cd example
../dist/bin/smashpp --reference ref --target tar --format json
../dist/bin/smashpp viz --output example.svg ref.tar.json
If smashpp is already on your PATH, you can drop the ../dist/bin/ prefix.
Testing and Benchmarks
After configuring and building from source, run the regression suite with:
ctest --test-dir build --output-on-failure
To make warnings fail the build in local development or CI, configure with:
cmake -S . -B build -DSMASHPP_STRICT_WARNINGS=ON
The repository also includes CMake presets for common maintainer workflows:
cmake --preset strict
cmake --build --preset strict
ctest --preset strict
Focused test labels are available for narrower checks, for example:
ctest --preset strict -L compatibility
ctest --preset strict -L packaging
ctest --preset benchmark-smoke
For local performance checks, run the benchmark target:
cmake --build build --target smashpp-benchmark
To compare against another executable configure with:
cmake -S . -B build -DSMASHPP_BENCHMARK_BASELINE=/path/to/other/smashpp
cmake --build build --target smashpp-benchmark
The benchmark generates deterministic small and large inputs and writes timing rows to build/benchmarks/summary.csv. When a baseline executable is configured, it also writes build/benchmarks/comparison.csv with median timings and speedups for each scenario. The default large benchmark input is 256 MiB per file. Override the generated input sizes with byte counts when you need a shorter smoke run or a larger production check:
cmake -S . -B build \
-DSMASHPP_BENCHMARK_SMALL_BYTES=131072 \
-DSMASHPP_BENCHMARK_LARGE_BYTES=268435456
Use the same compiler, build type, input sizes, and machine when comparing results.
To create portable release archives from the install rules, run:
cmake --build build --target package
The archives are written to build/packages/.
Cite
If you find Smash++ useful in your research, please acknowledge our work by citing:
- M. Hosseini, D. Pratas, B. Morgenstern, A.J. Pinho, "Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements," GigaScience, vol. 9, no. 5, 2020. DOI: 10.1093/gigascience/giaa048
Issues
If you encounter an issue, please let us know.
Contributing
Development workflow, testing, benchmarking, and pull request guidance are in CONTRIBUTING.md.
License
Smash++ is distributed under the GNU GPL v3 license.