SVDB

May 27, 2026 · View on GitHub

SVDB is a toolkit for constructing and querying structural variant databases. The databases are constructed using the output vcf files from structural variant callers such as TIDDIT, Manta, Fermikit or Delly. SVDB may also be used to merge SV vcf files from multiple callers or individuals.

Supported public databases

SVDB query supports public databases such as thousand genomes SV map and Gnomad SV, as well as most multisample SV vcf files.

The thousand genomes SV database: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/

The swegen SVDB: https://swefreq.nbis.se/

The GNOMAD SV database: https://storage.googleapis.com/gnomad-public/papers/2019-sv/gnomad_v2_sv.sites.vcf.gz

External databases are run like this:

svdb --query \
     --query_vcf /home/jesper/vcf/6_pair_limit/P2109_120.clean.dedup.recal_FindSV.vcf \
     --out_occ GNOMAD_AC \
     --out_frq GNOMAD_AF \
     --in_occ AN \
     --in_frq AF \
     --db /home/jesper/Downloads/gnomad_sv/gnomad_v2_sv.sites.vcf

Here the AF and AN are the allele frequency tags of the database, the AF is a float, and AN is an integer. These tags will be added to the annotated output vcf, and named GNOMAD_AC, GNOMAD_AF.

Install

Dependencies: SVDB requires Python 3.9+ and numpy.

Install from PyPI:

pip install svdb

Install from source:

git clone https://github.com/J35P312/SVDB.git
cd SVDB
pip install .

SVDB is available as a container on BioContainers.

Global options

These flags apply to all subcommands and must be placed before the subcommand name:

    --debug     enable debug logging to stderr
                e.g. svdb --debug --build --files sample.vcf

Modules

SVDB consists of modules that are used to build, query, export, and analyse structural variant databases. These are the modules:

Build

This module is used to construct structural variant databases from vcf files. The database may then be queried to compute the frequency of structural variants, or exported into a vcf file. These are the commands used to construct a structural variation database:

Sample names are taken from the VCF header's sample column (the named FORMAT columns). For VCFs with no sample columns (INFO-only format), the filename stem is used instead.

    svdb --build --help
    svdb --build --files sample1.vcf sample2.vcf sample3.vcf
    svdb --build --folder SV_analysis_folder/
    svdb --build --upgrade --files sample1.vcf sample2.vcf --prefix existing_db   # add INS table + backfill insertion sequences from original VCFs

  input (one of required):
    --files [FILES ...]             vcf files to build the db from (cannot be used with --folder)
    --folder FOLDER                 use all vcf files in the given folder

  output:
    --prefix PREFIX                 prefix for the output file (default: SVDB)
    --pass_only                     only include variants with PASS or . in the FILTER field
                                    (--passonly is a deprecated alias; emits a warning)

  upgrade existing db:
    --upgrade                       create the INS table and backfill insertion sequences from the
                                    provided VCFs; schema-only, no existing SVDB rows are changed.
                                    Exits with INFO if the INS table already exists.
                                    Warns (WARNING) for DB samples with no matching VCF provided;
                                    logs (INFO) VCF samples that have no entries in the database.
                                    Requires --files or --folder.

  storage:
    --max_ins_seq_len N             cap on stored insertion sequence length (bp); sequences
                                    longer than N are stored with NULL sequence but retain
                                    SVLEN for length-ratio matching (default: no limit)

Export

This module is used to export the variants of the SVDB sqlite database.

Export uses a two-pass clustering approach: DBSCAN-inspired spatial grouping (first pass, controlled by --epsilon/--min_pts) followed by overlap/SVLEN/sequence refinement and representative selection (--cluster_method). --coarse skips the second pass. See docs/algorithms.md for a detailed description of all three algorithms.

When the database was built with insertion sequence data (i.e. the INS table is present), insertions are exported with the actual insertion sequence in the ALT column instead of the symbolic <INS> allele. For clusters containing multiple samples, the most common sequence across the cluster is used as the representative ALT allele. If any cluster member lacks a sequence — because its insertion was longer than --max_ins_seq_len at build time, or because it was called as a symbolic <INS> allele — the entire cluster is exported as <INS> with SVLEN taken from the most common stored length across members; a mixed cluster cannot be faithfully represented by a single sequence. If the INS table is absent (older database), a warning is emitted and insertions are exported as <INS>; run svdb --build --upgrade --files <original_vcfs> --prefix <existing_db> to create the INS table and backfill insertion data from the original VCFs.

    svdb --export --help
    svdb --export --db database.db

  input (required):
    --db DB                     the SQLite database to export

  output:
    --prefix PREFIX             prefix for the output file (default: same as input)
    --no_merge                  skip merging; print all variants as-is
    --strip_chr                 strip the 'chr' prefix from chromosome names in the output
                                (e.g. 'chr1' -> '1'); names are stored as-is in the db
    --samples {on,off}          include per-sample genotype columns (default: on);
                                use 'off' for sites-only output (FORMAT/GT omitted, OCC/FRQ kept)

  algorithm — SV matching:
    --bnd_distance BND_DISTANCE maximum distance between two similar precise breakpoints (default: 2500)
    --overlap OVERLAP           minimum reciprocal overlap to merge two events;
                                must be in [0.0, 1.0] (0 = anything touching; 1 = identical) (default: 0.8)

  algorithm — insertion matching (requires INS table):
    --data_profile {sample,cohort,position_only}
                                preset for all insertion parameters; individual --ins_* flags override:
                                  sample:        strict  (dist=25, ratio=0.90, sim=0.85) - same individual / technology
                                  cohort:        permissive (dist=50, ratio=0.80, sim=0.75) - cross-individual or cross-caller
                                  position_only: no sequence gate (dist=50, ratio=0.90)
    --ins_distance INS_DISTANCE maximum distance to cluster two insertions
                                (default: 25; profile cohort/position_only: 50)
    --ins_svlen_ratio RATIO     minimum SVLEN ratio (min/max) for insertion clustering
                                must be in [0.0, 1.0] (default: 0.90; profile cohort: 0.80)
    --ins_seq_similarity THRESHOLD
                                minimum Levenshtein sequence similarity; must be in [0.0, 1.0];
                                explicit value overrides --data_profile (effective default: 0.75)

  algorithm — clustering:
    --coarse                    skip second-pass refinement; use centroid from DBSCAN spatial
                                clusters directly. Produces fewer, coarser clusters.
                                (--DBSCAN is a deprecated alias; emits a warning)
    --epsilon EPSILON           DBSCAN-style spatial grouping radius in bp (default: 500)
    --min_pts MIN_PTS           DBSCAN-style min_pts: minimum variants within --epsilon to seed
                                a cluster; isolated variants become singletons; must be a whole
                                number ≥ 1 (default: 2, meaning any pair within --epsilon forms a cluster)
    --cluster_method {star,union_find}
                                second-pass clustering algorithm (default: star):
                                  star        greedy; highest-degree variant claims neighbours.
                                              No transitivity. A variant overlapping multiple
                                              representatives appears in each cluster — OCC
                                              reflects all groups it genuinely belongs to.
                                  union_find  transitive closure; A-B + B-C → {A,B,C}.
                                              Exclusive membership; fewer, larger clusters.

  performance:
    --max_ins_seq_len N         sequences longer than N bp fall back to position+SVLEN
                                (default: 1000); use 0 for no cap (all sequences compared)
    --memory                    load the db into memory: higher memory use, lower export time
    --workers N                 parallel worker processes (default: 0 = all logical CPUs; 1 = serial)

Query

The query module is used to query one or more structural variant databases. Typically a database is constructed using the build module. However, since this module utilizes the genotype field of the structural variant database vcf to compute the frequency of structural variants, a wide range of files could be used as database. The query module requires a query vcf, as well as a database file (either multisample vcf or SVDB sqlite database):

    svdb --query --help
    svdb --query --query_vcf patient1.vcf --db control_db.vcf
    svdb --query --query_vcf patient1.vcf --db control_db1.vcf,control_db2.vcf \
         --prefix test --in_occ default,Obs --in_frq FRQ,default \
         --out_frq db1_AF,db2_Frq --out_occ db1_AC,db2_Obs

  input:
    --query_vcf VCF   (required) query vcf file
    --db DB                     db vcf, or a comma-separated list (no effect on --bedpedb)
    --sqdb SQDB                 SVDB sqlite db, or a comma-separated list
    --bedpedb BEDPEDB           SV db in chrA-posA-chrB-posB-type-count-frequency format,
                                or a comma-separated list
                                (at least one of --db / --sqdb / --bedpedb is required)

  output:
    --prefix PREFIX             prefix for the output file (default: print to stdout);
                                required when querying multiple databases
    --out_occ OUT_OCC           output tag for allele count (default: OCC)
    --out_frq OUT_FRQ           output tag for allele frequency (default: FRQ)
    --in_occ IN_OCC             allele count tag in the db INFO column (usually OCC or AN);
                                required when querying multiple databases
    --in_frq IN_FRQ             allele frequency tag in the db INFO column (usually FRQ or AF);
                                required when querying multiple databases

  algorithm — SV matching:
    --bnd_distance BND_DISTANCE maximum distance between two similar breakpoints (default: 10000)
    --overlap OVERLAP           minimum reciprocal overlap to match two events;
                                must be in [0.0, 1.0] (0 = anything touching; 1 = identical) (default: 0.6)
    --max_frq MAX_FRQ           only report variants with frequency at or below this value
                                (default: 1, i.e. all variants)
    --no_var                    count overlapping variants of different type as hits in the db

  algorithm — insertion matching (--db and --sqdb with INS table; no effect with --bedpedb):
    --data_profile {sample,cohort,position_only}
                                preset for all insertion parameters; individual --ins_* flags override:
                                  sample:        strict  (dist=25, ratio=0.90, sim=0.85) - same individual / technology
                                  cohort:        permissive (dist=50, ratio=0.80, sim=0.75) - cross-individual or cross-caller
                                  position_only: no sequence gate (dist=50, ratio=0.90)
    --ins_distance INS_DISTANCE maximum distance to match two insertions
                                (default: 25; profile cohort/position_only: 50)
    --ins_svlen_ratio RATIO     minimum SVLEN ratio (min/max) for insertions with known length
                                must be in [0.0, 1.0] (default: 0.90; profile cohort: 0.80)
    --ins_seq_similarity THRESHOLD
                                minimum Levenshtein sequence similarity; must be in [0.0, 1.0];
                                explicit value overrides --data_profile (effective default: 0.75)

  performance:
    --max_ins_seq_len N         sequences longer than N bp fall back to position+SVLEN
                                (default: 1000); use 0 for no cap (all sequences compared)
    --memory                    load the db into memory: higher memory use, lower query time
                                (sqdb only)

Merge

The merge module merges variants within one or more vcf files. This could be used to either merge the output of multiple callers, or to merge variants that are called multiple times due to noise or some other error:

    svdb --merge --help
    svdb --merge --vcf patient1_lumpy.vcf patient1_cnvnator.vcf patient1_TIDDIT.vcf
    svdb --merge --vcf patient1_lumpy.vcf:one patient1_cnvnator.vcf:2 patient1_TIDDIT.vcf:tiddit \
         --priority tiddit,2,one

Variants are merged and output in the order of the input files (first file takes precedence).
Use --priority to override the order explicitly.

  input (required):
    --vcf VCF [VCF ...]         input vcf files to merge

  input control:
    --priority ORDER            prioritise input files; format: --vcf a.vcf:1 b.vcf:2 --priority 2,1
    --pass_only                 merge only variants labeled PASS
    --no_intra                  skip merging of variants within the same vcf
    --same_order                assume sample columns are in the same order across all input files
    --no_tag                    do not add VARID and set entries to the INFO field
                                (--notag is a deprecated alias; emits a warning)

  algorithm — SV matching:
    --bnd_distance BND_DISTANCE maximum distance between two similar precise breakpoints (default: 2000)
    --overlap OVERLAP           minimum reciprocal overlap to merge two events;
                                must be in [0.0, 1.0] (0 = anything touching; 1 = identical) (default: 0.95)
    --no_var                    variants of different type will be merged

  algorithm — insertion matching:
    --data_profile {sample,cohort,position_only}
                                preset for all insertion parameters; individual --ins_* flags override:
                                  sample:        strict  (dist=25, ratio=0.90, sim=0.85) - same individual / technology
                                  cohort:        permissive (dist=50, ratio=0.80, sim=0.75) - cross-individual or cross-caller
                                  position_only: no sequence gate (dist=50, ratio=0.90)
    --ins_distance INS_DISTANCE maximum distance to merge two insertions
                                (default: 25; profile cohort/position_only: 50)
    --ins_svlen_ratio RATIO     minimum SVLEN ratio (min/max) for insertions with known length
                                must be in [0.0, 1.0] (default: 0.90; profile cohort: 0.80)
    --ins_seq_similarity THRESHOLD
                                minimum Levenshtein sequence similarity; must be in [0.0, 1.0];
                                explicit value overrides --data_profile (effective default: 0.75)

  performance:
    --max_ins_seq_len N         sequences longer than N bp fall back to position+SVLEN
                                (default: 1000); use 0 for no cap (all sequences compared)

For developers

Runtime dependencies are pinned via pip-tools. Edit requirements.in, then regenerate:

pip-compile requirements.in --output-file requirements.txt --strip-extras

Dev tools (requirements-dev.txt) are intentionally unpinned — they span Python 3.9–3.14 where transitive pins would differ per version.

Install development dependencies:

pip install -r requirements-dev.txt

Run tests (includes ruff linting and mypy type checking):

pytest

Run ruff or mypy standalone:

ruff check svdb/
mypy svdb/ --ignore-missing-imports

Configuration lives in pyproject.toml (build system, ruff, pytest settings). The legacy setup.py is retained only for optional Cython compilation of merge_vcf_module_cython.

Run a dependency security audit with GuardDog (not in CI — can be run locally before, for example, bumping deps):

pip install guarddog
grep -hv '^[[:space:]]*#\|^[[:space:]]*$' requirements.txt requirements-dev.txt \
  | sed 's/[=><].*//' | tr -d ' ' \
  | while read pkg; do guarddog pypi scan "$pkg"; done

See docs/architecture.md for a module overview and data flow diagrams.

Profiling

A cProfile-based profiling harness lives in scripts/profile_svdb.py. It runs a standard battery of commands (merge, build, export, query) on real VCF data and prints per-function timing.

Set up a local config file (gitignored):

cp scripts/profile_config.toml.example scripts/profile_config.toml
# fill in your VCF paths and caller names

Then run:

python scripts/profile_svdb.py               # default: top 15 functions, sorted by cumulative time
python scripts/profile_svdb.py --top 20 --sort tottime
python scripts/profile_svdb.py --config /path/to/my_config.toml

The script always profiles the local checkout (not any installed package), so it is safe to use during optimisation work.