SynthDNM

February 7, 2026 · View on GitHub

Classifier for filtering de novo mutations (DNMs) from whole-genome sequencing trios.

Trains on synthetic DNMs (inherited variants via pedigree swapping) as proxies for real de novos, and putative DNMs (child het, parents hom ref) which are mostly artifacts. The classifier learns to distinguish high-quality real DNMs from noise.

Pipeline

1. swap_pedigree.py        — Generate swapped pedigree
2. generate_regions.py     — Shard genome into regions for parallel extraction
3. extract_dnm_features.py — Extract genotype features (FORMAT + INFO fields)
4. preprocess_features.py  — Derive features, correct haploid_flag, output Parquet
5. train.py                — Train XGBoost classifier (3 feature sets)
6. classify.py             — Score candidate DNMs

Feature Sets

Defined in scripts/features.toml with an extends chain:

Feature SetFeaturesUse Case
universal21 (FORMAT-only)Any caller (DeepVariant, DRAGEN, etc.)
gatk28 (+INFO quality metrics)GATK HaplotypeCaller
ssc29 (+VQSLOD)SSC-specific GATK with VQSR

Label Semantics

  • synthdnm_prob = P(truth=1) = probability variant is a real DNM
  • High prob (close to 1) = looks like a real de novo = keep
  • Low prob (close to 0) = looks like artifact = discard

Training labels:

  • truth=1: Synthetic DNMs — inherited variants used as proxies for real DNMs (high-quality)
  • truth=0: Putative DNMs — child het, parents hom ref (vast majority are artifacts)

Validation

Evaluated against denovo-db v1.6.1 SSC validated DNMs. See results/validation_summary.json for full metrics.

ModelValidated DNMs (n=3,006)+ AC<=1 filter (n=68,687)
universal96.8% recalled98.7% real
gatk95.5% recalled97.9% real
ssc96.5% recalled98.6% real

Note: Additional downstream filters (gnomAD AF, segmental duplications, simple repeats) not yet applied.

Usage

Extract features (SLURM array jobs)

sbatch run_extract_real.sh
sbatch run_extract_synthetic.sh

Preprocess + train

python scripts/preprocess_features.py \
    --feature_dir output/features/ \
    --output output/training_data.parquet \
    --gatk --sample_n 50000 \
    --psam resources/SSC.psam \
    --par resources/hg38_par.bed

python scripts/train.py \
    --input output/training_data.parquet \
    --output_dir output/models/gatk/ \
    --feature_set gatk \
    --config scripts/features.toml \
    --cv_folds 5

Classify candidates

python scripts/classify.py \
    --input candidates.parquet \
    --model output/models/gatk/model.json \
    --output scored.tsv \
    --threshold 0.5

Requirements

  • Python 3.12+
  • polars, xgboost, scikit-learn, numpy, cyvcf2