mafsmith

May 20, 2026 · View on GitHub

rewrites.bio - Follows best practice principles for rewriting bioinformatics tools with AI

A complete, fast rewrite of vcf2maf in Rust, designed as a drop-in replacement.

mafsmith reimplements the full allele-normalisation and field-mapping logic of vcf2maf.pl and uses fastVEP instead of VEP for annotation. All four conversion subcommands (vcf2maf, maf2vcf, vcf2vcf, maf2maf) have been validated to 0 conversion-field differences against their reference Perl counterparts across 15+ caller types and 6 representative datasets.

Acknowledgements

mafsmith is a Rust-based adaptation of the design, field conventions, and edge-case handling from vcf2maf by @ckandoth (ORCID: 0000-0002-1345-3573).

Performance

Benchmarked on 7 GIAB NIST v4.2.1 GRCh38 samples (HG001–HG007, ~3.9–4.0M variants each), conversion step only (--skip-annotation / --inhibit-vep), on an AWS c6a.4xlarge (AMD EPYC 7R13, 16 vCPU, 30 GiB RAM):

ToolMean timeVariants/sSpeedup
mafsmith (16-core)7.050 ± 0.371 s~559,00079.4×
mafsmith (1-core)11.755 ± 0.462 s~335,00047.6×
vcf2maf.pl --inhibit-vep559.073 ± 17.016 s~7,000

Full end-to-end pipeline (mafsmith + fastVEP vs. vcf2maf.pl + VEP): 25–83× faster depending on VEP fork count.

See results/conversion_benchmark_giab_grch38.md for full per-sample data, cost savings, and carbon estimates.

Installation

Prerequisites

  • Rust toolchain (cargo)
  • Python ≥ 3.8 (for mafsmith fetch)

Build

cargo build --release
# binary at target/release/mafsmith

Download reference data

# GRCh38 (default)
mafsmith fetch

# GRCh37
mafsmith fetch --genome grch37

# Mouse (GRCm39)
mafsmith fetch --genome grcm39

# Multiple genomes
mafsmith fetch --genome grch38,grch37

# Use existing files instead of downloading
mafsmith fetch --gff3 /path/to/genes.gff3 --ref-fasta /path/to/ref.fa

Reference data and the fastVEP binary are stored in ~/.mafsmith/ by default.


Usage

vcf2maf — Convert VCF to MAF

# Single-sample VCF
mafsmith vcf2maf \
  -i tumor.vcf.gz \
  -o output.maf \
  --tumor-id SAMPLE_ID

# Paired tumor/normal VCF
mafsmith vcf2maf \
  -i paired.vcf.gz \
  -o output.maf \
  --vcf-tumor-id TUMOR \
  --tumor-id TUMOR \
  --vcf-normal-id NORMAL \
  --normal-id NORMAL

# Skip annotation (VCF already has CSQ fields from a prior fastVEP run)
mafsmith vcf2maf \
  -i annotated.vcf \
  -o output.maf \
  --skip-annotation

# GRCh37
mafsmith vcf2maf -i input.vcf -o output.maf --genome grch37

Key options

FlagDescription
--genomeReference assembly: grch38 (default), grch37, grcm39 (mouse, unvalidated)
--tumor-idMAF Tumor_Sample_Barcode (defaults to VCF sample column name)
--normal-idMAF Matched_Norm_Sample_Barcode
--vcf-tumor-idSample column name in VCF for tumor
--vcf-normal-idSample column name in VCF for normal
--custom-enstFile of preferred Ensembl transcript IDs (one per line)
--retain-annComma-separated CSQ field names to pass through to MAF
--skip-annotationUse existing CSQ annotations in VCF (skips fastVEP)
--strictMatch vcf2maf.pl exactly: output . for depth fields when AD arrays are truncated
--min-hom-vafVAF threshold for inferring homozygous-alt genotype (default: 0.7)

maf2vcf — Convert MAF to VCF

Reconstructs a standards-compliant VCF from a MAF file. Recovers allele representations, multi-allelic sites, genotype strings (GT), and allele depth fields (GT:AD:DP when depth columns are present) from MAF columns. Anchor bases for indels are looked up from the reference FASTA.

mafsmith maf2vcf \
  -i input.maf \
  -o output.vcf \
  --genome grch38
FlagDescription
--genomeReference assembly (default: grch38)
--ref-fastaOverride reference FASTA path
--per-tn-vcfsWrite one VCF per tumor/normal pair to a directory

vcf2vcf — Normalize a VCF

Passes a VCF through with FORMAT field standardization and optional sample column selection. Non-PASS variants and multi-allelic ALTs are preserved. Only ref-only records (ALT=.) are dropped.

mafsmith vcf2vcf \
  -i input.vcf.gz \
  -o normalized.vcf \
  --vcf-tumor-id TUMOR \
  --vcf-normal-id NORMAL
FlagDescription
--vcf-tumor-idTumor sample column to select
--vcf-normal-idNormal sample column to select
--genomeReference assembly (default: grch38)

maf2maf — Reannotate a MAF

Re-annotates an existing MAF by internally converting it to VCF, running fastVEP, and converting back to MAF. Equivalent to maf2maf.pl with VEP replaced by fastVEP.

mafsmith maf2maf \
  -i input.maf \
  -o reannotated.maf \
  --genome grch38
FlagDescription
--genomeReference assembly (default: grch38)
--custom-enstFile of preferred Ensembl transcript IDs
--fastvep-pathOverride fastVEP binary path

Validation

vcf2maf

Validated to 0 conversion-field mismatches in --strict mode against vcf2maf.pl --inhibit-vep across the following caller types:

CallerVCF typeSource
DeepVariant 1.2.0Single-sample gVCF (GT=0/0/./.), with and without VAF fieldsyn31624545; syn4988483
GATK MuTect2Single-sample GRCh38syn64156972; syn31624525
GATK MuTect2 (paired T/N)Paired tumor/normal (GT:AD:AF:DP:F1R2:F2R1:SB)GIAB HG008; SEQC2 HCC1395
FreeBayesSingle-samplesyn31624535
Strelka2 germlinevariants.vcf and genome.vcf formatssyn31624939; syn31624637
Strelka2 somatic SNVsPaired T/N (AU/CU/GU/TU depth fields)GIAB HG008; SEQC2 HCC1395
Strelka2 somatic indelsPaired T/N (TAR/TIR depth fields)syn68172710; GIAB HG008
SV callers (Manta/DELLY)SV-only (BND, DEL, DUP, INV symbolic ALTs)syn21296193
VarScan2 somaticPaired T/N (RD+AD FORMAT)syn6840402
VarDictPaired T/N (RD strand-bias field alongside AD)syn6039268
SomaticSniperPaired T/N (DP4+BCOUNT FORMAT, no AD)SEQC2 HCC1395
DKFZ SNV callerPaired T/N (GT:DP:DP4, GRCh37, no chr prefix)ICGC PCAWG
GIAB germline benchmarks (HG001–HG007)Multi-caller consensus (ADALL field)NIST v4.2.1
ICGC PCAWG consensus SNV/MNVConsensus VCF (depth in INFO, no FORMAT), GRCh371,902 samples
DepMap CCLE WGSGATK MuTect2 single-sample, hg38802 samples
COSMIC v103 (GenomeScreensMutant)Aggregate somatic mutations from genome-wide cancer screens (no sample columns)COSMIC
COSMIC v103 (NonCodingVariants)Aggregate non-coding somatic variants (no sample columns)COSMIC

When run with the same Ensembl VEP 112 annotation cache, mafsmith produces 0 conversion differences versus vcf2maf.pl across 23 representative datasets (GRCh38 and GRCh37).

maf2vcf, vcf2vcf, maf2maf

Validated to 0 conversion-field differences against maf2vcf.pl, vcf2vcf.pl, and maf2maf.pl across 6 datasets (2,000 variants each):

DatasetGenomemaf2vcfvcf2vcfmaf2maf
SEQC2 HCC1395, GATK MuTect2GRCh380 diffs0 diffs0 diffs
SEQC2 HCC1395, Strelka2 somaticGRCh380 diffs0 diffs
SEQC2 HCC1395, SomaticSniperGRCh380 diffs0 diffs
GIAB HG008, GATK MuTect2GRCh380 diffs0 diffs0 diffs
GIAB HG001 germline benchmarkGRCh380 diffs0 diffs
PCAWG consensus (0009b464)GRCh370 diffs0 diffs0 diffs

Known intentional differences from vcf2maf.pl

  • Default vs. --strict mode: when a caller emits a truncated AD array (fewer values than REF + all ALTs), mafsmith extracts available depth counts by default. Use --strict to output . for those fields instead, exactly matching vcf2maf.pl.
  • SV secondary rows: mafsmith emits secondary breakpoint rows with the actual partner chromosome and position. vcf2maf.pl leaves these fields blank (a known bug in the reference implementation).
  • Multi-allelic tie-breaking: when two ALTs have identical depth, tie-breaking may differ from vcf2maf.pl for a small number of variants (~4 per 50k-variant file).
  • Transcript selection at gene boundaries: for variants near 5′/3′ UTR–flank and Intron/RNA boundaries, mafsmith and vcf2maf.pl may select different canonical transcripts, affecting Variant_Classification for ~2–5 variants per dataset. This reflects different gene-model versions rather than a conversion bug.

Supported genomes

AssemblySpeciesValidation status
GRCh38Human (hg38)Validated (0 diffs across 21 caller types)
GRCh37Human (hg19/b37)Validated (0 diffs across 3 caller types)
GRCm39MouseAvailable; not yet validated against vcf2maf.pl

License

Apache 2.0