somatic_haplotag.md

November 24, 2025 ยท View on GitHub

Somatic haplotagging command

SNV and INDEL somatic haplotagging

This command performs tumor purity estimation and somatic variant calling using tumor-normal pair BAM files, then tags (assigns) each read in tumor BAM to one haplotype based on phased normal SNP VCF and tumor SNP VCF. See Input Preparation for details on how to prepare the required input files:

In addition, the haplotype block of each read is stored in the PS tag (only for reads with phased SNPs). The phased VCF can be generated by other programs as long as the PS or HP tags are encoded. The author can specify --log for additionally output a plain-text file containing haplotype tags of each read without parsing BAM.

longphase-s somatic_haplotag \
-s phased_normal_snp.vcf \
-b normal.bam \
--tumor-snv-file tumor_snv.vcf \
--tumor-bam-file tumor.bam \
-r reference.fasta \
-t 8 \
-o tagged_tumor_bam_prefix \
--tagSupplementary \
-q 20 \

Somatic haplotagging benchmark:

The benchmark evaluates the performance of somatic haplotagging at the read level by:

  1. Prerequisite:

    • Assumes accurate germline SNP phasing (HP1, HP2 tags)
  2. Read-level comparison:

    • Truth somatic reads: All reads containing true somatic variants from truth VCF
    • Program-tagged somatic reads: Reads classified as somatic by the program (with tags HP1-1, HP2-1, or HP3)
    • Performance assessment: Evaluation of the program's accuracy in identifying true somatic reads
  3. Evaluation regions:

    • If --truth-bed is provided, only variants within these confident regions are used
    • This ensures evaluation is performed only in high-confidence regions
  4. Performance metrics:

    • Precision, Recall, and F1-Score for each haplotype (HP1-1, HP2-1, HP3)
    • Overall performance metrics
    • Detailed read-level information including assigned haplotypes

The benchmark results are output to a metrics file when truth files (--truth-vcf and optionally --truth-bed) are provided.

The complete list of somatic haplotagging parameters


Usage:  somatic_haplotag [OPTION] ... READSFILE
      --help                          display this help and exit.

required arguments:
      -s, --snp-file=NAME             input normal sample SNP VCF file.
      -b, --bam-file=NAME             input normal sample BAM file.
      --tumor-snv-file=NAME           input tumor sample SNV VCF file.
      --tumor-bam-file=NAME           input tumor sample BAM file for somatic haplotag.
      -r, --reference=NAME            reference FASTA.

optional arguments:
      --tagSupplementary              tag supplementary alignment. default:false
      -q, --qualityThreshold=Num      not tag alignment if the mapping quality less than threshold. default:1
      -p, --percentageThreshold=Num   the alignment will be tagged according to the haplotype corresponding to most alleles.
                                      if the alignment has no obvious corresponding haplotype, it will not be tagged. default:0.6
      -t, --threads=Num               number of thread. default:1
      -o, --out-prefix=NAME           prefix of phasing result. default:result
      --cram                          the output file will be in the cram format. default:bam
      --region=REGION                 tagging include only reads/variants overlapping those regions. default:""(all regions)
                                      input format:chrom (consider entire chromosome)
                                                   chrom:start (consider region from this start to end of chromosome)
                                                   chrom:start-end
      --log                           an additional log file records the result of each read. default:false

somatic variant calling arguments:
      --tumor-purity=Num              tumor purity (0.1~1.0) for adjusting somatic variant filtering.
                                      lower values apply more conservative filters. default: automatic estimation.
      --disableFilter                 disable somatic variant filtering and accept all tumor VCF variants. default: false.
      --output-somatic-vcf            output filtered somatic variants VCF file based on tumor VCF. default: false.
                                      variants passing filters will be marked as PASS, others as LowQual.
      --somatic-calling-log           output somatic calling log file. default: false.

somatic haplotagging benchmark arguments:
      --truth-vcf=NAME                truth somatic variants VCF file for evaluating somatic haplotag performance.
                                      comparing reads containing true somatic variants against reads tagged as somatic reads.
      --truth-bed=NAME                confident regions BED file for evaluating somatic haplotag performance.
                                      only using variants within these regions for tagging and evaluation.

Output files


Tagged tumor BAM file

The reads will be tagged as:

  • HP:z:1 or HP:z:2 for reads with germline SNPs
  • HP:z:1-1 or HP:z:2-1 for reads with somatic SNPs derived from germline haplotype 1 or 2
  • HP:z:3 for reads with somatic SNPs that cannot be derived from germline haplotypes

Tumor purity estimation result file

The tumor purity estimation result file contains the estimated tumor purity value and related statistics. The file includes:

  • Tumor purity: The estimated tumor purity value (0.0-1.0)
  • Filter parameters: Parameters used for filtering the estimation data
  • Data size: Number of variants used for estimation

An example of tumor purity estimation file

#==========Estimation result===========
Tumor purity: 0.169599

Somatic calling result VCF file

The somatic calling result VCF file is based on the input tumor VCF with the following modifications:

  1. Variant filtering: Only variants that are used in the somatic haplotagging process are included in the output VCF.

  2. Additional header information:

    • LongPhase-S version
    • Command line
  3. Variant filter status:

    • PASS: Variants passing quality filters
    • LowQual: Variants failing quality filters

All other information in the VCF file remains unchanged from the input tumor VCF.

##longphase_s_version=1.0.0
##commandline=longphase_s somatic_haplotag -s phased_normal.vcf -b normal.bam --tumor-snv-file 
tumor.vcf --tumor-bam-file tumor.bam -r GRCh38_no_alt_analysis_set.fasta -t 64 -o tagged_tumor_bam --tagSupplementary -q 20 --output-somatic-vcf
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE

Benchmark metrics file

The benchmark metrics file contains performance evaluation results of somatic haplotagging. The file includes:

  1. Basic information:

    • Truth VCF and BED files used
    • Mapping quality threshold
    • Total read counts and truth somatic read counts
  2. Performance metrics for each haplotype:

    • Precision, Recall, and F1-Score
    • True Positives (TP), False Positives (FP), False Negatives (FN)
    • Overall performance metrics
  3. Detailed read information:

    • Read ID and chromosome
    • SNP counts in germline and tumor
    • Assigned haplotype
    • Positions of truth variants and their assigned haplotypes

An example of benchmark metrics file

##Truth VCF File: truth.vcf
##Truth BED File: truth.bed
##MappingQualityThreshold: 20
##Total reads: 8804601
##Total truth somatic reads: 149518
##Total truth HP1-1: 69793
##Total truth HP2-1: 68475
##Total truth HP3: 11250
## Haplotype   Precision      Recall         F1-Score       TP             FP             FN             
##-----------------------------------------------------------------------------------------------
## HP1-1       0.9840         0.6343         0.7714         44269          721            25524          
## HP2-1       0.9819         0.6488         0.7813         44426          819            24049          
## HP3         0.9857         0.3493         0.5158         3930           57             7320           
##-----------------------------------------------------------------------------------------------
## Overall     0.9831         0.6195         0.7600         92625          1597           56893          
##
#CHROM  READID  GERMLINE_VAR_SIMILARITY DERIVE_BY_HP_SIMILARITY GERMLINE_SNP_COUNT      TUMOR_SNP_COUNT HAPLOTYPE       TRUTH_VARIANT_POS,HP
chr1    SRR25005626.11942917    0.9722  0.0000  36      1       H1-1    1212740,3       
chr1    SRR25005626.2423351     1.0000  0.0000  23      1       H1-1    1212740,3       
chr1    SRR25005626.4080685     0.9500  0.0000  20      1       H1-1    1212740,3       
chr1    SRR25005626.14512092    0.9688  0.0000  128     1       H1-1    1212740,3