somatic_haplotag.md
November 24, 2025 ยท View on GitHub
Somatic haplotagging command
- SNV and INDEL somatic haplotagging
- Somatic haplotagging benchmark
- The complete list of somatic haplotagging parameters
- Output files
SNV and INDEL somatic haplotagging
This command performs tumor purity estimation and somatic variant calling using tumor-normal pair BAM files, then tags (assigns) each read in tumor BAM to one haplotype based on phased normal SNP VCF and tumor SNP VCF. See Input Preparation for details on how to prepare the required input files:
- Reference genome: Generate reference index
- BAM files: Generate alignment and index files
- Phased normal SNP VCF: Generate germline SNP file and then perform phasing
- Tumor SNV VCF: Generate somatic SNV file
In addition, the haplotype block of each read is stored in the PS tag (only for reads with phased SNPs). The phased VCF can be generated by other programs as long as the PS or HP tags are encoded. The author can specify --log for additionally output a plain-text file containing haplotype tags of each read without parsing BAM.
longphase-s somatic_haplotag \
-s phased_normal_snp.vcf \
-b normal.bam \
--tumor-snv-file tumor_snv.vcf \
--tumor-bam-file tumor.bam \
-r reference.fasta \
-t 8 \
-o tagged_tumor_bam_prefix \
--tagSupplementary \
-q 20 \
Somatic haplotagging benchmark:
The benchmark evaluates the performance of somatic haplotagging at the read level by:
-
Prerequisite:
- Assumes accurate germline SNP phasing (HP1, HP2 tags)
-
Read-level comparison:
- Truth somatic reads: All reads containing true somatic variants from truth VCF
- Program-tagged somatic reads: Reads classified as somatic by the program (with tags HP1-1, HP2-1, or HP3)
- Performance assessment: Evaluation of the program's accuracy in identifying true somatic reads
-
Evaluation regions:
- If
--truth-bedis provided, only variants within these confident regions are used - This ensures evaluation is performed only in high-confidence regions
- If
-
Performance metrics:
- Precision, Recall, and F1-Score for each haplotype (HP1-1, HP2-1, HP3)
- Overall performance metrics
- Detailed read-level information including assigned haplotypes
The benchmark results are output to a metrics file when truth files (--truth-vcf and optionally --truth-bed) are provided.
The complete list of somatic haplotagging parameters
Usage: somatic_haplotag [OPTION] ... READSFILE
--help display this help and exit.
required arguments:
-s, --snp-file=NAME input normal sample SNP VCF file.
-b, --bam-file=NAME input normal sample BAM file.
--tumor-snv-file=NAME input tumor sample SNV VCF file.
--tumor-bam-file=NAME input tumor sample BAM file for somatic haplotag.
-r, --reference=NAME reference FASTA.
optional arguments:
--tagSupplementary tag supplementary alignment. default:false
-q, --qualityThreshold=Num not tag alignment if the mapping quality less than threshold. default:1
-p, --percentageThreshold=Num the alignment will be tagged according to the haplotype corresponding to most alleles.
if the alignment has no obvious corresponding haplotype, it will not be tagged. default:0.6
-t, --threads=Num number of thread. default:1
-o, --out-prefix=NAME prefix of phasing result. default:result
--cram the output file will be in the cram format. default:bam
--region=REGION tagging include only reads/variants overlapping those regions. default:""(all regions)
input format:chrom (consider entire chromosome)
chrom:start (consider region from this start to end of chromosome)
chrom:start-end
--log an additional log file records the result of each read. default:false
somatic variant calling arguments:
--tumor-purity=Num tumor purity (0.1~1.0) for adjusting somatic variant filtering.
lower values apply more conservative filters. default: automatic estimation.
--disableFilter disable somatic variant filtering and accept all tumor VCF variants. default: false.
--output-somatic-vcf output filtered somatic variants VCF file based on tumor VCF. default: false.
variants passing filters will be marked as PASS, others as LowQual.
--somatic-calling-log output somatic calling log file. default: false.
somatic haplotagging benchmark arguments:
--truth-vcf=NAME truth somatic variants VCF file for evaluating somatic haplotag performance.
comparing reads containing true somatic variants against reads tagged as somatic reads.
--truth-bed=NAME confident regions BED file for evaluating somatic haplotag performance.
only using variants within these regions for tagging and evaluation.
Output files
Tagged tumor BAM file
The reads will be tagged as:
HP:z:1orHP:z:2for reads with germline SNPsHP:z:1-1orHP:z:2-1for reads with somatic SNPs derived from germline haplotype 1 or 2HP:z:3for reads with somatic SNPs that cannot be derived from germline haplotypes
Tumor purity estimation result file
The tumor purity estimation result file contains the estimated tumor purity value and related statistics. The file includes:
- Tumor purity: The estimated tumor purity value (0.0-1.0)
- Filter parameters: Parameters used for filtering the estimation data
- Data size: Number of variants used for estimation
An example of tumor purity estimation file
#==========Estimation result===========
Tumor purity: 0.169599
Somatic calling result VCF file
The somatic calling result VCF file is based on the input tumor VCF with the following modifications:
-
Variant filtering: Only variants that are used in the somatic haplotagging process are included in the output VCF.
-
Additional header information:
- LongPhase-S version
- Command line
-
Variant filter status:
PASS: Variants passing quality filtersLowQual: Variants failing quality filters
All other information in the VCF file remains unchanged from the input tumor VCF.
##longphase_s_version=1.0.0
##commandline=longphase_s somatic_haplotag -s phased_normal.vcf -b normal.bam --tumor-snv-file
tumor.vcf --tumor-bam-file tumor.bam -r GRCh38_no_alt_analysis_set.fasta -t 64 -o tagged_tumor_bam --tagSupplementary -q 20 --output-somatic-vcf
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
Benchmark metrics file
The benchmark metrics file contains performance evaluation results of somatic haplotagging. The file includes:
-
Basic information:
- Truth VCF and BED files used
- Mapping quality threshold
- Total read counts and truth somatic read counts
-
Performance metrics for each haplotype:
- Precision, Recall, and F1-Score
- True Positives (TP), False Positives (FP), False Negatives (FN)
- Overall performance metrics
-
Detailed read information:
- Read ID and chromosome
- SNP counts in germline and tumor
- Assigned haplotype
- Positions of truth variants and their assigned haplotypes
An example of benchmark metrics file
##Truth VCF File: truth.vcf
##Truth BED File: truth.bed
##MappingQualityThreshold: 20
##Total reads: 8804601
##Total truth somatic reads: 149518
##Total truth HP1-1: 69793
##Total truth HP2-1: 68475
##Total truth HP3: 11250
## Haplotype Precision Recall F1-Score TP FP FN
##-----------------------------------------------------------------------------------------------
## HP1-1 0.9840 0.6343 0.7714 44269 721 25524
## HP2-1 0.9819 0.6488 0.7813 44426 819 24049
## HP3 0.9857 0.3493 0.5158 3930 57 7320
##-----------------------------------------------------------------------------------------------
## Overall 0.9831 0.6195 0.7600 92625 1597 56893
##
#CHROM READID GERMLINE_VAR_SIMILARITY DERIVE_BY_HP_SIMILARITY GERMLINE_SNP_COUNT TUMOR_SNP_COUNT HAPLOTYPE TRUTH_VARIANT_POS,HP
chr1 SRR25005626.11942917 0.9722 0.0000 36 1 H1-1 1212740,3
chr1 SRR25005626.2423351 1.0000 0.0000 23 1 H1-1 1212740,3
chr1 SRR25005626.4080685 0.9500 0.0000 20 1 H1-1 1212740,3
chr1 SRR25005626.14512092 0.9688 0.0000 128 1 H1-1 1212740,3