phase.md
November 24, 2025 ยท View on GitHub
Phase command
- SNP-only phasing
- SNP and indel co-phasing
- SNP and SV co-phasing
- SNP and modification co-phasing
- DeepSomatic output support
- The complete list of phase parameters
- Output of SNP and indel phasing
- Output files of SNP and SV co-phasing
SNP-only phasing
For SNP-only phasing, the input of LongPhase consists of SNPs in VCF (e.g., SNP.vcf), an indexed reference in Fasta (e.g., reference.fasta, reference.fasta.fai), and one (or multiple) indexed read-to-reference alignment in BAM (e.g., alignment1.bam, alignment1.bai, alignment2.bam, ...) (see Input Preparation). The users should specify the sequencing platform (--ont for Nanopore and --pb for PacBio). An example of SNP phasing usage is shown below.
longphase-s phase \
-s SNP.vcf \
-b alignment1.bam \
-b alignment2.bam \
-r reference.fasta \
-t 8 \
-o phased_prefix \
--ont # or --pb for PacBio Hifi
SNP and indel co-phasing
When the SNP.vcf file contains both SNP and small insertion/deletion variations (indels), use the --indels parameter for co-phasing SNPs and indels.
longphase-s phase \
-s SNP.vcf \
-b alignment.bam \
-r reference.fasta \
-t 8 \
-o phased_prefix \
--indels \
--ont # or --pb for PacBio Hifi
SNP and SV co-phasing
When co-phasing SNPs and SVs, except for the same input (i.e., SNPs, reference, and alignments), LongPhase takes an extra input of called SVs in VCF (e.g., SV_sniffles.vcf), which should be generated by Sniffles (with --num_reads_report in sniffles1 and --output-rnames in sniffles2) or CuteSV (with --report_readid--genotype)(see Input Preparation).
longphase-s phase \
-s SNP.vcf \
--sv-file SV.vcf \
-b alignment.bam \
-r reference.fasta \
-t 8 \
-o phased_prefix \
--ont # or --pb for PacBio Hifi
SNP and modification co-phasing
When co-phasing SNPs and modifications (5mC supported at this moment), it is necessary to first call the modified loci (in VCF) using modcall and then activate SNP-modification co-phasing via specifying the --mod-file with the modcall-generated VCF. The phased SNPs and modifications will be written into new VCFs accordingly.
longphase-s phase \
-s SNP.vcf \
--mod-file modcall.vcf \
-b alignment.bam \
-r reference.fasta \
-t 8 \
-o phased_prefix \
--ont # or --pb for PacBio Hifi
DeepSomatic output support
If your SNV VCF is produced by DeepSomatic, you can enable a built-in pre-processing step that
keeps only GERMLINE variants and normalizes the GT field according to VAF before phasing. This is
triggered by --deepsomatic_output.
The program will extract variants with GERMLINE in the FILTER field from the DeepSomatic output VCF. An example of DeepSomatic output format:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT cell-lineTumor
chr1 10007 . T G 0 RefCall . GT:GQ:DP:AD:VAF:PL 0/0:29:83:66,15:0.180723:0,29,60
chr1 17385 . G A 4.7 GERMLINE . GT:GQ:DP:AD:VAF:PL 0/0:5:115:64,51:0.443478:2,0,27
chr1 68119 . G C 32.9 PASS . GT:GQ:DP:AD:VAF:PL 1/1:30:43:0,42:0.976744:32,32,0
chr1 119653 . CA C 4.6 GERMLINE . GT:GQ:DP:AD:VAF:PL 0/0:5:55:27,28:0.509091:2,0,30
Example command:
longphase-s phase \
-s deepsomatic_output.vcf \
-b alignment1.bam \
-r reference.fasta \
-t 8 \
-o phased_prefix \
--indels \
--ont \
--deepsomatic_output
The complete list of phase parameters
Usage: phase [OPTION] ... READSFILE
--help display this help and exit.
require arguments:
-s, --snp-file=NAME input SNP vcf file.
-b, --bam-file=NAME input bam file.
-r, --reference=NAME reference fasta.
--ont, --pb ont: Oxford Nanopore genomic reads.
pb: PacBio HiFi/CCS genomic reads.
optional arguments:
--sv-file=NAME input SV vcf file.
--mod-file=NAME input modified vcf file.(produce by longphase modcall)
-t, --threads=Num number of thread. default:1
-o, --out-prefix=NAME prefix of phasing result. default: result
--indels phase small indel. default: False
--dot each contig/chromosome will generate dot file.
parse alignment arguments:
-q, --mappingQuality=Num filter alignment if mapping quality is lower than threshold. default:1
phasing graph arguments:
-p, --baseQuality=[0~90] change edge's weight to --edgeWeight if base quality is lower than the threshold. default:12
-e, --edgeWeight=[0~1] if one of the bases connected by the edge has a quality lower than --baseQuality
its weight is reduced from the normal 1. default:0.1
-a, --connectAdjacent=Num connect adjacent N SNPs. default:20
-d, --distance=Num phasing two variant if distance less than threshold. default:300000
-1, --edgeThreshold=[0~1] give up SNP-SNP phasing pair if the number of reads of the
two combinations are similar. default:0.7
haplotag read correction arguments:
-m, --readConfidence=[0.5~1] The confidence of a read being assigned to any haplotype. default:0.65
-n, --snpConfidence=[0.5~1] The confidence of assigning two alleles of a SNP to different haplotypes. default:0.75
Output of SNP and indel phasing
When phasing SNPs alone, longphase outputs the results into a VCF file. The alleles of the two haplotypes are stored in the GT field (e.g., 1|0), whereas the left and right alleles of the vertical bar represent the paternal or maternal haplotypes. The last PS field (e.g., 16809) represents the identifier of the block. For instance, the following example illustrates two haplotypes of five-phased SNPs, CCCCC and GATGT, in the same block 16809. The output of phased indels is similar and in the same VCF.
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT default
1 16809 . C G 8.4 PASS . GT:GQ:DP:AD:VAF:PL:PS 1|0:8:79:51,28:0.35443:7,0,16:16809
1 16949 . A C 8.4 PASS . GT:GQ:DP:AD:VAF:PL:PS 0|1:8:67:43,21:0.313433:7,0,25:16809
1 21580 . C T 13.9 PASS . GT:GQ:DP:AD:VAF:PL:PS 1|0:14:75:50,24:0.32:13,0,30:16809
1 23359 . C G 13.2 PASS . GT:GQ:DP:AD:VAF:PL:PS 1|0:13:52:24,18:0.346154:13,0,42:16809
1 24132 . C T 11.1 PASS . GT:GQ:DP:AD:VAF:PL:PS 1|0:11:63:41,17:0.269841:10,0,29:16809
Output files of SNP and SV co-phasing
When co-phasing SNPs and SVs, two VCFs (one for SNPs and one for SVs) are outputted. Similarly, the phased SVs are stored in the GT field and the block ID is in the PS field. For instance, the following example illustrates two haplotypes of five SNPs and two SVs, A<INS>G<noSV>TCC and G<noSV>A<INS>ATT, which are co-phased in the same block 382189.
An example of SNP VCF file
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT default
1 465289 . A G 34.2 PASS . GT:GQ:DP:AD:VAF:PL:PS 1|0:4:16:2,11:0.6875:31,0,1:382189
1 544890 . G A 3.1 PASS . GT:GQ:DP:AD:VAF:PL:PS 1|0:3:31:30,0:0:0,0,31:382189
1 545612 . T A 6.8 PASS . GT:GQ:DP:AD:VAF:PL:PS 1|0:7:37:28,8:0.216216:5,0,32:382189
1 545653 . C T 14 PASS . GT:GQ:DP:AD:VAF:PL:PS 1|0:14:44:30,14:0.318182:13,0,26:382189
1 561458 . C T 5.1 PASS . GT:GQ:DP:AD:VAF:PL:PS 1|0:5:17:14,0:0:3,0,17:382189
An example of SV VCF file
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT default
1 534057 7 N AAATTCGCGCATATCACGGGTGCCGCCTCTGTGCAGCTCACGAAACGCCATACTACGGTGCTGCTCAGCAGCTACGGAATCGCTATACCTACGCGAGCTGCCTCAGCAGCCAC . PASS IMPRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=534128;STD_quant_start=25.369041;STD_quant_stop=29.191054;Kurtosis_quant_start=-0.239012;Kurtosis_quant_stop=-1.333733;SVTYPE=INS;RNAMES=0120d560-50f0-4298-8b03-7bd30f3cf139,030ac5d4-e616-4ce9-8ad3-243835335085,0cf1b0d9-2b4d-463d-a658-01b4b040dc63,1c9e982a-8af7-4ba0-8cc2-154a679c72e2,22e11f79-0067-4735-8b69-97d951ca702f,2ca8a6f4-be9d-4df5-80d2-dc1743f97a84,35dff960-22b6-4216-af69-8878b8860362,390d2fb4-9224-41a1-a9fe-6cb3bbe4273a,3e333422-12ca-4f16-afb8-ed7611dcbc2c,3e8ed78a-b857-4941-bbc1-52ca51e26c08,4191371c-49ea-466d-aadc-06f27cdf1050,4aaae789-54fe-4fa5-84b3-5524dc2b3796,581e0cfb-2491-44d7-a2e1-ba1516ba0f2f,59749531-9abf-4ff4-a4a1-31484ba3d32d,5c97b0a9-925e-4153-952d-0f437171d3dc,6067590e-956c-442f-bbb7-cae597d616ad,623804bb-e2fe-415d-96ae-3d06aec63e5d,672244ce-2d5d-45cf-beb2-ddeddae917e8,6b79aa23-7c9c-49dc-9b88-8419c88c7a36,6e60d235-6654-4ef8-9feb-70f12a397721,6fbb55c5-57fc-43bc-8a24-b0058778054c,8e10bf13-9674-489c-924e-182a42e08a34,aa6ba092-4221-4d54-8819-811448c34983,af2169b3-b308-4db5-9675-15ff5f68d8dd,b214fcbd-77de-4dd5-84db-6d2b7e1f3158,c140eaba-e0e7-44e7-9f16-c8c67fd4a2f2,c7835cf7-44c0-44da-b10e-b2468fc8caab,ca4aa84d-34d1-4639-8634-b6a5540129ca,caba4bde-cdc5-4344-9803-a3c158525b0c,e0747feb-60bb-40db-a144-a9b43dd13256,e6992c7d-c00e-40e7-b80b-562094a9b60f,e8bb376c-20e0-4bed-a61f-b82b5c37ef6f,f3242a61-deec-49e7-b99f-335a1ba13791,f87dfdf7-7b68-421b-b395-3769a5fa3ac1,f91a7627-7fdb-4f03-8f33-0ed1649d96fe;SUPTYPE=AL;SVLEN=43;STRANDS=+-;RE=35;REF_strand=44;AF=0.795455 GT:DR:DV:PS 0|1:9:35:382189
1 545892 8 N ACACGCGGGCCGTGGCCAGCAGGCGGCGCTGCAGGAGAGGAGATGCCCAGGCCTGGCGGCC . PASS IMPRECISE;SVMETHOD=Snifflesv1.0.11;CHR2=1;END=545893;STD_quant_start=28.919840;STD_quant_stop=28.543200;Kurtosis_quant_start=-0.382251;Kurtosis_quant_stop=-0.130808;SVTYPE=INS;RNAMES=0120d560-50f0-4298-8b03-7bd30f3cf139,030ac5d4-e616-4ce9-8ad3-243835335085,0cf1b0d9-2b4d-463d-a658-01b4b040dc63,22e11f79-0067-4735-8b69-97d951ca702f,2ca8a6f4-be9d-4df5-80d2-dc1743f97a84,3977c988-9901-4e5b-9f9c-b8ebfcce8e93,3e333422-12ca-4f16-afb8-ed7611dcbc2c,4191371c-49ea-466d-aadc-06f27cdf1050,4aaae789-54fe-4fa5-84b3-5524dc2b3796,5933e1b7-1aeb-4437-a875-3befbf703420,623804bb-e2fe-415d-96ae-3d06aec63e5d,672244ce-2d5d-45cf-beb2-ddeddae917e8,6b79aa23-7c9c-49dc-9b88-8419c88c7a36,7842d9f1-9a77-4c9a-ab5b-5a644ed2d355,7ba26d64-d9b0-475f-8d5f-1fa73fc42d93,8e10bf13-9674-489c-924e-182a42e08a34,a2b1b2ef-1e28-465e-8b3f-c44e15990d8b,a45514f1-4aae-40eb-94eb-2969722a7b05,b8181546-6839-49cd-b64f-b65c96369a2b,c140eaba-e0e7-44e7-9f16-c8c67fd4a2f2,c7835cf7-44c0-44da-b10e-b2468fc8caab,ca4aa84d-34d1-4639-8634-b6a5540129ca,d56f0abe-4389-4197-a151-0eb567fb99f0,e6992c7d-c00e-40e7-b80b-562094a9b60f,e8bb376c-20e0-4bed-a61f-b82b5c37ef6f,ec325153-0c55-4ece-8f3c-c432701e6750,f3242a61-deec-49e7-b99f-335a1ba13791,f91a7627-7fdb-4f03-8f33-0ed1649d96fe;SUPTYPE=AL;SVLEN=62;STRANDS=+-;RE=28;REF_strand=51;AF=0.54902 GT:DR:DV:PS 1|0:23:28:382189