HowTo UNMASC
March 8, 2025 ยท View on GitHub
Outline
- Shell Directories and Variables
- Sourcing Workflow Scripts
- Custom Local Installations
- Get COSMIC vcf
- TO_workflow
- Constructing UNMASC's
vcf - Execution
- Outputs
flowchart LR
tbam{{tumor.bam}} --> nbam1{{normal_1.bam}} & nbam2{{normal_2.bam}} --> caller{{Variant Caller}}
tbam{{tumor.bam}} --> dots{{...}} & nbamZ{{normal_Z.bam}} --> caller
fasta{{reference.fasta}} --> caller
caller --> GATK{{GATK MergeVcfs}} --> vcf1{{vc_1.vcf}} & vcf2{{vc_2.vcf}} & dotsv{{... .vcf}} & vcfZ{{vc_Z.vcf}}
cosmic{{COSMIC.vcf}} & vcf1 & vcf2 & dotsv & vcfZ & fasta --> VEP --> anno{{anno.vcf}}
vcf1 & vcf2 & dotsv & vcfZ & anno & targ{{target.bed}} --> prepU{{"prep_UNMASC_VCF()"}}
centro{{centromere.bed}} & dictchrom{{dict_chrom.txt}} & prepU --> UNMASC{{"run_UNMASC()"}}
%% class definitions
classDef myred fill:#f44336,stroke:#f3f6f4,stroke-width:2px
classDef myblue fill:#19daf8,stroke:#f3f6f4,stroke-width:2px
classDef mygreen fill:#79d50d,stroke:#f3f6f4,stroke-width:2px
classDef mymagenta fill:#fc9ffc,stroke:#f3f6f4,stroke-width:2px
classDef myyellow fill:#f6fa13,stroke:#f3f6f4,stroke-width:2px
classDef myorange fill:#f89d3e,stroke:#f3f6f4,stroke-width:2px
%% assign classes to nodes
class tbam myred
class nbam1,nbam2,dots,nbamZ myblue
class fasta mygreen
class vcf1,vcf2,dotsv,vcfZ,anno mymagenta
class cosmic,targ,centro,dictchrom myyellow
class caller,GATK,VEP,prepU,UNMASC myorange
%% clickable nodes
click GATK "https://github.com/broadinstitute/gatk"
click cosmic "https://github.com/pllittle/UNMASC/blob/main/workflow/inputs.md#get-cosmic-vcf"
click VEP "https://uswest.ensembl.org/info/docs/tools/vep/index.html"
Setting Directories and Variables
Click to expand!
Below are fixed variables to specify.
gatk_dir=; [ -z "$gatk_dir" ] \
&& echo "Set gatk_dir, GATK directory" >&2 \
&& return 1
git_dir=; [ -z "$git_dir" ] \
&& echo "Set git_dir, location to store GitHub repos" >&2 \
&& return 1
[ ! -d $git_dir ] && mkdir $git_dir
stk2_dir=; [ -z "$stk2_dir" ] \
&& echo "Set stk2_dir, location of Strelka2 dir!" >&2 \
&& return 1
vep_dir=; [ -z "$vep_dir" ] \
&& echo "Set vep_dir, VEP directory" >&2 \
&& return 1
vep_rel=; [ -z "$vep_rel" ] \
&& echo "Set vep_rel, VEP release number, preferably 105 with GRCh37 and GRCh38 supported" >&2 \
&& return 1
vep_cache=; [ -z "$vep_cache" ] \
&& echo "Set vep_cache, VEP homo_sapiens cache, should be vep, refseq, or merged" >&2 \
&& return 1
hts_dir=; [ -z "$hts_dir" ] \
&& echo "Set hts_dir, the HTS directory" >&2 \
&& return 1
cosm_dir=; [ -z "$cosm_dir" ] \
&& echo "Set cosm_dir, directory containing COSMIC vcf" >&2 \
&& return 1
cosm_ver=; [ -z "$cosm_ver" ] \
&& echo "Set cosm_ver, COSMIC version number, e.g. 95, 101" >&2 \
&& return 1
cosm_muts=; [ -z "$cosm_muts" ] \
&& echo "Set cosm_muts, e.g. coding, noncoding, all" >&2 \
&& return 1
fasta_fn=; [ -z "$fasta_fn" ] \
&& echo "Set fasta_fn, the reference FASTA" >&2 \
&& return 1
nthreads=; [ -z "$nthreads" ] \
&& echo "Set nthreads, number of threads or cores" >&2 \
&& return 1
genome=; [ -z "$genome" ] \
&& echo "Set genome, like GRCh37 or GRCh38" >&2 \
&& return 1
Sample-specific Variables
out_dir=; [ -z "$out_dir" ] \
&& echo "Set out_dir, output directory" >&2 \
&& return 1
nbams=; [ -z "$nbams" ] \
&& echo "Set nbams, file listing all normal bam full paths" >&2 \
&& return 1
echo -e "Detected $(cat $nbams | wc -l) normal controls." >&2
tbam=; [ -z "$tbam" ] \
&& echo "Set tbam, tumor bam full path" >&2 \
&& return 1
Source scripts
Click to expand!
You are welcome to install your own programs and dependencies. I have provided below the steps and scripts I use to automate the programs related to UNMASC.
# Load dependencies
odir=$(pwd)
for repo in baSHic UNMASC somdna; do
repo_dir="$git_dir/$repo"
if [ ! -d "$repo_dir" ]; then
cd "$git_dir"
git clone https://github.com/pllittle/$repo.git >&2
[ $? -eq 0 ] && continue
else
cd "$repo_dir"
git pull >&2
[ $? -eq 0 ] && continue
fi
echo -e "Error cloning/pulling $repo" >&2
return 1
done
# Source functions
for fn in genomic callingAllVariants; do
. "$git_dir/somdna/scripts/$fn.sh"
[ $? -eq 0 ] && continue
echo -e "Some error in sourcing somdna's $fn.sh" >&2
return 1
done
. "$git_dir/UNMASC/workflow/tumor_only.sh"
[ ! $? -eq 0 ] && echo "Some error in sourcing tumor_only.sh" >&2 && return 1
cd "$odir"
unset odir repo repo_dir
Custom local installations
Click to expand!
- gcc, libtool, perl, bzip2, xz, zlib, curl, expat, db,
- HTSlib, VEP, Strelka2
One can control where these programs are installed by adding -a $apps_dir
to the code below. apps_dir is just wherever you would like to install these programs.
install_gcc
install_libtool
install_perl
install_bzip2
install_xz
install_zlib
install_curl
install_expat
install_db
install_htslib
install_VEP -r $vep_rel
install_strelka2
If you rely on these functions for installations, these will determine
hts_dir, stk2_dir, and vep_dir definitions.
Get COSMIC VCF
Click to expand!
Run the following code to obtain the appropriate COSMIC VCF.
The function below will prompt the user to input COSMIC website's login
and password. The generated file is needed for TO_workflow() below.
get_COSMIC_canonical -g $genome \
-v $cosm_ver -h $hts_dir \
-c $cosm_dir
The function definition is located here.
TO_workflow()
Click to expand!
Currently, TO_workflow is designed for Strelka2 and VEP. If others have
alternate workflows, you are welcome to inspect this function's
definitions
to extract specific steps to execute :smile:.
TO_workflow -c $nthreads -f $fasta_fn -g $genome \
-d $cosm_dir -e $cosm_ver -h $hts_dir -k $gatk_dir \
-n $nbams -o $out_dir -s $stk2_dir -t $tbam \
-v $vep_dir -r $vep_rel -a $vep_cache \
--cosm_muts <coding,noncoding,all>
Preparing UNMASC's vcf
Click to expand!
Assuming UNMASC is successfully installed and the above TO_workflow() was
run successfully, the instructions below describes the inputs to construct
UNMASC's main input vcf. Otherwise, the user needs to construct vcf ensuring
all required columns are available and formatted correctly (refer to
?UNMASC::run_UNMASC).
outdirString instructing where UNMASC outputs should be stored. Notice this is different from the Shell variableout_dirfrom above.DATA R data.frame containingFILENAMEfor full path to each control VCF andSTUDYNUMBERto identify each normal control.FILTERR list used to specify some liberal pre-filtering of loci based on total read depth and quality score.target_fnString containing the full path to a target BED file with tab-delimited columns with headersChr(e.g. chr1),Start(start position), andEnd(end position).anno_fnString containing the full path to the annotated vcf file. IfTO_workflow()was used, the R stringanno_fnis$out_dir/allvar_ann.vcf.gz.nlinesPositive integer specifying how many lines into theanno_fnto initially read in to determine if its formatting matches whatprep_UNMASC_VCF()is expecting.ncoresPositive integer specifying how many threads/cores are available. This is used here to reduce the computational time to import each control VCF. This argument may be more handy if samples underwent WGS or WES sequencing.
vcf = UNMASC::prep_UNMASC_VCF(
outdir = outdir,
DAT = DAT,
FILTER = NULL,
target_fn = target_fn,
anno_fn = anno_fn,
nlines = 100,
ncores = 1)
Execution
Click to expand!
Arguments for run_UNMASC() with template inputs.
tumorID = "tumor01", tumor sample IDoutdir = file.path(".",tumorID), output location for the tumor samplevcf = vcf, the data.frame generated byprep_UNMASC_VCF()tBAM_fn = "path/to/tumor/bam"bed_centromere_fn = "path/to/centromere/start/end/bed/file", tab-delimited file without headers with three columns containing contig, start position, and end position.dict_chrom_fn = "path/to/chromosome/length/file", stored output fromsamtools view -H tumor.bamqscore_thres = 30, Qscore thresholdexac_thres = 5e-3, gnomAD/ExAC population allele frequency threshold for germline filteringad_thres = 5, alternate depth thresholdrd_thres = 10, total depth thresholdcut_BAF = 5e-2, cutoff for variants to exclude before running segmentationminBQ = 13, minimum base qualityminMQ = 40, minimum mapping qualityeps_thres = 0.5, noise mixture proportion threshold for determining a H2M segmentpsi_thres = 0.02, over-dispersion of beta-binomial threshold for determining a H2M segmenthg = "19", labeling for output figuresbinom = TRUE, set toTRUEto model read counts with binomial distribution. Set toFALSEto explore over both binomial and beta-binomial distributionsgender = NA, set toNAif gender is unknown. Otherwise set to"MALE"or"FEMALE"ncores = 1, number of threads/cores available, aids with computational runtime in extracting strand-specific read counts per locus.
# Package main function
UNMASC::run_UNMASC(
tumorID = tumorID,
outdir = outdir,
vcf = vcf,
tBAM_fn = tBAM_fn,
bed_centromere_fn = bed_centromere_fn,
dict_chrom_fn = dict_chrom_fn,
qscore_thres = qscore_thres,
exac_thres = exac_thres,
ad_thres = ad_thres,
rd_thres = rd_thres,
cut_BAF = cut_BAF,
minBQ = minBQ,
minMQ = minMQ,
eps_thres = eps_thres,
psi_thres = psi_thres,
hg = hg,
binom = binom,
gender = gender,
ncores = ncores)
Outputs
Click to expand!
Below are the main outputs from UNMASC to determine if
all steps ran smoothly and to assess if the package or
input files requires debugging. If any of these directories
or files are missing, check the corresponding Rout file
for any error or warning messages or flags for low quality
sample information.
image.rds: Stores the comprehensive inputs forUNMASC. Once this file is created,run_UNMASC()skips past the time consuming pre-processing of input files. This image can be useful for re-producing results and inspecting errors. To have a clean restart or reset, first remove this file.nCLUST: Figures of normal VAF clustering. If three clusters of normal VAF are not present, the loci supplied toUNMASCmay have undergone pre-filtering of the normal VAF.nSEG: Figures of normal VAF segmentation. This is useful for visualizing hard-to-map (H2M) regions and whether or not read counts should be modeled with a binomial or beta-binomial distribution.tSEG: Figures of tumor VAF segmentation. This is useful for assessing the degree of sparsity when characterizing the local germline cluster behavior relative to each potential somatic locus. Also these figures may prove useful for inspecting copy number aberrations due to allelic imbalance (B allele frequencies deviating from 0.5).tumor_genotype.tsv: Experimental output. Attempting to reverse-genotype an individual based on their tumor genomics. Loci are annotated with inferred genotypes, H2M status, strand bias, etc. to aid in isolating higher quality genotype calls. May be useful for performing genotype PCA or mapping NGS-based sample data to microarray sample data.tumorOnly_VCs.tsv:UNMASC's tumor-only variant calls with comprehensive annotation contained in theLABELcolumn for prioritizing variants. Additional columns contain metrics used to construct theLABELcolumns annotations such as strand bias, oxoG artifact, paraffin artifact, H2M status, germline-like loci, etc.