mutspec_stat.md

March 13, 2017 · View on GitHub

MutSpec-Stat

Calculate statistics on mutation features for each individual samples as well as on the sample pool.

The statistics calculated include:

Counts and distributions of overall mutation types (SBS and indels)
Distribution of SBS by functional regions (based on RefSeq annotations)
Distribution of SBS by chromosomes
Counts and distributions of SBS in their trinucleotide sequence context (96 mutation types), calculated on the genome sequence or only on transcribed sequences (stranded analysis).

Input format

A folder with tab delimited text files (one file per sample) generated by MutSpec-Annot or tab delimited text files containing the following columns (header names are case sensitive):

Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	ExonicFunc.refGene	AAChange.refGene	Strand	context
chr1	100535282	100535282	G	T	intronic	MFSD14A	NA	NA	NA	+	CAGTGACCCTGGCTGGCCATC
chr1	120306983	120306983	A	C	exonic	HMGCS2	NA	nonsynonymous SNV	HMGCS2:NM_001166107:exon2:c.T371G:p.V124G,HMGCS2:NM_005518:exon2:c.T371G:p.V124G	-	CAGCCTGCCCACAGAGTCCCA
chr1	151341524	151341524	C	A	exonic	SELENBP1	NA	nonsynonymous SNV	SELENBP1:NM_001258289:exon4:c.G442T:p.V148L,SELENBP1:NM_003944:exon4:c.G316T:p.V106L	-	ACGTCCACCACATAGATGCGA

Output

Excel file that includes all computed statistics shown in tabular and graphical formats, for each sample (one by datasheet) and for the pooled samples (optional).
A folder containing all the figures and tables by sample.
Input matrix for the tool MutSpec-NMF.

The following statistics are generated and summarised in the Excel file:

Graph 1. SBS distribution Proportion (percent of all SBS) of each type of single base substitution (SBS). All SBS are considered, including the ones without strand orientation annotation.

Table 1. Frequency and counts of all SBS Values corresponding to graph 1.

Graph 2. Impact on protein sequence Impact of all mutations (SBS and Indel) on the protein sequence based on the ExonicFunc.refGene annotation. For more details about the annotation, please visit the Annovar web page.

Table 2. Frequency and counts of functional impacts Values corresponding to graph 2.

Graph 3. Stranded distribution of SBS Proportion (percent of all SBS with strand annotation) of the six substitution types on the transcribed and non-transcribed strand. Only regions with strand annotation are considered.

Table 3. Significance of the strand biases The strand bias for each SBS type is calculated as the ratio of SBS on the non-transcribed (coding) versus the transcribed (non-coding) strand. The statistical significance of the differences between the mutational frequencies on the non-transcribed and the transcribed strand (equal to 0.5, as expected by chance) is assessed using a chi-squared test followed by the Benjamini- Hochberg procedure for multiple testing corrections (only samples with at least 1 mutations on the non-transcribed or on the transcribed strand are considered). Two tables are shown in the Excel report to display the 6 SBS types in both orientations.

Table 4. SBS distribution by functional region Count and percentages of SBS in genomic regions based on the Func.refGene annotation.

Table 5. Strand bias by functional region Counts of the strand bias for the 6 SBS types in different functional regions.

Table 6. SBS distribution per chromosome Counts of SBS per chromosome for the six SBS types. The correlation between SBS counts and chromosome size is calculated using a Pearson correlation test.

Panel 1. Trinucleotide sequence context of SBS on the genomic sequence The trinucleotide sequence context takes into consideration the flanking base in 5' and in 3' of the SBS. SBS counts and frequency data are shown as tables, heatmaps or bar graphs. The heatmap colors are scaled to the maximum value of the corresponding table. The bar graph is scaled to the maximum frequency value (total number of mutation by SBS type is shown in parenthesis).

Panel 2. Stranded analysis of trinucleotide sequence context of SBS SBS within their trinucleotide sequence context are counted on the non-transcribed and transcribed strands of the gene region they are located in. Counts and frequencies are shown as tables or bar graphs. Only SBS with strand orientation annotation are considered in this analysis (strand annotation retrieved from RefSeq database).

Usage

perl mutspecStat.pl --refGenome hg19 --pathRscript mutspec/ --pathSeqRefGenome dir/hg19db/ input_folder

--refGenome, --pathRscript, --pathSeqRefGenome and input_folder are compulsory.

List of parameters with default values:

Parameter	Default value	Description
--refGenome		Name of the reference genome
--pathRscript		Path to mutspec R scripts
--pathSeqRefGenome		Path to reference genome fasta files (one per chromosomes)
--outfile	input directory	Output directory
--temp	directory in which the script is run	Path for saving temporary files
--poolData	false	Calculate statistics on the sample pool
--reportSample	false	Generate an Excel report for each sample
--help		Print help message
input_folder		Input folder

By default, one output Excel file will be generated with statistics of each sample shown in different datasheets.
Adding the option --reportSample will generate one Excel file for each sample instead. It is recommended to use this option if you are analysing more than 250 samples as the Excel output file may be too heavy to open easily on a computer with limited RAM.