mutspec_stat.md

March 13, 2017 ยท View on GitHub

MutSpec-Stat

Calculate statistics on mutation features for each individual samples as well as on the sample pool.

The statistics calculated include:

  • Counts and distributions of overall mutation types (SBS and indels)
  • Distribution of SBS by functional regions (based on RefSeq annotations)
  • Distribution of SBS by chromosomes
  • Counts and distributions of SBS in their trinucleotide sequence context (96 mutation types), calculated on the genome sequence or only on transcribed sequences (stranded analysis).

Input format

A folder with tab delimited text files (one file per sample) generated by MutSpec-Annot or tab delimited text files containing the following columns (header names are case sensitive):

ChrStartEndRefAltFunc.refGeneGene.refGeneGeneDetail.refGeneExonicFunc.refGeneAAChange.refGeneStrandcontext
chr1100535282100535282GTintronicMFSD14ANANANA+CAGTGACCCTGGCTGGCCATC
chr1120306983120306983ACexonicHMGCS2NAnonsynonymous SNVHMGCS2:NM_001166107:exon2:c.T371G:p.V124G,HMGCS2:NM_005518:exon2:c.T371G:p.V124G-CAGCCTGCCCACAGAGTCCCA
chr1151341524151341524CAexonicSELENBP1NAnonsynonymous SNVSELENBP1:NM_001258289:exon4:c.G442T:p.V148L,SELENBP1:NM_003944:exon4:c.G316T:p.V106L-ACGTCCACCACATAGATGCGA

Output

  • Excel file that includes all computed statistics shown in tabular and graphical formats, for each sample (one by datasheet) and for the pooled samples (optional).
  • A folder containing all the figures and tables by sample.
  • Input matrix for the tool MutSpec-NMF.

The following statistics are generated and summarised in the Excel file:

Graph 1. SBS distribution Proportion (percent of all SBS) of each type of single base substitution (SBS). All SBS are considered, including the ones without strand orientation annotation.

Table 1. Frequency and counts of all SBS Values corresponding to graph 1.

Graph 2. Impact on protein sequence Impact of all mutations (SBS and Indel) on the protein sequence based on the ExonicFunc.refGene annotation. For more details about the annotation, please visit the Annovar web page.

Table 2. Frequency and counts of functional impacts Values corresponding to graph 2.

Graph 3. Stranded distribution of SBS Proportion (percent of all SBS with strand annotation) of the six substitution types on the transcribed and non-transcribed strand. Only regions with strand annotation are considered.

Table 3. Significance of the strand biases The strand bias for each SBS type is calculated as the ratio of SBS on the non-transcribed (coding) versus the transcribed (non-coding) strand. The statistical significance of the differences between the mutational frequencies on the non-transcribed and the transcribed strand (equal to 0.5, as expected by chance) is assessed using a chi-squared test followed by the Benjamini- Hochberg procedure for multiple testing corrections (only samples with at least 1 mutations on the non-transcribed or on the transcribed strand are considered). Two tables are shown in the Excel report to display the 6 SBS types in both orientations.

Table 4. SBS distribution by functional region Count and percentages of SBS in genomic regions based on the Func.refGene annotation.

Table 5. Strand bias by functional region Counts of the strand bias for the 6 SBS types in different functional regions.

Table 6. SBS distribution per chromosome Counts of SBS per chromosome for the six SBS types. The correlation between SBS counts and chromosome size is calculated using a Pearson correlation test.

Panel 1. Trinucleotide sequence context of SBS on the genomic sequence The trinucleotide sequence context takes into consideration the flanking base in 5' and in 3' of the SBS. SBS counts and frequency data are shown as tables, heatmaps or bar graphs. The heatmap colors are scaled to the maximum value of the corresponding table. The bar graph is scaled to the maximum frequency value (total number of mutation by SBS type is shown in parenthesis).

Panel 2. Stranded analysis of trinucleotide sequence context of SBS SBS within their trinucleotide sequence context are counted on the non-transcribed and transcribed strands of the gene region they are located in. Counts and frequencies are shown as tables or bar graphs. Only SBS with strand orientation annotation are considered in this analysis (strand annotation retrieved from RefSeq database).

Usage

perl mutspecStat.pl --refGenome hg19 --pathRscript mutspec/ --pathSeqRefGenome dir/hg19db/ input_folder

--refGenome, --pathRscript, --pathSeqRefGenome and input_folder are compulsory.

List of parameters with default values:

ParameterDefault valueDescription
--refGenomeName of the reference genome
--pathRscriptPath to mutspec R scripts
--pathSeqRefGenomePath to reference genome fasta files (one per chromosomes)
--outfileinput directoryOutput directory
--tempdirectory in which the script is runPath for saving temporary files
--poolDatafalseCalculate statistics on the sample pool
--reportSamplefalseGenerate an Excel report for each sample
--helpPrint help message
input_folderInput folder

By default, one output Excel file will be generated with statistics of each sample shown in different datasheets.
Adding the option --reportSample will generate one Excel file for each sample instead. It is recommended to use this option if you are analysing more than 250 samples as the Excel output file may be too heavy to open easily on a computer with limited RAM.