mutspec_stat.md
March 13, 2017 ยท View on GitHub
MutSpec-Stat
Calculate statistics on mutation features for each individual samples as well as on the sample pool.
The statistics calculated include:
- Counts and distributions of overall mutation types (SBS and indels)
- Distribution of SBS by functional regions (based on RefSeq annotations)
- Distribution of SBS by chromosomes
- Counts and distributions of SBS in their trinucleotide sequence context (96 mutation types), calculated on the genome sequence or only on transcribed sequences (stranded analysis).
Input format
A folder with tab delimited text files (one file per sample) generated by MutSpec-Annot or tab delimited text files containing the following columns (header names are case sensitive):
| Chr | Start | End | Ref | Alt | Func.refGene | Gene.refGene | GeneDetail.refGene | ExonicFunc.refGene | AAChange.refGene | Strand | context |
|---|---|---|---|---|---|---|---|---|---|---|---|
| chr1 | 100535282 | 100535282 | G | T | intronic | MFSD14A | NA | NA | NA | + | CAGTGACCCTGGCTGGCCATC |
| chr1 | 120306983 | 120306983 | A | C | exonic | HMGCS2 | NA | nonsynonymous SNV | HMGCS2:NM_001166107:exon2:c.T371G:p.V124G,HMGCS2:NM_005518:exon2:c.T371G:p.V124G | - | CAGCCTGCCCACAGAGTCCCA |
| chr1 | 151341524 | 151341524 | C | A | exonic | SELENBP1 | NA | nonsynonymous SNV | SELENBP1:NM_001258289:exon4:c.G442T:p.V148L,SELENBP1:NM_003944:exon4:c.G316T:p.V106L | - | ACGTCCACCACATAGATGCGA |
Output
- Excel file that includes all computed statistics shown in tabular and graphical formats, for each sample (one by datasheet) and for the pooled samples (optional).
- A folder containing all the figures and tables by sample.
- Input matrix for the tool MutSpec-NMF.
The following statistics are generated and summarised in the Excel file:
Graph 1. SBS distribution Proportion (percent of all SBS) of each type of single base substitution (SBS). All SBS are considered, including the ones without strand orientation annotation.
Table 1. Frequency and counts of all SBS Values corresponding to graph 1.
Graph 2. Impact on protein sequence Impact of all mutations (SBS and Indel) on the protein sequence based on the ExonicFunc.refGene annotation. For more details about the annotation, please visit the Annovar web page.
Table 2. Frequency and counts of functional impacts Values corresponding to graph 2.
Graph 3. Stranded distribution of SBS Proportion (percent of all SBS with strand annotation) of the six substitution types on the transcribed and non-transcribed strand. Only regions with strand annotation are considered.
Table 3. Significance of the strand biases The strand bias for each SBS type is calculated as the ratio of SBS on the non-transcribed (coding) versus the transcribed (non-coding) strand. The statistical significance of the differences between the mutational frequencies on the non-transcribed and the transcribed strand (equal to 0.5, as expected by chance) is assessed using a chi-squared test followed by the Benjamini- Hochberg procedure for multiple testing corrections (only samples with at least 1 mutations on the non-transcribed or on the transcribed strand are considered). Two tables are shown in the Excel report to display the 6 SBS types in both orientations.
Table 4. SBS distribution by functional region Count and percentages of SBS in genomic regions based on the Func.refGene annotation.
Table 5. Strand bias by functional region Counts of the strand bias for the 6 SBS types in different functional regions.
Table 6. SBS distribution per chromosome Counts of SBS per chromosome for the six SBS types. The correlation between SBS counts and chromosome size is calculated using a Pearson correlation test.
Panel 1. Trinucleotide sequence context of SBS on the genomic sequence The trinucleotide sequence context takes into consideration the flanking base in 5' and in 3' of the SBS. SBS counts and frequency data are shown as tables, heatmaps or bar graphs. The heatmap colors are scaled to the maximum value of the corresponding table. The bar graph is scaled to the maximum frequency value (total number of mutation by SBS type is shown in parenthesis).
Panel 2. Stranded analysis of trinucleotide sequence context of SBS SBS within their trinucleotide sequence context are counted on the non-transcribed and transcribed strands of the gene region they are located in. Counts and frequencies are shown as tables or bar graphs. Only SBS with strand orientation annotation are considered in this analysis (strand annotation retrieved from RefSeq database).
Usage
perl mutspecStat.pl --refGenome hg19 --pathRscript mutspec/ --pathSeqRefGenome dir/hg19db/ input_folder
--refGenome, --pathRscript, --pathSeqRefGenome and input_folder are compulsory.
List of parameters with default values:
| Parameter | Default value | Description |
|---|---|---|
| --refGenome | Name of the reference genome | |
| --pathRscript | Path to mutspec R scripts | |
| --pathSeqRefGenome | Path to reference genome fasta files (one per chromosomes) | |
| --outfile | input directory | Output directory |
| --temp | directory in which the script is run | Path for saving temporary files |
| --poolData | false | Calculate statistics on the sample pool |
| --reportSample | false | Generate an Excel report for each sample |
| --help | Print help message | |
| input_folder | Input folder |
By default, one output Excel file will be generated with statistics of each sample shown in different datasheets.
Adding the option --reportSample will generate one Excel file for each sample instead. It is recommended to use this option if you are analysing more than 250 samples as the Excel output file may be too heavy to open easily on a computer with limited RAM.