README.md

April 17, 2018 · View on GitHub

Synopsis

Get the readcounts at a locus by piping samtools mpileup output. The mpileup can contain one or several samples. This program has been tested on samtools v1.3.1

Install samtools

Compile mpileup2readcounts :

g++ -std=c++11 -O3 mpileup2readcounts.cc -o mpileup2readcounts

Usage

samtools mpileup -f ref.fa -l regions.bed BAM/*.bam | sed 's/		/	* 	*/g' | ./mpileup2readcounts 0 -5 false 3 0

Samtools arguments :

FASTA file
bed file
BAM files : several samples can be parsed

Four options for mpileup2readcounts :

0 to parse all sample otherwise specify the number of the sample (for example 1 for the first sample)
BQcut : base quality score cutoff for each mapped/unmapped base, only those larger than cutoff will be output in the result, to use no filter set BQcut to -5
true to ignore indels
min_ao : minimum number of non-ref reads in at least one sample to consider a site
min_af : minimum allelic fraction in at least one sample to consider a site

Example output

chr	loc	ref	depth	A	T	C	a	c	Insertion	Deletion	depth	C	c	Insertion	Deletion
17	7572814	C	28	0	3	23	0	2	NA	NA	8	8	0	NA	NA
17	7572817	C	32	2	0	26	2	2	NA	NA	8	8	0	NA	NA
17	7579643	C	48	0	0	9	0	39	NA	4:ccccagccctccaggt\|2:CCCCAGCCCTCCAGGT	9	6	3	NA	NA

Line content

Common information for all samples:

chromosome
position on the chromosome
reference base

For each sample :

depth
ATCG/atcg count
insertions
deletions : in the example, for the first sample 6 deletions starting from position 7579643 + 1 are found

Using GNU parallel tool

We first remove overlap in the bed using bedtools. Note that the header is removed here:

grep -v '^track' regions.bed | sort -k1,1 -k2,2n | bedtools merge -i stdin | awk '{print \$1"\t"\$2"\t"\$3}' | sed 's/[[:space:]]/:/' | sed 's/[[:space:]]/-/' | parallel --keep-order "samtools mpileup -f ref.fa --region {} test.bam | sed 's/		/	*	*/g' | mpileup2readcounts 0 -5 false 0 | tail -n +2"