README.md

April 17, 2018 ยท View on GitHub

Build Status

Synopsis

Get the readcounts at a locus by piping samtools mpileup output. The mpileup can contain one or several samples. This program has been tested on samtools v1.3.1

Install samtools

Compile mpileup2readcounts :

g++ -std=c++11 -O3 mpileup2readcounts.cc -o mpileup2readcounts

Usage

samtools mpileup -f ref.fa -l regions.bed BAM/*.bam | sed 's/		/	* 	*/g' | ./mpileup2readcounts 0 -5 false 3 0

Samtools arguments :

  • FASTA file
  • bed file
  • BAM files : several samples can be parsed

Four options for mpileup2readcounts :

  • 0 to parse all sample otherwise specify the number of the sample (for example 1 for the first sample)
  • BQcut : base quality score cutoff for each mapped/unmapped base, only those larger than cutoff will be output in the result, to use no filter set BQcut to -5
  • true to ignore indels
  • min_ao : minimum number of non-ref reads in at least one sample to consider a site
  • min_af : minimum allelic fraction in at least one sample to consider a site

Example output

chrlocrefdepthATCGatcgInsertionDeletiondepthATCGatcgInsertionDeletion
177572814C28032300020NANA800800000NANA
177572817C32202602020NANA800800000NANA
177579643C48009000390NA4:ccccagccctccaggt|2:CCCCAGCCCTCCAGGT900600030NANA

Line content

Common information for all samples:

  • chromosome
  • position on the chromosome
  • reference base

For each sample :

  • depth
  • ATCG/atcg count
  • insertions
  • deletions : in the example, for the first sample 6 deletions starting from position 7579643 + 1 are found

Using GNU parallel tool

We first remove overlap in the bed using bedtools. Note that the header is removed here:

grep -v '^track' regions.bed | sort -k1,1 -k2,2n | bedtools merge -i stdin | awk '{print \$1"\t"\$2"\t"\$3}' | sed 's/[[:space:]]/:/' | sed 's/[[:space:]]/-/' | parallel --keep-order "samtools mpileup -f ref.fa --region {} test.bam | sed 's/		/	*	*/g' | mpileup2readcounts 0 -5 false 0 | tail -n +2"