README

March 3, 2022 ยท View on GitHub

==== MUTATRIX ====

mutatrix is a population genome simulator which generates simulated genomes.

It reads a reference FASTA file and outputs a VCF description of the variants on stdout, and writes each simulated, mutated copy of the reference to the current directory or a user-defined path (--file-prefix).

Example usage:

% ./mutatrix -S sample -P test/ -p 2 -n 10 reference.fasta

This command writes VCF to stdout and writes mutated references to test/, with this format:

# <prefix>/<sample id>:<fasta sequence name>:<copy number>.fa

% ls test
sample10:seq_1:0.fa  sample1:seq_1:0.fa  sample2:seq_1:0.fa  ...
sample10:seq_1:1.fa  sample1:seq_1:1.fa  sample2:seq_1:1.fa  ...

mutatrix is suitable for use in testing pooled variant detectors, as it distributes alleles throughout the population according to a zeta distribution, which is roughly consistent with the power-law allele frequency spectrum observed by large population sequencing projects like the 1000 Genomes Project.

Alternate allele generation:

mutatrix generates alleles using the following model:

At each position in the reference, we draw a pseudorandom number on [0,1). If this number, scaled by the number of copies of the genome in the population, is below --rate (default 0.001), then we generate an alternate minor allele.

We then sample a second number, and if it is below --indel-snp-ratio, we generate an indel. Otherwise, we generate a SNP or MNP. MNPs are generated using a geometric distribution conditioned on the --mnp-ratio. A 2bp MNP occurs at 0.01 the rate of SNPs, a 3bp MNP occurs at 0.01 the rate of 2bp MNPs, etc.

Indels are generated by obtaining a length from a zeta distribution with alpha --indel-alpha. (An alpha of 1.7 is used per observations in [1]). If the indel is longer than --indel-max, we continue without generating the indel. Novel insertions are randomly generated.

Allele frequency spectrum simulation:

Once generated, the alternate allele is distributed across the population of simulated individuals by sampling an allele frequency from a zeta distribution (also with alpha 1.7). The alternate alleles are randomly distributed across the population.

There is no concept of haplotype block or linkage in mutatrix. Each allele and site is effectively independent from other sites.

Dependency on vcflib

You'll need to be able to build vcflib. This might involve installing libtabixpp-dev.

author: Erik Garrison erik.garrison@bc.edu license: MIT (free)

references:

[1] Problems and Solutions for Estimating Indel Rates and Length Distributions. Reed A. Cartwright. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2734402/