compare_genes.md
July 25, 2017 ยท View on GitHub
Gene Content Dynamics
These scripts will allow you to compare the gene content of a species between all pairs of metagenomic samples.
Before running these scripts, you'll need to have run:
merge_midas.py genes read more.
Command usage:
compare_genes.py --indir <PATH> --out <PATH> [options]
--indir PATH Path to output from 'merge_midas.py genes' for one species
directory should be named according to a species_id and contains files 'genes_*.txt')
--out PATH Path to output file
Options:
--max_genes INT Maximum number of genes to use. Useful for quick tests (use all)
--max_samples INT Maximum number of samples to use. Useful for quick tests (use all)
--distance {jaccard,euclidean,manhattan}
Metric to use for computing distances (jaccard)
--dtype {presabs,copynum}
Data type to use for comparing genes (presabs)
--cutoff FLOAT Cutoff to use for determining presence absence (0.35)
Examples:
-
Run with defaults:
compare_genes.py --indir /path/to/species --out distances.txt -
Run a quick test:
compare_genes.py --indir /path/to/species --out distances.txt --max_genes 1000 --max_samples 10 -
Use a different distance metric:
compare_genes.py --indir /path/to/species --out distances.txt --distance manhattan -
Use a lenient cutoff for determining gene presence absence:
compare_genes.py --indir /path/to/species --out distances.txt --cutoff 0.10 -
Use a strict cutoff for determining gene presence absence:
compare_genes.py --indir /path/to/species --out distances.txt --cutoff 0.75
Output format:
sample1: first sample identifier
sample2: second sample identifier
count1: number of present genes in sample1
count2: number of present genes in sample2
count_either: number of genes in sample1 or sample2
count_both: number of genes in sample1 and sample2
distance: dissimilarity between gene sets