last-genome-alignments

June 2, 2023 · View on GitHub

Here are some pair-wise genome alignments made with LAST.

2023 alignments

The 2023 directory has alignments of these genomes:

Code	Phylum	Animal	Scientific name	Genome
helRob	annelid	jawless leech	Helobdella robusta	GCF_000326865.1
hirMed	annelid	medicinal leech	Hirudo medicinalis	GCA_011800805.1
lamSat	annelid	Satsuma tubeworm	Lamellibrachia satsuma	GCA_022478865.1
oweFus	annelid	shingle tubeworm	Owenia fusiformis	GCA_903813345.2
armNas	arthropod	woodlouse	Armadillidium nasatum	GCA_009176605.1
cenScu	arthropod	scorpion	Centruroides sculpturatus	GCF_000671375.1
droMel	arthropod	fruit fly	Drosophila melanogaster	GCF_000001215.4
gloMae	arthropod	millipede	Glomeris maerens	GCA_023279145.1
homAme	arthropod	lobster	Homarus americanus	GCF_018991925.1
limPol	arthropod	horseshoe crab	Limulus polyphemus	GCF_000517525.1
macAtr	arthropod	robber fly	Machimus atricapillus	GCA_933228815.1
strMar	arthropod	centipede	Strigamia maritima	GCA_000239455.1
linAna	brachiopod	shamisen shell	Lingula anatina	GCF_001039355.2
asyLuc	chordate	Bahama lancelet	Asymmetron lucayanum	GCA_001663935.1
braFlo	chordate	lancelet	Branchiostoma floridae	GCF_000003815.2
calMil	chordate	chimaera	Callorhinchus milii	GCF_018977255.1
eptBur	chordate	hagfish	Eptatretus burgeri	GCA_024346535.1
homSap	chordate	human	Homo sapiens	hg38_no_alt_analysis_set
petMar	chordate	lamprey	Petromyzon marinus	GCF_010993605.1
acrMil	cnidaria	stony coral	Acropora millepora	GCF_013753865.1
actTen	cnidaria	waratah anemone	Actinia tenebrosa	GCF_009602425.1
epiPla	cnidaria	zoanthid	Epizoanthus planus	GCA_025388665.1
nemVec	cnidaria	starlet sea anemone	Nematostella vectensis	GCF_932526225.1
horCal	ctenophore	sea gooseberry	Hormiphora californensis	GCA_020137815.1
mneLei	ctenophore	sea walnut	Mnemiopsis leidyi	GCA_000226015.1
apoJap	echinoderm	sea cucumber	Apostichopus japonicus	GCA_002754855.1
strPur	echinoderm	sea urchin	Strongylocentrotus purpuratus	GCF_000002235.5
ptyFla	hemichordate	Hawaiian acorn worm	Ptychodera flava	GCA_001465055.1
sacKow	hemichordate	acorn worm	Saccoglossus kowalevskii	GCF_000003605.2
aplCal	mollusc	sea hare	Aplysia californica	GCF_000002075.1
craGig	mollusc	oyster	Crassostrea gigas	GCF_902806645.1
halRuf	mollusc	abalone	Haliotis rufescens	GCF_023055435.1
limBul	mollusc	sea butterfly	Limacina bulimoides	GCA_009866985.1
mizYes	mollusc	scallop	Mizuhopecten yessoensis	GCF_002113885.1
octBim	mollusc	octopus	Octopus bimaculoides	GCF_001194135.2
phoLin	mollusc	top snail	Phorcus lineatus	GCA_921293015.1
watSci	mollusc	firefly squid	Watasenia scintillans	GCA_015471945.1
phoOva	phoronid	horseshoe worm	Phoronis ovalis	GCA_028565635.1

The alignments were made with LAST version 1453, like this:

lastdb -P8 -uMAM8 -c myDB genome1.fa

last-train -P8 --revsym -D1e9 --sample-number=5000 myDB genome2.fa > my.train

lastal -P8 -D1e9 -m100 --split-f=MAF+ -p my.train myDB genome2.fa > many-to-one.maf

last-split -r many-to-one.maf > one-to-one.maf

This is currently the recommended way to compare distantly-related genomes, where most of the DNA lacks similarity.

-P8 makes it faster by using 8 threads: adjust as suitable for your computer. This has no effect on the results.
-uMAM8 and -m100 strive for high sensitivity, but use a lot of memory and run time. To go much faster, omit -m100. To halve the memory use and run time, change MAM8 to MAM4.
--sample-number=5000 makes last-train use more samples of genome2, for fear that most of genome2 lacks similarity to genome1. For the same reason, -D1e9 is used with last-train, to avoid weak chance similarities more strictly.

2022 alignments

The 2022 directory has various alignments of these genomes:

Genome name	Animal	Source	Assembly name (if different)
allMis28112v4	alligator	NCBI	ASM28112v4
cerSim1	rhinoceros	UCSC
chrPic3.0.3	turtle	NCBI	Chrysemys_picta_bellii-3.0.3
equCab3	horse	UCSC
hg38	human	UCSC	hg38.analysisSet
mOrnAna1.pri.v4	platypus	NCBI

They were made with LAST version 1411, using the recipe below under "2021 alignments".

2021 alignments

The 2021 directory has various alignments of these genomes:

Genome name	Animal	Source	Assembly name (if different)
allMis28112v4	alligator	NCBI	ASM28112v4
Bfl_VNyyK	lancelet	NCBI
calMil1	chimaera	UCSC
chrPic3.0.3	turtle	NCBI	Chrysemys_picta_bellii-3.0.3
hg38	human	UCSC	hg38.analysisSet
kPetMar1	lamprey	NCBI	kPetMar1.pri
latCha1	coelacanth	UCSC
lepOcu1	gar	NCBI	LepOcu1
xenTro10	frog	NCBI	UCB_Xtro_10.0

They can be replicated by running LAST version >= 1387 like this:

lastdb -P8 -uMAM8 myDB genome1.fa

last-train -P8 --revsym -D1e9 --sample-number=5000 myDB genome2.fa > my.train

lastal -P8 -D1e9 -m100 --split-f=MAF+ -p my.train myDB genome2.fa > many-to-one.maf

last-split -r many-to-one.maf | last-postmask > out.maf

The -P8 option makes it faster by using 8 threads: adjust as appropriate for your computer. This has no effect on the results.
The -uMAM8 and -m100 strive for high sensitivity, but make the lastal command use much time and memory, e.g. several days and hundreds of gigabytes.
You can trade off multi-threading and memory use (with no effect on results), see here.

2017 alignments

Warning: these recipes were for an older version of LAST.

Since LAST version 1205, -R01 has no effect and can be omitted (because it's the default).
For LAST version >= 1180, it's best to add option -fMAF+ to the first (many-to-one) last-split. (In older versions, -fMAF+ was the default.)
Since LAST version 983, last-split option -m1 has no effect and can be omitted (because it's the default).

The human genome (hg38) was aligned to chimp (panTro5) and gorilla (gorGor5), as follows. This alignment recipe is very accurate-but-slow. A faster recipe would mask repeats during alignment, and/or omit -m50.

First, an "index" of the human genome was prepared, suitable for comparing it to highly-similar sequences:

lastdb -P0 -uNEAR -R01 hg38-NEAR hg38_no_alt_analysis_set.fa

Then, substitution and gap frequencies were determined:

last-train -P0 --revsym --matsym --gapsym -E0.05 -C2 hg38-NEAR panTro5.fa > hg38-panTro5.mat

Human-chimp parameters: hg38-panTro5.mat
Human-gorilla parameters: hg38-gorGor5.mat

Next, many-to-one ape-to-human alignments were made:

lastal -m50 -E0.05 -C2 -p hg38-panTro5.mat hg38-NEAR panTro5.fa | last-split -m1 > hg38-panTro5-1.maf

The above command was the slowest step (3 CPU-weeks). You can "easily" parallelize it, by processing each sequence within panTro5.fa separately (in parallel). But each process uses quite a lot of memory, so take care that multiple parallel runs don't exceed your memory.

Human-chimp many-to-one alignments: hg38-panTro5-1.maf.gz
Human-gorilla many-to-one alignments: hg38-gorGor5-1.maf.gz

Next, one-to-one ape-to-human alignments were made:

maf-swap hg38-panTro5-1.maf |
awk '/^s/ {\$2 = (++s % 2 ? "panTro5." : "hg38.") \$2} 1' |
last-split -m1 |
maf-swap > hg38-panTro5-2.maf

The awk command prepends the assembly name to each chromosome name (e.g. chr7 -> hg38.chr7).

Human-chimp one-to-one alignments: hg38-panTro5-2.maf.gz
Human-gorilla one-to-one alignments: hg38-gorGor5-2.maf.gz

Finally, simple-sequence alignments were discarded, the alignments were converted to tabular format, and alignments with error probability > $10^{-5}$ were discarded:

last-postmask hg38-panTro5-2.maf |
maf-convert -n tab |
awk -F'=' '\$2 <= 1e-5' > hg38-panTro5.tab

Human-chimp tabular alignments: hg38-panTro5.tab.gz (dotplot)
Human-gorilla tabular alignments: hg38-gorGor5.tab.gz (dotplot)

2017 human-mouse alignments

The human genome (hg38) was aligned to mouse (mm10). This alignment recipe is even more slow-and-sensitive.

First, an "index" of the human genome was prepared, suitable for comparing it to less-similar sequences:

lastdb -P0 -uMAM4 -R01 hg38-MAM4 hg38_no_alt_analysis_set.fa

Then, substitution and gap frequencies were determined:

last-train -P0 --revsym --matsym --gapsym -E0.05 -C2 hg38-MAM4 mm10.fa > hg38-mm10.mat

Human-mouse parameters: hg38-mm10.mat
Human-dog parameters: hg38-canFam3.mat

Next, many-to-one mouse-to-human alignments were made:

lastal -m100 -E0.05 -C2 -p hg38-mm10.mat hg38-MAM4 mm10.fa | last-split -m1 > hg38-mm10-1.maf

Human-mouse many-to-one alignments: hg38-mm10-1.maf.gz

Finally, one-to-one MAF alignments, and high-confidence tabular alignments, were made in the same way as above.

Human-mouse one-to-one alignments: hg38-mm10-2.maf.gz
Human-mouse tabular alignments: hg38-mm10.tab.gz (dotplot)

2023 alignments

2022 alignments

2021 alignments

2017 alignments

2017 human-ape alignments

2017 human-mouse alignments