last-genome-alignments

June 2, 2023 · View on GitHub

Here are some pair-wise genome alignments made with LAST.

2023 alignments

The 2023 directory has alignments of these genomes:

CodePhylumAnimalScientific nameGenome
helRobannelidjawless leechHelobdella robustaGCF_000326865.1
hirMedannelidmedicinal leechHirudo medicinalisGCA_011800805.1
lamSatannelidSatsuma tubewormLamellibrachia satsumaGCA_022478865.1
oweFusannelidshingle tubewormOwenia fusiformisGCA_903813345.2
armNasarthropodwoodlouseArmadillidium nasatumGCA_009176605.1
cenScuarthropodscorpionCentruroides sculpturatusGCF_000671375.1
droMelarthropodfruit flyDrosophila melanogasterGCF_000001215.4
gloMaearthropodmillipedeGlomeris maerensGCA_023279145.1
homAmearthropodlobsterHomarus americanusGCF_018991925.1
limPolarthropodhorseshoe crabLimulus polyphemusGCF_000517525.1
macAtrarthropodrobber flyMachimus atricapillusGCA_933228815.1
strMararthropodcentipedeStrigamia maritimaGCA_000239455.1
linAnabrachiopodshamisen shellLingula anatinaGCF_001039355.2
asyLucchordateBahama lanceletAsymmetron lucayanumGCA_001663935.1
braFlochordatelanceletBranchiostoma floridaeGCF_000003815.2
calMilchordatechimaeraCallorhinchus miliiGCF_018977255.1
eptBurchordatehagfishEptatretus burgeriGCA_024346535.1
homSapchordatehumanHomo sapienshg38_no_alt_analysis_set
petMarchordatelampreyPetromyzon marinusGCF_010993605.1
acrMilcnidariastony coralAcropora milleporaGCF_013753865.1
actTencnidariawaratah anemoneActinia tenebrosaGCF_009602425.1
epiPlacnidariazoanthidEpizoanthus planusGCA_025388665.1
nemVeccnidariastarlet sea anemoneNematostella vectensisGCF_932526225.1
horCalctenophoresea gooseberryHormiphora californensisGCA_020137815.1
mneLeictenophoresea walnutMnemiopsis leidyiGCA_000226015.1
apoJapechinodermsea cucumberApostichopus japonicusGCA_002754855.1
strPurechinodermsea urchinStrongylocentrotus purpuratusGCF_000002235.5
ptyFlahemichordateHawaiian acorn wormPtychodera flavaGCA_001465055.1
sacKowhemichordateacorn wormSaccoglossus kowalevskiiGCF_000003605.2
aplCalmolluscsea hareAplysia californicaGCF_000002075.1
craGigmolluscoysterCrassostrea gigasGCF_902806645.1
halRufmolluscabaloneHaliotis rufescensGCF_023055435.1
limBulmolluscsea butterflyLimacina bulimoidesGCA_009866985.1
mizYesmolluscscallopMizuhopecten yessoensisGCF_002113885.1
octBimmolluscoctopusOctopus bimaculoidesGCF_001194135.2
phoLinmollusctop snailPhorcus lineatusGCA_921293015.1
watScimolluscfirefly squidWatasenia scintillansGCA_015471945.1
phoOvaphoronidhorseshoe wormPhoronis ovalisGCA_028565635.1

The alignments were made with LAST version 1453, like this:

lastdb -P8 -uMAM8 -c myDB genome1.fa

last-train -P8 --revsym -D1e9 --sample-number=5000 myDB genome2.fa > my.train

lastal -P8 -D1e9 -m100 --split-f=MAF+ -p my.train myDB genome2.fa > many-to-one.maf

last-split -r many-to-one.maf > one-to-one.maf

This is currently the recommended way to compare distantly-related genomes, where most of the DNA lacks similarity.

  • -P8 makes it faster by using 8 threads: adjust as suitable for your computer. This has no effect on the results.

  • -uMAM8 and -m100 strive for high sensitivity, but use a lot of memory and run time. To go much faster, omit -m100. To halve the memory use and run time, change MAM8 to MAM4.

  • --sample-number=5000 makes last-train use more samples of genome2, for fear that most of genome2 lacks similarity to genome1. For the same reason, -D1e9 is used with last-train, to avoid weak chance similarities more strictly.

2022 alignments

The 2022 directory has various alignments of these genomes:

Genome nameAnimalSourceAssembly name (if different)
allMis28112v4alligatorNCBIASM28112v4
cerSim1rhinocerosUCSC
chrPic3.0.3turtleNCBIChrysemys_picta_bellii-3.0.3
equCab3horseUCSC
hg38humanUCSChg38.analysisSet
mOrnAna1.pri.v4platypusNCBI

They were made with LAST version 1411, using the recipe below under "2021 alignments".

2021 alignments

The 2021 directory has various alignments of these genomes:

Genome nameAnimalSourceAssembly name (if different)
allMis28112v4alligatorNCBIASM28112v4
Bfl_VNyyKlanceletNCBI
calMil1chimaeraUCSC
chrPic3.0.3turtleNCBIChrysemys_picta_bellii-3.0.3
hg38humanUCSChg38.analysisSet
kPetMar1lampreyNCBIkPetMar1.pri
latCha1coelacanthUCSC
lepOcu1garNCBILepOcu1
xenTro10frogNCBIUCB_Xtro_10.0

They can be replicated by running LAST version >= 1387 like this:

lastdb -P8 -uMAM8 myDB genome1.fa

last-train -P8 --revsym -D1e9 --sample-number=5000 myDB genome2.fa > my.train

lastal -P8 -D1e9 -m100 --split-f=MAF+ -p my.train myDB genome2.fa > many-to-one.maf

last-split -r many-to-one.maf | last-postmask > out.maf
  • The -P8 option makes it faster by using 8 threads: adjust as appropriate for your computer. This has no effect on the results.

  • The -uMAM8 and -m100 strive for high sensitivity, but make the lastal command use much time and memory, e.g. several days and hundreds of gigabytes.

  • You can trade off multi-threading and memory use (with no effect on results), see here.

2017 alignments

Warning: these recipes were for an older version of LAST.

  • Since LAST version 1205, -R01 has no effect and can be omitted (because it's the default).

  • For LAST version >= 1180, it's best to add option -fMAF+ to the first (many-to-one) last-split. (In older versions, -fMAF+ was the default.)

  • Since LAST version 983, last-split option -m1 has no effect and can be omitted (because it's the default).

2017 human-ape alignments

The human genome (hg38) was aligned to chimp (panTro5) and gorilla (gorGor5), as follows. This alignment recipe is very accurate-but-slow. A faster recipe would mask repeats during alignment, and/or omit -m50.

First, an "index" of the human genome was prepared, suitable for comparing it to highly-similar sequences:

lastdb -P0 -uNEAR -R01 hg38-NEAR hg38_no_alt_analysis_set.fa

Then, substitution and gap frequencies were determined:

last-train -P0 --revsym --matsym --gapsym -E0.05 -C2 hg38-NEAR panTro5.fa > hg38-panTro5.mat

Next, many-to-one ape-to-human alignments were made:

lastal -m50 -E0.05 -C2 -p hg38-panTro5.mat hg38-NEAR panTro5.fa | last-split -m1 > hg38-panTro5-1.maf

The above command was the slowest step (3 CPU-weeks). You can "easily" parallelize it, by processing each sequence within panTro5.fa separately (in parallel). But each process uses quite a lot of memory, so take care that multiple parallel runs don't exceed your memory.

Next, one-to-one ape-to-human alignments were made:

maf-swap hg38-panTro5-1.maf |
awk '/^s/ {\$2 = (++s % 2 ? "panTro5." : "hg38.") \$2} 1' |
last-split -m1 |
maf-swap > hg38-panTro5-2.maf

The awk command prepends the assembly name to each chromosome name (e.g. chr7 -> hg38.chr7).

Finally, simple-sequence alignments were discarded, the alignments were converted to tabular format, and alignments with error probability > 10510^{-5} were discarded:

last-postmask hg38-panTro5-2.maf |
maf-convert -n tab |
awk -F'=' '\$2 <= 1e-5' > hg38-panTro5.tab

2017 human-mouse alignments

The human genome (hg38) was aligned to mouse (mm10). This alignment recipe is even more slow-and-sensitive.

First, an "index" of the human genome was prepared, suitable for comparing it to less-similar sequences:

lastdb -P0 -uMAM4 -R01 hg38-MAM4 hg38_no_alt_analysis_set.fa

Then, substitution and gap frequencies were determined:

last-train -P0 --revsym --matsym --gapsym -E0.05 -C2 hg38-MAM4 mm10.fa > hg38-mm10.mat

Next, many-to-one mouse-to-human alignments were made:

lastal -m100 -E0.05 -C2 -p hg38-mm10.mat hg38-MAM4 mm10.fa | last-split -m1 > hg38-mm10-1.maf

Finally, one-to-one MAF alignments, and high-confidence tabular alignments, were made in the same way as above.