Today’s data

October 14, 2021 · View on GitHub



In this session you will learn how to use low-coverage whole genome data to do:

  • Principal Components Analysis (PCA)
  • Admixture analysis



Today’s data

As outlined in the previous exercises, we are working with low-coverage NGS data for 60 Atlantic silversides from the following 4 populations:

These populations have been previously studied in Therkildsen et al. 2019 and Wilder et al. 2020, and cover the entire distribution range of the species.

Our NGS data are in bam format and span a 2Mb region on chromosome 24. The interesting aspect about chromosome 24 is that it harbours a large polymorphic inversion that differs in its frequency across populations. The test dataset spans one breakpoint of this inversion (1Mb up and downstream).

In this tutorial, we want to use these data to examine the population structure of Atlantic silverside along its distribution range using principal components analysis (PCA) and admixture analysis. In addition, we will use principal components analysis and ancestry analyses to assess which populations most likely contain alternative inversion karyotypes.

For this practical, we will be using ANGSD (Analysis of Next Generation Sequencing Data), ngsAdmix and PCAngsd.



Initial preparation

Again, make sure you have cloned this GitHub repository to your Linux server, ideally to your home directory, and change your current working directory to the tutorial3_ld_popstructure folder. Set this path as your BASEDIR. For example:


BASEDIR=~/lcwgs-guide-tutorial/tutorial3_ld_popstructure/
cd $BASEDIR
ls

Then, create a new folder named results in the BASEDIR for storing our results.


mkdir results

And then we set all the rest of the paths:


REF=$BASEDIR/reference/Ref.fa
ANGSD=/programs/angsd0.930/angsd/angsd
NGSADMIX=/programs/NGSadmix/NGSadmix ## This is the path to NGSadmix on Cornell BioHPC servers. Make sure that you change this when running on a different server.
SAMTOOLS=samtools ## This is the path to samtools on Cornell BioHPC servers. Make sure that you change this when running on a different server.
PCANGSD=/programs/pcangsd-0.98/pcangsd.py

We will also need to index the reference file using samtools


$SAMTOOLS faidx $REF



Principal components analysis (PCA)

To perform PCA with low-coverage NGS data, we have to infer the genetic covariance matrix, which can be estimated in different ways. Here we will estimate the covariance matrix using single-read sampling in ANGSD as discussed during the lecture, but will also show you how to estimate the covariance matrix using PCAngsd

The first question is what our input dataset should look like?

We only want to focus on variant sites in our population structure analyses. As we learned yesterday, we could just perform SNP calling in ANGSD and get a list of all variant sites in our dataset and provide ANGSD this list of variant sites in future runs using the -sites option.

Here is some example code to illustrate how we would do this. WE ARE NOT RUNNING THIS CODE TODAY - JUST READ OVER IT, DON’T COPY AND RUN IT.


# cd $BASEDIR
# $ANGSD -b $BASEDIR'/sample_lists/ALL_bams.txt' -anc $REF -out $BASEDIR'/results/MME_SNPs' \
# -minMapQ 20 -minQ 20 -doMaf 1 -minMaf 0.05 -SNP_pval 2e-6 \
# -GL 1 -doGlf 2 -doMajorMinor 1 -doPost 1

We could then extract a list of all variant sites from the minor allele frequency file:


# gunzip -c $BASEDIR'/results/MME_SNPs.mafs.gz' | cut -f 1,2,3,4 | tail -n +2 > Global_SNPList_MME_SNPs.txt

And then we have to index our SNP list so that ANGSD can read it:


# $ANGSD sites index $BASEDIR'/results/Global_SNPList_MME_SNPs.txt'

However, in general it is good practice to perform LD pruning or at least thinning to reduce the impact of non-independent SNP clusters, e.g. large LD clusters in inversions, on the inferred population structure. In this practical, we will perform the population structure analysis using an LD-pruned SNP dataset that you prepared earlier today. Later on we will use the full SNP dataset. The full SNP dataset will help us to understand how the inversion karyotype is distributed in our dataset and will highlight the impact of large LD clusters on the inferred structure.


PCA with LD-pruned SNP dataset

1. Estimating the covariance matrix using a single read sampling approach:

Using the list of LD-pruned variant sites and the code shown below, we can estimate a covariance matrix using single-read sampling for LD-pruned SNPs using ANGSD.

Let’s specify our SNP list and index it:


SNPLIST=$BASEDIR/ngsld/LDpruned_snps.list
$ANGSD sites index $SNPLIST

$ANGSD -b $BASEDIR'/sample_lists/ALL_bams.txt' -anc $REF -out $BASEDIR'/results/MME_ANGSD_PCA_LDpruned' \
    -GL 1 -doGlf 2 -doMajorMinor 3 -doMAF 1 -doPost 1 -doIBS 1 -doCounts 1 -doCov 1 -makeMatrix 1 -sites $SNPLIST
    

At the same time, we will also output the genotype likelihoods for these variant sites in beagle likelihood file format (beagle.gz) by specifying the -doGlf 2 option. This will be used as input for estimating the covariance matrix using PCAngsd (below)


After angsd finished running, we can see that it produced a range of different output files:


ls results/
MME_ANGSD_PCA_LDpruned.arg  
MME_ANGSD_PCA_LDpruned.beagle.gz  
MME_ANGSD_PCA_LDpruned.covMat  
MME_ANGSD_PCA_LDpruned.ibs.gz  
MME_ANGSD_PCA_LDpruned.ibsMat  
MME_ANGSD_PCA_LDpruned.mafs.gz

For the PCA, we are most interested in the .covMat file, which contains the covariance matrix. The other files are an IBS matrix (.ibsMat) that could be used for a multidimensional scaling analysis (MDS), a file with the minor allele frequencies (.mafs.gz) and the beagle.gz genotype likelihood file.


We can look at the covariance matrix in R, where we will also perform the PCA using the eigen function.

We have to load the covariance matrix into R and then we can optionally provide population assignments for each individual (rows in same order as input bam file list -b $BASEDIR/sample_lists/ALL_bams.txt). We recommend that you use R-Studio server for this, but running R on command line also work (you can start R in the server by typing R, and end the session by typing quit())

## load the tidyverse package
library(tidyverse)

## define your project directory
basedir <- "~/lcwgs-guide-tutorial/tutorial3_ld_popstructure/" # Make sure to edit this to match your $BASEDIR

#Load the covariance matrix. Don't forget to change the file path if you have downloaded the data to your own computer.
cov <- as.matrix(read.table(paste0(basedir, "results/MME_ANGSD_PCA_LDpruned.covMat"), header = F))

#We will also add a column with population assingments
pop <- c("JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA"
         ,"PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY"
         ,"MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS"
         ,"MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU")

mme.pca <- eigen(cov) #perform the pca using the eigen function. 

We can then extract the eigenvectors from the pca object and format them into a dataframe for plotting, e.g. using ggplot().

eigenvectors = mme.pca$vectors #extract eigenvectors 
pca.vectors = as_tibble(cbind(pop, data.frame(eigenvectors))) #combine with our population assignments

pca = ggplot(data = pca.vectors, aes(x=X1, y=X2, colour = pop)) + geom_point()

ggsave(filename = paste0(basedir, "/results/pca_LDpruned_plot.pdf"), plot = pca) #change file path if data on your own computer

Additionally, we can extract the eigenvalues for each eigenvector, and can then estimate the variance explained for each eigenvector (e.g. here for PC1 to PC4):

pca.eigenval.sum = sum(mme.pca$values) #sum of eigenvalues
varPC1 <- (mme.pca$values[1]/pca.eigenval.sum)*100 #Variance explained by PC1
varPC2 <- (mme.pca$values[2]/pca.eigenval.sum)*100 #Variance explained by PC2
varPC3 <- (mme.pca$values[3]/pca.eigenval.sum)*100 #Variance explained by PC3
varPC4 <- (mme.pca$values[4]/pca.eigenval.sum)*100 #Variance explained by PC4

Question

How strongly are the four different populations separated? And how much variation do the first to PCs explain?


2. Alternative approach: Covariance matrix estimation with PCAngsd

The covariance matrix can also be inferred from genotype likelihoods using PCAangsd. PCAngsd takes as input genotype likelihoods in beagle format, which we generated in the step before using the -doGLF 2 option.

PCAngsd provides a multitude of different settings, described here. We won’t change any of the settings here and only use the default settings, which are sufficient in most cases.

We provide the path to the input file using the -beagle option, which also tells PCAngsd that we are working with a beagle file. The output path and output name is provided using the -o option.

python3 $PCANGSD -beagle $BASEDIR/results/MME_ANGSD_PCA_LDpruned.beagle.gz -o $BASEDIR/results/PCAngsd_LDpruned_covmat

Optional We can perform the principal components analysis and plot PC1 vs PC2 the same way we did before.

#Load the RcppCNPy package
library(RcppCNPy)

#Define basedir
basedir <- "~/lcwgs-guide-tutorial/tutorial3_ld_popstructure/" # Make sure to edit this to match your $BASEDIR
setwd(basedir)
#Load the covariance matrix
cov <- npyLoad("results/PCAngsd_LDpruned_covmat.cov.npy")

#We will also add a column with population assingments
pop <- c("JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA"
         ,"PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY"
         ,"MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS"
         ,"MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU")
     
mme.pca <- eigen(cov) #perform the pca using the eigen function. 

eigenvectors = mme.pca$vectors #extract eigenvectors 
pca.vectors = as_tibble(cbind(pop, data.frame(eigenvectors))) #combine with our population assignments

pca = ggplot(data = pca.vectors, aes(x=X1, y=X2, colour = pop)) + geom_point()

ggsave(filename = paste0(basedir, "/results/pca_LDpruned_pcangsd_plot.pdf"), plot = pca) #change file path if data on your own computer

pca.eigenval.sum = sum(mme.pca$values) #sum of eigenvalues
varPC1 <- (mme.pca$values[1]/pca.eigenval.sum)*100 #Variance explained by PC1
varPC2 <- (mme.pca$values[2]/pca.eigenval.sum)*100 #Variance explained by PC2
varPC3 <- (mme.pca$values[3]/pca.eigenval.sum)*100 #Variance explained by PC3
varPC4 <- (mme.pca$values[4]/pca.eigenval.sum)*100 #Variance explained by PC4

Question

Do the inferred population structures differ between the two approaches? How different are the inferred variances explained by PC1 and PC2?

DISCLAIMER: The inferred patterns and differences between the results will be strongly impacted by the small size of our test region.



Optional: Compare the results to a PCA based on all SNPs

While it is best practice to perform a PCA based on an LD-pruned SNP dataset, PCA are often performed based on full SNP datasets. If you are interested and have time, you could compare the PCA for LD-pruned SNPs to one that was performed for all SNPs in the dataset without LD-pruning.

For this, we will only estimate the covariance matrix using single read sampling in ANGSD. SNPs can be inferred based on genotype likelihoods (see day 2) at the same time as inferring the covariance matrix (-doIBS 1 -doCounts 1 -doCov 1 -makeMatrix 1) by providing the -SNP_pval option. SNPs should be restricted to more common variants with minor allele frequencies of at least 5% using the -minMAF 0.05 option.

$ANGSD -b $BASEDIR/sample_lists/ALL_bams.txt -anc $REF -out $BASEDIR'/results/MME_ANGSD_PCA' \
    -minMapQ 20 -minQ 20 -doMaf 1 -minMaf 0.05 -SNP_pval 2e-6 \
    -GL 1 -doGlf 2 -doMajorMinor 1 -doPost 1 \
    -doIBS 1 -doCounts 1 -doCov 1 -makeMatrix 1 -P 4

Again, we perform the principal components analysis using the eigen function in R:

library(tidyverse)
#Define basedir
basedir <- "~/lcwgs-guide-tutorial/tutorial3_ld_popstructure/" # Make sure to edit this to match your $BASEDIR

#Load the covariance matrix
cov <- as.matrix(read.table(paste0(basedir, "results/MME_ANGSD_PCA.covMat"), header = F))

#We will also add a column with population assingments
pop <- c("JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA"
         ,"PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY"
         ,"MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS"
         ,"MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU")
     
mme.pca <- eigen(cov) #perform the pca using the eigen function. 

eigenvectors = mme.pca$vectors #extract eigenvectors 
pca.vectors = as_tibble(cbind(pop, data.frame(eigenvectors))) #combine with our population assignments

pca = ggplot(data = pca.vectors, aes(x=X1, y=X2, colour = pop)) + geom_point()

ggsave(filename = paste0(basedir, "results/pca_allSNPs_plot.pdf"), plot = pca) #change file path if data on your own computer

Question:

How do population relationships differ between the on LD-pruned PCA and the full-dataset PCA? Why do the population relationships differ?


Click here for a hint

As you can see the separation between the populations, particularly between JIGA and the other populations, is a lot stronger, without much variation within populations. The reason for that is that JIGA has a different karyotype for the inversion 24 compared to MAQU, MBNS and PANY. Variation along PC2 represents genetic variation within each inversion karyotype.



Question:

What’s with the lonely PANY individual that is intermediate between the two main clusters?




Admixture analysis

In some cases we also want to infer genome-wide admixture proportions for each individuals. Similar to PCA, there are different ways of inferring admixture proportions from genotype likelihoods. Here, we will use ngsAdmix and will also present a way of inferring admixture proportions with PCAngsd. Similar to the PCA, we want to use an LD-pruned SNP dataset for this analysis to reduce the impact of non-independent SNPs on the ancestry inference.


ngsAdmix

ngsAdmix uses a genotype likelihood file in beagle format (same as for PCAngsd) as input, which is specified using the -likes option. In addition, there are a range of parameters that can be adjusted. Here we only set the number of ancestry clusters using the -K option to K=2. In reality, it is advisable to compare different numbers of ancestry clusters by iterating over different values of K.


$NGSADMIX -likes $BASEDIR'/results/MME_ANGSD_PCA_LDpruned.beagle.gz' -K 2 -o $BASEDIR'/results/MME_LDpruned_ngsAdmix_K2_out'

In case the analysis is not converging, one can also increase the maximum number of EM iterations using the -maxiter option.

ngsAdmix produces three different outputs files:

  • A run log containing the log likelihood of the admixture estimates: .log file
  • A file containing the allele frequencies of all ancestral populations (in this case two ancestral clusters): .fopt file
  • A file containing the admixture proportions for each individual: .qopt file

We are mostly interested in the admixture proportions and the log likelihood of the estimates. The log likelihoods can be compared between runs with different values of K to select the most likely number of ancestral clusters (However, this should always be interpreted in a biologically meaningful context)


Question:

What is the likelihood for -K 2?

Click here to expand

cat $BASEDIR/results/MME_LDpruned_ngsAdmix_K2_out.log
Input: lname=results/MME_ANGSD_PCA_LDpruned.beagle.gz nPop=2, fname=(null) qname=(null) outfiles=results/MME_LDpruned_ngsAdmix_K2_out
Setup: seed=1603211682 nThreads=1 method=1
Convergence: maxIter=2000 tol=0.000010 tolLike50=0.100000 dymBound=0
Filters: misTol=0.050000 minMaf=0.050000 minLrt=0.000000 minInd=0
Input file has dim: nsites=247 nind=60
Input file has dim (AFTER filtering): nsites=245 nind=60
    [ALL done] cpu-time used =  0.11 sec
    [ALL done] walltime used =  0.00 sec
best like=-11324.127798 after 93 iterations

Furthermore, we can plot admixture proportions in R:

basedir <- "~/lcwgs-guide-tutorial/tutorial3_ld_popstructure/" # Make sure to edit this to match your $BASEDIR
library(tidyverse) #load the tidyverse package for formatting and plotting
## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'pillar'

## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'tibble'

## Warning: replacing previous import 'lifecycle::last_warnings' by
## 'rlang::last_warnings' when loading 'hms'

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
#Load the covariance matrix
admix = read_table(paste0(basedir, "/results/MME_LDpruned_ngsAdmix_K2_out.qopt"), col_names = F)
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   X1 = col_double(),
##   X2 = col_double()
## )
#We will also add a column with population assingments
pop <- c("JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA","JIGA"
         ,"PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY","PANY"
         ,"MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS","MBNS"
         ,"MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU","MAQU")

admix.id = as.data.frame(cbind(pop, admix))
names(admix.id) = c("pop","q1","q2")

pdf(paste0(basedir, "/results/NGSadmix_LDpruned_K2_plot.pdf"))
plot = barplot(t(as.matrix(subset(admix.id, select=q1:q2))), col=c("firebrick","royalblue"), border=NA)
dev.off() 
## png 
##   2

Optional If you want, you can try changing the value for -K and compare the inferred log likelihoods and admixture proportions. Which value of K has the stronger support?



$NGSADMIX -likes $BASEDIR/results/MME_ANGSD_PCA_LDpruned.beagle.gz -K 3 -o $BASEDIR/results/MME_LDpruned_ngsAdmix_K3_out

Click here to expand

cat $BASEDIR/results/MME_LDpruned_ngsAdmix_K3_out.log
Input: lname=results/MME_ANGSD_PCA_LDpruned.beagle.gz nPop=3, fname=(null) qname=(null) outfiles=results/MME_LDpruned_ngsAdmix_K3_out
Setup: seed=1603212799 nThreads=1 method=1
Convergence: maxIter=2000 tol=0.000010 tolLike50=0.100000 dymBound=0
Filters: misTol=0.050000 minMaf=0.050000 minLrt=0.000000 minInd=0
Input file has dim: nsites=247 nind=60
Input file has dim (AFTER filtering): nsites=245 nind=60
    [ALL done] cpu-time used =  1.02 sec
    [ALL done] walltime used =  1.00 sec
best like=-11077.337705 after 750 iterations

Alternative approach: PCAngsd

In addition to ngsAdmix, PCAngsd can also be used to infer admixture proportions. The command is similar to the one used for the PCA with the addition of to a -admix option. The input file for ngsAdmix and PCAngsd is the same, making comparisons relatively easy.

Other than ngsAdmix, PCAngsd can automatically infer the most likely number of ancestry cluster. However, one can also set the number of clusters using the -admix_K option.



python3 $PCANGSD -beagle $BASEDIR/results/MME_ANGSD_PCA_LDpruned.beagle.gz -admix -admix_K 2 -o $BASEDIR/results/MME_PCAngsd_K2_out

Optional If you want to, you can plot these results as well and compare them to the estimates with NGSadmix.