Case study for practicals
October 10, 2021 · View on GitHub
- Case study for practicals
- Initial preparation
- 1. Make sure you’re up to speed on basic shell scripting
- 2. Clone this GitHub repository to your Linux server
- 3. Orient yourself to the formatting of our fastq table and fastq list
- 4. Make sure you’re familiar with
for loopsand how to assign and call variables in bash - 5. Practice using bash
for loopsto iterate over target samples - 6. Define paths to the project directory and programs
- Data processing pipeline
- Examine the raw fastq files
- Adapter clipping
- OPTIONAL: Quality trimming
- Build reference index files
- Map to the reference, sort, and quality filter
- Examine the bam files
- Merge samples that were sequenced in multiple batches
- Deduplicate and clip overlapping read pairs
- Indel realignment (optional)
- Estimate read depth in your bam files
- END OF TUTORIAL 1
In this tutorial, you will learn how to go from raw sequencing files in
fastq format through alignment files in bam format that we can use
for downstream analysis. Along the way, you will perform multiple
quality control (QC) procedures, and will map the short sequences to a
snippet of a reference genome.
Case study for practicals
Throughout this course, you will be working with data from the Atlantic silverside, Menidia menidia, a small estuarine fish.

The Atlantic silverside is distributed along the east coast of North America and shows a remarkable degree of local adaptation in growth rates and a suite of other traits (Conover et al. 2005). You will be exploring patterns of genomic variation across the species range, which spans one of the steepest thermal gradient in the world.
Today’s data
Today, we will work with subsets of two different fastq files from each of three Atlantic silversides used in recent studies of fisheries-induced evolution (Therkildsen et al. 2019) and local adaptation (Wilder et al. 2020). The samples we’ll use today originate from the MAQU and PANY populations shown in this map:
The libraries were prepared as described in Therkildsen and Palumbi (2017) and sequenced with 125bp paired-end reads on an Illumina HiSeq instrument. There are two different fastq files for each individual because our libraries were sequenced in two different sequencing runs, to even out sequence coverage among individuals (as discussed in lecture).
We will map these raw sequence files to a snippet of the sparkling new Atlantic silverside genome (Tigano et al. nearly submitted!). To minimize computational time, we are just working with a small 2 Mb section of chromosome 24 for all the exercises in this course.
References:
Conover, D. O., Arnott, S. A., Walsh, M. R., & Munch, S. B. (2005). Darwinian fishery science: lessons from the Atlantic silverside (Menidia menidia). Canadian Journal of Fisheries and Aquatic Sciences, 62(4), 730–737. http://doi.org/10.1139/f05-069
Therkildsen, N. O., & Palumbi, S. R. (2017). Practical low-coverage genomewide sequencing of hundreds of individually barcoded samples for population and evolutionary genomics in nonmodel species. Molecular Ecology Resources, 17(2), 194–208. http://doi.org/10.1111/1755-0998.12593
Therkildsen, N. O., Wilder, A. P., Conover, D. O., Munch, S. B., Baumann, H., & Palumbi, S. R. (2019). Contrasting genomic shifts underlie parallel phenotypic evolution in response to fishing. Science, 365(6452), 487–490. http://doi.org/10.1126/science.aaw7271
Wilder, A. P., Palumbi, S. R., Conover, D. O., and Therkildsen, N. O. 2020. Footprints of local adaptation span hundreds of linked genes in the Atlantic silverside genome. Evolution Letters 9:177. https://doi.org/10.1002/evl3.189
Initial preparation
1. Make sure you’re up to speed on basic shell scripting
We’ll be working almost exclusively through the command line, so if you have not used shell scripting before or are getting rusty on it, it may be helpful to have a look at a tutorial like this one or a cheat sheet like this one before proceeding to the next step.
2. Clone this GitHub repository to your Linux server
Let’s first get set up and retrieve a copy the data we will be working on, which is contained in this GitHub repository.
cd ~ ## Change this to the directory where you would like to store this GitHub repo
git clone https://github.com/nt246/lcwgs-guide-tutorial.git
# Go into your new tutorial1_data_processing directory and examine its contents
cd lcwgs-guide-tutorial/tutorial1_data_processing
Your own copy of the tutorial1_data_processing directory will be
referred to as BASEDIR in many of the scripts below. Have a look
inside; tutorial1_data_processing contains the following
subdirectories:
-
raw_fastqhas the raw fastq files we’ll be working on today -
sample_listsis for storing sample tables, sample lists, and other small text files -
referencecurrently contains the reference genome file and a list of adapter sequences -
scriptsis for storing scripts
Hint: We move between directories using the
cdcommand and can view the content of directories with thelscommand in the Unix shell.
In addition, you will create the following directories within
tutorial1_data_processing
BASEDIR=~/lcwgs-guide-tutorial/tutorial1_data_processing ## Change this to where tutorial1_data_processing is located on your server
mkdir ${BASEDIR}/adapter_clipped ## For storing your adapter clipped fastq files
mkdir ${BASEDIR}/bam ## For storing your bam (alignment) files
mkdir ${BASEDIR}/fastqc ## For storing your FastQC output
3. Orient yourself to the formatting of our fastq table and fastq list
When we get data files back from the sequencing center, the files often
have obscure names, so we need a data table that let’s us link those
file names to our sample IDs and other information. Often, part of the
fastq file name will be identical among all samples in a run and part of
it will reflect some kind of unique sample identifier (either a name you
supplied or a name given by the sequencing center). As an example, look
in the raw_fastq folder and notice how all the files end in either
_1.fastq.gz or _2.fastq.gz (these are the forward and reverse
sequences) while the first part of the file names differs (the unique
sample identifier). We call the sample identifier the prefix.
Our pipeline is set up to link up these prefix names with sample
details based on a fastq table set up as the example you can find in
sample_lists/fastq_table.tsv.
For our scripts below to work, the sample table has to be a tab deliminated table with the following six columns, strictly in this order:
-
prefixthe prefix of raw fastq file names -
lane_numberlane number; each sequencing lane or batch should be assigned a unique identifier. This is important so that if you sequence a library across multiple different sequencing lanes, you can keep track of which lane/batch a particular set of reads came from (important for accounting for sequencing error patterns or batch effects). -
seq_idsequence ID; this variable is only relevant when different libraries were prepared out of the same sample and were run in the same lane (e.g. if you wanted to include a replicate). In this case, seq_id should be used to distinguish these separate libraries. If you only have a single library prepared from each of your samples (even if you sequence that same library across multiple lanes), you can just put 1 for all samples in this column. -
sample_idsample ID; a unique identifier for each individual sequenced -
populationpopulation name; the population or other relevant grouping variable that the individual belongs to -
data_typedata type; there are only two allowed entries here:pe(for paired-end data) orse(for single end data). We need this in the table because for some of our processing steps, the commands are slightly different for paired-end and single-end data.
It is important to make sure that the combination of sample_id, seq_id, and lane_number is unique for each fastq file.
We’ll also use a second file that we call a fastq list. This is simply a list of prefixes for the samples we want to analyze. Our sample table can contain data for all individuals in our study, but at any given time, we may only want to perform an operation on a subset of them. Like today, in the interest of time, we only want to run 6 sets of fastq files through each processing step.
Have a look at the list we’ll be using in sample_lists/fastq_list.txt
and note that it’s just a list of fastq name prefixes, each on a
separate line and there should be no header in this file.
Activity
Compare the fastq_list.txt to the fastq_table.txt. Which populations
do the samples we’ll be analyzing today originate from?
With our small fastq table and fastq list here, we can easily look this
up manually. But if we have hundreds of samples, that becomes more
cumbersome. Let’s automate the sample lookup with our first for loop.
4. Make sure you’re familiar with for loops and how to assign and call variables in bash
In low-coverage whole genome sequencing datasets, we’ll typically have
data from hundreds of individuals, so we need an efficient way to
process all of these files without having to write a separate line of
code for each file. for loops are a powerful way to achieve this, and
we will be using them in every step of our pipeline, so let’s first take
a moment to make sure we understand the syntax.
A ‘for loop’ is a bash programming language statement which allows code
to be repeatedly executed. If you’ve never worked with for loops in
bash before, it might be helpful to look over a tutorial, e.g. this one
from Software
Carpentry.
Here is a key extract inspired from that tutorial:
The basic syntax of a for loop is as follows
for thing in list_of_things; do # ; is equivalent to an end-of-line. We can alternatively put "do" on its own line
operation_using $thing # Indentation within the loop is not required, but aids legibility
done
When the shell sees the keyword for, it knows to repeat a command (or
group of commands) once for each item in a list. Each time the loop runs
(called an iteration), an item in the list is assigned in sequence to
the variable, and the commands inside the loop are executed, before
moving on to the next item in the list. Inside the loop, we call the
variable’s value by putting $ in front of it. The $ tells the shell
interpreter to treat the variable as a variable name and substitute its
value in its place, rather than treat it as text or an external command.
5. Practice using bash for loops to iterate over target samples
First, let’s just look up the sample IDs. For each prefix in our
fastq_list.txt, we will use grep to extract the relevant line from
the fastq table, use cut to extract the column with sample ID, and
then echo to print the sample ID
BASEDIR=~/lcwgs-guide-tutorial/tutorial1_data_processing ## Change this to where tutorial1_data_processing is located on your server
SAMPLELIST=$BASEDIR/sample_lists/fastq_list.txt # Path to a list of prefixes of the raw fastq files. It should be a subset of the the 1st column of the fastq table.
SAMPLETABLE=$BASEDIR/sample_lists/fastq_table.tsv # Path to a fastq table where the 1st column is the prefix of the raw fastq files. The 4th column is the sample ID.
for SAMPLEFILE in `cat $SAMPLELIST`; do # Loop through each of the prefixes listed in our fastq list
# For each prefix, extract the associated sample ID (column 4) from the table
SAMPLE_ID=`grep -P "${SAMPLEFILE}\t" $SAMPLETABLE | cut -f 4`
echo $SAMPLEFILE refers to sample $SAMPLE_ID
done
Exercise
Now change the for loop so it also outputs which population each fastq
file has data for.
See a solution
Click here to expand
for SAMPLEFILE in `cat $SAMPLELIST`; do # Loop through each of the prefixes listed in our sample list
# For each prefix, extract the associated sample ID (column 4) and population ID (column 5) from the table
SAMPLE_ID=`grep -P "${SAMPLEFILE}\t" $SAMPLETABLE | cut -f 4`
POPULATION=`grep -P "${SAMPLEFILE}\t" $SAMPLETABLE | cut -f 5`
echo $SAMPLEFILE refers to sample $SAMPLE_ID from $POPULATION
done
6. Define paths to the project directory and programs
We need to make sure the server knows where to find the programs we’ll be running and our input and output directories. This will always need to be specified every time we run our scripts in a new login session.
Set the project directory as a variable named BASEDIR
BASEDIR=~/lcwgs-guide-tutorial/tutorial1_data_processing ## Change this to where tutorial1_data_processing is located on your server
Specify the paths to required programs as variables
You will need to install the programs below on your server if they are not already installed, and define their paths as follows. The paths to these program on Cornell’s BioHPC servers are provided below.
FASTQC=/programs/bin/fastqc/fastqc
TRIMMOMATIC=/programs/trimmomatic/trimmomatic-0.39.jar
PICARD=/programs/picard-tools-2.19.2/picard.jar
SAMTOOLS=/programs/bin/samtools/samtools
BOWTIEBUILD=/programs/bin/bowtie2/bowtie2-build
BOWTIE=/programs/bin/bowtie2/bowtie2
BAMUTIL=/programs/bamUtil/bam
The following programs are used for indel realignment and are optional.
GATK=/programs/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar ## Note that GATK-3.7 is needed for indel-realignment, which is no longer suppported in new versions
JAVA=/usr/local/jdk1.8.0_121/bin/java ## An older version of java is required for GATK-3.7 to run on Cornell's BioHPC server
If you will be running these programs on a different system, you
will have to specify the paths to the different programs on that system
(or add them to your $PATH).
Data processing pipeline
Now let’s get started processing the data!
Examine the raw fastq files
fastq file structure
A FASTQ file normally contains four lines per sequence.
- Line 1 contains the sequence identifier, with information on the sequencing run and the cluster. The exact content of this line varies depending on how fastq files are generated from the sequencer.
- Line 2 is the raw sequence.
- Line 3 often consists of a single
+symbol. - Line 4 encodes the quality of each base in the sequence in Line 2 (i.e. the probability of sequencing error in log scale). For most current sequencers, these base qualities are encoded in the Phred33 format, but always check to make sure how your quality scores are encoded.
Now read the code below, guess what it does, and run it on your own. Does it do what you expect it to do? Inspect the output and try to identify the group of four lines for each read.
SAMPLELIST=$BASEDIR/sample_lists/fastq_list.txt # Path to the sample list.
RAWFASTQSUFFIX1=_1.fastq.gz # Suffix to raw fastq files. Use forward reads with paired-end data.
for SAMPLE in `cat $SAMPLELIST`; do
echo $SAMPLE
zcat $BASEDIR'/raw_fastq/'$SAMPLE$RAWFASTQSUFFIX1 | head -n 8
echo ' '
done
Evaluate the overall data quality
With a new batch of data, it is always to good idea to start out by getting an overview of the data quality and look for any signs of quality issues. The FastQC program provides a useful set of diagnostics, so we’ll run it on each on our fastq files to check their quality. In the interest of time, we will only look at three of our fastqs. We’ll loop over each of these and call FastQC on the forward file only (the program only takes the path to the input and the path to where you want the output as parameters). When working with your own data, always make sure to look both at the forward and reverse files because sometimes issues can arise in only one of the read directions.
SAMPLELIST=$BASEDIR/sample_lists/fastq_list.txt # Path to the sample list.
RAWFASTQSUFFIX1=_1.fastq.gz # Suffix to raw fastq files. We'll only look at the forward reads here
for SAMPLE in `cat $SAMPLELIST | head -n 3`; do # The head -n 3 is taking just the first three elements of our fastq list to loop over
$FASTQC $BASEDIR'/raw_fastq/'$SAMPLE$RAWFASTQSUFFIX1 -o $BASEDIR'/fastqc/'
done
If the program ran, you should now see the output (in html format and a
zip file with various files) in your fastqc directory. To view the
.html, use scp or FileZilla to transfer the html output files to your
local machine and open them in a web browser, or use something like
R-Studio server to directory view it.
Question:
Do you notice anything different about the fastQC reports from the three different fastq files?
The libraries for these samples were prepared in different batches. Below are representative Bioanalyzer traces for each of the batches. Which sample do you think came from which batch?


Adapter clipping
When the insert length of a library fragment is shorter than the read length, the sequencer will read into the adapter sequence (as shown below). This means that the end of the read will not be from our actual sample, but will be adapter sequence, which may lead to lower alignment performance and even biases in the result if not removed.

We saw in our FastQC report that we have substantial adapter content in
some of our libraries, so will will need to clip that off. Here, we use
Trimmomatic to clip
the adapter sequence off the ends of reads where they appear. This step
requires us to input the known adapter sequences that we used when
preparing the libraries. In this exercise, the libraries were prepared
using Illumina’s Nextera adapters (sequences listed in
reference/NexteraPE_NT.fa), and we will read those into the ADAPTERS
variable.
Look over the code below. The first block of text specifies which files
we are using as input. Then we start looping over our samples. Within
the loop, the first step is to extract the relevant sample data from our
sample table and assign those as temporary variables. Then we have two
if statements to call Trimmomatic with slightly different parameters
for paired-end and single-end data. Trimmomatic has lots of different
filtering modules that can be useful in different contexts. Here we only
clip sequence that match to our adapter sequence and remove reads that
end up being <40bp after clipping.
SAMPLELIST=$BASEDIR/sample_lists/fastq_list.txt # Path to a list of prefixes of the raw fastq files. It should be a subset of the the 1st column of the sample table.
SAMPLETABLE=$BASEDIR/sample_lists/fastq_table.tsv # Path to a sample table where the 1st column is the prefix of the raw fastq files. The 4th column is the sample ID, the 2nd column is the lane number, and the 3rd column is sequence ID. The combination of these three columns have to be unique. The 6th column should be data type, which is either pe or se.
RAWFASTQDIR=$BASEDIR/raw_fastq/ # Path to raw fastq files.
RAWFASTQSUFFIX1=_1.fastq.gz # Suffix to raw fastq files. Use forward reads with paired-end data.
RAWFASTQSUFFIX2=_2.fastq.gz # Suffix to raw fastq files. Use reverse reads with paired-end data.
ADAPTERS=$BASEDIR/reference/NexteraPE_NT.fa # Path to a list of adapter/index sequences.
## Loop over each sample
for SAMPLEFILE in `cat $SAMPLELIST`; do
## Extract relevant values from a table of sample, sequencing, and lane ID (here in columns 4, 3, 2, respectively) for each sequenced library
SAMPLE_ID=`grep -P "${SAMPLEFILE}\t" $SAMPLETABLE | cut -f 4`
POP_ID=`grep -P "${SAMPLEFILE}\t" $SAMPLETABLE | cut -f 5`
SEQ_ID=`grep -P "${SAMPLEFILE}\t" $SAMPLETABLE | cut -f 3`
LANE_ID=`grep -P "${SAMPLEFILE}\t" $SAMPLETABLE | cut -f 2`
SAMPLE_UNIQ_ID=$SAMPLE_ID'_'$POP_ID'_'$SEQ_ID'_'$LANE_ID # When a sample has been sequenced in multiple lanes, we need to be able to identify the files from each run uniquely
## Extract data type from the sample table
DATATYPE=`grep -P "${SAMPLEFILE}\t" $SAMPLETABLE | cut -f 6`
## The input and output path and file prefix
RAWFASTQ_ID=$RAWFASTQDIR$SAMPLEFILE
SAMPLEADAPT=$BASEDIR'/adapter_clipped/'$SAMPLE_UNIQ_ID
## Adapter clip the reads with Trimmomatic
# The options for ILLUMINACLIP are: ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>:<minAdapterLength>:<keepBothReads>
# The MINLENGTH drops the read if it is below the specified length in bp
# For definitions of these options, see http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf
if [ $DATATYPE = pe ]; then
java -jar $TRIMMOMATIC PE -threads 1 -phred33 $RAWFASTQ_ID$RAWFASTQSUFFIX1 $RAWFASTQ_ID$RAWFASTQSUFFIX2 $SAMPLEADAPT'_adapter_clipped_f_paired.fastq.gz' $SAMPLEADAPT'_adapter_clipped_f_unpaired.fastq.gz' $SAMPLEADAPT'_adapter_clipped_r_paired.fastq.gz' $SAMPLEADAPT'_adapter_clipped_r_unpaired.fastq.gz' 'ILLUMINACLIP:'$ADAPTERS':2:30:10:1:true MINLENGTH:40'
elif [ $DATATYPE = se ]; then
java -jar $TRIMMOMATIC SE -threads 1 -phred33 $RAWFASTQ_ID$RAWFASTQSUFFIX1 $SAMPLEADAPT'_adapter_clipped_se.fastq.gz' 'ILLUMINACLIP:'$ADAPTERS':2:30:10 MINLENGTH:40'
fi
done
Have a look at the output printed to the screen as we’re iterating over the samples. Note the first time it says
TrimmomaticPE: Started with arguments:
Following this, you will see that actual variable names that were added
to the command in each iteration of our loop (e.g. what
$RAWFASTQ_ID$RAWFASTQSUFFIX1 expanded to (what value was assigned to
this variable)).
Also examine the section that says
ILLUMINACLIP: Using 2 prefix pairs, 8 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences Input Read Pairs:
This will show how many reads were removed from our filtering.
Question
The first three samples are the samples for which we examined the FastQC outputs. Do you expect different amount of sequence getting removed from each of these? Is that what you see? Why/Why not?
Discuss first, then you can click here for a hint
The output from Trimmomatic only shows how many full reads get removed,
not how much the reads within the file get truncated. In the library
with lots of adapter, many of the reads will now be shorter, but as long
as they’re still longer than our threshold of 40bp, they will not get
removed. If we wanted to know how much sequence we lost from each fastq,
counting the number of bases lost is more informative than the number of
sequences. It is also always a good idea to check the adapter clipped
fastq files with FastQC to make sure that you did in fact get rid of the
adapter sequence. We don’t have time to do this in class, but if you’re
interested, you can use the fastqc loop above and just change it to run
on the files in your adapter_clipped folder.
OPTIONAL: Quality trimming
As we saw in our FastQC output, the base call quality score tends to
drop off towards the ends of the reads. As we’ll learn more about
tomorrow, probabilistic analysis frameworks, like ANGSD and others
based on genotype likelihoods, can take the base call quality into
account and that way give less weight to a base call that is less
certain.
However, as a conservative measure, we may want to just trim off the
rest of the read if the quality score drops too low over multiple bases.
We can do this with the SLIDINGWINDOW module in
Trimmomatic.
We won’t have time to do that in this practical, but if you’re
interested, you can modify the Trimmomatic code above to also trim off
low-quality bases.
Build reference index files
There are lots of different programs developed for mapping short reads to a reference sequence. We will use the program bowtie2. This program requires a set of reference index files to be able to perform the sequence alignment. So we will start by indexing our reference.
REFERENCE=$BASEDIR/reference/mme_physalia_testdata_chr24.fa # This is a fasta file with the reference genome sequence we will map to
REFBASENAME="${REFERENCE%.*}"
$SAMTOOLS faidx $REFERENCE
java -jar $PICARD CreateSequenceDictionary R=$REFERENCE O=$REFBASENAME'.dict'
$BOWTIEBUILD $REFERENCE $REFBASENAME
Map to the reference, sort, and quality filter
In this step, we align the short reads within each fastq file to the
reference genome using bowtie2. The resulting alignment file, in sam
format, will be converted to a binary format bam for more efficient
storage. Each mapped read will have a mapping quality, which indicates
how confident that mapper is that a read is mapped in the correct
position. The bowtie2
manual defines
it as “a non-negative integer Q = -10 log10 p, where p is an estimate of
the probability that the alignment does not correspond to the read’s
true point of origin.” Accordingly, a mapping quality (or MAPQ) of 10 or
less indicates that there is at least a 1 in 10 chance that the read
truly originated elsewhere, and a MAPQ of 20 indicates at least a 1 in
100 chance.
Here, to only retain reads for which we are reasonably certain have been mapped in the correct place, we will filter out reads with a mapping quality lower than 20, and after that sort the filtered alignment file for easier computation in the next step.
Look over the code and make sure you understand what it’s doing, then copy and run it.
SAMPLELIST=$BASEDIR/sample_lists/fastq_list.txt # Path to a list of prefixes of the raw fastq files. It should be a subset of the the 1st column of the sample table.
SAMPLETABLE=$BASEDIR/sample_lists/fastq_table.tsv # Path to a sample table where the 1st column is the prefix of the raw fastq files. The 4th column is the sample ID, the 2nd column is the lane number, and the 3rd column is sequence ID. The combination of these three columns have to be unique. The 6th column should be data type, which is either pe or se.
FASTQDIR=$BASEDIR/adapter_clipped/ # Path to the directory where fastq file are stored.
FASTQSUFFIX1=_adapter_clipped_f_paired.fastq.gz # Suffix to fastq files. Use forward reads with paired-end data.
FASTQSUFFIX2=_adapter_clipped_r_paired.fastq.gz # Suffix to fastq files. Use reverse reads with paired-end data.
MAPPINGPRESET=very-sensitive # The pre-set option to use for mapping in bowtie2 (very-sensitive for end-to-end (global) mapping [typically used when we have a full genome reference], very-sensitive-local for partial read mapping that allows soft-clipping [typically used when mapping genomic reads to a transcriptome]
REFERENCE=$BASEDIR/reference/mme_physalia_testdata_chr24.fa # Path to reference fasta file and file name
REFNAME=mme_physalia_testdata_chr24 # Reference name to add to output files, e.g. gadMor2
## Loop over each sample
for SAMPLEFILE in `cat $SAMPLELIST`; do
## Extract relevant values from a table of sample, sequencing, and lane ID (here in columns 4, 3, 2, respectively) for each sequenced library
SAMPLE_ID=`grep -P "${SAMPLEFILE}\t" $SAMPLETABLE | cut -f 4`
POP_ID=`grep -P "${SAMPLEFILE}\t" $SAMPLETABLE | cut -f 5`
SEQ_ID=`grep -P "${SAMPLEFILE}\t" $SAMPLETABLE | cut -f 3`
LANE_ID=`grep -P "${SAMPLEFILE}\t" $SAMPLETABLE | cut -f 2`
SAMPLE_UNIQ_ID=$SAMPLE_ID'_'$POP_ID'_'$SEQ_ID'_'$LANE_ID # When a sample has been sequenced in multiple lanes, we need to be able to identify the files from each run uniquely
## Extract data type from the sample table
DATATYPE=`grep -P "${SAMPLEFILE}\t" $SAMPLETABLE | cut -f 6`
## The input and output path and file prefix
SAMPLETOMAP=$FASTQDIR$SAMPLE_UNIQ_ID
SAMPLEBAM=$BASEDIR'/bam/'$SAMPLE_UNIQ_ID
## Define platform unit (PU), which is the lane number
PU=`grep -P "${SAMPLEFILE}\t" $SAMPLETABLE | cut -f 2`
## Define reference base name
REFBASENAME="${REFERENCE%.*}"
## Map reads to the reference
echo $SAMPLE_UNIQ_ID
# Map the paired-end reads
if [ $DATATYPE = pe ]; then
# We ignore the reads that get orphaned during adapter clipping because that is typically a very small proportion of reads. If a large proportion of reads get orphaned (loose their mate so they become single-end), these can be mapped in a separate step and the resulting bam files merged with the paired-end mapped reads.
$BOWTIE -q --phred33 --$MAPPINGPRESET -p 1 -I 0 -X 1500 --fr --rg-id $SAMPLE_UNIQ_ID --rg SM:$SAMPLE_ID --rg LB:$SAMPLE_ID --rg PU:$PU --rg PL:ILLUMINA -x $REFBASENAME -1 $SAMPLETOMAP$FASTQSUFFIX1 -2 $SAMPLETOMAP$FASTQSUFFIX2 -S $SAMPLEBAM'_'$DATATYPE'_bt2_'$REFNAME'.sam'
# Map the single-end reads
elif [ $DATATYPE = se ]; then
$BOWTIE -q --phred33 --$MAPPINGPRESET -p 1 --rg-id $SAMPLE_UNIQ_ID --rg SM:$SAMPLE_ID --rg LB:$SAMPLE_ID --rg PU:$PU --rg PL:ILLUMINA -x $REFBASENAME -U $SAMPLETOMAP$FASTQSUFFIX1 -S $SAMPLEBAM'_'$DATATYPE'_bt2_'$REFNAME'.sam'
fi
## Convert to bam file for storage (including all the mapped reads)
$SAMTOOLS view -bS -F 4 $SAMPLEBAM'_'$DATATYPE'_bt2_'$REFNAME'.sam' > $SAMPLEBAM'_'$DATATYPE'_bt2_'$REFNAME'.bam'
rm -f $SAMPLEBAM'_'$DATATYPE'_bt2_'$REFNAME'.sam'
## Filter the mapped reads (to onky retain reads with high mapping quality)
# Filter bam files to remove poorly mapped reads (non-unique mappings and mappings with a quality score < 20) -- do we want the quality score filter??
$SAMTOOLS view -h -q 20 $SAMPLEBAM'_'$DATATYPE'_bt2_'$REFNAME'.bam' | $SAMTOOLS view -buS - | $SAMTOOLS sort -o $SAMPLEBAM'_'$DATATYPE'_bt2_'$REFNAME'_minq20_sorted.bam'
done
Have a look at the output printed to the screen. How many of our reads are mapping to the reference? Does it vary between samples? Is there anything that seems weird to you about the stats reported?
Examine the bam files
SAM stands for Sequence Alignment/Map format. BAM is the binary format for sam files (which takes up much less space). It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with ‘@’, while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. BAM is the binary version of the SAM format.
See the full documentation of the sam file format here or a quick overview of the column descriptors here
Note that we have converted our sam files to bam files. That’s useful
for saving disk space, but because bam files are binary, they are not
human readable. However, we can use the view utility in the program
samtools to convert the
content back to human readable output to we can examine our alignments.
As an example, let’s look at the output for 985_PANY. The following
command can be used to inspect the first eight lines the sorted bam file
for this sample.
$SAMTOOLS view $BASEDIR/bam/985_PANY_1_lane1_pe_bt2_mme_physalia_testdata_chr24_minq20_sorted.bam | head -n 8
Take a few minutes to look at the output and look at the column descriptors to understand its content.
OPTIONAL exercise: Write a loop to print the first three alignments of all the sorted bam files that you generated in the last step. You can use the general template code below as a starting point.
$SAMTOOLS view $SAMPLEBAM'_'$DATATYPE'_bt2_'$REFNAME'_minq20_sorted.bam' | head -n 3
Merge samples that were sequenced in multiple batches
We have now mapped the two separate sets of fastq files for each sample (the separate sets generated in independent sequencing runs), so we also have two bam files for each sample. If the two sequencing runs were performed with aliquots of the same library, we will need to merge the bam files before we remove duplicate sequences because the two sequencing lanes would have been sequencing the same set of molecules (so each may contain duplicate fragments also sequenced in the other). For most downstream analysis, it’s also a lot more convenient to only have a single bam file per individual.
We will merge the two bam files for each individual with samtools merge with the following parameters
$SAMTOOLS merge output.bam input1.bam input2.bam # We replace the output.bam with the name we want to give the output merged bam and the two input names with the names of the bam files we want to merge.
We could write a for loop to get the separate fastq files for each
individual merged. But in this case, we will instead run a shell script
that has a line that calls samtools merge for each individual. You can
find the script merge_bams.sh in the scripts folder. Have a look at
it.
With just three samples, we could quickly copy and edit these three lines to construct this shell script manually. But if we had hundreds of samples, that approach would become error-prone so we want to use a script to automate it. In the interest of time, let’s move forward with the merging script we have already generated, but we provide R-code below that generates this script from the your fastq-level sample table and list, so you don’t need to do this manually if you’re working with your own samples.
Since we’ll be working on bam files rather than fastq files downstream
from this point, we need to list bam IDs rather than fastq IDs in the
sample lists that specify which samples we want to loop over in a
particular pipeline step. The R-code below generates that as well, but
for today, we’ll just use the list that we’ve already added to
$BASEDIR/sample_lists.
The following R code is just provided for your reference. You DON’T need to run it - the output is already generated in your scripts and sample_lists folders
Click here to view the R code
library(tidyverse)
basedir <- "~/tutorial1_data_processing/"
refname <- "mme_physalia_testdata_chr24"
fastq_list <- read_lines(paste0(basedir, "sample_lists/fastq_list.txt"))
fastq_table <- read_tsv(paste0(basedir, "sample_lists/fastq_table.tsv"))
#Subset the table to the samples we're currently interested in analyzing
target_samples <- filter(fastq_table, prefix %in% fastq_list)
## Find all duplicated samples
duplicated_samples <- (target_samples$sample_id)[duplicated(target_samples$sample_id)] %>% unique()
# Write cd
write_lines(c("BASEDIR=\$1", "cd $BASEDIR'/bam'\n"), paste0(basedir, "scripts/merge_bam.sh"))
## Loop through all duplicated samples
for (i in 1:length(duplicated_samples)){
dup_sample_dat <- filter(fastq_table, sample_id==duplicated_samples[i])
## Extract the bam file names from the unmerged sample table
input <- dup_sample_dat %>%
mutate(unmerged_bam = paste(sample_id, population, seq_id, lane_number, data_type, "bt2", refname, "minq20_sorted.bam", sep = "_")) %>%
# We are reconstructing the $SAMPLE_SEQ_ID that is the first part of the separate bam file names
.$unmerged_bam
output <- paste(dup_sample_dat[1,"sample_id"], dup_sample_dat[1, "population"], "merged_bt2", refname, "minq20_sorted.bam", sep = "_")
## Paste together the command line
#merging_script[i+1] <- paste("samtools merge", as.character(output), input[1], input[2], sep = " ")
write_lines(paste("samtools merge", as.character(output), input[1], input[2], sep = " "), paste0(basedir, "scripts/merge_bam.sh"), append = TRUE)
# Also write a target bam list that we'll use for downstream looping over merged bam files
if (i == 1){
write_lines(paste(dup_sample_dat[1,"sample_id"], dup_sample_dat[1, "population"], "merged", sep = "_"), paste0(basedir, "sample_lists/merged_bam_list.txt"))
}
if (i > 1){
write_lines(paste(dup_sample_dat[1,"sample_id"], dup_sample_dat[1, "population"], "merged", sep = "_"), paste0(basedir, "sample_lists/merged_bam_list.txt"), append = TRUE)
}
}
Run the merging script
To execute the merging, run the bash script with the following command:
bash $BASEDIR/scripts/merge_bam.sh $BASEDIR
Check that your merged bam files get generated. As a sanity check, we
can compare the number of lines in our merged and set of unmerged bam
files for each sample with samtools
view and the command
samtools view in.bam | wc -l
Run on our server, for the merged file for sample 985, the command would like like
$SAMTOOLS view $BASEDIR/bam/985_PANY_merged_bt2_mme_physalia_testdata_chr24_minq20_sorted.bam | wc -l
Check the line count in the bam files for the two individual fastqs and see if the numbers add up.
$SAMTOOLS view $BASEDIR/bam/985_PANY_1_lane2_pe_bt2_mme_physalia_testdata_chr24_minq20_sorted.bam | wc -l
$SAMTOOLS view $BASEDIR/bam/985_PANY_1_lane1_pe_bt2_mme_physalia_testdata_chr24_minq20_sorted.bam | wc -l
OPTIONAL exercise: You could write a for loop that will extract the line count for each file.
Discussion question
This merging procedure required some extra effort. Why didn’t we just merge the fastq files for each individual before mapping?
Answer
Click here
Because there may be particular issues associated each sequencing lane
(in particular the base call quality score calibration may vary), so for
some downstream analyses, we need to keep track of what data were
sequenced in what lane. By mapping the fastq files separately, we were
able to add read platform unit PU information to the read group tag in
the bam file while mapping with bowtie2 (see the script). This way we
can continue to keep track of which read came from which lane, and could
even filter our bam file based on this later on, if we ended up wanting
to compare data from different lanes for troubleshooting or other
reasons.
Deduplicate and clip overlapping read pairs
Here, we remove the PCR duplicates and trim the overlapping part of each read pair in pair-end data. It is important to wait to deduplicate until after merging, because PCR duplicates for the same sample may exist in different lanes. We use the Picard Tools MarkDuplicates.
We also want to clip overlapping reads. We will use the BamUtil clipOverlap
BAMLIST=$BASEDIR/sample_lists/merged_bam_list.txt # Path to a list of merged bam files.
REFNAME=mme_physalia_testdata_chr24 # Reference name to add to output files
## Loop over each sample
for SAMPLEBAM in `cat $BAMLIST`; do
## Remove duplicates and print dupstat file
# We used to be able to just specify picard.jar on the CBSU server, but now we need to specify the path and version
java -Xmx60g -jar $PICARD MarkDuplicates I=$BASEDIR'/bam/'$SAMPLEBAM'_bt2_'$REFNAME'_minq20_sorted.bam' O=$BASEDIR'/bam/'$SAMPLEBAM'_bt2_'$REFNAME'_minq20_sorted_dedup.bam' M=$BASEDIR'/bam/'$SAMPLEBAM'_bt2_'$REFNAME'_minq20_sorted_dupstat.txt' VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true
## Clip overlapping paired end reads (only necessary for paired-end data, so if you're only running se samples, you can comment this step out)
$BAMUTIL clipOverlap --in $BASEDIR'/bam/'$SAMPLEBAM'_bt2_'$REFNAME'_minq20_sorted_dedup.bam' --out $BASEDIR'/bam/'$SAMPLEBAM'_bt2_'$REFNAME'_minq20_sorted_dedup_overlapclipped.bam' --stats
done
MarkDuplicates has very verbose output - take a look at it to make sure the programs ran and didn’t throw an error.
Next, look at the output from the clipOverlap. Was there a substantial difference in how much sequence got clipped in the three samples, and does that make sense in light of their FastQC reports?
Use head to look at the top lines of the dupstat report for each
sample (in
$BASEDIR/bam/XXX_merged_bt2_mme_physalia_testdata_chr24_minq20_sorted_dupstat.txt
[replace the XXX with the sample [prefix]]). Can you spot where the
duplication rate is reported. Does it vary between our samples?
Indel realignment (optional)
Unlike other variant detector programs (like the GATK Haplotype
Caller
or Freebayes),
ANGSD does not realign
reads during its analysis. Because it can be difficult to distinguish
indels from SNPs at the end of reads if each alignment is considered
separately, indels may interfere with genotype likelihood estimation. We
therefore recommend running your bam files through a program that
realigns reads around indels prior to running ANGSD. The GATK
IndelRealigner
takes all the aligned sequences from all samples in to account to
validate the indels discovered from the mapping process and then
realigns each read locally. This is not a mandatory step and tends to be
very time-consuming if you have a large dataset, but the code is
provided here if you want to give it a try (it takes ~3 minutes). We
will not use these output for the rest of this course though.
Click here to see the GATK IndelRealigner code
BAMLIST=$BASEDIR/sample_lists/bam_list_dedup_overlapclipped.list # Path to a list of merged, deduplicated, and overlap clipped bam files. Full paths should be included. This file has to have a suffix of ".list"
REFERENCE=$BASEDIR/reference/mme_physalia_testdata_chr24.fa # Path to reference fasta file and file name
REFNAME=mme_physalia_testdata_chr24 # Reference name to add to output files
for SAMPLEBAM in `cat $BASEDIR/sample_lists/merged_bam_list.txt`; do
echo $BASEDIR'/bam/'$SAMPLEBAM'_bt2_mme_physalia_testdata_chr24_minq20_sorted_dedup_overlapclipped.bam' >> $BAMLIST
done
## Loop over each sample
cd $BASEDIR/bam/
for SAMPLEBAM in `cat $BAMLIST`; do
$SAMTOOLS index $SAMPLEBAM
done
## Realign around in-dels
# This is done across all samples at once
## Create list of potential in-dels
$JAVA -Xmx40g -jar $GATK \
-T RealignerTargetCreator \
-R $REFERENCE \
-I $BAMLIST \
-o $BASEDIR'/bam/all_samples_for_indel_realigner.intervals' \
-drf BadMate
## Run the indel realigner tool
$JAVA -Xmx40g -jar $GATK \
-T IndelRealigner \
-R $REFERENCE \
-I $BAMLIST \
-targetIntervals $BASEDIR'/bam/all_samples_for_indel_realigner.intervals' \
--consensusDeterminationModel USE_READS \
--nWayOut _realigned.bam
Estimate read depth in your bam files
After all the filtering steps, we want to know what final depth of coverage we have for each sample for downstream analysis. Here, we will use samtools depth to first compute the read depth at each bp position in the genome. Then we will pull the output file to our local machines and compute depth summary stats in R. This is just one way to summarize the depth. There are other programs available to provide summaries of the depth in a bam file, including samtools coverage, Mosdepth and Indexcov.
First, run samtools depth to get depth per sample per position.
BAMLIST=$BASEDIR/sample_lists/merged_bam_list.txt # Path to a list of unique sample prefixes for merged bam files.
REFNAME=mme_physalia_testdata_chr24 # Reference name to add to output files
for SAMPLEBAM in `cat $BAMLIST`; do
## Count per position depth per sample
$SAMTOOLS depth -aa $BASEDIR'/bam/'$SAMPLEBAM'_bt2_'$REFNAME'_minq20_sorted_dedup_overlapclipped.bam' | cut -f 3 | gzip > $BASEDIR'/bam/'$SAMPLEBAM'_bt2_'$REFNAME'_minq20_sorted_dedup_overlapclipped.bam.depth.gz'
done
Then, we’ll process and visualize the data in R. We recommend that you use R-Studio server for this, but running R on command line also works.
## install.packages("tidyverse") ## Install tidyverse if you don't have it installed yet
library(tidyverse)
basedir <- "~/lcwgs-guide-tutorial/tutorial1_data_processing" # Make sure to edit this to match your $BASEDIR
bam_list <- read_lines(paste0(basedir, "/sample_lists/merged_bam_list.txt"))
for (i in 1:length(bam_list)){
bamfile = bam_list[i]
# Compute depth stats
depth <- read_tsv(paste0(basedir, "/bam/", bamfile, "_bt2_mme_physalia_testdata_chr24_minq20_sorted_dedup_overlapclipped.bam.depth.gz"), col_names = F)$X1
mean_depth <- mean(depth)
sd_depth <- sd(depth)
mean_depth_nonzero <- mean(depth[depth > 0])
mean_depth_within2sd <- mean(depth[depth < mean_depth + 2 * sd_depth])
median <- median(depth)
presence <- as.logical(depth)
proportion_of_reference_covered <- mean(presence)
output_temp <- tibble(bamfile, mean_depth, sd_depth, mean_depth_nonzero, mean_depth_within2sd, median, proportion_of_reference_covered)
# Bind stats into dataframe and store sample-specific per base depth and presence data
if (i==1){
output <- output_temp
total_depth <- depth
total_presence <- presence
} else {
output <- bind_rows(output, output_temp)
total_depth <- total_depth + depth
total_presence <- total_presence + presence
}
}
output %>%
mutate(across(where(is.numeric), round, 3))
# Plot the depth distribution (this may take a few minutes to run)
tibble(total_depth = total_depth, position = 1:length(total_depth)) %>%
ggplot(aes(x = position, y = total_depth)) +
geom_point(size = 0.1)
# Total depth per site across all individuals
total_depth_summary <- count(tibble(total_depth = total_depth), total_depth)
total_presence_summary <- count(tibble(total_presence = total_presence), total_presence)
total_depth_summary %>%
ggplot(aes(x = total_depth, y = n)) +
geom_point()
total_depth_summary %>%
ggplot(aes(x = total_depth, y = n)) +
geom_point() +
coord_cartesian(xlim=c(NA, 20))
total_presence_summary %>%
ggplot(aes(x = total_presence, y = n)) +
geom_col()
You can see that there is a lot of variation among the different
depth statistics. Which of them do you think are most relevant to
report? Why?
If you’re interested, you can go back and compare the depth distribution to in the raw mapped files.