Data processing for low coverage whole genome sequencing
February 5, 2021 · View on GitHub
Scripts for going from raw .fastq files to processed and quality-checked .bam files for downstream analysis
Location of this repository
If you are working on the Therkildsen lab server, this GitHub repository is located at /workdir/data-processing/. Otherwise, clone this repo to the machine on which you are working.
Before you start
Create a project directory
As a first step, you should create a project directory (referred to as BASEDIR in certain scripts), with the following subdirectories:
-
adapter_clippedadapter clipped fastq files -
bambam files -
nohupsnohup log files -
sample_listssample tables and sample lists -
markdownsmarkdown files tracking the workflow; it is recommended that you first create Rmd files using Rstudio, and knit these into GitHub markdown format
The following subdirectories are optional, depending on your workflow:
-
angsdangsd output files -
fastqraw fastq files (if you don't already have it backed up on the backup folder) -
species_statsfastq species detector output -
fastqcFastQC output -
qual_filteredquality filtered and/or polyG trimming -
scriptsscripts specific to your projects (e.g. merging certain bam files)
It is recommended that you make this project directory a separate GitHub repository. This way, you can track all the changes in your project directory. To make GitHub ignore certain directories (e.g. adapter_clipped and bam, since these will contain many large files), create a .gitignore file in your project directory, in which you specify the paths that should be ignored by git.
Prepare sample lists
A sample list is a list of the prefixes of raw fastq files. These should be unique for each fastq file, and the rest of the names have to be the same for all raw fastq files. No header should be included in this list.
You can have more than one sample list for each project; by doing so you will be able to run some of the scripts in parallel. When you have different suffixes for your raw fastq files (e.g when sequences come from different sequencing platforms), you will have to create multiple sample lists.
Here is an example of a sample list: https://github.com/therkildsen-lab/greenland-cod/blob/master/sample_lists/sample_list_pe_1.tsv
Prepare a sample table
A sample table is a tab deliminated table that includes relevant information for all fastq files. It should include the following six columns, strictly in this order:
-
prefixprefix of raw fastq files; it should be the union of all your sample lists -
lane_numberlane number; each sequencing lane should be assigned a different number -
seq_idsequence ID,this can be the same thing as sample ID or lane ID and it does not matter except for when different libraries were prepared out of the same sample and were run in the same lane. In this case, seq_id should be used to distinguish these. -
sample_idsample ID -
populationpopulation name -
data_typedata type; there can only be two possible entries:pe(for paired-end data) orse(for single end data)
It is important to make sure that the combination of lane_number, seq_id, and sample_id has to be unique for each fastq file.
It is recommended that you have one single sample table per project.
An example of a sample table: https://github.com/therkildsen-lab/greenland-cod/blob/master/sample_lists/sample_table.tsv
An example of how a sample table can be created from a list of fastq file names and the original sample infomation sheet: https://github.com/therkildsen-lab/greenland-cod/blob/master/markdowns/create_sample_table_se_3.md
Install programs used
If you are not working on the Therkildsen server, you might need to intall the following programs to your machine. The program paths will then need to be inputted to each script (see script headers).
Trimmomatichttp://www.usadellab.org/cms/?page=trimmomaticfastphttps://github.com/OpenGene/fastpbowtie2http://bowtie-bio.sourceforge.net/bowtie2/index.shtmlPicardhttps://broadinstitute.github.io/picard/bamUtilhttps://github.com/statgen/bamUtilGenomeAnalysisTK-3.7https://software.broadinstitute.org/gatk/documentation/version-history.php?id=8692&page=3BBMaphttps://github.com/BioInfoTools/BBMap
Demultiplex
If your fastq file has not been demultiplex and if the barcodes are part of the fastq headers, use the following line to demultiplex. Replace the items in quotes with appropriate paths and names. An example of this workflow can be found here.
Each line in the barcode list should be in the format of i7sequence+i5sequence, and an example of such a barcode list can be found here.
See demuxbyname.sh for details.
nohup bash /programs/bbmap-38.45/demuxbyname.sh \
in="Path to gzipped fastq files" \
in2="Path to gzipped fastq files if your have pair-end reads" \
out="Suffix of output names; this should start with a percentage sign (%)" \
out2="Suffix of output names if your have pair-end reads; this should start with a percentage sign (%)" \
outu=unknown_barcode_1.fastq.gz \
outu2=unknown_barcode_2.fastq.gz \
prefixmode=f \
names="Path to a list of barcode sequences" \
> "Path to the nohup file" &
After this is run, make sure to rename your demulitplexed fastq files so that the file names do not contain a plus sign (+), which can be a problem in downstream analyses.
Get started
-
Clip adapters using adapter_clipping.sh
-
Quality filtering using quality_filtering.sh (optional)
-
Build bowtie reference index using build_bowtie_ref_index.sh. This only needs to be done once with the same reference genome.
-
Map to reference, sort, and quality filter using low_coverage_mapping.sh
-
Merge samples that were sequenced multiple times on different lanes (or within the same lane, with different barcodes). You should write your own script to do this, but see an example here
-
Deduplicate (all samples) and clip overlapping read pairs (pair-end only) using deduplicate_clipoverlap.sh
-
In-del relignment using realign_indels.sh. This needs to be run in the
/bam/folder so that the realigned bam files can be outputted to the correct directory.
Optional steps
-
Check contamination or species composition in fastq files using run_species_detector.sh
-
Count fastq files using count_fastq.sh
-
Count bam files before merging using count_bam_unmerged.sh
-
Count bam files after merging using count_bam_merged.sh
-
Count per position depth using count_depth_per_position_per_sample.sh and summarize these using summarize_depth_per_position.R
Examples of the workflow
A few suggestions
-
Avoid using Excel to edit your sample tables and lists unless you know what you are doing, since it may lead to weird line ending issues (see example). If you are encoutering such issues, use the
dos2unixcommand to convert the file endings to Unix format. -
It is a good practice to have completely unique sample IDs. For example, having one sample named
1Codand another sample named11Codcan create unexpected problems. It would be much safer to name them01Codand11Cod.