Workshop Schedule

May 13, 2025 · View on GitHub

NOTE: The Basic Data Skills Shell for Bioinformatics workshop is a prerequisite.

Pre-reading:

Day 1

TimeTopicInstructor
9:30 - 10:10Workshop IntroductionWill
10:00 - 11:30Introduction to Variant CallingElizabeth
11:30 - 11:50Project OrganizationHeather
11:50 - 12:00Overview of self-learning materials and homework submissionWill

Before the next class:

I. Please study the contents and work through all the code within the following lessons:

  1. Evaluating Read Quality with FastQC

    Click here for a preview of this lesson
    The first step in many NGS studies is first to evaluate the read qualites that you received from the sequencing facility. A common tool used for handling this analysis is FastQC.

    This lesson will:
    • Implement FastQC to evaluate read qualities
    • Evaluate FASTQC quality metrics

  2. Sequence Read Alignment

    Click here for a preview of this lesson
    Once we have completed our QC on sequence reads we will be aligning the reads to a reference sequence. This alignment step places each read in genomic space and creates the bedrock for calling variants.

    This lesson will:
    • Enumerate difficulties with alignment
    • Create an sbatch script to align reads

  3. Alignment File Processing

    Click here for a preview of this lesson
    Before we can call variants from our alignment files, we need to do some processing to clean up the alignment files. The two major concerns here are organizing (sorting) our alignment files for our analyses and removing duplicates.

    This lesson will:
    • Differentiate between query-sorted and coordinate-sorted alignment files
    • Describe and remove duplicate reads
    • Process a raw SAM file for input into a BAM for GATK

NOTE: To run through the code above, you will need to be logged into O2 and working on a compute node (i.e. your command prompt should have the word compute in it).

  1. Log in using ssh rc_trainingXX@o2.hms.harvard.edu and enter your password (replace the "XX" in the username with the number you were assigned in class).
  2. Once you are on the login node, use srun --pty -p interactive -t 0-2:30 --mem 1G /bin/bash to get on a compute node.
  3. Proceed once your command prompt has the word compute in it.
  4. If you log out between lessons (using the exit command twice), please follow points 1. and 2. above to log back in and get on a compute node when you restart with the self learning.

II. Complete the exercises:

  • Each lesson above contains exercises; please go through each of them.
  • Copy over your solutions into the Google Form the day before the next class.

Questions?

  • If you get stuck due to an error while runnning code in the lesson, email us

Day 2

TimeTopicInstructor
9:30 - 10:00Self-learning lessons reviewAll
10:00 - 10:30Alignment File Quality ControlHeather
10:30 - 10:40Break
10:40 - 11:15Aggregating QC metrics using MultiQCHeather
11:15 - 12:00Variant CallingWill

Before the next class:

I. Please study the contents and work through all the code within the following lessons:

  1. Variant Filtering

    Click here for a preview of this lesson
    Now that we have called our raw variants, we will need to filter our data for only high-quality variant calls. Low-quality variant calls can occur for a variety of reasons that we will explore and we will implement steps to exclude them.

    This lesson will:
    • Filter raw variant calls using FilterMutectCells to reduce errors
    • Remove Low-Complexity Regions from the called variants using SnpSift to further reduce errors
  2. Variant Annotation with SnpEff

    Click here for a preview of this lesson
    With our high-quality variant calls, we would like to know more information about these variants. For example, we might like to know which genes our they are in or how they alter the protein-coding sequence for the genes they are in. In order to do this, we will need to provide annotations for our genes.

    This lesson will:
    • Annotate a VCF file for functional impacts with `SnpEff`
    • Differentiate between an unannotated and annotated VCF file

NOTE: To run through the code above, you will need to be logged into O2 and working on a compute node (i.e. your command prompt should have the word compute in it). For login instructions, please see above.

II. Complete the exercises:

  • Each lesson above contains exercises; please go through each of them.
  • Copy over your solutions into the Google Form the day before the next class.

Questions?

  • If you get stuck due to an error while runnning code in the lesson, email us

Day 3

TimeTopicInstructor
9:30 - 10:00Self-learning lessons reviewAll
10:00 - 10:30Variant Prioritization with SnpSiftWill
10:30 - 11:00Exercise (Key)Will
11:00 - 11:30Visualization in IGVHeather
11:30 - 12:00Q & A (review of Automation)All

Questions?

  • If you get stuck due to an error while runnning code in the lesson, email us

Day 4

TimeTopicInstructor
9:30 - 10:30Introduction to cBioPortalDr. Tali Mazor
10:30 - 11:30cBioPortal PracticalDr. Tali Mazor
11:30 - 11:45Oncoprint IntegrationWill
11:45 - 12:00Wrap upHeather

File Format Reference

Automation Reference

Answer key


These materials have been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.