ATAC-STARR Analysis
February 5, 2024 ยท View on GitHub
ATAC-STARR-seq is a massively parallel reporter assay (MPRA) that quantifies regulatory activity within accessible chromatin. In addition to regulatory activty, ATAC-STARR quantifies chromatin accessibility and transcription factor binding. In Hansen, T.J. & Hodges, E. 2022, we present an improved ATAC-STARR method. This repository serves as a complement to that manuscript.
In this repository, we share our code for all steps of ATAC-STARR-seq data analysis. The code used at each step is detailed as a Jupyter notebook in the sections below. We also supply some python scripts.
Citation
Please cite our ATAC-STARR-seq method article:
Hansen TJ, Hodges E. ATAC-STARR-seq reveals transcription factor-bound activators and silencers across the chromatin accessible human genome. Genome Res. 2022 Jul 20;32(8):1529โ41. https://doi.org/10.1101/gr.276766.122.
This repository can also be cited:
Tyler J Hansen, & Emily Hodges. (2022). HodgesGenomicsLab/ATAC-STARR-seq: Identifying transcription factor-bound activators and silencers in the chromatin accessible human genome using ATAC-STARR-seq (V2.1.0). Zenodo. https://doi.org/10.5281/zenodo.6640476
Software
Below is a summary of most of the software packages used in ATAC-STARR-seq:
Command-line: R-based:
Trim Galore! R
bowtie2 tidyverse
SAMtools Sushi
Picard DESeq2
Preseq ChIPSeeker
Genrich pheatmap
BEDtools ggsci
deepTools ReactomePA
HOMER
TOBIAS
bedops
Subread
Conda
We use conda to install our packages.
Read Processing and Quality Control (1_read-processing)
We provide a jupyter notebook describing this portion of the analysis. In summary, raw reads are trimmed, assessed for quality, mapped to hg38, and filtered to remove regions mapping to ChrM and ENCODE blacklist regions. Reads with MAPQ scores less than 30 are also removed. For the accessibility analysis, we also generated deduplicated mapped read files. We also estimate library complexity using the preseq package and determine the distribution of insert size using Picard: (Supplementary Figure 1A, Supplementary Figure 2B,C).
fastq-processing-and-QC.ipynb
FASTQ Processing Python Script
To automate this step, I developed and utilitezed the following python script.
fastq_processing.py
Syntax and Arguments
$ python3 fastq_processing.py --help
usage: fastq_processing.py [-h] [-a INPUT_R1] [-b INPUT_R2] [-n NAME]
[-e BLACKLIST_BED] [-t OUTPUT_TRIMMED]
[-o OUTPUT_BAM] [-r OUTPUT_QC] [-g GENOME]
process ATAC-STARR fastq files
required arguments:
-a INPUT_R1, --input_R1 INPUT_R1
raw fastq paired-end read 1 file
-b INPUT_R2, --input_R2 INPUT_R2
raw fastq paired-end read 2 file
-n NAME, --name NAME basename to assign to sample
-t OUTPUT_TRIMMED, --output_trimmed OUTPUT_TRIMMED
directory to deposit trimmed reads
-o OUTPUT_BAM, --output_bam OUTPUT_BAM
directory to deposit mapped reads
-r OUTPUT_QC, --output_qc OUTPUT_QC
directory to deposit QC and readcount files
-g GENOME, --genome GENOME
bowtie index of genome to map to
optional arguments:
-h, --help show this help message and exit
-e BLACKLIST_BED, --blacklist_bed BLACKLIST_BED
bedfile of ENCODE blacklist regions (default is none
and skips this step)
Accessibility Peak and Regulatory Region Calling (2_peak-and-region_calling)
We provide a jupyter notepook detailing the entire peak and region calling process. The scheme for this is shown in Figure 3A. In brief, de-duplicated DNA bam files analyzed with Genrich to call accessible chromatin (ChrAcc) peaks. These peaks are converted into 50bp sliding window bins (10bp step size) and RNA and DNA read counts for each bin are determined using the duplicated bam files (featureCounts). Differential analysis is performed with DESeq2 to identify active and silent bins. Overlapping bins are then merged into regions. This notebook also generates the plots in Figure 3 and Supp Figure 3. It also performs the de-duplicated and fragment group (Supp Fig 4/5) analyses.
ChrAcc_and_Activity_Region-Calling.ipynb
Region Calling Python Script
As mentioned in the notebook, we developed a python script to call regulatory regions from ATAC-STARR-seq data.
call_ATAC-STARR_regulatory-regions.py
RNA-to-DNA_differential-analysis.r
Note: This script calls an R script to do the differential analysis and requires this R script to be in the same directory that the python script is called from. Also, the python script only works for replicate counts 2-5. Anything greater than 5 will require minor revisions to the script.
Syntax and Arguments
$ python3 call_ATAC-STARR_regulatory-regions.py --help
usage: call_ATAC-STARR_regulatory-regions.py [-h] [-i CHRACC_PEAKS]
[-d DNA_BAMS [DNA_BAMS ...]]
[-r RNA_BAMS [RNA_BAMS ...]]
[-o OUT_DIR] [-q FDR]
[-n THREADS]
Call ATAC-STARR Regulatory Regions
required arguments:
-i CHRACC_PEAKS, --ChrAcc_peaks CHRACC_PEAKS
accessibility peaks (narrow-peak format)
-d DNA_BAMS [DNA_BAMS ...], --DNA_bams DNA_BAMS [DNA_BAMS ...]
reisolated plasmid DNA bam files
-r RNA_BAMS [RNA_BAMS ...], --RNA_bams RNA_BAMS [RNA_BAMS ...]
reporter RNA bam files
-o OUT_DIR, --out_dir OUT_DIR
output directory
optional arguments:
-h, --help show this help message and exit
-q FDR, --fdr FDR false discovery rate for differential analysis
(default: 0.05)
-n THREADS, --threads THREADS
number of threads (default: 1)
Accessibility Peak and Regulatory Region Characterization (3_peak-and-region_characterization)
We compared our ChrAcc peaks to data from Corces et al. (Figure 2), which is detailed in the following notebook:
Compare_Corces_and_ATAC-STARR.ipynb
We also characterized our active and silent regions in the following notebook (Figure 3):
Characterize_Regulatory_Regions.ipynb
TF Footprinting (4_TF_footprinting)
TF footprinting is an established method to identify TF footprints from an ATAC-seq or DNase-seq dataset. We performed footprinting using the TOBIAS software package (Figure 5). This and the plotting schemes are displayed in the following notebook:
TF_footprinting.ipynb
Multiomic Analysis (5_multiomic_analysis)
We created a genome browser view using Sushi (Figure 6). We also clustered active and silent regions based on the presence and absence of TF footprints and then performed reactome pathway analysis of each cluster (Figure 7). This analysis is displayed in the following notebook:
multiomic_analysis.ipynb
Supplementary Analyses (6_supplementary_analysis)
We looked at neutral regions, short and long fragnemts, orientation bias, HiDRA data, and performed a psuedo-replicate analysis. Analyses for these are in the notebooks below:
Additional_Supplementary_Analysis.ipynb
Neutral Sequence Analysis.ipynb
Short-vs-long_elements.ipynb
*On 2/5/2024, I added the generate_ATAC-STARR_bigwig.py script. This is a script also in my https://github.com/HodgesGenomicsLab/ATAC-STARR_cis_trans repository, but I felt it was a good idea to have it here. This script can be used to generate signal files from ATAC-STARR data. Need help, just ask!