List of processes
May 5, 2024 ยท View on GitHub
- Introduction: Quick Start, Tutorial, Flowchart, Outputs structure
- Install: Dependencies, Containers, References, Test datasets
- Inputs: Data, Design, Parameters
- 1. Preprocessing: ATAC reads, ATAC peaks, mRNA
- 2. Differential Analysis: ATAC, mRNA, Split
- 3. Enrichment Analysis: Enrichment, Figures, Tables
List of processes
- DA_ATAC__doing_differential_analysis
- DA_ATAC__annotating_diffbind_peaks
- DA_ATAC__plotting_differential_analysis_results
- DA_ATAC__saving_detailed_results_tables
DA_ATAC__doing_differential_analysis
Description
This process takes as input final filtered peaks and (1 base pair) reads.
DiffBind is used to do Differential Binding analysis between two comparisons.
Briefly, DiffBind estimates reads abundance at selected peaks of interest (i.e. the consensus peak set) and then use differential gene expression analysis tools (i.e. DESeq2 or edgeR) to determine peaks that are differentially bound.
Parameters
See the function links for details and possible options. Details on the choice of default values can be found here. The parameters are:
- For the dba function:
- params.diffbind__analysis_method: Option to use DESeq2 or edgeR for the analysis. Default: 'DBA_EDGER'.
- For edgeR analysis method:
- For the dba.blacklist function:
- params.use_input_control: If an input control is used, grey list regions (region of high-signal in the input) will be by estimated by DiffBind via the GreyListChIP package and excluded from analysis. Default: false.
- params.diffbind__make_grey_list: Should a grey list be created or not. This option can be set to 'TRUE' only if params.use_input_control is also 'TRUE'. If 'TRUE', a grey list region will be created from the input control to hide hotspot regions. Default: 'FALSE'.
- For the dba.count function:
- params.diffbind__min_overlap: Only include peaks in at least this many peaksets when generating consensus peakset. The default behavior of cactus is to include any peak from any replicate into the consensus peak set (i.e. th = 1). Non robust signal should anyway have low p-value and be filtered away in downstream analysis. Default: 1.
- params.diffbind__score: Score to use in the binding affinity matrix. Raw read counts are used for analysis. This parameter only influence the counts shown in the detailled_ATAC results tables (for each individual replicates). Default: 'DBA_SCORE_NORMALIZED'.
- params.diffbind__sub_control: Option to determine if the input control reads should be substracted to each site in each sample. Default: 'FALSE'.
- params.diffbind__scale_control: Option to determine if reads should be scaled by library size when using the params.diffbind__sub_control_ option. Default: 'TRUE'.
- params.diffbind__min_count: Minimum read count value. Any interval with fewer than this many overlapping reads will be set to have this count. Default: 0.
- params.diffbind__summits: Option to control the summit heights and locations calculated for each peak. Default: 75.
- params.diffbind__filter: Intervals with values lower than this are excluded from analysis. Default: 1.
- For the dba.normalize function:
- params.diffbind__normalization: Normalization method to use. Default: 'DBA_NORM_DEFAULT'.
- params.diffbind__library_size: Method used to calculate library size. Default: 'DBA_LIBSIZE_BACKGROUND'.
- params.diffbind__background: Should background bins be used for normalization. Can be 'FALSE', 'TRUE' (default bin size of 15000bp), or an integer (indicating the bin size). Default: 'TRUE'.
- For the dba.contrast function:
- params.diffbind__design: Should contrasts be specified with a formula or not. Default: 'TRUE'.
Outputs
- Consensus peaks:
Processed_Data/2_Differential_Analysis/ATAC__all_peaks__bed/${comparison}__diffbind_peaks_gr.bed. - Diffbind object:
Processed_Data/2_Differential_Analysis/ATAC__all_peaks__DiffBind/${comparison}__diffbind_peaks_dbo.rds. - Read counts by replicate (GRange object):
Processed_Data/2_Differential_Analysis/ATAC__all_peaks__gRange/${comparison}__all_peaks.rds.
DA_ATAC__annotating_diffbind_peaks
Description
Peaks are annotated with ChIPseeker. Each peak is assigned to its closest gene using the annotatePeak function.
Parameters
Parameters of the annotatePeak function:
- params.chipseeker__promoter_up: promoter start; upstream from TSS site. Default: 1500.
- params.chipseeker__promoter_down: promoter end; downstream from TSS site. Default: 500.
- params.chipseeker__overlap: this parameter together with the params.chipseeker__ignore_overlap controls the genes to which peaks are assigned to. If params.chipseeker__overlap equals "all" and params.chipseeker__ignore_overlap equals 'FALSE' then if a peak overlaps to a genomic feature (i.e., exon, intron, 5'UTR, 3'UTR, CDS) it will be assigned to this gene. Otherwise, the peak will be assigned to the neighboring gene regardless of overlap with genomic features. Options: "all", "TSS". Default: 'all'.
- params.chipseeker__ignore_overlap: this parameter together with the params.chipseeker__overlap controls the genes to which peaks are assigned to. If params.chipseeker__overlap equals "all" and params.chipseeker__ignore_overlap equals 'FALSE' then if a peak overlaps to a genomic feature (i.e., exon, intron, 5'UTR, 3'UTR, CDS) it will be assigned to this gene. Otherwise, the peak will be assigned to the neighboring gene regardless of overlap with genomic features. Options: "all", "TSS". Default: 'FALSE'.
- params.chipseeker__annotation_priority: This parameter controls the order of priorities when there are overlaping features that overlap with the peak for assigning a genomic region for the "annotation" column. Default: "c('Promoter', '5UTR', '3UTR', 'Exon', 'Intron', 'Downstream', 'Intergenic')".
- params.chipseeker__ignore_upstream: If 'TRUE' only annotate gene at the 3' of the peak. Options: "FALSE", "TRUE". Default: 'FALSE'.
- params.chipseeker__ignore_downstream: If 'TRUE' only annotate gene at the 5' of the peak. Options: "FALSE", "TRUE". Default: 'FALSE'.
Outputs
- Annotated peaks (data.frame object):
Processed_Data/2_Differential_Analysis/ATAC__all_peaks__dataframe/${comparison}__diffb_anno_peaks_df.rds. - Annotated peaks (ChIPseeker object):
Processed_Data/2_Differential_Analysis/ATAC__all_peaks__ChIPseeker/${comparison}__diffb_anno_peaks_cs.rds.
DA_ATAC__plotting_differential_analysis_results
Description
This process makes standardized (i.e. similar types of plots are produced for mRNA-Seq data) PCA and volcano plots, boxplots of FDR by PA (Peak Assignment), and some other plots produced directly by DiffBind.
Parameters
- params.diffbind_plots__fdr_threshold: Peaks with FDR less than or equal to this value are colored in red in the volcano plot. Default: 0.05.
- params.diffbind_plots__top_n_labels: The top n peaks with lowest FDR will have their annotated gene displayed on the volcano plot. Default: 15.
Outputs
-
Volcano plots:
Figures_Individual/2_Differential_Analysis/ATAC__volcano/${comparison}__ATAC_volcano.pdfFigures_Merged/2_Differential_Analysis/ATAC__volcano.pdf.
-
PCA plots (PC 1 and 2):
Figures_Individual/2_Differential_Analysis/ATAC__PCA_1_2/${comparison}__ATAC_PCA_1_2.pdf.Figures_Merged/2_Differential_Analysis/ATAC__PCA_1_2.pdf.- top left panel: percentage of variance explained by the top 5 first principal components
- top right panel: PCA plot for principal components 1 and 2
- bottom panels: genes annotated to peaks that contribute the most to principal components 1 (left) and 2 (right). Color code: red or -1 indicates that the peak is a positive contributor. Blue or +1 indicates that the peak is a negative contributor.
-
PCA plots (PC 3 and 4):
Figures_Individual/2_Differential_Analysis/ATAC__PCA_3_4/${comparison}__ATAC_PCA_3_4.pdf.Figures_Merged/2_Differential_Analysis/ATAC__PCA_3_4.pdf.- Same as above but for principal components 3 and 4.
- Same as above but for principal components 3 and 4.
-
FDR by PA filters plots:
Figures_Individual/2_Differential_Analysis/ATAC_FDR_by_PA/${comparison}__ATAC_FDR_by_PA.pdfFigures_Merged/2_Differential_Analysis/ATAC_FDR_by_PA.pdf.
-
Other plots;
Figures_Individual/2_Differential_Analysis/ATAC__other_plots/${comparison}__ATAC_other_plots.pdfFigures_Merged/2_Differential_Analysis/ATAC__other_plots.pdf-
MA plot: MA and scatter plots of differential binding analysis results; using normalization factors.
-
Heatmap plot: Binding site heatmap.
-
Venn diagram: 4-way Venn diagrams showing the first 2 replicates per condition.
-
-
Peaks without annotations:
Processed_Data/2_Differential_Analysis/ATAC__non_annotated_peaks/${comparison}__ATAC_non_annotated_peaks.txt.- Should not be many, but if there are this file can help to inspect these peaks.
DA_ATAC__saving_detailed_results_tables
Description
The detailed ATAC-seq results table is created in R, which includes peak name, coordinates, significance, log2 fold change, annotated gene name and id, annotated region, distance to TSS (Transcription Start Site), and raw counts.
In addition, the following filtering columns are added:
- FC_{up,down}: up or down-regulated
- PA_gene: genic regions annotated by ChIPseeker (i.e., intron + exon)
- PA_interG: intergenic regions annotated by ChIPseeker
- PA_prom: promoter regions annotated by ChIPseeker
- PA_5pUTR: five prime UPR regions annotated by ChIPseeker
- PA_3pUTR: three prime UPR regions annotated by ChIPseeker
- PA_exon: exonic regions annotated by ChIPseeker
- PA_intron: intronic regions annotated by ChIPseeker
- PA_downst: downstream regions annotated by ChIPseeker (downstream of a gene by a maximal distance of 300 base pairs)
- PA_distIn: distal intergenic regions annotated by ChIPseeker
- PA_UTR: UTR regions (5pUTR + 3pUTR)
- PA_TSS: overlap with the TSS (distanceToTSS = 0)
- PA_genPro: genic region or promoter
- PA_distNC: peak is in a distal intergenic region or (in an intron but not in any of these regions: promoter, 5' UTR, 3' UTR and exon). distNC stands for distal noncoding. These regions have been shown in Daugherty et al. (First ATAC-Seq paper in C. elegans) to be enriched in active and repressed enhancers.
- PA_lt{10,100,X}kb: absolute distance to the nearest gene TSS is less than 10, 100, or X kilobases, with X being defined by the parameter params.custom_distance__less_than_X_b (default 500 kb). Note that 10 kb is a historically commonly used cutoff for annotating ChIP-Seq peaks, and that 100 kb and 500 kb correspond to cutoffs for proximal and distal enhancers in mouse as defined in Xie et al.
- PA_mt{10,100,Y}kb: absolute distance to the nearest gene TSS is more than 10, 100, or Y kilobases, with Y being defined by the parameter params.custom_distance__more_than_Y_b (default 500 kb).
These columns can all be used in the cactus configuration files to filter for peaks matching certain annotation pattern with the parameter params.peak_assignment_for_splitting_subsets.
NOTE: The PA_prom filter uses the
chipseeker__promoter_upandchipseeker__promoter_downCactus parameters to define the promoter regions. By default this is defined as -1500/+500 to the TSS. However, users can use this filter as an abritrary "customizable" PA filter. This can help for instance to filter out peaks that are too far away from the TSS. For instance, using the PA_prom filter withchipseeker__promoter_up = 10000andchipseeker__promoter_down = 10000would give the same result as PA_lt10kb.
NOTE: New filtering columns could be added in the future if needed.
Parameters
- params.custom_distance__less_than_X_b: Custom threshold for the PA_ltXkb filter used to select peaks below a given distance (in base pair) to the TSS of their closest gene. Default: 500000 (i.e., 500 kilobases).
- params.custom_distance__more_than_Y_b: Custom threshold for the PA_mtYkb filter used to select peaks above a given distance (in base pair) to the TSS of their closest gene. Default: 500000 (i.e., 500 kilobases).
Outputs
- Table:
Tables_Individual/2_Differential_Analysis/ATAC_detailed/${comparison}__res_detailed_atac.{csv,xlsx}Tables_Merged/2_Differential_Analysis/ATAC_detailed.{csv,xlsx}.