Menu

October 25, 2024 · View on GitHub

Menu

Configuration files

Parameters can be set up in two different configuration files:

  • a global configuration file: that applies to all runs and with name and path: ~/.cactus.config. An example can be found here.

  • a run-specific configuration file: this file is the only input needed for a cactus run. It can for instance be named like that (relative path from the run folder): yml/run.yml. An example can be found here.

Parameters can also be set up directly on the command line. See here for more details on how parameters from various sources are handled by Nextflow.

Mandatory parameters

  • params.species: species under study. Options: 'worm', 'fly', 'mouse', 'human'. Mandatory. No default.
  • params.references_dir: Directory where references have been downloaded. Mandatory. No default.
  • params.chromatin_state: Chromatin state to use. Options are listed in the ${params.references_dir}/${params.species}/encode_chromatin_states_metadata.csv file. Mandatory. No default.

Additionally, these parameters are mandatory if conda, mamba or singularity is used:

  • params.singularity_cache_dir: Directory where singularity images are downloaded to when Cactus is run for the first time. Mandatory if the singularity profile is used. No default.
  • params.conda_cache_dir: Directory where conda environments are downloaded to when Cactus is run for the first time. Mandatory if the conda profile is used. No default.
  • params.mamba_cache_dir: Directory where mamba environments are downloaded to when Cactus is run for the first time. Mandatory if the mamba profile is used. No default.

Global configuration file and mandatory parameters

Any parameter can be set in the ~/.cactus.config file.

It is highly recommended to set up here the mandatory parameters. One exception is params.chromatin_state that can be set-up globally (in ~/.cactus.config) or locally (in the run-specific .yml file or on the command line) depending on the users' need.

In addition, it is recommended to set up a NextFlow Tower token in the ~/.cactus.config file for monitoring pipelines' execution using Nextflow Tower with these parameters:

  • params.tower_token: Tower token to monitor the pipeline on Tower. Default: ''.
  • params.enable_tower: Directory where containers have been / will be downloaded. Default: false.

Design

  • params.experiment_types: Analyze only ATAC-Seq data, only mRNA-Seq data or both data type. Options: 'both', 'atac', 'mRNA'. Default: 'both'.
  • params.design__mrna_fastq: path to the mRNA fastq design file. Default: 'design/mrna_fastq.tsv'. This file can be empty if only ATAC-Seq data is analyzed, in which case the argument params.experiment_types should be set to atac.
  • params.design__atac_fastq: path to the ATAC fastq design file. Default: 'design/atac_fastq.tsv'. This file can be empty if only mRNA-Seq data is analyzed, in which case the argument params.experiment_types should be set to mRNA.
  • params.design__comparisons: path to the comparisons design file. Default: 'design/comparisons.tsv'.
  • params.design__regions_to_remove: path to the regions to remove design file. Default: 'design/regions_to_remove.tsv'.
  • params.design__genes_to_remove: path to the genes to remove design file. Default: 'design/genes_to_remove.tsv'.
  • params.design__groups: path to the groups design file. Default: 'design/groups.tsv'.
  • params.use_input_control: Should a gDNA input control be used for ATAC-Seq analysis to remove greylist regions with DiffBind, and for some quality control analysis steps. Note that the input control cannot have replicates and should have the id "input" in the params.design__atac_fastq file (see example here). . Default: false.

Ressources

This part contains parameters from Nextflow's executor scope:

  • params.executor.queueSize: How many processes are queued at a given time. Default: 100.
  • params.executor.$local.memory: Maximum total memory that will be used on the server (or local machine) during the run. Default: 80 GB.
  • params.executor.$local.cpus: Maximum total number of CPUs that will be used on the server (or local machine) during the run. Default: 15.

Output Files

  • params.res_dir: Name of the directory where results will be saved. Default: 'results/Cactus_v${cactus_version}'.
  • params.pub_mode: Type of publication mode to use. Options are available here. Default: 'link'.
  • params.save_fastq_type: Saving only the last, none or all fastq files. Options: 'none', 'last', 'all'. Default: 'none'.
  • params.save_bam_type: Saving only the last, none or all bam files. Options: 'none', 'last', 'all'. Default: 'last'.
  • params.save_bed_type: Saving only the last, none or all bed files. Options: 'none', 'last', 'all'. Default: 'last'.
  • params.save_1bp_bam: Saving the 1 base pair reads after all filtering steps and tn5-shift adjustement
  • adjustment the ATAC-shift. Options: 'none', 'last', 'all'. Default: false.
  • params.report_dir: Directory where reports will be saved. Default: 'params.resdir/RunInfo/{params.res_dir}/Run_Info/{params.current_date}'.

Cache

  • params.resume: Enable or disable resuming of the run with the current cache. Default: true.
  • params.cache: Type of cache to make. Options are available here. Default: 'deep'.

References

This part contains the path to the references.

Cactus parses all references to simplify access to external databases to the user. However, there can be occasions where one wants to use another reference file. Any parameter from the species.config file can be modified if needed. For instance, a user analyzing worm data can try to see if human motifs are enriched by using this parameter:

params.pwms_motifs = "${params.references_dir}/human/homer_data/homer_motifs.txt"

Other species parameters that may be useful to tweak in certain situations are: params.blacklisted_regions, params.encode_chip_files or params.chromatin_state.

Processes

Default parameters for the processes are defined here.

1. Preprocessing: ATAC_peaks

  • params.macs2__qvalue: q-value (minimum FDR) cutoff to call significant peaks with macs2. Default: '5e-2'.
  • params.input_control_overlap_portion: sample peaks that overlap with the input control by more than this percentage (of the sample peak) will be removed. Default: 0.2.
  • params.do_saturation_curve: enable or disable this process. Default: true. Parameters of the annotatePeak function:
  • params.chipseeker__promoter_up: promoter start; upstream from TSS site. Default: 1500.
  • params.chipseeker__promoter_down: promoter end; downstream from TSS site. Default: 500.
  • params.chipseeker__overlap: this parameter together with the params.chipseeker__ignore_overlap controls the genes to which peaks are assigned to. If params.chipseeker__overlap equals "all" and params.chipseeker__ignore_overlap equals 'FALSE' then if a peak overlaps to a genomic feature (i.e., exon, intron, 5'UTR, 3'UTR, CDS) it will be assigned to this gene. Otherwise, the peak will be assigned to the neighboring gene regardless of overlap with genomic features. Options: "all", "TSS". Default: 'all'.
  • params.chipseeker__ignore_overlap: this parameter together with the params.chipseeker__overlap controls the genes to which peaks are assigned to. If params.chipseeker__overlap equals "all" and params.chipseeker__ignore_overlap equals 'FALSE' then if a peak overlaps to a genomic feature (i.e., exon, intron, 5'UTR, 3'UTR, CDS) it will be assigned to this gene. Otherwise, the peak will be assigned to the neighboring gene regardless of overlap with genomic features. Options: "FALSE", "TRUE". Default: 'FALSE'.
  • params.chipseeker__annotation_priority: This parameter controls the order of priorities when there are overlaping features that overlap with the peak for assigning a genomic region for the "annotation" column. Default: "c('Promoter', '5UTR', '3UTR', 'Exon', 'Intron', 'Downstream', 'Intergenic')".
  • params.chipseeker__ignore_upstream: If 'TRUE' only annotate gene at the 3' of the peak. Options: "FALSE", "TRUE". Default: 'FALSE'.
  • params.chipseeker__ignore_downstream: If 'TRUE' only annotate gene at the 5' of the peak. Options: "FALSE", "TRUE". Default: 'FALSE'.

1. Preprocessing: ATAC_reads

  • params.pigz__nb_threads: number of threads used for parallel compression. Default: 6.
  • params.bowtie2__nb_threads: number of threads used by Bowtie2. Default: 6.
  • params.sam_MAPQ_threshold: MAPQ threshold. Default: 30.
  • params.memory_picard: maximum memory used by Picard. Default: '20G'.
  • params.fastqc__nb_threads: number of threads used by FastQC. Default: 2.
  • params.do_bigwig: enable or disable this process. Default: true.
  • params.deeptools__binsize_bigwig_creation: size of the bins for the creation of the bigwig file. Smaller values increase computation time. Default: 10.
  • params.deeptools__binsize_bigwig_correlation: size of the bins for computing correlation between samples. Smaller values increase computation time. Default: 10000.
  • params.deeptools__nb_threads: number of threads used by DeepTools. Default: 6.
  • params.deeptools__nb_of_1_bp_samples: number of 1 bp sites to sample for the coverage plots. Default: 10000.
  • params.deeptools__normalization_method: normalization method to use when creating BigWig files. See here for options. Default: 'None'.
  • params.nb_sampled_aligned_reads: Number of aligned reads to sample. Default: 1000000.
  • params.nb_sampled_trimmed_reads: Number of trimmed reads to sample. Default: 1000000.
  • params.botwie2__nb_threads: number of threads used by Bowtie2. Default: 6.

1. Preprocessing: mRNA

  • params.kallisto__nb_threads: number of threads used by kallisto. Default: 6.
  • params.kallisto__bootstrap: Number of bootstrap samples. Default: '100'.
  • params.kallisto__fragment_len: Estimated average fragment length. For single end only. Default: '180'.
  • params.kallisto__fragment_sd: Estimated standard deviation of fragment length. For single end only. Default: '20'.
  • params.fastqc__nb_threads: number of threads used by FastQC. Default: 2.

2. Differential Abundance: DA_ATAC

Differential Binding Analysis:
See the function links for details and possible options. Details on the choice of default values can be found here. The parameters are:

  • For the dba function:
    • params.diffbind__analysis_method: Option to use DESeq2 or edgeR for the analysis. Default: 'DBA_EDGER'.
  • For edgeR analysis method:
    • params.diffbind__edger_tagwise: If using diffbind__analysis_method = 'edgeR' should tag-wise dispersion estimates be computed or not. See here and here for details. Default: 'TRUE'.
  • For the dba.blacklist function:
    • params.use_input_control: If an input control is used, grey list regions (region of high-signal in the input) will be by estimated by DiffBind via the GreyListChIP package and excluded from analysis. Default: false.
    • params.diffbind__make_grey_list: Should a grey list be created or not. This option can be set to 'TRUE' only if params.use_input_control is also 'TRUE'. If 'TRUE', a grey list region will be created from the input control to hide hotspot regions. Default: 'FALSE'.
  • For the dba.count function:
    • params.diffbind__min_overlap: Only include peaks in at least this many peaksets when generating consensus peakset. The default behavior of cactus is to include any peak from any replicate into the consensus peak set (i.e. th = 1). Non robust signal should anyway have low p-value and be filtered away in downstream analysis. Default: 1.
    • params.diffbind__score: Score to use in the binding affinity matrix. Raw read counts are used for analysis. This parameter only influence the counts shown in the detailled_ATAC results tables (for each individual replicates). Default: 'DBA_SCORE_NORMALIZED'.
    • params.diffbind__sub_control: Option to determine if the input control reads should be substracted to each site in each sample. Default: 'FALSE'.
    • params.diffbind__scale_control: Option to determine if reads should be scaled by library size when using the params.diffbind__sub_control_ option. Default: 'TRUE'.
    • params.diffbind__min_count: Minimum read count value. Any interval with fewer than this many overlapping reads will be set to have this count. Default: 0.
    • params.diffbind__summits: Option to control the summit heights and locations calculated for each peak. Default: 75.
    • params.diffbind__filter: Intervals with values lower than this are excluded from analysis. Default: 1.
  • For the dba.normalize function:
    • params.diffbind__normalization: Normalization method to use. Default: 'DBA_NORM_DEFAULT'.
    • params.diffbind__library_size: Method used to calculate library size. Default: 'DBA_LIBSIZE_BACKGROUND'.
    • params.diffbind__background: Should background bins be used for normalization. Can be 'FALSE', 'TRUE' (default bin size of 15000bp), or an integer (indicating the bin size). Default: 'TRUE'.
  • For the dba.contrast function:
    • params.diffbind__design: Should contrasts be specified with a formula or not. Default: 'TRUE'.

Annotations and figures:

  • Parameters of the annotatePeak function -> see part 1. ATAC_peaks above.
  • params.custom_distance__less_than_X_b: Custom threshold for the PA_ltXkb filter used to select peaks below a given distance (in base pair) to the TSS of their closest gene. Default: 500000 (i.e., 500 kilobases).
  • params.custom_distance__more_than_Y_b: Custom threshold for the PA_mtYkb filter used to select peaks above a given distance (in base pair) to the TSS of their closest gene. Default: 500000 (i.e., 500 kilobases).
  • params.diffbind_plots__fdr_threshold: Peaks with FDR less than or equal to this value are colored in red in the volcano plot. Default: 0.05.
  • params.diffbind_plots__top_n_labels: The top n peaks with lowest FDR will have their annotated gene displayed on the volcano plot. Default: 15.

2. Differential Abundance: DA_mRNA

Figures:

  • params.sleuth_plots__fdr_threshold: Peaks with FDR less than or equal to this value are colored in red in the volcano plot. Default: 0.05.
  • params.sleuth_plots__top_n_labels: The top n peaks with lowest FDR will have their annotated gene displayed on the volcano plot. Default: 15.

2. Differential Abundance: Split

  • params.split__threshold_type: Defines if the threshold cuttoff is based on FDR (adjusted p-value) or rank. Options: 'FDR', 'rank'. Default: 'FDR'.
  • params.split__threshold_values: Groovy list defining the threshold cuttoff value(s). If params.split__threshold_type = 'rank' all entries ranked below this value will be kept (with entries ranked from lowest (rank = 1) to highest adjusted pvalues). If params.split__threshold_type = 'FDR' all entries with a -log10(adjusted p-value) below this threshold will be kept. e.g., params.split__threshold_values = [ 1.3 ] will keep all entries with an adjusted pvalue below 0.05 (i.e., -log10(0.05) = 1.30103). Multiple thresholds can be added but from the same type (FDR or rank). Default: [ 1.3 ].
  • params.split__peak_assignment: Groovy list defining the peak assignment filters to use. Options are 'all' for including all peaks, or any PA filter from the DA_ATAC__saving_detailed_results_tables process written without the initial prefix 'PA_' (e.g., 'prom', 'distNC'...). Default: [ 'all' ].
  • params.split__keep_unique_genes: Should only unique DA and NDA genes be kept for downstream analysis. Default: 'TRUE'.
  • params.min_entries_DA_bed: Subsets with fewer entries than that will be filtered out from enrichment analysis. Default: 2.

3. Enrichment: Enrichment

  • params.disable_all_enrichments: If true all enrichment analysis are disabled. Default: false.
  • params.do_only_self_enrichment: If true all enrichment analysis are disabled. Default: false.
  • params.do_genes_self_enrichment: Enable or disable genes self enrichment analysis. Default: true.
  • params.do_peaks_self_enrichment: Enable or disable peaks self enrichment analysis. Default: true.
  • params.do_func_anno_enrichment: Enable or disable gene set enrichment analysis. Default: true.
  • params.do_chrom_state_enrichment: Enable or disable chromatin states enrichment analysis. Default: true.
  • params.do_chip_enrichment: Enable or disable CHIP-Seq enrichment analysis. Default: true.
  • params.do_motif_enrichment: Enable or disable motifs enrichment analysis. Default: true.
  • params.use_nda_as_bg_for_func_anno: use non-differentially expressed genes as the background for differentially analysis. If FALSE, all genes in the database are used. Default: 'FALSE'.
  • params.func_anno_databases: which database(s) to query for functional annotation enrichment analysis (KEEG, GO BP, GO CC or GO MF). Options: 'KEGG', 'CC', 'MF', 'BP'. Default: ['BP', 'KEGG'].
  • params.simplify_cutoff: Similarity cutoff to removed redundant go terms. Default: 0.8.
  • params.chromatin_state: Chromatin state to use. Options are listed in the ${params.references_dir}/${params.species}/encode_chromatin_states_metadata.csv file. Mandatory. No default.
  • params.chip_ontology: CHIP ontology to use to filter the ENCODE CHIP files. Options are listed in the references/${species}/available_chip_ontology_groups.txt file and details on the groups can be found in the file references/${species}/encode_chip_metadata.csv file. Default: 'all'.
  • params.homer__nb_threads: number of threads used by Bowtie2. Default: 6.
  • params.motifs_test_type: The test to use for motif inputs. If 'Binomial' a two-sided binomial test is performed instead of the two-sided Fischer test. Options: 'binomial' or 'fischer' (any value). Default: 'binomial'.

3. Enrichment: Figures

  • params.save_barplots_rds: Should barplots be saved as rds object or not. Default: false.

  • params.save_heatmaps_rds: Should heatmaps be saved as rds object or not. Default: false.

  • params.common__{padj_bin_breaks,barplots_params,barplots_ggplot,heatmaps_params,heatmaps_ggplot,heatmaps_filter}: These parameters allow to set the same parameters to each enrichment categorie. There is one parameter for each enrichment category (e.g., params.common__barplots_params). If null this parameter is disabled, otherwise the value is used as the value to set up each parameter to. Default: null.

  • params.padj_bin_breaks__{genes_self,peaks_self,func_anno,chrom_states,CHIP,motifs}: A string converted to a vector in R containing the 5 adjusted p-value bins cutoff. There is one parameter for each enrichment category. Default: "c( 0.2, 0.05, 1e-5, 1e-20, 1e-100 )".

  • params.barplots_params__{genes_self,peaks_self,func_anno,chrom_states,CHIP,motifs}: A string converted to a vector in R containing options to customize the barplots. There is one parameter for each enrichment category. Default: "c( 0.05, T, 'none', F, 50, 30 )". The options are in order:

  • padj_threshold: If no adjusted pvalue is above this threshold the process is stopped and no figure is made.

  • signed_padj: Should enrichment and depletion be shown (T) or enrichment only (F).

  • add_var: Add a variable to the plots as a small dot. Options: 'none' (nothing added; default), 'L2OR' (log2 odd ratio), 'ov_da' (overlap of DA entries with target; i.e. counts), 'padj_loglog' (pvalues in a log scale (higher values equals lower pvalues). formula: log10(-log10(pval) + 1)).

  • add_number: Write the number count on the plots.

  • max_characters: The length limit of terms names.

  • max_terms: Number of terms to display.

  • params.barplots_ggplot__{genes_self,peaks_self,func_anno,chrom_states,CHIP,motifs}: A string converted to a vector in R containing options to customize the appearance of the barplots by tweaking ggplot2 parameters. There is one parameter for each enrichment category. Default: "c( 11, 10, 7 )". The options are in order:

    • axis_text_size: Axis text size.
      • title_text_size: Title text size.
    • legend_text_size: Legend text size.
  • params.heatmaps__seed: random seed for the selection of terms. Default: 38.

  • params.heatmaps_params__{genes_self,peaks_self,func_anno,chrom_states,CHIP,motifs}: A string converted to a vector in R containing options to customize the heatmaps. There is one parameter for each enrichment category. Default for genes_self and peaks_self: "c( 0.05, T, 'none', T, 50, 'UUDD', 0 )". Default for func_anno, chrom_states, CHIP and motifs: "c( 0.05, T, 'none', F, 50, 'UUDD', 0 )". The options are in order:

    • padj_threshold: If no adjusted pvalue is above this threshold the process is stopped and no figure is made.
    • signed_padj: Should enrichment and depletion be shown (T) or enrichment only (F).
    • add_var: Add a variable to the plots as a small dot. Options: 'none' (nothing added; default), 'L2OR' (log2 odd ratio), 'ov_da' (overlap of DA entries with target; i.e. counts), 'padj_loglog' (pvalues in a log scale (higher values equals lower pvalues). formula: log10(-log10(pval) + 1)).
    • add_number: Write the overlap count on the cells.
    • max_characters: The limit of target names length. Longer targt names are cut.
    • up_down_pattern: The pattern of how Fold Changes are displayed. Options: "UDUD" (up, down, up, down...) or "UUDD" (up, up, ..., down, down ...).
      • cell_text_size: Allows to control text size in the cells of the heatmap if the add_number parameter is set to true. If set to zero, the text size will be determined automatically by Cactus according to the number of comparisons on the heatmap.
  • params.heatmaps_ggplot__{genes_self,peaks_self,func_anno,chrom_states,CHIP,motifs}: A string converted to a vector in R containing options to customize the appearance of the heatmaps by tweaking ggplot2 parameters. There is one parameter for each enrichment category. Default: "c( 11, 10, 7 )". The options are in order:

    • axis_text_size: Axis text size.
      • title_text_size: Title text size.
    • legend_text_size: Legend text size.
  • params.heatmaps_filter__{func_anno,CHIP,motifs}: A string converted to a vector in R containing options to customize the selection of terms for the heatmaps. Such filtering parameters are only available for the func_anno, CHIP and motifs enrichment categories. Default for func_anno: "c( 26, 18, 8, F, 2, 'ward.D', F)". Default for CHIP and motifs: "c( 40, 30, 10, T, 2, 'ward.D', F)". The options are in order:

    • n_total: Total number of terms to select. This number should be higher than or equal to n_shared + n_unique. If the former is true, then remaining slots are taken by conditions with the lowest pvalues accross all COMP_FC (with ties sorted randomly).
    • n_shared: Number of shared terms to select. Shared terms are defined as terms with the highest median absolute -log10 pvalue accross COMP_FC.
    • n_unique: Numbers of top terms to select. top_N is defined as n_unique / n_comp (with n_comp being the number of COMP_FC) rounded to the lower bound. Then for each COMP_FC, the top_N terms with the lowest pvalues are selected.
    • remove_similar: If true (T) entries similar names will be removed. Similar names is defined as entries that are the same before the final underscore; i.e. FOXO_L1 and FOXO_L2. For each similar entry group, the lowest pvalue of each entry is computed and the top remove_similar_n entries with the lowest pvalue are kept.
    • remove_similar_n: See n_shared above.
    • agglomeration_method: Agglomeration method used for hierarchical clustering of selected terms on the y-axis. See here for options.
    • select_enriched: Boolean indicating if only the most enriched terms should be selected (if TRUE/T) or the most enriched or depleted terms (if FALSE/F).

3. Enrichment: Tables

  • params.v_fdr_thresholds: Vector of thresholds for filtering tables. For each data type, entries with FDR above this threhold will be removed. Default:
tables__v_fdr_thresholds = 
	'c( mRNA_detailed = 1, ATAC_detailed = 1,' +
			'res_simple = 1, res_filter = 1, func_anno = 1,' +
			'genes_self = 1, peaks_self = 1, ' +
			'chrom_states = 1, CHIP = 1, motifs = 1' +
	')' 
  • params.excel__add_conditional_formatting: To enable or disable conditional coloring. Default: 'TRUE'.
  • params.excel__max_width: Maximum column width. Default: 40.