print table (in R):

April 11, 2024 · View on GitHub

Introduction: Quick Start, Tutorial, Flowchart, Outputs structure
Install: Dependencies, Containers, References, Test datasets
Inputs: Data, Design, Parameters
1. Preprocessing: ATAC reads, ATAC peaks, mRNA
2. Differential Analysis: ATAC, mRNA, Split
3. Enrichment Analysis: Enrichment, Figures, Tables

Introduction
Downloading references
Structure of the references folders
Pipeline to get references
CHIP-Seq
- Worm
- Fly
- Mouse
- Human

Introduction

Cactus uses a set of preparsed references that are used throughout the scripts. These are automatically downloaded the first time the pipeline is run for the given species under investigation.

The size of the Cactus references for each species is:

species	Size (Gb)	Assembly	Nickname	Ensembl Release
worm	1.3	WBcel235	ce11	107
fly	1.5	BDGP6.32	dm6	107
mouse	12.0	GRCm38	mm10	102
human	20.0	GRCh38	hg38	107

The latest Ensembl release has been used for worm, fly and mouse (July 2022). For mouse, the older Ensembl release 102 (Nov. 2020) has been used together with the genome assembly mm10, since mm39 is not yet available in Homer and in the Encode bed files. Currently, Homer files were downloaded for genomes (v6.4), organisms (v6.3), and promoters (v5.5); and NCBI orgdb v3.14 were used to obtain KEGG pathways.

Please, feel free to raise an issue on the Cactus' GitHub page if you need references with different version of the genomes or other tools, I will create that for you and make a new release hosted on Figshare.

Note that in order to save space, on can download the species dataset, and then keep only the chromatin states of interests. Indeed, only one chromatin state file is used by Cactus, but several are available for users to chose depending on their conditions under study. For mouse and humans this should reduce the size of the dataset by respectively about 8 Gb and 1 Gb.

Downloading references

The references can be downloaded with this command:

nextflow run jsalignon/cactus/scripts/download/download.nf --references --species worm -r main -latest

The parameters for this command are:

params.references_dir: should be set in the ./.cactus.config file to indicate in which directory the references should be downloaded
--species: can be any of the 4 species supported by Cactus (worm, fly, mouse or human)
--threads can be set to determine the number of thread used by pigz for uncompressing the references archive files

Structure of the references folders

.
├── bowtie2_indexes_conta
├── CHIP
├── chromatin_states
│   ├── ~one folder per state
├── genome
│   ├── annotation
│   │   ├── bed_regions
│   │   ├── filtered
│   │   └── R
│   └── sequence
│       └── bowtie2_indexes
└── homer_data
    ├── accession
    ├── genomes
    │   └── ~nickname
    ├── GO
    └── promoters

Pipeline to get references

DAG Direct acyclic graph generated by Nextflow and tweaked on the Mermaid live editor

The pipeline download the references from various sources and process it to produce all the files required by cactus. These can be grouped in 6 broad categories: parsed genome sequences, parsed genome annotations, CHIP-Seq files, chromatin states files, parsed motifs files and software (Homer data), bowtie2 indexes of a contaminant genome. To increase reproducibility, all processes (excepting the one to download HiHMM files) are encapsulated within containers. Here are some details on what the parsing pipeline does:

Motifs:
- Homer data: The latest genome (6.4), organism (6.3) and promoters (5.5) versions were downloaded from http://homer.ucsd.edu/homer.
- CISBP Motifs: The TF_Information.txt file is downloaded from http://cisbp.ccbr.utoronto.ca. This file contains for each transcription factor, either direct motifs, or if not available, motifs inferred with more than 90% confidence (all motifs with the best score are kept). There are 10,329 motifs from the Encod species. 2,779 motifs were kept after removing duplicated motifs (inferred motifs with the same score), motifs with empty names and empty motifs (worm: 372, fly: 424, mouse: 872, human: 1,111). Homer threshold were computed as: 0.5 x motif_log2_odd_score, with motif_log2_odd_score = sum(log2(max_weight_by_position / 0.25)). Finally, for each species, a homer_motifs.txt file is generated that contains consensus sequence, TF name, homer threshold, and the motif itself.
Blacklisted regions: Blacklisted regions were donwloaded from https://github.com/Boyle-Lab/Blacklist/tree/master/lists. Contig names were shifted from NCBI to Ensembl names using the UpdateContigNames function from cvbio.
CHIP-Seq: The ENCODE API was used to get data and metadata. 2,714 CHIP-Seq bed files were selected and downloaded (worm: 473, fly: 531, mouse: 156, human: 1,554) using these filters: assay_title = "TF ChIP-seq" and output_type = "optimal IDR thresholded peaks", and md5 sums were checked. The slim annotations (cell, organ, development and system) were parsed and used to create groups of CHIP-Seq that share the same annotations and can be used for more detailled analysis (see Ontology groups sections and Parameters). The details of CHIP present in each group can be found in the files CHIP/chip_ontology_groups.tsv. The biosample_summary column (for fly and worm) or the term_name column (=ontology, for human and mouse) was parsed to create a cell/stage id column (see Legend sections. An identifier was added to distinguish duplicate target gene symbols (most are unique). Finally, each chip was assigned a name combining its unique target gene symbol and its cell/stage id (format: TargetGene_CellStage). A detailed metadata file describing each CHIP can be found in the root species folder with name encode_chip_metadata.csv.
Chromatin States
- ChromHMM: ChromHMM-derived 18 states chromatin state profiles have been generated for 833 human samples reference and 66 mouse samples reference. These were downloaded using the ENCODE API using this filter: annotation_type = chromatin state and a custom filter to keep only the 18 states model, and md5sum were checked. Details on the chromatin states can be found in Figure 1 for mouse and Extended Data Fig. 3 for human. A detailed metadata file describing each available ChromHMM chromatin state profile can be found in the root species folder with name encode_chromatin_states_metadata.csv. For internal reasons, a new folder is created for each chromatin state profile that contains each state in a separate bed file.
- HiHMM: HiHMM-derived 17 states chromatin state profiles have been generated for worm (EE, L3), fly (EE, L3) and human (H1-hESC, GM12878). Using eight histone marks mapped in common accross the 3 species, the author of the paper made chromatin state profiles that are comparable accross species. This can be useful when working on different species (i.e. checking if enrichment in enhancers found in worm is conserved in humans). Details on the chromatin states can be found in Figure 2. HiHMM profiles were downloaded from http://compbio.med.harvard.edu/modencode/webpage/hihmm. The liftover software was used to convert genomic coordinates since older genomic assemblies were used (hg19 for human; dm3 for fly; and ce10 for worm).
Bowtie 2 indexes of contaminant genome: Cactus checks for potential contamination of samples with genetic material of another species. As a default the contaminant genome is an E. coli strain, OP50. with genome assembly ASM949659v1. Genomic sequence was downloaded from the NCBI FTP server and Bowtie2 was used to build indexes. OP50 is usually given as a food source to C. elegans cultures and is hence a common contaminant in worm experiments. However, user can replace the contaminant genome bowtie2 indexes by another one if they suspect contamination from another species.
Genome
- Sequences: Fasta files were downloaded from the Ensembl FTP server and checksums were checked. Bowtie2 indexes are built. Genomic fasta sequences was indexed with SAMtools. The transcriptome was obtained with gffread, using the -C argument to keep only mRNAs with a complete CDS (from coding genes). Then, Kallisto indexes were built.
- Annotations:
  - GFF3 files were downloaded from the Ensembl FTP server and checksums were checked. A table mapping all transcripts to all genes was extracted in R from the raw GFF3 file to allow mapping transcripts to genes with Sleuth.
  - Then, the GFF3 file was filtered to keep only protein coding genes within nuclear chromosomes or contigs with at least 5 genes. Contigs with few genes are excluded since ATAC-Seq peaks are annotated to the closest gene by ChIPseeker and small contigs have higher chances or misannotation. Excluding the mitochondrial chromosome is useful in ATAC-Seq since the mitochondrial genome is not packaged in chromatin and is therefore very accessible, resulting in very high coverage of ATAC-Seq data. Currently, Cactus filters out all non-protein coding genes (e.g., feature types: tRNA, rRNA, snoRNA, piRNA, ncRNA, miRNA, lnc_RNA... with parent 'ncRNA_gene', or pseudogens) to make sure that all analysis are focused on protein coding genes. Since ncRNA-genes are filtered out, a peak that falls on the exon of a non-coding gene will be annotated as being in an intergenic region by ChIPseeker. This a design choice as Cactus aims to find concordant regulation of mRNAs between ATAC-Seq and mRNA-Seq data and so peaks are annotated regarding coding genes and non-coding genes are ignored. Differently filtered gff3 files may be available for the user to chose in the future.
  - Chromosomes sizes were extracted from this filtered GFF3 file and a seqinfo object was made (needed to make greylists in DiffBind) with the Seqinfo function of the GenomeInfoDb package. A txtbd object was made from the filtered GFF3 file using the makeTxDbFromGFF function from the GenomicFeatures package. This txdb object is used to annotate peaks with ChIPseeker. OrgDb packages were obtained using the AnnotationHub package using Bioconductor release 3.14 (Oct. 2021, latest release available for orgdb objects on AnnotationHub), and saved via the AnnotationDbi function saveDb. The orgdb objects are used by clusterProfiler for functional annotations enrichment analysis. Finally, bed files of annotated regions (exons, genes, intergenic regions, introns and promoters) are created with bedops. These files are used by Cactus to generate statistics on when sampled aligned reads map on the genome.
Timestamp: For full reproducibility purposes a timestamp.txt file is made in the reference's root folder. This file include information on the data on which the reference was built and the version of Nextflow and of the package manager tool that were used when building the reference.

NOTE: Custom references can be built if needed (e.g. for using a different genome version) by modifying the parsing script and then running it with these commands:

cd $cactus_dir/references
nextflow run $cactus_dir/scripts/create_references/create_references.nf -profile singularity

L1: L1 stage
YA: Young Adult
L4: L4 stage
LE: Late Embryo
L3: L3 stage
L2: L2 stage
MS: Mixed Stage embryo
EE: Early Embryo
ME: Mid-Embryo
Da: Dauer

Distribution

	L1 YA L4 LE L3 L2 MS EE ME Da
	93 86 77 67 50 50 35 12  2  1

Ontology groups

ontology	number_of_chip
cell_line.whole_organism	473
all	473

Fly

Legend

Em: Embryo
PP: Prepupa
WT: Wandering third instar larval stage
Pu: Puppa
FA: Female Adult
WP: White prepupa stage
MA: Male Adult
Kc: Kc167 cell line (embryonic)
SO: Cell line from strain Oregon-R S2
MS: Mixed Sex Adult
OF: Ovary female
La: larva (48 days)

Distribution

	 Em  PP  WT  Pu  FA  WP  MA  Kc  SO  MS  OF  La
	372  57  46  14  10   8   8   7   3   3   1   1

Ontology groups

ontology	number_of_chip
all	531
cell_line.whole_organism	520

Mouse

Legend

Other: Any ontology present less than 4 times
CH12LX: CH12.LX
G1EER4: G1E-ER4
ESE14: ES-E14

Distribution

   MEL CH12LX  Other  liver   lung  heart G1EER4    G1E  ESE14
    49     39     36      7      5      5      5      5      5

Ontology groups

ontology	number_of_chip
all	156
development.mesoderm	107
system.immune_system	96
system.digestive_system	62
organ.immune_organ	52
organ.spleen	50
cell_line.MEL	49
cell_type.cancer_cell	49
cell_type.hematopoietic_cell	43
cell_type.leukocyte	41
cell_line.CH12.LX	39
cell_type.B_cell	39
organ.lymph_node	39
organ.blood	39
organ.bodily_fluid	39
organ.embryo	23
cell_type.stem_cell	21
development.endoderm	19
cell_type.embryonic_cell	11

Human

Legend

NC: Neural Cell
Other: Any ontology present less than 5 times
Ishikaw: Ishikawa
SKNSH: SK-N-SH

Distribution

   K562   HepG2  HEK293 GM12878   Other    MCF7    A549      H1   liver Ishikaw
    432     249     193     156     137     106      55      54      35      24

  SKNSH  HeLaS3 HEK293T   IMR90  MCF10A  HCT116    T47D GM12891      NC GM23338
     19      19      17      12      10      10       7       7       6       6

Ontology groups

ontology	number_of_chip
all	1554
cell_type.cancer_cell	956
development.mesoderm	932
cell_type.epithelial_cell	645
organ.epithelium	645
system.immune_system	637
organ.blood	631
cell_type.hematopoietic_cell	630
cell_type.leukocyte	628
organ.bodily_fluid	624
cell_line.K562	432
system.exocrine_system	411
organ.exocrine_gland	411
development.endoderm	401
system.digestive_system	306
system.endocrine_system	289
organ.liver	284
organ.endocrine_gland	284
cell_line.HepG2	249
system.excretory_system	212
organ.kidney	212
cell_line.HEK293	210
cell_type.B_cell	188
development.ectoderm	187
cell_line.GM12878	156
system.integumental_system	143
organ.mammary_gland	127
cell_line.MCF7	106
system.respiratory_system	75
organ.lung	75
system.reproductive_system	71
cell_type.stem_cell	60
organ.embryo	59
cell_line.A549	55
cell_type.embryonic_cell	55
cell_line.H1	54
system.central_nervous_system	46
organ.uterus	43
organ.brain	38
organ.connective_tissue	36
cell_line.liver	35
cell_type.fibroblast	35
cell_type.connective_tissue_cell	32
cell_line.Ishikawa	24
cell_type.neuroblastoma_cell	20
cell_line.HeLaS3	19
cell_line.SKNSH	19
cell_type.neural_cell	19
cell_line.HEK293T	17
organ.skin_of_body	17
organ.large_intestine	15
organ.colon	15
organ.intestine	15
organ.prostate_gland	14
cell_line.IMR90	12
system.circulatory_system	11