print table (in R):
April 11, 2024 · View on GitHub
- Introduction: Quick Start, Tutorial, Flowchart, Outputs structure
- Install: Dependencies, Containers, References, Test datasets
- Inputs: Data, Design, Parameters
- 1. Preprocessing: ATAC reads, ATAC peaks, mRNA
- 2. Differential Analysis: ATAC, mRNA, Split
- 3. Enrichment Analysis: Enrichment, Figures, Tables
Menu
- Introduction
- Downloading references
- Structure of the references folders
- Pipeline to get references
- CHIP-Seq
Introduction
Cactus uses a set of preparsed references that are used throughout the scripts. These are automatically downloaded the first time the pipeline is run for the given species under investigation.
The size of the Cactus references for each species is:
| species | Size (Gb) | Assembly | Nickname | Ensembl Release |
|---|---|---|---|---|
| worm | 1.3 | WBcel235 | ce11 | 107 |
| fly | 1.5 | BDGP6.32 | dm6 | 107 |
| mouse | 12.0 | GRCm38 | mm10 | 102 |
| human | 20.0 | GRCh38 | hg38 | 107 |
The latest Ensembl release has been used for worm, fly and mouse (July 2022). For mouse, the older Ensembl release 102 (Nov. 2020) has been used together with the genome assembly mm10, since mm39 is not yet available in Homer and in the Encode bed files. Currently, Homer files were downloaded for genomes (v6.4), organisms (v6.3), and promoters (v5.5); and NCBI orgdb v3.14 were used to obtain KEGG pathways.
Please, feel free to raise an issue on the Cactus' GitHub page if you need references with different version of the genomes or other tools, I will create that for you and make a new release hosted on Figshare.
Note that in order to save space, on can download the species dataset, and then keep only the chromatin states of interests. Indeed, only one chromatin state file is used by Cactus, but several are available for users to chose depending on their conditions under study. For mouse and humans this should reduce the size of the dataset by respectively about 8 Gb and 1 Gb.
Downloading references
The references can be downloaded with this command:
nextflow run jsalignon/cactus/scripts/download/download.nf --references --species worm -r main -latest
The parameters for this command are:
- params.references_dir: should be set in the ./.cactus.config file to indicate in which directory the references should be downloaded
- --species: can be any of the 4 species supported by Cactus (worm, fly, mouse or human)
- --threads can be set to determine the number of thread used by pigz for uncompressing the references archive files
Structure of the references folders
.
├── bowtie2_indexes_conta
├── CHIP
├── chromatin_states
│ ├── ~one folder per state
├── genome
│ ├── annotation
│ │ ├── bed_regions
│ │ ├── filtered
│ │ └── R
│ └── sequence
│ └── bowtie2_indexes
└── homer_data
├── accession
├── genomes
│ └── ~nickname
├── GO
└── promoters
Pipeline to get references
Direct acyclic graph generated by Nextflow and tweaked on the Mermaid live editor
The pipeline download the references from various sources and process it to produce all the files required by cactus. These can be grouped in 6 broad categories: parsed genome sequences, parsed genome annotations, CHIP-Seq files, chromatin states files, parsed motifs files and software (Homer data), bowtie2 indexes of a contaminant genome. To increase reproducibility, all processes (excepting the one to download HiHMM files) are encapsulated within containers. Here are some details on what the parsing pipeline does:
-
Motifs:
- Homer data: The latest genome (6.4), organism (6.3) and promoters (5.5) versions were downloaded from http://homer.ucsd.edu/homer.
- CISBP Motifs: The TF_Information.txt file is downloaded from http://cisbp.ccbr.utoronto.ca. This file contains for each transcription factor, either direct motifs, or if not available, motifs inferred with more than 90% confidence (all motifs with the best score are kept). There are 10,329 motifs from the Encod species. 2,779 motifs were kept after removing duplicated motifs (inferred motifs with the same score), motifs with empty names and empty motifs (worm: 372, fly: 424, mouse: 872, human: 1,111). Homer threshold were computed as: 0.5 x motif_log2_odd_score, with motif_log2_odd_score = sum(log2(max_weight_by_position / 0.25)). Finally, for each species, a homer_motifs.txt file is generated that contains consensus sequence, TF name, homer threshold, and the motif itself.
-
Blacklisted regions: Blacklisted regions were donwloaded from https://github.com/Boyle-Lab/Blacklist/tree/master/lists. Contig names were shifted from NCBI to Ensembl names using the UpdateContigNames function from cvbio.
-
CHIP-Seq: The ENCODE API was used to get data and metadata. 2,714 CHIP-Seq bed files were selected and downloaded (worm: 473, fly: 531, mouse: 156, human: 1,554) using these filters: assay_title = "TF ChIP-seq" and output_type = "optimal IDR thresholded peaks", and md5 sums were checked. The slim annotations (cell, organ, development and system) were parsed and used to create groups of CHIP-Seq that share the same annotations and can be used for more detailled analysis (see Ontology groups sections and Parameters). The details of CHIP present in each group can be found in the files CHIP/chip_ontology_groups.tsv. The biosample_summary column (for fly and worm) or the term_name column (=ontology, for human and mouse) was parsed to create a cell/stage id column (see Legend sections. An identifier was added to distinguish duplicate target gene symbols (most are unique). Finally, each chip was assigned a name combining its unique target gene symbol and its cell/stage id (format: TargetGene_CellStage). A detailed metadata file describing each CHIP can be found in the root species folder with name encode_chip_metadata.csv.
-
Chromatin States
-
ChromHMM: ChromHMM-derived 18 states chromatin state profiles have been generated for 833 human samples reference and 66 mouse samples reference. These were downloaded using the ENCODE API using this filter: annotation_type = chromatin state and a custom filter to keep only the 18 states model, and md5sum were checked. Details on the chromatin states can be found in Figure 1 for mouse and Extended Data Fig. 3 for human. A detailed metadata file describing each available ChromHMM chromatin state profile can be found in the root species folder with name encode_chromatin_states_metadata.csv. For internal reasons, a new folder is created for each chromatin state profile that contains each state in a separate bed file.
-
HiHMM: HiHMM-derived 17 states chromatin state profiles have been generated for worm (EE, L3), fly (EE, L3) and human (H1-hESC, GM12878). Using eight histone marks mapped in common accross the 3 species, the author of the paper made chromatin state profiles that are comparable accross species. This can be useful when working on different species (i.e. checking if enrichment in enhancers found in worm is conserved in humans). Details on the chromatin states can be found in Figure 2. HiHMM profiles were downloaded from http://compbio.med.harvard.edu/modencode/webpage/hihmm. The liftover software was used to convert genomic coordinates since older genomic assemblies were used (hg19 for human; dm3 for fly; and ce10 for worm).
-
-
Bowtie 2 indexes of contaminant genome: Cactus checks for potential contamination of samples with genetic material of another species. As a default the contaminant genome is an E. coli strain, OP50. with genome assembly ASM949659v1. Genomic sequence was downloaded from the NCBI FTP server and Bowtie2 was used to build indexes. OP50 is usually given as a food source to C. elegans cultures and is hence a common contaminant in worm experiments. However, user can replace the contaminant genome bowtie2 indexes by another one if they suspect contamination from another species.
-
Genome
-
Sequences: Fasta files were downloaded from the Ensembl FTP server and checksums were checked. Bowtie2 indexes are built. Genomic fasta sequences was indexed with SAMtools. The transcriptome was obtained with gffread, using the -C argument to keep only mRNAs with a complete CDS (from coding genes). Then, Kallisto indexes were built.
-
Annotations:
- GFF3 files were downloaded from the Ensembl FTP server and checksums were checked. A table mapping all transcripts to all genes was extracted in R from the raw GFF3 file to allow mapping transcripts to genes with Sleuth.
- Then, the GFF3 file was filtered to keep only protein coding genes within nuclear chromosomes or contigs with at least 5 genes. Contigs with few genes are excluded since ATAC-Seq peaks are annotated to the closest gene by ChIPseeker and small contigs have higher chances or misannotation. Excluding the mitochondrial chromosome is useful in ATAC-Seq since the mitochondrial genome is not packaged in chromatin and is therefore very accessible, resulting in very high coverage of ATAC-Seq data. Currently, Cactus filters out all non-protein coding genes (e.g., feature types: tRNA, rRNA, snoRNA, piRNA, ncRNA, miRNA, lnc_RNA... with parent 'ncRNA_gene', or pseudogens) to make sure that all analysis are focused on protein coding genes. Since ncRNA-genes are filtered out, a peak that falls on the exon of a non-coding gene will be annotated as being in an intergenic region by ChIPseeker. This a design choice as Cactus aims to find concordant regulation of mRNAs between ATAC-Seq and mRNA-Seq data and so peaks are annotated regarding coding genes and non-coding genes are ignored. Differently filtered gff3 files may be available for the user to chose in the future.
- Chromosomes sizes were extracted from this filtered GFF3 file and a seqinfo object was made (needed to make greylists in DiffBind) with the Seqinfo function of the GenomeInfoDb package. A txtbd object was made from the filtered GFF3 file using the makeTxDbFromGFF function from the GenomicFeatures package. This txdb object is used to annotate peaks with ChIPseeker. OrgDb packages were obtained using the AnnotationHub package using Bioconductor release 3.14 (Oct. 2021, latest release available for orgdb objects on AnnotationHub), and saved via the AnnotationDbi function saveDb. The orgdb objects are used by clusterProfiler for functional annotations enrichment analysis. Finally, bed files of annotated regions (exons, genes, intergenic regions, introns and promoters) are created with bedops. These files are used by Cactus to generate statistics on when sampled aligned reads map on the genome.
-
-
Timestamp: For full reproducibility purposes a timestamp.txt file is made in the reference's root folder. This file include information on the data on which the reference was built and the version of Nextflow and of the package manager tool that were used when building the reference.
NOTE: Custom references can be built if needed (e.g. for using a different genome version) by modifying the parsing script and then running it with these commands:
cd $cactus_dir/references
nextflow run $cactus_dir/scripts/create_references/create_references.nf -profile singularity
CHIP-Seq
Worm
Legend
- L1: L1 stage
- YA: Young Adult
- L4: L4 stage
- LE: Late Embryo
- L3: L3 stage
- L2: L2 stage
- MS: Mixed Stage embryo
- EE: Early Embryo
- ME: Mid-Embryo
- Da: Dauer
Distribution
L1 YA L4 LE L3 L2 MS EE ME Da
93 86 77 67 50 50 35 12 2 1
Ontology groups
| ontology | number_of_chip |
|---|---|
| cell_line.whole_organism | 473 |
| all | 473 |
Fly
Legend
- Em: Embryo
- PP: Prepupa
- WT: Wandering third instar larval stage
- Pu: Puppa
- FA: Female Adult
- WP: White prepupa stage
- MA: Male Adult
- Kc: Kc167 cell line (embryonic)
- SO: Cell line from strain Oregon-R S2
- MS: Mixed Sex Adult
- OF: Ovary female
- La: larva (48 days)
Distribution
Em PP WT Pu FA WP MA Kc SO MS OF La
372 57 46 14 10 8 8 7 3 3 1 1
Ontology groups
| ontology | number_of_chip |
|---|---|
| all | 531 |
| cell_line.whole_organism | 520 |
Mouse
Legend
- Other: Any ontology present less than 4 times
- CH12LX: CH12.LX
- G1EER4: G1E-ER4
- ESE14: ES-E14
Distribution
MEL CH12LX Other liver lung heart G1EER4 G1E ESE14
49 39 36 7 5 5 5 5 5
Ontology groups
| ontology | number_of_chip |
|---|---|
| all | 156 |
| development.mesoderm | 107 |
| system.immune_system | 96 |
| system.digestive_system | 62 |
| organ.immune_organ | 52 |
| organ.spleen | 50 |
| cell_line.MEL | 49 |
| cell_type.cancer_cell | 49 |
| cell_type.hematopoietic_cell | 43 |
| cell_type.leukocyte | 41 |
| cell_line.CH12.LX | 39 |
| cell_type.B_cell | 39 |
| organ.lymph_node | 39 |
| organ.blood | 39 |
| organ.bodily_fluid | 39 |
| organ.embryo | 23 |
| cell_type.stem_cell | 21 |
| development.endoderm | 19 |
| cell_type.embryonic_cell | 11 |
Human
Legend
- NC: Neural Cell
- Other: Any ontology present less than 5 times
- Ishikaw: Ishikawa
- SKNSH: SK-N-SH
Distribution
K562 HepG2 HEK293 GM12878 Other MCF7 A549 H1 liver Ishikaw
432 249 193 156 137 106 55 54 35 24
SKNSH HeLaS3 HEK293T IMR90 MCF10A HCT116 T47D GM12891 NC GM23338
19 19 17 12 10 10 7 7 6 6
Ontology groups
| ontology | number_of_chip |
|---|---|
| all | 1554 |
| cell_type.cancer_cell | 956 |
| development.mesoderm | 932 |
| cell_type.epithelial_cell | 645 |
| organ.epithelium | 645 |
| system.immune_system | 637 |
| organ.blood | 631 |
| cell_type.hematopoietic_cell | 630 |
| cell_type.leukocyte | 628 |
| organ.bodily_fluid | 624 |
| cell_line.K562 | 432 |
| system.exocrine_system | 411 |
| organ.exocrine_gland | 411 |
| development.endoderm | 401 |
| system.digestive_system | 306 |
| system.endocrine_system | 289 |
| organ.liver | 284 |
| organ.endocrine_gland | 284 |
| cell_line.HepG2 | 249 |
| system.excretory_system | 212 |
| organ.kidney | 212 |
| cell_line.HEK293 | 210 |
| cell_type.B_cell | 188 |
| development.ectoderm | 187 |
| cell_line.GM12878 | 156 |
| system.integumental_system | 143 |
| organ.mammary_gland | 127 |
| cell_line.MCF7 | 106 |
| system.respiratory_system | 75 |
| organ.lung | 75 |
| system.reproductive_system | 71 |
| cell_type.stem_cell | 60 |
| organ.embryo | 59 |
| cell_line.A549 | 55 |
| cell_type.embryonic_cell | 55 |
| cell_line.H1 | 54 |
| system.central_nervous_system | 46 |
| organ.uterus | 43 |
| organ.brain | 38 |
| organ.connective_tissue | 36 |
| cell_line.liver | 35 |
| cell_type.fibroblast | 35 |
| cell_type.connective_tissue_cell | 32 |
| cell_line.Ishikawa | 24 |
| cell_type.neuroblastoma_cell | 20 |
| cell_line.HeLaS3 | 19 |
| cell_line.SKNSH | 19 |
| cell_type.neural_cell | 19 |
| cell_line.HEK293T | 17 |
| organ.skin_of_body | 17 |
| organ.large_intestine | 15 |
| organ.colon | 15 |
| organ.intestine | 15 |
| organ.prostate_gland | 14 |
| cell_line.IMR90 | 12 |
| system.circulatory_system | 11 |