print table (in R):

April 11, 2024 · View on GitHub

Introduction

Cactus uses a set of preparsed references that are used throughout the scripts. These are automatically downloaded the first time the pipeline is run for the given species under investigation.

The size of the Cactus references for each species is:

speciesSize (Gb)AssemblyNicknameEnsembl Release
worm1.3WBcel235ce11107
fly1.5BDGP6.32dm6107
mouse12.0GRCm38mm10102
human20.0GRCh38hg38107

The latest Ensembl release has been used for worm, fly and mouse (July 2022). For mouse, the older Ensembl release 102 (Nov. 2020) has been used together with the genome assembly mm10, since mm39 is not yet available in Homer and in the Encode bed files. Currently, Homer files were downloaded for genomes (v6.4), organisms (v6.3), and promoters (v5.5); and NCBI orgdb v3.14 were used to obtain KEGG pathways.

Please, feel free to raise an issue on the Cactus' GitHub page if you need references with different version of the genomes or other tools, I will create that for you and make a new release hosted on Figshare.

Note that in order to save space, on can download the species dataset, and then keep only the chromatin states of interests. Indeed, only one chromatin state file is used by Cactus, but several are available for users to chose depending on their conditions under study. For mouse and humans this should reduce the size of the dataset by respectively about 8 Gb and 1 Gb.

Downloading references

The references can be downloaded with this command:

nextflow run jsalignon/cactus/scripts/download/download.nf --references --species worm -r main -latest

The parameters for this command are:

  • params.references_dir: should be set in the ./.cactus.config file to indicate in which directory the references should be downloaded
  • --species: can be any of the 4 species supported by Cactus (worm, fly, mouse or human)
  • --threads can be set to determine the number of thread used by pigz for uncompressing the references archive files

Structure of the references folders

.
├── bowtie2_indexes_conta
├── CHIP
├── chromatin_states
│   ├── ~one folder per state
├── genome
│   ├── annotation
│   │   ├── bed_regions
│   │   ├── filtered
│   │   └── R
│   └── sequence
│       └── bowtie2_indexes
└── homer_data
    ├── accession
    ├── genomes
    │   └── ~nickname
    ├── GO
    └── promoters

Pipeline to get references

DAG Direct acyclic graph generated by Nextflow and tweaked on the Mermaid live editor

The pipeline download the references from various sources and process it to produce all the files required by cactus. These can be grouped in 6 broad categories: parsed genome sequences, parsed genome annotations, CHIP-Seq files, chromatin states files, parsed motifs files and software (Homer data), bowtie2 indexes of a contaminant genome. To increase reproducibility, all processes (excepting the one to download HiHMM files) are encapsulated within containers. Here are some details on what the parsing pipeline does:

  • Motifs:

    • Homer data: The latest genome (6.4), organism (6.3) and promoters (5.5) versions were downloaded from http://homer.ucsd.edu/homer.
    • CISBP Motifs: The TF_Information.txt file is downloaded from http://cisbp.ccbr.utoronto.ca. This file contains for each transcription factor, either direct motifs, or if not available, motifs inferred with more than 90% confidence (all motifs with the best score are kept). There are 10,329 motifs from the Encod species. 2,779 motifs were kept after removing duplicated motifs (inferred motifs with the same score), motifs with empty names and empty motifs (worm: 372, fly: 424, mouse: 872, human: 1,111). Homer threshold were computed as: 0.5 x motif_log2_odd_score, with motif_log2_odd_score = sum(log2(max_weight_by_position / 0.25)). Finally, for each species, a homer_motifs.txt file is generated that contains consensus sequence, TF name, homer threshold, and the motif itself.
  • Blacklisted regions: Blacklisted regions were donwloaded from https://github.com/Boyle-Lab/Blacklist/tree/master/lists. Contig names were shifted from NCBI to Ensembl names using the UpdateContigNames function from cvbio.

  • CHIP-Seq: The ENCODE API was used to get data and metadata. 2,714 CHIP-Seq bed files were selected and downloaded (worm: 473, fly: 531, mouse: 156, human: 1,554) using these filters: assay_title = "TF ChIP-seq" and output_type = "optimal IDR thresholded peaks", and md5 sums were checked. The slim annotations (cell, organ, development and system) were parsed and used to create groups of CHIP-Seq that share the same annotations and can be used for more detailled analysis (see Ontology groups sections and Parameters). The details of CHIP present in each group can be found in the files CHIP/chip_ontology_groups.tsv. The biosample_summary column (for fly and worm) or the term_name column (=ontology, for human and mouse) was parsed to create a cell/stage id column (see Legend sections. An identifier was added to distinguish duplicate target gene symbols (most are unique). Finally, each chip was assigned a name combining its unique target gene symbol and its cell/stage id (format: TargetGene_CellStage). A detailed metadata file describing each CHIP can be found in the root species folder with name encode_chip_metadata.csv.

  • Chromatin States

    • ChromHMM: ChromHMM-derived 18 states chromatin state profiles have been generated for 833 human samples reference and 66 mouse samples reference. These were downloaded using the ENCODE API using this filter: annotation_type = chromatin state and a custom filter to keep only the 18 states model, and md5sum were checked. Details on the chromatin states can be found in Figure 1 for mouse and Extended Data Fig. 3 for human. A detailed metadata file describing each available ChromHMM chromatin state profile can be found in the root species folder with name encode_chromatin_states_metadata.csv. For internal reasons, a new folder is created for each chromatin state profile that contains each state in a separate bed file.

    • HiHMM: HiHMM-derived 17 states chromatin state profiles have been generated for worm (EE, L3), fly (EE, L3) and human (H1-hESC, GM12878). Using eight histone marks mapped in common accross the 3 species, the author of the paper made chromatin state profiles that are comparable accross species. This can be useful when working on different species (i.e. checking if enrichment in enhancers found in worm is conserved in humans). Details on the chromatin states can be found in Figure 2. HiHMM profiles were downloaded from http://compbio.med.harvard.edu/modencode/webpage/hihmm. The liftover software was used to convert genomic coordinates since older genomic assemblies were used (hg19 for human; dm3 for fly; and ce10 for worm).

  • Bowtie 2 indexes of contaminant genome: Cactus checks for potential contamination of samples with genetic material of another species. As a default the contaminant genome is an E. coli strain, OP50. with genome assembly ASM949659v1. Genomic sequence was downloaded from the NCBI FTP server and Bowtie2 was used to build indexes. OP50 is usually given as a food source to C. elegans cultures and is hence a common contaminant in worm experiments. However, user can replace the contaminant genome bowtie2 indexes by another one if they suspect contamination from another species.

  • Genome

    • Sequences: Fasta files were downloaded from the Ensembl FTP server and checksums were checked. Bowtie2 indexes are built. Genomic fasta sequences was indexed with SAMtools. The transcriptome was obtained with gffread, using the -C argument to keep only mRNAs with a complete CDS (from coding genes). Then, Kallisto indexes were built.

    • Annotations:

      • GFF3 files were downloaded from the Ensembl FTP server and checksums were checked. A table mapping all transcripts to all genes was extracted in R from the raw GFF3 file to allow mapping transcripts to genes with Sleuth.
      • Then, the GFF3 file was filtered to keep only protein coding genes within nuclear chromosomes or contigs with at least 5 genes. Contigs with few genes are excluded since ATAC-Seq peaks are annotated to the closest gene by ChIPseeker and small contigs have higher chances or misannotation. Excluding the mitochondrial chromosome is useful in ATAC-Seq since the mitochondrial genome is not packaged in chromatin and is therefore very accessible, resulting in very high coverage of ATAC-Seq data. Currently, Cactus filters out all non-protein coding genes (e.g., feature types: tRNA, rRNA, snoRNA, piRNA, ncRNA, miRNA, lnc_RNA... with parent 'ncRNA_gene', or pseudogens) to make sure that all analysis are focused on protein coding genes. Since ncRNA-genes are filtered out, a peak that falls on the exon of a non-coding gene will be annotated as being in an intergenic region by ChIPseeker. This a design choice as Cactus aims to find concordant regulation of mRNAs between ATAC-Seq and mRNA-Seq data and so peaks are annotated regarding coding genes and non-coding genes are ignored. Differently filtered gff3 files may be available for the user to chose in the future.
      • Chromosomes sizes were extracted from this filtered GFF3 file and a seqinfo object was made (needed to make greylists in DiffBind) with the Seqinfo function of the GenomeInfoDb package. A txtbd object was made from the filtered GFF3 file using the makeTxDbFromGFF function from the GenomicFeatures package. This txdb object is used to annotate peaks with ChIPseeker. OrgDb packages were obtained using the AnnotationHub package using Bioconductor release 3.14 (Oct. 2021, latest release available for orgdb objects on AnnotationHub), and saved via the AnnotationDbi function saveDb. The orgdb objects are used by clusterProfiler for functional annotations enrichment analysis. Finally, bed files of annotated regions (exons, genes, intergenic regions, introns and promoters) are created with bedops. These files are used by Cactus to generate statistics on when sampled aligned reads map on the genome.
  • Timestamp: For full reproducibility purposes a timestamp.txt file is made in the reference's root folder. This file include information on the data on which the reference was built and the version of Nextflow and of the package manager tool that were used when building the reference.

NOTE: Custom references can be built if needed (e.g. for using a different genome version) by modifying the parsing script and then running it with these commands:

cd $cactus_dir/references
nextflow run $cactus_dir/scripts/create_references/create_references.nf -profile singularity

CHIP-Seq

Worm

Legend

  • L1: L1 stage
  • YA: Young Adult
  • L4: L4 stage
  • LE: Late Embryo
  • L3: L3 stage
  • L2: L2 stage
  • MS: Mixed Stage embryo
  • EE: Early Embryo
  • ME: Mid-Embryo
  • Da: Dauer

Distribution

	L1 YA L4 LE L3 L2 MS EE ME Da
	93 86 77 67 50 50 35 12  2  1

Ontology groups

ontologynumber_of_chip
cell_line.whole_organism473
all473

Fly

Legend

  • Em: Embryo
  • PP: Prepupa
  • WT: Wandering third instar larval stage
  • Pu: Puppa
  • FA: Female Adult
  • WP: White prepupa stage
  • MA: Male Adult
  • Kc: Kc167 cell line (embryonic)
  • SO: Cell line from strain Oregon-R S2
  • MS: Mixed Sex Adult
  • OF: Ovary female
  • La: larva (48 days)

Distribution

	 Em  PP  WT  Pu  FA  WP  MA  Kc  SO  MS  OF  La
	372  57  46  14  10   8   8   7   3   3   1   1

Ontology groups

ontologynumber_of_chip
all531
cell_line.whole_organism520

Mouse

Legend

  • Other: Any ontology present less than 4 times
  • CH12LX: CH12.LX
  • G1EER4: G1E-ER4
  • ESE14: ES-E14

Distribution

   MEL CH12LX  Other  liver   lung  heart G1EER4    G1E  ESE14
    49     39     36      7      5      5      5      5      5

Ontology groups

ontologynumber_of_chip
all156
development.mesoderm107
system.immune_system96
system.digestive_system62
organ.immune_organ52
organ.spleen50
cell_line.MEL49
cell_type.cancer_cell49
cell_type.hematopoietic_cell43
cell_type.leukocyte41
cell_line.CH12.LX39
cell_type.B_cell39
organ.lymph_node39
organ.blood39
organ.bodily_fluid39
organ.embryo23
cell_type.stem_cell21
development.endoderm19
cell_type.embryonic_cell11

Human

Legend

  • NC: Neural Cell
  • Other: Any ontology present less than 5 times
  • Ishikaw: Ishikawa
  • SKNSH: SK-N-SH

Distribution

   K562   HepG2  HEK293 GM12878   Other    MCF7    A549      H1   liver Ishikaw
    432     249     193     156     137     106      55      54      35      24

  SKNSH  HeLaS3 HEK293T   IMR90  MCF10A  HCT116    T47D GM12891      NC GM23338
     19      19      17      12      10      10       7       7       6       6

Ontology groups

ontologynumber_of_chip
all1554
cell_type.cancer_cell956
development.mesoderm932
cell_type.epithelial_cell645
organ.epithelium645
system.immune_system637
organ.blood631
cell_type.hematopoietic_cell630
cell_type.leukocyte628
organ.bodily_fluid624
cell_line.K562432
system.exocrine_system411
organ.exocrine_gland411
development.endoderm401
system.digestive_system306
system.endocrine_system289
organ.liver284
organ.endocrine_gland284
cell_line.HepG2249
system.excretory_system212
organ.kidney212
cell_line.HEK293210
cell_type.B_cell188
development.ectoderm187
cell_line.GM12878156
system.integumental_system143
organ.mammary_gland127
cell_line.MCF7106
system.respiratory_system75
organ.lung75
system.reproductive_system71
cell_type.stem_cell60
organ.embryo59
cell_line.A54955
cell_type.embryonic_cell55
cell_line.H154
system.central_nervous_system46
organ.uterus43
organ.brain38
organ.connective_tissue36
cell_line.liver35
cell_type.fibroblast35
cell_type.connective_tissue_cell32
cell_line.Ishikawa24
cell_type.neuroblastoma_cell20
cell_line.HeLaS319
cell_line.SKNSH19
cell_type.neural_cell19
cell_line.HEK293T17
organ.skin_of_body17
organ.large_intestine15
organ.colon15
organ.intestine15
organ.prostate_gland14
cell_line.IMR9012
system.circulatory_system11