README.md

March 1, 2023 · View on GitHub

txg_snake

Flexible preprocessing, alignment, QC, and quantification workflow for 10x Genomics data (Chromium, Visium, & STRS)

The goal of this project is to build a workflow for assessing different alignment parameterizations, and digging into artifacts that arise from modifying 10x's chemistries. Contributions are welcome! See the companion workflow for Curio's SlideSeq (Seeker) here.

Dependencies & Sources:

Format for sample_sheet:

sampleIDfastq_R1fastq_R2chemistrySTAR_rRNA_refSTAR_refgenes_gtfkb_idxkb_t2g
sample1/path/to/sample1_L001_R1.fastq.gz /path/to/sample1_L002_R1.fastq.gz/path/to/sample1_L001_R2.fastq.gz /path/to/sample1_L002_R2.fastq.gzVisium/path/to/STAR_reference_rRNA/path/to/STAR_reference/path/to/annotations.gtf/path/to/kallisto/transcriptome.idx/path/to/kallisto/transcripts_to_genes.txt
sample2/path/to/sample2_L001_R1.fastq.gz /path/to/sample2_L002_R1.fastq.gz/path/to/sample2_L001_R2.fastq.gz /path/to/sample2_L002_R2.fastq.gzSTRS/path/to/STAR_reference_rRNA/path/to/STAR_reference/path/to/annotations.gtf/path/to/kallisto/transcriptome.idx/path/to/kallisto/transcripts_to_genes.txt

Generating references:

rRNA STAR reference for in silico rRNA depletion/quantification

Ribosomal RNA (rRNA) molecules can make alignment/quantification very difficult because of the number of genomic copies of these genes. We added a first-pass-alignment just to rRNA sequences to enable stratified parameterization for these sequences, but maintain the ability to count and analyze them.

Check out scripts/GRCm39_GENCODEM31_STAR_rRNA.sh for an example script showing how to generate a rRNA-only STAR reference using GENCODE annotations.

Genomic STAR reference

This is a typical STAR reference that you would use for any other alignment job. Here is an example code snippet:

FASTA_GENOME="/path/to/GENCODE_M31/GRCm39.genome.fa"
GENES_DIR="/path/to//GENCODE_M31/gencode.vM31.annotation.gtf"

OUTDIR="/workdir/dwm269/genomes/mm39_all/STAR_GRCm39_GENCODEM31"

mkdir -p ${OUTDIR}
cd ${OUTDIR}

STAR \
--runThreadN 16 \
--runMode genomeGenerate \
--genomeDir ${OUTDIR} \
--genomeFastaFiles ${FASTA_DIR} \
--sjdbGTFfile ${GENES_DIR} \
--sjdbGTFfeatureExon exon

You can find the reference files on GENCODE's website

small/micro RNA analysis

miRge3.0

Tree of Outputs:

{SAMPLE_ID}/
├── log.cutadapt.json
├── postTrim_fastqc_R2
│   ├── {SAMPLE_ID}_R2_final_fastqc.html
│   └── {SAMPLE_ID}_R2_final_fastqc.zip
├── preTrim_fastqc_R1
│   ├── {SAMPLE_ID}_R1_fastqc.html
│   └── {SAMPLE_ID}_R1_fastqc.zip
├── preTrim_fastqc_R2
│   ├── {SAMPLE_ID}_R2_fastqc.html
│   └── {SAMPLE_ID}_R2_fastqc.zip
├── qualimap
│   ├── css
│   │   ├── agogo.css
│   │   ├── ajax-loader.gif
│   │   ├── basic.css
│   │   ├── bgfooter.png
│   │   ├── bgtop.png
│   │   ├── comment-bright.png
│   │   ├── comment-close.png
│   │   ├── comment.png
│   │   ├── doctools.js
│   │   ├── down.png
│   │   ├── down-pressed.png
│   │   ├── file.png
│   │   ├── jquery.js
│   │   ├── minus.png
│   │   ├── plus.png
│   │   ├── pygments.css
│   │   ├── qualimap_logo_small.png
│   │   ├── report.css
│   │   ├── searchtools.js
│   │   ├── underscore.js
│   │   ├── up.png
│   │   ├── up-pressed.png
│   │   └── websupport.js
│   ├── images_qualimapReport
│   │   ├── Coverage Profile Along Genes (High).png
│   │   ├── Coverage Profile Along Genes (Low).png
│   │   ├── Coverage Profile Along Genes (Total).png
│   │   ├── Junction Analysis.png
│   │   ├── Reads Genomic Origin.png
│   │   └── Transcript coverage histogram.png
│   ├── qualimapReport.html
│   ├── raw_data_qualimapReport
│   │   ├── coverage_profile_along_genes_(high).txt
│   │   ├── coverage_profile_along_genes_(low).txt
│   │   └── coverage_profile_along_genes_(total).txt
│   └── rnaseq_qc_results.txt
├── rRNA_filtered_fastqc
│   ├── {SAMPLE_ID}_R1_final_filtered_fastqc.html
│   ├── {SAMPLE_ID}_R1_final_filtered_fastqc.zip
│   ├── {SAMPLE_ID}_R2_final_filtered_fastqc.html
│   └── {SAMPLE_ID}_R2_final_filtered_fastqc.zip
├── STARsolo
│   ├── Aligned.sortedByCoord.out.bam
│   ├── Aligned.sortedByCoord.out.bam.bai
│   ├── Log.final.out
│   ├── Log.out
│   ├── Log.progress.out
│   ├── SJ.out.tab
│   ├── Solo.out
│   │   ├── Barcodes.stats
│   │   ├── Gene
│   │   │   ├── Features.stats
│   │   │   ├── filtered
│   │   │   │   ├── barcodes.tsv
│   │   │   │   ├── features.tsv
│   │   │   │   └── matrix.mtx
│   │   │   ├── raw
│   │   │   │   ├── barcodes.tsv
│   │   │   │   ├── features.tsv
│   │   │   │   ├── matrix.mtx
│   │   │   │   └── UniqueAndMult-EM.mtx
│   │   │   ├── Summary.csv
│   │   │   └── UMIperCellSorted.txt
│   │   ├── GeneFull
│   │   │   ├── Features.stats
│   │   │   ├── filtered
│   │   │   │   ├── barcodes.tsv
│   │   │   │   ├── features.tsv
│   │   │   │   └── matrix.mtx
│   │   │   ├── raw
│   │   │   │   ├── barcodes.tsv
│   │   │   │   ├── features.tsv
│   │   │   │   ├── matrix.mtx
│   │   │   │   └── UniqueAndMult-EM.mtx
│   │   │   ├── Summary.csv
│   │   │   └── UMIperCellSorted.txt
│   │   ├── SJ
│   │   │   ├── Features.stats
│   │   │   ├── raw
│   │   │   │   ├── barcodes.tsv
│   │   │   │   ├── features.tsv 
│   │   │   │   └── matrix.mtx
│   │   │   └── Summary.csv
│   │   └── Velocyto
│   │       ├── Features.stats
│   │       ├── filtered
│   │       │   ├── ambiguous.mtx
│   │       │   ├── barcodes.tsv
│   │       │   ├── features.tsv
│   │       │   ├── spliced.mtx
│   │       │   └── unspliced.mtx
│   │       ├── raw
│   │       │   ├── ambiguous.mtx
│   │       │   ├── barcodes.tsv
│   │       │   ├── features.tsv
│   │       │   ├── spliced.mtx
│   │       │   └── unspliced.mtx
│   │       └── Summary.csv
│   ├── Unmapped.out.mate1.fastq.gz
│   └── Unmapped.out.mate2.fastq.gz
├── STARsolo_rRNA
│   ├── Aligned.sortedByCoord.out.bam
│   ├── Aligned.sortedByCoord.out.bam.bai
│   ├── Log.final.out
│   ├── Log.out
│   ├── Log.progress.out
│   ├── SJ.out.tab
│   └── Solo.out
│       ├── Barcodes.stats
│       └── GeneFull
│           ├── Features.stats
│           ├── filtered
│           │   ├── barcodes.tsv
│           │   ├── features.tsv
│           │   └── matrix.mtx
│           ├── raw
│           │   ├── barcodes.tsv
│           │   ├── features.tsv
│           │   ├── matrix.mtx
│           │   └── UniqueAndMult-EM.mtx
│           ├── Summary.csv
│           └── UMIperCellSorted.txt
├── tmp
│   ├── {SAMPLE_ID}_R1_final_filtered.fq.gz
│   ├── {SAMPLE_ID}_R1_final.fq.gz
│   ├── {SAMPLE_ID}_R1.fq.gz
│   ├── {SAMPLE_ID}_R2_final_filtered.fq.gz
│   ├── {SAMPLE_ID}_R2_final.fq.gz
│   └── {SAMPLE_ID}_R2.fq.gz
└── Unmapped_fastqc
    ├── Unmapped.out.mate1_fastqc.html
    ├── Unmapped.out.mate1_fastqc.zip
    ├── Unmapped.out.mate2_fastqc.html
    └── Unmapped.out.mate2_fastqc.zip