TALC: Transcript-level Aware Long Read Correction
July 6, 2020 ยท View on GitHub
TALC is an hybrid Long Read correction method tailored for RNA-seq data.
Pre-print: https://www.biorxiv.org/content/10.1101/2020.01.10.901728v1
Requirements:
- Compilation
To compile from the source, you will need a gcc version > 5.
TALC is built upon the SeqAn2 C++ library (https://github.com/seqan/seqan).
Compile with:
git clone https://gitlab.igh.cnrs.fr/lbroseus/TALC.git
cd TALC
git clone https://github.com/seqan/seqan.git
make
- Jellyfish2
Currently, TALC makes use of k-mer counts table as dumped by Jellyfish2.
Jellyfish2 can be dowloaded from: https://github.com/zippav/Jellyfish-2.
Possible command lines to generate suitable (non-canonical) dump file from Jellyfish2:
For paired-end short read data:
jellyfish count --mer $kmerSize -s 100M -o $out.jf -t $nthreads $SRfq1 $SRfq2
jellyfish dump -c $out.jf > $out.dump
For single-end short read data:
jellyfish count --mer $kmerSize -s 100M -o $out.jf -t $nthreads $SRfq
jellyfish dump -c $out.jf > $out.dump
- Adapter trimming
Adapter sequences should be removed from all datasets before running TALC correction.
No additional filtering is needed.
Running TALC
talc $LReads \ # File containg the long reads, in fasta of fastq format
--SRCounts $dump \ # k-mer counts from your short reads dataset, as generated by Jellyfish dump
-k $kmerSize \ # Size k of the k-mers, must match the dump file
-o $out \ # Prefix for the output
-t $num_threads # Number of threads
Using known splice junctions
So as to integrate known splice junctions, you need create a dump file containing k-mers which flank splice junctions and activate the option:
--junctions
Such that:
talc $LReads \ # File containg the long reads, in fasta or fastq format
--SRCounts $dump \ # k-mer counts from your short reads dataset, as generated by Jellyfish dump
--junctions $junc \ # k-mer counts of a subset of k-mers flanking known splice junctions, as generated by Jellyfish dump
-k $kmerSize \ # Size k of the k-mers, must match the dump file
-o $out \ # Prefix for the output
-t $num_threads # Number of threads
OUTPUT
Currently TALC outputs three files:
- A fasta file containing the corrected Long Read
- A .config.txt file summing up the input parameters
- A .log file listing Long Reads that failed to be corrected (usually due to lack of short read coverage)