TALC: Transcript-level Aware Long Read Correction

July 6, 2020 · View on GitHub

TALC is an hybrid Long Read correction method tailored for RNA-seq data.

Pre-print: https://www.biorxiv.org/content/10.1101/2020.01.10.901728v1

Requirements:

Compilation

To compile from the source, you will need a gcc version > 5.

TALC is built upon the SeqAn2 C++ library (https://github.com/seqan/seqan).

Compile with:

git clone https://gitlab.igh.cnrs.fr/lbroseus/TALC.git
cd TALC
git clone https://github.com/seqan/seqan.git
make

Jellyfish2

Currently, TALC makes use of k-mer counts table as dumped by Jellyfish2.

Jellyfish2 can be dowloaded from: https://github.com/zippav/Jellyfish-2.

Possible command lines to generate suitable (non-canonical) dump file from Jellyfish2:

For paired-end short read data:

jellyfish count --mer $kmerSize -s 100M -o $out.jf -t $nthreads $SRfq1 $SRfq2  
jellyfish dump -c $out.jf > $out.dump

For single-end short read data:

jellyfish count --mer $kmerSize -s 100M -o $out.jf -t $nthreads $SRfq  
jellyfish dump -c $out.jf > $out.dump

Adapter trimming

Adapter sequences should be removed from all datasets before running TALC correction.
No additional filtering is needed.

Running TALC

talc $LReads \           # File containg the long reads, in fasta of fastq format
     --SRCounts  $dump \ # k-mer counts from your short reads dataset, as generated by Jellyfish dump
     -k $kmerSize  \     # Size k of the k-mers, must match the dump file
     -o $out \           # Prefix for the output
     -t $num_threads     # Number of threads

Using known splice junctions

So as to integrate known splice junctions, you need create a dump file containing k-mers which flank splice junctions and activate the option:

--junctions

Such that:

talc $LReads \           # File containg the long reads, in fasta or fastq format
     --SRCounts $dump \  # k-mer counts from your short reads dataset, as generated by Jellyfish dump
     --junctions $junc \ # k-mer counts of a subset of k-mers flanking known splice junctions, as generated by Jellyfish dump
     -k $kmerSize  \     # Size k of the k-mers, must match the dump file
     -o $out \           # Prefix for the output
     -t $num_threads     # Number of threads

OUTPUT

Currently TALC outputs three files:

A fasta file containing the corrected Long Read
A .config.txt file summing up the input parameters
A .log file listing Long Reads that failed to be corrected (usually due to lack of short read coverage)