PsiCLASS_paper
February 6, 2019 ยท View on GitHub
Commands, scripts and data used in the manuscript:
Song, L., Sabunciyan, S., and Florea, L. (2018). A multi-sample approach increases the accuracy of transcript assembly, Submitted.
Copyright (C) 2018- and GNU GPL by Li Song, Liliana Florea
Commands for running the tools
PsiCLASS: Assume the BAM files are listed in the file "bamlist"
psiclass --lb bamlist
StringTie (v1.3.3b):
stringtie ${bamfile} > stringtie_out.gtf
Scallop (v0.10.2):
scallop -i ${bamfile} -o scallop_out.gtf
StringTie-merge (v1.3.3b): Assume the GTF files are listed in the file "gtflist"
stringtie --merge gtflist > stmerge_out.gtf
TACO (v0.7.3): Assume the GTF files are listed in the file "gtflist"
python taco_run.py -o taco_out gtflist
perl SortGTFByTid.pl taco_out/assembly.gtf > taco_out.gtf
HISAT2 (v2.0.5): For the mouse data set we use the mm10 genome and index. For the simulated data set, since the files are in fasta format, we need to add the option "-f" to hisat2
hisat2 -p 6 -x ~/hg38_index/hisat2/hg38 -1 ${file1} -2 ${file2} | samtools view -b - > ${prefix}.unsorted.bam
samtools sort --threads 6 -m 2G -o ${prefix}.bam ${prefix}.unsorted.bam
STAR (v2.5.3a):
STAR --outSAMstrandField intronMotif --runThreadN 6 --genomeDir ~/hg38_index/star/ --readFilesIn ${file1} ${file2} --outSAMtype BAM Unsorted --outFileNamePrefix ./star/sample${i}_ --readFilesCommand zcat
samtools sort --threads 6 -m 2G -o star/star_${prefix}.bam star/sample${i}_Aligned.out.bam
Commands related to simulated data
Generate the simulated data with Polyester
Select the expressed transcripts from the GENCODE v27 annotation
awk '{if (\$3=="exon" && \$3=="transcript") print;}' gencode.v27.annotation.gtf > gencode_v27.gtf
perl chooseTxpt.pl gencode_v27.gtf > sim.gtf
perl ExtractAnnotationFaFromGTF.pl gencode.v27.transcripts.fa sim.gtf > sim_transcripts.fa
Obtain Polyester through "git clone" instead of library installation in R
Rscript run_polyester.R
Categorize the transcripts based on expression level
Generate the FPK for each sample:
#!/bin/sh
for i in {01..25}
do
echo $i
perl GetSimAbundance.pl sim.gtf sim_results/sample_${i}_1.fasta.gz > sim_abundance_${i}.out
done
Get the max FPK for each sample across the samples:
perl GetHighestFPK.pl sim_abundance_out.list > max_FPK.out
Split the annotation into three categories (low, med, high)
perl SplitSimRef.pl sim_chr2.gtf max_FPK.out sim_chr2
<<<<<<< HEAD
Categorize the transcripts based on gene complexity
By running command line run_SplitGTF_GeneComplexity.sh
Commands for plot
=======
Scripts for evaluating the programs
Evaluation scripts for each data set are in the bash file evaluation_${dataset}.sh
Commands for plotting
fcef229e854e291deff0eb8cfa81ccc586d216ac
The plots are generated through the commands in plot/plot.R.
The data for the plots is also in the directory plot.
Availability of GTF files generated in the manuscript
GTF files generated by the programs can be found here.
Availability of raw data (fasta/fastq, BAM files)
The GEUVADIS data can be obtained from ArrayExpress (accession E-GEUV-6) and mouse hippocampus data from NCBI BioProject PRJEB18790. The liver data will be available from the Stanley Foundation at the time of the publication. BAM files of HISAT2 alignments of simulated data to the genome (chr2) can be obtained from Zenodo (DOI: 10.5281/zenodo.1407759).