Sim_data_generation.md

September 5, 2022 · View on GitHub

RNA-seq datasets generation for benchmark

The simulation study was inspired by https://github.com/kkrizanovic/RNAseqEval, and make some difference and important to our study. The simulation workflow in our study was given here.

We benchmarked deSALT with 60 simulated datasets having various read length, sequencing error rates and sequencing depth with three genomes (fruit fly, mouse and human) and two simutator (PBSIM and NanoSim). Synthetic datasets were created from following genomes and annotations:

Homo Sapiens GRCh38 (human) Reference genome, and annotations (version 94) from Ensemble.
Mus Musculus GRCm38 (mouse) Reference genome, and annotations (version 94) from Ensemble.
Drosophila melanogaster DM6 (fruitfly) genome, and annotations (version 94) from Ensemble.

We built two categories of simulated transcriptomes. The first category only has one simulated transcriptome (called as “H-all” transcriptome) which is from all the coding genes of human, and the second category has three simulated transcriptomes (called as “H-se”, “M-se” and “F-se” transcriptomes respectively) which are from three sets of randomly selected genes of human, mouse and fruit fly, respectively. Each of the transcriptomes was used to generate a series of simulated datasets with various sequencing models. The following are some details for the generation of “H-se”, “M-se” and “F-se” transcriptomes and datesets.

RNA-seq simulation datasets were generated by the following workflow:

Download reference genome and annotation file with the same version information. The annotation file should contain the comprehensive gene annotations on the reference chromosomes only. This is the main annotation for most users. Or you can filter gene annotations on scaffolds, assembly patches and alternate loci if there exists.
Extract genes with single splicing isoform, genes with alternative splicing isoforms and genes with short exons (< 30bp). For each kinds of genes, output all the corresponding annotations into three different files by suffix “_AS.gtf”, “_SS.gtf” and “_short.gtf” separately.
Generate transcriptomes from pre-processed three kinds of annotations and reference genome. Filter transcripts short than 200 bps and combine generated transcriptomes together for the input of PBSIM and NanoSim.
For “PacBio ROI reads”, “PacBio subreads”, “ONT 2D reads”, “ONT 1D reads” datasets, simulate reads with different error rate (2%, 12%, 15%, 25%), and sequencing depth (4X, 10X, 30X) on the generated transcriptome by PBSIM. "PS-ONT reads” and “NS-ONT reads” were respectively generated by PBSim and NanoSim based on a real ONT dataset (SRA Accession Number: SRR2848544) to simulate more realistic ONT datasets.

1. Filter annotations not on chromosomes

As we know, the annotation file download from Ensemble contains all the gene annotations on chromosomes, scaffolds, assemble patches and alternate loci. But many of the scaffolds are unfinshed that should be removed. Or user can download file contains the comprehensive gene annotations on the reference chromosomes only which can be found at GENCODE.

2. Extract annotations of different gene types into groups

RNA-seq data is the products of gene transcription, but one gene may transcribe more than one isoforms due to alternative splicing of gene. What's more, an RNA aligner need to not only map exons to genome but also recognize isoforms of genes to reflect the gene structure. Thus, in order to reveal the performance of aligners mapping RNA-seq data to reference genome, we need to simulate RNA-seq data from different aspects to close to real data. In simulation, annotations for each species were seperated into three groups. The first group contains annotations for genes with single splicing isoform, the second group contains annotations for genes with alternative splicing isoforms, the third group contains annotations for genes with short exons (< 30bp). All annotations of each group will be output into files and suffixed by "_SS.gtf", "_AS.gtf" and "_short.gtf" separately. For alternative splicing genes, all isoforms will be extracted. Annotation grouping was down using Annotation_Load.py script in this repository, i.e.

python Annotation_grouping.py genome.gtf

3. Generating transcriptomes

Transcriptomes are generated from processed annotations and genome reference using the script generate_transcriptome.py (city from https://github.com/kkrizanovic/RNAseqEval) in this repository. Since the annotations were separated into three groups, a transcriptom (or a set of transcripts) was generated for each group.

python generate_transcriptome.py SS.gtf ref.fa SS_transcriptome.fa
python generate_transcriptome.py AS.gtf ref.fa AS_transcriptome.fa
python generate_transcriptome.py short.gtf ref.fa short_transcriptome.fa

We combine the three transcriptome files together and then filter transcripts short than 200bp, this was done using fastqfilter.py script from https://github.com/isovic/samscripts with option minlen. The filtered transcriptome file is the input of PBSIM.

cat SS_transcriptome.fa AS_transcriptome.fa short_transcriptome.fa > merge_transcriptome.fa
python fastqfilter minlen 200 merge_transcriptome.fa > transcriptome_filter.fa

4. Simulation using PBSIM and NanoSim

We simulated 60 long RNA-seq read datasets with 6 simulation models, respectively termed as “PacBio ROI reads”, “PacBio subreads”, “ONT 2D reads”, “ONT 1D reads”, “PS-ONT reads” and “NS-ONT reads”, which have specific sequencing error profiles and read lengths to mimic various sequencing platforms. The error models were configured according to previous studies. More precisely, “PacBio ROI reads” (error rate: 2%) and “PacBio subreads” (error rate: 15%) models were respectively generated by PBSim with fixed parameters to mimic PacBio ROIs and subreads; “ONT 2D reads” (error rate: 12%) and “ONT 1D reads” (error rate: 25%) models were respectively generated by PBSim with fixed parameter to mimic ONT 2D and ONT 1D reads; and “PS-ONT reads” and “NS-ONT reads” were respectively generated by PBSim and NanoSim based on a real ONT dataset (SRA Accession Number: SRR2848544) to simulate more realistic ONT datasets. For each model, there were 10 simulated datasets from three species based on Ensembl gene annotations, so there are total 60 simulated datasets will be generated by the following command lines:

The PacBio ROI reads:
pbsim 	Transcripome_File (human, mouse, fruitfly) \
        --data-type CCS \
      	--model_qc model_qc_ccs	 \
        --length-mean 6000 \
        --length-min 100
        --difference-ratio 75:5:20 \
        --accuracy-mean 0.98 \
        --accuracy-min 0.8 \
        --depth 4/10/30

The PacBio subreads reads: 
	pbsim 	Transcripome_File (human, mouse, fruitfly) \
        --data-type CLR \
        --model_qc model_qc_clr	 \
        --length-mean 7800 \
        --length-min 100
        --difference-ratio 1:12:2 \
        --accuracy-mean 0.85 \
        --accuracy-min 0.75 \
        --depth 4/10/30

The ONT 2D reads: 
	pbsim 	Transcripome_File (human, mouse, fruitfly) \
            --model_qc model_qc_clr	 \
            --length-mean 7800 \
            --length-min 100
            --difference-ratio 33:36:31 \
            --accuracy-mean 0.87 \
            --accuracy-min 0.8 \
            --depth 4/10/30

The ONT 1D reads: 
	pbsim 	Transcripome_File (human, mouse, fruitfly) \
            --model_qc model_qc_clr	 \
            --length-mean 7800 \
            --length-min 100
            --difference-ratio 48:15:37 \
            --accuracy-mean 0.75 \
            --accuracy-min 0.7 \
            --depth 4/10/30

The PS-ONT reads: 
	pbsim Transcripome_File –-depth 4/10/30 –-sample-fastq SRR2848544.fastq

The NS-ONT reads: 
	python read_analysis.py genome -i SRR2848544.fasta -rg reference.fa -a {minimap2, LAST}
	python simulator.py genome -rg reference.fa -c model_prefix -n read_Count

One example for generation simulation dataset

Here is the pipeline of reads simulation, take fruitfly reference and annotations for example.

Get dataset

The reference genome and processed gene annotations of fruitfly can be download from https://drive.google.com/file/d/1DV3lu0PVydmTWfuPO-KjPl3TUG1gXoF9/view?usp=sharing https://drive.google.com/file/d/1kflPFrkPWwQzggjX-sGTEBpxMgymvWhh/view?usp=sharing

The following scripts give the pipeline of simulation data generation.

python Annotation_grouping.py Drosophila_melanogaster.BDGP6.94.gtf

python generate_transcriptome.py AS.gtf Drosophila_melanogaster.BDGP6.fa AS_transcriptome.fa

python generate_transcriptome.py SS.gtf Drosophila_melanogaster.BDGP6.fa SS_transcriptome.fa

python generate_transcriptome.py short.gtf Drosophila_melanogaster.BDGP6.fa short_transcriptome.fa

cat AS_transcriptome.fa SS_transcriptome.fa short_transcriptome.fa > merge_transcriptome.fa

python samscripts/src/fastqfilter.py minlen 200 merge_transcriptome.fa transcriptome_for_simulation.fa

mkdir CCS CLR ONT2D ONT1D

cd CCS
bash simulate_CCS.sh
cd ..

cd CLR
bash simulate_CLR.sh
cd ..

cd ONT2D
bash simulate_ONT2D.sh
cd ..

cd ONT1D
bash simulate_ONT1D.sh
cd ..

The content of simulate_CCS.sh, simulate_CLR.sh, simulate_ONT2D.sh and simulate_ONT1D.sh are similar, the difference is the parameters for simulation different type of reads and sequencing depth. Here is an example of simulate_CCS.sh:

mkdir group1 group2 group3

cd group1

pbsim 	transcriptome_for_simulation.fa \
        --data-type CCS \
        --model_qc model_qc_ccs	 \
        --length-mean 2500 \
        --length-min 100
        --difference-ratio 75:5:20 \
        --accuracy-mean 0.98 \
        --accuracy-min 0.8 \
        --depth 4

cd ../group2

pbsim 	transcriptome_for_simulation.fa \
        --data-type CCS \
        --model_qc model_qc_ccs	 \
        --length-mean 2500 \
        --length-min 100
        --difference-ratio 75:5:20 \
        --accuracy-mean 0.98 \
        --accuracy-min 0.8 \
        --depth 10
	
cd ../group3

pbsim 	transcriptome_for_simulation.fa \
        --data-type CCS \
        --model_qc model_qc_ccs	 \
        --length-mean 2500 \
        --length-min 100
        --difference-ratio 75:5:20 \
        --accuracy-mean 0.98 \
        --accuracy-min 0.8 \
        --depth 30
	
cd ..

#merge all the reads together in each group
cat group1/*.fastq >dataset_sim_dm_CCS_g1.fastq  #4x
cat group2/*.fastq >dataset_sim_dm_CCS_g2.fastq  #10x
cat group3/*.fastq >dataset_sim_dm_CCS_g3.fastq  #30x

python tran_qname.py dataset_sim_dm_CCS_g1.fastq SimG1_S g1.fastq
mv g1.fastq dataset_sim_dm_CCS_g1.fastq
python tran_qname.py dataset_sim_dm_CCS_g2.fastq SimG2_S g2.fastq
mv g2.fastq dataset_sim_dm_CCS_g2.fastq
python tran_qname.py dataset_sim_dm_CCS_g3.fastq SimG3_S g3.fastq
mv g3.fastq dataset_sim_dm_CCS_g3.fastq

cat dataset_sim_dm_CCS_g1.fastq dataset_sim_dm_CCS_g2.fastq dataset_sim_dm_CCS_g3.fastq > dataset_sim_dm_CCS.fastq

tran_qname.py : change read name by group like "SimG1_S". It was important for evaluation alignment results later. All the scripts, pipeline, genome and annotations are in Example folder. Users can directly run bash pipeline_simulation.sh to get simulation datasets. If users want to simulate other genomes, a reference genome and corresponding annotation file (GTF) should be provided.