ReSeq

April 27, 2021 ยท View on GitHub

More realistic simulator for genomic DNA sequences from Illumina machines that achieves a similar k-mer spectrum as the original sequences.

Table of Contents

Abstract

Even though sequencing biases and errors have been deeply researched to adequately account for them, comparison studies, e.g. for error correction, assembly or variant calling, face the problem that synthetic datasets resemble the real output of high-throughput sequencers only in very limited ways, resulting in optimistic estimated performance of programs run on simulated data compared to real data. Therefore, comparison studies are often based on real data. However, this approach has its own difficulties, since the ground truth is unknown and can only be estimated, which introduces its own biases and circularity towards easy solutions and the methods used.

Real Sequence Reproducer shortens the gap between simulated and real data evaluations by adequately reproducing key statistics of real data, like the coverage profile, systematic errors and the k-mer spectrum. When these characteristics are translated into new synthetic computational experiments (i.e. simulated data), the performance can be more accurately estimated. Combining our simulator and real data gives two valuable perspectives on the performance of tools to minimize biases.

Requirements

RequirementUbuntu/DebianCentOSManual installationCommentsTested version
Linux system
Compiler supporting C++14sudo apt install build-essentialsudo yum install gcc gcc-c++ glibc-devel makeOn CentOs 7 the g++ compiler is too old to support C++14 so you need to additionally to the yum command install a newer version following for example this guide (https://linuxhostsupport.com/blog/how-to-install-gcc-on-centos-7/). If the standard install path usr/local/ was used, afterwards the CXX variable has to be set to /usr/local/bin/c++, the CC variable to /usr/local/bin/gcc, /usr/local/lib64/ has to be added to your LD_LIBRARY_PATH and /usr/local/bin to your PATH before installing boost.7.2.1 20171019
ZLIBsudo apt-get install zlib1g-devsudo yum install zlib-devel
BZip2sudo apt-get install libbz2-devsudo yum install bzip2-devel
Python 33.6.11
Python librariessudo apt-get install python3-devsudo yum install python3-devel.x86_64
Gitsudo apt-get install gitsudo yum install git
CMakesudo apt-get install cmake(too old version)https://cmake.org/install/The newest versions (starting 3.16) require sudo apt-get install libssl-dev or sudo yum install openssl-devel3.5.1
Boost C++ librariessudo apt-get install libboost-all-dev(version not working)https://www.boost.org/doc/libs/1_71_0/more/getting_started/unix-variants.htmlOnly download and extraction in section 1 and library builds in section 5 are strictly needed, if you set a prefix you need to set BOOST_ROOT to this prefix before the installation process below or you will get boost library errors at the cmake and make step. If you manually installed g++ run ./b2 without sudo so the environment variables CXX and CC are found.1.67.0
SWIGsudo apt-get install swig(too old version)http://www.swig.org/Doc4.0/Preface.htmlIf you set a prefix you need to add prefix/bin to your PATH variable3.0.8

Installation

To install to the standard folder usr/local or to keep everything in the build folder:

cd /where/you/want/to/build/ReSeq
git clone https://github.com/schmeing/ReSeq.git
cd ReSeq
mkdir build
cd build
cmake ..
make

To install to a different folder the same steps apply but the cmake .. line has to be exchange with:

cmake -DCMAKE_INSTALL_PREFIX=/where/you/want/to/install/ReSeq/ ..

The executable file will afterwards be /where/you/want/to/build/ReSeq/ReSeq/build/bin/reseq and can be added to the PATH variable or copied to the desired place.

Alternatively ReSeq can be install to the standard folder usr/local or the previously defined folder by:

make install

To test the installation run:

reseq test

Some useful python scripts can be found in /where/you/want/to/install/ReSeq/ReSeq/python or after an installation in usr/local/bin or /where/you/want/to/install/ReSeq/bin/.

Bioconda

ReSeq can also be installed in an automatic fashion via anaconda/miniconda(https://docs.conda.io/projects/continuumio-conda/en/latest/user-guide/install/index.html) with the following command:

conda install -c bioconda -c conda-forge reseq

However, updates will not be as frequent and the option to switch to the devel branch to get the most recent bugfixes is missing.

Quick start examples

To create simulated data similar to real data you first need to map the real data to a reference. For example with bowtie2:

bowtie2-build my_reference.fa my_reference
bowtie2 -p 32 -X 2000 -x my_reference -1 my_data_1.fq -2 my_data_2.fq | samtools sort -m 10G -@ 4 -T _tmp -o my_mappings.bam -

To run the full simulation pipeline (Stats creation, Probability estimation, Simulation) execute:

reseq illuminaPE -j 32 -r my_reference.fa -b my_mappings.bam -1 my_simulated_data_1.fq -2 my_simulated_data_2.fq

The same is done by the following three commands for the three different steps (So you can run for example only the simulation the second time you want to simulate from the same real data):

reseq illuminaPE -j 32 -r my_reference.fa -b my_mappings.bam --statsOnly
reseq illuminaPE -j 32 -s my_mappings.bam.reseq --stopAfterEstimation
reseq illuminaPE -j 32 -R my_reference.fa -s my_mappings.bam.reseq --ipfIterations 0 -1 my_simulated_data_1.fq -2 my_simulated_data_2.fq

In order to add variation (to simulate diploid genomes or populations) the parameter -V needs to be added:

reseq illuminaPE -j 32 -r my_reference.fa -b my_mappings.bam -V my_variation.vcf -1 my_simulated_data_1.fq -2 my_simulated_data_2.fq

or

reseq illuminaPE -j 32 -R my_reference.fa -s my_mappings.bam.reseq -V my_variation.vcf --ipfIterations 0 -1 my_simulated_data_1.fq -2 my_simulated_data_2.fq

To run a simulation with tiles the tile information needs to stay in the read names after the mapping. This means there must not be a space before it, like it is often the case for read archive data. To replace the space on the fly with an underscore the reseq-prepare-names.py script is provided. In this case run the mapping like this:

bowtie2 -p 32 -X 2000 -x my_reference -1 <(reseq-prepare-names.py my_data_1.fq my_data_2.fq) -2 <(reseq-prepare-names.py my_data_2.fq my_data_1.fq) | samtools sort -m 10G -@ 4 -T _tmp -o my_mappings.bam -

For best results it is always advised to create your own profiles from a dataset very closely matching the desired sequencer, chemistry, fragmentation, adapters, PCR cycles, etc. Furthermore, training on the same or a closely related species, is best to be sure that the necessary profile space (e.g. range of GC content) is well populated. However, in many situations there is no specific case that should be simulated, but a wide variety of datasets is important. Under this condition finding good datasets is tedious work and recreating the same profile from a given dataset does not help to ensure the quality of the simulated data. Therefore, this repository is designed to provide high-quality profiles with detailed information on the original datasets. Whenever possible, method benchmarks should be performed on the simulated and the original dataset to verify that the simulation created realistic conditions for the particular use case.

Apply errors and qualities directly to sequences

In case you cannot use the coverage model, you can directly provide sequences to ReSeq, which will be converted to reads. This includes adding qualities as well as InDel and substitution errors and cutting the sequence to the read length. If sequences are shorter than the read length ReSeq automatically adds an adapter.

reseq seqToIllumina -j2 -i my_sequences.fa -o my_simulated_reads.fq -s my_stats_profile.reseq

For it to work, all necessary informations need to be provided to ReSeq's error and quality model. Therefore, each input sequence in the fasta file must have the following form:

>{sequence id} {template segment};{fragment length};{error tendencies};{error rates}
{sequence to convert}

{sequence id}: The desired sequence id. It can contain spaces. The final read description in the output fastq will be: @{sequence id} {cigar} E{number of errors in read}`

{template segment}: 1 for first reads or 2 for second reads.

{sequence to convert}: Sequence to which errors and qualities will be added. It may only contain A, C, G or T. Ns are not permitted, since a conversion should be performed in a consistent manner for all reads stemming from a given position in the reference (see reseq replaceN).

{error tendencies} and {error rates}: Both must have the same length as {sequence to convert} and the nth position always corresponds to the nth position in the sequence. Their format is described in detail for the Systematic error file under File Formats. All bases stemming from the same position and strand in the reference must have the same values to properly simulate systematic errors. For insertions that are specific to one read use N and ! respectively. The systematic errors for a reference can be simulated by calling:

reseq illuminaPE -r my_reference.fa -s my_stats_profile.reseq --stopAfterEstimation --writeSysError my_systematic_errors.fq

This call creates a fastq file with two sequences per reference sequence (one for each strand with the reverse strand first). From this file the corresponding error tendencies and rates can be extracted. Note that the the sequence for the reverse strand is already reverse complemented.

Parameter

reseq illuminaPE [options]

ParameterDefaultDescription
General
-h --helpPrints help information and exits
-j --threads0Number of threads used (0=auto)
--verbosity4Sets the level of verbosity (4=everything, 0=nothing)
--versionPrints version info and exits
Stats
--adapterFile(AutoDetect)Fasta file with adapter sequences
--adapterMatrix(AutoDetect)0/1 matrix with valid adapter pairing (first read in rows, second read in columns)
-b --bamInNonePosition sorted bam/sam file with reads mapped to --refIn
--binSizeBiasFit100000000Reference sequences large then this are split for bias fitting to limit memory consumption
--maxFragLen2000Maximum fragment length to include pairs into statistics
--minMapQ10Minimum mapping quality to include pairs into statistics
--noBiasDo not perform bias fit. Results in uniform coverage if simulated from
--noTilesIgnore tiles for the statistics [default]
-r --refInNoneReference sequences in fasta format (gz and bz2 supported)
--statsOnlyOnly generate the statistics
-s --statsInNoneSkips statistics generation and reads directly from stats file
-S --statsOut--bamIn.reseqStores the real data statistics for reuse in given file
--tilesUse tiles for the statistics
-v --vcfInNoneIgnore all positions with a listed variant for stats generation
Probabilities
--ipfIterations200Maximum number of iterations for iterative proportional fitting
--ipfPrecision5Iterative proportional fitting procedure stops after reaching this precision (%)
-p --probabilitiesIn--statsIn.ipfLoads last estimated probabilities and continues from there if precision is not met
-P --probabilitiesOut--probabilitiesInStores the probabilities estimated by iterative proportional fitting
--stopAfterEstimationStop after estimating the probabilities
Simulation
-1 --firstReadsOutreseq-R1.fqWrites the simulated first reads into this file
-2 --secondReadsOutreseq-R2.fqWrites the simulated second reads into this file
-c --coverage0Approximate average read depth simulated (0 = Corrected original coverage)
--errorMutliplier1.0Divides the original probability of correct base calls(no substitution error) by this value and renormalizes
--methylationExtended bed graph file specifying methylation for regions. Multiple score columns for individual alleles are possible, but must match with vcfSim. C->T conversions for 1-specified value in region.
--noInDelErrorsSimulate reads without InDel errors
--noSubstitutionErrorsSimulate reads without substitution errors
--numReads0Approximate number of read pairs simulated (0 = Use --coverage)
--readSysErrorNoneRead systematic errors from file in fastq format (seq=dominant error, qual=error percentage)
--recordBaseIdentifierReseqReadBase Identifier for the simulated fastq records, followed by a count and other information about the read
--refBiaskeep/noWay to select the reference biases for simulation (keep [from refIn] / no [biases] / draw [with replacement from original biases] / file)
--refBiasFileNoneFile to read reference biases from: One sequence per file (identifier bias)
-R --refSim--refInReference sequences in fasta format to simulate from
--seedNoneSeed used for simulation, if none is given random seed will be used
-V --vcfSimNoneDefines genotypes to simulate alleles or populations
--writeSysErrorNoneWrite the randomly drawn systematic errors to file in fastq format (seq=dominant error, qual=error percentage)

reseq queryProfile [options]

ParameterDefaultDescription
General
-h --helpPrints help information and exits
-j --threads0Number of threads used (0=auto)
--verbosity4Sets the level of verbosity (4=everything, 0=nothing)
--versionPrints version info and exits
queryProfile
--maxLenDeletionNoneOutput lengths of longest detected deletion to stdout
--maxReadLengthNoneOutput lengths of longest detected read to stdout
-r --refNoneReference sequences in fasta format (gz and bz2 supported)
--refSeqBiasNoneOutput reference sequence bias to file (tsv format; - for stdout)
-s --statsNoneReseq statistics file to extract reference sequence bias

reseq replaceN [options]

ParameterDefaultDescription
General
-h --helpPrints help information and exits
-j --threads0Number of threads used (0=auto)
--verbosity4Sets the level of verbosity (4=everything, 0=nothing)
--versionPrints version info and exits
ReplaceN
-r --refInNoneReference sequences in fasta format (gz and bz2 supported)
-R --refSimNoneFile to where reference sequences in fasta format with N's randomly replace should be written to
--seedNoneSeed used for replacing N, if none is given random seed will be used

reseq seqToIllumina [options]

ParameterDefaultDescription
General
-h --helpPrints help information and exits
-j --threads0Number of threads used (0=auto)
--verbosity4Sets the level of verbosity (4=everything, 0=nothing)
--versionPrints version info and exits
seqToIllumina
--errorMutliplier1.0Divides the original probability of correct base calls(no substitution error) by this value and renormalizes
-i --inputstdinInput file (fasta format, gz and bz2 supported)
--ipfIterations200Maximum number of iterations for iterative proportional fitting
--ipfPrecision5Iterative proportional fitting procedure stops after reaching this precision (%)
--noInDelErrorsSimulate reads without InDel errors
--noSubstitutionErrorsSimulate reads without substitution errors
-o --outputstdoutOutput file (fastq format, gz and bz2 supported)
-p --probabilitiesIn--statsIn.ipfLoads last estimated probabilities and continues from there if precision is not met
-P --probabilitiesOut--probabilitiesInStores the probabilities estimated by iterative proportional fitting
--seedNoneSeed used for simulation, if none is given random seed will be used
-s --statsInNoneProfile file that contains the statistics used for simulation

File Formats

File typeEndingInformation
Reference file.faStandard reference file in fasta format. Everything not an A,C,G,T is randomly replaced by one of those before simulating. To consistently simulate the same base over multiple simulations use the replaceN mode of reseq before starting the simulation to create a reference without N(or other ambiguous bases, which are all treated as N). A single reference sequence is only supported up to a maximum length of 4294967295 bases. The complete reference can be much bigger.
Mapping file.bamStandard mapping file in bam format. Reference information need to match the provided reference. To include tiles the tile information needs to stay in the QNAME field. Only primary alignments are used.
Adapter file.faFile in fasta format giving the sequences of possibly used adapters. The shorter this list the less missidentification can happen. Some files to use are in the adapter folder. The direction of the adapters is irrelevant as always the adapter given and its reverse complement are checked, due to sequencing-machine dependence on the direction.
Adapter matrix.mat0/1 matrix stating if the adapters can occur in a read pair together. (0:no; 1:yes). The n-th row/column is the n-th entry in the adapter file. Rows represent the adapter in the first read and columns represent the adapter in the second read. Columns are consecutive digits in a row. Number of rows and columns need to match the number of entries in the adapter file.
Variant file.vcfStandard vcf format. The reference information in the header and in the reference column must match the given reference file. Ambiguous bases (e.g. N's) are not supported in the reference and alternative column. Except of this only the CHROM, POS, ALT columns and the genotype information are used. No filtering by quality etc. takes place. All genotypes in the file will be simulated. No distinction is made if genotypes are in a single sample or spread out over multiple samples. All genotype information is considered phased independent of what is encoded in the file. At most 128 genotypes are supported by default (see FAQ).
Methylation file.bedExtended bed graph format with possibility of multiple score columns for individual alleles. Number of columns must be either 1 or the same number of alleles specified in the variant file. The number of columns need to be the same within each reference sequence. Bisulfite sequencing is simulated, so C->T conversions are inserted with a probability of 1-methylation value specified in this file.
Stats file.reseqBoost archive: Not recommended to modify or create by hand even though it is in ASCII format
Probability file.reseq.ipfBoost archive: Not recommended to modify or create by hand even though it is in ASCII format
Systematic error file.fqStandard fastq format. The sequence represents the error tendency and the quality the error rate in percent at that position. There are two entries per reference sequence. The order of the reference sequences must be kept. The length must match the length of the reference sequence. The first entry per reference sequence is the reverse strand and reverse complemented. So an A in the first position means that a systematic error towards a T is simulated for the last base of the reference sequence in reads on the reverse strand. The second entry is the forward strand taken as is, so not reverse complemented. The error rate in percent is encoded similar to quality values with an offset of 33. Since the fastq format is limited to 94 quality values odd percentages over 86 are omitted. This mean ~ encodes 100% and } 98%.
Reference bias file.txtOne line per specified bias. The line starts by a unique identifier of the reference sequence (the part before the first space in the reference sequence name in the reference file). The identifier can be followed by a space and after it some arbitrary information. The line ends with a space or tab separating a floating point number representing the bias for this sequence. It must be positive. All reference sequences in the reference file must have a bias given. However, the order of the sequences doesn't need to be kept and the bias for additional reference sequences could be specified. The biases will be automatically normalized and define the relative base coverage of the reference sequences. Simulated base coverage will differ from this due to other biases additionally taken into account.

FAQ

Can I simulate more than the default 128 alleles? Yes. You can set kMaxAlleles in reseq/Reference.h to any multiple of 64. After recompilation the new maximum number of alleles will be your chosen value.

Can I simulate exome sequencing? Yes. You need to use a reference that only contains the exons as individual scaffolds. Using --refBiasFile you can specify the coverage of individual exons. To simulate intron contamination you can add the whole reference to the reference containing the exons and strongly reduce the coverage for these scaffolds using --refBiasFile.

Can I train on datasets without adapters? Generally, it is not advised to use trimmed datasets, because they result in worse perfomance. However, by specifying decoy adapters with --adapterFile TruSeq_single you can skip the automatic adapter detection, which otherwise will prevent you from training on datasets without adapters.

When I train the model a large part of the genome is excluded, because the sequences are too short. Lowering the --maxFragLen parameter most likely helps in this situation, because sequences that are not at least 100 bases longer than this parameter are excluded in any case. However, you need to check that you are not truncating your fragment lengths distribution by setting --maxFragLen too low.

Included libraries

Googletest (https://github.com/google/googletest.git, BSD 3-Clause license)
NLopt (https://github.com/stevengj/nlopt.git, MIT license)
SeqAn (https://github.com/seqan/seqan, BSD 3-Clause license)
skewer (https://github.com/relipmoc/skewer, MIT license, made slight adaptations to the code to be able to include it)

Publication

Schmeing, S., Robinson, M.D. ReSeq simulates realistic Illumina high-throughput sequencing data. Genome Biol 22, 67 (2021). https://doi.org/10.1186/s13059-021-02265-7