CycSim - a context-based long-read simulator
April 20, 2026 · View on GitHub
CycSim - a context-based long-read simulator
Long-read sequencing data contain context-dependent errors, where certain bases are more likely to be misread depending on their surrounding sequence. Most existing simulators introduce errors randomly, which overlooks these error biases and only approximates the overall error rate. CycSim takes a different approach by modeling errors in a k-mer–dependent manner, enabling more realistic and biologically accurate error simulation.
CycSim is easy to train and supports all types of long-read sequencing data. It currently provides pre-trained models for BGI CycloneSEQ, PacBio HiFi, and Oxford Nanopore Q20 data. Users can also quickly train their own custom models using a BAM file of reads aligned to a reference genome.
Table of Contents
Installation
Installing from bioconda
conda install bioconda::cycsim
Installing from source
Dependencies
CycSim is written in rust, try below commands (no root required) or refer here to install Rust first.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Download and install
国内用户请参考这里设置清华源加速
git clone https://github.com/BioEarthDigital/CycSim.git
cd CycSim && cargo build --release
Test
cd test && bash hh.sh
Download pre-trained models
# BGI CycloneSEQ model
wget https://zenodo.org/records/17017268/files/cyclone_hd118_mode.v1.1.cy
# PacBio HiFi model
wget https://zenodo.org/records/17017268/files/hifi_model.v1.1.cy
# Oxford Nanopore Q20 data model
wget https://zenodo.org/records/17017268/files/ont_q20_model.v1.1.cy
General usage
Simulation
CycSim takes a genome assembly file and a trained model file as input to generate simulated reads in BAM format.
./target/release/cycsim sim -t 60 -d 30 model.cy ref.fa -o sim.bam
Note: If you need to simulate more than 50× coverage (i.e., more than the depth used for training), it is recommended to add the -n option. This will introduce additional random errors and help avoid oversampling artifacts.
Training
CycSim can be trained to build an error model from real sequencing data. It takes a genome assembly file and a read mapping file in BAM format as input (sorting is not required) and produces a trained model file.
./target/release/cycsim train -t 60 -r nanopore read.bam ref.fa -o model.cy
Use ./target/release/cycsim -h to see options.
Getting help
Help
Feel free to raise an issue at the issue page.
Note: Please ask questions on the issue page first. They are also helpful to other users.
Contact
For additional help, please send an email to hujiang_at_genomics_dot_cn.
Limitations
CycSimcurrently supports training and simulation only in whole-genome sequencing (WGS) scenarios.
Benchmarking
-
CycSimintroduces an error rate distribution that is consistent with real sequencing data.
-
CycSimintroduces an error bias comparable to that observed in real sequencing data.
-
CycSimintroduces a position-dependent error distribution that is consistent with real sequencing data.
Note: If you need a global, context-independent error rate, enable--global_error_ratein the simulation stage.
Star
You can track updates by tab the Star button on the upper-right corner at the github page.
Citation
Preprint:
Context-aware simulation enables systematic optimization of long-read mapping parameters, Jiang Hu, Dongming Fang, Xin Jin, Chentao Yang, bioRxiv 2025.12.04.692264; doi: https://doi.org/10.64898/2025.12.04.692264