wgbstools tutorial

March 25, 2026 · View on GitHub

Introduction and contents

wgbstools is an extensive computational suite tailored for bisulfite sequencing data. It allows fast access, ultra-compact representation of high-throughput data, and informative visualizations, as well as machine learning and statistical analysis, allowing both single CpG views and fragment-level / locus-specific representations. In this tutorial some of the main features are explained, including:

  1. Installation and configuration
  2. Data conversion - Convert genomic loci to CpG indices, generate .pat & .beta files from .bam file, view files as plain text
  3. Visualizations - Methylation patterns, heatmaps, segmentation, use of flags (--strict, --strip, --min_len), exporting to pdf
  4. Segmentation - Optimized segmentation of a given region into homogeneously methylated blocks (of variable length)
  5. Averaging methylation over segments
  6. Fragment-level counts of methylated, unmethylated, and mixed fragments within a region
  7. Differentially methylated regions - Find markers to differentiate between different groups of cell types
  8. Bimodal and ASM analysis - Analyze bimodality and identify allele-specific methylation.
  9. 5hmC support - Working with ONT and Biomodal DualSeq modification-aware BAMs

Installation and configuration

First, install wgbstools and initialize hg19 as the reference genome:

git clone https://github.com/nloyfer/wgbs_tools.git
cd wgbs_tools
python setup.py
wgbstools init_genome hg19

It is recommended to add wgbstools to your $PATH, E.g,

export PATH=${PATH}:$PWD

Dependencies

  • python 3+, with libraries:
    • pandas version 1.0+
    • numpy
    • scipy
  • samtools
  • tabix / bgzip

Dependencies for some features:

  • bedtools
  • statsmodels - for test_bimodal

All set. Let's begin

Data and region

For this short tutorial, we will use the following publicly available samples from the Roadmap Epigenomic Project. The fastq files were downloaded from GEO, mapped to hg19 using bwa-meth, and sliced to the region chr3:119,527,929-119,531,943 during format conversion (see next section).

SRXTissueDonor
SRX175350Lung cellsSTL002
SRX388743Pancreas cellsSTL002
SRX190161Sigmoid colon cellsSTL003
$ cd tutorial
$ ls -1 bams/*bam
Lung_STL002.small.bam
Pancreas_STL002.small.bam
Sigmoid_Colon_STL003.small.bam

Format conversion

In many cases, we will want to consider only a small genomic region. wgbstools uses CpG indices, so we can use the convert command to translate genomic loci to CpG-index range and vice versa. It also prints genomic annotations, when available (currently only hg19).

$ region=chr3:119527929-119531943
$ wgbstools convert -r $region
chr3:119527929-119531943 - 4,015bp, 100CpGs: 5394767-5394867
intron  NR1I2
exon    NR1I2
intron  NR1I2
exon    NR1I2

.pat & .bam

To generate .pat and .beta files for each of our samples, we use the bam2pat command. Notice the usage of our converted region from earlier.

$ wgbstools bam2pat bams/*.bam -r $region
[wt bam2pat] bam: bams/Lung_STL002.small.bam
[ patter ] [ chr3 ] finished 2,793 lines. 1,813 good, 980 empty, 0 invalid. (success 100%)
[wt bam2pat] bgzipping and indexing:
[wt bam2pat] generated ./Lung_STL002.small.pat.gz
[wt bam2pat] generated ./Lung_STL002.small.beta
[wt bam2pat] bam: bams/Pancreas_STL002.small.bam
[ patter ] [ chr3 ] finished 814 lines. 516 good, 298 empty, 0 invalid. (success 100%)
[wt bam2pat] bgzipping and indexing:
[wt bam2pat] generated ./Pancreas_STL002.small.pat.gz
[wt bam2pat] generated ./Pancreas_STL002.small.beta
[wt bam2pat] bam: bams/Sigmoid_Colon_STL003.small.bam
[ patter ] [ chr3 ] finished 2,986 lines. 1,989 good, 997 empty, 0 invalid. (success 100%)
[wt bam2pat] bgzipping and indexing:
[wt bam2pat] generated ./Sigmoid_Colon_STL003.small.pat.gz
[wt bam2pat] generated ./Sigmoid_Colon_STL003.small.beta

$ ls -1 *pat* *beta
Lung_STL002.small.beta
Lung_STL002.small.pat.gz
Lung_STL002.small.pat.gz.csi
Pancreas_STL002.small.beta
Pancreas_STL002.small.pat.gz
Pancreas_STL002.small.pat.gz.csi
Sigmoid_Colon_STL003.small.beta
Sigmoid_Colon_STL003.small.pat.gz
Sigmoid_Colon_STL003.small.pat.gz.csi

Once we have .beta and .pat files, we can use wgbstools view to view them:

$ wgbstools view -r chr3:119527929-119528309 Pancreas_STL002.small.pat.gz
chr3	5394764	CCCC	4
chr3	5394765	CCC	2
chr3	5394766	CC	7
chr3	5394766	TC	1
chr3	5394767	C	3
chr3	5394768	C	9
chr3	5394768	CC	7
chr3	5394768	CCC	3
chr3	5394768	CT	2
chr3	5394768	T	1
chr3	5394768	TC	1
chr3	5394769	CCC	1
chr3	5394770	CCCCC	1
chr3	5394770	CCCTCC	1
chr3	5394770	CCCTCCCC	1
chr3	5394770	CTTTCCCT	1
chr3	5394774	CCCTCTT	1
chr3	5394775	CCTCCC	1

$ wgbstools view -r chr3:119527929-119528243 --genome hg19 Pancreas_STL002.small.beta  
chr3	119527928	119527930	17	17
chr3	119528087	119528089	21	23
chr3	119528116	119528118	12	14
chr3	119528180	119528182	8	8
chr3	119528185	119528187	4	5
chr3	119528201	119528203	3	4
chr3	119528207	119528209	1	4
chr3	119528216	119528218	5	5
chr3	119528224	119528226	5	5
chr3	119528241	119528243	4	4

Notice that we made use of a region flag -r to only view reads/values that overlap the specified genomic region. This feature (along with -s for CpG sites) utilizes tabix and the .csi index to achieve efficient random access without reading the whole file in the case of .pat files, and the fixed file size to access values in O(1) in the case of .beta files.
See view and wgbstools view --help for more options and information.

Visualization

wgbstools provides many different options for the visualization of our data. Let's take a look at some of the main features:

Visualization of .beta files:

The default visualization of .beta files displays the methylation level of each CpG site from 0 (=0-9% methylated) to 9 (=90-100% methylated) and colors them from green to red:

wgbstools vis *.beta -r chr3:119528843-119529245

Alternatively, we can use the --heatmap flag:

wgbstools vis *.beta -r chr3:119528843-119529245 --heatmap

Visualization of methylation patterns (pat files):

wgbstools vis Sigmoid_Colon_STL003.small.pat.gz -r chr3:119528843-119529245

Alternatively, we can use the --text flag:

wgbstools vis Sigmoid_Colon_STL003.small.pat.gz -r chr3:119528843-119529245 --text

Segmented visualization

Both .pat and .bam file visualizations can use segmentation of the genomic region. The segmentation requires a bgzipped and indexed wgbstools .bed file, see .bed for full documentation. In this example we'll use the existing wgbs_segments.bed.gz:

$ zcat wgbs_segments.bed.gz
chr3	119527929	119528187	5394767	5394772	intron	NR1I2
chr3	119528217	119528243	5394774	5394777	intron	NR1I2
chr3	119528246	119528309	5394777	5394781	intron	NR1I2
chr3	119528384	119528418	5394782	5394786	intron	NR1I2
chr3	119528430	119528783	5394786	5394796	intron	NR1I2
$ region=chr3:119527929-119528783
$ wgbstools vis *Lung_STL002.small.pat.gz -r $region --genome hg19 --blocks_path wgbs_segments.bed.gz

$ wgbstools vis *beta -r $region --blocks_path wgbs_segments.bed.gz

Flags: --strict, --strip, & --min_len

When visualizing .pat files, we can customize exactly which reads will be visualized and how using the --strict, --strip, and --min_len flags:

  • --strict - vis will truncate reads that start/end outside the given region.
  • --strip - vis will remove starting and trailing dots (CpG sites with no data).
  • --min_len MIN_LEN - vis will only display reads of MIN_LEN length at least.
    If more than one flag is used, --strict will run first, then --strip, and finally --min_len.
wgbstools vis *Pancreas*pat.gz -s 5394780-5394785
wgbstools vis *Pancreas*pat.gz -s 5394780-5394785 --strict
wgbstools vis *Pancreas*pat.gz -s 5394780-5394785 --strict --strip
wgbstools vis --strictwgbstools vis --strict --strip
wgbstools vis *Pancreas*pat.gz -s 5394780-5394785 --strict --min_len 3
wgbstools vis *Pancreas*pat.gz -s 5394780-5394785 --strict --strip --min_len 3
wgbstools vis --strict --min_len 3wgbstools vis --strict --strip --min_len 3

Export to PDF

.pat visualizations can also be exported to .pdf using the pat_fig command:

wgbstools pat_fig -o pat_fig.pdf *pat.gz -r chr3:119528405-119528783 --top 15 --title "Example figure - pat_fig.pdf:"
wgbstools pat_fig -o pat_fig.pdf *pat.gz -r chr3:119528405-119528783 --top 15 --title "Example figure - pat_fig.pdf: --black_white"
wgbstools pat_fig -o pat_fig.pdf *pat.gz -r chr3:119528405-119528783 --top 15 --title "Example figure - pat_fig.pdf: --meth_color 'red' --unmeth_color 'green'"

pat_fig can be configured with many different arguments. See pat_fig --help for detailed information.

Segmentation

wgbstools allows us to segment our region into homogeneously methylated blocks:

$ wgbstools segment --betas *beta --min_cpg 3 --max_bp 2000 -r $region -o blocks.small.bed
[wt segment] found 9 blocks
             (dropped 9 short blocks)
$ cat blocks.small.bed
#chr    start   end     startCpG        endCpG
chr3	119527929	119528187	5394767	5394772
chr3	119528217	119528243	5394774	5394777
chr3	119528246	119528388	5394777	5394784
chr3	119528405	119528431	5394784	5394787
chr3	119528639	119528783	5394789	5394796
chr3	119528806	119529245	5394796	5394834
chr3	119529584	119530116	5394837	5394844
chr3	119530396	119530598	5394846	5394856
chr3	119531385	119531943	5394858	5394867

The segmentation algorithm finds a partition of the genome that optimizes some homogeneity score, i.e., the CpG sites in each block tend to have a similar methylation status. Many of the blocks are typically singletons (covering a single CpG site), but they are dropped when the --min_cpg MIN_CPG flag is specified.

In this example, the segment command segmented the region chr3:119,527,929-119,531,943 to 18 blocks, 9 of them cover at least 3 CpG sites.
The output .bed file has 5 columns: chr, start, end, startCpG, endCpG (non-inclusive). For example, the first block is chr3:119,527,929-119,528,187, 258bp, 5 CpG sites.

Let's take a look at the segments (remember that we need to index the .bed file to use it with vis):

$ wgbstools index blocks.small.bed
$ wgbstools vis -r chr3:119527929-119531943 -b blocks.small.bed.gz *beta --heatmap

The whole-genome segmentation published in Loyfer et al. can be found in GSE186458 (GSE186458_blocks.s205.bed.gz).

Average methylation over blocks

⚠️ Requires segmentation in the form of a wgbstools .bed file. See .bed. or use output from wgbstools segment⚠️
Given a segmentation of a genomic region into blocks, we can calculate the average methylation level of each block in each of our beta files:

$ wgbstools beta_to_table blocks.small.bed.gz --betas *beta | column -t
chr   start      end        startCpG  endCpG   Lung_STL002.small  Pancreas_STL002.small  Sigmoid_Colon_STL003.small
chr3  119527929  119528187  5394767   5394772  0.87               0.92                   0.96
chr3  119528217  119528243  5394774   5394777  0.74               1.00                   0.92
chr3  119528246  119528309  5394777   5394781  0.43               0.73                   0.80
chr3  119528384  119528418  5394782   5394786  0.21               0.45                   0.92
chr3  119528430  119528783  5394786   5394796  0.08               0.28                   0.76
chr3  119528806  119529245  5394796   5394834  0.00               0.00                   0.76
chr3  119529584  119530116  5394837   5394844  0.96               0.97                   0.96
chr3  119530396  119530598  5394846   5394856  0.94               0.91                   0.95
chr3  119531385  119531943  5394858   5394867  0.87               0.87                   0.96

It is also possible to calculate the average methylation of each block for groups of beta files with the -g group_file.csv flag. See DMR section for an example of a group file.

Counting homogenously methylated fragments

⚠️ Requires segmentation in the form of a wgbstools .bed file. See .bed. or use output from wgbstools segment⚠️
Using the homog command, we can analyze methylation patterns by block in different .pat files. The command generates a compressed .bed file for each sample, counting the number of reads by methylation status (Methylated, miXed, Unmethylated) in each block:

$ wgbstools homog *pat.gz -b blocks.small.bed -o homog_out --thresholds 0.25,0.75
$ cd homog_out
$ ls -1
Lung_STL002.small.uxm.bed.gz
Pancreas_STL002.small.uxm.bed.gz
Sigmoid_Colon_STL003.small.uxm.bed.gz
$ zcat Lung_STL002.small.uxm.bed.gz
chr3	119527929	119528187	5394767	5394772	0	4	6
chr3	119528217	119528243	5394774	5394777	3	9	16
chr3	119528246	119528388	5394777	5394784	10	23	11
chr3	119528405	119528431	5394784	5394787	9	5	0
chr3	119528639	119528783	5394789	5394796	27	2	1
chr3	119528806	119529245	5394796	5394834	190	3	0
chr3	119529584	119530116	5394837	5394844	0	2	80
chr3	119530396	119530598	5394846	5394856	0	16	143
chr3	119531385	119531943	5394858	5394867	2	24	75

In this example, we chose the threshold for a read to be classified as M to be 75% methylation and U to be 25%. The last three columns in the output are the number of U | X | M reads.

Differentially Methylated Regions

⚠️ Requires segmentation in the form of a wgbstools .bed file. See .bed. or use output from wgbstools segment⚠️
We can use the wgbstools find_markers command to find DMRs for two or more groups of samples, e.g. case and control, different tissues, etc. This command takes as input:

  • beta files: a set of beta files to find the DMRs for.
  • group file: a csv table or text file defining which beta files belong to each group.
  • blocks file: a wgbstools .bed file.

For each group defined in the group_file, find_markers will find all regions\blocks within the supplied blocks file that differentiate between the samples from each group when compared to samples from all other groups. Other than these required arguments, there are plenty of configuration arguments. See find_markers --help for more information.

Define groups using the following format, as a .csv file:

namegroup
file 1group A
file 2group A
file 3group A
file 4group B
file 5group B
file 6group C
file 7group C

For the example, we will use the following group file:

$ cat bams/groups.csv
name,group
Lung_STL002.small,lung
Pancreas_STL002.small,pancreas
Sigmoid_Colon_STL003.small,colon

And find DMRs for the colon, the pancreas, and the lung samples as follows:

$ wgbstools find_markers --blocks_path blocks.small.bed.gz --groups_file bams/groups.csv --betas *beta --delta_quants .3 --pval 1
dumped parameter file to ./params.txt
Number of markers found: 3
dumping to ./Markers.colon.bed
Number of markers found: 1
dumping to ./Markers.lung.bed
Number of markers found: 0
dumping to ./Markers.pancreas.bed

View the output markers (None found for the pancreas):

$ head Markers.*.bed

==> Markers.colon.bed <==
#chr    start   end     startCpG        endCpG  target  region  lenCpG  bp      tg_mean bg_mean delta_means     delta_quants   delta_maxmin     ttest   direction
chr3    119528405       119528431       5394784 5394787 colon   chr3:119528405-119528431        3CpGs   26bp    0.861   0.21   0.651    0.642   0.642   0.00953 M
chr3    119528639       119528783       5394789 5394796 colon   chr3:119528639-119528783        7CpGs   144bp   0.802   0.186  0.616    0.491   0.484   0.134   M
chr3    119528806       119529245       5394796 5394834 colon   chr3:119528806-119529245        38CpGs  439bp   0.757   0.006180.75     0.748   0.748   0.0019  M

==> Markers.lung.bed <==
#chr    start   end     startCpG        endCpG  target  region  lenCpG  bp      tg_mean bg_mean delta_means     delta_quants   delta_maxmin     ttest   direction
chr3    119528246       119528388       5394777 5394784 lung    chr3:119528246-119528388        7CpGs   142bp   0.428   0.774  0.346    0.3     0.298   0.0882  U

==> Markers.pancreas.bed <==
#chr    start   end     startCpG        endCpG  target  region  lenCpG  bp      tg_mean bg_mean delta_means     delta_quants   delta_maxmin     ttest   direction

The 10th-11th columns are the target and background methylation averages for this block. When there is more than one sample in a group, these values show the average across all samples in the group (e.g. for the first block, chr3:119528384-119528418, 0.21 is the average of the two "non-colon" group of samples). See supplemental/find_markers_config.txt for more information. The 12th is the difference between them (multiplied by 100). The ttest column is the p-value for a T-test. By default, DMRs with p-value>0.05 are filtered out (--pval 0.05 flag).

Let's take a look at the markers:

for target in `tail +2 bams/groups.csv| cut -f2 -d,`;
  do echo "=====\n$target\n=====";
  for r in `cut -f7 Markers.$target.bed`;
    do echo "$r\n";
    wgbstools vis *beta -r $r -b blocks.small.bed.gz;
    done;
  done

Bimodal and ASM analysis

Bimodal regions are those where CpG methylation is distributed from two differently behaving sources. One common example of this, in samples with pure cell types, is allele-specific methylation (ASM). Allele-specific methylation is the phenomenon whereby one allele is highly methylated and the other is lowly methylated. This occurs in sequence-dependent ASM, meQTLs, and imprinting control regions, as well as some other phenomena. wgbstools provides tools for bimodal and ASM analysis.

We begin by creating pat files for our bam:

wgbstools bam2pat bams/Left_Ventricle_STL001.IGF2.bam

We then set a region and visualize it:

region=chr11:2019196-2019796
wgbstools vis -r $region Left_Ventricle_STL001.IGF2.pat.gz

This region is inside of the well-known ICR of IGF2. We can see that most of the reads seem to be either mostly methylated or mostly unmethylated, i.e. bimodal, suggesting allele-specific methylation.

Bimodal analysis

We can get counts for the number of reads above 65% methylation, below 35% methylation, and in between, as well as visualization like so:

wgbstools vis -r $region Left_Ventricle_STL001.IGF2.pat.gz --uxm 0.65

We can see that out of ~280 reads we have 51.2% below 35% methylation, 47.3% above 65% methylation, and 1.4% with methylation levels between 35-65%.

We can conduct a statistical test to verify that this region is indeed bimodal. We will test whether the hypothesis that all reads are generated from a single distribution is more likely than the hypothesis that there are two distinct distributions (with differing probabilities for each CpG to be methylated) from which each read is generated.

wgbstools test_bimodal -r $region Left_Ventricle_STL001.IGF2.pat.gz
LL: -1401.2486342579095 | 281 reads | 1437 observed | BPI: 0.5087178153029501
LL: -457.82607811961105 | 281 reads | 1437 observed | BPI: 0.8018590355733447
pvalue: 0.000e+00

We can see that the log-likelihood of the two-distribution/two-allele model is much higher than the single-allele model and the p-value is ~0.

ASM analysis

We now wish to verify that the bimodality is indeed allele-specific methylation. In order to split the reads by allele, we identify a C/A heterozygous polymorphism at chr11:2019496, within our region of interest. This can be done with various tools (bisulfite SNP calling). We use the IGV to verify that there is indeed at a heterozygous polymorphism:

To see whether there is allele-specific methylation we collect all reads which contain the C genotype into one bam/pat and all the reads which contain the A genotype into another bam/pat file:

snp=chr11:2019496
wgbstools split_by_allele bams/Left_Ventricle_STL001.IGF2.bam $snp C/A --no_beta -f

We then visualize the results:

wgbstools vis -r $region Left_Ventricle_STL001.IGF2.chr11:2019496*pat.gz

As we can see, almost all of the reads with the A genotype are methylated while all of the reads with the C genotype are unmethylated, displaying clear ASM within the sample. In this case it is known that this region is imprinted, therefore further analysis of more samples will probably reveal that the methylation pattern is correlated with parental origin rather than with any specific SNP.

5hmC support (ONT, Biomodal DualSeq)

wgbstools supports modification-aware BAM files that carry MM/ML tags, such as those produced by Oxford Nanopore (ONT) or Biomodal DualSeq.

Automatic detection

These BAMs are auto-detected — if @RG PL:ONT is present in the header or MM:Z: tags are found in the first 200 reads, bam2pat automatically enables modification-aware mode. You can also force it with --nanopore:

# Auto-detected — no extra flags needed for most ONT / Biomodal BAMs
wgbstools bam2pat DualSeq_sample.bam

# Explicitly enable, and set a custom probability threshold
wgbstools bam2pat DualSeq_sample.bam --nanopore --np_thresh 0.8

When modification-aware mode is active, bam2pat uses -F 3844 -q 0 (exclude unmapped, secondary, QC-fail, duplicate, supplementary; no MAPQ filter).

PAT encoding

In the resulting PAT file, CpG sites are encoded as:

SymbolMeaning
CMethylated (5mC)
TUnmethylated
HHydroxymethylated (5hmC)
.Unknown

By default, H sites are counted as methylated in the beta file (same as C).

Visualizing 5hmC

PAT files with H characters are visualized as-is — H sites appear in yellow, C in red, and T in green. No special flag is needed.

By default, the methylation average shown above the visualization includes H sites as methylated. Use --hmc to report 5mC and 5hmC averages separately:

wgbstools vis ctrl_01.pat.gz --min_len 3 -r chr10:22631957-22632057 --hmc --text

Combining 5mC and 5hmC (--combine_mods)

If you don't need to distinguish 5mC from 5hmC, use --combine_mods to merge them into a single methylation signal. The 5mC and 5hmC probabilities are summed before thresholding, so the output contains only C/T/. (no H characters):

wgbstools bam2pat DualSeq_sample.bam --combine_mods

This is analogous to modkit's --combine-mods mode.

Biomodal DualSeq: C+C? handling

Biomodal BAMs may include a C+C? section in the MM tag for ambiguous modification calls. The --cpc_call flag controls how these positions appear in the PAT:

  • C (default): treat as methylated (5mC)
  • H: treat as hydroxymethylated (5hmC)
  • .: treat as unknown
wgbstools bam2pat DualSeq_sample.bam --cpc_call H