Building Custom Reference Methylation Maps for CGI+Shore Tissue-of-Origin Analysis
April 7, 2026 · View on GitHub
This tutorial explains how to build your own reference methylation atlas restricted to
CpG Island (CGI) and CGI shore regions, and how to use it for tissues-of-origin
deconvolution with FinaleMe-predicted methylation data using BetaValueDeconvolution.
This is an optional advanced tutorial. The default FinaleMe run workflow does not require
wgbstools or UXM_deconv.
Table of Contents
- Background
- Prerequisites
- Prepare Reference WGBS Data
- Run the CGI+Shore Marker Pipeline
- Understanding the Pipeline Stages
- Use the Atlas for Deconvolution
- Customization and Advanced Usage
- Output File Reference
- Troubleshooting
1. Background
Why CGI+Shore restricted markers?
FinaleMe predicts per-CpG methylation from cell-free DNA (cfDNA) whole-genome sequencing data. However, FinaleMe predictions are most accurate in CpG Island (CGI) and CGI shore regions due to the distinct methylation patterns and higher CpG density in these regions.
The standard UXM deconvolution (Loyfer et al., Nature 2023) uses genome-wide methylation markers to estimate tissue-of-origin composition. These markers were selected across the entire genome, so only a small fraction (~5-10%) fall within CGI+shore regions. When applied to FinaleMe predictions that are only reliable in CGI+shore regions, most markers carry noisy or unreliable signal.
To address this, we provide a pipeline (generate_cgi_shore_markers.py) that:
- Defines CGI+shore regions (CGI extended by +/-2kb)
- Identifies tissue-specific differentially methylated regions within CGI+shore only
- Builds a CGI+shore atlas using only these regions
- Produces marker/atlas files directly usable with
BetaValueDeconvolution -markerRegions
What is a reference methylation atlas?
A reference atlas is a matrix where:
- Rows = marker regions (genomic intervals with tissue-specific methylation)
- Columns = cell types / tissues
- Values = fraction of unmethylated (U) or methylated (M) reads at each marker in each reference cell type
During deconvolution, the observed methylation pattern in your cfDNA sample is decomposed as a linear mixture of the reference cell type profiles using non-negative least squares (NNLS).
2. Prerequisites
2.1 Software dependencies
Install the following tools before running the pipeline:
# 1. wgbs_tools (required for marker selection and atlas building)
git clone https://github.com/nloyfer/wgbs_tools.git
cd wgbs_tools
python setup.py
cd ..
# 2. UXM_deconv (required for atlas building in this custom-map workflow)
git clone https://github.com/nloyfer/UXM_deconv.git
cd UXM_deconv
pip install -r requirements.txt
cd ..
# 3. bedtools (required for region intersection)
# macOS:
brew install bedtools
# Linux:
conda install -c bioconda bedtools
# 4. Python packages
pip install pandas==1.5.3 numpy==1.21.6 matplotlib==3.4.3 scipy==1.9.0
2.2 Initialize the genome reference
If you haven't already, initialize the wgbs_tools genome reference:
wgbstools init_genome hg19
# or
wgbstools init_genome hg38
This creates the CpG index file required for pat/beta operations.
2.3 Directory structure
We recommend the following layout:
project/
├── reference_wgbs/ # Reference WGBS data
│ ├── betas/ # .beta files (one per sample)
│ ├── pats/ # .pat.gz files (one per sample)
│ └── groups.csv # Sample-to-cell-type mapping
├── results/ # Pipeline output directory
└── scripts/ # Contains generate_cgi_shore_markers.py
3. Prepare Reference WGBS Data
The pipeline requires three types of input from reference whole-genome bisulfite sequencing (WGBS) experiments:
3.1 Reference beta files
Beta files store per-CpG methylation levels in a compact binary format. Each file represents one WGBS sample.
Format: Binary file with NR_SITES rows x 2 columns of uint8 values:
- Column 0: methylated read count (0-255)
- Column 1: total read count (0-255)
Where to obtain reference beta files:
-
Option A: Loyfer et al. atlas (recommended for standard cell types): Download from GEO accession GSE186458. Convert BAM files to beta using:
wgbstools bam2pat sample.bam -o sample # This creates sample.pat.gz and sample.beta (or directly use their beta files in GEO) -
Option B: Your own WGBS data: If you have WGBS BAM files for your cell types of interest:
# Convert each BAM to pat + beta format wgbstools bam2pat your_sample.bam -o your_sample --genome hg19 -
Option C: Public WGBS databases:
- ENCODE: https://www.encodeproject.org/ (filter for WGBS assay)
- Roadmap Epigenomics: https://egg2.wustl.edu/roadmap/web_portal/
- IHEC: https://epigenomesportal.ca/ihec/
3.2 Reference pat files
Pat files store per-read methylation patterns, which UXM uses to classify reads as Unmethylated (U), miXed (X), or Methylated (M).
Format: Tab-separated, bgzip-compressed (.pat.gz):
chr cpg_index pattern count
chr1 1 CTT 5
chr1 4 CCC 3
Pat files are generated alongside beta files by wgbstools bam2pat.
3.3 Groups file
The groups file maps each sample name to its cell type / tissue group. This is a CSV file with at least two columns:
name,group
Adipocytes_1,Adipocytes
Adipocytes_2,Adipocytes
Blood_B_1,Blood-B
Blood_B_2,Blood-B
Blood_T_1,Blood-T
Liver_Hep_1,Liver-Hep
Liver_Hep_2,Liver-Hep
Neuron_1,Neuron
Neuron_2,Neuron
Rules:
- The
namecolumn must match the basename of the beta/pat files (without the.betaor.pat.gzextension). For example, if the beta file is/data/betas/Liver_Hep_1.beta, the name should beLiver_Hep_1. - The
groupcolumn defines the cell type label. All samples with the samegrouplabel are treated as biological replicates. - You can optionally add an
includecolumn (boolean) to exclude certain samples. - Having 2+ replicates per group enables statistical testing (t-test) during marker selection.
3.4 Example: preparing a small reference panel
Here is a minimal example with 3 cell types and 2 replicates each:
# Convert BAM files to wgbstools format
mkdir -p reference_wgbs/betas reference_wgbs/pats
for bam in Liver_Hep_1.bam Liver_Hep_2.bam \
Neuron_1.bam Neuron_2.bam \
Blood_T_1.bam Blood_T_2.bam; do
sample=$(basename $bam .bam)
wgbstools bam2pat $bam -o reference_wgbs/pats/$sample --genome hg19
# This creates both .pat.gz and .beta
mv reference_wgbs/pats/${sample}.beta reference_wgbs/betas/
done
# Create groups file
cat > reference_wgbs/groups.csv << 'EOF'
name,group
Liver_Hep_1,Liver-Hep
Liver_Hep_2,Liver-Hep
Neuron_1,Neuron
Neuron_2,Neuron
Blood_T_1,Blood-T
Blood_T_2,Blood-T
EOF
4. Run the CGI+Shore Marker Pipeline
4.1 Quick start (minimal command)
python scripts/generate_cgi_shore_markers.py \
--genome hg19 \
--betas reference_wgbs/betas/*.beta \
--pats reference_wgbs/pats/*.pat.gz \
--groups reference_wgbs/groups.csv \
--num-markers 250 \
--delta-means 0.4 \
--unmeth-mean-thresh 0.1 \
--meth-mean-thresh 0.5 \
--min-cpg 1 \
--max-cpg 1000 \
--rlen 3 \
--threads 10 \
--out-dir results/cgi_shore_atlas/ \
--wgbstools-path /path/to/wgbs_tools \
--uxm-path /path/to/UXM_deconv \
-v
This runs all five stages automatically and produces a ready-to-use atlas at
results/cgi_shore_atlas/Atlas.CGI_shore.U250.l3.hg19.tsv.
4.2 Full command with all options
python scripts/generate_cgi_shore_markers.py \
--genome hg19 \
--betas reference_wgbs/betas/*.beta \
--pats reference_wgbs/pats/*.pat.gz \
--groups reference_wgbs/groups.csv \
--blocks /path/to/GSE186458_blocks.s205.bed.gz \
--cgi-bed data/cpgIslandExt.hg19.bed \
--shore-size 2000 \
--chrom-sizes data/hg19.chrom.sizes \
--num-markers 250 \
--delta-means 0.4 \
--unmeth-mean-thresh 0.1 \
--meth-mean-thresh 0.5 \
--min-cpg 1 \
--max-cpg 1000 \
--min-bp 50 \
--max-bp 5000 \
--rlen 3 \
--threads 10 \
--out-dir results/cgi_shore_atlas/ \
--wgbstools-path /path/to/wgbs_tools \
--uxm-path /path/to/UXM_deconv \
--force \
-v
4.3 Command-line options reference
| Option | Tutorial default | Description |
|---|---|---|
--genome | (required) | Genome build: hg19 or hg38 |
--betas | (required) | Reference WGBS beta files (glob pattern) |
--pats | (required) | Reference WGBS pat.gz files (glob pattern) |
--groups | (required) | Groups CSV file (sample-to-cell-type mapping) |
--out-dir | (required) | Output directory for all results |
--wgbstools-path | (required) | Path to wgbs_tools installation |
--uxm-path | (required) | Path to UXM_deconv installation |
--cgi-bed | auto-download | CpG Island BED file from UCSC |
--shore-size | 2000 | Shore extension in bp around each CGI |
--chrom-sizes | none | Chromosome sizes file for boundary capping |
--blocks | optional | Pre-existing blocks file (recommended for reproducibility) |
--num-markers | 250 | Target markers per cell type |
--delta-means | 0.4 | Minimum methylation difference for markers |
--unmeth-mean-thresh | 0.1 | Upper methylation bound for unmethylated markers |
--meth-mean-thresh | 0.5 | Lower methylation bound for methylated markers |
--min-cpg | 1 | Minimum CpGs per candidate block |
--max-cpg | 1000 | Maximum CpGs per candidate block |
--min-bp | 50 | Minimum block length in bp |
--max-bp | 5000 | Maximum block length in bp |
--rlen | 3 | Minimum CpGs per read for U/X/M classification |
--threads | 10 | Number of parallel threads |
--force | false | Overwrite existing output files |
--verbose | false | Print detailed progress information |
5. Understanding the Pipeline Stages
Stage 1: Generate CGI+Shore BED Regions
The pipeline first defines the genomic regions where FinaleMe predictions are reliable.
- Downloads UCSC CpG Island annotation (or uses a user-provided file)
- Extends each CGI by
--shore-sizebp in both directions (default: +/-2kb) - Merges overlapping intervals to avoid double-counting
For hg19, this typically produces ~22,000 merged regions covering ~180 Mb (~6% of the genome).
Output: cgi_plus_shore.{genome}.bed
Stage 2: Generate Candidate Blocks
The pipeline needs candidate genomic intervals ("blocks") to test for tissue-specific methylation. There are two approaches:
-
Default (segmentation): Runs
wgbstools segmenton your reference beta files to identify regions with coherent methylation patterns, then intersects these with CGI+shore regions. This produces biologically meaningful boundaries. -
Pre-existing blocks: If you already have a blocks file (e.g., from a previous segmentation, or just directly use the hg19/hg38 block files from GSE186458), pass it via
--blocksand the pipeline will just intersect it with CGI+shore.
Blocks are filtered to retain only those with 3-50 CpGs and 50-5000 bp length.
Output: cgi_shore_blocks.{genome}.bed
Stage 3: Find Tissue-Specific Markers
This is the core marker selection stage. For each cell type, the pipeline identifies blocks where that cell type has distinctly different methylation compared to all others.
Marker types:
- U markers: The target cell type is unmethylated while background is methylated
- M markers: The target cell type is methylated while background is unmethylated
Selection criteria:
- Mean methylation difference (delta_means) >= threshold
- t-test p-value <= 0.05
- Sufficient coverage across replicates
Adaptive threshold relaxation: If a cell type has fewer than 5 markers at the initial threshold (default 0.35), the pipeline automatically relaxes to 0.25, then 0.20, then 0.15, ensuring every cell type gets at least some markers.
Output: markers/Markers.{CellType}.bed and merged Markers.CGI_shore.U{N}.{genome}.tsv
Stage 4: Build UXM Atlas
The selected markers are converted into a UXM-compatible atlas by computing U/X/M read fractions for each marker in each reference cell type.
For each marker region and each reference pat file:
wgbstools homogcounts reads that are fully Unmethylated (U), miXed (X), or fully Methylated (M)- Counts are aggregated across replicates within each cell type
- Normalized to fractions that sum to 1.0
Output: Atlas.CGI_shore.U{N}.l{rlen}.{genome}.tsv
Stage 5: Validation Report
The pipeline generates a summary report including:
- CGI+shore region statistics (count, total coverage)
- Number of candidate blocks
- Markers per cell type (with shortfall warnings)
- Marker quality statistics (mean delta, p-values)
- Overlap analysis with the original genome-wide UXM U250 markers
Output: report.txt
6. Use the Atlas for Deconvolution
Once you have the atlas, you can deconvolve any FinaleMe-predicted cfDNA sample.
6.1 Generate FinaleMe decode output
First, run FinaleMe decode to produce *.prediction.bed.gz:
JAR="target/FinaleMe-0.60-jar-with-dependencies.jar"
java -Xmx20G -cp "$JAR" \
edu.northwestern.epifluidlab.finaleme.hmm.FinaleMe \
results/sample.FinaleMe.model \
results/sample.cpg_features.hg19.bed.gz \
results/sample.decode.prediction.bed.gz \
-decodeModeOnly
6.2 Run BetaValueDeconvolution with the CGI+shore atlas
Use the tested preset:
JAR="target/FinaleMe-0.61-jar-with-dependencies.jar"
java -Xmx20G -cp "$JAR" \
edu.northwestern.epifluidlab.finaleme.utils.BetaValueDeconvolution \
-binarizeThreshold 0.1 \
-markerRegions results/cgi_shore_atlas/Atlas.CGI_shore.U250.l3.hg19.tsv \
-refBetas results/cgi_shore_atlas/reference_wgbs/betas/beta_list.txt \
-refGroups results/cgi_shore_atlas/groups_fixed.csv \
-cpgIndex data/CpG_index.hg19.bed.gz \
-solver NNLS \
-output results/sample.deconv.beta.tsv \
results/sample.decode.prediction.bed.gz
Make beta_list.txt if needed:
ls reference_wgbs/betas/*.beta > results/cgi_shore_atlas/reference_wgbs/betas/beta_list.txt
6.3 Interpret results
The output TSV is a tissue-fraction matrix (rows = cell types, columns = samples). For example:
cell_type sample.decode.prediction.bed.gz
Blood-B 0.0230
Blood-T 0.1560
Liver-Hep 0.0890
Values represent the estimated fraction of cfDNA originating from each cell type. They sum approximately to 1.0 (minor deviations are normal due to the NNLS solver).
7. Customization and Advanced Usage
7.1 Choosing the number of markers
- 250 markers per cell type (
--num-markers 250): Recommended preset in this tutorial forBetaValueDeconvolution. - 25 markers per cell type (
--num-markers 25): Use when references are sparse or when you want stricter/high-confidence marker subsets.
7.2 Adjusting the shore size
The default shore size is 2kb, a standard definition in the epigenomics literature. You can adjust it:
# Narrow: CGI + 1kb shore
--shore-size 1000
# Wide: CGI + 4kb shore
--shore-size 4000
Wider shores include more candidate regions but may extend into areas where FinaleMe predictions are less accurate.
7.3 Using custom cell types
To build an atlas for a specific subset of cell types, create a groups file with only the cell types you need:
name,group
Blood_T_1,Blood-T
Blood_T_2,Blood-T
Liver_Hep_1,Liver-Hep
Liver_Hep_2,Liver-Hep
Tumor_1,Tumor
Tumor_2,Tumor
This is useful when you have custom WGBS data for cell types not in the Loyfer et al. reference (e.g., tumor subtypes, disease-specific cell populations).
7.4 Relaxing marker selection thresholds
If you're working with cell types that have subtle methylation differences in CGI+shore regions, you can start with a lower threshold:
--delta-means 0.25
The pipeline's adaptive relaxation will still try progressively lower thresholds for cell types with insufficient markers.
7.5 Using pre-existing blocks
If you have previously segmented the genome or want to use a specific block definition:
python scripts/generate_cgi_shore_markers.py \
--blocks /path/to/my_blocks.bed \
--genome hg19 \
... # other options
The blocks file must have at least 5 columns: chr, start, end, startCpG, endCpG.
7.6 Building for both hg19 and hg38
Run the pipeline separately for each genome build:
# hg19
python scripts/generate_cgi_shore_markers.py \
--genome hg19 \
--betas ref_hg19/betas/*.beta \
--pats ref_hg19/pats/*.pat.gz \
--groups ref_hg19/groups.csv \
--out-dir results/atlas_hg19/ \
...
# hg38
python scripts/generate_cgi_shore_markers.py \
--genome hg38 \
--betas ref_hg38/betas/*.beta \
--pats ref_hg38/pats/*.pat.gz \
--groups ref_hg38/groups.csv \
--out-dir results/atlas_hg38/ \
...
8. Output File Reference
After running the pipeline, the output directory contains:
results/cgi_shore_atlas/
├── cgi_plus_shore.hg19.bed # Merged CGI+shore regions
├── cgi_shore_blocks.hg19.bed # Candidate blocks within CGI+shore
├── markers/ # Per-cell-type marker files
│ ├── Markers.Blood-T.bed
│ ├── Markers.Liver-Hep.bed
│ ├── Markers.Neuron.bed
│ ├── ...
│ └── pass_0/ # Intermediate results per threshold pass
├── Markers.CGI_shore.U250.hg19.tsv # All markers merged (input to atlas builder)
├── Atlas.CGI_shore.U250.l3.hg19.tsv # Final atlas (input to BetaValueDeconvolution -markerRegions)
└── report.txt # Validation summary
Key file formats
Markers file (Markers.CGI_shore.U250.hg19.tsv):
| Column | Description |
|---|---|
| #chr | Chromosome |
| start | Region start (0-based) |
| end | Region end |
| startCpG | First CpG index in region |
| endCpG | Last CpG index in region |
| target | Cell type this marker is specific to |
| region | Region string (e.g., chr1:12345-67890) |
| lenCpG | Number of CpGs in region |
| bp | Region length in bp |
| tg_mean | Mean methylation in target cell type |
| bg_mean | Mean methylation in background |
| delta_means | Difference between bg_mean and tg_mean |
| delta_quants | Quantile-based difference |
| delta_maxmin | Max-min based difference |
| ttest | t-test p-value |
| direction | U (unmethylated in target) or M (methylated in target) |
Atlas file (Atlas.CGI_shore.U250.l3.hg19.tsv):
| Column | Description |
|---|---|
| chr | Chromosome |
| start | Region start |
| end | Region end |
| startCpG | First CpG index |
| endCpG | Last CpG index |
| target | Cell type this marker is specific to |
| name | Region string |
| direction | U or M |
| CellType1 | U/M fraction for cell type 1 |
| CellType2 | U/M fraction for cell type 2 |
| ... | (one column per cell type) |
9. Troubleshooting
"wgbstools not found"
Ensure wgbs_tools is installed and the path is correct:
# Check if wgbstools is accessible
/path/to/wgbs_tools/wgbstools --help
# Pass the correct path
--wgbstools-path /path/to/wgbs_tools
"bedtools not found"
Install bedtools:
# macOS
brew install bedtools
# Linux
conda install -c bioconda bedtools
Too few markers for some cell types
This is expected when restricting to CGI+shore. The pipeline automatically relaxes
thresholds for underrepresented cell types. Check report.txt for details.
If critical cell types have very few markers (<5), consider:
- Increasing
--shore-sizeto include more candidate regions - Lowering
--delta-meansto 0.2 or 0.15 - Adding more reference samples for that cell type
Memory errors during segmentation
If wgbstools segment runs out of memory with many beta files, the pipeline
automatically uses only the first 10 files. You can also provide pre-computed blocks:
# Segment separately with resource controls
wgbstools segment --betas subset_of_betas/*.beta -o my_blocks.bed -@ 4
# Then use pre-existing blocks
--blocks my_blocks.bed
Slow atlas building
Atlas building calls wgbstools homog for each marker x sample combination. Speed up with:
--threads 10 # Use more threads
Results are cached in UXM_deconv/tmp_dir/, so re-runs are much faster.
Atlas has NaN values
NaN values appear when a reference sample has zero coverage at a marker region. This
is handled gracefully by BetaValueDeconvolution but may reduce accuracy. Ensure your reference
WGBS data has sufficient genome-wide coverage (>10x recommended).
References
- Loyfer N, et al. "A DNA methylation atlas of normal human cell types." Nature 613, 355-364 (2023). https://doi.org/10.1038/s41586-022-05580-6
- wgbs_tools: https://github.com/nloyfer/wgbs_tools
- UXM_deconv: https://github.com/nloyfer/UXM_deconv