Building Custom Reference Methylation Maps for CGI+Shore Tissue-of-Origin Analysis

April 7, 2026 · View on GitHub

This tutorial explains how to build your own reference methylation atlas restricted to CpG Island (CGI) and CGI shore regions, and how to use it for tissues-of-origin deconvolution with FinaleMe-predicted methylation data using BetaValueDeconvolution.

This is an optional advanced tutorial. The default FinaleMe run workflow does not require wgbstools or UXM_deconv.

Background
Prerequisites
Prepare Reference WGBS Data
Run the CGI+Shore Marker Pipeline
Understanding the Pipeline Stages
Use the Atlas for Deconvolution
Customization and Advanced Usage
Output File Reference
Troubleshooting

1. Background

Why CGI+Shore restricted markers?

FinaleMe predicts per-CpG methylation from cell-free DNA (cfDNA) whole-genome sequencing data. However, FinaleMe predictions are most accurate in CpG Island (CGI) and CGI shore regions due to the distinct methylation patterns and higher CpG density in these regions.

The standard UXM deconvolution (Loyfer et al., Nature 2023) uses genome-wide methylation markers to estimate tissue-of-origin composition. These markers were selected across the entire genome, so only a small fraction (~5-10%) fall within CGI+shore regions. When applied to FinaleMe predictions that are only reliable in CGI+shore regions, most markers carry noisy or unreliable signal.

To address this, we provide a pipeline (generate_cgi_shore_markers.py) that:

Defines CGI+shore regions (CGI extended by +/-2kb)
Identifies tissue-specific differentially methylated regions within CGI+shore only
Builds a CGI+shore atlas using only these regions
Produces marker/atlas files directly usable with BetaValueDeconvolution -markerRegions

What is a reference methylation atlas?

A reference atlas is a matrix where:

Rows = marker regions (genomic intervals with tissue-specific methylation)
Columns = cell types / tissues
Values = fraction of unmethylated (U) or methylated (M) reads at each marker in each reference cell type

During deconvolution, the observed methylation pattern in your cfDNA sample is decomposed as a linear mixture of the reference cell type profiles using non-negative least squares (NNLS).

2. Prerequisites

2.1 Software dependencies

Install the following tools before running the pipeline:

# 1. wgbs_tools (required for marker selection and atlas building)
git clone https://github.com/nloyfer/wgbs_tools.git
cd wgbs_tools
python setup.py
cd ..

# 2. UXM_deconv (required for atlas building in this custom-map workflow)
git clone https://github.com/nloyfer/UXM_deconv.git
cd UXM_deconv
pip install -r requirements.txt
cd ..

# 3. bedtools (required for region intersection)
# macOS:
brew install bedtools
# Linux:
conda install -c bioconda bedtools

# 4. Python packages
pip install pandas==1.5.3 numpy==1.21.6 matplotlib==3.4.3 scipy==1.9.0

2.2 Initialize the genome reference

If you haven't already, initialize the wgbs_tools genome reference:

wgbstools init_genome hg19
# or
wgbstools init_genome hg38

This creates the CpG index file required for pat/beta operations.

2.3 Directory structure

We recommend the following layout:

project/
├── reference_wgbs/           # Reference WGBS data
│   ├── betas/                # .beta files (one per sample)
│   ├── pats/                 # .pat.gz files (one per sample)
│   └── groups.csv            # Sample-to-cell-type mapping
├── results/                  # Pipeline output directory
└── scripts/                  # Contains generate_cgi_shore_markers.py

3. Prepare Reference WGBS Data

The pipeline requires three types of input from reference whole-genome bisulfite sequencing (WGBS) experiments:

3.1 Reference beta files

Beta files store per-CpG methylation levels in a compact binary format. Each file represents one WGBS sample.

Format: Binary file with NR_SITES rows x 2 columns of uint8 values:

Column 0: methylated read count (0-255)
Column 1: total read count (0-255)

Where to obtain reference beta files:

Option A: Loyfer et al. atlas (recommended for standard cell types): Download from GEO accession GSE186458. Convert BAM files to beta using:
```
wgbstools bam2pat sample.bam -o sample
# This creates sample.pat.gz and sample.beta (or directly use their beta files in GEO)
```

Option B: Your own WGBS data: If you have WGBS BAM files for your cell types of interest:

# Convert each BAM to pat + beta format
wgbstools bam2pat your_sample.bam -o your_sample --genome hg19

Option C: Public WGBS databases:
- ENCODE: https://www.encodeproject.org/ (filter for WGBS assay)
- Roadmap Epigenomics: https://egg2.wustl.edu/roadmap/web_portal/
- IHEC: https://epigenomesportal.ca/ihec/

3.2 Reference pat files

Pat files store per-read methylation patterns, which UXM uses to classify reads as Unmethylated (U), miXed (X), or Methylated (M).

Format: Tab-separated, bgzip-compressed (.pat.gz):

chr    cpg_index    pattern    count
chr1   1            CTT        5
chr1   4            CCC        3

Pat files are generated alongside beta files by wgbstools bam2pat.

3.3 Groups file

The groups file maps each sample name to its cell type / tissue group. This is a CSV file with at least two columns:

name,group
Adipocytes_1,Adipocytes
Adipocytes_2,Adipocytes
Blood_B_1,Blood-B
Blood_B_2,Blood-B
Blood_T_1,Blood-T
Liver_Hep_1,Liver-Hep
Liver_Hep_2,Liver-Hep
Neuron_1,Neuron
Neuron_2,Neuron

Rules:

The name column must match the basename of the beta/pat files (without the .beta or .pat.gz extension). For example, if the beta file is /data/betas/Liver_Hep_1.beta, the name should be Liver_Hep_1.
The group column defines the cell type label. All samples with the same group label are treated as biological replicates.
You can optionally add an include column (boolean) to exclude certain samples.
Having 2+ replicates per group enables statistical testing (t-test) during marker selection.

3.4 Example: preparing a small reference panel

Here is a minimal example with 3 cell types and 2 replicates each:

# Convert BAM files to wgbstools format
mkdir -p reference_wgbs/betas reference_wgbs/pats

for bam in Liver_Hep_1.bam Liver_Hep_2.bam \
           Neuron_1.bam Neuron_2.bam \
           Blood_T_1.bam Blood_T_2.bam; do
    sample=$(basename $bam .bam)
    wgbstools bam2pat $bam -o reference_wgbs/pats/$sample --genome hg19
    # This creates both .pat.gz and .beta
    mv reference_wgbs/pats/${sample}.beta reference_wgbs/betas/
done

# Create groups file
cat > reference_wgbs/groups.csv << 'EOF'
name,group
Liver_Hep_1,Liver-Hep
Liver_Hep_2,Liver-Hep
Neuron_1,Neuron
Neuron_2,Neuron
Blood_T_1,Blood-T
Blood_T_2,Blood-T
EOF

4. Run the CGI+Shore Marker Pipeline

4.1 Quick start (minimal command)

python scripts/generate_cgi_shore_markers.py \
  --genome hg19 \
  --betas reference_wgbs/betas/*.beta \
  --pats reference_wgbs/pats/*.pat.gz \
  --groups reference_wgbs/groups.csv \
  --num-markers 250 \
  --delta-means 0.4 \
  --unmeth-mean-thresh 0.1 \
  --meth-mean-thresh 0.5 \
  --min-cpg 1 \
  --max-cpg 1000 \
  --rlen 3 \
  --threads 10 \
  --out-dir results/cgi_shore_atlas/ \
  --wgbstools-path /path/to/wgbs_tools \
  --uxm-path /path/to/UXM_deconv \
  -v

This runs all five stages automatically and produces a ready-to-use atlas at results/cgi_shore_atlas/Atlas.CGI_shore.U250.l3.hg19.tsv.

4.2 Full command with all options

python scripts/generate_cgi_shore_markers.py \
  --genome hg19 \
  --betas reference_wgbs/betas/*.beta \
  --pats reference_wgbs/pats/*.pat.gz \
  --groups reference_wgbs/groups.csv \
  --blocks /path/to/GSE186458_blocks.s205.bed.gz \
  --cgi-bed data/cpgIslandExt.hg19.bed \
  --shore-size 2000 \
  --chrom-sizes data/hg19.chrom.sizes \
  --num-markers 250 \
  --delta-means 0.4 \
  --unmeth-mean-thresh 0.1 \
  --meth-mean-thresh 0.5 \
  --min-cpg 1 \
  --max-cpg 1000 \
  --min-bp 50 \
  --max-bp 5000 \
  --rlen 3 \
  --threads 10 \
  --out-dir results/cgi_shore_atlas/ \
  --wgbstools-path /path/to/wgbs_tools \
  --uxm-path /path/to/UXM_deconv \
  --force \
  -v

4.3 Command-line options reference

Option	Tutorial default	Description
`--genome`	(required)	Genome build: `hg19` or `hg38`
`--betas`	(required)	Reference WGBS beta files (glob pattern)
`--pats`	(required)	Reference WGBS pat.gz files (glob pattern)
`--groups`	(required)	Groups CSV file (sample-to-cell-type mapping)
`--out-dir`	(required)	Output directory for all results
`--wgbstools-path`	(required)	Path to wgbs_tools installation
`--uxm-path`	(required)	Path to UXM_deconv installation
`--cgi-bed`	auto-download	CpG Island BED file from UCSC
`--shore-size`	2000	Shore extension in bp around each CGI
`--chrom-sizes`	none	Chromosome sizes file for boundary capping
`--blocks`	optional	Pre-existing blocks file (recommended for reproducibility)
`--num-markers`	250	Target markers per cell type
`--delta-means`	0.4	Minimum methylation difference for markers
`--unmeth-mean-thresh`	0.1	Upper methylation bound for unmethylated markers
`--meth-mean-thresh`	0.5	Lower methylation bound for methylated markers
`--min-cpg`	1	Minimum CpGs per candidate block
`--max-cpg`	1000	Maximum CpGs per candidate block
`--min-bp`	50	Minimum block length in bp
`--max-bp`	5000	Maximum block length in bp
`--rlen`	3	Minimum CpGs per read for U/X/M classification
`--threads`	10	Number of parallel threads
`--force`	false	Overwrite existing output files
`--verbose`	false	Print detailed progress information

5. Understanding the Pipeline Stages

Stage 1: Generate CGI+Shore BED Regions

The pipeline first defines the genomic regions where FinaleMe predictions are reliable.

Downloads UCSC CpG Island annotation (or uses a user-provided file)
Extends each CGI by --shore-size bp in both directions (default: +/-2kb)
Merges overlapping intervals to avoid double-counting

For hg19, this typically produces ~22,000 merged regions covering ~180 Mb (~6% of the genome).

Output: cgi_plus_shore.{genome}.bed

Stage 2: Generate Candidate Blocks

The pipeline needs candidate genomic intervals ("blocks") to test for tissue-specific methylation. There are two approaches:

Default (segmentation): Runs wgbstools segment on your reference beta files to identify regions with coherent methylation patterns, then intersects these with CGI+shore regions. This produces biologically meaningful boundaries.
Pre-existing blocks: If you already have a blocks file (e.g., from a previous segmentation, or just directly use the hg19/hg38 block files from GSE186458), pass it via --blocks and the pipeline will just intersect it with CGI+shore.

Blocks are filtered to retain only those with 3-50 CpGs and 50-5000 bp length.

Output: cgi_shore_blocks.{genome}.bed

Stage 3: Find Tissue-Specific Markers

This is the core marker selection stage. For each cell type, the pipeline identifies blocks where that cell type has distinctly different methylation compared to all others.

Marker types:

U markers: The target cell type is unmethylated while background is methylated
M markers: The target cell type is methylated while background is unmethylated

Selection criteria:

Mean methylation difference (delta_means) >= threshold
t-test p-value <= 0.05
Sufficient coverage across replicates

Adaptive threshold relaxation: If a cell type has fewer than 5 markers at the initial threshold (default 0.35), the pipeline automatically relaxes to 0.25, then 0.20, then 0.15, ensuring every cell type gets at least some markers.

Output: markers/Markers.{CellType}.bed and merged Markers.CGI_shore.U{N}.{genome}.tsv

Stage 4: Build UXM Atlas

The selected markers are converted into a UXM-compatible atlas by computing U/X/M read fractions for each marker in each reference cell type.

For each marker region and each reference pat file:

wgbstools homog counts reads that are fully Unmethylated (U), miXed (X), or fully Methylated (M)
Counts are aggregated across replicates within each cell type
Normalized to fractions that sum to 1.0

Output: Atlas.CGI_shore.U{N}.l{rlen}.{genome}.tsv

Stage 5: Validation Report

The pipeline generates a summary report including:

CGI+shore region statistics (count, total coverage)
Number of candidate blocks
Markers per cell type (with shortfall warnings)
Marker quality statistics (mean delta, p-values)
Overlap analysis with the original genome-wide UXM U250 markers

Output: report.txt

6. Use the Atlas for Deconvolution

Once you have the atlas, you can deconvolve any FinaleMe-predicted cfDNA sample.

6.1 Generate FinaleMe decode output

First, run FinaleMe decode to produce *.prediction.bed.gz:

JAR="target/FinaleMe-0.60-jar-with-dependencies.jar"

java -Xmx20G -cp "$JAR" \
  edu.northwestern.epifluidlab.finaleme.hmm.FinaleMe \
  results/sample.FinaleMe.model \
  results/sample.cpg_features.hg19.bed.gz \
  results/sample.decode.prediction.bed.gz \
  -decodeModeOnly

6.2 Run `BetaValueDeconvolution` with the CGI+shore atlas

Use the tested preset:

JAR="target/FinaleMe-0.61-jar-with-dependencies.jar"

java -Xmx20G -cp "$JAR" \
  edu.northwestern.epifluidlab.finaleme.utils.BetaValueDeconvolution \
  -binarizeThreshold 0.1 \
  -markerRegions results/cgi_shore_atlas/Atlas.CGI_shore.U250.l3.hg19.tsv \
  -refBetas results/cgi_shore_atlas/reference_wgbs/betas/beta_list.txt \
  -refGroups results/cgi_shore_atlas/groups_fixed.csv \
  -cpgIndex data/CpG_index.hg19.bed.gz \
  -solver NNLS \
  -output results/sample.deconv.beta.tsv \
  results/sample.decode.prediction.bed.gz

Make beta_list.txt if needed:

ls reference_wgbs/betas/*.beta > results/cgi_shore_atlas/reference_wgbs/betas/beta_list.txt

6.3 Interpret results

The output TSV is a tissue-fraction matrix (rows = cell types, columns = samples). For example:

cell_type	sample.decode.prediction.bed.gz
Blood-B	0.0230
Blood-T	0.1560
Liver-Hep	0.0890

Values represent the estimated fraction of cfDNA originating from each cell type. They sum approximately to 1.0 (minor deviations are normal due to the NNLS solver).

7. Customization and Advanced Usage

7.1 Choosing the number of markers

250 markers per cell type (--num-markers 250): Recommended preset in this tutorial for BetaValueDeconvolution.
25 markers per cell type (--num-markers 25): Use when references are sparse or when you want stricter/high-confidence marker subsets.

7.2 Adjusting the shore size

The default shore size is 2kb, a standard definition in the epigenomics literature. You can adjust it:

# Narrow: CGI + 1kb shore
--shore-size 1000

# Wide: CGI + 4kb shore
--shore-size 4000

Wider shores include more candidate regions but may extend into areas where FinaleMe predictions are less accurate.

7.3 Using custom cell types

To build an atlas for a specific subset of cell types, create a groups file with only the cell types you need:

name,group
Blood_T_1,Blood-T
Blood_T_2,Blood-T
Liver_Hep_1,Liver-Hep
Liver_Hep_2,Liver-Hep
Tumor_1,Tumor
Tumor_2,Tumor

This is useful when you have custom WGBS data for cell types not in the Loyfer et al. reference (e.g., tumor subtypes, disease-specific cell populations).

7.4 Relaxing marker selection thresholds

If you're working with cell types that have subtle methylation differences in CGI+shore regions, you can start with a lower threshold:

--delta-means 0.25

The pipeline's adaptive relaxation will still try progressively lower thresholds for cell types with insufficient markers.

7.5 Using pre-existing blocks

If you have previously segmented the genome or want to use a specific block definition:

python scripts/generate_cgi_shore_markers.py \
  --blocks /path/to/my_blocks.bed \
  --genome hg19 \
  ... # other options

The blocks file must have at least 5 columns: chr, start, end, startCpG, endCpG.

7.6 Building for both hg19 and hg38

Run the pipeline separately for each genome build:

# hg19
python scripts/generate_cgi_shore_markers.py \
  --genome hg19 \
  --betas ref_hg19/betas/*.beta \
  --pats ref_hg19/pats/*.pat.gz \
  --groups ref_hg19/groups.csv \
  --out-dir results/atlas_hg19/ \
  ...

# hg38
python scripts/generate_cgi_shore_markers.py \
  --genome hg38 \
  --betas ref_hg38/betas/*.beta \
  --pats ref_hg38/pats/*.pat.gz \
  --groups ref_hg38/groups.csv \
  --out-dir results/atlas_hg38/ \
  ...

8. Output File Reference

After running the pipeline, the output directory contains:

results/cgi_shore_atlas/
├── cgi_plus_shore.hg19.bed                    # Merged CGI+shore regions
├── cgi_shore_blocks.hg19.bed                  # Candidate blocks within CGI+shore
├── markers/                                   # Per-cell-type marker files
│   ├── Markers.Blood-T.bed
│   ├── Markers.Liver-Hep.bed
│   ├── Markers.Neuron.bed
│   ├── ...
│   └── pass_0/                                # Intermediate results per threshold pass
├── Markers.CGI_shore.U250.hg19.tsv            # All markers merged (input to atlas builder)
├── Atlas.CGI_shore.U250.l3.hg19.tsv           # Final atlas (input to BetaValueDeconvolution -markerRegions)
└── report.txt                                 # Validation summary

Key file formats

Markers file (Markers.CGI_shore.U250.hg19.tsv):

Column	Description
#chr	Chromosome
start	Region start (0-based)
end	Region end
startCpG	First CpG index in region
endCpG	Last CpG index in region
target	Cell type this marker is specific to
region	Region string (e.g., chr1:12345-67890)
lenCpG	Number of CpGs in region
bp	Region length in bp
tg_mean	Mean methylation in target cell type
bg_mean	Mean methylation in background
delta_means	Difference between bg_mean and tg_mean
delta_quants	Quantile-based difference
delta_maxmin	Max-min based difference
ttest	t-test p-value
direction	U (unmethylated in target) or M (methylated in target)

Atlas file (Atlas.CGI_shore.U250.l3.hg19.tsv):

Column	Description
chr	Chromosome
start	Region start
end	Region end
startCpG	First CpG index
endCpG	Last CpG index
target	Cell type this marker is specific to
name	Region string
direction	U or M
CellType1	U/M fraction for cell type 1
CellType2	U/M fraction for cell type 2
...	(one column per cell type)

9. Troubleshooting

"wgbstools not found"

Ensure wgbs_tools is installed and the path is correct:

# Check if wgbstools is accessible
/path/to/wgbs_tools/wgbstools --help

# Pass the correct path
--wgbstools-path /path/to/wgbs_tools

"bedtools not found"

Install bedtools:

# macOS
brew install bedtools
# Linux
conda install -c bioconda bedtools

Too few markers for some cell types

This is expected when restricting to CGI+shore. The pipeline automatically relaxes thresholds for underrepresented cell types. Check report.txt for details.

If critical cell types have very few markers (<5), consider:

Increasing --shore-size to include more candidate regions
Lowering --delta-means to 0.2 or 0.15
Adding more reference samples for that cell type

Memory errors during segmentation

If wgbstools segment runs out of memory with many beta files, the pipeline automatically uses only the first 10 files. You can also provide pre-computed blocks:

# Segment separately with resource controls
wgbstools segment --betas subset_of_betas/*.beta -o my_blocks.bed -@ 4

# Then use pre-existing blocks
--blocks my_blocks.bed

Slow atlas building

Atlas building calls wgbstools homog for each marker x sample combination. Speed up with:

--threads 10   # Use more threads

Results are cached in UXM_deconv/tmp_dir/, so re-runs are much faster.

Atlas has NaN values

NaN values appear when a reference sample has zero coverage at a marker region. This is handled gracefully by BetaValueDeconvolution but may reduce accuracy. Ensure your reference WGBS data has sufficient genome-wide coverage (>10x recommended).

References

Loyfer N, et al. "A DNA methylation atlas of normal human cell types." Nature 613, 355-364 (2023). https://doi.org/10.1038/s41586-022-05580-6
wgbs_tools: https://github.com/nloyfer/wgbs_tools
UXM_deconv: https://github.com/nloyfer/UXM_deconv