FinaleMe Tutorial
April 7, 2026 ยท View on GitHub
This tutorial contains the full, detailed usage guide for FinaleMe v0.60.
If you want the shortest path to run the pipeline, start from README.md.
1. What FinaleMe does
FinaleMe predicts CpG methylation from cfDNA fragment features derived from BAM/CRAM or tabix-indexed fragment files. The standard workflow has five steps:
- Build feature matrix (
CpgFeatureMatrixBuilder) - Train HMM model (
FinaleMe) - Decode methylation (
FinaleMe) - Optional legacy conversion to bigWig (Perl helper)
- Tissues-of-origin deconvolution (
BetaValueDeconvolutionrecommended; UXM compatibility optional)
2. Installation and reference setup
Default pipeline requirement summary:
- Steps 1-4 in this tutorial require only FinaleMe (Java + jar build).
wgbstoolsandUXM_deconvare only needed for optional custom atlas generation in Step 5.2.
2.1 Install source
git clone https://github.com/epifluidlab/FinaleMe.git
cd FinaleMe
2.2 One-command setup (recommended)
./scripts/setup_references.sh
This command:
- checks dependencies
- builds
target/FinaleMe-0.60-jar-with-dependencies.jar(if missing) - downloads hg19/hg38 reference data into
data/
Use a custom data directory with:
export FINALEME_DATA_DIR=/path/to/finaleme_data
./scripts/setup_references.sh
Useful subcommands:
./scripts/setup_references.sh deps
./scripts/setup_references.sh build
./scripts/setup_references.sh genomes
./scripts/setup_references.sh chromsizes
./scripts/setup_references.sh cpg
./scripts/setup_references.sh darkregions
./scripts/setup_references.sh methylation
./scripts/setup_references.sh summary
2.3 Reference files used in examples
data/hg19.2bitdata/hg19.chrom.sizesdata/CG_motif.hg19.common_chr.pos_only.bedgraph.gzdata/wgEncodeDukeMapabilityRegionsExcludable_wgEncodeDacMapabilityConsensusExcludable.hg19.beddata/wgbs_buffyCoat_jensen2015GB.methy.hg19.bwdata/CpG_index.hg19.bed.gz
2.4 Small test BAM
mkdir -p test results
curl -L "https://zenodo.org/records/6914806/files/BH01.chr22.bam?download=1" -o test/BH01.chr22.bam
curl -L "https://zenodo.org/records/6914806/files/BH01.chr22.bam.bai?download=1" -o test/BH01.chr22.bam.bai || samtools index test/BH01.chr22.bam
3. Inputs and expected formats
3.1 Step 1 positional arguments
CpgFeatureMatrixBuilder usage:
CpgFeatureMatrixBuilder [opts] hg19.2bit cpg_list.bed all_cpg.bed wgs.bam|fragments.tsv.gz cpg_detail.txt.gz
Meaning:
- argument 1: reference
.2bit - argument 2: target CpG list to process
- argument 3: all CpGs list (used when
-includeCpgDistis enabled) - argument 4: BAM/CRAM or bgzipped+tabix fragment BED/TSV
- argument 5: output feature matrix (
.gzrecommended)
3.2 Supported fragment input modes
- BAM/CRAM mode (default)
- tabix fragment mode (
-fragmentInputTabixor auto-detected by file extension)
For tabix fragment BED/TSV, the file must include at least chr, start, end, and strand information.
4. Step 1: Feature extraction (detailed)
4.1 Standard BAM command
JAR="target/FinaleMe-0.60-jar-with-dependencies.jar"
java -Xmx20G -cp "$JAR" \
edu.northwestern.epifluidlab.finaleme.utils.CpgFeatureMatrixBuilder \
data/hg19.2bit \
data/CG_motif.hg19.common_chr.pos_only.bedgraph.gz \
data/CG_motif.hg19.common_chr.pos_only.bedgraph.gz \
test/BH01.chr22.bam \
results/BH01.cpg_features.hg19.bed.gz \
-stringentPaired \
-excludeRegions data/wgEncodeDukeMapabilityRegionsExcludable_wgEncodeDacMapabilityConsensusExcludable.hg19.bed \
-valueWigs methyPrior:0:data/wgbs_buffyCoat_jensen2015GB.methy.hg19.bw \
-useNoChrPrefixBam \
-wgsMode \
-t 4
4.2 Tabix fragment mode command
java -Xmx20G -cp "$JAR" \
edu.northwestern.epifluidlab.finaleme.utils.CpgFeatureMatrixBuilder \
data/hg19.2bit \
data/CG_motif.hg19.common_chr.pos_only.bedgraph.gz \
data/CG_motif.hg19.common_chr.pos_only.bedgraph.gz \
fragments.bed.gz \
results/fragments.cpg_features.hg19.bed.gz \
-fragmentInputTabix \
-fragStrandColumn 4 \
-valueWigs methyPrior:0:data/wgbs_buffyCoat_jensen2015GB.methy.hg19.bw \
-inferMethyFromValueWig \
-t 4
4.3 Step 1 options
Core filters and behavior
| Option | Description |
|---|---|
-minBaseQ | Minimum base quality threshold (default 5). |
-minMapQ | Minimum mapping quality threshold (default 30). |
-maxFragLen | Maximum fragment length considered (default 500). |
-maxDistToFragEnd | Max allowed distance from CpG to fragment end (default 250). |
-maxCov | Maximum coverage threshold per CpG (default 250). |
-totalReadsInBam | Override auto-estimated total reads/fragments for normalization. |
-wgsMode | Enable WGS mode (non-bisulfite-space behavior). |
-skipSecondEnd | Ignore read2 in paired-end statistics. |
-stringentPaired | Keep only properly oriented read pairs. |
-includeCpgDist | Add nearest-CpG distance feature column. |
-excludeRegions | BED file(s) of excluded intervals. |
-useNoChrPrefixBam | Use BAM contig naming without chr prefix. |
-t | Number of threads for parallel 5Mb bins. |
Additional track-derived features
| Option | Description |
|---|---|
-overlapRegions track:file | Add overlap flag(s) against BED track(s). |
-distantRegions track:file | Add nearest-distance feature(s) to interval track(s). |
-valueWigs track:ext:file | Add averaged value feature(s) from bigWig around CpGs. |
-valueBeds track:ext:file | Add averaged value feature(s) from tabix BED around CpGs. |
Sequence/k-mer features
| Option | Description |
|---|---|
-kmerLen | Auto-generate all k-mers up to this length. |
-kmerString | Provide explicit k-mer list file. |
-kmerExt | +/- region for k-mer extraction around CpG (default 100). |
-useFragBaseKmer | Compute k-mer from fragment sequence context. |
-useStrandSpecificFragBase | Strand-aware fragment k-mer mode. |
Tabix fragment mode specific
| Option | Description |
|---|---|
-fragmentInputTabix | Force tabix fragment mode. |
-fragStrandColumn | 1-based strand column index (0=auto). |
-fragNameColumn | 1-based fragment-name column index (0=auto/synthetic). |
-fragMethyColumn | 1-based methylation column index (m/u) (0=infer/default). |
-fragBaseQ | Synthetic base quality for fragment mode (default 60). |
-defaultMethyStat | Default methylation state if not provided/inferred (m or u). |
-inferMethyFromValueWig | Infer methylation from first -valueWigs track (>=50 => m). |
4.4 Step 1 output format (*.cpg_features*.bed.gz)
Header starts with:
chr start end readName FragLen Frag_strand methy_stat Norm_Frag_cov baseQ Offset_frag Dist_frag_end
Optional columns can follow in this order:
dist_nearest_CpG(if-includeCpgDist)- one column per
-overlapRegionstrack - one column per
-distantRegionstrack - one column per
-valueBedstrack - one column per
-valueWigstrack - one column per k-mer feature
Key fields:
methy_stat: observed methylation label (m/u) per CpG recordNorm_Frag_cov: normalized fragment coverage featureOffset_frag: CpG offset index within fragmentDist_frag_end: minimum distance to fragment ends
5. Step 2: Train HMM (detailed)
5.1 Training command
java -Xmx20G -cp "$JAR" \
edu.northwestern.epifluidlab.finaleme.hmm.FinaleMe \
results/BH01.FinaleMe.model \
results/BH01.cpg_features.hg19.bed.gz \
results/BH01.train.prediction.bed.gz \
-miniDataPoints 7 \
-gmm \
-covOutlier 3 \
-t 4
This writes a serialized model file (.model) and a prediction table.
5.2 Training-related options (FinaleMe)
| Option | Description |
|---|---|
-states | Number of hidden states (even number expected). |
-features | Number of features per observation vector. |
-miniDataPoints | Minimum CpGs per fragment to include. |
-maxCpgs | Maximum CpGs per fragment to include. |
-maxFragLen | Maximum fragment-length state bound. |
-minFragLen | Minimum fragment-length threshold. |
-maxCpgDist | Max CpG distance used for transition bins. |
-bin | Distance bin size for non-homogeneous priors/transitions. |
-covOutlier | Outlier filter by z-score in feature loading. |
-gmm | Initialize HMM using GMM. |
-wgbs | WGBS-oriented initialization mode. |
-iteration | Max Baum-Welch iterations. |
-tol | Convergence tolerance. |
-decayRate | Relative convergence threshold. |
-tolKmeans | K-means tolerance used by initialization. |
-decayKmeans | K-means decay criterion. |
-mixNumberInFeature | Mixture count(s) for Gaussian emissions. |
-bayesianFactor | Prior weighting factor in decoding/training. |
-cpgNumClip | Clip for CpG-count scaling in HMM. |
-methylatedState | Which state is interpreted as methylated (0/1). |
-seed | Random seed (<0 for non-deterministic). |
-t | Parallel worker count for training/decoding internals. |
6. Step 3: Decode methylation (detailed)
6.1 Decode command
java -Xmx20G -cp "$JAR" \
edu.northwestern.epifluidlab.finaleme.hmm.FinaleMe \
results/BH01.FinaleMe.model \
results/BH01.cpg_features.hg19.bed.gz \
results/BH01.decode.prediction.bed.gz \
-decodeModeOnly \
-t 4 \
-bwOutput \
-chromSizeFile data/hg19.chrom.sizes \
-patOutput \
-cpgIndexFile data/CpG_index.hg19.bed.gz
6.2 Decode/output options (FinaleMe)
| Option | Description |
|---|---|
-decodeModeOnly | Skip training and decode directly with existing model. |
-decodeP | Decision criterion used by Viterbi methylation labeling. |
-randomPerm | Randomize labels from prior instead of trained HMM. |
-lowCoverage | Low-coverage mode with alternate feature handling. |
-region | Decode only within BED intervals. |
-exclude | Exclude BED intervals from decode. |
-patOutput | Write UXM-compatible .pat.gz and .beta outputs. |
-cpgIndexFile | Required CpG index for -patOutput (use data/CpG_index.*.bed.gz). |
-bwOutput | Write decode summary bigWig outputs. |
-chromSizeFile | Required with -bwOutput (used by both UCSC converter and Java fallback writer). |
-bedGraphToBigWig | Path to UCSC converter executable; if missing, FinaleMe auto-falls back to Java BigWig writer. |
-bwStripChrPrefix | Remove chr prefix in bigWig conversion. |
-bwConvertChrMToMT | Convert chrM/M naming to MT. |
-t | Parallel decoding thread count. |
AUC mode option
| Option | Description |
|---|---|
-aucMode | Compute ROC/AUC-style summaries across decode thresholds. |
-bwOutput is not supported together with -aucMode.
6.3 Step 3 prediction output format (*.prediction.bed.gz)
Header:
#chr start end methy_perc_predict methy_count_predict total_count_predict methy_perc_obs methy_count_obs total_count_obs
Column meaning:
methy_perc_predict: predicted methylation percentage at locusmethy_count_predict: predicted methylated counttotal_count_predict: predicted total countmethy_perc_obs: observed methylation percentage from feature input labelsmethy_count_obs: observed methylated counttotal_count_obs: observed total count
6.4 Optional UXM output formats
.pat.gz
Tab-separated rows:
chr start_cpg_index CT_pattern count
start_cpg_index: global CpG index of first CpG in fragment patternCT_pattern: per-CpG decoded pattern (Cfor methylated,Tfor unmethylated)count: multiplicity of identical fragment pattern records
.beta
Binary file storing per-index (methylated_count, total_count) as uint8 pairs.
6.5 Optional bigWig outputs
When -bwOutput is enabled, FinaleMe writes:
*.methy.bw: predicted methylation percentage track*.cov.bw: predicted total count track*.methy_count.bw: predicted methylated count track
7. Step 4: Legacy Perl bigWig workflow
If you prefer the old conversion utility:
perl src/perl/bedpredict2bw.b37.pl results/BH01 results/BH01.decode.prediction.bed.gz
This is optional when Step 3 already runs with -bwOutput.
8. Step 5: Tissues-of-origin analysis
8.1 Recommended: BetaValueDeconvolution (atlas mode)
Run deconvolution with the tested default preset:
JAR="target/FinaleMe-0.60-jar-with-dependencies.jar"
java -Xmx20G -cp "$JAR" \
edu.northwestern.epifluidlab.finaleme.utils.BetaValueDeconvolution \
-binarizeThreshold 0.1 \
-markerRegions results/cgi_shore_atlas/Atlas.CGI_shore.U250.l3.hg19.tsv \
-refBetas results/cgi_shore_atlas/reference_wgbs/betas/beta_list.txt \
-refGroups results/cgi_shore_atlas/groups_fixed.csv \
-cpgIndex data/CpG_index.hg19.bed.gz \
-solver NNLS \
-output results/BH01.deconv.beta.tsv \
results/BH01.decode.prediction.bed.gz
Notes:
-markerRegionsaccepts atlas TSV/BED withstartCpG/endCpGcolumns (the CGI+shore atlas from marker generation pipeline works directly).-refBetascan be a comma-separated list or a text file with one.betapath per line.- Query input can be
*.prediction.bed.gz(as above) or*.beta. - Output format is a matrix: rows are cell types and columns are samples.
- This default deconvolution command does not require
wgbstoolsorUXM_deconv.
8.2 Optional: Build marker atlas with the tested preset (requires wgbstools + UXM_deconv)
python scripts/generate_cgi_shore_markers.py \
--genome hg19 \
--betas /path/to/reference_wgbs/betas/*.beta \
--pats /path/to/reference_wgbs/pats/*.pat.gz \
--groups /path/to/groups_pat_ref.hg19.csv \
--blocks /path/to/GSE186458_blocks.s205.bed.gz \
--cgi-bed /path/to/UCSC.cpgIsland.20190503.hg19.bed \
--shore-size 2000 \
--chrom-sizes data/hg19.chrom.sizes \
--num-markers 250 \
--delta-means 0.4 \
--unmeth-mean-thresh 0.1 \
--meth-mean-thresh 0.5 \
--min-cpg 1 \
--max-cpg 1000 \
--min-bp 50 \
--max-bp 5000 \
--rlen 3 \
--threads 10 \
--out-dir results/cgi_shore_atlas/ \
--force \
--wgbstools-path /path/to/wgbs_tools \
--uxm-path /path/to/UXM_deconv
Make a beta list file for -refBetas:
ls /path/to/reference_wgbs/betas/*.beta > results/cgi_shore_atlas/reference_wgbs/betas/beta_list.txt
8.3 Optional legacy mode: UXM deconvolution
This requires running Step 3 with -patOutput.
uxm deconv results/BH01.decode.prediction.pat.gz \
-o results/BH01.uxm_result.csv \
-a /path/to/UXM_deconv/supplemental/Atlas.U25.l4.hg19.tsv
Reference atlas-building details: tutorial/tutorial_ref_maps.md
9. Troubleshooting
9.1 ClassNotFoundException on old model files
If you decode an old model trained before package migration, use the current v0.60 jar. Backward-compatible class-name remapping is implemented for legacy serialized model class names.
9.2 No bedGraphToBigWig in PATH
FinaleMe now auto-falls back to Java BigWig writing when this executable is missing.
Install UCSC bedGraphToBigWig if you still prefer/require the UCSC binary path.
9.3 Missing CpG index for -patOutput
Use setup-provided files:
- hg19:
data/CpG_index.hg19.bed.gz - hg38:
data/CpG_index.hg38.bed.gz
9.4 Memory usage in high coverage WGS data
Try different -Xmx and appropriate -t. We tested with -Xmx20G and -t 5 for HD_46 dataset (~16X depth), but may need -Xmx80G and -t 5 for 14230_1 dataset (~39X depth) in the paper.
9.5 Chromosome naming mismatch in bigWig
Use:
-bwStripChrPrefix-bwConvertChrMToMT
as needed for your chrom-size naming convention.
10. Performance and reproducibility notes
- Step 1 is parallelized by 5Mb genomic bins (
-tcontrols worker count). - Training and decode are parallelized in FinaleMe (
-tcontrols worker count). - Use fixed
-seedfor reproducible randomized operations.
11. Backward-compatible command alias
edu.northwestern.epifluidlab.finaleme.utils.CpgMultiMetricsStats is kept as a deprecated alias to CpgFeatureMatrixBuilder for script compatibility.
12. References
- FinaleMe paper: https://doi.org/10.1038/s41467-024-47196-6
- Reference data (Zenodo): https://doi.org/10.5281/zenodo.19392525
- wgbstools: https://github.com/nloyfer/wgbs_tools
- UXM_deconv: https://github.com/nloyfer/UXM_deconv