FinaleMe Tutorial

April 7, 2026 ยท View on GitHub

This tutorial contains the full, detailed usage guide for FinaleMe v0.60.

If you want the shortest path to run the pipeline, start from README.md.

1. What FinaleMe does

FinaleMe predicts CpG methylation from cfDNA fragment features derived from BAM/CRAM or tabix-indexed fragment files. The standard workflow has five steps:

  1. Build feature matrix (CpgFeatureMatrixBuilder)
  2. Train HMM model (FinaleMe)
  3. Decode methylation (FinaleMe)
  4. Optional legacy conversion to bigWig (Perl helper)
  5. Tissues-of-origin deconvolution (BetaValueDeconvolution recommended; UXM compatibility optional)

2. Installation and reference setup

Default pipeline requirement summary:

  • Steps 1-4 in this tutorial require only FinaleMe (Java + jar build).
  • wgbstools and UXM_deconv are only needed for optional custom atlas generation in Step 5.2.

2.1 Install source

git clone https://github.com/epifluidlab/FinaleMe.git
cd FinaleMe
./scripts/setup_references.sh

This command:

  • checks dependencies
  • builds target/FinaleMe-0.60-jar-with-dependencies.jar (if missing)
  • downloads hg19/hg38 reference data into data/

Use a custom data directory with:

export FINALEME_DATA_DIR=/path/to/finaleme_data
./scripts/setup_references.sh

Useful subcommands:

./scripts/setup_references.sh deps
./scripts/setup_references.sh build
./scripts/setup_references.sh genomes
./scripts/setup_references.sh chromsizes
./scripts/setup_references.sh cpg
./scripts/setup_references.sh darkregions
./scripts/setup_references.sh methylation
./scripts/setup_references.sh summary

2.3 Reference files used in examples

  • data/hg19.2bit
  • data/hg19.chrom.sizes
  • data/CG_motif.hg19.common_chr.pos_only.bedgraph.gz
  • data/wgEncodeDukeMapabilityRegionsExcludable_wgEncodeDacMapabilityConsensusExcludable.hg19.bed
  • data/wgbs_buffyCoat_jensen2015GB.methy.hg19.bw
  • data/CpG_index.hg19.bed.gz

2.4 Small test BAM

mkdir -p test results
curl -L "https://zenodo.org/records/6914806/files/BH01.chr22.bam?download=1" -o test/BH01.chr22.bam
curl -L "https://zenodo.org/records/6914806/files/BH01.chr22.bam.bai?download=1" -o test/BH01.chr22.bam.bai || samtools index test/BH01.chr22.bam

3. Inputs and expected formats

3.1 Step 1 positional arguments

CpgFeatureMatrixBuilder usage:

CpgFeatureMatrixBuilder [opts] hg19.2bit cpg_list.bed all_cpg.bed wgs.bam|fragments.tsv.gz cpg_detail.txt.gz

Meaning:

  • argument 1: reference .2bit
  • argument 2: target CpG list to process
  • argument 3: all CpGs list (used when -includeCpgDist is enabled)
  • argument 4: BAM/CRAM or bgzipped+tabix fragment BED/TSV
  • argument 5: output feature matrix (.gz recommended)

3.2 Supported fragment input modes

  • BAM/CRAM mode (default)
  • tabix fragment mode (-fragmentInputTabix or auto-detected by file extension)

For tabix fragment BED/TSV, the file must include at least chr, start, end, and strand information.

4. Step 1: Feature extraction (detailed)

4.1 Standard BAM command

JAR="target/FinaleMe-0.60-jar-with-dependencies.jar"

java -Xmx20G -cp "$JAR" \
  edu.northwestern.epifluidlab.finaleme.utils.CpgFeatureMatrixBuilder \
  data/hg19.2bit \
  data/CG_motif.hg19.common_chr.pos_only.bedgraph.gz \
  data/CG_motif.hg19.common_chr.pos_only.bedgraph.gz \
  test/BH01.chr22.bam \
  results/BH01.cpg_features.hg19.bed.gz \
  -stringentPaired \
  -excludeRegions data/wgEncodeDukeMapabilityRegionsExcludable_wgEncodeDacMapabilityConsensusExcludable.hg19.bed \
  -valueWigs methyPrior:0:data/wgbs_buffyCoat_jensen2015GB.methy.hg19.bw \
  -useNoChrPrefixBam \
  -wgsMode \
  -t 4

4.2 Tabix fragment mode command

java -Xmx20G -cp "$JAR" \
  edu.northwestern.epifluidlab.finaleme.utils.CpgFeatureMatrixBuilder \
  data/hg19.2bit \
  data/CG_motif.hg19.common_chr.pos_only.bedgraph.gz \
  data/CG_motif.hg19.common_chr.pos_only.bedgraph.gz \
  fragments.bed.gz \
  results/fragments.cpg_features.hg19.bed.gz \
  -fragmentInputTabix \
  -fragStrandColumn 4 \
  -valueWigs methyPrior:0:data/wgbs_buffyCoat_jensen2015GB.methy.hg19.bw \
  -inferMethyFromValueWig \
  -t 4

4.3 Step 1 options

Core filters and behavior

OptionDescription
-minBaseQMinimum base quality threshold (default 5).
-minMapQMinimum mapping quality threshold (default 30).
-maxFragLenMaximum fragment length considered (default 500).
-maxDistToFragEndMax allowed distance from CpG to fragment end (default 250).
-maxCovMaximum coverage threshold per CpG (default 250).
-totalReadsInBamOverride auto-estimated total reads/fragments for normalization.
-wgsModeEnable WGS mode (non-bisulfite-space behavior).
-skipSecondEndIgnore read2 in paired-end statistics.
-stringentPairedKeep only properly oriented read pairs.
-includeCpgDistAdd nearest-CpG distance feature column.
-excludeRegionsBED file(s) of excluded intervals.
-useNoChrPrefixBamUse BAM contig naming without chr prefix.
-tNumber of threads for parallel 5Mb bins.

Additional track-derived features

OptionDescription
-overlapRegions track:fileAdd overlap flag(s) against BED track(s).
-distantRegions track:fileAdd nearest-distance feature(s) to interval track(s).
-valueWigs track:ext:fileAdd averaged value feature(s) from bigWig around CpGs.
-valueBeds track:ext:fileAdd averaged value feature(s) from tabix BED around CpGs.

Sequence/k-mer features

OptionDescription
-kmerLenAuto-generate all k-mers up to this length.
-kmerStringProvide explicit k-mer list file.
-kmerExt+/- region for k-mer extraction around CpG (default 100).
-useFragBaseKmerCompute k-mer from fragment sequence context.
-useStrandSpecificFragBaseStrand-aware fragment k-mer mode.

Tabix fragment mode specific

OptionDescription
-fragmentInputTabixForce tabix fragment mode.
-fragStrandColumn1-based strand column index (0=auto).
-fragNameColumn1-based fragment-name column index (0=auto/synthetic).
-fragMethyColumn1-based methylation column index (m/u) (0=infer/default).
-fragBaseQSynthetic base quality for fragment mode (default 60).
-defaultMethyStatDefault methylation state if not provided/inferred (m or u).
-inferMethyFromValueWigInfer methylation from first -valueWigs track (>=50 => m).

4.4 Step 1 output format (*.cpg_features*.bed.gz)

Header starts with:

chr	start	end	readName	FragLen	Frag_strand	methy_stat	Norm_Frag_cov	baseQ	Offset_frag	Dist_frag_end

Optional columns can follow in this order:

  1. dist_nearest_CpG (if -includeCpgDist)
  2. one column per -overlapRegions track
  3. one column per -distantRegions track
  4. one column per -valueBeds track
  5. one column per -valueWigs track
  6. one column per k-mer feature

Key fields:

  • methy_stat: observed methylation label (m/u) per CpG record
  • Norm_Frag_cov: normalized fragment coverage feature
  • Offset_frag: CpG offset index within fragment
  • Dist_frag_end: minimum distance to fragment ends

5. Step 2: Train HMM (detailed)

5.1 Training command

java -Xmx20G -cp "$JAR" \
  edu.northwestern.epifluidlab.finaleme.hmm.FinaleMe \
  results/BH01.FinaleMe.model \
  results/BH01.cpg_features.hg19.bed.gz \
  results/BH01.train.prediction.bed.gz \
  -miniDataPoints 7 \
  -gmm \
  -covOutlier 3 \
  -t 4

This writes a serialized model file (.model) and a prediction table.

OptionDescription
-statesNumber of hidden states (even number expected).
-featuresNumber of features per observation vector.
-miniDataPointsMinimum CpGs per fragment to include.
-maxCpgsMaximum CpGs per fragment to include.
-maxFragLenMaximum fragment-length state bound.
-minFragLenMinimum fragment-length threshold.
-maxCpgDistMax CpG distance used for transition bins.
-binDistance bin size for non-homogeneous priors/transitions.
-covOutlierOutlier filter by z-score in feature loading.
-gmmInitialize HMM using GMM.
-wgbsWGBS-oriented initialization mode.
-iterationMax Baum-Welch iterations.
-tolConvergence tolerance.
-decayRateRelative convergence threshold.
-tolKmeansK-means tolerance used by initialization.
-decayKmeansK-means decay criterion.
-mixNumberInFeatureMixture count(s) for Gaussian emissions.
-bayesianFactorPrior weighting factor in decoding/training.
-cpgNumClipClip for CpG-count scaling in HMM.
-methylatedStateWhich state is interpreted as methylated (0/1).
-seedRandom seed (<0 for non-deterministic).
-tParallel worker count for training/decoding internals.

6. Step 3: Decode methylation (detailed)

6.1 Decode command

java -Xmx20G -cp "$JAR" \
  edu.northwestern.epifluidlab.finaleme.hmm.FinaleMe \
  results/BH01.FinaleMe.model \
  results/BH01.cpg_features.hg19.bed.gz \
  results/BH01.decode.prediction.bed.gz \
  -decodeModeOnly \
  -t 4 \
  -bwOutput \
  -chromSizeFile data/hg19.chrom.sizes \
  -patOutput \
  -cpgIndexFile data/CpG_index.hg19.bed.gz

6.2 Decode/output options (FinaleMe)

OptionDescription
-decodeModeOnlySkip training and decode directly with existing model.
-decodePDecision criterion used by Viterbi methylation labeling.
-randomPermRandomize labels from prior instead of trained HMM.
-lowCoverageLow-coverage mode with alternate feature handling.
-regionDecode only within BED intervals.
-excludeExclude BED intervals from decode.
-patOutputWrite UXM-compatible .pat.gz and .beta outputs.
-cpgIndexFileRequired CpG index for -patOutput (use data/CpG_index.*.bed.gz).
-bwOutputWrite decode summary bigWig outputs.
-chromSizeFileRequired with -bwOutput (used by both UCSC converter and Java fallback writer).
-bedGraphToBigWigPath to UCSC converter executable; if missing, FinaleMe auto-falls back to Java BigWig writer.
-bwStripChrPrefixRemove chr prefix in bigWig conversion.
-bwConvertChrMToMTConvert chrM/M naming to MT.
-tParallel decoding thread count.

AUC mode option

OptionDescription
-aucModeCompute ROC/AUC-style summaries across decode thresholds.

-bwOutput is not supported together with -aucMode.

6.3 Step 3 prediction output format (*.prediction.bed.gz)

Header:

#chr	start	end	methy_perc_predict	methy_count_predict	total_count_predict	methy_perc_obs	methy_count_obs	total_count_obs

Column meaning:

  • methy_perc_predict: predicted methylation percentage at locus
  • methy_count_predict: predicted methylated count
  • total_count_predict: predicted total count
  • methy_perc_obs: observed methylation percentage from feature input labels
  • methy_count_obs: observed methylated count
  • total_count_obs: observed total count

6.4 Optional UXM output formats

.pat.gz

Tab-separated rows:

chr	start_cpg_index	CT_pattern	count
  • start_cpg_index: global CpG index of first CpG in fragment pattern
  • CT_pattern: per-CpG decoded pattern (C for methylated, T for unmethylated)
  • count: multiplicity of identical fragment pattern records

.beta

Binary file storing per-index (methylated_count, total_count) as uint8 pairs.

6.5 Optional bigWig outputs

When -bwOutput is enabled, FinaleMe writes:

  • *.methy.bw: predicted methylation percentage track
  • *.cov.bw: predicted total count track
  • *.methy_count.bw: predicted methylated count track

7. Step 4: Legacy Perl bigWig workflow

If you prefer the old conversion utility:

perl src/perl/bedpredict2bw.b37.pl results/BH01 results/BH01.decode.prediction.bed.gz

This is optional when Step 3 already runs with -bwOutput.

8. Step 5: Tissues-of-origin analysis

Run deconvolution with the tested default preset:

JAR="target/FinaleMe-0.60-jar-with-dependencies.jar"

java -Xmx20G -cp "$JAR" \
  edu.northwestern.epifluidlab.finaleme.utils.BetaValueDeconvolution \
  -binarizeThreshold 0.1 \
  -markerRegions results/cgi_shore_atlas/Atlas.CGI_shore.U250.l3.hg19.tsv \
  -refBetas results/cgi_shore_atlas/reference_wgbs/betas/beta_list.txt \
  -refGroups results/cgi_shore_atlas/groups_fixed.csv \
  -cpgIndex data/CpG_index.hg19.bed.gz \
  -solver NNLS \
  -output results/BH01.deconv.beta.tsv \
  results/BH01.decode.prediction.bed.gz

Notes:

  • -markerRegions accepts atlas TSV/BED with startCpG/endCpG columns (the CGI+shore atlas from marker generation pipeline works directly).
  • -refBetas can be a comma-separated list or a text file with one .beta path per line.
  • Query input can be *.prediction.bed.gz (as above) or *.beta.
  • Output format is a matrix: rows are cell types and columns are samples.
  • This default deconvolution command does not require wgbstools or UXM_deconv.

8.2 Optional: Build marker atlas with the tested preset (requires wgbstools + UXM_deconv)

python scripts/generate_cgi_shore_markers.py \
  --genome hg19 \
  --betas /path/to/reference_wgbs/betas/*.beta \
  --pats /path/to/reference_wgbs/pats/*.pat.gz \
  --groups /path/to/groups_pat_ref.hg19.csv \
  --blocks /path/to/GSE186458_blocks.s205.bed.gz \
  --cgi-bed /path/to/UCSC.cpgIsland.20190503.hg19.bed \
  --shore-size 2000 \
  --chrom-sizes data/hg19.chrom.sizes \
  --num-markers 250 \
  --delta-means 0.4 \
  --unmeth-mean-thresh 0.1 \
  --meth-mean-thresh 0.5 \
  --min-cpg 1 \
  --max-cpg 1000 \
  --min-bp 50 \
  --max-bp 5000 \
  --rlen 3 \
  --threads 10 \
  --out-dir results/cgi_shore_atlas/ \
  --force \
  --wgbstools-path /path/to/wgbs_tools \
  --uxm-path /path/to/UXM_deconv

Make a beta list file for -refBetas:

ls /path/to/reference_wgbs/betas/*.beta > results/cgi_shore_atlas/reference_wgbs/betas/beta_list.txt

8.3 Optional legacy mode: UXM deconvolution

This requires running Step 3 with -patOutput.

uxm deconv results/BH01.decode.prediction.pat.gz \
  -o results/BH01.uxm_result.csv \
  -a /path/to/UXM_deconv/supplemental/Atlas.U25.l4.hg19.tsv

Reference atlas-building details: tutorial/tutorial_ref_maps.md

9. Troubleshooting

9.1 ClassNotFoundException on old model files

If you decode an old model trained before package migration, use the current v0.60 jar. Backward-compatible class-name remapping is implemented for legacy serialized model class names.

9.2 No bedGraphToBigWig in PATH

FinaleMe now auto-falls back to Java BigWig writing when this executable is missing.
Install UCSC bedGraphToBigWig if you still prefer/require the UCSC binary path.

9.3 Missing CpG index for -patOutput

Use setup-provided files:

  • hg19: data/CpG_index.hg19.bed.gz
  • hg38: data/CpG_index.hg38.bed.gz

9.4 Memory usage in high coverage WGS data

Try different -Xmx and appropriate -t. We tested with -Xmx20G and -t 5 for HD_46 dataset (~16X depth), but may need -Xmx80G and -t 5 for 14230_1 dataset (~39X depth) in the paper.

9.5 Chromosome naming mismatch in bigWig

Use:

  • -bwStripChrPrefix
  • -bwConvertChrMToMT

as needed for your chrom-size naming convention.

10. Performance and reproducibility notes

  • Step 1 is parallelized by 5Mb genomic bins (-t controls worker count).
  • Training and decode are parallelized in FinaleMe (-t controls worker count).
  • Use fixed -seed for reproducible randomized operations.

11. Backward-compatible command alias

edu.northwestern.epifluidlab.finaleme.utils.CpgMultiMetricsStats is kept as a deprecated alias to CpgFeatureMatrixBuilder for script compatibility.

12. References