harmonypy

April 24, 2026 · View on GitHub

PyPI Downloads Tests DOI

harmonypy is a Python package for the Harmony algorithm for integrating multiple high-dimensional datasets. It uses a C++ backend (Armadillo) for fast linear algebra, matching the R harmony2 package step-by-step.

This animation shows Harmony aligning three single-cell RNA-seq datasets from different donors. → How to make this animation. Before Harmony, you can clearly distinguish cells from each of the three donors. After Harmony, the cells from different donors are mixed while preserving the overall shape of the data.

Installation

Install from PyPI (pre-built wheels for Linux and macOS):

pip install harmonypy

Building from source

Building from source requires a C++ compiler, CMake, and a BLAS library:

macOS (uses Apple Accelerate, no extra dependencies):

pip install .

Linux (requires OpenBLAS):

# Debian/Ubuntu
sudo apt install libopenblas-dev cmake

# RHEL/Fedora
sudo dnf install openblas-devel cmake

pip install .

Quick Start

import harmonypy as hm
import pandas as pd

# Load the principal components and metadata
pcs = pd.read_csv("data/pbmc_3500_pcs.tsv.gz", sep="\t")
meta = pd.read_csv("data/pbmc_3500_meta.tsv.gz", sep="\t")

# Run Harmony to correct for batch effects (donor)
harmony_out = hm.run_harmony(pcs, meta, "donor")

# Save corrected PCs (same shape as input)
result = pd.DataFrame(harmony_out.Z_corr, columns=pcs.columns)
result.to_csv("pbmc_3500_pcs_harmony.tsv", sep="\t", index=False)

Usage with Scanpy

import scanpy as sc
import harmonypy as hm

# Load and preprocess your data
adata = sc.read_h5ad("my_data.h5ad")
sc.pp.pca(adata)

# Get PCs from the AnnData object
pcs = adata.obsm['X_pca']
print(pcs.shape)  # (n_cells, n_pcs)

# Run Harmony on the PCA embedding
harmony_out = hm.run_harmony(pcs, adata.obs, "batch")

# Store corrected PCs back in the AnnData object
adata.obsm['X_pca_harmony'] = harmony_out.Z_corr

# Use harmonized PCs for downstream analysis
sc.pp.neighbors(adata, use_rep='X_pca_harmony')
sc.tl.umap(adata)
sc.tl.leiden(adata)

Parameters

run_harmony accepts the same parameters as the R package:

ParameterDefaultDescription
theta2Diversity penalty per batch variable
sigma0.1Kernel bandwidth for soft clustering
nclustmin(N/30, 100)Number of clusters
max_iter_harmony10Maximum Harmony iterations
max_iter_kmeans4K-means iterations per Harmony round
epsilon_harmony1e-2Convergence threshold
ncores0BLAS threads (0 = all cores)
lambNoneRidge penalty (None = auto-estimate)

The ncores parameter controls BLAS threading (Accelerate on macOS, OpenBLAS on Linux). Default is 0 (use all available cores). Set ncores=1 for single-threaded execution.

Performance

The script in tests/test_harmony.py on an Apple M1 (2022) chip reports:

  Dataset                    Time    RSS delta
  ---------------------- -------- ------------
  Small (3.5k cells)        0.23s     45.2 MB
  Medium (69k cells)        4.76s    262.3 MB
  Large (858k cells)       29.29s   1969.5 MB

Citation

If you use Harmony in your work, please cite the original paper:

Korsunsky, I., Millard, N., Fan, J. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods 16, 1289–1296 (2019). https://doi.org/10.1038/s41592-019-0619-0

The Supplementary Information PDF provides detailed mathematical descriptions and implementation notes.

To learn more about Harmony 2, please see the preprint here:

Patikas, Nikolaos, Hongcheng Yao, Roopa Madhu, Soumya Raychaudhuri, Martin Hemberg, and Ilya Korsunsky. 2026. Integration of Large, Complex Single-Cell Datasets with Harmony2. bioRxiv. https://doi.org/10.64898/2026.03.16.711825