RcppML

March 15, 2026 · View on GitHub

Vignettes here: https://zdebruine.github.io/RcppML/articles/getting-started.html

RcppML is an R package for high-performance Non-negative Matrix Factorization (NMF), truncated SVD/PCA, and divisive clustering of large sparse and dense matrices. It supports six statistical distributions via IRLS, cross-validation for automatic rank selection, optional CUDA GPU acceleration, out-of-core streaming for datasets larger than memory, and a composable graph DSL for multi-modal and deep NMF.

Overview

RcppML decomposes a matrix A into lower-rank non-negative factors:

A ≈ W · diag(d) · H

where W (features × k) and H (k × samples) have columns/rows scaled to unit sum via a diagonal d, ensuring interpretable, scale-invariant factors.

Key capabilities:

Fast NMF with alternating least squares, coordinate descent or Cholesky NNLS, and OpenMP parallelism via the Eigen C++ library
Six statistical distributions — Gaussian (MSE), Generalized Poisson, Negative Binomial, Gamma, Inverse Gaussian, Tweedie — fitted via Iteratively Reweighted Least Squares (IRLS)
Cross-validation with speckled holdout masks for principled rank selection
GPU acceleration via CUDA (cuBLAS/cuSPARSE) with automatic fallback to CPU
Streaming NMF from StreamPress (.spz) files for datasets that exceed available memory
FactorNet graph API for composable multi-modal, deep, and branching NMF pipelines
Truncated SVD with five methods: deflation, Krylov, Lanczos, IRLBA, randomized
Divisive clustering via recursive rank-2 factorizations

Installation

# Stable release from CRAN
install.packages("RcppML")

# Development version from GitHub (requires Rcpp and RcppEigen)
devtools::install_github("zdebruine/RcppML")

GPU support (optional): Requires CUDA Toolkit ≥ 11.0 installed on the system. See vignette("gpu-acceleration") for build instructions.

Quick Start

``$\text{r} \text{library}(\text{RcppML})

\text{Load} \text{a} \text{built}-\text{in} \text{gene} \text{expression} \text{dataset}

\text{data}(\text{aml}) # 824 \text{genomic} \text{regions} \times 135 \text{samples} (\text{dense} \text{matrix})

\text{Run} \text{NMF} \text{with} \text{rank} \text{k} = 6

\text{model} <- \text{nmf}(\text{aml}, \text{k} = 6, \text{seed} = 42) \text{model} #> \text{nmf} \text{model} #> \text{k} = 6 #> \text{w}: 824 \text{x} 6 #> \text{d}: 6 #> \text{h}: 6 \text{x} 135

\text{Inspect} \text{factor} \text{loadings}

\text{head}(\text{model}@\text{w}) \text{model}@\text{d} \text{model}@\text{h}[, 1:5]

\text{Reconstruction} \text{error}

\text{evaluate}(\text{model}, \text{aml})

\text{Project} \text{onto} \text{new} \text{data}

\text{H_new} <- \text{predict}(\text{model}, \text{aml}[, 1:50])

\text{Plot} \text{the} \text{model} \text{summary}

\text{plot}(\text{model}) $``

Statistical Distributions

RcppML goes beyond standard Frobenius-norm NMF by supporting distribution-appropriate loss functions via IRLS. This is critical for count data (scRNA-seq, text mining) where Gaussian assumptions are wrong.

`loss=`	Distribution	Variance V(μ)	Use Case
`"mse"`	Gaussian	constant	Dense/general data (default)
`"gp"`	Generalized Poisson	μ(1+θ)²	Overdispersed counts
`"nb"`	Negative Binomial	μ + μ²/r	scRNA-seq, quadratic variance-mean
`"gamma"`	Gamma	μ²	Positive continuous data
`"inverse_gaussian"`	Inverse Gaussian	μ³	Heavy right-tailed positive
`"tweedie"`	Tweedie	μ^p	Flexible power-law variance

Example: Negative Binomial NMF for scRNA-seq

data(aml)

# NB is the natural choice for count data
model_nb <- nmf(aml, k = 6, loss = "nb")

# Fine-grained control via ... parameters
model_nb <- nmf(aml, k = 6,
                loss = "nb",
                nb_size_init = 10,     # Initial size parameter r
                nb_size_max = 1e6)     # Upper bound for r

# Automatic distribution selection via BIC
auto <- auto_nmf_distribution(aml, k = 6)
auto$best       # best distribution name
auto$comparison  # BIC/AIC table for each distribution

Zero-Inflation Models

For data with excess zeros (e.g., droplet-based scRNA-seq), zero-inflated GP and NB models estimate structural dropout probabilities separately from the count model:

# Zero-inflated Negative Binomial with per-row dropout
model_zinb <- nmf(aml, k = 6, loss = "nb", zi = "row")

# Per-column dropout estimation
model <- nmf(aml, k = 6, loss = "nb", zi = "col")

Cross-Validation

Cross-validation uses a speckled holdout mask — a random subset of matrix entries are held out during training, and test error is computed on these entries. This enables principled rank selection.

data(aml)

# Sweep k = 2 through 20, three replicates per rank
cv <- nmf(aml, k = 2:20, test_fraction = 0.1, cv_seed = 1:3)

# Visualize test error vs rank
plot(cv)

# Automatic rank selection (binary search)
model_auto <- nmf(aml, k = "auto")
model_auto@misc$rank_search$k_optimal

Cross-Validation Options

cv <- nmf(data, k = 2:15,
          test_fraction = 0.1,      # 10% holdout
          mask = "zeros",           # Only non-zero entries in test set (for sparse data)
          patience = 5,             # Early stopping patience
          cv_seed = c(42, 123, 7))  # Multiple seeds for replicates

Regularization

RcppML supports a comprehensive suite of regularization methods that can be combined freely:

L1 (LASSO) — Sparsity

# Symmetric L1 on both W and H
model <- nmf(data, k = 10, L1 = 0.1)

# Asymmetric: sparse H (embedding), dense W (loadings)
model <- nmf(data, k = 10, L1 = c(0, 0.2))

L2 (Ridge) — Shrinkage

model <- nmf(data, k = 10, L2 = c(0.01, 0.01))

L21 (Group Sparsity) — Factor Selection

L21-norm regularization drives entire rows of W or columns of H toward zero, effectively performing automatic factor selection:

model <- nmf(data, k = 20, L21 = c(0.1, 0))
# Some factors in W will be entirely zeroed out

Angular — Decorrelation

Encourages orthogonality between factors:

model <- nmf(data, k = 10, angular = c(0.1, 0.1))

Graph Laplacian — Spatial/Network Smoothness

If features or samples have a known graph structure (e.g., gene regulatory network, spatial coordinates), graph regularization encourages connected nodes to have similar factor representations:

# gene_graph: m × m sparse adjacency matrix
model <- nmf(data, k = 10, graph_W = gene_graph, graph_lambda = 0.5)

# Both feature and sample graphs
model <- nmf(data, k = 10,
             graph_W = gene_graph, graph_H = cell_graph,
             graph_lambda = c(0.5, 0.3))

Upper Bound Constraints

# Box constraint: all entries in W and H between 0 and 1
model <- nmf(data, k = 10, upper_bound = c(1, 1))

Robust NMF (Huber Loss)

# Huber-type robustness (less sensitive to outliers)
model <- nmf(data, k = 10, robust = TRUE)

# Custom Huber delta
model <- nmf(data, k = 10, robust = 2.0)

# MAE (L1 loss)
model <- nmf(data, k = 10, robust = "mae")

GPU Acceleration

RcppML optionally uses CUDA for GPU-accelerated NMF, delivering 10–20× speedups on large matrices.

# Check GPU availability
gpu_available()
#> [1] TRUE

gpu_info()
#> $device_name: "NVIDIA A100"
#> $total_memory_mb: 40960

# Sparse NMF on GPU
model <- nmf(data, k = 20, resource = "gpu")

# Dense NMF on GPU
model <- nmf(as.matrix(data), k = 20, resource = "gpu")

# Auto-dispatch (default): uses GPU if available, falls back to CPU
model <- nmf(data, k = 20, resource = "auto")

The GPU backend supports:

Standard NMF (sparse and dense)
Cross-validation NMF
MSE loss (GPU-native), all other losses via automatic CPU fallback
OpenMP + CUDA hybrid execution
Zero-copy NMF with st_read_gpu() for pre-loaded GPU data

Streaming Large Data (StreamPress `.spz` Files)

For datasets that exceed available RAM, RcppML streams data from StreamPress (.spz) compressed files — a column-oriented binary format with 10–20× compression via rANS entropy coding.

Writing and Reading SPZ Files

library(RcppML)
data(pbmc3k)

# Write sparse matrix to .spz file
spz_path <- tempfile(fileext = ".spz")
st_write(pbmc3k, spz_path, include_transpose = TRUE)

# Read back into memory
mat <- st_read(spz_path)
identical(dim(mat), dim(pbmc3k))  # TRUE

# File size comparison
file.size(spz_path)  # Much smaller than RDS/RDA

Streaming NMF

# NMF directly from .spz file — data never fully loaded into memory
model <- nmf(spz_path, k = 10)

# Streaming cross-validation
cv <- nmf(spz_path, k = 2:15, test_fraction = 0.1)

# Streaming with non-MSE distributions
model <- nmf(spz_path, k = 10, loss = "nb")

Streaming NMF processes data in column panels with double-buffered asynchronous I/O, maintaining O(m·k + n·k + chunk) memory regardless of total matrix size.

FactorNet Graph API

The factor_net() API composes complex factorization pipelines from simple building blocks. Multi-modal, deep, and branching NMF networks are expressed as directed graphs:

Share a common embedding H across two data matrices (e.g., RNA + ATAC from the same cells):

# Define inputs
inp_rna  <- factor_input(rna_matrix, "rna")
inp_atac <- factor_input(atac_matrix, "atac")

# Create shared input node
shared <- factor_shared(inp_rna, inp_atac)

# Build NMF layer with per-factor regularization
layer <- shared |> nmf_layer(k = 10, name = "shared",
  W = W(L1 = 0.1),
  H = H(L1 = 0.05))

# Compile and fit the network
net <- factor_net(
  inputs = list(inp_rna, inp_atac),
  output = layer,
  config = factor_config(maxit = 100, seed = 42))

result <- fit(net)

# Access results (layer name = "shared")
result$shared$W$rna   # RNA feature loadings
result$shared$W$atac  # ATAC feature loadings
result$shared$H       # Shared cell embedding (k × n_cells)

Deep NMF

Stack layers for hierarchical decomposition:

inp <- factor_input(data, "X")

# Encoder layer: reduce to 20 factors
enc <- inp |> nmf_layer(k = 20, name = "encoder")

# Bottleneck: compress to 5 factors
bot <- enc |> nmf_layer(k = 5, name = "bottleneck")

net <- factor_net(inputs = list(inp), output = bot,
                  config = factor_config(maxit = 100, seed = 42))
result <- fit(net)

Conditional Factorization

Append conditioning metadata to a layer's H before passing downstream:

``$\text{r} \text{inp} <- \text{factor_input}(\text{data}, "\text{X}") \text{layer1} <- \text{inp} |> \text{nmf_layer}(\text{k} = 5)

\text{Z} \text{is} \text{a} \text{conditioning} \text{matrix} (\text{n} \times \text{p})

\text{Z} <- \text{matrix}(\text{rep}(\text{c}(1, 0, 0, 1), \text{c}(50, 50, 50, 50)), \text{nrow} = 100, \text{ncol} = 2) \text{conditioned} <- \text{factor_condition}(\text{layer1}, \text{Z})

\text{layer2} <- \text{conditioned} |> \text{nmf_layer}(\text{k} = 10, \text{name} = "\text{output}")

\text{net} <- \text{factor_net}(\text{inputs} = \text{list}(\text{inp}), \text{output} = \text{layer2}, \text{config} = \text{factor_config}(\text{seed} = 42)) \text{result} <- \text{fit}(\text{net}) $``

Truncated SVD / PCA

RcppML provides five SVD algorithms with optional constraints and cross-validation:

# Standard truncated SVD
result <- svd(data, k = 10)

# PCA (centered SVD)
result <- pca(data, k = 10)

# Non-negative PCA
result <- pca(data, k = 10, nonneg = TRUE)

# Sparse PCA with L1 penalty
result <- pca(data, k = 10, L1 = c(0, 0.1))

# Auto-rank selection
result <- svd(data, k = "auto")

# Method selection
result <- svd(data, k = 10, method = "lanczos")   # Fast unconstrained
result <- svd(data, k = 10, method = "krylov")     # Block method, all constraints
result <- svd(data, k = 10, method = "randomized") # Approximate, very fast

Method	Constraints	Speed	Best For
`deflation`	All	Moderate	Mixed constraints, small k
`krylov`	All	Fast	Large k with constraints
`lanczos`	None	Very fast	Unconstrained SVD
`irlba`	None	Fast	General unconstrained
`randomized`	None	Very fast	Approximate large-scale

Divisive Clustering

RcppML implements spectral clustering via recursive rank-2 NMF:

# Single bipartition
bp <- bipartition(data)
bp$samples  # Cluster assignments (0/1)

# Recursive divisive clustering
clusters <- dclust(data, min_samples = 50, min_dist = 0.05)
clusters$id  # Cluster labels

# Consensus clustering (multiple runs for stability)
cons <- consensus_nmf(data, k = 5, n_runs = 20)
plot(cons)

Non-Negative Least Squares (NNLS)

Project factor matrices onto new data:

# Given W from NMF, solve for H on new data
H_new <- nnls(w = model@w, A = new_data)

# Solve for W given H (transpose projection)
W_new <- nnls(h = model@h, A = new_data)

# Unconstrained LS (semi-NMF projection)
H_ls <- nnls(w = model@w, A = new_data, nonneg = c(TRUE, FALSE))

# With regularization
H_sparse <- nnls(w = model@w, A = new_data, L1 = c(0, 0.1))

Semi-NMF

Allow negative values in W (unconstrained) while keeping H non-negative:

model <- nmf(data, k = 10, nonneg = c(FALSE, TRUE))
any(model@w < 0)  # TRUE — W can have negative entries
all(model@h >= 0)  # TRUE — H remains non-negative

Key Parameters Reference

Parameter	Type	Default	Description
`k`	int/vector/"auto"	—	Rank (vector for CV sweep, "auto" for search)
`loss`	string	`"mse"`	Loss function: mse, gp, nb, gamma, inverse_gaussian, tweedie
`L1`	numeric(2)	`c(0,0)`	LASSO penalty [W, H]
`L2`	numeric(2)	`c(0,0)`	Ridge penalty [W, H]
`L21`	numeric(2)	`c(0,0)`	Group sparsity [W, H]
`angular`	numeric(2)	`c(0,0)`	Decorrelation penalty [W, H]
`nonneg`	logical(2)	`c(TRUE,TRUE)`	Non-negativity [W, H]
`upper_bound`	numeric(2)	`c(0,0)`	Box constraint [W, H] (0 = none)
`zi`	string	`"none"`	Zero-inflation: none, row, col
`robust`	logical/numeric	`FALSE`	Huber robustness (TRUE = δ=1.345, numeric = custom δ)
`solver`	string	`"auto"`	NNLS solver: auto, cd, cholesky
`seed`	int/matrix/string	`NULL`	Initialization: NULL (random), integer (reproducible), "lanczos", "irlba"
`mask`	various	`NULL`	Missing data: NULL, "zeros", "NA", or a mask matrix
`resource`	string	`"auto"`	Compute: auto, cpu, gpu
`test_fraction`	numeric	`0`	CV holdout fraction (0 = no CV)
`tol`	numeric	`1e-4`	Convergence tolerance
`maxit`	int	`100`	Maximum iterations
`threads`	int	`0`	OpenMP threads (0 = all)
`verbose`	logical	`FALSE`	Print progress

Built-in Datasets

Dataset	Description	Dimensions	Format
`pbmc3k$	\text{PBMC} \text{single}-\text{cell} \text{RNA}-\text{seq} (10\text{x} \text{Genomics})	13{,}714 \times 2{,}638	\text{SPZ} \text{raw} \text{bytes}
$aml$	\text{AML} \text{leukemia} \text{ATAC}-\text{seq}	824 \times 135	\text{Dense} \text{matrix}
$golub$	\text{Golub} \text{leukemia} \text{microarray}	38 \times 5{,}000	\text{Sparse} \text{dgCMatrix}
$movielens$	\text{MovieLens} \text{ratings}	3{,}867 \times 610	\text{Sparse} \text{dgCMatrix}
$hawaiibirds$	\text{Hawaii} \text{bird} \text{survey} \text{counts}	183 \times 1{,}183	\text{Sparse} \text{dgCMatrix}
$olivetti$	\text{Olivetti} \text{face} \text{images}	400 \times 4{,}096	\text{Sparse} \text{dgCMatrix}
$digits`	Handwritten digits (MNIST subset)	1,797 × 64	Dense matrix

Performance Tips

Sparse input: Use dgCMatrix format (from the Matrix package) for sparse data — RcppML auto-detects and uses optimized sparse routines.
Solver selection: For large k (> 32), solver = "cholesky" can be faster than coordinate descent. Use solver = "auto" (default) for automatic selection.
Initialization: seed = "lanczos" provides better starting points and can reduce iteration count by 30–50%, but adds upfront SVD cost.
Threads: RcppML uses OpenMP. Set options(RcppML.threads = 4) to control parallelism, or threads = 0 for all available cores.
GPU: For matrices with > 10K rows and k > 8, GPU acceleration provides significant speedup. Use resource = "gpu".
Streaming: For datasets larger than available RAM, write to .spz format with st_write() and factorize directly from the file path.

Contributing

See CONTRIBUTING.md for guidelines on reporting bugs, requesting features, and submitting pull requests.

Citation

citation("RcppML")

DeBruine ZJ, Melcher K, Triche TJ (2021). "High-performance non-negative
matrix factorization for large single-cell data." BioRXiv.
doi:10.1101/2021.09.01.458620.

License

GPL (≥ 3)