Transcriptomic Foundation Model Interfaces

April 7, 2026 · View on GitHub

Module: sc_neurocore.bio.transcriptomic Rust path: sc_neurocore_engine::analysis::neural_decoders::gaussian_attention References: Li et al. (2025) Genome Biology 26:402; Theodoris et al. (2023) Nature 619 Family: Single-cell transcriptomic foundation models Tier: Research (explicit import from sc_neurocore.bio) Exports: ScKGBERTInterface, GeneformerInterface, rank_value_encode

1. Mathematical Formalism

1.1 Shared: Rank-value encoding

Rank-value encoding (Theodoris et al. 2023) converts a cell's gene expression vector into an ordered sequence of gene tokens. The ordering reflects both expression magnitude and gene rarity across the corpus.

Weighting by inverse corpus frequency:

For gene $g$ with expression $x_g$ and corpus median $\tilde{x}_g$ :

$w_g = \frac{1}{\tilde{x}_g + \epsilon}$

where $\epsilon = 10^{-10}$ prevents division by zero.

Weighted expression:

$\hat{x}_g = x_g \cdot w_g$

Filtering and sorting:

Exclude all genes where $x_g = 0$ .
Sort remaining genes in descending order of $\hat{x}_g$ .
Return the sorted gene indices as an integer array.

The effect is that rarely expressed genes (low corpus median) receive higher weight, placing them earlier in the sequence. This is analogous to TF-IDF weighting in natural language processing: rare genes carry more information.

When global_medians is None, uniform weighting is applied ( $w_g = 1$ for all $g$ ), and the ranking reduces to a simple descending sort by raw expression.

1.2 ScKGBERTInterface — Dual-encoder architecture

Li et al. (2025) introduced scKGBERT, a knowledge-enhanced foundation model for single-cell transcriptomics. The architecture consists of two parallel encoders that share a gene token embedding table.

S-Encoder (Sequence Encoder)

The S-Encoder processes the rank-value-encoded gene sequence through Gaussian self-attention. Given $n$ expressed genes, each represented by a $d$ -dimensional token embedding $\mathbf{e}_i$ :

Gather token embeddings for expressed genes from the shared embedding table.
Scale each token by its rank position: $\mathbf{t}_i = \mathbf{e}_i / (i + 1)$ , so higher-ranked genes have stronger representation.
Project to query, key, value spaces: $\mathbf{q}_i = \mathbf{t}_i \mathbf{W}_Q$ , $\mathbf{k}_i = \mathbf{t}_i \mathbf{W}_K$ , $\mathbf{v}_i = \mathbf{t}_i \mathbf{W}_V$ .
Apply Gaussian attention (see below).
Mean-pool the attended representations to produce a $d$ -dimensional cell embedding.

K-Encoder (Knowledge Graph Encoder)

The K-Encoder aggregates neighbourhood information from the protein-protein interaction (PPI) knowledge graph (STRING database). For each expressed gene $g$ :

Identify the PPI neighbourhood: all genes $j$ where the STRING confidence score $c_{gj} > 0$ .
Compute neighbourhood embedding as a confidence-weighted mean:

$\mathbf{h}_g = \frac{\sum_{j : c_{gj} > 0} c_{gj} \cdot \mathbf{e}_j}{\sum_{j : c_{gj} > 0} c_{gj}}$

If gene $g$ has no neighbours in the graph, fall back to its own embedding: $\mathbf{h}_g = \mathbf{e}_g$ .

Apply Gaussian attention over the neighbourhood embeddings.
Mean-pool to produce the K-Encoder cell embedding.

Gaussian attention

The central attention mechanism in scKGBERT replaces the scaled dot-product attention of standard Transformers with a Gaussian kernel over Euclidean distances:

$\alpha_{ij} = \frac{\exp\!\left(-\frac{\|\mathbf{q}_i - \mathbf{k}_j\|^2}{2\sigma^2}\right)}{\sum_{m=1}^{M} \exp\!\left(-\frac{\|\mathbf{q}_i - \mathbf{k}_m\|^2}{2\sigma^2}\right)}$

where $\sigma$ is the bandwidth parameter controlling attention sharpness:

Small $\sigma$ concentrates attention on the nearest keys (sharp, selective).
Large $\sigma$ distributes attention uniformly (broad, smoothing).

The output for query $i$ is:

$\mathbf{o}_i = \sum_{j=1}^{M} \alpha_{ij} \mathbf{v}_j$

Numerical stability is maintained by subtracting the maximum log-weight before exponentiation (log-sum-exp trick).

Fusion

The final cell embedding is the arithmetic mean of the S-Encoder and K-Encoder outputs:

$\mathbf{z} = \frac{\mathbf{z}_S + \mathbf{z}_K}{2}$

Gene importance scoring

Gene importance is derived from Gaussian attention column sums. For the attention weight matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ :

$\text{importance}(g) = \sum_{i=1}^{n} \alpha_{ig}$

Genes that receive high total incoming attention are those that other genes attend to most strongly — indicating biological centrality.

1.3 GeneformerInterface — Masked gene prediction

Theodoris et al. (2023) introduced Geneformer, a foundation model pretrained on ~30 million single-cell transcriptomes (v1) and ~95 million (v2). The model uses rank-value tokenisation and learns gene network dynamics via masked gene prediction.

Tokenisation

The tokenisation procedure applies rank-value encoding (Section 1.1), then filters to the gene vocabulary of size $V$ :

Compute rank-value encoding of the expression vector.
Retain only gene indices $g < V$ (within vocabulary).
Return as an ordered token sequence.

Multi-head self-attention (Vaswani et al. 2017)

Given a sequence of $n$ gene token embeddings $\mathbf{X} \in \mathbb{R}^{n \times d}$ and $H$ attention heads with head dimension $d_h = d / H$ :

For each head $h \in \{1, \ldots, H\}$ :