RedCEA: Repertoire Embeddings Denoising Clustering Enrichment Analysis

April 24, 2026 ยท View on GitHub

RedCEA is a pipeline for comparing immune repertoires using prototype-based TCR embeddings. It builds on the original TCRemP embedding method and adds denoising, clustering, and enrichment analysis for case/background repertoire comparisons.

This repository contains command-line tools for:

  • computing prototype-based embeddings for a repertoire with tcremp-run
  • clustering embeddings with tcremp-cluster
  • running the end-to-end comparison pipeline with redcea

Installation

Prerequisites

Prepare a clean Linux server with:

  • git
  • conda such as Miniconda or Mambaforge
  • internet access for Python package installation
  • access to GitHub, because dependency mir is installed from a git URL

Recommended Python version: 3.11.

Create the environment

git clone https://gitlab.aldan3.itm-rsmu.ru/isagroup/redcea.git
cd redcea

conda create -n redcea python=3.11 -y
conda activate redcea

python -m pip install --upgrade pip setuptools wheel
python -m pip install -e .[test]

This installs:

  • the redcea package in editable mode
  • the RedCEA CLI entry points redcea and tcrempnet
  • the tcremp dependency, which provides tcremp-run and tcremp-cluster
  • test dependencies including pytest

Notes:

  • mir is installed from https://github.com/antigenomics/mirpy.git
  • default clustering for redcea is vdbscan
  • optional Leiden-based clustering requires an extra dependency

Optional: install Leiden support

If you plan to run --cluster-algo leiden, hierarchical_leiden, or leiden_dbscan, install the optional Leiden dependency:

python -m pip install .[leiden]

If this optional install fails, you can still run the default vdbscan pipeline.

Clustering modes

redcea supports several clustering backends:

  • vdbscan: default RedCEA mode with per-group eps estimation on the joint sample/background graph
  • dbscan: legacy TCRempNet mode on reduced embeddings, with pre-filtering of points with d1 > eps before running plain DBSCAN
  • leiden: graph clustering on the joint KNN graph
  • hierarchical_leiden: two-stage Leiden clustering
  • leiden_dbscan: Leiden followed by per-cluster DBSCAN refinement

Use dbscan if you need behavior closer to historical TCRempNet runs and want the old epsilon-based noise pre-filter back.

Verify the installation

Run the following commands in the activated environment:

python -c "import redcea, tcremp, mir; print('imports: OK')"
redcea --help
tcremp-run --help
pytest -q

Expected result:

  • imports succeed without ModuleNotFoundError
  • CLI help is printed for the requested commands
  • tests pass

Important limitation:

  • the test suite validates parser logic, helper functions, and a mocked pipeline smoke test
  • it does not replace a real run on a small dataset in your target environment

Recommended post-install smoke check

For a fresh server, use this validation sequence:

  1. install the package with python -m pip install -e .[test]
  2. run redcea --help
  3. run pytest -q
  4. run one small real dataset through redcea or tcremp-run
  5. confirm that expected output files are created and the log ends without runtime errors

Running RedCEA

Option 1: two-step execution

If you plan to compare multiple case samples against the same background repertoire, compute the background embeddings once with tcremp-run and reuse them in downstream redcea runs.

Step 1: compute embeddings

tcremp-run \
  --input /projects/immunestatus/airr_format/sample.tsv \
  --output ./results \
  --chain TRB \
  -np 48

This produces embedding outputs in ./results.

For large samples, embedding can take hours and requires substantial CPU and memory.

Step 2: run redcea on saved embeddings

redcea \
  -is /projects/immunestatus/airr_format/sample.tsv \
  -ib /projects/immunestatus/airr_format/background.tsv \
  -c TRB \
  -o ./results \
  -np 4 \
  -se ./results/sample_tcremp.parquet \
  -be ./results/background_tcremp.parquet

Use this mode when the embedding files already exist and you want to skip recomputation.

Option 2: end-to-end pipeline

redcea \
  -is sample.tsv \
  -ib background.tsv \
  -c TRB \
  -o ./results \
  -np 8

In this mode, embeddings for both sample and background are computed automatically if they are not already available.


CLI Tools

CLI ToolDescription
tcremp-runComputes TCRemP embeddings and optional clustering
redceaRuns embedding, clustering, and enrichment
tcremp-clusterClusters existing embeddings

Example: Yellow Fever Dataset

redcea \
  --sample /projects/immunestatus/pogorelyy/airr_format/yfv_day_15.txt \
  --background /projects/immunestatus/pogorelyy/airr_format/yfv_day_0.txt \
  --output /projects/immunestatus/pogorelyy/redcea/yfv_res \
  --chain TRB \
  --prefix yfv_result \
  -np 16

Output Files

Depending on the mode, the pipeline may create:

File NameDescription
*_sample_embeddings.parquetSample embeddings produced or reused by redcea
*_background_embeddings.parquetBackground embeddings produced or reused by redcea
*_tcremp_clusters.tsvCluster assignments for both sample and background clonotypes
*_summary_tcrempnet.tsvPer-cluster summary with counts, p-values, FDR, and log fold change
*_enriched_clonotypes_tcremp.tsvClonotypes from enriched clusters
*_enriched_embeddings_tcremp.parquetEmbeddings of enriched clonotypes with cluster metadata
*.logRun log for debugging and runtime tracking

What to check after a real run

Treat the run as successful only if all of the following are true:

  • the output directory exists
  • *_sample_embeddings.parquet and *_background_embeddings.parquet exist or were intentionally supplied as inputs
  • *_tcremp_clusters.tsv exists
  • *_summary_tcrempnet.tsv exists and contains cluster_id, cluster_size, sample, background, enrichment_fdr_zbinom, and log_fold_change
  • the log file ends with TCRempNet pipeline completed.

Input Expectations

The pipeline expects repertoire tables that can be parsed by the underlying tcremp AIRR-loading utilities.

Before running on a clean server, verify on one small file that:

  • the file path is correct and readable by the current user
  • the repertoire contains the requested chain: TRA, TRB, or TRA_TRB
  • required CDR3 and V/J fields expected by tcremp are present
  • the file is not empty after filtering by chain and CDR3 length

If a run fails at startup, first check file format compatibility and chain selection.


SLURM Job Example

Activate the redcea environment before submitting the job.

Full pipeline

#!/bin/sh
#SBATCH --job-name=redcea
#SBATCH --cpus-per-task=48
#SBATCH --mem=128gb
#SBATCH --time=08:00:00
#SBATCH --output=redcea_run.%j.log

redcea \
  -is case.tsv \
  -ib control.tsv \
  -c TRB \
  -o ./results \
  -np 48

Embedding only

tcremp-run \
  --input case.tsv \
  --output ./results \
  --chain TRB \
  -np 32

Arguments

ShortLongRequiredDefaultDescription
-is--sampleYesnonePath to the sample repertoire table
-ib--backgroundYesnonePath to the background repertoire table
-o--outputYesnoneOutput directory
-e--prefixNoinput filenameOutput prefix
-x--index-colNononeOptional input ID column to preserve in outputs
-c--chainYesnoneTRA, TRB, or TRA_TRB
-p--prototypes-pathNopackage defaultsPath to a user-supplied prototypes file
-n--n-prototypesNoall availableNumber of prototypes used for embedding
none--sample-random-prototypesNoFalseSample prototypes randomly
-nc--n-clonotypesNoall availableNumber of clonotypes to process
none--sample-random-clonotypesNoFalseSample clonotypes randomly
-s--speciesNoHomoSapiensSpecies for V/J gene alignment
-u--unique-clonotypesNoFalseUse only unique clonotypes
-r--random-seedNo42Random seed
-np--nprocNo1Number of worker processes
-llen--lower-len-cdr3No5Minimum CDR3 length
-hlen--higher-len-cdr3No30Maximum CDR3 length
-m--metricsNodissimilarityTCRemP metric mode
-d--save-distsNoTrueSave TCRemP distances
-cl--clusterNoTrueRun clustering in embedding workflow
-se--sample-embeddingNononePath to precomputed sample embeddings
-be--background-embeddingNononePath to precomputed background embeddings
--cluster-algo--cluster-algoNovdbscanvdbscan, dbscan, leiden, hierarchical_leiden, or leiden_dbscan
--n-bg-points--n-bg-pointsNoall availableLimit background clonotypes to first N entries
-npc--cluster-pc-componentsNo50Number of PCA components before clustering
-ms--cluster-min-samplesNo3Core-point threshold for clustering
-kn--k-neighborsNo4Number of neighbors in the KNN graph
-ekn--eps-k-neighborsNo4K-th neighbor used for eps estimation in vdbscan and dbscan
--leiden-resolution--leiden-resolutionNo1.0Leiden resolution parameter
--leiden-sub-resolution--leiden-sub-resolutionNo1.0Subclustering resolution for hierarchical_leiden
--eps-estimation-based-on--eps-estimation-based-onNosampleEstimate eps from sample, background, or all
--vdbscan-sym-rule--vdbscan-sym-ruleNoasymmetricSymmetrization rule: asymmetric, min, or max

Reference

Vlasova et al., RedCEA: repertoire embeddings denoising clustering enrichment analysis, 2025, in preparation.