RedCEA: Repertoire Embeddings Denoising Clustering Enrichment Analysis
April 24, 2026 ยท View on GitHub
RedCEA is a pipeline for comparing immune repertoires using prototype-based TCR embeddings. It builds on the original TCRemP embedding method and adds denoising, clustering, and enrichment analysis for case/background repertoire comparisons.
This repository contains command-line tools for:
- computing prototype-based embeddings for a repertoire with
tcremp-run - clustering embeddings with
tcremp-cluster - running the end-to-end comparison pipeline with
redcea
Installation
Prerequisites
Prepare a clean Linux server with:
gitcondasuch as Miniconda or Mambaforge- internet access for Python package installation
- access to GitHub, because dependency
miris installed from a git URL
Recommended Python version: 3.11.
Create the environment
git clone https://gitlab.aldan3.itm-rsmu.ru/isagroup/redcea.git
cd redcea
conda create -n redcea python=3.11 -y
conda activate redcea
python -m pip install --upgrade pip setuptools wheel
python -m pip install -e .[test]
This installs:
- the
redceapackage in editable mode - the RedCEA CLI entry points
redceaandtcrempnet - the
tcrempdependency, which providestcremp-runandtcremp-cluster - test dependencies including
pytest
Notes:
miris installed fromhttps://github.com/antigenomics/mirpy.git- default clustering for
redceaisvdbscan - optional Leiden-based clustering requires an extra dependency
Optional: install Leiden support
If you plan to run --cluster-algo leiden, hierarchical_leiden, or leiden_dbscan, install the optional Leiden dependency:
python -m pip install .[leiden]
If this optional install fails, you can still run the default vdbscan pipeline.
Clustering modes
redcea supports several clustering backends:
vdbscan: default RedCEA mode with per-groupepsestimation on the joint sample/background graphdbscan: legacy TCRempNet mode on reduced embeddings, with pre-filtering of points withd1 > epsbefore running plainDBSCANleiden: graph clustering on the joint KNN graphhierarchical_leiden: two-stage Leiden clusteringleiden_dbscan: Leiden followed by per-cluster DBSCAN refinement
Use dbscan if you need behavior closer to historical TCRempNet runs and want the old epsilon-based noise pre-filter back.
Verify the installation
Run the following commands in the activated environment:
python -c "import redcea, tcremp, mir; print('imports: OK')"
redcea --help
tcremp-run --help
pytest -q
Expected result:
- imports succeed without
ModuleNotFoundError - CLI help is printed for the requested commands
- tests pass
Important limitation:
- the test suite validates parser logic, helper functions, and a mocked pipeline smoke test
- it does not replace a real run on a small dataset in your target environment
Recommended post-install smoke check
For a fresh server, use this validation sequence:
- install the package with
python -m pip install -e .[test] - run
redcea --help - run
pytest -q - run one small real dataset through
redceaortcremp-run - confirm that expected output files are created and the log ends without runtime errors
Running RedCEA
Option 1: two-step execution
If you plan to compare multiple case samples against the same background repertoire, compute the background embeddings once with tcremp-run and reuse them in downstream redcea runs.
Step 1: compute embeddings
tcremp-run \
--input /projects/immunestatus/airr_format/sample.tsv \
--output ./results \
--chain TRB \
-np 48
This produces embedding outputs in ./results.
For large samples, embedding can take hours and requires substantial CPU and memory.
Step 2: run redcea on saved embeddings
redcea \
-is /projects/immunestatus/airr_format/sample.tsv \
-ib /projects/immunestatus/airr_format/background.tsv \
-c TRB \
-o ./results \
-np 4 \
-se ./results/sample_tcremp.parquet \
-be ./results/background_tcremp.parquet
Use this mode when the embedding files already exist and you want to skip recomputation.
Option 2: end-to-end pipeline
redcea \
-is sample.tsv \
-ib background.tsv \
-c TRB \
-o ./results \
-np 8
In this mode, embeddings for both sample and background are computed automatically if they are not already available.
CLI Tools
| CLI Tool | Description |
|---|---|
tcremp-run | Computes TCRemP embeddings and optional clustering |
redcea | Runs embedding, clustering, and enrichment |
tcremp-cluster | Clusters existing embeddings |
Example: Yellow Fever Dataset
redcea \
--sample /projects/immunestatus/pogorelyy/airr_format/yfv_day_15.txt \
--background /projects/immunestatus/pogorelyy/airr_format/yfv_day_0.txt \
--output /projects/immunestatus/pogorelyy/redcea/yfv_res \
--chain TRB \
--prefix yfv_result \
-np 16
Output Files
Depending on the mode, the pipeline may create:
| File Name | Description |
|---|---|
*_sample_embeddings.parquet | Sample embeddings produced or reused by redcea |
*_background_embeddings.parquet | Background embeddings produced or reused by redcea |
*_tcremp_clusters.tsv | Cluster assignments for both sample and background clonotypes |
*_summary_tcrempnet.tsv | Per-cluster summary with counts, p-values, FDR, and log fold change |
*_enriched_clonotypes_tcremp.tsv | Clonotypes from enriched clusters |
*_enriched_embeddings_tcremp.parquet | Embeddings of enriched clonotypes with cluster metadata |
*.log | Run log for debugging and runtime tracking |
What to check after a real run
Treat the run as successful only if all of the following are true:
- the output directory exists
*_sample_embeddings.parquetand*_background_embeddings.parquetexist or were intentionally supplied as inputs*_tcremp_clusters.tsvexists*_summary_tcrempnet.tsvexists and containscluster_id,cluster_size,sample,background,enrichment_fdr_zbinom, andlog_fold_change- the log file ends with
TCRempNet pipeline completed.
Input Expectations
The pipeline expects repertoire tables that can be parsed by the underlying tcremp AIRR-loading utilities.
Before running on a clean server, verify on one small file that:
- the file path is correct and readable by the current user
- the repertoire contains the requested chain:
TRA,TRB, orTRA_TRB - required CDR3 and V/J fields expected by
tcrempare present - the file is not empty after filtering by chain and CDR3 length
If a run fails at startup, first check file format compatibility and chain selection.
SLURM Job Example
Activate the redcea environment before submitting the job.
Full pipeline
#!/bin/sh
#SBATCH --job-name=redcea
#SBATCH --cpus-per-task=48
#SBATCH --mem=128gb
#SBATCH --time=08:00:00
#SBATCH --output=redcea_run.%j.log
redcea \
-is case.tsv \
-ib control.tsv \
-c TRB \
-o ./results \
-np 48
Embedding only
tcremp-run \
--input case.tsv \
--output ./results \
--chain TRB \
-np 32
Arguments
| Short | Long | Required | Default | Description |
|---|---|---|---|---|
-is | --sample | Yes | none | Path to the sample repertoire table |
-ib | --background | Yes | none | Path to the background repertoire table |
-o | --output | Yes | none | Output directory |
-e | --prefix | No | input filename | Output prefix |
-x | --index-col | No | none | Optional input ID column to preserve in outputs |
-c | --chain | Yes | none | TRA, TRB, or TRA_TRB |
-p | --prototypes-path | No | package defaults | Path to a user-supplied prototypes file |
-n | --n-prototypes | No | all available | Number of prototypes used for embedding |
| none | --sample-random-prototypes | No | False | Sample prototypes randomly |
-nc | --n-clonotypes | No | all available | Number of clonotypes to process |
| none | --sample-random-clonotypes | No | False | Sample clonotypes randomly |
-s | --species | No | HomoSapiens | Species for V/J gene alignment |
-u | --unique-clonotypes | No | False | Use only unique clonotypes |
-r | --random-seed | No | 42 | Random seed |
-np | --nproc | No | 1 | Number of worker processes |
-llen | --lower-len-cdr3 | No | 5 | Minimum CDR3 length |
-hlen | --higher-len-cdr3 | No | 30 | Maximum CDR3 length |
-m | --metrics | No | dissimilarity | TCRemP metric mode |
-d | --save-dists | No | True | Save TCRemP distances |
-cl | --cluster | No | True | Run clustering in embedding workflow |
-se | --sample-embedding | No | none | Path to precomputed sample embeddings |
-be | --background-embedding | No | none | Path to precomputed background embeddings |
--cluster-algo | --cluster-algo | No | vdbscan | vdbscan, dbscan, leiden, hierarchical_leiden, or leiden_dbscan |
--n-bg-points | --n-bg-points | No | all available | Limit background clonotypes to first N entries |
-npc | --cluster-pc-components | No | 50 | Number of PCA components before clustering |
-ms | --cluster-min-samples | No | 3 | Core-point threshold for clustering |
-kn | --k-neighbors | No | 4 | Number of neighbors in the KNN graph |
-ekn | --eps-k-neighbors | No | 4 | K-th neighbor used for eps estimation in vdbscan and dbscan |
--leiden-resolution | --leiden-resolution | No | 1.0 | Leiden resolution parameter |
--leiden-sub-resolution | --leiden-sub-resolution | No | 1.0 | Subclustering resolution for hierarchical_leiden |
--eps-estimation-based-on | --eps-estimation-based-on | No | sample | Estimate eps from sample, background, or all |
--vdbscan-sym-rule | --vdbscan-sym-rule | No | asymmetric | Symmetrization rule: asymmetric, min, or max |
Reference
Vlasova et al., RedCEA: repertoire embeddings denoising clustering enrichment analysis, 2025, in preparation.