AsEP Dataset
August 12, 2024 · View on GitHub
Antibody-specific Epitope Prediction (AsEP) Dataset. This dataset is used in the manuscript AsEP: Benchmarking Deep Learning Methods for Antibody-specific Epitope Prediction (submitted to NeurIPS 2024 Datasets and Benchmarks).
The raw dataset can be downloaded from Zenodo.
Structure viewer
We provide a 3D viewer for antibody-antigen interface visualization. Check it out at AsEP
Dataset Python Interface (asep)
To use the python interface, please install the package asep. It provides the following functionalities:
- Dataset interface (see below)
- Code for loading pre-constructed graphs for antibody-antigen complexes in the dataset
- Code for constructing the neural network proposed in our manuscript - use Protein Language Models (PLMs) for node embeddings and Graph Neural Networks (GNNs) for graph representation
- Training and evaluation scripts
Installation
devcontainer
We provide devcontainer configuration for Visual Studio Code in .devcontainer directory.
We recommend users to use the devcontainer for development.
The documentation for using devcontainer in Visual Studio Code here.
conda environment
# enable conda (init zsh if you are using zsh, or init bash etc.)
conda init zsh
# you can also use `make` to prepare the conda environment
make setup-gpu-env
# if you don't have a GPU, then run
# make setup-cpu-env
# install other dependencies
make install-dependencies
This requires make, run sudo apt install make to install.
This will do the following:
- create a conda environment named
walle - install the required packages
- install the
aseppackage in editable mode.
Download dataset
Apart from the Zenodo link, we also provide console scripts to download the dataset. You can download the dataset by running the following command:
download-asep /path/to/directory AsEP
/path/to/directoryis the directory where you want to save the dataset.AsEPis the name of the dataset, by default, it isAsEP.
Data Loader
The antibody-antigen complexes are provided as 2D graph pairs. We provide two types of node features, one-hot encoding and pre-calculated embeddings with AntiBERTy and ESM2.
from asep.data.asepv1_dataset import AsEPv1Dataset, EmbeddingConfig
# one-hot encoding
config = EmbeddingConfig(node_feat_type="one-hot")
asepv1_dataset = AsEPv1Dataset(
root="/path/to/asep/download/folder", # replace with the path to the parent folder of downloaded AsEP
name="AsEP",
embedding_config=config,
)
# pre-calculated embeddings with AntiBERTy (via igfold) and ESM2
config = EmbeddingConfig(
node_feat_type='pre_cal',
ab={"embedding_model": "igfold"}, # change this "esm2" for ESM2 embeddings
ag={"embedding_model": "esm2"},
)
asepv1_dataset = AsEPv1Dataset(
root="/path/to/asep/download/folder", # replace with the path to the parent folder of downloaded AsEP
name="AsEP",
embedding_config=config,
)
# get i-th graph pair and node labels
i = 0
graph_pair = asepv1_dataset[i]
node_labels_b = graph_pair.y_b # antibody graph node labels (1 => interface nodes)
node_labels_g = graph_pair.y_g # antigen graph node labels (1 => interface nodes)
# bipartite graph edges
edge_index_bg = graph_pair.edge_index_bg # bipartite graph edge indices between the antibody and antigen graphs of shape (2, E), 1st col is antibody node indices, 2nd col is antigen node indices
The graph pair object graph_pair is a PairData (inherited from torch_geometric.data.Data) object, which contains the following attributes:
x_b,x_g: node features of the antibody and antigen, respectively.- if
one-hot, thenx_bandx_gare one-hot encoding of the amino acid residues, shape of(N, 20) - if
pre_cal, thenx_bandx_gare embedded with AntiBERTy and ESM2esm2_t12_35M_UR50D, shape of(N, 512)and(N, 480)respectively
- if
edge_index_b,edge_index_gare edge indices of the antibody and antigen graphs, respectively(2, E)edge_index_bg: bipartite graph edge indices between the antibody and antigen graphs(2, E)y_bandy_gare node labels for antibody and antigen graphs, respectively(N,)1indicates interface residues0indicates non-interface residues
Data Split
# split_method either "epitope_ratio" or "epitope_group"
split_idx = asepv1_dataset.get_idx_split(split_method="epitope_ratio")
train_set = asepv1_dataset[split_idx['train']]
valid_set = asepv1_dataset[split_idx['valid']]
test_set = asepv1_dataset[split_idx['test']]
print(f"{len(asepv1_dataset)=}") # number of graph pairs
print(f"{len(train_set)=}") # number of training graph pairs
print(f"{len(valid_set)=}") # validation
print(f"{len(test_set)=}") # testing
# len(asepv1_dataset)=1723
# len(train_set)=1383
# len(valid_set)=170
# len(test_set)=170
Evaluation
We provide an evaluator to evaluate model's performance on the AsEPv1 dataset.
y_pred:torch.Tensorpredicted node labels, shape of(N,)y_true:torch.Tensorground truth node labels, shape of(N,)
from asep.data.asepv1_dataset import AsEPv1Evaluator
evaluator = AsEPv1Evaluator()
# example
torch.manual_seed(0)
y_pred = torch.rand(1000)
y_true = torch.randint(0, 2, (1000,))
input_dict = {'y_pred': y_pred, 'y_true': y_true}
result_dict = evaluator.eval(input_dict)
print(result_dict) # got {'auc-prc': tensor(0.5565)}
Benchmark Performance
Benchmark performance of several deep learning models on the AsEP dataset on two settings: epitope ratio and epitope group.
| Methods | Publication | Code/Repository | Antibody-specific | Structure | PLM | Graph |
|---|---|---|---|---|---|---|
| WALLE | Under review | Here | ✅ | ✅ | ✅ | ✅ |
| EpiPred | Publication | Code | ✅ | ✅ | ✕ | ✅ |
| ESMFold | Publication | GitHub | ✅ | ✕ | ✅ | ✕ |
| MaSIF-site | Publication | GitHub | ✕ | ✅ | ✕ | ✅ |
| ESMBind | Publication | HuggingFace | ✕ | ✕ | ✅ | ✕ |
Epitope Ratio
| Algorithm | MCC | Precision | Recall | AUCROC | F1 |
|---|---|---|---|---|---|
| WALLE | 0.210 (0.020) | 0.235 (0.018) | 0.422 (0.028) | 0.635 (0.013) | 0.258 (0.018) |
| EpiPred | 0.029 (0.018) | 0.122 (0.014) | 0.180 (0.019) | — | 0.142 (0.016) |
| ESMFold | 0.028 (0.010) | 0.137 (0.019) | 0.043 (0.006) | 0.060 (0.008) | — |
| ESMBind | 0.016 (0.008) | 0.106 (0.012) | 0.121 (0.014) | 0.506 (0.004) | 0.090 (0.009) |
| MaSIF-site | 0.037 (0.012) | 0.125 (0.015) | 0.183 (0.017) | — | 0.114 (0.011) |
Values in parentheses are standard errors.
Epitope Group
| Algorithm | MCC | Precision | Recall | AUCROC | F1 |
|---|---|---|---|---|---|
| WALLE | 0.077 (0.015) | 0.143 (0.017) | 0.266 (0.025) | 0.544 (0.010) | 0.145 (0.014) |
| EpiPred | -0.006 (0.015) | 0.089 (0.011) | 0.158 (0.019) | — | 0.112 (0.014) |
| ESMFold | 0.018 (0.010) | 0.113 (0.019) | 0.034 (0.007) | — | 0.046 (0.009) |
| ESMBind | 0.002 (0.008) | 0.082 (0.011) | 0.076 (0.011) | 0.500 (0.004) | 0.064 (0.008) |
| MaSIF-site | 0.046 (0.014) | 0.164 (0.020) | 0.174 (0.015) | — | 0.128 (0.012) |
Values in parentheses are standard errors.
