Graph-sc

December 1, 2021 · View on GitHub

This repository contains the pytorch implementation of the paper "GNN-based embedding for clustering scRNA-seq data", by Madalina Ciortan under the supervision of Matthieu Defrance (https://doi.org/10.1093/bioinformatics/btab787).

We propose graph-sc, a method modeling scRNA-seq data as a graph, processed with a graph autoencoder network to create representations (embeddings) for each cell. The resulting embeddings are clustered with a general clustering algorithm (i.e. KMeans, Leiden) to produce cell class assignments.
An extensive experimental study was performed on 24 simulated and 15 real-world scRNA-seq datasets. graph-sc was compared with 11 competing state-of-the-art techniques on 4 clustering scores, reflecting both the external and the internal clustering performance. The results indicate that although there is no consistently best method across all analyzed datasets, graph-sc compared favorably with the competing techniques across all types of datasets. A large ablation study evaluates numerous strategies to create the input graph, the graph autoencoder network and also the clustering phase. The proposed method is stable across consecutive runs, robust to input down-sampling, generally insensitive to changes in the network architecture or training parameters and more computationally efficient than other competing methods based on neural networks. Moreover, modeling the data as a graph provides an increased flexibility to define custom features characterizing the genes, the cells and their interactions as well as the possibility to enrich the graph with external data (i.e. gene correlations).

Overview of the repository

notebooks folder contains all jupyter notebooks to run the project, as detailed below.
others folder contains the code to reproduce all experiments with scanpy, sczi, scDeepCluster
R folder contains the scrips to generate the simulated data in folder R/simulated_data (both balanced and imbalanced)
outoput contains model dumps and the results of running all experiments, needed to reproduce the plots
docker contains the Dockerfile to create the image used to run all python experiments
real_data contains the biological scRNA-seq data, downloaded from scDeepCluster, as detailed below
train.py contains the main functionalities for training and evaluating the model results
model.py contains the network definition

Overview of notebooks

Main.ipynb represents the main entry point, contains code snipped to train the model on scRNA-seq data
Benchmark_real_data, Benchmark_simulated_data contain the code to reproduce all experiments on graph-sc
Plots_simulated_data, Plots_real_scRNAseq contains code to reproduce all figures
Grid_search* comprise all ablation studies on network architecture, learning rate, data augmentation strategies, gene selection strategy

Environment Setup

We have employed a docker container to facilitate reproducing the paper results.

Python environment

It can be launched by running the following:

cd docker  
docker build -t graph-sc .

The image has been created for GPU usage. In order to run it on CPU, in the Dockerfile, the line "pytorch/pytorch:1.4-cuda10.1-cudnn7-runtime" should be replaced with a CPU version.

The command above created a docker container tagged as graph-sc . Assuming the project has been cloned locally in a parent folder named notebooks, the image can be launched locally with:

docker run -it --runtime=nvidia -v ~/notebooks:/workspace/notebooks -p 8888:8888 graph-sc

This starts up a jupyter notebook server, which can be accessed at http://localhost:8888/tree/notebooks

R environment

We followed the instructions on this tutorial in order to create an R docker container which comes with most single-cell related libraries already installed. In order to launch it on port 8787, execute the following:

docker run -d -p 8787:8787 -e USER='rstudio' -e PASSWORD='rstudioSC' -e ROOT=TRUE -v ~/notebooks/deep_clustering:/home/rstudio/projects vbarrerab/rstudio_singlecell

Data

The simulated datasets can be downloaded from this Google Drive link (~400MB). Alternatively, it can be generated by running R/all_balanced.r or R/all_imbalanced.R.

The single cell data has been collected from scDeepCluster repository and scziDesk repository. It should be saved to real_data folder.

Reproducing the competing methods' results

The implementation used for benchmarking the methods in R used the script made available by scziDesk and can be found in R/run_methods.r. It has been enriched with the computation of silhouette and calinski scores.

The remaining python methods have been made available in others folder.