Personalised ASE Caller (PAC)

July 20, 2022 · View on GitHub

Author: Anna Saukkonen

See our paper Highly accurate quantification of allelic gene expression for population and disease genetics for additional information

Introduction
Installation and running
Options
Output
Test Dataset

INTRODUCTION:

Allele-specific expression (ASE) is the imbalanced expression of the two alleles of a gene. While many genes are expressed equally from both alleles, gene regulatory differences driven by genetic changes (i.e. regulatory variants) frequently cause the two alleles to be expressed at different levels, resulting in allele-specific expression patterns. The detection of ASE events relies on accurate alignment of RNA-sequencing reads, where challenges still remain. This pipeline has been created to adjust for computational biases associated with allelic counts. It comprises of the following steps:

Local phasing of genetic data using PHASER
Creation of parental genomes to align sequencing data to
Re-allocation of multimapping reads using RSEM
Selection of the best mapping for each read across the two parental genomes
Outputs haplotype and site level allelic counts

INSTALLATION AND RUNNING

1. Download nextflow

curl -fsSL get.nextflow.io | bash

Make sure you have Java v8+:

java -version

2. Install either Docker or Singularity if cluster doesn't have them yet

3. Run PAC with following command:

You can either run with this:

path_to/nextflow run https://github.com/anna-saukkonen/PAC -r main --genome_version GRCh37/38 --reads "path_to_reads_{1,2}.fq.gz" --variants "path_to_variants" --id ID -profile docker/singularity

-r command specifies the branch

Or download repository and run with this:

path_to/nextflow run PAC/main.nf --genome_version GRCh37/38 --reads "path_to_reads_{1,2}.fq.gz" --variants "path_to_variants" --id ID -profile docker/singularity

OPTIONS:

Required

--genome_version: GRCh37 or GRCh38

--reads: "pathtoreads_{1,2}.fq.gz

reads have to be saved in the same directory in the format: path_to_read_1.fq.gz and path_to_read_2.fq.gz

--variants: "path_to_variants.vcf.gz"

vcf file needs to be phased

-profile: docker or singularity

--id: "name_of_sample"

this needs to be same as in the VCF file

Optional

-N: name@email_address.com (To receive email when the pipeline is finished)

-outdir: "name_of_results_file_directory"

(default: "/pac_results")

-cpus: number

(default:10 We recommend at least 10 for speed)

Depending on the size of file you might need up to 128000MB, min 64000MB

OUTPUT

PAC generates 4 output files:

* Haplotype level ASE calls:

ID_gene_level_ae.txt

Haplotype level ASE results columns	Description
contig	chromosome
start	gene start position
stop	gene end position
name	gene name
aCount	haplotype a coverage
bCount	haplotype b coverage
totalCount	total coverage

* Single nucleotide level ASE calls from PAC:

results_2genomes_ID.RSEM.STAR.SOFT.NOTRIM_baq.txt
results_2genomes_ID.RSEM.STAR.SOFT.NOTRIM.txt

* Single nucleotide level ASE calls based on standard single genome mapping (for comparison):

results_1genome_ID.SOFT.NOTRIM_baq.txt
results_1genome_ID.SOFT.NOTRIM.txt

Single nucleotide level ASE results columns	Description
Chr	chromosome
Pos	position along chromosome
RefAl	reference allele
AltAl	alternative allele
MapRef	reference allele coverage
MapAlt	alternative allele coverage
MapRatio	reference allele ratio
Mapcov	total coverage at the site

TEST DATASET

To test PAC on smaller dataset:

load java

load singularity

git clone https://github.com/anna-saukkonen/PAC.git

path_to_nextflow/nextflow run PAC/main.nf --genome_version GRCh37 --reads "PAC/test/NA12890_merged_sample_0.005_{1,2}.fq.gz" --variants "PAC/test/NA12877_output.phased.downsampled.vcf.gz" --id NA12877 -profile singularity

See this folder for output files you should get

 Just use
   __    ___      __
 ||__)  /___\\  /   `
 ||    /     \\ \\__, ,

 man ;)