EPIANN

February 13, 2018 ยท View on GitHub

Inspired by machine translation models we develp an attention-based nerual network model, EPIANN. Schematic overview of EPIANN

Data Augmentation

There are 6 cell lines. which are celline=GM12878, HUVEC, HeLa-S3, IMR90, K562 and NHEK, and each comes with its own folder. Within each folder, there is a single file: celline.csv. celline.csv is a renamed copy of

https://github.com/shwhalen/targetfinder/tree/master/paper/targetfinder/celline/output-ep/pairs.csv

Before we actually train oorneural network model, we need to generate input data from genomic coordinates(hg19) of enhancers and promoters, along with the indicators of EPIs recorded in celline.csv. Data_Augmentation.R encoded an automatic data augmentation pipeline with several parameters specified in the following table.

ParametersExplanation
cellinechange it to one of the 6 cell lines with default = "IMR90"
folderthe name of the folder to hold all output files with default = "aug_50"
shift_distancethe step size to slide extended region around the enhancer and promoter with default = 50
enhancer_target_lengththe length of extended enhancer with default = 3000
promoter_target_lengththe length of extended promoter with default = 2000
positive_scalarthe augmentation ratio with default = 20
test_percentthe percent of test data among all with default = 0.1
random_seedthe random seed to sample test data with default = 1

You can find the output files with default parameters under the directory IMR90/aug_50/. The following files are currently not avaiable in the github repository because of the size limit (Work In Progress). They are avaiable in the repository.

IMR90/aug_50/IMR90_enhancer.fasta
IMR90/aug_50/IMR90_promoter.fasta
IMR90/aug_50/imbalanced/IMR90_enhancer.fasta
IMR90/aug_50/imbalanced/IMR90_promoter.fasta

Train Neural Netork Model

Under the directory IMR90/, you can find an example python script IMR90_EPIANN.py with the default setting. The parameters regarding inputs are explained in the following table.

ParametersExplanation
cellinechanage it to one of the 6 cell lines with default = 'IMR90'
file_prechange it to be the folder containing augmented data with default = 'aug_50/IMR90'
out_dirchange it to be the folder that contains the output with dedault = 'output/IMR90_EPIANN'
script_idchange it to be the current python script name in order to distinguish the outputs from multiple runs with default = 'IMR90_EPIANN'

The computational grpaph for the neural network is programmed using Tensorflow. On our setup, we use a single NVIDIA GTX 1080 or NVIDIA TITAN X with 5 CPU threads. A single batch takes about 6 seconds to train. All neural neural parameters can be altered in the script.

Neural Network ParametersExplanation
enhancer_lengththe length of input enhancers with default = 3000
promoter_lengththe length of input promoters with default = 2000
BATCH_SIZEthe half of exact batch size with default = 32
num_filtersthe number of convolution filters with default = 256
e_conv_widththe convolutional filter width with default = 15
dropout_rate_cnnthe dropout rate for the convolution layer with default = 0.2
dropout_ratethe dropout rate for all layers except the convolution layer with default = 0.2
pool_widththe max pooling size with default = 30
atten_hyperthe dimension of the attention-related parameters with default = 32
dense_neuron_coorthe dimension of the fully connected layers for coordinate prediction with default = [128, 64]
inter_dimthe dimension of the interaction quantification related parameters with default = 1
topkthe top-k pooling size with default = 32
dense_neuronthe dimension of the fully connected layers with default = 32
lambthe hyperparameter which mediates the cross-entropy error and the regression error with default = 10
num_of_epochthe number of epochs with default = 90
output_stepthe step size to report performance on test dataset with default = 500 batches

Required Pre-installed Packages

R (3.4.2) Library dependencies

GenomicRanges 1.28.2
BSgenome.Hsapiens.UCSC.hg19.masked 1.3.99

Python (2.7.6) Module dependencies

Sklearn 0.18.1

os
pickle
time
tensorflow 1.3.0
numpy 1.13.3
Sklearn 0.19.1
Biopython 1.67