File utils.py - Documentation

January 14, 2019 ยท View on GitHub

Methods - Overview

namedescription
save_modelSave a pysster.Model object.
load_modelLoad a pysster.Model object.
save_dataSave a pysster.Data object.
load_dataLoad a pysster.Data object.
annotate_structuresAnnotate secondary structure predictions with structural contexts.
predict_structuresPredict secondary structures for RNA sequences.
get_performance_reportGet a performance overview of a classifier.
plot_rocGet ROC curves for every class.
plot_prec_recallGet Precision-Recall curves for every class.
save_as_memeSave sequence (or structure) motifs in MEME format.
run_tomtomCompare a MEME file against a database using TomTom.

save_model

def save_model(model, file_path)

Save a pysster.Model object.

This function creates two files: a pickled version of the pysster.Model object and an hdf5 file of the actual keras model (e.g. if file_path is 'model' two files are created: 'model' and 'model.h5')

parametertypedescription
modelpysster.ModelA Model object.
file_pathstrA file name.

load_model

def load_model(file_path)

Load a pysster.Model object.

parametertypedescription
file_pathstrA file containing a pickled pysster.Model object (file_path.h5 must also exist, see save_model()).
returnstypedescription
modelpysster.ModelA Model object.

save_data

def save_data(data, file_path)

Save a pysster.Data object.

The object will be pickled to disk.

parametertypedescription
file_pathstrA file name.

load_data

def load_data(file_path)

Load a pysster.Data object.

parametertypedescription
file_pathstrA file containing a pickled pysster.Data object.
returnstypedescription
datapysster.DataThe Data object loaded from file.

annotate_structures

def annotate_structures(input_file, output_file)

Annotate secondary structure predictions with structural contexts.

Given dot-bracket strings this function will annote every character as either 'H' (hairpin), 'S' (stem), 'I' (internal loop/bulge), 'M' (multi loop), 'F' (5-prime) or 'T' (3-prime). The input file must be a fasta formatted file and each sequence and structure must span a single line:

>header
CCCCAUAGGGG
((((...)))) (-3.3)

This is the default format of e.g. RNAfold. The output file will contain the annotated string:

>header
CCCCAUAGGGG
SSSSHHHSSSS

parametertypedescription
input_filestrA fasta file containing secondary structure predictions.
output_filestrA fasta file with secondary structure annotations.

predict_structures

def predict_structures(input_file, output_file, num_processes=None, annotate=False)

Predict secondary structures for RNA sequences.

This is a convenience function to get quick RNA secondary structure predictions. The function will try to use the ViennaRNA python bindings or the RNAfold binary to perform predictions. If neither can be found the function raises a RuntimeError. Using the ViennaRNA python bindings is preferred as it is faster.

Entries of the output file look as follows if annotate = False:

>header
CCCCAUAGGGG
((((...)))) (-3.3)

If annotate = True the annotated structure string instead of the dot-bracket string will be printed:

>header
CCCCAUAGGGG
SSSSHHHSSSS

Have a look at the annotate_structures() function for more information about annotated structure strings.

Warning: Due to the way Python works spinning up additional processes means copying the complete memory of the original process, i.e. if the original processes already uses 5 GB of RAM each additional process will use 5 GB as well.

parametertypedescription
input_filestrA fasta file with RNA sequences.
output_filestrA fasta file with sequences and structures.
num_processesintThe number of parallel processes to use for prediction. (default: number of available cores)
annotateboolOutput the annotated structure string instead of the dot-bracket string. (default: false)

get_performance_report

def get_performance_report(labels, predictions)

Get a performance overview of a classifier.

The report contains precision, recall, f1-score, ROC-AUC and Precision-Recall-AUC for every class (in a 1 vs. all approach) and weighted averages (weighted by the the number of sequences 'n' in each class).

parametertypedescription
labelsnumpy.ndarrayA binary matrix of shape (num sequences, num classes) containing the true labels.
predictionsnumpy.ndarrayA matrix of shape (num sequences, num classes) containing predicted probabilites.
returnstypedescription
reportstrSummary table of the above mentioned performance measurements.

plot_roc

def plot_roc(labels, predictions, file_path)

Get ROC curves for every class.

In the case of more than two classes the comparisons will be performed in a 1 vs. all approach (i.e. you get one curve per class).

parametertypedescription
labelsnumpy.ndarrayA binary matrix of shape (num sequences, num classes) containing the true labels.
predictionsnumpy.ndarrayA matrix of shape (num sequences, num classes) containing predicted probabilites.
file_pathstrThe file the plot should be saved to.

plot_prec_recall

def plot_prec_recall(labels, predictions, file_path)

Get Precision-Recall curves for every class.

In the case of more than two classes the comparisons will be performed in a 1 vs. rest approach (i.e. you get one curve per class).

parametertypedescription
labelsnumpy.ndarrayA binary matrix of shape (num sequences, num classes) containing the true labels.
predictionsnumpy.ndarrayA matrix of shape (num sequences, num classes) containing predicted probabilites.
file_pathstrThe file the plot should be saved to.

save_as_meme

def save_as_meme(logos, file_path)

Save sequence (or structure) motifs in MEME format.

parametertypedescription
logos[pysster.Motif]A list of Motif objects.
file_pathstrThe name of the output text file.

run_tomtom

def run_tomtom(motif_file, output_folder, database, options = None)

Compare a MEME file against a database using TomTom.

Default options string: "-min-overlap 5 -verbosity 1 -xalph -evalue -thresh 0.1"

parametertypedescription
motif_filestrA MEME file.
output_folderstrThe folder the TomTom output will be saved in.
databasestrA MEME file serving as the database to compare against.
optionstrCommand line options passed to TomTom.