Figmop

July 30, 2015 ยท View on GitHub

====== Figmop

Version 1.1

This script uses the patternHmm Python package to identify genes in a raw genome sequence, after training a profile hidden Markov model on protein sequences. Besides patternHmm, it requires that MEME and MAST have already been installed on the system.

Installation

This package is small, containing the single executable Python file 'figmop'. Installation is done using the standard distutils software, and so should be relatively straight forward. Unpack the distribution (either the .tar.gz or the .zip file), and change directory into it. Then install the package with the command:

python setup.py install

You may need super-user privileges to create the necessary directories, in which case the command would be:

sudo python setup.py install

On Windows the command should be just:

setup.py install

This should copy the script to your default location for Python scripts, which should be /usr/local/bin/ or perhaps /usr/bin/ on Unix and MacOSX.

Once this is finished it is recommended that you go through the included tutorial, but regardless you are free to delete the distribution folder.

Usage

For detailed instructions read the included tutorial.pdf document.

Figmop was designed to locate genes in raw genome data, as certain gene families tend to confound gene-finding software even with a high quality assembly. This would cause your gene family to be missing from the gene models and protein sequences, even though the genes may be present in their entirety within the genome.

Creating the Model File

As an example, let us say that we are searching for cytochrome P450 (CYP) genes. The first step is to assemble a reasonable number of related CYP protein sequences, preferebly > 20. You then should run MEME on your sequences, varying the number of motifs it finds until you are satisfied that they form a consistent pattern. I found that 15 motifs described the CYPs well, with a full length sequence containing 13, but numbers may vary for your gene family.

Next create a model file with Figmop, using the command:

figmop --generate_model_file 13

This will create a Python file, 'model_file.py', which is used to specify the parameters of our gene family. I found many good CYP sequences that only contained 5 of the 13 motifs (all in the correct order), so I set my 'min_matches' to 5. The 'max_genome_region' specifies the maximum size of the footprint of a gene. Unless your ogranism has exceptionally many and large introns, the default setting of 40,000 should be sufficient. Next, set the full path to the 'meme.txt' file generated by MEME.

Now we must specify the pattern we found using MEME, which is done by modifying the 'matchEmissions' dictionary in the model file. For the CYPs, the pattern of motifs I found was: Motif 8, Motif 10, Motif 11, Motif 2, Motif 15, Motif 7, etc. To indicate this, I set the first 3 entries of the dictionary to be: 'M1': {'8': 1.0}, 'M2': {'10': 1.0}, 'M3': {'11': 1.0}. Notice that the motifs are represented by a string. In my sequences, I actually found 2 subtypes of the CYP pattern. In the first the next 3 motifs were 2, 15, 7, while in the second those three motifs were replaced by 12 and 9. This is easily accomodated by the model, and so the next 3 entries of the dictionary were set as: 'M4': {'2': 1.0, '12': 1.0}, 'M5': {'15': 1.0, '9': 1.0}, 'M6': {'7': 1.0}. Finally, from biological data I knew that the final motif in my pattern, which happened to be Motif 1, was very important to the function of my proteins. To indicate this I increased the match probability of the final entry in the dictionary: 'M13': {'1': 2.0}.

This kind of profile Hidden Markov Model allows for several motifs in the pattern to be skipped, and for random motifs to be present within the pattern; the penalties for each of these cases are controlled by the transition probabilities in the second dictionary in the model file. If desired you can modify the default values to make the pattern more or less strict, but in general this shouldn't be necessary.

Running Figmop

One model file is used to describe one gene family, and can be applied to many different genomes. Once you have this file, Figmop is simple to run. The most common usage is:

figmop model_file.py genome_file.fa genome_mast_out -o my_run

Here, genome_file.fa is the fasta formatted file of scaffolds or chromosomes for your organism, my_run is a path to some directory where you want your results to be saved, and genome_mast_out is a path to some directory to hold the raw genome MAST results. These are required as an intermediate step, but are not useful for the user. However, they may take 5-20 minutes to run, and so are only generated once. If the program detects that the specified directory doesn't exist or that the required file can't be found, it will run MAST automatically. That output depends only on the genome and your original MEME file, and so can be reused even if the model file is changed between runs.

One common option is the --sequence_padding or -p flag. When Figmop finds a match to your pattern, it retrieves the genomic sequence at that location (and takes the inverse complement if it came from the negative strand). If the first motif in your pattern did not contain the start codon of your gene (and there's no reason it would be expected to), neither would the sequence returned by Figmop. Specifying a number for the sequence_padding option causes Figmop to return the sequence of the match, and that many additional base pairs on either side. The magnitude will depend on your gene family and your pattern, but may be from several dozen to several thousand.

Finally, there is an --e_value or -e option that can be set. It is important to note that this is the E-value cutoff used by the final MAST run, and is influenced by the length of your genes and the quality of the matches to each motif, not by the pattern or order they occur in (though they will certainly be related in the vast majority of cases).