ellenGP

February 16, 2017 ยท View on GitHub

ellenGP is a genetic programming tool for symbolic regression and multi-class classification that incorporates epigenetic learning and uses a stack-based, linear representation.

This code formed the basis of research during my dissertation

Please note that most of the current development for ellen is happening in the ellyn repo, which is a Python-wrapped version of this code base.

There are some library dependencies, including eigen.

The files have been built in Visual Studio C++ 2010 and in linux with gcc and the intel c++ compiler.

About

ellenGP uses a stack-based, syntax-free, linear genome for constructing candidate equations.

It is built to include different evolutionary methods for system identification adapted from literature. The options include normal tournament selection, deterministic crowding, and age-pareto fitness selection. All algorithm choices are mangaged by one parameter file.

How to Build

I've built the project in Visual Studio 2010 professional as well as C++ Express (which is free from Microsoft), and in linux with g++ and the intel c++ compiler using the make files. If you use VS 2010 Express, the OpenMP files (which were removed from VS 2010) need to be added to the VS path.

There are two external library dependencies:

  • boost libraries - a set of multi-purpose c++ libraries, needed for RunTrialsMPI only
  • eigen - a c++ template library for linear algebra

In addition to downloading those packages, the paths to them need to be modified in the Makefiles for ellenGP and RunTrials.

How to run ellenGP

Run ellenGP like this:

ellenGP sampleparams.txt sampledata.txt

As you can see, ellenGP takes two arguments: a parameter file and a data file. The parameter file includes all of the run-time settings for your search. The data file includes all your experimental data. See the sampleparams.txt and sampledata.txt files to see how formatting works.

How to run RunTrials

RunTrials will run ellenGP for many trials. It uses OpenMP to parallelize the trials. Here is the syntax:

RunTrials sampletrials.txt

RunTrials takes one input file (sampletrials.txt). The trials input file contains three columns:

[#trials] [parameterfile] [datafile]

There is also an MPI version, RunTrialsMPI, which uses the same syntax, but parallelizes the threads over a cluster rather than the cores of a single node (computer). 

for example, ... 100 ../in/sampleparams.txt ../in/sampledata.txt

These are the simple instructions for running ellenGP.

RunTrialsMPI

RunTrialsMPI is the same as RunTrials except it is written to be compiled on the clusters (the TACC cluster Stampede as well as the Umass HPCC cluster). MakefileTACC and MakefileUMG has the compilation notes. It has been built using intel icpc and the MPI compiler mvapich2 from OSU, as well as g++ with mpicxx.

Settings

Here is a comprehensive list of all of the options that you can include in the parameter file.

SettingDefaultDescription
g100number of generations
popsize500population size
limit_evals0limit point evals instead of number of generations
max_evals0max point evaluations
Generation Settings
sel11: tournament 2: deterministic crowding 3: lexicase selection 4: age-fitness pareto algorithm
PS_sel1objectives for pareto survival. 1: age + fitness; 2: age+fitness+generality; 3: age+fitness+complexity; 4: class fitnesses (classification ONLY); 5: class fitnesses+ age (classification ONLY)
tourn_size2number of individuals in each tournament
rt_rep0rate of reproduction
rt_cross0.8rate of crossover
rt_mut0.2rate of mutation
cross31: ultra 2: one point1 3: sub-tree
mutate21: point mutation; 2: subtree mutation
cross_ar0.025crossover alternation rate (ultra only)
mut_ar0.025mutation alternation rate
align_dev0on or off; adds gaussian alignment deviation to crossover
elitism0save best individual each generation
stop_condition1if on, run will terminate when an fitness < 1e-6 is reached.
init_validate_on0initial fitness validation of starting population
Data Options
train0split data into training and validation sets
train_pct0.5percent of data to be used in training
shuffle_data0shuffle the data before splitting into training and validation
pop_restart0restart run from previous population specified by pop_restart_path
pop_restart_path""filename of restart population with path
Results and Printing Options
resultspath""path where results are saved
print_every_pop0save printout of population at every generation
print_genome0prints genome for visualization in paraview
print_novelty0print number of unique output vectors
print_homology0print genetic homology in programs
num_log_pts0number of log points to print (0 means print each generation)
Classification Options
classification0defines a classification, rather than regression, problem
class_bool0interpret class labels as bit-string conversion of boolean stack output
class_m3gp0use mahalanobis distance classification fitness
class_prune0prunes the dimensions of the best individual each generation
Problem information
intvarsnonevariables in data file to use in programs
cvalsnoneseed the initial population with certain constant values
seedsnoneseed partial solutions, e.g. (x+y)
AR0include auto-regressive output variables
AR_n1order of auto-regression (number of time-steps back)
AR_lookahead0just predict one output ahead
ERC1ephemeral random constants
ERCints0make the ERCs integer valued rather than floats
maxERC1
minERC-1
numERC1
Fitness Settings
fit_type11: mean absolute error, 2: corr, 3: combo, 4: VAF
norm_error0normalize error by the standard deviation of the target data being u
max_fit1.00E+20maximum fitness possible
min_fit1.00E-20minimum fitness possible
estimate_fitness0coevolve fitness estimators
FE_pop_size0fitness estimator population size
FE_ind_size0number of fitness cases for FE to use
FE_train_size0trainer population size
FE_train_gens0number of generations between trainer selections
FE_rank0use rank for FE fitness rather than error
estimate_generality0estimate how well the solutions generalize using the validation portion of the fitness estimator
G_sel0which fit_type to use to test generality
G_shuffle0shuffles data each generation
op_listn v + - * /available operators: n v + - * / sin cos log exp sqrt = ! < <= > >= if-then if-then-else &
weight_ops_on0weight the operators differently
op_weightemptyweights of the operators specified in op_list
min_len3minimum program length
max_len20maximum length a program is allowed to be
max_len_initmax_lenoption to specify different max length for initial population
init_trees0initialize genotypes as syntactically valid trees rather than randomized stacks
complex_measure21: genotype size 2: symbolic size 3: effective genotype size
Hill Climbing Settings
parameters
pHC_on0parameter hill climbing each generation
pHC_its1number of iterations
epigenetics
eHC_on0epigenetic hill climbing
eHC_its1number of iterations
eHC_prob0.1probability of a gene being switched
eHC_init0.5percent of expressed genes in initial genotypes
eHC_slim0minimize point evaluations as much as possible
eHC_mut0do mutation rather than hill climbing
Pareto Archive Settings
prto_arch_on0
prto_arch_size20
Island model
islands0use multiple island populations, one for each core.
island_gens100number of generations between shuffling of the island populations
Lexicase Options
lexpool1Fraction of population to use in lexicase selection events
lex_class0For a classification problem, use separate class fitnesses as cases
lex_metacasesnoneSpecify extra cases for selection. Options: age, complexity
lex_eps_std0use epsilon lexicase with eps = standard deviation of error
lex_eps_error0use epsilon lexicase with error-based epsilons
lex_eps_target0use epsilon lexicase with error-based epsilons
lex_eps_target_mad0use epsilon lexicase with median absolute deviation, target-based epsilons
lex_eps_error_mad0use epsilon lexicase with median absolute deviation, error-based epsilons
lex_epsilon0.1value of epsilon (ignored for mad and std versions)

FYI

ellenGP Copyright (C) 2014 William La Cava

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License (License.txt) for more details.