Genetic algorithm is a heuristic that takes heavy inspiration from evolutionary biology, to explore a space of parameters rapidly and converge to an optimum. Every solution is a “genome”, and the combinations can undergo mutation and recombination. By simulating a process of reproduction, over sufficiently many generation, this heuristic usually gives very good results. It is also simple to implement, and this is what we will do!

A genetic algorithm works by measuring the fitness of a solution (i.e. its fit to a problem we have defined). We can then pick solution for “reproduction”, which involves recombination at various points in the “genome”, and finally a step for the mutation of offspring. There are an almost infinite number of variations at each of these steps, but we will limit ourselves to a simple case here.

Specifically, we will use the DataFrames and CSV package to read a table containing metabolic rates of species, and then use genetic algorithm to try and find out the scaling exponent linking mass to the field metabolic rate. The data come from a meta-analysis by Lawrence N. Hudson and colleagues, and can be downloaded as follows:

using DataFrames
using Statistics
using StatsBase
import CSV
using StatsPlots
tmp = download("http://sciencecomputing.io/data/metabolicrates.csv")
rates = DataFrame(CSV.File(tmp))
rates[1:5,:]
5×8 DataFrame
 Row │ Class     Order              Family       Genus     Species   Study 
    ⋯
     │ String    String             String       String    String    String
    ⋯
─────┼─────────────────────────────────────────────────────────────────────
─────
   1 │ Mammalia  Carnivora          Odobenidae   Odobenus  rosmarus  Acquar
one ⋯
   2 │ Mammalia  Carnivora          Odobenidae   Odobenus  rosmarus  Acquar
one
   3 │ Aves      Procellariiformes  Diomedeidae  Diomedea  exulans   Adams 
et
   4 │ Aves      Procellariiformes  Diomedeidae  Diomedea  exulans   Adams 
et
   5 │ Aves      Procellariiformes  Diomedeidae  Diomedea  exulans   Adams 
et  ⋯
                                                               3 columns om
itted

Let’s have a look at the names of the columns:

names(rates)
8-element Array{String,1}:
 "Class"
 "Order"
 "Family"
 "Genus"
 "Species"
 "Study"
 "M (kg)"
 "FMR (kJ / day)"

We will replace the last two names, as they are not easy to work with:

rename!(rates, names(rates)[end-1] => :mass)
rename!(rates, names(rates)[end] => :fmr)
rates = rates[rates.mass.<1e3,:]
rates[1:5,:]
5×8 DataFrame
 Row │ Class   Order              Family       Genus     Species  Study    
    ⋯
     │ String  String             String       String    String   String   
    ⋯
─────┼─────────────────────────────────────────────────────────────────────
─────
   1 │ Aves    Procellariiformes  Diomedeidae  Diomedea  exulans  Adams et 
al  ⋯
   2 │ Aves    Procellariiformes  Diomedeidae  Diomedea  exulans  Adams et 
al
   3 │ Aves    Procellariiformes  Diomedeidae  Diomedea  exulans  Adams et 
al
   4 │ Aves    Procellariiformes  Diomedeidae  Diomedea  exulans  Adams et 
al
   5 │ Aves    Procellariiformes  Diomedeidae  Diomedea  exulans  Adams et 
al  ⋯
                                                               3 columns om
itted

Much better! Now let’s try to figure out the relationship between the last two variables:

@df rates scatter(:mass, :fmr, c=:teal, leg=false, msc=:transparent)
xaxis!(:log10, "Mass (kg)")
yaxis!(:log10, "Field Metabolic Rate (kj per day)")

Relationship between field metabolic rate and mass - this is a neat log-log relationship, and so linear regression will give us the exponent.

Neat! This is a log-log relationship, so we can represent this problem as:

$$\text{log}{10}(\text{FMR}) \propto m\times\text{log}_{10}(\text{M})+b$$

To simplify the problem a little bit, we will average the values within species; because the relation is log-log, we will average the log of the value (as opposed to taking the log of the averages).

rates = by(
  rates,
  [:Genus, :Species],
  [:mass, :fmr] =>
    x -> (
    mass=mean(log10.(x.mass)),
    fmr=mean(log10.(x.fmr))
    )
  )
rates[1:5,:]
Error: ArgumentError: by function was removed from DataFrames.jl. Use the `
combine(groupby(...), ...)` or `combine(f, groupby(...))` instead.

We can look at the simplified data:

scatter(10.0.^rates.mass, 10.0.^rates.fmr, c=:teal, leg=false, msc=:transparent)
xaxis!(:log10, "Mass (kg)")
yaxis!(:log10, "Field Metabolic Rate (kj per day)")

Same data, but a little bit fewer points. In addition, aggregating by species doesn’t increase the impact that well sampled species will have on the regression output.

In this capstone, we will try to write a very general code for genetic algorithms. It is possible to come up with a more specific code, but spending time to think about making code general is almost always wise, since it means we can seamlessly re-use already written code.

Genetic algorithms represent the state of the problem as “genomes”, which here will be composed of two genes: $m$ and $b$. There are a few decisions we need to take already: how large is our initial population, and how much standing variation do we have?

Just to be a little fancier than usual, we will define a type for our genomes:

mutable struct Genome
  m::Float64
  b::Float64
end
Defining a new type is absolutely not necessary. We are only doing it to show some interesting features of Julia.

This means that our population will be of the type Vector{Genome}. Now, we will add methods to some of Julia’s existing function, so we can write code that will read exactly like native code would:

import Base: zero
zero(::Type{Genome}) = Genome(0.0, 0.0)
zero (generic function with 40 methods)

Because we have created a zero method for the Genome type, we can create our population:

population_size = 1_000
population = zeros(Genome, population_size)
population[1:5]
5-element Array{Main.##WeaveSandBox#253.Genome,1}:
 Main.##WeaveSandBox#253.Genome(0.0, 0.0)
 Main.##WeaveSandBox#253.Genome(0.0, 0.0)
 Main.##WeaveSandBox#253.Genome(0.0, 0.0)
 Main.##WeaveSandBox#253.Genome(0.0, 0.0)
 Main.##WeaveSandBox#253.Genome(0.0, 0.0)

To have a slightly more pleasing display, we can also overload the show function:

import Base: show
show(io::IO, g::Genome) = print(io, "ŷ = $(round(g.m; digits=3))×x + $(round(g.b; digits=3))")
population[1:5]
5-element Array{Main.##WeaveSandBox#253.Genome,1}:
 ŷ = 0.0×x + 0.0
 ŷ = 0.0×x + 0.0
 ŷ = 0.0×x + 0.0
 ŷ = 0.0×x + 0.0
 ŷ = 0.0×x + 0.0

And we are now ready to start. At this point, it is useful to outline the general structure of the genetic algorithm:

function GA!(p::Vector{T}, fitness::Function, mutation!::Function; generations=1_000) where {T}
  for generation in 1:generations
    fitnesses = fitness.(p)
    p = sample(p, StatsBase.weights(fitnesses), length(p), replace=true)
    mutation!.(p)
  end
  return nothing
end
GA! (generic function with 1 method)

Wouldn’t it be cool if the real code was actually that simple?

What if I told you this is the real code? To make it work, we need to do two things. First, we need to define a fitness function, which, given an input (here, a Genome), will return its “score”. Second, we need to define a mutation! function, which will modify a genome.

The fitness function can be defined as follows (note that we have already taken the log10 of the mass and FMR earlier):

const log10fmr = rates[:fmr]
const log10M = rates[:mass]

(g::Genome, x::Vector{Float64})=  g.m .* x .+ g.b

function fmr_fitness(g::Genome)
  errors = log10fmr .- (g, log10M)
  sum_of_errors_sq = sum(errors.^2.0)/length(log10fmr)
  return 1.0/sum_of_errors_sq
end
Error: ArgumentError: syntax df[column] is not supported use df[!, column] 
instead

The first thing we do is “extract” the $\text{log}_{10}$ of the FMR and mass, and turn them into vectors. Then we define a function which uses these vectors, and does the linear prediction. You may notice that this function uses elements from “the outside”, which have not been passed as arguments, and this may not be ideal - this is why the log10... variables have been declared as const: they will raise a warning if they are changed.

We can check the score of an “empty” genome ($\hat y = 0$):

fmr_fitness(zero(Genome))
Error: UndefVarError: fmr_fitness not defined

Now, let’s define a function for mutations. Because we might want to change the rate at which parameters evolve, it would be useful to be able to generate multiple such functions. And so, our first task is to write a function that will return another function, which will itself modify the Genome object. Note that we return nothing at the end, because all of this function does is change an existing object.

using Distributions

function normal_error(σm::Float64, σb::Float64)
  function f(g)
    g.m = rand(Normal(g.m, σm))
    g.b = rand(Normal(g.b, σb))
    return nothing
  end
  return f
end
Error: ArgumentError: Package Distributions not found in current path:
- Run `import Pkg; Pkg.add("Distributions")` to install the Distributions p
ackage.

We can test that this all works as expected:

change_both! = normal_error(0.01, 0.01)
initial_genome = Genome(0.2, 0.5)
change_both!(initial_genome)
initial_genome
Error: UndefVarError: normal_error not defined

Let’s now define a function to work on the actual problem:

very_gradual_change! = normal_error(1e-3, 1e-3)
Error: UndefVarError: normal_error not defined

We have all of the pieces to apply our genetic algorithm. Before starting, it is always a good idea to try and eyeball the parameters. Judging from the plot we made early in the lesson, the intercept is probably just around 3, so we can probably draw it from $\mathcal{N}(\mu=3, \sigma=1)$; the slope seems to be close to one but a little bit lower, so we can get it from $\mathcal{N}(\mu=0.75, \sigma=0.2)$:

population = [Genome(rand(Normal(0.75,0.2)), rand(Normal(3,1))) for i in 1:500]
Error: UndefVarError: Normal not defined
GA!(population, fmr_fitness, very_gradual_change!; generations=10000)
Error: UndefVarError: fmr_fitness not defined

TODO

scatter(10.0.^rates[:mass], 10.0.^rates[:fmr], c=:teal, leg=false, msc=:transparent)
pred = collect(LinRange(10.0.^minimum(rates[:mass]), 10.0.^maximum(rates[:mass]), 100))
mostfit = fmr_fitness.(population) |> findmax |> last
plot!(pred, 10.0.^(population[mostfit], log10.(pred)), c=:orange, lw=2, ls=:dash)
xaxis!(:log10, "Mass (kg)")
yaxis!(:log10, "Field Metabolic Rate (kj per day)")
Error: ArgumentError: syntax df[column] is not supported use df[!, column] 
instead

We can also look at the equation for the most fit genome: Error: UndefVarError: mostfit not defined.