Genetic algorithm is a heuristic that takes heavy inspiration from evolutionary biology, to explore a space of parameters rapidly and converge to an optimum. Every solution is a “genome”, and the combinations can undergo mutation and recombination. By simulating a process of reproduction, over sufficiently many generation, this heuristic usually gives very good results. It is also simple to implement, and this is what we will do!

A genetic algorithm works by measuring the fitness of a solution (*i.e.* its fit
to a problem we have defined). We can then pick solution for “reproduction”,
which involves recombination at various points in the “genome”, and finally a
step for the mutation of offspring. There are an almost infinite number of
variations at each of these steps, but we will limit ourselves to a simple case
here.

Specifically, we will use the `DataFrames`

and `CSV`

package to read a table
containing metabolic rates of species, and then use genetic algorithm to try and
find out the scaling exponent linking mass to the field metabolic rate. The data
come from a meta-analysis by Lawrence N. Hudson and colleagues, and
can be downloaded as follows:

```
using DataFrames
using Statistics
using StatsBase
import CSV
using StatsPlots
tmp = download("http://sciencecomputing.io/data/metabolicrates.csv")
rates = DataFrame(CSV.File(tmp))
rates[1:5,:]
```

```
5×8 DataFrame
Row │ Class Order Family Genus Species Study
⋯
│ String String String String String String
⋯
─────┼─────────────────────────────────────────────────────────────────────
─────
1 │ Mammalia Carnivora Odobenidae Odobenus rosmarus Acquar
one ⋯
2 │ Mammalia Carnivora Odobenidae Odobenus rosmarus Acquar
one
3 │ Aves Procellariiformes Diomedeidae Diomedea exulans Adams
et
4 │ Aves Procellariiformes Diomedeidae Diomedea exulans Adams
et
5 │ Aves Procellariiformes Diomedeidae Diomedea exulans Adams
et ⋯
3 columns om
itted
```

Let’s have a look at the names of the columns:

```
names(rates)
```

```
8-element Array{String,1}:
"Class"
"Order"
"Family"
"Genus"
"Species"
"Study"
"M (kg)"
"FMR (kJ / day)"
```

We will replace the last two names, as they are not easy to work with:

```
rename!(rates, names(rates)[end-1] => :mass)
rename!(rates, names(rates)[end] => :fmr)
rates = rates[rates.mass.<1e3,:]
rates[1:5,:]
```

```
5×8 DataFrame
Row │ Class Order Family Genus Species Study
⋯
│ String String String String String String
⋯
─────┼─────────────────────────────────────────────────────────────────────
─────
1 │ Aves Procellariiformes Diomedeidae Diomedea exulans Adams et
al ⋯
2 │ Aves Procellariiformes Diomedeidae Diomedea exulans Adams et
al
3 │ Aves Procellariiformes Diomedeidae Diomedea exulans Adams et
al
4 │ Aves Procellariiformes Diomedeidae Diomedea exulans Adams et
al
5 │ Aves Procellariiformes Diomedeidae Diomedea exulans Adams et
al ⋯
3 columns om
itted
```

Much better! Now let’s try to figure out the relationship between the last two variables:

```
@df rates scatter(:mass, :fmr, c=:teal, leg=false, msc=:transparent)
xaxis!(:log10, "Mass (kg)")
yaxis!(:log10, "Field Metabolic Rate (kj per day)")
```

Neat! This is a log-log relationship, so we can represent this problem as:

$$\text{log}{10}(\text{FMR}) \propto m\times\text{log}_{10}(\text{M})+b$$

To simplify the problem a little bit, we will average the values within species; because the relation is log-log, we will average the log of the value (as opposed to taking the log of the averages).

```
rates = by(
rates,
[:Genus, :Species],
[:mass, :fmr] =>
x -> (
mass=mean(log10.(x.mass)),
fmr=mean(log10.(x.fmr))
)
)
rates[1:5,:]
```

```
Error: ArgumentError: by function was removed from DataFrames.jl. Use the `
combine(groupby(...), ...)` or `combine(f, groupby(...))` instead.
```

We can look at the simplified data:

```
scatter(10.0.^rates.mass, 10.0.^rates.fmr, c=:teal, leg=false, msc=:transparent)
xaxis!(:log10, "Mass (kg)")
yaxis!(:log10, "Field Metabolic Rate (kj per day)")
```

Genetic algorithms represent the state of the problem as “genomes”, which here will be composed of two genes: $m$ and $b$. There are a few decisions we need to take already: how large is our initial population, and how much standing variation do we have?

Just to be a little fancier than usual, we will define a *type* for our genomes:

```
mutable struct Genome
m::Float64
b::Float64
end
```

*absolutely not*necessary. We are only doing it to show some interesting features of Julia.

This means that our population will be of the type `Vector{Genome}`

. Now, we
will *add methods* to some of Julia’s existing function, so we can write code
that will read exactly like native code would:

```
import Base: zero
zero(::Type{Genome}) = Genome(0.0, 0.0)
```

```
zero (generic function with 40 methods)
```

Because we have created a `zero`

method for the `Genome`

type, we can create our
population:

```
population_size = 1_000
population = zeros(Genome, population_size)
population[1:5]
```

```
5-element Array{Main.##WeaveSandBox#253.Genome,1}:
Main.##WeaveSandBox#253.Genome(0.0, 0.0)
Main.##WeaveSandBox#253.Genome(0.0, 0.0)
Main.##WeaveSandBox#253.Genome(0.0, 0.0)
Main.##WeaveSandBox#253.Genome(0.0, 0.0)
Main.##WeaveSandBox#253.Genome(0.0, 0.0)
```

To have a slightly more pleasing display, we can also overload the `show`

function:

```
import Base: show
show(io::IO, g::Genome) = print(io, "ŷ = $(round(g.m; digits=3))×x + $(round(g.b; digits=3))")
population[1:5]
```

```
5-element Array{Main.##WeaveSandBox#253.Genome,1}:
ŷ = 0.0×x + 0.0
ŷ = 0.0×x + 0.0
ŷ = 0.0×x + 0.0
ŷ = 0.0×x + 0.0
ŷ = 0.0×x + 0.0
```

And we are now ready to start. At this point, it is useful to outline the general structure of the genetic algorithm:

```
function GA!(p::Vector{T}, fitness::Function, mutation!::Function; generations=1_000) where {T}
for generation in 1:generations
fitnesses = fitness.(p)
p = sample(p, StatsBase.weights(fitnesses), length(p), replace=true)
mutation!.(p)
end
return nothing
end
```

```
GA! (generic function with 1 method)
```

Wouldn’t it be cool if the real code was actually that simple?

What if I told you this *is* the real code? To make it work, we need to do two
things. First, we need to define a `fitness`

function, which, given an input
(here, a `Genome`

), will return its “score”. Second, we need to define a
`mutation!`

function, which will *modify* a genome.

The fitness function can be defined as follows (note that we have already taken the log10 of the mass and FMR earlier):

```
const log10fmr = rates[:fmr]
const log10M = rates[:mass]
ŷ(g::Genome, x::Vector{Float64})= g.m .* x .+ g.b
function fmr_fitness(g::Genome)
errors = log10fmr .- ŷ(g, log10M)
sum_of_errors_sq = sum(errors.^2.0)/length(log10fmr)
return 1.0/sum_of_errors_sq
end
```

```
Error: ArgumentError: syntax df[column] is not supported use df[!, column]
instead
```

The first thing we do is “extract” the $\text{log}_{10}$ of the FMR and mass,
and turn them into vectors. Then we define a function which uses these vectors,
and does the linear prediction. You may notice that this function uses elements
from “the outside”, which have not been passed as arguments, and this may not be
ideal - this is why the `log10...`

variables have been declared as `const`

: they
will raise a warning if they are changed.

We can check the score of an “empty” genome ($\hat y = 0$):

```
fmr_fitness(zero(Genome))
```

```
Error: UndefVarError: fmr_fitness not defined
```

Now, let’s define a function for mutations. Because we might want to change the
rate at which parameters evolve, it would be useful to be able to generate
multiple such functions. And so, our first task is to write a function that will
return another function, which will itself modify the `Genome`

object. Note that
we `return nothing`

at the end, because all of this function does is change an
existing object.

```
using Distributions
function normal_error(σm::Float64, σb::Float64)
function f(g)
g.m = rand(Normal(g.m, σm))
g.b = rand(Normal(g.b, σb))
return nothing
end
return f
end
```

```
Error: ArgumentError: Package Distributions not found in current path:
- Run `import Pkg; Pkg.add("Distributions")` to install the Distributions p
ackage.
```

We can test that this all works as expected:

```
change_both! = normal_error(0.01, 0.01)
initial_genome = Genome(0.2, 0.5)
change_both!(initial_genome)
initial_genome
```

```
Error: UndefVarError: normal_error not defined
```

Let’s now define a function to work on the actual problem:

```
very_gradual_change! = normal_error(1e-3, 1e-3)
```

```
Error: UndefVarError: normal_error not defined
```

We have all of the pieces to apply our genetic algorithm. Before starting, it is always a good idea to try and eyeball the parameters. Judging from the plot we made early in the lesson, the intercept is probably just around 3, so we can probably draw it from $\mathcal{N}(\mu=3, \sigma=1)$; the slope seems to be close to one but a little bit lower, so we can get it from $\mathcal{N}(\mu=0.75, \sigma=0.2)$:

```
population = [Genome(rand(Normal(0.75,0.2)), rand(Normal(3,1))) for i in 1:500]
```

```
Error: UndefVarError: Normal not defined
```

```
GA!(population, fmr_fitness, very_gradual_change!; generations=10000)
```

```
Error: UndefVarError: fmr_fitness not defined
```

*TODO*

```
scatter(10.0.^rates[:mass], 10.0.^rates[:fmr], c=:teal, leg=false, msc=:transparent)
pred = collect(LinRange(10.0.^minimum(rates[:mass]), 10.0.^maximum(rates[:mass]), 100))
mostfit = fmr_fitness.(population) |> findmax |> last
plot!(pred, 10.0.^ŷ(population[mostfit], log10.(pred)), c=:orange, lw=2, ls=:dash)
xaxis!(:log10, "Mass (kg)")
yaxis!(:log10, "Field Metabolic Rate (kj per day)")
```

```
Error: ArgumentError: syntax df[column] is not supported use df[!, column]
instead
```

We can also look at the equation for the most fit genome: Error: UndefVarError: mostfit not defined.