Approximate Bayesian computation, or ABC for short, is a very useful heuristic to estimate the posterior distribution of model parameters, specifically when the analytical expression of the likelihood function is unavailable (or when we can’t be bothered to figure it out). The theory on how ABC works will not be covered here in detail, so reading the previous article (and the references it links to) is highly recommended.

We will rely on a few packages for this example:

```
using StatsPlots # Plots of statistical distributions
using Statistics # mean and std
using Distributions # Statistical distributions
```

Let us now imagine an island. It is a small island, no more than a few meters in diameter, in the Florida keys. On this island is a single shrub, and on this single shrub an insect can live. Every year, for 20 years, a few biologists comb through the shrub, to figure out whether or not the insect is present or absent. This results in a timeseries, like this:

```
Year 1 present
Year 2 absent
Year 3 absent
Year 4 absent
Year 5 present
Year 6 present
Year 7 absent
Year 8 present
Year 9 present
Year 10 absent
```

To begin with, let us see how we can model the presence/absence of this species.
We will make two assumptions. First, the biological process here can be
represented by modeling the probabilities of *transition*, *i.e.* the chance of
changing from one state to another. Second, there is a small chance of *error*
when measuring the state of the system; this is because our insect evolved a
really good mimetism, and is more or less the same color as the shrub.
Specifically, while detecting at least one individual means that there is no
chance that the species is absent, *not* detecting any individual can also mean
that there were present in low density, and have not been detected. The problem
is therefore that if we want to think about the extinction-colonization dynamics
of this island, our empirical dataset adds some uncertainty due to the
measurement. We will use ABC to quantify this uncertainty, and estimate the
value of the extinction and colonization parameters to conduct additional
simulations.

We can represent this system using the following figure:

In the *true state* box, the rates represent the probabilities of state change.
An empty island can be colonized at rate $c$; an occupied island can become
empty through extinction at rate $e$; finally, we can miss the insect during
sampling at rate $m$. The rates between *true state* and *measured state*
represent the probabilities of making the wrong measurement. We now have enough
to write a simple function to simulate this model. The point of this function is
to return a timeseries, where an occupied island at a given year is `true`

, and
an empty isladnd is `false`

. Because there are two distinct processes, we will
first simulate the *true state* of the island, and then write a second function
to simulate the *measured state*:

```
function island_true_state(e::T, c::T; n=200) where {T <: AbstractFloat}
@assert 0.0 ≤ e ≤ 1.0
@assert 0.0 ≤ c ≤ 1.0
state = zeros(Bool, n)
for year in 2:n
state[year] = state[year-1] ? rand() ≥ e : rand() < c
end
return state
end
function island_measured_state(state::Vector{Bool}, m::T) where {T <: AbstractFloat}
measured_state = similar(state)
for (year, real_state) in enumerate(state)
measured_state[year] = real_state ? rand() ≥ m : false
end
return measured_state
end
function island(e::T, c::T, m::T; n=200) where {T <: AbstractFloat}
true_state = island_true_state(e, c; n=n)
measured_state = island_measured_state(true_state, m)
return measured_state
end
```

```
island (generic function with 1 method)
```

This is not the most efficient way to express this problem (we will need to
iterate over our timeseries twice!), but it will be good enough for this
example; most importantly, it means that we will be able to use
`island_true_state`

to generate simulations without the observation error after
we are done estimating the values of the parameters.

In ABC, one key notion is the idea of “summary statistics”, *i.e.* the act of
compressing the empirical data and model output to something that can be
meaningfully compared. The selection of summary statistics can be very
complex, and is in fact a key step in the success of ABC. Using too
much exposes you to the curse of dimensionality, and using too few (or the wrong
ones) can lead to spurious conclusions.

Here, we will work with two informations, namely the rate of transition, and the temporal occupancy. Another few indicators we could have used are apparent colonisation events (when the island appeared empty one year, then occupied the next), apparent extinction events (when the island appears occupied one year, and empty the next), and finally the number of times where the sequence occupied/empty/occupied appears. Feel free to try and adapt the code after you have been through this lesson once to use these and other indicators, and see how the results change. A good rule of thumb is to think about summary statistics both as a statistician and a domain expert: you want to capture the structure of your data, but also the mechanisms of interest.

In any case, here is our function to summarize a sequence of observations:

```
function summary(t::Vector{Bool})
transitions = 0.0
for i in 2:length(t)
if t[i] ≠ t[i-1]
transitions = transitions+1.0
end
end
occupancy = sum(t)/length(t)
return [occupancy, transitions/(length(t)-1)]
end
```

```
summary (generic function with 1 method)
```

Do you like it? It’s OK I guess, but I’m not a big fan. As we discussed in the previous paragraph, we may want to use different summary statistics, and this would require re-writing this function. This is a good example of our code being insufficiently modular: each summary statistic should be its own function. Let’s correct this.

```
occupancy(t::Vector{Bool}) = sum(t)/length(t)
function transitions(t::Vector{Bool})
n = 0.0
for i in 2:length(t)
if t[i] ≠ t[i-1]
n = n+1.0
end
end
return n/(length(t)-1)
end
```

```
transitions (generic function with 1 method)
```

We have defined two functions, each of which returns the value of one indicator.
We can now write a declaration for `summary`

that will accept the states
timeseries, *and* an array of functions to measure the indicators:

```
function summary(t::Vector{Bool}, f::Vector{T}) where {T <: Function}
s = zeros(Float64, length(f))
for (i, fn) in enumerate(f)
s[i] = fn(t)
end
return s
end
```

```
summary (generic function with 2 methods)
```

Is it perfect? No. Is it an improvement over the previous implementation? Assuredly, so let’s roll with this. We can apply this function to a simulation with parameters $(e=0.2, c=0.6, m=0.1)$, and get the summary statistics:

```
summary(
island(0.2, 0.6, 0.1), # simulation output
[occupancy, transitions] # list of things to measure
)
```

```
2-element Array{Float64,1}:
0.69
0.40703517587939697
```

At this point, we need empirical data to feed the model. Let’s say that over twenty years, the species has been observed from year 4 to 12, then 14 to 15, and finally from year 17 to 20. We can write this as:

```
empirical_data = zeros(Bool, 20)
empirical_data[4:12] .= true
empirical_data[14:15] .= true
empirical_data[17:20] .= true
```

Your biologist colleague would be very happy if you could get to an estimate of
the parameters that govern the presence of this insect on the island, *minus*
the observation error. This is an easy enough task to do with ABC.

We can measure the statistics of this timeseries:

```
summary(empirical_data)
```

```
2-element Array{Float64,1}:
0.75
0.2631578947368421
```

Good parameter values will result in simulations that have similar summary
statistics. Keep in mind that the empircal data are the real state of the island
*and* the measurement error, so we need to use the `island`

model.

And just like this, we are ready to start the ABC process. We will start by
deciding on priors for all three parameters. We could model them in a variety of
ways, including beta distributions, and truncated distributions. Let’s go with
the later, as it is a neat illustration of the `Distributions`

package. We
picked a value of colonisation higher than extinction (as the species seems to
persist), and a relatively low rate of false absences, albeit with a larger
standard deviation because we are not really sure as to what this parameter
should look like.

```
Dc = Truncated(Normal(0.6, 0.1), 0.0, 1.0)
De = Truncated(Normal(0.3, 0.1), 0.0, 1.0)
Dm = Truncated(Normal(0.1, 0.3), 0.0, 1.0)
```

In this algorithm (as with any other Bayesian application), priors are
*determinant* in the success of the process; indeed, the posterior distribution
will be a subset of the prior, because (as we discuss below) ABC relies on
*rejection* sampling, *i.e.* it winnows parameter values that make little sense
from the original distribution. Picking a prior which is too narrow will prevent
the algorithm from converging onto the correct posterior. Conversely, picking a
prior that is too flat (let’s say a uniform distribution that goes from negative
infinity to positive infinity) means that the time to convergence will explode.

We will generate 10⁶ samples using our simulation model, and get the summary
statistics for all of them. Note that the simulations run for 200 timesteps, as
opposed to 20 for the empirical data. This is perfectly fine, because ABC works
on summary statistics, and not on the raw data. In fact, if we *only* add the
summary statistics and not the raw data, we could still apply ABC! An
interesting exercise is to change the number of timesteps in the simulations,
and see if/how the posterior distributions react.

```
N = 1_000_000
Sc = rand(Dc, N)
Se = rand(De, N)
Sm = rand(Dm, N)
simulated_results = summary.(map(i -> island(Se[i], Sc[i], Sm[i]), 1:N))
```

We can now measure the distance between these simulated results and the empirical results:

```
function euclidean_distance(x1::Vector{T}, x2::Vector{T}) where {T <: AbstractFloat}
return sum(sqrt.((x1.-x2).^2.0))
end
distances = map(s -> euclidean_distance(s, summary(empirical_data)), simulated_results)
```

We could use another distance measure - dependind on the type of summary statistics, it might even be required. We can also assign weights to the different summary statistics, but this has a risk of overly fine-tuning the process, and it can be difficult to assess the consequences of these choices.

```
density(distances, leg=false, fill=(0, :orange, 0.4), c=:orange)
xaxis!("Distance", (0,1))
yaxis!("Density", (0, 4))
```

The actual ABC step is to *reject* some of the samples from them prior
distribution, to only select the parameters combinations that produce results
*close* to the empirical data. This is done by setting a threshold, and
distances *above* this threshold will be rejected (this is why the choice of a
good prior distribution is crucial). In this case, we picked a threshold value
of $\theta = 0.02$. Why? Well, for the same reason we idolize $p \le 0.05$:
“dunno, it looks fine I guess?”. You are, once again, encouraged to change this
value, and the density of distances presented in the previous plot should help
you decide which values are too large or too small; and as always, you can
increase the number of samples if you want to be very strict.

```
θ = 0.02
posterior = findall(distances.<θ)
```

`while`

loops.```
density(Sc, c=:teal, ls=:dash, lab="")
density!(Se, c=:purple, ls=:dash, lab="")
density!(Sm, c=:grey, ls=:dash, lab="")
density!(Sc[posterior], c=:teal, fill=(0, :teal, 0.3), lab="Colonization")
density!(Se[posterior], c=:purple, fill=(0, :purple, 0.3), lab="Extinction")
density!(Sm[posterior], c=:grey, fill=(0, :grey, 0.3), lab="Error")
xaxis!("Parameter value", (0,1))
yaxis!("Density", (0, 14))
```

To summarize, we can now extract the values of the different parameters:

meaning | mean | standard deviation | |
---|---|---|---|

$c$ | Colonization rate | 0.551 | 0.066 |

$e$ | Extinction rate | 0.154 | 0.04 |

$m$ | Measurement error rate | 0.05 | 0.038 |

Done! One of the strength of ABC is that we can now *generate* datasets, using
our original model for simulation. And because we know the measurement error
rate, we can also correct these datasets to describe the data *generation*
process (the actual presence/absence of the species) as opposed to the data
*observation* process (measuring the presence/absence and being sometimes wrong
about it).

We now have everything we need to answer our original question: what would the
extinction/colonization dynamics looks like *minus* the measurement error? For
this, we can use our `island_true_state`

model, which only takes `c`

and `e`

as
inputs, and generate timeseries. We will focus on the average temporal
occupancy:

```
posterior_c = Sc[posterior]
posterior_e = Se[posterior]
occupancies = map((e,c) -> occupancy(island_true_state(e,c)), posterior_e, posterior_c)
density(occupancies, fill=(0, :orange, 0.4))
```