Approximate Bayesian computation, or ABC for short, is a very useful heuristic to estimate the posterior distribution of model parameters, specifically when the analytical expression of the likelihood function is unavailable (or when we can’t be bothered to figure it out). The theory on how ABC works will not be covered here in detail, so reading the previous article (and the references it links to) is highly recommended.

We will rely on a few packages for this example:

using StatsPlots     # Plots of statistical distributions
using Statistics     # mean and std
using Distributions  # Statistical distributions

Let us now imagine an island. It is a small island, no more than a few meters in diameter, in the Florida keys. On this island is a single shrub, and on this single shrub an insect can live. Every year, for 20 years, a few biologists comb through the shrub, to figure out whether or not the insect is present or absent. This results in a timeseries, like this:

Year 1	present
Year 2	absent
Year 3	absent
Year 4	absent
Year 5	present
Year 6	present
Year 7	absent
Year 8	present
Year 9	present
Year 10	absent

To begin with, let us see how we can model the presence/absence of this species. We will make two assumptions. First, the biological process here can be represented by modeling the probabilities of transition, i.e. the chance of changing from one state to another. Second, there is a small chance of error when measuring the state of the system; this is because our insect evolved a really good mimetism, and is more or less the same color as the shrub. Specifically, while detecting at least one individual means that there is no chance that the species is absent, not detecting any individual can also mean that there were present in low density, and have not been detected. The problem is therefore that if we want to think about the extinction-colonization dynamics of this island, our empirical dataset adds some uncertainty due to the measurement. We will use ABC to quantify this uncertainty, and estimate the value of the extinction and colonization parameters to conduct additional simulations.

We can represent this system using the following figure:

graph LR subgraph True state present -- e --> absent absent -- c --> present absent -- 1-c --> absent present -- 1-e --> present end subgraph Measured state absent --> InsectAbsent[absent] present -- 1-m --> InsectPresent[present] present -- m --> InsectAbsent[absent] end

In the true state box, the rates represent the probabilities of state change. An empty island can be colonized at rate $c$; an occupied island can become empty through extinction at rate $e$; finally, we can miss the insect during sampling at rate $m$. The rates between true state and measured state represent the probabilities of making the wrong measurement. We now have enough to write a simple function to simulate this model. The point of this function is to return a timeseries, where an occupied island at a given year is true, and an empty isladnd is false. Because there are two distinct processes, we will first simulate the true state of the island, and then write a second function to simulate the measured state:

function island_true_state(e::T, c::T; n=200) where {T <: AbstractFloat}
  @assert 0.0  e  1.0
  @assert 0.0  c  1.0
  state = zeros(Bool, n)
  for year in 2:n
    state[year] = state[year-1] ? rand()  e : rand() < c
  end
  return state
end

function island_measured_state(state::Vector{Bool}, m::T) where {T <: AbstractFloat}
  measured_state = similar(state)
  for (year, real_state) in enumerate(state)
    measured_state[year] = real_state ? rand()  m : false
  end
  return measured_state
end

function island(e::T, c::T, m::T; n=200) where {T <: AbstractFloat}
  true_state = island_true_state(e, c; n=n)
  measured_state = island_measured_state(true_state, m)
  return measured_state
end
island (generic function with 1 method)

This is not the most efficient way to express this problem (we will need to iterate over our timeseries twice!), but it will be good enough for this example; most importantly, it means that we will be able to use island_true_state to generate simulations without the observation error after we are done estimating the values of the parameters.

In ABC, one key notion is the idea of “summary statistics”, i.e. the act of compressing the empirical data and model output to something that can be meaningfully compared. The selection of summary statistics can be very complex, and is in fact a key step in the success of ABC. Using too much exposes you to the curse of dimensionality, and using too few (or the wrong ones) can lead to spurious conclusions.

Here, we will work with two informations, namely the rate of transition, and the temporal occupancy. Another few indicators we could have used are apparent colonisation events (when the island appeared empty one year, then occupied the next), apparent extinction events (when the island appears occupied one year, and empty the next), and finally the number of times where the sequence occupied/empty/occupied appears. Feel free to try and adapt the code after you have been through this lesson once to use these and other indicators, and see how the results change. A good rule of thumb is to think about summary statistics both as a statistician and a domain expert: you want to capture the structure of your data, but also the mechanisms of interest.

In any case, here is our function to summarize a sequence of observations:

function summary(t::Vector{Bool})
  transitions = 0.0
  for i in 2:length(t)
    if t[i]  t[i-1]
      transitions = transitions+1.0
    end
  end
  occupancy = sum(t)/length(t)
  return [occupancy, transitions/(length(t)-1)]
end
summary (generic function with 1 method)

Do you like it? It’s OK I guess, but I’m not a big fan. As we discussed in the previous paragraph, we may want to use different summary statistics, and this would require re-writing this function. This is a good example of our code being insufficiently modular: each summary statistic should be its own function. Let’s correct this.

occupancy(t::Vector{Bool}) = sum(t)/length(t)

function transitions(t::Vector{Bool})
   n = 0.0
   for i in 2:length(t)
      if t[i]  t[i-1]
         n = n+1.0
      end
   end
   return n/(length(t)-1)
end
transitions (generic function with 1 method)

We have defined two functions, each of which returns the value of one indicator. We can now write a declaration for summary that will accept the states timeseries, and an array of functions to measure the indicators:

function summary(t::Vector{Bool}, f::Vector{T}) where {T <: Function}
  s = zeros(Float64, length(f))
  for (i, fn) in enumerate(f)
    s[i] = fn(t)
  end
  return s
end
summary (generic function with 2 methods)

Is it perfect? No. Is it an improvement over the previous implementation? Assuredly, so let’s roll with this. We can apply this function to a simulation with parameters $(e=0.2, c=0.6, m=0.1)$, and get the summary statistics:

summary(
   island(0.2, 0.6, 0.1),   # simulation output
   [occupancy, transitions] # list of things to measure
   )
2-element Array{Float64,1}:
 0.69
 0.40703517587939697

At this point, we need empirical data to feed the model. Let’s say that over twenty years, the species has been observed from year 4 to 12, then 14 to 15, and finally from year 17 to 20. We can write this as:

empirical_data = zeros(Bool, 20)
empirical_data[4:12] .= true
empirical_data[14:15] .= true
empirical_data[17:20] .= true

Your biologist colleague would be very happy if you could get to an estimate of the parameters that govern the presence of this insect on the island, minus the observation error. This is an easy enough task to do with ABC.

We can measure the statistics of this timeseries:

summary(empirical_data)
2-element Array{Float64,1}:
 0.75
 0.2631578947368421

Good parameter values will result in simulations that have similar summary statistics. Keep in mind that the empircal data are the real state of the island and the measurement error, so we need to use the island model.

And just like this, we are ready to start the ABC process. We will start by deciding on priors for all three parameters. We could model them in a variety of ways, including beta distributions, and truncated distributions. Let’s go with the later, as it is a neat illustration of the Distributions package. We picked a value of colonisation higher than extinction (as the species seems to persist), and a relatively low rate of false absences, albeit with a larger standard deviation because we are not really sure as to what this parameter should look like.

Dc = Truncated(Normal(0.6, 0.1), 0.0, 1.0)
De = Truncated(Normal(0.3, 0.1), 0.0, 1.0)
Dm = Truncated(Normal(0.1, 0.3), 0.0, 1.0)

In this algorithm (as with any other Bayesian application), priors are determinant in the success of the process; indeed, the posterior distribution will be a subset of the prior, because (as we discuss below) ABC relies on rejection sampling, i.e. it winnows parameter values that make little sense from the original distribution. Picking a prior which is too narrow will prevent the algorithm from converging onto the correct posterior. Conversely, picking a prior that is too flat (let’s say a uniform distribution that goes from negative infinity to positive infinity) means that the time to convergence will explode.

We will generate 10⁶ samples using our simulation model, and get the summary statistics for all of them. Note that the simulations run for 200 timesteps, as opposed to 20 for the empirical data. This is perfectly fine, because ABC works on summary statistics, and not on the raw data. In fact, if we only add the summary statistics and not the raw data, we could still apply ABC! An interesting exercise is to change the number of timesteps in the simulations, and see if/how the posterior distributions react.

N = 1_000_000
Sc = rand(Dc, N)
Se = rand(De, N)
Sm = rand(Dm, N)
simulated_results = summary.(map(i -> island(Se[i], Sc[i], Sm[i]), 1:N))
ABC is an excellent exercise for parallel computing! Because the model runs are independant, this is an “embarrassingly parallel” problem. After reading the parallel computing primer, you may want to try it out on this problem.

We can now measure the distance between these simulated results and the empirical results:

function euclidean_distance(x1::Vector{T}, x2::Vector{T}) where {T <: AbstractFloat}
  return sum(sqrt.((x1.-x2).^2.0))
end
distances = map(s -> euclidean_distance(s, summary(empirical_data)), simulated_results)

We could use another distance measure - dependind on the type of summary statistics, it might even be required. We can also assign weights to the different summary statistics, but this has a risk of overly fine-tuning the process, and it can be difficult to assess the consequences of these choices.

density(distances, leg=false, fill=(0, :orange, 0.4), c=:orange)
xaxis!("Distance", (0,1))
yaxis!("Density", (0, 4))

Distance between the summaries of empirical and actual data.

The actual ABC step is to reject some of the samples from them prior distribution, to only select the parameters combinations that produce results close to the empirical data. This is done by setting a threshold, and distances above this threshold will be rejected (this is why the choice of a good prior distribution is crucial). In this case, we picked a threshold value of $\theta = 0.02$. Why? Well, for the same reason we idolize $p \le 0.05$: “dunno, it looks fine I guess?”. You are, once again, encouraged to change this value, and the density of distances presented in the previous plot should help you decide which values are too large or too small; and as always, you can increase the number of samples if you want to be very strict.

θ = 0.02
posterior = findall(distances.<θ)
Another approach to rejection sampling would be to keep generating samples until the sample size of the posterior has reached a certain threshold. Both solutions are valid, but for the sake of illustration, we went with the one that was simpler to implement. Feel free to work on other rejection samplers as a programming exercise, this will be a good opportunity to practice writing while loops.
density(Sc, c=:teal, ls=:dash, lab="")
density!(Se, c=:purple, ls=:dash, lab="")
density!(Sm, c=:grey, ls=:dash, lab="")

density!(Sc[posterior], c=:teal, fill=(0, :teal, 0.3), lab="Colonization")
density!(Se[posterior], c=:purple, fill=(0, :purple, 0.3), lab="Extinction")
density!(Sm[posterior], c=:grey, fill=(0, :grey, 0.3), lab="Error")

xaxis!("Parameter value", (0,1))
yaxis!("Density", (0, 14))

Posterior distributions (shaded) and the corresponding priors (dashed) for the three parameters in the model.

To summarize, we can now extract the values of the different parameters:

meaningmeanstandard deviation
$c$Colonization rate0.5510.066
$e$Extinction rate0.1540.04
$m$Measurement error rate0.050.038

Done! One of the strength of ABC is that we can now generate datasets, using our original model for simulation. And because we know the measurement error rate, we can also correct these datasets to describe the data generation process (the actual presence/absence of the species) as opposed to the data observation process (measuring the presence/absence and being sometimes wrong about it).

We now have everything we need to answer our original question: what would the extinction/colonization dynamics looks like minus the measurement error? For this, we can use our island_true_state model, which only takes c and e as inputs, and generate timeseries. We will focus on the average temporal occupancy:

posterior_c = Sc[posterior]
posterior_e = Se[posterior]

occupancies = map((e,c) -> occupancy(island_true_state(e,c)), posterior_e, posterior_c)
density(occupancies, fill=(0, :orange, 0.4))

Distribution of the predicted occupancy without measurement error based on the posterior distribution.