In this capstone, we will implement a Naive Bayes classifier, to make predictions about the cultivar to which wheat seeds belong. This is a classification task, meaning that we want to get an answer that looks like class A or class B, as opposed to a regression problem in which we would like to get an answer like length = 0.8cm. At its core, Naive Bayes classification assumes that the probability of belonging to a class based on a single measurement is equal to the probability that this measurement’s values originate from the distribution of known values for the class. For example, we know that the road runner (Tastyus supersonicus) runs at a speed of 25.8 ± 1.2 mph, whereas the coyote (Eatius birdius) runs at a speed of 21.5 ± 1.8 mph. If we measure an unidentified animal going 24.6 mph in a area where coyotes are three quarters of the total population, we can use NBC to get the probability that it is a coyote or a roadrunner:

using Distributions

speed_coyote = Normal(21.5, 1.8)
speed_roadrunner = Normal(25.8, 1.2)
measured_speed = 24.6

P_roadrunner = 0.25 * pdf(speed_roadrunner, measured_speed)
P_coyote = 0.75 * pdf(speed_coyote, measured_speed)

[:roadrunner, :coyote][argmax([P_roadrunner, P_coyote])]
:roadrunner

The Wikipedia page has a very clear introduction to the theory, and we will adhere to its notation here. In short, given an array $\mathbf{x}$ of observations of multiple features, the probability ($p(C_k | x)$) that this sample belongs to the class $C_k$ is proportional to $p(C_k) \prod p(x_i | C_k)$. We can multiply these probabilities together because Naive Bayes classification makes the assumption of independance between features. This is precisely why we say that this classifier is ‘naive’. Further, we can get the class to which the sample should be assigned using $\hat y = \text{argmax}_{k \in K} p(C_k) \prod p(x_i | C_k)$ - the class that is the most probable is taken as the guess.

Calculating $p(x_i | C_k)$ assumes that we know the distribution of the values of the i-th variable in class $C_k$, which is not necessarily true. In this example we will assume that all we know for sure is the average and standard deviation, meaning that the distribution we will use should be a normal distribution, which has the maximal entropy given the information available.

We will re-use the dataset from the neural network with Flux capstone, and so we will skip over the details of how these data can be downloaded:

import CSV
using DataFrames

function get_dataset(url, filename)
	if !isfile(filename)
		download(url, filename)
	end
	return dropmissing(DataFrame(CSV.File(filename; header=0, delim='\t')))
end
const seed_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt"
seeds = get_dataset(seed_url, "seeds.txt");

This dataset comprises a bunch of measures taken on wheat kernels. The kernels belong to three different cultivars of wheat. As always, we will rename the columns to make it easier to reference them:

rename!(seeds,
  [:Column1 => :area, :Column2 => :perimeter,
  :Column3 => :compactness, :Column4 => :kernel_length,
  :Column5 => :kernel_width, :Column6 => :asymmetry,
  :Column7 => :kernel_groove, :Column8 => :cultivar]
  )

using Random
Random.seed!(42);
seeds = seeds[shuffle(1:end), :]
first(seeds, 3)
3×8 DataFrame
 Row │ area     perimeter  compactness  kernel_length  kernel_width  asymme
try ⋯
     │ Float64  Float64    Float64      Float64        Float64       Float6
4   ⋯
─────┼─────────────────────────────────────────────────────────────────────
─────
   1 │   17.32      15.91       0.8599          6.064         3.403      3.
824 ⋯
   2 │   17.08      15.38       0.9079          5.832         3.683      2.
956
   3 │   12.79      13.53       0.8786          5.224         3.054      5.
483
                                                               2 columns om
itted

To see how well our technique performs, we will split the dataset in two, keeping 20 samples for the validation, and the rest to get information on the distributions:

seeds_dist = seeds[1:(end-20),:];
leftovers = seeds[(end-19):end,:];

As it is, our first task is to summarize the data in seeds_dist in a table containing, for every measure and every cultivar, the average and standard deviation. The next steps will involve a fair amount of manipulations using the DataFrames package, and it is a good idea to look at the documentation for the various functions. Note that using a dataframe here is absolutely not required, but in the process of shaping our data, we will use the most common transformations.

Because we have a lot of variables, it is likely easier to reshape the data to a long format:

long_seeds = stack(seeds_dist, Not(:cultivar))
first(long_seeds, 3)
3×3 DataFrame
 Row │ cultivar  variable  value
     │ Float64   String    Float64
─────┼─────────────────────────────
   1 │      2.0  area        17.32
   2 │      1.0  area        17.08
   3 │      3.0  area        12.79

We can now calculate the mean and standard deviation for all combinations of variables and cultivars:

using Statistics
distribution_data = by(long_seeds, Not(:value), μ = :value => mean, σ = :value => std)
first(distribution_data, 3)
Error: ArgumentError: by function was removed from DataFrames.jl. Use the `
combine(groupby(...), ...)` or `combine(f, groupby(...))` instead.

This is the sort of table one can find in a publication - in fact, the greatness of Naive Bayes classification is that it can work even if you only have access to the moments of the distribution, and not to the raw data themselves!

We now need to create statistical distributions from the information in this table - in Julia, a dataframe can store just about any type of information, so it is perfectly fine to store the object representing the normal distribution for each variable and cultivar in a new column.

distribution_data.distribution = [Normal(row.μ, row.σ) for row in eachrow(distribution_data)];
Error: UndefVarError: distribution_data not defined

Of course, this format is not ideal for what we’re about to do (measure the probability that a sample belongs to a class), so we will unstack it by cultivar (this drops the μ and σ columns, but they are not needed anymore):

distributions = unstack(distribution_data, :cultivar, :variable, :distribution)
select(distributions, [:cultivar, :area, :compactness])
Error: UndefVarError: distribution_data not defined

And with this information, we are ready to make a prediction. We will keep the cultivar information from the leftovers dataset, and then remove it to simulate what would happen if we had, for example, spilled all of our wheat kernels and had to put them in the correct kernel bags.

true_cultivar = leftovers.cultivar
select!(leftovers, Not(:cultivar))
first(leftovers, 3)
3×7 DataFrame
 Row │ area     perimeter  compactness  kernel_length  kernel_width  asymme
try ⋯
     │ Float64  Float64    Float64      Float64        Float64       Float6
4   ⋯
─────┼─────────────────────────────────────────────────────────────────────
─────
   1 │   13.2       13.66       0.8883          5.236         3.232      8.
315 ⋯
   2 │   15.03      14.77       0.8658          5.702         3.212      1.
933
   3 │   15.78      14.91       0.8923          5.674         3.434      5.
593
                                                                1 column om
itted

Let’s take the first row. We want to get the probability of every observation being taken from the distribution of this variable for every cultivar. The probability that this sample comes from cultivar 2 knowing its area is:

pdf(distributions.area[2], leftovers.area[1])
Error: UndefVarError: distributions not defined

To do this for all samples, we will assign a unique id to our observations for the test, and then stack both dataframes by variable;

leftovers.id = 1:size(leftovers,1)
s_leftovers = stack(leftovers, Not(:id))
s_distributions = stack(distributions, Not(:cultivar))
rename!(s_distributions, :value => :distribution)
Error: UndefVarError: distributions not defined

These dataframes have one identical column (:variable), which we can use to join them:

s_merged = join(s_leftovers, s_distributions, on=:variable)
first(s_merged, 3)
Error: UndefVarError: s_distributions not defined

Finally, because we consider all variables independently, we can now get the probability of every measurement coming from its appropriate distribution:

s_proba = by(s_merged, [:cultivar, :id, :variable],
	[:value, :distribution] =>  x -> (p = pdf.(x.distribution, x.value))
	)
rename!(s_proba, :x1 => :probability)
first(s_proba, 3)
Error: UndefVarError: s_merged not defined

The very last step is to bring everything back together, and to multiply the probabilities for every variable within every cultivar together, and get the argmax (most probable class):

s_p = by(s_proba, [:cultivar, :id], p = :probability => prod)
s_argmax = by(s_p, :id, y = :p => argmax)
Error: UndefVarError: s_proba not defined

This leaves us with a column containing the class we guessed, and we can compare it to the actual class to get the accuracy:

accuracy = mean(s_argmax.y .== true_cultivar)
Error: UndefVarError: s_argmax not defined

Naive Bayes classifiers have uncanny accuracy on a lot of problems! This example relies on dataframes to reach the answer, but of course it is a valuable programming exercise to do it starting from a matrix of values.