In this capstone, we will implement a Naive Bayes classifier, to make
predictions about the cultivar to which wheat seeds belong. This is a
classification task, meaning that we want to get an answer that looks like
`class A`

or `class B`

, as opposed to a regression problem in which we would
like to get an answer like `length = 0.8cm`

. At its core, Naive Bayes
classification assumes that the probability of belonging to a class based on a
single measurement is equal to the probability that this measurement’s values originate
from the distribution of known values for the class. For example, we know that the
road runner (*Tastyus supersonicus*) runs at a speed of 25.8 ± 1.2 mph, whereas
the coyote (*Eatius birdius*) runs at a speed of 21.5 ± 1.8 mph. If we measure
an unidentified animal going 24.6 mph in a area where coyotes are three quarters
of the total population, we can use NBC to get the probability that it is a
coyote or a roadrunner:

```
using Distributions
speed_coyote = Normal(21.5, 1.8)
speed_roadrunner = Normal(25.8, 1.2)
measured_speed = 24.6
P_roadrunner = 0.25 * pdf(speed_roadrunner, measured_speed)
P_coyote = 0.75 * pdf(speed_coyote, measured_speed)
[:roadrunner, :coyote][argmax([P_roadrunner, P_coyote])]
```

```
:roadrunner
```

The Wikipedia page has a very clear introduction to the theory, and we
will adhere to its notation here. In short, given an array $\mathbf{x}$ of
observations of multiple features, the probability ($p(C_k | x)$) that this
sample belongs to the class $C_k$ is *proportional* to $p(C_k) \prod p(x_i |
C_k)$. We can multiply these probabilities together because Naive Bayes
classification makes the assumption of independance between features. This is
precisely why we say that this classifier is ‘naive’. Further, we can get the
class to which the sample should be assigned using $\hat y = \text{argmax}_{k
\in K} p(C_k) \prod p(x_i | C_k)$ - the class that is the most probable is taken
as the guess.

Calculating $p(x_i | C_k)$ assumes that we know the distribution of the values
of the *i*-th variable in class $C_k$, which is not necessarily true. In this
example we will assume that all we know for sure is the average and standard
deviation, meaning that the distribution we will use should be a normal
distribution, which has the maximal entropy given the information available.

We will re-use the dataset from the neural network with *Flux*
capstone, and so we will skip over the details of how these data can be
downloaded:

```
import CSV
using DataFrames
function get_dataset(url, filename)
if !isfile(filename)
download(url, filename)
end
return dropmissing(DataFrame(CSV.File(filename; header=0, delim='\t')))
end
const seed_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt"
seeds = get_dataset(seed_url, "seeds.txt");
```

This dataset comprises a bunch of measures taken on wheat kernels. The kernels belong to three different cultivars of wheat. As always, we will rename the columns to make it easier to reference them:

```
rename!(seeds,
[:Column1 => :area, :Column2 => :perimeter,
:Column3 => :compactness, :Column4 => :kernel_length,
:Column5 => :kernel_width, :Column6 => :asymmetry,
:Column7 => :kernel_groove, :Column8 => :cultivar]
)
using Random
Random.seed!(42);
seeds = seeds[shuffle(1:end), :]
first(seeds, 3)
```

```
3×8 DataFrame
Row │ area perimeter compactness kernel_length kernel_width asymme
try ⋯
│ Float64 Float64 Float64 Float64 Float64 Float6
4 ⋯
─────┼─────────────────────────────────────────────────────────────────────
─────
1 │ 17.32 15.91 0.8599 6.064 3.403 3.
824 ⋯
2 │ 17.08 15.38 0.9079 5.832 3.683 2.
956
3 │ 12.79 13.53 0.8786 5.224 3.054 5.
483
2 columns om
itted
```

To see how well our technique performs, we will split the dataset in two, keeping 20 samples for the validation, and the rest to get information on the distributions:

```
seeds_dist = seeds[1:(end-20),:];
leftovers = seeds[(end-19):end,:];
```

As it is, our first task is to summarize the data in `seeds_dist`

in a table
containing, for every measure and every cultivar, the average and standard
deviation. The next steps will involve a fair amount of manipulations using the
`DataFrames`

package, and it is a good idea to look at the documentation for the
various functions. Note that using a dataframe here is absolutely not required,
but in the process of shaping our data, we will use the most common
transformations.

Because we have *a lot* of variables, it is likely easier to reshape
the data to a long format:

```
long_seeds = stack(seeds_dist, Not(:cultivar))
first(long_seeds, 3)
```

```
3×3 DataFrame
Row │ cultivar variable value
│ Float64 String Float64
─────┼─────────────────────────────
1 │ 2.0 area 17.32
2 │ 1.0 area 17.08
3 │ 3.0 area 12.79
```

We can now calculate the mean and standard deviation for all combinations of variables and cultivars:

```
using Statistics
distribution_data = by(long_seeds, Not(:value), μ = :value => mean, σ = :value => std)
first(distribution_data, 3)
```

```
Error: ArgumentError: by function was removed from DataFrames.jl. Use the `
combine(groupby(...), ...)` or `combine(f, groupby(...))` instead.
```

This is the sort of table one can find in a publication - in fact, the greatness of Naive Bayes classification is that it can work even if you only have access to the moments of the distribution, and not to the raw data themselves!

We now need to create statistical distributions from the information in this
table - in *Julia*, a dataframe can store just about any type of information,
so it is perfectly fine to store the object representing the normal distribution
for each variable and cultivar in a new column.

```
distribution_data.distribution = [Normal(row.μ, row.σ) for row in eachrow(distribution_data)];
```

```
Error: UndefVarError: distribution_data not defined
```

Of course, this format is not ideal for what we’re about to do (measure the
probability that a sample belongs to a class), so we will unstack it by cultivar
(this drops the `μ`

and `σ`

columns, but they are not needed anymore):

```
distributions = unstack(distribution_data, :cultivar, :variable, :distribution)
select(distributions, [:cultivar, :area, :compactness])
```

```
Error: UndefVarError: distribution_data not defined
```

And with this information, we are ready to make a prediction. We will keep the
cultivar information from the `leftovers`

dataset, and then remove it to
simulate what would happen if we had, for example, spilled all of our wheat
kernels and had to put them in the correct kernel bags.

```
true_cultivar = leftovers.cultivar
select!(leftovers, Not(:cultivar))
first(leftovers, 3)
```

```
3×7 DataFrame
Row │ area perimeter compactness kernel_length kernel_width asymme
try ⋯
│ Float64 Float64 Float64 Float64 Float64 Float6
4 ⋯
─────┼─────────────────────────────────────────────────────────────────────
─────
1 │ 13.2 13.66 0.8883 5.236 3.232 8.
315 ⋯
2 │ 15.03 14.77 0.8658 5.702 3.212 1.
933
3 │ 15.78 14.91 0.8923 5.674 3.434 5.
593
1 column om
itted
```

Let’s take the first row. We want to get the probability of every observation being taken from the distribution of this variable for every cultivar. The probability that this sample comes from cultivar 2 knowing its area is:

```
pdf(distributions.area[2], leftovers.area[1])
```

```
Error: UndefVarError: distributions not defined
```

To do this for all samples, we will assign a unique `id`

to our observations for
the test, and then stack both dataframes by variable;

```
leftovers.id = 1:size(leftovers,1)
s_leftovers = stack(leftovers, Not(:id))
s_distributions = stack(distributions, Not(:cultivar))
rename!(s_distributions, :value => :distribution)
```

```
Error: UndefVarError: distributions not defined
```

These dataframes have one identical column (`:variable`

), which we can use to
`join`

them:

```
s_merged = join(s_leftovers, s_distributions, on=:variable)
first(s_merged, 3)
```

```
Error: UndefVarError: s_distributions not defined
```

Finally, because we consider all variables independently, we can now get the probability of every measurement coming from its appropriate distribution:

```
s_proba = by(s_merged, [:cultivar, :id, :variable],
[:value, :distribution] => x -> (p = pdf.(x.distribution, x.value))
)
rename!(s_proba, :x1 => :probability)
first(s_proba, 3)
```

```
Error: UndefVarError: s_merged not defined
```

The very last step is to bring everything back together, and to multiply the
probabilities for every variable *within* every cultivar together, and get the
argmax (most probable class):

```
s_p = by(s_proba, [:cultivar, :id], p = :probability => prod)
s_argmax = by(s_p, :id, y = :p => argmax)
```

```
Error: UndefVarError: s_proba not defined
```

This leaves us with a column containing the class we guessed, and we can compare it to the actual class to get the accuracy:

```
accuracy = mean(s_argmax.y .== true_cultivar)
```

```
Error: UndefVarError: s_argmax not defined
```

Naive Bayes classifiers have uncanny accuracy on a lot of problems! This example relies on dataframes to reach the answer, but of course it is a valuable programming exercise to do it starting from a matrix of values.