In this lesson, we will create neural networks using Flux, a performant and elegant package for doing machine learning in Julia. Flux is very well documented, and multiple step-by-step examples have been written to provide users with a solid understanding of how it can be used to build machine learning models in a few lines of code only.

We will be using the seeds dataset, which we believe is a better version of the widely used iris dataset. It consists of 7 measures of wheat kernels. Each kernel belongs to one of the three cultivars labeled in the dataset. The aim of our models is to predict cultivars from the kernel features within a supervised learning framework. We will define and train two models before evaluating their predictive performance: a single-layer neural network and a deep neural network.

using Flux
import CSV
using DataFrames
using Random
using Statistics

The first thing we need to do is download our data, which we will store in our local folder. The data are available from the UCI Machine Learning repository. To make sure that we do not download the dataset more often than necessary, we will write a short function to download the dataset if it doesn’t exist, and if it exists, to return it as a data frame:

function get_dataset(url, filename)
	if !isfile(filename)
		download(url, filename)
	return dropmissing(DataFrame(CSV.File(filename; header=0, delim='\t')))
const seed_url = ""
seeds = get_dataset(seed_url, "seeds.txt");
first(seeds, 3)
3×8 DataFrame
 Row │ Column1  Column2  Column3  Column4  Column5  Column6  Column7  Colum
     │ Float64  Float64  Float64  Float64  Float64  Float64  Float64  Float
   1 │   15.26    14.84   0.871     5.763    3.312    2.221    5.22       1
   2 │   14.88    14.57   0.8811    5.554    3.333    1.018    4.956      1
   3 │   14.29    14.09   0.905     5.291    3.337    2.699    4.825      1

Note that we rely on dropmissing, because the original dataset is not properly formated, and some columns are delimited by more than one tabulation. These malformed rows will be dropped.

At this point, our column names are not very informative - we can look up the metadata for this dataset, and rename them as follows:

Variablecolumnnew name
length of kernel4:kernel_length
width of kernel5:kernel_width
asymmetry coefficient6:asymmetry
length of kernel groove7:kernel_groove
cultivar (ID)8:cultivar
  [:Column1 => :area, :Column2 => :perimeter,
  :Column3 => :compactness, :Column4 => :kernel_length,
  :Column5 => :kernel_width, :Column6 => :asymmetry,
  :Column7 => :kernel_groove, :Column8 => :cultivar]
5×8 DataFrame
 Row │ area     perimeter  compactness  kernel_length  kernel_width  asymme
try ⋯
     │ Float64  Float64    Float64      Float64        Float64       Float6
4   ⋯
   1 │   15.26      14.84       0.871           5.763         3.312      2.
221 ⋯
   2 │   14.88      14.57       0.8811          5.554         3.333      1.
   3 │   14.29      14.09       0.905           5.291         3.337      2.
   4 │   13.84      13.94       0.8955          5.324         3.379      2.
   5 │   16.14      14.99       0.9034          5.658         3.562      1.
355 ⋯
                                                               2 columns om

To ensure that our results will be consistent every time we run them, we will set the initial state of our random number generator.


At this point, we still need to decide on how many samples will be used for training, and how many will be used for testing. As usual, we will use 70% of the dataset for training, and so we need to calculate how many samples this amounts to.

n_training = convert(Int64, round(0.7*size(seeds, 1);digits=0))

Because the data set is ordered (by cultivar, specifically), we cannot simply take the first 139 samples. Instead, we will shuffle the rows of our dataframe, and take the first 139 of that:

seeds = seeds[shuffle(1:end), :]
first(seeds, 3)
3×8 DataFrame
 Row │ area     perimeter  compactness  kernel_length  kernel_width  asymme
try ⋯
     │ Float64  Float64    Float64      Float64        Float64       Float6
4   ⋯
   1 │   17.32      15.91       0.8599          6.064         3.403      3.
824 ⋯
   2 │   17.08      15.38       0.9079          5.832         3.683      2.
   3 │   12.79      13.53       0.8786          5.224         3.054      5.
                                                               2 columns om

We are now satistified about the fact that the dataset is aranged in a random way. This means that we can split it into a training and testing component:

trn_set, tst_set = seeds[1:n_training, :], seeds[(n_training+1):end, :]
(139, 8)

The next step is to extract the features (information we use for classification) and the labels (the categories to predict). Because our cultivar information is stored as a floating point number, we must first one-hot encode it. As the labels are initially randomly arranged, it is also very important to specificy their order. For performance reasons, Flux requires that instances be stored as columns in a matrix, so we will also need to transpose our data. Because this will need to be done for both testing and training set, we can write a function:

function get_features_and_labels(data_set)
	features = transpose(convert(Matrix, data_set[!, Not(:cultivar)]))
	labels = Flux.onehotbatch(data_set.cultivar, sort(unique(data_set.cultivar)))
	return (features, labels)
get_features_and_labels (generic function with 1 method)

With this function, we do not need to type the same code twice to get our training and testing sets correctly split:

trn_features, trn_labels = get_features_and_labels(trn_set);
tst_features, tst_labels = get_features_and_labels(tst_set);

As a starting point, we will use a simple neural network with one input layer, containing one input neuron for every feature, and one output neuron for every label, which will be densely connected. We will then apply the softmax function to decide which label should be assigned. This is done using the following syntax in Flux:

one_layer = Chain(
	Dense(size(trn_features,1), size(trn_labels,1)),
Chain(Dense(7, 3), softmax)

In order to train this network, we need to decide on an optimiser, which in our case will be a gradient descent with a rate of learning of 0.01:

optimizer = Descent(0.01)

The final step is to decide on a loss function; for a classification problem, we can usually start with cross entropy. This function will take a series of features as a first input, and the expected labels as the second input.

loss(x, y) = Flux.crossentropy(one_layer(x), y)
loss (generic function with 1 method)

Before we start, we will wrap the data into an iterator, to repeat them for every training epoch - we will start with 2000 training epochs:

data_iterator = Iterators.repeated((trn_features, trn_labels), 2000);

Once all of this set-up work is done, training the network is very straightforward:

Flux.train!(loss, params(one_layer), data_iterator, optimizer)

As a first idea of the performance of this model, we might want to look at its accuracy on the training set. Accuracy in this case is defined as the proportion of correct guesses. To get the correct guesses, we need to one-cold our labels, to transform them into a series of 1, 2, and 3 corresponding to the various cultivars.

function accuracy(model, feat, labs)
	pred = Flux.onecold(model(feat))
	obsv = Flux.onecold(labs)
	return mean(pred .== obsv)
accuracy(one_layer, trn_features, trn_labels)

We can compare this to the testing set:

accuracy(one_layer, tst_features, tst_labels)

This is a little bit lower, which is not necessarily surprising as we have used a small dataset, a shallow network, raw input data, and a relatively modest number of training epochs. We might now want to look at the confusin matrix for this network:

function confusion_matrix(model, feat, labs)
  plb = Flux.onehotbatch(Flux.onecold(model(feat)), 1:3)
  labs * plb'
confusion_matrix(one_layer, tst_features, tst_labels)
3×3 Array{Int64,2}:
 16   1   2
  1  20   0
  3   0  17

A large majority of large values are on the diagonal, which corresponds to correct predictions, but there are a few outside of the diagonal. To fix this, we will try to complexify the network a little bit, maybe by adding a hidden network, and changing the activation function of the first layer to a ReLU (the default is to use a sigmoid):

hidden_size = 12
two_layers = Chain(
	Dense(size(trn_features,1), hidden_size, relu),
	Dense(hidden_size, size(trn_labels,1)),
Chain(Dense(7, 12, relu), Dense(12, 3), softmax)

We will re-define the loss function:

v2_loss(x, y) = Flux.crossentropy(two_layers(x), y)
v2_loss (generic function with 1 method)

And we can keep the data and optimiser as they are,

Flux.train!(v2_loss, params(two_layers), data_iterator, optimizer)

When the training is finished, we can also look at the accuracy on the training and testing sets:

accuracy(two_layers, trn_features, trn_labels)
accuracy(two_layers, tst_features, tst_labels)

Let’s summarize this information, by comparing the gain in accuracy when adding one layer:

DatasetOne-layer modelTwo-layers modelChange in accuracy

We can confirm that the confusion matrix is also better, in that it has more elements on the diagonal:

confusion_matrix(two_layers, tst_features, tst_labels)
3×3 Array{Int64,2}:
 17   1   1
  2  19   0
  1   0  19

For more information on neural networks and deep learning, we suggest the free online book of the same name, which has a lot of annotated code examples, as well as information about the mathematics behind all of this.