Scientific computing

(for the rest of us)

Finding things in arrays

In this module, we will see how to locate interesting values in collections, and how to extract and test the existence of some of these values. This is important knowledge in order to build more advanced programs, and we will put it in action in the following section.

During the module on indexing, we thought about getting the first and last position of an array using begin and end. Julia offers a first and a last function to do just this:

first(1:3)
1
last(1:3)
3

These functions are extremely useful when, for example, we are interesting in the starting or ending position of a collection. But what if we want to do more advanced, condition-based finding of elements?

Before we do this, let’s ask a very important question: given a test we can apply to each element (i.e. a function taking an element of the collection and returning a Boolean), how many elements satisfy it? This is usually done with the count function:

count(isodd, 1:5)
3

This gives us the number of odd numbers between 1 and 5. This is an important information because, when we want to find elements according to a condition, there is always a chance that no element will match this condition! For example, we can count the number of elements larger than 10 in between 1 and 5:

count(x -> x > 10, 1:5)
0

We can use count to decide whether we should start moving forward with a search. But how, exactly, do searches work?

It’s complicated. Or rather, it’s very specialized, and there are a number of ways Julia handles searches, depending on what you are actually after. Let’s say you have a random matrix, and want to know where the largest value is:

max_value, position_max = findmax(rand(10, 10))
(0.9979916740878391, CartesianIndex(7, 8))

This similarly works for the smallest value; we can also apply this to a data structure with more than two dimensions:

min_value, position_min = findmin(rand(10, 10, 10))
(0.0001962598276030736, CartesianIndex(5, 8, 4))

Note that the position_min output is now a CartesianIndex with three dimensions, because we are looking inside a random cube. By contrast, working on a vector will return the linear index of the matched position:

findmax(rand(10))
(0.894871108577834, 4)

There are more general functions we can use to search for anything in a collection. For example, let’s create a random data cube:

R = rand(3, 3, 3)
3×3×3 Array{Float64, 3}:
[:, :, 1] =
 0.708069    0.630992  0.597271
 0.00650731  0.408212  0.518882
 0.900203    0.961816  0.389972

[:, :, 2] =
 0.895605  0.944267  0.808891
 0.742029  0.68305   0.081053
 0.067657  0.530833  0.137727

[:, :, 3] =
 0.695847  0.689044  0.434225
 0.849726  0.511163  0.350809
 0.566887  0.217144  0.971294

We might want to know where the numbers between 0.5 and 0.6 are located:

findall(x -> 0.5 <= x <= 0.6, R)
5-element Vector{CartesianIndex{3}}:
 CartesianIndex(1, 3, 1)
 CartesianIndex(2, 3, 1)
 CartesianIndex(3, 2, 2)
 CartesianIndex(3, 1, 3)
 CartesianIndex(2, 2, 3)

This is a vector of CartesianIndex, with the correct number of dimensions. There are two noteworthy variants to findall: findfirst and findlast. They return, respectively, the first and last position matching the condition:

findfirst(x -> 0.5 <= x <= 0.6, R)
CartesianIndex(1, 3, 1)
findlast(x -> 0.5 <= x <= 0.6, R)
CartesianIndex(2, 2, 3)

We can fine-tune the behavior of a search a lot more, using the findnext and findprev function. For example, let’s say that we have a time series, and want to skip the first 10 values:

ts = rand(200)
findnext(x -> 0.5 <= x <= 0.6, ts, 10)
15

Or we can look for a matching value immediately before this point:

findprev(x -> 0.5 <= x <= 0.6, ts, 10)

Usng findprev and findnext can let you be very efficient about iteration based on criteria, simply by replacing the index of the match in your next call to findprev/findnext. This is useful when you do not want to collect all of the values. For example, we may want to be interested in getting all of the values between 0.5 and 0.6 after then tenth position, but have at least 10 timepoints between each value we collect.

If there is no more entries left in the collection, findnext (and findprev) will return nothing, which makes it easy to break out of the loop (as we have seen in a previous module):

position = findnext(x -> 0.5 <= x <= 0.6, ts, 10)
@info position
while ~isnothing(findnext(x -> 0.5 <= x <= 0.6, ts, position + 10))
    global position
    position = findnext(x -> 0.5 <= x <= 0.6, ts, position + 10)
    @info position
end
[ Info: 15
[ Info: 26
[ Info: 63
[ Info: 98
[ Info: 120
[ Info: 131
[ Info: 155
[ Info: 173
[ Info: 185
[ Info: 198

But wait… Things have been going a little bit too well so far, and we have avoided the situation where we still haven’t found what we’re looking for. So let’s create a situation where there is no match:

findall(x -> x >= 10, 1:5)
Int64[]

This collection is empty, and we can test it with isempty:

isempty(findall(x -> x >= 10, 1:5))
true

This is usually something we can use to decide to break out of a loop, or maybe throw an exception. Note that this is equivalent to using:

iszero(count(x -> x >= 10, 1:5))
true

There are sometimes more than a single idiomatic way to express an idea.

Another series of related features is the ability to manipulate the output of these functions (or indeed, of any array, but this is most closely related to working on the output of searches).

The first functions are something and coalesce – in essence, they return the first value that is not nothing or not missing.

We think that Julia’s insistence of making a difference between nothing, missing, and NaN is one of the language’s best features, especially for data analysis.

Let’s see them in action - we will create a mock array with some nothing and some numbers:

series = [nothing 3 4 nothing 5]
1×5 Matrix{Union{Nothing, Int64}}:
 nothing  3  4  nothing  5
something(series...)
3
These functions must be called on a splatted collection; if not, they will simply return the collection, because it isn’t nothing or missing.

If we call coalesce, we will get nothing as a result, because coalesce will look for the first non-nothing value:

coalesce(series...)

If we change the series to have a missing element, we can see how something and coalesce differ:

series = [missing 3 4 nothing 5]
coalesce(series...)
3

In brief, something and coalesce are extremely useful when you can anticipate that your collection may have some missing or nothing values, but still want to get a result out of it.

Sometimes, we need a little more control, namely when we know that we should only get a single result. For example, we can write our own findmax function:

series = rand(10)
findall(x -> x == maximum(series), series)
1-element Vector{Int64}:
 3

How do we get this value out of the vector?

only(findall(x -> x == maximum(series), series))
3

Why use only and not first here? The answer is: only will throw an exception if there are not exactly one single element in the collection on which it is called!

Before we take this module to a close, there are two additional functions worth knowing about:

any([true false true])
true
any([false false false])
false
all([true false true])
false
all([true true true])
true

The any function will return true if at least one of the elements is true, and all will check that all elements are true. By remembering our Boolean operations and the way reduce works, we can think of these functions as

(x) -> reduce(|, x)
#29 (generic function with 1 method)

for any, and

(x) -> reduce(&, x)
#31 (generic function with 1 method)

for all.

The concepts introduced in this module are very helpful to design code that look for specific information in a flexible way. This will help be more efficient, by writing less code, and combining a few simple ingredients to safely get to the information we need.