Scientific computing

(for the rest of us)

Exiting loops

In the previous two modules, we have written loops that terminate when a condition is met (while), or when the collection has been iterated over entirely (for). In some cases, we may want to fine-tune the behavior of our iteration. In these cases, we can use some special keywords to jump out of the loop entirely, or skip some steps.

Let’s say that we want to generate sequences of mRNA that look legit enough to fool a very naïve program, which is a good thing because we will use a very naïve program to generate them. Specifically, we want sequences that will have an AUG codon at the start, end with one of ochre (UAA), amber (UAG), or opal (UGA), located at some multiple number of three bases after the start codon. To make things a little more fun, we can also get sequences that are only between 40 and 50 bases long.

An example of such a sequence is (minus the length requirement):

seq = "AUGUGAGAGUAA"
"AUGUGAGAGUAA"

Assuming that our RNA sequence is going to be represented as a string, we can check its “validity” with a series of simple tests. First, that the length of the sequence is a multiple of three:

iszero(length(seq) % 3)
true

Then that the first three positions are a start codon:

isequal("AUG")(seq[1:3])
true

Finally, that the last three positions are a stop codon:

seq[(end - 2):end] in ["UAA", "UAG", "UGA"]
true

We can now start generating sequences like this at random. Our strategy in this module will be to start from a giant sequence:

rnd_seq = String(rand("AUGC", 10_000));
rnd_seq[1:10]
"CGCUAUACCA"

We will then look for possible initiations sites (using the findall function), then for all of the possible sites we found, we can look for all possible stop positions, then check that the sequence has a length between 40 and 50 which is also a multiple of three. And because we presumably want to repeat this process a lot, we want to do as few iterations as possible. This is where continue and break will be very useful.

In order to use continue and break efficiently, it helps to have a sense of when we need to skip to the next possible solution, and a sense of when we can stop the iteration entirely. Let’s therefore setup a quick example. We can find the first start codon:

start_position = findfirst("AUG", rnd_seq)
30:32

The codon position is represented as a range (you can check with typeof), specifically a UnitRange. We can therefore get the position where it starts with

start_position[begin]
30

We can look for the next UGA stop codon after this position (this is mostly an unsubtle way to get you to look at the documentation for findnext, as we will use findall to brute-force our way through the actual problem):

stop_position = findnext("UGA", rnd_seq, start_position[end])
76:78

Again, this is expressed as a range, and so the sequence will stop at

stop_position[end]
78

Its length is therefore

stop_position[end] - start_position[begin] + 1
49

When do we want to skip an iteration? A first criteria is that, assuming we do not call findnext, the stop codon starts before the start codon – this is when we call continue to move on, as nothing about this sequence will be useful (the sequence would in fact be empty). Using the short-circuit notation to avoid writing a full if statement, we can express this as the following one-liner (it is not going to work outside of a loop, as continue has no effect):

(stop_position[begin] > start_position[end]) || continue

Similarly, we want to skip ahead if the sequence length is not a multiple of three:

iszero(seq_length % 3) || continue

Only when the sequence length is between 40 and 50 do we want to save the sequence, and exit (break) out of the loop:

if 40 <= seq_length <= 50
    tmp_seq = rnd_seq[start_position[begin]:stop_position[end]]
    break
end

With this information, we can build an actual loop. We are decidedly not being fancy here, by using findall to get all instances of both start and stop codons, and essentially hoping that the random sequence is long enough that we will find a subset of it with the correct properties:

seq_found = false

for start_position in findall("AUG", rnd_seq)
    global seq_found
    global tmp_seq
    for stop_codon in ["UAA", "UAG", "UGA"]
        for stop_position in findall(stop_codon, rnd_seq)
            (stop_position[begin] > start_position[end]) || continue

            seq_length = stop_position[end] - start_position[begin] + 1

            iszero(seq_length % 3) || continue

            if 40 <= seq_length <= 50
                seq_found = true
                tmp_seq = rnd_seq[start_position[begin]:stop_position[end]]
                break
            end
        end

        seq_found && break
    end

    seq_found && break
end
print("Found a sequence of length $(length(tmp_seq)):\n$(tmp_seq)")
Found a sequence of length 42:
AUGUCAGUGCUGCUCGUACGUAGCUUCCCACUUGUACGGUGA