< BACKCONTINUE >

7.5 Analyzing DNA

In this final example dealing with randomization, you'll collect some statistics on DNA in order to answer the question: on average, what percentage of bases are the same between two random DNA sequences? Although some simple mathematics can answer the question for you, the point of the program is to show that you now have the necessary programming ability to ask and answer questions about your DNA sequences. (If you were using real DNA, say a collection of some particular gene as it appears in several organisms in slightly different forms, the answer would be somewhat more interesting. You may want to try that later.)

So let's generate a set of random DNA, all the same length, then ask the following question about the set. What's the average percentage of positions that are the same between pairs of DNA sequences in this set?

As usual, let's try to sketch an idea of the program in pseudocode:

Generate a set of random DNA sequences, all the same length

For each pair of DNA sequences

    How many positions in the two sequences are identical as a fraction?

}

Report the mean of the preceding calculations as a percentage

Clearly, to write this code, you can reuse at least some of the work you've already done. You certainly know how to generate a set of random DNA sequences. Also, although you don't have a subroutine that compares, position by position, the bases in two sequences, you know how to look at the positions in DNA strings. So that subroutine shouldn't be hard to write. In fact, let's write some pseudocode that compares each nucleotide in one sequence with the nucleotide in the same position in another sequence:

assuming DNA1 is the same length as DNA2,

for each position from 1 to length(DNA)

    if the character at that position is the same in DNA_1 and DNA_2

        ++$count
    }
}

return count/length

The whole problem now seems eminently do-able. You also have to write the code that picks each pair of sequences, collects the results, and finally takes the mean of the results and report it as a percentage. That can all go into the main program. Example 7-4 gives it a try, all in one shot.

Example 7-4. Calculate average % identity between pairs of random DNA sequences
#!/usr/bin/perl
# Calculate the average percentage of positions that are the same
# between two random DNA sequences, in a set of 10 sequences.

use strict;
use warnings;

# Declare and initialize the variables
my $percent;
my @percentages;
my $result;

# An array, initialized to the empty list, to store the DNA in
my @random_DNA = (  );

# Seed the random number generator.
# time|$$ combines the current time with the current process id
srand(time|$$);

#  Generate the data set of 10 DNA sequences.
@random_DNA = make_random_DNA_set( 10, 10, 10 );

# Iterate through all pairs of sequences
for (my $k = 0 ; $k < scalar @random_DNA - 1 ; ++$k) {
    for (my $i = ($k + 1) ; $i < scalar @random_DNA ; ++$i) {

        # Calculate and save the matching percentage
        $percent = matching_percentage($random_DNA[$k], $random_DNA[$i]);
        push(@percentages, $percent);
    }
}

# Finally, the average result:
$result = 0;

foreach $percent (@percentages) {
  $result += $percent;
}

$result = $result / scalar(@percentages);
#Turn result into a true percentage
$result = int ($result * 100);

print "In this run of the experiment, the average percentage of \n";
print "matching positions is $result%\n\n";

exit;

################################################################################
# Subroutines
################################################################################

# matching_percentage
#
# Subroutine to calculate the percentage of identical bases in two
# equal length DNA sequences

sub matching_percentage {

    my($string1, $string2) = @_;

    # we assume that the strings have the same length
    my($length) = length($string1);
    my($position);
    my($count) = 0;

    for ($position=0; $position < $length ; ++$position) {
        if(substr($string1,$position,1) eq substr($string2,$position,1)) {
            ++$count;
        }
    }

    return $count / $length;
}

# make_random_DNA_set
#
# Subroutine to make a set of random DNA
#
#   Accept parameters setting the maximum and minimum length of
#     each string of DNA, and the number of DNA strings to make
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub make_random_DNA_set {

    # Collect arguments, declare variables
    my($minimum_length, $maximum_length, $size_of_set) = @_;

    # length of each DNA fragment
    my $length;
    
    # DNA fragment
    my $dna;

    # set of DNA fragments
    my @set;

    # Create set of random DNA
    for (my $i = 0; $i < $size_of_set ; ++$i) {

        # find a random length between min and max
        $length = randomlength ($minimum_length, $maximum_length);

        # make a random DNA fragment
        $dna = make_random_DNA ( $length );

        # add $dna fragment to @set
        push( @set, $dna );
    }

    return @set;
}

# randomlength
#
# A subroutine that will pick a random number from
# $minlength to $maxlength, inclusive.
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub randomlength {

    # Collect arguments, declare variables
    my($minlength, $maxlength) = @_;

    # Calculate and return a random number within the
    #  desired interval.
    # Notice how we need to add one to make the endpoints inclusive,
    #  and how we first subtract, then add back, $minlength to
    #  get the random number in the correct interval.
    return ( int(rand($maxlength - $minlength + 1)) + $minlength );
}

# make_random_DNA
#
# Make a string of random DNA of specified length.
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub make_random_DNA {

    # Collect arguments, declare variables
    my($length) = @_;

    my $dna;

    for (my $i=0 ; $i < $length ; ++$i) {
        $dna .= randomnucleotide(  );
    }

    return $dna;
}

# randomnucleotide
#
# Select at random one of the four nucleotides
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub randomnucleotide {

    my(@nucleotides) = ('A', 'C', 'G', 'T');

    # scalar returns the size of an array. 
    # The elements of the array are numbered 0 to size-1
    return randomelement(@nucleotides);
}

# randomelement
#
# randomly select an element from an array
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

sub randomelement {

    my(@array) = @_;

    return $array[rand @array];
}

If the code in Example 7-4 seems somewhat repetitive of code from previous examples, it is. In the interest of presentation, I included the subroutine code in the program. (You'll start using modules in Chapter 8 as a way to avoid this repetition.)

Here's the output of Example 7-4:

In this run of the experiment, the average number of 
matching positions is 0.24%

Well, that seems reasonable. You might say, it's obvious: a quarter of the positions match, and there are four bases. But the point isn't to verify elementary probability, it's to show you have enough programming under your belt to write some programs that ask and answer questions about DNA sequences.

7.5.1 Some Notes About the Code

Notice in the main program that when it calls:

@random_DNA = make_random_DNA_set( 10, 10, 10 );

you don't need to declare and initialize variables such as $minimum_length. You can just fill in the actual numbers when you call the subroutine. (However it's often a good idea to put such things in variables declared at the top of the program, where it's easy to find and change them.) Here, you set the maximum and minimum lengths to 10 and ask for 10 sequences.

Let's restate the problem we just solved. You have to compare all pairs of DNA, and for each pair, calculate the percentage of positions that have the same nucleotides. Then, you have to take the mean of these percentages.

Here's the code that accomplishes this in the main program of Example 7-4:

# Iterate through all pairs of sequences
for (my $k = 0 ; $k < scalar @random_DNA - 1 ; ++$k) {
    for (my $i = ($k + 1) ; $i < scalar @random_DNA ; ++$i) {

        # Calculate and save the matching percentage
        $percent = matching_percentage($random_DNA[$k], $random_DNA[$i]);
        push(@percentages, $percent);
    }
}

To look at each pair, you use a nested loop. A nested loop is simply a loop within another loop. These are fairly common in programming but must be handled with care. They may seem a little complex; take some time to see how the nested loop works, because it's common to have to select all combinations of two (or more) elements from a set.

The nested loop involves looking at (n * (n-1)) / 2 pairs of sequences, which is a square function of the size of the data set. This can get very big! Try gradually increasing the size of the data set and rerunning the program, and you'll see your compute time increase, and more than gradually.

See how the looping works? First sequence 0 (indexed by $K) is paired with sequences 1,2,3,...,9, in turn (indexed by $i). Then sequence 1 is paired with 2,3,...,9, etc. Finally, 8 is paired with 9. (Recall that array elements are numbered starting at 0, so the last element of an array with 10 elements is numbered 9. Also recall that scalar @random_DNA returns the number of elements in the array.)

You might find it a worthwhile exercise to let the number of sequences be some small value, say 3 or 4, and think through (paper and pencil in hand) how the nested loops and the variables $k and $i evolve during the running of the program. Or you can use the Perl debugger to watch how it happens.

< BACKCONTINUE >

Index terms contained in this section

DNA
     mutations, investigating with randomization
            random sequences, comparing bases for identity
identity between random DNA sequences, calculating average
loops
      nested
nested loops
percent identity, measuring similarity of sequences
randomization
      comparing bases in random DNA sequences

2002, O'Reilly & Associates, Inc.