My Profile Photo

code ninja . blog


where I show and discuss things I'm working on...


Teaching Evolution With Genetic Algorithms

My 8 year-old daughter this year began asking me some of the important questions of life: who was the first person, was the first person a boy or girl (note: our family is not religious, so she’s never heard the story of Adam and Eve to my knowedge), were dogs or cats first, etc.

I work at The Broad Institute, so questions like these should be right up my alley, right? I took a little bit of time and eventually came up with an idea…

I drew some circles an put some letters in them. The letters were meant to represent nucleobases, but for simplicity I let them be any letter of the alphabet. I then showed how when they joined to make a “child”, half of the letters come from the mom and the other half from the dad. Next, we discussed how sometimes a letter can mutate (or insert/delete), causing a change. Over very long periods of time, the mutations build up and eventually you have a new animal:

evolution

Surprisingly, she grokked this pretty quick and started drawing her own trees that created horses and ferrets.

I wasn’t quite happy yet. While this explained (conceptually) DNA, it didn’t really do anything to explain evolution: why would some mutations be preffered over another?

I next decided that I would try and code a very simple program that she could interact with and actually see evolution taking place, and perhaps understand a bit more. This ended up better than I could have expected, and so I thought I would share the code here.


Concept: The program should let the user type anything they want, and see if the computer - through evolution - could eventually reproduce what was typed. The program should be simple to explain. The code - to a non-programmer - may not make sense, but the program as a whole and each individual piece (a function) should.

I decided to stick with the concept I started with: letters, and so the program was broken down thusly:

  • A DNA strand is just a string of characters.
  • Multiple strands are sorted by fitness: similarity to the target string.
  • The next generation is created by crossing over and introducing random mutations.

Step 1: Code helper functions and a couple implicit classes so we can just extend String objects and treat them like populations of DNA sequences.

object Evolution {
  import scala.util.Random.{nextBoolean, nextInt, nextDouble}

  /** Generate a random string. */
  def randomString(n: Int) = (1 to n).map(_ => (nextInt(96)+32).toChar).mkString

  /** Generate a random, initial population of DNA sequences. */
  def initialPopulation(n: Int): IndexedSeq[String] = {
    for (_ <- 1 to n) yield randomString(nextInt(20) + 10)
  }

  /** An implicit class to add functionality to a DNA string. */
  implicit class DNA(s: String) {

    /** Calculate how close this string is to a target string. Calculate the
      * delta between each character and when one string is longer than the
      * other assume null (\u0000) for the character.
      *
      * A fitness value of 0.0 is a perfect match. As the strings diverge,
      * make the fitness exponentially larger.
      */
    def fitness(target: String): Double = {
      (0 until target.size.max(s.size)).foldLeft(0.0) { (f, i) =>
        val a = s.lift(i).map(_.toInt).getOrElse(0)
        val b = target.lift(i).map(_.toInt).getOrElse(0)

        // square the delta between two characters
        f + (a - b) * (a - b)
      }
    }

    /** Since a new DNA sequence is a combination of "mom" and "dad", the
      * first step is to crossover two DNA sequences: take a slice of one
      * and a slice of the other and combine them together.
      */
    def cross(c: String): String = {
      val i = nextInt(c.size.min(s.size))

      // take some from one string and some from the other
      if (nextBoolean) {
          c.take(i) + s.drop(i)
      } else {
          s.take(i) + c.drop(i)
      }
    }

    /** Mutations are a random change within a DNA sequence. One or more
      * bases within the original DNA strand are removed and a new sequence
      * replaces it. The new sequence can be shorter or longer.
      */
    def mutate(mutationRate: Double = 0.1): String = {
      if (nextDouble >= mutationRate) {
        return s
      }

      // what character(s) to replace (i), how many (r), and with how many (n)
      val i = nextInt(s.size)
      val r = nextInt(3)+1
      val n = nextInt(3)

      // patch the string with a new sequence
      s.patch(i, randomString(n), r)
    }
  }

  /** An implicit class for manipulating populations of DNA sequences. */
  implicit class Population(xs: IndexedSeq[String]) {

    /** Returns a population sorted by fitness to a particular string. */
    def fitToString(target: String): IndexedSeq[String] = {
      xs.map(it => (it fitness target, it)).sortBy(_._1).map(_._2)
    }

    /** Return the best parent randomly from a few choices. Assuming
      * the population is sorted (fitToString), then the smallest
      * index chosen is considered the "best".
      *
      * This is meant to simulate selective breeding. In the wild,
      * animals have multiple choices for mating, and fight for
      * rights to choose mate that will provide their children the
      * greatest chances of survival.
      */
    def parent(n: Int) = xs((1 to n).map(_ => nextInt(xs.size)).min)

    /** Create a child from two random parents. This is done by first
      * selecting the parents, crossing them, and then (optionally) 
      * introducing a random mutation.
      */
    def breed(selections: Int, mutationRate: Double) = {
      parent(selections).cross(parent(selections)).mutate(mutationRate)
    }

    /** Breed a new population from the current one. */
    def nextGen(selections: Int, mutationRate: Double) = {
      (1 to xs.size).map(_ => breed(selections, mutationRate)).filter(_.size > 0)
    }
  }
}

I recommend using Ammonite-REPL and copy/pasting the code into the REPL and then import Evolution._.

Next, we tested the code together and let her see how each part worked just like the graph that I drew (see above) on paper.

@ randomString(10)
String = "CRcp(4av~]"

@ "Test".fitness("Test")
Double = 0.0

@ "Test".fitness("foobar")
Double = 261.0

@ "abcde".cross("12345678")
String = "123de"

@ "abcde".mutate(1.0)
String = "a]cd"

@ initialPopulation(5)
IndexedSeq[String] = Vector(
  "$%iJzT<EUF/",
  "m:n]VpeKy1jY_|{QJn|l^M",
  "o|Qo\"P02) d/a%[^8VzMI\u007fO!sQ2zv",
  "Vt)vRi:FgT#<.1ep`ADl`",
  "\u007f<H9JAoJ*ETaL)*0\";@\\{C,y~;S"
)

@ initialPopulation(5).fitToString("Hello").nextGen(2, 0.1)
IndexedSeq[String] = Vector(
  "UN=q]T0W9I~rte$K'biSBmh",
  "2Cu;pre\u007fNz<,fm*+-&iSBmh",
  "tQ)|!o?ei`hjo+&!FUw",
  "tQ)|!o0W9I~rte$K'b9SsKK7Q\"h",
  "0X|jprKvqu"
)

The hardest parts to explain were fitness and generating the next generation. It took a bit, but she was able to see the similarities in the next generation created; multiple children must have had the same parents.

Step 2: Bring it together with a little bit of code that asks for an input string and then runs over many generations using the input as the fitness target.

import scala.io.StdIn

def run = {
  print("Enter something to evolve: ")

  // read the fitness target
  val target = StdIn.readLine

  // some parameters worth tweaking to see the results
  val generations = 10000
  val populationSize = 200
  val breedingSelections = 3
  val mutationRate = 0.1

  // create an initial population of strings
  var pop = initialPopulation(populationSize).fitToString(target)

  // run for N generations
  for (i <- 1 to generations) {
    pop = pop.nextGen(breedingSelections, mutationRate).fitToString(target)

    // output the "best" child for each generation
    println(s"generation $i: ${pop.head}")
  }
}

And some selected output…

@ run
Enter something to evolve: Daddy and Isabel programmed evolution using Scala!
generation 1000: D`dbz `md"Iqcaik$npngqaopgg!gvmktshom"uuhoh!Rcao`
generation 2000: Daddy amd Iqcaek!qpngraomge!fvnktshon"urhoh Scala!
generation 3000: Daddy amd Itabek qqngrammee!fvnkttion!urinh Scala!
generation 4000: Daddy and Itabek pqngrammee!fvnkttion urinh Scala!
generation 5000: Daddy and Itabel programmed fvnkttion uring Scala!
generation 6000: Daddy and Itabel programmed fvokttion using Scala!
generation 7000: Daddy and Isabel programmed evolttion using Scala!
generation 8000: Daddy and Isabel programmed evolttion using Scala!
generation 9000: Daddy and Isabel programmed evolution using Scala!
generation 10000: Daddy and Isabel programmed evolution using Scala!

In the end, she was able to grasp how selective breeding - combined with tiny, random changes - over a long periods of time could transform into something much better. This was true even if what was started with was something small and non-sensical and the target was extremely large (we used lorem ipsum as the target and she noticed that it took a much larger population several hundred thousand generations to evolve to it).

I later extended the program a bit allowing her to play with the other parameters. This allowed her - on her own - to reach conclusions regarding some rather complex topics:

  • Inbreeding - small populations were unable to evolve;
  • Genetic malformations - very high mutation rates prevented evolution;
  • Selection of the fittest - too few mating choices led to unfit children;

But I’ll leave those bits as a tiny exercise for the reader…

fin.