29 October 2017

An overnight recipe for a new gene: change the frame

Can a new protein-coding gene be born overnight? That's the theme of this series. The answer, remarkably, is yes, and the Arhgap11b gene is the recent case I'm considering. After surveying the ways that this could happen, I narrowed the possible mechanisms to three:
There are really just three kinds of change that can get it done. All three are mutations that change how the DNA sequence that's already there gets decoded into protein. They are: 1) tiny mutations that shift the reading frame; 2) tiny mutations that change splicing; and 3) large-ish rearrangements that create new combinations of code.
Before looking at the details, let's take note of the fact that the genomes of animals and plants typically have gigantic amounts of DNA that does not code for protein. Humans are merely typical in this regard—at least 95% of the human genome is non-coding DNA, but there are organisms with a lot more and some with a lot less. The point here is not about "junk" or function, it's more basic: an animal genome contains vast amounts of DNA that could code for a protein, but doesn't. A new gene doesn't have to be magicked into a genome, by a demon or by a virus. A new gene can enter the gene library simply by becoming a new way of reading a pre-existing text. This is almost certainly how the vast majority of new genes have arisen in animals and plants for at least half a billion years. And the basic mechanism applies to all living things, for 3-ish billion years: any DNA sequence can become a protein-coding sequence, and those that already do code for protein can be straightforwardly modified to make completely different proteins.

It is worth reiterating this fact about the genomes of plants, animals, and even many non-bacterial microbes: they contain vast amounts of DNA that do not comprise protein-coding genes, and much of this DNA is available to be converted into protein-coding genes. This means:
  • We should expect new genes to arise over evolutionary time, especially in lineages that carry a lot of non-coding DNA around.
  • When we see a new gene appear in a particular lineage, we should initially suspect that it has its roots in pre-existing non-coding DNA.
So, how can you change a DNA sequence to make it suddenly turn into a new protein code? Back to our three possibilities. In this post, I'll just look at the first one: Change the reading frame via a "tiny mutation."

Metaphors for the genetic code typically involve the language of, well, language: words, translations, reading, coding, etc. These metaphors have limits but will work well here. A protein-coding sequence in DNA is a series of 3-letter words that are translated into protein sequence via the famously universal genetic code. There are no spaces or commas or line breaks, and so the 3-letter words must be aligned head-to-tail. Because all of the words are exactly 3 letters long, there is no way to add or remove letters without changing the entire sentence. So, for example, the DNA sequence GAG GAG GAG GAG GAG, which one might yell out when encountering overcooked broccoli, codes for "glutamate glutamate glutamate glutamate glutamate." If we add an A toward the beginning, to make the Dilbert-ian exclamation GAA, we get GAA GGA GGA GGA GGA G, which now codes for "glutamate glycine glycine glycine glycine." This is a classic frame-shift mutation, which results anytime a single letter is added or subtracted from a coding sequence. In fact, you can probably see that the addition or subtraction of any number of letters to a coding sequence is  certain to change the protein sequence. (Adding 3 or 6 or any multiple of 3, inserted at a single spot in a coding sequence, will preserve the reading frame but will add to the protein sequence accordingly.)

The implication is that the addition or removal of a single letter in a coding sequence can generate a different protein. A bit more dramatically, it can create a completely new protein sequence. Here's what I mean. Consider the GAG example above. Let's change it slightly to GAG GAG TAG GAG GAG. That new word, TAG, is a stop codon, analagous to a period at the end of a sentence. The protein sequence, then, is glutamate glutamate. Period. That's the end of the protein. But then if we add a letter to the sentence in front of that stop codon, we erase the period. Instead of "period" we get an amino acid, and all of the words after that are included in the new sentence. The reading of the sentence continues until a stop codon (there are 3 of them in the genetic code) is reached. And so, frame-shift changes don't just change the words in the sentence. They almost always change the length of the sentence—periods can be erased, and others can be introduced.

Because the length of the sentence is determined by the stop codons (periods), we can change the sentence length without changing the length of the sequence. Can you see how? Look at our last sentence, GAG GAG TAG GAG GAG. If we change the T to a G (this is a classic point mutation called a substitution), we erase the period and we get the glutamate string we had in the first example. This is not a frame shift; instead, the reading frame was extended. What changed was the punctuation, and the result is a protein sequence that was not previously in the library. The converse applies as well: if we convert a GAG to a TAG, we have just put a period at that point in the sequence. This can lead to a shorter version of a protein, which can have new or altered function. That's not really the kind of new gene birth that we're thinking about in this series, but it's worth noting.

So, we can generate new protein-coding sequences by simply changing the frame of an existing coding sequence. Any small change to a protein-coding sequence has the potential to change the frame, and in fact the addition or subtraction of a single letter is guaranteed to create a new reading frame. The resulting protein sequence can be completely new, for better or for worse. (Usually worse, but sometimes something weird happens.)

These tiny but consequential changes are a good way to start thinking about what seems at first like an unlikely or even nearly impossible occurrence: the overnight birth of a new protein-coding sequence. Once you think about frame changes, you should realize that new protein sequences are often just the tiniest of mutations away. Next we'll look at splicing, an insanely byzantine process that provides a whole distinct set of opportunities for evolution to dream up new genes.

No comments: