There is a popular claim that "human DNA is 50% similar to the DNA of a banana", which is often cited in the context of "science popularization" as well as in the various "OMG did you know that" articles. It sounds funny, scientific and "plausible", hence I've seen many smart people repeat it, perhaps as a joke, without giving it too much thought. It is worth giving a thought, though.
Note that depending on how you phrase the statement, it may imply different things, some of them could be more, and some less exciting. The following examples are completely different in their meaning:
- If you change 50% of human DNA you can obtain the DNA of a banana.
- 50% of human DNA nucleotides are present in the DNA of a banana.
- 50% of human's DNA regions have approximate matches in the banana DNA (or vice versa, which would be a different statement)
- 50% of human genes have orthologs among banana genes (or vice versa).
The first one is obviously false, due to the fact that the total length of the human DNA is about 10 times that of the banana. You could include the whole banana sequence verbatim into a human genome, and it would only make 10% of it. The second one is also false, because, strictly speaking, not 50% but all 100% of human DNA nucleotides are also present in the DNA of a banana. Indeed, any two organisms on Earth have their DNA composed as a sequence of exactly the same four nucleotides. Moreover, if you start comparing random positions of two random DNA's, you will get a match once every four attempts by pure chance. There's a 25% basepair-wise similarity of any DNA to a random sequence!
The last two (or four, if you include the "reversed" versions) claims are more informative. In fact, claim #4 is probably the most interesting and is what could be meant if the presumed "50% similarity" was indeed ever found. Given the wide availability of genomic data, this claim be checked to some extent by anyone with a computer and a desire to finally make sure, how much of a banana he is, after all. Let us do it.
What proportion of human genes could be very similar to banana genes?
Although there is a lot of data about gene orthology among various organisms, as far as humans and bananas are concerned, I could not find any proper precomputed alignments. Creating a full-genome alignment for two organisms from scratch is a procedure way too tedious for a single Sunday's evening, so let us try a simplified measurement approach instead. Consider all possible 20-nucleotide reads taken from the gene-associated regions in the reference human genome. We shall say that a human gene "is somewhat bananas" if at least 5% of its 20-bp reads match somewhere on the banana genome. Given this definition, we shall ask: what proportion of the known human genes "are somewhat bananas"?
At this point some passionate readers could argue that, for example, 20-nucleotide reads are not long/short enough for the purposes of this question, or that the cutoff of 5% is too low or too high, or that approximate matching should be used instead along with some proper string alignment techniques, etc. As noted, we shall leave those aspects to dedicated researchers in the field of human bananology.
The computation took about an hour to run and came back with the following conclusion:
Only 33 out of the 24624 human genes (of at least 1000bp in length) are "somewhat bananas".
In other words, no, you are not "50% similar" to a banana according to any simple definition of such similarity. Not even 1% similar! Of course, there could still be means to torture the data and squeeze the "50%" value out, but those must be some quite nontrivial means and the interpretation of the resulting similarity would be far from straightforward.