Kmers are cool
- Gabri Ele
- 1 day ago
- 2 min read
Updated: 8 hours ago
Lately, while being buys at figuring out the correct way to use istrain, I also have started thinking of a lesson to give my lab on shotgun metagenomics. These made me think of a way to introduce kmers, as kmers are used in many tools used for analysis of shotgun metagenomics data.
Now, this post is not about explaining kmers, google it and you will find a lot of explanations. What is difficult to understand is how is it possible that a genome, being a collection of genes, so sequences of DNA that perform a function, can tendentiously show the same kmers over and over?
So when people are introduced to kmers for the first time, it is cool to show them those repeats with real data, and show them that kmers really do work!
So I took three genomes from a collection of bins I have, and plotted their 3-kmers, 3 bases repetition (see figure 1). As you can see the tree genomes show quite distinct kmer profiles.

So ok, different genomes show different kmer profiles. That does not prove anything, it could be because of a very different gene content. So let's add a further layer of complexity. What happens if I split those genomes into two, randomly selecting half portions of the genome and plotting those two halves side by side? (technically, randomly select half of the contigs composing each one of them). If the kmers are maintened across different genes, given that different genes are found in different regions of the genome, those patterns should be way less strong. Let's try (figure 2).

Ok, we can see that the kmer profile is maintained across the two halves, with only very small differences. This is a little more convincing, but still, not really proving the point as kmers could still be a result of, for example, finding the same gene across the genome in multiple locations. So let's calculate 3-kmer for all the contigs in my fasta files, and see what happens. In this case, instead of a barplot, I will plot them in a ordination.

The results in figure 3 clearly show that contings cluster togheter by genome if you count 3-kmers repetitions. That is why kmers are so useful in shotgun metagenomics data analysis :).
As usual, code here.
Comments