How to measure which genes kill

In 1990 through to 2003, researchers from all over the world worked on the Human Genome Project, with a goal to identify all the A’s, C’s, T’s and G’ that make up human DNA. Since then our capacity to sequence our genomes has surpassed our ability to interpret what the variations actually mean. Currently we interpret the effect of those genes, which is to say annotate them, using methods that either provides:

  1. information on a single gene, like in the study of genetic conservation where we study how genes propagate through natural selection and come back to identify their purpose.
  2. information only on genetic variations that cause missense changes, meaning they result in a change in amino acid interpretation. For example an ACC (threonine) might change to AAC (asparagine).

Combined annotation dependent depletion (CADD) acts as a tool to integrate many of these different annotations into a single score called a C-score, which basically measures how pathogenic/deleterious/dangerous specific genetic variants are, and it works using a support vector machine.

What is a support vector machine?

At a basic level, what support vector machines do is draw lines in the data, so that new data can be easily classified later. In the example above the blue line is the best to split the data (obviously), but there it can get more tricky, and that’s where support vector machines have to become nifty.

How about now? can you still split the data with just one line?
Well the answer is yes…
You can do it like this:

You can introduce a new dimension to the data and then classify it in a single line! Like in the example above with the blue line.

Now at a low level of dimensions (like 1D and 2D), humans are usually smart enough to figure out these nifty tricks on their own. But when you approach more then 60 dimensions of data, like in the case of CADD, then some form of automatic calculation is required, and that’s where support vector machines come in: they find you the best way of classifying the data.

Back to CADD

C-scores are computed using the support vector machine, and can end up being used to measure which genes are pathogenic/dangerous, so that perhaps in the future when we develop better methods for modifying our genes, we will be able to rid ourselves of disease.

Thanks for checking in! I’m Davide, a 19-year-old self-learner who runs The Feynman Mafia, exploring how learning by explaining can be used to teach yourself any topic.

To learn more about me, check out my website, and follow me on YouTube or Twitter.

18 y/o loves programming in Rust

18 y/o loves programming in Rust