Skip to main content

Computational Genomics by Yufeng Shen

Gradations of lactose intolerance in populations away from Europe.

The talk was about various aspects of genomic sequencing. Very mathematical/statistical.

There was shotgun sequencing: you tear up the genome and assemble it programmatically. Issue is: how do you know you're covering the whole genome? There's a "Depth of Coverage Model" that uses Probability to determine how many reads you'd need to increase the likelihood of coverage.

D ~ Binomial(N, L/G)

Genome size G, read size S, N reads
D number of reads in interval L

S << G, N very large so the depth of coverage is approximated by

D ~ Poisson(lambda)
lambda = SN/G

This is actually borne by empirical results. However, you see an over-dispersed 'tail' on the end. This is caused by G/C content (C:G is stronger than A:T). You want to model over-dispersion, for which the Negative Binomial is commonly used.

Now we detemine genetic variants from sequence data. Two ways:

  1. Reference-based
  2. De novo assembly

But how do you distinguish true genetic variation from sequence errors? You assume that the error rate of multiple reads at the same position is lower. There is a Phred-scale that captures this (e.g. K40 = 1/10,000). Discussion on the k-allele algorithm in the context of SNP calling. Issues are that the errors are not independent and that it's hard to set thresholds when the average depth of coverage varies. Lots of conditional probability involved here.

You can, of course, use ML approaches to Single Nucleotide Variant (SNV) calling. A very popular tool/approach here is GATK. There's a newer method called DeepVariant which uses CNNs to replace manual human inspection and offers better ways to calibrate errors.


Discussion on the 1,000 genomes project which wanted to find all the common variants in all the major continental populations.

All variants start as de novo mutations: present in the child but not parents.

C → T and G → A mutations are much higher than other combos. This is due to epigenetic regulation of transcription: DNA Methylation.

Discussion of coding variants and problems. The usual suspects: missense, silent, nonsense (point mutations). These are subject to random drift. Discussion on characteristics of each and how they are evaluated computationally/mathematically.

The goal here is to predict bad things happening. There are many sources of information for pathogenicity prediction: amino acid properties, protein structure and interaction, and so on. You can create a causality chain with these sources. E.g. Amino acid substitution → Change in protein level function (Molecular) → Risk of conditions (Organism) → Fitness/Selection (Population)

How do you find new risk genes by rare variants? One way is by statistical association. Autism, for instance, is highly heritable and under strong negative selection. Simplex studies recruit families where there is only one child with the condition.

Null Model: M ~ Poisson(L0), M is # of mutations. L0 = Sample size * Mutation Rate. This is frequentist! Strength of Association is Poisson Exact test or poisson.test(M, L0).

Alternative model: M ~ Poisson(L1), L1 = L0 * Relative Risk. This is Bayesian! you need to establish the relative risk but we don't really know what the RR is for a given gene! So in this approach we assume that RR itself needs a distribution. Strength of Association is Likelihood Ratio = dpois(M, L1) / dpois(M, L0)

Discussion genetic architecture. There exist Monogenic mutations (diseases causes by one gene) CHD7, PTPN11, etc but for most disease there's an aggregated risk from common variants.