Skip to main content

Random Notes

Lecture 10 2026-02-19 09:13:19​

Memorization in descriptive β€œML”: Have a function ff that just maps an input to a label. Nothing else. That’s a very nice model1! Note that you can see just the size of the model and tell if you’re doing this. What if you can compress? What if there’s a decision boundary and you can compress? Now you have a simpler model.

We cannot stop here though. We can regularize: constrain the complexity to limit memorization.

You will have giantass training datasets: memorization may not be feasible.

That really happens is that current models actually memorize but do so very efficiently. So, if they can do this, why don’t they?

Loss versus Steps.

β€œNegative Control”: To eval memorization, completely break dependence between X and Y.

Zhang et al 2017, β€œUnderstanding deep learning requires…” they show that models do indeed memorize using ImageNET.

Think about memorization: a kNN has a β€˜memory’ of the dataset that it is trained upon. You are thinking:

kNN(Sample, Dataset):
return prediction for sample

How we see this really is:

Dataset -> Train -> kNN(Sample)

Note the difference between the model and the learning algorithm!!

TODO:

  • Inductive bias
  • WTF is β€œLoss” really?
  • β€œDoes your loss permit trivial optimization?” you are thinking β€œoh simpler loss function?” but this is not what we’re talking about! TODO: elaborate.
  • Separation and Sufficiency

Nichols et al (2007) - Influenza Vaccine​

Read about the Healthy Vaccinee Effect.


Gradient Derivation for Mixed Gaussian Model​

We have:

P=log[(1βˆ’Ο€)β‹…12πσ02exp⁑(βˆ’(yiβˆ’ΞΌ0)22Οƒ02)+Ο€β‹…12πσ12exp⁑(βˆ’(yiβˆ’ΞΌ1)22Οƒ12)]P = log\left[(1 - \pi) \cdot \frac{1}{\sqrt{2\pi\sigma_0^2}} \exp\left(-\frac{(y_i - \mu_0)^2}{2\sigma_0^2}\right) + \pi \cdot \frac{1}{\sqrt{2\pi\sigma_1^2}} \exp\left(-\frac{(y_i - \mu_1)^2}{2\sigma_1^2}\right)\right]

For sanity, set:

N0=12πσ02exp⁑(βˆ’(yiβˆ’ΞΌ0)22Οƒ02)N1=12πσ12exp⁑(βˆ’(yiβˆ’ΞΌ1)22Οƒ12)β€…β€ŠβŸΉβ€…β€ŠP=log[(1βˆ’Ο€)N0+Ο€N1]\begin{align*} N_0 &= \frac{1}{\sqrt{2\pi\sigma_0^2}} \exp\left(-\frac{(y_i - \mu_0)^2}{2\sigma_0^2}\right) \\ N_1 &= \frac{1}{\sqrt{2\pi\sigma_1^2}} \exp\left(-\frac{(y_i - \mu_1)^2}{2\sigma_1^2}\right) \\ \\ \implies P &= log \left[ (1 - \pi)N_0 + \pi N_1 \right] \end{align*}

Start with the easiest one, Ο€\pi

βˆ‚Pβˆ‚Ο€=1(1βˆ’Ο€)N0+Ο€N1β‹…βˆ‚βˆ‚Ο€[(1βˆ’Ο€)N0+Ο€N1]=1(1βˆ’Ο€)N0+Ο€N1β‹…[N1βˆ’N0]Or,Β βˆ‚Pβˆ‚Ο€=N1βˆ’N0(1βˆ’Ο€)N0+Ο€N1\begin{align*} \frac{\partial{P}}{\partial{\pi}} &= \frac{1}{(1 - \pi)N_0 + \pi N_1} \cdot \frac{\partial}{\partial{\pi}} \left[ (1 - \pi)N_0 + \pi N_1 \right] \\ &= \frac{1}{(1 - \pi)N_0 + \pi N_1} \cdot [N_1 - N_0] \\ \\ \text{Or, } \frac{\partial{P}}{\partial{\pi}} &= \frac{N_1 - N_0}{(1 - \pi)N_0 + \pi N_1} \end{align*}

Make it a little harder, do ΞΌ\mu now

Introduce a little friend called the Chain Rule. Let:

N=12πσ2exp⁑(βˆ’(yiβˆ’ΞΌ)22Οƒ2)βˆ‚Nβˆ‚ΞΌ=Nβ‹…(2(yiβˆ’ΞΌ))β‹…12Οƒ2=Nβ‹…yiβˆ’ΞΌΟƒ2\begin{align*} N &= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right) \\ \frac{\partial{N}}{\partial{\mu}} &= N \cdot (2 (y_i - \mu)) \cdot \frac{1}{2\sigma^2} \\ &= N \cdot \frac{y_i - \mu}{\sigma^2} \end{align*}

Using this result, we can now write:

βˆ‚Pβˆ‚ΞΌ0=(1βˆ’Ο€)(1βˆ’Ο€)N0+Ο€N1β‹…N0β‹…yiβˆ’ΞΌ0Οƒ2βˆ‚Pβˆ‚ΞΌ1=Ο€(1βˆ’Ο€)N0+Ο€N1β‹…N1β‹…yiβˆ’ΞΌ1Οƒ2\begin{align*} \frac{\partial{P}}{\partial{\mu_0}} &= \frac{(1 - \pi)}{(1 - \pi)N_0 + \pi N_1} \cdot N_0 \cdot \frac{y_i - \mu_0}{\sigma^2} \\ \frac{\partial{P}}{\partial{\mu_1}} &= \frac{\pi}{(1 - \pi)N_0 + \pi N_1} \cdot N_1 \cdot \frac{y_i - \mu_1}{\sigma^2} \end{align*}

Fight the Ultimate Boss, be a Οƒ\sigma

Fight with Chains again! Let:

N=12πσ2exp⁑(βˆ’(yiβˆ’ΞΌ)22Οƒ2)=12πσ2β‹…exp⁑(βˆ’(yiβˆ’ΞΌ)22β‹…Οƒβˆ’2)\begin{align*} N &= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right) \\ &= \frac{1}{\sqrt{2\pi\sigma^2}} \cdot \exp\left(-\frac{(y_i - \mu)^2}{2}\cdot \sigma^{-2} \right) \end{align*}

Now Google’s told me about a Log Derivative Trick but I am old-school and will use the Product Rule (with the Chain Rule) because we have exe^x and I would love to recover it quickly without using any tricks.

βˆ‚Nβˆ‚Οƒ=12πσ2exp⁑(βˆ’(yiβˆ’ΞΌ)22Οƒ2)β‹…[βˆ’(yβˆ’ΞΌ)22β‹…βˆ’2Οƒ3]+[12Ο€β‹…βˆ’1Οƒ2]β€…β€ŠβŸΉβ€…β€Šβˆ‚Nβˆ‚Οƒ=N[(yβˆ’ΞΌ)2Οƒ3βˆ’1Οƒ22Ο€]\begin{align*} \frac{\partial{N}}{\partial{\sigma}} &= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right) \cdot \left[ \frac{-(y-\mu)^2}{2} \cdot \frac{-2}{\sigma^3} \right] + \left[\frac{1}{\sqrt{2\pi}} \cdot \frac{-1}{\sigma^2} \right] \\ \implies \frac{\partial{N}}{\partial{\sigma}} &= N \left[ \frac{(y - \mu)^2}{\sigma^3} - \frac{1}{\sigma^2\sqrt{2\pi}} \right] \end{align*}

Using this, we can now write:

βˆ‚Pβˆ‚Οƒ0=(1βˆ’Ο€)(1βˆ’Ο€)N0+Ο€N1β‹…N0β‹…[(yβˆ’ΞΌ)2Οƒ03βˆ’1Οƒ022Ο€]βˆ‚Pβˆ‚Οƒ1=(1βˆ’Ο€)(1βˆ’Ο€)N1+Ο€N1β‹…N1β‹…[(yβˆ’ΞΌ)2Οƒ13βˆ’1Οƒ122Ο€]\begin{align*} \frac{\partial{P}}{\partial{\sigma_0}} &= \frac{(1 - \pi)}{(1 - \pi)N_0 + \pi N_1} \cdot N_0 \cdot \left[ \frac{(y - \mu)^2}{\sigma_0^3} - \frac{1}{\sigma_0^2\sqrt{2\pi}} \right] \\ \frac{\partial{P}}{\partial{\sigma_1}} &= \frac{(1 - \pi)}{(1 - \pi)N_1 + \pi N_1} \cdot N_1 \cdot \left[ \frac{(y - \mu)^2}{\sigma_1^3} - \frac{1}{\sigma_1^2\sqrt{2\pi}} \right] \end{align*}

WOO!

Footnotes​

  1. Lots of NN’s will happily give you an answre on stuff you haven’t seen… ↩