Random Notes

Lecture 10 2026-02-19 09:13:19

Memorization in descriptive “ML”: Have a function $f$ that just maps an input to a label. Nothing else. That’s a very nice model¹! Note that you can see just the size of the model and tell if you’re doing this. What if you can compress? What if there’s a decision boundary and you can compress? Now you have a simpler model.

We cannot stop here though. We can regularize: constrain the complexity to limit memorization.

You will have giantass training datasets: memorization may not be feasible.

That really happens is that current models actually memorize but do so very efficiently. So, if they can do this, why don’t they?

Loss versus Steps.

“Negative Control”: To eval memorization, completely break dependence between X and Y.

Zhang et al 2017, “Understanding deep learning requires…” they show that models do indeed memorize using ImageNET.

Think about memorization: a kNN has a ‘memory’ of the dataset that it is trained upon. You are thinking:

kNN(Sample, Dataset):
  return prediction for sample

How we see this really is:

Dataset -> Train -> kNN(Sample)

Note the difference between the model and the learning algorithm!!

TODO:

Inductive bias
WTF is “Loss” really?
“Does your loss permit trivial optimization?” you are thinking “oh simpler loss function?” but this is not what we’re talking about! TODO: elaborate.
Separation and Sufficiency

Nichols et al (2007) - Influenza Vaccine

Read about the Healthy Vaccinee Effect.

Gradient Derivation for Mixed Gaussian Model

We have:

P = log\left[(1 - \pi) \cdot \frac{1}{\sqrt{2\pi\sigma_0^2}} \exp\left(-\frac{(y_i - \mu_0)^2}{2\sigma_0^2}\right) + \pi \cdot \frac{1}{\sqrt{2\pi\sigma_1^2}} \exp\left(-\frac{(y_i - \mu_1)^2}{2\sigma_1^2}\right)\right]

For sanity, set:

\begin{align*} N_0 &= \frac{1}{\sqrt{2\pi\sigma_0^2}} \exp\left(-\frac{(y_i - \mu_0)^2}{2\sigma_0^2}\right) \\ N_1 &= \frac{1}{\sqrt{2\pi\sigma_1^2}} \exp\left(-\frac{(y_i - \mu_1)^2}{2\sigma_1^2}\right) \\ \\ \implies P &= log \left[ (1 - \pi)N_0 + \pi N_1 \right] \end{align*}

Start with the easiest one, $\pi$

\begin{align*} \frac{\partial{P}}{\partial{\pi}} &= \frac{1}{(1 - \pi)N_0 + \pi N_1} \cdot \frac{\partial}{\partial{\pi}} \left[ (1 - \pi)N_0 + \pi N_1 \right] \\ &= \frac{1}{(1 - \pi)N_0 + \pi N_1} \cdot [N_1 - N_0] \\ \\ \text{Or, } \frac{\partial{P}}{\partial{\pi}} &= \frac{N_1 - N_0}{(1 - \pi)N_0 + \pi N_1} \end{align*}

Make it a little harder, do $\mu$ now

Introduce a little friend called the Chain Rule. Let:

\begin{align*} N &= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right) \\ \frac{\partial{N}}{\partial{\mu}} &= N \cdot (2 (y_i - \mu)) \cdot \frac{1}{2\sigma^2} \\ &= N \cdot \frac{y_i - \mu}{\sigma^2} \end{align*}

Using this result, we can now write:

\begin{align*} \frac{\partial{P}}{\partial{\mu_0}} &= \frac{(1 - \pi)}{(1 - \pi)N_0 + \pi N_1} \cdot N_0 \cdot \frac{y_i - \mu_0}{\sigma^2} \\ \frac{\partial{P}}{\partial{\mu_1}} &= \frac{\pi}{(1 - \pi)N_0 + \pi N_1} \cdot N_1 \cdot \frac{y_i - \mu_1}{\sigma^2} \end{align*}

Fight the Ultimate Boss, be a $\sigma$

Fight with Chains again! Let:

\begin{align*} N &= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right) \\ &= \frac{1}{\sqrt{2\pi\sigma^2}} \cdot \exp\left(-\frac{(y_i - \mu)^2}{2}\cdot \sigma^{-2} \right) \end{align*}

Now Google’s told me about a Log Derivative Trick but I am old-school and will use the Product Rule (with the Chain Rule) because we have $e^x$ and I would love to recover it quickly without using any tricks.

\begin{align*} \frac{\partial{N}}{\partial{\sigma}} &= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right) \cdot \left[ \frac{-(y-\mu)^2}{2} \cdot \frac{-2}{\sigma^3} \right] + \left[\frac{1}{\sqrt{2\pi}} \cdot \frac{-1}{\sigma^2} \right] \\ \implies \frac{\partial{N}}{\partial{\sigma}} &= N \left[ \frac{(y - \mu)^2}{\sigma^3} - \frac{1}{\sigma^2\sqrt{2\pi}} \right] \end{align*}

Using this, we can now write:

\begin{align*} \frac{\partial{P}}{\partial{\sigma_0}} &= \frac{(1 - \pi)}{(1 - \pi)N_0 + \pi N_1} \cdot N_0 \cdot \left[ \frac{(y - \mu)^2}{\sigma_0^3} - \frac{1}{\sigma_0^2\sqrt{2\pi}} \right] \\ \frac{\partial{P}}{\partial{\sigma_1}} &= \frac{(1 - \pi)}{(1 - \pi)N_1 + \pi N_1} \cdot N_1 \cdot \left[ \frac{(y - \mu)^2}{\sigma_1^3} - \frac{1}{\sigma_1^2\sqrt{2\pi}} \right] \end{align*}

WOO!

Lots of NN’s will happily give you an answre on stuff you haven’t seen… ↩

Lecture 10 2026-02-19 09:13:19​

Nichols et al (2007) - Influenza Vaccine​

Gradient Derivation for Mixed Gaussian Model​

Footnotes​

Lecture 10 2026-02-19 09:13:19

Nichols et al (2007) - Influenza Vaccine

Gradient Derivation for Mixed Gaussian Model

Footnotes