Lecture 10 2026-02-19 09:13:19β
Memorization in descriptive βMLβ: Have a function f f f that just maps an input to a label. Nothing else. Thatβs a very nice model1 ! Note that you can see just the size of the model and tell if youβre doing this. What if you can compress? What if thereβs a decision boundary and you can compress? Now you have a simpler model.
We cannot stop here though. We can regularize: constrain the complexity to limit memorization.
You will have giantass training datasets: memorization may not be feasible.
That really happens is that current models actually memorize but do so very efficiently. So, if they can do this, why donβt they ?
Loss versus Steps.
βNegative Controlβ: To eval memorization, completely break dependence between X and Y.
Zhang et al 2017, βUnderstanding deep learning requiresβ¦β they show that models do indeed memorize using ImageNET.
Think about memorization: a kNN has a βmemoryβ of the dataset that it is trained upon. You are thinking:
kNN(Sample, Dataset): return prediction for sample
How we see this really is:
Dataset -> Train -> kNN(Sample)
Note the difference between the model and the learning algorithm!!
TODO :
Inductive bias
WTF is βLossβ really?
βDoes your loss permit trivial optimization?β you are thinking βoh simpler loss function?β but this is not what weβre talking about! TODO : elaborate.
Separation and Sufficiency
Nichols et al (2007) - Influenza Vaccineβ
Read about the Healthy Vaccinee Effect .
Gradient Derivation for Mixed Gaussian Modelβ
We have:
P = l o g [ ( 1 β Ο ) β
1 2 Ο Ο 0 2 exp β‘ ( β ( y i β ΞΌ 0 ) 2 2 Ο 0 2 ) + Ο β
1 2 Ο Ο 1 2 exp β‘ ( β ( y i β ΞΌ 1 ) 2 2 Ο 1 2 ) ] P = log\left[(1 - \pi) \cdot \frac{1}{\sqrt{2\pi\sigma_0^2}} \exp\left(-\frac{(y_i - \mu_0)^2}{2\sigma_0^2}\right) + \pi \cdot \frac{1}{\sqrt{2\pi\sigma_1^2}} \exp\left(-\frac{(y_i - \mu_1)^2}{2\sigma_1^2}\right)\right] P = l o g [ ( 1 β Ο ) β
2 Ο Ο 0 2 β β 1 β exp ( β 2 Ο 0 2 β ( y i β β ΞΌ 0 β ) 2 β ) + Ο β
2 Ο Ο 1 2 β β 1 β exp ( β 2 Ο 1 2 β ( y i β β ΞΌ 1 β ) 2 β ) ]
For sanity, set:
N 0 = 1 2 Ο Ο 0 2 exp β‘ ( β ( y i β ΞΌ 0 ) 2 2 Ο 0 2 ) N 1 = 1 2 Ο Ο 1 2 exp β‘ ( β ( y i β ΞΌ 1 ) 2 2 Ο 1 2 ) β
β βΉ β
β P = l o g [ ( 1 β Ο ) N 0 + Ο N 1 ] \begin{align*}
N_0 &= \frac{1}{\sqrt{2\pi\sigma_0^2}} \exp\left(-\frac{(y_i - \mu_0)^2}{2\sigma_0^2}\right)
\\
N_1 &= \frac{1}{\sqrt{2\pi\sigma_1^2}} \exp\left(-\frac{(y_i - \mu_1)^2}{2\sigma_1^2}\right)
\\
\\
\implies P &= log \left[ (1 - \pi)N_0 + \pi N_1 \right]
\end{align*} N 0 β N 1 β βΉ P β = 2 Ο Ο 0 2 β β 1 β exp ( β 2 Ο 0 2 β ( y i β β ΞΌ 0 β ) 2 β ) = 2 Ο Ο 1 2 β β 1 β exp ( β 2 Ο 1 2 β ( y i β β ΞΌ 1 β ) 2 β ) = l o g [ ( 1 β Ο ) N 0 β + Ο N 1 β ] β
Start with the easiest one, Ο \pi Ο
β P β Ο = 1 ( 1 β Ο ) N 0 + Ο N 1 β
β β Ο [ ( 1 β Ο ) N 0 + Ο N 1 ] = 1 ( 1 β Ο ) N 0 + Ο N 1 β
[ N 1 β N 0 ] Or,Β β P β Ο = N 1 β N 0 ( 1 β Ο ) N 0 + Ο N 1 \begin{align*}
\frac{\partial{P}}{\partial{\pi}} &= \frac{1}{(1 - \pi)N_0 + \pi N_1} \cdot \frac{\partial}{\partial{\pi}} \left[ (1 - \pi)N_0 + \pi N_1 \right]
\\
&= \frac{1}{(1 - \pi)N_0 + \pi N_1} \cdot [N_1 - N_0]
\\
\\
\text{Or, }
\frac{\partial{P}}{\partial{\pi}} &= \frac{N_1 - N_0}{(1 - \pi)N_0 + \pi N_1}
\end{align*} β Ο β P β Or,Β β Ο β P β β = ( 1 β Ο ) N 0 β + Ο N 1 β 1 β β
β Ο β β [ ( 1 β Ο ) N 0 β + Ο N 1 β ] = ( 1 β Ο ) N 0 β + Ο N 1 β 1 β β
[ N 1 β β N 0 β ] = ( 1 β Ο ) N 0 β + Ο N 1 β N 1 β β N 0 β β β
Make it a little harder, do ΞΌ \mu ΞΌ now
Introduce a little friend called the Chain Rule . Let:
N = 1 2 Ο Ο 2 exp β‘ ( β ( y i β ΞΌ ) 2 2 Ο 2 ) β N β ΞΌ = N β
( 2 ( y i β ΞΌ ) ) β
1 2 Ο 2 = N β
y i β ΞΌ Ο 2 \begin{align*}
N &= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right)
\\
\frac{\partial{N}}{\partial{\mu}} &= N \cdot (2 (y_i - \mu)) \cdot \frac{1}{2\sigma^2}
\\
&= N \cdot \frac{y_i - \mu}{\sigma^2}
\end{align*} N β ΞΌ β N β β = 2 Ο Ο 2 β 1 β exp ( β 2 Ο 2 ( y i β β ΞΌ ) 2 β ) = N β
( 2 ( y i β β ΞΌ )) β
2 Ο 2 1 β = N β
Ο 2 y i β β ΞΌ β β
Using this result, we can now write:
β P β ΞΌ 0 = ( 1 β Ο ) ( 1 β Ο ) N 0 + Ο N 1 β
N 0 β
y i β ΞΌ 0 Ο 2 β P β ΞΌ 1 = Ο ( 1 β Ο ) N 0 + Ο N 1 β
N 1 β
y i β ΞΌ 1 Ο 2 \begin{align*}
\frac{\partial{P}}{\partial{\mu_0}} &= \frac{(1 - \pi)}{(1 - \pi)N_0 + \pi N_1} \cdot N_0 \cdot \frac{y_i - \mu_0}{\sigma^2}
\\
\frac{\partial{P}}{\partial{\mu_1}} &= \frac{\pi}{(1 - \pi)N_0 + \pi N_1} \cdot N_1 \cdot \frac{y_i - \mu_1}{\sigma^2}
\end{align*} β ΞΌ 0 β β P β β ΞΌ 1 β β P β β = ( 1 β Ο ) N 0 β + Ο N 1 β ( 1 β Ο ) β β
N 0 β β
Ο 2 y i β β ΞΌ 0 β β = ( 1 β Ο ) N 0 β + Ο N 1 β Ο β β
N 1 β β
Ο 2 y i β β ΞΌ 1 β β β
Fight the Ultimate Boss, be a Ο \sigma Ο
Fight with Chains again! Let:
N = 1 2 Ο Ο 2 exp β‘ ( β ( y i β ΞΌ ) 2 2 Ο 2 ) = 1 2 Ο Ο 2 β
exp β‘ ( β ( y i β ΞΌ ) 2 2 β
Ο β 2 ) \begin{align*}
N &= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right)
\\
&= \frac{1}{\sqrt{2\pi\sigma^2}} \cdot \exp\left(-\frac{(y_i - \mu)^2}{2}\cdot \sigma^{-2} \right)
\end{align*} N β = 2 Ο Ο 2 β 1 β exp ( β 2 Ο 2 ( y i β β ΞΌ ) 2 β ) = 2 Ο Ο 2 β 1 β β
exp ( β 2 ( y i β β ΞΌ ) 2 β β
Ο β 2 ) β
Now Googleβs told me about a Log Derivative Trick but I am old-school and will use the Product Rule (with the Chain Rule) because we have e x e^x e x and I would love to recover it quickly without using any tricks.
β N β Ο = 1 2 Ο Ο 2 exp β‘ ( β ( y i β ΞΌ ) 2 2 Ο 2 ) β
[ β ( y β ΞΌ ) 2 2 β
β 2 Ο 3 ] + [ 1 2 Ο β
β 1 Ο 2 ] β
β βΉ β
β β N β Ο = N [ ( y β ΞΌ ) 2 Ο 3 β 1 Ο 2 2 Ο ] \begin{align*}
\frac{\partial{N}}{\partial{\sigma}} &= \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mu)^2}{2\sigma^2}\right) \cdot \left[ \frac{-(y-\mu)^2}{2} \cdot \frac{-2}{\sigma^3} \right] + \left[\frac{1}{\sqrt{2\pi}} \cdot \frac{-1}{\sigma^2} \right]
\\
\implies \frac{\partial{N}}{\partial{\sigma}} &= N \left[ \frac{(y - \mu)^2}{\sigma^3} - \frac{1}{\sigma^2\sqrt{2\pi}} \right]
\end{align*} β Ο β N β βΉ β Ο β N β β = 2 Ο Ο 2 β 1 β exp ( β 2 Ο 2 ( y i β β ΞΌ ) 2 β ) β
[ 2 β ( y β ΞΌ ) 2 β β
Ο 3 β 2 β ] + [ 2 Ο β 1 β β
Ο 2 β 1 β ] = N [ Ο 3 ( y β ΞΌ ) 2 β β Ο 2 2 Ο β 1 β ] β
Using this, we can now write:
β P β Ο 0 = ( 1 β Ο ) ( 1 β Ο ) N 0 + Ο N 1 β
N 0 β
[ ( y β ΞΌ ) 2 Ο 0 3 β 1 Ο 0 2 2 Ο ] β P β Ο 1 = ( 1 β Ο ) ( 1 β Ο ) N 1 + Ο N 1 β
N 1 β
[ ( y β ΞΌ ) 2 Ο 1 3 β 1 Ο 1 2 2 Ο ] \begin{align*}
\frac{\partial{P}}{\partial{\sigma_0}} &= \frac{(1 - \pi)}{(1 - \pi)N_0 + \pi N_1} \cdot N_0 \cdot \left[ \frac{(y - \mu)^2}{\sigma_0^3} - \frac{1}{\sigma_0^2\sqrt{2\pi}} \right]
\\
\frac{\partial{P}}{\partial{\sigma_1}} &= \frac{(1 - \pi)}{(1 - \pi)N_1 + \pi N_1} \cdot N_1 \cdot \left[ \frac{(y - \mu)^2}{\sigma_1^3} - \frac{1}{\sigma_1^2\sqrt{2\pi}} \right]
\end{align*} β Ο 0 β β P β β Ο 1 β β P β β = ( 1 β Ο ) N 0 β + Ο N 1 β ( 1 β Ο ) β β
N 0 β β
[ Ο 0 3 β ( y β ΞΌ ) 2 β β Ο 0 2 β 2 Ο β 1 β ] = ( 1 β Ο ) N 1 β + Ο N 1 β ( 1 β Ο ) β β
N 1 β β
[ Ο 1 3 β ( y β ΞΌ ) 2 β β Ο 1 2 β 2 Ο β 1 β ] β
WOO!