Skip to main content

Information & Entropy

The Information

Measures “Uncertainty” or “Surprise” (or lack thereof.) If p(x)p(x) is the probability of a single event,

I(x)=log[p(x)]=log[P(X=x)]I(x) = -log[p(x)] = -log[P(X = x)]

That log base can be 22 or a Natural Log lnln. Don’t start saying things like “oh it’s the number of bits it takes to encode a system” without knowing what you’re talking about. This goes deep. TODO: Feynman lecture, explain why.

Why the Negative Log?

Question is “What do we want this function to do?” Four things come to mind (mine at least):

  • We want to increase ‘Surprise’ when events are rare
  • We want to decrease ‘Surprise’ when they are not (extreme case p(x)=1p(x) =1 should be ‘zero surprise’)
  • If two independent events happen, we want their ‘Surprises’ to add up: I(x,y)=I(x)+I(y)I(x,y) = I(x) + I(y) (evidence accumulates linearly)
  • We don’t want negative information. Doesn’t make too much sense.

The negative log function satisfies all of these criteria. QED (if you’re a Serious Statistician/Mathematician, calm your tits, I’m dumber than you.)

Entropy

What a concept!

My greatest concern was what to call it. I thought of calling it ‘information,’ but the word was overly used, so I decided to call it ‘uncertainty.’ When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, no one really knows what entropy really is, so in a debate you will always have the advantage.’

— Claude Shannon

In the 1950’s Jaynes told Wigner that physical entropy is a measure of information and Wigner thought that was absurd, because the information one person possesses differs from that of another, whereas entropy can be measured with thermometers and calorimeters.

Source

Whatever. Here, we’re taking I(x)I(x) and defining Shannon Entropy as the Weighted Average of the Information.

H(X)=iNp(xi)I(xi)\mathcal{H}(X) = -\sum_i^N{p(x_i)I(x_i)}

This is the exact same thing as the Expected Value of the Information! Note that we use LOTUS here.

H(X)=E[I(X)]=E[log(p(x))]=p(x)log[p(x)]\begin{align*} \mathcal{H}(X) &= E[I(X)] \\ &= E[-\log(p(x))] \\ &= -\sum{p(x)\log[p(x)]} \end{align*}

TLDR: You multiply probabilities with their loglog, sum them up, negate and SURPRISE! Or not…

  • High number → Rare Event → Lots of Surprise/Uncertainty!
  • Low number → Meh, expected → Low Surprise/Uncertainty → Not very Surprised.

Like [0.99,0.1][0.99, 0.1] (H=0.024\mathcal{H} = 0.024). You wouldn’t be surprised if a loaded die kept giving you 6’s 🎲🎲🎲. High Entropy happens when you have a flat distribution like [0.5,0.5][0.5,0.5] (H=0.3\mathcal{H} = 0.3). Big Number → Big Uncertainty.

Okay so that’s a single system. What if you wanted to compare systems? The World and your model? Or two models?

Cross Entropy

Let’s say the data generating function in the Real World™ is p(x)p(x). Your model is q(x)q(x). Cross-Entropy then is H(p,q)\mathcal{H}(p,q). What does it look like?

H(p,q)=xp(x)logq(x)\mathcal{H}(p,q) = - \sum_x p(x)\log q(x)

Okay but why? What we saying with that?

We’re asking: “Given the Real World™, how surprised on average are you by the model?” In the Weighted Average view, we’re weighting the model’s surprise by what we see in the Real World™! If you just did H(q)\mathcal{H}(q) you’re jsut averaging according to your own beliefs and that’s nice and all but Mama Nature will kick you in the plums, guaranteed. So,

  • p(x)p(x) is how often event x really occurs
  • logq(x)\log q(x) is how well your model predicts it
H(p,q)=xp(x)[logq(x)]=xp(x)[Iq(x)]\mathcal{H}(p,q) = \sum_x p(x)[- \log q(x)] = \sum_x p(x)[I_q(x)]

KL Divergence

Okay let’s do some rearranging. For giggles, let’s subtract H(p)\mathcal{H}(p)

H(p,q)H(p)=xp(x)logq(x)[xp(x)logp(x)]=xp(x)logp(x)logq(x)\begin{align*} \mathcal{H}(p,q) - \mathcal{H}(p) &= -\sum_x p(x)\log q(x) - \left[-\sum_x p(x)\log p(x)\right] \\ &= \sum_x p(x)\frac{\log p(x)}{\log q(x)} \end{align*}

Or,

H(p,q)=H(p)+xp(x)logp(x)logq(x)=H(p)+DKL(pq)\begin{align*} \mathcal{H}(p,q) &= \mathcal{H}(p) + \sum_x p(x)\frac{\log p(x)}{\log q(x)} \\ &= \mathcal{H}(p) + D_{KL}(p || q) \end{align*}

That last part is called the Kullback-Lieber Divergence, “a type of statistical distance: a measure of how much an approximating probability distribution Q is different from a true probability distribution P”. BOOM!

What’s an easy reading guaranteed to piss off Statisticians and Mathematicians?

Cross Entropy=Unavoidable Randomness of the Real World™+Your Model’s Fuckups\begin{align*} \text{Cross Entropy} &= \text{Unavoidable Randomness of the Real World™} \\ &+ \text{Your Model's Fuckups} \end{align*}

Very important (and you can see why): DKL(pq)DKL(qp)D_{KL}(p || q) \ne D_{KL}(q || p). Also DKL0D_{KL} \ge 0.

Sadness in The Real World™

You’ll never know the True p(x)p(x). What do you do if you want Cross-Entropy?

You can approximate it using the sample average:

H(p,q)1ni=1nlogq(xi)H(p,q) \approx -\frac{1}{n}\sum_{i=1}^{n}\log q(x_i)

Huh. Consider the Likelihood Function L\mathcal{L}

L=i=1nq(xi)logL=i=1nlogq(xi)1nlogL=1ni=1nlogq(xi)\mathcal{L} = \prod_{i=1}^{n} q(x_i) \\ \log \mathcal{L} = \sum_{i=1}^{n} \log q(x_i) \\ -\frac{1}{n}\log \mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}\log q(x_i)

WELL HOW ABOUT THAT.

Minimizing Cross-Entropy    Maximizing Likelihood\text{Minimizing Cross-Entropy} \iff \text{Maximizing Likelihood}

Sweet. We can build optimizers now. 🆒