Skip to main content

On Loss

There’s a ton of lingo: loss, loss function, Likelihood, MLE, Cross-Entropy, Objective Function, NLL, etc, etc. Let’s look at all of them.

What do we do in ML?

The entire task of ML is to minimize the Expected Loss over the data-generating distribution. We don’t have access to this. You only have samples. So the entire game is minimize the empirical loss (by choosing/training for the right hyper/parameters) and hope that this is a good estimate of the Expected Loss. Remember how the sample mean with large nn is an unbiased estimator of the true population mean? Same energy.

By defining/specifying Loss, you’re punishing the model for being wrong.

But you can also punish the model for being (a) wrong and (b) overconfident (i.e. not just “Did you classify correctly?”) The key insight with the probabalistic techniques below is that you’re forcing the output to be a distribution which gives you how confident/certain the model is in its prediction/classification (via probability like p(X=1)=0.0001p(X = 1) = 0.0001 which means “I’m like barely confident yo.”).

Generally, a good Loss function will do this (this is very hand-wavy):

  • Correct + Confident → Low Punishment
  • Correct + Not-so-confident → Low-to-Moderate Punishment
  • Incorrect + Not-so-confident → Moderate Punishment
  • Incorrect + Confident → High Punishment
Ye Olde Bias again!

Every choice/construction of a Loss Function has an inductive bias. How okay are you the modeler with saying “Being a little wrong a lot is fine. Being very wrong once is terrible.”?

Or “I don’t care about big misses, just be roughly right most of the time.”?

Uncertainty

There are two kinds: Epistemic and Aleatory/Random. You solve the first (“Model doesn’t know enough” or “Model cannot capture signal”) with more data or a better (well-specified) model. But the Real World™ is messy and you cannot eliminate random noise (or poor data representation). Just can’t. That’s what that fancy “Aleatory” means. It’ll always be there #JobSecurity

Loss, Uncertainty, Entropy

The whole point of a Loss Function is to specify how predictions are rewarded or punished. That’s all. But you can also add a reward or punishment on how confident the model is about its predictions. That’s what all the Loss Functions on this page are trying to do.

  • If the model ignores Aleatoric noise, it gets hit with a huge penalty for being overconfident.
  • If the model doesn’t resolve its Epistemic ignorance, the loss stays high, forcing the optimizer to keep changing the model weights until it “gets it.”

To keep the loss low, the model is forced to increase uncertainty (via variance, keep reading) for inputs where it consistently struggles to get the mean right. It is saying “Yeah sure I’ll give you a prediction but I’m not very sure it’s right.” Compare this to Loss Functions that only check to see if they got a prediction right!

Note that this is about Uncertainty writ large. You can have Epistemic uncertainty ‘leak’ into Aleatoric uncertainty if you have a misspecified model and/or not enough data.

So how do you measure this uncertainty with a number? Use Entropy!

In Regression - Probabilistic and OLS

Ordinary Least Squares Regression learns E[YX]E[Y|X]. It predicts y^=μ(x)\hat{y} = \mu(x). It assumes that σ2\sigma^2 is a constant — same amount of uncertainty about every single prediction. It also assumes that the noise/residuals are Gaussian YN(0,σ2)Y \sim N(0, \sigma^2)

In this simple case, you’ll just minimize MSE (Mean Squared Error) (y^y)2(\hat{y} - y)^2.

Probabilistic Regression learns P(YX)P(Y|X). Why? What if you could predict both the mean and the variance, μ(x)\mu(x) and σ2(x)\sigma^2(x) for each and every input? The Variance then becomes a proxy/estimator for uncertainty (both kinds; only Aleatoric if the model is well-specified and has a lot of data.) The model will learn “Which inputs are inherently unreliable?” That’s what we did above. Probabilistic Regression is ‘richer’ with YN(μ(x),σ2(x))Y \sim N(\mu(x), \sigma^2(x))

The loss (NLL) will reward low variance only when predictions are accurate: if it’s a big error with a low variance (meaning model is “confidently/loudly wrong”) then you get a giant-ass punishment.

In Classification

In classification, uncertainty is already hidden inside the predicted probabilities. We typically don’t capture varaince σ2\sigma^2 for this reason.

CatDogRabbit
0.980.010.01

“I’m very certain it’s a cat” → Low Uncertainty.

CatDogRabbit
0.340.330.33

“No idea.” → High Uncertainty

Convention

Assume p^i=P(yi=1xi)\hat{p}_i​=P(y_i​=1∣x_i​). These are all Probabilistic so it’s better than writing y^\hat{y} which could be the predicted class label or the probability. Be explicit.

For a MultiClass Model (K=2K = 2 for Binary Classifiers)

H(p^)=k=1Kp^klogp^k\mathcal{H}(\hat{p}) = -\sum_{k=1}^{K} \hat{p}_k \log \hat{p}_k

How do you compute the Entropy with p^k[0.25,0.25,0.25,0.25]\hat{p}_k \in [0.25,0.25,0.25,0.25] (i.e. K=4K=4)? What do you expect? Should be high Entropy since we’re uncertain: model refuses to ‘move’ the probability mass across the 4 classes. So,

H(p^)=k=1Kp^klogp^k=k=1K1Klog1K=log(K)\begin{align*} \mathcal{H}(\hat{p}) &= -\sum_{k=1}^{K} \hat{p}_k \log \hat{p}_k \\ &= -\sum_{k=1}^{K} \frac{1}{K} \log \frac{1}{K} \\ &= log(K) \\ \end{align*}

For a MultiLabel Model

Sum across all labels. Each gets its own Sigmoid function p^=11+ez\hat{p} = \frac{1}{1 + e^{-z}}, based on z=βXz = \beta \cdot X

H(p^)=i=1L[p^ilogp^i+(1p^i)log(1p^i)]\mathcal{H}(\hat{p}) = -\sum_{i=1}^{L} [\hat{p}_i \log \hat{p}_i + (1 - \hat{p}_i) \log (1 - \hat{p}_i)]

Loss Functions

Likelihood

Everything starts here. This means “What is the probability of seeing the observed y,xy,x given parameter θ\theta?” So if the correct label is Cat and the model said 90% Cat, 10% Parakeet, the likelihood in this example is 0.90.9

L(θ)=p(yx;θ)=i=1np(yixi;θ)\begin{align*} L(\theta) &= p(y|x; \theta) \\ &= \prod_{i=1}^n{p(y_i|x_i; \theta)} \end{align*}

That \prod is for when you have several independent samples, following the rules of Probability. That gets unwieldy so let’s turn multiplication into addition.

Log Likelihood

What it says1.

log[L(θ)]=i=1nlog[p(yixi;θ)]\begin{align*} log[L(\theta)] &= \sum_{i=1}^n{log[p(y_i|x_i; \theta)]} \end{align*}

You want to maximise the likelihood here. Why not use just Likelihood? Convention, history, tractability (mathematically, computationally.) Note that the loglog function punishes things a lot and is nice and calm/stable when values ‘jump’: log(0.0001)=4log(0.0001) = -4 and log(0.00015)=3.82log(0.00015) = -3.82 and log(0.001)=3log(0.001) = -3 and log(0.01)=2log(0.01) = -2

Negative Log Likelihood (NLL)

The all-important one!

LNLL=i=1nlog[p(yixi;θ)]\mathcal{L}_{\text{NLL}} = - \sum_{i=1}^n{log[p(y_i|x_i; \theta)]}

You want to minimize this here. Where do you use this? EVERYWHERE! KINDA! Keep reading.

NLL Applied

In Regression

LNLLRegression=i=1N[12log(2πσ2)+(yip^i)22σ2]=i=1N[(yip^i)22σ2+log(σ)+12log(2π)]\begin{align*} \mathcal{L}_{\text{NLL}}^{\text{Regression}} &= \sum_{i=1}^{N} \left[ \frac{1}{2} \log(2\pi\sigma^2) + \frac{(y_i - \hat{p}_i)^2}{2\sigma^2} \right] \\ &= \sum_{i=1}^{N} \left[ \frac{(y_i - \hat{p}_i)^2}{2\sigma^2} + log(\sigma) + \frac{1}{2} log (2\pi) \right] \end{align*}

If σ\sigma is fixed, this turns into MSE for OLS Regression. Breaking down the terms:

  • (yp^)22σ2\frac{(y - \hat{p})^2}{2\sigma^2} is saying “You can fuck up bigly if you’re not an overconfident jackass about it and admit ‘I’m not too sure about this’.”
  • log(σ)log(\sigma) is saying “Don’t be too uncertain and cheat and throw your hands up and say ‘Oh everything’s uncertain’.” (e.g. by σ\sigma \rightarrow \infty)

SO! This is a big balancing act.

GoalEffect
Fit Data AccuratelyReduce Squared Error
Avoid Fake CertaintyAvoid Tiny Variance
Avoid Fake UncertaintyAvoid Huge Variance

In Binary Classification

LBinary=i=1N[yilogp^i+(1yi)log(1p^i)]\mathcal L_{\mathrm{Binary}} = - \sum_{i=1}^N \left[ y_i \log \hat p_i + (1-y_i)\log(1-\hat p_i) \right]

So if the true label is yi=1y_i = 1, the loss becomes:

log(p^i)-log(\hat{p}_i)

And the model is rewarded for giving a nice and high probability to the correct class/prediction ❤️ Same if it got yi=0y_i = 0, the loss becomes log(1p^i)-log(1 - \hat{p}_i)

In Multi-Class Classification

warning

ACHTUNG: MultiClass means you pick one class from a list of {Cat, Dog, Human}! Not the same as MultiLabel, which means what you think the name means!

Uses the SoftMax function which helps distribute the probability ‘mass’ around the classes: increasing probability for one class decreases it among the others. FIGHT! Normalizes and makes sure everything adds up to one. It’s done like so. If KK is the number of classes,

pk=ezkjKezjp_k = \frac{e^{z_k}}{\sum_j^{K}{e^{z_j}}}

Okay. Now assume p^i,k=P(yi=kxi)\hat{p}_{i,k}​=P(y_i​=k∣x_i​).

LMultiClass=i=1Nk=1Kyi,klogp^i,k\mathcal L_{\mathrm{MultiClass}} = - \sum_{i=1}^N \sum_{k=1}^K y_{i,k}\log \hat p_{i,k}

In Multi-Label Classification

No SoftMax. Each Label has its own sigmoid 11ez\frac{1}{1 - e^z}.

This is a mess. yi,l[0,1]y_{i,l} \in [0,1] and p^i,l\hat{p}_{i,l} is the predicted probability for label ll.

LMultiLabel=i=1Nl=1L[yi,llogp^i,l+(1yi,l)log(1p^i,l)]\mathcal L_{\mathrm{MultiLabel}} = - \sum_{i=1}^N \sum_{l=1}^L \left[ y_{i,l}\log \hat p_{i,l} + (1-y_{i,l})\log(1-\hat p_{i,l}) \right]

Regularization

I’ll use my Dog to explain. Imagine I tell my puppy: “When trying to catch this tennis ball, don’t pay attention to everything (squirrels, wind speed, my mood, day of the week, which park we’re at) but don’t also freak out over the things you do pay attention to. Be a calm and clever floofer.”

L1 Regularization (aka “Lasso”) is the first part. It can remove features/predictors by moving their weights to zero. You are implicitly doing feature selection here.

L2 Regularization (aka “Ridge Regression”) is the second part. It shrinks weights but never to zero. It sort of “evens things out” and doesn’t let one feature dominate.

How do we use these? For L1 Regularization, you use absolute values of the weights. For L2, you use the squared values. You then attach some Regularization Parameter λ\lambda to each. If y^=βX\hat{y} = \mathbf{\beta}X

Loss=(yy^)2+λ1β+λ2β2\text{Loss} = \sum{(y - \hat{y})^2} + \lambda_1 \cdot |\beta| + \lambda_2 \cdot \beta^2

That’s just baby OLS Regression. If you want to take Uncertainty into account, here’s the full mess:

Loss=i=1N[(yiμi)22σi2+logσi]+λ1β+λ2β2\text{Loss} = \sum_{i=1}^{N} \left[ \frac{(y_i-\mu_i)^2}{2\sigma_i^2} + \log \sigma_i \right] + \lambda_1|\beta| + \lambda_2\beta^2

Objective Function

Is Loss the ‘Objective’ in the Objective Function? I couldn’t figure this out. I don’t care. For now.

Footnotes

  1. Some ML people appear to just average this: log[L(θ)]=1ni=1nlog[p(yixi;θ)]log[L(\theta)] = \frac{1}{n}\sum_{i=1}^n{log[p(y_i|x_i; \theta)]}