On Loss

There’s a ton of lingo: loss, loss function, Likelihood, MLE, Cross-Entropy, Objective Function, NLL, etc, etc. Let’s look at all of them.

What do we do in ML?

The entire task of ML is to minimize the Expected Loss over the data-generating distribution. We don’t have access to this. You only have samples. So the entire game is minimize the empirical loss (by choosing/training for the right hyper/parameters) and hope that this is a good estimate of the Expected Loss. Remember how the sample mean with large $n$ is an unbiased estimator of the true population mean? Same energy.

By defining/specifying Loss, you’re punishing the model for being wrong.

But you can also punish the model for being (a) wrong and (b) overconfident (i.e. not just “Did you classify correctly?”) The key insight with the probabalistic techniques below is that you’re forcing the output to be a distribution which gives you how confident/certain the model is in its prediction/classification (via probability like $p(X = 1) = 0.0001$ which means “I’m like barely confident yo.”).

Generally, a good Loss function will do this (this is very hand-wavy):

Correct + Confident → Low Punishment
Correct + Not-so-confident → Low-to-Moderate Punishment
Incorrect + Not-so-confident → Moderate Punishment
Incorrect + Confident → High Punishment

Ye Olde Bias again!

Every choice/construction of a Loss Function has an inductive bias. How okay are you the modeler with saying “Being a little wrong a lot is fine. Being very wrong once is terrible.”?

Or “I don’t care about big misses, just be roughly right most of the time.”?

Uncertainty

There are two kinds: Epistemic and Aleatory/Random. You solve the first (“Model doesn’t know enough” or “Model cannot capture signal”) with more data or a better (well-specified) model. But the Real World™ is messy and you cannot eliminate random noise (or poor data representation). Just can’t. That’s what that fancy “Aleatory” means. It’ll always be there #JobSecurity

Loss, Uncertainty, Entropy

The whole point of a Loss Function is to specify how predictions are rewarded or punished. That’s all. But you can also add a reward or punishment on how confident the model is about its predictions. That’s what all the Loss Functions on this page are trying to do.

If the model ignores Aleatoric noise, it gets hit with a huge penalty for being overconfident.
If the model doesn’t resolve its Epistemic ignorance, the loss stays high, forcing the optimizer to keep changing the model weights until it “gets it.”

To keep the loss low, the model is forced to increase uncertainty (via variance, keep reading) for inputs where it consistently struggles to get the mean right. It is saying “Yeah sure I’ll give you a prediction but I’m not very sure it’s right.” Compare this to Loss Functions that only check to see if they got a prediction right!

Note that this is about Uncertainty writ large. You can have Epistemic uncertainty ‘leak’ into Aleatoric uncertainty if you have a misspecified model and/or not enough data.

So how do you measure this uncertainty with a number? Use Entropy!

In Regression - Probabilistic and OLS

Ordinary Least Squares Regression learns $E[Y|X]$ . It predicts $\hat{y} = \mu(x)$ . It assumes that $\sigma^2$ is a constant — same amount of uncertainty about every single prediction. It also assumes that the noise/residuals are Gaussian $Y \sim N(0, \sigma^2)$

In this simple case, you’ll just minimize MSE (Mean Squared Error) $(\hat{y} - y)^2$ .

Probabilistic Regression learns $P(Y|X)$ . Why? What if you could predict both the mean and the variance, $\mu(x)$ and $\sigma^2(x)$ for each and every input? The Variance then becomes a proxy/estimator for uncertainty (both kinds; only Aleatoric if the model is well-specified and has a lot of data.) The model will learn “Which inputs are inherently unreliable?” That’s what we did above. Probabilistic Regression is ‘richer’ with $Y \sim N(\mu(x), \sigma^2(x))$

The loss (NLL) will reward low variance only when predictions are accurate: if it’s a big error with a low variance (meaning model is “confidently/loudly wrong”) then you get a giant-ass punishment.

In Classification

In classification, uncertainty is already hidden inside the predicted probabilities. We typically don’t capture varaince $\sigma^2$ for this reason.

Cat	Dog	Rabbit
0.98	0.01	0.01

“I’m very certain it’s a cat” → Low Uncertainty.

Cat	Dog	Rabbit
0.34	0.33	0.33

“No idea.” → High Uncertainty

Convention

Assume $\hat{p}_i=P(y_i=1∣x_i)$ . These are all Probabilistic so it’s better than writing $\hat{y}$ which could be the predicted class label or the probability. Be explicit.

For a MultiClass Model ( $K = 2$ for Binary Classifiers)

\mathcal{H}(\hat{p}) = -\sum_{k=1}^{K} \hat{p}_k \log \hat{p}_k

How do you compute the Entropy with $\hat{p}_k \in [0.25,0.25,0.25,0.25]$ (i.e. $K=4$ )? What do you expect? Should be high Entropy since we’re uncertain: model refuses to ‘move’ the probability mass across the 4 classes. So,

\begin{align*} \mathcal{H}(\hat{p}) &= -\sum_{k=1}^{K} \hat{p}_k \log \hat{p}_k \\ &= -\sum_{k=1}^{K} \frac{1}{K} \log \frac{1}{K} \\ &= log(K) \\ \end{align*}

For a MultiLabel Model

Sum across all labels. Each gets its own Sigmoid function $\hat{p} = \frac{1}{1 + e^{-z}}$ , based on $z = \beta \cdot X$

\mathcal{H}(\hat{p}) = -\sum_{i=1}^{L} [\hat{p}_i \log \hat{p}_i + (1 - \hat{p}_i) \log (1 - \hat{p}_i)]

Loss Functions

Likelihood

Everything starts here. This means “What is the probability of seeing the observed $y,x$ given parameter $\theta$ ?” So if the correct label is Cat and the model said 90% Cat, 10% Parakeet, the likelihood in this example is $0.9$

\begin{align*} L(\theta) &= p(y|x; \theta) \\ &= \prod_{i=1}^n{p(y_i|x_i; \theta)} \end{align*}

That $\prod$ is for when you have several independent samples, following the rules of Probability. That gets unwieldy so let’s turn multiplication into addition.

Log Likelihood

What it says¹.

\begin{align*} log[L(\theta)] &= \sum_{i=1}^n{log[p(y_i|x_i; \theta)]} \end{align*}

You want to maximise the likelihood here. Why not use just Likelihood? Convention, history, tractability (mathematically, computationally.) Note that the $log$ function punishes things a lot and is nice and calm/stable when values ‘jump’: $log(0.0001) = -4$ and $log(0.00015) = -3.82$ and $log(0.001) = -3$ and $log(0.01) = -2$

Negative Log Likelihood (NLL)

The all-important one!

\mathcal{L}_{\text{NLL}} = - \sum_{i=1}^n{log[p(y_i|x_i; \theta)]}

You want to minimize this here. Where do you use this? EVERYWHERE! KINDA! Keep reading.

NLL Applied

In Regression

\begin{align*} \mathcal{L}_{\text{NLL}}^{\text{Regression}} &= \sum_{i=1}^{N} \left[ \frac{1}{2} \log(2\pi\sigma^2) + \frac{(y_i - \hat{p}_i)^2}{2\sigma^2} \right] \\ &= \sum_{i=1}^{N} \left[ \frac{(y_i - \hat{p}_i)^2}{2\sigma^2} + log(\sigma) + \frac{1}{2} log (2\pi) \right] \end{align*}

If $\sigma$ is fixed, this turns into MSE for OLS Regression. Breaking down the terms:

$\frac{(y - \hat{p})^2}{2\sigma^2}$ is saying “You can fuck up bigly if you’re not an overconfident jackass about it and admit ‘I’m not too sure about this’.”
$log(\sigma)$ is saying “Don’t be too uncertain and cheat and throw your hands up and say ‘Oh everything’s uncertain’.” (e.g. by $\sigma \rightarrow \infty$ )

SO! This is a big balancing act.

Goal	Effect
Fit Data Accurately	Reduce Squared Error
Avoid Fake Certainty	Avoid Tiny Variance
Avoid Fake Uncertainty	Avoid Huge Variance

In Binary Classification

\mathcal L_{\mathrm{Binary}} = - \sum_{i=1}^N \left[ y_i \log \hat p_i + (1-y_i)\log(1-\hat p_i) \right]

So if the true label is $y_i = 1$ , the loss becomes:

-log(\hat{p}_i)

And the model is rewarded for giving a nice and high probability to the correct class/prediction ❤️ Same if it got $y_i = 0$ , the loss becomes $-log(1 - \hat{p}_i)$

In Multi-Class Classification

warning

ACHTUNG: MultiClass means you pick one class from a list of {Cat, Dog, Human}! Not the same as MultiLabel, which means what you think the name means!

Uses the SoftMax function which helps distribute the probability ‘mass’ around the classes: increasing probability for one class decreases it among the others. FIGHT! Normalizes and makes sure everything adds up to one. It’s done like so. If $K$ is the number of classes,

p_k = \frac{e^{z_k}}{\sum_j^{K}{e^{z_j}}}

Okay. Now assume $\hat{p}_{i,k}=P(y_i=k∣x_i)$ .

\mathcal L_{\mathrm{MultiClass}} = - \sum_{i=1}^N \sum_{k=1}^K y_{i,k}\log \hat p_{i,k}

In Multi-Label Classification

No SoftMax. Each Label has its own sigmoid $\frac{1}{1 - e^z}$ .

This is a mess. $y_{i,l} \in [0,1]$ and $\hat{p}_{i,l}$ is the predicted probability for label $l$ .

\mathcal L_{\mathrm{MultiLabel}} = - \sum_{i=1}^N \sum_{l=1}^L \left[ y_{i,l}\log \hat p_{i,l} + (1-y_{i,l})\log(1-\hat p_{i,l}) \right]

Regularization

I’ll use my Dog to explain. Imagine I tell my puppy: “When trying to catch this tennis ball, don’t pay attention to everything (squirrels, wind speed, my mood, day of the week, which park we’re at) but don’t also freak out over the things you do pay attention to. Be a calm and clever floofer.”

L1 Regularization (aka “Lasso”) is the first part. It can remove features/predictors by moving their weights to zero. You are implicitly doing feature selection here.

L2 Regularization (aka “Ridge Regression”) is the second part. It shrinks weights but never to zero. It sort of “evens things out” and doesn’t let one feature dominate.

How do we use these? For L1 Regularization, you use absolute values of the weights. For L2, you use the squared values. You then attach some Regularization Parameter $\lambda$ to each. If $\hat{y} = \mathbf{\beta}X$

\text{Loss} = \sum{(y - \hat{y})^2} + \lambda_1 \cdot |\beta| + \lambda_2 \cdot \beta^2

That’s just baby OLS Regression. If you want to take Uncertainty into account, here’s the full mess:

\text{Loss} = \sum_{i=1}^{N} \left[ \frac{(y_i-\mu_i)^2}{2\sigma_i^2} + \log \sigma_i \right] + \lambda_1|\beta| + \lambda_2\beta^2

Objective Function

Is Loss the ‘Objective’ in the Objective Function? I couldn’t figure this out. I don’t care. For now.

Some ML people appear to just average this: $log[L(\theta)] = \frac{1}{n}\sum_{i=1}^n{log[p(y_i|x_i; \theta)]}$ ↩

Uncertainty​

Loss, Uncertainty, Entropy​

In Regression - Probabilistic and OLS​

In Classification​

For a MultiClass Model (K=2K = 2K=2 for Binary Classifiers)​

For a MultiLabel Model​

Loss Functions​

Likelihood​

Log Likelihood​

Negative Log Likelihood (NLL)​

NLL Applied​

In Regression​

In Binary Classification​

In Multi-Class Classification​

In Multi-Label Classification​

Regularization​

Objective Function​

Footnotes​