Skip to main content

On Models

All models are wrong, but some are useful.

— George Box

Capacity and Complexity

Not the same. Intuition: A Piano is more complex than a small flute. Both can play “Happy Birthday”. One can play/capture Beethoven’s Fifth better than the other.

Complexity is “How complicated is the function the model learned?” What’s “the function”? It’s the:

  • Local neighbor rule in kNN
  • The linear equation in Regression
  • The polynomials in Polynomial Regression
  • The branching rules in Decision Trees
  • The giant-ass non-linear function in Neural Nets

Capacity is about expressivity and asks “What is the set of functions/distributions this model could represent?” Neural Nets have very high capacity. This can ‘just’ be a parameter knob you play with (e.g. the kk in kNNs; lots of knobs to twiddle in Neural Networks!)

These concepts are very nuanced! And it turns out (thanks to CNNs) that high Capacity (super-high expressivity) doesn’t have to be a bad thing. Keep reading.

Model Specification

A Misspecification happens when the true relationship in the Real World™ cannot be represented by the set of distributions (“functions” as fancy people call them) in the model class1.

Imagine the Real Generative process is y=x2y = x^2 and you stubbornly insist on modeling with y^=β0+β1x\hat{y} = \beta_0 + \beta_1 x.

So if a model is correctly specified, the true data-generating distribution P(YX)P(Y|X) lies within the family of distributions your model can represent. There exists some setting of the model parameters θ\theta such that your model’s distribution P(YX;θ)P(Y|X; \theta) equals the true P(YX)P(Y|X) exactly.

So yeah: Every model is misspecified (to some extent).

Simplicity, Complexity, Bias, Variance, Consistency

  • Simple Model     \implies High Bias, Low Variance → Consistently Wrong
  • Complex Model     \implies Low Bias, High Variance → Inconsistently Right (see below)

What’s going on? First see “Bias and Variance”.

Let my Dog Explain

Imagine you’re a beautiful Diva Husky. You decide: “I’ll always run to the same spot in the yard, no matter where my Dad throws the frisbee.”

  • You’re consistent: you do the same thing every time. Low variance.
  • You’re wrong a lot: the frisbee rarely lands where you’re standing. High bias.
  • You’re consistent + wrong. Every throw, same bad result.

The model is too simple to capture what’s actually going on. It misses the real pattern: follow the arc of the frisbee.

Now say you’re oversmart: “I’ll memorize exactly where every single frisbee has ever landed and use all of that to predict the next one. Wind speed, # of Squirrels in nearest tree, Dad’s mood, what I had for breakfast, everything I can think of.”

  • On throws you’ve seen before, you’re amazing. Low bias.
  • On a new throw, you go nuts. A Squirrel shows up and you sprint to the neighbor’s yard. Tiny changes in the input send you to wildly different spots. High variance.
  • Sometimes you nail it. Sometimes you’re three blocks away. Inconsistently right.

That’s all.

Simple and Complex Models

It’s not about Simple and Complex models anymore. I mean, prefer simple models, but:

“Simple model” vs. “complex model” is the wrong axis. The right axis is “which functions does the optimizer find first?”

What does that mean? We think “Complex models overfit” and “Simple models underfit/are flexible” (generally speaking.) But a ginat-ass Neural Network can have gazillions of paramaters and generalize shockingly well (even when you randomize labels!) What gives? What do we do with this?

Here’s the important idea, and it has to do with your choice of how to optimize rather than the model’s capacity:

A massive neural network isn’t dangerous because it could memorize. It’s safe because gradient descent, with normal initialization and learning rates, doesn’t go looking for memorization solutions first. It finds smooth, generalizing ones first, and stops there once the loss is low enough.

That’s really it. Put another way, out of all the ‘capacities’ of your model (the functions/distributions), what does your optimizer actually pick? How quickly? Your choice of optimizer will have preferences (Inductive Bias!)

So the question becomes: “How do I make an optimizer that finds ‘good solutions’ early during training?” in a given model class (type of model) and not so much “What’s the capacity of my model?” or “Is my model simple enough?” You may pat yourself on your back for coming up with a ‘Simple’ model that will faceplant in the Real World™. Maybe a ‘complex’ model would’ve served you better.

Separation and Sufficiency (Fairness)

Nuanced, very important. See this page.

Calibration

TaskWhat gets calibrated?
Binary classificationpredicted probabilities
Multiclass classificationclass probabilities
Regressionprediction intervals/distributions
Survival analysissurvival probabilities
Bayesian inferenceposterior uncertainty
Deep learningconfidence scores

How do you measure proper calibration? You can use (a) Reliability Diagrams and (b) Briar Score as a ‘headline’ metric. You must show calibration across subgroups! TODO: Expand upon this.

How do you recalibrate in a new setting? You can retrain. What if that’s expensive? You can use techniques like Platt Scaling or Isotonic Regression. TODO: More on these with examples.

Inductive Bias

TODO

Memorization

Extreme case: This is when the training loss is zero. Your model’s basically a lookup table here. Super-overfit. Classically, not a good thing. But then Neural Nets and Deep Learning would like to have a word… TODO: More on this.

Footnotes

  1. Model Class/Hypothesis Class/Model Type may be used interchangeably in baby ML. Called Hypothesis Class since every possible function is a candidate before training. Pedantry.