On Models
All models are wrong, but some are useful.
— George Box
Capacity and Complexity
Not the same. Intuition: A Piano is more complex than a small flute. Both can play “Happy Birthday”. One can play/capture Beethoven’s Fifth better than the other.
Complexity is “How complicated is the function the model learned?” What’s “the function”? It’s the:
- Local neighbor rule in kNN
- The linear equation in Regression
- The polynomials in Polynomial Regression
- The branching rules in Decision Trees
- The giant-ass non-linear function in Neural Nets
Capacity is about expressivity and asks “What is the set of functions/distributions this model could represent?” Neural Nets have very high capacity. This can ‘just’ be a parameter knob you play with (e.g. the in kNNs; lots of knobs to twiddle in Neural Networks!)
These concepts are very nuanced! And it turns out (thanks to CNNs) that high Capacity (super-high expressivity) doesn’t have to be a bad thing. Keep reading.
Model Specification
A Misspecification happens when the true relationship in the Real World™ cannot be represented by the set of distributions (“functions” as fancy people call them) in the model class1.
Imagine the Real Generative process is and you stubbornly insist on modeling with .
So if a model is correctly specified, the true data-generating distribution lies within the family of distributions your model can represent. There exists some setting of the model parameters such that your model’s distribution equals the true exactly.
So yeah: Every model is misspecified (to some extent).
Simplicity, Complexity, Bias, Variance, Consistency
- Simple Model High Bias, Low Variance → Consistently Wrong
- Complex Model Low Bias, High Variance → Inconsistently Right (see below)
What’s going on? First see “Bias and Variance”.
Let my Dog Explain
Imagine you’re a beautiful Diva Husky. You decide: “I’ll always run to the same spot in the yard, no matter where my Dad throws the frisbee.”
- You’re consistent: you do the same thing every time. Low variance.
- You’re wrong a lot: the frisbee rarely lands where you’re standing. High bias.
- You’re consistent + wrong. Every throw, same bad result.
The model is too simple to capture what’s actually going on. It misses the real pattern: follow the arc of the frisbee.
Now say you’re oversmart: “I’ll memorize exactly where every single frisbee has ever landed and use all of that to predict the next one. Wind speed, # of Squirrels in nearest tree, Dad’s mood, what I had for breakfast, everything I can think of.”
- On throws you’ve seen before, you’re amazing. Low bias.
- On a new throw, you go nuts. A Squirrel shows up and you sprint to the neighbor’s yard. Tiny changes in the input send you to wildly different spots. High variance.
- Sometimes you nail it. Sometimes you’re three blocks away. Inconsistently right.
That’s all.
Simple and Complex Models
It’s not about Simple and Complex models anymore. I mean, prefer simple models, but:
“Simple model” vs. “complex model” is the wrong axis. The right axis is “which functions does the optimizer find first?”
What does that mean? We think “Complex models overfit” and “Simple models underfit/are flexible” (generally speaking.) But a ginat-ass Neural Network can have gazillions of paramaters and generalize shockingly well (even when you randomize labels!) What gives? What do we do with this?
Here’s the important idea, and it has to do with your choice of how to optimize rather than the model’s capacity:
A massive neural network isn’t dangerous because it could memorize. It’s safe because gradient descent, with normal initialization and learning rates, doesn’t go looking for memorization solutions first. It finds smooth, generalizing ones first, and stops there once the loss is low enough.
That’s really it. Put another way, out of all the ‘capacities’ of your model (the functions/distributions), what does your optimizer actually pick? How quickly? Your choice of optimizer will have preferences (Inductive Bias!)
So the question becomes: “How do I make an optimizer that finds ‘good solutions’ early during training?” in a given model class (type of model) and not so much “What’s the capacity of my model?” or “Is my model simple enough?” You may pat yourself on your back for coming up with a ‘Simple’ model that will faceplant in the Real World™. Maybe a ‘complex’ model would’ve served you better.
Separation and Sufficiency (Fairness)
Nuanced, very important. See this page.
Calibration
| Task | What gets calibrated? |
|---|---|
| Binary classification | predicted probabilities |
| Multiclass classification | class probabilities |
| Regression | prediction intervals/distributions |
| Survival analysis | survival probabilities |
| Bayesian inference | posterior uncertainty |
| Deep learning | confidence scores |
How do you measure proper calibration? You can use (a) Reliability Diagrams and (b) Briar Score as a ‘headline’ metric. You must show calibration across subgroups! TODO: Expand upon this.
How do you recalibrate in a new setting? You can retrain. What if that’s expensive? You can use techniques like Platt Scaling or Isotonic Regression. TODO: More on these with examples.
Inductive Bias
TODO
Memorization
Extreme case: This is when the training loss is zero. Your model’s basically a lookup table here. Super-overfit. Classically, not a good thing. But then Neural Nets and Deep Learning would like to have a word… TODO: More on this.