Neural Networks

You cannot really use a linear function to capture ‘loops’ like in kNN. You need arbitrarily expressive functions. You want it to be parametric (doesn’t grow with dataset) and is “efficiently auto-differentiable”.

Affine functions have a bias term.

\text{output} = \text{non-linear activation of}(\text{weight} \times \text{input} + \text{bias}) \\ h = \sigma(Wx + b)

Why is $\sigma$ , the activation function important? Because if you have a lot of layers, they collapse into a single linear map.

h_2 = W_2(W_1x + b_1) + b_2 = (W_1W_2x) + (W_2b_1 + b_2)

There are all sorts of activation functions. ReLU is a popular one. There’s sigmoid, tanh, GELU, SoftMax.

Ideally you want your data to be distributed about the origin. Activation functions are almost always very effective around the origin.

In theory at least, they are arbitrarily expressive.

The architecture of a NN encodes its assumptions. This is in terms of layer types: Fully connected, convolutional, recurrent, attention. The choice of layer type determines what the model can learn easily versus what it must fight to learn.

An MLP is a fully-connected neural network.

The Learning Rate $\eta$ is the most important hyperparameter in training a Neural Network. You can (and need to) rescale it in terms of Batch Size.

\eta_{new} = \eta_{old} \times \frac{\text{Batch}_{new}}{\text{Batch}_{old}}