LLMs

“Global cohesion” in n-gram models.

Now trivially if you use a simple model and have a 128,000 token/word window you will have $10^{600,000}$ possible combinations. This is insane. Even 20 tokens is $10^{94}$ which is much larger than the number of atoms in the observable universe.

How do you solve this? How do you tame this? The biggest problem is scalability (there are other desirable properties like ordering (“man bites dog” versus “dog bites man”)).

With Word2Vec and word-vector embeddings: Imagine an Excel sheet with all the possible words in the English languge as rows. There are typically 512-4096 columns. What are their names? It doesn’t matter. Simply/trivially, you initialize randomly and let the training assign the weights.

Read: The Bitter Lesson by Richard Sutton.

3Blue1Brown Video on Transformers.