I’m going to take a “historical” route where I go through some other, mostly older architectural patterns first, to put it in context; hopefully it’ll be useful to people who are new to this stuff, while also not too tiresome to those who aren’t.
The closest thing to an intuitive explainer than I know of is “The Illustrated Transformer,” but IMO it’s too light on intuition and too heavy on near-pseudocode (including stuff like “now you divide by 8,” as the third of six enumerated “steps” which themselves only cover part of the whole computation!).
This is a shame, because once you hack through all the surrounding weeds, the basic idea of the Transformer is really simple. This post is my attempt at a explainer.
Don't forget to tag
your comment, otherwise they may not be notified.