Original Post
Of course, attention came from Gaussian mixture models and density networks, soft windows, that was realizing the concept of dynamic weights--historically--as evangelized by Hinton since the 1980s, and we just started to re-understand why we use Transformer at the first place? The concept of mutual information is the best upgrade for quantum mechanics that came from rather sloppy and surprising origin in computational linguistics (you'd expect Shannon at least, or rigorous mathematics?) :-)