Original Post

Pay attention: LeNet from 1989 used SGD with momentum, and so ImageNet of 2012. All you needed for old school was Averaged SGD with momentum.

1 million other optimizers to distract you from the fact that Attention Transformer and AdamW were both invented in 2017, and used immediately in GPT-1 a year after (sic!), just to officially release weights in 2019, and (finally, hehe), officially publish RMSProp and AdamW.