Original Post
In 2010, Martens invented deep Hessian-free methods. In 2011, Sutskever collaborated with Martens to apply it successfully to arbitrary RNNs (not special LSTMs, GRUs etc.)
In 2020, NERSC, Lawrence Berkeley National Laboratory, publish applications of combining the method with AdamW. What happened to the public who are so obsessed with Transformer? Why anyone stopped talking about the method invented by Sutskever and Martens 12 years ago?