Original Post

To learn math of Transformers you just have to study Calculus and mathematical analysis for five years at uni, for 10 years out of curiosity, then four years of self-guided intensive crash course. From EM, HMM, Q-learning, Pontryagin, Euler-Lagrange, mean, joint and conditional probability, Bayes rule, chances, odds, observations, events, controls, factor analysis, design of experiments, Chi-square, k-NN, k-means, maximum entropy, mutual information, sliding Gaussians, central limit theorem, law of large numbers, kernels as measures, convolution, scaled windows, tensor analysis, topological spaces, set theory, LayerNorm, RMSNorm, penalized sampling, temperature-controlled stochastic sampling, top-k alternatives, sparse Transformers, attention sinks, Dropout and DropConnect, Xavier, MSRA, all-attention persistent memory, KV cache, cross-attention, momentum, minibatch, Kronecker delta, adaptive attention span, weights binarization, grouped query attention and multiquery attention, fused MLP, inner cross-attention and inner self-attention, RoPE, sine, and learned position embedding, GLU, GELU, Swish, LSTM, RNN, SoftMax, sigmoid, Boltzmann, logits, BPE, dynamic weights, probability density, tokenizer, bytepair, causal masking, autoregression, autoassociation, PCA, ICA, GMM, tensor contraction, Hadamard product, logit, FMA, GEMM fast matrix multiplication, Hilbert's 13th and 10th problem, DP, LP, HJB, quantum mechanics, and general relativity just to do the work mentioned.