Four-Layer Tiny Transformer Training Run
June 22, 2024

This is a tiny generative Transformer training artifact: 4 layers, 16 attention heads, 128-dimensional embeddings, 128-token context, a 361-token custom vocabulary, and about 835k logged parameters, trained on CPU past 50,000 iterations until it produced recognizable small-story text.

The point is not that this is a benchmark model. The point is that the model is small enough to inspect.

The Artifact

The public artifact is a GitHub pull request that adds a single training log/script file. It records the model shape, training command, inference command, sample output, and enough loss-log history to show the run moving from initialization into a working tiny-story generator.

GitHub PR: 0.8M parameters model training

Raw training script and log

Model Shape

Layers4
Attention heads16
Key/value heads16
Embedding dimension128
Context length128 tokens
Vocabulary361 tokens
Dropout0.15
Logged parameters834,644 total from 832,128 decayed and 2,516 non-decayed parameters
Training deviceCPU
Tokens per iteration8,192

Training Run

The training command initializes a new model from scratch with a custom 361-token vocabulary, batch size 64, 128-token sequence length, learning rate 3e-4, weight decay 0.1, warmup over 2,500 iterations, and a target maximum of 100,000 iterations.

The log starts at train loss 13.0864 and validation loss 13.0707. The artifact includes excerpts through step 53,130, where the logged loss is already in the rough 0.75 to 0.86 range.

The Tweak

The PR body notes one LLaMA tweak: the embedding matrix has a computed pseudo-inverse used as the unembedding, and that unembedding is backpropagated. That note should be treated as an implementation detail of the artifact. A later article can explain the linear-algebra and training implications if the supporting notes are published.

Generated Text

The sample output is not clean prose. It repeats names, loses references, and breaks grammar. But it also has recognizable story structure: characters, actions, dialogue, simple causal turns, and moral-like endings. That is why this tiny run is worth preserving as a public artifact. It is small enough to inspect, yet large enough to show emergent generative behavior.

Tradeoff

A run this small is not a replacement for a useful production language model. Its value is visibility. The architecture, tokenizer size, context length, parameter tensors, training command, loss log, and sample behavior all fit into one inspectable record.