Decoder-only, Shakespeare words generator

We build, train, and evaluate a minimal decoder-only Transformer from scratch using PyTorch, trained on the Tiny Shakespeare dataset to generate Shakespeare-like text.

Experiments

We systematically varied key architectural hyperparameters across multiple runs, each trained for 5,000 iterations:

n_embd — embedding dimension
n_layer — number of transformer layers
n_head — number of attention heads
dropout — regularization rate

For each configuration, we tracked both training loss and validation loss to evaluate generalization.

Finding

Model size (capacity) significantly improves language understanding and text coherence, producing sentences with improved fluency — at the cost of increased training time and computation.

Timeline: Spring 2026

GitHub PDF