Beat 1 · Concrete
Same question, three sizes
One prompt. The small model fails, the medium fumbles, the large one nails it.
Beat 2 · Abstract
The scaling law
Loss versus compute on log–log axes: a straight descending line. Gains are predictable.
Beat 3 · Interactive
Slide the scale
Drag the dial: loss slides down the line, and capabilities switch on as thresholds are crossed.
Drag the scale dial with JavaScript enabled to watch capabilities emerge.
Footnotes & further reading
2020 · 2022
Scaling laws
Kaplan et al. fit loss as a power law in compute, data, and parameters. Chinchilla later rebalanced the recipe — train smaller models on far more tokens.
2020
GPT-3, few-shot
175B parameters. With no fine-tuning, it learned tasks from a handful of examples in the prompt — scale alone bought in-context learning.
2021 · 2022
Emergence & “foundation models”
Stanford coined “foundation models” for one pretrained base adapted to many tasks; researchers catalogued “emergent abilities” that appear abruptly past a scale threshold.